Non UTF-8 charset fallback support in GLib (Was Re: plans for long term support releases?)
Daniel Yek
dyek at real.com
Thu Jan 18 06:18:31 UTC 2007
At 09:31 PM 1/17/2007, Bruno Wolff III wrote:
>On Wed, Jan 17, 2007 at 23:10:14 +0100,
> Ola Thoresen <redhat at olen.net> wrote:
> >
> > One of the worst examples of this is the change to UTF-8 as default
> > charset. I am a devoted UTF-8 user myself, but it is probably the
> > single change that has caused most pain for others, and it is stil
> > causing trouble.
> > When we changed to UTF-8 as default, there were no
> > easy way to convert filesystems, documents, text-files, webpages...
Not sure if these two utilities could help:
(1) iconv -f old-encoding -t UTF-8 filename > newfilename
(2) utf8ize
The script:
http://ftp.penguin.cz/pub/users/utx/misc/utf8ize.gopts
The web page (search for utf8ize):
http://www.penguin.cz/~utx/
> > The first thing almost everyone I know that are installing Fedora,
> > Redhat or Suse is doing is to change /etc/sysconfig/i18n to go back to
> > en_US as default LANG. Simply because it takes a h... of a lot of work
> > to convert all your files and applications and there are no good tools
> > out there to help you.
>
>UTF-8 is an encoding and en_US is a locale. You are comparing different
>types of things. Perhaps you meant that UTF-8 was being used instead of
>ASCII or Latin 1? Note that ASCII is in a sense a subset of UTF-8, so
>converting from ASCII to UTF-8 isn't a big deal.
Something that I don't feel GLib has done enough is to have enough API
supporting non UTF-8 content. For example, if a text file is opened using
GIOChannel, the read would fail if the file content isn't containing only
UTF-8 content.
The fallback could be more graceful; for example, the API could allow a
fallback charset to convert bytes that aren't legal UTF-8 byes to UTF-8.
There should exist enough API that is as tolerant to non UTF-8 content as
possible (such as using fallback charset).
For example, a lot of people could be using a single European charset
before UTF-8 became mainstream. So, with just one fallback charset
specified, all these people could have been covered. Their files could be
opened and new files are saved as UTF-8 charset.
As it is now, if you want your application to support reading of both UTF-8
and ISO-8859-1 encodings (just the most common 2 sets, not more), most
facilities in GLib are not a choice -- if one text file contains just one
copyright symbol encoded in ISO-8859-1, you fail to read the entire text
file...very far from an ideal scenario.
What do people think?
--
Daniel Yek
More information about the fedora-devel-list
mailing list