Non UTF-8 charset fallback support in GLib (Was Re: plans for long term support releases?)

Thu Jan 18 06:18:31 UTC 2007

At 09:31 PM 1/17/2007, Bruno Wolff III wrote:
>On Wed, Jan 17, 2007 at 23:10:14 +0100,
>   Ola Thoresen <redhat at olen.net> wrote:
> >
> > One of the worst examples of this is the change to UTF-8 as default
> > charset.  I am a devoted UTF-8 user myself, but it is probably the
> > single change that has caused most pain for others, and it is stil
> > causing trouble.

> > When we changed to UTF-8 as default, there were no
> > easy way to convert filesystems, documents, text-files, webpages...

Not sure if these two utilities could help:
(1) iconv -f old-encoding -t UTF-8 filename > newfilename

(2) utf8ize

The script:
http://ftp.penguin.cz/pub/users/utx/misc/utf8ize.gopts

The web page (search for utf8ize):
http://www.penguin.cz/~utx/

> > The first thing almost everyone I know that are installing Fedora,
> > Redhat or Suse is doing is to change /etc/sysconfig/i18n to go back to
> > en_US as default LANG. Simply because it takes a h... of a lot of work
> > to convert all your files and applications and there are no good tools
> > out there to help you.
>
>UTF-8 is an encoding and en_US is a locale. You are comparing different
>types of things. Perhaps you meant that UTF-8 was being used instead of
>ASCII or Latin 1? Note that ASCII is in a sense a subset of UTF-8, so
>converting from ASCII to UTF-8 isn't a big deal.

Something that I don't feel GLib has done enough is to have enough API 
supporting non UTF-8 content. For example, if a text file is opened using 
GIOChannel, the read would fail if the file content isn't containing only 
UTF-8 content.

The fallback could be more graceful; for example, the API could allow a 
fallback charset to convert bytes that aren't legal UTF-8 byes to UTF-8. 
There should exist enough API that is as tolerant to non UTF-8 content as 
possible (such as using fallback charset).

For example, a lot of people could be using a single European charset 
before UTF-8 became mainstream. So, with just one fallback charset 
specified, all these people could have been covered. Their files could be 
opened and new files are saved as UTF-8 charset.

As it is now, if you want your application to support reading of both UTF-8 
and ISO-8859-1 encodings (just the most common 2 sets, not more), most 
facilities in GLib are not a choice -- if one text file contains just one 
copyright symbol encoded in ISO-8859-1, you fail to read the entire text 
file...very far from an ideal scenario.

What do people think?

-- 
Daniel Yek