FC4 and accentued characters

Mon Jan 22 00:44:00 UTC 2007

jeanpca at free.fr wrote:
> 
> I work on a fc4  [2.6.11-1.1369_FC4] and my system is speaking in english
> My i18n file looks like to
> cat /etc/sysconfig/i18n
> LANG="en_US.UTF-8"
              ^^^^^

(which will only work if you're reading this in a fixed font...)

OK. This e-mail is written in what's known as "US-ASCII". "US-ASCII"
only supports the characters on an American keyboard. It uses character
values up to 127. Each character is usually stored in an eight-bit byte
(or octet), which can store values up to 255.

Then people started wanting to use accents ... and Greek letters ... and
Russian letters .. and all sorts of other symbols. So they created ways
to use those other values up to 255.

Unfortunately, there were way more than 128 different characters that
different nationalities wanted to use. So we ended up with dozens of
ways of extending ASCII. The ISO 8859-1 variant was most popular for
Western Europena languages -- until the Euro symbol was created. And it
still wasn't possible to include Greek and Russian in the same document.

And Chinese and Japanese users had to have their own standards anyway --
they have thousands of different characters.

So another standard was created -- Unicode. Unicode was originally
encoded as *two* octets -- with up to 65536 different characters. That
turned out to be (a) not enough for all the world's different languages,
and (b) rather complex to handle.

UTF-8 is a different way of encoding Unicode. US-ASCII letters are
encoded as one octet, just as they always have been. Accented letters,
and letters from other character sets, take up between two and four
characters.

And there is the promise of one standard for the whole world, and
everything being sweetness and light, and that anything that can be
written can be shown on any computer screen around the world.

In practice, UTF-8 is about as good as you can get.

> On this system, when I create an accentued char from my keyboard, it is
> written in two words:

Technical niggle -- "word" has a separate, different, technical meaning
in this context. I think you mean "byte" or "octet".

That's a two-octet UTF-8 character.

> 0000000 303 251   e   e   e  \n
> If i display this file my web server or send it by mail (php), i get some
> strange chars

OK -- in this case you *need* to read up about MIME encoding and
content-type and charset headers. These are needed in any case for your
viewers / recipients to be able to understand accents, whether you send
them as a traditional ISO-8859 encoding or as UTF-8.

Because some, but not all, of your recipients will understand them the
way you meant them. Others will use different character sets as standard
and see something completely different. They might have a Greek letter
at the same "code point".

You need some way of convincing your recipients' computers that you are
sending data in *this* particular character encoding. And once you've
got that working, you might just as well go for the UTF-8 standard and
be able to send and receive in all sorts of different languages. And
MIME encodings are the way to do this.

Hope this helps,

James.

-- 
E-mail:     james@ | [Bradford Cathedral] took 194 years to complete. A
aprilcottage.co.uk | construction period of nearly two centuries may seem
                   | ridiculous to us, but of course builders were a lot
                   | quicker in those days.  -- "ISIHAC", BBC Radio 4