What's the internal charset of Linux?

Yin Ming yinming at mdc-ds.com
Sat Oct 30 01:50:34 UTC 2004


Hi, all

As a non-English user, I've got many problems of charset. Thank the god
it's perfectly solved in RH9, but I wonder how did the problem happened
in the core.

I don't want to dive into core source deeply, but just want to know, how
does linux handle strings, and why those "???" and "*(*@#&$(@" appears
in the past?

I know there are various way to handle strings. The worse one is dealing
characters as 8bit chars. ( even 7bit ) . So, for Chinese and other
multi-byte language, one character is separated into two or more byte,
and many strange ASCII chars are displayed.

Another way is MBCS as some WIN does. Characters are store in
multi-bytes, and the OS remember their charset, displaying them in
corresponding fonts with these mult-bytes. 
This approach cannot handle multi-charset at one time, I think, unless
you convert strings from other charset into the defualt one.

And the better way, I think, is unicode. Using correct charset to
encoding multi-bytes into UNICODE strings, and handle thse UNICODE in
the core, than, decoding them into external multi-bytes before output.
This approach must only mantain a default IO charset, used for
de/encodeing for IO.

So, in the core, the type of string should be wchar_t.

Right? How does linux handle string?





More information about the redhat-list mailing list