UTF-8 and filenames

Wed Mar 14 19:54:40 UTC 2007

On Wed, 2007-03-14 at 10:11 -0700, Toshio Kuratomi wrote:
> On Wed, 2007-03-14 at 19:45 +0900, Mamoru Tasaka wrote:
> > Toshio Kuratomi wrote:
> > > Hi all,
> > > 
> > > I'm thinking of writing a draft guideline for the packaging committee to
> > > mandate all filenames be in utf-8.  
> > 
> > This may be difficult when filename contains multibyte
> > characters (such as Japanese Kanji characters), although
> > I am not familiar with handling filenames with multibyte
> > characters.
> > 
> I was under the impression that utf-8 was capable of storing Kanji,
> just not as efficiently as utf-16 or another encoding.  (AIUI utf-8 uses
> three bytes instead of two.)  Am I missing something important here?

UTF8 is a multipyte charset with a 1 byte base unit, IIRC it can go up
to 4 or 6 bytes for a single character in some rare conditions, but
IIRC. UTF8 is ASCII-7 compatible and null terminated.
UTF16 is a multibyte charset but the base unit is 2 bytes long, it is
not ASCII-7 compatible and is not null"byte" terminated (ascii chars
translates into \00\XX with XX the actual ASCII code).
Also UTF16 should be further divided in LE and BE (little Endian and Big
Endian) depending on the byte order of the 2 byte base unit.

Both, utf8 and utf16 are just representations of the Unicode standard,
and in theory you should be able to translate from utf8 to utf16 and
vice-versa with no loss of information.

Then MS started using UCS2/UTF16 and ... well you can guess ...

Simo.