UTF-8 and filenames

Wed Mar 14 22:22:26 UTC 2007

On Wed, 2007-03-14 at 17:03 -0500, Callum Lerwick wrote:
> On Wed, 2007-03-14 at 00:01 -0700, Toshio Kuratomi wrote:
> > The thing is we control the filenames to some extent.  If we decide that
> > every filename in one of our packages has to be utf-8 then we'll never
> > have a filename enter the database that isn't utf-8.  If we decide that
> > it's okay for fedora packages to contain files whose names are not
> > encoded in utf-8 then the tools will have to cope with it.
> 
> I'm seeing two issues here.
> 
> Unix systems have supported arbitrary bitstreams for filenames (well,
> except for '/'...) since the beginning of time. Any low level tool that
> falls over because the filename contains whitespace or high-ascii or
> utf-8 or whatever is broken. Period.

This is not a black and white, thing, you can't say that _any_ tool that
falls is broken.
Example: convmv is a lower level tool, and it can't translate unknown
charsets, but it is not broken.

> Now interpreting the meaning of these bitstreams is a higher level
> display problem. The great thing about having a "case sensitive"
> filesystem is the kernel doesn't have to care about encodings. That
> bloat is pushed to userspace. Its just a bunch of bits as far as the
> kernel and low-level libc are concerned. (Except the kernel DOES have to
> know about encodings in order to implement vfat, SMB, ntfs and whatnot,
> because microsoft sux...)

The fact that the Linux kernel can ignore the content of the file name
does not mean that nothing have to. Managing char sets is not only a
display problem, and it is not certainly Microsoft fault that someone
invented a broken standard like ASCII from start.

There are historical reasons why some protocols (and no just file
sharing protocols, think of http/html) have standardized on some char
set, and no matter what you say, any network protocol have to deal with
that, cause machines used to speak in different ways. And you can't
force everybody to have the same encoding. Actually the fact that the
kernel accept everything worsen the problem, as it allows mixed char set
file names to exists. But this is a different problem.

Now if we could standardize everything on utf8 we would have no problem,
and this is why, at least internally, we should standardize on it, and
not propagate the plague of mixed incompatible char sets.

We can and we should standardize at least at the user space level.
People expect consistency from a distribution and the tools it ships.

Simo.