UTF-8 and filenames

Nicolas Mailhot nicolas.mailhot at laposte.net
Wed Mar 14 22:24:18 UTC 2007


Le mercredi 14 mars 2007 à 17:03 -0500, Callum Lerwick a écrit :

> Now interpreting the meaning of these bitstreams is a higher level
> display problem. The great thing about having a "case sensitive"
> filesystem is the kernel doesn't have to care about encodings. That
> bloat is pushed to userspace. 

Except userspace has no way to guess the filename encoding: filename
itself is too short to use any sort of euristic, and Linux filesystems
won't provide any other hint.

The only sane thing userspace can do is postulate a system-wide encoding
and display garbage for filenames encoded otherwise (hoping that will
force users to use the default encoding), even if that will fail
spectacularly with removable medias or legacy partitions that use
another convention. Also little help to apps that do something else with
filenames than displaying them.

Casing, sorting is quite another problem. If the encoding is fixed, it
only requires locale knowledge, which is already exported to userspace
reliably.

Also don't forget UTF-8 coverage comes at the price of forbidding some
valid ASCII sequences. So anyone blindly injecting data using legacy
8-bit encoding in an UTF-8 system is asking for trouble (and Linus
refused to enforce UTF-8 safety kernel-side)

-- 
Nicolas Mailhot




More information about the Fedora-maintainers mailing list