UTF-8 and filenames

Simo Sorce ssorce at redhat.com
Wed Mar 14 22:36:31 UTC 2007


On Wed, 2007-03-14 at 23:24 +0100, Nicolas Mailhot wrote:
> Le mercredi 14 mars 2007 à 17:03 -0500, Callum Lerwick a écrit :
> 
> > Now interpreting the meaning of these bitstreams is a higher level
> > display problem. The great thing about having a "case sensitive"
> > filesystem is the kernel doesn't have to care about encodings. That
> > bloat is pushed to userspace. 
> 
> Except userspace has no way to guess the filename encoding: filename
> itself is too short to use any sort of euristic, and Linux filesystems
> won't provide any other hint.
> 
> The only sane thing userspace can do is postulate a system-wide encoding
> and display garbage for filenames encoded otherwise (hoping that will
> force users to use the default encoding), even if that will fail
> spectacularly with removable medias or legacy partitions that use
> another convention. Also little help to apps that do something else with
> filenames than displaying them.
> 
> Casing, sorting is quite another problem. If the encoding is fixed, it
> only requires locale knowledge, which is already exported to userspace
> reliably.

+1 up to this point

> Also don't forget UTF-8 coverage comes at the price of forbidding some
> valid ASCII sequences. So anyone blindly injecting data using legacy
> 8-bit encoding in an UTF-8 system is asking for trouble (and Linus
> refused to enforce UTF-8 safety kernel-side)

To be pedant, UTF-8 and ASCII* are perfectly compatible, but encodings
that use the upper 127 values, like iso8859-*, are not compatible with
UTF-8.

Simo.

* http://en.wikipedia.org/wiki/ASCII




More information about the Fedora-maintainers mailing list