Why is "LANG=en_US.UTF-8" the default in Fedora

Fri May 21 21:44:28 UTC 2004

On Fri, 2004-05-21 at 14:00, Nico Kadel-Garcia wrote:
> Yes, in exactly the "case insensitive" fashion that "sort" has used for the
> last 20 years or so.
> 
> With "LANG-en_US.UTF-8 ls", we get lists like this:
> 
> a
> A
> ab
> aB
> Ab
> AB
> abc
> abC
> aBc
> aBC
> Abc
> AbC
> ABc
> ABC

Yup, that's lexicographically correct.  AKA a "dictionary" sort.

> With "LANG=C;  ls | sort -i", we get the same thing:

*snip* 

Funny, since sort -i ignores non-printable characters, not case. Of
course, I get the same result if I use -f.  Doesn't actually prove
anything though and you're still wrong about it being case insensitive. 
The output might look the same (in this example), but that doesn't imply
the cause is.

> 
> Looks identical, doesn't it? Also, notice that the items starting "AB" are
> no longer together. "AB" is entirely separate from "AB[cC]"

Good, then strcoll is working correctly.

> With "LANG=C ls", we get.

*snip*

Good, then strcoll is working correctly.

> Notice some wildly, wildly different behavior there, such as everything that
> starts with "AB" actually being grouped together? Now, guess how much old
> source code in the world was written in the days when ASCII meant ASCII, and
> sorting was predictable, not randomly dependent on a set of semi-randomly
> assigned locale's that cannot be predicted, and now require an additional
> programming step of checking if your system supports locales and setting
> them appropriately?

You're exaggerating.  They are different, as well they should be. ASCII
is still ASCII and will always be ASCII.  Sorting is still predictable,
there is nothing "random" about it.  And, as far as the average user is
concerned lexicographic sorting is correct.  Getting a
locale-independent sort order doesn't require any of that, it simply
requires using strcmp and not strcoll. (Read a man page every once in a
while, or even a "C" standard book and you'd know this).

> The effective change was from the "C" standard to what you are describing as
> the "More Natural" sorting of en.US_UTF8, or whichever locale we happen to
> choose from moment to moment. This is conceptually reasonable, but the
> change has been breaking old code and unexpectedly multi-lingual code for
> the last few years. The shift to Unicode has been extremely painful for a
> lot of programmers, including me, and remains painful as I have to clean up
> tools or code from old source or other locations that make unwitting
> assumptions about this sort of behavior.

Wrong again.  Take a look at the copy of the "C" standard you have next
to you.  Look up "strcoll" or "setlocale" and suprise, suprise, they
specify the behavior you're complaining about.  It's more than
conceptually reasonable and it has nothing to do with Unicode.  You're
complaining about two entirely separate issues.  Try removing the
".UTF-8" from the locale and watch as nothing changes with the sort
order.  Applications which expect strcoll to behave like strcmp are, by
definition, broken.  Yes, the shift to Unicode has been painful but the
sort order is the smallest impact by far (and only tangentially related
to Unicode).  Again, if you're expecting strcoll to behave like strcmp,
your code is broken.  If you're expecting LANG=C behavior from 'ls',
then either specify that or simply sort the output after you read it in
using strcmp!  Breaking working apps to work around broken apps is a
horrible idea, the working apps break while the broken apps never get
fixed.  

Additionally, advocating changing the default locale from one that
provides the expected behavior for 95% of users to one that doesn't
because you prefer the sort order of 'ls' is arrogant and absurd. 
Change your own /etc/sysconfig/i18n and be done with it.  Or fix the
applications that *actually* break in UTF-8/non-C locales and make
everyone's life better.  Of course by "actually break" I don't mean sort
correctly.
-- 
Shahms King <shahms at shahms.com>