From sflaniga at redhat.com  Thu Oct  2 04:45:04 2008
From: sflaniga at redhat.com (Sean Flanigan)
Date: Thu, 02 Oct 2008 14:45:04 +1000
Subject: Pseudo-locales for i18n testing by English speakers
Message-ID: <48E451D0.8020904@redhat.com>

G'day all,

I think we should make use of pseudo-locales to test Fedora.

[--- I ????? ?? ?????? ???? ??? ?? ??????-??????? ?? ???? F?????.
  ---]

(In case UTF-8 doesn't make it to everyone's mail client intact, the
above sentence should like similar to the first one, except that the
lower case characters have been replaced by other similar-looking
Unicode characters.  A couple of the characters don't fit into 16 bits,
and really gave my text editors some trouble!)

See http://en.wikipedia.org/wiki/Pseudo-translation and
http://blogs.msdn.com/shawnste/archive/2006/06/27/647915.aspx for more
about pseudo-locales.  Microsoft actually used three different
pseudo-locales to test Vista, with things like reverse sorting,
right-to-left characters, and large character sets.

To me, the main advantages of pseudo-localisation are the ability to
test some aspects of i18n without having to wait for translations to be
turned around, and allowing English-only speakers to test i18n areas,
which is otherwise extremely difficult.

I have a simple Ant task which can generate pseudo-translations like the
one above from a gettext POT files, but I'm not suggesting that we
should integrate my humble Ant task into the makefiles of thousands of
Fedora packages. If the gettext runtime code that fetches translations
from .mo files (in glibc?) were to recognise a pseudo-locale id, it
could generate pseudo-translations on the fly from the English text.

Admittedly, there's a little more to it than simple character
substitution.  The pseudo-translator has to avoid changing things like
variable names and html tags, but a few rules (eg don't modify anything
between angle/square/curly brackets, don't touch %d/%s/etc) would cover
95% of cases.  In the other cases, you might mess up some HTML or fail
to expand a variable, but only users who choose to use a pseudo-locale
would ever see these problems.

Would there be any interest in getting something like this into glibc?

[--- S??? ---]


PS this could make sense for the OpenJDK too, but that's another story.


-- 
Sean Flanigan

Senior Software Engineer
Engineering - Internationalisation
Red Hat

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 551 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/fedora-i18n-list/attachments/20081002/f30216d6/attachment.sig>

From sflaniga at redhat.com  Thu Oct  2 04:55:49 2008
From: sflaniga at redhat.com (Sean Flanigan)
Date: Thu, 02 Oct 2008 14:55:49 +1000
Subject: Pseudo-locales for i18n testing by English speakers
Message-ID: <48E45455.7000203@redhat.com>

(Apologies for the double post on fedora-i18n-list, but I want to keep
the thread together.  Please reply to this message, not the first one.)

G'day all,

I think we should make use of pseudo-locales to test Fedora.

[--- I ????? ?? ?????? ???? ??? ?? ??????-??????? ?? ???? F?????.
  ---]

(In case UTF-8 doesn't make it to everyone's mail client intact, the
above sentence should like similar to the first one, except that the
lower case characters have been replaced by other similar-looking
Unicode characters.  A couple of the characters don't fit into 16 bits,
and really gave my text editors some trouble!)

See http://en.wikipedia.org/wiki/Pseudo-translation and
http://blogs.msdn.com/shawnste/archive/2006/06/27/647915.aspx for more
about pseudo-locales.  Microsoft actually used three different
pseudo-locales to test Vista, with things like reverse sorting,
right-to-left characters, and large character sets.

To me, the main advantages of pseudo-localisation are the ability to
test some aspects of i18n without having to wait for translations to be
turned around, and allowing English-only speakers to test i18n areas,
which is otherwise extremely difficult.

I have a simple Ant task which can generate pseudo-translations like the
one above from a gettext POT files, but I'm not suggesting that we
should integrate my humble Ant task into the makefiles of thousands of
Fedora packages. If the gettext runtime code that fetches translations
from .mo files (in glibc?) were to recognise a pseudo-locale id, it
could generate pseudo-translations on the fly from the English text.

Admittedly, there's a little more to it than simple character
substitution.  The pseudo-translator has to avoid changing things like
variable names and html tags, but a few rules (eg don't modify anything
between angle/square/curly brackets, don't touch %d/%s/etc) would cover
95% of cases.  In the other cases, you might mess up some HTML or fail
to expand a variable, but only users who choose to use a pseudo-locale
would ever see these problems.

Would there be any interest in getting something like this into glibc?

[--- S??? ---]


PS this could make sense for the OpenJDK too, but that's another story.
-- 
Sean Flanigan

Senior Software Engineer
Engineering - Internationalisation
Red Hat

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 551 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/fedora-i18n-list/attachments/20081002/fd3d1b1a/attachment.sig>

From sflaniga at redhat.com  Thu Oct  2 08:01:55 2008
From: sflaniga at redhat.com (Sean Flanigan)
Date: Thu, 02 Oct 2008 18:01:55 +1000
Subject: Pseudo-locales for i18n testing by English speakers
In-Reply-To: <1222929670.4697.31.camel@localhost.localdomain>
References: <48E45455.7000203@redhat.com>
	<1222929670.4697.31.camel@localhost.localdomain>
Message-ID: <48E47FF3.5060600@redhat.com>

Ding-Yi Chen wrote:
> The pseudo locale is intriguing, and I assume it helps at some degree.
> However, this approach does have its own limitation:

Of course, pseudo-localisation testing is not the same as localisation
testing in every Fedora language, but it's something!

> 1. Lack of font support: as the attachment "lack_of_font.png" shows, the
> pseudo locale might be rendered useless if all developers can see are
> unicode boxes. :-P 

That tells me that the developers should install better fonts, or how
else can they test an internationalised application?  But to be honest,
I probably shouldn't have used
http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols since
they're only guaranteed to be available in certain mathematical fonts
such as Code2001.  I really need to find some latinesque characters that
don't come from the BMP, nor from the maths section!

Apparently Zimbra loses (without trace) the 'e' characters in my
pseudotranslation.  Bad Zimbra!

As long as it's only a couple of characters, I think having some unusual
characters is okay, since you can still work out what's going on, at
least enough to resolve the problem by installing more fonts.

> Perhaps we should specifiy the minimal font set as
> remedy.

Before running pseudo-localised apps, you mean?  Good idea.  I found a
webapp that gives the names of unicode characters -
<http://rishida.net/scripts/uniview/uniview.php>.  Just paste text into
the "cut & paste" field and hit enter.

But how can I find the name of the font which provides a given
character?  I can tell you that all my pseudo-characters are readable on
my computer, but I can't tell you where they come from.

Once I work out what fonts my pseudo-locale requires, I'd be happy to
share the info as a dependency list.

Perhaps it would make sense to define a small Fedora package which
specified certain Unicode fonts as dependencies, as well as enabling the
hypothetical pseudo-locale support in glibc.

> 2. It doesn't really solve the language specific problem. Take Chinese
> characters sorting for example, they can be sorted by
> Pinyin, Zhuyin, radical, number of strokes, and "natural" order such as
> numberial characters. The sorting is impossible to verify without the
> knowledge.

True, but a pseudo-locale which uses reverse sorting can at least show
up whether an app is using internationalised sorting, or plain old ASCII
ordering.

And we're not limited to what Microsoft did - I don't know much about
Chinese character sorting, but we could probably come up with a couple
of alternative sorts that could be understood by an English-speaking
developer.  But I don't want to tackle that just yet!

> Still, the idea itself is good. And surely it filters out some of the
> bugs without the help of translaters.

I expect a lot of i18n/L10n bugs are not picked up until someone tests
one of the affected languages.  Some of those bugs could show up in a
pseudo-locale much earlier, which has to be an improvement.

For instance, I've already found bugs where Eclipse and joe mess up the
cursor position when editing SMP characters, without personally knowing
any SMP languages.

As an English-only developer I think it's also pretty cool to see if my
code is at least partly internationalised, which otherwise I can't see
for myself at all, except in a foreign language.   I think some
English-only developers might take more interest in i18n issues if they
could easily see the results for themselves.

And for those i18n issues which can be demonstrated with a
pseudo-locale, it can be easier for multiple developers to talk about
something which is in "English", since most developers speak English,
even if they have differing native languages.

> Since the main purpose of pseudo locale is for testing, shall we agree
> on a list of pseudo locales which have their own specified behaviour?

I think it would be good if we could fit in with Vista's chosen
pseudo-locale IDs, as listed here:
http://blogs.msdn.com/shawnste/archive/2006/06/27/647915.aspx

As I said, we certainly don't have to emulate MS completely, but I think
 we should use qps for the language code.  See
http://blogs.msdn.com/michkap/archive/2007/02/04/1596987.aspx

As for the behaviours, I expect that they will change as we learn more
from testing feedback, but here are some ideas:

a. simple character substitution, rendered text to be about the same size
b. character substitution with expansion (eg "[--- original text ---]")
to make strings longer
c. maybe swapping upper and lower case.  Sometimes it's handy to have
more than one pseudo-locale, eg to make sure a web client is not seeing
the server's locale, so having spare locales might be handy.

And we could have options like different sort orders.  But I'd be happy
to start with (a) or (b) and leave sort orders until a bit later.  At
least with (a) and (b) it's easy to see whether someone forgot to call
gettext(), because the plain English strings will stick out.

-- 
Sean Flanigan

Senior Software Engineer
Engineering - Internationalisation
Red Hat

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 551 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/fedora-i18n-list/attachments/20081002/783f52e5/attachment.sig>

From sflaniga at redhat.com  Fri Oct  3 00:12:44 2008
From: sflaniga at redhat.com (Sean Flanigan)
Date: Fri, 03 Oct 2008 10:12:44 +1000
Subject: Pseudo-locales for i18n testing by English speakers
In-Reply-To: <48E4D40A.6000402@redhat.com>
References: <48E45455.7000203@redhat.com> <48E4D40A.6000402@redhat.com>
Message-ID: <48E5637C.7040901@redhat.com>

Ulrich Drepper wrote:
> Sean Flanigan wrote:
>> Would there be any interest in getting something like this into glibc?
> 
> Hell, no.  There is no room for testing code in the runtime.  

I find it hard to believe that there is no testing code in the runtime.
 Even OS kernels have functions for debug messages.

> And I see
> absolutely no need whatsoever to have it there.  It seems to be the
> wrong place altogether.  

Perhaps it is; I'm mostly a Java programmer, but in Java I'd be looking
to hook into the ResourceBundle class, which is responsible for fetching
translated strings.

I'd much prefer to keep my grubby mitts out of glibc, but I was given to
understand that the gettext() calls at runtime are actually implemented
in glibc (rather than say "libgettext").  If that's wrong, please
enlighten me!

> For PO files you create appropriate
> translations.  For the locales themselves you derive a file from the
> existing locales, compile it using localedef, and just use it.
> 
> None of the locale code is hardcoded in glibc.  Why should this?

Is the implementation of "fetch translations from MO files under
/usr/share/locale/" hard-coded?  If there's already a nice programmatic
hook I could use, even better.  If I could register locale-specific
overrides of gettext(), I could add any number of dynamically generated
locales.

A gettext() hook could also be used to fetch translations from other
sources, such as a shared, up-to-date, translation database.  I think
that has the potential to be useful to a lot of people, not just
developers and testers.

It doesn't really make sense to generate thousands of pseudo PO files,
and compile them into static MOs, when all the required data (ie English
text) is available at runtime.

Let me change my question then.  How would people feel about having a
hook to override the behaviour of gettext() in a system-wide fashion?

-- 
Sean Flanigan

Senior Software Engineer
Engineering - Internationalisation
Red Hat

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 551 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/fedora-i18n-list/attachments/20081003/05231840/attachment.sig>

From sflaniga at redhat.com  Fri Oct  3 01:05:01 2008
From: sflaniga at redhat.com (Sean Flanigan)
Date: Fri, 03 Oct 2008 11:05:01 +1000
Subject: Pseudo-locales for i18n testing by English speakers
In-Reply-To: <dc4a4b079288836c1efc2bda99e4939d.squirrel@arekh.dyndns.org>
References: <48E45455.7000203@redhat.com>	<1222929670.4697.31.camel@localhost.localdomain>	<48E47FF3.5060600@redhat.com>
	<dc4a4b079288836c1efc2bda99e4939d.squirrel@arekh.dyndns.org>
Message-ID: <48E56FBD.4020305@redhat.com>

Nicolas Mailhot wrote:

>> But how can I find the name of the font which provides a given
>> character?
> 
> Answer in the fonts SIG wiki
> http://fedoraproject.org/wiki/Category:Fonts_SIG
> 
> Looking at your interests and questions, you should really read it and
> join the SIG (same for other interested people)

Thanks Nicolas, but I can't make much sense of that wiki.  I gather that
it covers the problem of "where can I find a font with character X?",
which would be useful, but in this case I want to know "which of my
current fonts is providing the character X?"  Or when I choose the font
"Monospace 10", which font *really* provides the character X?  (Which
probably depends on whether I'm using an X application, a Java Swing app
or a Java SWT app...)

I guess I should be reading up on font substitution.

-- 
Sean Flanigan

Senior Software Engineer
Engineering - Internationalisation
Red Hat

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 551 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/fedora-i18n-list/attachments/20081003/a0afac43/attachment.sig>

From asgeirf at redhat.com  Fri Oct  3 07:54:26 2008
From: asgeirf at redhat.com (Asgeir Frimannsson)
Date: Fri, 3 Oct 2008 03:54:26 -0400 (EDT)
Subject: Pseudo-locales for i18n testing by English speakers
In-Reply-To: <28741995.3551223019840423.JavaMail.asgeirf@localhost.localdomain>
Message-ID: <31390880.3571223020324059.JavaMail.asgeirf@localhost.localdomain>

----- "Sean Flanigan" <sflaniga at redhat.com> wrote:
> Ulrich Drepper wrote:
> > Sean Flanigan wrote:
> >> Would there be any interest in getting something like this into
> glibc?
> > 
> > Hell, no.  There is no room for testing code in the runtime.  
> 
<snip>
> Is the implementation of "fetch translations from MO files under
> /usr/share/locale/" hard-coded?  If there's already a nice
> programmatic
> hook I could use, even better.  If I could register locale-specific
> overrides of gettext(), I could add any number of dynamically
> generated
> locales.

It is set by bindtextdomain().

Somewhat related, look at a previous discussion relating to Ubuntu's patched glibc for supporting language-packs:
http://sources.redhat.com/ml/libc-alpha/2005-03/msg00105.html

> A gettext() hook could also be used to fetch translations from other
> sources, such as a shared, up-to-date, translation database.  I think
> that has the potential to be useful to a lot of people, not just
> developers and testers.
> 
> It doesn't really make sense to generate thousands of pseudo PO
> files,
> and compile them into static MOs, when all the required data (ie
> English
> text) is available at runtime.
> 
> Let me change my question then.  How would people feel about having a
> hook to override the behaviour of gettext() in a system-wide fashion?

I guess you could experiment with LD_PRELOAD for this. Tim Foster experimented with this a while a go. 
http://blogs.sun.com/timf/entry/how_much_translation_do_you

cheers,
asgeir


From sflaniga at redhat.com  Mon Oct  6 01:17:12 2008
From: sflaniga at redhat.com (Sean Flanigan)
Date: Mon, 06 Oct 2008 11:17:12 +1000
Subject: Pseudo-locales for i18n testing by English speakers
In-Reply-To: <31390880.3571223020324059.JavaMail.asgeirf@localhost.localdomain>
References: <31390880.3571223020324059.JavaMail.asgeirf@localhost.localdomain>
Message-ID: <48E96718.4040905@redhat.com>

Asgeir Frimannsson wrote:
> ----- "Sean Flanigan" <sflaniga at redhat.com> wrote:
>> A gettext() hook could also be used to fetch translations from other
>> sources, such as a shared, up-to-date, translation database.  I think
>> that has the potential to be useful to a lot of people, not just
>> developers and testers.
>>
>> It doesn't really make sense to generate thousands of pseudo PO
>> files,
>> and compile them into static MOs, when all the required data (ie
>> English
>> text) is available at runtime.
>>
>> Let me change my question then.  How would people feel about having a
>> hook to override the behaviour of gettext() in a system-wide fashion?
> 
> I guess you could experiment with LD_PRELOAD for this. Tim Foster experimented with this a while a go. 
> http://blogs.sun.com/timf/entry/how_much_translation_do_you

Thank you, Asgeir, that's just the information I needed.  Looks like
LD_PRELOAD won't work on suid binaries, but it should be possible to
pseudo-localise 95% of packages without touching their build processes.

-- 
Sean Flanigan

Senior Software Engineer
Engineering - Internationalisation
Red Hat

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 551 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/fedora-i18n-list/attachments/20081006/cba01e2e/attachment.sig>

From sflaniga at redhat.com  Tue Oct  7 00:04:35 2008
From: sflaniga at redhat.com (Sean Flanigan)
Date: Tue, 07 Oct 2008 10:04:35 +1000
Subject: Pseudo-locales for i18n testing by English speakers
In-Reply-To: <46a038f90810060034y2ac8de3pf535bc057f97a3ed@mail.gmail.com>
References: <48E45455.7000203@redhat.com>
	<46a038f90810060034y2ac8de3pf535bc057f97a3ed@mail.gmail.com>
Message-ID: <48EAA793.8010606@redhat.com>

Martin Langhoff wrote:
> 2008/10/2 Sean Flanigan <sflaniga at redhat.com>:
>> I have a simple Ant task which can generate pseudo-translations like the
>> one above from a gettext POT files,
> 
> I am after a few sets of "latin-lookalike" character tables I can use.
> Have you (or anyone) got pointers to good tables?

Well, I've made up a couple of simple ones (also attached as UTF-8):
ASCII:
"abcdefghijklmnopqrstuvwxyz"
BMP only:
"??????????????????????????"
BMP+SMP:
"??????????????????????????"

You could also try googling for "LATIN SMALL LETTER {A,B,C,...} WITH",
which should turn up all sorts of modified latin characters, such as
LATIN SMALL LETTER V WITH RIGHT HOOK.

Another option is the Wikipedia Unicode pages
http://en.wikipedia.org/wiki/List_of_Unicode_characters
has several sections for extended latin scripts, and the Unicode mapping
tables down the bottom are handy if you want to go directly to a certain
Unicode range (eg to get away from the BMP).

> The simple example phrase you provided hit a bug in moodle (php
> webapp) straight away - I think a few webapps have trouble with that
> funny 'e' (U+1D5BE). Interestingly, it's also present in Jira
> (Java-based webapp). Might be an iconv issue.

I chose that 'e' specifically because it wasn't part of the BMP, but
apparently the mathematical alphanumeric symbols are a bit of a special
case - I'm not sure if systems are expected to provide font substitution
for them.

Zimbra (written in Java) had trouble with the 'e' too - it just removed
it entirely.  I think a lot of programs have trouble with characters
that don't fit into 16-bit Unicode.  My text editors and Thunderbird can
show the 'e' character, but the cursor handling is all wrong on those lines.


-- 
Sean Flanigan

Senior Software Engineer
Engineering - Internationalisation
Red Hat
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: latinesque_table_utf8.txt
URL: <http://listman.redhat.com/archives/fedora-i18n-list/attachments/20081007/dcfac116/attachment.txt>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 551 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/fedora-i18n-list/attachments/20081007/dcfac116/attachment.sig>