From sflaniga at redhat.com Thu Oct 2 04:45:04 2008 From: sflaniga at redhat.com (Sean Flanigan) Date: Thu, 02 Oct 2008 14:45:04 +1000 Subject: Pseudo-locales for i18n testing by English speakers Message-ID: <48E451D0.8020904@redhat.com> G'day all, I think we should make use of pseudo-locales to test Fedora. [--- I ????? ?? ?????? ???? ??? ?? ??????-??????? ?? ???? F?????. ---] (In case UTF-8 doesn't make it to everyone's mail client intact, the above sentence should like similar to the first one, except that the lower case characters have been replaced by other similar-looking Unicode characters. A couple of the characters don't fit into 16 bits, and really gave my text editors some trouble!) See http://en.wikipedia.org/wiki/Pseudo-translation and http://blogs.msdn.com/shawnste/archive/2006/06/27/647915.aspx for more about pseudo-locales. Microsoft actually used three different pseudo-locales to test Vista, with things like reverse sorting, right-to-left characters, and large character sets. To me, the main advantages of pseudo-localisation are the ability to test some aspects of i18n without having to wait for translations to be turned around, and allowing English-only speakers to test i18n areas, which is otherwise extremely difficult. I have a simple Ant task which can generate pseudo-translations like the one above from a gettext POT files, but I'm not suggesting that we should integrate my humble Ant task into the makefiles of thousands of Fedora packages. If the gettext runtime code that fetches translations from .mo files (in glibc?) were to recognise a pseudo-locale id, it could generate pseudo-translations on the fly from the English text. Admittedly, there's a little more to it than simple character substitution. The pseudo-translator has to avoid changing things like variable names and html tags, but a few rules (eg don't modify anything between angle/square/curly brackets, don't touch %d/%s/etc) would cover 95% of cases. In the other cases, you might mess up some HTML or fail to expand a variable, but only users who choose to use a pseudo-locale would ever see these problems. Would there be any interest in getting something like this into glibc? [--- S??? ---] PS this could make sense for the OpenJDK too, but that's another story. -- Sean Flanigan Senior Software Engineer Engineering - Internationalisation Red Hat -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 551 bytes Desc: OpenPGP digital signature URL: From sflaniga at redhat.com Thu Oct 2 04:55:49 2008 From: sflaniga at redhat.com (Sean Flanigan) Date: Thu, 02 Oct 2008 14:55:49 +1000 Subject: Pseudo-locales for i18n testing by English speakers Message-ID: <48E45455.7000203@redhat.com> (Apologies for the double post on fedora-i18n-list, but I want to keep the thread together. Please reply to this message, not the first one.) G'day all, I think we should make use of pseudo-locales to test Fedora. [--- I ????? ?? ?????? ???? ??? ?? ??????-??????? ?? ???? F?????. ---] (In case UTF-8 doesn't make it to everyone's mail client intact, the above sentence should like similar to the first one, except that the lower case characters have been replaced by other similar-looking Unicode characters. A couple of the characters don't fit into 16 bits, and really gave my text editors some trouble!) See http://en.wikipedia.org/wiki/Pseudo-translation and http://blogs.msdn.com/shawnste/archive/2006/06/27/647915.aspx for more about pseudo-locales. Microsoft actually used three different pseudo-locales to test Vista, with things like reverse sorting, right-to-left characters, and large character sets. To me, the main advantages of pseudo-localisation are the ability to test some aspects of i18n without having to wait for translations to be turned around, and allowing English-only speakers to test i18n areas, which is otherwise extremely difficult. I have a simple Ant task which can generate pseudo-translations like the one above from a gettext POT files, but I'm not suggesting that we should integrate my humble Ant task into the makefiles of thousands of Fedora packages. If the gettext runtime code that fetches translations from .mo files (in glibc?) were to recognise a pseudo-locale id, it could generate pseudo-translations on the fly from the English text. Admittedly, there's a little more to it than simple character substitution. The pseudo-translator has to avoid changing things like variable names and html tags, but a few rules (eg don't modify anything between angle/square/curly brackets, don't touch %d/%s/etc) would cover 95% of cases. In the other cases, you might mess up some HTML or fail to expand a variable, but only users who choose to use a pseudo-locale would ever see these problems. Would there be any interest in getting something like this into glibc? [--- S??? ---] PS this could make sense for the OpenJDK too, but that's another story. -- Sean Flanigan Senior Software Engineer Engineering - Internationalisation Red Hat -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 551 bytes Desc: OpenPGP digital signature URL: From sflaniga at redhat.com Thu Oct 2 08:01:55 2008 From: sflaniga at redhat.com (Sean Flanigan) Date: Thu, 02 Oct 2008 18:01:55 +1000 Subject: Pseudo-locales for i18n testing by English speakers In-Reply-To: <1222929670.4697.31.camel@localhost.localdomain> References: <48E45455.7000203@redhat.com> <1222929670.4697.31.camel@localhost.localdomain> Message-ID: <48E47FF3.5060600@redhat.com> Ding-Yi Chen wrote: > The pseudo locale is intriguing, and I assume it helps at some degree. > However, this approach does have its own limitation: Of course, pseudo-localisation testing is not the same as localisation testing in every Fedora language, but it's something! > 1. Lack of font support: as the attachment "lack_of_font.png" shows, the > pseudo locale might be rendered useless if all developers can see are > unicode boxes. :-P That tells me that the developers should install better fonts, or how else can they test an internationalised application? But to be honest, I probably shouldn't have used http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols since they're only guaranteed to be available in certain mathematical fonts such as Code2001. I really need to find some latinesque characters that don't come from the BMP, nor from the maths section! Apparently Zimbra loses (without trace) the 'e' characters in my pseudotranslation. Bad Zimbra! As long as it's only a couple of characters, I think having some unusual characters is okay, since you can still work out what's going on, at least enough to resolve the problem by installing more fonts. > Perhaps we should specifiy the minimal font set as > remedy. Before running pseudo-localised apps, you mean? Good idea. I found a webapp that gives the names of unicode characters - . Just paste text into the "cut & paste" field and hit enter. But how can I find the name of the font which provides a given character? I can tell you that all my pseudo-characters are readable on my computer, but I can't tell you where they come from. Once I work out what fonts my pseudo-locale requires, I'd be happy to share the info as a dependency list. Perhaps it would make sense to define a small Fedora package which specified certain Unicode fonts as dependencies, as well as enabling the hypothetical pseudo-locale support in glibc. > 2. It doesn't really solve the language specific problem. Take Chinese > characters sorting for example, they can be sorted by > Pinyin, Zhuyin, radical, number of strokes, and "natural" order such as > numberial characters. The sorting is impossible to verify without the > knowledge. True, but a pseudo-locale which uses reverse sorting can at least show up whether an app is using internationalised sorting, or plain old ASCII ordering. And we're not limited to what Microsoft did - I don't know much about Chinese character sorting, but we could probably come up with a couple of alternative sorts that could be understood by an English-speaking developer. But I don't want to tackle that just yet! > Still, the idea itself is good. And surely it filters out some of the > bugs without the help of translaters. I expect a lot of i18n/L10n bugs are not picked up until someone tests one of the affected languages. Some of those bugs could show up in a pseudo-locale much earlier, which has to be an improvement. For instance, I've already found bugs where Eclipse and joe mess up the cursor position when editing SMP characters, without personally knowing any SMP languages. As an English-only developer I think it's also pretty cool to see if my code is at least partly internationalised, which otherwise I can't see for myself at all, except in a foreign language. I think some English-only developers might take more interest in i18n issues if they could easily see the results for themselves. And for those i18n issues which can be demonstrated with a pseudo-locale, it can be easier for multiple developers to talk about something which is in "English", since most developers speak English, even if they have differing native languages. > Since the main purpose of pseudo locale is for testing, shall we agree > on a list of pseudo locales which have their own specified behaviour? I think it would be good if we could fit in with Vista's chosen pseudo-locale IDs, as listed here: http://blogs.msdn.com/shawnste/archive/2006/06/27/647915.aspx As I said, we certainly don't have to emulate MS completely, but I think we should use qps for the language code. See http://blogs.msdn.com/michkap/archive/2007/02/04/1596987.aspx As for the behaviours, I expect that they will change as we learn more from testing feedback, but here are some ideas: a. simple character substitution, rendered text to be about the same size b. character substitution with expansion (eg "[--- original text ---]") to make strings longer c. maybe swapping upper and lower case. Sometimes it's handy to have more than one pseudo-locale, eg to make sure a web client is not seeing the server's locale, so having spare locales might be handy. And we could have options like different sort orders. But I'd be happy to start with (a) or (b) and leave sort orders until a bit later. At least with (a) and (b) it's easy to see whether someone forgot to call gettext(), because the plain English strings will stick out. -- Sean Flanigan Senior Software Engineer Engineering - Internationalisation Red Hat -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 551 bytes Desc: OpenPGP digital signature URL: From sflaniga at redhat.com Fri Oct 3 00:12:44 2008 From: sflaniga at redhat.com (Sean Flanigan) Date: Fri, 03 Oct 2008 10:12:44 +1000 Subject: Pseudo-locales for i18n testing by English speakers In-Reply-To: <48E4D40A.6000402@redhat.com> References: <48E45455.7000203@redhat.com> <48E4D40A.6000402@redhat.com> Message-ID: <48E5637C.7040901@redhat.com> Ulrich Drepper wrote: > Sean Flanigan wrote: >> Would there be any interest in getting something like this into glibc? > > Hell, no. There is no room for testing code in the runtime. I find it hard to believe that there is no testing code in the runtime. Even OS kernels have functions for debug messages. > And I see > absolutely no need whatsoever to have it there. It seems to be the > wrong place altogether. Perhaps it is; I'm mostly a Java programmer, but in Java I'd be looking to hook into the ResourceBundle class, which is responsible for fetching translated strings. I'd much prefer to keep my grubby mitts out of glibc, but I was given to understand that the gettext() calls at runtime are actually implemented in glibc (rather than say "libgettext"). If that's wrong, please enlighten me! > For PO files you create appropriate > translations. For the locales themselves you derive a file from the > existing locales, compile it using localedef, and just use it. > > None of the locale code is hardcoded in glibc. Why should this? Is the implementation of "fetch translations from MO files under /usr/share/locale/" hard-coded? If there's already a nice programmatic hook I could use, even better. If I could register locale-specific overrides of gettext(), I could add any number of dynamically generated locales. A gettext() hook could also be used to fetch translations from other sources, such as a shared, up-to-date, translation database. I think that has the potential to be useful to a lot of people, not just developers and testers. It doesn't really make sense to generate thousands of pseudo PO files, and compile them into static MOs, when all the required data (ie English text) is available at runtime. Let me change my question then. How would people feel about having a hook to override the behaviour of gettext() in a system-wide fashion? -- Sean Flanigan Senior Software Engineer Engineering - Internationalisation Red Hat -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 551 bytes Desc: OpenPGP digital signature URL: From sflaniga at redhat.com Fri Oct 3 01:05:01 2008 From: sflaniga at redhat.com (Sean Flanigan) Date: Fri, 03 Oct 2008 11:05:01 +1000 Subject: Pseudo-locales for i18n testing by English speakers In-Reply-To: References: <48E45455.7000203@redhat.com> <1222929670.4697.31.camel@localhost.localdomain> <48E47FF3.5060600@redhat.com> Message-ID: <48E56FBD.4020305@redhat.com> Nicolas Mailhot wrote: >> But how can I find the name of the font which provides a given >> character? > > Answer in the fonts SIG wiki > http://fedoraproject.org/wiki/Category:Fonts_SIG > > Looking at your interests and questions, you should really read it and > join the SIG (same for other interested people) Thanks Nicolas, but I can't make much sense of that wiki. I gather that it covers the problem of "where can I find a font with character X?", which would be useful, but in this case I want to know "which of my current fonts is providing the character X?" Or when I choose the font "Monospace 10", which font *really* provides the character X? (Which probably depends on whether I'm using an X application, a Java Swing app or a Java SWT app...) I guess I should be reading up on font substitution. -- Sean Flanigan Senior Software Engineer Engineering - Internationalisation Red Hat -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 551 bytes Desc: OpenPGP digital signature URL: From asgeirf at redhat.com Fri Oct 3 07:54:26 2008 From: asgeirf at redhat.com (Asgeir Frimannsson) Date: Fri, 3 Oct 2008 03:54:26 -0400 (EDT) Subject: Pseudo-locales for i18n testing by English speakers In-Reply-To: <28741995.3551223019840423.JavaMail.asgeirf@localhost.localdomain> Message-ID: <31390880.3571223020324059.JavaMail.asgeirf@localhost.localdomain> ----- "Sean Flanigan" wrote: > Ulrich Drepper wrote: > > Sean Flanigan wrote: > >> Would there be any interest in getting something like this into > glibc? > > > > Hell, no. There is no room for testing code in the runtime. > > Is the implementation of "fetch translations from MO files under > /usr/share/locale/" hard-coded? If there's already a nice > programmatic > hook I could use, even better. If I could register locale-specific > overrides of gettext(), I could add any number of dynamically > generated > locales. It is set by bindtextdomain(). Somewhat related, look at a previous discussion relating to Ubuntu's patched glibc for supporting language-packs: http://sources.redhat.com/ml/libc-alpha/2005-03/msg00105.html > A gettext() hook could also be used to fetch translations from other > sources, such as a shared, up-to-date, translation database. I think > that has the potential to be useful to a lot of people, not just > developers and testers. > > It doesn't really make sense to generate thousands of pseudo PO > files, > and compile them into static MOs, when all the required data (ie > English > text) is available at runtime. > > Let me change my question then. How would people feel about having a > hook to override the behaviour of gettext() in a system-wide fashion? I guess you could experiment with LD_PRELOAD for this. Tim Foster experimented with this a while a go. http://blogs.sun.com/timf/entry/how_much_translation_do_you cheers, asgeir From sflaniga at redhat.com Mon Oct 6 01:17:12 2008 From: sflaniga at redhat.com (Sean Flanigan) Date: Mon, 06 Oct 2008 11:17:12 +1000 Subject: Pseudo-locales for i18n testing by English speakers In-Reply-To: <31390880.3571223020324059.JavaMail.asgeirf@localhost.localdomain> References: <31390880.3571223020324059.JavaMail.asgeirf@localhost.localdomain> Message-ID: <48E96718.4040905@redhat.com> Asgeir Frimannsson wrote: > ----- "Sean Flanigan" wrote: >> A gettext() hook could also be used to fetch translations from other >> sources, such as a shared, up-to-date, translation database. I think >> that has the potential to be useful to a lot of people, not just >> developers and testers. >> >> It doesn't really make sense to generate thousands of pseudo PO >> files, >> and compile them into static MOs, when all the required data (ie >> English >> text) is available at runtime. >> >> Let me change my question then. How would people feel about having a >> hook to override the behaviour of gettext() in a system-wide fashion? > > I guess you could experiment with LD_PRELOAD for this. Tim Foster experimented with this a while a go. > http://blogs.sun.com/timf/entry/how_much_translation_do_you Thank you, Asgeir, that's just the information I needed. Looks like LD_PRELOAD won't work on suid binaries, but it should be possible to pseudo-localise 95% of packages without touching their build processes. -- Sean Flanigan Senior Software Engineer Engineering - Internationalisation Red Hat -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 551 bytes Desc: OpenPGP digital signature URL: From sflaniga at redhat.com Tue Oct 7 00:04:35 2008 From: sflaniga at redhat.com (Sean Flanigan) Date: Tue, 07 Oct 2008 10:04:35 +1000 Subject: Pseudo-locales for i18n testing by English speakers In-Reply-To: <46a038f90810060034y2ac8de3pf535bc057f97a3ed@mail.gmail.com> References: <48E45455.7000203@redhat.com> <46a038f90810060034y2ac8de3pf535bc057f97a3ed@mail.gmail.com> Message-ID: <48EAA793.8010606@redhat.com> Martin Langhoff wrote: > 2008/10/2 Sean Flanigan : >> I have a simple Ant task which can generate pseudo-translations like the >> one above from a gettext POT files, > > I am after a few sets of "latin-lookalike" character tables I can use. > Have you (or anyone) got pointers to good tables? Well, I've made up a couple of simple ones (also attached as UTF-8): ASCII: "abcdefghijklmnopqrstuvwxyz" BMP only: "??????????????????????????" BMP+SMP: "??????????????????????????" You could also try googling for "LATIN SMALL LETTER {A,B,C,...} WITH", which should turn up all sorts of modified latin characters, such as LATIN SMALL LETTER V WITH RIGHT HOOK. Another option is the Wikipedia Unicode pages http://en.wikipedia.org/wiki/List_of_Unicode_characters has several sections for extended latin scripts, and the Unicode mapping tables down the bottom are handy if you want to go directly to a certain Unicode range (eg to get away from the BMP). > The simple example phrase you provided hit a bug in moodle (php > webapp) straight away - I think a few webapps have trouble with that > funny 'e' (U+1D5BE). Interestingly, it's also present in Jira > (Java-based webapp). Might be an iconv issue. I chose that 'e' specifically because it wasn't part of the BMP, but apparently the mathematical alphanumeric symbols are a bit of a special case - I'm not sure if systems are expected to provide font substitution for them. Zimbra (written in Java) had trouble with the 'e' too - it just removed it entirely. I think a lot of programs have trouble with characters that don't fit into 16-bit Unicode. My text editors and Thunderbird can show the 'e' character, but the cursor handling is all wrong on those lines. -- Sean Flanigan Senior Software Engineer Engineering - Internationalisation Red Hat -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: latinesque_table_utf8.txt URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 551 bytes Desc: OpenPGP digital signature URL: