[Freeipa-devel] Python i18n (was: IPA patches pushed)

Wed Feb 20 14:01:03 UTC 2008

Rob Crittenden wrote:
> I pushed a fairly large i18n patch this evening. It includes a Japanese 
> locale so if you set your browser up properly you should be able to see 
> Japanese characters in the UI. We haven't done any i18n work for the 
> command-line yet.

I haven't had a chance to look at how the i18n handling is being done in 
our Python code yet, but I've learned the hard way it's not as obvious 
as one might think. I spent a while tracing through all the logic in 
various components to fix some i18n bugs and came up with some notes and 
a conclusion as to optimal way to do i18n in python and the rationale 
for why. So I thought I would share it. The key item to note here is in 
our python installations it is not possible for a python program to 
reset the default encoding from ascii to utf-8 (I don't know why this is 
prohibited). Also, when I say 'output a string' what I mean is when 
CPython passes a string to another C library or via IO writes. It's the 
other C library which is of particular importance.

# i18n (internationalization) Handling
#
# Python has two builtin types which can contain strings, 'str' which
# is a conventional byte sequence where each byte contains a charater
# and 'unicode' which depending on how python was compiled is
# implemented using wide characters using 2 or 4 bytes per character
# (UCS-2, UCS-4 respectively). The Red Hat builds use UCS-4 for
# unicode.
#
#
# There are two fundamental ways a i18n string can enter a python
# application, either hardcoded via the 'u' unicode type coercion
# (e.g. u'some i18n string') or most commonly by looking up a i18n
# string in a translation catalog via the gettext package using the
# _() method(e.g. _(some i18n string').
#
# This application also utilizes many other packages to which i18n
# strings must be passed, by convention most packages accept i18n
# strings in the UTF-8 encoding. UTF-8 is byte orientated representing
# a character is a single byte if possble and optionally expanding to
# a multi-byte sequence if necessary, thus ascii and UTF-8 are
# byte identical.
#
# When python outputs a unicode string it will attempt to convert it
# to the default encoding set in site.py. It is not possible to for a
# python application to set the default encoding, this is
# prohibited. In many python implementations the default encoding is
# set to ascii :-( Thus when python attempts to output a unicode
# string (UCS-2 or UCS-4) it will in try to apply the default encoding
# to it (typically ascii) and the translation will fail because many
# wide UCS code points (characters) lie outside the aacii numeric
# range.
#
# Because the external packages we 'link' with expect UTF-8 we need to
# assure strings we output to them are encoded in UTF-8. There are two
# ways to accomplish this:
#
# 1) set the default encoding to UTF-8 and internally use unicode
# strings.
#
# 2) internally use UTF-8, not unicode. Thus all i18n strings will be
# conventional byte orientated 'str' objects, not wide unicode
# (UCS). Python will happily pass these UTF-8 strings around as plain
# strings and because they are plain strings will not attempt to apply
# encoding translations to them, thus on output an i18n string encoded
# in UTF-8 remains UTF-8. The downside is len() no longer returns the
# correct number of characters (if there are multibyte characters in the
# string) and it's difficult to apply basic string operations
# (e.g. concatenation). However, it's not common to need to perform
# such string operations on i18n strings originating from an i18n
# translation catalog.
#
# Our adopted solution is 2. We eschew use of unicode strings, all
# strings are represented as 'str', not unicode and are encoded in
# UTF-8. We instruct gettext to not return translations via _() in
# unicode, but rather in UTF-8 by specifying the gettext codeset to be
# UTF-8. This also means any i18n strings which are not obtained by
# _() translation catalog lookup must use str.encode('utf-8').
#
# WARNING: It is vital that gettext.install() be called as soon as
# possible in the import loading sequence as other loaded modules may
# call _() to obtain to an i18n translation from the catalog.

Example for a program, this installs the _() method in the global namespace:

gettext.install(domain    = get_config('general', 'i18n_text_domain'),
                 localedir = get_config('general', 'i18n_locale_dir'),
                 unicode   = False,
                 codeset   = 'utf-8')

Example for a module (note fallback setting), this sets _() locally 
within the module:

import gettext
_ = gettext.translation(get_config('general', 'i18n_text_domain'),
                         get_config('general', 'i18n_locale_dir'),
                         fallback=True).lgettext

-- 
John Dennis <jdennis at redhat.com>