[Freeipa-devel] Re: Python i18n

Mon Feb 25 21:54:22 UTC 2008

Recently I posted a recommendation for i18n coding practices in python
which raised some questions. I've investigated further and would like to
share what I've learned.

The recommendation I posted came about because of broken Python bindings
to C libraries. The fact many Python bindings seem to be broken is a an
unfortunate reality we're going to have to deal with.

How are the bindings broken?

When you author a Python binding for a C library you use the CPython
API. Since Python is written in C you're effectively just using the
Python internal API. Everything in a Python program is an object. When C
code is passed a Python object it must convert the object to something C
code can operate on. This conversion occurs in a family of API routines,
most notably PyArg_ParseTuple() whose role is to convert each argument
in the tuple to a C type. The expected C type is passed to
PyArg_ParseTuple() in a format string much like sscanf(). For example a
function which expects a single string might use the format string "s",
where "s" means return a pointer to a standard NULL terminated C string.

Most Python bindings use the "s" format specifier for strings. Perhaps
this is the case because many of the bindings were written before Python
had wide spread i18n support via the 'unicode' string object. Python
added Unicode string objects supplanting the traditional 'str' string
object (characters represented as one byte). Internally Python
implements unicode objects in either the UCS-2 or UCS-4 encodings (e.g.
2 byte or 4 byte characters). The use of UCS-2 vs. UCS-4 is a compile
time option.

The 's' format conversion specifier in CPython was expanded to accept
unicode objects in addition to str objects. Recall that the 's' format
specifier returns a pointer to a NULL terminated C string. For str
objects this conversion was for all practical purposes an identity
transformation, it just returned the pointer to the character buffer
inside the str object used to hold the string.

But what happens for unicode string objects when the 's' format
conversion is specified? The first thing to note is the unicode object
has to be re-encoded from UCS-2 or UCS-4 to the destination encoding.
But what is the destination encoding for a traditional C string? The
fact is traditional C strings never had an encoding specified for
historical reasons, the best consensus guess is ascii. Since the 's'
format conversion does not specify the desired destination encoding the
global default-encoding is used. For historical reasons the
default-encoding is initialized to ascii.

To implement the conversion from the unicode source to a destination
format one must know the destination encoding, know how big the
destination will be after conversion, and have available a buffer of
sufficient size to write the destination conversion into. Recalling the
's' format conversion returns a pointer this means the conversion code
must allocate a buffer of the right size for the destination encoding.

But who owns this buffer and who is responsible for freeing it? The
caller of PyArg_ParseTuple does not free the pointer it is returned,
that would violate the API. CPython solves this problem and at the same
time adds a performance optimization by caching the buffer for the
default-encoding inside the unicode Python object. Thus a Python unicode
object has a buffer in the UCS-{2,4} encoding and optionally a buffer
for the string represented in the default-encoding. When
PyArg_ParseTuple is passed a unicode object with the 's' format
conversion specifier it checks to see if there is already a buffer in
the unicode object representing the current string value in the
default-encoding, if not it allocates a default-encoding buffer and
calls a conversion routine to convert from UCS-{2,4} to the
default-encoding passing it the default-encoding buffer in the unicode
object. The 's' format conversion code then returns the pointer to the
cached default-encoding buffer in the unicode object. Internal
bookkeeping assures the default-encoding buffer is updated at
appropriate times whenever unicode version of the string is modified.
When the unicode object is freed the UCS-{2,4] buffer is freed as well
as the optional default-encoding buffer. This allows the caller of the
's' conversion to be ignorant of memory management issues and optimizes
by performing the conversion from unicode to the default-encoding only
once and then caching it along side the UCS-{2,4} version.

What are the problems with the above?

The most significant issue is that the default-encoding is global.
Different C libraries many have vastly different expectations concerning
the encoding they expect for strings, one size does not fit all. One
library might expect the byte orientated UTF-8 encoding while another
might expect 4 byte wide characters, a third might expect 2 byte wide
characters and a fourth library might only accept 7 bit ascii.

The next most significant problem is the selection of the
default-encoding and any subsequent modification of the default-encoding
when some strings have already had their default encodings cached prior
to the default-encoding modification.

What should the default-encoding be?

Many argue for historical reasons the only sensible default-encoding is
ascii. Thus in many Python implementations the default-encoding is set
to ascii. This is typically done in site.py.

But my Linux system uses UTF-8 as the default encoding, why can't I just
set the default-encoding to UTF-8? Then all the C libraries that various
Python bindings wrap will get their strings in UTF-8 when the format
conversion specifier is 's'. Problem solved right?

Unfortunately no. The default-encoding is ascii. If you change the
default encoding then cached default encodings for strings will be in
the wrong encoding. At the moment there isn't a way for Python to
invalidate cached encodings so there is no way to know if a cached
encoding matches the current default-encoding. To protect against this
problem site.py removes the sys.setdefaultencoding() entry point
preventing Python applications from modifying the default-encoding once
Python initializes. So unless your Python implementation shipped with
the default-encoding set to UTF-8 you're out of luck. Setting the
default-encoding is not the right solution because as noted above it is
global setting affecting every CPython binding which may not share the
same encoding requirements.

So what is the solution?

The solution is simple, each CPython binding must explicitly specify the
encoding it wants to use. PyArg_ParseTuple supports format conversion
specifiers for strings other than 's', for example 'es' which stands for
"encoded string". The caller specifies both the desired encoding and a
pointer to a pointer to receive an allocated buffer containing the
string in the specified encoding. After using the encoded string it is
the callers responsibility to free it.

If the solution is so simple why are so many bindings still using the
incorrect 's' format conversion?

For many reasons. The most likely is 's' works for applications which do
not use multi-byte internationalized strings and the library expects
ascii. They don't know it's broken and they've gotten away with it for a
long time. The second reason is the code in the CPython binding has to
be augmented when using the 'es' specifier to both specify the encoding
and more importantly to free the returned string, neither of which was
required with the simple 's' specifier. Some python bindings are
automatically created by code generators. These code generators would
have to be modified to also insert code to check for buffers and free
them at all exit points in the generated binding function. This can be
non-trivial.

But wait a minute, PyGTK+ works with unicode and it uses the 's' format
specifier, what's up with that? Your analysis must be wrong!

The PyGTK+ binding to GTK+ gets around this problem with a nasty trick
documented in this bug report:
http://bugzilla.gnome.org/show_bug.cgi?id=132040 by calling
PyUnicode_SetDefaultEncoding("utf-8") from the binding's C code. Thus
changing the default-encoding from 'ascii' to 'utf-8'. But remember
changing the default-encoding is so highly frowned upon by Python
developers they actually made it impossible to do from Python code. It
can only be set internally from C code. This means an optional Python
module (gtk) which during its load phase modifies Python global state
and creates inconsistencies with cached encodings. It also does this
silently.

">>> import sys
 >>> print sys.getdefaultencoding()
'ascii'
 >>> import gtk
 >>> print sys.getdefaultencoding()
'utf-8'

This means when and if you import gtk the application's entire i18n
handling will change and you may experience inexplicable encoding errors
or have problems with incorrectly coded libraries/modules being masked
until you change the order of imports or the set of modules imported.
The above bug report suggests one can reset the old default-encoding thusly:

 > It is actually possible*, so you could do something like this if you
 > want:

 > old = sys.getdefaultencoding()
 > import gtk
 > import sys
 > reload(sys)
 > sys.setdefaultencoding(old)

 > *) the default site.py deletes the setdefaultencoding function from
 > sys. Fortunately for the small group of persistent hackers, that
 > technique is worked around by using reload, which re-creates the
 > module namespace.

However, the above comment does not address the fundamental issue, which
is the only reason to reset the default-encoding back is if you have
libraries which expect the old default encoding. Because the
default-encoding is global to all of CPython you can't have it both ways
at once!

Summary:

If you import extension modules implemented in CPython (e.g. a Python
binding written in C) unicode strings will only work if that extension
module uses the 'es' family of format specifiers (unfortunately this is
rare). If you pass a unicode string to that module you will likely get
an encoding error (something like "cannot convert xxx ordinal not in
range(128)").

I only know of two ways to fix this in preferential ordering:

1) modify the binding to use 'es'

2) don't use unicode in your python application, use str's encoded in
utf-8 (see original post, assumes all extension libraries want UTF-8).

3) set the default-encoding to utf-8 with all the attendant problems
listed above.

The bottom nasty line:

To know whether we're going to have a problem using unicode in our
Python code we're going to have to examine the source code of each and
every extension module we load (or is loaded as a consequence of any
other module load) and ascertain if they are using 'es' instead of 's'.
If any of the extension modules fail to do that we can't blindly use
unicode and may have to fall back to the workarounds referred to above 
and accept the problems which go with each of them. Yuck!

Observation:

Is it any wonder people have such problems with i18n? Few take to the 
time to understand it when their code breaks, instead they tweak things 
in mostly ignorant ways until it works for their isolated case and then 
publish their mistakes into the larger universe of code contributing to 
the endless speculation over what is correct i18n handling. It's not a 
pretty picture.

-- 
John Dennis <jdennis at redhat.com>