[Avocado-devel] [RFC] Text/binary data and encodings: how they relate to Avocado, extensions and tests
Cleber Rosa
crosa at redhat.com
Wed Apr 18 01:18:07 UTC 2018
Recently, Avocado has seen a lot of changes brought by the Python 3
port. One fundamental difference between Python 2 and 3 is under
the spotlight: how to deal with "text" and "binary" data[1].
It's then important to make it clear where Avocado stands (or is
headed) when it comes to handling text, binary data and encodings,
which is the goal of this document.
First, let's review some very basic concepts.
Bytes, the unassuming arrays
============================
On both Python 2 and 3, there's "bytes". On Python 2, it's nothing
but an alias to "str"[2]::
>>> import sys; sys.version[0]
2
>>> bytes is str
True
One of the striking characteristics of "bytes" is that every byte
counts, that is::
>>> aacute = b'\xc3\xa1'
>>> len(aacute)
2
This is as simple as it gets. The "bytes" type is an "array" of
bytes.
Also, if it's not clear enough, this sequence of two bytes, happens to
be **one way** to **represent** the "LATIN SMALL LETTER A WITH
ACUTE"[3] character, as defined by the Unicode standard, in a given
encoding. Please pause for a moment and let that information settle.
Old habits die hard
===================
We, humans beings, are used to deal with text. Developers, being a
special kind of human beings, are used to deal with *character arrays*
instead. Those are, or have been for a long time, sequences of
one-byte characters with specific (but somewhat implicit) meaning.
Many developers will still assume that each byte contains a value that
maps to the ascii(7) table::
Oct Dec Hex Char Oct Dec Hex Char
────────────────────────────────────────────────────────────────────────
000 0 00 NUL '\0' (null character) 100 64 40 @
001 1 01 SOH (start of heading) 101 65 41 A
002 2 02 STX (start of text) 102 66 42 B
...
076 62 3E > 176 126 7E ~
077 63 3F ? 177 127 7F DEL
Some other developers will assume that ASCII is a thing of the past,
and each one-byte character means something according to the latin1(7)
mapping::
ISO 8859-1 characters
The following table displays the characters in ISO 8859-1, which
are printable and
unlisted in the ascii(7) manual page.
Oct Dec Hex Char Description
────────────────────────────────────────────────────────────────────
240 160 A0 NO-BREAK SPACE
241 161 A1 ¡ INVERTED EXCLAMATION MARK
242 162 A2 ¢ CENT SIGN
...
376 254 FE þ LATIN SMALL LETTER THORN
377 255 FF ÿ LATIN SMALL LETTER Y WITH DIAERESIS
Then, there's yet another group of developers who believe that a byte
in an array of bytes may be either a character, or part of a
character. They believe in that because, Unicode and "UTF-8" is the
new standard and can be assumed to be everywhere.
The fact is, all those developers are wrong. Not because an array of
bytes can not contain what they believe, but because one can only
guess that an array of bytes map to a character set (an encoding).
Data itself carries no intrinsic meaning
========================================
Pure data doesn't have any meaning. Its meaning depends on the
interpretation given, that is, some kind of context around it.
When dealing with text, the meaning of data is usually determined by a
character set, a mapping table or some more advanced encoding and
decoding mechanism.
For instance, the following sequence of numbers expressed in
decimal format and separated by spaces::
66 67 68 69 70
Will only mean the first letters of the western alphabet, ``ABCDE``,
**if** we determine that its meaning is based on the ASCII character
set (besides other details such as ordering, separator used, etc).
Turning arrays of bytes into text
=================================
On many occasions, usually when data is destined for humans, it is
necessary to present it, and to deal with it, in a different way.
Here, we use the abstract term *text* to refer to data is more
meaningful to humans, and would usually be found in documents (such as
this one) intended to be distributed and read by us, the non-machine
beings.
Reusing the example given earlier, one can do on a Python interpreter::
>>> aacute = b'\xc3\xa1'
>>> len(aacute.decode('utf-8'))
1
The process of turning bytes into "text" is called "decoding" by
Python. It helps to think of bytes as something that humans cannot
understand and consequently needs deciphering (or decoding) to then
become something readable by humans.
In this process, the encoding is of the uttermost importance. It's
analogous to a symmetric key used on a cryptographic operation. For
instance, let's look at what happens when using the same data with a
different encoding::
>>> aacute = b'\xc3\xa1'
>>> len(aacute.decode('utf-16'))
1
>>> print(aacute.decode('utf-16'))
ꇃ
Or giving too little data for a given encoding::
>>> aacute.decode('utf-32')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/encodings/utf_32.py", line 11, in decode
return codecs.utf_32_decode(input, errors, True)
UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-1:
truncated data
Even though Unicode is increasingly popular, it's also a good idea to
remind ourselves that other, non-Unicode encodings exist. For
instance, look at the same data when decoded using a character set
developed for the Thai language::
>>> len(aacute.decode('tis-620'))
2
>>> print(aacute.decode('tis-620'))
รก
Now, think about this: if you expect quick, consistent and reliable
cryptographic operations, would you save a key for later use? Or
would you just guess it whenever you need it?
Hopefully, you've answered that you would save the key. The same
applies to encoding: you should keep track of what you're using.
What Python offers
==================
There are a number of features that Python offers related to the
encoding used. Some of them have differences depending on the
Python version. When that's the case, the version used is made
clear.
Let's review those features now.
sys.getfilesystemencoding()
---------------------------
>From the documentation, this function will *Return the name of the
encoding used to convert Unicode filenames into system file names, or
None if the system default encoding is used".*
To demo how this works, let's create a base directory with ASCII only
characters (and using the byte type to avoid any implicit encoding)::
>>> import os
>>> os.mkdir(b'/tmp/mydir')
And then, let's explicit create a directory, again using a sequence of
bytes::
>>> os.mkdir(b'/tmp/mydir/\xc3\xa1')
If you look at the content of the ``/tmp/mydir`` directory, you should
find a single file::
>>> os.listdir(b'/tmp/mydir')
['\xc3\xa1']
Which is just what we expected. Now, we'll start Python (2.7) with a
environment variable that will influente the encoding it'll use for
conversion of Unicode filenames::
$ LANG=en_US.ANSI_X3.4-1968 python2.7
>>> import sys
>>> sys.getfilesystemencoding()
'ANSI_X3.4-1968'
Now, let ask Python to list all files (by using the standard library
module ``glob``) in that directory::
$ LANG=en_US.ANSI_X3.4-1968 python2.7 -c "import glob;
print(glob.glob(u'/tmp/mydir/\u00e1*'))"
[]
The list is empty because ``glob`` fails to match the reference given in the
encoding used. Basically, think of what would happen if you were to
do::
>>> u'/tmp/mydir/\u00e1*'.encode(sys.getfilesystemencoding())
On the other hand, by using an appropriate encoding::
$ LANG=en_US.UTF-8 python2.7 -c "import glob;
print(glob.glob(u'/tmp/mydir/\u00e1*'))"
[u'/tmp/mydir/\xe1']
The point here is that ``sys.getfilesystemencoding()`` will be used by
some Python libraries when working with filenames.
.. warning:: Don't expect any code to be perfect. For instance, the
author could find some issues with the ``glob`` module
used in the example above.
sys.std{in,out,err}.encoding
----------------------------
An ``encoding`` attribute may be set on ``sys.stdin``, ``sys.stdout``
and ``sys.stderr`` to let applications know how to input and output
meaningful text.
Suppose you need to read text from the standard input and save it to
a file on a specific encoding. The following script is going to be
used as an example (``read_encode.py``)::
import sys
# On Python 3 "str" is unicode
if sys.version_info[0] >= 3:
unicode = str
sys.stdout.write("Enter text:\n")
input_read = sys.stdin.readline().strip()
if isinstance(input_read, unicode):
bytes_read = input_read.encode(sys.stdin.encoding)
else:
bytes_read = input_read
with open('/tmp/data.bin', 'wb') as data_file:
data_file.write(bytes_read)
Now, on both Python 2 and 3 this produces the same results::
$ python2 -c 'import sys; print(sys.stdin.encoding)'
UTF-8
$ python2 read_encode.py
Enter text:
áéíóú
$ file /tmp/data.bin
/tmp/data.bin: UTF-8 Unicode text, with no line terminators
$ python3 -c 'import sys; print(sys.stdin.encoding)'
UTF-8
$ python3 read_encode.py
Enter text:
áéíóú
$ file /tmp/data.bin
/tmp/data.bin: UTF-8 Unicode text, with no line terminators
The encoding set on ``sys.stdin.encoding`` was important to
the example script as it needs to turn unicode into bytes.
Now, suppose that your application, while reading input that matches
the user's environment, must produce a file in the ``UTF-32``
encoding. The code to do that could look similar to the following
example (``write_utf32.py``)::
import sys
sys.stdout.write("Enter text:\n")
input_read = sys.stdin.readline().strip()
if isinstance(input_read, bytes):
unicode_str = input_read.decode(sys.stdin.encoding)
else:
unicode_str = input_read
with open('/tmp/data.bin.utf32', 'wb') as data_file:
data_file.write(unicode_str.encode('UTF-32'))
Again, let'see how this performs under Python 2 and 3::
$ python2 -c 'import sys; print(sys.stdin.encoding)'
UTF-8
$ python2 write_utf32.py
Enter text:
áéíóú
$ file /tmp/data.bin.utf32
/tmp/data.bin.utf32: Unicode text, UTF-32, little-endian
$ python3 -c 'import sys; print(sys.stdin.encoding)'
UTF-8
$ python3 write_utf32.py
Enter text:
áéíóú
$ file /tmp/data.bin.utf32
/tmp/data.bin.utf32: Unicode text, UTF-32, little-endian
.. tip:: do not assume that ``sys.std{in,out,err}`` will always have
the ``encoding`` attribute, or that they'll be set to a valid
encoding. For instance, when ``sys.stdin`` is not a TTY,
it's ``encoding`` attribute will have a ``None`` value.
A few points can be realized here:
1) Using Unicode strings internally, as an intermediate format, gives
you the freedom to read (decode) from different encodings, and at
the same time, to write (encode) into any other encoding.
2) Code that is expected to work under both Python 2 and 3 need
some extra handling with regards to the data type being handled.
3) While determining the data type, one can either check for ``bytes``
or for ``unicode``. While it's certainly a matter of preference
and style, keep in mind that the ``bytes`` name exists on both
Python 2 and 3, while ``unicode`` exists only on Python 2.
locale
------
This Python standard library module is a wrapper around POSIX
locale-related functionality.
Because this discussion is about text encodings, let's focus on the
``locale.getpreferredencoding()`` function. Acording to the
documentation, it *"Return(s) the encoding used for text data,
according to user preferences"*. As it wraps specificities of different
platforms, it may not be able to problem it on some systems, and
because of that, the documentation notes that *"this function only
returns a guess"*.
Even though it may be a guess, it is probably the best bet you can
make.
.. tip:: Many non-Linux/UNIX platforms implement some level of POSIX
functionality, and that happens to be the case for the
``locale`` features discussed here. Because of that, the
Python ``locale`` module can also be found on platforms such
as Microsoft Windows.
Error Handling
--------------
Some encoding and decoding operations just won't be possible. The
most straightforward example is when you're trying to map a value
that is outside the bounds of the mapping table.
For instance, the ASCII character set defines mappings for values in
the range of 0 up to 127 (7f in hexadecimal). That means that a value
larger than 127 (7f in hexadecimal) will cause an error.
When dealing with Unicode strings in Python, those errors are
represented as a ``UnicodeError`` (whose most common subclasses
are ``UnicodeEncodeError`` or ``UnicodeDecodeError``).
Getting back to the simple example given before, trying to decode
the value 126 (7e when represented in hexadecimal) using the ASCII
character set should work fine::
>>> b'\x7e'.decode('ascii')
u'~'
But anything larger than 127 (7f) won't work::
>>> b'\x80'.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position
0: ordinal not in range(128)
The reason for the failure is explicit in the error message: 0x80
(given in hex) is decimal 128, which is indeed not in ``range(128)``
(which is zero based, and thus contains 0-127).
There may be situations in which a different error handling may be
beneficial. Instead of catching ``UnicodeError`` exceptions and
handling them on an individual basis, it's possible to use a
registered error handler. Let's use a builtin error handler as
an example.
Suppose that your application reads from a file that known to be
encoded in ``UTF-8``, and you need to output to system's preffered
encoding (as defined by ``locale.getpreferredencoding()``). To make
for a more realistic example, let's imagine that the application is
test runner like Avocado itself, reading from a file containing
definitions of test variations and parameters, and writing out the
test variation IDs that were executed. The test variations/parameters
file will look like this (again, encoded in ``UTF-8``)::
intel-überdisk-workstation-20-12b3:cpu=intel;disk=überdisk;
intel-virtio-workstation-20-b322:cpu=intel;disk=virtio
amd-überdisk-workstation-20-c523:cpu=amd;disk=überdisk
amd-virtio-workstation-20-ddf3:cpu=amd;disk=virtio
And the code to parse and report the tests could look like this::
import io
import locale
INTERNAL_ENCODING = 'UTF-8'
with io.open('parameters', 'r', encoding=INTERNAL_ENCODING) as
parameters_file:
parameters_lines = parameters_file.readlines()
test_variants_run = [line.split(":", 1)[0] for line in parameters_lines]
with io.open('report.txt', 'w',
encoding=locale.getpreferredencoding()) as output_file:
output_file.write(u"\n".join(test_variants_run))
Now, on a given system, this run as expected::
$ python -c 'import locale; print(locale.getpreferredencoding())'
UTF-8
$ python read_parameters_write_report.py && cat report.txt
intel-überdisk-workstation-20-12b3
intel-virtio-workstation-20-b322
amd-überdisk-workstation-20-c523
amd-virtio-workstation-20-ddf3
$ file report.txt
report.txt: UTF-8 Unicode text
But on a **different** system::
$ python -c 'import locale; print(locale.getpreferredencoding())'
ANSI_X3.4-1968
$ python read_parameters_write_report.py && cat report.txt
Traceback (most recent call last):
File "read_parameters_write_report.py", line 12, in <module>
output_file.write(u"\n".join(test_variants_run))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in
position 6: ordinal not in range(128)
One possible solution is using an error handler, such as ``replace``.
By adding the ``errors`` parameter to the ``io.open``::
--- read_parameters_write_report.py 2018-04-17
18:33:26.781059079 -0400
+++ read_parameters_write_report.py.new 2018-04-17
18:33:58.677181944 -0400
@@ -9,5 +9,6 @@
test_variants_run = [line.split(":", 1)[0] for line in parameters_lines]
with io.open('report.txt', 'w',
- encoding=locale.getpreferredencoding()) as output_file:
+ encoding=locale.getpreferredencoding(),
+ errors='replace') as output_file:
output_file.write(u"\n".join(test_variants_run))
The result becomes::
intel-?berdisk-workstation-20-12b3
intel-virtio-workstation-20-b322
amd-?berdisk-workstation-20-c523
amd-virtio-workstation-20-ddf3
Which may be better than crashing, but may also be unacceptable
because information is lost. One alternative is to escape the data.
Using the ``backslashreplace`` error handler, ``report.txt`` would look
like::
intel-\xfcberdisk-workstation-20-12b3
intel-virtio-workstation-20-b322
amd-\xfcberdisk-workstation-20-c523
amd-virtio-workstation-20-ddf3
This way, no information is lost, and the generated report respects
the system preferred encoding::
$ file report.txt
report.txt: ASCII text
Guidelines
==========
This section sets the general guidelines for byte/text data in
Avocado, and consequently for the encoding used. It should be
followed by Avocado plugins developed externally, so that a consistent
combined work is achieved.
It can also be used as a guideline for test writers that are target
Avocado on both Python 2 and 3.
1) When generating text that will be consumed by humans, Avocado SHOULD
respect the preferred system encoding. When that is not available,
Avocado's default encoding (currently ``UTF-8``, as defined in
``avocado/core/defaults.py``) should be used.
2) When operating on data that may or may not contain text, Avocado
SHOULD treat the data as binary. If the owner of the data knows it
contains text destined for humans, what we call text, then the data
owner should handle the decoding. It's OK for utility APIs to have
helper functionality. One example is the
``avocado.utils.process.CmdResult`` class, which contains both
``stdout`` and the ``stdout_text`` attribute/property. Even then,
the user producing the data is responsible for determinig the
encoding used when treating the data as text.
3) When operating on data that provides encoding as metadata (by using
an alternative channel or that can reliably be obtained from the
data itself), Avocado MUST respect that encoding. One example is
respecting the encoding that can be given on the ``Content-Type``
headers on an HTTP session.
4) Avocado functionality CAN restrict the encodings it generates if an
expressive enough character set is used and the generated data
contains metadata that clearly defines the encoding used. One
example is the HTML plugin, which is currently limited to producing
content in ``UTF-8``.
5) All input given by humans to the Avocado test runner, such as test
references, parameters coming from files and other loader
implementations, command line parameter values and others, should
be treated as text unless noted otherwise. This means that Avocado
should be able to deal with test references given in the
system's preferred encoding transparently.
6) Avocado code should, when operating on text data, use unicode
strings internally (``unicode`` on Python 2, and ``str`` on Python
3).
Besides those points, it's worth noting that a number of utility
functionality related to binary and text data, and encoding handling,
is growing organically, and be seen on modules such as
``avocado.utils.astring``. Further functionality is currently being
proposed upstream and may soon be part of the Avocado libraries.
Caveats
=======
While handling text and binary types on Avocado, please pay attention
to the following caveats:
1) The Avocado test runner replaces the stock
``sys.std{in,out,err}.encoding``, so if you're writing a plugin, do
not assume/expect these to contain an encoding setting.
2) Some features on core Avocado, as well as on external plugins,
still fall short of the guidelines described here. This is a
work in progress. Please exercise care when
---
[1] -
https://docs.python.org/3/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
[2] - https://docs.python.org/2.7/c-api/object.html#c.PyObject_Bytes
[3] - http://unicode.scarfboy.com/?s=u%2B00e1
--
Cleber Rosa
[ Sr Software Engineer - Virtualization Team - Red Hat ]
[ Avocado Test Framework - avocado-framework.github.io ]
[ 7ABB 96EB 8B46 B94D 5E0F E9BB 657E 8D33 A5F2 09F3 ]
More information about the Avocado-devel
mailing list