[Avocado-devel] [RFC] Text/binary data and encodings: how they relate to Avocado, extensions and tests

Wed Apr 18 01:18:07 UTC 2018

Recently, Avocado has seen a lot of changes brought by the Python 3
port.  One fundamental difference between Python 2 and 3 is under
the spotlight: how to deal with "text" and "binary" data[1].

It's then important to make it clear where Avocado stands (or is
headed) when it comes to handling text, binary data and encodings,
which is the goal of this document.

First, let's review some very basic concepts.

Bytes, the unassuming arrays
============================

On both Python 2 and 3, there's "bytes".  On Python 2, it's nothing
but an alias to "str"[2]::

   >>> import sys; sys.version[0]
   2
   >>> bytes is str
   True

One of the striking characteristics of "bytes" is that every byte
counts, that is::

   >>> aacute = b'\xc3\xa1'
   >>> len(aacute)
   2

This is as simple as it gets.  The "bytes" type is an "array" of
bytes.

Also, if it's not clear enough, this sequence of two bytes, happens to
be **one way** to **represent** the "LATIN SMALL LETTER A WITH
ACUTE"[3] character, as defined by the Unicode standard, in a given
encoding.  Please pause for a moment and let that information settle.

Old habits die hard
===================

We, humans beings, are used to deal with text.  Developers, being a
special kind of human beings, are used to deal with *character arrays*
instead.  Those are, or have been for a long time, sequences of
one-byte characters with specific (but somewhat implicit) meaning.

Many developers will still assume that each byte contains a value that
maps to the ascii(7) table::

   Oct   Dec   Hex   Char                        Oct   Dec   Hex   Char
   ────────────────────────────────────────────────────────────────────────
   000   0     00    NUL '\0' (null character)   100   64    40    @
   001   1     01    SOH (start of heading)      101   65    41    A
   002   2     02    STX (start of text)         102   66    42    B
   ...
   076   62    3E    >                           176   126   7E    ~
   077   63    3F    ?                           177   127   7F    DEL

Some other developers will assume that ASCII is a thing of the past,
and each one-byte character means something according to the latin1(7)
mapping::

   ISO 8859-1 characters
   The  following  table  displays the characters in ISO 8859-1, which
are printable and
   unlisted in the ascii(7) manual page.

   Oct   Dec   Hex   Char   Description
   ────────────────────────────────────────────────────────────────────
   240   160   A0           NO-BREAK SPACE
   241   161   A1     ¡     INVERTED EXCLAMATION MARK
   242   162   A2     ¢     CENT SIGN
   ...
   376   254   FE     þ     LATIN SMALL LETTER THORN
   377   255   FF     ÿ     LATIN SMALL LETTER Y WITH DIAERESIS

Then, there's yet another group of developers who believe that a byte
in an array of bytes may be either a character, or part of a
character.  They believe in that because, Unicode and "UTF-8" is the
new standard and can be assumed to be everywhere.

The fact is, all those developers are wrong.  Not because an array of
bytes can not contain what they believe, but because one can only
guess that an array of bytes map to a character set (an encoding).

Data itself carries no intrinsic meaning
========================================

Pure data doesn't have any meaning.  Its meaning depends on the
interpretation given, that is, some kind of context around it.

When dealing with text, the meaning of data is usually determined by a
character set, a mapping table or some more advanced encoding and
decoding mechanism.

For instance, the following sequence of numbers expressed in
decimal format and separated by spaces::

  66 67 68 69 70

Will only mean the first letters of the western alphabet, ``ABCDE``,
**if** we determine that its meaning is based on the ASCII character
set (besides other details such as ordering, separator used, etc).

Turning arrays of bytes into text
=================================

On many occasions, usually when data is destined for humans, it is
necessary to present it, and to deal with it, in a different way.
Here, we use the abstract term *text* to refer to data is more
meaningful to humans, and would usually be found in documents (such as
this one) intended to be distributed and read by us, the non-machine
beings.

Reusing the example given earlier, one can do on a Python interpreter::

   >>> aacute = b'\xc3\xa1'
   >>> len(aacute.decode('utf-8'))
   1

The process of turning bytes into "text" is called "decoding" by
Python.  It helps to think of bytes as something that humans cannot
understand and consequently needs deciphering (or decoding) to then
become something readable by humans.

In this process, the encoding is of the uttermost importance.  It's
analogous to a symmetric key used on a cryptographic operation.  For
instance, let's look at what happens when using the same data with a
different encoding::

   >>> aacute = b'\xc3\xa1'
   >>> len(aacute.decode('utf-16'))
   1
   >>> print(aacute.decode('utf-16'))
   ꇃ

Or giving too little data for a given encoding::

   >>> aacute.decode('utf-32')
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/usr/lib64/python2.7/encodings/utf_32.py", line 11, in decode
       return codecs.utf_32_decode(input, errors, True)
   UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-1:
truncated data

Even though Unicode is increasingly popular, it's also a good idea to
remind ourselves that other, non-Unicode encodings exist.  For
instance, look at the same data when decoded using a character set
developed for the Thai language::

   >>> len(aacute.decode('tis-620'))
   2
   >>> print(aacute.decode('tis-620'))
   รก

Now, think about this: if you expect quick, consistent and reliable
cryptographic operations, would you save a key for later use?  Or
would you just guess it whenever you need it?

Hopefully, you've answered that you would save the key.  The same
applies to encoding: you should keep track of what you're using.

What Python offers
==================

There are a number of features that Python offers related to the
encoding used.  Some of them have differences depending on the
Python version.  When that's the case, the version used is made
clear.

Let's review those features now.

sys.getfilesystemencoding()
---------------------------

>From the documentation, this function will *Return the name of the
encoding used to convert Unicode filenames into system file names, or
None if the system default encoding is used".*

To demo how this works, let's create a base directory with ASCII only
characters (and using the byte type to avoid any implicit encoding)::

  >>> import os
  >>> os.mkdir(b'/tmp/mydir')

And then, let's explicit create a directory, again using a sequence of
bytes::

  >>> os.mkdir(b'/tmp/mydir/\xc3\xa1')

If you look at the content of the ``/tmp/mydir`` directory, you should
find a single file::

  >>> os.listdir(b'/tmp/mydir')
  ['\xc3\xa1']

Which is just what we expected.  Now, we'll start Python (2.7) with a
environment variable that will influente the encoding it'll use for
conversion of Unicode filenames::

  $ LANG=en_US.ANSI_X3.4-1968 python2.7
  >>> import sys
  >>> sys.getfilesystemencoding()
  'ANSI_X3.4-1968'

Now, let ask Python to list all files (by using the standard library
module ``glob``) in that directory::

  $ LANG=en_US.ANSI_X3.4-1968 python2.7 -c "import glob;
print(glob.glob(u'/tmp/mydir/\u00e1*'))"
  []

The list is empty because ``glob`` fails to match the reference given in the
encoding used.  Basically, think of what would happen if you were to
do::

  >>> u'/tmp/mydir/\u00e1*'.encode(sys.getfilesystemencoding())

On the other hand, by using an appropriate encoding::

  $ LANG=en_US.UTF-8 python2.7 -c "import glob;
print(glob.glob(u'/tmp/mydir/\u00e1*'))"
  [u'/tmp/mydir/\xe1']

The point here is that ``sys.getfilesystemencoding()`` will be used by
some Python libraries when working with filenames.

.. warning:: Don't expect any code to be perfect.  For instance, the
             author could find some issues with the ``glob`` module
             used in the example above.

sys.std{in,out,err}.encoding
----------------------------

An ``encoding`` attribute may be set on ``sys.stdin``, ``sys.stdout``
and ``sys.stderr`` to let applications know how to input and output
meaningful text.

Suppose you need to read text from the standard input and save it to
a file on a specific encoding.  The following script is going to be
used as an example (``read_encode.py``)::

  import sys

  # On Python 3 "str" is unicode
  if sys.version_info[0] >= 3:
      unicode = str

  sys.stdout.write("Enter text:\n")

  input_read = sys.stdin.readline().strip()
  if isinstance(input_read, unicode):
      bytes_read = input_read.encode(sys.stdin.encoding)
  else:
      bytes_read = input_read

  with open('/tmp/data.bin', 'wb') as data_file:
      data_file.write(bytes_read)

Now, on both Python 2 and 3 this produces the same results::

  $ python2 -c 'import sys; print(sys.stdin.encoding)'
  UTF-8
  $ python2 read_encode.py
  Enter text:
  áéíóú
  $ file /tmp/data.bin
  /tmp/data.bin: UTF-8 Unicode text, with no line terminators

  $ python3 -c 'import sys; print(sys.stdin.encoding)'
  UTF-8
  $ python3 read_encode.py
  Enter text:
  áéíóú
  $ file /tmp/data.bin
  /tmp/data.bin: UTF-8 Unicode text, with no line terminators

The encoding set on ``sys.stdin.encoding`` was important to
the example script as it needs to turn unicode into bytes.

Now, suppose that your application, while reading input that matches
the user's environment, must produce a file in the ``UTF-32``
encoding.  The code to do that could look similar to the following
example (``write_utf32.py``)::

  import sys

  sys.stdout.write("Enter text:\n")

  input_read = sys.stdin.readline().strip()
  if isinstance(input_read, bytes):
      unicode_str = input_read.decode(sys.stdin.encoding)
  else:
      unicode_str = input_read

  with open('/tmp/data.bin.utf32', 'wb') as data_file:
      data_file.write(unicode_str.encode('UTF-32'))

Again, let'see how this performs under Python 2 and 3::

  $ python2 -c 'import sys; print(sys.stdin.encoding)'
  UTF-8
  $ python2 write_utf32.py
  Enter text:
  áéíóú
  $ file /tmp/data.bin.utf32
  /tmp/data.bin.utf32: Unicode text, UTF-32, little-endian

  $ python3 -c 'import sys; print(sys.stdin.encoding)'
  UTF-8
  $ python3 write_utf32.py
  Enter text:
  áéíóú
  $ file /tmp/data.bin.utf32
  /tmp/data.bin.utf32: Unicode text, UTF-32, little-endian

.. tip:: do not assume that ``sys.std{in,out,err}`` will always have
         the ``encoding`` attribute, or that they'll be set to a valid
         encoding.  For instance, when ``sys.stdin`` is not a TTY,
         it's ``encoding`` attribute will have a ``None`` value.

A few points can be realized here:

1) Using Unicode strings internally, as an intermediate format, gives
   you the freedom to read (decode) from different encodings, and at
   the same time, to write (encode) into any other encoding.

2) Code that is expected to work under both Python 2 and 3 need
   some extra handling with regards to the data type being handled.

3) While determining the data type, one can either check for ``bytes``
   or for ``unicode``.  While it's certainly a matter of preference
   and style, keep in mind that the ``bytes`` name exists on both
   Python 2 and 3, while ``unicode`` exists only on Python 2.

locale
------

This Python standard library module is a wrapper around POSIX
locale-related functionality.

Because this discussion is about text encodings, let's focus on the
``locale.getpreferredencoding()`` function.  Acording to the
documentation, it *"Return(s) the encoding used for text data,
according to user preferences"*.  As it wraps specificities of different
platforms, it may not be able to problem it on some systems, and
because of that, the documentation notes that *"this function only
returns a guess"*.

Even though it may be a guess, it is probably the best bet you can
make.

.. tip:: Many non-Linux/UNIX platforms implement some level of POSIX
         functionality, and that happens to be the case for the
         ``locale`` features discussed here.  Because of that, the
         Python ``locale`` module can also be found on platforms such
         as Microsoft Windows.

Error Handling
--------------

Some encoding and decoding operations just won't be possible.  The
most straightforward example is when you're trying to map a value
that is outside the bounds of the mapping table.

For instance, the ASCII character set defines mappings for values in
the range of 0 up to 127 (7f in hexadecimal).  That means that a value
larger than 127 (7f in hexadecimal) will cause an error.

When dealing with Unicode strings in Python, those errors are
represented as a ``UnicodeError`` (whose most common subclasses
are ``UnicodeEncodeError`` or ``UnicodeDecodeError``).

Getting back to the simple example given before, trying to decode
the value 126 (7e when represented in hexadecimal) using the ASCII
character set should work fine::

  >>> b'\x7e'.decode('ascii')
  u'~'

But anything larger than 127 (7f) won't work::

  >>> b'\x80'.decode('ascii')
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position
0: ordinal not in range(128)

The reason for the failure is explicit in the error message: 0x80
(given in hex) is decimal 128, which is indeed not in ``range(128)``
(which is zero based, and thus contains 0-127).

There may be situations in which a different error handling may be
beneficial.  Instead of catching ``UnicodeError`` exceptions and
handling them on an individual basis, it's possible to use a
registered error handler.  Let's use a builtin error handler as
an example.

Suppose that your application reads from a file that known to be
encoded in ``UTF-8``, and you need to output to system's preffered
encoding (as defined by ``locale.getpreferredencoding()``).  To make
for a more realistic example, let's imagine that the application is
test runner like Avocado itself, reading from a file containing
definitions of test variations and parameters, and writing out the
test variation IDs that were executed.  The test variations/parameters
file will look like this (again, encoded in ``UTF-8``)::

  intel-überdisk-workstation-20-12b3:cpu=intel;disk=überdisk;
  intel-virtio-workstation-20-b322:cpu=intel;disk=virtio
  amd-überdisk-workstation-20-c523:cpu=amd;disk=überdisk
  amd-virtio-workstation-20-ddf3:cpu=amd;disk=virtio

And the code to parse and report the tests could look like this::

  import io
  import locale

  INTERNAL_ENCODING = 'UTF-8'

  with io.open('parameters', 'r', encoding=INTERNAL_ENCODING) as
parameters_file:
      parameters_lines = parameters_file.readlines()

  test_variants_run = [line.split(":", 1)[0] for line in parameters_lines]
  with io.open('report.txt', 'w',
               encoding=locale.getpreferredencoding()) as output_file:
      output_file.write(u"\n".join(test_variants_run))

Now, on a given system, this run as expected::

  $ python -c 'import locale; print(locale.getpreferredencoding())'
  UTF-8
  $ python read_parameters_write_report.py && cat report.txt
  intel-überdisk-workstation-20-12b3
  intel-virtio-workstation-20-b322
  amd-überdisk-workstation-20-c523
  amd-virtio-workstation-20-ddf3
  $ file report.txt
  report.txt: UTF-8 Unicode text

But on a **different** system::

  $ python -c 'import locale; print(locale.getpreferredencoding())'
  ANSI_X3.4-1968
  $ python read_parameters_write_report.py && cat report.txt
  Traceback (most recent call last):
    File "read_parameters_write_report.py", line 12, in <module>
      output_file.write(u"\n".join(test_variants_run))
  UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in
position 6: ordinal not in range(128)

One possible solution is using an error handler, such as ``replace``.
By adding the ``errors`` parameter to the ``io.open``::

  --- read_parameters_write_report.py       2018-04-17
18:33:26.781059079 -0400
  +++ read_parameters_write_report.py.new   2018-04-17
18:33:58.677181944 -0400
  @@ -9,5 +9,6 @@

   test_variants_run = [line.split(":", 1)[0] for line in parameters_lines]
   with io.open('report.txt', 'w',
  -             encoding=locale.getpreferredencoding()) as output_file:
  +             encoding=locale.getpreferredencoding(),
  +             errors='replace') as output_file:
       output_file.write(u"\n".join(test_variants_run))

The result becomes::

  intel-?berdisk-workstation-20-12b3
  intel-virtio-workstation-20-b322
  amd-?berdisk-workstation-20-c523
  amd-virtio-workstation-20-ddf3

Which may be better than crashing, but may also be unacceptable
because information is lost.  One alternative is to escape the data.
Using the ``backslashreplace`` error handler, ``report.txt`` would look
like::

  intel-\xfcberdisk-workstation-20-12b3
  intel-virtio-workstation-20-b322
  amd-\xfcberdisk-workstation-20-c523
  amd-virtio-workstation-20-ddf3

This way, no information is lost, and the generated report respects
the system preferred encoding::

  $ file report.txt
  report.txt: ASCII text

Guidelines
==========

This section sets the general guidelines for byte/text data in
Avocado, and consequently for the encoding used.  It should be
followed by Avocado plugins developed externally, so that a consistent
combined work is achieved.

It can also be used as a guideline for test writers that are target
Avocado on both Python 2 and 3.

1) When generating text that will be consumed by humans, Avocado SHOULD
   respect the preferred system encoding.  When that is not available,
   Avocado's default encoding (currently ``UTF-8``, as defined in
   ``avocado/core/defaults.py``) should be used.

2) When operating on data that may or may not contain text, Avocado
   SHOULD treat the data as binary.  If the owner of the data knows it
   contains text destined for humans, what we call text, then the data
   owner should handle the decoding.  It's OK for utility APIs to have
   helper functionality.  One example is the
   ``avocado.utils.process.CmdResult`` class, which contains both
   ``stdout`` and the ``stdout_text`` attribute/property.  Even then,
   the user producing the data is responsible for determinig the
   encoding used when treating the data as text.

3) When operating on data that provides encoding as metadata (by using
   an alternative channel or that can reliably be obtained from the
   data itself), Avocado MUST respect that encoding.  One example is
   respecting the encoding that can be given on the ``Content-Type``
   headers on an HTTP session.

4) Avocado functionality CAN restrict the encodings it generates if an
   expressive enough character set is used and the generated data
   contains metadata that clearly defines the encoding used.  One
   example is the HTML plugin, which is currently limited to producing
   content in ``UTF-8``.

5) All input given by humans to the Avocado test runner, such as test
   references, parameters coming from files and other loader
   implementations, command line parameter values and others, should
   be treated as text unless noted otherwise.  This means that Avocado
   should be able to deal with test references given in the
   system's preferred encoding transparently.

6) Avocado code should, when operating on text data, use unicode
   strings internally (``unicode`` on Python 2, and ``str`` on Python
   3).

Besides those points, it's worth noting that a number of utility
functionality related to binary and text data, and encoding handling,
is growing organically, and be seen on modules such as
``avocado.utils.astring``.  Further functionality is currently being
proposed upstream and may soon be part of the Avocado libraries.

Caveats
=======

While handling text and binary types on Avocado, please pay attention
to the following caveats:

1) The Avocado test runner replaces the stock
   ``sys.std{in,out,err}.encoding``, so if you're writing a plugin, do
   not assume/expect these to contain an encoding setting.

2) Some features on core Avocado, as well as on external plugins,
   still fall short of the guidelines described here.  This is a
   work in progress.  Please exercise care when

---

[1] -
https://docs.python.org/3/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
[2] - https://docs.python.org/2.7/c-api/object.html#c.PyObject_Bytes
[3] - http://unicode.scarfboy.com/?s=u%2B00e1

-- 
Cleber Rosa
[ Sr Software Engineer - Virtualization Team - Red Hat ]
[ Avocado Test Framework - avocado-framework.github.io ]
[  7ABB 96EB 8B46 B94D 5E0F  E9BB 657E 8D33 A5F2 09F3  ]