[Avocado-devel] [RFC] Text/binary data and encodings: how they relate to Avocado, extensions and tests

Wed Apr 18 15:21:09 UTC 2018

On 04/17/2018 09:18 PM, Cleber Rosa wrote:
> Recently, Avocado has seen a lot of changes brought by the Python 3
> port.  One fundamental difference between Python 2 and 3 is under
> the spotlight: how to deal with "text" and "binary" data[1].
> 
> It's then important to make it clear where Avocado stands (or is
> headed) when it comes to handling text, binary data and encodings,
> which is the goal of this document.
> 
> First, let's review some very basic concepts.
> 
> Bytes, the unassuming arrays
> ============================
> 
> On both Python 2 and 3, there's "bytes".  On Python 2, it's nothing
> but an alias to "str"[2]::
> 
>    >>> import sys; sys.version[0]
>    2
>    >>> bytes is str
>    True
> 
> One of the striking characteristics of "bytes" is that every byte
> counts, that is::
> 
>    >>> aacute = b'\xc3\xa1'
>    >>> len(aacute)
>    2
> 
> This is as simple as it gets.  The "bytes" type is an "array" of
> bytes.
> 
> Also, if it's not clear enough, this sequence of two bytes, happens to
> be **one way** to **represent** the "LATIN SMALL LETTER A WITH
> ACUTE"[3] character, as defined by the Unicode standard, in a given
> encoding.  Please pause for a moment and let that information settle.
> 
> Old habits die hard
> ===================
> 
> We, humans beings, are used to deal with text.  Developers, being a
> special kind of human beings, are used to deal with *character arrays*
> instead.  Those are, or have been for a long time, sequences of
> one-byte characters with specific (but somewhat implicit) meaning.
> 
> Many developers will still assume that each byte contains a value that
> maps to the ascii(7) table::
> 
>    Oct   Dec   Hex   Char                        Oct   Dec   Hex   Char
>    ────────────────────────────────────────────────────────────────────────
>    000   0     00    NUL '\0' (null character)   100   64    40    @
>    001   1     01    SOH (start of heading)      101   65    41    A
>    002   2     02    STX (start of text)         102   66    42    B
>    ...
>    076   62    3E    >                           176   126   7E    ~
>    077   63    3F    ?                           177   127   7F    DEL
> 
> Some other developers will assume that ASCII is a thing of the past,
> and each one-byte character means something according to the latin1(7)
> mapping::
> 
>    ISO 8859-1 characters
>    The  following  table  displays the characters in ISO 8859-1, which
> are printable and
>    unlisted in the ascii(7) manual page.
> 
>    Oct   Dec   Hex   Char   Description
>    ────────────────────────────────────────────────────────────────────
>    240   160   A0           NO-BREAK SPACE
>    241   161   A1     ¡     INVERTED EXCLAMATION MARK
>    242   162   A2     ¢     CENT SIGN
>    ...
>    376   254   FE     þ     LATIN SMALL LETTER THORN
>    377   255   FF     ÿ     LATIN SMALL LETTER Y WITH DIAERESIS
> 
> Then, there's yet another group of developers who believe that a byte
> in an array of bytes may be either a character, or part of a
> character.  They believe in that because, Unicode and "UTF-8" is the
> new standard and can be assumed to be everywhere.
> 
> The fact is, all those developers are wrong.  Not because an array of
> bytes can not contain what they believe, but because one can only
> guess that an array of bytes map to a character set (an encoding).
> 
> Data itself carries no intrinsic meaning
> ========================================
> 
> Pure data doesn't have any meaning.  Its meaning depends on the
> interpretation given, that is, some kind of context around it.
> 
> When dealing with text, the meaning of data is usually determined by a
> character set, a mapping table or some more advanced encoding and
> decoding mechanism.
> 
> For instance, the following sequence of numbers expressed in
> decimal format and separated by spaces::
> 
>   66 67 68 69 70
> 
> Will only mean the first letters of the western alphabet, ``ABCDE``,
> **if** we determine that its meaning is based on the ASCII character
> set (besides other details such as ordering, separator used, etc).
> 
> Turning arrays of bytes into text
> =================================
> 
> On many occasions, usually when data is destined for humans, it is
> necessary to present it, and to deal with it, in a different way.
> Here, we use the abstract term *text* to refer to data is more
> meaningful to humans, and would usually be found in documents (such as
> this one) intended to be distributed and read by us, the non-machine
> beings.
> 
> Reusing the example given earlier, one can do on a Python interpreter::
> 
>    >>> aacute = b'\xc3\xa1'
>    >>> len(aacute.decode('utf-8'))
>    1
> 
> The process of turning bytes into "text" is called "decoding" by
> Python.  It helps to think of bytes as something that humans cannot
> understand and consequently needs deciphering (or decoding) to then
> become something readable by humans.
> 
> In this process, the encoding is of the uttermost importance.  It's
> analogous to a symmetric key used on a cryptographic operation.  For
> instance, let's look at what happens when using the same data with a
> different encoding::
> 
>    >>> aacute = b'\xc3\xa1'
>    >>> len(aacute.decode('utf-16'))
>    1
>    >>> print(aacute.decode('utf-16'))
>    ꇃ
> 
> Or giving too little data for a given encoding::
> 
>    >>> aacute.decode('utf-32')
>    Traceback (most recent call last):
>      File "<stdin>", line 1, in <module>
>      File "/usr/lib64/python2.7/encodings/utf_32.py", line 11, in decode
>        return codecs.utf_32_decode(input, errors, True)
>    UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-1:
> truncated data
> 
> Even though Unicode is increasingly popular, it's also a good idea to
> remind ourselves that other, non-Unicode encodings exist.  For
> instance, look at the same data when decoded using a character set
> developed for the Thai language::
> 
>    >>> len(aacute.decode('tis-620'))
>    2
>    >>> print(aacute.decode('tis-620'))
>    รก
> 
> Now, think about this: if you expect quick, consistent and reliable
> cryptographic operations, would you save a key for later use?  Or
> would you just guess it whenever you need it?
> 
> Hopefully, you've answered that you would save the key.  The same
> applies to encoding: you should keep track of what you're using.
> 
> What Python offers
> ==================
> 
> There are a number of features that Python offers related to the
> encoding used.  Some of them have differences depending on the
> Python version.  When that's the case, the version used is made
> clear.
> 
> Let's review those features now.
> 
> sys.getfilesystemencoding()
> ---------------------------
> 
> From the documentation, this function will *Return the name of the
> encoding used to convert Unicode filenames into system file names, or
> None if the system default encoding is used".*
> 
> To demo how this works, let's create a base directory with ASCII only
> characters (and using the byte type to avoid any implicit encoding)::
> 
>   >>> import os
>   >>> os.mkdir(b'/tmp/mydir')
> 
> And then, let's explicit create a directory, again using a sequence of
> bytes::
> 
>   >>> os.mkdir(b'/tmp/mydir/\xc3\xa1')
> 
> If you look at the content of the ``/tmp/mydir`` directory, you should
> find a single file::
> 
>   >>> os.listdir(b'/tmp/mydir')
>   ['\xc3\xa1']
> 
> Which is just what we expected.  Now, we'll start Python (2.7) with a
> environment variable that will influente the encoding it'll use for
> conversion of Unicode filenames::
> 
>   $ LANG=en_US.ANSI_X3.4-1968 python2.7
>   >>> import sys
>   >>> sys.getfilesystemencoding()
>   'ANSI_X3.4-1968'
> 
> Now, let ask Python to list all files (by using the standard library
> module ``glob``) in that directory::
> 
>   $ LANG=en_US.ANSI_X3.4-1968 python2.7 -c "import glob;
> print(glob.glob(u'/tmp/mydir/\u00e1*'))"
>   []
> 
> The list is empty because ``glob`` fails to match the reference given in the
> encoding used.  Basically, think of what would happen if you were to
> do::
> 
>   >>> u'/tmp/mydir/\u00e1*'.encode(sys.getfilesystemencoding())
> 
> On the other hand, by using an appropriate encoding::
> 
>   $ LANG=en_US.UTF-8 python2.7 -c "import glob;
> print(glob.glob(u'/tmp/mydir/\u00e1*'))"
>   [u'/tmp/mydir/\xe1']
> 
> The point here is that ``sys.getfilesystemencoding()`` will be used by
> some Python libraries when working with filenames.
> 
> .. warning:: Don't expect any code to be perfect.  For instance, the
>              author could find some issues with the ``glob`` module
>              used in the example above.
> 
> sys.std{in,out,err}.encoding
> ----------------------------
> 
> An ``encoding`` attribute may be set on ``sys.stdin``, ``sys.stdout``
> and ``sys.stderr`` to let applications know how to input and output
> meaningful text.
> 
> Suppose you need to read text from the standard input and save it to
> a file on a specific encoding.  The following script is going to be
> used as an example (``read_encode.py``)::
> 
>   import sys
> 
> 
>   # On Python 3 "str" is unicode
>   if sys.version_info[0] >= 3:
>       unicode = str
> 
>   sys.stdout.write("Enter text:\n")
> 
>   input_read = sys.stdin.readline().strip()
>   if isinstance(input_read, unicode):
>       bytes_read = input_read.encode(sys.stdin.encoding)
>   else:
>       bytes_read = input_read
> 
>   with open('/tmp/data.bin', 'wb') as data_file:
>       data_file.write(bytes_read)
> 
> Now, on both Python 2 and 3 this produces the same results::
> 
>   $ python2 -c 'import sys; print(sys.stdin.encoding)'
>   UTF-8
>   $ python2 read_encode.py
>   Enter text:
>   áéíóú
>   $ file /tmp/data.bin
>   /tmp/data.bin: UTF-8 Unicode text, with no line terminators
> 
>   $ python3 -c 'import sys; print(sys.stdin.encoding)'
>   UTF-8
>   $ python3 read_encode.py
>   Enter text:
>   áéíóú
>   $ file /tmp/data.bin
>   /tmp/data.bin: UTF-8 Unicode text, with no line terminators
> 
> The encoding set on ``sys.stdin.encoding`` was important to
> the example script as it needs to turn unicode into bytes.
> 
> Now, suppose that your application, while reading input that matches
> the user's environment, must produce a file in the ``UTF-32``
> encoding.  The code to do that could look similar to the following
> example (``write_utf32.py``)::
> 
>   import sys
> 
> 
>   sys.stdout.write("Enter text:\n")
> 
>   input_read = sys.stdin.readline().strip()
>   if isinstance(input_read, bytes):
>       unicode_str = input_read.decode(sys.stdin.encoding)
>   else:
>       unicode_str = input_read
> 
>   with open('/tmp/data.bin.utf32', 'wb') as data_file:
>       data_file.write(unicode_str.encode('UTF-32'))
> 
> Again, let'see how this performs under Python 2 and 3::
> 
>   $ python2 -c 'import sys; print(sys.stdin.encoding)'
>   UTF-8
>   $ python2 write_utf32.py
>   Enter text:
>   áéíóú
>   $ file /tmp/data.bin.utf32
>   /tmp/data.bin.utf32: Unicode text, UTF-32, little-endian
> 
>   $ python3 -c 'import sys; print(sys.stdin.encoding)'
>   UTF-8
>   $ python3 write_utf32.py
>   Enter text:
>   áéíóú
>   $ file /tmp/data.bin.utf32
>   /tmp/data.bin.utf32: Unicode text, UTF-32, little-endian
> 
> .. tip:: do not assume that ``sys.std{in,out,err}`` will always have
>          the ``encoding`` attribute, or that they'll be set to a valid
>          encoding.  For instance, when ``sys.stdin`` is not a TTY,
>          it's ``encoding`` attribute will have a ``None`` value.
> 
> A few points can be realized here:
> 
> 1) Using Unicode strings internally, as an intermediate format, gives
>    you the freedom to read (decode) from different encodings, and at
>    the same time, to write (encode) into any other encoding.
> 
> 2) Code that is expected to work under both Python 2 and 3 need
>    some extra handling with regards to the data type being handled.
> 
> 3) While determining the data type, one can either check for ``bytes``
>    or for ``unicode``.  While it's certainly a matter of preference
>    and style, keep in mind that the ``bytes`` name exists on both
>    Python 2 and 3, while ``unicode`` exists only on Python 2.
> 
> locale
> ------
> 
> This Python standard library module is a wrapper around POSIX
> locale-related functionality.
> 
> Because this discussion is about text encodings, let's focus on the
> ``locale.getpreferredencoding()`` function.  Acording to the
> documentation, it *"Return(s) the encoding used for text data,
> according to user preferences"*.  As it wraps specificities of different
> platforms, it may not be able to problem it on some systems, and
> because of that, the documentation notes that *"this function only
> returns a guess"*.
> 
> Even though it may be a guess, it is probably the best bet you can
> make.
> 
> .. tip:: Many non-Linux/UNIX platforms implement some level of POSIX
>          functionality, and that happens to be the case for the
>          ``locale`` features discussed here.  Because of that, the
>          Python ``locale`` module can also be found on platforms such
>          as Microsoft Windows.
> 
> Error Handling
> --------------
> 
> Some encoding and decoding operations just won't be possible.  The
> most straightforward example is when you're trying to map a value
> that is outside the bounds of the mapping table.
> 
> For instance, the ASCII character set defines mappings for values in
> the range of 0 up to 127 (7f in hexadecimal).  That means that a value
> larger than 127 (7f in hexadecimal) will cause an error.
> 
> When dealing with Unicode strings in Python, those errors are
> represented as a ``UnicodeError`` (whose most common subclasses
> are ``UnicodeEncodeError`` or ``UnicodeDecodeError``).
> 
> Getting back to the simple example given before, trying to decode
> the value 126 (7e when represented in hexadecimal) using the ASCII
> character set should work fine::
> 
>   >>> b'\x7e'.decode('ascii')
>   u'~'
> 
> But anything larger than 127 (7f) won't work::
> 
>   >>> b'\x80'.decode('ascii')
>    Traceback (most recent call last):
>      File "<stdin>", line 1, in <module>
>    UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position
> 0: ordinal not in range(128)
> 
> The reason for the failure is explicit in the error message: 0x80
> (given in hex) is decimal 128, which is indeed not in ``range(128)``
> (which is zero based, and thus contains 0-127).
> 
> There may be situations in which a different error handling may be
> beneficial.  Instead of catching ``UnicodeError`` exceptions and
> handling them on an individual basis, it's possible to use a
> registered error handler.  Let's use a builtin error handler as
> an example.
> 
> Suppose that your application reads from a file that known to be
> encoded in ``UTF-8``, and you need to output to system's preffered
> encoding (as defined by ``locale.getpreferredencoding()``).  To make
> for a more realistic example, let's imagine that the application is
> test runner like Avocado itself, reading from a file containing
> definitions of test variations and parameters, and writing out the
> test variation IDs that were executed.  The test variations/parameters
> file will look like this (again, encoded in ``UTF-8``)::
> 
>   intel-überdisk-workstation-20-12b3:cpu=intel;disk=überdisk;
>   intel-virtio-workstation-20-b322:cpu=intel;disk=virtio
>   amd-überdisk-workstation-20-c523:cpu=amd;disk=überdisk
>   amd-virtio-workstation-20-ddf3:cpu=amd;disk=virtio
> 
> And the code to parse and report the tests could look like this::
> 
>   import io
>   import locale
> 
> 
>   INTERNAL_ENCODING = 'UTF-8'
> 
>   with io.open('parameters', 'r', encoding=INTERNAL_ENCODING) as
> parameters_file:
>       parameters_lines = parameters_file.readlines()
> 
>   test_variants_run = [line.split(":", 1)[0] for line in parameters_lines]
>   with io.open('report.txt', 'w',
>                encoding=locale.getpreferredencoding()) as output_file:
>       output_file.write(u"\n".join(test_variants_run))
> 
> Now, on a given system, this run as expected::
> 
>   $ python -c 'import locale; print(locale.getpreferredencoding())'
>   UTF-8
>   $ python read_parameters_write_report.py && cat report.txt
>   intel-überdisk-workstation-20-12b3
>   intel-virtio-workstation-20-b322
>   amd-überdisk-workstation-20-c523
>   amd-virtio-workstation-20-ddf3
>   $ file report.txt
>   report.txt: UTF-8 Unicode text
> 
> But on a **different** system::
> 
>   $ python -c 'import locale; print(locale.getpreferredencoding())'
>   ANSI_X3.4-1968
>   $ python read_parameters_write_report.py && cat report.txt
>   Traceback (most recent call last):
>     File "read_parameters_write_report.py", line 12, in <module>
>       output_file.write(u"\n".join(test_variants_run))
>   UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in
> position 6: ordinal not in range(128)
> 
> One possible solution is using an error handler, such as ``replace``.
> By adding the ``errors`` parameter to the ``io.open``::
> 
>   --- read_parameters_write_report.py       2018-04-17
> 18:33:26.781059079 -0400
>   +++ read_parameters_write_report.py.new   2018-04-17
> 18:33:58.677181944 -0400
>   @@ -9,5 +9,6 @@
> 
>    test_variants_run = [line.split(":", 1)[0] for line in parameters_lines]
>    with io.open('report.txt', 'w',
>   -             encoding=locale.getpreferredencoding()) as output_file:
>   +             encoding=locale.getpreferredencoding(),
>   +             errors='replace') as output_file:
>        output_file.write(u"\n".join(test_variants_run))
> 
> The result becomes::
> 
>   intel-?berdisk-workstation-20-12b3
>   intel-virtio-workstation-20-b322
>   amd-?berdisk-workstation-20-c523
>   amd-virtio-workstation-20-ddf3
> 
> Which may be better than crashing, but may also be unacceptable
> because information is lost.  One alternative is to escape the data.
> Using the ``backslashreplace`` error handler, ``report.txt`` would look
> like::
> 
>   intel-\xfcberdisk-workstation-20-12b3
>   intel-virtio-workstation-20-b322
>   amd-\xfcberdisk-workstation-20-c523
>   amd-virtio-workstation-20-ddf3
> 
> This way, no information is lost, and the generated report respects
> the system preferred encoding::
> 
>   $ file report.txt
>   report.txt: ASCII text
> 
> Guidelines
> ==========
> 
> This section sets the general guidelines for byte/text data in
> Avocado, and consequently for the encoding used.  It should be
> followed by Avocado plugins developed externally, so that a consistent
> combined work is achieved.
> 
> It can also be used as a guideline for test writers that are target
> Avocado on both Python 2 and 3.
> 
> 1) When generating text that will be consumed by humans, Avocado SHOULD
>    respect the preferred system encoding.  When that is not available,
>    Avocado's default encoding (currently ``UTF-8``, as defined in
>    ``avocado/core/defaults.py``) should be used.
> 
> 2) When operating on data that may or may not contain text, Avocado
>    SHOULD treat the data as binary.  If the owner of the data knows it
>    contains text destined for humans, what we call text, then the data
>    owner should handle the decoding.  It's OK for utility APIs to have
>    helper functionality.  One example is the
>    ``avocado.utils.process.CmdResult`` class, which contains both
>    ``stdout`` and the ``stdout_text`` attribute/property.  Even then,
>    the user producing the data is responsible for determinig the
>    encoding used when treating the data as text.
> 
> 3) When operating on data that provides encoding as metadata (by using
>    an alternative channel or that can reliably be obtained from the
>    data itself), Avocado MUST respect that encoding.  One example is
>    respecting the encoding that can be given on the ``Content-Type``
>    headers on an HTTP session.
> 
> 4) Avocado functionality CAN restrict the encodings it generates if an
>    expressive enough character set is used and the generated data
>    contains metadata that clearly defines the encoding used.  One
>    example is the HTML plugin, which is currently limited to producing
>    content in ``UTF-8``.
> 
> 5) All input given by humans to the Avocado test runner, such as test
>    references, parameters coming from files and other loader
>    implementations, command line parameter values and others, should
>    be treated as text unless noted otherwise.  This means that Avocado
>    should be able to deal with test references given in the
>    system's preferred encoding transparently.
> 
> 6) Avocado code should, when operating on text data, use unicode
>    strings internally (``unicode`` on Python 2, and ``str`` on Python
>    3).
> 
> Besides those points, it's worth noting that a number of utility
> functionality related to binary and text data, and encoding handling,
> is growing organically, and be seen on modules such as
> ``avocado.utils.astring``.  Further functionality is currently being
> proposed upstream and may soon be part of the Avocado libraries.
> 
> Caveats
> =======
> 
> While handling text and binary types on Avocado, please pay attention
> to the following caveats:
> 
> 1) The Avocado test runner replaces the stock
>    ``sys.std{in,out,err}.encoding``, so if you're writing a plugin, do
>    not assume/expect these to contain an encoding setting.
> 
> 2) Some features on core Avocado, as well as on external plugins,
>    still fall short of the guidelines described here.  This is a
>    work in progress.  Please exercise care when
> 

Sorry for this "bad ending".  I just meant "please exercise care, and
investigate the code status when relying on it".

Cheers!
- Cleber.

> ---
> 
> [1] -
> https://docs.python.org/3/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
> [2] - https://docs.python.org/2.7/c-api/object.html#c.PyObject_Bytes
> [3] - http://unicode.scarfboy.com/?s=u%2B00e1
> 

-- 
Cleber Rosa
[ Sr Software Engineer - Virtualization Team - Red Hat ]
[ Avocado Test Framework - avocado-framework.github.io ]
[  7ABB 96EB 8B46 B94D 5E0F  E9BB 657E 8D33 A5F2 09F3  ]