[Avocado-devel] [RFC] Text/binary data and encodings: how they relate to Avocado, extensions and tests
Cleber Rosa
crosa at redhat.com
Wed Apr 18 15:21:09 UTC 2018
On 04/17/2018 09:18 PM, Cleber Rosa wrote:
> Recently, Avocado has seen a lot of changes brought by the Python 3
> port. One fundamental difference between Python 2 and 3 is under
> the spotlight: how to deal with "text" and "binary" data[1].
>
> It's then important to make it clear where Avocado stands (or is
> headed) when it comes to handling text, binary data and encodings,
> which is the goal of this document.
>
> First, let's review some very basic concepts.
>
> Bytes, the unassuming arrays
> ============================
>
> On both Python 2 and 3, there's "bytes". On Python 2, it's nothing
> but an alias to "str"[2]::
>
> >>> import sys; sys.version[0]
> 2
> >>> bytes is str
> True
>
> One of the striking characteristics of "bytes" is that every byte
> counts, that is::
>
> >>> aacute = b'\xc3\xa1'
> >>> len(aacute)
> 2
>
> This is as simple as it gets. The "bytes" type is an "array" of
> bytes.
>
> Also, if it's not clear enough, this sequence of two bytes, happens to
> be **one way** to **represent** the "LATIN SMALL LETTER A WITH
> ACUTE"[3] character, as defined by the Unicode standard, in a given
> encoding. Please pause for a moment and let that information settle.
>
> Old habits die hard
> ===================
>
> We, humans beings, are used to deal with text. Developers, being a
> special kind of human beings, are used to deal with *character arrays*
> instead. Those are, or have been for a long time, sequences of
> one-byte characters with specific (but somewhat implicit) meaning.
>
> Many developers will still assume that each byte contains a value that
> maps to the ascii(7) table::
>
> Oct Dec Hex Char Oct Dec Hex Char
> ────────────────────────────────────────────────────────────────────────
> 000 0 00 NUL '\0' (null character) 100 64 40 @
> 001 1 01 SOH (start of heading) 101 65 41 A
> 002 2 02 STX (start of text) 102 66 42 B
> ...
> 076 62 3E > 176 126 7E ~
> 077 63 3F ? 177 127 7F DEL
>
> Some other developers will assume that ASCII is a thing of the past,
> and each one-byte character means something according to the latin1(7)
> mapping::
>
> ISO 8859-1 characters
> The following table displays the characters in ISO 8859-1, which
> are printable and
> unlisted in the ascii(7) manual page.
>
> Oct Dec Hex Char Description
> ────────────────────────────────────────────────────────────────────
> 240 160 A0 NO-BREAK SPACE
> 241 161 A1 ¡ INVERTED EXCLAMATION MARK
> 242 162 A2 ¢ CENT SIGN
> ...
> 376 254 FE þ LATIN SMALL LETTER THORN
> 377 255 FF ÿ LATIN SMALL LETTER Y WITH DIAERESIS
>
> Then, there's yet another group of developers who believe that a byte
> in an array of bytes may be either a character, or part of a
> character. They believe in that because, Unicode and "UTF-8" is the
> new standard and can be assumed to be everywhere.
>
> The fact is, all those developers are wrong. Not because an array of
> bytes can not contain what they believe, but because one can only
> guess that an array of bytes map to a character set (an encoding).
>
> Data itself carries no intrinsic meaning
> ========================================
>
> Pure data doesn't have any meaning. Its meaning depends on the
> interpretation given, that is, some kind of context around it.
>
> When dealing with text, the meaning of data is usually determined by a
> character set, a mapping table or some more advanced encoding and
> decoding mechanism.
>
> For instance, the following sequence of numbers expressed in
> decimal format and separated by spaces::
>
> 66 67 68 69 70
>
> Will only mean the first letters of the western alphabet, ``ABCDE``,
> **if** we determine that its meaning is based on the ASCII character
> set (besides other details such as ordering, separator used, etc).
>
> Turning arrays of bytes into text
> =================================
>
> On many occasions, usually when data is destined for humans, it is
> necessary to present it, and to deal with it, in a different way.
> Here, we use the abstract term *text* to refer to data is more
> meaningful to humans, and would usually be found in documents (such as
> this one) intended to be distributed and read by us, the non-machine
> beings.
>
> Reusing the example given earlier, one can do on a Python interpreter::
>
> >>> aacute = b'\xc3\xa1'
> >>> len(aacute.decode('utf-8'))
> 1
>
> The process of turning bytes into "text" is called "decoding" by
> Python. It helps to think of bytes as something that humans cannot
> understand and consequently needs deciphering (or decoding) to then
> become something readable by humans.
>
> In this process, the encoding is of the uttermost importance. It's
> analogous to a symmetric key used on a cryptographic operation. For
> instance, let's look at what happens when using the same data with a
> different encoding::
>
> >>> aacute = b'\xc3\xa1'
> >>> len(aacute.decode('utf-16'))
> 1
> >>> print(aacute.decode('utf-16'))
> ꇃ
>
> Or giving too little data for a given encoding::
>
> >>> aacute.decode('utf-32')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/usr/lib64/python2.7/encodings/utf_32.py", line 11, in decode
> return codecs.utf_32_decode(input, errors, True)
> UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-1:
> truncated data
>
> Even though Unicode is increasingly popular, it's also a good idea to
> remind ourselves that other, non-Unicode encodings exist. For
> instance, look at the same data when decoded using a character set
> developed for the Thai language::
>
> >>> len(aacute.decode('tis-620'))
> 2
> >>> print(aacute.decode('tis-620'))
> รก
>
> Now, think about this: if you expect quick, consistent and reliable
> cryptographic operations, would you save a key for later use? Or
> would you just guess it whenever you need it?
>
> Hopefully, you've answered that you would save the key. The same
> applies to encoding: you should keep track of what you're using.
>
> What Python offers
> ==================
>
> There are a number of features that Python offers related to the
> encoding used. Some of them have differences depending on the
> Python version. When that's the case, the version used is made
> clear.
>
> Let's review those features now.
>
> sys.getfilesystemencoding()
> ---------------------------
>
> From the documentation, this function will *Return the name of the
> encoding used to convert Unicode filenames into system file names, or
> None if the system default encoding is used".*
>
> To demo how this works, let's create a base directory with ASCII only
> characters (and using the byte type to avoid any implicit encoding)::
>
> >>> import os
> >>> os.mkdir(b'/tmp/mydir')
>
> And then, let's explicit create a directory, again using a sequence of
> bytes::
>
> >>> os.mkdir(b'/tmp/mydir/\xc3\xa1')
>
> If you look at the content of the ``/tmp/mydir`` directory, you should
> find a single file::
>
> >>> os.listdir(b'/tmp/mydir')
> ['\xc3\xa1']
>
> Which is just what we expected. Now, we'll start Python (2.7) with a
> environment variable that will influente the encoding it'll use for
> conversion of Unicode filenames::
>
> $ LANG=en_US.ANSI_X3.4-1968 python2.7
> >>> import sys
> >>> sys.getfilesystemencoding()
> 'ANSI_X3.4-1968'
>
> Now, let ask Python to list all files (by using the standard library
> module ``glob``) in that directory::
>
> $ LANG=en_US.ANSI_X3.4-1968 python2.7 -c "import glob;
> print(glob.glob(u'/tmp/mydir/\u00e1*'))"
> []
>
> The list is empty because ``glob`` fails to match the reference given in the
> encoding used. Basically, think of what would happen if you were to
> do::
>
> >>> u'/tmp/mydir/\u00e1*'.encode(sys.getfilesystemencoding())
>
> On the other hand, by using an appropriate encoding::
>
> $ LANG=en_US.UTF-8 python2.7 -c "import glob;
> print(glob.glob(u'/tmp/mydir/\u00e1*'))"
> [u'/tmp/mydir/\xe1']
>
> The point here is that ``sys.getfilesystemencoding()`` will be used by
> some Python libraries when working with filenames.
>
> .. warning:: Don't expect any code to be perfect. For instance, the
> author could find some issues with the ``glob`` module
> used in the example above.
>
> sys.std{in,out,err}.encoding
> ----------------------------
>
> An ``encoding`` attribute may be set on ``sys.stdin``, ``sys.stdout``
> and ``sys.stderr`` to let applications know how to input and output
> meaningful text.
>
> Suppose you need to read text from the standard input and save it to
> a file on a specific encoding. The following script is going to be
> used as an example (``read_encode.py``)::
>
> import sys
>
>
> # On Python 3 "str" is unicode
> if sys.version_info[0] >= 3:
> unicode = str
>
> sys.stdout.write("Enter text:\n")
>
> input_read = sys.stdin.readline().strip()
> if isinstance(input_read, unicode):
> bytes_read = input_read.encode(sys.stdin.encoding)
> else:
> bytes_read = input_read
>
> with open('/tmp/data.bin', 'wb') as data_file:
> data_file.write(bytes_read)
>
> Now, on both Python 2 and 3 this produces the same results::
>
> $ python2 -c 'import sys; print(sys.stdin.encoding)'
> UTF-8
> $ python2 read_encode.py
> Enter text:
> áéíóú
> $ file /tmp/data.bin
> /tmp/data.bin: UTF-8 Unicode text, with no line terminators
>
> $ python3 -c 'import sys; print(sys.stdin.encoding)'
> UTF-8
> $ python3 read_encode.py
> Enter text:
> áéíóú
> $ file /tmp/data.bin
> /tmp/data.bin: UTF-8 Unicode text, with no line terminators
>
> The encoding set on ``sys.stdin.encoding`` was important to
> the example script as it needs to turn unicode into bytes.
>
> Now, suppose that your application, while reading input that matches
> the user's environment, must produce a file in the ``UTF-32``
> encoding. The code to do that could look similar to the following
> example (``write_utf32.py``)::
>
> import sys
>
>
> sys.stdout.write("Enter text:\n")
>
> input_read = sys.stdin.readline().strip()
> if isinstance(input_read, bytes):
> unicode_str = input_read.decode(sys.stdin.encoding)
> else:
> unicode_str = input_read
>
> with open('/tmp/data.bin.utf32', 'wb') as data_file:
> data_file.write(unicode_str.encode('UTF-32'))
>
> Again, let'see how this performs under Python 2 and 3::
>
> $ python2 -c 'import sys; print(sys.stdin.encoding)'
> UTF-8
> $ python2 write_utf32.py
> Enter text:
> áéíóú
> $ file /tmp/data.bin.utf32
> /tmp/data.bin.utf32: Unicode text, UTF-32, little-endian
>
> $ python3 -c 'import sys; print(sys.stdin.encoding)'
> UTF-8
> $ python3 write_utf32.py
> Enter text:
> áéíóú
> $ file /tmp/data.bin.utf32
> /tmp/data.bin.utf32: Unicode text, UTF-32, little-endian
>
> .. tip:: do not assume that ``sys.std{in,out,err}`` will always have
> the ``encoding`` attribute, or that they'll be set to a valid
> encoding. For instance, when ``sys.stdin`` is not a TTY,
> it's ``encoding`` attribute will have a ``None`` value.
>
> A few points can be realized here:
>
> 1) Using Unicode strings internally, as an intermediate format, gives
> you the freedom to read (decode) from different encodings, and at
> the same time, to write (encode) into any other encoding.
>
> 2) Code that is expected to work under both Python 2 and 3 need
> some extra handling with regards to the data type being handled.
>
> 3) While determining the data type, one can either check for ``bytes``
> or for ``unicode``. While it's certainly a matter of preference
> and style, keep in mind that the ``bytes`` name exists on both
> Python 2 and 3, while ``unicode`` exists only on Python 2.
>
> locale
> ------
>
> This Python standard library module is a wrapper around POSIX
> locale-related functionality.
>
> Because this discussion is about text encodings, let's focus on the
> ``locale.getpreferredencoding()`` function. Acording to the
> documentation, it *"Return(s) the encoding used for text data,
> according to user preferences"*. As it wraps specificities of different
> platforms, it may not be able to problem it on some systems, and
> because of that, the documentation notes that *"this function only
> returns a guess"*.
>
> Even though it may be a guess, it is probably the best bet you can
> make.
>
> .. tip:: Many non-Linux/UNIX platforms implement some level of POSIX
> functionality, and that happens to be the case for the
> ``locale`` features discussed here. Because of that, the
> Python ``locale`` module can also be found on platforms such
> as Microsoft Windows.
>
> Error Handling
> --------------
>
> Some encoding and decoding operations just won't be possible. The
> most straightforward example is when you're trying to map a value
> that is outside the bounds of the mapping table.
>
> For instance, the ASCII character set defines mappings for values in
> the range of 0 up to 127 (7f in hexadecimal). That means that a value
> larger than 127 (7f in hexadecimal) will cause an error.
>
> When dealing with Unicode strings in Python, those errors are
> represented as a ``UnicodeError`` (whose most common subclasses
> are ``UnicodeEncodeError`` or ``UnicodeDecodeError``).
>
> Getting back to the simple example given before, trying to decode
> the value 126 (7e when represented in hexadecimal) using the ASCII
> character set should work fine::
>
> >>> b'\x7e'.decode('ascii')
> u'~'
>
> But anything larger than 127 (7f) won't work::
>
> >>> b'\x80'.decode('ascii')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position
> 0: ordinal not in range(128)
>
> The reason for the failure is explicit in the error message: 0x80
> (given in hex) is decimal 128, which is indeed not in ``range(128)``
> (which is zero based, and thus contains 0-127).
>
> There may be situations in which a different error handling may be
> beneficial. Instead of catching ``UnicodeError`` exceptions and
> handling them on an individual basis, it's possible to use a
> registered error handler. Let's use a builtin error handler as
> an example.
>
> Suppose that your application reads from a file that known to be
> encoded in ``UTF-8``, and you need to output to system's preffered
> encoding (as defined by ``locale.getpreferredencoding()``). To make
> for a more realistic example, let's imagine that the application is
> test runner like Avocado itself, reading from a file containing
> definitions of test variations and parameters, and writing out the
> test variation IDs that were executed. The test variations/parameters
> file will look like this (again, encoded in ``UTF-8``)::
>
> intel-überdisk-workstation-20-12b3:cpu=intel;disk=überdisk;
> intel-virtio-workstation-20-b322:cpu=intel;disk=virtio
> amd-überdisk-workstation-20-c523:cpu=amd;disk=überdisk
> amd-virtio-workstation-20-ddf3:cpu=amd;disk=virtio
>
> And the code to parse and report the tests could look like this::
>
> import io
> import locale
>
>
> INTERNAL_ENCODING = 'UTF-8'
>
> with io.open('parameters', 'r', encoding=INTERNAL_ENCODING) as
> parameters_file:
> parameters_lines = parameters_file.readlines()
>
> test_variants_run = [line.split(":", 1)[0] for line in parameters_lines]
> with io.open('report.txt', 'w',
> encoding=locale.getpreferredencoding()) as output_file:
> output_file.write(u"\n".join(test_variants_run))
>
> Now, on a given system, this run as expected::
>
> $ python -c 'import locale; print(locale.getpreferredencoding())'
> UTF-8
> $ python read_parameters_write_report.py && cat report.txt
> intel-überdisk-workstation-20-12b3
> intel-virtio-workstation-20-b322
> amd-überdisk-workstation-20-c523
> amd-virtio-workstation-20-ddf3
> $ file report.txt
> report.txt: UTF-8 Unicode text
>
> But on a **different** system::
>
> $ python -c 'import locale; print(locale.getpreferredencoding())'
> ANSI_X3.4-1968
> $ python read_parameters_write_report.py && cat report.txt
> Traceback (most recent call last):
> File "read_parameters_write_report.py", line 12, in <module>
> output_file.write(u"\n".join(test_variants_run))
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in
> position 6: ordinal not in range(128)
>
> One possible solution is using an error handler, such as ``replace``.
> By adding the ``errors`` parameter to the ``io.open``::
>
> --- read_parameters_write_report.py 2018-04-17
> 18:33:26.781059079 -0400
> +++ read_parameters_write_report.py.new 2018-04-17
> 18:33:58.677181944 -0400
> @@ -9,5 +9,6 @@
>
> test_variants_run = [line.split(":", 1)[0] for line in parameters_lines]
> with io.open('report.txt', 'w',
> - encoding=locale.getpreferredencoding()) as output_file:
> + encoding=locale.getpreferredencoding(),
> + errors='replace') as output_file:
> output_file.write(u"\n".join(test_variants_run))
>
> The result becomes::
>
> intel-?berdisk-workstation-20-12b3
> intel-virtio-workstation-20-b322
> amd-?berdisk-workstation-20-c523
> amd-virtio-workstation-20-ddf3
>
> Which may be better than crashing, but may also be unacceptable
> because information is lost. One alternative is to escape the data.
> Using the ``backslashreplace`` error handler, ``report.txt`` would look
> like::
>
> intel-\xfcberdisk-workstation-20-12b3
> intel-virtio-workstation-20-b322
> amd-\xfcberdisk-workstation-20-c523
> amd-virtio-workstation-20-ddf3
>
> This way, no information is lost, and the generated report respects
> the system preferred encoding::
>
> $ file report.txt
> report.txt: ASCII text
>
> Guidelines
> ==========
>
> This section sets the general guidelines for byte/text data in
> Avocado, and consequently for the encoding used. It should be
> followed by Avocado plugins developed externally, so that a consistent
> combined work is achieved.
>
> It can also be used as a guideline for test writers that are target
> Avocado on both Python 2 and 3.
>
> 1) When generating text that will be consumed by humans, Avocado SHOULD
> respect the preferred system encoding. When that is not available,
> Avocado's default encoding (currently ``UTF-8``, as defined in
> ``avocado/core/defaults.py``) should be used.
>
> 2) When operating on data that may or may not contain text, Avocado
> SHOULD treat the data as binary. If the owner of the data knows it
> contains text destined for humans, what we call text, then the data
> owner should handle the decoding. It's OK for utility APIs to have
> helper functionality. One example is the
> ``avocado.utils.process.CmdResult`` class, which contains both
> ``stdout`` and the ``stdout_text`` attribute/property. Even then,
> the user producing the data is responsible for determinig the
> encoding used when treating the data as text.
>
> 3) When operating on data that provides encoding as metadata (by using
> an alternative channel or that can reliably be obtained from the
> data itself), Avocado MUST respect that encoding. One example is
> respecting the encoding that can be given on the ``Content-Type``
> headers on an HTTP session.
>
> 4) Avocado functionality CAN restrict the encodings it generates if an
> expressive enough character set is used and the generated data
> contains metadata that clearly defines the encoding used. One
> example is the HTML plugin, which is currently limited to producing
> content in ``UTF-8``.
>
> 5) All input given by humans to the Avocado test runner, such as test
> references, parameters coming from files and other loader
> implementations, command line parameter values and others, should
> be treated as text unless noted otherwise. This means that Avocado
> should be able to deal with test references given in the
> system's preferred encoding transparently.
>
> 6) Avocado code should, when operating on text data, use unicode
> strings internally (``unicode`` on Python 2, and ``str`` on Python
> 3).
>
> Besides those points, it's worth noting that a number of utility
> functionality related to binary and text data, and encoding handling,
> is growing organically, and be seen on modules such as
> ``avocado.utils.astring``. Further functionality is currently being
> proposed upstream and may soon be part of the Avocado libraries.
>
> Caveats
> =======
>
> While handling text and binary types on Avocado, please pay attention
> to the following caveats:
>
> 1) The Avocado test runner replaces the stock
> ``sys.std{in,out,err}.encoding``, so if you're writing a plugin, do
> not assume/expect these to contain an encoding setting.
>
> 2) Some features on core Avocado, as well as on external plugins,
> still fall short of the guidelines described here. This is a
> work in progress. Please exercise care when
>
Sorry for this "bad ending". I just meant "please exercise care, and
investigate the code status when relying on it".
Cheers!
- Cleber.
> ---
>
> [1] -
> https://docs.python.org/3/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
> [2] - https://docs.python.org/2.7/c-api/object.html#c.PyObject_Bytes
> [3] - http://unicode.scarfboy.com/?s=u%2B00e1
>
--
Cleber Rosa
[ Sr Software Engineer - Virtualization Team - Red Hat ]
[ Avocado Test Framework - avocado-framework.github.io ]
[ 7ABB 96EB 8B46 B94D 5E0F E9BB 657E 8D33 A5F2 09F3 ]
More information about the Avocado-devel
mailing list