[Libguestfs] [PATCH v2] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)

Sat Apr 25 19:58:33 UTC 2020

On Sat, Apr 25, 2020 at 8:32 PM Sam Eiderman <sameid at google.com> wrote:
>
> Hi Nir,
> I think latin1,
>
> How do you think we should handle latin1 errors then? Replace on latin1 or replace on utf-8?

Decoding from latin1 (or any other 8 bit encoding) never fails, it returns junk.

For example, the name "Jörgen":

>>> "Jörgen".encode("utf-8")
b'J\xc3\xb6rgen'

If the data happens to be "utf-8", we will decode it successfully:

>>> b'J\xc3\xb6rgen'.decode("utf-8")
'Jörgen'

But if the data was "latin1":

>>> "Jörgen".encode("latin1")
b'J\xf6rgen'

Replacing will give:

>>> b'J\xf6rgen'.decode("utf-8", errors="replace")
'J�rgen'

Falling back to "latin1":

>>> b'J\xf6rgen'.decode("latin1")
'Jörgen'

But note that if the data was not latin1, like this (Hebrew Alef):

>>> "\u05d0".encode("cp1255")
b'\xe0'

Fallback to "latin" will succeed, returning junk:

>>> b"\xe0".decode("latin1")
'à'

Instead of the actual value:

>>> b"\xe0".decode("cp1255")
'א'

This makes sense if we know that the relevant data is usually encoded in latin1.
You can check if this gives better results for your use case.

Nir