[Libguestfs] [PATCH v2] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)

Nir Soffer nsoffer at redhat.com
Sat Apr 25 19:58:33 UTC 2020


On Sat, Apr 25, 2020 at 8:32 PM Sam Eiderman <sameid at google.com> wrote:
>
> Hi Nir,
> I think latin1,
>
> How do you think we should handle latin1 errors then? Replace on latin1 or replace on utf-8?

Decoding from latin1 (or any other 8 bit encoding) never fails, it returns junk.

For example, the name "Jörgen":

>>> "Jörgen".encode("utf-8")
b'J\xc3\xb6rgen'

If the data happens to be "utf-8", we will decode it successfully:

>>> b'J\xc3\xb6rgen'.decode("utf-8")
'Jörgen'

But if the data was "latin1":

>>> "Jörgen".encode("latin1")
b'J\xf6rgen'

Replacing will give:

>>> b'J\xf6rgen'.decode("utf-8", errors="replace")
'J�rgen'

Falling back to "latin1":

>>> b'J\xf6rgen'.decode("latin1")
'Jörgen'

But note that if the data was not latin1, like this (Hebrew Alef):

>>> "\u05d0".encode("cp1255")
b'\xe0'

Fallback to "latin" will succeed, returning junk:

>>> b"\xe0".decode("latin1")
'à'

Instead of the actual value:

>>> b"\xe0".decode("cp1255")
'א'

This makes sense if we know that the relevant data is usually encoded in latin1.
You can check if this gives better results for your use case.

Nir





More information about the Libguestfs mailing list