[Freeipa-devel] [PATCH] 0748 Handle encoding for ipautil.run

Wed Dec 2 07:59:54 UTC 2015

On 1.12.2015 12:26, Petr Viktorin wrote:
> On 11/30/2015 08:59 AM, Jan Cholasta wrote:
>> On 25.11.2015 15:47, Petr Viktorin wrote:
>>> On 11/25/2015 11:04 AM, Jan Cholasta wrote:
>>>> On 24.11.2015 17:21, Petr Viktorin wrote:
>>>>> On 11/23/2015 10:50 AM, Jan Cholasta wrote:
>>>>>> On 23.11.2015 07:43, Jan Cholasta wrote:
>>>>>>> On 19.11.2015 00:55, Petr Viktorin wrote:
>>>>>>>> On 11/03/2015 02:39 PM, Petr Viktorin wrote:
>>>>>>>>> Hello,
>>>>>>>>> Python 3's strings are Unicode, so data coming to or leaving a
>>>>>>>>> Python
>>>>>>>>> program needs to be decoded/encoded if it's to be handled as a
>>>>>>>>> string.
>>>>>>>>> One of the boundaries where encoding is necessary is external
>>>>>>>>> programs,
>>>>>>>>> specifically, ipautil.run.
>>>>>>>>> Unfortunately there's no one set of options that would work for all
>>>>>>>>> run() invocations, so I went through them all and specified the
>>>>>>>>> stdin/stdout/stderr encoding where necessary. I've also updated the
>>>>>>>>> call
>>>>>>>>> sites to make it clearer if the return values are used or not.
>>>>>>>>> If an encoding is not set, run() will accept/return bytes. (This
>>>>>>>>> is a
>>>>>>>>> fail-safe setting, since it can't raise errors, and using bytes
>>>>>>>>> where
>>>>>>>>> strings are expected generally fails loudly in py3.)
>>>>>>>>>
>>>>>>>>> Note that the changes are not effective under Python 2.
>>>>>>>>
>>>>>>>> ping,
>>>>>>>> Could someone look at this patch?
>>>>>>>
>>>>>>> Looking.
>>>>>>
>>>>>> 1) I think a better approach would be to use str for
>>>>>> stdin/stdout/stderr
>>>>>> by default in both Python 2 and 3, i.e. do nothing in Python 2 and
>>>>>> encode/decode using locale.getpreferredencoding() in Python 3. This is
>>>>>> consistent with sys.std{in,out,err} in both Python 2 and 3. Using
>>>>>> bytes
>>>>>> or different encoding should be opt-in.
>>>>>>
>>>>>> Note that different encoding should be used only if the LC_ALL or LANG
>>>>>> variables are overriden in the command's environment.
>>>>>
>>>>> That would assume the output of *everything* run via ipautil.run can be
>>>>> decoded using the locale encoding. Any stray invalid byte would make
>>>>> IPA
>>>>> crash, even in cases where we don't care about the output. IPA calls
>>>>> too
>>>>> many weird tools to assume they all output text.
>>>>
>>>> Such a stray invalid byte may bubble through our code and cause havoc
>>>> somewhere else, which is much harder to troubleshoot than a crash in
>>>> ipautil.run(), where you can see that it was a misbehaving command that
>>>> caused the crash and exactly what command it was.
>>>
>>> The invalid byte can't bubble through code, because if you don't specify
>>> an encoding you get bytes back. You'll get a crash when you try using it
>>> as a string.
>>
>> Invalid bytes *can* bubble through our code, into configuration files
>> and other external places. The fact that it has not caused much grief so
>> far suggests that it should be safe to use the locale encoding for most
>> commands.
>
> Yes, "suggests" for "most". I'm still wary of using the locale encoding
> for everything by default.

OK, let's keep bytes as the default, but I still think the locale 
encoding should be the default when decoding output to text.

>
>> If we do subscribe to the statement that any command can output invalid
>> bytes at any time, I don't see a point in handling encoding in
>> ipautil.run() at all - just always return bytes and let the caller worry
>> about encoding and handling related errors.
>
> Right, that's actually the basic idea.
>
> However, the encoding is two lines of boilerplate code ("if six.PY3" and
> the actual encode). I've turned this into a convenience keyword argument
> to minimize the repetition.

If you always return bytes in both Python 2 and 3, you can always 
.decode() the output, so there won't be any "if six.PY3". Bearing that 
in mind, I don't see much difference between adding ", 
stdout_encoding='ascii'" to the run() call or "stdout = 
stdout.decode('ascii')" on the line below, except the latter is more 
explicit and lets you handle decoding errors yourself.

>
>>> locale.getpreferredencoding() is not used consistently in all software,
>>> especially if it's not written in Python. For instance, wget won't
>>> magically re-encode the data it fetches for us. It's better to
>>> explicitly specify the encoding every time.
>>
>> I would say it is used consistently in most software used by IPA. The
>> wget example is obviously not relevant to this discussion, since it's
>> returning external data.
>>
>> IMO setting the encoding to 'ascii' without also setting LC_ALL=C in the
>> environment, as seen frequently in your patch, has even higher
>> probability of breaking something than using the locale encoding.
>
> The patch should use 'ascii' with programs that don't output localized
> messages: things like base64-encoded certs or the output of `ip`.
> If you see  a particular call where that's not the case, it's a bug in
> the patch, please point it out.

IMO it's better to be safe than sorry, so I would like to see either 
LC_ALL=C set with every use of 'ascii' encoding, or use the locale 
encoding instead of 'ascii'.

>
>>>> If we don't care about output somewhere, we should not capture it there
>>>> at all.
>>>
>>> Then people need to remember to put "capture_output=True" everywhere
>>> (except that also disables logging of the output, so we'd need an
>>> additional option).
>>
>> capture_output=True is the default and capture_output=False should be
>> set where output is not needed.
>
> "capture_output=True" being the default would mean that it's still
> possible to leave "capture_output=False" out, but ignore the output. It
> would make the wrong usage easier to type than the right one, which is
> something to avoid.

I'm not sure I understand what are you trying to say here.

My point is that capture_output=True is currently the default (see for 
yourself) and it is indeed possible to leave capture_output=False but 
ignore the output. I believe this is wrong and that you should always 
have to explicitly request the output to be captured.

> So I still prefer the encoding being explicit.
>
>> IMO the output should always be logged if possible, regardless of
>> capture_output value.
>
> File a bug for that?
>

-- 
Jan Cholasta