[Freeipa-devel] [PATCH] 971 detect binary LDAP data

Thu Mar 1 14:48:08 UTC 2012

On 29.2.2012 15:45, Rob Crittenden wrote:
> Jan Cholasta wrote:
>> On 28.2.2012 18:58, Rob Crittenden wrote:
>>> Jan Cholasta wrote:
>>>> On 28.2.2012 18:02, Petr Viktorin wrote:
>>>>> On 02/28/2012 04:45 PM, Rob Crittenden wrote:
>>>>>> Petr Viktorin wrote:
>>>>>>> On 02/28/2012 04:02 AM, Rob Crittenden wrote:
>>>>>>>> Petr Viktorin wrote:
>>>>>>>>> On 02/27/2012 05:10 PM, Rob Crittenden wrote:
>>>>>>>>>> Rob Crittenden wrote:
>>>>>>>>>>> Simo Sorce wrote:
>>>>>>>>>>>> On Mon, 2012-02-27 at 09:44 -0500, Rob Crittenden wrote:
>>>>>>>>>>>>> We are pretty trusting that the data coming out of LDAP
>>>>>>>>>>>>> matches
>>>>>>>>>>>>> its
>>>>>>>>>>>>> schema but it is possible to stuff non-printable characters
>>>>>>>>>>>>> into
>>>>>>>>>>>>> most
>>>>>>>>>>>>> attributes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've added a sanity checker to keep a value as a python str
>>>>>>>>>>>>> type
>>>>>>>>>>>>> (treated as binary internally). This will result in a base64
>>>>>>>>>>>>> encoded
>>>>>>>>>>>>> blob be returned to the client.
>>>>
>>>> I don't like the idea of having arbitrary binary data where unicode
>>>> strings are expected. It might cause some unexpected errors (I have a
>>>> feeling that --addattr and/or --delattr and possibly some plugins might
>>>> not handle this very well). Wouldn't it be better to just throw away
>>>> the
>>>> value if it's invalid and warn the user?
>>>
>>> This isn't for user input, it is for data stored in LDAP. User's are
>>> going to have no way to provide binary data to us unless they use the
>>> API themselves in which case they have to follow our rules.
>>
>> Well my point was that --addattr and --delattr cause an LDAP search for
>> the given attribute and plugins might get the result of a LDAP search in
>> their post_callback and I'm not sure if they can cope with binary data.
>
> It wouldn't be any different than if we had the value as a unicode.

Let's see what happens if the mail attribute of a user contains invalid 
UTF-8 (ff 62 30 72 6b 65 64):

$ ipa user-find jdoe
--------------
1 user matched
--------------
   User login: jdoe
   First name: John
   Last name: Doe
   Home directory: /home/jdoe
   Login shell: /bin/sh
   Email address: /2IwcmtlZA==
   UID: 526
   GID: 526
   Account disabled: False
   Password: False
   Kerberos keys available: False
----------------------------
Number of entries returned 1
----------------------------

$ ipa user-mod jdoe --addattr mail=jdoe at example.com
ipa: ERROR: an internal error has occurred

The internal error is:
Traceback (most recent call last):
   File "/usr/lib/python2.7/site-packages/ipaserver/rpcserver.py", line 
302, in wsgi_execute
     result = self.Command[name](*args, **options)
   File "/usr/lib/python2.7/site-packages/ipalib/frontend.py", line 438, 
in __call__
     ret = self.run(*args, **options)
   File "/usr/lib/python2.7/site-packages/ipalib/frontend.py", line 696, 
in run
     return self.execute(*args, **options)
   File "/usr/lib/python2.7/site-packages/ipalib/plugins/baseldap.py", 
line 1217, in execute
     ldap, dn, entry_attrs, attrs_list, *keys, **options
   File "/usr/lib/python2.7/site-packages/ipalib/plugins/user.py", line 
532, in pre_callback
     entry_attrs['mail'] = self.obj._normalize_email(entry_attrs['mail'])
   File "/usr/lib/python2.7/site-packages/ipalib/plugins/user.py", line 
338, in _normalize_email
     norm_email.append(m + u'@' + config['ipadefaultemaildomain'][0])
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: 
invalid start byte

$ ipa user-mod jdoe --delattr mail=/2IwcmtlZA==
ipa: ERROR: mail does not contain '/2IwcmtlZA=='

$ ipa user-mod jdoe --delattr mail=`echo 'ff 62 30 72 6b 65 64' | xxd -p -r`
ipa: ERROR: UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in 
position 5: invalid start byte
Traceback (most recent call last):
   File "/usr/lib/python2.7/site-packages/ipalib/cli.py", line 1242, in run
     sys.exit(api.Backend.cli.run(argv))
   File "/usr/lib/python2.7/site-packages/ipalib/cli.py", line 1024, in run
     kw = self.parse(cmd, argv)
   File "/usr/lib/python2.7/site-packages/ipalib/cli.py", line 1049, in 
parse
     return dict(self.parse_iter(cmd, kw))
   File "/usr/lib/python2.7/site-packages/ipalib/cli.py", line 1058, in 
parse_iter
     yield (key, self.Backend.textui.decode(value))
   File "/usr/lib/python2.7/site-packages/ipalib/cli.py", line 136, in 
decode
     return value.decode(encoding)
   File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode
     return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 5: 
invalid start byte
ipa: ERROR: an internal error has occurred

I'm sure there is a lot more places in the code where things will break 
when you feed them arbitrary data.

>
> We treat the python type str as binary data. Anything that is a str gets
> based64 encoded before json or xml-rpc transmission.
>
> The type unicode is considered a "string" and goes in the clear.
>
> We determine what this type should be not from the data but from the
> schema. This is a big assumption. Hopefully this answer's Petr's point
> as well.
>
> We decided long ago that str means Binary and unicode means String. It
> is a bit clumsy perhaps python handles it well. It will be more clear
> when we switch to Python 3.0 and we'll have bytes and str instead as types.

Well, this is all super-obvious and I'm not really sure why do you bring 
it up here...

My concern is that with your approach, you can no longer be sure that an 
attribute value is a valid unicode string. This creates a 
maintainability burden, as you *always* have to keep in mind that you 
first have to check whether the value is valid or not before touching 
it. You can easily forget to include the check in new code and there is 
a lot of places in old code that need to be changed because of this (as 
you can see in the mail example above).

Let me quote John Dennis 
(<http://www.redhat.com/archives/freeipa-devel/2012-January/msg00075.html>):
> I'm very interested in this work. I too have recognized that we have a
> very real structural problem with how we handle the types we use
> internally, especially when it corresponds to an external type and
> requires conversion. In fact we do a remarkably bad job in this area, we
> have conversions which are done ad hoc all over the place. There is no
> well established place for the conversions to occur and it's difficult
> to know at any point in the code what type to expect.

Your patch adds yet another occurence of "it's difficult to know at any 
point in the code what type to expect". IMO this is very bad and we 
should not do this at all in new code.

I'm not sure if just dropping bad values (as I have suggested before) is 
the best idea, but IMHO anything is better than letting these bad values 
into the bowels of IPA.

>
>>> We are trusting that the data in LDAP matches its schema. This is just
>>> belt and suspenders verifying that it is the case.
>>
>> Sure, but I still think we should allow any valid unicode data to come
>> from LDAP, not just what is valid in XML-RPC.
>
> This won't affect data coming out of LDAP, only the way it is
> transmitted to the client.

It will affect the data flowing through IPA, after we get it from LDAP, 
before it is transmitted through XML-RPC.

In other words, I think it is more correct to do this:

LDAP -> check unicode validity -> (rest of IPA) -> check XML-RPC 
validitity -> XML-RPC

than this:

LDAP -> check unicode validity -> check XML-RPC validity -> (rest of 
IPA) -> XML-RPC

(Let's keep things isolated please.)

>
> rob

Honza

-- 
Jan Cholasta