[Freeipa-devel] [PATCH] 971 detect binary LDAP data

Wed Feb 29 08:55:20 UTC 2012

On 02/28/2012 09:50 PM, Rob Crittenden wrote:
> Petr Viktorin wrote:
>> On 02/28/2012 04:45 PM, Rob Crittenden wrote:
>>> Petr Viktorin wrote:
>>>> On 02/28/2012 04:02 AM, Rob Crittenden wrote:
>>>>> Petr Viktorin wrote:
>>>>>> On 02/27/2012 05:10 PM, Rob Crittenden wrote:
>>>>>>> Rob Crittenden wrote:
>>>>>>>> Simo Sorce wrote:
>>>>>>>>> On Mon, 2012-02-27 at 09:44 -0500, Rob Crittenden wrote:
>>>>>>>>>> We are pretty trusting that the data coming out of LDAP matches
>>>>>>>>>> its
>>>>>>>>>> schema but it is possible to stuff non-printable characters into
>>>>>>>>>> most
>>>>>>>>>> attributes.
>>>>>>>>>>
>>>>>>>>>> I've added a sanity checker to keep a value as a python str type
>>>>>>>>>> (treated as binary internally). This will result in a base64
>>>>>>>>>> encoded
>>>>>>>>>> blob be returned to the client.
>>>>>>>>>
>>>>>>>>> Shouldn't you try to parse it as a unicode string and catch
>>>>>>>>> TypeError to
>>>>>>>>> know when to return it as binary ?
>>>>>>>>>
>>>>>>>>> Simo.
>>>>>>>>>
>>>>>>>>
>>>>>>>> What we do now is the equivalent of unicode(chr(0)) which returns
>>>>>>>> u'\x00' and is why we are failing now.
>>>>>>>>
>>>>>>>> I believe there is a unicode category module, we might be able to
>>>>>>>> use
>>>>>>>> that if there is a category that defines non-printable characters.
>>>>>>>>
>>>>>>>> rob
>>>>>>>
>>>>>>> Like this:
>>>>>>>
>>>>>>> import unicodedata
>>>>>>>
>>>>>>> def contains_non_printable(val):
>>>>>>> for c in val:
>>>>>>> if unicodedata.category(unicode(c)) == 'Cc':
>>>>>>> return True
>>>>>>> return False
>>>>>>>
>>>>>>> This wouldn't have the exclusion of tab, CR and LF like using ord()
>>>>>>> but
>>>>>>> is probably more correct.
>>>>>>>
>>>>>>> rob
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Freeipa-devel mailing list
>>>>>>> Freeipa-devel at redhat.com
>>>>>>> https://www.redhat.com/mailman/listinfo/freeipa-devel
>>>>>>
>>>>>> If you're protecting the XML-RPC calls, it'd probably be better to
>>>>>> look
>>>>>> at the XML spec directly: http://www.w3.org/TR/xml/#charsets
>>>>>>
>>>>>> Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
>>>>>> [#x10000-#x10FFFF]
>>>>>>
>>>>>> I'd say this is a good set for CLI as well.
>>>>>>
>>>>>> And you can trap invalid UTF-8 sequences by catching the
>>>>>> UnicodeDecodeError from decode().
>>>>>>
>>>>>
>>>>> Replace my ord function with a regex that looks for invalid
>>>>> characters.
>>>>> Now catching that exception too and leaving as a str type.
>>>>>
>>>>> rob
>>>>
>>>> You're matching an Unicode regex against a bytestring. It'd be
>>>> better to
>>>> match after the decode.
>>>>
>>>
>>> The unicode regex was force of habit. I purposely do this comparison
>>> before the decoding to save decoding something that is binary. Would it
>>> be acceptable if I just remove the u'?
>>>
>>> rob
>>
>> No: re.search('[\uDFFF]', u'\uDFFF'.encode('utf-8')) --> None
>> Actually I'm not sure what encoding is used behind the scenes here; I
>> wouldn't recommend mixing Unicode and bytestrings even if it did work on
>> my machine.
>
> The data coming out of LDAP is a plain string. We encode to utf-8 if is
> not a binary as defined in _SYNTAX_MAPPING in ipaserver/plugins/ldap2.py

Now we're just confusing each other, it seems.
By plain string you mean a bytestring in either UTF-8 or IA5 (which is 
almost an UTF-8 subset), right?
We convert that to an Unicode string if possible.

Anyway just dropping the u won't work. The \u escape only works in 
Unicode strings:
 >>> list('\u1234')
['\\', 'u', '1', '2', '3', '4']

>> Also, match() only looks at the beginning of the string; use search().
>> Sorry for not catching that earlier.
>
> good point.
>
>>
>>
>> OCD notes:
>> - the "matches" variable name is misleading, there's either one match or
>> None.
>> - use `is not None` instead of `!= None` as per PEP8.
>> - The contains_non_printable method seems a bit excessive now that it
>> essentially just contains one call. Unless it's meant to be subclassed,
>> in which case the docstring should probably look different.
>>
>
> It doesn't need to be a public method. I broke it out mostly because
> embedding the documentation on why it is there overwhelmed encode().
>
> rob

-- 
Petr³