[Freeipa-devel] [PATCH] 971 detect binary LDAP data

Tue Feb 28 17:02:01 UTC 2012

On 02/28/2012 04:45 PM, Rob Crittenden wrote:
> Petr Viktorin wrote:
>> On 02/28/2012 04:02 AM, Rob Crittenden wrote:
>>> Petr Viktorin wrote:
>>>> On 02/27/2012 05:10 PM, Rob Crittenden wrote:
>>>>> Rob Crittenden wrote:
>>>>>> Simo Sorce wrote:
>>>>>>> On Mon, 2012-02-27 at 09:44 -0500, Rob Crittenden wrote:
>>>>>>>> We are pretty trusting that the data coming out of LDAP matches its
>>>>>>>> schema but it is possible to stuff non-printable characters into
>>>>>>>> most
>>>>>>>> attributes.
>>>>>>>>
>>>>>>>> I've added a sanity checker to keep a value as a python str type
>>>>>>>> (treated as binary internally). This will result in a base64
>>>>>>>> encoded
>>>>>>>> blob be returned to the client.
>>>>>>>
>>>>>>> Shouldn't you try to parse it as a unicode string and catch
>>>>>>> TypeError to
>>>>>>> know when to return it as binary ?
>>>>>>>
>>>>>>> Simo.
>>>>>>>
>>>>>>
>>>>>> What we do now is the equivalent of unicode(chr(0)) which returns
>>>>>> u'\x00' and is why we are failing now.
>>>>>>
>>>>>> I believe there is a unicode category module, we might be able to use
>>>>>> that if there is a category that defines non-printable characters.
>>>>>>
>>>>>> rob
>>>>>
>>>>> Like this:
>>>>>
>>>>> import unicodedata
>>>>>
>>>>> def contains_non_printable(val):
>>>>> for c in val:
>>>>> if unicodedata.category(unicode(c)) == 'Cc':
>>>>> return True
>>>>> return False
>>>>>
>>>>> This wouldn't have the exclusion of tab, CR and LF like using ord()
>>>>> but
>>>>> is probably more correct.
>>>>>
>>>>> rob
>>>>>
>>>>> _______________________________________________
>>>>> Freeipa-devel mailing list
>>>>> Freeipa-devel at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/freeipa-devel
>>>>
>>>> If you're protecting the XML-RPC calls, it'd probably be better to look
>>>> at the XML spec directly: http://www.w3.org/TR/xml/#charsets
>>>>
>>>> Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
>>>> [#x10000-#x10FFFF]
>>>>
>>>> I'd say this is a good set for CLI as well.
>>>>
>>>> And you can trap invalid UTF-8 sequences by catching the
>>>> UnicodeDecodeError from decode().
>>>>
>>>
>>> Replace my ord function with a regex that looks for invalid characters.
>>> Now catching that exception too and leaving as a str type.
>>>
>>> rob
>>
>> You're matching an Unicode regex against a bytestring. It'd be better to
>> match after the decode.
>>
>
> The unicode regex was force of habit. I purposely do this comparison
> before the decoding to save decoding something that is binary. Would it
> be acceptable if I just remove the u'?
>
> rob

No: re.search('[\uDFFF]', u'\uDFFF'.encode('utf-8'))  -->  None
Actually I'm not sure what encoding is used behind the scenes here; I 
wouldn't recommend mixing Unicode and bytestrings even if it did work on 
my machine.

Also, match() only looks at the beginning of the string; use search(). 
Sorry for not catching that earlier.

OCD notes:
- the "matches" variable name is misleading, there's either one match or 
None.
- use `is not None` instead of `!= None` as per PEP8.
- The contains_non_printable method seems a bit excessive now that it 
essentially just contains one call. Unless it's meant to be subclassed, 
in which case the docstring should probably look different.

-- 
Petr³