[Freeipa-devel] [PATCH] 971 detect binary LDAP data

Tue Feb 28 15:45:13 UTC 2012

Petr Viktorin wrote:
> On 02/28/2012 04:02 AM, Rob Crittenden wrote:
>> Petr Viktorin wrote:
>>> On 02/27/2012 05:10 PM, Rob Crittenden wrote:
>>>> Rob Crittenden wrote:
>>>>> Simo Sorce wrote:
>>>>>> On Mon, 2012-02-27 at 09:44 -0500, Rob Crittenden wrote:
>>>>>>> We are pretty trusting that the data coming out of LDAP matches its
>>>>>>> schema but it is possible to stuff non-printable characters into
>>>>>>> most
>>>>>>> attributes.
>>>>>>>
>>>>>>> I've added a sanity checker to keep a value as a python str type
>>>>>>> (treated as binary internally). This will result in a base64 encoded
>>>>>>> blob be returned to the client.
>>>>>>
>>>>>> Shouldn't you try to parse it as a unicode string and catch
>>>>>> TypeError to
>>>>>> know when to return it as binary ?
>>>>>>
>>>>>> Simo.
>>>>>>
>>>>>
>>>>> What we do now is the equivalent of unicode(chr(0)) which returns
>>>>> u'\x00' and is why we are failing now.
>>>>>
>>>>> I believe there is a unicode category module, we might be able to use
>>>>> that if there is a category that defines non-printable characters.
>>>>>
>>>>> rob
>>>>
>>>> Like this:
>>>>
>>>> import unicodedata
>>>>
>>>> def contains_non_printable(val):
>>>> for c in val:
>>>> if unicodedata.category(unicode(c)) == 'Cc':
>>>> return True
>>>> return False
>>>>
>>>> This wouldn't have the exclusion of tab, CR and LF like using ord() but
>>>> is probably more correct.
>>>>
>>>> rob
>>>>
>>>> _______________________________________________
>>>> Freeipa-devel mailing list
>>>> Freeipa-devel at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/freeipa-devel
>>>
>>> If you're protecting the XML-RPC calls, it'd probably be better to look
>>> at the XML spec directly: http://www.w3.org/TR/xml/#charsets
>>>
>>> Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
>>> [#x10000-#x10FFFF]
>>>
>>> I'd say this is a good set for CLI as well.
>>>
>>> And you can trap invalid UTF-8 sequences by catching the
>>> UnicodeDecodeError from decode().
>>>
>>
>> Replace my ord function with a regex that looks for invalid characters.
>> Now catching that exception too and leaving as a str type.
>>
>> rob
>
> You're matching an Unicode regex against a bytestring. It'd be better to
> match after the decode.
>

The unicode regex was force of habit. I purposely do this comparison 
before the decoding to save decoding something that is binary. Would it 
be acceptable if I just remove the u'?

rob