[Freeipa-devel] [PATCH] 971 detect binary LDAP data

Tue Feb 28 08:35:04 UTC 2012

On 02/28/2012 04:02 AM, Rob Crittenden wrote:
> Petr Viktorin wrote:
>> On 02/27/2012 05:10 PM, Rob Crittenden wrote:
>>> Rob Crittenden wrote:
>>>> Simo Sorce wrote:
>>>>> On Mon, 2012-02-27 at 09:44 -0500, Rob Crittenden wrote:
>>>>>> We are pretty trusting that the data coming out of LDAP matches its
>>>>>> schema but it is possible to stuff non-printable characters into most
>>>>>> attributes.
>>>>>>
>>>>>> I've added a sanity checker to keep a value as a python str type
>>>>>> (treated as binary internally). This will result in a base64 encoded
>>>>>> blob be returned to the client.
>>>>>
>>>>> Shouldn't you try to parse it as a unicode string and catch
>>>>> TypeError to
>>>>> know when to return it as binary ?
>>>>>
>>>>> Simo.
>>>>>
>>>>
>>>> What we do now is the equivalent of unicode(chr(0)) which returns
>>>> u'\x00' and is why we are failing now.
>>>>
>>>> I believe there is a unicode category module, we might be able to use
>>>> that if there is a category that defines non-printable characters.
>>>>
>>>> rob
>>>
>>> Like this:
>>>
>>> import unicodedata
>>>
>>> def contains_non_printable(val):
>>> for c in val:
>>> if unicodedata.category(unicode(c)) == 'Cc':
>>> return True
>>> return False
>>>
>>> This wouldn't have the exclusion of tab, CR and LF like using ord() but
>>> is probably more correct.
>>>
>>> rob
>>>
>>> _______________________________________________
>>> Freeipa-devel mailing list
>>> Freeipa-devel at redhat.com
>>> https://www.redhat.com/mailman/listinfo/freeipa-devel
>>
>> If you're protecting the XML-RPC calls, it'd probably be better to look
>> at the XML spec directly: http://www.w3.org/TR/xml/#charsets
>>
>> Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
>> [#x10000-#x10FFFF]
>>
>> I'd say this is a good set for CLI as well.
>>
>> And you can trap invalid UTF-8 sequences by catching the
>> UnicodeDecodeError from decode().
>>
>
> Replace my ord function with a regex that looks for invalid characters.
> Now catching that exception too and leaving as a str type.
>
> rob

You're matching an Unicode regex against a bytestring. It'd be better to 
match after the decode.

-- 
Petr³