[Freeipa-devel] [PATCH] 971 detect binary LDAP data

Mon Feb 27 16:45:10 UTC 2012

On 02/27/2012 05:10 PM, Rob Crittenden wrote:
> Rob Crittenden wrote:
>> Simo Sorce wrote:
>>> On Mon, 2012-02-27 at 09:44 -0500, Rob Crittenden wrote:
>>>> We are pretty trusting that the data coming out of LDAP matches its
>>>> schema but it is possible to stuff non-printable characters into most
>>>> attributes.
>>>>
>>>> I've added a sanity checker to keep a value as a python str type
>>>> (treated as binary internally). This will result in a base64 encoded
>>>> blob be returned to the client.
>>>
>>> Shouldn't you try to parse it as a unicode string and catch TypeError to
>>> know when to return it as binary ?
>>>
>>> Simo.
>>>
>>
>> What we do now is the equivalent of unicode(chr(0)) which returns
>> u'\x00' and is why we are failing now.
>>
>> I believe there is a unicode category module, we might be able to use
>> that if there is a category that defines non-printable characters.
>>
>> rob
>
> Like this:
>
> import unicodedata
>
> def contains_non_printable(val):
> for c in val:
> if unicodedata.category(unicode(c)) == 'Cc':
> return True
> return False
>
> This wouldn't have the exclusion of tab, CR and LF like using ord() but
> is probably more correct.
>
> rob
>
> _______________________________________________
> Freeipa-devel mailing list
> Freeipa-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/freeipa-devel

If you're protecting the XML-RPC calls, it'd probably be better to look 
at the XML spec directly: http://www.w3.org/TR/xml/#charsets

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
[#x10000-#x10FFFF]

I'd say this is a good set for CLI as well.

And you can trap invalid UTF-8 sequences by catching the 
UnicodeDecodeError from decode().

-- 
Petr³