[Freeipa-devel] [PATCH] 0064 Rework task naming in LDAP updates to avoid conflicts

Wed Jul 25 13:33:48 UTC 2012

Petr Viktorin wrote:
> On 07/24/2012 08:36 PM, Alexander Bokovoy wrote:
>> On Tue, 24 Jul 2012, Petr Viktorin wrote:
>>> On 07/24/2012 04:50 PM, Alexander Bokovoy wrote:
>>>> On Tue, 24 Jul 2012, Rob Crittenden wrote:
>>>>> Petr Viktorin wrote:
>>>>>> On 07/24/2012 02:49 PM, Alexander Bokovoy wrote:
>>>>>>> On Tue, 24 Jul 2012, Petr Viktorin wrote:
>>>>>>>> On 07/24/2012 02:06 PM, Alexander Bokovoy wrote:
>>>>>>>>> On Tue, 24 Jul 2012, Petr Viktorin wrote:
>>>>>>>>>> On 07/24/2012 12:01 PM, Alexander Bokovoy wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> There are two problems in task naming in LDAP updates:
>>>>>>>>>>>
>>>>>>>>>>> 1. Randomness may be scarce in virtual machines
>>>>>>>>>>> 2. Random number is added to the time value rounded to a second
>>>>>>>>>>>
>>>>>>>>>>> The second issue leads to values that may repeat themselves as
>>>>>>>>>>> time
>>>>>>>>>>> only grows and random number is non-negative as well, so
>>>>>>>>>>> t2+r2 can be equal to t1+t2 generated earlier.
>>>>>>>>>>>
>>>>>>>>>>> Since task name is a DN, there is no strict requirement to
>>>>>>>>>>> use an
>>>>>>>>>>> integer value.  Instead, we can take time and attribute name. To
>>>>>>>>>>> get
>>>>>>>>>>> reasonable 'randomness' these values are then hashed with sha1
>>>>>>>>>>> and
>>>>>>>>>>> use
>>>>>>>>>>> the resulting string as task name.
>>>>>>>>>>>
>>>>>>>>>>> SHA1 may technically be an overkill here as we could simply use
>>>>>>>>>>>
>>>>>>>>>>> indextask_$date_$attribute
>>>>>>>>>>>
>>>>>>>>>>> where $date is a value of time.time() but SHA1 gives a resonable
>>>>>>>>>>> 'randomness' into the string.
>>>>>>>>>>
>>>>>>>>>> What kind of randomness do you mean? SHA1 is deterministic, it
>>>>>>>>>> doesn't
>>>>>>>>>> add any randomness at all. It just obscures what's really
>>>>>>>>>> happening.
>>>>>>>>> Hence using quotes to describe it. We don't need randomness in the
>>>>>>>>> task
>>>>>>>>> names, we need something that avoids collisions.
>>>>>>>>>
>>>>>>>>> An issue here is in time.time() -- it may give us sub-second
>>>>>>>>> resolution
>>>>>>>>> if underlying OS supports it, it may not. Having a second-level
>>>>>>>>> resolution is not enough, especially on fast machines, so we can't
>>>>>>>>> simply use int(times.time()) as it was in the original version.
>>>>>>>>>
>>>>>>>>> indextask_$date_$attribute has this issue that we don't have
>>>>>>>>> enough
>>>>>>>>> guarantee for $date (time.time()) to be unique in sufficiently
>>>>>>>>> tight
>>>>>>>>> conditions, thus use of SHA-1 to generate something that has
>>>>>>>>> better
>>>>>>>>> chances to avoid collisions than $data_$attribute.
>>>>>>>>
>>>>>>>> My point is that if "indextask_$date_$attribute" is not unique,
>>>>>>>> neither is SHA1("indextask_$date_$attribute"). Hashing has no
>>>>>>>> effect
>>>>>>>> on the chance of collisions.
>>>>>>>>
>>>>>>>> You could use Python's pseudorandom number generator
>>>>>>>> (random.randint)
>>>>>>>> instead of random.SystemRandom. It's not cryptographically secure
>>>>>>>> but
>>>>>>>> it's enough to avoid collisions, and it doesn't use up system
>>>>>>>> entropy
>>>>>>>> (except for initial seeding, through `import random`).
>>>>>>>> "indextask_$date_$attribute_$pseudorandomvalue" should be unique
>>>>>>>> enough.
>>>>>>> Using random here is really bad.
>>>>>>> What we ideally need is a method to increment sequential calls for
>>>>>>> the same attribute.We use seconds to differentiate
>>>>>>> between all tasks but that is not really required, tasks that were
>>>>>>> processed will be removed.
>>>>>>>
>>>>>>
>>>>>> Or maybe use $date_$attribute and just warn (ignore the error) if
>>>>>> there's a duplicate -- if a reindex task for the same attribute
>>>>>> already
>>>>>> exists from the same second, do we really need to start a new one?
>>>>>>
>>>>>
>>>>> That is true.
>>>>>
>>>>> We can generate a UUID, I think that is probably a better/safer thing
>>>>> to use overall.
>>>> Updated patch attached.
>>>>
>>>> 2012-07-24T14:36:31Z INFO Creating task to index attribute: memberOf
>>>> 2012-07-24T14:36:31Z DEBUG Task id:
>>>> cn=indextask_memberOf_135624333918176170_14302,cn=index,cn=tasks,cn=config
>>>>
>>>>
>>>> 2012-07-24T14:36:32Z INFO Indexing finished
>>>> ...
>>>> 2012-07-24T14:36:43Z INFO Creating task to index attribute: memberUser
>>>> 2012-07-24T14:36:43Z DEBUG Task id:
>>>> cn=indextask_memberUser_135624334038532670_14302,cn=index,cn=tasks,cn=config
>>>>
>>>>
>>>>
>>>> 2012-07-24T14:36:44Z INFO Indexing finished
>>>>
>>>> You can note that clock_seq does not change, it is because uuid.uuid1()
>>>> uses nanosecond resolution and uuid_generate_time() if the latter is
>>>> available. If we happen to ask for uuids within sub-nanosecond
>>>> interval,
>>>> clock_seq will be different then.
>>>>
>>>> I'm extracting only time and clock_seq instead of pasting full uuid
>>>> value because uuid_generate_time() will leak out ethernet MAC address
>>>> for 48-bit node. We don't need more bits here so I drop these 48 bits
>>>> and avoid publishing the ethernet MAC address, even in logs.
>>>>
>>>> --
>>>> / Alexander Bokovoy
>>>
>>> Yes, this approach will work. Just some technical comments:
>>>
>>>
>>>> +        self.sub_dict['TIME'] = cn_uuid.get_time()
>>>> +        self.sub_dict['SEQ'] = cn_uuid.get_clock_seq()
>>>
>>> Use the attributes, `cn_uuid.time` and `cn_uuid.clock_seq`. The
>>> accessor methods are undocumented.
>> Fixed.
>>
>>>
>>>> +        self.sub_dict['ATTR'] = attribute
>>>>
>>>> -        # Refresh the time to make uniqueness more probable. Add on
>>>> some
>>>> -        # randomness for good measure.
>>>> -        self.sub_dict['TIME'] = int(time.time()) + r.randint(0,10000)
>>>> +        cn = self._template_str("indextask_${ATTR}_${TIME}_${SEQ}")
>>>
>>> You're overwriting sub_dict['TIME'], which is used elsewhere in the
>>> class. That could cause trouble.
>> Since it was set here and others are using it, I kept it updated in a
>> new version as well, based on cn_uuid.time. But since cn_uuid.time is in
>> nanoseconds resolution, I have to divide the value by 1e9.
>>
>>> There's no reason to use sub_dict here; you can simply do:
>>>
>>> cn = 'indextask_%s_%s_%s' % (attribute, cn_uuid.time, cn_uuid.clock_seq)
>> Fixed.
>>
>> This version works fine for me.
>>
>
> ACK
>

pushed to master