[Freeipa-devel] user deletion in offline mode does not get replicated after node recovery

Wed Jun 17 10:58:52 UTC 2015

Hi Ludwig,

On 06/17/2015 11:06 AM, Ludwig Krispenz wrote:
> Hi Oleg,
>
> can you give a bit more info on the scenarios when this happens. 
> Always or is it a timing problem ?
I guess it is a timing problem. It happened yesterday, today I was 
unable to reproduce this. The scenario is very simple:
create a user1, make sure it's there turn off a replica, then create 
another user on master and delete user1 on master, then turn replica 
back on.
I still have an infrastructure with 2 replicas having a user that was 
deleted on master. Now all the user (and other data) manipulations on 
this very setup work as intended.
>
> Ludwig
>
> On 06/16/2015 07:02 PM, thierry bordaz wrote:
>> Hello
>>
>>
>> On Master:
>>     User 'onmaster' was deleted
>>
>> [16/Jun/2015:10:16:45 -0400] conn=402 op=19 SRCH 
>> base="cn=otp,dc=bagam,dc=net" scope=1 
>> filter="(&(objectClass=ipatoken)(ipatokenOwner=uid=onmaster,cn=users,cn=accounts,dc=bagam,dc=net))" 
>> attrs="ipatokenNotAfter description ipatokenOwner objectClass 
>> ipatokenDisabled ipatokenVendor managedBy ipatokenModel 
>> ipatokenNotBefore ipatokenUniqueID ipatokenSerial"
>> [16/Jun/2015:10:16:45 -0400] conn=402 op=19 RESULT err=0 tag=101 
>> nentries=0 etime=0
>> [16/Jun/2015:10:16:45 -0400] conn=402 op=20 DEL 
>> dn="uid=onmaster,cn=users,cn=accounts,dc=bagam,dc=net"
>> [16/Jun/2015:10:16:45 -0400] conn=402 op=21 UNBIND
>> [16/Jun/2015:10:16:45 -0400] conn=402 op=21 fd=120 closed - U1
>> [16/Jun/2015:10:16:45 -0400] conn=402 op=20 RESULT err=0 tag=107 
>> nentries=0 etime=0 csn=55802fcf000300040000
>>
>>     Replication agreement failed to replicate it to the replica2
>> [16/Jun/2015:10:18:36 -0400] NSMMReplicationPlugin - 
>> agmt="cn=f22master.bagam.net-to-f22replica2.bagam.net" 
>> (f22replica2:389): Consumer failed to replay change (uniqueid 
>> b8242e18-143111e5-b1d0d0c3-ae5854ff, CSN 55802fcf000300040000): 
>> Operations error (1). Will retry later.
>>
>>
>> On replica2:
>>
>>     The replicated operation failed
>> [16/Jun/2015:10:18:27 -0400] conn=8 op=4 RESULT err=0 tag=101 
>> nentries=1 etime=0
>> [16/Jun/2015:10:18:27 -0400] conn=8 op=5 EXT 
>> oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
>> [16/Jun/2015:10:18:27 -0400] conn=8 op=5 RESULT err=0 tag=120 
>> nentries=0 etime=0
>> [16/Jun/2015:10:18:27 -0400] conn=8 op=6 DEL 
>> dn="uid=onmaster,cn=users,cn=accounts,dc=bagam,dc=net"
>> [16/Jun/2015:10:18:35 -0400] conn=8 op=6 RESULT err=1 tag=107 
>> nentries=0 etime=8 csn=55802fcf000300040000
>>
>>     because of DB failures to update.
>>     The failures were E_AGAIN or E_DB_DEADLOCK. In such situation, DS 
>> retries after a small delay.
>>     The problem is that it retried 50 times without success.
>> [16/Jun/2015:10:18:34 -0400] NSMMReplicationPlugin - changelog 
>> program - _cl5WriteOperationTxn: retry (49) the transaction 
>> (csn=55802fcf000300040000) failed (rc=-30993 (BDB0068 
>> DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
>> [16/Jun/2015:10:18:34 -0400] NSMMReplicationPlugin - changelog 
>> program - _cl5WriteOperationTxn: failed to write entry with csn 
>> (55802fcf000300040000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: 
>> Locker killed to resolve a deadlock
>> [16/Jun/2015:10:18:34 -0400] NSMMReplicationPlugin - 
>> write_changelog_and_ruv: can't add a change for 
>> uid=onmaster,cn=users,cn=accounts,dc=bagam,dc=net (uniqid: 
>> b8242e18-143111e5-b1d0d0c3-ae5854ff, optype: 32) to changelog csn 
>> 55802fcf000300040000
>> [16/Jun/2015:10:18:34 -0400] - SLAPI_PLUGIN_BE_TXN_POST_DELETE_FN 
>> plugin returned error code but did not set SLAPI_RESULT_CODE
>>
>>
>> The MAIN issue here is that replica2 successfully applied others 
>> updates after 55802fcf000300040000 from the same replica (e.g 
>> csn=55802fcf000400040000)
>> I do not know if master was able to detect this failure and to replay 
>> this update. but I am afraid it did not !!
>> It is looking like you hit https://fedorahosted.org/389/ticket/47788
>> Is it possible to access your VM ?
>>
>> [16/Jun/2015:10:18:27 -0400] conn=8 op=6 DEL 
>> dn="uid=onmaster,cn=users,cn=accounts,dc=bagam,dc=net"
>> [16/Jun/2015:10:18:35 -0400] conn=8 op=6 RESULT err=1 tag=107 
>> nentries=0 etime=8 csn=55802fcf000300040000
>> [16/Jun/2015:10:18:35 -0400] conn=8 op=7 MOD 
>> dn="cn=ipausers,cn=groups,cn=accounts,dc=bagam,dc=net"
>> [16/Jun/2015:10:18:36 -0400] conn=8 op=7 RESULT err=0 tag=103 
>> nentries=0 etime=1 csn=55802fcf000400040000
>> [16/Jun/2015:10:18:36 -0400] conn=8 op=8 DEL 
>> dn="cn=onmaster,cn=groups,cn=accounts,dc=bagam,dc=net"
>> [16/Jun/2015:10:18:37 -0400] conn=8 op=8 RESULT err=0 tag=107 
>> nentries=0 etime=1 csn=55802fcf000700040000
>> [16/Jun/2015:10:18:37 -0400] conn=8 op=9 MOD 
>> dn="cn=ipausers,cn=groups,cn=accounts,dc=bagam,dc=net"
>> [16/Jun/2015:10:18:37 -0400] conn=8 op=9 RESULT err=0 tag=103 
>> nentries=0 etime=0 csn=55802fd0000000060000
>>
>>
>>
>>
>> On 06/16/2015 04:49 PM, Oleg Fayans wrote:
>>> Hi all,
>>>
>>> I've bumped into a strange problem with only a part of changes 
>>> implemented on master during replica outage get replicated after 
>>> replica recovery.
>>>
>>> Namely: when I delete an existing user on the master while the node 
>>> is offline, these changes do not get to the node when it's back 
>>> online. User creation, however, gets replicated as expected.
>>>
>>> Steps to reproduce:
>>>
>>> 1. Create the following tolopogy:
>>>
>>> replica1 <-> master <-> replica2 <-> replica3
>>>
>>> 2. Create user1 on master, make sure it appears on all replicas
>>> 3. Turn off replica2
>>> 4. On master delete user1 and create user2, make sure the changes 
>>> get replicated to replica1
>>> 5. Turn on replica2
>>>
>>> Expected results:
>>>
>>> A minute or so after repica2 is back up,
>>> 1. user1 does not exist neither on replica2 nor on replica3
>>> 2. user2 exists both on replica2 and replica3
>>>
>>> Actual results:
>>> 1. user1 coexist with user2 on replica2 and replica3
>>> 2. master and replica1 have only user2
>>>
>>>
>>> In my case, though, the topology was as follows:
>>> $ ipa topologysegment-find realm
>>> ------------------
>>> 3 segments matched
>>> ------------------
>>>   Segment name: f22master.bagam.net-to-f22replica3.bagam.net
>>>   Left node: f22master.bagam.net
>>>   Right node: f22replica3.bagam.net
>>>   Connectivity: both
>>>
>>>   Segment name: replica1-to-replica2
>>>   Left node: f22replica1.bagam.net
>>>   Right node: f22replica2.bagam.net
>>>   Connectivity: both
>>>
>>>   Segment name: replica2-to-master
>>>   Left node: f22replica2.bagam.net
>>>   Right node: f22master.bagam.net
>>>   Connectivity: both
>>> ----------------------------
>>> Number of entries returned 3
>>> ----------------------------
>>> And I was turning off replica2, leaving replica1 offline, but that 
>>> does not really matter.
>>>
>>> The dirsrv error message, most likely to be relevant is:
>>> ----------------------------------------------------------------------------------------------------------------------------------------------------- 
>>>
>>> Consumer failed to replay change (uniqueid 
>>> b8242e18-143111e5-b1d0d0c3-ae5854ff, CSN 55802fcf000300040000): 
>>> Operations error (1). Will retry later
>>> ----------------------------------------------------------------------------------------------------------------------------------------------------- 
>>>
>>>
>>> I attach dirsrv error and access logs from all nodes, in case they 
>>> could be useful
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>

-- 
Oleg Fayans
Quality Engineer
FreeIPA team
RedHat.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/freeipa-devel/attachments/20150617/a4a068e9/attachment.htm>