[Freeipa-users] replica DS failure deadlock

thierry bordaz tbordaz at redhat.com
Wed Oct 19 16:33:05 UTC 2016



On 10/19/2016 03:48 PM, Andrew E. Bruno wrote:
> On Wed, Oct 19, 2016 at 10:13:26AM +0200, Ludwig Krispenz wrote:
>> On 10/18/2016 08:52 PM, Andrew E. Bruno wrote:
>>> We had one of our replicas fail today with the following errors:
>>>
>>>
>>> [18/Oct/2016:13:40:47 -0400] agmt="cn=meTosrv-m14-32.cbls.ccr.buffalo.edu" (srv-m14-32:389) - Can't locate CSN 58065ef3000100030000 in the changelog (DB rc=-30988). If replication stops, the consumer may need to be reinitialized.
>>> [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=58065f74000500040000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
>>> [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (58065f74000500040000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
>>> [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for uid=janedoe,cn=users,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu (uniqid: 939bca48-2ced11e5-ac0b8f7e-e0b1a377, optype: 64) to changelog csn 58065f74000500040000
>>> [18/Oct/2016:13:43:07 -0400] - SLAPI_PLUGIN_BE_TXN_POST_MODRDN_FN plugin returned error but did not set SLAPI_RESULT_CODE
>>> [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - process_postop: Failed to apply update (58065f74000500040000) error (1).  Aborting replication session(conn=1314106 op=1688559)
>>> [18/Oct/2016:13:43:12 -0400] - cos_cache_change_notify: modified entry is NULL--updating cache just in case
>>> [18/Oct/2016:13:43:12 -0400] - Skipping CoS Definition cn=Password Policy,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu--no CoS Templates found, which should be added before the CoS Definition.
>>> [18/Oct/2016:13:43:20 -0400] - Operation error fetching Null DN (4a729f9a-955a11e6-aaffa516-e778e883), error -30993.
>>> [18/Oct/2016:13:43:20 -0400] - dn2entry_ext: Failed to get id for changenumber=30856302,cn=changelog from entryrdn index (-30993)
>>> [18/Oct/2016:13:43:20 -0400] - Operation error fetching changenumber=30856302,cn=changelog (null), error -30993.
>>> [18/Oct/2016:13:43:20 -0400] DSRetroclPlugin - replog: an error occured while adding change number 30856302, dn = changenumber=30856302,cn=changelog: Operations error.
>>> [18/Oct/2016:13:43:20 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
>>> [18/Oct/2016:13:43:20 -0400] NSMMReplicationPlugin - process_postop: Failed to apply update (58065f9f000000600000) error (1).  Aborting replication session(conn=1901274 op=5)
>>> [18/Oct/2016:13:43:24 -0400] - ldbm_back_seq deadlock retry BAD 1601, err=0 BDB0062 Successful return: 0
>>> [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=58065f7c000a00040000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
>>> [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (58065f7c000a00040000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
>>> [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for uid=janedoe,cn=users,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu (uniqid: 4080421a-2d0211e5-ac0b8f7e-e0b1a377, optype: 64) to changelog csn 58065f7c000a00040000
>>>
>>>
>>> ns-slapd was hung so we restarted and now it's stuck and won't come back up. It
>>> hangs up here:
>>>
>>> [18/Oct/2016:14:12:31 -0400] - Skipping CoS Definition cn=Password Policy,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu--no CoS Templates found, which should be added before the CoS Definition.
>>> [18/Oct/2016:14:12:31 -0400] NSMMReplicationPlugin - changelog program - _cl5NewDBFile: PR_DeleteSemaphore: /var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/a32992ce-71b811e5-9d33a516-e778e883.sema; NSPR error - -5943
>>> [18/Oct/2016:14:12:32 -0400] NSMMReplicationPlugin - changelog program - _cl5NewDBFile: PR_DeleteSemaphore: /var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/986efe12-71b811e5-9d33a516-e778e883.sema; NSPR error - -5943
>>>
>>>
>>> Tried deleting the semaphore files and restarting but no luck. Attached
>>> is a stacktrace of the stuck ns-slapd process.
>>>
>>> Here's the versions were running:
>>>
>>> ipa-server-4.2.0-15.0.1.el7.centos.19.x86_64
>>> 389-ds-base-1.3.4.0-33.el7_2.x86_64
>>>
>>> FWIW, we were experimenting with the new life-cycle management features,
>>> specifically "preserved" users and deleted the user "janedoe" when this
>>> happened.  From the errors above looks like this host failed to
>>> replicate the change?  Not sure if this is related or not.
>>>
>>> Is it possible to recover the database? Thanks in advance for any pointers.
>> from the stack trace the process is not hanging, it is trying to recover.
>> After a crash/kill  the changelog does not contai a RUV and it is
>> reconstructed by reading all records in the changelog, if this is large it
>> can take some time.
>> If you look at that part of the stack repeatedly,
>>
>> #4  0x00007f4e88daeba5 in cl5DBData2Entry (data=<optimized out>, len=<optimized out>, entry=entry at entry=0x7ffff6598910) at ldap/servers/plugins/replication/cl5_api.c:2342
>>          rc = <optimized out>
>>          version = <optimized out>
>>          pos = 0x7f4e9839d091 ""
>>          strCSN = 0x0
>>          op = 0x7ffff6598980
>>          add_mods = 0x7f4e983a5e80
>>          rawDN = 0x7f4e98396e20 "fqdn=cpn-k08-29-02.cbls.ccr.buffalo.edu,cn=computers,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu"
>>          s = "\300\037>\230N\177\000\000@\210Y\366\377\177\000\000@\210Y\366\377"
>> #5  0x00007f4e88daf5d6 in _cl5GetNextEntry (entry=entry at entry=0x7ffff6598910, iterator=0x7f4e983a5e80) at ldap/servers/plugins/replication/cl5_api.c:5291
>>          rc = 0
>>          it = 0x7f4e983a5e80
>>          key = {data = 0x0, size = 21, ulen = 0, dlen = 0, doff = 0, app_data = 0x0, flags = 16}
>>          data = {data = 0x7f4e9839cff0, size = 335, ulen = 0, dlen = 0, doff = 0, app_data = 0x0, flags = 16}
>> #6  0x00007f4e88dafb34 in _cl5ConstructRUV (purge=1, obj=0x7f4e983e1fc0, replGen=0x7ffff6598910 "\200\211Y\366\377\177") at ldap/servers/plugins/replication/cl5_api.c:4306
>>
>>
>> you should see some progress in which entry is handled
>>
> Ludwig, thanks very much for the help. As you pointed out just needed to let it
> finish.  ns-slapd eventually came back up once it finished reading the
> changelog. Still seeing some errors related to the NSMMReplicationPlugin failed
> to apply update and from the managed-entries-plugin. Can these safely be
> ignored or are they indicative of a more serious problem?

This is difficult to say the reason of managed entries messages.
It says that the origin entry "uid=janedoe,cn=deleted 
users,cn=accounts,cn=provisioning,dc=cbls,dc=ccr,dc=buffalo,dc=edu"
is still having a managed entry ('|mepManagedEntry') that is possibly 
something like 
'|cn=janedoe,cn=groups,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu".

This is looking like a bug because user 'janedoe' being a preserved 
user, it should not have any reference to existing groups.

Could you dump uid=janedoe entry:
ldapsearch -D "cn=directory manager" -w xxxx -b ""uid=janedoe,cn=deleted 
users,cn=accounts,cn=provisioning,dc=cbls,dc=ccr,dc=buffalo,dc=edu" 
nscpentrywsi

If the link still exists, it is looking like a bug but IMHO it should 
not create security issue.

regards
thierry
>
> [19/Oct/2016:09:28:46 -0400] - Operation error fetching Null DN (e73b48a4-95ff11e6-8bc7a516-e778e883), error -30993.
> [19/Oct/2016:09:28:46 -0400] - dn2entry_ext: Failed to get id for changenumber=30856335,cn=changelog from entryrdn index (-30993)
> [19/Oct/2016:09:28:46 -0400] - Operation error fetching changenumber=30856335,cn=changelog (null), error -30993.
> [19/Oct/2016:09:28:46 -0400] DSRetroclPlugin - replog: an error occured while adding change number 30856335, dn = changenumber=30856335,cn=changelog: Operations error.
> [19/Oct/2016:09:28:46 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
> [19/Oct/2016:09:28:46 -0400] NSMMReplicationPlugin - process_postop: Failed to apply update (5806acf7000000600000) error (1).  Aborting replication session(conn=9 op=28)
> [19/Oct/2016:09:28:54 -0400] - Operation error fetching Null DN (e73b48a7-95ff11e6-8bc7a516-e778e883), error -30993.
> [19/Oct/2016:09:28:59 -0400] - dn2entry_ext: Failed to get id for changenumber=30856337,cn=changelog from entryrdn index (-30993)
> [19/Oct/2016:09:29:17 -0400] - Operation error fetching changenumber=30856337,cn=changelog (null), error -30993.
> [19/Oct/2016:09:29:17 -0400] DSRetroclPlugin - replog: an error occured while adding change number 30856337, dn = changenumber=30856337,cn=changelog: Operations error.
> [19/Oct/2016:09:29:17 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
> [19/Oct/2016:09:29:17 -0400] NSMMReplicationPlugin - process_postop: Failed to apply update (5806acf7000000600000) error (1).  Aborting replication session(conn=15 op=5)
> [19/Oct/2016:09:29:20 -0400] - Retry count exceeded in delete
> [19/Oct/2016:09:29:20 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 30712389 (rc: 51)
> [19/Oct/2016:09:29:28 -0400] - Operation error fetching Null DN (0afe8e82-960011e6-8bc7a516-e778e883), error -30993.
> [19/Oct/2016:09:29:28 -0400] - dn2entry_ext: Failed to get id for changenumber=30856351,cn=changelog from entryrdn index (-30993)
> [19/Oct/2016:09:29:28 -0400] - Operation error fetching changenumber=30856351,cn=changelog (null), error -30993.
> [19/Oct/2016:09:29:28 -0400] DSRetroclPlugin - replog: an error occured while adding change number 30856351, dn = changenumber=30856351,cn=changelog: Operations error.
> [19/Oct/2016:09:29:28 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
> [19/Oct/2016:09:29:28 -0400] NSMMReplicationPlugin - process_postop: Failed to apply update (5806acf7000000600000) error (1).  Aborting replication session(conn=25 op=5)
> [19/Oct/2016:09:29:31 -0400] - Retry count exceeded in delete
> [19/Oct/2016:09:29:31 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 30712865 (rc: 51)
> [19/Oct/2016:09:29:39 -0400] - Operation error fetching Null DN (0afe8e90-960011e6-8bc7a516-e778e883), error -30993.
> [19/Oct/2016:09:29:39 -0400] - dn2entry_ext: Failed to get id for changenumber=30856364,cn=changelog from entryrdn index (-30993)
> [19/Oct/2016:09:29:39 -0400] - Operation error fetching changenumber=30856364,cn=changelog (null), error -30993.
> [19/Oct/2016:09:29:39 -0400] DSRetroclPlugin - replog: an error occured while adding change number 30856364, dn = changenumber=30856364,cn=changelog: Operations error.
> [19/Oct/2016:09:29:39 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
> [19/Oct/2016:09:29:39 -0400] NSMMReplicationPlugin - process_postop: Failed to apply update (5806acf7000000600000) error (1).  Aborting replication session(conn=35 op=5)
> [19/Oct/2016:09:29:42 -0400] - Retry count exceeded in delete
> [19/Oct/2016:09:29:42 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 30713364 (rc: 51)
> [19/Oct/2016:09:29:42 -0400] - Operation error fetching Null DN (0afe8ea0-960011e6-8bc7a516-e778e883), error -30993.
> [19/Oct/2016:09:29:42 -0400] - dn2entry_ext: Failed to get id for changenumber=30856379,cn=changelog from entryrdn index (-30993)
> [19/Oct/2016:09:29:42 -0400] - Operation error fetching changenumber=30856379,cn=changelog (null), error -30993.
> [19/Oct/2016:09:29:42 -0400] DSRetroclPlugin - replog: an error occured while adding change number 30856379, dn = changenumber=30856379,cn=changelog: Operations error.
> [19/Oct/2016:09:29:42 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
> [19/Oct/2016:09:29:42 -0400] NSMMReplicationPlugin - process_postop: Failed to apply update (5806acf7000000600000) error (1).  Aborting replication session(conn=36 op=5)
> [19/Oct/2016:09:29:50 -0400] - Operation error fetching Null DN (0afe8ea2-960011e6-8bc7a516-e778e883), error -30993.
> [19/Oct/2016:09:29:50 -0400] - dn2entry_ext: Failed to get id for changenumber=30856380,cn=changelog from entryrdn index (-30993)
> [19/Oct/2016:09:29:50 -0400] - Operation error fetching changenumber=30856380,cn=changelog (null), error -30993.
> [19/Oct/2016:09:29:50 -0400] DSRetroclPlugin - replog: an error occured while adding change number 30856380, dn = changenumber=30856380,cn=changelog: Operations error.
> [19/Oct/2016:09:29:50 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
> ...
> [19/Oct/2016:09:30:06 -0400] managed-entries-plugin - mep_mod_post_op: Unable to find config for origin entry "uid=janedoe,cn=deleted users,cn=accounts,cn=provisioning,dc=cbls,dc=ccr,dc=buffalo,dc=edu".
> [19/Oct/2016:09:30:06 -0400] managed-entries-plugin - mep_mod_post_op: Unable to find config for origin entry "uid=janedoe,cn=deleted users,cn=accounts,cn=provisioning,dc=cbls,dc=ccr,dc=buffalo,dc=edu".
>
>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/freeipa-users/attachments/20161019/3a203a28/attachment.htm>


More information about the Freeipa-users mailing list