[Freeipa-users] replicas unresponsive with increasing file descriptors

Tue Sep 1 14:39:57 UTC 2015

A few months ago we had a replica failure where the system ran out of file
descriptors and the slapd database was corrupted:

https://www.redhat.com/archives/freeipa-users/2015-June/msg00389.html

We now monitor file descriptor counts on our replicas and last night we
had 2 of our 3 replicas fail and become completely unresponsive. Trying
to kinit on the replica resulted in:

[user at ipa-master]$ kinit
kinit: Generic error (see e-text) while getting initial credentials

Snippet from the /var/log/dirsrv/slapd-[domain]/errors:

[31/Aug/2015:17:14:39 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Warning: Attempting to release replica, but unable to receive endReplication extended operation response from the replica. Error -5 (Timed out)
[31/Aug/2015:17:16:39 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[31/Aug/2015:17:18:42 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[31/Aug/2015:17:20:42 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[31/Aug/2015:17:22:47 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[31/Aug/2015:17:24:47 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[31/Aug/2015:17:24:47 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Incremental protocol: event backoff_timer_expired should not occur in state start_backoff
[31/Aug/2015:17:26:50 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[31/Aug/2015:17:28:50 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.

The access logs were filling up with:

[31/Aug/2015:17:13:17 -0400] conn=1385990 fd=449 slot=449 connection from 10.106.14.29 to 10.113.14.30
[31/Aug/2015:17:13:18 -0400] conn=1385991 fd=450 slot=450 connection from 10.104.9.137 to 10.113.14.30
[31/Aug/2015:17:13:18 -0400] conn=1385992 fd=451 slot=451 connection from 10.104.16.19 to 10.113.14.30
[31/Aug/2015:17:13:21 -0400] conn=1385993 fd=452 slot=452 connection from 10.111.11.30 to 10.113.14.30
[31/Aug/2015:17:13:24 -0400] conn=1385994 fd=453 slot=453 connection from 10.113.27.115 to 10.113.14.30
[31/Aug/2015:17:13:27 -0400] conn=1385995 fd=454 slot=454 connection from 10.111.8.116 to 10.113.14.30
[31/Aug/2015:17:13:27 -0400] conn=1385996 fd=514 slot=514 connection from 10.113.25.40 to 10.113.14.30
[31/Aug/2015:17:13:29 -0400] conn=1385997 fd=515 slot=515 connection from 10.106.14.27 to 10.113.14.30
[31/Aug/2015:17:13:29 -0400] conn=1385998 fd=516 slot=516 connection from 10.111.10.141 to 10.113.14.30
[31/Aug/2015:17:13:30 -0400] conn=1385999 fd=528 slot=528 connection from 10.104.14.27 to 10.113.14.30
[31/Aug/2015:17:13:31 -0400] conn=1386000 fd=529 slot=529 connection from 10.106.13.132 to 10.113.14.30
[31/Aug/2015:17:13:31 -0400] conn=1386001 fd=530 slot=530 connection from 10.113.25.11 to 10.113.14.30
[31/Aug/2015:17:13:31 -0400] conn=1386002 fd=531 slot=531 connection from 10.104.15.11 to 10.113.14.30
[31/Aug/2015:17:13:32 -0400] conn=1386003 fd=533 slot=533 connection from 10.104.7.136 to 10.113.14.30
[31/Aug/2015:17:13:33 -0400] conn=1386004 fd=534 slot=534 connection from 10.113.24.23 to 10.113.14.30
[31/Aug/2015:17:13:33 -0400] conn=1386005 fd=535 slot=535 connection from 10.106.12.105 to 10.113.14.30
[31/Aug/2015:17:13:33 -0400] conn=1386006 fd=536 slot=536 connection from 10.104.16.41 to 10.113.14.30
[31/Aug/2015:17:13:34 -0400] conn=1386007 fd=537 slot=537 connection from 10.104.16.4 to 10.113.14.30
[31/Aug/2015:17:13:35 -0400] conn=1386008 fd=538 slot=538 connection from 10.111.8.12 to 10.113.14.30
[31/Aug/2015:17:13:36 -0400] conn=1386009 fd=539 slot=539 connection from 10.111.8.17 to 10.113.14.30
....

Seems like clients were connecting to the replicas but file descriptors were
not getting released. Our monitoring showed increasing file descriptor counts
on both replicas (the FD counts are normally ~600):

DateTime                  |   Host        |   File Descriptor Count
--------------------------------------------------------------------
Mon, 31 Aug 2015 17:28:28 |   srv-m14-32  |   1394 
Mon, 31 Aug 2015 17:28:40 |   srv-m14-30  |   1192  
Mon, 31 Aug 2015 18:28:28 |   srv-m14-32  |   2478
Mon, 31 Aug 2015 18:28:40 |   srv-m14-30  |   2212
Mon, 31 Aug 2015 19:28:28 |   srv-m14-32  |   3305
Mon, 31 Aug 2015 19:28:40 |   srv-m14-30  |   3058
....

We can confirm this via logconv.pl:

Start of Logs:    31/Aug/2015:14:55:01
End of Logs:      31/Aug/2015:16:42:37
...
FDs Taken:                    3140
FDs Returned:                 3160
Highest FD Taken:             603

Start of Logs:    31/Aug/2015:16:42:37
End of Logs:      31/Aug/2015:20:18:41
..
FDs Taken:                    4562
FDs Returned:                 1336
Highest FD Taken:             3755

We suspect something happened around 31/Aug/2015:17:13:00 that caused both
replicas to become unresponsive and leak file descriptors. Luckily we caught
this before the system ran out of file descriptors. We logged onto each replica
and restarted ipa via:

    # systemctl restart ipa

This produced the some errors in the logs:

[31/Aug/2015:20:11:02 -0400] - slapd shutting down - signaling operation threads - op stack size 0 max work q size 3623 max work q stack size 496
[31/Aug/2015:20:11:02 -0400] - slapd shutting down - waiting for 30 threads to terminate
...
[31/Aug/2015:20:12:34 -0400] - 389-Directory/1.3.3.1 B2015.118.1941 starting up
[31/Aug/2015:20:12:34 -0400] - WARNING -- Minimum cache size is 512000 -- rounding up
[31/Aug/2015:20:12:34 -0400] - WARNING: changelog: entry cache size 512000B is less than db size 400465920B; We recommend to increase the entry cache size nsslapd-cachememsize.
[31/Aug/2015:20:12:34 -0400] - Detected Disorderly Shutdown last time Directory Server was running, recovering database.
[31/Aug/2015:20:12:35 -0400] schema-compat-plugin - warning: no entries set up under cn=computers, cn=compat,dc=cbls,dc=ccr,dc=buffalo,dc=edu
....
[31/Aug/2015:20:12:39 -0400] - Skipping CoS Definition cn=Password Policy,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu--no CoS Templates found, which should be added before the CoS Definition.
[31/Aug/2015:20:12:39 -0400] NSMMReplicationPlugin - changelog program - _cl5NewDBFile: PR_DeleteSemaphore: /var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/e909b405-2cb811e5-ac0b8f7e-e0b1a377.sema; NSPR error - -5943
[31/Aug/2015:20:13:08 -0400] NSMMReplicationPlugin - changelog program - _cl5NewDBFile: PR_DeleteSemaphore: /var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/0cccfa05-2cb911e5-ac0b8f7e-e0b1a377.sema; NSPR error - -5943
[31/Aug/2015:20:13:11 -0400] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: disordely shutdown for replica dc=cbls,dc=ccr,dc=buffalo,dc=edu. Check if DB RUV needs to be updated
[31/Aug/2015:20:13:11 -0400] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: disordely shutdown for replica o=ipaca. Check if DB RUV needs to be updated
[31/Aug/2015:20:13:11 -0400] set_krb5_creds - Could not get initial credentials for principal [ldap/srv-m14-32.cbls.ccr.buffalo.edu at CBLS.CCR.BUFFALO.EDU] in keytab [FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for requested realm)
[31/Aug/2015:20:13:11 -0400] set_krb5_creds - Could not get initial credentials for principal [ldap/srv-m14-32.cbls.ccr.buffalo.edu at CBLS.CCR.BUFFALO.EDU] in keytab [FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for requested realm)
....

Followed by almost 1M lines of...

[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 511742 (rc: 32)
[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 511743 (rc: 32)
[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 511744 (rc: 32)
[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 511745 (rc: 32)
[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 511746 (rc: 32)
[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 511747 (rc: 32)
[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 511748 (rc: 32)
[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 511749 (rc: 32)
[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 511750 (rc: 32)
...

The delete_changerecord id went from 511742 to 1471562 in increasing order. Are
these normal or a sign of something more serious?

After we restarted ipa, both replicas eventually came back up and appear to be
operating as normal.

Any ideas what could have caused this and where to look for more info? We'd
obviously like to prevent this from happening again as we suspect this is what
caused our replica failure back in June as well.

Should we be concerned about these errors:

 replica_check_for_data_reload: Warning: disordely shutdown for replica dc=cbls,dc=ccr,dc=buffalo,dc=edu. Check if DB RUV needs to be updated
 replica_check_for_data_reload: Warning: disordely shutdown for replica o=ipaca. Check if DB RUV needs to be updated

How do we check if DB RUV needs to be updated?

running: ipa-server-4.1.0-18, 389-ds-base-1.3.3.1-16, CentOS 7.1.1503

Thanks in advance for any help!

Best,

--Andrew