[Freeipa-users] replicas unresponsive with increasing file descriptors

Tue Sep 1 15:20:53 UTC 2015

On Tue, Sep 01, 2015 at 05:03:10PM +0200, Ludwig Krispenz wrote:
> 
> On 09/01/2015 04:39 PM, Andrew E. Bruno wrote:
> >A few months ago we had a replica failure where the system ran out of file
> >descriptors and the slapd database was corrupted:
> >
> >https://www.redhat.com/archives/freeipa-users/2015-June/msg00389.html
> >
> >We now monitor file descriptor counts on our replicas and last night we
> >had 2 of our 3 replicas fail and become completely unresponsive. Trying
> >to kinit on the replica resulted in:
> >
> >[user at ipa-master]$ kinit
> >kinit: Generic error (see e-text) while getting initial credentials
> >
> >
> >Snippet from the /var/log/dirsrv/slapd-[domain]/errors:
> >
> >[31/Aug/2015:17:14:39 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Warning: Attempting to release replica, but unable to receive endReplication extended operation response from the replica. Error -5 (Timed out)
> >[31/Aug/2015:17:16:39 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
> >[31/Aug/2015:17:18:42 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
> >[31/Aug/2015:17:20:42 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
> >[31/Aug/2015:17:22:47 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
> >[31/Aug/2015:17:24:47 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
> >[31/Aug/2015:17:24:47 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Incremental protocol: event backoff_timer_expired should not occur in state start_backoff
> >[31/Aug/2015:17:26:50 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
> >[31/Aug/2015:17:28:50 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
> >
> >The access logs were filling up with:
> >
> >[31/Aug/2015:17:13:17 -0400] conn=1385990 fd=449 slot=449 connection from 10.106.14.29 to 10.113.14.30
> >[31/Aug/2015:17:13:18 -0400] conn=1385991 fd=450 slot=450 connection from 10.104.9.137 to 10.113.14.30
> >[31/Aug/2015:17:13:18 -0400] conn=1385992 fd=451 slot=451 connection from 10.104.16.19 to 10.113.14.30
> >[31/Aug/2015:17:13:21 -0400] conn=1385993 fd=452 slot=452 connection from 10.111.11.30 to 10.113.14.30
> >[31/Aug/2015:17:13:24 -0400] conn=1385994 fd=453 slot=453 connection from 10.113.27.115 to 10.113.14.30
> >[31/Aug/2015:17:13:27 -0400] conn=1385995 fd=454 slot=454 connection from 10.111.8.116 to 10.113.14.30
> >[31/Aug/2015:17:13:27 -0400] conn=1385996 fd=514 slot=514 connection from 10.113.25.40 to 10.113.14.30
> >[31/Aug/2015:17:13:29 -0400] conn=1385997 fd=515 slot=515 connection from 10.106.14.27 to 10.113.14.30
> >[31/Aug/2015:17:13:29 -0400] conn=1385998 fd=516 slot=516 connection from 10.111.10.141 to 10.113.14.30
> >[31/Aug/2015:17:13:30 -0400] conn=1385999 fd=528 slot=528 connection from 10.104.14.27 to 10.113.14.30
> >[31/Aug/2015:17:13:31 -0400] conn=1386000 fd=529 slot=529 connection from 10.106.13.132 to 10.113.14.30
> >[31/Aug/2015:17:13:31 -0400] conn=1386001 fd=530 slot=530 connection from 10.113.25.11 to 10.113.14.30
> >[31/Aug/2015:17:13:31 -0400] conn=1386002 fd=531 slot=531 connection from 10.104.15.11 to 10.113.14.30
> >[31/Aug/2015:17:13:32 -0400] conn=1386003 fd=533 slot=533 connection from 10.104.7.136 to 10.113.14.30
> >[31/Aug/2015:17:13:33 -0400] conn=1386004 fd=534 slot=534 connection from 10.113.24.23 to 10.113.14.30
> >[31/Aug/2015:17:13:33 -0400] conn=1386005 fd=535 slot=535 connection from 10.106.12.105 to 10.113.14.30
> >[31/Aug/2015:17:13:33 -0400] conn=1386006 fd=536 slot=536 connection from 10.104.16.41 to 10.113.14.30
> >[31/Aug/2015:17:13:34 -0400] conn=1386007 fd=537 slot=537 connection from 10.104.16.4 to 10.113.14.30
> >[31/Aug/2015:17:13:35 -0400] conn=1386008 fd=538 slot=538 connection from 10.111.8.12 to 10.113.14.30
> >[31/Aug/2015:17:13:36 -0400] conn=1386009 fd=539 slot=539 connection from 10.111.8.17 to 10.113.14.30
> >....
> >
> >
> >Seems like clients were connecting to the replicas but file descriptors were
> >not getting released. Our monitoring showed increasing file descriptor counts
> >on both replicas (the FD counts are normally ~600):
> >
> >DateTime                  |   Host        |   File Descriptor Count
> >--------------------------------------------------------------------
> >Mon, 31 Aug 2015 17:28:28 |   srv-m14-32  |   1394
> >Mon, 31 Aug 2015 17:28:40 |   srv-m14-30  |   1192
> >Mon, 31 Aug 2015 18:28:28 |   srv-m14-32  |   2478
> >Mon, 31 Aug 2015 18:28:40 |   srv-m14-30  |   2212
> >Mon, 31 Aug 2015 19:28:28 |   srv-m14-32  |   3305
> >Mon, 31 Aug 2015 19:28:40 |   srv-m14-30  |   3058
> >....
> >
> >We can confirm this via logconv.pl:
> >
> >
> >Start of Logs:    31/Aug/2015:14:55:01
> >End of Logs:      31/Aug/2015:16:42:37
> >...
> >FDs Taken:                    3140
> >FDs Returned:                 3160
> >Highest FD Taken:             603
> >
> >
> >Start of Logs:    31/Aug/2015:16:42:37
> >End of Logs:      31/Aug/2015:20:18:41
> >..
> >FDs Taken:                    4562
> >FDs Returned:                 1336
> >Highest FD Taken:             3755
> >
> >
> >We suspect something happened around 31/Aug/2015:17:13:00 that caused both
> >replicas to become unresponsive and leak file descriptors. Luckily we caught
> >this before the system ran out of file descriptors. We logged onto each replica
> >and restarted ipa via:
> >
> >     # systemctl restart ipa
> >
> >This produced the some errors in the logs:
> >
> >[31/Aug/2015:20:11:02 -0400] - slapd shutting down - signaling operation threads - op stack size 0 max work q size 3623 max work q stack size 496
> >[31/Aug/2015:20:11:02 -0400] - slapd shutting down - waiting for 30 threads to terminate
> >...
> >[31/Aug/2015:20:12:34 -0400] - 389-Directory/1.3.3.1 B2015.118.1941 starting up
> >[31/Aug/2015:20:12:34 -0400] - WARNING -- Minimum cache size is 512000 -- rounding up
> >[31/Aug/2015:20:12:34 -0400] - WARNING: changelog: entry cache size 512000B is less than db size 400465920B; We recommend to increase the entry cache size nsslapd-cachememsize.
> >[31/Aug/2015:20:12:34 -0400] - Detected Disorderly Shutdown last time Directory Server was running, recovering database.
> >[31/Aug/2015:20:12:35 -0400] schema-compat-plugin - warning: no entries set up under cn=computers, cn=compat,dc=cbls,dc=ccr,dc=buffalo,dc=edu
> >....
> >[31/Aug/2015:20:12:39 -0400] - Skipping CoS Definition cn=Password Policy,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu--no CoS Templates found, which should be added before the CoS Definition.
> >[31/Aug/2015:20:12:39 -0400] NSMMReplicationPlugin - changelog program - _cl5NewDBFile: PR_DeleteSemaphore: /var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/e909b405-2cb811e5-ac0b8f7e-e0b1a377.sema; NSPR error - -5943
> >[31/Aug/2015:20:13:08 -0400] NSMMReplicationPlugin - changelog program - _cl5NewDBFile: PR_DeleteSemaphore: /var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/0cccfa05-2cb911e5-ac0b8f7e-e0b1a377.sema; NSPR error - -5943
> >[31/Aug/2015:20:13:11 -0400] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: disordely shutdown for replica dc=cbls,dc=ccr,dc=buffalo,dc=edu. Check if DB RUV needs to be updated
> >[31/Aug/2015:20:13:11 -0400] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: disordely shutdown for replica o=ipaca. Check if DB RUV needs to be updated
> >[31/Aug/2015:20:13:11 -0400] set_krb5_creds - Could not get initial credentials for principal [ldap/srv-m14-32.cbls.ccr.buffalo.edu at CBLS.CCR.BUFFALO.EDU] in keytab [FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for requested realm)
> >[31/Aug/2015:20:13:11 -0400] set_krb5_creds - Could not get initial credentials for principal [ldap/srv-m14-32.cbls.ccr.buffalo.edu at CBLS.CCR.BUFFALO.EDU] in keytab [FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for requested realm)
> >....
> >
> >Followed by almost 1M lines of...
> >
> >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 511742 (rc: 32)
> >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 511743 (rc: 32)
> >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 511744 (rc: 32)
> >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 511745 (rc: 32)
> >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 511746 (rc: 32)
> >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 511747 (rc: 32)
> >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 511748 (rc: 32)
> >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 511749 (rc: 32)
> >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 511750 (rc: 32)
> >...
> >
> >
> >The delete_changerecord id went from 511742 to 1471562 in increasing order. Are
> >these normal or a sign of something more serious?
> >
> >After we restarted ipa, both replicas eventually came back up and appear to be
> >operating as normal.
> >
> >Any ideas what could have caused this and where to look for more info? We'd
> >obviously like to prevent this from happening again as we suspect this is what
> >caused our replica failure back in June as well.
> If the server is not responsive and the incoming connections pile up
> consuming all the file descriptors, this could be caused by a deadlock in
> the server or by some clients not reading responses and blocking all other
> threads from progressing.
> If you run into this situation again, could you try to get a pstack from the
> process ?

Yes, we'll be sure to grab a pstack next time. Could this possibly be
related to replication? These errors happened right around the time the
servers crashed:

 - Warning: Attempting to release replica, but unable to receive endReplication extended operation response from the replica
 - Incremental protocol: event backoff_timer_expired should not occur in state start_backoff

> 
> About the changlog messages, it looks like after a crash, the information
> where changelog trimming should start is lost and it starts at the wrong
> number and can no longer find the records in the retro changelog and logs an
> error for each changenumber attempted.
> We should investigate this and try to prevent it or at least not pollute the
> log
> 
> About the "Check if DB RUV ....", it is misleading, the server is doing it,
> the message should better read "Checking if ..."

Great, thanks for the info.

> >
> >Should we be concerned about these errors:
> >
> >  replica_check_for_data_reload: Warning: disordely shutdown for replica dc=cbls,dc=ccr,dc=buffalo,dc=edu. Check if DB RUV needs to be updated
> >  replica_check_for_data_reload: Warning: disordely shutdown for replica o=ipaca. Check if DB RUV needs to be updated
> >
> >How do we check if DB RUV needs to be updated?
> >
> >running: ipa-server-4.1.0-18, 389-ds-base-1.3.3.1-16, CentOS 7.1.1503
> >
> >Thanks in advance for any help!
> >
> >Best,
> >
> >--Andrew
> >
> 
> -- 
> Manage your subscription for the Freeipa-users mailing list:
> https://www.redhat.com/mailman/listinfo/freeipa-users
> Go to http://freeipa.org for more info on the project
> 
>