[Freeipa-users] ipa replica failure

Fri Jun 19 18:57:38 UTC 2015

On 06/19/2015 12:22 PM, Andrew E. Bruno wrote:
> Hello,
>
> First time trouble shooting an ipa server failure and looking for some
> guidance on how best to proceed.
>
> First some background on our setup:
>
> Servers are running freeipa v4.1.0 on CentOS 7.1.1503:
>
> - ipa-server-4.1.0-18.el7.centos.3.x86_64
> - 389-ds-base-1.3.3.1-16.el7_1.x86_64
>
> 3 ipa-servers, 1 first master (rep1) and 2 (rep2, rep3) replicates. The
> replicates were setup to be ca's (i.e. ipa-replica-install --setup-ca...)
>
> We have ~3000 user accounts (~1000 active the rest disabled). We have
> ~700 hosts enrolled (all installed using ipa-client-install and running
> sssd). Hosts clients are a mix of centos 7 and centos 6.5.
>
>
> We recently discovered one of our replica servers (rep2) was not
> responding. A quick check of the dirsrv logs
> /var/log/dirsrv/slapd-XXXX/errors (sanitized):
>
>      PR_Accept() failed, Netscape Portable Runtime error (Process open
>      FD table is full.)
>      ...
>
> The server was rebooted and after coming back up had these errors in the logs:
>
>          389-Directory/1.3.3.1 B2015.118.1941
>          replica2:636 (/etc/dirsrv/slapd-XXXX)
>
> [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
> [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to trickle, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
> [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
> [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock detect (aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
> [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
> [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock detect (aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
> [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
> [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint database, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
> [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
> [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint database, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
> [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
> [16/Jun/2015:10:12:33 -0400] - checkpoint_threadmain: log archive failed - BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery (-30973)
> ....
> [16/Jun/2015:16:24:04 -0400] - 389-Directory/1.3.3.1 B2015.118.1941 starting up
> [16/Jun/2015:16:24:04 -0400] - Detected Disorderly Shutdown last time Directory Server was running, recovering database.
> ...
> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: disordely shutdown for replica dc=XXX. Check if DB RUV needs to be updated
> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV (from CL RUV) ->  55770068000300030000
> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV (from CL RUV) ->  556f4632001400040000
> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV (from CL RUV) ->  556f4631004d00050000
> [16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 111 (Connection refused)
> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - agmt="cn=cloneAgreement1-rep2 (rep1:389): Replication bind with SIMPLE auth failed: LDAP error -1 (Can't contact LDAP server) ()
> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: disordely shutdown for replica o=ipaca. Check if DB RUV needs to be updated
> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV (from CL RUV) ->  556f46290005005b0000
> [16/Jun/2015:16:24:15 -0400] set_krb5_creds - Could not get initial credentials for principal [ldap/rep2] in keytab [FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for requested realm)
> [16/Jun/2015:16:24:15 -0400] slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error -1 (Can't contact LDAP server) ((null)) errno 111 (Connection refused)
> [16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not perform interactive bind for id [] authentication mechanism [GSSAPI]: error -1 (Can't contact LDAP server)
> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - agmt="cn=meTorep1" (rep1:389): Replication bind with GSSAPI auth failed: LDAP error -1 (Can't contact LDAP server) ()
> [16/Jun/2015:16:24:15 -0400] - Skipping CoS Definition cn=Password Policy,cn=accounts,dc=xxx--no CoS Templates found, which should be added before the CoS Definition.
> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301438 (rc: 32)
> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301439 (rc: 32)
> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301440 (rc: 32)
> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301441 (rc: 32)
> ....
> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301443 (rc: 32)
> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301444 (rc: 32)
> [16/Jun/2015:16:24:15 -0400] - slapd started.  Listening on All Interfaces port 389 for LDAP requests
> [16/Jun/2015:16:24:15 -0400] - Listening on All Interfaces port 636 for LDAPS requests
> [16/Jun/2015:16:24:15 -0400] - Listening on /var/run/slapd-CCR-BUFFALO-EDU.socket for LDAPI requests
> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301445 (rc: 32)
> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301446 (rc: 32)
> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301447 (rc: 32)
> ...
> [16/Jun/2015:16:24:24 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 336362 (rc: 32)
> [16/Jun/2015:16:24:24 -0400] slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error -1 (Can't contact LDAP server) ((null)) errno 107 (Transport endpoint is not connected)
> [16/Jun/2015:16:24:24 -0400] slapi_ldap_bind - Error: could not perform interactive bind for id [] authentication mechanism [GSSAPI]: error -1 (Can't contact LDAP server)
> [16/Jun/2015:16:24:24 -0400] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 107 (Transport endpoint is not connected)
> [16/Jun/2015:16:24:24 -0400] NSMMReplicationPlugin - agmt="cn=cloneAgreement1-rep2-pki-tomcat" (rep1:389): Replication bind with SIMPLE auth resumed
> [16/Jun/2015:16:24:25 -0400] NSMMReplicationPlugin - agmt="cn=meTorep1" (rep1:389): Replication bind with GSSAPI auth resumed
> [16/Jun/2015:16:27:28 -0400] - Operation error fetching Null DN (0ce19ce4-146611e5-8135a170-bd40e05c), error -30993.
> [16/Jun/2015:16:27:28 -0400] - dn2entry_ext: Failed to get id for changenumber=336746,cn=changelog from entryrdn index (-30993)
> [16/Jun/2015:16:27:28 -0400] - Operation error fetching changenumber=336746,cn=changelog (null), error -30993.
> [16/Jun/2015:16:27:28 -0400] DSRetroclPlugin - replog: an error occured while adding change number 336746, dn = changenumber=336746,cn=changelog: Operations error.
> [16/Jun/2015:16:27:28 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
> ..
> [16/Jun/2015:16:27:36 -0400] - Operation error fetching changenumber=336762,cn=changelog (null), error -30993.
> [16/Jun/2015:16:27:36 -0400] DSRetroclPlugin - replog: an error occured while adding change number 336762, dn = changenumber=336762,cn=changelog: Operations error.
> [16/Jun/2015:16:27:36 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
> [16/Jun/2015:16:27:36 -0400] - ldbm_back_seq deadlock retry BAD 1601, err=0 BDB0062 Successful return: 0
> [16/Jun/2015:16:27:36 -0400] - ldbm_back_seq deadlock retry BAD 1601, err=0 BDB0062 Successful return: 0
> [17/Jun/2015:13:41:23 -0400] - slapd shutting down - signaling operation threads - op stack size 0 max work q size 3025 max work q stack size 5
> [17/Jun/2015:13:41:23 -0400] - slapd shutting down - waiting for 30 threads to terminate
>
>
> Assuming we had a corrupted database we first attempted to remove the replicate
> by logging into the first master and running:
>
>      # ipa-replica-manage del rep2
>
> This process hung indefinitely. So we proceeded to shutdown all ipa services on
> rep2 (systemctl stop ipa).. then re-ran the command on the first master:
>
>      # ipa-replica-manage -v --force del rep2
>
> This appeared to work ok and rep2 appears to have been deleted:
>
>      # ipa-replica-manage list
>      rep3: master
>      rep1: master
>
> However, when querying ldap nsDS5ReplicationAgreement we still see rep2 with a
> replica 97 id for the ipca:
>
> # ldapsearch -Y GSSAPI -b "cn=mapping tree,cn=config" objectClass=nsDS5ReplicationAgreement -LL
>
> dn: cn=masterAgreement1-rep3-pki-tomcat,cn=replica,cn=ipaca,cn=mapping tree,cn=config
> objectClass: top
> objectClass: nsds5replicationagreement
> cn: masterAgreement1-rep3-pki-tomcat
> nsDS5ReplicaRoot: o=ipaca
> nsDS5ReplicaHost: rep3
> nsDS5ReplicaPort: 389
> nsDS5ReplicaBindDN: cn=Replication Manager cloneAgreement1-rep3-pki-tomcat,ou=csusers,cn=config
> nsDS5ReplicaBindMethod: Simple
> nsDS5ReplicaTransportInfo: TLS
> description: masterAgreement1-rep3-pki-tomcat
> ..
> nsds50ruv: {replicageneration} 5527f74b000000600000
> nsds50ruv: {replica 91 ldap://rep3:389} 5537c7ba0000005b
>   0000 5582c7e40004005b0000
> nsds50ruv: {replica 96 ldap://rep1:389} 5527f75400000060
>   0000 5582cd19000000600000
> nsds50ruv: {replica 97 ldap://rep2:389} 5527f76000000061
>   0000 556f462b000400610000
> nsruvReplicaLastModified: {replica 91 ldap://rep3:389} 0
>   0000000
> nsruvReplicaLastModified: {replica 96 ldap://rep1:389} 0
>   0000000
> nsruvReplicaLastModified: {replica 97 ldap://rep2:389} 0
>   0000000
> nsds5replicaLastUpdateStart: 20150619173149Z
> nsds5replicaLastUpdateEnd: 20150619173149Z
> nsds5replicaChangesSentSinceStartup:: OTY6MTI0LzAg
> nsds5replicaLastUpdateStatus: 0 Replica acquired successfully: Incremental upd
>   ate succeeded
> nsds5replicaUpdateInProgress: FALSE
> nsds5replicaLastInitStart: 0
> nsds5replicaLastInitEnd: 0
>
>
> Questions:
>
> 0. Is it likely that after running out of file descriptors the dirsrv
> slapd database on rep2 was corrupted?

That would appear to be the case based on correlation of events, 
although I've never seen that happen, and it is not supposed to happen.

>
> 1. Do we have to run ipa-replica-manage del rep2 on *each* of the
> remaining replica servers (rep1 and rep3)? Or should it just be run on
> the first master?

I believe it should only be run on the first master, but it hung, so 
something is not right, and I'm not sure how to remedy the situation.

> Do we need to run ipa-csreplicate-manage del as well?
>
> 2. Why does the rep2 server still appear when querying the
> nsDS5ReplicationAgreement in ldap? Is this benign or will this pose problems
> when we go to add rep2 back in?

You should remove it.

>
> 3. What steps/commands can we take to verify rep2 was successfully removed and
> replication is behaving normally?
>
> We had tuned our servers according to the rhel  Performance Tuning
> Guide:
>
>     # cat /proc/sys/fs/file-max
>     6534908
>
>     # cat /etc/security/limits.d/40-freeipa.conf
>     *         soft      nofile      8192
>     *         hard      nofile      8192
>
>     # cat /etc/sysctl.d/40-freeipa.conf
>     net.ipv4.ip_local_port_range = 1024 65000
>     net.ipv4.tcp_keepalive_time = 300
>
>     # cat /etc/sysconfig/dirsrv.systemd
>     [Service]
>     LimitNOFILE=8192
>
> We're now considering increasing the nofile to something larger to
> prevent running out of file descriptors. Any guidance on what number to
> set this to?

8192 is extremely high.  The fact that you ran out of file descriptors 
at 8192 seems like a bug/fd leak somewhere.  I suppose you could, as a 
very temporary workaround, set the fd limit higher, but that is no 
guarantee that you won't run out again.

Please file at least 1 ticket e.g. "database corrupted when server ran 
out of file descriptors", with as much information about that particular 
problem as you can provide.

>
> Many thanks in advance for any help.
>
> --Andrew
>