[Freeipa-users] ipa replica failure

Sat Jun 20 04:08:15 UTC 2015

On 6/19/15 11:22 AM, Andrew E. Bruno wrote:
> Hello,
>
> First time trouble shooting an ipa server failure and looking for some
> guidance on how best to proceed.
>
> First some background on our setup:
>
> Servers are running freeipa v4.1.0 on CentOS 7.1.1503:
>
> - ipa-server-4.1.0-18.el7.centos.3.x86_64
> - 389-ds-base-1.3.3.1-16.el7_1.x86_64
>
> 3 ipa-servers, 1 first master (rep1) and 2 (rep2, rep3) replicates. The
> replicates were setup to be ca's (i.e. ipa-replica-install --setup-ca...)
>
> We have ~3000 user accounts (~1000 active the rest disabled). We have
> ~700 hosts enrolled (all installed using ipa-client-install and running
> sssd). Hosts clients are a mix of centos 7 and centos 6.5.
>
>
> We recently discovered one of our replica servers (rep2) was not
> responding. A quick check of the dirsrv logs
> /var/log/dirsrv/slapd-XXXX/errors (sanitized):
>
>      PR_Accept() failed, Netscape Portable Runtime error (Process open
>      FD table is full.)
>      ...
>
> The server was rebooted and after coming back up had these errors in the logs:
>
>          389-Directory/1.3.3.1 B2015.118.1941
>          replica2:636 (/etc/dirsrv/slapd-XXXX)
>
> [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
> [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to trickle, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
> [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
> [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock detect (aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
> [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
> [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock detect (aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
> [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
> [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint database, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
> [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
> [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint database, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
> [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
> [16/Jun/2015:10:12:33 -0400] - checkpoint_threadmain: log archive failed - BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery (-30973)
> ....
> [16/Jun/2015:16:24:04 -0400] - 389-Directory/1.3.3.1 B2015.118.1941 starting up
> [16/Jun/2015:16:24:04 -0400] - Detected Disorderly Shutdown last time Directory Server was running, recovering database.
> ...
> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: disordely shutdown for replica dc=XXX. Check if DB RUV needs to be updated
> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV (from CL RUV) ->  55770068000300030000
> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV (from CL RUV) ->  556f4632001400040000
> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV (from CL RUV) ->  556f4631004d00050000
> [16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 111 (Connection refused)
> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - agmt="cn=cloneAgreement1-rep2 (rep1:389): Replication bind with SIMPLE auth failed: LDAP error -1 (Can't contact LDAP server) ()
> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: disordely shutdown for replica o=ipaca. Check if DB RUV needs to be updated
> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV (from CL RUV) ->  556f46290005005b0000
> [16/Jun/2015:16:24:15 -0400] set_krb5_creds - Could not get initial credentials for principal [ldap/rep2] in keytab [FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for requested realm)
> [16/Jun/2015:16:24:15 -0400] slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error -1 (Can't contact LDAP server) ((null)) errno 111 (Connection refused)
> [16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not perform interactive bind for id [] authentication mechanism [GSSAPI]: error -1 (Can't contact LDAP server)
> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - agmt="cn=meTorep1" (rep1:389): Replication bind with GSSAPI auth failed: LDAP error -1 (Can't contact LDAP server) ()
> [16/Jun/2015:16:24:15 -0400] - Skipping CoS Definition cn=Password Policy,cn=accounts,dc=xxx--no CoS Templates found, which should be added before the CoS Definition.
> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301438 (rc: 32)
> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301439 (rc: 32)
> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301440 (rc: 32)
> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301441 (rc: 32)
> ....
> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301443 (rc: 32)
> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301444 (rc: 32)
> [16/Jun/2015:16:24:15 -0400] - slapd started.  Listening on All Interfaces port 389 for LDAP requests
> [16/Jun/2015:16:24:15 -0400] - Listening on All Interfaces port 636 for LDAPS requests
> [16/Jun/2015:16:24:15 -0400] - Listening on /var/run/slapd-CCR-BUFFALO-EDU.socket for LDAPI requests
> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301445 (rc: 32)
> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301446 (rc: 32)
> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301447 (rc: 32)
> ...
> [16/Jun/2015:16:24:24 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 336362 (rc: 32)
> [16/Jun/2015:16:24:24 -0400] slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error -1 (Can't contact LDAP server) ((null)) errno 107 (Transport endpoint is not connected)
> [16/Jun/2015:16:24:24 -0400] slapi_ldap_bind - Error: could not perform interactive bind for id [] authentication mechanism [GSSAPI]: error -1 (Can't contact LDAP server)
> [16/Jun/2015:16:24:24 -0400] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 107 (Transport endpoint is not connected)
> [16/Jun/2015:16:24:24 -0400] NSMMReplicationPlugin - agmt="cn=cloneAgreement1-rep2-pki-tomcat" (rep1:389): Replication bind with SIMPLE auth resumed
> [16/Jun/2015:16:24:25 -0400] NSMMReplicationPlugin - agmt="cn=meTorep1" (rep1:389): Replication bind with GSSAPI auth resumed
> [16/Jun/2015:16:27:28 -0400] - Operation error fetching Null DN (0ce19ce4-146611e5-8135a170-bd40e05c), error -30993.
> [16/Jun/2015:16:27:28 -0400] - dn2entry_ext: Failed to get id for changenumber=336746,cn=changelog from entryrdn index (-30993)
> [16/Jun/2015:16:27:28 -0400] - Operation error fetching changenumber=336746,cn=changelog (null), error -30993.
> [16/Jun/2015:16:27:28 -0400] DSRetroclPlugin - replog: an error occured while adding change number 336746, dn = changenumber=336746,cn=changelog: Operations error.
> [16/Jun/2015:16:27:28 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
> ..
> [16/Jun/2015:16:27:36 -0400] - Operation error fetching changenumber=336762,cn=changelog (null), error -30993.
> [16/Jun/2015:16:27:36 -0400] DSRetroclPlugin - replog: an error occured while adding change number 336762, dn = changenumber=336762,cn=changelog: Operations error.
> [16/Jun/2015:16:27:36 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
> [16/Jun/2015:16:27:36 -0400] - ldbm_back_seq deadlock retry BAD 1601, err=0 BDB0062 Successful return: 0
> [16/Jun/2015:16:27:36 -0400] - ldbm_back_seq deadlock retry BAD 1601, err=0 BDB0062 Successful return: 0
> [17/Jun/2015:13:41:23 -0400] - slapd shutting down - signaling operation threads - op stack size 0 max work q size 3025 max work q stack size 5
> [17/Jun/2015:13:41:23 -0400] - slapd shutting down - waiting for 30 threads to terminate
>
>
> Assuming we had a corrupted database we first attempted to remove the replicate
> by logging into the first master and running:
>
>      # ipa-replica-manage del rep2
>
> This process hung indefinitely. So we proceeded to shutdown all ipa services on
> rep2 (systemctl stop ipa).. then re-ran the command on the first master:
>
>      # ipa-replica-manage -v --force del rep2
>
> This appeared to work ok and rep2 appears to have been deleted:
>
>      # ipa-replica-manage list
>      rep3: master
>      rep1: master
>
> However, when querying ldap nsDS5ReplicationAgreement we still see rep2 with a
> replica 97 id for the ipca:
>
> # ldapsearch -Y GSSAPI -b "cn=mapping tree,cn=config" objectClass=nsDS5ReplicationAgreement -LL
>
> dn: cn=masterAgreement1-rep3-pki-tomcat,cn=replica,cn=ipaca,cn=mapping tree,cn=config
> objectClass: top
> objectClass: nsds5replicationagreement
> cn: masterAgreement1-rep3-pki-tomcat
> nsDS5ReplicaRoot: o=ipaca
> nsDS5ReplicaHost: rep3
> nsDS5ReplicaPort: 389
> nsDS5ReplicaBindDN: cn=Replication Manager cloneAgreement1-rep3-pki-tomcat,ou=csusers,cn=config
> nsDS5ReplicaBindMethod: Simple
> nsDS5ReplicaTransportInfo: TLS
> description: masterAgreement1-rep3-pki-tomcat
> ..
> nsds50ruv: {replicageneration} 5527f74b000000600000
> nsds50ruv: {replica 91 ldap://rep3:389} 5537c7ba0000005b
>   0000 5582c7e40004005b0000
> nsds50ruv: {replica 96 ldap://rep1:389} 5527f75400000060
>   0000 5582cd19000000600000
> nsds50ruv: {replica 97 ldap://rep2:389} 5527f76000000061
>   0000 556f462b000400610000
> nsruvReplicaLastModified: {replica 91 ldap://rep3:389} 0
>   0000000
> nsruvReplicaLastModified: {replica 96 ldap://rep1:389} 0
>   0000000
> nsruvReplicaLastModified: {replica 97 ldap://rep2:389} 0
>   0000000
> nsds5replicaLastUpdateStart: 20150619173149Z
> nsds5replicaLastUpdateEnd: 20150619173149Z
> nsds5replicaChangesSentSinceStartup:: OTY6MTI0LzAg
> nsds5replicaLastUpdateStatus: 0 Replica acquired successfully: Incremental upd
>   ate succeeded
> nsds5replicaUpdateInProgress: FALSE
> nsds5replicaLastInitStart: 0
> nsds5replicaLastInitEnd: 0
>
>
> Questions:
>
> 0. Is it likely that after running out of file descriptors the dirsrv
> slapd database on rep2 was corrupted?
I have increased FDs to 32768, although have a somewhat larger 
environment and the bulk of logins occur via LDAP, so that uses up more 
FDs.  From looking at your numbers, I don't think 16K would be 
unreasonable. Remember to set /etc/sysconfig/dirsrv.systemd as well as 
the ldapmodify settings and finally /etc/security/limits.conf
>
> 1. Do we have to run ipa-replica-manage del rep2 on *each* of the
> remaining replica servers (rep1 and rep3)? Or should it just be run on
> the first master? Do we need to run ipa-csreplicate-manage del as well?
You mentioned it was a CA --
ipa-csreplica-manage and run the same "delete" options to remove it.
>
> 2. Why does the rep2 server still appear when querying the
> nsDS5ReplicationAgreement in ldap? Is this benign or will this pose problems
> when we go to add rep2 back in?
See above
> 3. What steps/commands can we take to verify rep2 was successfully removed and
> replication is behaving normally?
After above - it should be gone
>
> We had tuned our servers according to the rhel  Performance Tuning
> Guide:
>
>     # cat /proc/sys/fs/file-max
>     6534908
>
>     # cat /etc/security/limits.d/40-freeipa.conf
>     *         soft      nofile      8192
>     *         hard      nofile      8192
>
>     # cat /etc/sysctl.d/40-freeipa.conf
>     net.ipv4.ip_local_port_range = 1024 65000
>     net.ipv4.tcp_keepalive_time = 300
>
>     # cat /etc/sysconfig/dirsrv.systemd
>     [Service]
>     LimitNOFILE=8192

Do it.  Also, have you run an lsof of the PID of ns-slapd to see what 
the connections are?
This is a very simply thing to increase.

You mention all your logins are via ipa-client-install, which implies 
Kerberos. However, I wonder if this is the case. Check the lsof output, 
but I would also consider running logconv.pl against your access logs in 
/var/log/dirsrv/slapd-INSTANCE to really see what is going on behind the 
scenes.

~J