[Freeipa-users] ipa replica failure

Sat Jun 20 04:53:40 UTC 2015

On Fri, Jun 19, 2015 at 09:08:15PM -0700, Janelle wrote:
> On 6/19/15 11:22 AM, Andrew E. Bruno wrote:
> >Hello,
> >
> >First time trouble shooting an ipa server failure and looking for some
> >guidance on how best to proceed.
> >
> >First some background on our setup:
> >
> >Servers are running freeipa v4.1.0 on CentOS 7.1.1503:
> >
> >- ipa-server-4.1.0-18.el7.centos.3.x86_64
> >- 389-ds-base-1.3.3.1-16.el7_1.x86_64
> >
> >3 ipa-servers, 1 first master (rep1) and 2 (rep2, rep3) replicates. The
> >replicates were setup to be ca's (i.e. ipa-replica-install --setup-ca...)
> >
> >We have ~3000 user accounts (~1000 active the rest disabled). We have
> >~700 hosts enrolled (all installed using ipa-client-install and running
> >sssd). Hosts clients are a mix of centos 7 and centos 6.5.
> >
> >
> >We recently discovered one of our replica servers (rep2) was not
> >responding. A quick check of the dirsrv logs
> >/var/log/dirsrv/slapd-XXXX/errors (sanitized):
> >
> >     PR_Accept() failed, Netscape Portable Runtime error (Process open
> >     FD table is full.)
> >     ...
> >
> >The server was rebooted and after coming back up had these errors in the logs:
> >
> >         389-Directory/1.3.3.1 B2015.118.1941
> >         replica2:636 (/etc/dirsrv/slapd-XXXX)
> >
> >[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
> >[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to trickle, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
> >[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
> >[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock detect (aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
> >[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
> >[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock detect (aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
> >[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
> >[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint database, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
> >[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
> >[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint database, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
> >[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
> >[16/Jun/2015:10:12:33 -0400] - checkpoint_threadmain: log archive failed - BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery (-30973)
> >....
> >[16/Jun/2015:16:24:04 -0400] - 389-Directory/1.3.3.1 B2015.118.1941 starting up
> >[16/Jun/2015:16:24:04 -0400] - Detected Disorderly Shutdown last time Directory Server was running, recovering database.
> >...
> >[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: disordely shutdown for replica dc=XXX. Check if DB RUV needs to be updated
> >[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV (from CL RUV) ->  55770068000300030000
> >[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV (from CL RUV) ->  556f4632001400040000
> >[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV (from CL RUV) ->  556f4631004d00050000
> >[16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 111 (Connection refused)
> >[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - agmt="cn=cloneAgreement1-rep2 (rep1:389): Replication bind with SIMPLE auth failed: LDAP error -1 (Can't contact LDAP server) ()
> >[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: disordely shutdown for replica o=ipaca. Check if DB RUV needs to be updated
> >[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV (from CL RUV) ->  556f46290005005b0000
> >[16/Jun/2015:16:24:15 -0400] set_krb5_creds - Could not get initial credentials for principal [ldap/rep2] in keytab [FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for requested realm)
> >[16/Jun/2015:16:24:15 -0400] slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error -1 (Can't contact LDAP server) ((null)) errno 111 (Connection refused)
> >[16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not perform interactive bind for id [] authentication mechanism [GSSAPI]: error -1 (Can't contact LDAP server)
> >[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - agmt="cn=meTorep1" (rep1:389): Replication bind with GSSAPI auth failed: LDAP error -1 (Can't contact LDAP server) ()
> >[16/Jun/2015:16:24:15 -0400] - Skipping CoS Definition cn=Password Policy,cn=accounts,dc=xxx--no CoS Templates found, which should be added before the CoS Definition.
> >[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301438 (rc: 32)
> >[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301439 (rc: 32)
> >[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301440 (rc: 32)
> >[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301441 (rc: 32)
> >....
> >[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301443 (rc: 32)
> >[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301444 (rc: 32)
> >[16/Jun/2015:16:24:15 -0400] - slapd started.  Listening on All Interfaces port 389 for LDAP requests
> >[16/Jun/2015:16:24:15 -0400] - Listening on All Interfaces port 636 for LDAPS requests
> >[16/Jun/2015:16:24:15 -0400] - Listening on /var/run/slapd-CCR-BUFFALO-EDU.socket for LDAPI requests
> >[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301445 (rc: 32)
> >[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301446 (rc: 32)
> >[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301447 (rc: 32)
> >...
> >[16/Jun/2015:16:24:24 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 336362 (rc: 32)
> >[16/Jun/2015:16:24:24 -0400] slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error -1 (Can't contact LDAP server) ((null)) errno 107 (Transport endpoint is not connected)
> >[16/Jun/2015:16:24:24 -0400] slapi_ldap_bind - Error: could not perform interactive bind for id [] authentication mechanism [GSSAPI]: error -1 (Can't contact LDAP server)
> >[16/Jun/2015:16:24:24 -0400] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 107 (Transport endpoint is not connected)
> >[16/Jun/2015:16:24:24 -0400] NSMMReplicationPlugin - agmt="cn=cloneAgreement1-rep2-pki-tomcat" (rep1:389): Replication bind with SIMPLE auth resumed
> >[16/Jun/2015:16:24:25 -0400] NSMMReplicationPlugin - agmt="cn=meTorep1" (rep1:389): Replication bind with GSSAPI auth resumed
> >[16/Jun/2015:16:27:28 -0400] - Operation error fetching Null DN (0ce19ce4-146611e5-8135a170-bd40e05c), error -30993.
> >[16/Jun/2015:16:27:28 -0400] - dn2entry_ext: Failed to get id for changenumber=336746,cn=changelog from entryrdn index (-30993)
> >[16/Jun/2015:16:27:28 -0400] - Operation error fetching changenumber=336746,cn=changelog (null), error -30993.
> >[16/Jun/2015:16:27:28 -0400] DSRetroclPlugin - replog: an error occured while adding change number 336746, dn = changenumber=336746,cn=changelog: Operations error.
> >[16/Jun/2015:16:27:28 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
> >..
> >[16/Jun/2015:16:27:36 -0400] - Operation error fetching changenumber=336762,cn=changelog (null), error -30993.
> >[16/Jun/2015:16:27:36 -0400] DSRetroclPlugin - replog: an error occured while adding change number 336762, dn = changenumber=336762,cn=changelog: Operations error.
> >[16/Jun/2015:16:27:36 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
> >[16/Jun/2015:16:27:36 -0400] - ldbm_back_seq deadlock retry BAD 1601, err=0 BDB0062 Successful return: 0
> >[16/Jun/2015:16:27:36 -0400] - ldbm_back_seq deadlock retry BAD 1601, err=0 BDB0062 Successful return: 0
> >[17/Jun/2015:13:41:23 -0400] - slapd shutting down - signaling operation threads - op stack size 0 max work q size 3025 max work q stack size 5
> >[17/Jun/2015:13:41:23 -0400] - slapd shutting down - waiting for 30 threads to terminate
> >
> >
> >Assuming we had a corrupted database we first attempted to remove the replicate
> >by logging into the first master and running:
> >
> >     # ipa-replica-manage del rep2
> >
> >This process hung indefinitely. So we proceeded to shutdown all ipa services on
> >rep2 (systemctl stop ipa).. then re-ran the command on the first master:
> >
> >     # ipa-replica-manage -v --force del rep2
> >
> >This appeared to work ok and rep2 appears to have been deleted:
> >
> >     # ipa-replica-manage list
> >     rep3: master
> >     rep1: master
> >
> >However, when querying ldap nsDS5ReplicationAgreement we still see rep2 with a
> >replica 97 id for the ipca:
> >
> ># ldapsearch -Y GSSAPI -b "cn=mapping tree,cn=config" objectClass=nsDS5ReplicationAgreement -LL
> >
> >dn: cn=masterAgreement1-rep3-pki-tomcat,cn=replica,cn=ipaca,cn=mapping tree,cn=config
> >objectClass: top
> >objectClass: nsds5replicationagreement
> >cn: masterAgreement1-rep3-pki-tomcat
> >nsDS5ReplicaRoot: o=ipaca
> >nsDS5ReplicaHost: rep3
> >nsDS5ReplicaPort: 389
> >nsDS5ReplicaBindDN: cn=Replication Manager cloneAgreement1-rep3-pki-tomcat,ou=csusers,cn=config
> >nsDS5ReplicaBindMethod: Simple
> >nsDS5ReplicaTransportInfo: TLS
> >description: masterAgreement1-rep3-pki-tomcat
> >..
> >nsds50ruv: {replicageneration} 5527f74b000000600000
> >nsds50ruv: {replica 91 ldap://rep3:389} 5537c7ba0000005b
> >  0000 5582c7e40004005b0000
> >nsds50ruv: {replica 96 ldap://rep1:389} 5527f75400000060
> >  0000 5582cd19000000600000
> >nsds50ruv: {replica 97 ldap://rep2:389} 5527f76000000061
> >  0000 556f462b000400610000
> >nsruvReplicaLastModified: {replica 91 ldap://rep3:389} 0
> >  0000000
> >nsruvReplicaLastModified: {replica 96 ldap://rep1:389} 0
> >  0000000
> >nsruvReplicaLastModified: {replica 97 ldap://rep2:389} 0
> >  0000000
> >nsds5replicaLastUpdateStart: 20150619173149Z
> >nsds5replicaLastUpdateEnd: 20150619173149Z
> >nsds5replicaChangesSentSinceStartup:: OTY6MTI0LzAg
> >nsds5replicaLastUpdateStatus: 0 Replica acquired successfully: Incremental upd
> >  ate succeeded
> >nsds5replicaUpdateInProgress: FALSE
> >nsds5replicaLastInitStart: 0
> >nsds5replicaLastInitEnd: 0
> >
> >
> >Questions:
> >
> >0. Is it likely that after running out of file descriptors the dirsrv
> >slapd database on rep2 was corrupted?
> I have increased FDs to 32768, although have a somewhat larger environment
> and the bulk of logins occur via LDAP, so that uses up more FDs.  From
> looking at your numbers, I don't think 16K would be unreasonable. Remember
> to set /etc/sysconfig/dirsrv.systemd as well as the ldapmodify settings and
> finally /etc/security/limits.conf

Excellent, thanks for the tip. 

In regards to the ldapmodify settings, is safe to run "systemctl restart
dirsrv at DOMAIN" or should "systemctl restart ipa" be run, i.e. can you
restart dirsrv without restarting other ipa services?

Also, curious if you updated nsslapd-reservedescriptors as well?

> >
> >1. Do we have to run ipa-replica-manage del rep2 on *each* of the
> >remaining replica servers (rep1 and rep3)? Or should it just be run on
> >the first master? Do we need to run ipa-csreplicate-manage del as well?
> You mentioned it was a CA --
> ipa-csreplica-manage and run the same "delete" options to remove it.

I ran ipa-csreplica-manage del but got an error:

'rep1' has no replication agreement for 'rep2'

It almost seems like the ipaca replica agreement between rep3 and rep2
is orphaned. It still exists in ldap but  ipa-csreplica-manage list
commands don't show it? And I can't seem to delete it using the
ipa-replica commands. 

I'm anticipating issues when trying to add my failed replicate back in
without resolving this. 

> >
> >2. Why does the rep2 server still appear when querying the
> >nsDS5ReplicationAgreement in ldap? Is this benign or will this pose problems
> >when we go to add rep2 back in?
> See above
> >3. What steps/commands can we take to verify rep2 was successfully removed and
> >replication is behaving normally?
> After above - it should be gone

No luck..  ipa-csreplica-manage list shows only 2 masters (rep3 and
rep1). But ldapsearch still shows references to the deleted replicate
rep2.

> >
> >We had tuned our servers according to the rhel  Performance Tuning
> >Guide:
> >
> >    # cat /proc/sys/fs/file-max
> >    6534908
> >
> >    # cat /etc/security/limits.d/40-freeipa.conf
> >    *         soft      nofile      8192
> >    *         hard      nofile      8192
> >
> >    # cat /etc/sysctl.d/40-freeipa.conf
> >    net.ipv4.ip_local_port_range = 1024 65000
> >    net.ipv4.tcp_keepalive_time = 300
> >
> >    # cat /etc/sysconfig/dirsrv.systemd
> >    [Service]
> >    LimitNOFILE=8192
> 
> Do it.  Also, have you run an lsof of the PID of ns-slapd to see what the
> connections are?

$ lsof -a -p 2868 | wc -l
586

$ cat /proc/sys/fs/file-nr
3232    0   6534908

$ ls -alh /proc/2868/fd | wc -l
466

$  ldapsearch "cn=monitor" "(objectclass=*)"
currentconnections: 339
totalconnections: 51585
currentconnectionsatmaxthreads: 0
maxthreadsperconnhits: 485
dtablesize: 8192
readwaiters: 0
opsinitiated: 2309677
opscompleted: 2309676
entriessent: 1526462
bytessent: 24606941647
currenttime: 20150620044615Z
starttime: 20150618141112Z
nbackends: 3

> This is a very simply thing to increase.
> 
> You mention all your logins are via ipa-client-install, which implies
> Kerberos. However, I wonder if this is the case. Check the lsof output, but
> I would also consider running logconv.pl against your access logs in
> /var/log/dirsrv/slapd-INSTANCE to really see what is going on behind the
> scenes.

Thanks again for the tip.. I'll checkout logconv.pl

> 
> ~J
> 
> -- 
> Manage your subscription for the Freeipa-users mailing list:
> https://www.redhat.com/mailman/listinfo/freeipa-users
> Go to http://freeipa.org for more info on the project
> 
>