[Freeipa-users] ipa replica failure

Fri Jun 19 19:18:50 UTC 2015

Rich Megginson wrote:
> On 06/19/2015 12:22 PM, Andrew E. Bruno wrote:
>> Hello,
>>
>> First time trouble shooting an ipa server failure and looking for some
>> guidance on how best to proceed.
>>
>> First some background on our setup:
>>
>> Servers are running freeipa v4.1.0 on CentOS 7.1.1503:
>>
>> - ipa-server-4.1.0-18.el7.centos.3.x86_64
>> - 389-ds-base-1.3.3.1-16.el7_1.x86_64
>>
>> 3 ipa-servers, 1 first master (rep1) and 2 (rep2, rep3) replicates. The
>> replicates were setup to be ca's (i.e. ipa-replica-install --setup-ca...)
>>
>> We have ~3000 user accounts (~1000 active the rest disabled). We have
>> ~700 hosts enrolled (all installed using ipa-client-install and running
>> sssd). Hosts clients are a mix of centos 7 and centos 6.5.
>>
>>
>> We recently discovered one of our replica servers (rep2) was not
>> responding. A quick check of the dirsrv logs
>> /var/log/dirsrv/slapd-XXXX/errors (sanitized):
>>
>>      PR_Accept() failed, Netscape Portable Runtime error (Process open
>>      FD table is full.)
>>      ...
>>
>> The server was rebooted and after coming back up had these errors in
>> the logs:
>>
>>          389-Directory/1.3.3.1 B2015.118.1941
>>          replica2:636 (/etc/dirsrv/slapd-XXXX)
>>
>> [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region
>> error detected; run recovery
>> [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to trickle,
>> err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
>> [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region
>> error detected; run recovery
>> [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock
>> detect (aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal
>> error, run database recovery)
>> [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region
>> error detected; run recovery
>> [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock
>> detect (aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal
>> error, run database recovery)
>> [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region
>> error detected; run recovery
>> [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint
>> database, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run
>> database recovery)
>> [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region
>> error detected; run recovery
>> [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint
>> database, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run
>> database recovery)
>> [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region
>> error detected; run recovery
>> [16/Jun/2015:10:12:33 -0400] - checkpoint_threadmain: log archive
>> failed - BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
>> (-30973)
>> ....
>> [16/Jun/2015:16:24:04 -0400] - 389-Directory/1.3.3.1 B2015.118.1941
>> starting up
>> [16/Jun/2015:16:24:04 -0400] - Detected Disorderly Shutdown last time
>> Directory Server was running, recovering database.
>> ...
>> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin -
>> replica_check_for_data_reload: Warning: disordely shutdown for replica
>> dc=XXX. Check if DB RUV needs to be updated
>> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of
>> database RUV (from CL RUV) ->  55770068000300030000
>> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of
>> database RUV (from CL RUV) ->  556f4632001400040000
>> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of
>> database RUV (from CL RUV) ->  556f4631004d00050000
>> [16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not send
>> startTLS request: error -1 (Can't contact LDAP server) errno 111
>> (Connection refused)
>> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin -
>> agmt="cn=cloneAgreement1-rep2 (rep1:389): Replication bind with SIMPLE
>> auth failed: LDAP error -1 (Can't contact LDAP server) ()
>> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin -
>> replica_check_for_data_reload: Warning: disordely shutdown for replica
>> o=ipaca. Check if DB RUV needs to be updated
>> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of
>> database RUV (from CL RUV) ->  556f46290005005b0000
>> [16/Jun/2015:16:24:15 -0400] set_krb5_creds - Could not get initial
>> credentials for principal [ldap/rep2] in keytab
>> [FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for
>> requested realm)
>> [16/Jun/2015:16:24:15 -0400] slapd_ldap_sasl_interactive_bind - Error:
>> could not perform interactive bind for id [] mech [GSSAPI]: LDAP error
>> -1 (Can't contact LDAP server) ((null)) errno 111 (Connection refused)
>> [16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not
>> perform interactive bind for id [] authentication mechanism [GSSAPI]:
>> error -1 (Can't contact LDAP server)
>> [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin -
>> agmt="cn=meTorep1" (rep1:389): Replication bind with GSSAPI auth
>> failed: LDAP error -1 (Can't contact LDAP server) ()
>> [16/Jun/2015:16:24:15 -0400] - Skipping CoS Definition cn=Password
>> Policy,cn=accounts,dc=xxx--no CoS Templates found, which should be
>> added before the CoS Definition.
>> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord:
>> could not delete change record 301438 (rc: 32)
>> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord:
>> could not delete change record 301439 (rc: 32)
>> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord:
>> could not delete change record 301440 (rc: 32)
>> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord:
>> could not delete change record 301441 (rc: 32)
>> ....
>> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord:
>> could not delete change record 301443 (rc: 32)
>> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord:
>> could not delete change record 301444 (rc: 32)
>> [16/Jun/2015:16:24:15 -0400] - slapd started.  Listening on All
>> Interfaces port 389 for LDAP requests
>> [16/Jun/2015:16:24:15 -0400] - Listening on All Interfaces port 636
>> for LDAPS requests
>> [16/Jun/2015:16:24:15 -0400] - Listening on
>> /var/run/slapd-CCR-BUFFALO-EDU.socket for LDAPI requests
>> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord:
>> could not delete change record 301445 (rc: 32)
>> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord:
>> could not delete change record 301446 (rc: 32)
>> [16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord:
>> could not delete change record 301447 (rc: 32)
>> ...
>> [16/Jun/2015:16:24:24 -0400] DSRetroclPlugin - delete_changerecord:
>> could not delete change record 336362 (rc: 32)
>> [16/Jun/2015:16:24:24 -0400] slapd_ldap_sasl_interactive_bind - Error:
>> could not perform interactive bind for id [] mech [GSSAPI]: LDAP error
>> -1 (Can't contact LDAP server) ((null)) errno 107 (Transport endpoint
>> is not connected)
>> [16/Jun/2015:16:24:24 -0400] slapi_ldap_bind - Error: could not
>> perform interactive bind for id [] authentication mechanism [GSSAPI]:
>> error -1 (Can't contact LDAP server)
>> [16/Jun/2015:16:24:24 -0400] slapi_ldap_bind - Error: could not send
>> startTLS request: error -1 (Can't contact LDAP server) errno 107
>> (Transport endpoint is not connected)
>> [16/Jun/2015:16:24:24 -0400] NSMMReplicationPlugin -
>> agmt="cn=cloneAgreement1-rep2-pki-tomcat" (rep1:389): Replication bind
>> with SIMPLE auth resumed
>> [16/Jun/2015:16:24:25 -0400] NSMMReplicationPlugin -
>> agmt="cn=meTorep1" (rep1:389): Replication bind with GSSAPI auth resumed
>> [16/Jun/2015:16:27:28 -0400] - Operation error fetching Null DN
>> (0ce19ce4-146611e5-8135a170-bd40e05c), error -30993.
>> [16/Jun/2015:16:27:28 -0400] - dn2entry_ext: Failed to get id for
>> changenumber=336746,cn=changelog from entryrdn index (-30993)
>> [16/Jun/2015:16:27:28 -0400] - Operation error fetching
>> changenumber=336746,cn=changelog (null), error -30993.
>> [16/Jun/2015:16:27:28 -0400] DSRetroclPlugin - replog: an error
>> occured while adding change number 336746, dn =
>> changenumber=336746,cn=changelog: Operations error.
>> [16/Jun/2015:16:27:28 -0400] retrocl-plugin - retrocl_postob:
>> operation failure [1]
>> ..
>> [16/Jun/2015:16:27:36 -0400] - Operation error fetching
>> changenumber=336762,cn=changelog (null), error -30993.
>> [16/Jun/2015:16:27:36 -0400] DSRetroclPlugin - replog: an error
>> occured while adding change number 336762, dn =
>> changenumber=336762,cn=changelog: Operations error.
>> [16/Jun/2015:16:27:36 -0400] retrocl-plugin - retrocl_postob:
>> operation failure [1]
>> [16/Jun/2015:16:27:36 -0400] - ldbm_back_seq deadlock retry BAD 1601,
>> err=0 BDB0062 Successful return: 0
>> [16/Jun/2015:16:27:36 -0400] - ldbm_back_seq deadlock retry BAD 1601,
>> err=0 BDB0062 Successful return: 0
>> [17/Jun/2015:13:41:23 -0400] - slapd shutting down - signaling
>> operation threads - op stack size 0 max work q size 3025 max work q
>> stack size 5
>> [17/Jun/2015:13:41:23 -0400] - slapd shutting down - waiting for 30
>> threads to terminate
>>
>>
>> Assuming we had a corrupted database we first attempted to remove the
>> replicate
>> by logging into the first master and running:
>>
>>      # ipa-replica-manage del rep2
>>
>> This process hung indefinitely. So we proceeded to shutdown all ipa
>> services on
>> rep2 (systemctl stop ipa).. then re-ran the command on the first master:
>>
>>      # ipa-replica-manage -v --force del rep2
>>
>> This appeared to work ok and rep2 appears to have been deleted:
>>
>>      # ipa-replica-manage list
>>      rep3: master
>>      rep1: master
>>
>> However, when querying ldap nsDS5ReplicationAgreement we still see
>> rep2 with a
>> replica 97 id for the ipca:
>>
>> # ldapsearch -Y GSSAPI -b "cn=mapping tree,cn=config"
>> objectClass=nsDS5ReplicationAgreement -LL
>>
>> dn: cn=masterAgreement1-rep3-pki-tomcat,cn=replica,cn=ipaca,cn=mapping
>> tree,cn=config
>> objectClass: top
>> objectClass: nsds5replicationagreement
>> cn: masterAgreement1-rep3-pki-tomcat
>> nsDS5ReplicaRoot: o=ipaca
>> nsDS5ReplicaHost: rep3
>> nsDS5ReplicaPort: 389
>> nsDS5ReplicaBindDN: cn=Replication Manager
>> cloneAgreement1-rep3-pki-tomcat,ou=csusers,cn=config
>> nsDS5ReplicaBindMethod: Simple
>> nsDS5ReplicaTransportInfo: TLS
>> description: masterAgreement1-rep3-pki-tomcat
>> ..
>> nsds50ruv: {replicageneration} 5527f74b000000600000
>> nsds50ruv: {replica 91 ldap://rep3:389} 5537c7ba0000005b
>>   0000 5582c7e40004005b0000
>> nsds50ruv: {replica 96 ldap://rep1:389} 5527f75400000060
>>   0000 5582cd19000000600000
>> nsds50ruv: {replica 97 ldap://rep2:389} 5527f76000000061
>>   0000 556f462b000400610000
>> nsruvReplicaLastModified: {replica 91 ldap://rep3:389} 0
>>   0000000
>> nsruvReplicaLastModified: {replica 96 ldap://rep1:389} 0
>>   0000000
>> nsruvReplicaLastModified: {replica 97 ldap://rep2:389} 0
>>   0000000
>> nsds5replicaLastUpdateStart: 20150619173149Z
>> nsds5replicaLastUpdateEnd: 20150619173149Z
>> nsds5replicaChangesSentSinceStartup:: OTY6MTI0LzAg
>> nsds5replicaLastUpdateStatus: 0 Replica acquired successfully:
>> Incremental upd
>>   ate succeeded
>> nsds5replicaUpdateInProgress: FALSE
>> nsds5replicaLastInitStart: 0
>> nsds5replicaLastInitEnd: 0
>>
>>
>> Questions:
>>
>> 0. Is it likely that after running out of file descriptors the dirsrv
>> slapd database on rep2 was corrupted?
>
> That would appear to be the case based on correlation of events,
> although I've never seen that happen, and it is not supposed to happen.
>
>>
>> 1. Do we have to run ipa-replica-manage del rep2 on *each* of the
>> remaining replica servers (rep1 and rep3)? Or should it just be run on
>> the first master?
>
> I believe it should only be run on the first master, but it hung, so
> something is not right, and I'm not sure how to remedy the situation.

How long did it hang, and where?

>> Do we need to run ipa-csreplicate-manage del as well?
>>
>> 2. Why does the rep2 server still appear when querying the
>> nsDS5ReplicationAgreement in ldap? Is this benign or will this pose
>> problems
>> when we go to add rep2 back in?
>
> You should remove it.

And ipa-csreplica-manage is the tool to do it.

>>
>> 3. What steps/commands can we take to verify rep2 was successfully
>> removed and
>> replication is behaving normally?

The ldapsearch you performed already will confirm that the CA agreement 
has been removed.

>>
>> We had tuned our servers according to the rhel  Performance Tuning
>> Guide:
>>
>>     # cat /proc/sys/fs/file-max
>>     6534908
>>
>>     # cat /etc/security/limits.d/40-freeipa.conf
>>     *         soft      nofile      8192
>>     *         hard      nofile      8192
>>
>>     # cat /etc/sysctl.d/40-freeipa.conf
>>     net.ipv4.ip_local_port_range = 1024 65000
>>     net.ipv4.tcp_keepalive_time = 300
>>
>>     # cat /etc/sysconfig/dirsrv.systemd
>>     [Service]
>>     LimitNOFILE=8192
>>
>> We're now considering increasing the nofile to something larger to
>> prevent running out of file descriptors. Any guidance on what number to
>> set this to?
>
> 8192 is extremely high.  The fact that you ran out of file descriptors
> at 8192 seems like a bug/fd leak somewhere.  I suppose you could, as a
> very temporary workaround, set the fd limit higher, but that is no
> guarantee that you won't run out again.
>
> Please file at least 1 ticket e.g. "database corrupted when server ran
> out of file descriptors", with as much information about that particular
> problem as you can provide.
>
>>
>> Many thanks in advance for any help.
>>
>> --Andrew
>>
>