[Freeipa-users] ns-slapd high cpu usage

Thu Jul 16 08:32:59 UTC 2015

Thank you for the data, I think I understand now what is going on.

In the error logs we see only message like (from my test env):

[16/Jul/2015:10:12:40 +0200] NSMMReplicationPlugin - agmt="cn=100-300" 
(localhost:9759): replay_update: modifys operation 
(dn="dc=example,dc=com" csn=55a82a29000100640000) not sent - empty
[16/Jul/2015:10:12:40 +0200] NSMMReplicationPlugin - agmt="cn=100-300" 
(localhost:9759): replay_update: Consumer successfully sent operation 
with csn 55a82a29000100640000
[16/Jul/2015:10:12:40 +0200] NSMMReplicationPlugin - agmt="cn=100-300" 
(localhost:9759): Skipping update operation with no message_id (uniqueid 
7507cb26-e8ac11e2-b2898005-8430f734, CSN 55a82a29000100640000):

This happens if fractional replication is configured as IPA does and the 
changes affect only attributes which will NOT be replicated. So teh 
local RUV will be updated, but since no change is really sent, the 
consumer RUV is not updated and replciation will always set off from an 
very old starting csn. It is a rare scenario where a server receives 
only mods which are not replicated.

I have opened a ticket for this: https://fedorahosted.org/389/ticket/48225

As  a workaround can you try to apply a mod on m14-26 which will not be 
stripped, either create a dummy user or add a description attribute to 
an existing object. Repliciation will once again iterate thru all the 
changes (which can take a while), but then should replay this latest 
change and define a new offset

Regards,
Ludwig

On 07/15/2015 07:05 PM, Andrew E. Bruno wrote:
> On Wed, Jul 15, 2015 at 04:58:23PM +0200, Ludwig Krispenz wrote:
>> On 07/15/2015 04:10 PM, Andrew E. Bruno wrote:
>>> On Wed, Jul 15, 2015 at 03:22:51PM +0200, Ludwig Krispenz wrote:
>>>> On 07/14/2015 08:59 PM, Andrew E. Bruno wrote:
>>>>> On Tue, Jul 14, 2015 at 04:52:10PM +0200, Ludwig Krispenz wrote:
>>>>>> hm, the stack traces show csn_str, which correspond to Jul,8th, Jul,4th, and
>>>>>> Jul,7th - so it looks like it is iterating the changelog over and over
>>>>>> again.
>>>>>> Th consumer side Is "cn=meTosrv-m14-24.ccr.buffalo.edu" - is this the master
>>>>>> ?
>>>>>>
>>>>>> can you provide the result of the following search from
>>>>>> m14-24.ccr.buffalo.edu adn the server with the high cpu:
>>>>>>
>>>>>> ldapsearch -o ldif-wrap=no -x -D ... -w  -b "cn=config"
>>>>>> "objectclass=nsds5replica" nsds50ruv
>>>>> master is srv-m14-24.. here's the results of the ldapsearch:
>>>>>
>>>>> [srv-m14-24 ~]$ ldapsearch -o ldif-wrap=no -x -D "cn=directory manager" -W  -b "cn=config" "objectclass=nsds5replica" nsds50ruv
>>>>>
>>>>> # replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config
>>>>> dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config
>>>>> nsds50ruv: {replicageneration} 5527f711000000040000
>>>>> nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 5527f771000000040000 55a55aed001000040000
>>>>> nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 5537c773000000050000 5591a3d2000700050000
>>>>> nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 55943dda000000060000 55945378000200060000
>>>> so this is really strange, the master m14-24 has the latest change from
>>>> replica 5(m14-26) as: 5591a3d2000700050000
>>>> which corresponds to Mon, 29 Jun 2015 20:00:18 GMT
>>>> so no update from 14-24 since that did arrive, or could not update the ruv.
>>>> So m14-26 tries to replicate all the changes back from that time, but looks
>>>> like iit has no success.
>>>> is there anything in the logs of m14-24 ? can you see successful mods with
>>>> csn=xxxxxxx00050000 ?
>>> Here's what I could find from the logs on srv-m14-24:
>>>
>>>
>>> [srv-m14-24 ~]# grep -r 00050000 /var/log/dirsrv/slapd-[domain]/*
>>> access.20150714-014346:[14/Jul/2015:03:10:05 -0400] conn=748529 op=14732 RESULT err=0 tag=103 nentries=0 etime=1 csn=55a4b5f0005000040000
>> ok, so no update originating at replica 5 has been replicated (probably
>> since June,29) did you experience data inconsistency between the servers ?
>>>
>>> And here's the last few lines the error log on srv-m14-24:
>> one set of messages refers to the o=ipaca backend and seem to be transient,
>> replication continues later.
>> the other set of msg "No original tombstone .." is annoying (and it is fixed
>> in ticket https://fedorahosted.org/389/ticket/47912)
>>
>> the next thing we can do to try to understand what is going on is to enable
>> replication logging on m14-26, it will then not only consume all cpu, but
>> write tons of messages to the error log.
>> But it can be turned on and off:
>>
>> ldapmodify ...
>> dn: cn=config
>> replace: nsslapd-errorlog-level
>> nsslapd-errorlog-level: 8192
>>
>> and let it run for a while, then set it back to: 0
> I enabled replication logging and it's running now. I noticed the
> default value for nsslapd-errorlog-level was set to 16384 (not 0).
>
> OK to send you the logs off list? Looks like they contain quite a bit of
> sensitive data.
>
> Thanks again for all the help looking into this.
>
> Best,
>
> --Andrew
>
>
>
>>> [12/Jul/2015:10:11:14 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2456070,cn=changelog!!
>>> [12/Jul/2015:10:11:48 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2498441,cn=changelog!!
>>> [13/Jul/2015:07:41:49 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a3a406000000600000): Operations error (1). Will retry later.
>>> [13/Jul/2015:11:56:50 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a3dfca000000600000): Operations error (1). Will retry later.
>>> [13/Jul/2015:14:26:50 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a402f2000000600000): Operations error (1). Will retry later.
>>> [13/Jul/2015:15:26:49 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a41102000000600000): Operations error (1). Will retry later.
>>> [13/Jul/2015:18:26:51 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a43b32000000600000): Operations error (1). Will retry later.
>>> [13/Jul/2015:18:56:51 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a4423a000000600000): Operations error (1). Will retry later.
>>> [13/Jul/2015:20:41:51 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a45ad6000000600000): Operations error (1). Will retry later.
>>> [13/Jul/2015:22:41:51 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a476f6000000600000): Operations error (1). Will retry later.
>>> [14/Jul/2015:06:56:51 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a4eafa000000600000): Operations error (1). Will retry later.
>>> [14/Jul/2015:09:56:52 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a5152a000000600000): Operations error (1). Will retry later.
>>> [14/Jul/2015:10:11:21 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2552223,cn=changelog!!
>>> [14/Jul/2015:10:11:21 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2552224,cn=changelog!!
>>> [14/Jul/2015:10:11:25 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2557315,cn=changelog!!
>>> [14/Jul/2015:10:11:25 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2557318,cn=changelog!!
>>> [14/Jul/2015:10:11:28 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2561020,cn=changelog!!
>>> [14/Jul/2015:10:11:28 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2561043,cn=changelog!!
>>> [14/Jul/2015:10:11:48 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2586022,cn=changelog!!
>>> [14/Jul/2015:10:11:59 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2598989,cn=changelog!!
>>> [14/Jul/2015:10:11:59 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2598990,cn=changelog!!
>>> [14/Jul/2015:10:12:01 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2600966,cn=changelog!!
>>> [14/Jul/2015:10:12:03 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2604037,cn=changelog!!
>>> [14/Jul/2015:10:12:04 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2604054,cn=changelog!!
>>> [14/Jul/2015:10:12:24 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2629803,cn=changelog!!
>>> [14/Jul/2015:10:12:24 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2629804,cn=changelog!!
>>>
>>>
>>>
>>>
>>>
>>>>> # replica, o\3Dipaca, mapping tree, config
>>>>> dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config
>>>>> nsds50ruv: {replicageneration} 5527f74b000000600000
>>>>> nsds50ruv: {replica 96 ldap://srv-m14-24.ccr.buffalo.edu:389} 5527f754000000600000 55a557f6000000600000
>>>>> nsds50ruv: {replica 86 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 55943e6e000000560000 55943e6f000100560000
>>>>> nsds50ruv: {replica 91 ldap://srv-m14-26.ccr.buffalo.edu:389} 5537c7ba0000005b0000 5582c7e40004005b0000
>>>>>
>>>>>
>>>>> server with high cpu load is srv-m14-26. here's the results of the ldapsearch
>>>> >from this server:
>>>>> [srv-m14-26 ~]$ ldapsearch -o ldif-wrap=no -x -D "cn=directory manager" -W  -b "cn=config" "objectclass=nsds5replica" nsds50ruv
>>>>>
>>>>> # replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config
>>>>> dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config
>>>>> nsds50ruv: {replicageneration} 5527f711000000040000
>>>>> nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 5537c773000000050000 55a55b47000300050000
>>>>> nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 5527f771000000040000 55a53eb0000a00040000
>>>>> nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 55943dda000000060000 55945378000200060000
>>>>>
>>>>> # replica, o\3Dipaca, mapping tree, config
>>>>> dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config
>>>>> nsds50ruv: {replicageneration} 5527f74b000000600000
>>>>> nsds50ruv: {replica 91 ldap://srv-m14-26.ccr.buffalo.edu:389} 5537c7ba0000005b0000 5582c7e40004005b0000
>>>>> nsds50ruv: {replica 96 ldap://srv-m14-24.ccr.buffalo.edu:389} 5527f754000000600000 55a557f6000000600000
>>>>> nsds50ruv: {replica 86 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 55943e6e000000560000 55943e6f000100560000
>>>>>
>>>>>
>>>>> srv-m14-25-02 is our 3rd replicate which we recently added back in after it
>>>>> failed (was added back in 7/1).
>>>>>
>>>>> Let me know if you need anything else. Thanks for the help.
>>>>>
>>>>> --Andrew
>>>>>
>>>>>> On 07/14/2015 02:35 PM, Andrew E. Bruno wrote:
>>>>>>> On Tue, Jul 14, 2015 at 01:41:57PM +0200, Ludwig Krispenz wrote:
>>>>>>>> On 07/13/2015 06:36 PM, Andrew E. Bruno wrote:
>>>>>>>>> On Mon, Jul 13, 2015 at 05:29:13PM +0200, Ludwig Krispenz wrote:
>>>>>>>>>> On 07/13/2015 05:05 PM, Andrew E. Bruno wrote:
>>>>>>>>>>> On Mon, Jul 13, 2015 at 04:58:46PM +0200, Ludwig Krispenz wrote:
>>>>>>>>>>>> can you get a pstack of the slapd process along with a top -H to find th
>>>>>>>>>>>> ethread with high cpu usage
>>>>>>>>>>> Attached is the full stacktrace of the running ns-slapd proccess. top -H
>>>>>>>>>>> shows this thread (2879) with high cpu usage:
>>>>>>>>>>>
>>>>>>>>>>> 2879 dirsrv    20   0 3819252 1.962g  11680 R 99.9  3.1   8822:10 ns-slapd
>>>>>>>>>> this thread is a replication thread sending updates, what is strange is that
>>>>>>>>>> the current csn_str is quite old (july, 7th), I can't tell which agreeement
>>>>>>>>>> this thread is handling, but looks like it is heavily reading the changeglog
>>>>>>>>>> and sending updates. anything changed recently in replication setup ?
>>>>>>>>> Yes, we had one replica fail on (6/19) which we removed (not this one
>>>>>>>>> showing high CPU load). Had to perform some manual cleanup of the ipa-ca
>>>>>>>>> RUVs. Then we added the replica back in on 7/1. Since then, replication
>>>>>>>>> appears to have been running normally between the 3 replicas. We've been
>>>>>>>>> monitoring utilization since 7/1 and only recently seen this spike (past
>>>>>>>>> 24 hours or so).
>>>>>>>> is it still in this state ? or was it a spike.
>>>>>>> Yes same state.
>>>>>>>
>>>>>>>> if it still is high cpu consuming, could you
>>>>>>>> - get a few pstack like the one before with some time in between, I would
>>>>>>>> like to see if it is progressing with the csns or looping on the same one
>>>>>>> Attached are a few stacktraces. The thread pegging the cpu is:
>>>>>>>
>>>>>>> PID  USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+  COMMAND
>>>>>>> 2879 dirsrv    20   0 3819252 1.978g  11684 R  99.9  3.2  10148:26  ns-slapd
>>>>>>>
>>>>>>>> - check the consumer side. is there anything in the error log ? does the
>>>>>>>> access log show replication activity from this server
>>>>>>> Here's some errors showing up on the first master server rep1 (rep2 is the
>>>>>>> server with pegged cpu):
>>>>>>>
>>>>>>> [13/Jul/2015:20:41:51 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-rep2-pki-tomcat" (rep2:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a45ad6000000600000): Operations error (1). Will retry later.
>>>>>>> [13/Jul/2015:22:41:51 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-rep2-pki-tomcat" (rep2:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a476f6000000600000): Operations error (1). Will retry later.
>>>>>>> [14/Jul/2015:06:56:51 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-rep2-tomcat" (rep2:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a4eafa000000600000): Operations error (1). Will retry later.
>>>>>>>
>>>>>>>
>>>>>>> Here's some snips from the access log of the rep2:
>>>>>>>
>>>>>>>
>>>>>>> [14/Jul/2015:08:22:31 -0400] conn=777787 op=9794 EXT oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session"
>>>>>>> [14/Jul/2015:08:22:31 -0400] conn=777787 op=9794 RESULT err=0 tag=120 nentries=0 etime=0
>>>>>>> [14/Jul/2015:08:22:31 -0400] conn=777787 op=9795 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
>>>>>>> [14/Jul/2015:08:22:31 -0400] conn=777787 op=9795 RESULT err=0 tag=120 nentries=0 etime=0
>>>>>>> [14/Jul/2015:08:22:33 -0400] conn=777787 op=9796 EXT oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session"
>>>>>>> ..
>>>>>>> [14/Jul/2015:08:23:38 -0400] conn=782341 op=129 RESULT err=0 tag=103 nentries=0 etime=0 csn=55a4ff6c000000050000
>>>>>>> ..
>>>>>>> [14/Jul/2015:08:24:02 -0400] conn=781901 op=1745 RESULT err=0 tag=101 nentries=1 etime=0
>>>>>>> [14/Jul/2015:08:24:03 -0400] conn=777787 op=9810 EXT oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session"
>>>>>>> [14/Jul/2015:08:24:03 -0400] conn=777787 op=9810 RESULT err=0 tag=120 nentries=0 etime=0
>>>>>>> [14/Jul/2015:08:24:03 -0400] conn=777787 op=9811 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
>>>>>>> [14/Jul/2015:08:24:03 -0400] conn=777787 op=9811 RESULT err=0 tag=120 nentries=0 etime=0
>>>>>>> [14/Jul/2015:08:24:05 -0400] conn=777787 op=9812 EXT oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session"
>>>>>>> [14/Jul/2015:08:24:05 -0400] conn=777787 op=9812 RESULT err=0 tag=120 nentries=0 etime=0
>>>>>>> [14/Jul/2015:08:24:08 -0400] conn=777787 op=9813 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
>>>>>>> [14/Jul/2015:08:24:08 -0400] conn=777787 op=9813 RESULT err=0 tag=120 nentries=0 etime=0
>>>>>>> [14/Jul/2015:08:24:10 -0400] conn=777787 op=9814 EXT oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session"
>>>>>>> [14/Jul/2015:08:24:10 -0400] conn=777787 op=9814 RESULT err=0 tag=120 nentries=0 etime=0
>>>>>>> [14/Jul/2015:08:24:10 -0400] conn=777787 op=9815 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
>>>>>>> [14/Jul/2015:08:24:10 -0400] conn=777787 op=9815 RESULT err=0 tag=120 nentries=0 etime=0
>>>>>>>
>>>>>>> and here's some from the error log:
>>>>>>>
>>>>>>> [13/Jul/2015:22:41:49 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2294859, dn = changenumber=2294859,cn=changelog: Operations error.
>>>>>>> [13/Jul/2015:22:41:49 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
>>>>>>> [13/Jul/2015:23:56:50 -0400] - dn2entry_ext: Failed to get id for changenumber=2296384,cn=changelog from entryrdn index (-30993)
>>>>>>> [13/Jul/2015:23:56:50 -0400] - Operation error fetching changenumber=2296384,cn=changelog (null), error -30993.
>>>>>>> [13/Jul/2015:23:56:50 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2296384, dn = changenumber=2296384,cn=changelog: Operations error.
>>>>>>> [13/Jul/2015:23:56:50 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
>>>>>>> [14/Jul/2015:06:56:50 -0400] - dn2entry_ext: Failed to get id for changenumber=2304418,cn=changelog from entryrdn index (-30993)
>>>>>>> [14/Jul/2015:06:56:50 -0400] - Operation error fetching changenumber=2304418,cn=changelog (null), error -30993.
>>>>>>> [14/Jul/2015:06:56:50 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2304418, dn = changenumber=2304418,cn=changelog: Operations error.
>>>>>>> [14/Jul/2015:06:56:50 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
>>>>>>>
>>>>>>>
>>>>>>>> - eventually enable replication logging: nsslapd-errorlog-level: 8192
>>>>>>>>> On a side note, we get hit with this bug often:
>>>>>>>>>
>>>>>>>>> https://www.redhat.com/archives/freeipa-users/2015-July/msg00018.html
>>>>>>>>>
>>>>>>>>> (rouge sssd_be processing hammering a replica).
>>>>>>>>>
>>>>>>>>> This causes high ns-slapd utilization on the replica and restarting sssd
>>>>>>>>> on the client host immediately fixes the issue. However, in this
>>>>>>>>> case, we're not seeing this behavior.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> On 07/13/2015 04:46 PM, Andrew E. Bruno wrote:
>>>>>>>>>>>>> We have 3 freeipa-replicas. Centos 7.1.1503, ipa-server 4.1.0-18, and
>>>>>>>>>>>>> 389-ds 1.3.3.1-16.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Recently, the ns-slapd process on one of our replicas started showing higher
>>>>>>>>>>>>> than normal CPU usage. ns-slapd is pegged at high CPU usage more or less
>>>>>>>>>>>>> constantly.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Seems very similar to this thread:
>>>>>>>>>>>>> https://www.redhat.com/archives/freeipa-users/2015-February/msg00281.html
>>>>>>>>>>>>>
>>>>>>>>>>>>> There are a few errors showing in /var/log/dirsrv/slapd-[domain]/errors (not
>>>>>>>>>>>>> sure if these are related):
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> [13/Jul/2015:02:56:49 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
>>>>>>>>>>>>> [13/Jul/2015:04:11:50 -0400] - dn2entry_ext: Failed to get id for changenumber=2277387,cn=changelog from entryrdn index (-30993)
>>>>>>>>>>>>> [13/Jul/2015:04:11:50 -0400] - Operation error fetching changenumber=2277387,cn=changelog (null), error -30993.
>>>>>>>>>>>>> [13/Jul/2015:04:11:50 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2277387, dn = changenumber=2277387,cn=changelog: Operations error.
>>>>>>>>>>>>> [13/Jul/2015:04:11:50 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
>>>>>>>>>>>>> [13/Jul/2015:07:41:49 -0400] - Operation error fetching Null DN (01de36ac-295411e5-b94db2ab-07afbca6), error -30993.
>>>>>>>>>>>>> [13/Jul/2015:07:41:49 -0400] - dn2entry_ext: Failed to get id for changenumber=2281464,cn=changelog from entryrdn index (-30993)
>>>>>>>>>>>>> [13/Jul/2015:07:41:49 -0400] - Operation error fetching changenumber=2281464,cn=changelog (null), error -30993.
>>>>>>>>>>>>> [13/Jul/2015:07:41:49 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2281464, dn = changenumber=2281464,cn=changelog: Operations error.
>>>>>>>>>>>>> [13/Jul/2015:07:41:49 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> access logs seem to be showing normal activity. Here's the number of open
>>>>>>>>>>>>> connections:
>>>>>>>>>>>>>
>>>>>>>>>>>>> # ls -al /proc/`cat /var/run/dirsrv/slapd-[domain].pid`/fd|grep socket|wc -l
>>>>>>>>>>>>> 62
>>>>>>>>>>>>>
>>>>>>>>>>>>> Note: the other two replicas have much higher open connections (>250) and low
>>>>>>>>>>>>> cpu load avgs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here's some output of logconv.pl from our most recent access log on the replica
>>>>>>>>>>>>> with high cpu load:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Start of Logs:    13/Jul/2015:04:49:18
>>>>>>>>>>>>> End of Logs:      13/Jul/2015:10:06:11
>>>>>>>>>>>>>
>>>>>>>>>>>>> Processed Log Time:  5 Hours, 16 Minutes, 53 Seconds
>>>>>>>>>>>>>
>>>>>>>>>>>>> Restarts:                     0
>>>>>>>>>>>>> Total Connections:            2343
>>>>>>>>>>>>>   - LDAP Connections:          2120
>>>>>>>>>>>>>   - LDAPI Connections:         223
>>>>>>>>>>>>>   - LDAPS Connections:         0
>>>>>>>>>>>>>   - StartTLS Extended Ops:     45
>>>>>>>>>>>>>   Secure Protocol Versions:
>>>>>>>>>>>>>    - TLS1.2 128-bit AES - 45
>>>>>>>>>>>>>
>>>>>>>>>>>>> Peak Concurrent Connections:  22
>>>>>>>>>>>>> Total Operations:             111865
>>>>>>>>>>>>> Total Results:                111034
>>>>>>>>>>>>> Overall Performance:          99.3%
>>>>>>>>>>>>>
>>>>>>>>>>>>> Searches:                     95585         (5.03/sec)  (301.64/min)
>>>>>>>>>>>>> Modifications:                3369          (0.18/sec)  (10.63/min)
>>>>>>>>>>>>> Adds:                         0             (0.00/sec)  (0.00/min)
>>>>>>>>>>>>> Deletes:                      0             (0.00/sec)  (0.00/min)
>>>>>>>>>>>>> Mod RDNs:                     0             (0.00/sec)  (0.00/min)
>>>>>>>>>>>>> Compares:                     0             (0.00/sec)  (0.00/min)
>>>>>>>>>>>>> Binds:                        7082          (0.37/sec)  (22.35/min)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Proxied Auth Operations:      0
>>>>>>>>>>>>> Persistent Searches:          0
>>>>>>>>>>>>> Internal Operations:          0
>>>>>>>>>>>>> Entry Operations:             0
>>>>>>>>>>>>> Extended Operations:          5317
>>>>>>>>>>>>> Abandoned Requests:           416
>>>>>>>>>>>>> Smart Referrals Received:     0
>>>>>>>>>>>>>
>>>>>>>>>>>>> VLV Operations:               96
>>>>>>>>>>>>> VLV Unindexed Searches:       0
>>>>>>>>>>>>> VLV Unindexed Components:     32
>>>>>>>>>>>>> SORT Operations:              64
>>>>>>>>>>>>>
>>>>>>>>>>>>> Entire Search Base Queries:   0
>>>>>>>>>>>>> Paged Searches:               3882
>>>>>>>>>>>>> Unindexed Searches:           0
>>>>>>>>>>>>> Unindexed Components:         5
>>>>>>>>>>>>>
>>>>>>>>>>>>> FDs Taken:                    2566
>>>>>>>>>>>>> FDs Returned:                 2643
>>>>>>>>>>>>> Highest FD Taken:             249
>>>>>>>>>>>>>
>>>>>>>>>>>>> Broken Pipes:                 0
>>>>>>>>>>>>> Connections Reset By Peer:    0
>>>>>>>>>>>>> Resource Unavailable:         0
>>>>>>>>>>>>> Max BER Size Exceeded:        0
>>>>>>>>>>>>>
>>>>>>>>>>>>> Binds:                        7082
>>>>>>>>>>>>> Unbinds:                      2443
>>>>>>>>>>>>>   - LDAP v2 Binds:             0
>>>>>>>>>>>>>   - LDAP v3 Binds:             6859
>>>>>>>>>>>>>   - AUTOBINDs:                 223
>>>>>>>>>>>>>   - SSL Client Binds:          0
>>>>>>>>>>>>>   - Failed SSL Client Binds:   0
>>>>>>>>>>>>>   - SASL Binds:                6814
>>>>>>>>>>>>>      GSSAPI - 6591
>>>>>>>>>>>>>      EXTERNAL - 223
>>>>>>>>>>>>>   - Directory Manager Binds:   0
>>>>>>>>>>>>>   - Anonymous Binds:           6591
>>>>>>>>>>>>>   - Other Binds:               491
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> strace timing on the ns-slapd process:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> % time     seconds  usecs/call     calls    errors syscall
>>>>>>>>>>>>> ------ ----------- ----------- --------- --------- ----------------
>>>>>>>>>>>>>   94.40    0.346659        5977        58           poll
>>>>>>>>>>>>>    4.10    0.015057       15057         1           restart_syscall
>>>>>>>>>>>>>    0.91    0.003353          57        59        59 getpeername
>>>>>>>>>>>>>    0.49    0.001796         150        12           futex
>>>>>>>>>>>>>    0.10    0.000364          73         5           read
>>>>>>>>>>>>> ------ ----------- ----------- --------- --------- ----------------
>>>>>>>>>>>>> 100.00    0.367229                   135        59 total
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> top output (with threads 'H'):
>>>>>>>>>>>>>
>>>>>>>>>>>>>    PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
>>>>>>>>>>>>>   2879 dirsrv    20   0 3819252 1.962g  11680 R 99.9  3.1   8822:10 ns-slapd
>>>>>>>>>>>>>   2895 dirsrv    20   0 3819252 1.962g  11680 R 34.1  3.1 115:10.62 ns-slapd
>>>>>>>>>>>>>   2889 dirsrv    20   0 3819252 1.962g  11680 R  2.4  3.1 115:34.42 ns-slapd
>>>>>>>>>>>>>   2917 dirsrv    20   0 3819252 1.962g  11680 S  2.4  3.1 115:26.87 ns-slapd
>>>>>>>>>>>>>   2898 dirsrv    20   0 3819252 1.962g  11680 S  2.1  3.1 116:33.12 ns-slapd
>>>>>>>>>>>>>   2904 dirsrv    20   0 3819252 1.962g  11680 S  2.1  3.1 115:08.56 ns-slapd
>>>>>>>>>>>>>   2892 dirsrv    20   0 3819252 1.962g  11680 S  1.8  3.1 115:33.04 ns-slapd
>>>>>>>>>>>>>   2897 dirsrv    20   0 3819252 1.962g  11680 R  1.8  3.1 114:54.28 ns-slapd
>>>>>>>>>>>>>   2914 dirsrv    20   0 3819252 1.962g  11680 R  1.8  3.1 116:03.35 ns-slapd
>>>>>>>>>>>>>   2907 dirsrv    20   0 3819252 1.962g  11680 S  1.5  3.1 115:42.25 ns-slapd
>>>>>>>>>>>>>   2910 dirsrv    20   0 3819252 1.962g  11680 R  1.5  3.1 116:01.99 ns-slapd
>>>>>>>>>>>>>   2870 dirsrv    20   0 3819252 1.962g  11680 R  1.2  3.1 611:30.22 ns-slapd
>>>>>>>>>>>>>   2890 dirsrv    20   0 3819252 1.962g  11680 S  1.2  3.1 115:18.25 ns-slapd
>>>>>>>>>>>>>   2891 dirsrv    20   0 3819252 1.962g  11680 S  1.2  3.1 115:22.24 ns-slapd
>>>>>>>>>>>>>   2899 dirsrv    20   0 3819252 1.962g  11680 R  1.2  3.1 116:11.85 ns-slapd
>>>>>>>>>>>>>   2888 dirsrv    20   0 3819252 1.962g  11680 S  0.9  3.1 114:51.19 ns-slapd
>>>>>>>>>>>>>   2896 dirsrv    20   0 3819252 1.962g  11680 R  0.9  3.1 115:46.84 ns-slapd
>>>>>>>>>>>>>   2915 dirsrv    20   0 3819252 1.962g  11680 S  0.9  3.1 115:49.34 ns-slapd
>>>>>>>>>>>>>   2887 dirsrv    20   0 3819252 1.962g  11680 R  0.6  3.1 115:49.85 ns-slapd
>>>>>>>>>>>>>   2894 dirsrv    20   0 3819252 1.962g  11680 S  0.6  3.1 115:58.02 ns-slapd
>>>>>>>>>>>>>   2911 dirsrv    20   0 3819252 1.962g  11680 S  0.6  3.1 116:22.84 ns-slapd
>>>>>>>>>>>>>   2913 dirsrv    20   0 3819252 1.962g  11680 S  0.6  3.1 114:43.56 ns-slapd
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> ns-slapd stays pegged >99%.  Trying to figure out what ns-slapd is doing? Any
>>>>>>>>>>>>> pointers on where else to look?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks in advance.
>>>>>>>>>>>>>
>>>>>>>>>>>>> --Andrew
>>>>>>>>>>>>>
>>>>>>>>>>>> -- 
>>>>>>>>>>>> Manage your subscription for the Freeipa-users mailing list:
>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/freeipa-users
>>>>>>>>>>>> Go to http://freeipa.org for more info on the project
>>>>>>>>>>>>
>>>>>>>>>>>>
>>