[Freeipa-users] ns-slapd high cpu usage

Ludwig Krispenz lkrispen at redhat.com
Tue Jul 14 14:52:10 UTC 2015


hm, the stack traces show csn_str, which correspond to Jul,8th, Jul,4th, 
and Jul,7th - so it looks like it is iterating the changelog over and 
over again.
Th consumer side Is "cn=meTosrv-m14-24.ccr.buffalo.edu" - is this the 
master ?

can you provide the result of the following search from 
m14-24.ccr.buffalo.edu adn the server with the high cpu:

ldapsearch -o ldif-wrap=no -x -D ... -w  -b "cn=config" 
"objectclass=nsds5replica" nsds50ruv

On 07/14/2015 02:35 PM, Andrew E. Bruno wrote:
> On Tue, Jul 14, 2015 at 01:41:57PM +0200, Ludwig Krispenz wrote:
>> On 07/13/2015 06:36 PM, Andrew E. Bruno wrote:
>>> On Mon, Jul 13, 2015 at 05:29:13PM +0200, Ludwig Krispenz wrote:
>>>> On 07/13/2015 05:05 PM, Andrew E. Bruno wrote:
>>>>> On Mon, Jul 13, 2015 at 04:58:46PM +0200, Ludwig Krispenz wrote:
>>>>>> can you get a pstack of the slapd process along with a top -H to find th
>>>>>> ethread with high cpu usage
>>>>> Attached is the full stacktrace of the running ns-slapd proccess. top -H
>>>>> shows this thread (2879) with high cpu usage:
>>>>>
>>>>> 2879 dirsrv    20   0 3819252 1.962g  11680 R 99.9  3.1   8822:10 ns-slapd
>>>> this thread is a replication thread sending updates, what is strange is that
>>>> the current csn_str is quite old (july, 7th), I can't tell which agreeement
>>>> this thread is handling, but looks like it is heavily reading the changeglog
>>>> and sending updates. anything changed recently in replication setup ?
>>> Yes, we had one replica fail on (6/19) which we removed (not this one
>>> showing high CPU load). Had to perform some manual cleanup of the ipa-ca
>>> RUVs. Then we added the replica back in on 7/1. Since then, replication
>>> appears to have been running normally between the 3 replicas. We've been
>>> monitoring utilization since 7/1 and only recently seen this spike (past
>>> 24 hours or so).
>> is it still in this state ? or was it a spike.
> Yes same state.
>
>> if it still is high cpu consuming, could you
>> - get a few pstack like the one before with some time in between, I would
>> like to see if it is progressing with the csns or looping on the same one
> Attached are a few stacktraces. The thread pegging the cpu is:
>
> PID  USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+  COMMAND
> 2879 dirsrv    20   0 3819252 1.978g  11684 R  99.9  3.2  10148:26  ns-slapd
>
>> - check the consumer side. is there anything in the error log ? does the
>> access log show replication activity from this server
>
> Here's some errors showing up on the first master server rep1 (rep2 is the
> server with pegged cpu):
>
> [13/Jul/2015:20:41:51 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-rep2-pki-tomcat" (rep2:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a45ad6000000600000): Operations error (1). Will retry later.
> [13/Jul/2015:22:41:51 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-rep2-pki-tomcat" (rep2:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a476f6000000600000): Operations error (1). Will retry later.
> [14/Jul/2015:06:56:51 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-rep2-tomcat" (rep2:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a4eafa000000600000): Operations error (1). Will retry later.
>
>
> Here's some snips from the access log of the rep2:
>
>
> [14/Jul/2015:08:22:31 -0400] conn=777787 op=9794 EXT oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session"
> [14/Jul/2015:08:22:31 -0400] conn=777787 op=9794 RESULT err=0 tag=120 nentries=0 etime=0
> [14/Jul/2015:08:22:31 -0400] conn=777787 op=9795 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
> [14/Jul/2015:08:22:31 -0400] conn=777787 op=9795 RESULT err=0 tag=120 nentries=0 etime=0
> [14/Jul/2015:08:22:33 -0400] conn=777787 op=9796 EXT oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session"
> ..
> [14/Jul/2015:08:23:38 -0400] conn=782341 op=129 RESULT err=0 tag=103 nentries=0 etime=0 csn=55a4ff6c000000050000
> ..
> [14/Jul/2015:08:24:02 -0400] conn=781901 op=1745 RESULT err=0 tag=101 nentries=1 etime=0
> [14/Jul/2015:08:24:03 -0400] conn=777787 op=9810 EXT oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session"
> [14/Jul/2015:08:24:03 -0400] conn=777787 op=9810 RESULT err=0 tag=120 nentries=0 etime=0
> [14/Jul/2015:08:24:03 -0400] conn=777787 op=9811 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
> [14/Jul/2015:08:24:03 -0400] conn=777787 op=9811 RESULT err=0 tag=120 nentries=0 etime=0
> [14/Jul/2015:08:24:05 -0400] conn=777787 op=9812 EXT oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session"
> [14/Jul/2015:08:24:05 -0400] conn=777787 op=9812 RESULT err=0 tag=120 nentries=0 etime=0
> [14/Jul/2015:08:24:08 -0400] conn=777787 op=9813 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
> [14/Jul/2015:08:24:08 -0400] conn=777787 op=9813 RESULT err=0 tag=120 nentries=0 etime=0
> [14/Jul/2015:08:24:10 -0400] conn=777787 op=9814 EXT oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session"
> [14/Jul/2015:08:24:10 -0400] conn=777787 op=9814 RESULT err=0 tag=120 nentries=0 etime=0
> [14/Jul/2015:08:24:10 -0400] conn=777787 op=9815 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
> [14/Jul/2015:08:24:10 -0400] conn=777787 op=9815 RESULT err=0 tag=120 nentries=0 etime=0
>
> and here's some from the error log:
>
> [13/Jul/2015:22:41:49 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2294859, dn = changenumber=2294859,cn=changelog: Operations error.
> [13/Jul/2015:22:41:49 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
> [13/Jul/2015:23:56:50 -0400] - dn2entry_ext: Failed to get id for changenumber=2296384,cn=changelog from entryrdn index (-30993)
> [13/Jul/2015:23:56:50 -0400] - Operation error fetching changenumber=2296384,cn=changelog (null), error -30993.
> [13/Jul/2015:23:56:50 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2296384, dn = changenumber=2296384,cn=changelog: Operations error.
> [13/Jul/2015:23:56:50 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
> [14/Jul/2015:06:56:50 -0400] - dn2entry_ext: Failed to get id for changenumber=2304418,cn=changelog from entryrdn index (-30993)
> [14/Jul/2015:06:56:50 -0400] - Operation error fetching changenumber=2304418,cn=changelog (null), error -30993.
> [14/Jul/2015:06:56:50 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2304418, dn = changenumber=2304418,cn=changelog: Operations error.
> [14/Jul/2015:06:56:50 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
>
>
>> - eventually enable replication logging: nsslapd-errorlog-level: 8192
>>> On a side note, we get hit with this bug often:
>>>
>>> https://www.redhat.com/archives/freeipa-users/2015-July/msg00018.html
>>>
>>> (rouge sssd_be processing hammering a replica).
>>>
>>> This causes high ns-slapd utilization on the replica and restarting sssd
>>> on the client host immediately fixes the issue. However, in this
>>> case, we're not seeing this behavior.
>>>
>>>
>>>
>>>
>>>>>
>>>>>
>>>>>
>>>>>> On 07/13/2015 04:46 PM, Andrew E. Bruno wrote:
>>>>>>> We have 3 freeipa-replicas. Centos 7.1.1503, ipa-server 4.1.0-18, and
>>>>>>> 389-ds 1.3.3.1-16.
>>>>>>>
>>>>>>> Recently, the ns-slapd process on one of our replicas started showing higher
>>>>>>> than normal CPU usage. ns-slapd is pegged at high CPU usage more or less
>>>>>>> constantly.
>>>>>>>
>>>>>>> Seems very similar to this thread:
>>>>>>> https://www.redhat.com/archives/freeipa-users/2015-February/msg00281.html
>>>>>>>
>>>>>>> There are a few errors showing in /var/log/dirsrv/slapd-[domain]/errors (not
>>>>>>> sure if these are related):
>>>>>>>
>>>>>>>
>>>>>>> [13/Jul/2015:02:56:49 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
>>>>>>> [13/Jul/2015:04:11:50 -0400] - dn2entry_ext: Failed to get id for changenumber=2277387,cn=changelog from entryrdn index (-30993)
>>>>>>> [13/Jul/2015:04:11:50 -0400] - Operation error fetching changenumber=2277387,cn=changelog (null), error -30993.
>>>>>>> [13/Jul/2015:04:11:50 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2277387, dn = changenumber=2277387,cn=changelog: Operations error.
>>>>>>> [13/Jul/2015:04:11:50 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
>>>>>>> [13/Jul/2015:07:41:49 -0400] - Operation error fetching Null DN (01de36ac-295411e5-b94db2ab-07afbca6), error -30993.
>>>>>>> [13/Jul/2015:07:41:49 -0400] - dn2entry_ext: Failed to get id for changenumber=2281464,cn=changelog from entryrdn index (-30993)
>>>>>>> [13/Jul/2015:07:41:49 -0400] - Operation error fetching changenumber=2281464,cn=changelog (null), error -30993.
>>>>>>> [13/Jul/2015:07:41:49 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2281464, dn = changenumber=2281464,cn=changelog: Operations error.
>>>>>>> [13/Jul/2015:07:41:49 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
>>>>>>>
>>>>>>>
>>>>>>> access logs seem to be showing normal activity. Here's the number of open
>>>>>>> connections:
>>>>>>>
>>>>>>> # ls -al /proc/`cat /var/run/dirsrv/slapd-[domain].pid`/fd|grep socket|wc -l
>>>>>>> 62
>>>>>>>
>>>>>>> Note: the other two replicas have much higher open connections (>250) and low
>>>>>>> cpu load avgs.
>>>>>>>
>>>>>>> Here's some output of logconv.pl from our most recent access log on the replica
>>>>>>> with high cpu load:
>>>>>>>
>>>>>>> Start of Logs:    13/Jul/2015:04:49:18
>>>>>>> End of Logs:      13/Jul/2015:10:06:11
>>>>>>>
>>>>>>> Processed Log Time:  5 Hours, 16 Minutes, 53 Seconds
>>>>>>>
>>>>>>> Restarts:                     0
>>>>>>> Total Connections:            2343
>>>>>>>   - LDAP Connections:          2120
>>>>>>>   - LDAPI Connections:         223
>>>>>>>   - LDAPS Connections:         0
>>>>>>>   - StartTLS Extended Ops:     45
>>>>>>>   Secure Protocol Versions:
>>>>>>>    - TLS1.2 128-bit AES - 45
>>>>>>>
>>>>>>> Peak Concurrent Connections:  22
>>>>>>> Total Operations:             111865
>>>>>>> Total Results:                111034
>>>>>>> Overall Performance:          99.3%
>>>>>>>
>>>>>>> Searches:                     95585         (5.03/sec)  (301.64/min)
>>>>>>> Modifications:                3369          (0.18/sec)  (10.63/min)
>>>>>>> Adds:                         0             (0.00/sec)  (0.00/min)
>>>>>>> Deletes:                      0             (0.00/sec)  (0.00/min)
>>>>>>> Mod RDNs:                     0             (0.00/sec)  (0.00/min)
>>>>>>> Compares:                     0             (0.00/sec)  (0.00/min)
>>>>>>> Binds:                        7082          (0.37/sec)  (22.35/min)
>>>>>>>
>>>>>>> Proxied Auth Operations:      0
>>>>>>> Persistent Searches:          0
>>>>>>> Internal Operations:          0
>>>>>>> Entry Operations:             0
>>>>>>> Extended Operations:          5317
>>>>>>> Abandoned Requests:           416
>>>>>>> Smart Referrals Received:     0
>>>>>>>
>>>>>>> VLV Operations:               96
>>>>>>> VLV Unindexed Searches:       0
>>>>>>> VLV Unindexed Components:     32
>>>>>>> SORT Operations:              64
>>>>>>>
>>>>>>> Entire Search Base Queries:   0
>>>>>>> Paged Searches:               3882
>>>>>>> Unindexed Searches:           0
>>>>>>> Unindexed Components:         5
>>>>>>>
>>>>>>> FDs Taken:                    2566
>>>>>>> FDs Returned:                 2643
>>>>>>> Highest FD Taken:             249
>>>>>>>
>>>>>>> Broken Pipes:                 0
>>>>>>> Connections Reset By Peer:    0
>>>>>>> Resource Unavailable:         0
>>>>>>> Max BER Size Exceeded:        0
>>>>>>>
>>>>>>> Binds:                        7082
>>>>>>> Unbinds:                      2443
>>>>>>>   - LDAP v2 Binds:             0
>>>>>>>   - LDAP v3 Binds:             6859
>>>>>>>   - AUTOBINDs:                 223
>>>>>>>   - SSL Client Binds:          0
>>>>>>>   - Failed SSL Client Binds:   0
>>>>>>>   - SASL Binds:                6814
>>>>>>>      GSSAPI - 6591
>>>>>>>      EXTERNAL - 223
>>>>>>>   - Directory Manager Binds:   0
>>>>>>>   - Anonymous Binds:           6591
>>>>>>>   - Other Binds:               491
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> strace timing on the ns-slapd process:
>>>>>>>
>>>>>>>
>>>>>>> % time     seconds  usecs/call     calls    errors syscall
>>>>>>> ------ ----------- ----------- --------- --------- ----------------
>>>>>>>   94.40    0.346659        5977        58           poll
>>>>>>>    4.10    0.015057       15057         1           restart_syscall
>>>>>>>    0.91    0.003353          57        59        59 getpeername
>>>>>>>    0.49    0.001796         150        12           futex
>>>>>>>    0.10    0.000364          73         5           read
>>>>>>> ------ ----------- ----------- --------- --------- ----------------
>>>>>>> 100.00    0.367229                   135        59 total
>>>>>>>
>>>>>>>
>>>>>>> top output (with threads 'H'):
>>>>>>>
>>>>>>>    PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
>>>>>>>   2879 dirsrv    20   0 3819252 1.962g  11680 R 99.9  3.1   8822:10 ns-slapd
>>>>>>>   2895 dirsrv    20   0 3819252 1.962g  11680 R 34.1  3.1 115:10.62 ns-slapd
>>>>>>>   2889 dirsrv    20   0 3819252 1.962g  11680 R  2.4  3.1 115:34.42 ns-slapd
>>>>>>>   2917 dirsrv    20   0 3819252 1.962g  11680 S  2.4  3.1 115:26.87 ns-slapd
>>>>>>>   2898 dirsrv    20   0 3819252 1.962g  11680 S  2.1  3.1 116:33.12 ns-slapd
>>>>>>>   2904 dirsrv    20   0 3819252 1.962g  11680 S  2.1  3.1 115:08.56 ns-slapd
>>>>>>>   2892 dirsrv    20   0 3819252 1.962g  11680 S  1.8  3.1 115:33.04 ns-slapd
>>>>>>>   2897 dirsrv    20   0 3819252 1.962g  11680 R  1.8  3.1 114:54.28 ns-slapd
>>>>>>>   2914 dirsrv    20   0 3819252 1.962g  11680 R  1.8  3.1 116:03.35 ns-slapd
>>>>>>>   2907 dirsrv    20   0 3819252 1.962g  11680 S  1.5  3.1 115:42.25 ns-slapd
>>>>>>>   2910 dirsrv    20   0 3819252 1.962g  11680 R  1.5  3.1 116:01.99 ns-slapd
>>>>>>>   2870 dirsrv    20   0 3819252 1.962g  11680 R  1.2  3.1 611:30.22 ns-slapd
>>>>>>>   2890 dirsrv    20   0 3819252 1.962g  11680 S  1.2  3.1 115:18.25 ns-slapd
>>>>>>>   2891 dirsrv    20   0 3819252 1.962g  11680 S  1.2  3.1 115:22.24 ns-slapd
>>>>>>>   2899 dirsrv    20   0 3819252 1.962g  11680 R  1.2  3.1 116:11.85 ns-slapd
>>>>>>>   2888 dirsrv    20   0 3819252 1.962g  11680 S  0.9  3.1 114:51.19 ns-slapd
>>>>>>>   2896 dirsrv    20   0 3819252 1.962g  11680 R  0.9  3.1 115:46.84 ns-slapd
>>>>>>>   2915 dirsrv    20   0 3819252 1.962g  11680 S  0.9  3.1 115:49.34 ns-slapd
>>>>>>>   2887 dirsrv    20   0 3819252 1.962g  11680 R  0.6  3.1 115:49.85 ns-slapd
>>>>>>>   2894 dirsrv    20   0 3819252 1.962g  11680 S  0.6  3.1 115:58.02 ns-slapd
>>>>>>>   2911 dirsrv    20   0 3819252 1.962g  11680 S  0.6  3.1 116:22.84 ns-slapd
>>>>>>>   2913 dirsrv    20   0 3819252 1.962g  11680 S  0.6  3.1 114:43.56 ns-slapd
>>>>>>>
>>>>>>>
>>>>>>> ns-slapd stays pegged >99%.  Trying to figure out what ns-slapd is doing? Any
>>>>>>> pointers on where else to look?
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>>
>>>>>>> --Andrew
>>>>>>>
>>>>>> -- 
>>>>>> Manage your subscription for the Freeipa-users mailing list:
>>>>>> https://www.redhat.com/mailman/listinfo/freeipa-users
>>>>>> Go to http://freeipa.org for more info on the project
>>>>>>
>>>>>>
>>




More information about the Freeipa-users mailing list