[Freeipa-users] FreeIPA 3.3 performance issues with many hosts

Wed Oct 21 15:03:25 UTC 2015

On 10/21/2015 03:56 PM, Dominik Korittki wrote:
>
>
> Am 07.10.2015 um 17:30 schrieb thierry bordaz:
>> On 10/07/2015 05:03 PM, Dominik Korittki wrote:
>>>
>>>
>>> Am 07.10.2015 um 15:25 schrieb thierry bordaz:
>>>> On 10/07/2015 11:19 AM, Martin Kosek wrote:
>>>>> On 10/05/2015 02:13 PM, Dominik Korittki wrote:
>>>>>>
>>>>>> Am 01.10.2015 um 21:52 schrieb Rob Crittenden:
>>>>>>> Dominik Korittki wrote:
>>>>>>>> Hello folks,
>>>>>>>>
>>>>>>>> I am running two FreeIPA Servers with around 100 users and around
>>>>>>>> 15.000
>>>>>>>> hosts, which are used by users to login via ssh. The FreeIPA 
>>>>>>>> servers
>>>>>>>> (which are Centos 7.0) ran good for a while, but as more and more
>>>>>>>> hosts
>>>>>>>> got migrated to serve as FreeIPA hosts, it started to get slow and
>>>>>>>> unstable.
>>>>>>>>
>>>>>>>> For example, its hard to maintain hostgroups, which have more than
>>>>>>>> 1.000
>>>>>>>> hosts. The ipa host-* commands are getting slower as the hostgroup
>>>>>>>> grows. Is this normal?
>>>>>>> You mean the ipa hostgroup-* commands? Whenever the entry is
>>>>>>> displayed
>>>>>>> (show and add) it needs to dereference all members so yes, it is
>>>>>>> understandable that it gets somewhat slower with more members. How
>>>>>>> slow
>>>>>>> are we talking about?
>>>>>>>
>>>>>>>> We also experience random dirsrv segfaults. Here's a dmesg line
>>>>>>>> from the
>>>>>>>> latest:
>>>>>>>>
>>>>>>>> [690787.647261] traps: ns-slapd[5217] general protection
>>>>>>>> ip:7f8d6b6d6bc1
>>>>>>>> sp:7f8d3aff2a88 error:0 in libc-2.17.so[7f8d6b650000+1b6000]
>>>>>>> You probably want to start here:
>>>>>>> http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-crashes
>>>>>> A stacktrace from the latest crash is attached to this email. After
>>>>>> restarting
>>>>>> the service, this is what I get in
>>>>>> /var/log/dirsrv/slapd-INTERNAL/errors
>>>>>> (hostname is ipa01.internal):
>>>>> Ludwig or Thierry, can you please take a look at the stack and file
>>>>> 389-DS
>>>>> ticket if appropriate?
>>>>
>>>> Hello Dominik,
>>>>
>>>> DS is crashing during a BIND and from the arguments values we can 
>>>> guess
>>>> it was due to a heap corruption that corrupted it operation pblock.
>>>> This bind operation was likely victim of the heap corruption more than
>>>> responsible of it.
>>>>
>>>> Using valgrind is the best way to track such problem but as you 
>>>> already
>>>> suffer from bad performance I doubt it would be acceptable.
>>>> How frequently does it crash ? did you identify a kind of test case ?
>>>
>>> At first the crashes happenend at a daily basis. Simply restarting the
>>> dirsrv daemon resolved the issue for another day but later on the
>>> daemon did not survive more than 15 minutes most of the time. There
>>> were exceptions though. Sometimes the daemon ran for several hours
>>> until it chrashed.
>>> I did not really identify a testcase. However, I supposed it could
>>> have something to do with replication, as I have seen replication
>>> related errors in dirsrv error log (mentioned in an earlier mail in
>>> this topic).
>> heap corruption are usually dynamic and if the server became more and
>> more slow, it could change the dynamic in favor of heap corruption.
>>>
>>> So did the following:
>>> ipa01 has a replication agreement with ipa02. ipa01 was the one with
>>> segfaults. I removed ipa01 from the replication agreement
>>> (ipa-replica-manage del), did an ipa-server-install --uninstall on
>>> ipa01 and created ipa01 as a replica of ipa02. Since then I did not
>>> experience any crashes (for now).
>>> Instead i'm having trouble rebuilding a clean replication agreement
>>> (old RUV stuff still in database), but thats another story I will
>>> eventually post on the mailinglist as a new topic.
>>>
>>> As for valgrind: Never used it before. Is there a handy explanation of
>>> how to use it in combination with 389ds? If I still experience those
>>> crashes and I get it managed to use I could try it out.
>> You may follow this procedure
>> http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-memory-growthinvalid-access-with-valgrind 
>>
>> (but remove --leak-check=yes because this is not a leak issue)
>>
>> thanks
>> thierry
>
> I experienced segmentation faults again on host ipa01, even after I 
> rebuild the replication topology as described in previous mail.
> I followed your advice and ran valgrind last evening. Sadly I forgot 
> to remove --leak-check=yes, but I hope the information is still useful 
> to you. If not, I'll do it again without --leak-check=yes.
>
> Running under valgrind, the ns-slapd process needed quiet some time 
> until it openened its ports. You can see this by watching the error logs:
>
> [20/Oct/2015:22:27:41 +0200] - 389-Directory/1.3.1.6 B2014.219.1825 
> starting up
> [20/Oct/2015:22:27:42 +0200] - WARNING: userRoot: entry cache size 
> 10485760B is less than db size 142483456B; We recommend to increase 
> the entry cache size nsslapd-cachememsize.
> [20/Oct/2015:22:27:44 +0200] schema-compat-plugin - warning: no 
> entries set up under cn=computers, cn=compat,dc=internal
> [20/Oct/2015:23:09:16 +0200] - slapd started.  Listening on All 
> Interfaces port 389 for LDAP requests
> [20/Oct/2015:23:09:16 +0200] - Listening on All Interfaces port 636 
> for LDAPS requests
> [20/Oct/2015:23:09:16 +0200] - Listening on 
> /var/run/slapd-INTERNAL.socket for LDAPI requests
>
> I guess that's normal, since running the process through valgrind has 
> a huge performance loss? The daemon crashed about ~ 25 seconds after 
> it has opened it's ports. Here is the valgrind log:
> http://pastebin.com/8t9RtB6p
>
> Do you see any suspicious things? 
It looks like it is accessing memory, which was freed in a pre-bind 
plugin, this could be the issue tracked in 
https://fedorahosted.org/389/ticket/48188
> Many thanks for your help!
>
>
> - Dominik
>
>>>
>>>
>>> Kind regards,
>>> Dominik Korittki
>>>
>>>>
>>>> thanks
>>>> thierry
>>>>>> [05/Oct/2015:13:51:30 +0200] - slapd started.  Listening on All
>>>>>> Interfaces port
>>>>>> 389 for LDAP requests
>>>>>> [05/Oct/2015:13:51:30 +0200] - Listening on All Interfaces port 636
>>>>>> for LDAPS
>>>>>> requests
>>>>>> [05/Oct/2015:13:51:30 +0200] - Listening on
>>>>>> /var/run/slapd-INTERNAL.socket for
>>>>>> LDAPI requests
>>>>>> [05/Oct/2015:13:51:30 +0200] slapd_ldap_sasl_interactive_bind -
>>>>>> Error: could
>>>>>> not perform interactive bind for id [] mech [GSSAPI]: LDAP error -2
>>>>>> (Local
>>>>>> error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS
>>>>>> failure.
>>>>>> Minor code may provide more information (No Kerberos credentials
>>>>>> available))
>>>>>> errno 0 (Success)
>>>>>> [05/Oct/2015:13:51:30 +0200] slapi_ldap_bind - Error: could not
>>>>>> perform
>>>>>> interactive bind for id [] authentication mechanism [GSSAPI]: error
>>>>>> -2 (Local
>>>>>> error)
>>>>>> [05/Oct/2015:13:51:30 +0200] NSMMReplicationPlugin -
>>>>>> agmt="cn=meToipa02.internal" (ipa02:389): Replication bind with
>>>>>> GSSAPI auth
>>>>>> failed: LDAP error -2 (Local error) (SASL(-1): generic failure:
>>>>>> GSSAPI Error:
>>>>>> Unspecified GSS failure.  Minor code may provide more information 
>>>>>> (No
>>>>>> Kerberos
>>>>>> credentials available))
>>>>>> [05/Oct/2015:13:51:30 +0200] NSMMReplicationPlugin - changelog
>>>>>> program -
>>>>>> agmt="cn=masterAgreement1-ipa02.internal-pki-tomcat" (ipa02:389): 
>>>>>> CSN
>>>>>> 54bea480000000600000 not found, we aren't as up to date, or we 
>>>>>> purged
>>>>>> [05/Oct/2015:13:51:30 +0200] NSMMReplicationPlugin -
>>>>>> agmt="cn=masterAgreement1-ipa02.internal-pki-tomcat" (ipa02:389):
>>>>>> Data required
>>>>>> to update replica has been purged. The replica must be 
>>>>>> reinitialized.
>>>>>> [05/Oct/2015:13:51:30 +0200] NSMMReplicationPlugin -
>>>>>> agmt="cn=masterAgreement1-ipa02.internal-pki-tomcat" (ipa02:389):
>>>>>> Incremental
>>>>>> update failed and requires administrator action
>>>>>> [05/Oct/2015:13:51:33 +0200] NSMMReplicationPlugin -
>>>>>> agmt="cn=meToipa02.internal" (ipa02:389): Replication bind with
>>>>>> GSSAPI auth
>>>>>> resumed
>>>>>>
>>>>>>
>>>>>> These lines are present since a replayed a ldif dump from ipa02 to
>>>>>> ipa01, but i
>>>>>> didn't think that it related to the segfault problem (therefore i
>>>>>> said there
>>>>>> are no related problems in the logfile).
>>>>>>
>>>>>> But I am starting to believe that these errors could be in relation
>>>>>> to each other.
>>>>>>
>>>>>>
>>>>>> Kind regards,
>>>>>> Dominik Korittki
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> Nothing in /var/log/dirsrv/slapd-INTERNAL/errors, which relates
>>>>>>>> to the
>>>>>>>> problem.
>>>>>> Not sure about that anymore.
>>>>>>
>>>>>>>> I'm thinking about migrating to latest CentOS 7 FreeIPA 4, but 
>>>>>>>> does
>>>>>>>> that
>>>>>>>> solve my problems?
>>>>>>>>
>>>>>>>> FreeIPA server version is 3.3.3-28.el7.centos
>>>>>>>> 389-ds-base.x86_64 is 1.3.1.6-26.el7_0
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Kind regards,
>>>>>>>> Dominik Korittki
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>>