[Freeipa-users] performance scaling of sssd / freeipa

Thu Jan 26 15:47:04 UTC 2017

Sumit,

Thank you for detailed and thorough reply; I really appreciate you taking the time to get back to me.

What you describe makes sense, and is in-line with my thoughts and observations (and from reading Jakubs performance tuning doc).  You articulated this very well; thank you.

We had already increased the 389ds thread number from 30 -> 180 and increased the number of processor cores to 6 (based on reading 389’s documentation 30 threads per cores is roughly appropriate).  This actually helped the problem significantly.

This morning, we upped that number again to 12 cores and 360 threads for 389 DS, and this made the situation even better.  The problem has largely gone away at this point.

I turned on the ignore_group_members setting long ago, it did not help with this particular problem.

I considered enabling enumeration to try and address this issue, for some reason I thought that feature wasn’t implemented for AD domains though.  

Anyway, I think we are good for now, at least on this issue, especially thanks to your description of what is actually occurring.

Thank you for your time and for your help!

Best,

Dan

> On Jan 26, 2017, at 9:15 AM, Sumit Bose <sbose at redhat.com> wrote:
> 
> On Wed, Jan 25, 2017 at 10:58:34PM +0000, Sullivan, Daniel [CRI] wrote:
>> Hi,
>> 
>> My apologizes for resurrecting this thread.  This issue is still ongoing, at this point we’ve been looking at it for over a week and now have more than one staff member analyzing and trying to resolve it on a full time basis.  I have some more information that I was hoping an a seasoned IPA expert could take a look at.   At this point I am fairly certain it is a performance tuning issue in either sssd or FreeIPA on the our domain controllers.  It looks to me like the main issue is that when looking up the same user across a large number of nodes in parallel, all of our available ds389 threads get blocked with '__lll_robust_lock_wait ()’ for operations involving ipa_extdom_common.c.  This usually occurs on one of our two DCs, but occasionally on both.   For example, in the attached output, out of 199 threads in the attached output, 179 are in the status __lll_robust_lock_wait ().      All of the user1 at xxx.uchicago.edu<mailto:user1 at xxx.uchicago.edu> in this attachment are the same user.
>> 
>> Here is more information about this issue (some of it repeated for convenience):
>> 
>>  1.  We currently have 2 domain controllers.  Each has 6 processor cores and 180 threads allocated for 389ds.  We have gone through Red Hat’s performance tuning guide for directory services made what we felt were appropriate changes, and made additional tuning modifications to get lowered eviction rates and high cache hit numbers for 389ds.  We have approximately 220 connections to our domain controllers (from "cn=monitor”), depending on the test I’ve seen as many as 190 connected to a single DC.
>>  2.  We are using an AD domain where all of our users and groups reside.
>>  3.   I induce this by looking up a user (using the id command) on a large number of nodes (maybe 200) for a user that has never been looked up before, and is not cached on either the client, or on the DC.
>>  4.   Before I induce the problem, I can lookup entries in LDAP without delay or problem (i.e. the LDAP server is performant and responsive, I can inspect cn=monitor or cn=config and get instantaneous results).
>>  5.  When I do induce the issue, the LDAP server basically becomes unresponsive (which is expected based on the attached output).  Servicing a query using the ldapsearchtool (for either cn=monitor or cn=config) can take upwards of 1-2 minutes or longer.  Eventually the LDAP server will ‘recover’, i.e. I do not typically need to restart IPA services to get this working again.
>>  6.  After a lookup fails, subsequent parallel lookups succeed and return the desired record (presumably from the cache).
>>  7.  It appears that these failures are also characterized by a corresponding "[monitor_hup] (0x0020): Received SIGHUP.”  in the sssd log.
>>  8.  Right before the problem occurs I see a brief spike in CPU utilization of the ns-slapd process, then the utilization basically drops to 0 once the threads are blocked in ns-slapd.
>>  9.  Since we are doing computation in our IPA environment, it is important that we can perform these types of parallel operations against our IPA environment at the scale we are testing.
>> 
>> I feel like we are either DoSing the LDAP server or the sss_be / sss_nss processes, although I am not sure.   Right now we are in the process of deploying an additional domain controller to see if that helps with distribution of load.  If anybody could provide any sort of information with respect addressing the issue in the attached trace I would be very grateful.
> 
> I think your observations are due to the fact that SSSD currently
> serializes connections from a single process. Your clients will call the
> extdom extended LDAP operation on the IPA server to get the information
> about the user from the trusted domain. The extdom plugin runs inside of
> 389ds and each client connection will run in a different thread. To get
> the information about the user from the trusted domain the extdom plugin
> calls SSSD and here is where the serialization will happen, i.e. all
> threads have to wait until the first one will get his results and the
> next thread can talk to SSSD.
> 
> With an empty cache the initial lookup of a user and all its groups will
> take some time and since you used quite a number of clients all 389ds
> worker threads will be "busy" waiting to talk to SSSD so that it would
> even be hard for other request, even the ones which do not need to talk
> to SSSD, to get through because there are no free worker threads.
> 
> To improve the situation maybe setting 'ignore_group_members=True' as
> described on
> https://jhrozek.wordpress.com/2015/08/19/performance-tuning-sssd-for-large-ipa-ad-trust-deployments/
> which you already mentioned might help.
> 
> Although in general not recommend depending on the size of the trusted
> domain (i.e. the number of users and groups in the trusted domain)
> enabling enumeration for SSSD on the IPA servers might help as well,
> see man sssd.conf for details.
> 
> For the responsiveness of 389ds it might help to increase the number of
> worker threads, check the nsslapd-threadnumber parameter in the 389ds
> docs, e.g.
> https://access.redhat.com/documentation/en-US/Red_Hat_Directory_Server/10/html/Configuration_Command_and_File_Reference/Core_Server_Configuration_Reference.html#cnconfig-nsslapd_threadnumber_Thread_Number
> But with the large number of clients the clients might just use up
> threads in a reasonable number of worker threads.
> 
> HTH
> 
> bye,
> Sumit
> 
>> 
>> Regards,
>> 
>> Dan Sullivan
>> 
>> 
>> 
> 
> 
>> -- 
>> Manage your subscription for the Freeipa-users mailing list:
>> https://www.redhat.com/mailman/listinfo/freeipa-users
>> Go to http://freeipa.org for more info on the project
> 
> 
> -- 
> Manage your subscription for the Freeipa-users mailing list:
> https://www.redhat.com/mailman/listinfo/freeipa-users
> Go to http://freeipa.org for more info on the project