[Freeipa-users] performance scaling of sssd / freeipa

Wed Jan 25 22:58:34 UTC 2017

Hi,

My apologizes for resurrecting this thread.  This issue is still ongoing, at this point we’ve been looking at it for over a week and now have more than one staff member analyzing and trying to resolve it on a full time basis.  I have some more information that I was hoping an a seasoned IPA expert could take a look at.   At this point I am fairly certain it is a performance tuning issue in either sssd or FreeIPA on the our domain controllers.  It looks to me like the main issue is that when looking up the same user across a large number of nodes in parallel, all of our available ds389 threads get blocked with '__lll_robust_lock_wait ()’ for operations involving ipa_extdom_common.c.  This usually occurs on one of our two DCs, but occasionally on both.   For example, in the attached output, out of 199 threads in the attached output, 179 are in the status __lll_robust_lock_wait ().      All of the user1 at xxx.uchicago.edu<mailto:user1 at xxx.uchicago.edu> in this attachment are the same user.

Here is more information about this issue (some of it repeated for convenience):

  1.  We currently have 2 domain controllers.  Each has 6 processor cores and 180 threads allocated for 389ds.  We have gone through Red Hat’s performance tuning guide for directory services made what we felt were appropriate changes, and made additional tuning modifications to get lowered eviction rates and high cache hit numbers for 389ds.  We have approximately 220 connections to our domain controllers (from "cn=monitor”), depending on the test I’ve seen as many as 190 connected to a single DC.
  2.  We are using an AD domain where all of our users and groups reside.
  3.   I induce this by looking up a user (using the id command) on a large number of nodes (maybe 200) for a user that has never been looked up before, and is not cached on either the client, or on the DC.
  4.   Before I induce the problem, I can lookup entries in LDAP without delay or problem (i.e. the LDAP server is performant and responsive, I can inspect cn=monitor or cn=config and get instantaneous results).
  5.  When I do induce the issue, the LDAP server basically becomes unresponsive (which is expected based on the attached output).  Servicing a query using the ldapsearchtool (for either cn=monitor or cn=config) can take upwards of 1-2 minutes or longer.  Eventually the LDAP server will ‘recover’, i.e. I do not typically need to restart IPA services to get this working again.
  6.  After a lookup fails, subsequent parallel lookups succeed and return the desired record (presumably from the cache).
  7.  It appears that these failures are also characterized by a corresponding "[monitor_hup] (0x0020): Received SIGHUP.”  in the sssd log.
  8.  Right before the problem occurs I see a brief spike in CPU utilization of the ns-slapd process, then the utilization basically drops to 0 once the threads are blocked in ns-slapd.
  9.  Since we are doing computation in our IPA environment, it is important that we can perform these types of parallel operations against our IPA environment at the scale we are testing.

I feel like we are either DoSing the LDAP server or the sss_be / sss_nss processes, although I am not sure.   Right now we are in the process of deploying an additional domain controller to see if that helps with distribution of load.  If anybody could provide any sort of information with respect addressing the issue in the attached trace I would be very grateful.

Regards,

Dan Sullivan

-------------- next part --------------
A non-text attachment was scrubbed...
Name: gdb.txt.zip
Type: application/zip
Size: 77563 bytes
Desc: gdb.txt.zip
URL: <http://listman.redhat.com/archives/freeipa-users/attachments/20170125/0c56bf90/attachment.zip>