[Freeipa-users] performance scaling of sssd / freeipa

Sumit Bose sbose at redhat.com
Fri Jan 20 16:27:08 UTC 2017


On Fri, Jan 20, 2017 at 03:41:46PM +0000, Sullivan, Daniel [CRI] wrote:
> Hi,
> 
> I have some more information on this issue.  I’m tracing it down through the slapd logs and  I am continuing to struggle; I was hoping that somebody could possibly help me provided this additional information.
> 
> On the domain logs (on the DC), I see an ldap_search_ext operation 155 with a timeout of 60:
> 
> (Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_get_generic_ext_step] (0x0400): calling ldap_search_ext with [(&(objectClass=ipaOverrideAnchor)(ipaAnchorUUID=:SID:S-1-5-21-xxx))][cn=Default Trust View,cn=views,cn=accounts,dc=xxx,dc=xxx,dc=uchicago,dc=edu].
> (Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_get_generic_ext_step] (0x2000): ldap_search_ext called, msgid = 155
> (Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_op_add] (0x2000): New operation 155 timeout 60
> 
> Then, on the slapd log for the server, I see the incoming request.  Based on the result, it looks like the request completes successfully within 1 second:
> 
> [20/Jan/2017:08:42:35.179890746 -0600] conn=1518 op=31 SRCH base="cn=Default Trust View,cn=views,cn=accounts,dc=xxx,dc=xxx,dc=uchicago,dc=edu" scope=2 filter="(&(objectClass=ipaOverrideAnchor)(ipaAnchorUUID=:SID:S-1-5-21-xxx))" attrs=ALL
> [20/Jan/2017:08:42:35.683485674 -0600] conn=1518 op=31 RESULT err=0 tag=101 nentries=0 etime=1 notes=U
> 
> Then, subsequently, the domain log (on the DC), I see the same operation 155 timeout (expectedly almost a minute later).

So it looks like ns-ldap send the response but it never reached SSSD in
time. Can you send what it happening in the SSSD logs between 08:42:34
and 08:43:34 (if you prefer you can send it to me directly). Maybe there
is an operation which blocks SSSD for too long so that it returns too
late to the main loop to process the response?

bye,
Sumit

> 
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_op_timeout] (0x1000): Issuing timeout for 155
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_op_destructor] (0x1000): Abandoning operation 155
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [generic_ext_search_handler] (0x0040): sdap_get_generic_ext_recv failed [110]: Connection timed out
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [ipa_get_ad_override_done] (0x0040): ipa_get_ad_override request failed.
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_id_op_destroy] (0x4000): releasing operation connection
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [ipa_initgr_get_overrides_override_done] (0x0040): IPA override lookup failed: 110
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [ipa_id_get_groups_overrides_done] (0x0040): IPA resolve user groups overrides failed [110].
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [be_mark_dom_offline] (0x1000): Marking subdomain bsdad.uchicago.edu offline
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [be_mark_subdom_offline] (0x1000): Marking subdomain bsdad.uchicago.edu as inactive
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [ipa_srv_ad_acct_lookup_done] (0x0040): ipa_get_*_acct request failed: [110]: Connection timed out.
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [ipa_subdomain_account_done] (0x0040): ipa_get_*_acct request failed: [110]: Connection timed out.
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_id_op_destroy] (0x4000): releasing operation connection
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [dp_reply_std_set] (0x0080): DP Error is OK on failed request?
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [dp_req_done] (0x0400): DP Request [Account #4292]: Request handler finished [0]: Success
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [_dp_req_recv] (0x0400): DP Request [Account #4292]: Receiving request data.
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [dp_req_reply_list_success] (0x0400): DP Request [Account #4292]: Finished. Success.
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [dp_req_reply_std] (0x1000): DP Request [Account #4292]: Returning [Success]: 0,110,Success
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [dp_table_value_destructor] (0x0400): Removing [0:1:0x0001:1:1::bsdad.uchicago.edu:name=userame at domain.uchicago.edu] from reply table
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [dp_req_destructor] (0x0400): DP Request [Account #4292]: Request removed.
> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [dp_req_destructor] (0x0400): Number of active DP request: 1
> 
> It seems like whatever problem I have is a communication issue between sssd on the domain controller, and ns-slapd.  I’m starting to venture down the road of doing trace/packet level debug logging on the LDAP server; this could be very time consuming and frustrating (I already have it enabled, I am starting to manipulate the log settings to yield potentially valuable data), I’m hoping that somebody might be able to provide some insight based on the information above; I can definitely lookup the user on both domain controllers & both IPA servers only use themselves for IPA servers.  Thank you so much for reading and for your help.
> 
> Dan
> 
> 
> > On Jan 19, 2017, at 4:15 PM, Sullivan, Daniel [CRI] <dsullivan2 at bsd.uchicago.edu> wrote:
> > 
> > Hi,
> > 
> > I’ve received incredibly good support from this mailing list previously; I am hoping that somebody can help me succeed in my ongoing efforts.  I have spent a few days on this at this point and I can’t seem to figure it out how to address this issue.  On my DCs I am seeing excessive ldap_search_ext and sdap_get_generic_ext_recv timeouts created solely by the invocation of the ‘id’ command on sssd clients.  This problem seems to present itself only when I parallelize lookups for an ‘uncached’ user (i.e. I have never performed an initial lookup).  Individual arbitrary one-off lookups for a single uncached user on a single system almost always work fine.  This leads me to believe this is a performance tuning issue.
> > 
> > We operate in an academic research computing unit (i.e. we have an HPC cluster), and I need the ability to lookup the same user in parallel (using the id command) across a relatively large number of systems, for example to spawn jobs that require large amounts of CPU cores and/or memory.  Right now I am doing about 50 parallel lookups for the same user to induce this problem.  
> > 
> > Here is some background information:
> > 
> > 1) I have read Jakub's “Anatomy of an SSSD Lookup” as well as “Performance Tuning of SSSD for large IPA-AD deployments”, as well as implemented recommendations from the performance tuning doc, including moving the sssd cache to tmpfs.
> > 2) We are on ipa-server 4.4.0-14.el7_3.4 using a trusted AD domain; all of our consumed users and groups are in the AD trusted domain.  We have two domain controllers; each is a RHEL 7.3 VM with 6 GB of memory.  Almost all (if not all) of our clients are running at least sssd 1.14, and are all RHEL 6/7.  Neither DC is swapping, and both have 2 CPUs.
> > 3) I have tuned SSSD clients on the DCs and all clients to include these options (the problem persists):
> >  a) ldap_opt_timeout = 60
> >  b) ldap_search_timeout = 60
> > 4) On both DCs, I can clear the SSSD cache, and lookup all 2000 or so users in my environment with 40 concurrent lookups occurring locally on each DC (using UNIX job control).  I can process all 2000 lookups in this manner without any failures (on either DC), and have ‘pre-populated’ the SSSD cache on both DC’s by doing this.
> > 6) I have made no additional performance tuning changes other than what has been described.
> > 
> > Would anybody be able to advise on any potential tuning that would be required (presumably on the DCs), to facilitate 50 parallel lookups without experiencing sdap_get_generic_ext_recv or  ldap_search_ext  timeouts?  Should I be able to do this sort of thing with relative ease?  I was hoping this would be the sort of thing that would just work, but based on my relatively extensive testing it doesn’t.  Any advice anybody could provide would be greatly appreciated.
> > 
> > Thank you,
> > 
> > Dan Sullivan
> > 
> > 
> 
> 
> -- 
> Manage your subscription for the Freeipa-users mailing list:
> https://www.redhat.com/mailman/listinfo/freeipa-users
> Go to http://freeipa.org for more info on the project




More information about the Freeipa-users mailing list