[Freeipa-users] 'Request is a replay'

Thu Jul 26 13:37:48 UTC 2012

On 07/26/2012 02:53 PM, Rob Crittenden wrote:
> Sigbjorn Lie wrote:
>> On Wed, July 25, 2012 09:54, Sigbjorn Lie wrote:
>>> On Tue, July 24, 2012 20:29, Simo Sorce wrote:
>>>
>>>> On Tue, 2012-07-24 at 10:22 +0200, Sigbjorn Lie wrote:
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> I keep seing this error message in our production environment 
>>>>> "Request is a replay" in
>>>>> variuos services using kerberos like ssh, sssd, automounter, squid 
>>>>> +++ after the upgrade to
>>>>> RHEL 6.3 /
>>>>> IPA
>>>>> 2.2.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Jul 24 10:16:11 server027 sssd_be: GSSAPI Error: Unspecified GSS 
>>>>> failure.  Minor code may
>>>>> provide more information (Request is a replay)
>>>>>
>>>>> Seaching google seem to suggest that this is an error with time. 
>>>>> However we have NTP
>>>>> configured (IPA servers as NTP servers) which is synchronized to 
>>>>> external NTP servers. There
>>>>> has been no issue before, and I cannot find issue with the time 
>>>>> being out of sync on the
>>>>> machines where this is happening.
>>>>
>>>> This error usually appears only when a same request is found in the
>>>> replay cache. It shouldn't be related to time issues, in that case 
>>>> you usually get clock-skew.
>>>>
>>>> Can you tell me what operation was being performed by sssd when you
>>>> caught that error ? Can you check if immediately before another 
>>>> identical operation had been
>>>> performed ?
>>>>
>>>
>>> That being said, I do have 1 IPA server (out of 3) that has 
>>> significantly higher CPU usage than
>>> the other 2, the 15-minute load average is sitting at between 0.85 
>>> and 0.95 the entire day, where
>>> ns-slapd 389-ds process is running at 100% most of the time.
>>>
>>> Load: 1.02, 0.94, 0.87
>>>
>>>
>>> In comparison the other two IPA servers has a 15-minute average 
>>> between 0.10 - 0.30 throughout
>>> the day, and the ns-slapd process is far from being such a cpu hog.
>>>
>>> On the server having high load, running even a command such as 
>>> "ipactl status" can take up to 20
>>> seconds to complete, where "Directory Service: RUNNING" returns 
>>> after a second or so, and to list
>>> the rest of the services takes the remainding 19 seconds.
>>>
>>> Also the web interface on this particular IPA server is rendered 
>>> unusable, returning "Limits
>>> exceeded for the query" for almost any action.
>>>
>>> Restarting all the IPA servies (ipactl restart) on the problematic 
>>> host soemwhat improves the
>>> situation, however that particular server returns to having heavy 
>>> load quickly.
>>>
>>> Using logconv.pl to analyze the dirsrv access log file displays that 
>>> the server in question has
>>> the lowest search queries per min with 106 queries/min. The other 
>>> servers have 710 search
>>> queries/sec and 168 queries/sec.
>>>
>>> For modifications all the IPA servers has about 5-6 queries/sec. For 
>>> unindexed searches the
>>> problematic server is the server with the lowest number. It does 
>>> however have more than twice the
>>> amount of GSSAPI binds than the other servers with over 61000 GSSAPI 
>>> binds over a 17 hour period.
>>>
>>>
>>> The problematic server is a physical server with 2 x AMD 2.4GHz Quad 
>>> core CPU and 8GB of RAM.
>>>
>>>
>>> This issue is also impacting all the clients, where I see random 
>>> hangs with anything involving a
>>> ldap or kerberos query to the IPA servers.
>>>
>>> Any suggestions?
>>>
>>>
>>
>> Anyone ?
>>
>> I am starting to see the Replay error when using the "ipa" CLI tool 
>> as well, causing the request
>> to drop out in an error.
>>
>> ipa dnsrecord-show example.com hostname
>> ipa: ERROR: Local error: SASL(-1): generic failure: GSSAPI Error: 
>> Unspecified GSS failure.  Minor
>> code may provide more information (Request is a replay)
>
> Sorry, I had started a reply yesterday and got side-tracked and never 
> sent it.
>
I know that feeling. :)
> For the one server is busier than others, how are your clients 
> configured? Are you using DNS SRV records?
>
We use DNS SRV records for everything LDAP that does support it -> SSSD 
and Linux automounter. Solaris clients, Red Hat 5 using nss_ldap, and 
NetApp use statically configured machines, however this is the second 
server in the server list for these machines. The primary server got 
more than 7x more LDAP queries per minute, and the load on the primary 
is much, much lower. All kerberos clients are using DNS SRV for lookups, 
no static configuration there.

I see some hickups on the clients as well, when browsing nfs shares 
(looking up UIDs), unlocking a client etc. It would seem like these are 
related to the "faulty" IPA server with high load, as it seem to respond 
very slowly to a lot of ldap queries too. I have tried removing it from 
the DNS SRV records an hour ago, and things seem to run smoother. A few 
services are still looking up there though, and the load on the "faulty" 
server is still high even with fewer clients. The primary server that's 
now receiving most of the queries barely increased anything at all in 
CPU usage.

> For the replay, are your servers running in bare metal or in VMs? How 
> about the clients? This sure seems like a time issue.

The time is configured as it has been for a long time. The physical IPA 
servers are syncronized from external time sources, providing the rest 
of the network with time. We have 2 physical servers and 1 virtual 
server. I have looked into the time, and it does seem like everything is 
syncronized.

The amount of clients has not changed much over the last few months.

These issues started appearing just after the upgrade to RHEL 6.3 / IPA 2.2.

Any suggestions to where to continue the troubleshooting?

Regards,
Siggi