[Freeipa-users] 3.0.0-42 Replication issue after Centos6.5->6.6 upgrade

Tue Nov 25 22:16:32 UTC 2014

On 11/25/2014 12:32 PM, dbischof at hrz.uni-kassel.de wrote:
> Hi,
>
> with the help of Thierry and Rich I managed to debug the running 
> ns-slapd on Server1 (see below). The failing attempt of decoding the 
> SASL data returns a not very fruitful "-1" (SASL_FAIL, "generic 
> failure").
>
> Any ideas? Short summary:
>
> Server1 = running IPA server
> Server2 = intended IPA replica
>
> Both machines run the exact same, up-to-date version of CentOS 6.6. 
> However: I had to run "ipa-replica-install" _without_ the option 
> "--setup-ca" (didn't work, installation failed with some obscure Perl 
> error), so there's no ns-slapd instance running for PKI-IPA. May this 
> be related?

Are you asking if not having --setup-ca would cause "sasl_io_recv failed 
to decode packet  for connection 2980"?  Not that I know of.

At this point, it's going to take more than a trivial amount of high 
latency back-and-forth on the mailling lists.  I think we have probably 
run out of log levels for you to try.  Please open a ticket against 
IPA.  While this may turn out to be a bug in 389, at the moment it is 
only reproducible in your IPA environment.

The fastest way to get to the bottom of this problem would be for a 389 
developer to run an interactive gdb session on your production machine 
and poke around.  That is, allow one of us to ssh into the machine and 
run gdb (which will kill the performance and cause outages unless this 
machine can be taken out of rotation somehow). What would we be looking 
for?  I don't know, but hopefully we would know it when we see it.

>
> On Fri, 21 Nov 2014, Rich Megginson wrote:
>
>> On 11/21/2014 04:51 AM, thierry bordaz wrote:
>>> On 11/21/2014 10:59 AM, dbischof at hrz.uni-kassel.de wrote:
>>>> On Thu, 20 Nov 2014, thierry bordaz wrote:
>>>>> On 11/20/2014 12:03 PM, dbischof at hrz.uni-kassel.de wrote:
>>>>>> On Thu, 20 Nov 2014, thierry bordaz wrote:
>>>>>>
>>>>>>> Server1 successfully replicated to Server2, but Server2 fails to 
>>>>>>> replicated to Server1.
>>>>>>>
>>>>>>> The replication Server2->Server1 is done with kerberos 
>>>>>>> authentication. Server1 receives the replication session, 
>>>>>>> successfully identify the replication manager, start to receives 
>>>>>>> replication extop but suddenly closes the connection.
>>>>>>>
>>>>>>>
>>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 fd=78 slot=78 
>>>>>>> connection from
>>>>>>>   xxx to yyy
>>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=0 BIND dn="" 
>>>>>>> method=sasl
>>>>>>>   version=3 mech=GSSAPI
>>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=0 RESULT err=14 tag=97
>>>>>>>   nentries=0 etime=0, SASL bind in progress
>>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=1 BIND dn="" 
>>>>>>> method=sasl
>>>>>>>   version=3 mech=GSSAPI
>>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=1 RESULT err=14 tag=97
>>>>>>>   nentries=0 etime=0, SASL bind in progress
>>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=2 BIND dn="" 
>>>>>>> method=sasl
>>>>>>>   version=3 mech=GSSAPI
>>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=2 RESULT err=0 tag=97
>>>>>>>   nentries=0 etime=0 dn="krbprincipalname=xxx"
>>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=3 SRCH base="" scope=0
>>>>>>>   filter="(objectClass=*)" attrs="supportedControl 
>>>>>>> supportedExtension"
>>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=3 RESULT err=0 tag=101
>>>>>>>   nentries=1 etime=0
>>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=4 SRCH base="" scope=0
>>>>>>>   filter="(objectClass=*)" attrs="supportedControl 
>>>>>>> supportedExtension"
>>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=4 RESULT err=0 tag=101
>>>>>>>   nentries=1 etime=0
>>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=5 EXT
>>>>>>>   oid="2.16.840.1.113730.3.5.12" 
>>>>>>> name="replication-multimaster-extop"
>>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=5 RESULT err=0 tag=120
>>>>>>>   nentries=0 etime=0
>>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=6 SRCH base="cn=schema"
>>>>>>>   scope=0 filter="(objectClass=*)" attrs="nsSchemaCSN"
>>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=6 RESULT err=0 tag=101
>>>>>>>   nentries=1 etime=0
>>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=-1 fd=78 closed - I/O
>>>>>>>   function error.
>>>>>>>
>>>>>>> The reason of this closure is logged in server1 error log. 
>>>>>>> sasl_decode fails to decode a received PDU.
>>>>>>>
>>>>>>>   [19/Nov/2014:14:21:39 +0100] - sasl_io_recv failed to decode 
>>>>>>> packet
>>>>>>>   for connection 2980
>>>>>>>
>>>>>>> I do not know why it fails but I wonder if the received PDU is 
>>>>>>> not larger than the maximum configured value. The attribute 
>>>>>>> nsslapd-maxsasliosize is set to 2Mb by default. Would it be 
>>>>>>> possible to increase its value (5Mb) to see if it has an impact
>>>>>>>
>>>>>>> [...]
>>>>>>
>>>>>> I set nsslapd-maxsasliosize to 6164480 on both machines, but the 
>>>>>> problem remains.
>>>>>
>>>>> The sasl-decode fails but the exact returned value is not logged. 
>>>>> With standard version we may need to attach a debugger and then 
>>>>> set a conditional breakpoint in sasl-decode just after 
>>>>> conn->oparams.decode that will fire if result !=0. Now this can 
>>>>> change the dynamic and possibly prevent the problem to occur 
>>>>> again. The other option is to use an instrumented version to log 
>>>>> this value.
>>>>
>>>> If I understand the mechanism correctly, Server1 needs to have 
>>>> debug versions of the relevant packages (probably 389-ds-base and 
>>>> cyrus-sasl) installed in order to track down the problem. 
>>>> Unfortunately, my Server1 is in production use - if I break it, my 
>>>> colleagues will grab forks and torches and be after me. A short 
>>>> downtime would be ok, though.
>>>>
>>>> Is there something else I could do?
>>>
>>> Sure I do not want to trigger so much trouble ;-)
>>>
>>>
>>> I think my email was not clear. To go further we would need to know 
>>> the exact reason why sasl_decode fails. I see two options:
>>>
>>>   * Prepare a debug version, that would report in the error logs the
>>>     returned valud of sasl_decode (when it fails). Except downtime to
>>>     install the debug version, it has no impact in production.
>>>
>>>   * Do a debug session (gdb) on Server1. The debug session will
>>>     install a breakpoint at a specific place, let the server run,
>>>     catch the sasl_decode failure and note the return code, exit from
>>>     debugger.
>>>     When the problem occurs, it happens regularly (each 5 seconds) so
>>>     we should not have to wait long.
>>>     That means that debugging Server1 should disturb production for 5
>>>     to 10 min.
>>>     A detailed procedure to do the debug session is required.
>>>
>> For starters: 
>> http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-crashes
>>
>> Since this is IPA you will need debuginfo packages for ipa and 
>> slapi-nis in addition to the ones for 389.
>>
>> Take a look at the Debugging Hangs section where it describes how to 
>> use gdb to get a stack trace.  If you can use that gdb command to get 
>> a stack trace with the full debugging symbols (and if you don't know 
>> what that means, just post the redacted stack trace somewhere and 
>> provide us with a link to it), then you should be all ready to do a 
>> gdb session to reproduce the error and "catch it in the act".
>
>
> Mit freundlichen Gruessen/With best regards,
>
> --Daniel.