[Freeipa-users] 3.0.0-42 Replication issue after Centos6.5->6.6 upgrade
Rich Megginson
rmeggins at redhat.com
Fri Nov 21 15:49:04 UTC 2014
On 11/21/2014 04:51 AM, thierry bordaz wrote:
> On 11/21/2014 10:59 AM, dbischof at hrz.uni-kassel.de wrote:
>> Hi,
>>
>> On Thu, 20 Nov 2014, thierry bordaz wrote:
>>
>>> On 11/20/2014 12:03 PM, dbischof at hrz.uni-kassel.de wrote:
>>>>
>>>> On Thu, 20 Nov 2014, thierry bordaz wrote:
>>>>
>>>>> Server1 successfully replicated to Server2, but Server2 fails to
>>>>> replicated to Server1.
>>>>>
>>>>> The replication Server2->Server1 is done with kerberos
>>>>> authentication. Server1 receives the replication session,
>>>>> successfully identify the replication manager, start to receives
>>>>> replication extop but suddenly closes the connection.
>>>>>
>>>>>
>>>>> [19/Nov/2014:14:21:39 +0100] conn=2980 fd=78 slot=78 connection
>>>>> from
>>>>> xxx to yyy
>>>>> [19/Nov/2014:14:21:39 +0100] conn=2980 op=0 BIND dn="" method=sasl
>>>>> version=3 mech=GSSAPI
>>>>> [19/Nov/2014:14:21:39 +0100] conn=2980 op=0 RESULT err=14 tag=97
>>>>> nentries=0 etime=0, SASL bind in progress
>>>>> [19/Nov/2014:14:21:39 +0100] conn=2980 op=1 BIND dn="" method=sasl
>>>>> version=3 mech=GSSAPI
>>>>> [19/Nov/2014:14:21:39 +0100] conn=2980 op=1 RESULT err=14 tag=97
>>>>> nentries=0 etime=0, SASL bind in progress
>>>>> [19/Nov/2014:14:21:39 +0100] conn=2980 op=2 BIND dn="" method=sasl
>>>>> version=3 mech=GSSAPI
>>>>> [19/Nov/2014:14:21:39 +0100] conn=2980 op=2 RESULT err=0 tag=97
>>>>> nentries=0 etime=0 dn="krbprincipalname=xxx"
>>>>> [19/Nov/2014:14:21:39 +0100] conn=2980 op=3 SRCH base="" scope=0
>>>>> filter="(objectClass=*)" attrs="supportedControl
>>>>> supportedExtension"
>>>>> [19/Nov/2014:14:21:39 +0100] conn=2980 op=3 RESULT err=0 tag=101
>>>>> nentries=1 etime=0
>>>>> [19/Nov/2014:14:21:39 +0100] conn=2980 op=4 SRCH base="" scope=0
>>>>> filter="(objectClass=*)" attrs="supportedControl
>>>>> supportedExtension"
>>>>> [19/Nov/2014:14:21:39 +0100] conn=2980 op=4 RESULT err=0 tag=101
>>>>> nentries=1 etime=0
>>>>> [19/Nov/2014:14:21:39 +0100] conn=2980 op=5 EXT
>>>>> oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
>>>>> [19/Nov/2014:14:21:39 +0100] conn=2980 op=5 RESULT err=0 tag=120
>>>>> nentries=0 etime=0
>>>>> [19/Nov/2014:14:21:39 +0100] conn=2980 op=6 SRCH base="cn=schema"
>>>>> scope=0 filter="(objectClass=*)" attrs="nsSchemaCSN"
>>>>> [19/Nov/2014:14:21:39 +0100] conn=2980 op=6 RESULT err=0 tag=101
>>>>> nentries=1 etime=0
>>>>> [19/Nov/2014:14:21:39 +0100] conn=2980 op=-1 fd=78 closed - I/O
>>>>> function error.
>>>>>
>>>>> The reason of this closure is logged in server1 error log.
>>>>> sasl_decode fails to decode a received PDU.
>>>>>
>>>>> [19/Nov/2014:14:21:39 +0100] - sasl_io_recv failed to decode packet
>>>>> for connection 2980
>>>>>
>>>>> I do not know why it fails but I wonder if the received PDU is not
>>>>> larger than the maximum configured value. The attribute
>>>>> nsslapd-maxsasliosize is set to 2Mb by default. Would it be
>>>>> possible to increase its value (5Mb) to see if it has an impact
>>>>>
>>>>> [...]
>>>>
>>>> I set nsslapd-maxsasliosize to 6164480 on both machines, but the
>>>> problem remains.
>>>
>>> The sasl-decode fails but the exact returned value is not logged.
>>> With standard version we may need to attach a debugger and then set
>>> a conditional breakpoint in sasl-decode just after
>>> conn->oparams.decode that will fire if result !=0. Now this can
>>> change the dynamic and possibly prevent the problem to occur again.
>>> The other option is to use an instrumented version to log this value.
>>
>> If I understand the mechanism correctly, Server1 needs to have debug
>> versions of the relevant packages (probably 389-ds-base and
>> cyrus-sasl) installed in order to track down the problem.
>> Unfortunately, my Server1 is in production use - if I break it, my
>> colleagues will grab forks and torches and be after me. A short
>> downtime would be ok, though.
>>
>> Is there something else I could do?
>
> Hello,
>
> Sure I do not want to trigger so much trouble ;-)
>
>
> I think my email was not clear. To go further we would need to know
> the exact reason why sasl_decode fails. I see two options:
>
> * Prepare a debug version, that would report in the error logs the
> returned valud of sasl_decode (when it fails). Except downtime to
> install the debug version, it has no impact in production.
>
> * Do a debug session (gdb) on Server1. The debug session will
> install a breakpoint at a specific place, let the server run,
> catch the sasl_decode failure and note the return code, exit from
> debugger.
> When the problem occurs, it happens regularly (each 5 seconds) so
> we should not have to wait long.
> That means that debugging Server1 should disturb production for 5
> to 10 min.
> A detailed procedure to do the debug session is required.
>
For starters:
http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-crashes
Since this is IPA you will need debuginfo packages for ipa and slapi-nis
in addition to the ones for 389.
Take a look at the Debugging Hangs section where it describes how to use
gdb to get a stack trace. If you can use that gdb command to get a
stack trace with the full debugging symbols (and if you don't know what
that means, just post the redacted stack trace somewhere and provide us
with a link to it), then you should be all ready to do a gdb session to
reproduce the error and "catch it in the act".
> *
>
> thanks
> thierry
>
>>
>>
>> Mit freundlichen Gruessen/With best regards,
>>
>> --Daniel.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/freeipa-users/attachments/20141121/d968392c/attachment.htm>
More information about the Freeipa-users
mailing list