[Freeipa-users] 3.0.0-42 Replication issue after Centos6.5->6.6 upgrade

Fri Nov 21 10:51:39 UTC 2014

On 11/21/2014 10:59 AM, dbischof at hrz.uni-kassel.de wrote:
> Hi,
>
> On Thu, 20 Nov 2014, thierry bordaz wrote:
>
>> On 11/20/2014 12:03 PM, dbischof at hrz.uni-kassel.de wrote:
>>>
>>> On Thu, 20 Nov 2014, thierry bordaz wrote:
>>>
>>>> Server1 successfully replicated to Server2, but Server2 fails to 
>>>> replicated to Server1.
>>>>
>>>> The replication Server2->Server1 is done with kerberos 
>>>> authentication. Server1 receives the replication session, 
>>>> successfully identify the replication manager, start to receives 
>>>> replication extop but suddenly closes the connection.
>>>>
>>>>
>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 fd=78 slot=78 connection from
>>>>   xxx to yyy
>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=0 BIND dn="" method=sasl
>>>>   version=3 mech=GSSAPI
>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=0 RESULT err=14 tag=97
>>>>   nentries=0 etime=0, SASL bind in progress
>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=1 BIND dn="" method=sasl
>>>>   version=3 mech=GSSAPI
>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=1 RESULT err=14 tag=97
>>>>   nentries=0 etime=0, SASL bind in progress
>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=2 BIND dn="" method=sasl
>>>>   version=3 mech=GSSAPI
>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=2 RESULT err=0 tag=97
>>>>   nentries=0 etime=0 dn="krbprincipalname=xxx"
>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=3 SRCH base="" scope=0
>>>>   filter="(objectClass=*)" attrs="supportedControl supportedExtension"
>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=3 RESULT err=0 tag=101
>>>>   nentries=1 etime=0
>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=4 SRCH base="" scope=0
>>>>   filter="(objectClass=*)" attrs="supportedControl supportedExtension"
>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=4 RESULT err=0 tag=101
>>>>   nentries=1 etime=0
>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=5 EXT
>>>>   oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=5 RESULT err=0 tag=120
>>>>   nentries=0 etime=0
>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=6 SRCH base="cn=schema"
>>>>   scope=0 filter="(objectClass=*)" attrs="nsSchemaCSN"
>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=6 RESULT err=0 tag=101
>>>>   nentries=1 etime=0
>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=-1 fd=78 closed - I/O
>>>>   function error.
>>>>
>>>> The reason of this closure is logged in server1 error log. 
>>>> sasl_decode fails to decode a received PDU.
>>>>
>>>>   [19/Nov/2014:14:21:39 +0100] - sasl_io_recv failed to decode packet
>>>>   for connection 2980
>>>>
>>>> I do not know why it fails but I wonder if the received PDU is not 
>>>> larger than the maximum configured value. The attribute 
>>>> nsslapd-maxsasliosize is set to 2Mb by default. Would it be 
>>>> possible to increase its value (5Mb) to see if it has an impact
>>>>
>>>> [...]
>>>
>>> I set nsslapd-maxsasliosize to 6164480 on both machines, but the 
>>> problem remains.
>>
>> The sasl-decode fails but the exact returned value is not logged. 
>> With standard version we may need to attach a debugger and then set a 
>> conditional breakpoint in sasl-decode just after conn->oparams.decode 
>> that will fire if result !=0. Now this can change the dynamic and 
>> possibly prevent the problem to occur again. The other option is to 
>> use an instrumented version to log this value.
>
> If I understand the mechanism correctly, Server1 needs to have debug 
> versions of the relevant packages (probably 389-ds-base and 
> cyrus-sasl) installed in order to track down the problem. 
> Unfortunately, my Server1 is in production use - if I break it, my 
> colleagues will grab forks and torches and be after me. A short 
> downtime would be ok, though.
>
> Is there something else I could do?

Hello,

Sure I do not want to trigger so much trouble ;-)

I think my email was not clear. To go further we would need to know the 
exact reason why sasl_decode fails. I see two options:

  * Prepare a debug version, that would report in the error logs the
    returned valud of sasl_decode (when it fails). Except downtime to
    install the debug version, it has no impact in production.

  * Do a debug session (gdb) on Server1. The debug session will install
    a breakpoint at a specific place, let the server run, catch the
    sasl_decode failure and note the return code, exit from debugger.
    When the problem occurs, it happens regularly (each 5 seconds) so we
    should not have to wait long.
    That means that debugging Server1 should disturb production for 5 to
    10 min.
    A detailed procedure to do the debug session is required.

thanks
thierry

>
>
> Mit freundlichen Gruessen/With best regards,
>
> --Daniel.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/freeipa-users/attachments/20141121/80b90630/attachment.htm>