[Freeipa-users] 3.0.0-42 Replication issue after Centos6.5->6.6 upgrade

Fri Nov 21 09:59:36 UTC 2014

Hi,

On Thu, 20 Nov 2014, thierry bordaz wrote:

> On 11/20/2014 12:03 PM, dbischof at hrz.uni-kassel.de wrote:
>> 
>> On Thu, 20 Nov 2014, thierry bordaz wrote:
>> 
>>> Server1 successfully replicated to Server2, but Server2 fails to 
>>> replicated to Server1.
>>> 
>>> The replication Server2->Server1 is done with kerberos authentication. 
>>> Server1 receives the replication session, successfully identify the 
>>> replication manager, start to receives replication extop but suddenly 
>>> closes the connection.
>>> 
>>>
>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 fd=78 slot=78 connection from
>>>   xxx to yyy
>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=0 BIND dn="" method=sasl
>>>   version=3 mech=GSSAPI
>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=0 RESULT err=14 tag=97
>>>   nentries=0 etime=0, SASL bind in progress
>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=1 BIND dn="" method=sasl
>>>   version=3 mech=GSSAPI
>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=1 RESULT err=14 tag=97
>>>   nentries=0 etime=0, SASL bind in progress
>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=2 BIND dn="" method=sasl
>>>   version=3 mech=GSSAPI
>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=2 RESULT err=0 tag=97
>>>   nentries=0 etime=0 dn="krbprincipalname=xxx"
>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=3 SRCH base="" scope=0
>>>   filter="(objectClass=*)" attrs="supportedControl supportedExtension"
>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=3 RESULT err=0 tag=101
>>>   nentries=1 etime=0
>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=4 SRCH base="" scope=0
>>>   filter="(objectClass=*)" attrs="supportedControl supportedExtension"
>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=4 RESULT err=0 tag=101
>>>   nentries=1 etime=0
>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=5 EXT
>>>   oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=5 RESULT err=0 tag=120
>>>   nentries=0 etime=0
>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=6 SRCH base="cn=schema"
>>>   scope=0 filter="(objectClass=*)" attrs="nsSchemaCSN"
>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=6 RESULT err=0 tag=101
>>>   nentries=1 etime=0
>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=-1 fd=78 closed - I/O
>>>   function error.
>>> 
>>> The reason of this closure is logged in server1 error log. sasl_decode 
>>> fails to decode a received PDU.
>>>
>>>   [19/Nov/2014:14:21:39 +0100] - sasl_io_recv failed to decode packet
>>>   for connection 2980
>>> 
>>> I do not know why it fails but I wonder if the received PDU is not larger 
>>> than the maximum configured value. The attribute nsslapd-maxsasliosize is 
>>> set to 2Mb by default. Would it be possible to increase its value (5Mb) to 
>>> see if it has an impact
>>> 
>>> [...]
>> 
>> I set nsslapd-maxsasliosize to 6164480 on both machines, but the problem 
>> remains.
>
> The sasl-decode fails but the exact returned value is not logged. With 
> standard version we may need to attach a debugger and then set a 
> conditional breakpoint in sasl-decode just after conn->oparams.decode 
> that will fire if result !=0. Now this can change the dynamic and 
> possibly prevent the problem to occur again. The other option is to use 
> an instrumented version to log this value.

If I understand the mechanism correctly, Server1 needs to have debug 
versions of the relevant packages (probably 389-ds-base and cyrus-sasl) 
installed in order to track down the problem. Unfortunately, my Server1 is 
in production use - if I break it, my colleagues will grab forks and 
torches and be after me. A short downtime would be ok, though.

Is there something else I could do?

Mit freundlichen Gruessen/With best regards,

--Daniel.