[Freeipa-users] 3.0.0-42 Replication issue after Centos6.5->6.6 upgrade

Tue Nov 25 19:32:09 UTC 2014

Hi,

with the help of Thierry and Rich I managed to debug the running ns-slapd 
on Server1 (see below). The failing attempt of decoding the SASL data 
returns a not very fruitful "-1" (SASL_FAIL, "generic failure").

Any ideas? Short summary:

Server1 = running IPA server
Server2 = intended IPA replica

Both machines run the exact same, up-to-date version of CentOS 6.6. 
However: I had to run "ipa-replica-install" _without_ the option 
"--setup-ca" (didn't work, installation failed with some obscure Perl 
error), so there's no ns-slapd instance running for PKI-IPA. May this be 
related?

On Fri, 21 Nov 2014, Rich Megginson wrote:

> On 11/21/2014 04:51 AM, thierry bordaz wrote:
>> On 11/21/2014 10:59 AM, dbischof at hrz.uni-kassel.de wrote:
>>> On Thu, 20 Nov 2014, thierry bordaz wrote:
>>>> On 11/20/2014 12:03 PM, dbischof at hrz.uni-kassel.de wrote:
>>>>> On Thu, 20 Nov 2014, thierry bordaz wrote:
>>>>> 
>>>>>> Server1 successfully replicated to Server2, but Server2 fails to 
>>>>>> replicated to Server1.
>>>>>> 
>>>>>> The replication Server2->Server1 is done with kerberos 
>>>>>> authentication. Server1 receives the replication session, 
>>>>>> successfully identify the replication manager, start to receives 
>>>>>> replication extop but suddenly closes the connection.
>>>>>> 
>>>>>>
>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 fd=78 slot=78 connection from
>>>>>>   xxx to yyy
>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=0 BIND dn="" method=sasl
>>>>>>   version=3 mech=GSSAPI
>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=0 RESULT err=14 tag=97
>>>>>>   nentries=0 etime=0, SASL bind in progress
>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=1 BIND dn="" method=sasl
>>>>>>   version=3 mech=GSSAPI
>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=1 RESULT err=14 tag=97
>>>>>>   nentries=0 etime=0, SASL bind in progress
>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=2 BIND dn="" method=sasl
>>>>>>   version=3 mech=GSSAPI
>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=2 RESULT err=0 tag=97
>>>>>>   nentries=0 etime=0 dn="krbprincipalname=xxx"
>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=3 SRCH base="" scope=0
>>>>>>   filter="(objectClass=*)" attrs="supportedControl supportedExtension"
>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=3 RESULT err=0 tag=101
>>>>>>   nentries=1 etime=0
>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=4 SRCH base="" scope=0
>>>>>>   filter="(objectClass=*)" attrs="supportedControl supportedExtension"
>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=4 RESULT err=0 tag=101
>>>>>>   nentries=1 etime=0
>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=5 EXT
>>>>>>   oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=5 RESULT err=0 tag=120
>>>>>>   nentries=0 etime=0
>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=6 SRCH base="cn=schema"
>>>>>>   scope=0 filter="(objectClass=*)" attrs="nsSchemaCSN"
>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=6 RESULT err=0 tag=101
>>>>>>   nentries=1 etime=0
>>>>>>   [19/Nov/2014:14:21:39 +0100] conn=2980 op=-1 fd=78 closed - I/O
>>>>>>   function error.
>>>>>> 
>>>>>> The reason of this closure is logged in server1 error log. sasl_decode 
>>>>>> fails to decode a received PDU.
>>>>>>
>>>>>>   [19/Nov/2014:14:21:39 +0100] - sasl_io_recv failed to decode packet
>>>>>>   for connection 2980
>>>>>> 
>>>>>> I do not know why it fails but I wonder if the received PDU is not 
>>>>>> larger than the maximum configured value. The attribute 
>>>>>> nsslapd-maxsasliosize is set to 2Mb by default. Would it be 
>>>>>> possible to increase its value (5Mb) to see if it has an impact
>>>>>> 
>>>>>> [...]
>>>>> 
>>>>> I set nsslapd-maxsasliosize to 6164480 on both machines, but the 
>>>>> problem remains.
>>>> 
>>>> The sasl-decode fails but the exact returned value is not logged. 
>>>> With standard version we may need to attach a debugger and then set a 
>>>> conditional breakpoint in sasl-decode just after conn->oparams.decode 
>>>> that will fire if result !=0. Now this can change the dynamic and 
>>>> possibly prevent the problem to occur again. The other option is to 
>>>> use an instrumented version to log this value.
>>> 
>>> If I understand the mechanism correctly, Server1 needs to have debug 
>>> versions of the relevant packages (probably 389-ds-base and 
>>> cyrus-sasl) installed in order to track down the problem. 
>>> Unfortunately, my Server1 is in production use - if I break it, my 
>>> colleagues will grab forks and torches and be after me. A short 
>>> downtime would be ok, though.
>>> 
>>> Is there something else I could do?
>> 
>> Sure I do not want to trigger so much trouble ;-)
>> 
>> 
>> I think my email was not clear. To go further we would need to know the 
>> exact reason why sasl_decode fails. I see two options:
>>
>>   * Prepare a debug version, that would report in the error logs the
>>     returned valud of sasl_decode (when it fails). Except downtime to
>>     install the debug version, it has no impact in production.
>>
>>   * Do a debug session (gdb) on Server1. The debug session will
>>     install a breakpoint at a specific place, let the server run,
>>     catch the sasl_decode failure and note the return code, exit from
>>     debugger.
>>     When the problem occurs, it happens regularly (each 5 seconds) so
>>     we should not have to wait long.
>>     That means that debugging Server1 should disturb production for 5
>>     to 10 min.
>>     A detailed procedure to do the debug session is required.
>> 
> For starters: 
> http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-crashes
>
> Since this is IPA you will need debuginfo packages for ipa and slapi-nis 
> in addition to the ones for 389.
>
> Take a look at the Debugging Hangs section where it describes how to use 
> gdb to get a stack trace.  If you can use that gdb command to get a 
> stack trace with the full debugging symbols (and if you don't know what 
> that means, just post the redacted stack trace somewhere and provide us 
> with a link to it), then you should be all ready to do a gdb session to 
> reproduce the error and "catch it in the act".

Mit freundlichen Gruessen/With best regards,

--Daniel.