<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 11/21/2014 10:59 AM,
<a class="moz-txt-link-abbreviated" href="mailto:dbischof@hrz.uni-kassel.de">dbischof@hrz.uni-kassel.de</a> wrote:<br>
</div>
<blockquote cite="mid:alpine.LSU.2.11.1411211033510.1449@fred"
type="cite">Hi,
<br>
<br>
On Thu, 20 Nov 2014, thierry bordaz wrote:
<br>
<br>
<blockquote type="cite">On 11/20/2014 12:03 PM,
<a class="moz-txt-link-abbreviated" href="mailto:dbischof@hrz.uni-kassel.de">dbischof@hrz.uni-kassel.de</a> wrote:
<br>
<blockquote type="cite">
<br>
On Thu, 20 Nov 2014, thierry bordaz wrote:
<br>
<br>
<blockquote type="cite">Server1 successfully replicated to
Server2, but Server2 fails to replicated to Server1.
<br>
<br>
The replication Server2->Server1 is done with kerberos
authentication. Server1 receives the replication session,
successfully identify the replication manager, start to
receives replication extop but suddenly closes the
connection.
<br>
<br>
<br>
[19/Nov/2014:14:21:39 +0100] conn=2980 fd=78 slot=78
connection from
<br>
xxx to yyy
<br>
[19/Nov/2014:14:21:39 +0100] conn=2980 op=0 BIND dn=""
method=sasl
<br>
version=3 mech=GSSAPI
<br>
[19/Nov/2014:14:21:39 +0100] conn=2980 op=0 RESULT err=14
tag=97
<br>
nentries=0 etime=0, SASL bind in progress
<br>
[19/Nov/2014:14:21:39 +0100] conn=2980 op=1 BIND dn=""
method=sasl
<br>
version=3 mech=GSSAPI
<br>
[19/Nov/2014:14:21:39 +0100] conn=2980 op=1 RESULT err=14
tag=97
<br>
nentries=0 etime=0, SASL bind in progress
<br>
[19/Nov/2014:14:21:39 +0100] conn=2980 op=2 BIND dn=""
method=sasl
<br>
version=3 mech=GSSAPI
<br>
[19/Nov/2014:14:21:39 +0100] conn=2980 op=2 RESULT err=0
tag=97
<br>
nentries=0 etime=0 dn="krbprincipalname=xxx"
<br>
[19/Nov/2014:14:21:39 +0100] conn=2980 op=3 SRCH base=""
scope=0
<br>
filter="(objectClass=*)" attrs="supportedControl
supportedExtension"
<br>
[19/Nov/2014:14:21:39 +0100] conn=2980 op=3 RESULT err=0
tag=101
<br>
nentries=1 etime=0
<br>
[19/Nov/2014:14:21:39 +0100] conn=2980 op=4 SRCH base=""
scope=0
<br>
filter="(objectClass=*)" attrs="supportedControl
supportedExtension"
<br>
[19/Nov/2014:14:21:39 +0100] conn=2980 op=4 RESULT err=0
tag=101
<br>
nentries=1 etime=0
<br>
[19/Nov/2014:14:21:39 +0100] conn=2980 op=5 EXT
<br>
oid="2.16.840.1.113730.3.5.12"
name="replication-multimaster-extop"
<br>
[19/Nov/2014:14:21:39 +0100] conn=2980 op=5 RESULT err=0
tag=120
<br>
nentries=0 etime=0
<br>
[19/Nov/2014:14:21:39 +0100] conn=2980 op=6 SRCH
base="cn=schema"
<br>
scope=0 filter="(objectClass=*)" attrs="nsSchemaCSN"
<br>
[19/Nov/2014:14:21:39 +0100] conn=2980 op=6 RESULT err=0
tag=101
<br>
nentries=1 etime=0
<br>
[19/Nov/2014:14:21:39 +0100] conn=2980 op=-1 fd=78 closed
- I/O
<br>
function error.
<br>
<br>
The reason of this closure is logged in server1 error log.
sasl_decode fails to decode a received PDU.
<br>
<br>
[19/Nov/2014:14:21:39 +0100] - sasl_io_recv failed to
decode packet
<br>
for connection 2980
<br>
<br>
I do not know why it fails but I wonder if the received PDU
is not larger than the maximum configured value. The
attribute nsslapd-maxsasliosize is set to 2Mb by default.
Would it be possible to increase its value (5Mb) to see if
it has an impact
<br>
<br>
[...]
<br>
</blockquote>
<br>
I set nsslapd-maxsasliosize to 6164480 on both machines, but
the problem remains.
<br>
</blockquote>
<br>
The sasl-decode fails but the exact returned value is not
logged. With standard version we may need to attach a debugger
and then set a conditional breakpoint in sasl-decode just after
conn->oparams.decode that will fire if result !=0. Now this
can change the dynamic and possibly prevent the problem to occur
again. The other option is to use an instrumented version to log
this value.
<br>
</blockquote>
<br>
If I understand the mechanism correctly, Server1 needs to have
debug versions of the relevant packages (probably 389-ds-base and
cyrus-sasl) installed in order to track down the problem.
Unfortunately, my Server1 is in production use - if I break it, my
colleagues will grab forks and torches and be after me. A short
downtime would be ok, though.
<br>
<br>
Is there something else I could do?
<br>
</blockquote>
<br>
Hello, <br>
<br>
Sure I do not want to trigger so much trouble <span
class="moz-smiley-s3"><span> ;-) </span></span><br>
<br>
<br>
I think my email was not clear. To go further we would need to know
the exact reason why sasl_decode fails. I see two options:<br>
<ul>
<li>Prepare a debug version, that would report in the error logs
the returned valud of sasl_decode (when it fails). Except
downtime to install the debug version, it has no impact in
production.<br>
<br>
</li>
<li>Do a debug session (gdb) on Server1. The debug session will
install a breakpoint at a specific place, let the server run,
catch the sasl_decode failure and note the return code, exit
from debugger. <br>
When the problem occurs, it happens regularly (each 5 seconds)
so we should not have to wait long.<br>
That means that debugging Server1 should disturb production for
5 to 10 min.<br>
A detailed procedure to do the debug session is required.<br>
</li>
</ul>
<p>thanks<br>
thierry<br>
</p>
<blockquote cite="mid:alpine.LSU.2.11.1411211033510.1449@fred"
type="cite">
<br>
<br>
Mit freundlichen Gruessen/With best regards,
<br>
<br>
--Daniel.
<br>
</blockquote>
<br>
</body>
</html>