<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <div class="moz-cite-prefix">On 11/21/2014 10:59 AM,
      <a class="moz-txt-link-abbreviated" href="mailto:dbischof@hrz.uni-kassel.de">dbischof@hrz.uni-kassel.de</a> wrote:<br>
    </div>
    <blockquote cite="mid:alpine.LSU.2.11.1411211033510.1449@fred"
      type="cite">Hi,
      <br>
      <br>
      On Thu, 20 Nov 2014, thierry bordaz wrote:
      <br>
      <br>
      <blockquote type="cite">On 11/20/2014 12:03 PM,
        <a class="moz-txt-link-abbreviated" href="mailto:dbischof@hrz.uni-kassel.de">dbischof@hrz.uni-kassel.de</a> wrote:
        <br>
        <blockquote type="cite">
          <br>
          On Thu, 20 Nov 2014, thierry bordaz wrote:
          <br>
          <br>
          <blockquote type="cite">Server1 successfully replicated to
            Server2, but Server2 fails to replicated to Server1.
            <br>
            <br>
            The replication Server2->Server1 is done with kerberos
            authentication. Server1 receives the replication session,
            successfully identify the replication manager, start to
            receives replication extop but suddenly closes the
            connection.
            <br>
            <br>
            <br>
              [19/Nov/2014:14:21:39 +0100] conn=2980 fd=78 slot=78
            connection from
            <br>
              xxx to yyy
            <br>
              [19/Nov/2014:14:21:39 +0100] conn=2980 op=0 BIND dn=""
            method=sasl
            <br>
              version=3 mech=GSSAPI
            <br>
              [19/Nov/2014:14:21:39 +0100] conn=2980 op=0 RESULT err=14
            tag=97
            <br>
              nentries=0 etime=0, SASL bind in progress
            <br>
              [19/Nov/2014:14:21:39 +0100] conn=2980 op=1 BIND dn=""
            method=sasl
            <br>
              version=3 mech=GSSAPI
            <br>
              [19/Nov/2014:14:21:39 +0100] conn=2980 op=1 RESULT err=14
            tag=97
            <br>
              nentries=0 etime=0, SASL bind in progress
            <br>
              [19/Nov/2014:14:21:39 +0100] conn=2980 op=2 BIND dn=""
            method=sasl
            <br>
              version=3 mech=GSSAPI
            <br>
              [19/Nov/2014:14:21:39 +0100] conn=2980 op=2 RESULT err=0
            tag=97
            <br>
              nentries=0 etime=0 dn="krbprincipalname=xxx"
            <br>
              [19/Nov/2014:14:21:39 +0100] conn=2980 op=3 SRCH base=""
            scope=0
            <br>
              filter="(objectClass=*)" attrs="supportedControl
            supportedExtension"
            <br>
              [19/Nov/2014:14:21:39 +0100] conn=2980 op=3 RESULT err=0
            tag=101
            <br>
              nentries=1 etime=0
            <br>
              [19/Nov/2014:14:21:39 +0100] conn=2980 op=4 SRCH base=""
            scope=0
            <br>
              filter="(objectClass=*)" attrs="supportedControl
            supportedExtension"
            <br>
              [19/Nov/2014:14:21:39 +0100] conn=2980 op=4 RESULT err=0
            tag=101
            <br>
              nentries=1 etime=0
            <br>
              [19/Nov/2014:14:21:39 +0100] conn=2980 op=5 EXT
            <br>
              oid="2.16.840.1.113730.3.5.12"
            name="replication-multimaster-extop"
            <br>
              [19/Nov/2014:14:21:39 +0100] conn=2980 op=5 RESULT err=0
            tag=120
            <br>
              nentries=0 etime=0
            <br>
              [19/Nov/2014:14:21:39 +0100] conn=2980 op=6 SRCH
            base="cn=schema"
            <br>
              scope=0 filter="(objectClass=*)" attrs="nsSchemaCSN"
            <br>
              [19/Nov/2014:14:21:39 +0100] conn=2980 op=6 RESULT err=0
            tag=101
            <br>
              nentries=1 etime=0
            <br>
              [19/Nov/2014:14:21:39 +0100] conn=2980 op=-1 fd=78 closed
            - I/O
            <br>
              function error.
            <br>
            <br>
            The reason of this closure is logged in server1 error log.
            sasl_decode fails to decode a received PDU.
            <br>
            <br>
              [19/Nov/2014:14:21:39 +0100] - sasl_io_recv failed to
            decode packet
            <br>
              for connection 2980
            <br>
            <br>
            I do not know why it fails but I wonder if the received PDU
            is not larger than the maximum configured value. The
            attribute nsslapd-maxsasliosize is set to 2Mb by default.
            Would it be possible to increase its value (5Mb) to see if
            it has an impact
            <br>
            <br>
            [...]
            <br>
          </blockquote>
          <br>
          I set nsslapd-maxsasliosize to 6164480 on both machines, but
          the problem remains.
          <br>
        </blockquote>
        <br>
        The sasl-decode fails but the exact returned value is not
        logged. With standard version we may need to attach a debugger
        and then set a conditional breakpoint in sasl-decode just after
        conn->oparams.decode that will fire if result !=0. Now this
        can change the dynamic and possibly prevent the problem to occur
        again. The other option is to use an instrumented version to log
        this value.
        <br>
      </blockquote>
      <br>
      If I understand the mechanism correctly, Server1 needs to have
      debug versions of the relevant packages (probably 389-ds-base and
      cyrus-sasl) installed in order to track down the problem.
      Unfortunately, my Server1 is in production use - if I break it, my
      colleagues will grab forks and torches and be after me. A short
      downtime would be ok, though.
      <br>
      <br>
      Is there something else I could do?
      <br>
    </blockquote>
    <br>
    Hello, <br>
    <br>
    Sure I do not want to trigger so much trouble <span
      class="moz-smiley-s3"><span> ;-) </span></span><br>
    <br>
    <br>
    I think my email was not clear. To go further we would need to know
    the exact reason why sasl_decode fails. I see two options:<br>
    <ul>
      <li>Prepare a debug version, that would report in the error logs
        the returned valud of sasl_decode (when it fails). Except
        downtime to install the debug version, it has no impact in
        production.<br>
        <br>
      </li>
      <li>Do a debug session (gdb) on Server1. The debug session will
        install a breakpoint at a specific place, let the server run,
        catch the sasl_decode failure and note the return code, exit
        from debugger. <br>
        When the problem occurs, it happens regularly (each 5 seconds)
        so we should not have to wait long.<br>
        That means that debugging Server1 should disturb production for
        5 to 10 min.<br>
        A detailed procedure to do the debug session is required.<br>
      </li>
    </ul>
    <p>thanks<br>
      thierry<br>
    </p>
    <blockquote cite="mid:alpine.LSU.2.11.1411211033510.1449@fred"
      type="cite">
      <br>
      <br>
      Mit freundlichen Gruessen/With best regards,
      <br>
      <br>
      --Daniel.
      <br>
    </blockquote>
    <br>
  </body>
</html>