<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 11/11/2015 04:20 PM, Andrew Krause
wrote:<br>
</div>
<blockquote
cite="mid:9789D1C1-572F-4CB7-AE8D-26E07E94B1CB@breakthroughfuel.com"
type="cite">
<pre wrap="">Yesterday I came in to 3 of my 4 freeipa replicas in an unusable state and replication was not connecting any of the hosts to each other. My first/primary host was still servicing authentication requests, but the others were in varying states of usability. I’ve investigated logs on all 4 nodes and the only thing I can see is messages like this from when the problem started until I restarted all 4 with ipactl stop/ipactl start:
[09/Nov/2015:19:17:16 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[09/Nov/2015:19:19:16 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[09/Nov/2015:19:21:19 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[09/Nov/2015:19:23:19 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[09/Nov/2015:19:25:21 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[09/Nov/2015:19:27:21 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[09/Nov/2015:19:29:26 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[09/Nov/2015:19:31:26 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[09/Nov/2015:19:32:37 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc2papp08.somedomain.com" (abcloc2papp08:389): Warning: Attempting to release replica, but unable to receive endReplication extended operation response from the replica. Error -5 (Timed out)
[09/Nov/2015:19:33:29 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[09/Nov/2015:19:34:37 -0700] NSMMReplicationPlugin - agmt="cn=meToa.somedomain.com" (abcloc2papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[09/Nov/2015:19:35:28 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[09/Nov/2015:19:36:41 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc2papp08.somedomain.com" (abcloc2papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
We’ve already looked into our network and there was no outage/interruption between sites during the timeframe in question. The only corrective action that was taken was to restart each node. Does anyone know any way I can investigate further what caused this issue? I don’t like giving “I don’t know” answers for why replication stopped working and did not resume by itself.
</pre>
</blockquote>
<font face="Times New Roman, Times, serif">Hi Andrew,<br>
<br>
There are quite periodic (each min or couple of min) networking
issues where the primary host fails to process the replication
protocol with bcloc[12]papp08.<br>
There may be problem with the 3rd replica but it is present in
this portion of logs. <br>
Most of the time it prevents primary master to establish a
replication session so these replica are likely late. <br>
The replicas are reachable but do not answer fast enough and the
protocol times out.<br>
<br>
Default replication timeout is 10m but can be tuned in each
replica agreement nsds5ReplicaTimeout.<br>
Is the value set ?<br>
<br>
As it was working fine before, it would be interesting to check
the replica logs (may be enable replication logging for them) when
the timeout occurs.<br>
Also, if the problem continue take periodic (under the
nsds5ReplicaTimeout value) pstacks of the replica because there
may be something that make them busy and unable to answer fast
enough.<br>
<br>
thanks<br>
thierry<br>
</font>
</body>
</html>