<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 05/21/2015 06:09 PM, Janelle wrote:<br>
</div>
<blockquote cite="mid:555E0357.3030500@gmail.com" type="cite">On
5/21/15 8:12 AM, Ludwig Krispenz wrote:
<br>
<blockquote type="cite">
<br>
On 05/21/2015 03:59 PM, Janelle wrote:
<br>
<blockquote type="cite">On 5/21/15 6:46 AM, Ludwig Krispenz
wrote:
<br>
<blockquote type="cite">
<br>
On 05/21/2015 03:28 PM, Janelle wrote:
<br>
<blockquote type="cite">I think I found the problem.
<br>
<br>
There was a lone replica running in another DC. It was
installed as a replica some time ago with all the others.
Think of this -- the original config had 5 servers, one of
them was this server. Then the other 4 servers were
RE-BUILT from scratch, so all the replication agreements
were changed AND - this is the important part - the 5th
server was never added back in. BUT - the 5th server was
left running and never told it that it was not a member
anymore. It still thought it had a replication agreement
with original "server 1", but server 1 knew otherwise.
<br>
<br>
Now, although the first 4 servers were rebuilt, the same
domain, realm, AND passwords were used.
<br>
<br>
I am guessing that somehow, this 5th server keeps trying
to interject its info into the ring of 4 servers, kind of
forcing its way in. Somehow, because the original
credentials still work (but certs are all different) is
leaving the first 4 servers with a "can't decode" issue.
<br>
<br>
There should be some security checks so this can't happen.
It should also be easy to replicate.
<br>
<br>
Now I have to go re-initialize all the servers from a good
server, so everyone is happy again. The "problem" server
has been shutdown completely. (and yes, there were
actually 3 of them in my scenario - I just used 1 to
simplify my example - but that explains the 3 CSNs that
just kept "appearing")
<br>
<br>
What concerns me most about this - were the servers
outside of the "good ring" somehow able to inject data
into replication which might have been causing bad data???
This is bad if it is true.
<br>
</blockquote>
it depends a bit on what you mean by rebuilt from scratch.
<br>
A replication session needs to meet three conditions to be
able to send data:
<br>
- the supplier side needs to be able to authenticate and the
authenticated users has to be in the list of binddns of the
replica
<br>
- the data generation of supplier and consumer side need to
be the same (they all have to have the same common origin)
<br>
- the supplier needs to have the changes (CSNs) to be able
to position in its changelog to send updates
<br>
<br>
now if you have 5 servers, forget about one of them and do
not change the credentials in the others and do not
reinitialize the database by an ldif import to generate a
new database generation, the fifth server will still be able
to connect and eventually send updates - how should the
other servers know that this one is no longer a "good" one
<br>
<blockquote type="cite">
<br>
~Janelle
<br>
<br>
</blockquote>
<br>
</blockquote>
The only problem left now - is no matter what, this last entry
will NOT go away and now I have 2 "stuck" cleanruvs that will
not "abort" either.
<br>
<br>
unable to decode {replica 24} 554d53d3000000180000
554d54a4000200180000
<br>
<br>
CLEANALLRUV tasks
<br>
RID 24 None
<br>
No abort CLEANALLRUV tasks running
<br>
=====================================
<br>
<br>
ldapmodify -D "cn=directory manager" -W -a
<br>
<br>
dn: cn=abort 24, cn=abort cleanallruv, cn=tasks, cn=config
<br>
objectclass: extensibleObject
<br>
replica-base-dn: dc=example,dc=com
<br>
cn: abort 24
<br>
replica-id: 24
<br>
replica-certify-all: no
<br>
adding new entry " cn=abort 24, cn=abort cleanallruv,
cn=tasks, cn=config"
<br>
ldap_add: No such object (32)
<br>
</blockquote>
in your dse.ldif do you see something like:
<br>
<br>
nsds5ReplicaCleanRUV: 300:00000000000000000000:no
<br>
in the replica object ?
<br>
This is where the task lives as long as it couldn't reach all
servers for which a replication agreement exists.
<br>
<br>
If abort task doesn't work, you could try to stop the server,
remove these lines from the dse.ldif, start the server again
<br>
</blockquote>
<br>
Sadly, nothing even close to that anywhere. And now, after trying
to remove another replica which had been showing as a duplicate,
although authentication is continuing to work, I am afraid to try
and do anything else to replication, for fear of bringing all of
production down.
<br>
<br>
I did not notice this at first - but yesterday when I shared my
RUVs -- there was something I missed:
<br>
<br>
dc1-ipa1.example.com 389 10
<br>
dc1-ipa2.example.com 389 25
<br>
dc1-ipa2.example.com 389 9
<br>
dc1-ipa3.example.com 389 8
<br>
dc1-ipa4.example.com 389 4
<br>
<br>
ipa2 appears twice with RUV 9 and 25 - with no explanation.
<br>
<br>
Frustrated.
<br>
~Janelle
<br>
<br>
</blockquote>
<font face="Times New Roman, Times, serif">Hi Janelle,<br>
<br>
Yes I mentioned that duplicate yesterday. That means the node
dc1-ipa2.example.com is a master and use to be known with RID 9
and now is known as RID 25 (or the opposite)<br>
Did you reinstall that node ? The purpose of CleanAllRuv is to
clear the old value from the RUV.<br>
Editing </font><font face="Times New Roman, Times, serif">dc1-ipa2.example.com
dse.ldif you can confirm the current value and choose which one
need to be cleared. <br>
When you have duplicated RID you may see logs with
'attrlist_replace:..." in the error logs<br>
<br>
Thanks<br>
thierry<br>
</font>
</body>
</html>