<html> <head> <meta content="text/html; charset=windows-1252" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> <div class="moz-cite-prefix">On 05/21/2015 09:59 AM, Janelle wrote: </div> <blockquote cite="mid:555DE4D2.9050801@gmail.com" type="cite">On 5/21/15 6:46 AM, Ludwig Krispenz wrote: <blockquote type="cite"> On 05/21/2015 03:28 PM, Janelle wrote: <blockquote type="cite">I think I found the problem. There was a lone replica running in another DC. It was installed as a replica some time ago with all the others. Think of this -- the original config had 5 servers, one of them was this server. Then the other 4 servers were RE-BUILT from scratch, so all the replication agreements were changed AND - this is the important part - the 5th server was never added back in. BUT - the 5th server was left running and never told it that it was not a member anymore. It still thought it had a replication agreement with original "server 1", but server 1 knew otherwise. Now, although the first 4 servers were rebuilt, the same domain, realm, AND passwords were used. I am guessing that somehow, this 5th server keeps trying to interject its info into the ring of 4 servers, kind of forcing its way in. Somehow, because the original credentials still work (but certs are all different) is leaving the first 4 servers with a "can't decode" issue. There should be some security checks so this can't happen. It should also be easy to replicate. Now I have to go re-initialize all the servers from a good server, so everyone is happy again. The "problem" server has been shutdown completely. (and yes, there were actually 3 of them in my scenario - I just used 1 to simplify my example - but that explains the 3 CSNs that just kept "appearing") What concerns me most about this - were the servers outside of the "good ring" somehow able to inject data into replication which might have been causing bad data??? This is bad if it is true. </blockquote> it depends a bit on what you mean by rebuilt from scratch. A replication session needs to meet three conditions to be able to send data: - the supplier side needs to be able to authenticate and the authenticated users has to be in the list of binddns of the replica - the data generation of supplier and consumer side need to be the same (they all have to have the same common origin) - the supplier needs to have the changes (CSNs) to be able to position in its changelog to send updates now if you have 5 servers, forget about one of them and do not change the credentials in the others and do not reinitialize the database by an ldif import to generate a new database generation, the fifth server will still be able to connect and eventually send updates - how should the other servers know that this one is no longer a "good" one <blockquote type="cite"> ~Janelle </blockquote> </blockquote> The only problem left now - is no matter what, this last entry will NOT go away and now I have 2 "stuck" cleanruvs that will not "abort" either. unable to decode {replica 24} 554d53d3000000180000 554d54a4000200180000 CLEANALLRUV tasks RID 24 None No abort CLEANALLRUV tasks running ===================================== ldapmodify -D "cn=directory manager" -W -a dn: cn=abort 24, cn=abort cleanallruv, cn=tasks, cn=config objectclass: extensibleObject replica-base-dn: dc=example,dc=com cn: abort 24 replica-id: 24 replica-certify-all: no adding new entry " cn=abort 24, cn=abort cleanallruv, cn=tasks, cn=config" ldap_add: No such object (32) </blockquote> There should not be a white space at the beginning: " cn=abort 24, cn=abort cleanallruv, cn=tasks, cn=config" When I run the abort task I don't have that extra white space, and the task is successfully added: [root@localhost ~]# ldapmodify -D cn=dm -w password -a dn: cn=abort 24, cn=abort cleanallruv, cn=tasks, cn=config objectclass: extensibleObject replica-base-dn: dc=example,dc=com cn: abort 24 replica-id: 24 replica-certify-all: no adding new entry "cn=abort 24, cn=abort cleanallruv, cn=tasks, cn=config" The extra white space is the probable cause of the error 32 (no such object) you were seeing. You can verify this by looking at the access log (/var/log/dirsrv/slapd-INSTANCE/access) Like I said before you could also check the errors log for the reason why the cleanAllRUV task is not completing as well. Regards, Mark </body> </html>