[Freeipa-users] replication again :-(

Thu May 21 15:12:27 UTC 2015

On 05/21/2015 03:59 PM, Janelle wrote:
> On 5/21/15 6:46 AM, Ludwig Krispenz wrote:
>>
>> On 05/21/2015 03:28 PM, Janelle wrote:
>>> I think I found the problem.
>>>
>>> There was a lone replica running in another DC. It was installed as 
>>> a replica some time ago with all the others. Think of this -- the 
>>> original config had 5 servers, one of them was this server. Then the 
>>> other 4 servers were RE-BUILT from scratch, so all the replication 
>>> agreements were changed AND - this is the important part - the 5th 
>>> server was never added back in. BUT - the 5th server was left 
>>> running and never told it that it was not a member anymore. It still 
>>> thought it had a replication agreement with original "server 1", but 
>>> server 1 knew otherwise.
>>>
>>> Now, although the first 4 servers were rebuilt, the same domain, 
>>> realm, AND passwords were used.
>>>
>>> I am guessing that somehow, this 5th server keeps trying to 
>>> interject its info into the ring of 4 servers, kind of forcing its 
>>> way in. Somehow, because the original credentials still work (but 
>>> certs are all different) is leaving the first 4 servers with a 
>>> "can't decode" issue.
>>>
>>> There should be some security checks so this can't happen. It should 
>>> also be easy to replicate.
>>>
>>> Now I have to go re-initialize all the servers from a good server, 
>>> so everyone is happy again. The "problem" server has been shutdown 
>>> completely. (and yes, there were actually 3 of them in my scenario - 
>>> I just used 1 to simplify my example - but that explains the 3 CSNs 
>>> that just kept "appearing")
>>>
>>> What concerns me most about this - were the servers outside of the 
>>> "good ring" somehow able to inject data into replication which might 
>>> have been causing bad data??? This is bad if it is true.
>> it depends a bit on what you mean by rebuilt from scratch.
>> A replication session needs to meet three conditions to be able to 
>> send data:
>> - the supplier side needs to be able to authenticate and the 
>> authenticated users has to be in the list of binddns of the replica
>> -  the data generation of supplier and consumer side need to be the 
>> same (they all have to have the same common origin)
>> - the supplier needs to have the changes (CSNs) to be able to 
>> position in its changelog to send updates
>>
>> now if you have 5 servers, forget about one of them and do not change 
>> the credentials in the others and do not reinitialize the database by 
>> an ldif import to generate a new database generation, the fifth 
>> server will still be able to connect and eventually send updates - 
>> how should the other servers know that this one is no longer a "good" 
>> one
>>>
>>> ~Janelle
>>>
>>
> The only problem left now - is no matter what, this last entry will 
> NOT go away and now I have 2 "stuck" cleanruvs that will not "abort" 
> either.
>
> unable to decode  {replica 24} 554d53d3000000180000 554d54a4000200180000
>
> CLEANALLRUV tasks
> RID 24  None
> No abort CLEANALLRUV tasks running
> =====================================
>
> ldapmodify -D "cn=directory manager" -W -a
>
> dn: cn=abort 24, cn=abort cleanallruv, cn=tasks, cn=config
> objectclass: extensibleObject
> replica-base-dn: dc=example,dc=com
> cn: abort 24
> replica-id: 24
> replica-certify-all: no
> adding new entry " cn=abort 24, cn=abort cleanallruv, cn=tasks, 
> cn=config"
> ldap_add: No such object (32)
in your dse.ldif do you see something like:

nsds5ReplicaCleanRUV: 300:00000000000000000000:no
in the replica object ?
This is where the task lives as long as it couldn't reach all servers 
for which a replication agreement exists.

If abort task doesn't work, you could try to stop the server, remove 
these lines from the dse.ldif, start the server again.