[Freeipa-users] replication again :-(

Thu May 21 16:09:59 UTC 2015

On 5/21/15 8:12 AM, Ludwig Krispenz wrote:
>
> On 05/21/2015 03:59 PM, Janelle wrote:
>> On 5/21/15 6:46 AM, Ludwig Krispenz wrote:
>>>
>>> On 05/21/2015 03:28 PM, Janelle wrote:
>>>> I think I found the problem.
>>>>
>>>> There was a lone replica running in another DC. It was installed as 
>>>> a replica some time ago with all the others. Think of this -- the 
>>>> original config had 5 servers, one of them was this server. Then 
>>>> the other 4 servers were RE-BUILT from scratch, so all the 
>>>> replication agreements were changed AND - this is the important 
>>>> part - the 5th server was never added back in. BUT - the 5th server 
>>>> was left running and never told it that it was not a member 
>>>> anymore. It still thought it had a replication agreement with 
>>>> original "server 1", but server 1 knew otherwise.
>>>>
>>>> Now, although the first 4 servers were rebuilt, the same domain, 
>>>> realm, AND passwords were used.
>>>>
>>>> I am guessing that somehow, this 5th server keeps trying to 
>>>> interject its info into the ring of 4 servers, kind of forcing its 
>>>> way in. Somehow, because the original credentials still work (but 
>>>> certs are all different) is leaving the first 4 servers with a 
>>>> "can't decode" issue.
>>>>
>>>> There should be some security checks so this can't happen. It 
>>>> should also be easy to replicate.
>>>>
>>>> Now I have to go re-initialize all the servers from a good server, 
>>>> so everyone is happy again. The "problem" server has been shutdown 
>>>> completely. (and yes, there were actually 3 of them in my scenario 
>>>> - I just used 1 to simplify my example - but that explains the 3 
>>>> CSNs that just kept "appearing")
>>>>
>>>> What concerns me most about this - were the servers outside of the 
>>>> "good ring" somehow able to inject data into replication which 
>>>> might have been causing bad data??? This is bad if it is true.
>>> it depends a bit on what you mean by rebuilt from scratch.
>>> A replication session needs to meet three conditions to be able to 
>>> send data:
>>> - the supplier side needs to be able to authenticate and the 
>>> authenticated users has to be in the list of binddns of the replica
>>> -  the data generation of supplier and consumer side need to be the 
>>> same (they all have to have the same common origin)
>>> - the supplier needs to have the changes (CSNs) to be able to 
>>> position in its changelog to send updates
>>>
>>> now if you have 5 servers, forget about one of them and do not 
>>> change the credentials in the others and do not reinitialize the 
>>> database by an ldif import to generate a new database generation, 
>>> the fifth server will still be able to connect and eventually send 
>>> updates - how should the other servers know that this one is no 
>>> longer a "good" one
>>>>
>>>> ~Janelle
>>>>
>>>
>> The only problem left now - is no matter what, this last entry will 
>> NOT go away and now I have 2 "stuck" cleanruvs that will not "abort" 
>> either.
>>
>> unable to decode  {replica 24} 554d53d3000000180000 554d54a4000200180000
>>
>> CLEANALLRUV tasks
>> RID 24  None
>> No abort CLEANALLRUV tasks running
>> =====================================
>>
>> ldapmodify -D "cn=directory manager" -W -a
>>
>> dn: cn=abort 24, cn=abort cleanallruv, cn=tasks, cn=config
>> objectclass: extensibleObject
>> replica-base-dn: dc=example,dc=com
>> cn: abort 24
>> replica-id: 24
>> replica-certify-all: no
>> adding new entry " cn=abort 24, cn=abort cleanallruv, cn=tasks, 
>> cn=config"
>> ldap_add: No such object (32)
> in your dse.ldif do you see something like:
>
> nsds5ReplicaCleanRUV: 300:00000000000000000000:no
> in the replica object ?
> This is where the task lives as long as it couldn't reach all servers 
> for which a replication agreement exists.
>
> If abort task doesn't work, you could try to stop the server, remove 
> these lines from the dse.ldif, start the server again

Sadly, nothing even close to that anywhere. And now, after trying to 
remove another replica which had been showing as a duplicate, although 
authentication is continuing to work, I am afraid to try and do anything 
else to replication, for fear of bringing all of production down.

I did not notice this at first - but yesterday when I shared my RUVs -- 
there was something I missed:

dc1-ipa1.example.com 389  10
dc1-ipa2.example.com 389  25
dc1-ipa2.example.com 389  9
dc1-ipa3.example.com 389  8
dc1-ipa4.example.com 389  4

ipa2 appears twice with RUV 9 and 25 - with no explanation.

Frustrated.
~Janelle