[Freeipa-users] replication again :-(

Fri May 22 07:59:14 UTC 2015

On 05/21/2015 06:09 PM, Janelle wrote:
> On 5/21/15 8:12 AM, Ludwig Krispenz wrote:
>>
>> On 05/21/2015 03:59 PM, Janelle wrote:
>>> On 5/21/15 6:46 AM, Ludwig Krispenz wrote:
>>>>
>>>> On 05/21/2015 03:28 PM, Janelle wrote:
>>>>> I think I found the problem.
>>>>>
>>>>> There was a lone replica running in another DC. It was installed 
>>>>> as a replica some time ago with all the others. Think of this -- 
>>>>> the original config had 5 servers, one of them was this server. 
>>>>> Then the other 4 servers were RE-BUILT from scratch, so all the 
>>>>> replication agreements were changed AND - this is the important 
>>>>> part - the 5th server was never added back in. BUT - the 5th 
>>>>> server was left running and never told it that it was not a member 
>>>>> anymore. It still thought it had a replication agreement with 
>>>>> original "server 1", but server 1 knew otherwise.
>>>>>
>>>>> Now, although the first 4 servers were rebuilt, the same domain, 
>>>>> realm, AND passwords were used.
>>>>>
>>>>> I am guessing that somehow, this 5th server keeps trying to 
>>>>> interject its info into the ring of 4 servers, kind of forcing its 
>>>>> way in. Somehow, because the original credentials still work (but 
>>>>> certs are all different) is leaving the first 4 servers with a 
>>>>> "can't decode" issue.
>>>>>
>>>>> There should be some security checks so this can't happen. It 
>>>>> should also be easy to replicate.
>>>>>
>>>>> Now I have to go re-initialize all the servers from a good server, 
>>>>> so everyone is happy again. The "problem" server has been shutdown 
>>>>> completely. (and yes, there were actually 3 of them in my scenario 
>>>>> - I just used 1 to simplify my example - but that explains the 3 
>>>>> CSNs that just kept "appearing")
>>>>>
>>>>> What concerns me most about this - were the servers outside of the 
>>>>> "good ring" somehow able to inject data into replication which 
>>>>> might have been causing bad data??? This is bad if it is true.
>>>> it depends a bit on what you mean by rebuilt from scratch.
>>>> A replication session needs to meet three conditions to be able to 
>>>> send data:
>>>> - the supplier side needs to be able to authenticate and the 
>>>> authenticated users has to be in the list of binddns of the replica
>>>> -  the data generation of supplier and consumer side need to be the 
>>>> same (they all have to have the same common origin)
>>>> - the supplier needs to have the changes (CSNs) to be able to 
>>>> position in its changelog to send updates
>>>>
>>>> now if you have 5 servers, forget about one of them and do not 
>>>> change the credentials in the others and do not reinitialize the 
>>>> database by an ldif import to generate a new database generation, 
>>>> the fifth server will still be able to connect and eventually send 
>>>> updates - how should the other servers know that this one is no 
>>>> longer a "good" one
>>>>>
>>>>> ~Janelle
>>>>>
>>>>
>>> The only problem left now - is no matter what, this last entry will 
>>> NOT go away and now I have 2 "stuck" cleanruvs that will not "abort" 
>>> either.
>>>
>>> unable to decode  {replica 24} 554d53d3000000180000 
>>> 554d54a4000200180000
>>>
>>> CLEANALLRUV tasks
>>> RID 24  None
>>> No abort CLEANALLRUV tasks running
>>> =====================================
>>>
>>> ldapmodify -D "cn=directory manager" -W -a
>>>
>>> dn: cn=abort 24, cn=abort cleanallruv, cn=tasks, cn=config
>>> objectclass: extensibleObject
>>> replica-base-dn: dc=example,dc=com
>>> cn: abort 24
>>> replica-id: 24
>>> replica-certify-all: no
>>> adding new entry " cn=abort 24, cn=abort cleanallruv, cn=tasks, 
>>> cn=config"
>>> ldap_add: No such object (32)
>> in your dse.ldif do you see something like:
>>
>> nsds5ReplicaCleanRUV: 300:00000000000000000000:no
>> in the replica object ?
>> This is where the task lives as long as it couldn't reach all servers 
>> for which a replication agreement exists.
>>
>> If abort task doesn't work, you could try to stop the server, remove 
>> these lines from the dse.ldif, start the server again
>
> Sadly, nothing even close to that anywhere. And now, after trying to 
> remove another replica which had been showing as a duplicate, although 
> authentication is continuing to work, I am afraid to try and do 
> anything else to replication, for fear of bringing all of production 
> down.
>
> I did not notice this at first - but yesterday when I shared my RUVs 
> -- there was something I missed:
>
> dc1-ipa1.example.com 389  10
> dc1-ipa2.example.com 389  25
> dc1-ipa2.example.com 389  9
> dc1-ipa3.example.com 389  8
> dc1-ipa4.example.com 389  4
>
> ipa2 appears twice with RUV 9 and 25 - with no explanation.
>
> Frustrated.
> ~Janelle
>
Hi Janelle,

Yes I mentioned that duplicate yesterday. That means the node 
dc1-ipa2.example.com is a master and use to be known with RID 9 and now 
is known as RID 25 (or the opposite)
Did you reinstall that node ? The purpose of CleanAllRuv is to clear the 
old value from the RUV.
Editing dc1-ipa2.example.com dse.ldif you can confirm the current value 
and choose which one need to be cleared.
When you have duplicated RID you may see logs with 
'attrlist_replace:..." in the error logs

Thanks
thierry
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/freeipa-users/attachments/20150522/dcddecab/attachment.htm>