[Freeipa-users] replication again :-(

Thu May 21 17:10:38 UTC 2015

On 05/21/2015 09:59 AM, Janelle wrote:
> On 5/21/15 6:46 AM, Ludwig Krispenz wrote:
>>
>> On 05/21/2015 03:28 PM, Janelle wrote:
>>> I think I found the problem.
>>>
>>> There was a lone replica running in another DC. It was installed as 
>>> a replica some time ago with all the others. Think of this -- the 
>>> original config had 5 servers, one of them was this server. Then the 
>>> other 4 servers were RE-BUILT from scratch, so all the replication 
>>> agreements were changed AND - this is the important part - the 5th 
>>> server was never added back in. BUT - the 5th server was left 
>>> running and never told it that it was not a member anymore. It still 
>>> thought it had a replication agreement with original "server 1", but 
>>> server 1 knew otherwise.
>>>
>>> Now, although the first 4 servers were rebuilt, the same domain, 
>>> realm, AND passwords were used.
>>>
>>> I am guessing that somehow, this 5th server keeps trying to 
>>> interject its info into the ring of 4 servers, kind of forcing its 
>>> way in. Somehow, because the original credentials still work (but 
>>> certs are all different) is leaving the first 4 servers with a 
>>> "can't decode" issue.
>>>
>>> There should be some security checks so this can't happen. It should 
>>> also be easy to replicate.
>>>
>>> Now I have to go re-initialize all the servers from a good server, 
>>> so everyone is happy again. The "problem" server has been shutdown 
>>> completely. (and yes, there were actually 3 of them in my scenario - 
>>> I just used 1 to simplify my example - but that explains the 3 CSNs 
>>> that just kept "appearing")
>>>
>>> What concerns me most about this - were the servers outside of the 
>>> "good ring" somehow able to inject data into replication which might 
>>> have been causing bad data??? This is bad if it is true.
>> it depends a bit on what you mean by rebuilt from scratch.
>> A replication session needs to meet three conditions to be able to 
>> send data:
>> - the supplier side needs to be able to authenticate and the 
>> authenticated users has to be in the list of binddns of the replica
>> -  the data generation of supplier and consumer side need to be the 
>> same (they all have to have the same common origin)
>> - the supplier needs to have the changes (CSNs) to be able to 
>> position in its changelog to send updates
>>
>> now if you have 5 servers, forget about one of them and do not change 
>> the credentials in the others and do not reinitialize the database by 
>> an ldif import to generate a new database generation, the fifth 
>> server will still be able to connect and eventually send updates - 
>> how should the other servers know that this one is no longer a "good" 
>> one
>>>
>>> ~Janelle
>>>
>>
> The only problem left now - is no matter what, this last entry will 
> NOT go away and now I have 2 "stuck" cleanruvs that will not "abort" 
> either.
>
> unable to decode  {replica 24} 554d53d3000000180000 554d54a4000200180000
>
> CLEANALLRUV tasks
> RID 24  None
> No abort CLEANALLRUV tasks running
> =====================================
>
> ldapmodify -D "cn=directory manager" -W -a
>
> dn: cn=abort 24, cn=abort cleanallruv, cn=tasks, cn=config
> objectclass: extensibleObject
> replica-base-dn: dc=example,dc=com
> cn: abort 24
> replica-id: 24
> replica-certify-all: no
> adding new entry *" cn=abort 24, cn=abort cleanallruv, cn=tasks, 
> cn=config" *
> ldap_add: No such object (32)
There should not be a white space at the beginning: *" cn=abort 24, 
cn=abort cleanallruv, cn=tasks, cn=config" **
*
When I run the abort task I don't have that extra white space, and the 
task is successfully added:

[root at localhost ~]# ldapmodify -D cn=dm -w password -a
dn: cn=abort 24, cn=abort cleanallruv, cn=tasks, cn=config
objectclass: extensibleObject
replica-base-dn: dc=example,dc=com
cn: abort 24
replica-id: 24
replica-certify-all: no

adding new entry *"cn=abort 24, cn=abort cleanallruv, cn=tasks, cn=config"*

The extra white space is the probable cause of the error 32 (no such 
object) you were seeing.  You can verify this by looking at the access 
log (/var/log/dirsrv/slapd-INSTANCE/access)

Like I said before you could also check the errors log for the reason 
why the cleanAllRUV task is not completing as well.

Regards,
Mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/freeipa-users/attachments/20150521/803e45c6/attachment.htm>