[Freeipa-users] Haunted servers?

thierry bordaz tbordaz at redhat.com
Tue May 26 14:04:18 UTC 2015


On 05/26/2015 08:47 AM, Martin Kosek wrote:
> On 05/26/2015 12:20 AM, Janelle wrote:
>> On 5/24/15 3:12 AM, Janelle wrote:
>>> And just like that, my haunted servers have all returned.
>>> I am going to just put a gun to my head and be done with it. :-(
>>>
>>> Why do things run perfectly and then suddenly ???
>>> Logs show little to nothing, mostly because the servers are so busy, 
>>> they
>>> have already rotated out.
>>>
>>> unable to decode  {replica 16} 55356472000300100000 
>>> 55356472000300100000
>>> unable to decode  {replica 22} 55371e9e000000160000 
>>> 553eec64000400160000
>>> unable to decode  {replica 23} 5545d61f000200170000 
>>> 55543240000300170000
>>> unable to decode  {replica 24} 554d53d3000000180000 
>>> 554d54a4000200180000
>>> unable to decode  {replica 25} 554d78bf000000190000 
>>> 555af302000400190000
>>> unable to decode  {replica 9} 55402c39000300090000 55402c39000300090000
>>>
>>> Don't know what to do anymore. At my wit's end..
>>>
>>> ~J
>> So things are getting more interesting.  Still trying to find the 
>> "leaking
>> server(s)".  here is what I mean by that. As you see, I continue to 
>> find these
>> -- BUT, notice a new symptom -- replica 9 does NOT show any other 
>> data - it is
>> blank?
>
> Hello Janelle,
>
> Thanks for update. So you worry that there might still be the "rogue 
> IPA replica" that would be injecting the wrong replica data?
>
> In any case, I bet Ludwig and Thierry will follow up with your thread, 
> there is just delay caused by the various public holidays and PTOs 
> this week and we need to rest before digging into the fun with RUVs - 
> as you already know yourself :-)
>
>> unable to decode  {replica 16} 55356472000300100000 55356472000300100000
>> unable to decode  {replica 22} 55371e9e000000160000 553eec64000400160000
>> unable to decode  {replica 24} 554d53d3000100180000 554d54a4000200180000
>> unable to decode  {replica 25} 554d78bf000200190000 555af302000400190000
>> unable to decode  {replica 9}
>>
>> Now, if I delete these from a server using the ldapmodify method - 
>> they go away
>> briefly, but then if I restart the server, they come back.
>>
>> Let me try to explain -- given a number of servers, say 8, if I user 
>> ldapmodify
>> to delete from 1 of those, they seem to go away from maybe 4 of them 
>> -- but if
>> I wait a few minutes, it is almost as though "replication" is 
>> re-adding these
>> bad replicas from the servers that I have NOT deleted them from.

On each replica (master/replica) there are one RUV in the database and 
one RUV in the changelog.
When cleanallruv succeeds it clears both of them. All replica should be 
reachable when you issue cleanallruv, so that
it can clean the RUVs on all the replicas in almost "single" operation. 
If some replica are not reachable, they keep
information of about the cleaned RID and then can later propagate those 
"old" RID to the rest of the replica.

Ludwig managed to reproduce the issue with a quite complex test case (3 
replicas and multiple cleanallruv).
We have not yet identified the reason how a cleaned replicaId can get 
resurrected.
In parallel we just reproduced it without a clear test case but in a 2 
replica topology.


>>
>> So my question is simple - is there something in the logs I can look 
>> for that
>> would indicate the SOURCE of these bogus entries?  Is the replica 9 
>> with NO
>> extra data any indication of something I could look for?

I guess that if I have the answer to your question we would have 
understood the bug ..

>>
>> I am not willing to give up easily (as you might have already 
>> guessed) and I am
>> determined to find the cause of these.  I know we need more logs, but 
>> with all
>> the traffic, the logs rollover within a few hours, and if the problem is
>> happening at 3am for example, I am not able to track it down because 
>> the logs
>> have rolled.
>>
>> Back to my investigations.
>> ~J
>>
>




More information about the Freeipa-users mailing list