[Fedora-directory-devel] Please review: Bug 233642: MMR breaks with time skew errors

Rich Megginson rmeggins at redhat.com
Mon Jun 23 17:21:52 UTC 2008


https://bugzilla.redhat.com/show_bug.cgi?id=233642
Resolves: bug 233642
Bug Description: MMR breaks with time skew errors
Reviewed by: ???
Files: see diff
Branch: HEAD
Fix Description: CSN remote offset generation seems broken.  We seem to 
accumulate a remote offset that keeps growing until we hit the limit of 
1 day, then replication stops.  The idea behind the remote offset is 
that servers may be seconds or minutes off.  When replication starts, 
one of the itmes in the payload of the start extop is the latest CSN 
from the supplier.  The CSN timestamp field is (sampled_time + local 
offset + remote offset).  Sampled time comes from the time thread in the 
server that updates the time once per second.  This allows the consumer, 
if also a master, to adjust its CSN generation so as not to generate 
duplicates or CSNs less than those from the supplier.  However, the 
logic in csngen_adjust_time appears to be wrong:
        remote_offset = remote_time - gen->state.sampled_time;
That is, remote_offset = (remote sampled_time + remote local offset + 
remote remote offset) - gen->state.sampled_time
It should be
        remote_offset = remote_time - (sampled_time + local offset + 
remote offset)
Since the sampled time is not the actual current time, it may be off by 
1 second.  So the new remote_offset will be at least 1 second more than 
it should be.  Since this is the same remote_offset used to generate the 
CSN to send back to the other master, this offset would keep increasing 
and increasing over time.  The script attached to the bug helps measure 
this effect.  The new code also attempts to refresh the sampled time 
while adjusting to make sure we have as current a sampled_time as 
possible.  In the old code, the remote_offset is "sent" back and forth 
between the masters, carried along in the CSN timestamp generation.  In 
the new code, this can happen too, but to a far less extent, and should 
max out at (real offset + N seconds) where N is the number of masters.
In the old code, you could only call csngen_adjust_time if you first 
made sure the remote timestamp >= local timestamp.  I have removed this 
restriction and moved that logic into csngen_adjust_time.  I also 
cleaned up the code in the consumer extop - I combined the checking of 
the CSN from the extop with the max CSN from the supplier RUV - now we 
only adjust the time once based on the max of all of these CSNs sent by 
the supplier.
Finally, I cleaned up the error handling in a few places that assumed 
all errors were time skew errors.
Platforms tested: RHEL5, F8, F9
Flag Day: no
Doc impact: no
QA impact: Should test MMR and use the script to measure the offset effect.
https://bugzilla.redhat.com/attachment.cgi?id=310040&action=diff




More information about the Fedora-directory-devel mailing list