[Fedora-directory-devel] Please review: Bug 233642: MMR breaks with time skew errors

Rich Megginson rmeggins at redhat.com
Tue Jun 24 21:49:35 UTC 2008


Rich Megginson wrote:
> https://bugzilla.redhat.com/show_bug.cgi?id=233642
> Resolves: bug 233642
> Bug Description: MMR breaks with time skew errors
> Reviewed by: ???
> Files: see diff
> Branch: HEAD
> Fix Description: CSN remote offset generation seems broken.  We seem 
> to accumulate a remote offset that keeps growing until we hit the 
> limit of 1 day, then replication stops.  The idea behind the remote 
> offset is that servers may be seconds or minutes off.  When 
> replication starts, one of the itmes in the payload of the start extop 
> is the latest CSN from the supplier.  The CSN timestamp field is 
> (sampled_time + local offset + remote offset).  Sampled time comes 
> from the time thread in the server that updates the time once per 
> second.  This allows the consumer, if also a master, to adjust its CSN 
> generation so as not to generate duplicates or CSNs less than those 
> from the supplier.  However, the logic in csngen_adjust_time appears 
> to be wrong:
>        remote_offset = remote_time - gen->state.sampled_time;
> That is, remote_offset = (remote sampled_time + remote local offset + 
> remote remote offset) - gen->state.sampled_time
> It should be
>        remote_offset = remote_time - (sampled_time + local offset + 
> remote offset)
> Since the sampled time is not the actual current time, it may be off 
> by 1 second.  So the new remote_offset will be at least 1 second more 
> than it should be.  Since this is the same remote_offset used to 
> generate the CSN to send back to the other master, this offset would 
> keep increasing and increasing over time.  The script attached to the 
> bug helps measure this effect.  The new code also attempts to refresh 
> the sampled time while adjusting to make sure we have as current a 
> sampled_time as possible.  In the old code, the remote_offset is 
> "sent" back and forth between the masters, carried along in the CSN 
> timestamp generation.  In the new code, this can happen too, but to a 
> far less extent, and should max out at (real offset + N seconds) where 
> N is the number of masters.
> In the old code, you could only call csngen_adjust_time if you first 
> made sure the remote timestamp >= local timestamp.  I have removed 
> this restriction and moved that logic into csngen_adjust_time.  I also 
> cleaned up the code in the consumer extop - I combined the checking of 
> the CSN from the extop with the max CSN from the supplier RUV - now we 
> only adjust the time once based on the max of all of these CSNs sent 
> by the supplier.
> Finally, I cleaned up the error handling in a few places that assumed 
> all errors were time skew errors.
> Platforms tested: RHEL5, F8, F9
> Flag Day: no
> Doc impact: no
> QA impact: Should test MMR and use the script to measure the offset 
> effect.
> https://bugzilla.redhat.com/attachment.cgi?id=310040&action=diff
Quick follow up - I found a bug in my previous patch - 
_csngen_adjust_local_time must not be called when the sampled time == 
the current time.  So I fixed that where I was calling 
_csngen_adjust_local_time, and I also changed _csngen_adjust_local_time 
so that time_diff == 0 is a no-op.  You can view the diffs of the diff 
here - 
https://bugzilla.redhat.com/attachment.cgi?oldid=310040&action=interdiff&newid=310193&headers=1
>
> -- 
> Fedora-directory-devel mailing list
> Fedora-directory-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/fedora-directory-devel

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3258 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/fedora-directory-devel/attachments/20080624/e8694fcf/attachment.bin>


More information about the Fedora-directory-devel mailing list