[Fedora-directory-devel] Please review: Bug 233642: MMR breaks with time skew errors
Rich Megginson
rmeggins at redhat.com
Tue Jun 24 21:49:35 UTC 2008
Rich Megginson wrote:
> https://bugzilla.redhat.com/show_bug.cgi?id=233642
> Resolves: bug 233642
> Bug Description: MMR breaks with time skew errors
> Reviewed by: ???
> Files: see diff
> Branch: HEAD
> Fix Description: CSN remote offset generation seems broken. We seem
> to accumulate a remote offset that keeps growing until we hit the
> limit of 1 day, then replication stops. The idea behind the remote
> offset is that servers may be seconds or minutes off. When
> replication starts, one of the itmes in the payload of the start extop
> is the latest CSN from the supplier. The CSN timestamp field is
> (sampled_time + local offset + remote offset). Sampled time comes
> from the time thread in the server that updates the time once per
> second. This allows the consumer, if also a master, to adjust its CSN
> generation so as not to generate duplicates or CSNs less than those
> from the supplier. However, the logic in csngen_adjust_time appears
> to be wrong:
> remote_offset = remote_time - gen->state.sampled_time;
> That is, remote_offset = (remote sampled_time + remote local offset +
> remote remote offset) - gen->state.sampled_time
> It should be
> remote_offset = remote_time - (sampled_time + local offset +
> remote offset)
> Since the sampled time is not the actual current time, it may be off
> by 1 second. So the new remote_offset will be at least 1 second more
> than it should be. Since this is the same remote_offset used to
> generate the CSN to send back to the other master, this offset would
> keep increasing and increasing over time. The script attached to the
> bug helps measure this effect. The new code also attempts to refresh
> the sampled time while adjusting to make sure we have as current a
> sampled_time as possible. In the old code, the remote_offset is
> "sent" back and forth between the masters, carried along in the CSN
> timestamp generation. In the new code, this can happen too, but to a
> far less extent, and should max out at (real offset + N seconds) where
> N is the number of masters.
> In the old code, you could only call csngen_adjust_time if you first
> made sure the remote timestamp >= local timestamp. I have removed
> this restriction and moved that logic into csngen_adjust_time. I also
> cleaned up the code in the consumer extop - I combined the checking of
> the CSN from the extop with the max CSN from the supplier RUV - now we
> only adjust the time once based on the max of all of these CSNs sent
> by the supplier.
> Finally, I cleaned up the error handling in a few places that assumed
> all errors were time skew errors.
> Platforms tested: RHEL5, F8, F9
> Flag Day: no
> Doc impact: no
> QA impact: Should test MMR and use the script to measure the offset
> effect.
> https://bugzilla.redhat.com/attachment.cgi?id=310040&action=diff
Quick follow up - I found a bug in my previous patch -
_csngen_adjust_local_time must not be called when the sampled time ==
the current time. So I fixed that where I was calling
_csngen_adjust_local_time, and I also changed _csngen_adjust_local_time
so that time_diff == 0 is a no-op. You can view the diffs of the diff
here -
https://bugzilla.redhat.com/attachment.cgi?oldid=310040&action=interdiff&newid=310193&headers=1
>
> --
> Fedora-directory-devel mailing list
> Fedora-directory-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/fedora-directory-devel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3258 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/fedora-directory-devel/attachments/20080624/e8694fcf/attachment.bin>
More information about the Fedora-directory-devel
mailing list