[Fedora-directory-commits] ldapserver/ldap/servers/slapd csngen.c, 1.7, 1.8

Richard Allen Megginson (rmeggins) fedora-directory-commits at redhat.com
Tue Jun 24 22:22:12 UTC 2008


Author: rmeggins

Update of /cvs/dirsec/ldapserver/ldap/servers/slapd
In directory cvs-int.fedora.redhat.com:/tmp/cvs-serv14162/ldapserver/ldap/servers/slapd

Modified Files:
	csngen.c 
Log Message:
Resolves: bug 233642
Bug Description: MMR breaks with time skew errors
Reviewed by: nhosoi, nkinder (Thanks!)
Fix Description: CSN remote offset generation seems broken.  We seem to accumulate a remote offset that keeps growing until we hit the limit of 1 day, then replication stops.  The idea behind the remote offset is that servers may be seconds or minutes off.  When replication starts, one of the itmes in the payload of the start extop is the latest CSN from the supplier.  The CSN timestamp field is (sampled_time + local offset + remote offset).  Sampled time comes from the time thread in the server that updates the time once per second.  This allows the consumer, if also a master, to adjust its CSN generation so as not to generate duplicates or CSNs less than those from the supplier.  However, the logic in csngen_adjust_time appears to be wrong:
       remote_offset = remote_time - gen->state.sampled_time;
That is, remote_offset = (remote sampled_time + remote local offset + remote remote offset) - gen->state.sampled_time
It should be
       remote_offset = remote_time - (sampled_time + local offset + remote offset)
Since the sampled time is not the actual current time, it may be off by 1 second.  So the new remote_offset will be at least 1 second more than it should be.  Since this is the same remote_offset used to generate the CSN to send back to the other master, this offset would keep increasing and increasing over time.  The script attached to the bug helps measure this effect.  The new code also attempts to refresh the sampled time while adjusting to make sure we have as current a sampled_time as possible.  In the old code, the remote_offset is "sent" back and forth between the masters, carried along in the CSN timestamp generation.  In the new code, this can happen too, but to a far less extent, and should max out at (real offset + N seconds) where N is the number of masters.
In the old code, you could only call csngen_adjust_time if you first made sure the remote timestamp >= local timestamp.  I have removed this restriction and moved that logic into csngen_adjust_time.  I also cleaned up the code in the consumer extop - I combined the checking of the CSN from the extop with the max CSN from the supplier RUV - now we only adjust the time once based on the max of all of these CSNs sent by the supplier.
Finally, I cleaned up the error handling in a few places that assumed all errors were time skew errors.
Follow up - I found a bug in my previous patch - _csngen_adjust_local_time must not be called when the sampled time == the current time.  So I fixed that where I was calling _csngen_adjust_local_time, and I also changed _csngen_adjust_local_time so that time_diff == 0 is a no-op.
Platforms tested: RHEL5, F8, F9
Flag Day: no
Doc impact: no
QA impact: Should test MMR and use the script to measure the offset effect.



Index: csngen.c
===================================================================
RCS file: /cvs/dirsec/ldapserver/ldap/servers/slapd/csngen.c,v
retrieving revision 1.7
retrieving revision 1.8
diff -u -r1.7 -r1.8
--- csngen.c	10 Nov 2006 23:45:40 -0000	1.7
+++ csngen.c	24 Jun 2008 22:22:10 -0000	1.8
@@ -60,6 +60,9 @@
 #define STATE_FORMAT			 "%8x%8x%8x%4hx%4hx"
 #define STATE_LENGTH			 32
 #define MAX_VAL(x,y)			 ((x)>(y)?(x):(y))
+#define CSN_CALC_TSTAMP(gen)     ((gen)->state.sampled_time + \
+                                  (gen)->state.local_offset + \
+                                  (gen)->state.remote_offset)
 
 /*
  * **************************************************************************
@@ -273,8 +276,7 @@
         gen->state.seq_num = 0;
     }
 
-    (*csn)->tstamp = gen->state.sampled_time + gen->state.local_offset + 
-                     gen->state.remote_offset;
+    (*csn)->tstamp = CSN_CALC_TSTAMP(gen);
     (*csn)->seqnum = gen->state.seq_num ++;
     (*csn)->rid = gen->state.rid;
 	(*csn)->subseqnum = 0;
@@ -308,8 +310,9 @@
    of time so that it does not generate smaller csns */
 int csngen_adjust_time (CSNGen *gen, const CSN* csn)
 {
-    time_t remote_time, remote_offset;
+    time_t remote_time, remote_offset, cur_time;
 	PRUint16 remote_seqnum;
+    int rc;
 
     if (gen == NULL || csn == NULL)
         return CSN_INVALID_PARAMETER;
@@ -319,21 +322,38 @@
 
     PR_RWLock_Wlock (gen->lock);
 
-	if (remote_seqnum > gen->state.seq_num )
-	{
-		if (remote_seqnum < CSN_MAX_SEQNUM)
-		{
-			gen->state.seq_num = remote_seqnum + 1;
-		}
-		else
-		{
-			remote_time++;
-		}
-	}
+    /* make sure we have the current time */
+    csngen_update_time();
+    cur_time = g_sampled_time;
+
+    /* make sure sampled_time is current */
+    /* must only call adjust_local_time if the current time is greater than
+       the generator state time */
+    if ((cur_time > gen->state.sampled_time) &&
+        (CSN_SUCCESS != (rc = _csngen_adjust_local_time(gen, cur_time))))
+    {
+        /* _csngen_adjust_local_time will log error */
+        PR_RWLock_Unlock (gen->lock);
+        csngen_dump_state(gen);
+        return rc;
+    }
 
-    if (remote_time >= gen->state.sampled_time)
+    cur_time = CSN_CALC_TSTAMP(gen);
+    if (remote_time >= cur_time)
     {
-        remote_offset = remote_time - gen->state.sampled_time;
+        if (remote_seqnum > gen->state.seq_num )
+        {
+            if (remote_seqnum < CSN_MAX_SEQNUM)
+            {
+                gen->state.seq_num = remote_seqnum + 1;
+            }
+            else
+            {
+                remote_time++;
+            }
+        }
+
+        remote_offset = remote_time - cur_time;
 		if (remote_offset > gen->state.remote_offset)
 		{
 			if (remote_offset <= CSN_MAX_TIME_ADJUST)
@@ -346,10 +366,18 @@
                             "adjustment limit exceeded; value - %ld, limit - %ld\n",
                              remote_offset, (long)CSN_MAX_TIME_ADJUST);
 				PR_RWLock_Unlock (gen->lock);
+				csngen_dump_state(gen);
 				return CSN_LIMIT_EXCEEDED;
 			}
 		}
-    }
+	}
+	else if (gen->state.remote_offset > 0)
+	{
+		/* decrease remote offset? */
+		/* how to decrease remote offset but ensure that we don't
+		   generate a duplicate CSN, or a CSN smaller than one we've already
+		   generated? */
+	}
 
     PR_RWLock_Unlock (gen->lock);
 
@@ -576,7 +604,14 @@
 {
     time_t time_diff = cur_time - gen->state.sampled_time;
 
-    if (time_diff > 0)
+    if (time_diff == 0) {
+        /* This is a no op - _csngen_adjust_local_time should never be called
+           in this case, because there is nothing to adjust - but just return
+           here to protect ourselves
+        */
+        return CSN_SUCCESS;
+    }
+    else if (time_diff > 0)
     {
         gen->state.sampled_time = cur_time;
         if (time_diff > gen->state.local_offset)
@@ -588,7 +623,7 @@
 
         return CSN_SUCCESS;
     }
-    else   /* time was turend back */
+    else   /* time was turned back */
     {
         if (abs (time_diff) > CSN_MAX_TIME_ADJUST)
         {




More information about the Fedora-directory-commits mailing list