[Freeipa-users] How to restore an IPA Replica when the CSN number generator has moved impossibly far into the future or past

JR Aquino JR.Aquino at citrix.com
Tue Feb 4 05:37:03 UTC 2014


If you are seeing clock skew errors in /var/log/dirsrv/slapd-EXAMPLE-COM/errors that look like this, then you will need to verify the time/date of the server to make sure NTP isn't freaked out. If the system date is correct, it is possible that the change number generator has skewed.

[01/Feb/2014:14:42:06 -0800] NSMMReplicationPlugin - conn=12949 op=7 repl="dc=example,dc=com": Excessive clock skew from supplier RUV
[01/Feb/2014:14:42:06 -0800] - csngen_adjust_time: adjustment limit exceeded; value - 1448518, limit - 86400
[01/Feb/2014:14:42:06 -0800] - CSN generator's state:
[01/Feb/2014:14:42:06 -0800] -  replica id: 115
[01/Feb/2014:14:42:06 -0800] -  sampled time: 1391294526
[01/Feb/2014:14:42:06 -0800] -  local offset: 0
[01/Feb/2014:14:42:06 -0800] -  remote offset: 0
[01/Feb/2014:14:42:06 -0800] -  sequence number: 55067

The following NsState_Script should be used to determine whether the change number generator has jumped significantly from the real time/date.
https://github.com/richm/scripts/blob/master/readNsState.py


The usage for the script works like this:

[root at ipaserver.ops jaquino]# ./readNsState.py /etc/dirsrv/slapd-EXAMPLE-COM/dse.ldif
nsState is cwAAAAAAAABGPfBSAAAAAAAAAAAAAAAAAQAAAAAAAAACAAAAAAAAAA==
Little Endian
For replica cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
  fmtstr=[H6x3QH6x]
  size=40
  len of nsstate is 40
  CSN generator state:
    Replica ID    : 115
    Sampled Time  : 1391476038
    Gen as csn    : 52f03d46000201150000
    Time as str   : Mon Feb  3 17:07:18 2014
    Local Offset  : 0
    Remote Offset : 1
    Seq. num      : 2
    System time   : Mon Feb  3 17:09:11 2014
    Diff in sec.  : 113
    Day:sec diff  : 0:113

If the output from the above command is over a day or more out of sync, then the reason is because the CSN generator has become grossly skewed. It will be necessary to perform the following steps to recover.

--------------------------------------------
How to resolve this issue

• 1: Select an ipa server to be authoritative and write the contents of its database to an ldif file
   On the master supplier:
   /var/lib/dirsrv/scripts-EXAMPLE-COM/db2ldif.pl -D 'cn=Directory Manager' -w - -n userRoot -a /tmp/master-389.ldif
   Note that without the -r option it is deliberately ommiting the tainted replication data which contains the bad CSNs

• 2: On the ipa server, shutdown its dirsrv daemon down so that you can reset the attribute responsible for the serial generation, and so that you can re-initialize its db from the known good ldif
   On the master supplier:
   ipactl stop
  

• 3: Sanitize the dse.ldif Configuration File
   On  the master supplier: 
   edit the /etc/dirsrv/slapd-EXAMPLE-COM/dse.ldif file and remove the nsState attribute from the replica config entry
   You DO NOT want to remove the nsState from: dn: cn=uniqueid generator,cn=config

   The stanza you want to remove the value from is: dn: cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
   The attribute will look like this: nsState:: cwAAAAAAAAA3QPBSAAAAAAAAAAAAAAAAAQAAAAAAAAABAAAAAAAAAA==
   Delete the entire line

• 3.1: Remove traces of stale CSN tracking in the Replica Agreements themeselves
   File location: /etc/dirsrv/slapd-EXAMPLE-COM/dse.ldif
   cat dse.ldif | sed -n '1 {h; $ !d}; $ {x; s/\n //g; p}; /^ / {H; d}; /^ /! {x; s/\n //g; p}' | grep -v nsds50ruv > new.dse.ldif
   backup the old dse.ldif and replace it with the new one:
   # mv dse.ldif dse.saved.ldif
   # mv new.dse.ldif dse.ldif

• 4: Import the data from the known good ldif. This will mark all the changes with CSNs that match the current time/date stamps
   On  the master supplier:
   chmod 644 /tmp/master-389.ldif
   /var/lib/dirsrv/scripts-EXAMPLE-COM/ldif2db -n userRoot -i /tmp/master-389.ldif

• 5: Restart the ipa daemons on the master supplier
   #ipactl start

• 6: When the daemon starts, it will see that it does not have an nsState and will write new CSN's to -all- of the newly imported good data with today's timetamp, we need to take that data and write -it- out to an ldif file
  On  the master supplier:
  /var/lib/dirsrv/scripts-EXAMPLE-COM/db2ldif.pl -D 'cn=Directory Manager' -w - -n userRoot -r -a /tmp/replication-master-389.ldif
  ^ the -r tells it to include all replica data which includes the newly blessed CSN data
  transfer the file to all of the ipa servers in the fleet

• 7: Now we must re-initialize _every other_ ipa consumer server in the fleet with the new good data.
  Steps 7-10 need to be done 1 at a time on each ipa consumer server
  ipactl stop

• 8: Sanitize the dse.ldif Configuration File
   On the ipa server: 
   edit the /etc/dirsrv/slapd-EXAMPLE-COM/dse.ldif file and remove the nsState attribute from the replica config entry
   You DO NOT want to remove the nsState from: dn: cn=uniqueid generator,cn=config
   The stanza you want to remove the value from is: dn: cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
   The attribute will look like this: nsState:: cwAAAAAAAAA3QPBSAAAAAAAAAAAAAAAAAQAAAAAAAAABAAAAAAAAAA==
   Delete the entire line

• 8.1: Remove traces of stale CSN tracking in the Replica Agreements themeselves
   File location: /etc/dirsrv/slapd-EXAMPLE-COM/dse.ldif
   cat dse.ldif | sed -n '1 {h; $ !d}; $ {x; s/\n //g; p}; /^ / {H; d}; /^ /! {x; s/\n //g; p}' | grep -v nsds50ruv > new.dse.ldif
   backup the old dse.ldif and replace it with the new one
   # mv dse.ldif dse.saved.ldif
   # mv new.dse.ldif dse.ldif

• 9: Import the data from the known good ldif. This will mark all the changes with CSNs that match the current time/date stamps
   On the auth server:
   chmod 644 /tmp/replication-master-389.ldif
   /var/lib/dirsrv/scripts-EXAMPLE-COM/ldif2db -n userRoot -i /tmp/replication-master-389.ldif

• 10: Restart the ipa daemons on the ipa server
   On the ipa server:
   ipactl start


--------------------------------

From Rich Megginson:
Further reading for those interested in the particulars of CSN tracking or the MultiMaster Replication algorithm, you can read up about it here:

It all starts with the Leslie Lamport paper:
http://www.stanford.edu/class/cs240/readings/lamport.pdf
"Time, Clocks, and the Ordering of Events in a Distributed System"

The next big impact on MMR protocols was the work done at Xerox PARC on the Bayou project.

These and other sources formed the basis of the IETF LDUP working group.  Much of the MMR protocol is based on the LDUP work.


The tl;dr version is this:

The MMR protocol is based on ordering operations by time so that when you have two updates to the same attribute, the "last one wins"
So how do you guarantee some sort of consistent ordering throughout many systems that do not have clocks in sync down to the millisecond? If you say "ntp" then you lose...
The protocol itself has to have some notion of time differences among servers
The ordering is done by CSN (Change Sequence Number)
The first part of the CSN is the timestamp of the operation in unix time_t (number of seconds since the epoch).
In order to guarantee ordering, the MMR protocol has a major constraint
You must never, never, issue a CSN that is the same or less than another CSN
In order to guarantee that, the MMR protocol keeps track of the time differences among _all_ of the servers that it knows about.
When it generates CSNs, it uses the largest time difference among all servers that it knows about.

So how does the time skew grow at all?
Due to timing differences, network latency, etc. the directory server cannot always generate the absolute exact system time.  There will always be 1 or 2 second differences in some replication sessions.
These 1 to 2 second differences accumulate over time.

However, there are things which can introduce really large differences
1) buggy ntp implementations
2) bad sysadmin screws up the system clock
3) vms which are notorious for having laggy system clocks, etc.


How can you monitor for this in the future?
The readnsState.py script supplied in this email can be used to output the effective skew of the system date vs the CSN generator.
You can set a crontab to run this script and monitor its output to catch any future severe drifts.

Ticket information for some of the fixes that have been implimented because of this work so far:
https://fedorahosted.org/389/ticket/47516



"You cannot hope to secure that which you do not first understand"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
JR Aquino

Senior Information Security Specialist, Technical Operations
T: +1 805 690 3478 | F: +1 805 879 3730 | M: +1 805 717 0365
GIAC Certified Exploit Researcher and Advanced Penetration Tester | 
GIAC WebApplication Penetration Tester | GIAC Certified Incident Handler
JR.Aquino at citrix.com



Powering mobile workstyles and cloud services
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/freeipa-users/attachments/20140204/c8ad9a46/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.jpg
Type: image/jpeg
Size: 15835 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/freeipa-users/attachments/20140204/c8ad9a46/attachment.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 455 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://listman.redhat.com/archives/freeipa-users/attachments/20140204/c8ad9a46/attachment.sig>


More information about the Freeipa-users mailing list