<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">On 02/03/2014 10:37 PM, JR Aquino
wrote:<br>
</div>
<blockquote
cite="mid:16593317-8E16-4B98-AF7D-F32E6BBB9BA9@citrixonline.com"
type="cite">If you are seeing clock skew errors in
/var/log/dirsrv/slapd-EXAMPLE-COM/errors that look like this, then
you will need to verify the time/date of the server to make sure
NTP isn't freaked out. If the system date is correct, it is
possible that the change number generator has skewed.<br>
</blockquote>
<br>
Thanks much JR! I have wiki-fied this email
<a class="moz-txt-link-freetext" href="http://port389.org/wiki/Howto:Fix_and_Reset_Time_Skew">http://port389.org/wiki/Howto:Fix_and_Reset_Time_Skew</a><br>
<br>
I would like to credit you on the page - how would you like to be
attributed?<br>
<br>
<blockquote
cite="mid:16593317-8E16-4B98-AF7D-F32E6BBB9BA9@citrixonline.com"
type="cite"><br>
[01/Feb/2014:14:42:06 -0800] NSMMReplicationPlugin - conn=12949
op=7 repl="dc=example,dc=com": Excessive clock skew from supplier
RUV<br>
[01/Feb/2014:14:42:06 -0800] - csngen_adjust_time: adjustment
limit exceeded; value - 1448518, limit - 86400<br>
[01/Feb/2014:14:42:06 -0800] - CSN generator's state:<br>
[01/Feb/2014:14:42:06 -0800] - replica id: 115<br>
[01/Feb/2014:14:42:06 -0800] - sampled time: 1391294526<br>
[01/Feb/2014:14:42:06 -0800] - local offset: 0<br>
[01/Feb/2014:14:42:06 -0800] - remote offset: 0<br>
[01/Feb/2014:14:42:06 -0800] - sequence number: 55067<br>
<br>
The following NsState_Script should be used to determine whether
the change number generator has jumped significantly from the real
time/date.<br>
<a moz-do-not-send="true"
href="https://github.com/richm/scripts/blob/master/readNsState.py">https://github.com/richm/scripts/blob/master/readNsState.py</a><br>
<br>
<br>
The usage for the script works like this:<br>
<br>
[<a class="moz-txt-link-abbreviated" href="mailto:root@ipaserver.ops">root@ipaserver.ops</a> jaquino]# ./readNsState.py
/etc/dirsrv/slapd-EXAMPLE-COM/dse.ldif<br>
nsState is
cwAAAAAAAABGPfBSAAAAAAAAAAAAAAAAAQAAAAAAAAACAAAAAAAAAA==<br>
Little Endian<br>
For replica cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping
tree,cn=config<br>
fmtstr=[H6x3QH6x]<br>
size=40<br>
len of nsstate is 40<br>
CSN generator state:<br>
Replica ID : 115<br>
Sampled Time : 1391476038<br>
Gen as csn : 52f03d46000201150000<br>
Time as str : Mon Feb 3 17:07:18 2014<br>
Local Offset : 0<br>
Remote Offset : 1<br>
Seq. num : 2<br>
System time : Mon Feb 3 17:09:11 2014<br>
Diff in sec. : 113<br>
Day:sec diff : 0:113<br>
<br>
If the output from the above command is over a day or more out of
sync, then the reason is because the CSN generator has become
grossly skewed. It will be necessary to perform the following
steps to recover.<br>
<br>
--------------------------------------------
<div>How to resolve this issue<br>
<div><br>
• 1: Select an ipa server to be authoritative and write the
contents of its database to an ldif file<br>
On the master supplier:<br>
/var/lib/dirsrv/scripts-EXAMPLE-COM/db2ldif.pl -D
'cn=Directory Manager' -w - -n userRoot -a
/tmp/master-389.ldif<br>
Note that without the -r option it is deliberately ommiting
the tainted replication data which contains the bad CSNs<br>
<br>
• 2: On the ipa server, shutdown its dirsrv daemon down so
that you can reset the attribute responsible for the serial
generation, and so that you can re-initialize its db from the
known good ldif<br>
On the master supplier:<br>
ipactl stop<br>
<br>
<br>
• 3: Sanitize the dse.ldif Configuration File<br>
On the master supplier: <br>
edit the /etc/dirsrv/slapd-EXAMPLE-COM/dse.ldif file and
remove the nsState attribute from the replica config entry<br>
You DO NOT want to remove the nsState from: dn: cn=uniqueid
generator,cn=config<br>
<br>
The stanza you want to remove the value from is: dn:
cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping
tree,cn=config<br>
The attribute will look like this: nsState::
cwAAAAAAAAA3QPBSAAAAAAAAAAAAAAAAAQAAAAAAAAABAAAAAAAAAA==<br>
Delete the entire line<br>
<br>
• 3.1: Remove traces of stale CSN tracking in the Replica
Agreements themeselves<br>
File location: /etc/dirsrv/slapd-EXAMPLE-COM/dse.ldif<br>
cat dse.ldif | sed -n '1 {h; $ !d}; $ {x; s/\n //g; p}; /^
/ {H; d}; /^ /! {x; s/\n //g; p}' | grep -v nsds50ruv >
new.dse.ldif<br>
backup the old dse.ldif and replace it with the new one:<br>
# mv dse.ldif dse.saved.ldif<br>
# mv new.dse.ldif dse.ldif<br>
<br>
• 4: Import the data from the known good ldif. This will mark
all the changes with CSNs that match the current time/date
stamps<br>
On the master supplier:<br>
chmod 644 /tmp/master-389.ldif<br>
/var/lib/dirsrv/scripts-EXAMPLE-COM/ldif2db -n userRoot -i
/tmp/master-389.ldif<br>
<br>
• 5: Restart the ipa daemons on the master supplier<br>
#ipactl start<br>
<br>
• 6: When the daemon starts, it will see that it does not have
an nsState and will write new CSN's to -all- of the newly
imported good data with today's timetamp, we need to take that
data and write -it- out to an ldif file<br>
On the master supplier:<br>
/var/lib/dirsrv/scripts-EXAMPLE-COM/db2ldif.pl -D
'cn=Directory Manager' -w - -n userRoot -r -a
/tmp/replication-master-389.ldif<br>
^ the -r tells it to include all replica data which includes
the newly blessed CSN data<br>
transfer the file to all of the ipa servers in the fleet<br>
<br>
• 7: Now we must re-initialize _every other_ ipa consumer
server in the fleet with the new good data.<br>
Steps 7-10 need to be done 1 at a time on each ipa consumer
server<br>
ipactl stop<br>
<br>
• 8: Sanitize the dse.ldif Configuration File<br>
On the ipa server: <br>
edit the /etc/dirsrv/slapd-EXAMPLE-COM/dse.ldif file and
remove the nsState attribute from the replica config entry<br>
You DO NOT want to remove the nsState from: dn: cn=uniqueid
generator,cn=config<br>
The stanza you want to remove the value from is: dn:
cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping
tree,cn=config<br>
The attribute will look like this: nsState::
cwAAAAAAAAA3QPBSAAAAAAAAAAAAAAAAAQAAAAAAAAABAAAAAAAAAA==<br>
Delete the entire line<br>
<br>
• 8.1: Remove traces of stale CSN tracking in the Replica
Agreements themeselves<br>
File location: /etc/dirsrv/slapd-EXAMPLE-COM/dse.ldif<br>
cat dse.ldif | sed -n '1 {h; $ !d}; $ {x; s/\n //g; p}; /^
/ {H; d}; /^ /! {x; s/\n //g; p}' | grep -v nsds50ruv >
new.dse.ldif<br>
backup the old dse.ldif and replace it with the new one<br>
# mv dse.ldif dse.saved.ldif<br>
# mv new.dse.ldif dse.ldif<br>
<br>
• 9: Import the data from the known good ldif. This will mark
all the changes with CSNs that match the current time/date
stamps<br>
On the auth server:<br>
chmod 644 /tmp/replication-master-389.ldif<br>
/var/lib/dirsrv/scripts-EXAMPLE-COM/ldif2db -n userRoot -i
/tmp/replication-master-389.ldif<br>
<br>
• 10: Restart the ipa daemons on the ipa server<br>
On the ipa server:<br>
ipactl start<br>
<div>
<div><br>
</div>
<div><br>
</div>
<div>--------------------------------</div>
<div><br>
</div>
<div>From Rich Megginson:</div>
<div>Further reading for those interested in the particulars
of CSN tracking or the MultiMaster Replication algorithm,
you can read up about it here:<br>
<br>
</div>
<div>It all starts with the Leslie Lamport paper:<br>
<a class="moz-txt-link-freetext" href="http://www.stanford.edu/class/cs240/readings/lamport.pdf">http://www.stanford.edu/class/cs240/readings/lamport.pdf</a><br>
"Time, Clocks, and the Ordering of Events in a Distributed
System"<br>
<br>
The next big impact on MMR protocols was the work done at
Xerox PARC on the Bayou project.<br>
<br>
These and other sources formed the basis of the IETF LDUP
working group. Much of the MMR protocol is based on the
LDUP work.<br>
<br>
<br>
The tl;dr version is this:<br>
<br>
The MMR protocol is based on ordering operations by time
so that when you have two updates to the same attribute,
the "last one wins"<br>
So how do you guarantee some sort of consistent ordering
throughout many systems that do not have clocks in sync
down to the millisecond? If you say "ntp" then you lose...<br>
The protocol itself has to have some notion of time
differences among servers<br>
The ordering is done by CSN (Change Sequence Number)<br>
The first part of the CSN is the timestamp of the
operation in unix time_t (number of seconds since the
epoch).<br>
In order to guarantee ordering, the MMR protocol has a
major constraint<br>
You must never, never, issue a CSN that is the same or
less than another CSN<br>
In order to guarantee that, the MMR protocol keeps track
of the time differences among _all_ of the servers that it
knows about.<br>
When it generates CSNs, it uses the largest time
difference among all servers that it knows about.<br>
<br>
So how does the time skew grow at all?<br>
Due to timing differences, network latency, etc. the
directory server cannot always generate the absolute exact
system time. There will always be 1 or 2 second
differences in some replication sessions.<br>
These 1 to 2 second differences accumulate over time.<br>
<br>
However, there are things which can introduce really large
differences<br>
1) buggy ntp implementations<br>
2) bad sysadmin screws up the system clock<br>
3) vms which are notorious for having laggy system clocks,
etc.<br>
<br>
<br>
How can you monitor for this in the future?<br>
The readnsState.py script supplied in this email can be
used to output the effective skew of the system date vs
the CSN generator.<br>
You can set a crontab to run this script and monitor its
output to catch any future severe drifts.<br>
<br>
Ticket information for some of the fixes that have been
implimented because of this work so far:<br>
<a class="moz-txt-link-freetext" href="https://fedorahosted.org/389/ticket/47516">https://fedorahosted.org/389/ticket/47516</a><br>
<br>
</div>
<div><br>
</div>
<div><br>
</div>
<div apple-content-edited="true">
<div style="color: rgb(0, 0, 0); ">"You cannot hope to
secure that which you do not first understand"<br>
</div>
<div style="color: rgb(0, 0, 0); ">~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~<br>
</div>
<div>JR Aquino<br>
<br>
Senior Information Security Specialist, Technical
Operations<br>
</div>
T: +1 805 690 3478 | F: +1 805 879 3730 | M: +1 805 717
0365<br>
GIAC Certified Exploit Researcher and Advanced Penetration
Tester | <br>
GIAC WebApplication Penetration Tester | GIAC Certified
Incident Handler<br>
<a class="moz-txt-link-abbreviated" href="mailto:JR.Aquino@citrix.com">JR.Aquino@citrix.com</a><br>
<span><br>
</span><span></span><span></span><br>
<span></span><span></span><span></span><span><img
id="8df4cafc-6e50-42b7-b491-e1ab822dddb5"
apple-width="yes" apple-height="yes"
src="cid:part2.02010702.05070308@redhat.com"
height="25" width="61"></span><br>
Powering mobile workstyles and cloud services</div>
</div>
</div>
</div>
</blockquote>
<br>
</body>
</html>