Something Wicked (with my hdwr clock) This Way Comes

Dean S. Messing deanm at sharplabs.com
Fri Dec 28 23:23:24 UTC 2007


First, apologies for the long post!

On 12/27/07, Dean S. Messing <deanm at sharplabs.com> wrote:
> I'd be interested to see if anyone else on the list running
>
> F7_x86_64 and using the 2.6.23.8-34.fc7 kernel

Thanks you Aaron Konstam and Jon Stanley for confirming that `hwclock
--show' is working properly on your systems.

Somehow, my Real Time Clock got mangled. I'm not sure how.  I had been
running some measurements with `adjtimex' and the output was going
totally bonkers.  That's when I noticed my `hwclock' problem.

Upon shutting down, bringing up the BIOS, I saw that the RTC was
nearly 10 hours off of UTC.

So I reset it, rebooted, and now `hwclock --show' works correctly.  I
don't understand why I (and you Jon) get the "/dev/rtc does not have
interrupt functions" but it _is_ getting the clock tick after all.

Still, there is some real strangeness going on which I explain
below.

Let me preface this by saying that I am not a n00bie at NTP, adjusting
kernel parameters related to time-keeping via `adjtimex', &c.  I've
worked with this stuff for several years.

Jon, Aaron, I'd like to ask you both to run another test.  (Thanks in
advance.)  As root, what does

adjtimex --utc --compare=20 --interval=10 > /tmp/adj_file.txt

give you? (It will take about 3.5 minutes for this to run)
For the F7 machine on which I'm seeing the weird behaviour, 
I get:


[root at medulla ~]# adjtimex --utc --compare=20 --interval=10
                                      --- current ---   -- suggested --
cmos time     system-cmos  error_ppm   tick      freq    tick      freq
1198880050       0.798082
1198880059       0.900194    10211.2  10000  10426576
1198880069       0.002171   -89802.3  10000  10426576   10899   4022989
1198880078       0.104114    10194.3  10000  10426576    9899   4246426
1198880087       0.206065    10195.1  10000  10426576    9899   4194864
1198880096       0.308030    10196.5  10000  10426576    9899   4102676
1198880105       0.409981    10195.1  10000  10426576    9899   4193301
1198880114       0.511942    10196.1  10000  10426576    9899   4129239
1198880123       0.612893    10095.1  10000  10426576    9900   4192826
1198880132       0.714856    10196.3  10000  10426576    9899   4116739
1198880141       0.816836    10198.0  10000  10426576    9899   4002676
1198880150       0.918770    10193.4  10000  10426576    9899   4305801
1198880160       0.020729   -89804.1  10000  10426576   10899   4141739
1198880169       0.122682    10195.3  10000  10426576    9899   4180801
1198880178       0.224641    10195.9  10000  10426576    9899   4141739
1198880187       0.326589    10194.8  10000  10426576    9899   4213614
1198880196       0.428548    10195.9  10000  10426576    9899   4141739
1198880205       0.530505    10195.7  10000  10426576    9899   4155801
1198880214       0.632459    10195.4  10000  10426576    9899   4174551
1198880223       0.734426    10196.7  10000  10426576    9899   4088614

The second column is in seconds and is the delta between the system
clock and the RTC.  On a well tuned machine with an accurate RTC and a
system clock adjusted to run in sync with UTC (say, via NTP) this
number should be small (less than 1 second) and nearly constant.

The third column is the change in Parts per Million of the number in
Column 2 from one line to the next. It tells you how much the RTC is
deviating relative to the system clock.  (100 ppm is 8.64 seconds per
day.)

As you can see, the system time appears to be advancing by almost
exactly 0.1 seconds every 10 seconds relative to the RTC (cmos clock).
Then, just as they are about to get out of phase by 1 second, something
causes either the system clock to "jump back" by ~1 second or the RTC
to jump forward.

It is certainly _not_ the system time doing the jumping.  In fact:

[root at medulla ~]# ntpdate -q montpelier.ilan.caltech.edu
server 192.12.19.20, stratum 1, offset -0.001485, delay 0.05589
28 Dec 14:23:03 ntpdate[29445]: adjust time server 192.12.19.20 offset -0.001485 sec

which says that my system is about 1 ms ahead of
"montpelier.ilan.caltech.edu" which is a highly accurate Statum 1 time
server.  

This delta is nearly constant so my system clock is not jumping
around.

That leaves the RTC doing the jumping.  But having an RTC that is
runing nearly 10000 ppm slower than my system clock and which "jumps
ahead" every 10 seconds or so seems absurd.  

Furthermore, when I halt the system, boot into the bios, and watch the
RTC time evolve I see no jumps and it is certainly not running 1/10
second slower than wall time.

That leaves two other possibilities: a bug in adjtimex or a bug in the
kernel.  That's where I am right now.

Couple of other items:

If I increase the --interval, above, to 20 seconds, I get (in part)

[root at medulla ~]# adjtimex --utc --compare=20 --interval=20 
                                      --- current ---   -- suggested --
cmos time     system-cmos  error_ppm   tick      freq    tick      freq
1198882092       0.313780
1198882111       0.415686     5095.3  10000  10426576
1198882130       0.516580     5044.7  10000  10426576    9951    942820
1198882149       0.617472     5044.6  10000  10426576    9951    950633
1198882168       0.718367     5044.8  10000  10426576    9951    939695
1198882187       0.820270     5095.1  10000  10426576    9950   4190951
1198882206       0.921165     5044.7  10000  10426576    9951    940476
1198882226       0.022069   -44954.8  10000  10426576   10451    910789
1198882245       0.122960     5044.6  10000  10426576    9951    952976
1198882264       0.224865     5095.2  10000  10426576    9950   4184701

The really odd thing is that the change in "system-cmos" from
line to line is still about 0.1 seconds!  If the RTC were really
running slowly, it should be 0.2 seconds.  In fact, no matter what
I make the interval, the 0.1 seconds remains the same.

So I vote for this being a bug in adjtimex (or the kernel).

Just as a reference here's the output of the same commond on my FC6 laptop
(whose sytem clock is also well sychronised with UTC):

[root at axon ~]# adjtimex --utc --compare=20 --interval=20
                                      --- current ---   -- suggested --
cmos time     system-cmos  error_ppm   tick      freq    tick      freq
1198882771     0.000563
1198882791     0.001026       23.1  10000   -580512
1198882811     0.000995       -1.5  10000   -580512    9999   6072345
1198882831     0.001093        4.9  10000   -580512    9999   5652814
1198882851     0.000830      -13.2  10000   -580512   10000    282026
1198882871     0.001060       11.5  10000   -580512    9999   5220782
1198882891     0.000850      -10.5  10000   -580512   10000    104683
1198882911     0.001099       12.4  10000   -580512    9999   5158282
1198882931     0.001014       -4.3  10000   -580512    9999   6252033
1198882951     0.001111        4.8  10000   -580512    9999   5656720


As you can see, the drift between the RTC and the system is quite low,
less than plus or minus 25ppm, and there's no wierd "jumping" going
on.  (Well there _is_ the "11 minute kernel correction" but that's
another bedtime story.)

I have no idea what's happening on my F7 system to cause the strange
behaviour.  Does anyone else?


Dean




More information about the fedora-list mailing list