[Crash-utility] Inaccurate timestamps in log -T output

Mon Sep 20 16:57:49 UTC 2021

Hello,

As the current maintainer of the PyKdump extension, I have been investigating a reported discrepancy between the timestamps generated by crash's log -T and PyKdump's tslog command, which provides the same functionality of timestamping the kernel log entries.  What I have found is that *both* log -T and tslog generate some inaccurate timestamps on vmcores with more than a small amount of uptime, but in different ways.  Specifically, log -T timestamps are accurate at boot time but become less accurate later in the log, while tslog timestamps are just the opposite: they're accurate around crash time but less accurate earlier in the log.

Here's an example from a panic on a kernel 3.10.0-957.5.1 system with a moderate uptime:

        DATE: Tue Jul  6 09:50:58 +03 2021
      UPTIME: 32 days, 16:29:29
       PANIC: "BUG: unable to handle kernel "

Crash correctly computes the boot time based on the date (crash time) and uptime shown above.  From "help -k":

     boot_date: Thu Jun  3 17:21:29 +03 2021

log -T shows the correct timestamps of the boot messages:

[Thu Jun  3 17:21:29 +03 2021] Initializing cgroup subsys cpuset
[Thu Jun  3 17:21:29 +03 2021] Initializing cgroup subsys cpu
[Thu Jun  3 17:21:29 +03 2021] Initializing cgroup subsys cpuacct

However, the timestamps of the BUG and surrounding messages are off by 150 seconds or 2.5 minutes (09:53:28, but the crash occurred at 09:50:58):

[Tue Jul  6 09:53:28 +03 2021] BUG: unable to handle kernel
[Tue Jul  6 09:53:28 +03 2021] NULL pointer dereference at           (null)
[Tue Jul  6 09:53:28 +03 2021] IP: [<ffffffff91cd6e66>] scsi_uninit_cmd+0x26/0x40

For comparison, tslog has the correct timestamps around the crash:

2021-07-06 09:50:58 [2824319.188837] BUG: unable to handle kernel
2021-07-06 09:50:58 [2824319.193440] NULL pointer dereference at           (null)
2021-07-06 09:50:58 [2824319.197853] IP: [<ffffffff91cd6e66>] scsi_uninit_cmd+0x26/0x40

But it's off by the same 2.5-minute interval during the boot messages (17:18:59, but it should be 17:21:29):

2021-06-03 17:18:59 [    0.000000] Initializing cgroup subsys cpuset
2021-06-03 17:18:59 [    0.000000] Initializing cgroup subsys cpu
2021-06-03 17:18:59 [    0.000000] Initializing cgroup subsys cpuacct

The reasons for these inaccuracies are:

1.       log -T computes boot time as (crash time minus uptime), and then adds the seconds part of the time stored in each log entry to generate the entry's timestamp.

2.       tslog computes a base time as (crash time minus seconds part of the final log entry) and then adds the seconds part of the time stored in each log entry.

3.       The key to the problem: printk() stores the log entry timestamp based on a call to local_clock(), which provides the raw time in the local CPU, which is not adjusted by NTP.  As such, inaccuracy accumulates as the clock drifts over time; in the example above, the discrepancy has reached 150 seconds by the time of the crash.

I have seen this now in numerous vmcores from kernels 3.0 through 4.12, which is the latest version I've checked.  The discrepancy exists in all of them with more than a few hours of uptime and can be positive or negative.  In one vmcore it was over 13 hours after 239 days of uptime.

>From the PyKdump perspective, I could live with the current tslog behavior because it's accurate for log entries near the crash time, which are generally the most useful for crash analysis.  But in an attempt to improve it, I'm testing an experimental version of tslog that smooths out the discrepancy.  It computes the boot time the same way that crash does (crash time minus uptime) and a scale factor for the time discrepancy equal to (uptime)/(seconds part of final log entry).  Then each log entry's timestamp is computed as (boot time) + (seconds part of log entry)*(scale factor).  This version results in correct timestamps at both ends of the log:

2021-06-03 17:21:29 [    0.000000] Initializing cgroup subsys cpuset
2021-06-03 17:21:29 [    0.000000] Initializing cgroup subsys cpu
2021-06-03 17:21:29 [    0.000000] Initializing cgroup subsys cpuacct
    :
2021-07-06 09:50:58 [2824319.188837] BUG: unable to handle kernel
2021-07-06 09:50:58 [2824319.193440] NULL pointer dereference at           (null)
2021-07-06 09:50:58 [2824319.197853] IP: [<ffffffff91cd6e66>] scsi_uninit_cmd+0x26/0x40

This is something of a hack since there's no guarantee that the clock will drift uniformly during the system uptime, so timestamps in the middle of the log may still be somewhat inaccurate (but probably more accurate than without this adjustment).

Regards,
Martin

Explore the HPE digital support experience: https://support.hpe.com

Martin Moore
Linux Engineering Resolution Team
HPE Pointnext Services
Hewlett Packard Enterprise

Martin.Moore at hpe.com<mailto:Martin.Moore at hpe.com>
8AM-5PM EDT (UTC-4) Monday-Friday
Manager: Becky McBride (becky.mcbride at hpe.com<mailto:becky.mcbride at hpe.com>)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/crash-utility/attachments/20210920/8102f4e0/attachment.htm>