[Crash-utility] Inaccurate timestamps in log -T output

Wed Sep 22 08:42:48 UTC 2021

Hi Martin,

Thank you for the detailed report and interesting idea.

The "log -T" option imitates the "dmesg -T" command, and it also
looks inaccurate in nature and has a warning in its help text.

# systemctl restart psacct ; date
Wed Sep 22 17:30:49 JST 2021
# dmesg -T | tail -n 1
[Wed Sep 22 17:30:22 2021] Process accounting resumed
# dmesg --help | grep -- -T
 -T, --ctime                 show human-readable timestamp (may be inaccurate!)

Maybe at least we should add a similar warning to the help page.

Personally I would prefer users to use it with the acknowledgement
of that inaccuracy like dmesg -T, rather than doing a crash specific
adjustment..

Thanks,
Kazu

-----Original Message-----
> From: crash-utility-bounces at redhat.com <crash-utility-bounces at redhat.com> On Behalf Of Moore, Martin
> (Linux ERT)
> Sent: Tuesday, September 21, 2021 1:58 AM
> To: Discussion list for crash utility usage, maintenance and development <crash-utility at redhat.com>
> Subject: [Crash-utility] Inaccurate timestamps in log -T output
> 
> Hello,
> 
> 
> 
> As the current maintainer of the PyKdump extension, I have been investigating a reported discrepancy between
> the timestamps generated by crash’s log -T and PyKdump’s tslog command, which provides the same functionality
> of timestamping the kernel log entries.  What I have found is that *both* log -T and tslog generate some
> inaccurate timestamps on vmcores with more than a small amount of uptime, but in different ways.
> Specifically, log -T timestamps are accurate at boot time but become less accurate later in the log, while
> tslog timestamps are just the opposite: they’re accurate around crash time but less accurate earlier in
> the log.
> 
> 
> 
> Here’s an example from a panic on a kernel 3.10.0-957.5.1 system with a moderate uptime:
> 
> 
> 
>         DATE: Tue Jul  6 09:50:58 +03 2021
> 
>       UPTIME: 32 days, 16:29:29
> 
>        PANIC: "BUG: unable to handle kernel "
> 
> 
> 
> Crash correctly computes the boot time based on the date (crash time) and uptime shown above.  From “help
> -k”:
> 
> 
> 
>      boot_date: Thu Jun  3 17:21:29 +03 2021
> 
> 
> 
> log -T shows the correct timestamps of the boot messages:
> 
> 
> 
> [Thu Jun  3 17:21:29 +03 2021] Initializing cgroup subsys cpuset
> 
> [Thu Jun  3 17:21:29 +03 2021] Initializing cgroup subsys cpu
> 
> [Thu Jun  3 17:21:29 +03 2021] Initializing cgroup subsys cpuacct
> 
> 
> 
> However, the timestamps of the BUG and surrounding messages are off by 150 seconds or 2.5 minutes (09:53:28,
> but the crash occurred at 09:50:58):
> 
> 
> 
> [Tue Jul  6 09:53:28 +03 2021] BUG: unable to handle kernel
> 
> [Tue Jul  6 09:53:28 +03 2021] NULL pointer dereference at           (null)
> 
> [Tue Jul  6 09:53:28 +03 2021] IP: [<ffffffff91cd6e66>] scsi_uninit_cmd+0x26/0x40
> 
> 
> 
> For comparison, tslog has the correct timestamps around the crash:
> 
> 
> 
> 2021-07-06 09:50:58 [2824319.188837] BUG: unable to handle kernel
> 
> 2021-07-06 09:50:58 [2824319.193440] NULL pointer dereference at           (null)
> 
> 2021-07-06 09:50:58 [2824319.197853] IP: [<ffffffff91cd6e66>] scsi_uninit_cmd+0x26/0x40
> 
> 
> 
> But it’s off by the same 2.5-minute interval during the boot messages (17:18:59, but it should be 17:21:29):
> 
> 
> 
> 2021-06-03 17:18:59 [    0.000000] Initializing cgroup subsys cpuset
> 
> 2021-06-03 17:18:59 [    0.000000] Initializing cgroup subsys cpu
> 
> 2021-06-03 17:18:59 [    0.000000] Initializing cgroup subsys cpuacct
> 
> 
> 
> The reasons for these inaccuracies are:
> 
> 
> 
> 1.       log -T computes boot time as (crash time minus uptime), and then adds the seconds part of the time
> stored in each log entry to generate the entry’s timestamp.
> 
> 
> 
> 2.       tslog computes a base time as (crash time minus seconds part of the final log entry) and then adds
> the seconds part of the time stored in each log entry.
> 
> 
> 
> 3.       The key to the problem: printk() stores the log entry timestamp based on a call to local_clock(),
> which provides the raw time in the local CPU, which is not adjusted by NTP.  As such, inaccuracy accumulates
> as the clock drifts over time; in the example above, the discrepancy has reached 150 seconds by the time
> of the crash.
> 
> 
> 
> I have seen this now in numerous vmcores from kernels 3.0 through 4.12, which is the latest version I’ve
> checked.  The discrepancy exists in all of them with more than a few hours of uptime and can be positive
> or negative.  In one vmcore it was over 13 hours after 239 days of uptime.
> 
> 
> 
> From the PyKdump perspective, I could live with the current tslog behavior because it’s accurate for log
> entries near the crash time, which are generally the most useful for crash analysis.  But in an attempt
> to improve it, I’m testing an experimental version of tslog that smooths out the discrepancy.  It computes
> the boot time the same way that crash does (crash time minus uptime) and a scale factor for the time discrepancy
> equal to (uptime)/(seconds part of final log entry).  Then each log entry’s timestamp is computed as (boot
> time) + (seconds part of log entry)*(scale factor).  This version results in correct timestamps at both
> ends of the log:
> 
> 
> 
> 2021-06-03 17:21:29 [    0.000000] Initializing cgroup subsys cpuset
> 
> 2021-06-03 17:21:29 [    0.000000] Initializing cgroup subsys cpu
> 
> 2021-06-03 17:21:29 [    0.000000] Initializing cgroup subsys cpuacct
> 
>     :
> 
> 2021-07-06 09:50:58 [2824319.188837] BUG: unable to handle kernel
> 
> 2021-07-06 09:50:58 [2824319.193440] NULL pointer dereference at           (null)
> 
> 2021-07-06 09:50:58 [2824319.197853] IP: [<ffffffff91cd6e66>] scsi_uninit_cmd+0x26/0x40
> 
> 
> 
> This is something of a hack since there’s no guarantee that the clock will drift uniformly during the system
> uptime, so timestamps in the middle of the log may still be somewhat inaccurate (but probably more accurate
> than without this adjustment).
> 
> 
> 
> Regards,
> 
> Martin
> 
> 
> 
> 
> 
> Explore the HPE digital support experience: https://support.hpe.com <https://support.hpe.com>
> 
> 
> 
> Martin Moore
> Linux Engineering Resolution Team
> 
> HPE Pointnext Services
> Hewlett Packard Enterprise
> 
> Martin.Moore at hpe.com <mailto:Martin.Moore at hpe.com>
> 8AM-5PM EDT (UTC-4) Monday-Friday
> Manager: Becky McBride (becky.mcbride at hpe.com <mailto:becky.mcbride at hpe.com> )
> 
>