[Crash-utility] Broken backtrace with nested NMIs
Petr Tesarik
ptesarik at suse.cz
Sat May 24 16:42:27 UTC 2014
On Sat, 24 May 2014 20:24:30 +0800
oliver yang <yangoliver at gmail.com> wrote:
> 2014-04-29 19:27 GMT+08:00 Petr Tesarik <ptesarik at suse.cz>:
>
> >
> > It will show an incorrect register dump, but the backtrace continues.
> > For example:
> >
>
> Hi Petr,
>
> The back trace looks good.
>
> How did you know the register dump is incorrect?
The saved registers did not make any sense in the interrupted code. ;-)
And one of them, which should have been a pointer, looked like RFLAGS.
> At least the value of RSP saved in NMI stack seemed to be good,
>
> RSP: ffff880232b2ff18
Yes, SS, RSP, RFLAGS, CS, and RIP may look good, because they are
pushed onto stack by the CPU. But they may point back to a NMI if it
was a nested NMI. See my comments below.
> Recently, I'm working on a core file analysis, and found crash tool
> couldn't give the correct NMI back trace.
> But I can find right stack trace by using IST pointer.
>
> I'm wondering whether your patch could work for my cases.
>
> May I can try your fix after it is ready.
See https://www.redhat.com/archives/crash-utility/2014-April/msg00038.html
It's now also in crash git, see
commit 8e15958e1b7183bbfbdf004f0ad8f2b62f023f9f.
So, how do you recognize wrong register dump?
Some symptoms:
> > PID: 0 TASK: ffff880232b2c440 CPU: 7 COMMAND: "kworker/0:1"
> > #0 [ffff88023fdc7e40] crash_nmi_callback at ffffffff8102428f
> > #1 [ffff88023fdc7e50] notifier_call_chain at ffffffff81461ec7
> > #2 [ffff88023fdc7e80] __atomic_notifier_call_chain at ffffffff81461f0d
> > #3 [ffff88023fdc7e90] notify_die at ffffffff81461f5d
> > #4 [ffff88023fdc7ec0] default_do_nmi at ffffffff8145f3a7
> > #5 [ffff88023fdc7ee0] do_nmi at ffffffff8145f5d8
> > #6 [ffff88023fdc7ef0] restart_nmi at ffffffff8145eb2d
> > [exception RIP: mwait_idle+423]
> > RIP: ffffffff8100b217 RSP: ffff880232b2ff18 RFLAGS: 00000246
> > RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000246
RAX is the kernel code segment (copied CS)
RBX is the kernel code segment (saved CS)
RCX looks like RFLAGS (note the typical 246 at the end).
> > RDX: ffff880232b2ff18 RSI: 0000000000000018 RDI: 0000000000000001
RDX points to a kernel stack
RSI is the kernel data segment (copied SS)
RDI is always 1 (the NMI executing flag)
> > RBP: ffffffff8100b217 R8: ffffffff8100b217 R9: 0000000000000018
RBP points to kernel text
R8 points to kernel text
R9 is the kernel data segment (saved SS)
> > R10: ffff880232b2ff18 R11: 0000000000000246 R12: ffffffffffffffff
R10 points to a kernel stack
R11 looks like RFLAGS
HTH,
Petr Tesarik
> > R13: ffffffff81d36108 R14: ffff880232b2ffd8 R15: 0000000000000000
> > ORIG_RAX: 0000000000000000 CS: 0010 SS: 0018
> > --- <NMI exception stack> ---
> > #7 [ffff880232b2ff18] mwait_idle at ffffffff8100b217
> > #8 [ffff880232b2ff30] cpu_idle at ffffffff81002126
> >
> > If there is a nested NMI, reading the code suggests crash may loop again
> > to the NMI stack, but I don't have a sample dump file ATM.
> >
> > Petr T
> >
> > --
> > Crash-utility mailing list
> > Crash-utility at redhat.com
> > https://www.redhat.com/mailman/listinfo/crash-utility
> >
>
>
>
More information about the Crash-utility
mailing list