[Crash-utility] Broken backtrace with nested NMIs

Tue Apr 29 11:27:13 UTC 2014

On Fri, 25 Apr 2014 10:11:28 -0400 (EDT)
Dave Anderson <anderson at redhat.com> wrote:

> ----- Original Message -----
> > Hi all,
> > 
> > as discovered by my colleagues, the backtrace code has been broken for
> > NMI stacks since kernel commit 3f3c8b8c4b2a34776c3470142a7c8baafcda6eb0
> > (Linux 3.3).
> > 
> > I am working on a fix, but it's tricky to get all cases right. For
> > example, the copied and saved register locations were swapped with
> > kernel commit 28696f434fef0efa97534b59986ad33b9c4df7f8, so we have at
> > least 3 possible layouts:
> > 
> > 1. pre-3.3 (no nesting)
> > 2. 3.3 to 3.8 (saved, then copied)
> > 3. 3.8+ (copied, then saved)
> > 
> > I'm writing this mail to tell you I'm working on it. I don't have a fix
> > (yet), but want to avoid duplicate efforts if more people start working
> > on this.
> > 
> > Petr T
> 
> Thanks Petr, I appreciate your efforts, and won't get in your way...
> 
> I was aware of Steven's work in this area, but haven't yet seen any
> core dumps that show the changes.  What exactly happens?  Does the
> backtrace fumble its way through the top of the NMI stack, but then
> successfully make the transition to the original stack, or does it 
> just blow up while transitioning through the NMI stack?

It will show an incorrect register dump, but the backtrace continues.
For example:

PID: 0      TASK: ffff880232b2c440  CPU: 7   COMMAND: "kworker/0:1"
 #0 [ffff88023fdc7e40] crash_nmi_callback at ffffffff8102428f
 #1 [ffff88023fdc7e50] notifier_call_chain at ffffffff81461ec7
 #2 [ffff88023fdc7e80] __atomic_notifier_call_chain at ffffffff81461f0d
 #3 [ffff88023fdc7e90] notify_die at ffffffff81461f5d
 #4 [ffff88023fdc7ec0] default_do_nmi at ffffffff8145f3a7
 #5 [ffff88023fdc7ee0] do_nmi at ffffffff8145f5d8
 #6 [ffff88023fdc7ef0] restart_nmi at ffffffff8145eb2d
    [exception RIP: mwait_idle+423]
    RIP: ffffffff8100b217  RSP: ffff880232b2ff18  RFLAGS: 00000246
    RAX: 0000000000000010  RBX: 0000000000000010  RCX: 0000000000000246
    RDX: ffff880232b2ff18  RSI: 0000000000000018  RDI: 0000000000000001
    RBP: ffffffff8100b217   R8: ffffffff8100b217   R9: 0000000000000018
    R10: ffff880232b2ff18  R11: 0000000000000246  R12: ffffffffffffffff
    R13: ffffffff81d36108  R14: ffff880232b2ffd8  R15: 0000000000000000
    ORIG_RAX: 0000000000000000  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #7 [ffff880232b2ff18] mwait_idle at ffffffff8100b217
 #8 [ffff880232b2ff30] cpu_idle at ffffffff81002126

If there is a nested NMI, reading the code suggests crash may loop again to the NMI stack, but I don't have a sample dump file ATM.

Petr T