[Crash-utility] infinite loop in crash due to double-NMI on x86_64 system
Dave Anderson
anderson at redhat.com
Mon Jun 28 19:10:55 UTC 2010
----- "Lucas Silacci" <Lucas.Silacci at teradata.com> wrote:
> Below is the output of running crash (with the patch) against one of
> these dumps.
>
> -Lucas
>
>
> crash 5.0.5
> Copyright (C) 2002-2010 Red Hat, Inc.
> Copyright (C) 2004, 2005, 2006 IBM Corporation
> Copyright (C) 1999-2006 Hewlett-Packard Co
> Copyright (C) 2005, 2006 Fujitsu Limited
> Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
> Copyright (C) 2005 NEC Corporation
> Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
> Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
> This program is free software, covered by the GNU General Public License,
> and you are welcome to change it and/or distribute copies of it under
> certain conditions. Enter "help copying" to see the conditions.
>
> This program has absolutely no warranty. Enter "help warranty" for
> details.
>
> GNU gdb (GDB) 7.0
> Copyright (C) 2009 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later
> <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> and "show warranty" for details.
>
> This GDB was configured as "x86_64-unknown-linux-gnu"...
>
> please wait... (determining panic task)
>
> WARNING: Loop detected in the NMI Exception Stack!
>
>
> bt: cannot transition from exception stack to current process stack:
> exception stack pointer: ffffffff8046dc50
> process stack pointer: ffffffff8046ddd8
> current stack base: ffffffff80422000
>
> SYSTEM MAP: /boot/System.map-2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
> DEBUG KERNEL: /boot/vmlinux-2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
> (2.6.16.53-0.8.PTF.434477.9.TDC.0-smp)
> DUMPFILE: /var/crash/lucas.save/vmcore [PARTIAL DUMP]
> CPUS: 4
> DATE: Tue May 18 12:46:07 2010
> UPTIME: 07:24:54
> LOAD AVERAGE: 85.74, 82.85, 82.29
> TASKS: 2449
> NODENAME: POLO5_1-9
> RELEASE: 2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
> VERSION: #1 SMP Fri Aug 31 06:07:27 PDT 2007
> MACHINE: x86_64 (2660 Mhz)
> MEMORY: 7.9 GB
> PANIC: "Kernel panic - not syncing: dumpsw: Dump switch pushed; reason: 0x20 args=0xffffffff8046df08"
> PID: 0
> COMMAND: "swapper"
> TASK: ffffffff8038c340 (1 of 4) [THREAD_INFO: ffffffff80422000]
> CPU: 0
> STATE: TASK_RUNNING (PANIC)
>
> crash> bt
> PID: 0 TASK: ffffffff8038c340 CPU: 0 COMMAND: "swapper"
> #0 [ffffffff8046dc50] machine_kexec at ffffffff8011a95b
> #1 [ffffffff8046dd20] crash_kexec at ffffffff80154351
> #2 [ffffffff8046dde0] panic at ffffffff801327fa
> #3 [ffffffff8046ded0] dumpsw_notify at ffffffff8831c0c3
> #4 [ffffffff8046dee0] notifier_call_chain at ffffffff8032481f
> #5 [ffffffff8046df00] default_do_nmi at ffffffff80322fab
> #6 [ffffffff8046df40] do_nmi at ffffffff80323365
> #7 [ffffffff8046df50] nmi at ffffffff8032268f
> [exception RIP: smp_send_stop+84]
> RIP: ffffffff80116e44 RSP: ffffffff8046ddd8 RFLAGS: 00000246
> RAX: 00000000000000ff RBX: ffffffff8831c1f8 RCX: 000041049c7256e8
> RDX: 0000000000000005 RSI: 000000005238a938 RDI: 00000000002896a0
> RBP: ffffffff8046df08 R8: 00000000000040fb R9: 000000005238a7e8
> R10: 0000000000000002 R11: 0000ffff0000ffff R12: 000000000000000c
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> --- <NMI exception stack> ---
> #8 [ffffffff8046ddd8] smp_send_stop at ffffffff80116e44
> bt: WARNING: Loop detected in the NMI Exception Stack!
> bt: cannot transition from exception stack to current process stack:
> exception stack pointer: ffffffff8046dc50
> process stack pointer: ffffffff8046ddd8
> current stack base: ffffffff80422000
> crash>
What exactly was the sequence of events? Was the system repeatedly and
erroneously running one NMI after another for some reason, and *then* the
"dump switch" was pressed? And the dumpsw_notify() function sends another
NMI? And where does that dumpsw_notify() function live anyway?
I'm just trying to get a grip on whether this will ever happen again, or
whether it's fixing a one-time hardware abnormality?
Dave
> -----Original Message-----
> From: crash-utility-bounces at redhat.com
> [mailto:crash-utility-bounces at redhat.com] On Behalf Of Dave Anderson
> Sent: Friday, June 25, 2010 12:32 PM
> To: Discussion list for crash utility usage,maintenance and
> development
> Subject: Re: [Crash-utility] infinite loop in crash due to double-NMI
> on
> x86_64 system
>
>
> ----- "Lucas Silacci" <Lucas.Silacci at teradata.com> wrote:
>
> > Hi,
> >
> > I've run into an issue where crash will enter an infinite loop
> while
> > decoding exception stacks if those stacks get corrupted.
> >
> > We've seen this on four different systems where the hardware
> generated
> > multiple NMIs and the second and subsequent NMIs caused the NMI
> > exception stack to be overwritten. When this condition is hit, the
> > bottom rsp on the NMI exception stack (which would normally point
> you
> > back to the kernel thread stack or possibly a different exception
> stack)
> > points you back into the middle of the same NMI exception stack.
> This
> > causes crash to infinitely loop when it tries to decode that
> exception
> > stack.
> >
> > Now clearly the root cause of the issue is faulty hardware that
> > generated multiple NMIs. However a very small change in crash can
> detect
> > this issue and stop the infinite loop from happening thereby
> allowing
> > you to get to a point in crash where you can actually tell that it
> was
> > an NMI that caused the system to dump.
> >
> > The patch is attached to this email. For x86_64 it will detect the
> > condition of any exception stack that points back at itself.
> >
> > Please feel free to ask me any questions on this.
>
> Wow, that's pretty interesting -- I've certainly never seen that
> before.
> Can you show me what the backtrace looks like with your patch
> applied?
>
> Thanks,
> Dave
>
> --
> Crash-utility mailing list
> Crash-utility at redhat.com
> https://www.redhat.com/mailman/listinfo/crash-utility
>
> --
> Crash-utility mailing list
> Crash-utility at redhat.com
> https://www.redhat.com/mailman/listinfo/crash-utility
More information about the Crash-utility
mailing list