[Crash-utility] infinite loop in crash due to double-NMI on x86_64 system

Dave Anderson anderson at redhat.com
Mon Jun 28 20:35:10 UTC 2010


----- "Lucas Silacci" <Lucas.Silacci at teradata.com> wrote:

> > -----Original Message-----
> > From: crash-utility-bounces at redhat.com 
> > [mailto:crash-utility-bounces at redhat.com] On Behalf Of Dave
> Anderson
> > Sent: Monday, June 28, 2010 12:11 PM
> > To: Discussion list for crash utility usage,maintenance and 
> > development
> > Subject: Re: [Crash-utility] infinite loop in crash due to 
> > double-NMI on x86_64 system
> > 
> > 
> >   
> > ----- "Lucas Silacci" <Lucas.Silacci at teradata.com> wrote:
> > 
> > > Below is the output of running crash (with the patch) against one
> of
> > > these dumps.
> > > 
> > > -Lucas
> > > 
> > > 
> > > crash 5.0.5
> > > Copyright (C) 2002-2010  Red Hat, Inc.
> > > Copyright (C) 2004, 2005, 2006  IBM Corporation
> > > Copyright (C) 1999-2006  Hewlett-Packard Co    
> > > Copyright (C) 2005, 2006  Fujitsu Limited      
> > > Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
> > > Copyright (C) 2005  NEC Corporation                  
> > > Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
> > > Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux,
> Inc.
> > > This program is free software, covered by the GNU General Public License,
> > > and you are welcome to change it and/or distribute copies of it under 
> > > certain conditions.  Enter "help copying" to see the conditions.
> > > This program has absolutely no warranty.  Enter "help warranty" for
> > > details.
> > > 
> > > GNU gdb (GDB) 7.0
> > > Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later
> > > <http://gnu.org/licenses/gpl.html>
> > > This is free software: you are free to change and redistribute it.
> > > There is NO WARRANTY, to the extent permitted by law.  Type "show copying"   
> > > and "show warranty" for details.
> > > 
> > > This GDB was configured as "x86_64-unknown-linux-gnu"...
> > > 
> > > please wait... (determining panic task)                     
> >           
> > > 
> > > WARNING: Loop detected in the NMI Exception Stack!          
> >           
> > > 
> > > 
> > > bt: cannot transition from exception stack to current process
> stack:
> > >     exception stack pointer: ffffffff8046dc50                     
>  
> > >       process stack pointer: ffffffff8046ddd8
> > >          current stack base: ffffffff80422000
> > > 
> > >   SYSTEM MAP:
> /boot/System.map-2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
> > > DEBUG KERNEL: /boot/vmlinux-2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
> > > (2.6.16.53-0.8.PTF.434477.9.TDC.0-smp)
> > >     DUMPFILE: /var/crash/lucas.save/vmcore  [PARTIAL DUMP]
> > >         CPUS: 4
> > >         DATE: Tue May 18 12:46:07 2010
> > >       UPTIME: 07:24:54
> > > LOAD AVERAGE: 85.74, 82.85, 82.29
> > >        TASKS: 2449
> > >     NODENAME: POLO5_1-9
> > >      RELEASE: 2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
> > >      VERSION: #1 SMP Fri Aug 31 06:07:27 PDT 2007
> > >      MACHINE: x86_64  (2660 Mhz)
> > >       MEMORY: 7.9 GB
> > >        PANIC: "Kernel panic - not syncing: dumpsw: Dump 
> > switch pushed; reason: 0x20  args=0xffffffff8046df08"
> > >          PID: 0
> > >      COMMAND: "swapper"
> > >         TASK: ffffffff8038c340  (1 of 4)  [THREAD_INFO: 
> > ffffffff80422000]
> > >          CPU: 0
> > >        STATE: TASK_RUNNING (PANIC)
> > > 
> > > crash> bt
> > > PID: 0      TASK: ffffffff8038c340  CPU: 0   COMMAND: "swapper"
> > >  #0 [ffffffff8046dc50] machine_kexec at ffffffff8011a95b
> > >  #1 [ffffffff8046dd20] crash_kexec at ffffffff80154351
> > >  #2 [ffffffff8046dde0] panic at ffffffff801327fa
> > >  #3 [ffffffff8046ded0] dumpsw_notify at ffffffff8831c0c3
> > >  #4 [ffffffff8046dee0] notifier_call_chain at ffffffff8032481f
> > >  #5 [ffffffff8046df00] default_do_nmi at ffffffff80322fab
> > >  #6 [ffffffff8046df40] do_nmi at ffffffff80323365
> > >  #7 [ffffffff8046df50] nmi at ffffffff8032268f
> > >     [exception RIP: smp_send_stop+84]
> > >     RIP: ffffffff80116e44  RSP: ffffffff8046ddd8  RFLAGS:
> 00000246
> > >     RAX: 00000000000000ff  RBX: ffffffff8831c1f8  RCX: 
> > 000041049c7256e8
> > >     RDX: 0000000000000005  RSI: 000000005238a938  RDI: 
> > 00000000002896a0
> > >     RBP: ffffffff8046df08   R8: 00000000000040fb   R9: 
> > 000000005238a7e8
> > >     R10: 0000000000000002  R11: 0000ffff0000ffff  R12: 
> > 000000000000000c
> > >     R13: 0000000000000000  R14: 0000000000000000  R15: 
> > 0000000000000000
> > >     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> > > --- <NMI exception stack> ---
> > >  #8 [ffffffff8046ddd8] smp_send_stop at ffffffff80116e44
> > > bt: WARNING: Loop detected in the NMI Exception Stack!
> > > bt: cannot transition from exception stack to current process
> stack:
> > >     exception stack pointer: ffffffff8046dc50
> > >       process stack pointer: ffffffff8046ddd8
> > >          current stack base: ffffffff80422000
> > > crash> 
> >  
> > What exactly was the sequence of events?  Was the system repeatedly and
> > erroneously running one NMI after another for some reason, and *then* the
> > "dump switch" was pressed?  And the dumpsw_notify() function sends another
> > NMI?  And where does that dumpsw_notify() function live anyway?
> > 
> > I'm just trying to get a grip on whether this will ever happen again, or
> > whether it's fixing a one-time hardware abnormality?
> > 
> > Dave
> >
> 
> As far as I am aware, we have had three separate customers encounter
> this issue. It appears from the hardware SEL log that multiple PCI
> SERR's came in at the same time and somehow triggered multiple NMIs.
> You can see the SEL entries from the output of the "ipmitool sel"
> command:
> 
> 0231 11FC  02  01:53:47 12/17/09  3300 04   13  EB   6F  A5 15 08 
> Crit.
> Interrupt   PCI SERR (PCI Bus 15 Device 1 Function 0) was asserted
> 0232 1210  02  01:53:47 12/17/09  3300 04   13  EB   6F  A5 16 20 
> Crit.
> Interrupt   PCI SERR (PCI Bus 16 Device 4 Function 0) was asserted
> 0233 1224  02  01:53:47 12/17/09  3300 04   13  EB   6F  A5 16 21 
> Crit.
> Interrupt   PCI SERR (PCI Bus 16 Device 4 Function 1) was asserted
> 0234 1238  02  01:53:47 12/17/09  3300 04   13  EB   6F  A5 16 30 
> Crit.
> Interrupt   PCI SERR (PCI Bus 16 Device 6 Function 0) was asserted
> 0235 124C  02  01:53:47 12/17/09  3300 04   13  EB   6F  A5 16 31 
> Crit.
> Interrupt   PCI SERR (PCI Bus 16 Device 6 Function 1) was asserted
> 
> My understanding of the architecture of the system is that only one NMI
> should have been asserted to the OS regardless of the number of times
> there was a hardware error, but clearly that wasn't the case in these
> three instances.
> 
> Also, it seemed like my patch made crash a little bit more tolerant of
> "corrupted" dump images which I thought could only be a good thing.

Right, I understand that...

But you didn't answer my questions re: the "dump switch" procedure and
the dumpsw_notify() function.  Was the system stuck in the NMI handler,
somebody noticed the repetetive NMIs (?), and so they hit the "dump switch"?
(whatever that may be...) 

Dave




More information about the Crash-utility mailing list