<!doctype html public "-//w3c//dtd html 4.0 transitional//en"> <html> <tt></tt> <tt>Hi Rachita,</tt><tt></tt> <tt>I'm looking at an alt-sysrq-c generated crash on an x86_64 kernel,</tt> <tt>and I am seeing something similar to what you are with respect</tt> <tt>to the transition from the interrupt stack back to the process</tt> <tt>stack.</tt><tt></tt> <tt>Here's a "bt -a", where cpu 0 shows that it received a shutdown</tt> <tt>NMI while in the idle loop, and the transition from the NMI</tt> <tt>exception stack back to the process stack was clean. But on</tt> <tt>the cpu which took the alt-sysrq-c keyboard interrupt, the</tt> <tt>transition from the per-cpu interrupt stack back to the</tt> <tt>process stack is similar to what you're seeing:</tt><tt></tt> <tt>crash> bt -a</tt> <tt>PID: 0 TASK: ffffffff8034ce60 CPU: 0 COMMAND: "swapper"</tt> <tt> #0 [ffffffff80481f30] crash_nmi_callback at ffffffff8007742f</tt> <tt> #1 [ffffffff80481f40] do_nmi at ffffffff80063c2c</tt> <tt> #2 [ffffffff80481f50] nmi at ffffffff8006312f</tt> <tt> [exception RIP: mwait_idle+54]</tt> <tt> RIP: ffffffff800553b7 RSP: ffffffff80437f90 RFLAGS: 00000246</tt> <tt> RAX: 0000000000000000 RBX: ffffffff80055381 RCX: 0000000000000000</tt> <tt> RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff8034e838</tt> <tt> RBP: 0000000000099000 R8: ffffffff80436000 R9: 000000000000003e</tt> <tt> R10: ffff810037d0c038 R11: 0000000000000048 R12: 0000000000090000</tt> <tt> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000</tt> <tt> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018</tt> <tt>--- <exception stack> ---</tt> <tt> #3 [ffffffff80437f90] mwait_idle at ffffffff800553b7</tt> <tt> #4 [ffffffff80437f90] cpu_idle at ffffffff800473be</tt><tt></tt> <tt>PID: 0 TASK: ffff81003fe48100 CPU: 1 COMMAND: "swapper"</tt> <tt> #0 [ffff81003fe6bb40] crash_kexec at ffffffff800ab7c4</tt> <tt> #1 [ffff81003fe6bbc8] mwait_idle at ffffffff800553b7</tt> <tt> #2 [ffff81003fe6bc00] sysrq_handle_crashdump at ffffffff8019301f</tt> <tt> #3 [ffff81003fe6bc10] __handle_sysrq at ffffffff80192e1c</tt> <tt> #4 [ffff81003fe6bc50] kbd_event at ffffffff8018dbc1</tt> <tt> #5 [ffff81003fe6bca0] input_event at ffffffff801e9b9f</tt> <tt> #6 [ffff81003fe6bcd0] hidinput_hid_event at ffffffff801e42cb</tt> <tt> #7 [ffff81003fe6bd00] hid_process_event at ffffffff801df66b</tt> <tt> #8 [ffff81003fe6bd40] hid_input_report at ffffffff801df9d9</tt> <tt> #9 [ffff81003fe6bdc0] hid_irq_in at ffffffff801e0dc0</tt> <tt>#10 [ffff81003fe6bde0] usb_hcd_giveback_urb at ffffffff801d33d8</tt> <tt>#11 [ffff81003fe6be10] uhci_giveback_urb at ffffffff88168724</tt> <tt>#12 [ffff81003fe6be50] uhci_scan_schedule at ffffffff88168e07</tt> <tt>#13 [ffff81003fe6bed0] uhci_irq at ffffffff8816ac08</tt> <tt>#14 [ffff81003fe6bf10] usb_hcd_irq at ffffffff801d3dc7</tt> <tt>#15 [ffff81003fe6bf20] handle_IRQ_event at ffffffff80010704</tt> <tt>#16 [ffff81003fe6bf50] __do_IRQ at ffffffff800b5238</tt> <tt>#17 [ffff81003fe6bf58] __do_softirq at ffffffff80011c0b</tt> <tt>#18 [ffff81003fe6bf90] do_IRQ at ffffffff8006a762</tt> <tt>--- <IRQ stack> ---</tt> <tt>#19 [ffff81003fe65e70] ret_from_intr at ffffffff8005bac9</tt> <tt> [exception RIP: cpu_idle+149]</tt> <tt> RIP: ffffffff800473be RSP: ffffffff8042e220 RFLAGS: ffffffff80074188</tt> <tt> RAX: ffffffffffffff16 RBX: 0000000000000000 RCX: ffffffff800553b7</tt> <tt> RDX: 0000000000000010 RSI: 0000000000000246 RDI: ffff81003fe65ef0</tt> <tt> RBP: ffff81003fe64000 R8: ffffffff8034e838 R9: 0000000000000001</tt> <tt> R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000003f</tt> <tt> R13: ffff810037d0c008 R14: 0000000000000246 R15: 0000000000000001</tt> <tt> ORIG_RAX: 0000000000000018 CS: 0020 SS: 0000</tt> <tt>bt: WARNING: possibly bogus exception frame</tt> <tt>crash></tt><tt></tt> <tt>crash dutifully reports that the exception frame looks bogus</tt> <tt>because of the CS value.</tt><tt></tt> <tt>The end of the per-cpu interrupt stack looks like this:</tt><tt></tt> <tt>crash> rd -s ffff81003fe68000 2048</tt> <tt>...</tt> <tt>ffff81003fe6bf30: 000000000000e900 00000000000000e9</tt> <tt>ffff81003fe6bf40: ffff810037ca4bc0 irq_desc+59708</tt> <tt>ffff81003fe6bf50: __do_IRQ+164 __do_softirq+94</tt> <tt>ffff81003fe6bf60: 00000000000000e9 ffff81003fe65e48</tt> <tt>ffff81003fe6bf70: 00000000000000ff cpu_data+256</tt> <tt>ffff81003fe6bf80: 0000000000000100 cpu_core_map+32</tt> <tt>ffff81003fe6bf90: do_IRQ+231 ffff81003fe65e70</tt> <tt>ffff81003fe6bfa0: mwait_idle ffff81003fe65e70</tt> <tt>ffff81003fe6bfb0: ret_from_intr ffff81003fe65e70</tt> <tt>ffff81003fe6bfc0: 0000000000000000 0000000000000000</tt> <tt>ffff81003fe6bfd0: 0000000000000000 0000000000000000</tt> <tt>ffff81003fe6bfe0: 0000000000000000 0000000000000000</tt> <tt>ffff81003fe6bff0: 0000000000000000 0000000000000000</tt> <tt>crash></tt><tt></tt> <tt>...hence the supposed pointer to the generating exception frame</tt> <tt>is presumed to be ffff81003fe65e70 (which is bogus).</tt><tt></tt> <tt>Interestingly, though, is if I do an exception frame search</tt> <tt>for that task, I do find the "real" exception frame:</tt><tt></tt> <tt>crash> bt -e</tt> <tt>PID: 0 TASK: ffff81003fe48100 CPU: 1 COMMAND: "swapper"</tt><tt></tt> <tt> KERNEL-MODE EXCEPTION FRAME AT: ffff81003fe65e48</tt> <tt> RIP: ffffffff800553b7 RSP: ffff81003fe65ef0 RFLAGS: 00000246</tt> <tt> RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000</tt> <tt> RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff8034e838</tt> <tt> RBP: 0000000000000000 R8: ffff81003fe64000 R9: 000000000000003f</tt> <tt> R10: ffff810037d0c008 R11: 0000000000000246 R12: ffff810037fef040</tt> <tt> R13: 0000000000000001 R14: ffff81003fe482f0 R15: 0000000002245f5e</tt> <tt> ORIG_RAX: ffffffffffffff16 CS: 0010 SS: 0018</tt> <tt>crash> sym ffffffff800553b7</tt> <tt>ffffffff800553b7 (t) mwait_idle+0x36 include/asm/thread_info.h: 63</tt> <tt>crash></tt><tt></tt> <tt>and if I just grep for that address within the per-cpu interrupt</tt> <tt>stack, I see several refernces to it:</tt><tt></tt> <tt>crash> rd -s ffff81003fe68000 2048 | grep ffff81003fe65e48</tt> <tt>ffff81003fe6bb30: ffff81003ac7c000 ffff81003fe65e48</tt> <tt>ffff81003fe6bc60: 00ffffff8001c1fd ffff81003fe65e48</tt> <tt>ffff81003fe6bce0: ffff81003ec68000 ffff81003fe65e48</tt> <tt>ffff81003fe6bd50: ffff81003fe65e48 0000000180087720</tt> <tt>ffff81003fe6bda0: ffff810037cb7400 ffff81003fe65e48</tt> <tt>ffff81003fe6bdb0: ffff810037cb7550 ffff81003fe65e48</tt> <tt>ffff81003fe6be60: ffff81003cf37488 ffff81003fe65e48</tt> <tt>ffff81003fe6bec0: ffff81003fe65e48 ffff81003fe65e48</tt> <tt>ffff81003fe6bed0: uhci_irq+0x13f ffff81003fe65e48</tt> <tt>ffff81003fe6bf00: ffff81003fe65e48 ffff81003fe65e48</tt> <tt>ffff81003fe6bf60: 00000000000000e9 ffff81003fe65e48</tt> <tt>crash></tt><tt></tt> <tt>But none of them are located at the "fixed" location of one</tt> <tt>word below the 64-byte block at the top of the interrupt</tt> <tt>stack.</tt><tt></tt> <tt>So I don't know what's going on in this case...</tt><tt></tt> <tt>I don't ever recall seeing such a bogus interrupt-to-process</tt> <tt>stack transition on any netdump or diskdump generated vmcores.</tt> <tt>And all the "test" kdump kernel dumpfiles I've used have only</tt> <tt>been generated by using "echo c > /proc/sysrq-trigger", so the</tt> <tt>crash path never went off the process stack.</tt><tt></tt> <tt>So without blatantly pointing the finger, I wonder whether there's</tt> <tt>something that the kexec/kdump code path does that could possibly</tt> <tt>be tinkering with the contents of the interrupt stack?</tt><tt></tt> <tt>I also want to try to force another crash but with all cpus</tt> <tt>forcibly running something other than the idle task, in case</tt> <tt>there's something strange about the interrupt bringing the</tt> <tt>cpu out of that "mwait" instruction? Grasping at straws a</tt> <tt>bit here...</tt><tt></tt> <tt>Also, that's why I'm always asking for back-trace tests that</tt> <tt>do something "real" -- instead of just having the kernel</tt> <tt>call panic(), or that do a sys_write() to /proc/sysrq-trigger</tt> <tt>to force an oops on the process stack. At least an alt-sysrq-c</tt> <tt>on the console keyboard generates an interrupt, as does your</tt> <tt>forced jprobes deal...</tt><tt></tt> <tt>BTW, I also note the the reading of the module unwind tables</tt> <tt>is reading invalid data, because it's being done before the</tt> <tt>non-unity-mapped address translation can even work! In other</tt> <tt>words, vmalloc addresses can only be read after vm_init() is</tt> <tt>complete. So I've added another machdep_init() argument</tt> <tt>(POST_VM) that is called just after vm_init(), and in the</tt> <tt>case of x86_64 (and x86), can call init_unwind_table().</tt><tt></tt> <tt>Thanks,</tt> <tt> Dave</tt> <tt></tt> </html>