[Crash-utility] [PATCH] ppc64: fix 'bt' command for vmcore captured with fadump.

Hari Bathini hbathini at linux.vnet.ibm.com
Fri Jan 20 17:57:56 UTC 2017



On Thursday 19 January 2017 07:54 PM, Dave Anderson wrote:
>
> ----- Original Message -----
>>
>> On Thursday 19 January 2017 02:05 AM, Dave Anderson wrote:
>>> ----- Original Message -----
>>>> Without this patch, backtraces of active tasks maybe be of the form
>>>> "#0 [c0000000700b3a90] (null) at c0000000700b3b50  (unreliable)" for
>>>> kernel dumps captured with fadump.  Trying to use ptregs saved for
>>>> active tasks before falling back to stack-search method. Also, getting
>>>> rid of warnings like "‘is_hugepage’ declared inline after being called".
>>>>
>>>> Signed-off-by: Hari Bathini <hbathini at linux.vnet.ibm.com>
>>> Hari,
>>>
>>> I only have 1 sample vmcore generated by FADUMP, and I see that
>>> the backtraces of the non-panicking active tasks are an improvement
>>> given that they show the exception frame register set.  However, I also
>>> note that the panic task backtrace has changed, from this using the
>>> current method:
>>>
>>>     PID: 1913   TASK: c000000250472120  CPU: 5   COMMAND: "bash"
>>>      #0 [c000000255933620] .crash_fadump at c00000000002cbb8
>>>      #1 [c0000002559336c0] .die at c000000000030dc8
>>>      #2 [c000000255933770] .bad_page_fault at c000000000043748
>>>      #3 [c0000002559337f0] handle_page_fault at c000000000005228
>>>      Data Access [300] exception frame:
>>>      R0:  0000000000000001    R1:  c000000255933ae0    R2:  c000000000f27628
>>>      R3:  0000000000000063    R4:  0000000000000000    R5:  ffffffffffffffff
>>>      R6:  0000000000000070    R7:  00000000000020b8    R8:  000000001cbbfaa8
>>>      R9:  0000000000000000    R10: 0000000000000002    R11: c00000000039c590
>>>      R12: 0000000028242482    R13: c000000000ff3180    R14: 000000001012b3dc
>>>      R15: 0000000000000000    R16: 0000000000000000    R17: 0000000010129c58
>>>      R18: 0000000010129bf8    R19: 000000001012b948    R20: 0000000000000000
>>>      R21: 000000001012b3e4    R22: 0000000000000000    R23: c000000000e57788
>>>      R24: 0000000000000004    R25: c000000000e57928    R26: c000000000e37414
>>>      R27: 0000000000000000    R28: 0000000000000001    R29: 0000000000000063
>>>      R30: c000000000ec9208    R31: c000000001423aac
>>>      NIP: c00000000039c57c    MSR: 8000000000009032    OR3: c000000255933a20
>>>      CTR: c00000000039c560    LR:  c00000000039c8c8    XER: 0000000000000001
>>>      CCR: 0000000028242482    MQ:  0000000000000000    DAR: 0000000000000000
>>>      DSISR: 0000000042000000     Syscall Result: 0000000000000000
>>>      #4 [c000000255933ae0] .sysrq_handle_crash at c00000000039c57c
>>>      [Link Register] [c000000255933ae0] .__handle_sysrq at c00000000039c8c8
>>>      #5 [c000000255933ba0] .write_sysrq_trigger at c00000000039ca70
>>>      #6 [c000000255933c30] .proc_reg_write at c000000000244874
>>>      #7 [c000000255933ce0] .vfs_write at c0000000001c9dac
>>>      #8 [c000000255933d80] .sys_write at c0000000001c9fd8
>>>      #9 [c000000255933e30] syscall_exit at c000000000008564
>>>      System Call [c00] exception frame:
>>>      R0:  0000000000000004    R1:  00000fffec87b540    R2:  00000080cec13268
>>>      R3:  0000000000000001    R4:  00000fffa55a0000    R5:  0000000000000002
>>>      R6:  000000007fffffff    R7:  0000000000000000    R8:  0000000000000001
>>>      R9:  0000000000000000    R10: 0000000000000000    R11: 0000000000000000
>>>      R12: 0000000000000000    R13: 00000080cea0ce10    R14: 000000001012b3dc
>>>      R15: 0000000000000000    R16: 0000000000000000    R17: 0000000010129c58
>>>      R18: 0000000010129bf8    R19: 000000001012b948    R20: 0000000000000000
>>>      R21: 000000001012b3e4    R22: 000001003391c720    R23: 0000000000000000
>>>      R24: 0000000000000001    R25: 000000001012b3e0    R26: 00000fffec87b86c
>>>      R27: 00000fffec87b868    R28: 0000000000000002    R29: 00000080cec006a0
>>>      R30: 00000fffa55a0000    R31: 0000000000000002
>>>      NIP: 00000080ceb49548    MSR: 800000000000d032    OR3: 0000000000000001
>>>      CTR: 00000080cead9d50    LR:  00000080cead9db8    XER: 0000000000000000
>>>      CCR: 0000000044242424    MQ:  0000000000000001    DAR: 00000100339436b8
>>>      DSISR: 0000000042000000     Syscall Result: 0000000000000000
>>>     
>>> to this with your patch, where the exception backtrace is missing:
>>>
>>>     PID: 1913   TASK: c000000250472120  CPU: 5   COMMAND: "bash"
>>>      R0:  0000000000000001    R1:  c000000255933ae0    R2:  c000000000f27628
>>>      R3:  0000000000000063    R4:  0000000000000000    R5:  ffffffffffffffff
>>>      R6:  0000000000000070    R7:  00000000000020b8    R8:  000000001cbbfaa8
>>>      R9:  0000000000000000    R10: 0000000000000002    R11: c00000000039c590
>>>      R12: 0000000028242482    R13: c000000000ff3180    R14: 000000001012b3dc
>>>      R15: 0000000000000000    R16: 0000000000000000    R17: 0000000010129c58
>>>      R18: 0000000010129bf8    R19: 000000001012b948    R20: 0000000000000000
>>>      R21: 000000001012b3e4    R22: 0000000000000000    R23: c000000000e57788
>>>      R24: 0000000000000004    R25: c000000000e57928    R26: c000000000e37414
>>>      R27: 0000000000000000    R28: 0000000000000001    R29: 0000000000000063
>>>      R30: c000000000ec9208    R31: c000000001423aac
>>>      NIP: c00000000039c57c    MSR: 8000000000009032    OR3: c000000255933a20
>>>      CTR: c00000000039c560    LR:  c00000000039c8c8    XER: 0000000000000001
>>>      CCR: 0000000028242482    MQ:  0000000000000000    DAR: 0000000000000000
>>>      DSISR: 0000000042000000     Syscall Result: 0000000000000000
>>>      NIP [c00000000039c57c] .sysrq_handle_crash
>>>      LR  [c00000000039c8c8] .__handle_sysrq
>>>      #0 [c000000255933ae0] .__handle_sysrq at c00000000039c89c
>>>      #1 [c000000255933ba0] .write_sysrq_trigger at c00000000039ca70
>>>      #2 [c000000255933c30] .proc_reg_write at c000000000244874
>>>      #3 [c000000255933ce0] .vfs_write at c0000000001c9dac
>>>      #4 [c000000255933d80] .sys_write at c0000000001c9fd8
>>>      #5 [c000000255933e30] syscall_exit at c000000000008564
>>>      System Call [c00] exception frame:
>>>      R0:  0000000000000004    R1:  00000fffec87b540    R2:  00000080cec13268
>>>      R3:  0000000000000001    R4:  00000fffa55a0000    R5:  0000000000000002
>>>      R6:  000000007fffffff    R7:  0000000000000000    R8:  0000000000000001
>>>      R9:  0000000000000000    R10: 0000000000000000    R11: 0000000000000000
>>>      R12: 0000000000000000    R13: 00000080cea0ce10    R14: 000000001012b3dc
>>>      R15: 0000000000000000    R16: 0000000000000000    R17: 0000000010129c58
>>>      R18: 0000000010129bf8    R19: 000000001012b948    R20: 0000000000000000
>>>      R21: 000000001012b3e4    R22: 000001003391c720    R23: 0000000000000000
>>>      R24: 0000000000000001    R25: 000000001012b3e0    R26: 00000fffec87b86c
>>>      R27: 00000fffec87b868    R28: 0000000000000002    R29: 00000080cec006a0
>>>      R30: 00000fffa55a0000    R31: 0000000000000002
>>>      NIP: 00000080ceb49548    MSR: 800000000000d032    OR3: 0000000000000001
>>>      CTR: 00000080cead9d50    LR:  00000080cead9db8    XER: 0000000000000000
>>>      CCR: 0000000044242424    MQ:  0000000000000001    DAR: 00000100339436b8
>>>      DSISR: 0000000042000000     Syscall Result: 0000000000000000
>>>
>>>
>>>     
>>> And then on a rhel7 traditional KDUMP dumpfile, both the panic task and the
>>> non-panicking active tasks are missing the exception trace.  Here's a
>>> sample
>>> panic task backtrace using the current manner:
>>>
>>>     PID: 32696  TASK: c0000001922ed5d0  CPU: 1   COMMAND: "runtest.sh"
>>>      #0 [c000000019823610] .crash_kexec at c0000000001725e0
>>>      #1 [c000000019823810] .die at c000000000020a48
>>>      #2 [c0000000198238c0] .bad_page_fault at c0000000000530d8
>>>      #3 [c000000019823940] handle_page_fault at c000000000009584
>>>      Data Access [300] exception frame:
>>>      R0:  c00000000055cf88    R1:  c000000019823c30    R2:  c00000000130a780
>>>      R3:  0000000000000063    R4:  c000000001845888    R5:  c0000000018564f8
>>>      R6:  0000000000005194    R7:  c0000000014b99a0    R8:  c000000000cca780
>>>      R9:  0000000000000001    R10: 0000000000000000    R11: 000000000000012f
>>>      R12: 0000000048222842    R13: c000000007b80900    R14: 0000000010142550
>>>      R15: 0000000040000000    R16: 0000000010143cdc    R17: 0000000000000000
>>>      R18: 00000000101306fc    R19: 00000000101424dc    R20: 00000000101424e0
>>>      R21: 000000001013c6f0    R22: 000000001013c970    R23: 0000000000000000
>>>      R24: 0000000000000001    R25: 0000000000000007    R26: c00000000120b170
>>>      R27: 0000000000000063    R28: c000000001709c98    R29: c00000000120b530
>>>      R30: c0000000011d8fa0    R31: 0000000000000002
>>>      NIP: c00000000055c3f8    MSR: 8000000000009032    OR3: c000000000009358
>>>      CTR: c00000000055c3e0    LR:  c00000000055cfac    XER: 0000000000000001
>>>      CCR: 0000000048222822    MQ:  0000000000000000    DAR: 0000000000000000
>>>      DSISR: 0000000042000000     Syscall Result: 0000000000000000
>>>      #4 [c000000019823c30] .sysrq_handle_crash at c00000000055c3f8
>>>      [Link Register] [c000000019823c30] .write_sysrq_trigger at
>>>      c00000000055cfac
>>>      #5 [c000000019823cf0] .proc_reg_write at c00000000037d120
>>>      #6 [c000000019823d80] .sys_write at c0000000002d68e4
>>>      #7 [c000000019823e30] syscall_exit at c00000000000a17c
>>>      System Call [c00] exception frame:
>>>      R0:  0000000000000004    R1:  00003fffc7738e00    R2:  00003fffb4163cc0
>>>      R3:  0000000000000001    R4:  00003fffad680000    R5:  0000000000000002
>>>      R6:  0000000000000010    R7:  0000000000000000    R8:  0000000000000000
>>>      R9:  0000000000000000    R10: 0000000000000000    R11: 0000000000000000
>>>      R12: 0000000000000000    R13: 00003fffb426c330    R14: 0000000010142550
>>>      R15: 0000000040000000    R16: 0000000010143cdc    R17: 0000000000000000
>>>      R18: 00000000101306fc    R19: 00000000101424dc    R20: 00000000101424e0
>>>      R21: 000000001013c6f0    R22: 000000001013c970    R23: 0000000000000000
>>>      R24: 0000000010143ce0    R25: 00000000100f65d0    R26: 00000100277ffa20
>>>      R27: 0000000000000001    R28: 0000000000000002    R29: 00003fffb4151108
>>>      R30: 00003fffad680000    R31: 0000000000000002
>>>      NIP: 00003fffb408a120    MSR: 800000000280f032    OR3: 0000000000000001
>>>      CTR: 0000000000000000    LR:  00003fffb4015704    XER: 0000000000000000
>>>      CCR: 0000000048222882    MQ:  0000000000000001    DAR: 00003fffad680000
>>>      DSISR: 0000000042000000     Syscall Result: 0000000000000000
>>>
>>> And here it is with your patch:
>>>
>>>     PID: 32696  TASK: c0000001922ed5d0  CPU: 1   COMMAND: "runtest.sh"
>>>      R0:  c00000000055cf88    R1:  c000000019823c30    R2:  c00000000130a780
>>>      R3:  0000000000000063    R4:  c000000001845888    R5:  c0000000018564f8
>>>      R6:  0000000000005194    R7:  c0000000014b99a0    R8:  c000000000cca780
>>>      R9:  0000000000000001    R10: 0000000000000000    R11: 000000000000012f
>>>      R12: 0000000048222842    R13: c000000007b80900    R14: 0000000010142550
>>>      R15: 0000000040000000    R16: 0000000010143cdc    R17: 0000000000000000
>>>      R18: 00000000101306fc    R19: 00000000101424dc    R20: 00000000101424e0
>>>      R21: 000000001013c6f0    R22: 000000001013c970    R23: 0000000000000000
>>>      R24: 0000000000000001    R25: 0000000000000007    R26: c00000000120b170
>>>      R27: 0000000000000063    R28: c000000001709c98    R29: c00000000120b530
>>>      R30: c0000000011d8fa0    R31: 0000000000000002
>>>      NIP: c00000000055c3f8    MSR: 8000000000009032    OR3: c000000000009358
>>>      CTR: c00000000055c3e0    LR:  c00000000055cfac    XER: 0000000000000001
>>>      CCR: 0000000048222822    MQ:  0000000000000000    DAR: 0000000000000000
>>>      DSISR: 0000000042000000     Syscall Result: 0000000000000000
>>>      NIP [c00000000055c3f8] .sysrq_handle_crash
>>>      LR  [c00000000055cfac] .write_sysrq_trigger
>>>      #0 [c000000019823c30] .write_sysrq_trigger at c00000000055cf88
>>>      #1 [c000000019823cf0] .proc_reg_write at c00000000037d120
>>>      #2 [c000000019823d80] .sys_write at c0000000002d68e4
>>>      #3 [c000000019823e30] syscall_exit at c00000000000a17c
>>>      System Call [c00] exception frame:
>>>      R0:  0000000000000004    R1:  00003fffc7738e00    R2:  00003fffb4163cc0
>>>      R3:  0000000000000001    R4:  00003fffad680000    R5:  0000000000000002
>>>      R6:  0000000000000010    R7:  0000000000000000    R8:  0000000000000000
>>>      R9:  0000000000000000    R10: 0000000000000000    R11: 0000000000000000
>>>      R12: 0000000000000000    R13: 00003fffb426c330    R14: 0000000010142550
>>>      R15: 0000000040000000    R16: 0000000010143cdc    R17: 0000000000000000
>>>      R18: 00000000101306fc    R19: 00000000101424dc    R20: 00000000101424e0
>>>      R21: 000000001013c6f0    R22: 000000001013c970    R23: 0000000000000000
>>>      R24: 0000000010143ce0    R25: 00000000100f65d0    R26: 00000100277ffa20
>>>      R27: 0000000000000001    R28: 0000000000000002    R29: 00003fffb4151108
>>>      R30: 00003fffad680000    R31: 0000000000000002
>>>      NIP: 00003fffb408a120    MSR: 800000000280f032    OR3: 0000000000000001
>>>      CTR: 0000000000000000    LR:  00003fffb4015704    XER: 0000000000000000
>>>      CCR: 0000000048222882    MQ:  0000000000000001    DAR: 00003fffad680000
>>>      DSISR: 0000000042000000     Syscall Result: 0000000000000000
>>>
>>> And from the same kdump, here's a non-panicking active task with the
>>> current
>>> way of doing things:
>>>
>>>     PID: 0      TASK: c000000001241c00  CPU: 0   COMMAND: "swapper/0"
>>>      #0 [c0000001dffdfb90] .crash_ipi_callback at c00000000004fd44
>>>      #1 [c0000001dffdfc20] .smp_ipi_demux at c000000000046bf8
>>>      #2 [c0000001dffdfcb0] .icp_hv_ipi_action at c000000000073454
>>>      #3 [c0000001dffdfd30] .handle_irq_event_percpu at c0000000001afaa4
>>>      #4 [c0000001dffdfe10] .handle_percpu_irq at c0000000001b526c
>>>      #5 [c0000001dffdfe90] .generic_handle_irq at c0000000001aed1c
>>>      #6 [c0000001dffdff10] .__do_irq at c000000000010d44
>>>      #7 [c0000001dffdff90] .call_do_irq at c000000000023f60
>>>      #8 [c00000000130b7e0] .do_IRQ at c000000000010eec
>>>      #9 [c00000000130b880] hardware_interrupt_common at c000000000002614
>>>      Hardware Interrupt [501] exception frame:
>>>      R0:  0000000000000000    R1:  c00000000130bb70    R2:  c00000000130a780
>>>      R3:  0000000000000000    R4:  0000000000000000    R5:  800000000bb71120
>>>      R6:  800000000bb844f8    R7:  0000000000000000    R8:  0000000000000000
>>>      R9:  0000000000000040    R10: 0000000000000000    R11: 000000005f9c862a
>>>      R12: 0000000000000000    R13: c000000007b80000
>>>      NIP: c0000000000849b4    MSR: 8000000000009032    OR3: 0000000000000c00
>>>      CTR: 0000000000000000    LR:  c000000000710070    XER: 0000000000000000
>>>      CCR: 0000000024002084    MQ:  0000000000000001    DAR: c000000001818380
>>>      DSISR: c000000000157684     Syscall Result: 0000000000000000
>>>     #10 [c00000000130bb70] .plpar_hcall_norets at c0000000000849b4
>>>     [Link Register] [c00000000130bb70] .shared_cede_loop at c000000000710070
>>>     #11 [c00000000130bbf0] .cpuidle_idle_call at c00000000070d9b4
>>>     #12 [c00000000130bcc0] .pseries_lpar_idle at c0000000000872f0
>>>     #13 [c00000000130bd30] .arch_cpu_idle at c000000000017b44
>>>     #14 [c00000000130bdb0] .cpu_startup_entry at c000000000149b10
>>>     #15 [c00000000130be80] .rest_init at c00000000000c5f4
>>>     #16 [c00000000130bef0] .start_kernel at c000000000c34258
>>>     #17 [c00000000130bf90] start_here_common at c000000000009b6c
>>>
>>> and here with your patch applied:
>>>
>>>     PID: 0      TASK: c000000001241c00  CPU: 0   COMMAND: "swapper/0"
>>>      R0:  0000000000000000    R1:  c00000000130bb70    R2:  c00000000130a780
>>>      R3:  0000000000000000    R4:  0000000000000000    R5:  800000000bb71120
>>>      R6:  800000000bb844f8    R7:  0000000000000000    R8:  0000000000000000
>>>      R9:  0000000000000040    R10: 0000000000000000    R11: 000000005f9c862a
>>>      R12: 0000000000000000    R13: c000000007b80000
>>>      NIP: c0000000000849b4    MSR: 8000000000009032    OR3: 0000000000000c00
>>>      CTR: 0000000000000000    LR:  c000000000710070    XER: 0000000000000000
>>>      CCR: 0000000024002084    MQ:  0000000000000001    DAR: c000000001818380
>>>      DSISR: c000000000157684     Syscall Result: 0000000000000000
>>>      NIP [c0000000000849b4] .plpar_hcall_norets
>>>      LR  [c000000000710070] .shared_cede_loop
>>>      #0 [c00000000130bb70] (null) at 3  (unreliable)
>>>      #1 [c00000000130bbf0] .cpuidle_idle_call at c00000000070d9b4
>>>      #2 [c00000000130bcc0] .pseries_lpar_idle at c0000000000872f0
>>>      #3 [c00000000130bd30] .arch_cpu_idle at c000000000017b44
>>>      #4 [c00000000130bdb0] .cpu_startup_entry at c000000000149b10
>>>      #5 [c00000000130be80] .rest_init at c00000000000c5f4
>>>      #6 [c00000000130bef0] .start_kernel at c000000000c34258
>>>      #7 [c00000000130bf90] start_here_common at c000000000009b6c
>>>
>>> Is that what you really want?
>>>
>>> It would be unfortunate to lose all of that exception information, both
>>> for the panic and for all of the non-panicking active tasks.
>> Hi Dave,
>>
>> Unfortunate, yes. But I think the exception information we are going to
>> lose out would be related to either crash_ipi_callback, crash_kexec,
>> crash_fadump or some such which may not be significant in debugging?
>> At least, that was the assumption with which I posted this patch..
> While it is true in the case of crash IPI callbacks, they are legitimate
> parts of the trace, and it's worth "exercising" that backtrace path.  Have
> you tested a crash that actually occurred while running on the hard or
> soft IRQ stack?
>
> Also, the exception frame doesn't even show the [bracketed] type of exception
> that occurred -- it's just a register dump followed by the remainder of the
> backtrace.  Upon a quick glance, it's not obvious that they are even active
> tasks.  And traditionally, all of the other architectures have always dumped
> a full trace.
>
> I'm not sure what the mechanism is for shutting down the non-active
> FADUMP tasks, so that's why I asked if you could restrict this change
> to just those types of dumps.  (For that matter, is it even possible to
> differentiate a real kdump from an FADUMP dumpfile --  aside from a

Hi Dave,

Differentiating a kdump and fadump dumpfile is not possible except that the
stack search would invariably fail and ptregs are guaranteed to be saved by
firmware in case of fadump. Posted v2 that doesn't change bt output for anything
but active tasks in case of fadump..

Thanks
Hari




More information about the Crash-utility mailing list