[Crash-utility] arm64: odd backtrace?

Fri Jun 3 12:22:47 UTC 2016

On Thu, Jun 02, 2016 at 10:52:28AM -0400, Dave Anderson wrote:
> 
> ----- Original Message -----
> > Dave,
> > 
> > When I ran "bt" against a process running in a user mode, I got
> > an odd backtrace result:
> > ===8<===
> > crash> ps
> >    ...
> > >  1324   1223   2  ffff80002018be80  RU   0.0     960    468  dhry
> >    1325      2   1  ffff800021089900  IN   0.0       0      0
> >    [kworker/u16:0]
> > crash> bt 1324
> > PID: 1324   TASK: ffff80002018be80  CPU: 2   COMMAND: "dhry"
> > ffff800022f6ae08: ffff00000812ae44 (crash_save_cpu on IRQ stack)
> >  #0 [ffff800022f6ae10] crash_save_cpu at ffff00000812ae44
> >  #1 [ffff800022f6ae60] handle_IPI at ffff00000808e718
> >  #2 [ffff800022f6b020] gic_handle_irq at ffff0000080815f8
> >  #3 [ffff800022f6b050] el0_irq_naked at ffff000008084c4c
> > pt_regs: ffff800022f6af60
> >      PC: ffffffffffffffff  [unknown or invalid address]
> >      LR: ffff800020107ed0  [unknown or invalid address]
> >      SP: 0000000000000000  PSTATE: 004016a4
> >     X29: ffff000008084c4c  X28: ffff800022f6b080  X27: ffff000008e60c54
> >     X26: ffff800020107ed0  X25: 0000000000001fff  X24: 0000000000000003
> >     X23: ffff0000080815f8  X22: ffff800022f6b040  X21: 0000000000000000
> >     X20: ffff000008bce000  X19: ffff00000808e758  X18: ffff800022f6b010
> >     X17: ffff00000808a820  X16: ffff800022f6aff0  X15: 0000000000000000
> >     X14: 0000000000000000  X13: 0000000000000000  X12: 0000000000402138
> >     X11: ffff000008675850  X10: ffff800022f6afe0   X9: 0000000000000000
> >      X8: ffff800022f6afc0   X7: 0000000000000000   X6: 0000000000000000
> >      X5: 0000000000000000   X4: 0000000000000001   X3: 0000000000000000
> >      X2: 0000000000493000   X1: 0000000000498000   X0: ffffffffffffffff
> >     ORIG_X0: 0000000020000000  SYSCALLNO: 4021f0
> > bt: WARNING: arm64_unwind_frame: on IRQ stack: oriq_sp: ffff800020107ed0 fp:
> > 0 (?)
> > pt_regs: ffff800020107ed0
> >      PC: 00000000004016a4   LR: 00000000004016a4   SP: 0000ffffc10c40a0
> >     X29: 0000ffffc10c40a0  X28: 0000000000000000  X27: 0000000000000000
> >     X26: 0000000000000000  X25: 0000000000402138  X24: 00000000004021f0
> >     X23: 0000000000000000  X22: 0000000000000000  X21: 00000000004001a0
> >     X20: 0000000000000000  X19: 0000000000000000  X18: 0000000000000000
> >     X17: 0000000000000001  X16: 0000000000000000  X15: 0000000000493000
> >     X14: 0000000000498000  X13: ffffffffffffffff  X12: 0000000000000005
> >     X11: 000000000000001e  X10: 0101010101010101   X9: fffffffff59a9190
> >      X8: 7f7f7f7f7f7f7f7f   X7: 1f535226301f2b4c   X6: 00000003001d1000
> >      X5: 00101d0003000000   X4: 0000000000000000   X3: 4952545320454d4f
> >      X2: 0000000010c35b40   X1: 0000000000000011   X0: 0000000010c35b40
> >     ORIG_X0: 0000000000498700  SYSCALLNO: ffffffffffffffff  PSTATE: 20000000
> > ===>8===
> > 
> > * PC, LR and SP look wrong.
> >   I don't know how those pt_regs values were derived.
> > * The message, "WARNING: arm64_unwind_frame: on IRQ stack: oriq_sp:
> >   ffff800020107ed0 fp: 0 (?)" should be refined.
> >   Apparently, in this case, the process is running in a user mode,
> >   and so there is no normal kernel stack.
> 
> Support for IRQ stacks was only recently put in place in crash-7.1.5,
> and obviously backtraces for a crash-while-in-user-space task is not working 
> correctly.  Unfortunately the only test kdump I have on hand only has IRQ
> stack transitions from kernel space.  I tried to create a kdump from a system
> running user-space commands on our 4.5.0-based kernel, but as luck would
> have it, kdump fails to work.  (it never even reaches the secondary kernel
> for some reason, even though the kdump facility says it's functional)
>   
> Obviously there's a problem in arm64_unwind_frame() trying to make the transition,
> and it returns FALSE because of the NULL fp and therefore INSTACK(frame->fp, bt))
> fails.   The function is trying to emulate the kernel's unwind_frame() function,
> which also would return -EINVAL because of the fp.  But I'm not sure whether that
> fp value has been set correctly because of the first, seemingly bogus, exception 
> frame that it's showing.
> 
> As you have seen, kernel space exceptions look like this, where the fp, sp and pc
> values are legitimate, so it prints "-- <IRQ stack> --", and transitions to the
> exception frame on the process stack:
>  
>   crash> set debug 1
>   debug: 1
>   crash> bt
>   PID: 0      TASK: fffffe035b0aae00  CPU: 3   COMMAND: "swapper/3"
>   fffffe03fe183d58: fffffe0000137ee4 (crash_save_cpu on IRQ stack)
>    #0 [fffffe03fe183d60] crash_save_cpu at fffffe0000137ee4
>    #1 [fffffe03fe183dc0] handle_IPI at fffffe000008e8d4
>    #2 [fffffe03fe183f80] gic_handle_irq at fffffe00000824c8
>    #3 [fffffe03fe183fd0] el1_irq at fffffe0000083520
>   bt: arm64_unwind_frame: switch stacks: fp: fffffe035b0f3f30 sp: fffffe035b0f3e10  pc: fffffe000008611c
>   --- <IRQ stack> ---
>   pt_regs: fffffe035b0f3e10
>        PC: fffffe000008611c  [arch_cpu_idle+60]
>        LR: fffffe0000086118  [arch_cpu_idle+56]
>        SP: fffffe035b0f3f30  PSTATE: 60000145
>       X29: fffffe035b0f3f30  X28: 0000000000000000  X27: fffffe0000084170
>       X26: fffffe0000bf13dc  X25: fffffe0000cf4000  X24: fffffe035b0f0000
>       X23: 0000000000000001  X22: fffffe0000b94c48  X21: 0000000000000003
>       X20: fffffe0000cf6000  X19: fffffe0000cf6028  X18: 000002aabb090050
>       X17: 000003ff9131a228  X16: fffffe000026dba4  X15: 00000000000000bf
>       X14: 004894597490a924  X13: 0000000000000000  X12: 0000000000000010
>       X11: 0000000000000067  X10: 0000000000000ab0   X9: fffffe035b0f0000
>        X8: fffffe035b0ab910   X7: 0000000000007b17   X6: 000000000001c690
>        X5: 0000001515d0302c   X4: 0100000000000000   X3: fffffe03fe184c8c
>        X2: fffffe03fe184c80   X1: 0000000000000000   X0: fffffe035b0f0000
>       ORIG_X0: fffffe035b0f0000  SYSCALLNO: fffffe0000b94c48
>    #4 [fffffe035b0f3e10] arch_cpu_idle at fffffe000008611c
>    #5 [fffffe035b0f3f40] default_idle_call at fffffe00000f81cc
>    #6 [fffffe035b0f3f70] cpu_startup_entry at fffffe00000f8320
>    #7 [fffffe035b0f3f80] secondary_start_kernel at fffffe000008e338
>   crash>
> 
> In your sample, it certainly doesn't appear that the first exception frame found
> on the IRQ stack is legitimate, and probably should not pass the test in 
> arm64_is_kernel_exception_frame(), but it does:
> 
> > crash> bt 1324
> > PID: 1324   TASK: ffff80002018be80  CPU: 2   COMMAND: "dhry"
> > ffff800022f6ae08: ffff00000812ae44 (crash_save_cpu on IRQ stack)
> >  #0 [ffff800022f6ae10] crash_save_cpu at ffff00000812ae44
> >  #1 [ffff800022f6ae60] handle_IPI at ffff00000808e718
> >  #2 [ffff800022f6b020] gic_handle_irq at ffff0000080815f8
> >  #3 [ffff800022f6b050] el0_irq_naked at ffff000008084c4c
> > pt_regs: ffff800022f6af60
> >      PC: ffffffffffffffff  [unknown or invalid address]
> >      LR: ffff800020107ed0  [unknown or invalid address]
> >      SP: 0000000000000000  PSTATE: 004016a4
> >     X29: ffff000008084c4c  X28: ffff800022f6b080  X27: ffff000008e60c54
> >     X26: ffff800020107ed0  X25: 0000000000001fff  X24: 0000000000000003
> >     X23: ffff0000080815f8  X22: ffff800022f6b040  X21: 0000000000000000
> >     X20: ffff000008bce000  X19: ffff00000808e758  X18: ffff800022f6b010
> >     X17: ffff00000808a820  X16: ffff800022f6aff0  X15: 0000000000000000
> >     X14: 0000000000000000  X13: 0000000000000000  X12: 0000000000402138
> >     X11: ffff000008675850  X10: ffff800022f6afe0   X9: 0000000000000000
> >      X8: ffff800022f6afc0   X7: 0000000000000000   X6: 0000000000000000
> >      X5: 0000000000000000   X4: 0000000000000001   X3: 0000000000000000
> >      X2: 0000000000493000   X1: 0000000000498000   X0: ffffffffffffffff
> >     ORIG_X0: 0000000020000000  SYSCALLNO: 4021f0
> 
> Maybe that is the cause of the bogus "fp"?  Anyway, since the orig_sp is 
> from a fixed location at the top of the IRQ stack, It then manages to make its 
> way back to the "dhry" process stack, where this exception frame "looks" legitimate:
> 
> > bt: WARNING: arm64_unwind_frame: on IRQ stack: oriq_sp: ffff800020107ed0 fp: 0 (?)
> > pt_regs: ffff800020107ed0
> >      PC: 00000000004016a4   LR: 00000000004016a4   SP: 0000ffffc10c40a0
> >     X29: 0000ffffc10c40a0  X28: 0000000000000000  X27: 0000000000000000
> >     X26: 0000000000000000  X25: 0000000000402138  X24: 00000000004021f0
> >     X23: 0000000000000000  X22: 0000000000000000  X21: 00000000004001a0
> >     X20: 0000000000000000  X19: 0000000000000000  X18: 0000000000000000
> >     X17: 0000000000000001  X16: 0000000000000000  X15: 0000000000493000
> >     X14: 0000000000498000  X13: ffffffffffffffff  X12: 0000000000000005
> >     X11: 000000000000001e  X10: 0101010101010101   X9: fffffffff59a9190
> >      X8: 7f7f7f7f7f7f7f7f   X7: 1f535226301f2b4c   X6: 00000003001d1000
> >      X5: 00101d0003000000   X4: 0000000000000000   X3: 4952545320454d4f
> >      X2: 0000000010c35b40   X1: 0000000000000011   X0: 0000000010c35b40
> >     ORIG_X0: 0000000000498700  SYSCALLNO: ffffffffffffffff  PSTATE: 20000000
> 
> But I'm not sure what happens when an arm64 IRQ exception occurs when
> the task is running in user space.  Does it lay an exception frame down on the
> process stack and then make the transition?  (and therefore the user-space frame
> above is legitimate?)  Or does the user-space frame get laid down directly on the 
> IRQ stack?  Unfortunately I don't know enough about arm64 exception handling.

Since I reviewed this IRQ stack patch in LAK-ML, I will be able to help you.
but I don't have enough time to explain in details this week.

> In any case, the bt should display "-- <IRQ stack> ...", and them dump
> the user-to-kernel-space exception frame, wherever it lies, i.e., either on the 
> normal process stack or (maybe?) on the IRQ stack. 
> 
> Anyway, can you make the vmlinux/vmcore pair available for me to download?  You can
> send the details to me offline.

I sent you a message which contains the link to those binaries.

Thanks,
-Takahiro AKASHI

> Thanks,
>   Dave
> 
> --
> Crash-utility mailing list
> Crash-utility at redhat.com
> https://www.redhat.com/mailman/listinfo/crash-utility

-- 
Thanks,
-Takahiro AKASHI