[Crash-utility] arm64: odd backtrace?

Dave Anderson anderson at redhat.com
Thu Jun 2 14:52:28 UTC 2016



----- Original Message -----
> Dave,
> 
> When I ran "bt" against a process running in a user mode, I got
> an odd backtrace result:
> ===8<===
> crash> ps
>    ...
> >  1324   1223   2  ffff80002018be80  RU   0.0     960    468  dhry
>    1325      2   1  ffff800021089900  IN   0.0       0      0
>    [kworker/u16:0]
> crash> bt 1324
> PID: 1324   TASK: ffff80002018be80  CPU: 2   COMMAND: "dhry"
> ffff800022f6ae08: ffff00000812ae44 (crash_save_cpu on IRQ stack)
>  #0 [ffff800022f6ae10] crash_save_cpu at ffff00000812ae44
>  #1 [ffff800022f6ae60] handle_IPI at ffff00000808e718
>  #2 [ffff800022f6b020] gic_handle_irq at ffff0000080815f8
>  #3 [ffff800022f6b050] el0_irq_naked at ffff000008084c4c
> pt_regs: ffff800022f6af60
>      PC: ffffffffffffffff  [unknown or invalid address]
>      LR: ffff800020107ed0  [unknown or invalid address]
>      SP: 0000000000000000  PSTATE: 004016a4
>     X29: ffff000008084c4c  X28: ffff800022f6b080  X27: ffff000008e60c54
>     X26: ffff800020107ed0  X25: 0000000000001fff  X24: 0000000000000003
>     X23: ffff0000080815f8  X22: ffff800022f6b040  X21: 0000000000000000
>     X20: ffff000008bce000  X19: ffff00000808e758  X18: ffff800022f6b010
>     X17: ffff00000808a820  X16: ffff800022f6aff0  X15: 0000000000000000
>     X14: 0000000000000000  X13: 0000000000000000  X12: 0000000000402138
>     X11: ffff000008675850  X10: ffff800022f6afe0   X9: 0000000000000000
>      X8: ffff800022f6afc0   X7: 0000000000000000   X6: 0000000000000000
>      X5: 0000000000000000   X4: 0000000000000001   X3: 0000000000000000
>      X2: 0000000000493000   X1: 0000000000498000   X0: ffffffffffffffff
>     ORIG_X0: 0000000020000000  SYSCALLNO: 4021f0
> bt: WARNING: arm64_unwind_frame: on IRQ stack: oriq_sp: ffff800020107ed0 fp:
> 0 (?)
> pt_regs: ffff800020107ed0
>      PC: 00000000004016a4   LR: 00000000004016a4   SP: 0000ffffc10c40a0
>     X29: 0000ffffc10c40a0  X28: 0000000000000000  X27: 0000000000000000
>     X26: 0000000000000000  X25: 0000000000402138  X24: 00000000004021f0
>     X23: 0000000000000000  X22: 0000000000000000  X21: 00000000004001a0
>     X20: 0000000000000000  X19: 0000000000000000  X18: 0000000000000000
>     X17: 0000000000000001  X16: 0000000000000000  X15: 0000000000493000
>     X14: 0000000000498000  X13: ffffffffffffffff  X12: 0000000000000005
>     X11: 000000000000001e  X10: 0101010101010101   X9: fffffffff59a9190
>      X8: 7f7f7f7f7f7f7f7f   X7: 1f535226301f2b4c   X6: 00000003001d1000
>      X5: 00101d0003000000   X4: 0000000000000000   X3: 4952545320454d4f
>      X2: 0000000010c35b40   X1: 0000000000000011   X0: 0000000010c35b40
>     ORIG_X0: 0000000000498700  SYSCALLNO: ffffffffffffffff  PSTATE: 20000000
> ===>8===
> 
> * PC, LR and SP look wrong.
>   I don't know how those pt_regs values were derived.
> * The message, "WARNING: arm64_unwind_frame: on IRQ stack: oriq_sp:
>   ffff800020107ed0 fp: 0 (?)" should be refined.
>   Apparently, in this case, the process is running in a user mode,
>   and so there is no normal kernel stack.

Support for IRQ stacks was only recently put in place in crash-7.1.5,
and obviously backtraces for a crash-while-in-user-space task is not working 
correctly.  Unfortunately the only test kdump I have on hand only has IRQ
stack transitions from kernel space.  I tried to create a kdump from a system
running user-space commands on our 4.5.0-based kernel, but as luck would
have it, kdump fails to work.  (it never even reaches the secondary kernel
for some reason, even though the kdump facility says it's functional)
  
Obviously there's a problem in arm64_unwind_frame() trying to make the transition,
and it returns FALSE because of the NULL fp and therefore INSTACK(frame->fp, bt))
fails.   The function is trying to emulate the kernel's unwind_frame() function,
which also would return -EINVAL because of the fp.  But I'm not sure whether that
fp value has been set correctly because of the first, seemingly bogus, exception 
frame that it's showing.

As you have seen, kernel space exceptions look like this, where the fp, sp and pc
values are legitimate, so it prints "-- <IRQ stack> --", and transitions to the
exception frame on the process stack:
 
  crash> set debug 1
  debug: 1
  crash> bt
  PID: 0      TASK: fffffe035b0aae00  CPU: 3   COMMAND: "swapper/3"
  fffffe03fe183d58: fffffe0000137ee4 (crash_save_cpu on IRQ stack)
   #0 [fffffe03fe183d60] crash_save_cpu at fffffe0000137ee4
   #1 [fffffe03fe183dc0] handle_IPI at fffffe000008e8d4
   #2 [fffffe03fe183f80] gic_handle_irq at fffffe00000824c8
   #3 [fffffe03fe183fd0] el1_irq at fffffe0000083520
  bt: arm64_unwind_frame: switch stacks: fp: fffffe035b0f3f30 sp: fffffe035b0f3e10  pc: fffffe000008611c
  --- <IRQ stack> ---
  pt_regs: fffffe035b0f3e10
       PC: fffffe000008611c  [arch_cpu_idle+60]
       LR: fffffe0000086118  [arch_cpu_idle+56]
       SP: fffffe035b0f3f30  PSTATE: 60000145
      X29: fffffe035b0f3f30  X28: 0000000000000000  X27: fffffe0000084170
      X26: fffffe0000bf13dc  X25: fffffe0000cf4000  X24: fffffe035b0f0000
      X23: 0000000000000001  X22: fffffe0000b94c48  X21: 0000000000000003
      X20: fffffe0000cf6000  X19: fffffe0000cf6028  X18: 000002aabb090050
      X17: 000003ff9131a228  X16: fffffe000026dba4  X15: 00000000000000bf
      X14: 004894597490a924  X13: 0000000000000000  X12: 0000000000000010
      X11: 0000000000000067  X10: 0000000000000ab0   X9: fffffe035b0f0000
       X8: fffffe035b0ab910   X7: 0000000000007b17   X6: 000000000001c690
       X5: 0000001515d0302c   X4: 0100000000000000   X3: fffffe03fe184c8c
       X2: fffffe03fe184c80   X1: 0000000000000000   X0: fffffe035b0f0000
      ORIG_X0: fffffe035b0f0000  SYSCALLNO: fffffe0000b94c48
   #4 [fffffe035b0f3e10] arch_cpu_idle at fffffe000008611c
   #5 [fffffe035b0f3f40] default_idle_call at fffffe00000f81cc
   #6 [fffffe035b0f3f70] cpu_startup_entry at fffffe00000f8320
   #7 [fffffe035b0f3f80] secondary_start_kernel at fffffe000008e338
  crash>

In your sample, it certainly doesn't appear that the first exception frame found
on the IRQ stack is legitimate, and probably should not pass the test in 
arm64_is_kernel_exception_frame(), but it does:

> crash> bt 1324
> PID: 1324   TASK: ffff80002018be80  CPU: 2   COMMAND: "dhry"
> ffff800022f6ae08: ffff00000812ae44 (crash_save_cpu on IRQ stack)
>  #0 [ffff800022f6ae10] crash_save_cpu at ffff00000812ae44
>  #1 [ffff800022f6ae60] handle_IPI at ffff00000808e718
>  #2 [ffff800022f6b020] gic_handle_irq at ffff0000080815f8
>  #3 [ffff800022f6b050] el0_irq_naked at ffff000008084c4c
> pt_regs: ffff800022f6af60
>      PC: ffffffffffffffff  [unknown or invalid address]
>      LR: ffff800020107ed0  [unknown or invalid address]
>      SP: 0000000000000000  PSTATE: 004016a4
>     X29: ffff000008084c4c  X28: ffff800022f6b080  X27: ffff000008e60c54
>     X26: ffff800020107ed0  X25: 0000000000001fff  X24: 0000000000000003
>     X23: ffff0000080815f8  X22: ffff800022f6b040  X21: 0000000000000000
>     X20: ffff000008bce000  X19: ffff00000808e758  X18: ffff800022f6b010
>     X17: ffff00000808a820  X16: ffff800022f6aff0  X15: 0000000000000000
>     X14: 0000000000000000  X13: 0000000000000000  X12: 0000000000402138
>     X11: ffff000008675850  X10: ffff800022f6afe0   X9: 0000000000000000
>      X8: ffff800022f6afc0   X7: 0000000000000000   X6: 0000000000000000
>      X5: 0000000000000000   X4: 0000000000000001   X3: 0000000000000000
>      X2: 0000000000493000   X1: 0000000000498000   X0: ffffffffffffffff
>     ORIG_X0: 0000000020000000  SYSCALLNO: 4021f0

Maybe that is the cause of the bogus "fp"?  Anyway, since the orig_sp is 
from a fixed location at the top of the IRQ stack, It then manages to make its 
way back to the "dhry" process stack, where this exception frame "looks" legitimate:

> bt: WARNING: arm64_unwind_frame: on IRQ stack: oriq_sp: ffff800020107ed0 fp: 0 (?)
> pt_regs: ffff800020107ed0
>      PC: 00000000004016a4   LR: 00000000004016a4   SP: 0000ffffc10c40a0
>     X29: 0000ffffc10c40a0  X28: 0000000000000000  X27: 0000000000000000
>     X26: 0000000000000000  X25: 0000000000402138  X24: 00000000004021f0
>     X23: 0000000000000000  X22: 0000000000000000  X21: 00000000004001a0
>     X20: 0000000000000000  X19: 0000000000000000  X18: 0000000000000000
>     X17: 0000000000000001  X16: 0000000000000000  X15: 0000000000493000
>     X14: 0000000000498000  X13: ffffffffffffffff  X12: 0000000000000005
>     X11: 000000000000001e  X10: 0101010101010101   X9: fffffffff59a9190
>      X8: 7f7f7f7f7f7f7f7f   X7: 1f535226301f2b4c   X6: 00000003001d1000
>      X5: 00101d0003000000   X4: 0000000000000000   X3: 4952545320454d4f
>      X2: 0000000010c35b40   X1: 0000000000000011   X0: 0000000010c35b40
>     ORIG_X0: 0000000000498700  SYSCALLNO: ffffffffffffffff  PSTATE: 20000000

But I'm not sure what happens when an arm64 IRQ exception occurs when
the task is running in user space.  Does it lay an exception frame down on the
process stack and then make the transition?  (and therefore the user-space frame
above is legitimate?)  Or does the user-space frame get laid down directly on the 
IRQ stack?  Unfortunately I don't know enough about arm64 exception handling.

In any case, the bt should display "-- <IRQ stack> ...", and them dump
the user-to-kernel-space exception frame, wherever it lies, i.e., either on the 
normal process stack or (maybe?) on the IRQ stack. 

Anyway, can you make the vmlinux/vmcore pair available for me to download?  You can
send the details to me offline.

Thanks,
  Dave




More information about the Crash-utility mailing list