[Crash-utility] Question re: xen hypervisor backtrace problem

Dave Anderson anderson at redhat.com
Tue Oct 14 20:30:18 UTC 2008


Hello Oda-san,

I have a xen-syms vmcore that finds a path that the hypervisor-related
changes in lkcd_x86_trace.c cannot handle.  When the back trace runs 
into the "process_softirqs" text return address reference from 
"xen/arch/x86/x86_32/entry.S", it cannot go any further.  Therefore 
the backtrace fails, and in the recovery code it incorrectly searches 
for a (vmlinux) eframe: 

  crash> bt -a
  PCPU:  0  VCPU: ffbc7080
  bt: cannot resolve stack trace:
   #0 [ff1d3ebc] elf_core_save_regs at ff10a810
   #1 [ff1d3ec4] common_interrupt at ff1222ed
   #2 [ff1d3ed0] do_nmi at ff1335bb
   #3 [ff1d3ef0] handle_nmi_mce at ff17442e
   #4 [ff1d3f24] csched_tick at ff110aa7
   #5 [ff1d3f80] timer_softirq_action at ff1155d2
   #6 [ff1d3fa0] do_softirq at ff1143fe
   #7 [ff1d3fb0] process_softirqs at ff173f61
  bt: text symbols on stack:
      [ff1d3ebc] disable_local_APIC at ff11db75
      [ff1d3ec0] crash_nmi_callback at ff13cc96
      [ff1d3ec4] common_interrupt at ff1222f2
      [ff1d3ed0] do_nmi at ff1335c1
      [ff1d3ef0] handle_nmi_mce at ff174435
      [ff1d3f18] csched_tick at ff110aa7
      [ff1d3f80] timer_softirq_action at ff1155d4
      [ff1d3fa0] do_softirq at ff114405
      [ff1d3fb0] process_softirqs at ff173f66
  
  bt: invalid structure size: task_struct
      FILE: x86.c  LINE: 1576  FUNCTION: x86_eframe_search()
  
  [/usr/bin/crash] error trace: 816373b => 8164497 => 810c40c => 813ed94
  
    813ed94: SIZE_verify+126
    810c40c: x86_eframe_search+1075
    8164497: handle_trace_error+692
    816373b: lkcd_x86_back_trace+2370
  
  bt: invalid structure size: task_struct
      FILE: x86.c  LINE: 1576  FUNCTION: x86_eframe_search()
  
  crash> 
  
Now, the bogus vmlinux eframe search can be avoided by doing this in 
handle_trace_error():

--- lkcd_x86_trace.c.orig       2008-10-14 15:46:33.000000000 -0400
+++ lkcd_x86_trace.c    2008-10-14 16:09:26.000000000 -0400
@@ -2440,12 +2441,14 @@ handle_trace_error(struct bt_info *bt, i
         bt->flags |= BT_TEXT_SYMBOLS_PRINT|BT_ERROR_MASK;
         back_trace(bt);
 
-        bt->flags = BT_EFRAME_COUNT;
-        if ((cnt = machdep->eframe_search(bt))) {
-               error(INFO, "possible exception frame%s:\n", 
-                       cnt > 1 ? "s" : "");
-               bt->flags &= ~(ulonglong)BT_EFRAME_COUNT;
-               machdep->eframe_search(bt); 
+       if (!XEN_HYPER_MODE()) {
+               bt->flags = BT_EFRAME_COUNT;
+               if ((cnt = machdep->eframe_search(bt))) {
+                       error(INFO, "possible exception frame%s:\n", 
+                               cnt > 1 ? "s" : "");
+                       bt->flags &= ~(ulonglong)BT_EFRAME_COUNT;
+                       machdep->eframe_search(bt); 
+               }
        }
 }

After doing the above, the bt -a shows this, and therefore does 
not fail prematurely:
  
  crash> bt -a
  PCPU:  0  VCPU: ffbc7080
  bt: cannot resolve stack trace:
   #0 [ff1d3ebc] elf_core_save_regs at ff10a810
   #1 [ff1d3ec4] common_interrupt at ff1222ed
   #2 [ff1d3ed0] do_nmi at ff1335bb
   #3 [ff1d3ef0] handle_nmi_mce at ff17442e
   #4 [ff1d3f24] csched_tick at ff110aa7
   #5 [ff1d3f80] timer_softirq_action at ff1155d2
   #6 [ff1d3fa0] do_softirq at ff1143fe
   #7 [ff1d3fb0] process_softirqs at ff173f61
  bt: text symbols on stack:
      [ff1d3ebc] disable_local_APIC at ff11db75
      [ff1d3ec0] crash_nmi_callback at ff13cc96
      [ff1d3ec4] common_interrupt at ff1222f2
      [ff1d3ed0] do_nmi at ff1335c1
      [ff1d3ef0] handle_nmi_mce at ff174435
      [ff1d3f18] csched_tick at ff110aa7
      [ff1d3f80] timer_softirq_action at ff1155d4
      [ff1d3fa0] do_softirq at ff114405
      [ff1d3fb0] process_softirqs at ff173f66

  PCPU:  1  VCPU: ff1b6080
  ...
  
Carrying it one step further, and given that the relevant part 
of the stack from above looks like this:

  crash> rd -s ff1d3ebc 84
  ff1d3ebc:  disable_local_APIC+5 crash_nmi_callback+38 common_interrupt+82 cpu0_stack+16076 
  ff1d3ecc:  0003d027 do_nmi+49 cpu0_stack+16120 00000000 
  ff1d3edc:  ffbca000 ffbcbeb0 00000030 cpu0_stack+16308 
  ff1d3eec:  0000e010 handle_nmi_mce+91 cpu0_stack+16120 00000100 
  ff1d3efc:  00000005 000000ff 000005dc ffbdee88 
  ff1d3f0c:  00000000 00000960 00020000 csched_tick+1239 
  ff1d3f1c:  0000e008 00000083 ffbc7080 00000030 
  ff1d3f2c:  0003d027 80000003 000583a8 per_cpu__schedule_data 
  ff1d3f3c:  c840ceb2 00000000 ffbfda80 00000000 
  ff1d3f4c:  00000000 00000000 00000100 00000960 
  ff1d3f5c:  ffbdee80 00000246 000000ff csched_priv+4 
  ff1d3f6c:  00000000 ffbfda8c __per_cpu_data_end+54972 e4c5d8d9 
  ff1d3f7c:  0000008b timer_softirq_action+132 00000000 ffbc7080 
  ff1d3f8c:  per_cpu__timers 00000000 cpu0_stack+16308 0000007b 
  ff1d3f9c:  eaed7700 do_softirq+53 00000000 ffbc7080 
  ff1d3fac:  0000007b process_softirqs+6 eb396d84 00000002 
  ff1d3fbc:  c0678470 c0678470 00000002 eaed7700 
  ff1d3fcc:  00000000 000d0000 c04011a7 00000061 
  ff1d3fdc:  00000202 eb396d48 00000069 0000007b 
  ff1d3fec:  0000007b 00000000 00000000 00000000 
  ff1d3ffc:  ffbc7080 ffffffff ffffffff ffffffff
  crash> 
  
Clearly "process_softirqs" is the last text return address
reference that the backtrace code can work with.  So to try
to clean up the backtrace, I added this:

--- lkcd_x86_trace.c.orig       2008-10-14 15:46:33.000000000 -0400
+++ lkcd_x86_trace.c    2008-10-14 16:09:26.000000000 -0400
@@ -1423,6 +1423,7 @@ find_trace(
                if (XEN_HYPER_MODE()) {
                        func_name = kl_funcname(pc);
                        if (STREQ(func_name, "idle_loop") || STREQ(func_name, "hypercall")
+                               || STREQ(func_name, "process_softirqs")
                                || STREQ(func_name, "tracing_off")
                                || STREQ(func_name, "handle_exception")) {
                                UPDATE_FRAME(func_name, pc, 0, sp, bp, asp, 0, 0, bp - sp, 0);

which shows:
  
  crash> bt -a
  PCPU:  0  VCPU: ffbc7080
   #0 [ff1d3ebc] elf_core_save_regs at ff10a810
   #1 [ff1d3ec4] common_interrupt at ff1222ed
   #2 [ff1d3ed0] do_nmi at ff1335bb
   #3 [ff1d3ef0] handle_nmi_mce at ff17442e
   #4 [ff1d3f24] csched_tick at ff110aa7
   #5 [ff1d3f80] timer_softirq_action at ff1155d2
   #6 [ff1d3fa0] do_softirq at ff1143fe
   #7 [ff1d3fb0] process_softirqs at ff173f61
  
  PCPU:  1  VCPU: ff1b6080
  ...
        
The patch to avoid eframe search can be avoided entirely by applying 
the second patch, but it seems that it should be left in place for 
other unforeseen possibilities in the future.

Do you agree with these changes?

Thanks,
  Dave




More information about the Crash-utility mailing list