[Crash-utility] cannot find stack info on ppc64le (call out to all IBM'ers on this list)

Dave Anderson anderson at redhat.com
Mon Jan 19 20:33:41 UTC 2015


Han,

This is much worse than I thought.  When you said "sometimes", you must
mean "all of the time" with respect to the active tasks?  Because that's
what I see here.  

I provisioned a ppc64le machine, set up kdump to create a compressed kdump,
and crashed the machine with sysrq-c.  This is what I see with "bt -a":

  crash> bt -a
  PID: 12674  TASK: c00000002cc08810  CPU: 0   COMMAND: "bash"

  PID: 0      TASK: c0000001ee020000  CPU: 1   COMMAND: "swapper/1"

  PID: 0      TASK: c0000001ee021370  CPU: 2   COMMAND: "swapper/2"

  PID: 0      TASK: c0000001ee0226e0  CPU: 3   COMMAND: "swapper/3"

  PID: 0      TASK: c0000001ee023a50  CPU: 4   COMMAND: "swapper/4"

  PID: 0      TASK: c0000001ee024dc0  CPU: 5   COMMAND: "swapper/5"

  PID: 0      TASK: c0000001ee026130  CPU: 6   COMMAND: "swapper/6"

  PID: 0      TASK: c0000001ee0274a0  CPU: 7   COMMAND: "swapper/7"
  crash>

As far as I can tell, it does not even use the registers in the 
notes, but rather defaults to searching the relevant stacks for
".crash_kexec".  But on ppc64le, that symbol no longer exists as
it does on a big-endian ppc64:

  crash> sym -q crash_kexec
  c000000000051760 (t) crash_kexec_prepare_cpus
  c0000000000519c0 (T) crash_kexec_secondary
  c00000000016e8a0 (T) crash_kexec
  crash> 
 
On the big-endian ppc64 machines, there was this construct where
the text symbols started with a ".", and there were data symbols
that pointed to them:

  crash> sym -q crash_kexec
  c000000000050050 (t) .crash_kexec_prepare_cpus
  c0000000000502b0 (T) .crash_kexec_secondary
  c000000000172160 (T) .crash_kexec
  c0000000012ab5f0 (d) crash_kexec_prepare_cpus
  c0000000012ab600 (D) crash_kexec_secondary
  c0000000012b9238 (D) crash_kexec
  crash> rd c0000000012b9238
  c0000000012b9238:  c000000000172160                    ......!`
  crash> 

So therefore backtraces are guaranteed to fail for the active tasks on ppc64le,
because the stack searching code does this:

        for (i = 0, up = (ulong *)bt->stackbuf;
             i < (bt->stacktop - bt->stackbase)/sizeof(ulong); i++, up++) {
                sym = closest_symbol(*up);

                if (STREQ(sym, ".netconsole_netdump") ||
                        STREQ(sym, ".netpoll_start_netdump") ||
                        STREQ(sym, ".start_disk_dump") ||
                        STREQ(sym, ".crash_kexec") ||
                        STREQ(sym, ".crash_fadump") ||
                        STREQ(sym, ".disk_dump")) {
                        *nip = *up;
                        *ksp = bt->stackbase +
                                ((char *)(up) - 16 - bt->stackbuf);
                        return TRUE;
                }
        }

where no ".crash_kexec" symbol exists in ppc64le kernels.

So Han, can you find out who in IBM should be responsible for supporting
ppc64le in the crash utility?  Or is it you?

Thanks,
  Dave



   

----- Original Message -----
> 
> 
> ----- Original Message -----
> > 
> > ----- Original Message -----
> > > Hello,
> > > 
> > > I just noticed that on ppc64le, sometimes "bt" cannot find the stack
> > > info of current process. For example, there is a vmcore captured by
> > > kdump on a ppc64le system, which running with a kernel version 3.10. The
> > > vmcore was captured when kernel oopsed. There is no stack info found by
> > > bt:
> > 
> > Hello Han,
> > 
> > I've never worked on the backtrace code for ppc64, as it was written
> > by (and maintained by) IBM.  From the debug messages, what happened is
> > that the starting IP/SP hooks are not being found.  The crash command
> > sequence presumably looks like this:
> > 
> >   cmd_bt
> >    back_trace
> >     get_kdump_regs
> >       get_netdump_regs
> >         get_netdump_regs_ppc64   (should setup bt->machdep to point to
> >         NT_PRSTATUS note)
> >           ppc64_get_stack_frame
> >             ppc64_get_dumpfile_stack_frame
> >                ppc64_kdump_stack_frame (should get IP/SP pair based upon
> >                NT_PRSTATUS note contents)
> >     ppc64_back_trace_cmd
> >      ppc64_back_trace
> > 
> > ppc64_kdump_stack_frame() should pull the starting NIP/KSP values from the
> > pt_regs structure in the per-cpu NT_PRSTATUS note, but it appears that it
> > is not,
> > leaving the registers at their initialized values of NULL.
> > 
> > This causes the failure later on when ppc64_back_trace_cmd() is called, and
> > which
> > prints the "=> PC: 0 () FP: 0" debug message shown below, and later on
> > ppc64_back_trace()
> > prints the "cannot find the stack info." debug message.
> > 
> > Without the dumpfile, I can't offer much else.  Can you verify the crash
> > utility
> > stack trail above, and if it is as I suspect, figure out why
> > ppc64_kdump_stack_frame()
> > is failing?  Or what other path it is taking?
> 
> Actually, if this is a compressed kdump, ppc64_kdump_stack_frame() will not
> be
> called, and the register access is done inside
> ppc64_get_dumpfile_stack_frame().
> 
> The ppc64_get_dumpfile_stack_frame() function first grabs the registers from
> the pt_regs
> structure in the per-cpu NT_PRSTATUS note, but then also checks the hard and
> soft IRQ
> stacks, and the hardware interrupt stack, for known instances of kernel dump
> functions,
> which would override the pt_regs contents.  If nothing is found on those
> stacks,
> the registers from the NT_PRSTATUS note are used.
> 
> Dave
> 
> 
> 
> 
> > 
> > > 
> > > crash 7.0.9-2.ael7b
> > > Copyright (C) 2002-2014  Red Hat, Inc.
> > > Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
> > > Copyright (C) 1999-2006  Hewlett-Packard Co
> > > Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
> > > Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
> > > Copyright (C) 2005, 2011  NEC Corporation
> > > Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
> > > Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
> > > This program is free software, covered by the GNU General Public License,
> > > and you are welcome to change it and/or distribute copies of it under
> > > certain conditions.  Enter "help copying" to see the conditions.
> > > This program has absolutely no warranty.  Enter "help warranty" for
> > > details.
> > > 
> > > GNU gdb (GDB) 7.6
> > > Copyright (C) 2013 Free Software Foundation, Inc.
> > > License GPLv3+: GNU GPL version 3 or later
> > > <http://gnu.org/licenses/gpl.html>
> > > This is free software: you are free to change and redistribute it.
> > > There is NO WARRANTY, to the extent permitted by law.  Type "show
> > > copying"
> > > and "show warranty" for details.
> > > This GDB was configured as "powerpc64le-unknown-linux-gnu"...
> > > 
> > >       KERNEL: /usr/lib/debug/lib/modules/3.10.0-221.ael7b.ppc64le/vmlinux
> > >     DUMPFILE: /var/crash/127.0.0.1-2015.01.15-22:19:14/vmcore  [PARTIAL
> > >     DUMP]
> > >         CPUS: 16
> > >         DATE: Thu Jan 15 21:18:16 2015
> > >       UPTIME: 17:53:43
> > > LOAD AVERAGE: 213.58, 213.23, 212.70
> > >        TASKS: 1383
> > >     NODENAME: thymelp2.isst.aus.stglabs.ibm.com
> > >      RELEASE: 3.10.0-221.ael7b.ppc64le
> > >      VERSION: #1 SMP Wed Jan 7 09:27:09 EST 2015
> > >      MACHINE: ppc64le  (3425 Mhz)
> > >       MEMORY: 15 GB
> > >        PANIC: "Oops: Kernel access of bad area, sig: 11 [#1]" (check log
> > >        for
> > >        details)
> > >          PID: 1970
> > >      COMMAND: "cat"
> > >         TASK: c0000003130874a0  [THREAD_INFO: c00000005069c000]
> > >          CPU: 5
> > >        STATE: TASK_RUNNING (PANIC)
> > > 
> > > crash> set debug 99
> > > debug: 99
> > > crash> bt
> > > PID: 1970   TASK: c0000003130874a0  CPU: 5   COMMAND: "cat"
> > > GETBUF(16384 -> 0)
> > > <readmem: c00000005069c000, KVADDR, "stack contents", 16384, (ROE),
> > > 10a81570>
> > > <read_diskdump: addr: c00000005069c000 paddr: 5069c000 cnt: 16384>
> > > read_diskdump: paddr/pfn: 5069c000/5069 -> cache physical page: 50690000
> > > c00000005069c018: do_no_restart_syscall
> > > c00000005069e870: blk_throtl_bio+240
> > > c00000005069e990: clone_endio
> > > c00000005069ea00: generic_make_request_checks+836
> > > c00000005069eab8: hardware_interrupt_common+128
> > > c00000005069eac0: generic_make_request+36
> > > c00000005069eb10: mempool_alloc_slab+36
> > > c00000005069eb30: mempool_alloc+256
> > > c00000005069eb50: mempool_alloc_slab+36
> > > c00000005069ebc0: get_request+948
> > > c00000005069ec00: __split_and_process_bio+1408
> > > c00000005069ec20: autoremove_wake_function
> > > c00000005069ec80: find_busiest_group+544
> > > c00000005069edf0: load_balance+684
> > > c00000005069ee10: blk_throtl_bio+240
> > > c00000005069ee70: find_busiest_group+544
> > > c00000005069eee0: dequeue_task_fair+968
> > > c00000005069ef30: clone_endio
> > > c00000005069ef50: get_page_from_freelist+1436
> > > c00000005069f0a0: pSeries_cause_ipi_mux+112
> > > c00000005069f0c0: smp_send_reschedule+164
> > > c00000005069f0e0: default_wake_function+708
> > > c00000005069f160: __wake_up_locked+116
> > > c00000005069f1b0: ep_poll_callback+444
> > > c00000005069f250: run_posix_cpu_timers+104
> > > c00000005069f2c0: hvterm_raw_put_chars+64
> > > c00000005069f2e0: hvc_console_print+336
> > > c00000005069f3a8: initial_stab+2048
> > > c00000005069f3b0: crash_save_cpu+252
> > > c00000005069f488: cik_cp_resume+13476
> > > c00000005069f490: dev_get_drvdata
> > > c00000005069f580: default_machine_kexec+332
> > > c00000005069f610: pSeries_machine_kexec+60
> > > c00000005069f680: machine_kexec+56
> > > c00000005069f6a0: crash_kexec+312
> > > c00000005069f6f0: dev_attr_show+64
> > > c00000005069f748: cik_cp_resume+13476
> > > c00000005069f750: dev_get_drvdata
> > > c00000005069f7f0: radeon_hwmon_show_temp+72
> > > c00000005069f800: slb_miss_realmode+80
> > > c00000005069f808: dev_get_drvdata
> > > c00000005069f810: radeon_hwmon_show_temp+32
> > > c00000005069f890: die+840
> > > c00000005069f930: bad_page_fault+224
> > > c00000005069f948: radeon_hwmon_show_temp+72
> > > c00000005069f9a0: handle_page_fault+44
> > > c00000005069fa00: dev_attr_show+64
> > > c00000005069fa58: cik_cp_resume+13476
> > > c00000005069fa60: dev_get_drvdata
> > > c00000005069fb00: radeon_hwmon_show_temp+72
> > > c00000005069fb10: slb_miss_realmode+80
> > > c00000005069fb18: dev_get_drvdata
> > > c00000005069fb20: radeon_hwmon_show_temp+32
> > > c00000005069fb60: handle_mm_fault+1724
> > > c00000005069fb80: sysfs_open_file
> > > c00000005069fbd0: handle_page_fault+16
> > > c00000005069fc90: alloc_pages_current+416
> > > c00000005069fd00: dev_attr_show+64
> > > c00000005069fd30: sysfs_read_file+220
> > > c00000005069fde0: sys_read+304
> > > c00000005069fe40: syscall_exit
> > > [3fffd0d6fe88] back_trace:
> > >         task: c0000003130874a0
> > >        flags: 0
> > >      instptr: 0
> > >       stkptr: 0
> > >         bptr: 0
> > >    stackbase: c00000005069c000
> > >     stacktop: c0000000506a0000
> > >           tc: 1003c7b9fa8 (1970, c0000003130874a0)
> > >           hp: 0
> > >          ref: 0
> > >     stackbuf: 10a81570
> > >     textlist: 0
> > >     frameptr: 0
> > >  call_target: none
> > >    eframe_ip: 0
> > >        debug: 0
> > >        radix: 0
> > >      cpumask: 0
> > >  => PC: 0 () FP: 0
> > >   GETBUF(248 -> 1)
> > >     GETBUF(1500 -> 2)
> > > cannot find the stack info.
> > >     FREEBUF(2)
> > >   FREEBUF(1)
> > > crash>
> > > 
> > > 
> > > Is this a problem?
> > > 
> > > Thanks in advance!
> > > 
> > > --
> > > Crash-utility mailing list
> > > Crash-utility at redhat.com
> > > https://www.redhat.com/mailman/listinfo/crash-utility
> > > 
> > 
> 




More information about the Crash-utility mailing list