[Crash-utility] Crash faults when determining panic task

Thu Sep 29 18:08:32 UTC 2011

----- Original Message -----
> Dave,
> 
> Adding --no_elf_notes to the crash invocation does indeed start crash
> with without issue.  Do you think that I am dealing with a
> corrupted/incomplete vmcore (as evident in that extremely large n_descsz
> value) or is this a bug that crash could more gracefully handle?

Hi Joe,

It should absolutely handle it more gracefully, but I'm not sure whether
the vmcore is corrupt.  It's difficult to debug this from afar, but hopefully
you can help me out a little bit.  (I'm also cc'ing the authors of this code
directly, to see if they can shed a little light on the matter.)

W/respect to your first patch that checks for a non-NULL bt->machdep
in x86_64_get_dumpfile_stack_frame(), can you tell me how that happened
exactly?  Before that function was called, it should have come through
get_netdump_regs_x86_64(), which would -- or would not -- have set
bt->machdep here:

        if (ELF_NOTES_VALID() &&
            (bt->flags & BT_DUMPFILE_SEARCH) && DISKDUMP_DUMPFILE() &&
            (note = (Elf64_Nhdr *)
             diskdump_get_prstatus_percpu(bt->tc->processor))) {  
                user_regs = get_regs_from_note((char *)note, &rip, &rsp);

                if (CRASHDEBUG(1))
                        netdump_print("ELF prstatus rsp: %lx rip: %lx\n",
                                rsp, rip);

                *rspp = rsp;
                *ripp = rip;

                if (*ripp && *rspp)
                        bt->flags |= BT_KDUMP_ELF_REGS;

                bt->machdep = (void *)user_regs;
        }

If it did *not* set bt->machdep above, then it must have been because
diskdump_get_prstatus_percpu() below returned a NULL pointer?

void *
diskdump_get_prstatus_percpu(int cpu)
{
        return dd->nt_prstatus_percpu[cpu];
}

If you bring up crash with at least debug level 1 like this:

 $ crash -d1 vmlinux vmcore

you will see a dump of the array of dd->nt_prstatus_percpu[] note
pointers.  Alternatively during run-time, you can see the same output
by entering "help -n". 

Can you confirm:

  (1) what "cpu" value was passed to the function (presumably it was
      legitimate), and 
  (2) whether dd->nt_prstatus_percpu[cpu] was NULL?

Secondly, w/respect to the bogus note->n_descsz value, was the note 
pointer containing it one of those listed in the dd->nt_prstatus[]
array?  If not, what was the "cpu" value passed to diskdump_get_prstatus_cpu()
that time?

And also, what is the output of:

  crash> help -k | grep _map:

On my workstation, I see this:

  crash> help -k | grep _map:
         cpu_possible_map: 0 1 2 3 4 5 6 7 
          cpu_present_map: 0 1 2 3 4 5 6 7 
           cpu_online_map: 0 1 2 3 4 5 6 7 
  crash>

I'm wondering if your dump shows a system with some of the lower
cpus taken offline?

Thanks,
  Dave

> 
> As far as the kernel is concerned,
> 2.6.32-131.0.15.el6.exp10.bz16586.x86_64 was a stock RH
> 2.6.32-131.0.15
> with an added patch for handling an MD Raid bug (RHBZ-707268).  Stratus
> does load a driver to track dirty VM pages for harvesting purposes, but
> does not change general VM behavior.
> 
> FWIW, this is the only vmcore that I've seen ELF note faulting or
> invalid section numbers.
> 
> Thanks,
> 
> -- Joe
> 
> -----Original Message-----
> From: crash-utility-bounces at redhat.com
> [mailto:crash-utility-bounces at redhat.com] On Behalf Of Dave Anderson
> Sent: Wednesday, September 28, 2011 5:15 PM
> To: Discussion list for crash utility usage,maintenance and
> development
> Subject: Re: [Crash-utility] Crash faults when determining panic task
> 
> 
> Hi Joe,
> 
> It pretty clear it's due to this change in 5.1.5:
> 
>          - Implemented the capability of using the NT_PRSTATUS ELF note data
>            that is saved in version 4 compressed kdump headers to determine the
>            starting stack and instruction pointer hooks for x86 and x86_64
>            backtraces when they cannot be determined in the traditional manners.
>            (wang.chao at cn.fujitsu.com, wency at cn.fujitsu.com)
> 
> What happens if you run it like so:
> 
>   $ crash --no_elf_notes vmlinux vmcore
> 
> As far as this message:
> 
>   WARNING: sparsemem: invalid section number: 137438888923
> 
> That should be outside the realm of Fujitsu's ELF notes patch.  Does this kernel
> have some kind of Stratus VM modification?
> 
> Dave
> 
> ----- Original Message -----
> > 
> > Crash faults when determining panic task
> > 
> > I have a vmcore generated on RHEL6.1 that newer versions of crash
> > have trouble analyzing (5.1.1-2.el6 seems to work ok) .
> > 
> > 
> > 
> > I can provide additional binary files if needed, just let me know
> > what convention best suits the list (ftp, private email attachment,
> > etc.)
> > 
> > 
> > 
> > Crash Version : OS: Result:
> > 
> > crash 5.1.8 Debian wheezy faults
> > 
> > crash 5.1.7-1.el6 RHEL6.2 Alpha faults
> > 
> > crash 5.1.1-2.el6 RHEL6.1 ok
> > 
> > 
> > Kernel:
> > 
> > 2.6.32-131.0.15.el6.exp10.bz16586.x86_64 ( 2.6.32-131.0.15 + a fix
> > for Red Hat bz - 707268)
> > 
> > 
> > Interesting warnings when starting crash:
> > 
> > WARNING: sparsemem: invalid section number: 137438888923
> > 
> > WARNING: sparsemem: invalid section number: 137438888923
> > 
> > 
> > First fault, null pointer deference:
> > 
> > please wait... (determining panic task)
> > 
> > Program received signal SIGSEGV, Segmentation fault.
> > 
> > x86_64_get_dumpfile_stack_frame (rsp=0x7fffffffcc58,
> > rip=0x7fffffffcc50,
> > 
> > bt_in=0x7fffffffcce0) at x86_64.c:4183
> > 
> > 4183 ur_rip = ULONG(user_regs +
> > 
> > (gdb) p user_regs
> > 
> > $1 = 0x0
> > 
> > 
> > Workaround, check that bt->machdep is not NULL:
> > 
> > diff -Nupr crash-5.1.8/x86_64.c crash-5.1.8.new/x86_64.c
> > 
> > --- crash-5.1.8/x86_64.c 2011-09-16 15:01:12.000000000 -0400
> > 
> > +++ crash-5.1.8.new/x86_64.c 2011-09-28 14:12:45.347188571 -0400
> > 
> > @@ -4178,7 +4178,7 @@ x86_64_get_dumpfile_stack_frame(struct b
> > 
> > goto skip_stage;
> > 
> > }
> > 
> > }
> > 
> > - } else if (ELF_NOTES_VALID()) {
> > 
> > + } else if (ELF_NOTES_VALID() && bt->machdep) {
> > 
> > user_regs = bt->machdep;
> > 
> > ur_rip = ULONG(user_regs +
> > 
> > OFFSET(user_regs_struct_rip));
> > 
> > 
> > Second fault, a curiously large n_descsz in elf note header:
> > 
> > please wait... (determining panic task)
> > 
> > Program received signal SIGSEGV, Segmentation fault.
> > 
> > get_regs_from_note (note=0xd26472 "\b", ip=0x7fffffffc4e0,
> > sp=0x7fffffffc4e8)
> > 
> > at netdump.c:2221
> > 
> > 2221 *sp = ULONG(user_regs + offset_sp);
> > 
> > (gdb) p *(Elf64_Nhdr *)note
> > 
> > $1 = {n_namesz = 8, n_descsz = 3438804992, n_type = 8}
> > 
> > 
> > Workaround, do not attempt reading registers from elf notes (this
> > chunk of code was not present in crash 5.1.1):
> > 
> > diff -Nupr crash-5.1.8/netdump.c crash-5.1.8.new/netdump.c
> > 
> > --- crash-5.1.8/netdump.c 2011-09-16 15:01:12.000000000 -0400
> > 
> > +++ crash-5.1.8.new/netdump.c 2011-09-28 14:14:43.687183734 -0400
> > 
> > @@ -2286,7 +2286,7 @@ get_netdump_regs_x86_64(struct bt_info *
> > 
> > 
> > 
> > bt->machdep = (void *)user_regs;
> > 
> > }
> > 
> > -
> > 
> > +#if 0
> > 
> > if (ELF_NOTES_VALID() &&
> > 
> > (bt->flags & BT_DUMPFILE_SEARCH) && DISKDUMP_DUMPFILE() &&
> > 
> > (note = (Elf64_Nhdr *)
> > 
> > @@ -2305,7 +2305,7 @@ get_netdump_regs_x86_64(struct bt_info *
> > 
> > 
> > 
> > bt->machdep = (void *)user_regs;
> > 
> > }
> > 
> > -
> > 
> > +#endif
> > 
> > machdep->get_stack_frame(bt, ripp, rspp); }
> > 
> > 
> > Given the warning messages at the beginning of the process, I'm sure
> > if I' m dealing with a corrupted or incomplete vmcore image. Let me
> > know what additional info could be useful if this seems worth
> > debugging further.
> > 
> > 
> > 
> > Thanks,
> > 
> > -- Joe Lawrence
> > --
> > Crash-utility mailing list
> > Crash-utility at redhat.com
> > https://www.redhat.com/mailman/listinfo/crash-utility
> > 
> 
> --
> Crash-utility mailing list
> Crash-utility at redhat.com
> https://www.redhat.com/mailman/listinfo/crash-utility
> 
> --
> Crash-utility mailing list
> Crash-utility at redhat.com
> https://www.redhat.com/mailman/listinfo/crash-utility
>