[Crash-utility] crash: invalid kernel virtual address: 0 type: "memory section"

Mon Jan 5 15:36:45 UTC 2015

----- Original Message -----
> Hello,
> 
> I have a couple dumps generated on Ubuntu Trusty LTS (3.13.0-39-generic
> kernel) which crash fails on.
> 
> $ ./crash ../ddeb/usr/lib/debug/boot/vmlinux-3.13.0-39-generic
> ../dump.201412280256
> 
> crash 7.0.9
> Copyright (C) 2002-2014  Red Hat, Inc.
> Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
> Copyright (C) 1999-2006  Hewlett-Packard Co
> Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
> Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
> Copyright (C) 2005, 2011  NEC Corporation
> Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
> Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
> This program is free software, covered by the GNU General Public License,
> and you are welcome to change it and/or distribute copies of it under
> certain conditions.  Enter "help copying" to see the conditions.
> This program has absolutely no warranty.  Enter "help warranty" for details.
> 
> GNU gdb (GDB) 7.6
> Copyright (C) 2013 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-unknown-linux-gnu"...
> 
> crash: cannot determine thread return address
> please wait... (gathering kmem slab cache data)
> crash: invalid kernel virtual address: 1c  type: "kmem_cache
> objsize/object_size"
> crash: failed to read pageflag_names entry
> please wait... (gathering module symbol data)
> WARNING: invalid kernel module size: 0
> 
> crash: cannot determine idle task addresses from init_tasks[] or runqueues[]
> 
> crash: cannot resolve "init_task_union"
> 
> 
> vmlinux-3.13.0-39-generic was extracted from Ubuntu ddeb:
> 
> $ file ../ddeb/usr/lib/debug/boot/vmlinux-3.13.0-39-generic
> ../ddeb/usr/lib/debug/boot/vmlinux-3.13.0-39-generic: ELF 64-bit LSB
> executable, x86-64, version 1 (SYSV), statically linked,
> BuildID[sha1]=c4fa631d2cc34a0b2628a5de01a04e81a0667555, not stripped
> 
> With -d8 I get:
> 
> ...
> <read_diskdump: addr: ffffffffffffffff paddr: 7fffffff cnt: 1>
> read_diskdump: paddr/pfn: 7fffffff/7ffff -> cache physical page: 7ffff000
> crash: invalid kernel virtual address: 0  type: "memory section"
> 
> The entire -d8 output is attached.
> 
> Bogus "base kernel version" stands out immediately and I'm pretty sure
> I've seen "0.0.0" in there a couple times with exactly the same dump.
> >From a quick look, the base kernel version code in kernel.c is not safe
> against kt->utsname.release being all zeroes.
> 
> Eddy Gonzalo (CC'ed) can probably provide access to the dumps if
> needed.
> 
> Thanks,
>                 Ilya

The obvious question is: are you sure that the vmlinux matches the dumpfile?

I say that because there are so many strange readings from this dumpfile,
As you noted, yes, this definitely is a mismatch, where the header shows

               sysname: Linux
              nodename: chqcephnas01
               release: 3.13.0-39-generic
               version: #66~precise1-Ubuntu SMP Wed Oct 29 09:56:49 UTC 2014
               machine: x86_64

but this gets read from the dumpfile:

  <readmem: ffffffff81c15284, KVADDR, "init_uts_ns", 390, (ROE), cfa7bc>
  <read_diskdump: addr: ffffffff81c15284 paddr: 1c15284 cnt: 390>
  read_diskdump: paddr/pfn: 1c15284/1c15 -> cache physical page: 1c15000
  base kernel version: 0.13.0

And one of the first set of items accessed, are the contents of the cpu mask variables:

  <readmem: ffffffff8180acf0, KVADDR, "cpu_possible_mask", 8, (FOE), 7fff5ab8b618>
  <read_diskdump: addr: ffffffff8180acf0 paddr: 180acf0 cnt: 8>
  read_diskdump: paddr/pfn: 180acf0/180a -> cache physical page: 180a000
  <readmem: ffffffff8180ace0, KVADDR, "cpu_present_mask", 8, (FOE), 7fff5ab8b618>
  <read_diskdump: addr: ffffffff8180ace0 paddr: 180ace0 cnt: 8>
  read_diskdump: paddr/pfn: 180ace0/180a -> physical page is cached: 180a000
  <readmem: ffffffff8180ace8, KVADDR, "cpu_online_mask", 8, (FOE), 7fff5ab8b618>
  <read_diskdump: addr: ffffffff8180ace8 paddr: 180ace8 cnt: 8>
  read_diskdump: paddr/pfn: 180ace8/180a -> physical page is cached: 180a000
  <readmem: ffffffff8180acd8, KVADDR, "cpu_active_mask", 8, (FOE), 7fff5ab8b618>
  <read_diskdump: addr: ffffffff8180acd8 paddr: 180acd8 cnt: 8>
  read_diskdump: paddr/pfn: 180acd8/180a -> physical page is cached: 180a000

But they all return NULL pointers.  They should return pointers to bitmasks,
which then get read, and their contents displayed.  For example, I've got
a 3.13 kernel dumpfile, where each mask pointer is read, the bitmask it points
gets read, and then the contents are dumped:

  <readmem: ffffffff8180a870, KVADDR, "cpu_possible_mask", 8, (FOE), 7fff5f116f48>
  <read_diskdump: addr: ffffffff8180a870 paddr: 180a870 cnt: 8>
  <readmem: ffffffff81d8c780, KVADDR, "possible", 1024, (ROE), f45b80>
  <read_diskdump: addr: ffffffff81d8c780 paddr: 1d8c780 cnt: 1024>
  cpu_possible_mask: 0 1 2 3
  <readmem: ffffffff8180a860, KVADDR, "cpu_present_mask", 8, (FOE), 7fff5f116f48>
  <read_diskdump: addr: ffffffff8180a860 paddr: 180a860 cnt: 8>
  <readmem: ffffffff81d8bf80, KVADDR, "present", 1024, (ROE), f45b80>
  <read_diskdump: addr: ffffffff81d8bf80 paddr: 1d8bf80 cnt: 128>
  <read_diskdump: addr: ffffffff81d8c000 paddr: 1d8c000 cnt: 896>
  cpu_present_mask: 0 1
  <readmem: ffffffff8180a868, KVADDR, "cpu_online_mask", 8, (FOE), 7fff5f116f48>
  <read_diskdump: addr: ffffffff8180a868 paddr: 180a868 cnt: 8>
  <readmem: ffffffff81d8c380, KVADDR, "online", 1024, (ROE), f45b80>
  <read_diskdump: addr: ffffffff81d8c380 paddr: 1d8c380 cnt: 1024>
  cpu_online_mask: 0 1
  <readmem: ffffffff8180a858, KVADDR, "cpu_active_mask", 8, (FOE), 7fff5f116f48>
  <read_diskdump: addr: ffffffff8180a858 paddr: 180a858 cnt: 8>
  <readmem: ffffffff81d8bb80, KVADDR, "active", 1024, (ROE), f45b80>
  <read_diskdump: addr: ffffffff81d8bb80 paddr: 1d8bb80 cnt: 1024>
  cpu_active_mask: 0 1

Right from the get-go, the __per_cpu_offset array looks like it's 
returning all zeroes, in which case pretty much all is lost and the
dumpfile is useless.

That can  be seen with the following readmem failure, which
should take the kt->__per_cpu_offset[0] value and add it to 
the (per-cpu) symbol value of "cpu_number", which presumably
is b084 in that kernel, and where kt->__per_cpu_offset[0] is 
apparently zero.  Therefore this readmem() call:

                if (!readmem(cpu_sp->value + kt->__per_cpu_offset[i],
                    KVADDR, &cpunumber, sizeof(int),
                    "cpu number (per_cpu)", QUIET|RETURN_ON_ERROR))
                        break;

generated this failure:

  <readmem: b084, KVADDR, "cpu number (per_cpu)", 4, (ROE|Q), 7fff5ab9c800>
  crash: invalid kernel virtual address: b084  type: "cpu number (per_cpu)"

The kt->__per_cpu_offset[] array would have been set up earlier in kernel_init():

        if (symbol_exists("__per_cpu_offset")) {
                if (LKCD_KERNTYPES())
                        i = get_cpus_possible();
                else
                        i = get_array_length("__per_cpu_offset", NULL, 0);
                get_symbol_data("__per_cpu_offset",
                        sizeof(long)*((i && (i <= NR_CPUS)) ? i : NR_CPUS),
                        &kt->__per_cpu_offset[0]);
                kt->flags |= PER_CPU_OFF;
        }

It looks like it read the array OK, where the Ubuntu kernel looks like
it has 256 cpus configured:

  <readmem: ffffffff81d130e0, KVADDR, "__per_cpu_offset", 2048, (FOE), cfa968>
  <read_diskdump: addr: ffffffff81d130e0 paddr: 1d130e0 cnt: 2048>
  read_diskdump: paddr/pfn: 1d130e0/1d13 -> cache physical page: 1d13000

But when utilizing the stashed kt->__per_cpu_offset[0] value later on (for cpu 0), 
it got a zero offset.  

So it looks like the vmlinux and dumpfile don't match, or perhaps the dumpfile
is suspect.

It would be interesting to confirm that the kernel being used (vmlinux-3.13.0-39-generic)
runs OK live on the crashing system.

Dave