[Crash-utility] Problem with NUMA Nodes

Dave Anderson anderson at redhat.com
Mon Apr 30 13:57:12 UTC 2007


sharyathi nagesh wrote:

> Hi
>     I am seeing this problem with crash tool on a system with NUMA nodes.
> crash exits with error message and no further analysis of dump is possible.
> =====
> Error message:
>
> cassinilp1:~ # crash
>
> crash 4.0-3.14
> Copyright (C) 2002, 2003, 2004, 2005, 2006  Red Hat, Inc.
> Copyright (C) 2004, 2005, 2006  IBM Corporation
> Copyright (C) 1999-2006  Hewlett-Packard Co
> Copyright (C) 2005  Fujitsu Limited
> Copyright (C) 2005  NEC Corporation
> Copyright (C) 1999, 2002  Silicon Graphics, Inc.
> Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
> This program is free software, covered by the GNU General Public License,
> and you are welcome to change it and/or distribute copies of it under
> certain conditions.  Enter "help copying" to see the conditions.
> This program has absolutely no warranty.  Enter "help warranty" for details.
>
> GNU gdb 6.1
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "powerpc64-unknown-linux-gnu"...
>
> crash: numnodes out of sync with pgdat_list?
>
> =====
> System configuration is given as
>
> Node 0 Memory:
> Node 1 Memory:
> Node 2 Memory:
> Node 3 Memory:
> Node 4 Memory: 0x0-0x180000000
>
> Node 0 CPUs: 0
> Node 1 CPUs:
> Node 2 CPUs:
> Node 3 CPUs:
> Node 4 CPUs: 1
> =====
> The problem is noticed because of mismatch:
>
>  if (n != vt->numnodes)
>                 error(FATAL, "numnodes out of sync with pgdat_list?\n");
> in memory.c/dump_memory_nodes() function
>
>         The problem is because of the mismatch between node_online_map and the number of nodes observed by traversing through pgdat_list.
> node_online_map bit is set differently in kernel version 2.6.16 and 2.6.19.
>         In earlier version all the bits from the first bit to
> nth bit, where n is last Node to which memory is assigned is set to '1'.
>         But in later version node is considered online if either memory or cpu is allocated (or both).
>
> So I need your suggestion on how to go and fix the problem
> A few ideas I had were
> 1) If KERNEL_VERSION <= 2.6.16 set increment vt->numnodes only if bits of node_online_map and cpu_online_map are set.
>    if KERNEL_VERSIOn > 2.6.16 use only node_online_map
>         (This will partly solve the problem)
> 2) or as in node_table_init(). Raise the error only when CRASHDEBUG(2) is set else update vt->numnodes with 'n'
>
> Please let me know of your opinion
> Regards
> Sharyathi Nagesh
>

Hi Sharyathi,

Thanks a lot for debugging this.

I prefer your idea (2) -- which if it works OK in your case -- will not break
any other currently-working incarnations.

Also, just to clarify, when you say "Raise the error...", node_table_init()
only makes an "error(NOTE, ...)" call, so you would simply get a "NOTE: ..."
message displayed if CRASHDEBUG(2), and the crash session would
still continue.  That's also what we would want in this case, unlike the
"error(FATAL, ...)", session-ending, error that you're seeing now...

Thanks,
  Dave





More information about the Crash-utility mailing list