[Crash-utility] Problem with NUMA Nodes

sharyathi nagesh sharyath at in.ibm.com
Mon Apr 30 10:29:12 UTC 2007



Hi
    I am seeing this problem with crash tool on a system with NUMA nodes.
crash exits with error message and no further analysis of dump is possible.
=====
Error message:

cassinilp1:~ # crash

crash 4.0-3.14
Copyright (C) 2002, 2003, 2004, 2005, 2006  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005  Fujitsu Limited
Copyright (C) 2005  NEC Corporation
Copyright (C) 1999, 2002  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

GNU gdb 6.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "powerpc64-unknown-linux-gnu"...


crash: numnodes out of sync with pgdat_list?

=====
System configuration is given as 

Node 0 Memory:
Node 1 Memory:
Node 2 Memory:
Node 3 Memory:
Node 4 Memory: 0x0-0x180000000

Node 0 CPUs: 0
Node 1 CPUs:
Node 2 CPUs:
Node 3 CPUs:
Node 4 CPUs: 1
=====
The problem is noticed because of mismatch:

 if (n != vt->numnodes)
                error(FATAL, "numnodes out of sync with pgdat_list?\n");
in memory.c/dump_memory_nodes() function

	The problem is because of the mismatch between node_online_map and the number of nodes observed by traversing through pgdat_list. 
node_online_map bit is set differently in kernel version 2.6.16 and 2.6.19. 
	In earlier version all the bits from the first bit to 
nth bit, where n is last Node to which memory is assigned is set to '1'.
	But in later version node is considered online if either memory or cpu is allocated (or both).

So I need your suggestion on how to go and fix the problem
A few ideas I had were
1) If KERNEL_VERSION <= 2.6.16 set increment vt->numnodes only if bits of node_online_map and cpu_online_map are set.
   if KERNEL_VERSIOn > 2.6.16 use only node_online_map
	(This will partly solve the problem)
2) or as in node_table_init(). Raise the error only when CRASHDEBUG(2) is set else update vt->numnodes with 'n'

Please let me know of your opinion
Regards
Sharyathi Nagesh






More information about the Crash-utility mailing list