[Crash-utility] Seek error type: "tss_struct ist array" problemon8-CPU AMD system

Mon May 14 19:20:47 UTC 2007

Dave,

I agree that we own it from this side to figure out where the rest of
the dump went.

Thank you again for your help,

Frank

________________________________

	From: crash-utility-bounces at redhat.com
[mailto:crash-utility-bounces at redhat.com] On Behalf Of Dave Anderson
	Sent: Monday, May 14, 2007 3:16 PM
	To: Discussion list for crash utility usage, maintenance and
development
	Subject: Re: [Crash-utility] Seek error type: "tss_struct ist
array" problemon8-CPU AMD system

	"Jansen, Frank" wrote: 

		> -----Original Message----- 
		> From: crash-utility-bounces at redhat.com 
		> [mailto:crash-utility-bounces at redhat.com] On Behalf Of
Dave Anderson 
		> Sent: Monday, May 14, 2007 12:22 PM 
		> To: Discussion list for crash utility usage,
maintenance and 
		> development 
		> Subject: Re: [Crash-utility] Seek error type:
"tss_struct ist 
		> array" problem on8-CPU AMD system 
		> 
		> "Jansen, Frank" wrote: 
		> 
		> > Looking through the changelog, I saw that the
'tss_struct ist array' 
		> > problem on 8-CPU systems had been addressed
previously. 
		> However, I'm 
		> > running into this issue on an AMD server with crash
4.0-4.1 
		> and RHEL4 
		> > Update 5 (2.6.9-55.Elsmp). 
		> > 
		> > The output from the crash invocation is the
following: 
		> > +++ 
		> > [root at well-rhel4564-ps3 dump]# /fpj/crash
System_map.2.6.9-55.ELsmp 
		> > vmlinux.debug.2.6.9-55.ELsmp ap3.1178895173.dmp 
		> > 
		> > crash 4.0-4.1 
		> > Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007
Red Hat, Inc. 
		> > Copyright (C) 2004, 2005, 2006  IBM Corporation
Copyright (C) 
		> > 1999-2006  Hewlett-Packard Co Copyright (C) 2005,
2006  Fujitsu 
		> > Limited Copyright (C) 2006, 2007  VA Linux Systems
Japan K.K. 
		> > Copyright (C) 2005  NEC Corporation 
		> > Copyright (C) 1999, 2002  Silicon Graphics, Inc. 
		> > Copyright (C) 1999, 2000, 2001, 2002  Mission
Critical Linux, Inc. 
		> > This program is free software, covered by the GNU
General Public 
		> > License, and you are welcome to change it and/or
distribute 
		> copies of 
		> > it under certain conditions.  Enter "help copying"
to see the 
		> > conditions. 
		> > This program has absolutely no warranty.  Enter
"help warranty" for 
		> > details. 
		> > 
		> > GNU gdb 6.1 
		> > Copyright 2004 Free Software Foundation, Inc. 
		> > GDB is free software, covered by the GNU General
Public 
		> License, and 
		> > you are welcome to change it and/or distribute
copies of it under 
		> > certain conditions. 
		> > Type "show copying" to see the conditions. 
		> > There is absolutely no warranty for GDB.  Type "show
warranty" for 
		> > details. 
		> > This GDB was configured as
"x86_64-unknown-linux-gnu"... 
		> > 
		> > crash: seek error: kernel virtual address:
10408119e84  type: 
		> > "tss_struct ist array" 
		> > --- 
		> > 
		> > The server is a 4 dual-core AMD (2.8GHz) with 64GB. 
		> > 
		> > Any insights into how best to troubleshoot this are
much 
		> appreciated. 
		> > 
		> > Thanks, 
		> > 
		> > Frank Jansen 
		> 
		> I doubt this has anything to do with the 8-cpu issue. 
		> 
		I think that you are right, as the crash -d7 seems to
indicate that the 
		dump may be incomplete(cf. attached crash -d7 output). 

		> A few questions: 
		> 
		> Is this an RHEL4 derivative kernel of some kind?  I
ask 
		> because you're using a system.map file as an argument.

		> 
		It's a standard kernel, to which we add a couple of our
(Egenera) 
		drivers.  I can read the dump without the system map
argument, but was 
		just going off the data provided to me by the person
that ran into the 
		problem. 

		> Anyway, this dumpfile is Egenera's LKCD off-shoot,
correct? 
		> Since you got an "lseek" error, the question is
whether (1) 
		> the virtual address of 10408119e84 is legitimate, and
(2) 
		> whether it is included in your dumpfile. 

		I think that the virtual address is legitimate, but that
the dump is 
		incomplete at this point. 

		> 
		> What does "crash -d7 ..." show? 

		See attached output 

		> 
		> Does crash work on the live system? 
		Yes, it works

	Right -- if it works on the live system, there's a good chance
that 
	it's probably missing from the dumpfile.  The tss_struct for
each 
	cpu is located in each cpu's per-cpu data area.  I have seen the

	exact same problem with x86_64 netdump "vmcore-incomplete"
dumpfiles, 
	where the per-cpu data areas, allocated with
alloc_bootmem_node(), 
	would tend to be located in very high physical memory (beyond
the 
	end of the vmcore-incomplete contents). 

	On a 64GB system,  the virtual address of 10408119e84 (~16GB
physical) 
	would certainly not be out of the question.  And if it can be
read 
	on the live machine (crash -d7 will show the same address access

	sequence), then it's probably not included in the dumpfile for 
	whatever reason. 

	In fact, looking at the -d7 output, the level_pgt pagetable
pointers 
	for each non-cpu0 cpu_pda get allocated with __get_free_pages()
-- and 
	there's a couple from the 10408xxxxxx virtual memory location: 

	... 
	<readmem: ffffffff804ed700, KVADDR, "cpu_pda entry", 128, (FOE),
930580> 
	CPU0: level4_pgt: ffffffff80101000 data_offset: 10087adef60 
	<readmem: ffffffff804ed780, KVADDR, "cpu_pda entry", 128, (FOE),
930580> 
	CPU1: level4_pgt: 1040802c000 data_offset: 10487bf8d60 
	<readmem: ffffffff804ed800, KVADDR, "cpu_pda entry", 128, (FOE),
930580> 
	CPU2: level4_pgt: 10408008000 data_offset: 10887bf8d60 
	<readmem: ffffffff804ed880, KVADDR, "cpu_pda entry", 128, (FOE),
930580> 
	CPU3: level4_pgt: 10bf9ff2000 data_offset: 10c87bfbf60 
	<readmem: ffffffff804ed900, KVADDR, "cpu_pda entry", 128, (FOE),
930580> 
	CPU4: level4_pgt: 10008028000 data_offset: 10087ae6f60 
	<readmem: ffffffff804ed980, KVADDR, "cpu_pda entry", 128, (FOE),
930580> 
	CPU5: level4_pgt: 10bf9f8a000 data_offset: 10487c00d60 
	<readmem: ffffffff804eda00, KVADDR, "cpu_pda entry", 128, (FOE),
930580> 
	CPU6: level4_pgt: 100f7f08000 data_offset: 10887c00d60 
	<readmem: ffffffff804eda80, KVADDR, "cpu_pda entry", 128, (FOE),
930580> 
	CPU7: level4_pgt: 107f9f8e000 data_offset: 10c87c03f60 
	<readmem: 10008000084, KVADDR, "tss_struct ist array", 56,
(FOE), 90c5b0> 
	<readmem: 10408119e84, KVADDR, "tss_struct ist array", 56,
(FOE), 90c5e8> 
	crash: seek error: kernel virtual address: 10408119e84  type:
"tss_struct ist array" 

	They weren't *read* from there at that point, but it shows that 
	there was memory in that neighborhood.  Anyway, the "seek error"

	from LKCD means that the physical page couldn't be found in the 
	dumpfile by lkcd_lseek(): 

	/* 
	 *  Read from an LKCD formatted dumpfile. 
	 */ 
	int 
	read_lkcd_dumpfile(int fd, void *bufptr, int cnt, ulong addr,
physaddr_t paddr) 
	{ 
	        set_lkcd_fp(fp); 

	        if (!lkcd_lseek(paddr)) 
	                return SEEK_ERROR; 

	        if (lkcd_read((void *)bufptr, cnt) != cnt) 
	                return READ_ERROR; 

	        return cnt; 
	} 

	I can't really help you from that point on, though... 

	Dave 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/crash-utility/attachments/20070514/022069a3/attachment.htm>