Kernel 2.6.9-55 issues

Troy Knabe knabe at 4j.lane.edu
Fri May 11 15:23:23 UTC 2007


The system boots and starts the kernel, then crashes. I wasn't watching the first time, so on a subsequent boot it gets to the point where it does a disk check because the system was not shut down cleanly.  At different points in the disk check is where it crashes and reboots now.  Thanks for any help you can provide.  

lspci
00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)
00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
00:0a.0 Ethernet controller: nVidia Corporation CK804 Ethernet Controller (rev a3)
00:0b.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:0c.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:0d.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
01:05.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
04:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11)

lsmod
Module                  Size  Used by
ipt_state               1985  1 
ip_conntrack           41077  1 ipt_state
ipt_multiport           2113  3 
ipt_LOG                 6593  1 
iptable_filter          3009  1 
ip_tables              17601  4 ipt_state,ipt_multiport,ipt_LOG,iptable_filter
parport_pc             24833  0 
lp                     12333  0 
parport                37513  2 parport_pc,lp
autofs4                25157  0 
i2c_dev                11585  0 
i2c_core               22337  1 i2c_dev
sunrpc                163237  1 
dm_mirror              30893  0 
dm_mod                 59989  1 dm_mirror
button                  6737  0 
battery                 9029  0 
ac                      4933  0 
md5                     4161  1 
ipv6                  235777  39 
joydev                 10497  0 
ohci_hcd               21841  0 
ehci_hcd               31301  0 
forcedeth              24001  0 
tg3                   107077  0 
ext3                  117193  3 
jbd                    71385  1 ext3
sata_nv                 9541  4 
libata                 66333  1 sata_nv
sd_mod                 17217  5 
scsi_mod              122445  2 libata,sd_mod

 

-----Original Message-----
From: redhat-list-bounces at redhat.com [mailto:redhat-list-bounces at redhat.com] On Behalf Of George Magklaras
Sent: Friday, May 11, 2007 1:27 AM
To: General Red Hat Linux discussion list
Subject: Re: Kernel 2.6.9-55 issues

Troy, what is your disk subsystem on the x2200? At what point it won't boot? Does it reach the bootloader and at least start the kernel? Also if you could do an 'lspci' and an lsmod and show the output from your good kernel.


##The following is a guess##
I don't have that kind of Sun kit, but there are all sorts of references to stability problems with AMD based chipsets. Also, FYI there is a kernel panic report for that kernel here:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=239484

This bug report concerns the Error Detection And Correction (EDAC) modules (hence the lsmod prompt). This comes from the edac kernel module thinking that there is something wrong with the bus or the memory. For your x2200, the system probably panics (any messages from the console during the boot failure?), as there is an option that defines a kernel panic on a kernel detecting EDAC parity errors. On your x1440 that are able to boot but they give the EDAC messages, do an lsmod and grep -i for edac.  They seem to point out a 'noedac' boot option, but I am not sure.

On the x1440 that spawn the edac messages, see if the /etc/modprobe.conf
  contains any references to the edac modules and you could try to remove them, see if that makes a difference.

GM


Troy Knabe wrote:
> I upgraded from 2.6.9-42 to 2.6.9-55 kernel over the weekend.  I have had issues with 3 servers.  1 server wouldn't boot (x2200 amd 148 proc).  And two x4100's with 2 - Dual Core AMD Opteron(tm) Processor 285.  The two x4100's are spewing these errors, but if I reboot them with the old 2.6.9-42 kernel then I don't get any of them.  Anyone else experiencing issues with the new kernel?
>  
> thanks
> -Troy
>  
> May  9 16:25:43 hostname kernel: EDAC k8 MC0: general bus error: 
> participating processor(local node response), time-out(no timeout) 
> memory transaction type(generic read), mem or i/o(mem access), cache 
> level(generic)May  9 16:25:43 hostname kernel: MC0: CE page 0xc, 
> offset 0x108, grain 8, syndrome 0x4b39, row 0, channel 1, label "": 
> k8_edacMay  9 16:25:43 hostname kernel: MC0: CE - no information 
> available: k8_edac Error Overflow setMay  9 16:25:43 hostname kernel: 
> EDAC k8 MC0: extended error code: ECC chipkill x4 errorMay  9 16:25:44 
> hostname kernel: EDAC k8 MC0: general bus error: participating 
> processor(local node origin), time-out(no timeout) memory transaction 
> type(generic read), mem or i/o(mem access), cache level(generic)May  9 
> 16:25:44 hostname kernel: MC0: CE page 0x1f1, offset 0x0, grain 8, 
> syndrome 0x28d8, row 3, channel 1, label "": k8_edacMay  9 16:25:44 
> hostname kernel: MC0: CE - no information available: k8_edac Error 
> Overflow setMay  9 16:25:45 hostname kerne
l: EDAC k8 MC0: extended error code: ECC chipkill x4 errorMay  9 16:25:46 hostname kernel: EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)May  9 16:25:46 hostname kernel: MC0: CE page 0x1f1, offset 0x0, grain 8, syndrome 0x28d8, row 3, channel 1, label "": k8_edacMay  9 16:25:46 hostname kernel: MC0: CE - no information available: k8_edac Error Overflow setMay  9 16:25:46 hostname kernel: EDAC k8 MC0: extended error code: ECC chipkill x4 errorMay  9 16:25:47 hostname kernel: EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)May  9 16:25:47 hostname kernel: MC0: CE page 0x138, offset 0xac0, grain 8, syndrome 0xeeff, row 0, channel 1, label "": k8_edacMay  9 16:25:47 hostname kernel: MC0: CE - no information available: 
k8_edac Error Overflow setMay  9 16:25:47 hostname kernel: EDAC k8 MC0: extended error code: ECC chipkill x4 error
>  

--
--
George Magklaras

Senior Computer Systems Engineer/UNIX Systems Administrator EMBnet Technical Management Board The Biotechnology Centre of Oslo, University of Oslo http://www.biotek.uio.no/

EMBnet Norway:	http://www.no.embnet.org/


--
redhat-list mailing list
unsubscribe mailto:redhat-list-request at redhat.com?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list




More information about the redhat-list mailing list