AMD64 Northbridge errors
Marcelino Mata
mmata at multimatic.com
Mon Nov 28 23:27:49 UTC 2005
Running RHEL 3.0 x86_64 U6 (2.4.21-37.Elsmp)
I have searched, logged a call with HP and Redhat support and have
turned up nothing. HP says I have memory problems, Redhat says it's a
known non-critical error.
I am not sure if I am chasing after the correct problem but all six of
my AMD64 HP XW9300 (based off Tyan Thunder K8WE?) with anywhere between
4-16Gb RAM and two Opteron CPU's get the following errors :
Nov 10 17:18:46 node4 kernel: CPU 0: Silent Northbridge MCE
Nov 10 17:18:46 node4 kernel: Northbridge status 94044100:ac080a13
Nov 10 17:18:46 node4 kernel: Error chipkill ecc error
Nov 10 17:18:46 node4 kernel: ECC error syndrome ac08
Nov 10 17:18:46 node4 kernel: bus error local node response, request
didn't time out
Nov 10 17:18:46 node4 kernel: generic read
Nov 10 17:18:46 node4 kernel: memory access, level generic
Nov 10 17:18:46 node4 kernel: link number 0
Nov 10 17:18:46 node4 kernel: dram scrub error
Nov 10 17:18:46 node4 kernel: corrected ecc error
Nov 10 17:18:46 node4 kernel: previous error lost
Nov 10 17:18:46 node4 kernel: NB error address 000000000126dd40
Nov 14 19:14:16 node4 kernel: CPU 0: Silent Northbridge MCE
Nov 14 19:14:16 node4 kernel: Northbridge status a6000001:0005001b
Nov 14 19:14:16 node4 kernel: Error gart error
Nov 14 19:14:16 node4 kernel: GART TLB error generic level generic
Nov 14 19:14:16 node4 kernel: err cpu1
Nov 14 19:14:16 node4 kernel: processor context corrupt
Nov 14 19:14:16 node4 kernel: error uncorrected
Nov 14 19:14:16 node4 kernel: previous error lost
Nov 14 19:14:16 node4 kernel: NB error address 00000000dffe0038
Five of the computers have between 1-30 references to these error
messages in the past 3 weeks. One computer has over 30,000 instances of
these error messages. I am getting the majority of these messages on
computers with >4Gb RAM but I have had the messages on computers with
only 4GB RAM.
The main reason I am focusing on these messages is that the computers
have crashed numerous times since being put online. The computer with
30K instances of the error message has crashed about 1-2 times per week.
I am running the latest BIOS.
I can not turn on diskdump since they have Nvidia SATA controllers (not
support by diskdump) and netdump has not produced anything since during
the kernel crash no data was written ( network driver went down? ).
Has anyone else seen these messages or have any idea how to identify the
problem? Could my crashes be due to Northbridge errors or am I barking
up the wrong tree.
Marcelino
Reference Information below
lspci information
-----------------
00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller
(rev a3)
00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97
Audio Controller (rev a2)
00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)
00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller
(rev f3)
00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller
(rev f3)
00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
00:0a.0 Ethernet controller: nVidia Corporation CK804 Ethernet
Controller (rev a3)
00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Address Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
DRAM Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
05:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A
IEEE-1394a-2000 Controller (PHY/Link)
0a:00.0 VGA compatible controller: nVidia Corporation NV41GL [Quadro FX
1400] (rev a2)
40:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge
(rev 12)
40:01.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
40:02.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge
(rev 12)
40:02.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
61:06.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X
Fusion-MPT Dual Ultra320 SCSI (rev 07)
61:06.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X
Fusion-MPT Dual Ultra320 SCSI (rev 07)
61:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5782
Gigabit Ethernet (rev 03)
80:00.0 Memory controller: nVidia Corporation CK804 Memory Controller
(rev a3)
80:01.0 Memory controller: nVidia Corporation CK804 Memory Controller
(rev a3)
lsmod
-----
Module Size Used by Tainted: P
nfs 95984 7 (autoclean)
audit 127208 2 (autoclean)
nfsd 86096 8 (autoclean)
lockd 60528 1 (autoclean) [nfs nfsd]
sunrpc 91944 1 (autoclean) [nfs nfsd lockd]
netconsole 19208 0 (unused)
autofs4 16912 2 (autoclean)
tg3 69936 1
nvnet 71168 1
sg 37880 0 (autoclean)
sr_mod 17676 0 (autoclean)
ide-scsi 12832 0
ide-cd 34408 0
cdrom 33096 0 [sr_mod ide-cd]
keybdev 3104 0 (unused)
mousedev 6728 0 (unused)
hid 21992 0 (unused)
input 7520 0 [keybdev mousedev hid]
ehci-hcd 21200 0 (unused)
usb-ohci 22864 0 (unused)
usbcore 85152 1 [hid ehci-hcd usb-ohci]
ext3 87856 2
jbd 57088 2 [ext3]
raid0 4368 1
sata_nv 5116 5
libata 49352 0 [sata_nv]
mptscsih 43792 0 (unused)
mptbase 50472 3 [mptscsih]
diskdumplib 6548 0 [mptscsih mptbase]
sd_mod 14964 10
scsi_mod 130124 6 [sg sr_mod ide-scsi sata_nv libata
mptscsih sd_mod]
More information about the redhat-list
mailing list