[dm-devel] kernel update and dmraid causing grub errors

David C. Rankin drankinatty at suddenlinkmail.com
Mon Nov 1 22:27:16 UTC 2010


dmraid devs,

	Over the past 8-9 months, I have had numerous dmraid related boot failures with
the past 6-8 kernels. It seems like a Russian-roulette type problem. Some
kernels work with dmraid, some cause grub errors. The problem is most acute on
an MSI SLI Platinum Based board (MS-7374), Phenom X4 (9850), with the following
pci bus config:

[15:48 archangel:/home/david/bugs/aa] # lspci
00:00.0 RAM memory: nVidia Corporation MCP78S [GeForce 8200] Memory Controller
(rev a2)
00:01.0 ISA bridge: nVidia Corporation MCP78S [GeForce 8200] LPC Bridge (rev a2)
00:01.1 SMBus: nVidia Corporation MCP78S [GeForce 8200] SMBus (rev a1)
00:01.2 RAM memory: nVidia Corporation MCP78S [GeForce 8200] Memory Controller
(rev a1)
00:01.3 Co-processor: nVidia Corporation MCP78S [GeForce 8200] Co-Processor (rev a2)
00:01.4 RAM memory: nVidia Corporation MCP78S [GeForce 8200] Memory Controller
(rev a1)
00:02.0 USB Controller: nVidia Corporation MCP78S [GeForce 8200] OHCI USB 1.1
Controller (rev a1)
00:02.1 USB Controller: nVidia Corporation MCP78S [GeForce 8200] EHCI USB 2.0
Controller (rev a1)
00:04.0 USB Controller: nVidia Corporation MCP78S [GeForce 8200] OHCI USB 1.1
Controller (rev a1)
00:04.1 USB Controller: nVidia Corporation MCP78S [GeForce 8200] EHCI USB 2.0
Controller (rev a1)
00:06.0 IDE interface: nVidia Corporation MCP78S [GeForce 8200] IDE (rev a1)
00:07.0 Audio device: nVidia Corporation MCP72XE/MCP72P/MCP78U/MCP78S High
Definition Audio (rev a1)
00:08.0 PCI bridge: nVidia Corporation MCP78S [GeForce 8200] PCI Bridge (rev a1)
00:09.0 RAID bus controller: nVidia Corporation MCP78S [GeForce 8200] SATA
Controller (RAID mode) (rev a2)
00:0a.0 Ethernet controller: nVidia Corporation MCP77 Ethernet (rev a2)
00:10.0 PCI bridge: nVidia Corporation MCP78S [GeForce 8200] PCI Express Bridge
(rev a1)
00:12.0 PCI bridge: nVidia Corporation MCP78S [GeForce 8200] PCI Express Bridge
(rev a1)
00:13.0 PCI bridge: nVidia Corporation MCP78S [GeForce 8200] PCI Bridge (rev a1)
00:14.0 PCI bridge: nVidia Corporation MCP78S [GeForce 8200] PCI Bridge (rev a1)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64,
Sempron] HyperTransport Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64,
Sempron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64,
Sempron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64,
Sempron] Miscellaneous Control
00:18.4 Host bridge: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64,
Sempron] Link Control
01:06.0 Serial controller: 3Com Corp, Modem Division 56K FaxModem Model 5610
(rev 01)
01:09.0 FireWire (IEEE 1394): VIA Technologies, Inc. VT6306/7/8 [Fire II(M)]
IEEE 1394 OHCI Controller (rev c0)
02:00.0 VGA compatible controller: nVidia Corporation G92 [GeForce 8800 GT] (rev a2)
04:00.0 SATA controller: JMicron Technology Corp. JMB362/JMB363 Serial ATA
Controller (rev 03)
04:00.1 IDE interface: JMicron Technology Corp. JMB362/JMB363 Serial ATA
Controller (rev 03)

full dmidecode information at:
  http://www.3111skyline.com/dl/Archlinux/bugs/aa-dmidecode.txt

	Booting the current Arch Linux kernel (2.6.35.8-1) fails and the boot hangs at
the very start. The kernel line I use hasn't changed in a long time:

  kernel /vmlinuz root=/dev/mapper/nvidia_baaccajap5 ro vga=0x31a

	Booting first stopped with the following error:

Booting 'Arch Linux on Archangel'

root (hd1,5)
  Filesystem type is ext2fs, Partition type 0x83
Kernel /vmlinuz26 root=/dev/mapper/nvidia_baacca_jap5 ro vga=794

Error 24: Attempt to access block outside partition

Press any key to continue...

	Upgrading to device-mapper-2.02.75-1 completely changes the error to:

Error 5: Partition table invalid or corrupt

	Rebooting to 2.6.35.7-1, or 2.6.32.25-1 (the Arch LTS kernel) works just fine.
So the problem is not a partition or partition table problem. The Arch Linux
developer (Tobias Powalowski) has referred me here as the problem isn't a kernel
problem, but something strange that is happening with dmraid.

	The only guess I have is that it is a dmraid/GeForce controller issue that is
triggered when dmraid loads under certain circumstances.

	This box has 2 dmraid arrays:

[17:15 archangel:/home/david/bugs/aa] # dmraid -r
/dev/sdd: nvidia, "nvidia_baaccaja", mirror, ok, 1465149166 sectors, data@ 0
/dev/sda: nvidia, "nvidia_fdaacfde", mirror, ok, 976773166 sectors, data@ 0
/dev/sdb: nvidia, "nvidia_baaccaja", mirror, ok, 1465149166 sectors, data@ 0
/dev/sdc: nvidia, "nvidia_fdaacfde", mirror, ok, 976773166 sectors, data@ 0

[17:15 archangel:/home/david/bugs/aa] # dmraid -s
*** Active Set
name   : nvidia_baaccaja
size   : 1465149056
stride : 128
type   : mirror
status : ok
subsets: 0
devs   : 2
spares : 0
*** Active Set
name   : nvidia_fdaacfde
size   : 976773120
stride : 128
type   : mirror
status : ok
subsets: 0
devs   : 2
spares : 0

	All disks check out fine with smartctl, so it isn't a disk-hardware problem.
The detailed information on the GeForce controller (lspci -vv) is:

00:09.0 RAID bus controller: nVidia Corporation MCP78S [GeForce 8200] SATA
Controller (RAID mode) (rev a2)
        Subsystem: Micro-Star International Co., Ltd. Device 7374
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR- INTx-
        Latency: 0 (750ns min, 250ns max)
        Interrupt: pin A routed to IRQ 28
        Region 0: I/O ports at b080 [size=8]
        Region 1: I/O ports at b000 [size=4]
        Region 2: I/O ports at ac00 [size=8]
        Region 3: I/O ports at a880 [size=4]
        Region 4: I/O ports at a800 [size=16]
        Region 5: Memory at f9e76000 (32-bit, non-prefetchable) [size=8K]
        Capabilities: [44] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [8c] SATA HBA v1.0 InCfgSpace
        Capabilities: [b0] MSI: Enable+ Count=1/8 Maskable- 64bit+
                Address: 00000000fee0f00c  Data: 4191
        Capabilities: [ec] HyperTransport: MSI Mapping Enable+ Fixed+
        Kernel driver in use: ahci
        Kernel modules: ahci


    Basically, I'm stumped here. Nothing has changed with this box in over a
year (same grub menu.lst, same hardware), the only oddity is that in 4 of the
last 6 kernels or so have failed to boot with this weird grub error, that has
nothing to do with grub (because it boots all other kernels fine), but is
something that results from dmraid and the way it gets initialized (which I'm
clueless about).

    Let me know what you think and let me know what data or testing you want me
to do. I'll be happy to do it. I last filed this bug with Arch against 2.6.35-1
and the problem was never fixed, but (solved) by upgrading to the (next -
testing kernel), so the actual problem was never found. The url to the closed
report is:

https://bugs.archlinux.org/task/20918?

    Thanks for any ideas or help you can give.

-- 
David C. Rankin, J.D.,P.E.
Rankin Law Firm, PLLC
510 Ochiltree Street
Nacogdoches, Texas 75961
Telephone: (936) 715-9333
Facsimile: (936) 715-9339
www.rankinlawfirm.com




More information about the dm-devel mailing list