2 different FC2 issues: >4GB of RAM, and dead USB ports

Sat Feb 5 01:16:40 UTC 2005

You'll have to excuse the lack of continuity in this message; this is 
cut&pasted from a much longer msg posted to the IBM xcat 
(XtremeClusterAdminTool) user's forums.  I didn't want to drown everyone 
in off-topic noise.

------------------------------------------------------
Issue 1 - FC2 and >4GB of ram panics the kernel
------------------------------------------------------
We're now able to install rhfc2 to the new blades.  Installation 
finishes, the blades re-boot to a login prompt, but the kernel panics 
and crashes within a few minutes.  Pulling out the 1GB of ram causes 
that panics to not occur.  Installing rhes3 on the blades with 5GB 
installed also produces _no_ kernel panics, so it's not bad ram.  It 
seems to be related to rhfc2 and >4GB of ram.

Kernel version on the blades is 2.6.5-1.358smp, release version is 
"Fedora Core release 2 (Tettnang)".

Should I bother re-compiling a kernel on a blade with 4GB and rhfc2 
installed, and then get xcat to install it onto the blades as a postscript?

Or is there something else I should try first?

------------------------------------------------------
Issue 3: - FC2 kernel kills the USB ports on the blades, rendering local 
console unresponsive - not xcat-related at all.  An install of rhes3 on 
the blades does not exhibit this behavior, so it's definitely FC2-related
------------------------------------------------------
This is a strange one, and I can't figure out what's causing it...

I experienced the same issue when I built my 336 mgm't node.  I had a 
ps/2 keyboard and usb mouse during the os installation, and the mouse 
would be responsive up to the grub stage.  Once the kernel booted, the 
led on the optical mouse would go out, and the mouse would be dead.  If 
I was using a usb k/b, it would be dead, too.  All the usb ports were dead.

I couldn't troubleshoot it within an hour of poking around on the 
redhat/fedora user forums, so I bought rhes3 for the 336 and made all my 
problems go away.  Now I see the same problem on my new blades; I knew 
the fix on the 336 was too easy :-)   But running Redhat Enterprise 
would be too expensive for a 300-blade installation, even if I only 
bought rhws.

[root at r01c01b01 configs]# dmesg | grep -i someth
uhci_hcd 0000:00:1d.0: host controller process error, something bad 
happened!
uhci_hcd 0000:00:1d.1: host controller process error, something bad 
happened!

[root at r01c01b01 configs]# lspci | grep -i usb
00:1d.0 USB Controller: Intel Corp. 6300ESB USB Universal Host 
Controller (rev 02)
00:1d.1 USB Controller: Intel Corp. 6300ESB USB Universal Host 
Controller (rev 02)

/var/log/messages on the xcat mgmt' node shows:
=====================================
Feb  4 11:43:19 r01c01b01 kernel: uhci_hcd 0000:00:1d.0: host controller 
process error, something bad happened!
Feb  4 11:43:19 r01c01b01 kernel: uhci_hcd 0000:00:1d.0: host controller 
halted, very bad!
Feb  4 11:43:22 r01c01b01 kernel: uhci_hcd 0000:00:1d.0: host controller 
process error, something bad happened!
Feb  4 11:43:22 r01c01b01 kernel: uhci_hcd 0000:00:1d.0: host controller 
halted, very bad!
Feb  4 11:43:24 r01c01b01 kernel: uhci_hcd 0000:00:1d.0: host controller 
process error, something bad happened!
Feb  4 11:43:24 r01c01b01 kernel: uhci_hcd 0000:00:1d.0: host controller 
halted, very bad!
Feb  4 11:43:24 r01c01b01 kernel: usb 1-2: control timeout on ep0out
Feb  4 11:43:26 r01c01b01 kernel: usb 1-2: device not accepting address 
2, error -110
=====================================

-- 
John Burk

Sr. Technical Director
Mainframe Entertainment

604.628.1019