IO lockups and ext3 readonly filecorruption on RHEL4 (pre and post U4)
tweeks
tweeks at rackspace.com
Tue Sep 5 20:53:52 UTC 2006
Has anyone been seeing IO lockup problems on EL4?
I've tried multiple IO scheduler options (elevator=) in the boot... I'm seeing
the same behavior regardless. Independent of hardware. Whitebox ATA, HA
enclosure with dedicated SCSI, megaraid RAID hardware, Dell 2850s... same
behavior:
A semi-busy system will suddenly go into some kind of IO la-la land where
nothing can be written to disk for >1hour. Of course when this happens, the
ext3 kernel module freaks out and remounts all the filesystems as readonly.
Then when the system is rebooted, if the system is allowed to fsck, the
journal is hosed and the filesystem eats itself. Moving them off the RH
kernel all together seems to fix the problem, but I have not found a way to
reproduce the problem yet (burning and stress testing doesn't seem to make it
appear), so real re-testing is difficult at best.
It's become so big of a problem that we're moving some customers that require
rock solid systems either over to RHEL3, or off RH and over to SLES or other
distro with a non-RH kernel.
Just the ext3 problem (minus the IO lockup part) can be seen in other BZ
tickets:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=175877
(when the filesystem fills up)
Has anyone seen these type of IO lockups + ext3 corruption on RHEL4?
Can you reproduce it?
Tweeks
More information about the Ext3-users
mailing list