IO lockups and ext3 readonly filecorruption on RHEL4 (pre and post U4)

tweeks tweeks at rackspace.com
Tue Sep 5 20:53:52 UTC 2006


Has anyone been seeing IO lockup problems on EL4?  

I've tried multiple IO scheduler options (elevator=) in the boot... I'm seeing 
the same behavior regardless.  Independent of hardware.  Whitebox ATA, HA 
enclosure with dedicated SCSI, megaraid RAID hardware, Dell 2850s... same 
behavior:

A semi-busy system will suddenly go into some kind of IO la-la land where 
nothing can be written to disk for >1hour.  Of course when this happens, the 
ext3 kernel module freaks out and remounts all the filesystems as readonly.  
Then when the system is rebooted, if the system is allowed to fsck, the 
journal is hosed and the filesystem eats itself.  Moving them off the RH 
kernel all together seems to fix the problem, but I have not found a way to 
reproduce the problem yet (burning and stress testing doesn't seem to make it 
appear), so real re-testing is difficult at best.

It's become so big of a problem that we're moving some customers that require 
rock solid systems either over to RHEL3, or off RH and over to SLES or other 
distro with a non-RH kernel.  

Just the ext3 problem (minus the IO lockup part) can be seen in other BZ 
tickets:
	https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=175877
(when the filesystem fills up)

Has anyone seen these type of IO lockups + ext3 corruption on RHEL4?  
Can you reproduce it?

Tweeks




More information about the Ext3-users mailing list