EXT3-fs unexpected failure msg ?

Tue Apr 18 13:57:46 UTC 2006

Andreas Dilger wrote:
> On Apr 17, 2006  21:30 -0400, Sev Binello wrote:
> 
>>Damian Menscher wrote:
>>
>>>On Mon, 17 Apr 2006, Andreas Dilger wrote:
>>>
>>>>You really, really, really need to mount your filesystem with
>>>>"-o errors=remount-ro", at least to prevent filesystem corruption.
>>>>I'm not sure if this is enough to prevent corruption in the case
>>>>of your RAID disconnects (if it doesn't generate errors up to the
>>>>filesystem, but still discards writes), but it is at least a minimum
>>>>requirement.
>>>
>>>Since this was so strongly-worded, I just did a random spot-check of 
>>>some of our filesystems (RHEL4) and discovered they all have:
>>>
>>>  Errors behavior:          Continue
>>>
>>>in the superblock (and mount apparently takes that option).  This makes 
>>>me curious: if it's so obvious that it should remount-ro on errors, why 
>>>is the default (on RHEL4, at least) to continue?
> 
> 
> It was only so strongly worded because Sev has had repeated failures of
> the RAID hardware resulting in filesystem corruption, and it seems prudent
> to stop the filesystem at the first inkling of corruption in this case.
> Not all environments see so many problems, and the choice to use remount-ro
> is up to the admin (though I believe Debian uses this as the default).
> 
> 
>>my question/concern is that since there are sometimes trivial errors that 
>>we often have to live with until we can take our operational systems down
>>long enough to fsck, will this option automatically put us in ro mode no
>>matter how trivial the problem is ?
> 
> 
> This will only trigger on cases where there is a consistency error detected
> in the ext3 metadata.  It doesn't affect regular IO errors for file data.
> 
Ok, I'm assuming this would be any error reported in /var/log/messages
that is preceeded by EXT3-fs

> However, that said, it surprises me that you are getting any kind of errors,
> even "trivial" ones, often.  I wouldn't consider a RAID system where you
> often get errors to be very reliable.
> 
No arguement from us.
> 
>>Also, when we had the problem earlier today (i.e. the raid controller 
>>didn't failover for about 20 mins), we did stop and fsck.
>>But even so when we checked after it was done, it still said state was
>>"clean with errors" ?
> 
> 
> When you run e2fsck, are you specifying the "-f" flag?  For ext3 filesystems,
> an e2fsck (without -f) will normally not do a full filesystem check unless
> the superblock has been flagged with an error.  This allows e2fsck to run
> against the filesystem always at boot, but normally only do journal replay
> (seconds at most) unless there was an error reported.
> 
> 
>>We tried fscking again with no better results,
>>though when it started it said...
>>      "ext3 recovery flag clear but journal has data"
>>any advice here ?
> 
> 
> Run "e2fsck -f"?  I haven't seen this unless the superblock was corrupted
> and had to be restored from backup or similar.
> 
Will try it
Thanks

> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
> 


-- 

Sev Binello
Brookhaven National Laboratory
Upton, New York
631-344-5647
sev at bnl.gov