EXT3-fs unexpected failure msg ?
Sev Binello
sev at bnl.gov
Tue Apr 18 13:57:46 UTC 2006
Andreas Dilger wrote:
> On Apr 17, 2006 21:30 -0400, Sev Binello wrote:
>
>>Damian Menscher wrote:
>>
>>>On Mon, 17 Apr 2006, Andreas Dilger wrote:
>>>
>>>>You really, really, really need to mount your filesystem with
>>>>"-o errors=remount-ro", at least to prevent filesystem corruption.
>>>>I'm not sure if this is enough to prevent corruption in the case
>>>>of your RAID disconnects (if it doesn't generate errors up to the
>>>>filesystem, but still discards writes), but it is at least a minimum
>>>>requirement.
>>>
>>>Since this was so strongly-worded, I just did a random spot-check of
>>>some of our filesystems (RHEL4) and discovered they all have:
>>>
>>> Errors behavior: Continue
>>>
>>>in the superblock (and mount apparently takes that option). This makes
>>>me curious: if it's so obvious that it should remount-ro on errors, why
>>>is the default (on RHEL4, at least) to continue?
>
>
> It was only so strongly worded because Sev has had repeated failures of
> the RAID hardware resulting in filesystem corruption, and it seems prudent
> to stop the filesystem at the first inkling of corruption in this case.
> Not all environments see so many problems, and the choice to use remount-ro
> is up to the admin (though I believe Debian uses this as the default).
>
>
>>my question/concern is that since there are sometimes trivial errors that
>>we often have to live with until we can take our operational systems down
>>long enough to fsck, will this option automatically put us in ro mode no
>>matter how trivial the problem is ?
>
>
> This will only trigger on cases where there is a consistency error detected
> in the ext3 metadata. It doesn't affect regular IO errors for file data.
>
Ok, I'm assuming this would be any error reported in /var/log/messages
that is preceeded by EXT3-fs
> However, that said, it surprises me that you are getting any kind of errors,
> even "trivial" ones, often. I wouldn't consider a RAID system where you
> often get errors to be very reliable.
>
No arguement from us.
>
>>Also, when we had the problem earlier today (i.e. the raid controller
>>didn't failover for about 20 mins), we did stop and fsck.
>>But even so when we checked after it was done, it still said state was
>>"clean with errors" ?
>
>
> When you run e2fsck, are you specifying the "-f" flag? For ext3 filesystems,
> an e2fsck (without -f) will normally not do a full filesystem check unless
> the superblock has been flagged with an error. This allows e2fsck to run
> against the filesystem always at boot, but normally only do journal replay
> (seconds at most) unless there was an error reported.
>
>
>>We tried fscking again with no better results,
>>though when it started it said...
>> "ext3 recovery flag clear but journal has data"
>>any advice here ?
>
>
> Run "e2fsck -f"? I haven't seen this unless the superblock was corrupted
> and had to be restored from backup or similar.
>
Will try it
Thanks
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>
--
Sev Binello
Brookhaven National Laboratory
Upton, New York
631-344-5647
sev at bnl.gov
More information about the Ext3-users
mailing list