EXT3-fs unexpected failure msg ?

Tue Apr 18 08:31:11 UTC 2006

On Apr 17, 2006  21:30 -0400, Sev Binello wrote:
> Damian Menscher wrote:
> >On Mon, 17 Apr 2006, Andreas Dilger wrote:
> >>You really, really, really need to mount your filesystem with
> >>"-o errors=remount-ro", at least to prevent filesystem corruption.
> >>I'm not sure if this is enough to prevent corruption in the case
> >>of your RAID disconnects (if it doesn't generate errors up to the
> >>filesystem, but still discards writes), but it is at least a minimum
> >>requirement.
> >
> >Since this was so strongly-worded, I just did a random spot-check of 
> >some of our filesystems (RHEL4) and discovered they all have:
> >
> >   Errors behavior:          Continue
> >
> >in the superblock (and mount apparently takes that option).  This makes 
> >me curious: if it's so obvious that it should remount-ro on errors, why 
> >is the default (on RHEL4, at least) to continue?

It was only so strongly worded because Sev has had repeated failures of
the RAID hardware resulting in filesystem corruption, and it seems prudent
to stop the filesystem at the first inkling of corruption in this case.
Not all environments see so many problems, and the choice to use remount-ro
is up to the admin (though I believe Debian uses this as the default).

> my question/concern is that since there are sometimes trivial errors that 
> we often have to live with until we can take our operational systems down
> long enough to fsck, will this option automatically put us in ro mode no
> matter how trivial the problem is ?

This will only trigger on cases where there is a consistency error detected
in the ext3 metadata.  It doesn't affect regular IO errors for file data.

However, that said, it surprises me that you are getting any kind of errors,
even "trivial" ones, often.  I wouldn't consider a RAID system where you
often get errors to be very reliable.

> Also, when we had the problem earlier today (i.e. the raid controller 
> didn't failover for about 20 mins), we did stop and fsck.
> But even so when we checked after it was done, it still said state was
> "clean with errors" ?

When you run e2fsck, are you specifying the "-f" flag?  For ext3 filesystems,
an e2fsck (without -f) will normally not do a full filesystem check unless
the superblock has been flagged with an error.  This allows e2fsck to run
against the filesystem always at boot, but normally only do journal replay
(seconds at most) unless there was an error reported.

> We tried fscking again with no better results,
> though when it started it said...
>       "ext3 recovery flag clear but journal has data"
> any advice here ?

Run "e2fsck -f"?  I haven't seen this unless the superblock was corrupted
and had to be restored from backup or similar.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.