rebooting more often to stop fsck problems and total disk loss

Mon Mar 19 21:27:19 UTC 2007

On Mar 19, 2007  17:15 -0400, ahlist wrote:
> Quite often we'll have a server that either needs a really long fsck
> (10 hours - 200 gig drive) or an fsck that evntually results in
> everything going to lost+found (pretty much a total loss).

Strange.  We get 1TB/hr fscks these days unless the filesystem is
completely corrupted and has a lot of duplicate blocks.

> Would rebooting these servers monthly (or some other frequency) stop this?

What else is important is that if you do an fsck you run with "-f" to
actually check the filesystem instead of just the superblock.  e2fsck
will only do a full e2fsck if the kernel detected disk corruption, OR
if the "last checked" time is > 6 months or {20 < X < 40} mounts have
happened since the last check time.  See tune2fs(8) for details.

> Is it correct to visualize this as small errors compounding over time
> thus more frequent reboots would allow quick fsck's to fix the errors
> before they become huge?

That is definitely true.  If the bitmaps get corrupted, then this will
spread corruption throughout the filesystem.

> (OS is redhat 7.3 and el3)

I would instead suggest updating to a newer kernel (e.g. RHEL4 2.6.9) as
this has fixed a LOT of bugs in ext3.  Also, make sure you are using the
newest e2fsck available, as some bugs have been fixed there also.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.