How does ext3 handle drive failures?

Andreas Dilger adilger at clusterfs.com
Fri Mar 19 05:46:51 UTC 2004


On Mar 17, 2004  19:15 -0600, Philip Molter wrote:
> We want to run multi-drive systems we have in a JBOD mode, where
> each drive is basically a filesystem to itself.  With the drives
> we currently have, we expect to have multiple failures, primarily
> unrecoverable ECC read errors or sometimes the drive just dying
> altogether.
> 
> How does ext[23] handle these two primary conditions?  Using them
> in a software RAID mode, I have sometimes seen problems with disks
> hang all access to the filesystem and even the entire system, but
> I'm not sure at what level that's happening (low-level driver?
> scsi layer?  raid layer?  filesystem layer?).

This is entirely an issue with the bus or SCSI layer, and not the
filesystem.

> If I have a drive fail taking out the entire ext3 filesystem, will
> I be able to stop using the filesystem (say, my application gets
> the error from the fs indicating some sort of problem in whatever
> system call it's made, who cares what), forcibly unmount the
> filesystem, and replace the drive?  Or will the system panic?  Or
> worse, will my application just enter an uninterruptible sleep
> never to return success or error?

Of all Linux filesystems, I think you'll find that ext2/ext3 probably
handle media and device errors the most gracefully (i.e. not panicing
because of cascading errors, unless you want that with errors=panic).
Whether you'll be able to unmount is really dependent on a lot of
factors so it's hard to comment.  When our storage servers (running
ext3) have some catastrophic disk problem we can usually unmount.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/





More information about the Ext3-users mailing list