[Linux-cluster] I/O Error management in GFS

Fri Apr 27 17:03:47 UTC 2007

On Fri, Apr 27, 2007 at 11:00:41AM +0200, Mathieu Avila wrote:
> Hello all,
> 
> >From what i understand of the GFS1 source code, I/O error are not
> managed : when an I/O error happens, either it exits the locking
> protocol's cluster (Gulm or CMAN), or sometimes it asserts/panics.
> 
> Anyway, most of the time, the node that got an I/O error must be
> rebooted (file system layer is instable) and the device must be checked
> and the file system must be fsck'ed.
> 
> Are there any plans for a cleaner management of I/O errors in GFS1,
> like, say, remount in R/O mode with -EIO returned to apps, or even
> better, advanced features like relocation mechanisms ? Is it planned in
> GFS2 ?

You've got very close to what you're asking for with the "withdraw"
feature which has existed in gfs1 since rhel4.  When gfs detects an io
error, it does a "withdraw" on that fs, which means shutting it down:
returning EIO to anything accessing it, telling other nodes to do journal
recovery for it, dropping all global locks that were held, then you can
unmount the withdrawn fs.  It's mainly about getting the node with the
errors out of the way of other nodes so the others can continue.  It also
allows you to shut down and reboot the node experiencing errors in a
controlled fashion.

Dave