[Linux-cluster] I/O Error management in GFS

Fri May 18 13:49:14 UTC 2007

Sorry for my late reply,

I've performed the following tests with cluster-1.03:
- mount GFS on more than 1 node, using Gulm as the lock manager.
- cp'ing something big (a kernel) into it on each node,
- while it does that, manage to have the device returning I/O errors.
The result is not what you described: sometimes my "cp" finishes with
I/O errors (that's good), but most of the times it is blocked in the
kernel. I cannot perform any action, including umount. Syscalls like
"df" are blocked, too.

I've done the same test with DLM and got the same results.

Do you need additional infos to investigate this ?

--
Mathieu

Le Fri, 27 Apr 2007 12:03:47 -0500,
David Teigland <teigland at redhat.com> a écrit :

> On Fri, Apr 27, 2007 at 11:00:41AM +0200, Mathieu Avila wrote:
> > Hello all,
> > 
> > >From what i understand of the GFS1 source code, I/O error are not
> > managed : when an I/O error happens, either it exits the locking
> > protocol's cluster (Gulm or CMAN), or sometimes it asserts/panics.
> > 
> > Anyway, most of the time, the node that got an I/O error must be
> > rebooted (file system layer is instable) and the device must be
> > checked and the file system must be fsck'ed.
> > 
> > Are there any plans for a cleaner management of I/O errors in GFS1,
> > like, say, remount in R/O mode with -EIO returned to apps, or even
> > better, advanced features like relocation mechanisms ? Is it
> > planned in GFS2 ?
> 
> You've got very close to what you're asking for with the "withdraw"
> feature which has existed in gfs1 since rhel4.  When gfs detects an io
> error, it does a "withdraw" on that fs, which means shutting it down:
> returning EIO to anything accessing it, telling other nodes to do
> journal recovery for it, dropping all global locks that were held,
> then you can unmount the withdrawn fs.  It's mainly about getting the
> node with the errors out of the way of other nodes so the others can
> continue.  It also allows you to shut down and reboot the node
> experiencing errors in a controlled fashion.
> 
> Dave
>