[Linux-cluster] GFS withdraw and/or node I/O errors affect whole cluster?

Wed Jan 21 03:47:39 UTC 2009

Using a 14-node cluster on CentOS 5.2 with GFS1.

We've observed a problem in production that caused us to peform an
unplanned cluster restart.  We also reproduced similar behavior in a lab
environment.

If one node loses its connection to shared storage, it can no longer
perform any filesystem activity.  The GFS filesystem may decide to
withdraw.  That's expected.

The same node that withdraws does not get fenced.  Since the cluster
itself depends on networking and not storage, and cluster services other
than GFS may be active, that's not surprising.

When one node withdraws or otherwise fails on a GFS mount without
getting fenced, other nodes freeze when attempting to access the same
filesystem.  That's unexpected.

For a high-availabliity cluster, this can be a bad thing, because it
isn't handled automatically and effectively causes a cluster-wide
outage.  Does this sound right?  How can we mitigate or prevent such
outages?  Are there relevant configuration settings I've missed?

Thanks for any insight.

Jeff