[Linux-cluster] GFS withdraw and/or node I/O errors affect whole cluster?

Wed Jan 21 10:12:28 UTC 2009

Hi,

On Tue, 2009-01-20 at 22:47 -0500, Jeff Sturm wrote:
> Using a 14-node cluster on CentOS 5.2 with GFS1.
>  
> We've observed a problem in production that caused us to peform an
> unplanned cluster restart.  We also reproduced similar behavior in a lab
> environment.
>  
> If one node loses its connection to shared storage, it can no longer
> perform any filesystem activity.  The GFS filesystem may decide to
> withdraw.  That's expected.
>  
> The same node that withdraws does not get fenced.  Since the cluster
> itself depends on networking and not storage, and cluster services other
> than GFS may be active, that's not surprising.
>  
> When one node withdraws or otherwise fails on a GFS mount without
> getting fenced, other nodes freeze when attempting to access the same
> filesystem.  That's unexpected.
Yes, I'd agree that should not happen.

>  
> For a high-availabliity cluster, this can be a bad thing, because it
> isn't handled automatically and effectively causes a cluster-wide
> outage.  Does this sound right?  How can we mitigate or prevent such
> outages?  Are there relevant configuration settings I've missed?
>  
> Thanks for any insight.
>  
> Jeff
> 
I'd suggest checking your fencing settings. The chances are that
something has gone wrong and the failed node could not be fenced for
some reason. Do you get any log messages which might explain it?

Steve.

> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster