[Linux-cluster] failed node causes all GFS systems to hang

Thu Jun 9 03:01:07 UTC 2005

How are you fencing?

I noticed a condition on certain brocade switches where the
fence_brocade script effectively kills the entire switch. 

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of David Teigland
> Sent: Wednesday, June 08, 2005 10:04 PM
> To: Dan B. Phung
> Cc: Linux-cluster at redhat.com
> Subject: Re: [Linux-cluster] failed node causes all GFS 
> systems to hang
> 
> On Wed, Jun 08, 2005 at 05:46:26PM -0400, Dan B. Phung wrote:
> 
> > I think I'm doing something terribly wrong here, because if 
> one of my 
> > nodes goes down, the rest of the nodes connected to GFS are hung in 
> > some wait state.  Specifically, only those nodes running 
> fenced are hosed.
> > These machines are not only blocked on the GFS's file 
> system, but the 
> > local file system stuff is hung as well, which requires me 
> to reboot 
> > everybody connected to GFS.  I have one node not running fenced to 
> > reset the quorum status, so that doesn't seem to be the problem.
> > 
> > I updated from the cvs sources -rRHEL4 last friday, so I have up to 
> > date stuff.  i'm running kernel 2.6.9 and fence_manual.  I 
> remember a 
> > couple of weeks back that when a node went down, I simply had to 
> > fence_ack_manual the node, but that message never comes up 
> anymore...
> 
> The joys of manual fencing, we do debate sometimes whether 
> it's more troublesome than helpful for people.
> 
> When a node fails, you need to run fence_ack_manual on one of 
> the remaining nodes, specifically, whichever remaining node 
> has a fence_manual notice in /var/log/messages.  So, you need 
> to monitor /var/log/messages on the remaining nodes to figure 
> out where you need to run fence_ack_manual (it will generally 
> be the remaining node with the lowest nodeid, see cman_tool nodes).
> 
> If the failed node caused the cluster to loose quorum, then 
> it's a different story.  In that case you need to get some 
> nodes back into your cluster (cman_tool join) to regain 
> quorum before any kind of fencing will happen.
> 
> GFS is going to be blocked everywhere until you run 
> fence_ack_manual for the failed node.  If there are no manual 
> fencing notices anywhere for the failed node, then maybe you 
> lost quorum (see cman_tool status), or something else is 
> wrong.  I don't know why your local fs would be hung.
> 
> Dave
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
>