[Linux-cluster] GFS 6.0 node without quorum tries to fence

Wed Aug 4 15:40:36 UTC 2004

On Wed, Aug 04, 2004 at 09:20:22AM -0500, adam manthei wrote:
> On Wed, Aug 04, 2004 at 04:06:32PM +0200, Schumacher, Bernd wrote:
> > > > The single point of failure is:
> > > > The lancard specified in "nodes.ccs:ip_interfaces" stops working on 
> > > > one node. No matter if this node was master or slave.
> > > > 
> > > > The whole gfs is stopped:
> > > > The rest of the cluster seems to need time to form a new cluster. The 
> > > > bad node does not need so much time for switching to 
> > > > arbitrary mode.  So the bad node has enough time to fence all other 
> > > > nodes, before it would be fenced by the new master.
> > > > 
> > > > The bad node lives but it can not form a cluster. GFS is not working.
> > > > 
> > > > Now all other nodes will reboot. But after reboot they can not join 
> > > > the cluster, because they can not contact the bad node. The 
> > > > lancard is still broken. GFS is not working.
> > > > 
> > > > Did I miss something?
> > > > Please tell me that I am wrong!
> > > 
> > > Well, I guess I'm confused how the node with the bad lan card 
> > > can contact the fencing device to fence the other nodes.  If 
> > > it can't communicate with the other nodes because it's NIC is 
> > > down, it can't contact the fencing device over that NIC 
> > > either, right?  Or are you using some alternate transport to 
> > > contact the fencing device? 
> > 
> > There is a second admin Lan which is used for fencing.
> >  
> > Could I probably use this second admin Lan for GFS Heartbeats too. Can I
> > define two LAN-Cards in "nodes.ccs:ip_interfaces". If this works I would
> > not have a single point of failure anymore. But the documentation seems
> > not to allow this.
> > I will test this tomorrow.
> 
> GULM does not support multiple ethernet devices.  In this case, you would
> want to architect your network so that the fence devices are on the same
> network as the heartbeats.
> 
> However, if you did _NOT_ do that, the problem isn't as bad as you make it out
> to be.  You're correct in thinking that there will be a shootout.  One of
> your gulm servers will try to hence the others, and the others will try to
> fence the one.  When the smoke clears, you will at worst be left with a
> single server.  If that remaining server can no longer talk to the other
> lock_gulmd servers due to a net split, it will continue to sit in the
> arbitrating state waiting for the other nodes to login.  The other nodes
> however will be able to start a new generation of the cluster when they
> restart because they will be quorate.  If the other quorate part of the
> netsplit wins the shootout, you only loose the one node.
> 
> If this is not acceptable, then you really need to rethink why the
> heartbeats are not going over the same interface as the fencing device.

Unfortunitly gulm has not yet had mutliple network device support added.
We've always ment to, but lacked the time and resources to do it.  You
really *must* put heartbeats/locktraffic/fencing/etc on the same network
device.  Things won't work the way they should otherwise.

-- 
Michael Conrad Tadpol Tilstra
I used to be indecisive, but now I'm not sure.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040804/a9078895/attachment.sig>