[Linux-cluster] RH Cluster doesn't pass basic acceptance tests - bug in fenced?

Fri Jan 12 17:56:32 UTC 2007

On Fri, 2007-01-12 at 13:41 +0100, Miroslav Zubcic wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Josef Whiter wrote:
> 
> > This isn't a bug, its working as expected.
> 
> IT People from the central bank doesn't think like that. I cannot blame
> them, because it is strange to me, and to anybody who has seen this RH
> cluster behaviour.
> 
> > What you need in qdisk, set it up
> > with the proper hueristics and it will force the shutdown of the bad node before
> > the bad node has a chance to fence off the working node.
> 
> This is just a workaround for lack of communication between clurgmgrd and
> fenced daemons, where first is aware of ethernet/network failure and is
> trying to disable active service, and fenced which is fencing other node
> without any good reason, because it doesn't know that it's node is faulty one.

There is no assumed correlation between the NIC(s) rgmanager uses for
services and the NIC(s) CMAN uses; many people use one network for
cluster traffic and another for service related traffic.  In this case,
a service failure due to a NIC link failing is far less of a problem:
The service fails, and it moves somewhere else in the cluster.

More generally, health of part of an rgmanager service != health of a
node.  They are independent, despite sometimes being correlative.

> I have even better workaround (one bonding with native data ethernet and
> tagged vlan for fence subnet) for this silly behaviour, but I will really
> like to see this thing fixed, because people are laughing on us when
> testing our cluster configurations (we are configuring Red Hat machines
> and clusters).

I think it's interesting to point out that CMAN, when run in 2-node
mode, expects the fencing devices and cluster paths to be on the same
links.  This has the effect that whenever you pull the links out of a
node, that node actually can not possibly fence the "good" node because
it can't reach the fence devices.  It sounds like you altered your
configuration to match this using vlans over the same links.

As a side note, it would also be trivial to add a 'reboot on link loss'
option to the IP script in rgmanager. *shrug*.

-- Lon