[Linux-cluster] Clusterbehaviour if one node is not reachable & fenceable any longer?

Digimer lists at alteeve.ca
Wed Jan 29 15:43:48 UTC 2014

On 29/01/14 10:14 AM, Nicolas Kukolja wrote:
> Hello,
> I have a cluster with three nodes (rhel 5.5) and every server has an
> ipmilan-module configured as fencing device in my cluster-config.
> Now, if one of the nodes is not reachable and its fencing device is not
> reachable, too, then the other two nodes try to fence this node again
> and again... without stopping it.
> Only when this node is reachable (& fenceable) again, the fencing
> proceeds sucessfully and the cluster service moves to another node.
> Why does the service not move to another node earlier? I think, its a
> common error scenario, that one node and its fencing device are not
> reachable maybe due to power problems e.g.
> How do I have to change the cluster configuration to retrieve my
> expected behaviour?
> Thanks in advance for any suggestions...
> Kind regards,
> Nicolas

This behaviour is expected and by design. The healthy nodes can't safely 
recover until they know what state the lost node is in. The cluster is 
not allowed to simply assume that the lost node is dead (no way to tell 
"disconnected but working" from "smouldering pile of rubble").

The way I deal with this is a second fence method. I use a pair of 
switched PDUs behind each node (one PDU for the first PSU in each node 
and the second PDU for the second PSU in each node). This way, if IPMI 
fencing fails, the nodes will connect to the PDUs and cut the power to 
the lost node, thus ensuring it's off and allowing prompt recovery of 

This might help:

* https://alteeve.ca/w/AN!Cluster_Tutorial_2#Why_Switched_PDUs.3F
* https://alteeve.ca/w/AN!Cluster_Tutorial_2#A_Map.21
* https://alteeve.ca/w/AN!Cluster_Tutorial_2#Using_the_Fence_Devices


Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?

More information about the Linux-cluster mailing list