[Linux-cluster] Clusterbehaviour if one node is not reachable & fenceable any longer?
lists at alteeve.ca
Wed Jan 29 15:43:48 UTC 2014
On 29/01/14 10:14 AM, Nicolas Kukolja wrote:
> I have a cluster with three nodes (rhel 5.5) and every server has an
> ipmilan-module configured as fencing device in my cluster-config.
> Now, if one of the nodes is not reachable and its fencing device is not
> reachable, too, then the other two nodes try to fence this node again
> and again... without stopping it.
> Only when this node is reachable (& fenceable) again, the fencing
> proceeds sucessfully and the cluster service moves to another node.
> Why does the service not move to another node earlier? I think, its a
> common error scenario, that one node and its fencing device are not
> reachable maybe due to power problems e.g.
> How do I have to change the cluster configuration to retrieve my
> expected behaviour?
> Thanks in advance for any suggestions...
> Kind regards,
This behaviour is expected and by design. The healthy nodes can't safely
recover until they know what state the lost node is in. The cluster is
not allowed to simply assume that the lost node is dead (no way to tell
"disconnected but working" from "smouldering pile of rubble").
The way I deal with this is a second fence method. I use a pair of
switched PDUs behind each node (one PDU for the first PSU in each node
and the second PDU for the second PSU in each node). This way, if IPMI
fencing fails, the nodes will connect to the PDUs and cut the power to
the lost node, thus ensuring it's off and allowing prompt recovery of
This might help:
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
More information about the Linux-cluster