[Linux-cluster] Tripp Lite switched PDU fence agent; exists?
Fabio M. Di Nitto
fdinitto at redhat.com
Fri Mar 18 20:40:55 UTC 2011
On 3/18/2011 9:20 PM, bergman at merctech.com wrote:
> The pithy ruminations from "Fabio M. Di Nitto" <fdinitto at redhat.com> on "Re: [Linux-cluster] Tripp Lite switched PDU fence agent; exists?" were:
> => Wouldn´t it be possible for the agent to:
> => 1) issue OFF command
> => 2) either poll for OFF status or wait > $known_random_max_delay
> => 3) issue ON command
> => 4) profit?
> Yes, but here's the problem:
> 0) there's a condition whereby cluster communication is lost between nodeA and nodeB
> 1) the agent on nodeA sends OFF command to PDU to shut down nodeB
> 2) the agent on nodeA polls for OFF status while waiting > $known_random_max_delay
> 3) the agent on nodeB sends OFF command to PDU to shut down nodeA
> 4) nodeB shuts down
> 5) nodeA shuts down
> The PDU responds quickly to network connections (ie., telnet & commands to shut down a power outlet). The PDU accepts multiple network sessions (ie., from nodeA and nodeB). The PDU delays executing the commands, potentially leaving enough time for multiple nodes to send commands each to shut down the "other" node.
This is virtually true for all 2 nodes clusters and it´s a very well
known fencing race condition.
there are several mechanisms to avoid it:
1) fence delay option. One node basically sleeps N seconds before it can
2) both cluster heartbeat traffic and fence devices are on the same
network (if node A can´t access the net, it also can´t access the fence
3) qdiskd + heuristics
4) use a fence device that allows only one connection at a time (one
node access, the other is forbidden)
and note that it is independent on how long the device takes to fence
More information about the Linux-cluster