[Linux-cluster] Fenced failing continuously

Fri Apr 10 00:07:58 UTC 2009

I've been testing a newly built 2-node cluster. The cluster resources are a
virtual IP and squid, so in a node failure, the VIP would go to the
surviving node and start up Squid. I'm running a modified fencing agent that
will SSH into the failing node and firewall it off via IPtables (not my
choice).

This all works fine for graceful shutdowns, but when I do something nasty
like pulling the power cord on the node that is currently running the
service, the surviving node never assumes the service and spends all its
time trying to fire off the fence agent, which obviously will not work
because the server is completely offline. The only way I can get the
surviving node to assume the VIP and start Squid is to fence_ack_manual,
which sort of runs counter to running a cluster to begin with. The logs are
filled with

Apr 12 00:01:44 <hostname> fenced[3223]: fencing node "<otherhost>"
 Could not disable xx.xx.xx.xx on    23]: agent "fence_iptables" reports:
ssh: connect to host xx.xx.xx.xx port 22: No route to host

Is this a misconfiguration, or is there an option I can include somewhere to
tell the nodes to give it up after a certain number of tries?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090409/cb7ecbe9/attachment.htm>