[Linux-cluster] Fenced failing continuously

Mon Apr 13 18:08:00 UTC 2009

You're right about there is no such thing as fail-safe ... but I would
worry more if I just hard-code a return value of SUCCESS in my scripts.
Management cards are supposed to work, even if they are powered down --
not that there is a loss of power to both lines.  If that is the case,
no electricity == no servers == no cluster, which means you are doing a
cold boot regardless.

We have both fence_ilo and fence_bladecenter in effect.  As good as the
iLO cards have performed to date, we are still moving off HP DL385s into
IBM BladeCenter because its management processors are closer to fault
tolerant than anything else we have experienced.  I have had HP iLO
cards "crash" and not reset themselves -- although later firmware
revisions have reduced those outages greatly.  Monitoring its https and
ssh ports for availability are a requirement!

There is user-contributed fence_ilo patch listed somewhere in this list
worth investigating -- it runs A LOT FASTER than the stock one.  AFAIK,
the fence_ilo does not use ssh, but a sort of web soap services call via
https.  We have seen in production and testing that a typical fencing
operation using fence_ilo is 42-seconds, and a good percentage of time,
up to twice as long as that.  The bladecenter fencing operations we have
seen occur in under 7-seconds.

We are starting to roll-out smart APC switches to allow for remote power
control, and I will consider adding them in as secondary fence devices.
But I figure if highly-available dual AMMs in a blade chassis fail, I
probably have a lot more problems to deal with -- and I would rather
have my clustered apps suspend before incurring any further harm.

________________________________________________________________________

Robert Hurst, Sr. Caché Administrator
Beth Israel Deaconess Medical Center
1135 Tremont Street, REN-7
Boston, Massachusetts   02120-2140
617-754-8754 ∙ Fax: 617-754-8730 ∙ Cell: 401-787-3154
Any technology distinguishable from magic is insufficiently advanced.

On Mon, 2009-04-13 at 12:19 -0400, Ian Hayes wrote:

> I realize that the ssh option is not optimal, but I'm stuck with the
> design requirements. I'm hoping I can get them changed.
> 
> But, this got me thinking... conventional fencing is not failsafe. I
> can think of quite a number of less than optimal but entirely
> real-world situations where a node can die and not be able to be
> absolutely fenced off. iLO only works of the victim node still has
> power. I've only been in 1 shop that had the APC managed power, and
> they didn't even have that set up. Brocade fencing doesn't always
> apply, especially if you're just doing a virtual IP. So sometimes
> having a second fencing method as a backup may not always be feasible.
> 
> So even with more traditional fences, this may not work unless I start
> modding fence scripts to return a success code even if they fail.
> 
> 
> On Fri, Apr 10, 2009 at 2:36 AM, Virginian
> <virginian at blueyonder.co.uk> wrote:
> 
>         Hi Ian,
>          
>         I think there is a flaw in the design. For example, say the
>         network card fails on machine A. Machine B detects this and
>         tries to fence machine A. The problem with doing it via ssh to
>         modify iptables is that there is no network connectivity to
>         Machine A and hence this mechanism will never work. What you
>         need is a solution that works independently of the OS such as
>         a power switch or remote management interface such as IBM RSA
>         II, HP iLO etc. With fencing, the solution has to be absolute
>         and ruthless in that, in this example, machine B needs to be
>         able to fence Machine A absolutely every time there is a
>         problem and as soon as there is a problem.
>          
>         Regards
>          
>         John
>          
>          
>                 
>                 ----- Original Message ----- 
>                 From: Ian Hayes 
>                 To: linux-cluster at redhat.com 
>                 Sent: Friday, April 10, 2009 1:07 AM
>                 Subject: [Linux-cluster] Fenced failing continuously
>                 
>                 
>                 
>                 I've been testing a newly built 2-node cluster. The
>                 cluster resources are a virtual IP and squid, so in a
>                 node failure, the VIP would go to the surviving node
>                 and start up Squid. I'm running a modified fencing
>                 agent that will SSH into the failing node and firewall
>                 it off via IPtables (not my choice).
>                 
>                 This all works fine for graceful shutdowns, but when I
>                 do something nasty like pulling the power cord on the
>                 node that is currently running the service, the
>                 surviving node never assumes the service and spends
>                 all its time trying to fire off the fence agent, which
>                 obviously will not work because the server is
>                 completely offline. The only way I can get the
>                 surviving node to assume the VIP and start Squid is to
>                 fence_ack_manual, which sort of runs counter to
>                 running a cluster to begin with. The logs are filled
>                 with 
>                 
>                 Apr 12 00:01:44 <hostname> fenced[3223]: fencing node
>                 "<otherhost>"
>                  Could not disable xx.xx.xx.xx on    23]: agent
>                 "fence_iptables" reports: ssh: connect to host
>                 xx.xx.xx.xx port 22: No route to host
>                 
>                 Is this a misconfiguration, or is there an option I
>                 can include somewhere to tell the nodes to give it up
>                 after a certain number of tries?
>                 
>                 
>                 
>                 ______________________________________________________
>                 
>                 --
>                 Linux-cluster mailing list
>                 Linux-cluster at redhat.com
>                 https://www.redhat.com/mailman/listinfo/linux-cluster
>         
>         
>         --
>         Linux-cluster mailing list
>         Linux-cluster at redhat.com
>         https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090413/9ea95c07/attachment.htm>