[Linux-cluster] What if the fence device doesn't work?

Janne Peltonen janne.peltonen at helsinki.fi
Tue Nov 21 06:59:24 UTC 2006


Hi!

I started wondering what happens if my fence device is broken. The
scenario:

 -a node (running a service) fails
 -another node notices the lost heartbeats and tries to fence the failed
 node
 -however, the fence device doesn't respond
 -...what now?

I tried to simulate the situation with our test cluster of two HP Blade
servers, using iLO fencing, by misconfiguring the fencing agent to use a
wrong username to authenticate to the iLO. What happens is, the fenced
on the running node tries to fence the failed node over and over again,
and the service I'm trying to fail over will never leave state "Started"
on node "Unknown"... that is, the cluster won't fail it over to the
running node.

Not good. If the active node fails, and the fence device fails at the
same time - for example, if the active node is a Xen guest and the host
Xen fails, or if the active node loses power because the network power
switch fails or because the iLO gets confused - the service is lost.
The Xen scenario doesn't even seem too far-fetched...

Am I missing something?


--Janne Peltonen
Univ. of Helsinki
mail admin




More information about the Linux-cluster mailing list