[Linux-cluster] More CS4 fencing fun

Tue Mar 7 19:12:39 UTC 2006

On Tue, 2006-03-07 at 17:04 +0100, Matteo Catanese wrote:

> Result: One node perfectly up but cluster service stalled

Fencing never completes because iLO does not have power.  This an
architectural limitation to using iLO (or IPMI, actually) in a cluster
environment as the sole fencing method.  Compare to RSA - which can have
its own external power supply - even though it is an integrated solution
like iLO.

With redundant power supplies, the expectation is that different
circuits (or preferably - different power sources entirely) are used,
which should make the tested case significantly less likely to occur.

> Switch time: 55 seconds (+ oracle startup time).

Hrm, the backup node should take over the service after the primary node
is confirmed 'dead', i.e. after fencing is complete.  It should
certainly not be waiting around for the other node to come back to life.
What does your fence + service configuration look like, and were there
any obvious log messages which might explain the odd behavior?

> Cluster is stalled
> 
> Can you change fence behaviour to be less "radical" ?
> 
> If ILO is unreachable means that machine is already off and could not  
> be powered on so fence shold spit out a warning and let the failover  
> happen

iLO being unreachable means iLO is unreachable, and assumptions as to
why should probably not be limited to lack of power.  Routing problems,
bad network cable, disconnected cable, and the occasional infinite
iLO-DHCP loop will all make iLO unreachable, but in no way confirm that
the node is dead.

More to the point, though, you can get around this particular behavior
(fencing on startup -> hang because fencing fails) by starting fenced
with the clean start parameter.  In a two node cluster, this is useful
to start things up in a controlled way when you know you won't be able
to fence the other node.  I think it's:

   fence_tool join -c

If you (the administrator) are sure that the node is dead and does not
have any services running, it will cause fenced to not fence the other
node on startup, thereby avoiding the hang entirely.  However,
automatically doing this is unsafe if both nodes are booting while a
network partition exists between the nodes, the cluster will end up with
a split brain.

-- Lon