[Linux-cluster] What if the fence device doesn't work?

Tue Nov 21 14:07:40 UTC 2006

On Tue, Nov 21, 2006 at 08:26:20AM -0500, Eric Kerin wrote:
> Actually, it is good.  A node failure comes in many shapes and sizes, 
> from a full system failure (where the whole machine is powered off)  to 
> a partial failure (where only the NIC used for heartbeat failed, but not 
> the OS or disk controllers)   If only the NIC fails, your service is 
> still running, still updating the hard drive, and still generally 
> running correctly, but it's not able to send heartbeats.
> 
>    Now, if the other system trys to take over the service, and assumes 
> that the failed node is offline, then it will mount the drive, start the 
> service, and since two systems both have the same non-clustered 
> filesystem mounted read-write they will corrupt it pretty quickly.  
> Which is what fencing is designed to prevent.
> 
>    So to keep that scenario from happening, the cluster software 
> ensures that a successful fence occurs before continuing operation.  
> It's a fail-safe style setup. Better to take 30 minutes downtime for an 
> admin to make the right decision than corrupt your filesystems and have 
> to take 8 -24 hours downtime to restore the system.

I do understand the basics. I wouldn't want the cluster suite to think
that a node couldn't access a resource such as an FS when it can. It
would just be nice to configure the cluster suite so that if one method
of fencing fails, it tries another, instead of mindlessly banging its
head on the wall. For example, if Xen fencing doesn't work because the
fence agent can't ssh to the host system because the host system is
down, it would be nice if fenced tried to fence the host system next...

Yes, I am particularly concerned abt Xen systems. And, more abstractly,
about the idea of a failing fence device. If I have a service that can
run on only one node of the cluster, and if my fence device is broken in
such a way that the active node goes down with it, as would be the case with
Xen fencing and a failing Xen host, then I don't see a way to create a
no-single-point-of-failure configuration. And this is what I find 'not
good'. ;)

--Janne Peltonen