[Linux-cluster] Fence device, How it work

Tue Nov 8 21:40:39 UTC 2005

On Tue, 2005-11-08 at 07:52 -0800, Michael Will wrote:
> >
> >Power-cycle.
> >  
> >
> I always wondered about this. If the node has a problem, chances are 
> that rebooting does not
> fix it. Now if the node comes up semi-functional and attempts to regain 
> control over the ressource
> that it owned before, then that could be bad. Should it not rather be 
> shut-down so an human intervention
> can fix it before it is being made operational again?

This is a bit long, but maybe it will clear some things up a little.  As
far as a node taking over a resource it thinks it still has after a
reboot (without notifying the other nodes of its intentions), that would
be a bug the cluster software, and a really *bad* one too!

A couple of things to remember when thinking about failures and fencing:

(a) Failures are rare.  A decent PC has something like a 99.95% uptime
(I wish I knew where I heard/read this long ago) uptime - with no
redundancy at all.  A server with ECC RAM, RAID for internal disks, etc.
probably has a higher uptime.

(b) The hardware component most likely to fail is a hard disk (moving
parts).  If that's the root hard disk, the machine probably won't boot
again.  If it's the shared RAID set, then the whole cluster will likely
have problems.

(c) I hate to say this, but the kernel is probably more likely to fail
(panic, hang) than any single piece of hardware.

(d) Consider this (I think this is an example of what you said?):
    1. Node A fails
    2. Node B reboots node A
    3. Node A correctly boots and rejoins cluster
    4. Node A mounts a GFS file system correctly
    5. Node A corrupts the GFS file system

What is the chance that 5 will happen without data corruption occurring
during before 1?  Very slim, but nonzero - which brings me to my next
point...

(e) Always make backups of critical data, no matter what sort of block
device or cluster technology you are using.  A bad RAM chip (e.g. an
parity RAM chip missing a double-bit errors) can cause periodic, quiet
data corruption.  Chances of this happening are also very slim, but
again, nonzero.  Probably at least as likely to happen as (d).

(f) If you're worried about (d) and are willing to take the expected
uptime hit for a given node when that node fails, even given (c), you
can always change the cluster configuration to turn "off" a node instead
of reboot it. :)

(g) You can chkconfig --del the cluster components so that they don't
automatically start on reboot; same effect as (f): the node won't
reacquire the resources if it never rejoins the cluster...

> I/O fencing instead of power fencing kind of works like this, you undo 
> the i/o block once you know
> the node is fine again.

Typically, we refer to that as "fabric level fencing" vs. "power level
fencing", both fit in with the I/O fencing paradigm in preventing a node
from flushing buffers after it has misbehaved.

Note that typically the only way to be 100% positive a node has no
buffers waiting after it has been fenced at the fabric level is a hard
reboot.

Many administrators will reboot a failed node as a first attempt to fix
it anyway - so we're just saving them a step :)  (Again, if you want,
you can always do (f) or (g) above...)

-- Lon