[Linux-cluster] Using qdisk and a watchdog timer to eliminate power fencing single point of failure?

Fri Feb 22 19:00:05 UTC 2008

On Thu, 2008-02-21 at 11:43 -0800, Jonathan Biggar wrote: 
> I've got a deployment scenario for a two node cluster (where services 
> are configured as active and standby) where the customer is concerned 
> that the external power fencing device I am using (WTI) becomes a single 
> point of failure.  If the WTI for the active node dies, taking down the 
> active node, the standby cannot bring up services because it cannot 
> successfully fence the failed node.  This leaves the cluster down.

Correct.  Although, if you plug in a serial terminal server, I have a
patch to talk to WTI switch through a terminal server in case the server
gets unjacked, though.

> In the setup, storage fencing is not feasible as a backup for power fencing.

Not even using fence_scsi? (SCSI3 reservations)?  That's unfortunate :(

> I think I've worked out a scenario using qdiskd and the internal 
> hardware watchdog timers in our nodes to use as a backup for power 
> fencing that I hope will eliminate the single point of failure.

Hardware watchdog timers = good stuff.

> Here's how I see it working:

> 2.  Create a heuristic (besides the usual network reachability test) for 
> qdisk that resets the node's hardware watchdog timer.  (I'll have to do 
> some additional work to ensure that the watchdog gets turned off if I am 
> gracefully shutting down the node's qdisk daemon.)

There's a watchdog daemon (userspace code) that lets you configure
heuristics for it.  Most are internal to it - and are therefore superior
to how qdiskd does heuristics from a HA / memory-neutrality perspective.
If some heuristic(s) are not met, the daemon can at your option stop
touching the watchdog device.

There's an open bugzilla to provide an integration path between qdiskd
and watchdogd - so that you can configure heuristics for watchdogd and
have qdiskd base its state on those.

For example, if watchdogd says "ok, we're not updating the watchdog
driver because of X", qdiskd can trigger a self-demotion off of that, or
maybe even write a 'If you don't hear from me in X seconds, consider me
dead' message to disk...?

> 3.  Create a custom fencing script that is run if power fencing fails 
> that examines qdisk's state to see if the node that needs to be fenced 
> is no longer updating the quorum disk.

I think the easiest thing to do is make a quick, small-footprint API or
utility to talk to qdiskd to get states...

> (I'm not sure how to do this--I 
> hope that the information in stored in qdisk's status_file will be 
> sufficient to determine this, if not, I might have to modify qdisk to 
> supply what I need.)

... because status_file is *sketchy* at best (really, it's a debugging
tool). ;)

> The standby node then should be sure that the active node has rebooted 
> itself either by qdiskd's action or via the watchdog timer, or else it 
> is power dead.

> Can anyone see a weakness in this approach I haven't thought of?

It's good from a best-effort standpoint.  We don't have anything that
does 'best effort' fencing - it's mostly all black/white.

A question that comes up is: if we use the watchdog + watchdog daemon,
do we need qdisk at all?  I mean, if there's an 'eventual timeout'
anyway based on the expectancy that the watchdog timer will fire and we
rely on it - why bother with the intermediate steps?

Hardware watchdog timers are going to be more reliable than just about
anything qdiskd could provide.

-- Lon