[Linux-cluster] Re: Using qdisk and a watchdog timer to eliminate power fencing single point of failure?

Sat Feb 23 17:06:58 UTC 2008

Lon Hohberger wrote:
>> I've got a deployment scenario for a two node cluster (where services 
>> are configured as active and standby) where the customer is concerned 
>> that the external power fencing device I am using (WTI) becomes a single 
>> point of failure.  If the WTI for the active node dies, taking down the 
>> active node, the standby cannot bring up services because it cannot 
>> successfully fence the failed node.  This leaves the cluster down.
> 
> Correct.  Although, if you plug in a serial terminal server, I have a
> patch to talk to WTI switch through a terminal server in case the server
> gets unjacked, though.

Actually, I'm more worried about a WTI that blows up, taking the active 
node with it.  A terminal server won't help with that.

>> In the setup, storage fencing is not feasible as a backup for power fencing.
> 
> Not even using fence_scsi? (SCSI3 reservations)?  That's unfortunate :(

Well, it's possible, but this solution may be deployed with many 
different SAN implementations, so I was hoping to find a way to avoid 
having to certify that each SAN does SCSI reservations correctly.

>> I think I've worked out a scenario using qdiskd and the internal 
>> hardware watchdog timers in our nodes to use as a backup for power 
>> fencing that I hope will eliminate the single point of failure.
> 
> Hardware watchdog timers = good stuff.
> 
> 
>> Here's how I see it working:
> 
>> 2.  Create a heuristic (besides the usual network reachability test) for 
>> qdisk that resets the node's hardware watchdog timer.  (I'll have to do 
>> some additional work to ensure that the watchdog gets turned off if I am 
>> gracefully shutting down the node's qdisk daemon.)
> 
> There's a watchdog daemon (userspace code) that lets you configure
> heuristics for it.  Most are internal to it - and are therefore superior
> to how qdiskd does heuristics from a HA / memory-neutrality perspective.
> If some heuristic(s) are not met, the daemon can at your option stop
> touching the watchdog device.
> 
> There's an open bugzilla to provide an integration path between qdiskd
> and watchdogd - so that you can configure heuristics for watchdogd and
> have qdiskd base its state on those.
> 
> For example, if watchdogd says "ok, we're not updating the watchdog
> driver because of X", qdiskd can trigger a self-demotion off of that, or
> maybe even write a 'If you don't hear from me in X seconds, consider me
> dead' message to disk...?

That looks like good stuff, I'll look into it.  From looking at 
watchdogd, it can monitor if a file gets updated, so it's easy to 
integrate quorumd and watchdogd in a simple fashion by just having a 
quorumd heuristic that touches a file.

>> 3.  Create a custom fencing script that is run if power fencing fails 
>> that examines qdisk's state to see if the node that needs to be fenced 
>> is no longer updating the quorum disk.
> 
> I think the easiest thing to do is make a quick, small-footprint API or
> utility to talk to qdiskd to get states...

That's what I figured.

>> (I'm not sure how to do this--I 
>> hope that the information in stored in qdisk's status_file will be 
>> sufficient to determine this, if not, I might have to modify qdisk to 
>> supply what I need.)
> 
> ... because status_file is *sketchy* at best (really, it's a debugging
> tool). ;)

I was afraid of that...

>> The standby node then should be sure that the active node has rebooted 
>> itself either by qdiskd's action or via the watchdog timer, or else it 
>> is power dead.
> 
> 
>> Can anyone see a weakness in this approach I haven't thought of?
> 
> It's good from a best-effort standpoint.  We don't have anything that
> does 'best effort' fencing - it's mostly all black/white.
> 
> A question that comes up is: if we use the watchdog + watchdog daemon,
> do we need qdisk at all?  I mean, if there's an 'eventual timeout'
> anyway based on the expectancy that the watchdog timer will fire and we
> rely on it - why bother with the intermediate steps?
> 
> Hardware watchdog timers are going to be more reliable than just about
> anything qdiskd could provide.

Ok, I get it.  It's probably a couple of orders of magnitude more 
reliable, but since it relies only on timing, there's no real *positive* 
indication that the fencing succeeded, so it's really only best-effort. 
  Even though it would take three failures (network disruption of 
heartbeat, quorumd failing to reboot the node and the watchdog timer 
failing as well), there's still a slim, slim chance that the node is 
still trying to write to the SAN.  If I want to guarantee that there's 
never a split brain, then this isn't good enough.

Thanks for the advice.

-- 
Jon Biggar
Floorboard Software
jon at floorboard.com
jon at biggar.org