[Linux-cluster] Re: Using qdisk and a watchdog timer to eliminate power fencing single point of failure?
Jonathan Biggar
jon at floorboard.com
Sat Feb 23 17:06:58 UTC 2008
Lon Hohberger wrote:
>> I've got a deployment scenario for a two node cluster (where services
>> are configured as active and standby) where the customer is concerned
>> that the external power fencing device I am using (WTI) becomes a single
>> point of failure. If the WTI for the active node dies, taking down the
>> active node, the standby cannot bring up services because it cannot
>> successfully fence the failed node. This leaves the cluster down.
>
> Correct. Although, if you plug in a serial terminal server, I have a
> patch to talk to WTI switch through a terminal server in case the server
> gets unjacked, though.
Actually, I'm more worried about a WTI that blows up, taking the active
node with it. A terminal server won't help with that.
>> In the setup, storage fencing is not feasible as a backup for power fencing.
>
> Not even using fence_scsi? (SCSI3 reservations)? That's unfortunate :(
Well, it's possible, but this solution may be deployed with many
different SAN implementations, so I was hoping to find a way to avoid
having to certify that each SAN does SCSI reservations correctly.
>> I think I've worked out a scenario using qdiskd and the internal
>> hardware watchdog timers in our nodes to use as a backup for power
>> fencing that I hope will eliminate the single point of failure.
>
> Hardware watchdog timers = good stuff.
>
>
>> Here's how I see it working:
>
>> 2. Create a heuristic (besides the usual network reachability test) for
>> qdisk that resets the node's hardware watchdog timer. (I'll have to do
>> some additional work to ensure that the watchdog gets turned off if I am
>> gracefully shutting down the node's qdisk daemon.)
>
> There's a watchdog daemon (userspace code) that lets you configure
> heuristics for it. Most are internal to it - and are therefore superior
> to how qdiskd does heuristics from a HA / memory-neutrality perspective.
> If some heuristic(s) are not met, the daemon can at your option stop
> touching the watchdog device.
>
> There's an open bugzilla to provide an integration path between qdiskd
> and watchdogd - so that you can configure heuristics for watchdogd and
> have qdiskd base its state on those.
>
> For example, if watchdogd says "ok, we're not updating the watchdog
> driver because of X", qdiskd can trigger a self-demotion off of that, or
> maybe even write a 'If you don't hear from me in X seconds, consider me
> dead' message to disk...?
That looks like good stuff, I'll look into it. From looking at
watchdogd, it can monitor if a file gets updated, so it's easy to
integrate quorumd and watchdogd in a simple fashion by just having a
quorumd heuristic that touches a file.
>> 3. Create a custom fencing script that is run if power fencing fails
>> that examines qdisk's state to see if the node that needs to be fenced
>> is no longer updating the quorum disk.
>
> I think the easiest thing to do is make a quick, small-footprint API or
> utility to talk to qdiskd to get states...
That's what I figured.
>> (I'm not sure how to do this--I
>> hope that the information in stored in qdisk's status_file will be
>> sufficient to determine this, if not, I might have to modify qdisk to
>> supply what I need.)
>
> ... because status_file is *sketchy* at best (really, it's a debugging
> tool). ;)
I was afraid of that...
>> The standby node then should be sure that the active node has rebooted
>> itself either by qdiskd's action or via the watchdog timer, or else it
>> is power dead.
>
>
>> Can anyone see a weakness in this approach I haven't thought of?
>
> It's good from a best-effort standpoint. We don't have anything that
> does 'best effort' fencing - it's mostly all black/white.
>
> A question that comes up is: if we use the watchdog + watchdog daemon,
> do we need qdisk at all? I mean, if there's an 'eventual timeout'
> anyway based on the expectancy that the watchdog timer will fire and we
> rely on it - why bother with the intermediate steps?
>
> Hardware watchdog timers are going to be more reliable than just about
> anything qdiskd could provide.
Ok, I get it. It's probably a couple of orders of magnitude more
reliable, but since it relies only on timing, there's no real *positive*
indication that the fencing succeeded, so it's really only best-effort.
Even though it would take three failures (network disruption of
heartbeat, quorumd failing to reboot the node and the watchdog timer
failing as well), there's still a slim, slim chance that the node is
still trying to write to the SAN. If I want to guarantee that there's
never a split brain, then this isn't good enough.
Thanks for the advice.
--
Jon Biggar
Floorboard Software
jon at floorboard.com
jon at biggar.org
More information about the Linux-cluster
mailing list