[linux-lvm] has anyone used LVM in a HA cluster?

Mon May 13 11:28:01 UTC 2002

> > We're going that route.
> > 
> 
> which may backfire, if both nodes think the other one is down (split 
> brain again) and start the shutdown procedure. okay, this is a very rare 
> situation, and may happen only under strange load and scheduling 
> parameters, but it will happen as any other "very rare situation" 
> happens ;-). especially in HA environments they seem to happen much more 
> often then in simple single point of failure environments ;-) you won'Ät 
> loosew your filesystem, but the service is unavailable.

Probably possible, unlikely in practice, however.  In any event, we have
constructed a load-sharing off-site node that recieves redirected
traffic via GSLB if the main node dies, so we have some time to wander
over to the datacenter and kick one of the HA nodes.  Writes can't
happen for a few minutes, but we're not recording financial transactions
directly, and email notices of any pending purchases just queue up in
the meantime.  This solution was not really what I had initially
planned, but it was a stipulation of a large contract we bid on (and
won), so we just went ahead and built it.  Better to hobble along for a
while during a problem, than to just fall over dead.  Meanwhile, the
issue of incremental backups (eg. to recover from user errors) is, at
least in my environment, fairly well handled by using CVS for code and
nightly rsync's for so-called 'static' files (a misnomer, but 'static'
in the sense that we only keep one revision kicking around :-)).

I've seen an awful lot of high-availability systems in production, and
the one I liked most (because it never seemed to cause problems) was the
IBM HACMP cluster(s) at IBM Microelectronics.  Until that becomes a
reasonable possibility for Linux, I guess we'll just stick to multiple
levels of redundancy and failover, the good old 'two safety nets are
better than one' theory...

The HACMP guys were the ones who suggested STONITH for Linux; while your
split-brain scenario could lead to both nodes losing power, this is not
as big of an issue (with journaled filesystems) as would be corruption.

-- 
    "The most valuable piece of equipment in the darkroom
     is the trash can."
                                  --Ansel Adams