[Linux-cluster] Failover after partial failure because of SAN?

Fri Nov 4 10:04:57 UTC 2011

On Fri, Nov 4, 2011 at 4:03 PM, Jochen Schneider
<jochen.schneider at gmail.com> wrote:
> Hi,
>
> We are setting up a cluster for a storage application with SAN disks managed
> through HA-LVM and connected through multipath. There are actually two
> applications which have to run on the same node,

HAVE to run on the same node? Why? Can't they communicate via TCP/IP?

> but only one of them needs
> the disk. Both of them have clients.
>
> The question I have is what should happen when the SAN fails: Should both
> applications failover to another machine (possibly after a retry) or should
> the application which doesn't need the disk keep running while the other is
> shut down?

You're not giving yourself much option. Since you say both application
HAVE to run on the same node, I assume both are related (e.g. one
needs the other). In that case, the only viable option is to failover.

Having said that, I'm curious what do you mean by "SAN fails". It's
rare for a cluster node to be suddenly unable to access a node while
the other can access it just fine. Usually it's either the SAN
unaccessible completely (e.g. broken SAN or switches) or a server node
fails.

> I'm not sure how much recovery can come out of a failover in case
> of a SAN failure, if it's not both network cards of the node which are
> defective or whatever.

Exactly :)

If no node can access the SAN, then it can't failover anywhere.

-- 
Fajar