[Linux-cluster] Starter Cluster / GFS

Jankowski, Chris Chris.Jankowski at hp.com
Thu Nov 11 09:59:13 UTC 2010


Gordan,

I do understand the mechanism.  I was trying to gently point out that this behaviour is unacceptable for my commercial IP customers. The customers buy clusters for high availability. Loosing the whole cluster due to single component failure - hearbeat link is not acceptable. The heartbeat link is a huge SPOF. And the cluster design does not support redundant links for heartbeat.

Also, none of the commercially available UNIX clusters or Linux clusters (HP ServiceGuard, Veritas, SteelEye) would display this type of behaviour and they do not clobber cluster filesystems.  So, it is possible to achieve acceptable reaction to this type of failure.

Regards,

Chris Jankowski

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Gordan Bobic
Sent: Thursday, 11 November 2010 20:28
To: linux clustering
Subject: Re: [Linux-cluster] Starter Cluster / GFS

Digimer wrote:
> On 10-11-10 10:29 PM, Jankowski, Chris wrote:
>> Digimer,
>>
>> 1.
>> Digimer wrote:
>>>>> Both partitions will try to fence the other, but the slower will lose and get fenced before it can fence.
>> Well, this is certainly not my experience in dealing with modern rack mounted or blade servers where you use iLO (on HP) or DRAC (on Dell).
>>
>> What actually happens in two node clusters is that both servers issue the fence request to the iLO or DRAC. It gets processed and *both* servers get powered off.  Ouch!!  Your 100% HA cluster becomes 100% dead cluster.
> 
> That is somewhat frightening. My experience is limited to stock IPMI 
> and Node Assassin. I've not seen a situation where both die. I'd 
> strongly suggest that a bug be filed.

It's actually fairly predictable and quite common. If the nodes lose connectivity to each other but both are actually alive (e.g. cluster service switch failure), you will get this sort of a shoot-out. The cause is that most out-of-band power-off mechanisms have an inherent lag of several seconds (i.e. it can be a few seconds between when you issue a power-off command and the machine actually powers off). During that race window, both machines may issue a remote power-off before they actually shut down themselves.

Gordan

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




More information about the Linux-cluster mailing list