[Linux-cluster] Starter Cluster / GFS

Fri Nov 12 01:58:01 UTC 2010

Digimer,

>>>I am very curious to know how this scenario can happen. As I had previously understood it, this should simply not be possible. Obviously it is though... 

It actually is very simple. For the mutual simultaneous killing to be guaranteed to happen three conditions are sufficient:

1. The fencing request is generated by the two nodes at the same time. Fulfilled by current design of the fencing.
2. Your fencing device needs to be a separate piece of equipment dedicated to the node to be fenced. Note that iLO or DRAC fulfill the requirement.
3. The implementation of the fencing device needs to be transactional i.e. - accept an order to fence, then execute it after a certain delay. Both iLO and DRAC work transactionally and there is sufficient delay.

What happens is simple. Think about it as transactions. Both nodes start at the same time transacting with the corresponding fencing devices. Each fencing device accepts the transaction. Only then, after a small delay, they start executing it. Both fencing devices are at this point committed to the execution and will do what they have been told.

The set of conditions is sufficient in mathematical sense. 

In modern networked servers with built-in service processors this set of confditions is almost certainly true for all of them.

The following are possible ways of resolving the problem for this set of sufficient condiions:

1. Invalidate condition 1 - introduce different fixed delays in fencing agents for each node - e.g. node A - no delay, node B 2 seconds.  This is a good solution, but requires custom programming work.  The current cluster design does not allow it as a configuration option.
2. Invalidate condition 2 - common physical fencing device that will accept only one request from one node. Essentially this serialises the transactions and allows at most one. This is not a clean way to do it, as such device would be a SPOF.
3. Invalidate condition 3 - change the execution phase to conditional based on the state of the requestor - in the execution phase execute the request only if the requestor is still alive.  This shrinks, but does not eliminate the time in which the race condition leads to both nodes going down.

However, I believe that the real solution is to change the mindset of the cluster from "I am the omniscient and omnipotent master of the world and I will shoot anything I do not like" to protecting resources i.e. protecting shared storage through SCSI reservations, which is what commerial Linux and UNIX clusters do.  Alas, the STONITH concept is so ingrained in the minds of developers of the Linux cluster that this change seems to be impossible to achieve.

--------

Please note that the STONITH concept has other fatal flows in the modern networked world. Consider, step by step scenario of what would happen to your available cluster if a node in the cluster gets completely separated from the network including its access, its hearbeat and iLO/DRAC network connections.  Again, the end result is that you have no access to your supposedly highly available application. From the functional point of view the whole cluster has failed.  The core issue, again, is the inadequacy of the STONITH concept. And again, commercial UNIX and Lnux clusters deal with this scenario correctly.  Their clusters will continue.

Regards,

Chris Jankowski

In fact, to remove the race condition o

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Digimer
Sent: Friday, 12 November 2010 03:44
To: linux clustering
Subject: Re: [Linux-cluster] Starter Cluster / GFS

On 10-11-11 04:23 AM, Gordan Bobic wrote:
> Jankowski, Chris wrote:
>> Digimer,
>>
>> 1.
>> Digimer wrote:
>>>>> Both partitions will try to fence the other, but the slower will 
>>>>> lose and get fenced before it can fence.
>>
>> Well, this is certainly not my experience in dealing with modern rack 
>> mounted or blade servers where you use iLO (on HP) or DRAC (on Dell).
>>
>> What actually happens in two node clusters is that both servers issue 
>> the fence request to the iLO or DRAC. It gets processed and *both* 
>> servers get powered off.  Ouch!!  Your 100% HA cluster becomes 100% 
>> dead cluster.
> 
> Indeed, I've seen this, too, on a range of hardware. My quick and 
> dirty solution was to doctor the fencing agent to add a different 
> sleep() on each node, in order of survivor preference. There may be a 
> setting in cluster.conf that can be used to achieve the same effect, 
> can't remember off the top of my head.
> 
> Gordan

I've not seen such an option, though I make no claims to complete knowledge of the options available. I do know that there are pre-device fence options (that is, IPMI has a set of options that differs from DRAC, etc). So perhaps there is an option there.

I am very curious to know how this scenario can happen. As I had previously understood it, this should simply not be possible. Obviously it is though... The only thing I can think of is where a fence device is external to the nodes and allows for multiple fence calls at the same time. I would expect that and fence device should terminate a node nearly instantly. If it doesn't or can't, then I would suggest that it not accept a second fence request until after the pending one completes.

--
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster