[Linux-cluster] Clusterbehaviour if one node is not reachable & fenceable any longer?

Digimer lists at alteeve.ca
Thu Jan 30 17:08:15 UTC 2014

On 30/01/14 07:00 AM, Nicolas Kukolja wrote:
> Digimer <lists <at> alteeve.ca> writes:
>> And this is the fundamental problem of stretch/geo-clusters.
>> I am loath to recommend this, because it's soooo easy to screw it up in
>> the heat of the moment, so please only ever do this after you are 100%
>> sure the other node is dead;
>> If you log into the 2 remaining nodes that are blocked (because of the
>> inability to fence), you can type 'fence_ack_manual'. That will tell the
>> cluster that you have manually confirmed the lost node is powered off.
>> It's tempting to make assumptions when you've got users and managers
>> yelling at you to get services back up. So much so that Red Hat dropped
>> 'fence_manual' entirely in RHEL 6 because it was too easy to blow things
>> up. I can not stress it enough just how critical it is that you confirm
>> that the remote location is truly off before doing this. If it's still
>> on and you clear the fence action, then really bad things could happen
>> when the link returns.
>> digimer
> Thanks a lot for your support and explanations... So I will try to explain
> it to my stakeholders...
> One little question is still in my mind:
> If in a three nodes scenario one node is not reachable and fencable, but two
> other nodes are still alive and able to communicate to each other, where is
> the risc of a "split-brain" situation?

Depending on what happened at the far end, the node could be in a state 
where it could try to provide or access HA services before realizing 
it's lost quorum. Quorum only works when the node is behaving in an 
expected manner. If the node isn't responding, you have to assume it has 
entered an undefined state, in which case quorum may or may not save you.

A classic example, though it doesn't cleanly apply here I suspect, would 
be a node that froze mid-write to shared storage. It's not dead, it's 
just hung. If the other nodes decide it's dead and proceed with recovery 
of the shared FS and go about their business. At some point later, the 
hung node recovers, has no idea that time has passed so it has no reason 
to think it's locks are invalid or check quorum, and finishes the write 
it was in the middle of. You now have a corrupted FS.

Again, this probably doesn't map to your setup, but there are other 
scenarios where things can get equally messed up in the time between a 
node recovers and it realizes it's lost quorum. The only safe protection 
is fencing, as it puts the node into a clean state (off or fresh boot).

> The "lost" third node will, if it is still running but not accessable from
> the others, disable the service because it has no contact to any other
> nodes, right?
> So if two nodes are connected, isn't it guaranteed, that the third node is
> no longer providing the service?

Nope, the only guarantee is to put it into a known state.

Quorum == useful when nodes are in a defined state
Fencing == useful when a nodes is in an undefined state.


Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?

More information about the Linux-cluster mailing list