[Linux-cluster] Starter Cluster / GFS

Wed Nov 10 16:41:27 UTC 2010

On 10-11-10 11:09 AM, Gordan Bobic wrote:
> Digimer wrote:
>> On 10-11-10 07:17 AM, Gordan Bobic wrote:
>>>>> If you want the FS mounted on all nodes at the same time then all
>>>>> those nodes must be a part of the cluster, and they have to be
>>>>> quorate (majority of nodes have to be up). You don't need a quorum
>>>>> block device, but it can be useful when you have only 2 nodes.
>>>> At term, I will have 7 to 10 nodes, but 2 at first for initial setup
>>>> and testing. Ok, so if I have a 3 nodes cluster for exemple, I need at
>>>> least 2 nodes for the cluster, and thus the gfs, to be up ? I cannot
>>>> have a running gfs with only one node ?
>>> In a 2-node cluster, you can have running GFS with just one node up. But
>>> in that case it is advisble to have a quorum block device on the SAN.
>>> With a 3 node cluster, you cannot have quorum with just 1 node, and thus
>>> you cannot have GFS running. It will block until quorum is
>>> re-established.
>>
>> With a quorum disk, you can in fact have one node left and still have
>> quorum. This is because the quorum drive should have (node-1) votes,
>> thus always giving the last node 50%+1 even with all other nodes being
>> dead.
> 
> I've never tried testing that use-case extensively, but I suspect that
> it is only safe to do with SAN-side fencing. Otherwise two nodes could
> lose contact with each other and still both have access to the SAN and
> thus both be individually quorate.
> 
> Gordan

Clustered storage *requires* fencing. To not use fencing is like driving
tired; It's just a matter of time before something bad happens. That
said, I should have been more clear in specifying the requirement for
fencing.

Now that said, the fencing shouldn't be needed at the SAN side, though
that works fine as well.

The way it works is:

In normal operation, all nodes communicate via corosync. Corosync in
turn manages the distributed locking and ensures that locks are ordered
across all nodes (virtual synchrony).

As soon as communication fails on one or more nodes, locks are no longer
issued and all I/O is blocked until:
a) The node responds finally
or
b) A timeout is reached and corosync issues a fence against the
incommunicado node(s).

Once a fence is issued, nothing will proceed until, and only until, the
fence agent returns a successful fence message to the fence daemon.

In the case of a split brain (nodes partition and are up but not talking
to each other), both partitions will issue a fence against the other
node(s). This is now a race, often described as an old-west style duel.
Both partitions will try to fence the other, but the slower will lose
and get fenced before it can fence.

With a successful fence, the surviving partition (which could be just
one node), will reconfigure and then begin restoring the clustered file
system (GFS2 in this case). Once recovery is complete, I/O unblocks and
continues.

With SAN-side fencing, a fence is in the form of a logic disconnection
from the storage network. This has no inherent mechanism for recovery,
so the sysadmin will have to manually recover the node(s). For this
reason, I do not prefer it.

With power fencing, by far the most common method which can be
implemented via IPMI, addressable PDUs, etc, the node that is fenced is
rebooted. The benefit of this method is that the node may well reboot
"healthy" and then be able to rejoin the cluster automatically. Of
course, if you prefer, you can have nodes powered off and left off.

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org