[Linux-cluster] Quorum disk over RAID software device

Mon Dec 14 22:15:09 UTC 2009

Hi all,

I was wondering if there is a way to achieve a "quorum disk over a RAID
software device" working CMAN cluster.

Explanation:

A) Environment
- 6 x different servers used as cluster nodes, with dual FC HBA
- 2 x different fabrics, each build with 3 FC SAN switches
- 2 x storage arrays, with 23 270GB LUNs of data each. 
- 1x Qdisk: a 24th LUN located in one of the storage arrays

B) Objectives
- All the 6 nodes must be able to mount and use any of the 2x23 LUNs of
data in the final configuration. Already done.
- Usage of a Qdisk for a last-man-standing configuration. Already done
(1 vote each node and 5 votes in the Qdisk device)

C) Flaws 
- Qdisk is located in ONE storage array. If there is a failure in that
storage array, 5 votes are lost. With only one cluster node failing
there won't be quorum. This means that with 5 nodes and an storage array
operative I will lose quorum.

D) Possible Fixes
- Using 2 quorum disks: Not implemented yet
http://sources.redhat.com/cluster/wiki/MultiQuorumDisk

- Using an LVM-Mirror device as a Qdisk and creating additional LUNs for
mirror and log in both storage arrays: if the Qdisk is a Clustered
Logical Volume, it won't be available in the CMAN start phase due CLVMD
(and CMIRROR) is needed to have access to clustered logical volumes and
CLVMD won't be running if CMAN is not running yet.
Question: is it really necessary to use a Clustered Logical Volume for
the Qdisk? Is there any problem in NOT using a clustered volume? 

- Using an Software Raid (MDRAID) device as a Qdisk and creating an
additional LUN in the second storage array: 
Each cluster node will use the MD device as de Qdisk. Do you see any
problem in this proposal? 

E) Possible Flaws
- With LVM-Mirror: what would happen if one of the underlying disks of
the Qdisk fails in only a part of the cluster nodes? You can imagine in
a lun-masking problem of the storage array controller or in an admin
making a mistake, which would result in some nodes losing the access to
one of the disks.
What would happen when the disk when it's fully on-line again?

- With MDRAID: same questions.

Of course, any idea os proposal is welcome. Thanks in advance. Cheers,

Rafael

-- 
Rafael Micó Miranda