[Linux-cluster] RAIDing a CLVM?

Tue Mar 21 22:41:00 UTC 2006

Patton, Matthew F, CTR, OSD-PA&E wrote:
> I can't think of a way to combine (C)LVM, GFS, GNBD, and MD (software 
> RAID) and make it work unless just one of the nodes becomes the MD 
> master and then just exports it via NFS. Can it be done? Do commercial 
> options exist to pull off this trick?

Hi,

We're working on the same problem. We have tried two approaches, both 
with their own fairly serious drawbacks.

Our goal was a 2-node all-in-one HA mega server, providing all office 
services from one cluster, and with no single point of failure.

The first uses a raid master for each pair.  Each member of the pair 
exports a disk using GNBD.  The pair negotiate a master using CMAN, and 
that master assembles a RAID device using one GNBD import, plus one 
local disk, and then exports it using NFS, or in the case of GFS being 
used, exports the assembled raid device via a third GNBD export.

Our trick here was each node exported it's contributory disk, using 
GNDB, by default, so long as at east one other node was active (quorum > 
1), knowing only one master would ever be active. This significantly 
reduced complexity.

Problems are:
  - GNDB instabilities cause frequent locks and crashes, especially 
busying DLM (suspected).
  - NFS export scheme also causes locks and hangs to NFS clients on 
failover *IF* a member of the pair then subsequently imports and an NFS 
client, as needed in some of our mega-server ideas.
  - NFS export is not too useful when file locking is important, e.g. 
subversion, procmail etc (yes, procmail, if your mail server is also 
your Samba homes server).  You have to dell mailproc to use alternative 
mailbox locking else mailboxes get corrupted.
  - GFS on assembled device with GNDB export scheme works best, but 
still causes locks and hangs.  Note also an exporting client must NOT 
import it's own exported GNBD volume, so there is no symmetry between 
the pair, and it's quite difficult to manage.

Our second approach is something we've just embarked on, and so far is 
proving more successful, using DRBD.  DRBD is used to create a mirrored 
pair of volumes, a bit like GNBD+MD as above.

The result is a block device accessible from both machines, but the 
problem is that only one member of the pair is writable (master), and 
the other is a read-only mount.

If the master server dies, the remaining DRBD becomes the master, and 
becomes writable.  When the dead node recovers, the recovered node 
becomes a slave, read-only.

The problem is with the read-only aspect, so you still need to have an 
exporting mechanism for the assembled DRBD volume running on the DRBD 
master.  We plan to do this via GNBD export (GFS FS installed).

That's where the complexity comes in - as the DRBD negotiation appears 
to be totally independent of cluster control suite, and so we're having 
to use customizations to start the exporting daemon on the DRBD master.

Conclusions
---

 From all we've learned to date, it still seems a dedicated file server 
or SAN approach is necessary to maintain availability.

Either of the above schemes would work fairly well if we were just 
building a HA storage component, because most of the complexities we've 
encountered come about when the shared storage device is used by 
services on the same cluster nodes.

Most, if not all of what we've done so far is not suitable for a 
production environment, as it just increases the coupling between nodes, 
and therefore increases the chance of a cascade failure of the cluster. 
  In all seriousness I believe a single machine with RAID-1 pair has a 
higher MTBF than any of our experiments.

Many parts of the CCS/GFS suite so far released have serious issues when 
used in non-standard configurations.  For example, exception handling 
we've encountered usually defaults to "while (1) { retry(); sleep(1); }"

I've read last year about plans for GFS mirroring from RedHat, and 
haven't found much else since.  If anyone knows more I'd love to hear.

It also appears that the guys behind DRBD want to further develop their 
mirroring so that both volumes can be writable, in which case you can 
just stick GFS on the assembled device, and run whichever exporting 
method you like as a normal cluster service.

James

www.daltonfirth.co.uk