[Linux-cluster] RAIDing a CLVM?

Thu Mar 23 03:34:27 UTC 2006

On Wed, Mar 22, 2006 at 01:18:56PM -0600, Benjamin Marzinski wrote:
> On Wed, Mar 22, 2006 at 02:10:30PM +0300, Denis Medvedev wrote:
> > 
> > A better approach is to export not an GNBD but an iSCSI device from DRBD.
> > 
> 
> I would definitely go with DRBD for this setup. If I understand this setup
> correctly, there is a data corruption possibility.
> 
> If you have two machines doing raid1 over a local device and a gnbd device,
> you have the problem were if machine A dies after it has written to it's local
> disk but not the disk on machine B. The mirror is out of sync. GNBD doesn't
> do anything to help with that, and md on machine B doesn't know anything about
> the state of machine A, so it can't correct the problem. So you are left with
> an out of sync mirror, which is BAD. DRBD was made for exactly this setup,
> and will (I believe) automagically handle this correctly.

This is ignoring the obvious issue that after machine A is dead, B will
presumeably keep writing to it's device, so it will obviously be out of sync.
And you probably knew that. It's been a long week.  But still, this sounds
exactly like what DRBD was designed for.

-Ben

> -Ben
>  
> > James Firth wrote:
> > 
> > 
> > >Patton, Matthew F, CTR, OSD-PA&E wrote:
> > >
> > >>I can't think of a way to combine (C)LVM, GFS, GNBD, and MD (software 
> > >>RAID) and make it work unless just one of the nodes becomes the MD 
> > >>master and then just exports it via NFS. Can it be done? Do 
> > >>commercial options exist to pull off this trick?
> > >
> > >
> > >Hi,
> > >
> > >We're working on the same problem. We have tried two approaches, both 
> > >with their own fairly serious drawbacks.
> > >
> > >Our goal was a 2-node all-in-one HA mega server, providing all office 
> > >services from one cluster, and with no single point of failure.
> > >
> > >The first uses a raid master for each pair.  Each member of the pair 
> > >exports a disk using GNBD.  The pair negotiate a master using CMAN, 
> > >and that master assembles a RAID device using one GNBD import, plus 
> > >one local disk, and then exports it using NFS, or in the case of GFS 
> > >being used, exports the assembled raid device via a third GNBD export.
> > >
> > >Our trick here was each node exported it's contributory disk, using 
> > >GNDB, by default, so long as at east one other node was active (quorum 
> > >> 1), knowing only one master would ever be active. This significantly 
> > >reduced complexity.
> > >
> > >Problems are:
> > > - GNDB instabilities cause frequent locks and crashes, especially 
> > >busying DLM (suspected).
> > > - NFS export scheme also causes locks and hangs to NFS clients on 
> > >failover *IF* a member of the pair then subsequently imports and an 
> > >NFS client, as needed in some of our mega-server ideas.
> > > - NFS export is not too useful when file locking is important, e.g. 
> > >subversion, procmail etc (yes, procmail, if your mail server is also 
> > >your Samba homes server).  You have to dell mailproc to use 
> > >alternative mailbox locking else mailboxes get corrupted.
> > > - GFS on assembled device with GNDB export scheme works best, but 
> > >still causes locks and hangs.  Note also an exporting client must NOT 
> > >import it's own exported GNBD volume, so there is no symmetry between 
> > >the pair, and it's quite difficult to manage.
> > >
> > >
> > >
> > >Our second approach is something we've just embarked on, and so far is 
> > >proving more successful, using DRBD.  DRBD is used to create a 
> > >mirrored pair of volumes, a bit like GNBD+MD as above.
> > >
> > >The result is a block device accessible from both machines, but the 
> > >problem is that only one member of the pair is writable (master), and 
> > >the other is a read-only mount.
> > >
> > >If the master server dies, the remaining DRBD becomes the master, and 
> > >becomes writable.  When the dead node recovers, the recovered node 
> > >becomes a slave, read-only.
> > >
> > >The problem is with the read-only aspect, so you still need to have an 
> > >exporting mechanism for the assembled DRBD volume running on the DRBD 
> > >master.  We plan to do this via GNBD export (GFS FS installed).
> > >
> > >That's where the complexity comes in - as the DRBD negotiation appears 
> > >to be totally independent of cluster control suite, and so we're 
> > >having to use customizations to start the exporting daemon on the DRBD 
> > >master.
> > >
> > >
> > >Conclusions
> > >---
> > >
> > >From all we've learned to date, it still seems a dedicated file server 
> > >or SAN approach is necessary to maintain availability.
> > >
> > >Either of the above schemes would work fairly well if we were just 
> > >building a HA storage component, because most of the complexities 
> > >we've encountered come about when the shared storage device is used by 
> > >services on the same cluster nodes.
> > >
> > >Most, if not all of what we've done so far is not suitable for a 
> > >production environment, as it just increases the coupling between 
> > >nodes, and therefore increases the chance of a cascade failure of the 
> > >cluster.  In all seriousness I believe a single machine with RAID-1 
> > >pair has a higher MTBF than any of our experiments.
> > >
> > >Many parts of the CCS/GFS suite so far released have serious issues 
> > >when used in non-standard configurations.  For example, exception 
> > >handling we've encountered usually defaults to "while (1) { retry(); 
> > >sleep(1); }"
> > >
> > >I've read last year about plans for GFS mirroring from RedHat, and 
> > >haven't found much else since.  If anyone knows more I'd love to hear.
> > >
> > >It also appears that the guys behind DRBD want to further develop 
> > >their mirroring so that both volumes can be writable, in which case 
> > >you can just stick GFS on the assembled device, and run whichever 
> > >exporting method you like as a normal cluster service.
> > >
> > >
> > >
> > >James
> > >
> > >www.daltonfirth.co.uk
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >-- 
> > >Linux-cluster mailing list
> > >Linux-cluster at redhat.com
> > >https://www.redhat.com/mailman/listinfo/linux-cluster
> > >
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster