[Linux-cluster] GFS+DRBD+Quorum: Help wrap my brain around this

Colin Simpson Colin.Simpson at iongeo.com
Fri Nov 26 15:04:40 UTC 2010

On Thu, 2010-11-25 at 16:39 +0000, A. Gideon wrote:
> On Tue, 23 Nov 2010 12:28:41 +0000, Colin Simpson wrote:
> > Since the third node is ignorant about the status of DRBD, I don't
> > really see what help it gives for it to decide on quorum.
> I've just read through the "Best Practice with DRBD RHCS and GFS2"
> thread
> on the drbd-users list.  And I'm still missing what seems to me to be
> a
> fundamental issue.
> First: It seems like you no longer (since 8.3.8) need to have GFS
> startup
> await the DRBD sync operation.  That's good, but is this because DRBD
> does the proper thing with I/O requests during a sync?  That's what I
> think is so, but then I don't understand why you'd an issue with 8.2.
> Or
> am I missing something?

The main issue I had was on 8.2 when the system booted out of sync it
would ooops a kernel module in the GFS, so couldn't really get any
further to see what the filesystem was up to during the sync

It is now fine on 8.3, it seems to always have the up to date i.e the
latest data, on the out of sync node when it is syncing, from my
testing. So it seems to do the right thing, as far as I can tell.

> But the real issue for me is quorum/consensus.  I noted:
>   startup {
>     wfc-timeout 0 ;       # Wait forever for initial connection
>     degr-wfc-timeout 60;  # Wait only 60 seconds if this node
>                           # was a degraded cluster
>   }
> and
>         net
>         {
>                 allow-two-primaries;
>                 after-sb-0pri discard-zero-changes;
>                 after-sb-1pri discard-secondary;
>                 after-sb-2pri disconnect;
>         }
> but when I break the DRBD connection between two primary nodes,
> "disconnected" apparently means that the nodes both continue as if
> they've UpToDate disks.  But this lets the data go out of sync.  Isn't
> this a Bad Thing?

Yup that could be an issue, however you should never be in a situation
where you break the connection between the two nodes. This needs to be
heavily mitigated, I'm planning to bond two interfaces on two different
cards so this doesn't happen (or I should say is highly unlikely). 

I guess there are two scenarios:

1/ The DRBD network links gets broken. This shouldn't be allowed to
happen, as above. 

2/ The node goes down totally so DRBD loses comms. But as all the comms
are down the other node will notice and Cluster Suite will fence the bad
node. Remember that GFS will suspend all operations (on all nodes) until
the bad node is fenced. 

I plan to further help the first situation by having my cluster comms
share the same bond with the DRBD. So if the comms fail, cluster suite
should notice, both the DRBD's on each node shouldn't change as GFS will
have suspended operations. Assuming the fence devices are reachable then
one of the nodes should fence the other (it might be a bit of a shoot
out situation) and then GFS should resume on the remaining node. 

That's how I currently see it working anyway. 

On the startup issue. I think order should be cman, drbd, clvmd and then
rgmanager. Basically the default order for the cluster services with
drbd starting before clvmd but after cman. So no filesystem mounting
will take place on a node until clvmd is up, that comes after drbd is
started. The drbd startup script will be the bit that does all the
either waiting forever or waiting the defined interval stuff etc. So the
boot will stop at this point and wait for this.

This is all my understanding, so if anyone sees flaws, I'd like to know


This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed.  If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.

More information about the Linux-cluster mailing list