[Linux-cluster] GFS+DRBD+Quorum: Help wrap my brain around this

Mon Nov 29 21:40:42 UTC 2010

On Fri, 26 Nov 2010 15:04:40 +0000, Colin Simpson wrote:

>> but when I break the DRBD connection between two primary nodes,
>> "disconnected" apparently means that the nodes both continue as if
>> they've UpToDate disks.  But this lets the data go out of sync.  Isn't
>> this a Bad Thing?
> 
> Yup that could be an issue, however you should never be in a situation
> where you break the connection between the two nodes. This needs to be
> heavily mitigated, I'm planning to bond two interfaces on two different
> cards so this doesn't happen (or I should say is highly unlikely).

Since I'll be a person tasked with cleaning up from this situation, and 
given that I've no idea how to achieve that cleanup once writes are 
occurring on both sides independently, I think I'll want something more 
than "highly unlikely".  That's rather the point of these tools, isn't it?

[...]
> 2/ The node goes down totally so DRBD loses comms. But as all the comms
> are down the other node will notice and Cluster Suite will fence the bad
> node. Remember that GFS will suspend all operations (on all nodes) until
> the bad node is fenced.

Does it make sense to have Cluster Suite do this fencing, or should DRBD 
do it?  I'm thinking that DRBD's resource-and-stonith gets me pretty 
close. 

> I plan to further help the first situation by having my cluster comms
> share the same bond with the DRBD. So if the comms fail, cluster suite
> should notice, both the DRBD's on each node shouldn't change as GFS will
> have suspended operations. Assuming the fence devices are reachable then
> one of the nodes should fence the other (it might be a bit of a shoot
> out situation) and then GFS should resume on the remaining node.

This "shoot out situation" (race condition) is part of my worry.  A third 
voter of any form eliminates this, in that it can arbitrate the matter of 
which of the two nodes in a lost-comm situation should be "outdated" and 
fenced. 

And if the third voter can solve the "wait forever on startup", so much 
the better.

I'm looking at how to solve this all at the DRBD layer.  But I'm also 
interested in a more Cluster-Suite-centric solution.  I could use a 
quorum disk, but a third node would also be useful.  I haven't figured 
out, though, how to run clvmd with the shared storage available on only 
two of three cluster nodes.  Is there a way to do this?

	- Andrew