[Linux-cluster] GFS+DRBD+Quorum: Help wrap my brain around this

Tue Nov 23 12:45:26 UTC 2010

On Mon, 2010-11-22 at 21:21 +0000, A. Gideon wrote:
> On Sun, 21 Nov 2010 21:46:03 +0000, Colin Simpson wrote:
> 
> 
> > I suppose what I'm saying is that there is no real way to get a
> quorum
> > disk with DRBD. And basically it doesn't really gain you anything
> > without actual shared storage.
> 
> I understand that.  That's why I'm looking for that "external"
> solution
> (ie. a separate iSCSI volume from a third machine) to act as a quorum
> disk (effectively making that third machine a quorum server).
> 
> But I'm not clear how important this is.  I think the problem is that,
> while I've some familiarity with clustering, I've less with DRBD.  I
> don't understand how DRBD handles the matter of quorum given only two
> potential voters.
> 
Just by telling cluster suite that a single node can be quorum in a 2
node cluster.

<cman expected_votes="1" two_node="1"/>

This is fine, but just needs a little bit of careful handling with DRBD
and the outdated node situation IMHO.

> [...]
> > The scenario is well mitigated by DRBD on two nodes already without
> > this. The system will not, if you config properly,  start DRBD (and
> all
> > the cluster storage stuff after, presuming your start up files are
> in
> > the right order) until it sees the second node.
> 
> So if one node fails, the mirror is broken but storage is still
> available?  But if both nodes go down, storage only becomes available
> again once both nodes are up?  I've missed this in the documentation,
> I'm
> afraid.

If one node fails the storage should be fine. When the second node comes
back up it will see that the first node has newer data and rebuild it's
copy with the newer data, and I believe (and my tests seems to say so)
that the second syncing node will pass all disk requests to the "good"
upto date node during the sync.

> [...]
> > The situation of two nodes coming up when the out of date one comes
> up
> > first should never arise if you give it sufficient time to see the
> other
> > node (it will always pick the new good one's data), you can make it
> wait
> > forever and then require manual intervention if you prefer (should a
> > node be down for an extended period).
> 
> Waiting forever for the second node seems a little strict to me,
> though I
> suppose if the second node is the node with the most up-to-date data
> then
> this is the proper thing to do.  But waiting forever for the node that
> has outdated information seems inefficient, though I see it is caused
> by
> the fact that DRBD has no way to know which node is more up-to-date.
> 
> Am I understanding that correctly?
You seem correct to me. 
> 
> > For me a couple of minutes waiting
> > for the other node is sufficient if it was degraded already, maybe a
> bit
> > longer if the DRBD was sync'd before they went down.
> 
> I'm afraid I'm not clear what you mean by this.  Isn't the fact that
> each
> node cannot know the state of the other the problem?  So how can wait
> times be varied as you describe?

Depends on your situation I think. I don't want to wait for ever either
as I don't want to visit the systems on such a scenario.  I have in my
test setup:

 startup {
  	wfc-timeout  300;       # Wait 300 for initial connection
  	degr-wfc-timeout 60;  # Wait only 60 seconds if this node was a
degraded cluster
	become-primary-on both;
  }

So I'm saying, at startup,  if I was degraded last time I was up, I'm
assuming that the other node was already down last time I was up(I'm
assuming say a long term HW outage). So it's less likely (unlikely) to
see the other node come up during this boot, so I'm probably the
"primary" good up to date  node. So I'm only going to wait around for 60
sec, to see if he does appear before the drbd script will finish and all
my cluster stuff comes up. 

On the normal case, of me not being degraded I will wait up to 5 minutes
before assuming I'm not going to see the other node, and therefore I
have the up to date data. 

Not perfect, but no way of telling with only two nodes. It's like the
man with two watches who doesn't know the time....

This mitigates the situation of two nodes if careful. Though you could
bump the times up for more paranoia. Or wait forever for maximum
paranoia. 

I have pretty much the same concerns as you on this, see my thread I
started "Best Practice with DRBD RHCS and GFS2" on the drbd mailing
list. A guy there seemed to address most of my concerns.

Colin

This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed.  If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.