[Linux-cluster] qdiskd + cman: trying to fix the use of quorumdev_poll.

Patrick Caulfield pcaulfie at redhat.com
Tue Jan 9 13:35:59 UTC 2007


Lon Hohberger wrote:
> On Sun, 2007-01-07 at 20:29 +0100, Simone Gotti wrote:
>> Problem 2)
>>
>> After fixing Problem 1, if I set in the quorumd tag of cluster.conf an
>> interval > quorumdev_poll/1000*2 the quorum is lost then regained over
>> and over as the polling frequency of qdiskd is less than the polling one
>> of cman.
>> Probably the right thing to do is to calculate the value of
>> quorumdev_poll from the ccs return value of "/cluster/quorumd/@interval"
>> and quorumdev_poll=interval*1000*2 should be ok.
> 
> I think the poll rate should be closer to (interval * tko * 1000) [10
> seconds by default] - and not a function of just the quorum disk
> interval.  
> 
> This is because after (interval*tko*1000), the master node of the
> cluster will write an eviction message to a hung node - and that's when
> qdiskd will either reboot the node or tell CMAN that its votes are no
> longer valid.
> 
> I do not think it will cause any problems per se, but dropping qdiskd's
> votes after ~2 seconds when the qdisk master won't write an eviction
> notice for another ~8 seconds seems a bit odd.
> 
> Normal node failure delay should be >= 2*(i*t*1000).  There's a
> parameter in the <totem> tag (which defaults to 5,000ms) - which should
> be 2 * interval * tko * 1000, but I don't recall what it is right now.
> 
> qdiskd needs to time out before CMAN does.  While it doesn't have to be
> "half or less", it's a good paranoia factor that's easy to remember, and
> it gives the node plenty of time.


lon: do you reckon we need a blocker bug for "problem 1)" ?

-- 

patrick




More information about the Linux-cluster mailing list