[Linux-cluster] qdiskd + cman: trying to fix the use of quorumdev_poll.

Lon Hohberger lhh at redhat.com
Mon Jan 8 15:43:26 UTC 2007


On Sun, 2007-01-07 at 20:29 +0100, Simone Gotti wrote:
> Problem 2)
> 
> After fixing Problem 1, if I set in the quorumd tag of cluster.conf an
> interval > quorumdev_poll/1000*2 the quorum is lost then regained over
> and over as the polling frequency of qdiskd is less than the polling one
> of cman.
> Probably the right thing to do is to calculate the value of
> quorumdev_poll from the ccs return value of "/cluster/quorumd/@interval"
> and quorumdev_poll=interval*1000*2 should be ok.

I think the poll rate should be closer to (interval * tko * 1000) [10
seconds by default] - and not a function of just the quorum disk
interval.  

This is because after (interval*tko*1000), the master node of the
cluster will write an eviction message to a hung node - and that's when
qdiskd will either reboot the node or tell CMAN that its votes are no
longer valid.

I do not think it will cause any problems per se, but dropping qdiskd's
votes after ~2 seconds when the qdisk master won't write an eviction
notice for another ~8 seconds seems a bit odd.

Normal node failure delay should be >= 2*(i*t*1000).  There's a
parameter in the <totem> tag (which defaults to 5,000ms) - which should
be 2 * interval * tko * 1000, but I don't recall what it is right now.

qdiskd needs to time out before CMAN does.  While it doesn't have to be
"half or less", it's a good paranoia factor that's easy to remember, and
it gives the node plenty of time.

-- Lon
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070108/91e528e9/attachment.sig>


More information about the Linux-cluster mailing list