[Linux-cluster] qdiskd + cman: trying to fix the use of quorumdev_poll.

Sun Jan 7 19:29:54 UTC 2007

Hi all,

I'm using the openais based cman-2.0.35.el5 and I'm trying to understand
how the quorum disk concept is implemented in rhcs, after various
experiments I think that I found at least 2 problems:

Problem 1)

Little bug in the quorum disk polling mechanism:

looking at the code in cman/daemon/commands.c the variable
quorumdev_poll = 10000 is expressed in milliseconds and used to call
"quorum_device_timer_fn" every quorumdev_poll interval to check if
qdiskd is informing cman that the node can use the quorum votes.

The same variable is then used in quorum_device_timer_fn, but here it's
used as seconds:

if (quorum_device->last_hello.tv_sec + quorumdev_poll < now.tv_sec) {

so, when the qdisks dies, or the access to the quorum disk is lost it
will take more than 2 hours to notify this and recalculate the quorum.

After changing the line:
========================================================================

--- cman-2.0.35.orig/cman/daemon/commands.c  2007-01-07
21:01:30.000000000 +0100
+++ cman-2.0.35.patched/cman/daemon/commands.c  2007-01-05
18:12:33.000000000 +0100
@@ -1038,15 +1037,12 @@ static void ccsd_timer_fn(void *arg)

 static void quorum_device_timer_fn(void *arg)
 {
        struct timeval now;
        if (!quorum_device || quorum_device->state == NODESTATE_DEAD)
                return;

        gettimeofday(&now, NULL);
-       if (quorum_device->last_hello.tv_sec + quorumdev_poll <
now.tv_sec) {
+       if (quorum_device->last_hello.tv_sec + quorumdev_poll/1000 <
now.tv_sec) {
                quorum_device->state = NODESTATE_DEAD;
                log_msg(LOG_INFO, "lost contact with quorum device\n");
                recalculate_quorum(0);
========================================================================


it worked. A more precise fix should be the use if tv_usec/1000 instead
of tv_sec.


Problem 2)

After fixing Problem 1, if I set in the quorumd tag of cluster.conf an
interval > quorumdev_poll/1000*2 the quorum is lost then regained over
and over as the polling frequency of qdiskd is less than the polling one
of cman.
Probably the right thing to do is to calculate the value of
quorumdev_poll from the ccs return value of "/cluster/quorumd/@interval"
and quorumdev_poll=interval*1000*2 should be ok.


What do you think about these problems? I'll be happy to fix them
providing a full patch.


Thanks.

Bye!
-- 
Simone Gotti

 
 
 --
 Email.it, the professional e-mail, gratis per te: http://www.email.it/f
 
 Sponsor:
 Cerchi un gioiello per te o da regalare? Sfoglia il nostro catalogo on-line e non lasciarti sfuggire le numerose occasioni presenti!
 Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=5631&d=7-1