[Linux-cluster] qdiskd eviction on missed writes

Thu Jan 4 04:12:28 UTC 2007

It seems that we can get into situations where certain spike conditions
will cause a node to evict another node based on missed writes to the
qdisk.  The problem is that during these spikes application access to
the same storage back end does not seem to be impacted.  The SAN in this
case is a high end EMC DMX, multipathed, etc...  Currently our clusters
are set to interval="1" and tko="15" which should allow for at least 15
seconds (a very long time for this type of storage)

In looking at ~/cluster/cman/qdisk/main.c it seems like the following is
taking place:

In quroum_loop {}

        1) read everybody else's status (not sure if this includes
yourself
        2) check for node transitions (write eviction notice if number
of heartbeats missed > tko)
        3) check local heuristic (if we do not meet requirement remove
from qdisk partition and possibly reboot)
        4) Find master and/or determine new master, etc...
        5) write out our status to qdisk
        6) write out our local status (heuristics)
        7) cycle ( sleep for defined interval).  sleep() measured in
seconds so complete cycle = interval + time for steps (1) through (6)

Do you think that any delay in steps (1) through (4) could be the
problem?  From an architectural standpoint wouldn't it be better to have
(6) and (7) as a separate thread or daemon?  A kernel thread like
cman_hbeat for example?

Further in the check_transitions procedure case #2 it might be more
helpful to clulog what actually caused this to trigger.  The current
logging is a bit generic.

Am I totally off base or does this seem plausible?

Thanks,
 Dan