[Linux-cluster] Startup ordering of cman, fenced, qdiskd, etc.

Lars Kellogg-Stedman lars at oddbit.com
Wed Sep 24 02:19:10 UTC 2008

Hello all,

I've recently started evaluating the RedHat cluster suite.  I've run
into some behavior I don't fully understand, and I'm hoping someone
here can help me out.

First, the essentials: I'm running cman-2.0.84-2 under CentOS 5.2.
The cluster is a two-node cluster with a quorum disk; the full
cluster.conf is available here: <http://tinyurl.com/3zdh2k>.  There
are two nodes, cluster0 (nodeid 1) and cluster1 (nodeid 2). The quorum
disk configuration has a single 'ping' heuristic.

While investigating failure scenarios, I introduced a network
interruption that would (a) isolate the two cluster nodes from each
other and (b) isolate node 2 ("cluster1") from the endpoint used in
qdiskd's "ping" heuristic:

  iptables -A INPUT -s cluster0 -j DROP; iptables -A OUTPUT -d
cluster0 -j DROP; iptables -A OUTPUT -p icmp -j DROP

Note that this introduces a temporary network interruption that will
not persist across a system reboot.  At the time I ran this command,
cluster1 was the quorum master ("Master Node ID: 2").

This initiated the following sequence of events:

(1) cluster1 fenced cluster0.

    --> the cluster now has one node and a quorum disk
    --> cluster is (still) quorate

(2) qdiskd noted the failure of the ping heuristic

    --> cluster becomes inquorate
    --> cman reboots cluster1 ("[CMAN ] quorum lost, blocking activity")

(3) cluster1 boots up and attempts to start cman.  qdiskd is not yet
running.  cluster0 is powered off.

    --> cluster never achieves quorum
    --> cman start never gets beyond "Starting fencing..."
    --> bootup never completes, so no manual intervention is possible

I've swapped the startup order of cman and qdiskd, and now things seem
to work (read, "the system actually boots and starts up cluster
services").  Is there any reason *not* to do this?  Would it make more
sense to start qdiskd out of the cman init script (like fenced,
groupd, et al)?  It seems to make a lot more sense this way, but
presumably people much smarter than I configured the default behavior,
so I'm nervous.

More information about the Linux-cluster mailing list