[Linux-cluster] "dlm_controld[nnnn]: cluster is down, exiting" on node1 when starting node2

Thu Jun 4 20:23:13 UTC 2009

I'm trying to understand a node shutdown during transition from 1 node to 
2 node with qdisk cluster. The platform is CentOS 5.3, with versions:

cman-2.0.98-1.el5
openais-0.80.3-22.el5

Jun  4 10:55:08 sun4150node1 root[8103]:
S10make-event-queue=action|Event|cluster-node-added|Action|S10make-event-queue|Start|1244127
308 610636|End|1244127308 614973|Elapsed|0.004337
Jun  4 10:55:08 sun4150node1 root[8103]: Running event handler:
/etc/e-smith/events/cluster-node-added/S15iscsi-adjust
Jun  4 10:55:08 sun4150node1 root[8103]:
S15iscsi-adjust=action|Event|cluster-node-added|Action|S15iscsi-adjust|Start|1244127308
6153
33|End|1244127308 677757|Elapsed|0.062424
Jun  4 10:55:08 sun4150node1 root[8103]: Running event handler:
/etc/e-smith/events/cluster-node-added/S20cluster-conf
Jun  4 10:55:08 sun4150node1 ccsd[7879]: Update of cluster.conf complete
(version 2 -> 3).
Jun  4 10:55:08 sun4150node1 root[8103]: Config file updated from version
2 to 3
Jun  4 10:55:08 sun4150node1 root[8103]:
Jun  4 10:55:08 sun4150node1 root[8103]: Update complete.
Jun  4 10:55:08 sun4150node1 root[8103]:
S20cluster-conf=action|Event|cluster-node-added|Action|S20cluster-conf|Start|1244127308
6781
19|End|1244127308 793629|Elapsed|0.11551
Jun  4 10:55:08 sun4150node1 root[8103]: Running event handler:
/etc/e-smith/events/cluster-node-added/S31qdiskd-adjust
Jun  4 10:55:08 sun4150node1 qdiskd[8128]: <info> Quorum Daemon
Initializing
Jun  4 10:55:08 sun4150node1 root[8103]: Starting the Quorum Disk Daemon:[
OK  ]^M
Jun  4 10:55:08 sun4150node1 root[8103]:
S31qdiskd-adjust=action|Event|cluster-node-added|Action|S31qdiskd-adjust|Start|1244127308
79
7450|End|1244127308 928144|Elapsed|0.130694
Jun  4 10:55:08 sun4150node1 root[8103]: Running event handler:
/etc/e-smith/events/cluster-node-added/S32cman-adjust
Jun  4 10:55:09 sun4150node1 root[8103]: Starting cluster:
Jun  4 10:55:09 sun4150node1 root[8103]:    Loading modules... done
Jun  4 10:55:09 sun4150node1 root[8103]:    Mounting configfs... done
Jun  4 10:55:09 sun4150node1 root[8103]:    Starting ccsd... done
Jun  4 10:55:09 sun4150node1 root[8103]:    Starting cman... done
Jun  4 10:55:09 sun4150node1 root[8103]:    Starting daemons... done
Jun  4 10:55:10 sun4150node1 root[8103]:    Starting fencing... done
Jun  4 10:55:10 sun4150node1 root[8103]: [  OK  ]^M
Jun  4 10:55:10 sun4150node1 root[8103]:
S32cman-adjust=action|Event|cluster-node-added|Action|S32cman-adjust|Start|1244127308
928465
|End|1244127310 103254|Elapsed|1.174789
Jun  4 10:55:10 sun4150node1 root[8103]: Running event handler:
/etc/e-smith/events/cluster-node-added/S40cluster-join
Jun  4 10:55:10 sun4150node1 root[8103]: building file list ... done
Jun  4 10:55:10 sun4150node1 root[8103]:
Jun  4 10:55:10 sun4150node1 root[8103]: sent 64 bytes  received 20 bytes
168.00 bytes/sec
Jun  4 10:55:10 sun4150node1 root[8103]: total size is 3162  speedup is
37.64
Jun  4 10:55:17 sun4150node1 qdiskd[8128]: <info> Initial score 1/1
Jun  4 10:55:17 sun4150node1 qdiskd[8128]: <info> Initialization complete
Jun  4 10:55:17 sun4150node1 qdiskd[8128]: <notice> Score sufficient for
master operation (1/1; required=1); upgrading
Jun  4 10:55:29 sun4150node1 qdiskd[8128]: <info> Assuming master role
Jun  4 10:55:34 sun4150node1 dlm_controld[7916]: cluster is down, exiting
Jun  4 10:55:34 sun4150node1 fenced[7910]: cluster is down, exiting
Jun  4 10:55:34 sun4150node1 gfs_controld[7922]: cluster is down, exiting
Jun  4 10:55:34 sun4150node1 kernel: dlm: closing connection to node 2
Jun  4 10:55:34 sun4150node1 kernel: dlm: closing connection to node 1
Jun  4 10:55:35 sun4150node1 qdiskd[8128]: <err> cman_dispatch: Host is
down
Jun  4 10:55:35 sun4150node1 qdiskd[8128]: <err> Halting qdisk operations
Jun  4 10:55:51 sun4150node1 kernel: dlm: FS1: remove fr 0 ID 1
Jun  4 10:56:01 sun4150node1 ccsd[7879]: Unable to connect to cluster
infrastructure after 30 seconds.
Jun  4 10:56:31 sun4150node1 ccsd[7879]: Unable to connect to cluster
infrastructure after 60 seconds.
Jun  4 10:57:01 sun4150node1 ccsd[7879]: Unable to connect to cluster
infrastructure after 90 seconds.
Jun  4 10:57:31 sun4150node1 ccsd[7879]: Unable to connect to cluster
infrastructure after 120 seconds.

The first thing I see awry is "dlm_controld[7916]: cluster is down, 
exiting". I can see from source code that that could be from either 
process_member() or cluster_dead(), both of which would be called via 
callback from loop(). My best guess is that process_member() called 
cman_dispatch(ch, CMAN_DISPATCH_ALL) and rv was -1 with errno set to 
EHOSTDOWN. But I don't know why that would be the case, and in particular 
why here.

cman started fine on node2, and node1 joined without incident after 
reboot.

Any hints on how to debug this would be appreciated.

Thanks

---
Charlie