[Linux-cluster] Rejoin cluster after failure without reboot?

Wed Nov 25 15:22:18 UTC 2015

Hi,

I'm experimenting with corosync+dlm+gfs2 (approximately following 
http://people.redhat.com/teigland/cluster4-gfs2-dlm.txt) and am trying 
to establish whether it meets my requirements. I have a query about a 
node rejoining a cluster after failure, and want to make sure I'm not 
overlooking something.

I have a three-node cluster and deliberately cause token loss by 
firewalling one of them (call it node A) out of the network for longer 
than the token timeout. At this point, the other two hosts (B and C) 
decide that A has disappeared and continue with quorum. That is fine.

When I unfirewall node A, dlm tries to reconnect to its peers on B and 
C. But then I see the following on host B:

16:29:25.823496 nodeb dlm_controld[6548]: 908 daemon node 85 stateful merge
16:29:25.823529 nodeb dlm_controld[6548]: 908 daemon node 85 kill due to 
stateful merge
16:29:25.823543 nodeb dlm_controld[6548]: 908 tell corosync to remove 
nodeid 85 from cluster
16:29:25.823696 nodeb corosync[6536]:  [CFG   ] request to kill node 
85(us=83): xxx

and then the following on node A:

16:29:25.828547 nodea corosync[3896]:  [CFG   ] Killed by node 83: 
dlm_controld
16:29:25.828575 nodea corosync[3896]:  [MAIN  ] Corosync Cluster Engine 
exiting with status -1 at cfg.c:530.
16:29:25.834828 nodea dlm_controld[3466]: 1183 process_cluster_cfg 
cfg_dispatch 2
16:29:25.834871 nodea dlm_controld[3466]: 1183 cluster is down, exiting
16:29:25.834886 nodea dlm_controld[3466]: 1183 process_cluster 
quorum_dispatch 2
16:29:25.834903 nodea dlm_controld[3466]: 1183 daemon cpg_dispatch error 2
16:29:25.834917 nodea dlm_controld[3466]: 1183 cpg_dispatch error 2
16:29:25.837152 nodea dlm_controld[3466]: 1183 abandoned lockspace mygfs2

resulting in both corosync and dlm_controld exiting on node A.

Later, if I try to manually restart corosync and dlm on node A, I see 
the following:

16:32:08.382871 nodea dlm_controld[20483]: 2872 dlm_controld 4.0.2 started
16:32:08.392453 nodea dlm_controld[20483]: 2872 found uncontrolled 
lockspace mygfs2
16:32:08.392477 nodea dlm_controld[20483]: 2872 tell corosync to remove 
nodeid 85 from cluster
16:32:08.394965 nodea corosync[20456]:  [CFG   ] request to kill node 
85(us=85): xxx
16:32:08.394998 nodea corosync[20456]:  [CFG   ] Killed by node 85: 
dlm_controld

The only way of making A rejoin the cluster is to reboot.

I would be grateful if you could confirm the following statements:
   (a) The "stateful merge" is unavoidable when node A leaves the 
cluster for longer than the token timeout then tries to rejoin.
   (b) Killing corosync on node A is unavoidable when node B sees the 
"stateful merge".
   (c) dlm exiting is unavoidable when corosync dies.
   (d) Restarting corosync then dlm on node A will necessarily result in 
"found uncontrolled lockspace".
   (e) The only way to recover from "found uncontrolled lockspace" (for 
a gfs2 lockspace) is to reboot.

I'm hoping that I'm overlooking something and that at least one of 
(a)--(e) is false! I'm not comfortable with a reboot being the only 
means of recovery when the token timeout is exceeded.

Thanks,
Jonathan