[Linux-cluster] Rejoin cluster after failure without reboot?
oalbrigt at redhat.com
Thu Nov 26 08:57:02 UTC 2015
On 26/11/15 09:39, Christine Caulfield wrote:
> On 25/11/15 15:22, Jonathan Davies wrote:
>> I'm experimenting with corosync+dlm+gfs2 (approximately following
>> http://people.redhat.com/teigland/cluster4-gfs2-dlm.txt) and am trying
>> to establish whether it meets my requirements. I have a query about a
>> node rejoining a cluster after failure, and want to make sure I'm not
>> overlooking something.
>> I have a three-node cluster and deliberately cause token loss by
>> firewalling one of them (call it node A) out of the network for longer
>> than the token timeout. At this point, the other two hosts (B and C)
>> decide that A has disappeared and continue with quorum. That is fine.
>> When I unfirewall node A, dlm tries to reconnect to its peers on B and
>> C. But then I see the following on host B:
>> 16:29:25.823496 nodeb dlm_controld: 908 daemon node 85 stateful merge
>> 16:29:25.823529 nodeb dlm_controld: 908 daemon node 85 kill due to
>> stateful merge
>> 16:29:25.823543 nodeb dlm_controld: 908 tell corosync to remove
>> nodeid 85 from cluster
>> 16:29:25.823696 nodeb corosync: [CFG ] request to kill node
>> 85(us=83): xxx
>> and then the following on node A:
>> 16:29:25.828547 nodea corosync: [CFG ] Killed by node 83:
>> 16:29:25.828575 nodea corosync: [MAIN ] Corosync Cluster Engine
>> exiting with status -1 at cfg.c:530.
>> 16:29:25.834828 nodea dlm_controld: 1183 process_cluster_cfg
>> cfg_dispatch 2
>> 16:29:25.834871 nodea dlm_controld: 1183 cluster is down, exiting
>> 16:29:25.834886 nodea dlm_controld: 1183 process_cluster
>> quorum_dispatch 2
>> 16:29:25.834903 nodea dlm_controld: 1183 daemon cpg_dispatch error 2
>> 16:29:25.834917 nodea dlm_controld: 1183 cpg_dispatch error 2
>> 16:29:25.837152 nodea dlm_controld: 1183 abandoned lockspace mygfs2
>> resulting in both corosync and dlm_controld exiting on node A.
>> Later, if I try to manually restart corosync and dlm on node A, I see
>> the following:
>> 16:32:08.382871 nodea dlm_controld: 2872 dlm_controld 4.0.2 started
>> 16:32:08.392453 nodea dlm_controld: 2872 found uncontrolled
>> lockspace mygfs2
>> 16:32:08.392477 nodea dlm_controld: 2872 tell corosync to remove
>> nodeid 85 from cluster
>> 16:32:08.394965 nodea corosync: [CFG ] request to kill node
>> 85(us=85): xxx
>> 16:32:08.394998 nodea corosync: [CFG ] Killed by node 85:
>> The only way of making A rejoin the cluster is to reboot.
> Yes. You need to implement fencing, so that the node will automatically
> be restarted when it leaves the cluster.
You'll probably have to use this patch to make fencing work as expected:
>> I would be grateful if you could confirm the following statements:
>> (a) The "stateful merge" is unavoidable when node A leaves the cluster
>> for longer than the token timeout then tries to rejoin.
>> (b) Killing corosync on node A is unavoidable when node B sees the
>> "stateful merge".
>> (c) dlm exiting is unavoidable when corosync dies.
>> (d) Restarting corosync then dlm on node A will necessarily result in
>> "found uncontrolled lockspace".
>> (e) The only way to recover from "found uncontrolled lockspace" (for a
>> gfs2 lockspace) is to reboot.
>> I'm hoping that I'm overlooking something and that at least one of
>> (a)--(e) is false! I'm not comfortable with a reboot being the only
>> means of recovery when the token timeout is exceeded.
More information about the Linux-cluster