[Linux-cluster] Rejoin cluster after failure without reboot?

Thu Nov 26 08:57:02 UTC 2015

On 26/11/15 09:39, Christine Caulfield wrote:
> On 25/11/15 15:22, Jonathan Davies wrote:
>> Hi,
>>
>> I'm experimenting with corosync+dlm+gfs2 (approximately following
>> http://people.redhat.com/teigland/cluster4-gfs2-dlm.txt) and am trying
>> to establish whether it meets my requirements. I have a query about a
>> node rejoining a cluster after failure, and want to make sure I'm not
>> overlooking something.
>>
>> I have a three-node cluster and deliberately cause token loss by
>> firewalling one of them (call it node A) out of the network for longer
>> than the token timeout. At this point, the other two hosts (B and C)
>> decide that A has disappeared and continue with quorum. That is fine.
>>
>> When I unfirewall node A, dlm tries to reconnect to its peers on B and
>> C. But then I see the following on host B:
>>
>> 16:29:25.823496 nodeb dlm_controld[6548]: 908 daemon node 85 stateful merge
>> 16:29:25.823529 nodeb dlm_controld[6548]: 908 daemon node 85 kill due to
>> stateful merge
>> 16:29:25.823543 nodeb dlm_controld[6548]: 908 tell corosync to remove
>> nodeid 85 from cluster
>> 16:29:25.823696 nodeb corosync[6536]:  [CFG   ] request to kill node
>> 85(us=83): xxx
>>
>> and then the following on node A:
>>
>> 16:29:25.828547 nodea corosync[3896]:  [CFG   ] Killed by node 83:
>> dlm_controld
>> 16:29:25.828575 nodea corosync[3896]:  [MAIN  ] Corosync Cluster Engine
>> exiting with status -1 at cfg.c:530.
>> 16:29:25.834828 nodea dlm_controld[3466]: 1183 process_cluster_cfg
>> cfg_dispatch 2
>> 16:29:25.834871 nodea dlm_controld[3466]: 1183 cluster is down, exiting
>> 16:29:25.834886 nodea dlm_controld[3466]: 1183 process_cluster
>> quorum_dispatch 2
>> 16:29:25.834903 nodea dlm_controld[3466]: 1183 daemon cpg_dispatch error 2
>> 16:29:25.834917 nodea dlm_controld[3466]: 1183 cpg_dispatch error 2
>> 16:29:25.837152 nodea dlm_controld[3466]: 1183 abandoned lockspace mygfs2
>>
>> resulting in both corosync and dlm_controld exiting on node A.
>>
>> Later, if I try to manually restart corosync and dlm on node A, I see
>> the following:
>>
>> 16:32:08.382871 nodea dlm_controld[20483]: 2872 dlm_controld 4.0.2 started
>> 16:32:08.392453 nodea dlm_controld[20483]: 2872 found uncontrolled
>> lockspace mygfs2
>> 16:32:08.392477 nodea dlm_controld[20483]: 2872 tell corosync to remove
>> nodeid 85 from cluster
>> 16:32:08.394965 nodea corosync[20456]:  [CFG   ] request to kill node
>> 85(us=85): xxx
>> 16:32:08.394998 nodea corosync[20456]:  [CFG   ] Killed by node 85:
>> dlm_controld
>>
>> The only way of making A rejoin the cluster is to reboot.
>>
>
> Yes. You need to implement fencing, so that the node will automatically
> be restarted when it leaves the cluster.
>
> CHrissie
You'll probably have to use this patch to make fencing work as expected:
https://github.com/ClusterLabs/pacemaker/pull/839
>
>
>> I would be grateful if you could confirm the following statements:
>>    (a) The "stateful merge" is unavoidable when node A leaves the cluster
>> for longer than the token timeout then tries to rejoin.
>>    (b) Killing corosync on node A is unavoidable when node B sees the
>> "stateful merge".
>>    (c) dlm exiting is unavoidable when corosync dies.
>>    (d) Restarting corosync then dlm on node A will necessarily result in
>> "found uncontrolled lockspace".
>>    (e) The only way to recover from "found uncontrolled lockspace" (for a
>> gfs2 lockspace) is to reboot.
>>
>> I'm hoping that I'm overlooking something and that at least one of
>> (a)--(e) is false! I'm not comfortable with a reboot being the only
>> means of recovery when the token timeout is exceeded.
>>
>> Thanks,
>> Jonathan
>>