[Linux-cluster] Rejoin cluster after failure without reboot?

Wed Nov 25 17:10:05 UTC 2015

On Wed, Nov 25, 2015 at 03:22:18PM +0000, Jonathan Davies wrote:
> 16:32:08.392453 nodea dlm_controld[20483]: 2872 found uncontrolled
> lockspace mygfs2

> The only way of making A rejoin the cluster is to reboot.

That's expected because we don't have the ability to clear the dlm and
gfs2 kernel state that was left behind.  Reboot is the only way to clear
that.

> I would be grateful if you could confirm the following statements:
>   (a) The "stateful merge" is unavoidable when node A leaves the
> cluster for longer than the token timeout then tries to rejoin.

correct

>   (b) Killing corosync on node A is unavoidable when node B sees the
> "stateful merge".

correct

>   (c) dlm exiting is unavoidable when corosync dies.

correct

>   (d) Restarting corosync then dlm on node A will necessarily result
> in "found uncontrolled lockspace".

correct

>   (e) The only way to recover from "found uncontrolled lockspace"
> (for a gfs2 lockspace) is to reboot.

correct

> I'm hoping that I'm overlooking something and that at least one of
> (a)--(e) is false! I'm not comfortable with a reboot being the only
> means of recovery when the token timeout is exceeded.

It's the nature of the beast I'm afraid -- an effect of the extremely
tight coupling of components that's needed to make gfs2 semantics as near
as possible to those of a local fs.  File systems willing to diverge a
little more from local fs behavior are generally more forgiving.

Dave