[Linux-cluster] Two-node cluster: Node attempts stateful merge after clean reboot

Wed Sep 11 12:50:11 UTC 2013

> The problem is that, if you enable cman on boot, the fenced node will
> try to join the cluster, fail to reach it's peer after post_join_delay
> (default 6 seconds, iirc) and fence it's peer. That peer reboots,
> starts cman, tries to connect, fenced it's peer...
>
> The easiest way to avoid this in 2-node clusters is to not let
> cman/rgmanager start automatically. That way, if a node is fenced, it
> will boot back up and you can log into remotely (assuming it's not
> totally dead). When you know things are fixed, manually start cman.
>
I my case however, the node which is trying to join is fully operational
and has network access. Also if you look at the configuration that I had
in my original email, my post_join_delay is 360 (for testing purposes),
so there is no way that a timeout occurs.

I might be wrong here, but judging from corosync's log file, the other
node even joins the cluster successfully, before being marked for
fencing by dlm_controld:

    Sep 11 11:14:09 corosync [CLM   ] CLM CONFIGURATION CHANGE
    Sep 11 11:14:09 corosync [CLM   ] New Configuration:
    Sep 11 11:14:09 corosync [CLM   ]     r(0) ip(10.xx.xx.1)
    Sep 11 11:14:09 corosync [CLM   ]     r(0) ip(10.xx.xx.2)
    Sep 11 11:14:09 corosync [CLM   ] Members Left:
    Sep 11 11:14:09 corosync [CLM   ] Members Joined:
    Sep 11 11:14:09 corosync [CLM   ]     r(0) ip(10.xx.xx.2)
    Sep 11 11:14:09 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
    Sep 11 11:14:09 corosync [QUORUM] Members[2]: 1 2
    Sep 11 11:14:09 corosync [QUORUM] Members[2]: 1 2


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20130911/080486cc/attachment.htm>