[Linux-cluster] [cman] cant joint cluster after reboot

Thu Nov 7 12:04:30 UTC 2013

Hi,

I'm trying to set up 3-node cluster (2 nodes + 1 standby node for 
quorum) with cman+pacemaker stack, everything according this quickstart 
article: http://clusterlabs.org/quickstart-redhat.html

Cluster starts, all nodes see each other, quorum gained, stonith 
working, but I've run into problem with cman: node cant join cluster 
after reboot - cman starts and cman_tool nodes reports only that node as 
cluster-member, while on other 2 nodes it reports 2 nodes as 
cluster-member and 3rd as offline. cman stop/start/restart on the 
problem node does no effect - it still can see only itself, but if i'll 
do cman restart on one of working nodes - everything goes back to 
normal, all 3 nodes joins the cluster and subsequent cman service 
restarts on any nodes works fine - node lefts cluster and rejoins 
sucessfully. But again - only till node OS reboot.

For example:
[1] Working cluster:
> [root at node-1 ~]# cman_tool nodes
> Node  Sts   Inc   Joined               Name
>    1   M    592   2013-11-07 15:20:54  node-1.spb.stone.local
>    2   M    760   2013-11-07 15:20:54  node-2.spb.stone.local
>    3   M    760   2013-11-07 15:20:54  vnode-3.spb.stone.local
> [root at node-1 ~]# cman_tool status
> Version: 6.2.0
> Config Version: 10
> Cluster Name: ocluster
> Cluster Id: 2059
> Cluster Member: Yes
> Cluster Generation: 760
> Membership state: Cluster-Member
> Nodes: 3
> Expected votes: 3
> Total votes: 3
> Node votes: 1
> Quorum: 2
> Active subsystems: 7
> Flags:
> Ports Bound: 0
> Node name: node-1.spb.stone.local
> Node ID: 1
> Multicast addresses: 239.192.8.19
> Node addresses: 192.168.220.21
Picture is same on all 3 nodes (except for node name and id) - same 
cluster name, cluster id, multicast addres.

[2] I've put node-1 into reboot. After reboot complete, "cman_tool 
nodes" on node-2 and vnode-3 shows this:
> Node  Sts   Inc   Joined               Name
>    1   X    760                        node-1.spb.stone.local
>    2   M    588   2013-11-07 15:11:23  node-2.spb.stone.local
>    3   M    760   2013-11-07 15:20:54  vnode-3.spb.stone.local
> [root at node-2 ~]# cman_tool status
> Version: 6.2.0
> Config Version: 10
> Cluster Name: ocluster
> Cluster Id: 2059
> Cluster Member: Yes
> Cluster Generation: 764
> Membership state: Cluster-Member
> Nodes: 2
> Expected votes: 3
> Total votes: 2
> Node votes: 1
> Quorum: 2
> Active subsystems: 7
> Flags:
> Ports Bound: 0
> Node name: node-2.spb.stone.local
> Node ID: 2
> Multicast addresses: 239.192.8.19
> Node addresses: 192.168.220.22
But, on rebooted node-1 it shows this:
> Node  Sts   Inc   Joined               Name
>    1   M    764   2013-11-07 15:49:01  node-1.spb.stone.local
>    2   X      0                        node-2.spb.stone.local
>    3   X      0                        vnode-3.spb.stone.local
> [root at node-1 ~]# cman_tool status
> Version: 6.2.0
> Config Version: 10
> Cluster Name: ocluster
> Cluster Id: 2059
> Cluster Member: Yes
> Cluster Generation: 776
> Membership state: Cluster-Member
> Nodes: 1
> Expected votes: 3
> Total votes: 1
> Node votes: 1
> Quorum: 2 Activity blocked
> Active subsystems: 7
> Flags:
> Ports Bound: 0
> Node name: node-1.spb.stone.local
> Node ID: 1
> Multicast addresses: 239.192.8.19
> Node addresses: 192.168.220.21
so, same cluster name, cluster id, multicast address - but it cant see 
other nodes. And there are nothing in /var/log/messages and 
/var/log/cluster/corosync.log on other two nodes - they seem not notice 
node-1 coming back online at all, last records about node-1 leaving cluster.

[3] If now i do "service cman restart" on node-2 or vnode-3 - everything 
goes back to normal operation as in [1]
in logs it shows as node-2 leaving cluster (service stop) and 
simultaneously joining of both node-2 and node-1 (service start)
> Nov  7 11:47:06 vnode-3 corosync[26692]: [QUORUM] Members[2]: 2 3
> Nov  7 11:47:06 vnode-3 corosync[26692]:   [TOTEM ] A processor joined 
> or left the membership and a new membership was formed.
> Nov  7 11:47:06 vnode-3 kernel: dlm: closing connection to node 1
> Nov  7 11:47:06 vnode-3 corosync[26692]:   [CPG   ] chosen downlist: 
> sender r(0) ip(192.168.220.22) ; members(old:3 left:1)
> Nov  7 11:47:06 vnode-3 corosync[26692]:   [MAIN  ] Completed service 
> synchronization, ready to provide service.
> Nov  7 11:53:28 vnode-3 corosync[26692]:   [QUORUM] Members[1]: 3
> Nov  7 11:53:28 vnode-3 corosync[26692]:   [TOTEM ] A processor joined 
> or left the membership and a new membership was formed.
> Nov  7 11:53:28 vnode-3 corosync[26692]:   [CPG   ] chosen downlist: 
> sender r(0) ip(192.168.220.14) ; members(old:2 left:1)
> Nov  7 11:53:28 vnode-3 corosync[26692]:   [MAIN  ] Completed service 
> synchronization, ready to provide service.
> Nov  7 11:53:28 vnode-3 kernel: dlm: closing connection to node 2
> Nov  7 11:53:30 vnode-3 corosync[26692]:   [TOTEM ] A processor joined 
> or left the membership and a new membership was formed.
> Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM] Members[2]: 1 3
> Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM] Members[2]: 1 3
> Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM] Members[3]: 1 2 3
> Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM] Members[3]: 1 2 3
> Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM] Members[3]: 1 2 3
> Nov  7 11:53:30 vnode-3 corosync[26692]:   [CPG   ] chosen downlist: 
> sender r(0) ip(192.168.220.21) ; members(old:1 left:0)
> Nov  7 11:53:30 vnode-3 corosync[26692]:   [MAIN  ] Completed service 
> synchronization, ready to provide service.

I've set up such cluster before in quite same configuration and never 
had any problems, but now I'm completely stuck.
So, what is wrong with my cluster and how to fix it?

OS Centos 6.4 with lastest updates, firewall disabled, selinux 
permissive, all 3 nodes inside same network. Multicast working - checked 
with omping.
cman.x86_64                   3.0.12.1-49.el6_4.2 @centos6-updates
corosync.x86_64               1.4.1-15.el6_4.1 @centos6-updates
pacemaker.x86_64              1.1.10-1.el6_4.4 @centos6-updates

cluster.conf is in attach

-- 
Yuriy Demchenko

-------------- next part --------------
<cluster config_version="10" name="ocluster">
  <fence_daemon/>
  <clusternodes>
    <clusternode name="node-1.spb.stone.local" nodeid="1">
      <fence>
        <method name="pcmk-redirect">
          <device name="pcmk" port="node-1"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="node-2.spb.stone.local" nodeid="2">
      <fence>
        <method name="pcmk-redirect">
          <device name="pcmk" port="node-2"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="vnode-3.spb.stone.local" nodeid="3"/>
  </clusternodes>
  <cman/>
  <fencedevices>
    <fencedevice agent="fence_pcmk" name="pcmk"/>
  </fencedevices>
  <rm>
    <failoverdomains/>
    <resources/>
  </rm>
</cluster>