[Linux-cluster] [cman] cant joint cluster after reboot

Thu Nov 7 13:11:27 UTC 2013

My understanding is node fenced while rebooting. I suggest you to look info
fencing logs as well. If your fencing logs not in detail use following in
cluster.conf to enable logging

<logging>
             <logging_daemon name="fenced" debug="on"/>
  </logging>

Thanks

On Thu, Nov 7, 2013 at 5:34 PM, Yuriy Demchenko <demchenko.ya at gmail.com>wrote:

> Hi,
>
> I'm trying to set up 3-node cluster (2 nodes + 1 standby node for quorum)
> with cman+pacemaker stack, everything according this quickstart article:
> http://clusterlabs.org/quickstart-redhat.html
>
> Cluster starts, all nodes see each other, quorum gained, stonith working,
> but I've run into problem with cman: node cant join cluster after reboot -
> cman starts and cman_tool nodes reports only that node as cluster-member,
> while on other 2 nodes it reports 2 nodes as cluster-member and 3rd as
> offline. cman stop/start/restart on the problem node does no effect - it
> still can see only itself, but if i'll do cman restart on one of working
> nodes - everything goes back to normal, all 3 nodes joins the cluster and
> subsequent cman service restarts on any nodes works fine - node lefts
> cluster and rejoins sucessfully. But again - only till node OS reboot.
>
> For example:
> [1] Working cluster:
>
>> [root at node-1 ~]# cman_tool nodes
>> Node  Sts   Inc   Joined               Name
>>    1   M    592   2013-11-07 15:20:54  node-1.spb.stone.local
>>    2   M    760   2013-11-07 15:20:54  node-2.spb.stone.local
>>    3   M    760   2013-11-07 15:20:54  vnode-3.spb.stone.local
>> [root at node-1 ~]# cman_tool status
>> Version: 6.2.0
>> Config Version: 10
>> Cluster Name: ocluster
>> Cluster Id: 2059
>> Cluster Member: Yes
>> Cluster Generation: 760
>> Membership state: Cluster-Member
>> Nodes: 3
>> Expected votes: 3
>> Total votes: 3
>> Node votes: 1
>> Quorum: 2
>> Active subsystems: 7
>> Flags:
>> Ports Bound: 0
>> Node name: node-1.spb.stone.local
>> Node ID: 1
>> Multicast addresses: 239.192.8.19
>> Node addresses: 192.168.220.21
>>
> Picture is same on all 3 nodes (except for node name and id) - same
> cluster name, cluster id, multicast addres.
>
> [2] I've put node-1 into reboot. After reboot complete, "cman_tool nodes"
> on node-2 and vnode-3 shows this:
>
>> Node  Sts   Inc   Joined               Name
>>    1   X    760                        node-1.spb.stone.local
>>    2   M    588   2013-11-07 15:11:23  node-2.spb.stone.local
>>    3   M    760   2013-11-07 15:20:54  vnode-3.spb.stone.local
>> [root at node-2 ~]# cman_tool status
>> Version: 6.2.0
>> Config Version: 10
>> Cluster Name: ocluster
>> Cluster Id: 2059
>> Cluster Member: Yes
>> Cluster Generation: 764
>> Membership state: Cluster-Member
>> Nodes: 2
>> Expected votes: 3
>> Total votes: 2
>> Node votes: 1
>> Quorum: 2
>> Active subsystems: 7
>> Flags:
>> Ports Bound: 0
>> Node name: node-2.spb.stone.local
>> Node ID: 2
>> Multicast addresses: 239.192.8.19
>> Node addresses: 192.168.220.22
>>
> But, on rebooted node-1 it shows this:
>
>> Node  Sts   Inc   Joined               Name
>>    1   M    764   2013-11-07 15:49:01  node-1.spb.stone.local
>>    2   X      0                        node-2.spb.stone.local
>>    3   X      0                        vnode-3.spb.stone.local
>> [root at node-1 ~]# cman_tool status
>> Version: 6.2.0
>> Config Version: 10
>> Cluster Name: ocluster
>> Cluster Id: 2059
>> Cluster Member: Yes
>> Cluster Generation: 776
>> Membership state: Cluster-Member
>> Nodes: 1
>> Expected votes: 3
>> Total votes: 1
>> Node votes: 1
>> Quorum: 2 Activity blocked
>> Active subsystems: 7
>> Flags:
>> Ports Bound: 0
>> Node name: node-1.spb.stone.local
>> Node ID: 1
>> Multicast addresses: 239.192.8.19
>> Node addresses: 192.168.220.21
>>
> so, same cluster name, cluster id, multicast address - but it cant see
> other nodes. And there are nothing in /var/log/messages and
> /var/log/cluster/corosync.log on other two nodes - they seem not notice
> node-1 coming back online at all, last records about node-1 leaving cluster.
>
> [3] If now i do "service cman restart" on node-2 or vnode-3 - everything
> goes back to normal operation as in [1]
> in logs it shows as node-2 leaving cluster (service stop) and
> simultaneously joining of both node-2 and node-1 (service start)
>
>> Nov  7 11:47:06 vnode-3 corosync[26692]: [QUORUM] Members[2]: 2 3
>> Nov  7 11:47:06 vnode-3 corosync[26692]:   [TOTEM ] A processor joined or
>> left the membership and a new membership was formed.
>> Nov  7 11:47:06 vnode-3 kernel: dlm: closing connection to node 1
>> Nov  7 11:47:06 vnode-3 corosync[26692]:   [CPG   ] chosen downlist:
>> sender r(0) ip(192.168.220.22) ; members(old:3 left:1)
>> Nov  7 11:47:06 vnode-3 corosync[26692]:   [MAIN  ] Completed service
>> synchronization, ready to provide service.
>> Nov  7 11:53:28 vnode-3 corosync[26692]:   [QUORUM] Members[1]: 3
>> Nov  7 11:53:28 vnode-3 corosync[26692]:   [TOTEM ] A processor joined or
>> left the membership and a new membership was formed.
>> Nov  7 11:53:28 vnode-3 corosync[26692]:   [CPG   ] chosen downlist:
>> sender r(0) ip(192.168.220.14) ; members(old:2 left:1)
>> Nov  7 11:53:28 vnode-3 corosync[26692]:   [MAIN  ] Completed service
>> synchronization, ready to provide service.
>> Nov  7 11:53:28 vnode-3 kernel: dlm: closing connection to node 2
>> Nov  7 11:53:30 vnode-3 corosync[26692]:   [TOTEM ] A processor joined or
>> left the membership and a new membership was formed.
>> Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM] Members[2]: 1 3
>> Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM] Members[2]: 1 3
>> Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM] Members[3]: 1 2 3
>> Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM] Members[3]: 1 2 3
>> Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM] Members[3]: 1 2 3
>> Nov  7 11:53:30 vnode-3 corosync[26692]:   [CPG   ] chosen downlist:
>> sender r(0) ip(192.168.220.21) ; members(old:1 left:0)
>> Nov  7 11:53:30 vnode-3 corosync[26692]:   [MAIN  ] Completed service
>> synchronization, ready to provide service.
>>
>
> I've set up such cluster before in quite same configuration and never had
> any problems, but now I'm completely stuck.
> So, what is wrong with my cluster and how to fix it?
>
> OS Centos 6.4 with lastest updates, firewall disabled, selinux permissive,
> all 3 nodes inside same network. Multicast working - checked with omping.
> cman.x86_64                   3.0.12.1-49.el6_4.2 @centos6-updates
> corosync.x86_64               1.4.1-15.el6_4.1 @centos6-updates
> pacemaker.x86_64              1.1.10-1.el6_4.4 @centos6-updates
>
> cluster.conf is in attach
>
> --
> Yuriy Demchenko
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

-- 
http://linuxmantra.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20131107/233df8fe/attachment.htm>