[Linux-cluster] CLVM & CMAN live adding nodes
Bjoern Teipel
bjoern.teipel at internetbrands.com
Mon Feb 24 19:45:59 UTC 2014
Thanks Chrissie,
that was an old artifact from testing with two nodes.
I set the expected votes now to 4 (3 existing nodes in the cluster and one
new) but I still have the same issue.
It seems like the new node can't gain quorum over corosync, I see multicast
packets flowing over the wire but quorum membership seems to be static:
Feb 24 11:29:09 corosync [QUORUM] Members[3]: 1 2 3
Version: 6.2.0
Config Version: 4
Cluster Name: hv-1618-106-1
Cluster Id: 11612
Cluster Member: Yes
Cluster Generation: 244
Membership state: Cluster-Member
Nodes: 3
Expected votes: 4
Total votes: 3
Node votes: 1
Quorum: 3
Active subsystems: 8
Flags:
Ports Bound: 0 11
Node name: node01
Node ID: 1
Multicast addresses: 239.192.45.137
Node addresses: 10.14.10.6
On Node04:
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... Timed-out waiting for cluster
[FAILED]
Stopping cluster:
Leaving fence domain... [ OK ]
Stopping gfs_controld... [ OK ]
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... [ OK ]
Waiting for corosync to shutdown: [ OK ]
Unloading kernel modules... [ OK ]
Unmounting configfs... [ OK ]
Node status:
Node Sts Inc Joined Name
1 M 236 2014-02-24 00:22:32 node01
2 M 240 2014-02-24 00:22:34 node02
3 M 244 2014-02-24 00:22:38 node03
4 X 0 node04
On Mon, Feb 24, 2014 at 2:25 AM, Christine Caulfield <ccaulfie at redhat.com>wrote:
> On 24/02/14 08:39, Bjoern Teipel wrote:
>
>> Hi Fabio,
>>
>> removing UDPU does not change the behavior, the new node still doesn't
>> join the cluster and still wants to fence node 01
>> It still feels like a split brain or so.
>> How do you join a new node, using the /etc/init.d/cman start or using
>> cman_tool / dlm_tool join ?
>>
>> Bjoern
>>
>>
>> On Sat, Feb 22, 2014 at 10:16 PM, Fabio M. Di Nitto <fdinitto at redhat.com
>> <mailto:fdinitto at redhat.com>> wrote:
>>
>> On 02/22/2014 08:05 PM, Bjoern Teipel wrote:
>> > Thanks Fabio for replying may request.
>> >
>> > I'm using stock CentOS 6.4 versions and no rm, just clvmd and dlm.
>> >
>> > Name : cman Relocations: (not
>> relocatable)
>> > Version : 3.0.12.1 Vendor: CentOS
>> > Release : 49.el6_4.2 Build Date: Tue 03
>> Sep 2013
>> > 02:18:10 AM PDT
>> >
>> > Name : lvm2-cluster Relocations: (not
>> relocatable)
>> > Version : 2.02.98 Vendor: CentOS
>> > Release : 9.el6_4.3 Build Date: Tue 05
>> Nov 2013
>> > 07:36:18 AM PST
>> >
>> > Name : corosync Relocations: (not
>> relocatable)
>> > Version : 1.4.1 Vendor: CentOS
>> > Release : 15.el6_4.1 Build Date: Tue 14
>> May 2013
>> > 02:09:27 PM PDT
>> >
>> >
>> > My question is based off this problem I have till January:
>> >
>> >
>> > When ever I add a new node (I put into the cluster.conf and
>> reloaded
>> > with cman_tool version -r -S) I end up with situations like the
>> new
>> > node wants to gain the quorum and starts to fence the existing pool
>> > master and appears to generate some sort of split cluster. Does
>> it work
>> > at all, corosync and dlm do not know about the recently added node
>> ?
>>
>> I can see you are using UDPU and that could be the culprit. Can you
>> drop
>> UDPU and work with multicast?
>>
>> Jan/Chrissie: do you remember if we support adding nodes at runtime
>> with
>> UDPU?
>>
>> The standalone node should not have quorum at all and should not be
>> able
>> to fence anybody to start with.
>>
>> >
>> > New Node
>> > ==========
>> >
>> > Node Sts Inc Joined Name
>> > 1 X 0 hv-1
>> > 2 X 0 hv-2
>> > 3 X 0 hv-3
>> > 4 X 0 hv-4
>> > 5 X 0 hv-5
>> > 6 M 80 2014-01-07 21:37:42 hv-6<--- host added
>> >
>> >
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] The network
>> interface
>> > [10.14.18.77] is now up.
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [QUORUM] Using quorum
>> provider
>> > quorum_cman
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine
>> loaded:
>> > corosync cluster quorum service v0.1
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [CMAN ] CMAN 3.0.12.1
>> (built
>> > Sep 3 2013 09:17:34) started
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine
>> loaded:
>> > corosync CMAN membership service 2.90
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine
>> loaded:
>> > openais checkpoint service B.01.01
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine
>> loaded:
>> > corosync extended virtual synchrony service
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine
>> loaded:
>> > corosync configuration service
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine
>> loaded:
>> > corosync cluster closed process group service v1.01
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine
>> loaded:
>> > corosync cluster config database access v1.01
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine
>> loaded:
>> > corosync profile loading service
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [QUORUM] Using quorum
>> provider
>> > quorum_cman
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine
>> loaded:
>> > corosync cluster quorum service v0.1
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [MAIN ] Compatibility
>> mode set
>> > to whitetank. Using V1 and V2 of the synchronization engine.
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding new UDPU
>> member
>> > {10.14.18.65}
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding new UDPU
>> member
>> > {10.14.18.67}
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding new UDPU
>> member
>> > {10.14.18.68}
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding new UDPU
>> member
>> > {10.14.18.70}
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding new UDPU
>> member
>> > {10.14.18.66}
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding new UDPU
>> member
>> > {10.14.18.77}
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] A processor
>> joined or
>> > left the membership and a new membership was formed.
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [CMAN ] quorum regained,
>> > resuming activity
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [QUORUM] This node is
>> within the
>> > primary component and will provide service.
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [QUORUM] Members[1]: 6
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [QUORUM] Members[1]: 6
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [CPG ] chosen downlist:
>> sender
>> > r(0) ip(10.14.18.77) ; members(old:0 left:0)
>> > Jan 7 21:37:42 hv-1 corosync[12564]: [MAIN ] Completed service
>> > synchronization, ready to provide service.
>> > Jan 7 21:37:46 hv-1 fenced[12620]: fenced 3.0.12.1 started
>> > Jan 7 21:37:46 hv-1 dlm_controld[12643]: dlm_controld 3.0.12.1
>> started
>> > Jan 7 21:37:47 hv-1 gfs_controld[12695]: gfs_controld 3.0.12.1
>> started
>> > Jan 7 21:37:54 hv-1 fenced[12620]: fencing node hv-b1clcy1
>> >
>> > sudo -i corosync-objctl |grep member
>> >
>> > totem.interface.member.memberaddr=hv-1
>> > totem.interface.member.memberaddr=hv-2
>> > totem.interface.member.memberaddr=hv-3
>> > totem.interface.member.memberaddr=hv-4
>> > totem.interface.member.memberaddr=hv-5
>> > totem.interface.member.memberaddr=hv-6
>> > runtime.totem.pg.mrp.srp.members.6.ip=r(0) ip(10.14.18.77)
>> > runtime.totem.pg.mrp.srp.members.6.join_count=1
>> > runtime.totem.pg.mrp.srp.members.6.status=joined
>> >
>> >
>> > Existing Node
>> > =============
>> >
>> > member 6 has not been added to the quorum list :
>> >
>> > Jan 7 21:36:28 hv-1 corosync[7769]: [QUORUM] Members[4]: 1 2 3 5
>> > Jan 7 21:37:54 hv-1 corosync[7769]: [TOTEM ] A processor joined
>> or
>> > left the membership and a new membership was formed.
>> > Jan 7 21:37:54 hv-1 corosync[7769]: [CPG ] chosen downlist:
>> sender
>> > r(0) ip(10.14.18.65) ; members(old:4 left:0)
>> >
>> >
>> > Node Sts Inc Joined Name
>> > 1 M 4468 2013-12-10 14:33:27 hv-1
>> > 2 M 4468 2013-12-10 14:33:27 hv-2
>> > 3 M 5036 2014-01-07 17:51:26 hv-3
>> > 4 X 4468 hv-4(dead at the moment)
>> > 5 M 4468 2013-12-10 14:33:27 hv-5
>> > 6 X 0 hv-6<--- added
>> >
>> >
>> > Jan 7 21:36:28 hv-1 corosync[7769]: [QUORUM] Members[4]: 1 2 3 5
>> > Jan 7 21:37:54 hv-1 corosync[7769]: [TOTEM ] A processor joined
>> or
>> > left the membership and a new membership was formed.
>> > Jan 7 21:37:54 hv-1 corosync[7769]: [CPG ] chosen downlist:
>> sender
>> > r(0) ip(10.14.18.65) ; members(old:4 left:0)
>> > Jan 7 21:37:54 hv-1 corosync[7769]: [MAIN ] Completed service
>> > synchronization, ready to provide service.
>> >
>> >
>> > totem.interface.member.memberaddr=hv-1
>> > totem.interface.member.memberaddr=hv-2
>> > totem.interface.member.memberaddr=hv-3
>> > totem.interface.member.memberaddr=hv-4
>> > totem.interface.member.memberaddr=hv-5.
>> > runtime.totem.pg.mrp.srp.members.1.ip=r(0) ip(10.14.18.65)
>> > runtime.totem.pg.mrp.srp.members.1.join_count=1
>> > runtime.totem.pg.mrp.srp.members.1.status=joined
>> > runtime.totem.pg.mrp.srp.members.2.ip=r(0) ip(10.14.18.66)
>> > runtime.totem.pg.mrp.srp.members.2.join_count=1
>> > runtime.totem.pg.mrp.srp.members.2.status=joined
>> > runtime.totem.pg.mrp.srp.members.4.ip=r(0) ip(10.14.18.68)
>> > runtime.totem.pg.mrp.srp.members.4.join_count=1
>> > runtime.totem.pg.mrp.srp.members.4.status=left
>> > runtime.totem.pg.mrp.srp.members.5.ip=r(0) ip(10.14.18.70)
>> > runtime.totem.pg.mrp.srp.members.5.join_count=1
>> > runtime.totem.pg.mrp.srp.members.5.status=joined
>> > runtime.totem.pg.mrp.srp.members.3.ip=r(0) ip(10.14.18.67)
>> > runtime.totem.pg.mrp.srp.members.3.join_count=3
>> > runtime.totem.pg.mrp.srp.members.3.status=joined
>> >
>> >
>> > cluster.conf:
>> >
>> > <?xml version="1.0"?>
>> > <cluster config_version="32" name="hv-1618-110-1">
>> > <fence_daemon clean_start="0"/>
>> > <cman transport="udpu" expected_votes="1"/>
>>
>
>
> Setting expected_votes to 1 in a six node cluster is a serious
> configuration error and needs to be changed. That is what is causing the
> new node to fence the rest of the cluster.
>
> Check that all of the nodes have the same cluster.conf file, any
> difference between that on the exiting nodes and the new one will prevent
> the new node from joining too.
>
> Chrissie
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140224/989178b6/attachment.htm>
More information about the Linux-cluster
mailing list