[Linux-cluster] CLVM & CMAN live adding nodes

Mon Feb 24 19:45:59 UTC 2014

Thanks Chrissie,

that was an old artifact from testing with two nodes.
I set the expected votes now to 4 (3 existing nodes in the cluster and one
new) but I still have the same issue.
It seems like the new node can't gain quorum over corosync, I see multicast
packets flowing over the wire but quorum membership seems to be static:

Feb 24 11:29:09 corosync [QUORUM] Members[3]: 1 2 3

Version: 6.2.0

Config Version: 4

Cluster Name: hv-1618-106-1

Cluster Id: 11612

Cluster Member: Yes

Cluster Generation: 244

Membership state: Cluster-Member

Nodes: 3

Expected votes: 4

Total votes: 3

Node votes: 1

Quorum: 3

Active subsystems: 8

Flags:

Ports Bound: 0 11

Node name: node01

Node ID: 1

Multicast addresses: 239.192.45.137

Node addresses: 10.14.10.6

On Node04:

Starting cluster:

   Checking if cluster has been disabled at boot...        [  OK  ]

   Checking Network Manager...                             [  OK  ]

   Global setup...                                         [  OK  ]

   Loading kernel modules...                               [  OK  ]

   Mounting configfs...                                    [  OK  ]

   Starting cman...                                        [  OK  ]

   Waiting for quorum... Timed-out waiting for cluster

                                                           [FAILED]

Stopping cluster:

   Leaving fence domain...                                 [  OK  ]

   Stopping gfs_controld...                                [  OK  ]

   Stopping dlm_controld...                                [  OK  ]

   Stopping fenced...                                      [  OK  ]

   Stopping cman...                                        [  OK  ]

   Waiting for corosync to shutdown:                       [  OK  ]

   Unloading kernel modules...                             [  OK  ]

   Unmounting configfs...                                  [  OK  ]

Node status:
Node  Sts   Inc   Joined               Name

   1   M    236   2014-02-24 00:22:32  node01

   2   M    240   2014-02-24 00:22:34  node02

   3   M    244   2014-02-24 00:22:38  node03

   4   X      0                        node04

On Mon, Feb 24, 2014 at 2:25 AM, Christine Caulfield <ccaulfie at redhat.com>wrote:

> On 24/02/14 08:39, Bjoern Teipel wrote:
>
>> Hi Fabio,
>>
>> removing UDPU does not change the behavior, the new node still doesn't
>> join the cluster and still wants to fence node 01
>> It still feels like a split brain or so.
>> How do you join a new node, using the /etc/init.d/cman start or using
>>   cman_tool / dlm_tool  join ?
>>
>> Bjoern
>>
>>
>> On Sat, Feb 22, 2014 at 10:16 PM, Fabio M. Di Nitto <fdinitto at redhat.com
>> <mailto:fdinitto at redhat.com>> wrote:
>>
>>     On 02/22/2014 08:05 PM, Bjoern Teipel wrote:
>>      > Thanks Fabio for replying may request.
>>      >
>>      > I'm using stock CentOS 6.4 versions and no rm, just clvmd and dlm.
>>      >
>>      > Name        : cman                         Relocations: (not
>>     relocatable)
>>      > Version     : 3.0.12.1                          Vendor: CentOS
>>      > Release     : 49.el6_4.2                    Build Date: Tue 03
>>     Sep 2013
>>      > 02:18:10 AM PDT
>>      >
>>      > Name        : lvm2-cluster                 Relocations: (not
>>     relocatable)
>>      > Version     : 2.02.98                           Vendor: CentOS
>>      > Release     : 9.el6_4.3                     Build Date: Tue 05
>>     Nov 2013
>>      > 07:36:18 AM PST
>>      >
>>      > Name        : corosync                     Relocations: (not
>>     relocatable)
>>      > Version     : 1.4.1                             Vendor: CentOS
>>      > Release     : 15.el6_4.1                    Build Date: Tue 14
>>     May 2013
>>      > 02:09:27 PM PDT
>>      >
>>      >
>>      > My question is based off this problem I have till January:
>>      >
>>      >
>>      > When ever I add a new node (I put into the cluster.conf and
>> reloaded
>>      > with cman_tool version -r -S)  I end up with situations like the
>> new
>>      > node wants to gain the quorum and starts to fence the existing pool
>>      > master and appears to generate some sort of split cluster. Does
>>     it work
>>      > at all, corosync and dlm do not know about the recently added node
>> ?
>>
>>     I can see you are using UDPU and that could be the culprit. Can you
>> drop
>>     UDPU and work with multicast?
>>
>>     Jan/Chrissie: do you remember if we support adding nodes at runtime
>> with
>>     UDPU?
>>
>>     The standalone node should not have quorum at all and should not be
>> able
>>     to fence anybody to start with.
>>
>>      >
>>      > New Node
>>      > ==========
>>      >
>>      > Node  Sts   Inc   Joined               Name
>>      >    1   X      0                        hv-1
>>      >    2   X      0                        hv-2
>>      >    3   X      0                        hv-3
>>      >    4   X      0                        hv-4
>>      >    5   X      0                        hv-5
>>      >    6   M     80   2014-01-07 21:37:42  hv-6<--- host added
>>      >
>>      >
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] The network
>>     interface
>>      > [10.14.18.77] is now up.
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Using quorum
>>     provider
>>      > quorum_cman
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>>     loaded:
>>      > corosync cluster quorum service v0.1
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [CMAN  ] CMAN 3.0.12.1
>> (built
>>      > Sep  3 2013 09:17:34) started
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>>     loaded:
>>      > corosync CMAN membership service 2.90
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>>     loaded:
>>      > openais checkpoint service B.01.01
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>>     loaded:
>>      > corosync extended virtual synchrony service
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>>     loaded:
>>      > corosync configuration service
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>>     loaded:
>>      > corosync cluster closed process group service v1.01
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>>     loaded:
>>      > corosync cluster config database access v1.01
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>>     loaded:
>>      > corosync profile loading service
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Using quorum
>>     provider
>>      > quorum_cman
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>>     loaded:
>>      > corosync cluster quorum service v0.1
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [MAIN  ] Compatibility
>>     mode set
>>      > to whitetank.  Using V1 and V2 of the synchronization engine.
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU
>>     member
>>      > {10.14.18.65}
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU
>>     member
>>      > {10.14.18.67}
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU
>>     member
>>      > {10.14.18.68}
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU
>>     member
>>      > {10.14.18.70}
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU
>>     member
>>      > {10.14.18.66}
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU
>>     member
>>      > {10.14.18.77}
>>      > Jan  7 21:37:42 hv-1  corosync[12564]:   [TOTEM ] A processor
>>     joined or
>>      > left the membership and a new membership was formed.
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [CMAN  ] quorum regained,
>>      > resuming activity
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] This node is
>>     within the
>>      > primary component and will provide service.
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Members[1]: 6
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Members[1]: 6
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [CPG   ] chosen downlist:
>>     sender
>>      > r(0) ip(10.14.18.77) ; members(old:0 left:0)
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [MAIN  ] Completed service
>>      > synchronization, ready to provide service.
>>      > Jan  7 21:37:46 hv-1 fenced[12620]: fenced 3.0.12.1 started
>>      > Jan  7 21:37:46 hv-1 dlm_controld[12643]: dlm_controld 3.0.12.1
>>     started
>>      > Jan  7 21:37:47 hv-1 gfs_controld[12695]: gfs_controld 3.0.12.1
>>     started
>>      > Jan  7 21:37:54 hv-1 fenced[12620]: fencing node hv-b1clcy1
>>      >
>>      > sudo -i corosync-objctl  |grep member
>>      >
>>      > totem.interface.member.memberaddr=hv-1
>>      > totem.interface.member.memberaddr=hv-2
>>      > totem.interface.member.memberaddr=hv-3
>>      > totem.interface.member.memberaddr=hv-4
>>      > totem.interface.member.memberaddr=hv-5
>>      > totem.interface.member.memberaddr=hv-6
>>      > runtime.totem.pg.mrp.srp.members.6.ip=r(0) ip(10.14.18.77)
>>      > runtime.totem.pg.mrp.srp.members.6.join_count=1
>>      > runtime.totem.pg.mrp.srp.members.6.status=joined
>>      >
>>      >
>>      > Existing Node
>>      > =============
>>      >
>>      > member 6 has not been added to the quorum list :
>>      >
>>      > Jan  7 21:36:28 hv-1 corosync[7769]:   [QUORUM] Members[4]: 1 2 3 5
>>      > Jan  7 21:37:54 hv-1 corosync[7769]:   [TOTEM ] A processor joined
>> or
>>      > left the membership and a new membership was formed.
>>      > Jan  7 21:37:54 hv-1 corosync[7769]:   [CPG   ] chosen downlist:
>>     sender
>>      > r(0) ip(10.14.18.65) ; members(old:4 left:0)
>>      >
>>      >
>>      > Node  Sts   Inc   Joined               Name
>>      >    1   M   4468   2013-12-10 14:33:27  hv-1
>>      >    2   M   4468   2013-12-10 14:33:27  hv-2
>>      >    3   M   5036   2014-01-07 17:51:26  hv-3
>>      >    4   X   4468                        hv-4(dead at the moment)
>>      >    5   M   4468   2013-12-10 14:33:27  hv-5
>>      >    6   X      0                        hv-6<--- added
>>      >
>>      >
>>      > Jan  7 21:36:28 hv-1 corosync[7769]:   [QUORUM] Members[4]: 1 2 3 5
>>      > Jan  7 21:37:54 hv-1 corosync[7769]:   [TOTEM ] A processor joined
>> or
>>      > left the membership and a new membership was formed.
>>      > Jan  7 21:37:54 hv-1 corosync[7769]:   [CPG   ] chosen downlist:
>>     sender
>>      > r(0) ip(10.14.18.65) ; members(old:4 left:0)
>>      > Jan  7 21:37:54 hv-1 corosync[7769]:   [MAIN  ] Completed service
>>      > synchronization, ready to provide service.
>>      >
>>      >
>>      > totem.interface.member.memberaddr=hv-1
>>      > totem.interface.member.memberaddr=hv-2
>>      > totem.interface.member.memberaddr=hv-3
>>      > totem.interface.member.memberaddr=hv-4
>>      > totem.interface.member.memberaddr=hv-5.
>>      > runtime.totem.pg.mrp.srp.members.1.ip=r(0) ip(10.14.18.65)
>>      > runtime.totem.pg.mrp.srp.members.1.join_count=1
>>      > runtime.totem.pg.mrp.srp.members.1.status=joined
>>      > runtime.totem.pg.mrp.srp.members.2.ip=r(0) ip(10.14.18.66)
>>      > runtime.totem.pg.mrp.srp.members.2.join_count=1
>>      > runtime.totem.pg.mrp.srp.members.2.status=joined
>>      > runtime.totem.pg.mrp.srp.members.4.ip=r(0) ip(10.14.18.68)
>>      > runtime.totem.pg.mrp.srp.members.4.join_count=1
>>      > runtime.totem.pg.mrp.srp.members.4.status=left
>>      > runtime.totem.pg.mrp.srp.members.5.ip=r(0) ip(10.14.18.70)
>>      > runtime.totem.pg.mrp.srp.members.5.join_count=1
>>      > runtime.totem.pg.mrp.srp.members.5.status=joined
>>      > runtime.totem.pg.mrp.srp.members.3.ip=r(0) ip(10.14.18.67)
>>      > runtime.totem.pg.mrp.srp.members.3.join_count=3
>>      > runtime.totem.pg.mrp.srp.members.3.status=joined
>>      >
>>      >
>>      > cluster.conf:
>>      >
>>      > <?xml version="1.0"?>
>>      > <cluster config_version="32" name="hv-1618-110-1">
>>      >   <fence_daemon clean_start="0"/>
>>      >   <cman transport="udpu" expected_votes="1"/>
>>
>
>
> Setting expected_votes to 1 in a six node cluster is a serious
> configuration error and needs to be changed. That is what is causing the
> new node to fence the rest of the cluster.
>
> Check that all of the nodes have the same cluster.conf file, any
> difference between that on the exiting nodes and the new one will prevent
> the new node from joining too.
>
> Chrissie
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140224/989178b6/attachment.htm>