[Linux-cluster] CLVM & CMAN live adding nodes

Mon Feb 24 08:39:51 UTC 2014

Hi Fabio,

removing UDPU does not change the behavior, the new node still doesn't join
the cluster and still wants to fence node 01
It still feels like a split brain or so.
How do you join a new node, using the /etc/init.d/cman start or using
 cman_tool / dlm_tool  join ?

Bjoern

On Sat, Feb 22, 2014 at 10:16 PM, Fabio M. Di Nitto <fdinitto at redhat.com>wrote:

> On 02/22/2014 08:05 PM, Bjoern Teipel wrote:
> > Thanks Fabio for replying may request.
> >
> > I'm using stock CentOS 6.4 versions and no rm, just clvmd and dlm.
> >
> > Name        : cman                         Relocations: (not relocatable)
> > Version     : 3.0.12.1                          Vendor: CentOS
> > Release     : 49.el6_4.2                    Build Date: Tue 03 Sep 2013
> > 02:18:10 AM PDT
> >
> > Name        : lvm2-cluster                 Relocations: (not relocatable)
> > Version     : 2.02.98                           Vendor: CentOS
> > Release     : 9.el6_4.3                     Build Date: Tue 05 Nov 2013
> > 07:36:18 AM PST
> >
> > Name        : corosync                     Relocations: (not relocatable)
> > Version     : 1.4.1                             Vendor: CentOS
> > Release     : 15.el6_4.1                    Build Date: Tue 14 May 2013
> > 02:09:27 PM PDT
> >
> >
> > My question is based off this problem I have till January:
> >
> >
> > When ever I add a new node (I put into the cluster.conf and reloaded
> > with cman_tool version -r -S)  I end up with situations like the new
> > node wants to gain the quorum and starts to fence the existing pool
> > master and appears to generate some sort of split cluster. Does it work
> > at all, corosync and dlm do not know about the recently added node ?
>
> I can see you are using UDPU and that could be the culprit. Can you drop
> UDPU and work with multicast?
>
> Jan/Chrissie: do you remember if we support adding nodes at runtime with
> UDPU?
>
> The standalone node should not have quorum at all and should not be able
> to fence anybody to start with.
>
> >
> > New Node
> > ==========
> >
> > Node  Sts   Inc   Joined               Name
> >    1   X      0                        hv-1
> >    2   X      0                        hv-2
> >    3   X      0                        hv-3
> >    4   X      0                        hv-4
> >    5   X      0                        hv-5
> >    6   M     80   2014-01-07 21:37:42  hv-6<--- host added
> >
> >
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] The network interface
> > [10.14.18.77] is now up.
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Using quorum provider
> > quorum_cman
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> > corosync cluster quorum service v0.1
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [CMAN  ] CMAN 3.0.12.1 (built
> > Sep  3 2013 09:17:34) started
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> > corosync CMAN membership service 2.90
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> > openais checkpoint service B.01.01
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> > corosync extended virtual synchrony service
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> > corosync configuration service
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> > corosync cluster closed process group service v1.01
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> > corosync cluster config database access v1.01
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> > corosync profile loading service
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Using quorum provider
> > quorum_cman
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> > corosync cluster quorum service v0.1
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [MAIN  ] Compatibility mode set
> > to whitetank.  Using V1 and V2 of the synchronization engine.
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
> > {10.14.18.65}
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
> > {10.14.18.67}
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
> > {10.14.18.68}
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
> > {10.14.18.70}
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
> > {10.14.18.66}
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
> > {10.14.18.77}
> > Jan  7 21:37:42 hv-1  corosync[12564]:   [TOTEM ] A processor joined or
> > left the membership and a new membership was formed.
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [CMAN  ] quorum regained,
> > resuming activity
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] This node is within the
> > primary component and will provide service.
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Members[1]: 6
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Members[1]: 6
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [CPG   ] chosen downlist: sender
> > r(0) ip(10.14.18.77) ; members(old:0 left:0)
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [MAIN  ] Completed service
> > synchronization, ready to provide service.
> > Jan  7 21:37:46 hv-1 fenced[12620]: fenced 3.0.12.1 started
> > Jan  7 21:37:46 hv-1 dlm_controld[12643]: dlm_controld 3.0.12.1 started
> > Jan  7 21:37:47 hv-1 gfs_controld[12695]: gfs_controld 3.0.12.1 started
> > Jan  7 21:37:54 hv-1 fenced[12620]: fencing node hv-b1clcy1
> >
> > sudo -i corosync-objctl  |grep member
> >
> > totem.interface.member.memberaddr=hv-1
> > totem.interface.member.memberaddr=hv-2
> > totem.interface.member.memberaddr=hv-3
> > totem.interface.member.memberaddr=hv-4
> > totem.interface.member.memberaddr=hv-5
> > totem.interface.member.memberaddr=hv-6
> > runtime.totem.pg.mrp.srp.members.6.ip=r(0) ip(10.14.18.77)
> > runtime.totem.pg.mrp.srp.members.6.join_count=1
> > runtime.totem.pg.mrp.srp.members.6.status=joined
> >
> >
> > Existing Node
> > =============
> >
> > member 6 has not been added to the quorum list :
> >
> > Jan  7 21:36:28 hv-1 corosync[7769]:   [QUORUM] Members[4]: 1 2 3 5
> > Jan  7 21:37:54 hv-1 corosync[7769]:   [TOTEM ] A processor joined or
> > left the membership and a new membership was formed.
> > Jan  7 21:37:54 hv-1 corosync[7769]:   [CPG   ] chosen downlist: sender
> > r(0) ip(10.14.18.65) ; members(old:4 left:0)
> >
> >
> > Node  Sts   Inc   Joined               Name
> >    1   M   4468   2013-12-10 14:33:27  hv-1
> >    2   M   4468   2013-12-10 14:33:27  hv-2
> >    3   M   5036   2014-01-07 17:51:26  hv-3
> >    4   X   4468                        hv-4(dead at the moment)
> >    5   M   4468   2013-12-10 14:33:27  hv-5
> >    6   X      0                        hv-6<--- added
> >
> >
> > Jan  7 21:36:28 hv-1 corosync[7769]:   [QUORUM] Members[4]: 1 2 3 5
> > Jan  7 21:37:54 hv-1 corosync[7769]:   [TOTEM ] A processor joined or
> > left the membership and a new membership was formed.
> > Jan  7 21:37:54 hv-1 corosync[7769]:   [CPG   ] chosen downlist: sender
> > r(0) ip(10.14.18.65) ; members(old:4 left:0)
> > Jan  7 21:37:54 hv-1 corosync[7769]:   [MAIN  ] Completed service
> > synchronization, ready to provide service.
> >
> >
> > totem.interface.member.memberaddr=hv-1
> > totem.interface.member.memberaddr=hv-2
> > totem.interface.member.memberaddr=hv-3
> > totem.interface.member.memberaddr=hv-4
> > totem.interface.member.memberaddr=hv-5.
> > runtime.totem.pg.mrp.srp.members.1.ip=r(0) ip(10.14.18.65)
> > runtime.totem.pg.mrp.srp.members.1.join_count=1
> > runtime.totem.pg.mrp.srp.members.1.status=joined
> > runtime.totem.pg.mrp.srp.members.2.ip=r(0) ip(10.14.18.66)
> > runtime.totem.pg.mrp.srp.members.2.join_count=1
> > runtime.totem.pg.mrp.srp.members.2.status=joined
> > runtime.totem.pg.mrp.srp.members.4.ip=r(0) ip(10.14.18.68)
> > runtime.totem.pg.mrp.srp.members.4.join_count=1
> > runtime.totem.pg.mrp.srp.members.4.status=left
> > runtime.totem.pg.mrp.srp.members.5.ip=r(0) ip(10.14.18.70)
> > runtime.totem.pg.mrp.srp.members.5.join_count=1
> > runtime.totem.pg.mrp.srp.members.5.status=joined
> > runtime.totem.pg.mrp.srp.members.3.ip=r(0) ip(10.14.18.67)
> > runtime.totem.pg.mrp.srp.members.3.join_count=3
> > runtime.totem.pg.mrp.srp.members.3.status=joined
> >
> >
> > cluster.conf:
> >
> > <?xml version="1.0"?>
> > <cluster config_version="32" name="hv-1618-110-1">
> >   <fence_daemon clean_start="0"/>
> >   <cman transport="udpu" expected_votes="1"/>
> >   <logging debug="off"/>
> >   <clusternodes>
> >     <clusternode name="hv-1" votes="1" nodeid="1"><fence><method
> > name="single"><device name="human"/></method></fence></clusternode>
> >     <clusternode name="hv-2" votes="1" nodeid="3"><fence><method
> > name="single"><device name="human"/></method></fence></clusternode>
> >     <clusternode name="hv-3" votes="1" nodeid="4"><fence><method
> > name="single"><device name="human"/></method></fence></clusternode>
> >     <clusternode name="hv-4" votes="1" nodeid="5"><fence><method
> > name="single"><device name="human"/></method></fence></clusternode>
> >     <clusternode name="hv-5" votes="1" nodeid="2"><fence><method
> > name="single"><device name="human"/></method></fence></clusternode>
> >     <clusternode name="hv-6" votes="1" nodeid="6"><fence><method
> > name="single"><device name="human"/></method></fence></clusternode>
> >   </clusternodes>
> >   <fencedevices>
> >   <fencedevice name="human" agent="manual"/></fencedevices>
> >   <rm/>
> > </cluster>
> >
> > (manual fencing just for testing)
> >
> >
> > corosync.conf:
> >
> > compatibility: whitetank
> > totem {
> >   version: 2
> >   secauth: off
> >   threads: 0
> >   # fail_recv_const: 5000
> >   interface {
> >     ringnumber: 0
> >     bindnetaddr: 10.14.18.0
> >     mcastaddr: 239.0.0.4
> >     mcastport: 5405
> >   }
> > }
> > logging {
> >   fileline: off
> >   to_stderr: no
> >   to_logfile: yes
> >   to_syslog: yes
> >   # the pathname of the log file
> >   logfile: /var/log/cluster/corosync.log
> >   debug: off
> >   timestamp: on
> >   logger_subsys {
> >     subsys: AMF
> >     debug: off
> >   }
> > }
> >
> > amf {
> >   mode: disabled
> > }
> >
>
> when using cman, corosync.conf is not used/read.
>
> Fabio
>
> >
> >
> > On Sat, Feb 22, 2014 at 5:54 AM, Fabio M. Di Nitto <fdinitto at redhat.com
> > <mailto:fdinitto at redhat.com>> wrote:
> >
> >     On 02/22/2014 10:33 AM, emmanuel segura wrote:
> >     > I know if you need to modify anything outside <rm>... </rm>{used by
> >     > rgmanager} tag in the cluster.conf file, you need to restart the
> whole
> >     > cluster stack, with cman+rgmanager i have never seen how to add a
> node
> >     > and remove a node from cluster without restart cman.
> >
> >     It depends on the version. RHEL5 that's correct, RHEL6 it works also
> for
> >     outside of <rm> but there are some limitations as some parameters
> just
> >     can't be changed runtime.
> >
> >     Fabio
> >
> >     >
> >     >
> >     >
> >     >
> >     > 2014-02-22 6:21 GMT+01:00 Bjoern Teipel
> >     > <bjoern.teipel at internetbrands.com
> >     <mailto:bjoern.teipel at internetbrands.com>
> >     > <mailto:bjoern.teipel at internetbrands.com
> >     <mailto:bjoern.teipel at internetbrands.com>>>:
> >     >
> >     >     Hi all,
> >     >
> >     >     who's using CLVM with CMAN in a cluster with more than 2 nodes
> in
> >     >     production ?
> >     >     Did you guys got it to manage to live add a new node to the
> >     cluster
> >     >     while everything is running ?
> >     >     I'm only able to add nodes while the cluster stack is shutdown.
> >     >     That's certainly not a good idea when you have to run CLVM on
> >     >     hypervisors and you need to shutdown all VMs to add a new box.
> >     >     Would be also good if you paste some of your configs using
> >     IPMI fencing
> >     >
> >     >     Thanks in advance,
> >     >     Bjoern
> >     >
> >     >     --
> >     >     Linux-cluster mailing list
> >     >     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> >     <mailto:Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>>
> >     >     https://www.redhat.com/mailman/listinfo/linux-cluster
> >     >
> >     >
> >     >
> >     >
> >     > --
> >     > esta es mi vida e me la vivo hasta que dios quiera
> >     >
> >     >
> >
> >     --
> >     Linux-cluster mailing list
> >     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> >     https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> >
> >
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140224/1ca9580a/attachment.htm>