[Linux-cluster] daemon cpg_join error retrying

Wed Oct 29 22:38:00 UTC 2014

> On 30 Oct 2014, at 9:32 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
> 
> 
>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>>> How to check  cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue.
> 
>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in.
> Ok.
> 
>>> 
>>> Also one more issue I am seeing in one other setup a repeated flood of 'A processor joined or left the membership and a new membership was formed' messages for every 4secs. I am running with default TOTEM settings with token time out as 10 secs. Even after I increase the token, consensus values to be higher. It goes on flooding the same message after newer consensus defined time (eg: if I increase it to be 10secs, then I see new membership formed messages for every 10secs)
>>> 
>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>> 
>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
> 
>> It does not sound like your network is particularly healthy.
>> Are you using multicast or udpu? If multicast, it might be worth trying udpu
> 
> I am using udpu and I also have firewall opened for ports 5404 & 5405. Tcpdump looks fine too, it does not complain of any issues. This is a VM envirornment and even if I switch to other node within same VM I keep getting same failure.  

Depending on what the host and VMs are doing, that might be your problem.
In any case, I will defer to the corosync guys at this point.

> 
> Thanks
> Lax
> 
> 
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof
> Sent: Wednesday, October 29, 2014 3:17 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
> 
> 
>> On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
>> 
>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>> How to check  cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue.
> 
> I don't really recall. Hopefully someone more familiar with GFS2 can chime in.
> 
>> 
>> Also one more issue I am seeing in one other setup a repeated flood of 'A processor joined or left the membership and a new membership was formed' messages for every 4secs. I am running with default TOTEM settings with token time out as 10 secs. Even after I increase the token, consensus values to be higher. It goes on flooding the same message after newer consensus defined time (eg: if I increase it to be 10secs, then I see new membership formed messages for every 10secs)
>> 
>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>> 
>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
> 
> It does not sound like your network is particularly healthy.
> Are you using multicast or udpu? If multicast, it might be worth trying udpu
> 
>> 
>> Thanks
>> Lax
>> 
>> 
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof
>> Sent: Wednesday, October 29, 2014 2:42 PM
>> To: linux clustering
>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
>> 
>> 
>>> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
>>> 
>>> Hi All,
>>> 
>>> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error  retrying'. I have a 2 Node setup with pacemaker and corosync.
>> 
>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>> 
>>> 
>>> Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is  there any way to fix this issue?  
>>> 
>>> 
>>> Thanks
>>> Lax
>>> 
>>> -- 
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> 
>> 
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> 
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster