[Linux-cluster] Help needed

Fri Jun 1 18:43:55 UTC 2012

What does 'shr289.cup.hp.com' and 'shr295.cup.hp.com' resolve to? Does
your switch support multicast properly? If the switch periodically tears
down a multicast group, your cluster will partition.

You *must* have fencing configured. Fencing using iLO works fine, please
use it. See
https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Example_.3Cfencedevice....3E_Tag_For_HP_iLO
Without fencing, you cluster will be unstable.

Digimer

On 06/01/2012 01:53 PM, Chen, Ming Ming wrote:
> Thanks for returning my email. The cluster configuration file and network configuration. Also one bad news is that the original issues come back again.
> So I've see two problems, and both problems will come sporatically:
> Thanks again for your help.
> Regards
> Ming
> 
> 1. The original one. I've increased the version number, and it was gone for a while, but come back. Do you know why?
> 
>    May 31 09:08:05 shr295 corosync[3542]:   [MAIN  ] Completed service synchronizat
>>> ion, ready to provide service.
>>> May 31 09:08:05 shr295 corosync[3542]:   [TOTEM ] A processor joined or left the
>>>  membership and a new membership was formed.
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>>> orosync: New configuration version has to be newer than current running configur
>>> ation
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>>> on 4: New configuration version has to be newer than current running configurati
>>> on#012.
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Activity suspended on this nod
>>> e
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Error reloading the configurat
>>> ion, will retry every second
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Node 1 conflict, remote config
>>>  version id=4, local=2
>>> -- VISUAL BLOCK --r295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>>> orosync: New configuration version has to be newer than current running configur
>>> ation
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>>> on 4: New configuration version has to be newer than current running configurati
>>> on#012.
> 
> 2. > [root at shr295 ~]# tail -f /var/log/messages
>> May 31 16:56:01 shr295 dlm_controld[3375]: dlm_controld 3.0.12.1 started
>> May 31 16:56:11 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:12 shr295 gfs_controld[3447]: gfs_controld 3.0.12.1 started
>> May 31 16:56:12 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:21 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:22 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:22 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>> May 31 16:56:31 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:32 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:32 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>> May 31 16:56:41 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:42 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:42 shr295 gfs_controld[3447]: daemon cpg_join error retrying
> 
> Cluster configuration File:
>>> <?xml version="1.0"?>
>>> <cluster config_version="2" name="vmcluster">
>>>       <logging debug="on"/>
>>>       <cman expected_votes="1" two_node="1"/>
>>>       <clusternodes>
>>>             <clusternode name="shr289.cup.hp.com" nodeid="1">
>>>                   <fence>
>>>                   </fence>
>>>             </clusternode>
>>>             <clusternode name="shr295.cup.hp.com" nodeid="2">
>>>                   <fence>
>>>                   </fence>
>>>             </clusternode>
>>>       </clusternodes>
>>>       <fencedevices>
>>>       </fencedevices>
>>>       <rm>
>>>       </rm>
>>> </cluster>
> 
> I had a fencing configuration there, but I'd like to see that I can bring up a simple cluster first, then will add the fencing there.
> 
> The network configuration:
> eth1      Link encap:Ethernet  HWaddr 00:23:7D:36:05:20
>           inet addr:16.89.112.182  Bcast:16.89.119.255  Mask:255.255.248.0
>           inet6 addr: fe80::223:7dff:fe36:520/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:1210316 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:73158 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:150775766 (143.7 MiB)  TX bytes:11749950 (11.2 MiB)
>           Interrupt:16 Memory:f6000000-f6012800
> 
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           inet6 addr: ::1/128 Scope:Host
>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>           RX packets:291 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:291 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:38225 (37.3 KiB)  TX bytes:38225 (37.3 KiB)
> 
> virbr0    Link encap:Ethernet  HWaddr 52:54:00:30:33:BD
>           inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:488 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:0 (0.0 b)  TX bytes:25273 (24.6 KiB)
> 
> 
> -----Original Message-----
> From: Digimer [mailto:lists at alteeve.ca]
> Sent: Thursday, May 31, 2012 7:05 PM
> To: Chen, Ming Ming
> Cc: linux clustering
> Subject: Re: [Linux-cluster] Help needed
> 
> Send your cluster.conf please, editing only password please. Please also
> include you network configs.
> 
> On 05/31/2012 08:12 PM, Chen, Ming Ming wrote:
>> Hi Digimer,
>> Thanks for your comment. I've got rid of the first problem, and now I have the following messages. Any idea?
>> Thanks in advance.
>> Ming
>>
>> [root at shr295 ~]# tail -f /var/log/messages
>> May 31 16:56:01 shr295 dlm_controld[3375]: dlm_controld 3.0.12.1 started
>> May 31 16:56:11 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:12 shr295 gfs_controld[3447]: gfs_controld 3.0.12.1 started
>> May 31 16:56:12 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:21 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:22 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:22 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>> May 31 16:56:31 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:32 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:32 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>> May 31 16:56:41 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:42 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:42 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>>
>> -----Original Message-----
>> From: Digimer [mailto:lists at alteeve.ca]
>> Sent: Thursday, May 31, 2012 10:13 AM
>> To: Chen, Ming Ming
>> Cc: linux clustering
>> Subject: Re: [Linux-cluster] Help needed
>>
>> On 05/31/2012 12:27 PM, Chen, Ming Ming wrote:
>>>  Hi, I have the following simple cluster config just to try out on SertOS 6.2
>>>
>>> <?xml version="1.0"?>
>>> <cluster config_version="2" name="vmcluster">
>>>       <logging debug="on"/>
>>>       <cman expected_votes="1" two_node="1"/>
>>>       <clusternodes>
>>>             <clusternode name="shr289.cup.hp.com" nodeid="1">
>>>                   <fence>
>>>                   </fence>
>>>             </clusternode>
>>>             <clusternode name="shr295.cup.hp.com" nodeid="2">
>>>                   <fence>
>>>                   </fence>
>>>             </clusternode>
>>>       </clusternodes>
>>>       <fencedevices>
>>>       </fencedevices>
>>>       <rm>
>>>       </rm>
>>> </cluster>
>>>
>>>
>>> And I got the following error message when I did "service cman start" I got the same messages on both nodes.
>>> Any help will be appreciated.
>>>
>>> May 31 09:08:04 corosync [TOTEM ] RRP multicast threshold (100 problem count)
>>> May 31 09:08:05 shr295 corosync[3542]:   [MAIN  ] Completed service synchronizat
>>> ion, ready to provide service.
>>> May 31 09:08:05 shr295 corosync[3542]:   [TOTEM ] A processor joined or left the
>>>  membership and a new membership was formed.
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>>> orosync: New configuration version has to be newer than current running configur
>>> ation
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>>> on 4: New configuration version has to be newer than current running configurati
>>> on#012.
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Activity suspended on this nod
>>> e
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Error reloading the configurat
>>> ion, will retry every second
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Node 1 conflict, remote config
>>>  version id=4, local=2
>>> -- VISUAL BLOCK --r295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>>> orosync: New configuration version has to be newer than current running configur
>>> ation
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>>> on 4: New configuration version has to be newer than current running configurati
>>> on#012.
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Activity suspended on this nod
>>> E
>>>
>>
>> Run 'cman_tool version' to get the current version of the configuration,
>> then increase the config_version="x" to be one higher.
>>
>> Also, configure fencing! If you don't, your cluster will hang the first
>> time anything goes wrong.
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.com
> 
> 
> --
> Digimer
> Papers and Projects: https://alteeve.com

-- 
Digimer
Papers and Projects: https://alteeve.com