[Linux-cluster] Help needed
Digimer
lists at alteeve.ca
Fri Jun 1 18:43:55 UTC 2012
What does 'shr289.cup.hp.com' and 'shr295.cup.hp.com' resolve to? Does
your switch support multicast properly? If the switch periodically tears
down a multicast group, your cluster will partition.
You *must* have fencing configured. Fencing using iLO works fine, please
use it. See
https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Example_.3Cfencedevice....3E_Tag_For_HP_iLO
Without fencing, you cluster will be unstable.
Digimer
On 06/01/2012 01:53 PM, Chen, Ming Ming wrote:
> Thanks for returning my email. The cluster configuration file and network configuration. Also one bad news is that the original issues come back again.
> So I've see two problems, and both problems will come sporatically:
> Thanks again for your help.
> Regards
> Ming
>
> 1. The original one. I've increased the version number, and it was gone for a while, but come back. Do you know why?
>
> May 31 09:08:05 shr295 corosync[3542]: [MAIN ] Completed service synchronizat
>>> ion, ready to provide service.
>>> May 31 09:08:05 shr295 corosync[3542]: [TOTEM ] A processor joined or left the
>>> membership and a new membership was formed.
>>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Unable to load new config in c
>>> orosync: New configuration version has to be newer than current running configur
>>> ation
>>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Can't get updated config versi
>>> on 4: New configuration version has to be newer than current running configurati
>>> on#012.
>>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Activity suspended on this nod
>>> e
>>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Error reloading the configurat
>>> ion, will retry every second
>>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Node 1 conflict, remote config
>>> version id=4, local=2
>>> -- VISUAL BLOCK --r295 corosync[3542]: [CMAN ] Unable to load new config in c
>>> orosync: New configuration version has to be newer than current running configur
>>> ation
>>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Can't get updated config versi
>>> on 4: New configuration version has to be newer than current running configurati
>>> on#012.
>
> 2. > [root at shr295 ~]# tail -f /var/log/messages
>> May 31 16:56:01 shr295 dlm_controld[3375]: dlm_controld 3.0.12.1 started
>> May 31 16:56:11 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:12 shr295 gfs_controld[3447]: gfs_controld 3.0.12.1 started
>> May 31 16:56:12 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:21 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:22 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:22 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>> May 31 16:56:31 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:32 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:32 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>> May 31 16:56:41 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:42 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:42 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>
> Cluster configuration File:
>>> <?xml version="1.0"?>
>>> <cluster config_version="2" name="vmcluster">
>>> <logging debug="on"/>
>>> <cman expected_votes="1" two_node="1"/>
>>> <clusternodes>
>>> <clusternode name="shr289.cup.hp.com" nodeid="1">
>>> <fence>
>>> </fence>
>>> </clusternode>
>>> <clusternode name="shr295.cup.hp.com" nodeid="2">
>>> <fence>
>>> </fence>
>>> </clusternode>
>>> </clusternodes>
>>> <fencedevices>
>>> </fencedevices>
>>> <rm>
>>> </rm>
>>> </cluster>
>
> I had a fencing configuration there, but I'd like to see that I can bring up a simple cluster first, then will add the fencing there.
>
> The network configuration:
> eth1 Link encap:Ethernet HWaddr 00:23:7D:36:05:20
> inet addr:16.89.112.182 Bcast:16.89.119.255 Mask:255.255.248.0
> inet6 addr: fe80::223:7dff:fe36:520/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:1210316 errors:0 dropped:0 overruns:0 frame:0
> TX packets:73158 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:150775766 (143.7 MiB) TX bytes:11749950 (11.2 MiB)
> Interrupt:16 Memory:f6000000-f6012800
>
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> inet6 addr: ::1/128 Scope:Host
> UP LOOPBACK RUNNING MTU:16436 Metric:1
> RX packets:291 errors:0 dropped:0 overruns:0 frame:0
> TX packets:291 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:38225 (37.3 KiB) TX bytes:38225 (37.3 KiB)
>
> virbr0 Link encap:Ethernet HWaddr 52:54:00:30:33:BD
> inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> TX packets:488 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:0 (0.0 b) TX bytes:25273 (24.6 KiB)
>
>
> -----Original Message-----
> From: Digimer [mailto:lists at alteeve.ca]
> Sent: Thursday, May 31, 2012 7:05 PM
> To: Chen, Ming Ming
> Cc: linux clustering
> Subject: Re: [Linux-cluster] Help needed
>
> Send your cluster.conf please, editing only password please. Please also
> include you network configs.
>
> On 05/31/2012 08:12 PM, Chen, Ming Ming wrote:
>> Hi Digimer,
>> Thanks for your comment. I've got rid of the first problem, and now I have the following messages. Any idea?
>> Thanks in advance.
>> Ming
>>
>> [root at shr295 ~]# tail -f /var/log/messages
>> May 31 16:56:01 shr295 dlm_controld[3375]: dlm_controld 3.0.12.1 started
>> May 31 16:56:11 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:12 shr295 gfs_controld[3447]: gfs_controld 3.0.12.1 started
>> May 31 16:56:12 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:21 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:22 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:22 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>> May 31 16:56:31 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:32 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:32 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>> May 31 16:56:41 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:42 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:42 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>>
>> -----Original Message-----
>> From: Digimer [mailto:lists at alteeve.ca]
>> Sent: Thursday, May 31, 2012 10:13 AM
>> To: Chen, Ming Ming
>> Cc: linux clustering
>> Subject: Re: [Linux-cluster] Help needed
>>
>> On 05/31/2012 12:27 PM, Chen, Ming Ming wrote:
>>> Hi, I have the following simple cluster config just to try out on SertOS 6.2
>>>
>>> <?xml version="1.0"?>
>>> <cluster config_version="2" name="vmcluster">
>>> <logging debug="on"/>
>>> <cman expected_votes="1" two_node="1"/>
>>> <clusternodes>
>>> <clusternode name="shr289.cup.hp.com" nodeid="1">
>>> <fence>
>>> </fence>
>>> </clusternode>
>>> <clusternode name="shr295.cup.hp.com" nodeid="2">
>>> <fence>
>>> </fence>
>>> </clusternode>
>>> </clusternodes>
>>> <fencedevices>
>>> </fencedevices>
>>> <rm>
>>> </rm>
>>> </cluster>
>>>
>>>
>>> And I got the following error message when I did "service cman start" I got the same messages on both nodes.
>>> Any help will be appreciated.
>>>
>>> May 31 09:08:04 corosync [TOTEM ] RRP multicast threshold (100 problem count)
>>> May 31 09:08:05 shr295 corosync[3542]: [MAIN ] Completed service synchronizat
>>> ion, ready to provide service.
>>> May 31 09:08:05 shr295 corosync[3542]: [TOTEM ] A processor joined or left the
>>> membership and a new membership was formed.
>>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Unable to load new config in c
>>> orosync: New configuration version has to be newer than current running configur
>>> ation
>>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Can't get updated config versi
>>> on 4: New configuration version has to be newer than current running configurati
>>> on#012.
>>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Activity suspended on this nod
>>> e
>>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Error reloading the configurat
>>> ion, will retry every second
>>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Node 1 conflict, remote config
>>> version id=4, local=2
>>> -- VISUAL BLOCK --r295 corosync[3542]: [CMAN ] Unable to load new config in c
>>> orosync: New configuration version has to be newer than current running configur
>>> ation
>>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Can't get updated config versi
>>> on 4: New configuration version has to be newer than current running configurati
>>> on#012.
>>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Activity suspended on this nod
>>> E
>>>
>>
>> Run 'cman_tool version' to get the current version of the configuration,
>> then increase the config_version="x" to be one higher.
>>
>> Also, configure fencing! If you don't, your cluster will hang the first
>> time anything goes wrong.
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.com
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.com
--
Digimer
Papers and Projects: https://alteeve.com
More information about the Linux-cluster
mailing list