[Linux-cluster] Help needed

Fri Jun 1 17:53:07 UTC 2012

Thanks for returning my email. The cluster configuration file and network configuration. Also one bad news is that the original issues come back again.
So I've see two problems, and both problems will come sporatically:
Thanks again for your help.
Regards
Ming

1. The original one. I've increased the version number, and it was gone for a while, but come back. Do you know why?

   May 31 09:08:05 shr295 corosync[3542]:   [MAIN  ] Completed service synchronizat
>> ion, ready to provide service.
>> May 31 09:08:05 shr295 corosync[3542]:   [TOTEM ] A processor joined or left the
>>  membership and a new membership was formed.
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>> orosync: New configuration version has to be newer than current running configur
>> ation
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>> on 4: New configuration version has to be newer than current running configurati
>> on#012.
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Activity suspended on this nod
>> e
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Error reloading the configurat
>> ion, will retry every second
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Node 1 conflict, remote config
>>  version id=4, local=2
>> -- VISUAL BLOCK --r295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>> orosync: New configuration version has to be newer than current running configur
>> ation
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>> on 4: New configuration version has to be newer than current running configurati
>> on#012.

2. > [root at shr295 ~]# tail -f /var/log/messages
> May 31 16:56:01 shr295 dlm_controld[3375]: dlm_controld 3.0.12.1 started
> May 31 16:56:11 shr295 fenced[3353]: daemon cpg_join error retrying
> May 31 16:56:12 shr295 gfs_controld[3447]: gfs_controld 3.0.12.1 started
> May 31 16:56:12 shr295 dlm_controld[3375]: daemon cpg_join error retrying
> May 31 16:56:21 shr295 fenced[3353]: daemon cpg_join error retrying
> May 31 16:56:22 shr295 dlm_controld[3375]: daemon cpg_join error retrying
> May 31 16:56:22 shr295 gfs_controld[3447]: daemon cpg_join error retrying
> May 31 16:56:31 shr295 fenced[3353]: daemon cpg_join error retrying
> May 31 16:56:32 shr295 dlm_controld[3375]: daemon cpg_join error retrying
> May 31 16:56:32 shr295 gfs_controld[3447]: daemon cpg_join error retrying
> May 31 16:56:41 shr295 fenced[3353]: daemon cpg_join error retrying
> May 31 16:56:42 shr295 dlm_controld[3375]: daemon cpg_join error retrying
> May 31 16:56:42 shr295 gfs_controld[3447]: daemon cpg_join error retrying

Cluster configuration File:
>> <?xml version="1.0"?>
>> <cluster config_version="2" name="vmcluster">
>>       <logging debug="on"/>
>>       <cman expected_votes="1" two_node="1"/>
>>       <clusternodes>
>>             <clusternode name="shr289.cup.hp.com" nodeid="1">
>>                   <fence>
>>                   </fence>
>>             </clusternode>
>>             <clusternode name="shr295.cup.hp.com" nodeid="2">
>>                   <fence>
>>                   </fence>
>>             </clusternode>
>>       </clusternodes>
>>       <fencedevices>
>>       </fencedevices>
>>       <rm>
>>       </rm>
>> </cluster>

I had a fencing configuration there, but I'd like to see that I can bring up a simple cluster first, then will add the fencing there.

The network configuration:
eth1      Link encap:Ethernet  HWaddr 00:23:7D:36:05:20
          inet addr:16.89.112.182  Bcast:16.89.119.255  Mask:255.255.248.0
          inet6 addr: fe80::223:7dff:fe36:520/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1210316 errors:0 dropped:0 overruns:0 frame:0
          TX packets:73158 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:150775766 (143.7 MiB)  TX bytes:11749950 (11.2 MiB)
          Interrupt:16 Memory:f6000000-f6012800

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:291 errors:0 dropped:0 overruns:0 frame:0
          TX packets:291 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:38225 (37.3 KiB)  TX bytes:38225 (37.3 KiB)

virbr0    Link encap:Ethernet  HWaddr 52:54:00:30:33:BD
          inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:488 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 b)  TX bytes:25273 (24.6 KiB)

-----Original Message-----
From: Digimer [mailto:lists at alteeve.ca]
Sent: Thursday, May 31, 2012 7:05 PM
To: Chen, Ming Ming
Cc: linux clustering
Subject: Re: [Linux-cluster] Help needed

Send your cluster.conf please, editing only password please. Please also
include you network configs.

On 05/31/2012 08:12 PM, Chen, Ming Ming wrote:
> Hi Digimer,
> Thanks for your comment. I've got rid of the first problem, and now I have the following messages. Any idea?
> Thanks in advance.
> Ming
>
> [root at shr295 ~]# tail -f /var/log/messages
> May 31 16:56:01 shr295 dlm_controld[3375]: dlm_controld 3.0.12.1 started
> May 31 16:56:11 shr295 fenced[3353]: daemon cpg_join error retrying
> May 31 16:56:12 shr295 gfs_controld[3447]: gfs_controld 3.0.12.1 started
> May 31 16:56:12 shr295 dlm_controld[3375]: daemon cpg_join error retrying
> May 31 16:56:21 shr295 fenced[3353]: daemon cpg_join error retrying
> May 31 16:56:22 shr295 dlm_controld[3375]: daemon cpg_join error retrying
> May 31 16:56:22 shr295 gfs_controld[3447]: daemon cpg_join error retrying
> May 31 16:56:31 shr295 fenced[3353]: daemon cpg_join error retrying
> May 31 16:56:32 shr295 dlm_controld[3375]: daemon cpg_join error retrying
> May 31 16:56:32 shr295 gfs_controld[3447]: daemon cpg_join error retrying
> May 31 16:56:41 shr295 fenced[3353]: daemon cpg_join error retrying
> May 31 16:56:42 shr295 dlm_controld[3375]: daemon cpg_join error retrying
> May 31 16:56:42 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>
> -----Original Message-----
> From: Digimer [mailto:lists at alteeve.ca]
> Sent: Thursday, May 31, 2012 10:13 AM
> To: Chen, Ming Ming
> Cc: linux clustering
> Subject: Re: [Linux-cluster] Help needed
>
> On 05/31/2012 12:27 PM, Chen, Ming Ming wrote:
>>  Hi, I have the following simple cluster config just to try out on SertOS 6.2
>>
>> <?xml version="1.0"?>
>> <cluster config_version="2" name="vmcluster">
>>       <logging debug="on"/>
>>       <cman expected_votes="1" two_node="1"/>
>>       <clusternodes>
>>             <clusternode name="shr289.cup.hp.com" nodeid="1">
>>                   <fence>
>>                   </fence>
>>             </clusternode>
>>             <clusternode name="shr295.cup.hp.com" nodeid="2">
>>                   <fence>
>>                   </fence>
>>             </clusternode>
>>       </clusternodes>
>>       <fencedevices>
>>       </fencedevices>
>>       <rm>
>>       </rm>
>> </cluster>
>>
>>
>> And I got the following error message when I did "service cman start" I got the same messages on both nodes.
>> Any help will be appreciated.
>>
>> May 31 09:08:04 corosync [TOTEM ] RRP multicast threshold (100 problem count)
>> May 31 09:08:05 shr295 corosync[3542]:   [MAIN  ] Completed service synchronizat
>> ion, ready to provide service.
>> May 31 09:08:05 shr295 corosync[3542]:   [TOTEM ] A processor joined or left the
>>  membership and a new membership was formed.
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>> orosync: New configuration version has to be newer than current running configur
>> ation
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>> on 4: New configuration version has to be newer than current running configurati
>> on#012.
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Activity suspended on this nod
>> e
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Error reloading the configurat
>> ion, will retry every second
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Node 1 conflict, remote config
>>  version id=4, local=2
>> -- VISUAL BLOCK --r295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>> orosync: New configuration version has to be newer than current running configur
>> ation
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>> on 4: New configuration version has to be newer than current running configurati
>> on#012.
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Activity suspended on this nod
>> E
>>
>
> Run 'cman_tool version' to get the current version of the configuration,
> then increase the config_version="x" to be one higher.
>
> Also, configure fencing! If you don't, your cluster will hang the first
> time anything goes wrong.
>
> --
> Digimer
> Papers and Projects: https://alteeve.com

--
Digimer
Papers and Projects: https://alteeve.com