From ming-ming.chen at hp.com  Fri Jun  1 00:12:31 2012
From: ming-ming.chen at hp.com (Chen, Ming Ming)
Date: Fri, 1 Jun 2012 00:12:31 +0000
Subject: [Linux-cluster] Help needed
In-Reply-To: <4FC7A6B5.30305@alteeve.ca>
References: <CAKrd530bTYqYSxZBf-aVjEjVHhWu-_KgP_9VjmTy2e913eD8hQ@mail.gmail.com>
	<1D241511770E2F4BA89AFD224EDD527117B82078@G9W0737.americas.hpqcorp.net>
	<1D241511770E2F4BA89AFD224EDD527117B90213@G9W0737.americas.hpqcorp.net>
	<CAE7pJ3C_V4Qkrs44CMtjXMZLoVrwGOHQh9NmHu6NjwZgxs7Gaw@mail.gmail.com>
	<4F7FAF45.8070104@alteeve.ca>
	<1D241511770E2F4BA89AFD224EDD527117B904A3@G9W0737.americas.hpqcorp.net>
	<1D241511770E2F4BA89AFD224EDD52712A9ED63F@G9W0737.americas.hpqcorp.net>
	<4FC053A1.8070407@alteeve.ca>
	<1D241511770E2F4BA89AFD224EDD52712A9EE4F0@G9W0737.americas.hpqcorp.net>
	<4FC7A6B5.30305@alteeve.ca>
Message-ID: <1D241511770E2F4BA89AFD224EDD52712A9EE73C@G9W0737.americas.hpqcorp.net>

Hi Digimer,
Thanks for your comment. I've got rid of the first problem, and now I have the following messages. Any idea?
Thanks in advance.
Ming

[root at shr295 ~]# tail -f /var/log/messages
May 31 16:56:01 shr295 dlm_controld[3375]: dlm_controld 3.0.12.1 started
May 31 16:56:11 shr295 fenced[3353]: daemon cpg_join error retrying
May 31 16:56:12 shr295 gfs_controld[3447]: gfs_controld 3.0.12.1 started
May 31 16:56:12 shr295 dlm_controld[3375]: daemon cpg_join error retrying
May 31 16:56:21 shr295 fenced[3353]: daemon cpg_join error retrying
May 31 16:56:22 shr295 dlm_controld[3375]: daemon cpg_join error retrying
May 31 16:56:22 shr295 gfs_controld[3447]: daemon cpg_join error retrying
May 31 16:56:31 shr295 fenced[3353]: daemon cpg_join error retrying
May 31 16:56:32 shr295 dlm_controld[3375]: daemon cpg_join error retrying
May 31 16:56:32 shr295 gfs_controld[3447]: daemon cpg_join error retrying
May 31 16:56:41 shr295 fenced[3353]: daemon cpg_join error retrying
May 31 16:56:42 shr295 dlm_controld[3375]: daemon cpg_join error retrying
May 31 16:56:42 shr295 gfs_controld[3447]: daemon cpg_join error retrying

-----Original Message-----
From: Digimer [mailto:lists at alteeve.ca]
Sent: Thursday, May 31, 2012 10:13 AM
To: Chen, Ming Ming
Cc: linux clustering
Subject: Re: [Linux-cluster] Help needed

On 05/31/2012 12:27 PM, Chen, Ming Ming wrote:
>  Hi, I have the following simple cluster config just to try out on SertOS 6.2
>
> <?xml version="1.0"?>
> <cluster config_version="2" name="vmcluster">
>       <logging debug="on"/>
>       <cman expected_votes="1" two_node="1"/>
>       <clusternodes>
>             <clusternode name="shr289.cup.hp.com" nodeid="1">
>                   <fence>
>                   </fence>
>             </clusternode>
>             <clusternode name="shr295.cup.hp.com" nodeid="2">
>                   <fence>
>                   </fence>
>             </clusternode>
>       </clusternodes>
>       <fencedevices>
>       </fencedevices>
>       <rm>
>       </rm>
> </cluster>
>
>
> And I got the following error message when I did "service cman start" I got the same messages on both nodes.
> Any help will be appreciated.
>
> May 31 09:08:04 corosync [TOTEM ] RRP multicast threshold (100 problem count)
> May 31 09:08:05 shr295 corosync[3542]:   [MAIN  ] Completed service synchronizat
> ion, ready to provide service.
> May 31 09:08:05 shr295 corosync[3542]:   [TOTEM ] A processor joined or left the
>  membership and a new membership was formed.
> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Unable to load new config in c
> orosync: New configuration version has to be newer than current running configur
> ation
> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
> on 4: New configuration version has to be newer than current running configurati
> on#012.
> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Activity suspended on this nod
> e
> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Error reloading the configurat
> ion, will retry every second
> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Node 1 conflict, remote config
>  version id=4, local=2
> -- VISUAL BLOCK --r295 corosync[3542]:   [CMAN  ] Unable to load new config in c
> orosync: New configuration version has to be newer than current running configur
> ation
> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
> on 4: New configuration version has to be newer than current running configurati
> on#012.
> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Activity suspended on this nod
> E
>

Run 'cman_tool version' to get the current version of the configuration,
then increase the config_version="x" to be one higher.

Also, configure fencing! If you don't, your cluster will hang the first
time anything goes wrong.

--
Digimer
Papers and Projects: https://alteeve.com



From lists at alteeve.ca  Fri Jun  1 02:05:17 2012
From: lists at alteeve.ca (Digimer)
Date: Thu, 31 May 2012 22:05:17 -0400
Subject: [Linux-cluster] Help needed
In-Reply-To: <1D241511770E2F4BA89AFD224EDD52712A9EE73C@G9W0737.americas.hpqcorp.net>
References: <CAKrd530bTYqYSxZBf-aVjEjVHhWu-_KgP_9VjmTy2e913eD8hQ@mail.gmail.com>
	<1D241511770E2F4BA89AFD224EDD527117B82078@G9W0737.americas.hpqcorp.net>
	<1D241511770E2F4BA89AFD224EDD527117B90213@G9W0737.americas.hpqcorp.net>
	<CAE7pJ3C_V4Qkrs44CMtjXMZLoVrwGOHQh9NmHu6NjwZgxs7Gaw@mail.gmail.com>
	<4F7FAF45.8070104@alteeve.ca>
	<1D241511770E2F4BA89AFD224EDD527117B904A3@G9W0737.americas.hpqcorp.net>
	<1D241511770E2F4BA89AFD224EDD52712A9ED63F@G9W0737.americas.hpqcorp.net>
	<4FC053A1.8070407@alteeve.ca>
	<1D241511770E2F4BA89AFD224EDD52712A9EE4F0@G9W0737.americas.hpqcorp.net>
	<4FC7A6B5.30305@alteeve.ca>
	<1D241511770E2F4BA89AFD224EDD52712A9EE73C@G9W0737.americas.hpqcorp.net>
Message-ID: <4FC8235D.6050206@alteeve.ca>

Send your cluster.conf please, editing only password please. Please also
include you network configs.

On 05/31/2012 08:12 PM, Chen, Ming Ming wrote:
> Hi Digimer,
> Thanks for your comment. I've got rid of the first problem, and now I have the following messages. Any idea?
> Thanks in advance.
> Ming
> 
> [root at shr295 ~]# tail -f /var/log/messages
> May 31 16:56:01 shr295 dlm_controld[3375]: dlm_controld 3.0.12.1 started
> May 31 16:56:11 shr295 fenced[3353]: daemon cpg_join error retrying
> May 31 16:56:12 shr295 gfs_controld[3447]: gfs_controld 3.0.12.1 started
> May 31 16:56:12 shr295 dlm_controld[3375]: daemon cpg_join error retrying
> May 31 16:56:21 shr295 fenced[3353]: daemon cpg_join error retrying
> May 31 16:56:22 shr295 dlm_controld[3375]: daemon cpg_join error retrying
> May 31 16:56:22 shr295 gfs_controld[3447]: daemon cpg_join error retrying
> May 31 16:56:31 shr295 fenced[3353]: daemon cpg_join error retrying
> May 31 16:56:32 shr295 dlm_controld[3375]: daemon cpg_join error retrying
> May 31 16:56:32 shr295 gfs_controld[3447]: daemon cpg_join error retrying
> May 31 16:56:41 shr295 fenced[3353]: daemon cpg_join error retrying
> May 31 16:56:42 shr295 dlm_controld[3375]: daemon cpg_join error retrying
> May 31 16:56:42 shr295 gfs_controld[3447]: daemon cpg_join error retrying
> 
> -----Original Message-----
> From: Digimer [mailto:lists at alteeve.ca]
> Sent: Thursday, May 31, 2012 10:13 AM
> To: Chen, Ming Ming
> Cc: linux clustering
> Subject: Re: [Linux-cluster] Help needed
> 
> On 05/31/2012 12:27 PM, Chen, Ming Ming wrote:
>>  Hi, I have the following simple cluster config just to try out on SertOS 6.2
>>
>> <?xml version="1.0"?>
>> <cluster config_version="2" name="vmcluster">
>>       <logging debug="on"/>
>>       <cman expected_votes="1" two_node="1"/>
>>       <clusternodes>
>>             <clusternode name="shr289.cup.hp.com" nodeid="1">
>>                   <fence>
>>                   </fence>
>>             </clusternode>
>>             <clusternode name="shr295.cup.hp.com" nodeid="2">
>>                   <fence>
>>                   </fence>
>>             </clusternode>
>>       </clusternodes>
>>       <fencedevices>
>>       </fencedevices>
>>       <rm>
>>       </rm>
>> </cluster>
>>
>>
>> And I got the following error message when I did "service cman start" I got the same messages on both nodes.
>> Any help will be appreciated.
>>
>> May 31 09:08:04 corosync [TOTEM ] RRP multicast threshold (100 problem count)
>> May 31 09:08:05 shr295 corosync[3542]:   [MAIN  ] Completed service synchronizat
>> ion, ready to provide service.
>> May 31 09:08:05 shr295 corosync[3542]:   [TOTEM ] A processor joined or left the
>>  membership and a new membership was formed.
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>> orosync: New configuration version has to be newer than current running configur
>> ation
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>> on 4: New configuration version has to be newer than current running configurati
>> on#012.
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Activity suspended on this nod
>> e
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Error reloading the configurat
>> ion, will retry every second
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Node 1 conflict, remote config
>>  version id=4, local=2
>> -- VISUAL BLOCK --r295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>> orosync: New configuration version has to be newer than current running configur
>> ation
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>> on 4: New configuration version has to be newer than current running configurati
>> on#012.
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Activity suspended on this nod
>> E
>>
> 
> Run 'cman_tool version' to get the current version of the configuration,
> then increase the config_version="x" to be one higher.
> 
> Also, configure fencing! If you don't, your cluster will hang the first
> time anything goes wrong.
> 
> --
> Digimer
> Papers and Projects: https://alteeve.com


-- 
Digimer
Papers and Projects: https://alteeve.com



From lists at alteeve.ca  Fri Jun  1 04:11:28 2012
From: lists at alteeve.ca (Digimer)
Date: Fri, 01 Jun 2012 00:11:28 -0400
Subject: [Linux-cluster] cluster 3.1.91 released (Test Release 2)
Message-ID: <4FC840F0.7080102@alteeve.ca>

Welcome to the cluster 3.1.91 (Test Release 2) release.

This 3.1.91 release is the second Test Release for the coming version
3.2. The release includes several bug fixes. One new feature is
'cpglockd' support in rgmanager. rgmanager will use cpglockd
automatically when corosync/cman are configured in RRP mode. Users of
cluster 3.1.90 are recommended to upgrade to this release.

The new source tarball can be downloaded here:

https://fedorahosted.org/releases/c/l/cluster/cluster-3.1.91.tar.xz

ChangeLog:

https://fedorahosted.org/releases/c/l/cluster/Changelog-3.1.91

To report bugs or issues:

   https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

   Join us on IRC (irc.freenode.net #linux-cluster) and share your
   experience with other sysadmins and users.

Thanks/congratulations to all people that contributed to this release!

Happy clustering,
digimer

-- 
Digimer
Papers and Projects: https://alteeve.com



From ming-ming.chen at hp.com  Fri Jun  1 17:53:07 2012
From: ming-ming.chen at hp.com (Chen, Ming Ming)
Date: Fri, 1 Jun 2012 17:53:07 +0000
Subject: [Linux-cluster] Help needed
In-Reply-To: <4FC8235D.6050206@alteeve.ca>
References: <CAKrd530bTYqYSxZBf-aVjEjVHhWu-_KgP_9VjmTy2e913eD8hQ@mail.gmail.com>
	<1D241511770E2F4BA89AFD224EDD527117B82078@G9W0737.americas.hpqcorp.net>
	<1D241511770E2F4BA89AFD224EDD527117B90213@G9W0737.americas.hpqcorp.net>
	<CAE7pJ3C_V4Qkrs44CMtjXMZLoVrwGOHQh9NmHu6NjwZgxs7Gaw@mail.gmail.com>
	<4F7FAF45.8070104@alteeve.ca>
	<1D241511770E2F4BA89AFD224EDD527117B904A3@G9W0737.americas.hpqcorp.net>
	<1D241511770E2F4BA89AFD224EDD52712A9ED63F@G9W0737.americas.hpqcorp.net>
	<4FC053A1.8070407@alteeve.ca>
	<1D241511770E2F4BA89AFD224EDD52712A9EE4F0@G9W0737.americas.hpqcorp.net>
	<4FC7A6B5.30305@alteeve.ca>
	<1D241511770E2F4BA89AFD224EDD52712A9EE73C@G9W0737.americas.hpqcorp.net>
	<4FC8235D.6050206@alteeve.ca>
Message-ID: <1D241511770E2F4BA89AFD224EDD52712A9EE90C@G9W0737.americas.hpqcorp.net>

Thanks for returning my email. The cluster configuration file and network configuration. Also one bad news is that the original issues come back again.
So I've see two problems, and both problems will come sporatically:
Thanks again for your help.
Regards
Ming

1. The original one. I've increased the version number, and it was gone for a while, but come back. Do you know why?

   May 31 09:08:05 shr295 corosync[3542]:   [MAIN  ] Completed service synchronizat
>> ion, ready to provide service.
>> May 31 09:08:05 shr295 corosync[3542]:   [TOTEM ] A processor joined or left the
>>  membership and a new membership was formed.
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>> orosync: New configuration version has to be newer than current running configur
>> ation
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>> on 4: New configuration version has to be newer than current running configurati
>> on#012.
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Activity suspended on this nod
>> e
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Error reloading the configurat
>> ion, will retry every second
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Node 1 conflict, remote config
>>  version id=4, local=2
>> -- VISUAL BLOCK --r295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>> orosync: New configuration version has to be newer than current running configur
>> ation
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>> on 4: New configuration version has to be newer than current running configurati
>> on#012.

2. > [root at shr295 ~]# tail -f /var/log/messages
> May 31 16:56:01 shr295 dlm_controld[3375]: dlm_controld 3.0.12.1 started
> May 31 16:56:11 shr295 fenced[3353]: daemon cpg_join error retrying
> May 31 16:56:12 shr295 gfs_controld[3447]: gfs_controld 3.0.12.1 started
> May 31 16:56:12 shr295 dlm_controld[3375]: daemon cpg_join error retrying
> May 31 16:56:21 shr295 fenced[3353]: daemon cpg_join error retrying
> May 31 16:56:22 shr295 dlm_controld[3375]: daemon cpg_join error retrying
> May 31 16:56:22 shr295 gfs_controld[3447]: daemon cpg_join error retrying
> May 31 16:56:31 shr295 fenced[3353]: daemon cpg_join error retrying
> May 31 16:56:32 shr295 dlm_controld[3375]: daemon cpg_join error retrying
> May 31 16:56:32 shr295 gfs_controld[3447]: daemon cpg_join error retrying
> May 31 16:56:41 shr295 fenced[3353]: daemon cpg_join error retrying
> May 31 16:56:42 shr295 dlm_controld[3375]: daemon cpg_join error retrying
> May 31 16:56:42 shr295 gfs_controld[3447]: daemon cpg_join error retrying

Cluster configuration File:
>> <?xml version="1.0"?>
>> <cluster config_version="2" name="vmcluster">
>>       <logging debug="on"/>
>>       <cman expected_votes="1" two_node="1"/>
>>       <clusternodes>
>>             <clusternode name="shr289.cup.hp.com" nodeid="1">
>>                   <fence>
>>                   </fence>
>>             </clusternode>
>>             <clusternode name="shr295.cup.hp.com" nodeid="2">
>>                   <fence>
>>                   </fence>
>>             </clusternode>
>>       </clusternodes>
>>       <fencedevices>
>>       </fencedevices>
>>       <rm>
>>       </rm>
>> </cluster>

I had a fencing configuration there, but I'd like to see that I can bring up a simple cluster first, then will add the fencing there.

The network configuration:
eth1      Link encap:Ethernet  HWaddr 00:23:7D:36:05:20
          inet addr:16.89.112.182  Bcast:16.89.119.255  Mask:255.255.248.0
          inet6 addr: fe80::223:7dff:fe36:520/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1210316 errors:0 dropped:0 overruns:0 frame:0
          TX packets:73158 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:150775766 (143.7 MiB)  TX bytes:11749950 (11.2 MiB)
          Interrupt:16 Memory:f6000000-f6012800

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:291 errors:0 dropped:0 overruns:0 frame:0
          TX packets:291 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:38225 (37.3 KiB)  TX bytes:38225 (37.3 KiB)

virbr0    Link encap:Ethernet  HWaddr 52:54:00:30:33:BD
          inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:488 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 b)  TX bytes:25273 (24.6 KiB)


-----Original Message-----
From: Digimer [mailto:lists at alteeve.ca]
Sent: Thursday, May 31, 2012 7:05 PM
To: Chen, Ming Ming
Cc: linux clustering
Subject: Re: [Linux-cluster] Help needed

Send your cluster.conf please, editing only password please. Please also
include you network configs.

On 05/31/2012 08:12 PM, Chen, Ming Ming wrote:
> Hi Digimer,
> Thanks for your comment. I've got rid of the first problem, and now I have the following messages. Any idea?
> Thanks in advance.
> Ming
>
> [root at shr295 ~]# tail -f /var/log/messages
> May 31 16:56:01 shr295 dlm_controld[3375]: dlm_controld 3.0.12.1 started
> May 31 16:56:11 shr295 fenced[3353]: daemon cpg_join error retrying
> May 31 16:56:12 shr295 gfs_controld[3447]: gfs_controld 3.0.12.1 started
> May 31 16:56:12 shr295 dlm_controld[3375]: daemon cpg_join error retrying
> May 31 16:56:21 shr295 fenced[3353]: daemon cpg_join error retrying
> May 31 16:56:22 shr295 dlm_controld[3375]: daemon cpg_join error retrying
> May 31 16:56:22 shr295 gfs_controld[3447]: daemon cpg_join error retrying
> May 31 16:56:31 shr295 fenced[3353]: daemon cpg_join error retrying
> May 31 16:56:32 shr295 dlm_controld[3375]: daemon cpg_join error retrying
> May 31 16:56:32 shr295 gfs_controld[3447]: daemon cpg_join error retrying
> May 31 16:56:41 shr295 fenced[3353]: daemon cpg_join error retrying
> May 31 16:56:42 shr295 dlm_controld[3375]: daemon cpg_join error retrying
> May 31 16:56:42 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>
> -----Original Message-----
> From: Digimer [mailto:lists at alteeve.ca]
> Sent: Thursday, May 31, 2012 10:13 AM
> To: Chen, Ming Ming
> Cc: linux clustering
> Subject: Re: [Linux-cluster] Help needed
>
> On 05/31/2012 12:27 PM, Chen, Ming Ming wrote:
>>  Hi, I have the following simple cluster config just to try out on SertOS 6.2
>>
>> <?xml version="1.0"?>
>> <cluster config_version="2" name="vmcluster">
>>       <logging debug="on"/>
>>       <cman expected_votes="1" two_node="1"/>
>>       <clusternodes>
>>             <clusternode name="shr289.cup.hp.com" nodeid="1">
>>                   <fence>
>>                   </fence>
>>             </clusternode>
>>             <clusternode name="shr295.cup.hp.com" nodeid="2">
>>                   <fence>
>>                   </fence>
>>             </clusternode>
>>       </clusternodes>
>>       <fencedevices>
>>       </fencedevices>
>>       <rm>
>>       </rm>
>> </cluster>
>>
>>
>> And I got the following error message when I did "service cman start" I got the same messages on both nodes.
>> Any help will be appreciated.
>>
>> May 31 09:08:04 corosync [TOTEM ] RRP multicast threshold (100 problem count)
>> May 31 09:08:05 shr295 corosync[3542]:   [MAIN  ] Completed service synchronizat
>> ion, ready to provide service.
>> May 31 09:08:05 shr295 corosync[3542]:   [TOTEM ] A processor joined or left the
>>  membership and a new membership was formed.
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>> orosync: New configuration version has to be newer than current running configur
>> ation
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>> on 4: New configuration version has to be newer than current running configurati
>> on#012.
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Activity suspended on this nod
>> e
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Error reloading the configurat
>> ion, will retry every second
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Node 1 conflict, remote config
>>  version id=4, local=2
>> -- VISUAL BLOCK --r295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>> orosync: New configuration version has to be newer than current running configur
>> ation
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>> on 4: New configuration version has to be newer than current running configurati
>> on#012.
>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Activity suspended on this nod
>> E
>>
>
> Run 'cman_tool version' to get the current version of the configuration,
> then increase the config_version="x" to be one higher.
>
> Also, configure fencing! If you don't, your cluster will hang the first
> time anything goes wrong.
>
> --
> Digimer
> Papers and Projects: https://alteeve.com


--
Digimer
Papers and Projects: https://alteeve.com



From lists at alteeve.ca  Fri Jun  1 18:43:55 2012
From: lists at alteeve.ca (Digimer)
Date: Fri, 01 Jun 2012 14:43:55 -0400
Subject: [Linux-cluster] Help needed
In-Reply-To: <1D241511770E2F4BA89AFD224EDD52712A9EE90C@G9W0737.americas.hpqcorp.net>
References: <CAKrd530bTYqYSxZBf-aVjEjVHhWu-_KgP_9VjmTy2e913eD8hQ@mail.gmail.com>
	<1D241511770E2F4BA89AFD224EDD527117B82078@G9W0737.americas.hpqcorp.net>
	<1D241511770E2F4BA89AFD224EDD527117B90213@G9W0737.americas.hpqcorp.net>
	<CAE7pJ3C_V4Qkrs44CMtjXMZLoVrwGOHQh9NmHu6NjwZgxs7Gaw@mail.gmail.com>
	<4F7FAF45.8070104@alteeve.ca>
	<1D241511770E2F4BA89AFD224EDD527117B904A3@G9W0737.americas.hpqcorp.net>
	<1D241511770E2F4BA89AFD224EDD52712A9ED63F@G9W0737.americas.hpqcorp.net>
	<4FC053A1.8070407@alteeve.ca>
	<1D241511770E2F4BA89AFD224EDD52712A9EE4F0@G9W0737.americas.hpqcorp.net>
	<4FC7A6B5.30305@alteeve.ca>
	<1D241511770E2F4BA89AFD224EDD52712A9EE73C@G9W0737.americas.hpqcorp.net>
	<4FC8235D.6050206@alteeve.ca>
	<1D241511770E2F4BA89AFD224EDD52712A9EE90C@G9W0737.americas.hpqcorp.net>
Message-ID: <4FC90D6B.90900@alteeve.ca>

What does 'shr289.cup.hp.com' and 'shr295.cup.hp.com' resolve to? Does
your switch support multicast properly? If the switch periodically tears
down a multicast group, your cluster will partition.

You *must* have fencing configured. Fencing using iLO works fine, please
use it. See
https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Example_.3Cfencedevice....3E_Tag_For_HP_iLO
Without fencing, you cluster will be unstable.

Digimer

On 06/01/2012 01:53 PM, Chen, Ming Ming wrote:
> Thanks for returning my email. The cluster configuration file and network configuration. Also one bad news is that the original issues come back again.
> So I've see two problems, and both problems will come sporatically:
> Thanks again for your help.
> Regards
> Ming
> 
> 1. The original one. I've increased the version number, and it was gone for a while, but come back. Do you know why?
> 
>    May 31 09:08:05 shr295 corosync[3542]:   [MAIN  ] Completed service synchronizat
>>> ion, ready to provide service.
>>> May 31 09:08:05 shr295 corosync[3542]:   [TOTEM ] A processor joined or left the
>>>  membership and a new membership was formed.
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>>> orosync: New configuration version has to be newer than current running configur
>>> ation
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>>> on 4: New configuration version has to be newer than current running configurati
>>> on#012.
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Activity suspended on this nod
>>> e
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Error reloading the configurat
>>> ion, will retry every second
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Node 1 conflict, remote config
>>>  version id=4, local=2
>>> -- VISUAL BLOCK --r295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>>> orosync: New configuration version has to be newer than current running configur
>>> ation
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>>> on 4: New configuration version has to be newer than current running configurati
>>> on#012.
> 
> 2. > [root at shr295 ~]# tail -f /var/log/messages
>> May 31 16:56:01 shr295 dlm_controld[3375]: dlm_controld 3.0.12.1 started
>> May 31 16:56:11 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:12 shr295 gfs_controld[3447]: gfs_controld 3.0.12.1 started
>> May 31 16:56:12 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:21 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:22 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:22 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>> May 31 16:56:31 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:32 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:32 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>> May 31 16:56:41 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:42 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:42 shr295 gfs_controld[3447]: daemon cpg_join error retrying
> 
> Cluster configuration File:
>>> <?xml version="1.0"?>
>>> <cluster config_version="2" name="vmcluster">
>>>       <logging debug="on"/>
>>>       <cman expected_votes="1" two_node="1"/>
>>>       <clusternodes>
>>>             <clusternode name="shr289.cup.hp.com" nodeid="1">
>>>                   <fence>
>>>                   </fence>
>>>             </clusternode>
>>>             <clusternode name="shr295.cup.hp.com" nodeid="2">
>>>                   <fence>
>>>                   </fence>
>>>             </clusternode>
>>>       </clusternodes>
>>>       <fencedevices>
>>>       </fencedevices>
>>>       <rm>
>>>       </rm>
>>> </cluster>
> 
> I had a fencing configuration there, but I'd like to see that I can bring up a simple cluster first, then will add the fencing there.
> 
> The network configuration:
> eth1      Link encap:Ethernet  HWaddr 00:23:7D:36:05:20
>           inet addr:16.89.112.182  Bcast:16.89.119.255  Mask:255.255.248.0
>           inet6 addr: fe80::223:7dff:fe36:520/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:1210316 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:73158 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:150775766 (143.7 MiB)  TX bytes:11749950 (11.2 MiB)
>           Interrupt:16 Memory:f6000000-f6012800
> 
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           inet6 addr: ::1/128 Scope:Host
>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>           RX packets:291 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:291 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:38225 (37.3 KiB)  TX bytes:38225 (37.3 KiB)
> 
> virbr0    Link encap:Ethernet  HWaddr 52:54:00:30:33:BD
>           inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:488 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:0 (0.0 b)  TX bytes:25273 (24.6 KiB)
> 
> 
> -----Original Message-----
> From: Digimer [mailto:lists at alteeve.ca]
> Sent: Thursday, May 31, 2012 7:05 PM
> To: Chen, Ming Ming
> Cc: linux clustering
> Subject: Re: [Linux-cluster] Help needed
> 
> Send your cluster.conf please, editing only password please. Please also
> include you network configs.
> 
> On 05/31/2012 08:12 PM, Chen, Ming Ming wrote:
>> Hi Digimer,
>> Thanks for your comment. I've got rid of the first problem, and now I have the following messages. Any idea?
>> Thanks in advance.
>> Ming
>>
>> [root at shr295 ~]# tail -f /var/log/messages
>> May 31 16:56:01 shr295 dlm_controld[3375]: dlm_controld 3.0.12.1 started
>> May 31 16:56:11 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:12 shr295 gfs_controld[3447]: gfs_controld 3.0.12.1 started
>> May 31 16:56:12 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:21 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:22 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:22 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>> May 31 16:56:31 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:32 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:32 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>> May 31 16:56:41 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:42 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:42 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>>
>> -----Original Message-----
>> From: Digimer [mailto:lists at alteeve.ca]
>> Sent: Thursday, May 31, 2012 10:13 AM
>> To: Chen, Ming Ming
>> Cc: linux clustering
>> Subject: Re: [Linux-cluster] Help needed
>>
>> On 05/31/2012 12:27 PM, Chen, Ming Ming wrote:
>>>  Hi, I have the following simple cluster config just to try out on SertOS 6.2
>>>
>>> <?xml version="1.0"?>
>>> <cluster config_version="2" name="vmcluster">
>>>       <logging debug="on"/>
>>>       <cman expected_votes="1" two_node="1"/>
>>>       <clusternodes>
>>>             <clusternode name="shr289.cup.hp.com" nodeid="1">
>>>                   <fence>
>>>                   </fence>
>>>             </clusternode>
>>>             <clusternode name="shr295.cup.hp.com" nodeid="2">
>>>                   <fence>
>>>                   </fence>
>>>             </clusternode>
>>>       </clusternodes>
>>>       <fencedevices>
>>>       </fencedevices>
>>>       <rm>
>>>       </rm>
>>> </cluster>
>>>
>>>
>>> And I got the following error message when I did "service cman start" I got the same messages on both nodes.
>>> Any help will be appreciated.
>>>
>>> May 31 09:08:04 corosync [TOTEM ] RRP multicast threshold (100 problem count)
>>> May 31 09:08:05 shr295 corosync[3542]:   [MAIN  ] Completed service synchronizat
>>> ion, ready to provide service.
>>> May 31 09:08:05 shr295 corosync[3542]:   [TOTEM ] A processor joined or left the
>>>  membership and a new membership was formed.
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>>> orosync: New configuration version has to be newer than current running configur
>>> ation
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>>> on 4: New configuration version has to be newer than current running configurati
>>> on#012.
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Activity suspended on this nod
>>> e
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Error reloading the configurat
>>> ion, will retry every second
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Node 1 conflict, remote config
>>>  version id=4, local=2
>>> -- VISUAL BLOCK --r295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>>> orosync: New configuration version has to be newer than current running configur
>>> ation
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>>> on 4: New configuration version has to be newer than current running configurati
>>> on#012.
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Activity suspended on this nod
>>> E
>>>
>>
>> Run 'cman_tool version' to get the current version of the configuration,
>> then increase the config_version="x" to be one higher.
>>
>> Also, configure fencing! If you don't, your cluster will hang the first
>> time anything goes wrong.
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.com
> 
> 
> --
> Digimer
> Papers and Projects: https://alteeve.com


-- 
Digimer
Papers and Projects: https://alteeve.com



From ming-ming.chen at hp.com  Fri Jun  1 21:04:49 2012
From: ming-ming.chen at hp.com (Chen, Ming Ming)
Date: Fri, 1 Jun 2012 21:04:49 +0000
Subject: [Linux-cluster] Help needed
In-Reply-To: <4FC90D6B.90900@alteeve.ca>
References: <CAKrd530bTYqYSxZBf-aVjEjVHhWu-_KgP_9VjmTy2e913eD8hQ@mail.gmail.com>
	<1D241511770E2F4BA89AFD224EDD527117B82078@G9W0737.americas.hpqcorp.net>
	<1D241511770E2F4BA89AFD224EDD527117B90213@G9W0737.americas.hpqcorp.net>
	<CAE7pJ3C_V4Qkrs44CMtjXMZLoVrwGOHQh9NmHu6NjwZgxs7Gaw@mail.gmail.com>
	<4F7FAF45.8070104@alteeve.ca>
	<1D241511770E2F4BA89AFD224EDD527117B904A3@G9W0737.americas.hpqcorp.net>
	<1D241511770E2F4BA89AFD224EDD52712A9ED63F@G9W0737.americas.hpqcorp.net>
	<4FC053A1.8070407@alteeve.ca>
	<1D241511770E2F4BA89AFD224EDD52712A9EE4F0@G9W0737.americas.hpqcorp.net>
	<4FC7A6B5.30305@alteeve.ca>
	<1D241511770E2F4BA89AFD224EDD52712A9EE73C@G9W0737.americas.hpqcorp.net>
	<4FC8235D.6050206@alteeve.ca>
	<1D241511770E2F4BA89AFD224EDD52712A9EE90C@G9W0737.americas.hpqcorp.net>
	<4FC90D6B.90900@alteeve.ca>
Message-ID: <1D241511770E2F4BA89AFD224EDD52712A9EEA1D@G9W0737.americas.hpqcorp.net>

Shr289.cup.hp.com resolves to 16.89.116.32
Shr295.cup.hp.com resolves to 16.89.112.182
I would assume that our switches should support multicast, since we have another cluster RH6.2 which runs OK using the same switch.
Also I'll put the fencing in the cluster conf to try it again.
Thanks
Ming


-----Original Message-----
From: Digimer [mailto:lists at alteeve.ca]
Sent: Friday, June 01, 2012 11:44 AM
To: Chen, Ming Ming
Cc: linux clustering
Subject: Re: [Linux-cluster] Help needed

What does 'shr289.cup.hp.com' and 'shr295.cup.hp.com' resolve to? Does
your switch support multicast properly? If the switch periodically tears
down a multicast group, your cluster will partition.

You *must* have fencing configured. Fencing using iLO works fine, please
use it. See
https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Example_.3Cfencedevice....3E_Tag_For_HP_iLO
Without fencing, you cluster will be unstable.

Digimer

On 06/01/2012 01:53 PM, Chen, Ming Ming wrote:
> Thanks for returning my email. The cluster configuration file and network configuration. Also one bad news is that the original issues come back again.
> So I've see two problems, and both problems will come sporatically:
> Thanks again for your help.
> Regards
> Ming
>
> 1. The original one. I've increased the version number, and it was gone for a while, but come back. Do you know why?
>
>    May 31 09:08:05 shr295 corosync[3542]:   [MAIN  ] Completed service synchronizat
>>> ion, ready to provide service.
>>> May 31 09:08:05 shr295 corosync[3542]:   [TOTEM ] A processor joined or left the
>>>  membership and a new membership was formed.
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>>> orosync: New configuration version has to be newer than current running configur
>>> ation
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>>> on 4: New configuration version has to be newer than current running configurati
>>> on#012.
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Activity suspended on this nod
>>> e
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Error reloading the configurat
>>> ion, will retry every second
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Node 1 conflict, remote config
>>>  version id=4, local=2
>>> -- VISUAL BLOCK --r295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>>> orosync: New configuration version has to be newer than current running configur
>>> ation
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>>> on 4: New configuration version has to be newer than current running configurati
>>> on#012.
>
> 2. > [root at shr295 ~]# tail -f /var/log/messages
>> May 31 16:56:01 shr295 dlm_controld[3375]: dlm_controld 3.0.12.1 started
>> May 31 16:56:11 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:12 shr295 gfs_controld[3447]: gfs_controld 3.0.12.1 started
>> May 31 16:56:12 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:21 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:22 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:22 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>> May 31 16:56:31 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:32 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:32 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>> May 31 16:56:41 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:42 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:42 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>
> Cluster configuration File:
>>> <?xml version="1.0"?>
>>> <cluster config_version="2" name="vmcluster">
>>>       <logging debug="on"/>
>>>       <cman expected_votes="1" two_node="1"/>
>>>       <clusternodes>
>>>             <clusternode name="shr289.cup.hp.com" nodeid="1">
>>>                   <fence>
>>>                   </fence>
>>>             </clusternode>
>>>             <clusternode name="shr295.cup.hp.com" nodeid="2">
>>>                   <fence>
>>>                   </fence>
>>>             </clusternode>
>>>       </clusternodes>
>>>       <fencedevices>
>>>       </fencedevices>
>>>       <rm>
>>>       </rm>
>>> </cluster>
>
> I had a fencing configuration there, but I'd like to see that I can bring up a simple cluster first, then will add the fencing there.
>
> The network configuration:
> eth1      Link encap:Ethernet  HWaddr 00:23:7D:36:05:20
>           inet addr:16.89.112.182  Bcast:16.89.119.255  Mask:255.255.248.0
>           inet6 addr: fe80::223:7dff:fe36:520/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:1210316 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:73158 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:150775766 (143.7 MiB)  TX bytes:11749950 (11.2 MiB)
>           Interrupt:16 Memory:f6000000-f6012800
>
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           inet6 addr: ::1/128 Scope:Host
>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>           RX packets:291 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:291 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:38225 (37.3 KiB)  TX bytes:38225 (37.3 KiB)
>
> virbr0    Link encap:Ethernet  HWaddr 52:54:00:30:33:BD
>           inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:488 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:0 (0.0 b)  TX bytes:25273 (24.6 KiB)
>
>
> -----Original Message-----
> From: Digimer [mailto:lists at alteeve.ca]
> Sent: Thursday, May 31, 2012 7:05 PM
> To: Chen, Ming Ming
> Cc: linux clustering
> Subject: Re: [Linux-cluster] Help needed
>
> Send your cluster.conf please, editing only password please. Please also
> include you network configs.
>
> On 05/31/2012 08:12 PM, Chen, Ming Ming wrote:
>> Hi Digimer,
>> Thanks for your comment. I've got rid of the first problem, and now I have the following messages. Any idea?
>> Thanks in advance.
>> Ming
>>
>> [root at shr295 ~]# tail -f /var/log/messages
>> May 31 16:56:01 shr295 dlm_controld[3375]: dlm_controld 3.0.12.1 started
>> May 31 16:56:11 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:12 shr295 gfs_controld[3447]: gfs_controld 3.0.12.1 started
>> May 31 16:56:12 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:21 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:22 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:22 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>> May 31 16:56:31 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:32 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:32 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>> May 31 16:56:41 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:42 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:42 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>>
>> -----Original Message-----
>> From: Digimer [mailto:lists at alteeve.ca]
>> Sent: Thursday, May 31, 2012 10:13 AM
>> To: Chen, Ming Ming
>> Cc: linux clustering
>> Subject: Re: [Linux-cluster] Help needed
>>
>> On 05/31/2012 12:27 PM, Chen, Ming Ming wrote:
>>>  Hi, I have the following simple cluster config just to try out on SertOS 6.2
>>>
>>> <?xml version="1.0"?>
>>> <cluster config_version="2" name="vmcluster">
>>>       <logging debug="on"/>
>>>       <cman expected_votes="1" two_node="1"/>
>>>       <clusternodes>
>>>             <clusternode name="shr289.cup.hp.com" nodeid="1">
>>>                   <fence>
>>>                   </fence>
>>>             </clusternode>
>>>             <clusternode name="shr295.cup.hp.com" nodeid="2">
>>>                   <fence>
>>>                   </fence>
>>>             </clusternode>
>>>       </clusternodes>
>>>       <fencedevices>
>>>       </fencedevices>
>>>       <rm>
>>>       </rm>
>>> </cluster>
>>>
>>>
>>> And I got the following error message when I did "service cman start" I got the same messages on both nodes.
>>> Any help will be appreciated.
>>>
>>> May 31 09:08:04 corosync [TOTEM ] RRP multicast threshold (100 problem count)
>>> May 31 09:08:05 shr295 corosync[3542]:   [MAIN  ] Completed service synchronizat
>>> ion, ready to provide service.
>>> May 31 09:08:05 shr295 corosync[3542]:   [TOTEM ] A processor joined or left the
>>>  membership and a new membership was formed.
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>>> orosync: New configuration version has to be newer than current running configur
>>> ation
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>>> on 4: New configuration version has to be newer than current running configurati
>>> on#012.
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Activity suspended on this nod
>>> e
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Error reloading the configurat
>>> ion, will retry every second
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Node 1 conflict, remote config
>>>  version id=4, local=2
>>> -- VISUAL BLOCK --r295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>>> orosync: New configuration version has to be newer than current running configur
>>> ation
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>>> on 4: New configuration version has to be newer than current running configurati
>>> on#012.
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Activity suspended on this nod
>>> E
>>>
>>
>> Run 'cman_tool version' to get the current version of the configuration,
>> then increase the config_version="x" to be one higher.
>>
>> Also, configure fencing! If you don't, your cluster will hang the first
>> time anything goes wrong.
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.com
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.com


--
Digimer
Papers and Projects: https://alteeve.com



From lists at alteeve.ca  Fri Jun  1 21:59:13 2012
From: lists at alteeve.ca (Digimer)
Date: Fri, 01 Jun 2012 17:59:13 -0400
Subject: [Linux-cluster] cluster 3.1.92 released (Test Release 3)
Message-ID: <4FC93B31.2050406@alteeve.ca>

Welcome to the cluster 3.1.92 (Test Release 3) release.

This releases fixes a config parse bug. Users of previous test releases
are encouraged to upgrade to this version.

The new source tarball can be downloaded here:

https://fedorahosted.org/releases/c/l/cluster/cluster-3.1.92.tar.xz

ChangeLog:

https://fedorahosted.org/releases/c/l/cluster/Changelog-3.1.92

To report bugs or issues:

   https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

   Join us on IRC (irc.freenode.net #linux-cluster) and share your
   experience with other sysadmins and users.

Thanks/congratulations to all people that contributed to this release!

Happy clustering,
digimer

-- 
Digimer
Papers and Projects: https://alteeve.com



From rhel_cluster at ckimaru.com  Sun Jun  3 01:25:23 2012
From: rhel_cluster at ckimaru.com (Cedric Kimaru)
Date: Sat, 2 Jun 2012 21:25:23 -0400
Subject: [Linux-cluster] Rhel 5.7 Cluster - gfs2 volume in
	"LEAVE_START_WAIT" status
Message-ID: <CANe6q9+_A-_cQqaHrZNZHj3q89MdW0ve7SP5KiwqxKzkYOCWbg@mail.gmail.com>

Fellow Cluster Compatriots,
I'm looking for some guidance here. Whenever my rhel 5.7 cluster get's into
"*LEAVE_START_WAIT*" on on a given iscsi volume, the following occurs:

   1. I can't r/w io to the volume.
   2. Can't unmount it, from any node.
   3. In flight/pending IO's are impossible to determine or kill since lsof
   on the mount fails. Basically all IO operations stall/fail.

So my questions are:

   1. What does the output from group_tool -v really indicate, *"00030005
   LEAVE_START_WAIT 12 c000b0002 1" *? Man on group_tool doesn't list these
   fields.
   2. Does anyone have a list of what these fields represent ?
   3. Corrective actions. How do i get out of this state without rebooting
   the entire cluster ?
   4. Is it possible to determine the offending node ?

thanks,
-Cedric


//misc output

root at bl13-node13:~# clustat
Cluster Status for cluster3 @ Sat Jun  2 20:47:08 2012
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
bl01-node01                                      1 Online, rgmanager
 bl04-node04                                      4 Online, rgmanager
 bl05-node05                                      5 Online, rgmanager
 bl06-node06                                      6 Online, rgmanager
 bl07-node07                                      7 Online, rgmanager
 bl08-node08                                      8 Online, rgmanager
 bl09-node09                                      9 Online, rgmanager
 bl10-node10                                     10 Online, rgmanager
 bl11-node11                                     11 Online, rgmanager
 bl12-node12                                     12 Online, rgmanager
 bl13-node13                                     13 Online, Local, rgmanager
 bl14-node14                                     14 Online, rgmanager
 bl15-node15                                     15 Online, rgmanager


 Service Name                                                 Owner
(Last)                                                 State
 ------- ----                                                 -----
------                                                 -----
 service:httpd
bl05-node05                               started
 service:nfs_disk2
bl08-node08                               started


root at bl13-node13:~# group_tool -v
type             level name            id       state node id local_done
fence            0     default         0001000d none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
dlm              1     clvmd           0001000c none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
dlm              1     cluster3_disk1  00020005 none
[4 5 6 7 8 9 10 11 12 13 14 15]
dlm              1     cluster3_disk2  00040005 none
[4 5 6 7 8 9 10 11 13 14 15]
dlm              1     cluster3_disk7  00060005 none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
dlm              1     cluster3_disk8  00080005 none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
dlm              1     cluster3_disk9  000a0005 none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
dlm              1     disk10          000c0005 none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
dlm              1     rgmanager       0001000a none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
dlm              1     cluster3_disk3  00020001 none
[1 5 6 7 8 9 10 11 12 13]
dlm              1     cluster3_disk6  00020008 none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
gfs              2     cluster3_disk1  00010005 none
[4 5 6 7 8 9 10 11 12 13 14 15]
*gfs              2     cluster3_disk2  00030005 LEAVE_START_WAIT 12
c000b0002 1
[4 5 6 7 8 9 10 11 13 14 15]*
gfs              2     cluster3_disk7  00050005 none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
gfs              2     cluster3_disk8  00070005 none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
gfs              2     cluster3_disk9  00090005 none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
gfs              2     disk10          000b0005 none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
gfs              2     cluster3_disk3  00010001 none
[1 5 6 7 8 9 10 11 12 13]
gfs              2     cluster3_disk6  00010008 none
[1 4 5 6 7 8 9 10 11 12 13 14 15]

root at bl13-node13:~# gfs2_tool list
253:15 cluster3:cluster3_disk6
253:16 cluster3:cluster3_disk3
253:18 cluster3:disk10
253:17 cluster3:cluster3_disk9
253:19 cluster3:cluster3_disk8
253:21 cluster3:cluster3_disk7
253:22 cluster3:cluster3_disk2
253:23 cluster3:cluster3_disk1

root at bl13-node13:~# lvs
    Logging initialised at Sat Jun  2 20:50:03 2012
    Set umask from 0022 to 0077
    Finding all logical volumes
  LV                            VG                            Attr
LSize   Origin Snap%  Move Log Copy%  Convert
  lv_cluster3_Disk7             vg_Cluster3_Disk7             -wi-ao
3.00T
  lv_cluster3_Disk9             vg_Cluster3_Disk9             -wi-ao
200.01G
  lv_Cluster3_libvert           vg_Cluster3_libvert           -wi-a-
100.00G
  lv_cluster3_disk1             vg_cluster3_disk1             -wi-ao
100.00G
  lv_cluster3_disk10            vg_cluster3_disk10            -wi-ao
15.00T
  lv_cluster3_disk2             vg_cluster3_disk2             -wi-ao
220.00G
  lv_cluster3_disk3             vg_cluster3_disk3             -wi-ao
330.00G
  lv_cluster3_disk4_1T-kvm-thin vg_cluster3_disk4_1T-kvm-thin -wi-a-
1.00T
  lv_cluster3_disk5             vg_cluster3_disk5             -wi-a-
555.00G
  lv_cluster3_disk6             vg_cluster3_disk6             -wi-ao
2.00T
  lv_cluster3_disk8             vg_cluster3_disk8             -wi-ao
2.00T
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120602/0d5cd131/attachment.htm>

From emi2fast at gmail.com  Sun Jun  3 17:17:16 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Sun, 3 Jun 2012 19:17:16 +0200
Subject: [Linux-cluster] Rhel 5.7 Cluster - gfs2 volume in
 "LEAVE_START_WAIT" status
In-Reply-To: <CANe6q9+_A-_cQqaHrZNZHj3q89MdW0ve7SP5KiwqxKzkYOCWbg@mail.gmail.com>
References: <CANe6q9+_A-_cQqaHrZNZHj3q89MdW0ve7SP5KiwqxKzkYOCWbg@mail.gmail.com>
Message-ID: <CAE7pJ3B63X6cW19axUBgKzS1jpVDw36XaU98fP_KEhK7tzj6=g@mail.gmail.com>

Hello Cedric

Are you using gfs or gfs2? if you are using gfs  i recommend to use gfs2

2012/6/3 Cedric Kimaru <rhel_cluster at ckimaru.com>

> Fellow Cluster Compatriots,
> I'm looking for some guidance here. Whenever my rhel 5.7 cluster get's
> into "*LEAVE_START_WAIT*" on on a given iscsi volume, the following
> occurs:
>
>    1. I can't r/w io to the volume.
>    2. Can't unmount it, from any node.
>    3. In flight/pending IO's are impossible to determine or kill since
>    lsof on the mount fails. Basically all IO operations stall/fail.
>
> So my questions are:
>
>    1. What does the output from group_tool -v really indicate, *"00030005
>    LEAVE_START_WAIT 12 c000b0002 1" *? Man on group_tool doesn't list
>    these fields.
>    2. Does anyone have a list of what these fields represent ?
>    3. Corrective actions. How do i get out of this state without
>    rebooting the entire cluster ?
>    4. Is it possible to determine the offending node ?
>
> thanks,
> -Cedric
>
>
> //misc output
>
> root at bl13-node13:~# clustat
> Cluster Status for cluster3 @ Sat Jun  2 20:47:08 2012
> Member Status: Quorate
>
>  Member Name                                                     ID
> Status
>  ------ ----                                                     ----
> ------
> bl01-node01                                      1 Online, rgmanager
>  bl04-node04                                      4 Online, rgmanager
>  bl05-node05                                      5 Online, rgmanager
>  bl06-node06                                      6 Online, rgmanager
>  bl07-node07                                      7 Online, rgmanager
>  bl08-node08                                      8 Online, rgmanager
>  bl09-node09                                      9 Online, rgmanager
>  bl10-node10                                     10 Online, rgmanager
>  bl11-node11                                     11 Online, rgmanager
>  bl12-node12                                     12 Online, rgmanager
>  bl13-node13                                     13 Online, Local,
> rgmanager
>  bl14-node14                                     14 Online, rgmanager
>  bl15-node15                                     15 Online, rgmanager
>
>
>  Service Name                                                 Owner
> (Last)                                                 State
>  ------- ----                                                 -----
> ------                                                 -----
>  service:httpd
> bl05-node05                               started
>  service:nfs_disk2
> bl08-node08                               started
>
>
> root at bl13-node13:~# group_tool -v
> type             level name            id       state node id local_done
> fence            0     default         0001000d none
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> dlm              1     clvmd           0001000c none
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> dlm              1     cluster3_disk1  00020005 none
> [4 5 6 7 8 9 10 11 12 13 14 15]
> dlm              1     cluster3_disk2  00040005 none
> [4 5 6 7 8 9 10 11 13 14 15]
> dlm              1     cluster3_disk7  00060005 none
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> dlm              1     cluster3_disk8  00080005 none
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> dlm              1     cluster3_disk9  000a0005 none
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> dlm              1     disk10          000c0005 none
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> dlm              1     rgmanager       0001000a none
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> dlm              1     cluster3_disk3  00020001 none
> [1 5 6 7 8 9 10 11 12 13]
> dlm              1     cluster3_disk6  00020008 none
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> gfs              2     cluster3_disk1  00010005 none
> [4 5 6 7 8 9 10 11 12 13 14 15]
> *gfs              2     cluster3_disk2  00030005 LEAVE_START_WAIT 12
> c000b0002 1
> [4 5 6 7 8 9 10 11 13 14 15]*
> gfs              2     cluster3_disk7  00050005 none
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> gfs              2     cluster3_disk8  00070005 none
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> gfs              2     cluster3_disk9  00090005 none
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> gfs              2     disk10          000b0005 none
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> gfs              2     cluster3_disk3  00010001 none
> [1 5 6 7 8 9 10 11 12 13]
> gfs              2     cluster3_disk6  00010008 none
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
>
> root at bl13-node13:~# gfs2_tool list
> 253:15 cluster3:cluster3_disk6
> 253:16 cluster3:cluster3_disk3
> 253:18 cluster3:disk10
> 253:17 cluster3:cluster3_disk9
> 253:19 cluster3:cluster3_disk8
> 253:21 cluster3:cluster3_disk7
> 253:22 cluster3:cluster3_disk2
> 253:23 cluster3:cluster3_disk1
>
> root at bl13-node13:~# lvs
>     Logging initialised at Sat Jun  2 20:50:03 2012
>     Set umask from 0022 to 0077
>     Finding all logical volumes
>   LV                            VG                            Attr
> LSize   Origin Snap%  Move Log Copy%  Convert
>   lv_cluster3_Disk7             vg_Cluster3_Disk7             -wi-ao
> 3.00T
>   lv_cluster3_Disk9             vg_Cluster3_Disk9             -wi-ao
> 200.01G
>   lv_Cluster3_libvert           vg_Cluster3_libvert           -wi-a-
> 100.00G
>   lv_cluster3_disk1             vg_cluster3_disk1             -wi-ao
> 100.00G
>   lv_cluster3_disk10            vg_cluster3_disk10            -wi-ao
> 15.00T
>   lv_cluster3_disk2             vg_cluster3_disk2             -wi-ao
> 220.00G
>   lv_cluster3_disk3             vg_cluster3_disk3             -wi-ao
> 330.00G
>   lv_cluster3_disk4_1T-kvm-thin vg_cluster3_disk4_1T-kvm-thin -wi-a-
> 1.00T
>   lv_cluster3_disk5             vg_cluster3_disk5             -wi-a-
> 555.00G
>   lv_cluster3_disk6             vg_cluster3_disk6             -wi-ao
> 2.00T
>   lv_cluster3_disk8             vg_cluster3_disk8             -wi-ao
> 2.00T
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120603/359bdacd/attachment.htm>

From nicolas at ecarnot.net  Mon Jun  4 09:24:37 2012
From: nicolas at ecarnot.net (Nicolas Ecarnot)
Date: Mon, 04 Jun 2012 11:24:37 +0200
Subject: [Linux-cluster] Error mounting lockproto lock_dlm
Message-ID: <4FCC7ED5.2090002@ecarnot.net>

Hi,

I had a 2-nodes cluster running too fine under Ubuntu server 11.10, with 
cman, corosync, GFS2, OCFS2, clvm, ctdb, samba, winbind.

So I decided to upgrade :)

Under Precise (12.04), my OCFS2 partition is still working well.
CLVM is still OK, nicely speaking with the dlm layer (dlm_controld).

I ran "dlm_controld -D" and I can see the nice interaction with clvmd 
when ran.

But when I try to mount any GFS2 partition (either directly with 
mount.gfs2, or via the init.d script), I get the good old error:

| gfs_controld join connect error: Connection refused
| error mounting lockproto lock_dlm

When getting this, I don't see the smallest contact with dlm_controld 
(ran with -D, it should blink somewhere).


I guess something has changed : in Precise, here are the version numbers :
- libdlm3        3.1.7
- libdlmcontrol3 3.1.7
- gfs2-utils     3.1.3

What point must I check to explain to mount.gfs2 that dlm is actually up 
and running?
Does all that depend on other components I should check?

-- 
Nicolas Ecarnot



From nicolas at ecarnot.net  Mon Jun  4 12:35:12 2012
From: nicolas at ecarnot.net (Nicolas Ecarnot)
Date: Mon, 04 Jun 2012 14:35:12 +0200
Subject: [Linux-cluster] Error mounting lockproto lock_dlm [SOLVED]
In-Reply-To: <4FCC7ED5.2090002@ecarnot.net>
References: <4FCC7ED5.2090002@ecarnot.net>
Message-ID: <4FCCAB80.3030904@ecarnot.net>

Le 04/06/2012 11:24, Nicolas Ecarnot a ?crit :
> Hi,
>
> I had a 2-nodes cluster running too fine under Ubuntu server 11.10, with
> cman, corosync, GFS2, OCFS2, clvm, ctdb, samba, winbind.
>
> So I decided to upgrade :)
>
> Under Precise (12.04), my OCFS2 partition is still working well.
> CLVM is still OK, nicely speaking with the dlm layer (dlm_controld).
>
> I ran "dlm_controld -D" and I can see the nice interaction with clvmd
> when ran.
>
> But when I try to mount any GFS2 partition (either directly with
> mount.gfs2, or via the init.d script), I get the good old error:
>
> | gfs_controld join connect error: Connection refused
> | error mounting lockproto lock_dlm
>
> When getting this, I don't see the smallest contact with dlm_controld
> (ran with -D, it should blink somewhere).
>
>
> I guess something has changed

Dear me :)

Check these pages and their diffs:
- http://manpages.ubuntu.com/manpages/oneiric/man8/gfs_controld.8.html
- http://manpages.ubuntu.com/manpages/precise/man8/gfs_controld.8.html

Especialy look at the second line :
Provided by: ...

Was provided by the package cman, and is now provided by the package 
gfs2-cluster...

Ubuntu, I like you, but you're sometimes hard to follow...

-- 
Nicolas Ecarnot



From rhel_cluster at ckimaru.com  Mon Jun  4 13:29:13 2012
From: rhel_cluster at ckimaru.com (Cedric Kimaru)
Date: Mon, 4 Jun 2012 09:29:13 -0400
Subject: [Linux-cluster] Rhel 5.7 Cluster - gfs2 volume in
 "LEAVE_START_WAIT" status
In-Reply-To: <CAE7pJ3B63X6cW19axUBgKzS1jpVDw36XaU98fP_KEhK7tzj6=g@mail.gmail.com>
References: <CANe6q9+_A-_cQqaHrZNZHj3q89MdW0ve7SP5KiwqxKzkYOCWbg@mail.gmail.com>
	<CAE7pJ3B63X6cW19axUBgKzS1jpVDw36XaU98fP_KEhK7tzj6=g@mail.gmail.com>
Message-ID: <CANe6q9KoHmNK0fyi5gQDefh5RoKbddC8gheiNVNi-YuPvGRF1Q@mail.gmail.com>

Hi Emmanuel,
 Yes, i'm running gfs2. I'm also trying this out on Rhel 6.2 with three
nodes so see if this happens upstream.
Looks like i may have to open a BZ to get more info on this.

root at bl13-node13:~# gfs2_tool list
253:15 cluster3:cluster3_disk6
253:16 cluster3:cluster3_disk3
253:18 cluster3:disk10
253:17 cluster3:cluster3_disk9
253:19 cluster3:cluster3_disk8
253:21 cluster3:cluster3_disk7
253:22 cluster3:cluster3_disk2
253:23 cluster3:cluster3_disk1

thanks,
-Cedric
On Sun, Jun 3, 2012 at 1:17 PM, emmanuel segura <emi2fast at gmail.com> wrote:

> Hello Cedric
>
> Are you using gfs or gfs2? if you are using gfs  i recommend to use gfs2
>
> 2012/6/3 Cedric Kimaru <rhel_cluster at ckimaru.com>
>
>> Fellow Cluster Compatriots,
>> I'm looking for some guidance here. Whenever my rhel 5.7 cluster get's
>> into "*LEAVE_START_WAIT*" on on a given iscsi volume, the following
>> occurs:
>>
>>    1. I can't r/w io to the volume.
>>    2. Can't unmount it, from any node.
>>    3. In flight/pending IO's are impossible to determine or kill since
>>    lsof on the mount fails. Basically all IO operations stall/fail.
>>
>> So my questions are:
>>
>>    1. What does the output from group_tool -v really indicate, *"00030005
>>    LEAVE_START_WAIT 12 c000b0002 1" *? Man on group_tool doesn't list
>>    these fields.
>>    2. Does anyone have a list of what these fields represent ?
>>    3. Corrective actions. How do i get out of this state without
>>    rebooting the entire cluster ?
>>    4. Is it possible to determine the offending node ?
>>
>> thanks,
>> -Cedric
>>
>>
>> //misc output
>>
>> root at bl13-node13:~# clustat
>> Cluster Status for cluster3 @ Sat Jun  2 20:47:08 2012
>> Member Status: Quorate
>>
>>  Member Name                                                     ID
>> Status
>>  ------ ----                                                     ----
>> ------
>> bl01-node01                                      1 Online, rgmanager
>>  bl04-node04                                      4 Online, rgmanager
>>  bl05-node05                                      5 Online, rgmanager
>>  bl06-node06                                      6 Online, rgmanager
>>  bl07-node07                                      7 Online, rgmanager
>>  bl08-node08                                      8 Online, rgmanager
>>  bl09-node09                                      9 Online, rgmanager
>>  bl10-node10                                     10 Online, rgmanager
>>  bl11-node11                                     11 Online, rgmanager
>>  bl12-node12                                     12 Online, rgmanager
>>  bl13-node13                                     13 Online, Local,
>> rgmanager
>>  bl14-node14                                     14 Online, rgmanager
>>  bl15-node15                                     15 Online, rgmanager
>>
>>
>>  Service Name                                                 Owner
>> (Last)                                                 State
>>  ------- ----                                                 -----
>> ------                                                 -----
>>  service:httpd
>> bl05-node05                               started
>>  service:nfs_disk2
>> bl08-node08                               started
>>
>>
>> root at bl13-node13:~# group_tool -v
>> type             level name            id       state node id local_done
>> fence            0     default         0001000d none
>> [1 4 5 6 7 8 9 10 11 12 13 14 15]
>> dlm              1     clvmd           0001000c none
>> [1 4 5 6 7 8 9 10 11 12 13 14 15]
>> dlm              1     cluster3_disk1  00020005 none
>> [4 5 6 7 8 9 10 11 12 13 14 15]
>> dlm              1     cluster3_disk2  00040005 none
>> [4 5 6 7 8 9 10 11 13 14 15]
>> dlm              1     cluster3_disk7  00060005 none
>> [1 4 5 6 7 8 9 10 11 12 13 14 15]
>> dlm              1     cluster3_disk8  00080005 none
>> [1 4 5 6 7 8 9 10 11 12 13 14 15]
>> dlm              1     cluster3_disk9  000a0005 none
>> [1 4 5 6 7 8 9 10 11 12 13 14 15]
>> dlm              1     disk10          000c0005 none
>> [1 4 5 6 7 8 9 10 11 12 13 14 15]
>> dlm              1     rgmanager       0001000a none
>> [1 4 5 6 7 8 9 10 11 12 13 14 15]
>> dlm              1     cluster3_disk3  00020001 none
>> [1 5 6 7 8 9 10 11 12 13]
>> dlm              1     cluster3_disk6  00020008 none
>> [1 4 5 6 7 8 9 10 11 12 13 14 15]
>> gfs              2     cluster3_disk1  00010005 none
>> [4 5 6 7 8 9 10 11 12 13 14 15]
>> *gfs              2     cluster3_disk2  00030005 LEAVE_START_WAIT 12
>> c000b0002 1
>> [4 5 6 7 8 9 10 11 13 14 15]*
>> gfs              2     cluster3_disk7  00050005 none
>> [1 4 5 6 7 8 9 10 11 12 13 14 15]
>> gfs              2     cluster3_disk8  00070005 none
>> [1 4 5 6 7 8 9 10 11 12 13 14 15]
>> gfs              2     cluster3_disk9  00090005 none
>> [1 4 5 6 7 8 9 10 11 12 13 14 15]
>> gfs              2     disk10          000b0005 none
>> [1 4 5 6 7 8 9 10 11 12 13 14 15]
>> gfs              2     cluster3_disk3  00010001 none
>> [1 5 6 7 8 9 10 11 12 13]
>> gfs              2     cluster3_disk6  00010008 none
>> [1 4 5 6 7 8 9 10 11 12 13 14 15]
>>
>> root at bl13-node13:~# gfs2_tool list
>> 253:15 cluster3:cluster3_disk6
>> 253:16 cluster3:cluster3_disk3
>> 253:18 cluster3:disk10
>> 253:17 cluster3:cluster3_disk9
>> 253:19 cluster3:cluster3_disk8
>> 253:21 cluster3:cluster3_disk7
>> 253:22 cluster3:cluster3_disk2
>> 253:23 cluster3:cluster3_disk1
>>
>> root at bl13-node13:~# lvs
>>     Logging initialised at Sat Jun  2 20:50:03 2012
>>     Set umask from 0022 to 0077
>>     Finding all logical volumes
>>   LV                            VG                            Attr
>> LSize   Origin Snap%  Move Log Copy%  Convert
>>   lv_cluster3_Disk7             vg_Cluster3_Disk7             -wi-ao
>> 3.00T
>>   lv_cluster3_Disk9             vg_Cluster3_Disk9             -wi-ao
>> 200.01G
>>   lv_Cluster3_libvert           vg_Cluster3_libvert           -wi-a-
>> 100.00G
>>   lv_cluster3_disk1             vg_cluster3_disk1             -wi-ao
>> 100.00G
>>   lv_cluster3_disk10            vg_cluster3_disk10            -wi-ao
>> 15.00T
>>   lv_cluster3_disk2             vg_cluster3_disk2             -wi-ao
>> 220.00G
>>   lv_cluster3_disk3             vg_cluster3_disk3             -wi-ao
>> 330.00G
>>   lv_cluster3_disk4_1T-kvm-thin vg_cluster3_disk4_1T-kvm-thin -wi-a-
>> 1.00T
>>   lv_cluster3_disk5             vg_cluster3_disk5             -wi-a-
>> 555.00G
>>   lv_cluster3_disk6             vg_cluster3_disk6             -wi-ao
>> 2.00T
>>   lv_cluster3_disk8             vg_cluster3_disk8             -wi-ao
>> 2.00T
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120604/c53b8326/attachment.htm>

From dan131riley at gmail.com  Mon Jun  4 14:52:23 2012
From: dan131riley at gmail.com (Dan Riley)
Date: Mon, 4 Jun 2012 10:52:23 -0400
Subject: [Linux-cluster] Rhel 5.7 Cluster - gfs2 volume in
	"LEAVE_START_WAIT" status
In-Reply-To: <CANe6q9+_A-_cQqaHrZNZHj3q89MdW0ve7SP5KiwqxKzkYOCWbg@mail.gmail.com>
References: <CANe6q9+_A-_cQqaHrZNZHj3q89MdW0ve7SP5KiwqxKzkYOCWbg@mail.gmail.com>
Message-ID: <C7408501-85D5-488A-A609-5245FC52F5FF@gmail.com>

Hi Cedric,

About the only doc I've found that describes the barrier state transitions is in the cluster2 architecture doc

http://people.redhat.com/teigland/cluster2-arch.txt

When group membership changes, there's a barrier operation that stops the group, changes the membership, and restarts the group, so that all members agree on the membership change synchronization.  LEAVE_START_WAIT means that a node (12) left the group, but restarting the group hasn't completed because not all the nodes have acknowledged agreement.  You should do 'group_tool -v' on the different nodes of the cluster and look for a node where the final 'local_done' flag is 0, or where the group membership is inconsistent with the other nodes.  Dumping the debug buffer for the group on the various nodes may also identify which node is being waited on.  In the cases where we've found inconsistent group membership, fencing the node with the inconsistency let the group finish starting.

[as an aside--is there a plan to reengineer the RH cluster group membership protocol stack to take advantage of the virtual synchrony capabilities of Corosync/TOTEM?]

-dan

On Jun 2, 2012, at 9:25 PM, Cedric Kimaru wrote:

> Fellow Cluster Compatriots,
> I'm looking for some guidance here. Whenever my rhel 5.7 cluster get's into "LEAVE_START_WAIT" on on a given iscsi volume, the following occurs: 
> 	? I can't r/w io to the volume.
> 	? Can't unmount it, from any node.
> 	? In flight/pending IO's are impossible to determine or kill since lsof on the mount fails. Basically all IO operations stall/fail.
> So my questions are:
> 
> 	? What does the output from group_tool -v really indicate, "00030005 LEAVE_START_WAIT 12 c000b0002 1" ? Man on group_tool doesn't list these fields.
> 	? Does anyone have a list of what these fields represent ?
> 	? Corrective actions. How do i get out of this state without rebooting the entire cluster ?
> 	? Is it possible to determine the offending node ?
> thanks,
> -Cedric
> 
> 
> //misc output
> 
> root at bl13-node13:~# group_tool -v
> type             level name            id       state node id local_done
> fence            0     default         0001000d none        
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> dlm              1     clvmd           0001000c none        
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> dlm              1     cluster3_disk1  00020005 none        
> [4 5 6 7 8 9 10 11 12 13 14 15]
> dlm              1     cluster3_disk2  00040005 none        
> [4 5 6 7 8 9 10 11 13 14 15]
> dlm              1     cluster3_disk7  00060005 none        
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> dlm              1     cluster3_disk8  00080005 none        
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> dlm              1     cluster3_disk9  000a0005 none        
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> dlm              1     disk10          000c0005 none        
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> dlm              1     rgmanager       0001000a none        
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> dlm              1     cluster3_disk3  00020001 none        
> [1 5 6 7 8 9 10 11 12 13]
> dlm              1     cluster3_disk6  00020008 none        
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> gfs              2     cluster3_disk1  00010005 none        
> [4 5 6 7 8 9 10 11 12 13 14 15]
> gfs              2     cluster3_disk2  00030005 LEAVE_START_WAIT 12 c000b0002 1
> [4 5 6 7 8 9 10 11 13 14 15]
> gfs              2     cluster3_disk7  00050005 none        
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> gfs              2     cluster3_disk8  00070005 none        
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> gfs              2     cluster3_disk9  00090005 none        
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> gfs              2     disk10          000b0005 none        
> [1 4 5 6 7 8 9 10 11 12 13 14 15]
> gfs              2     cluster3_disk3  00010001 none        
> [1 5 6 7 8 9 10 11 12 13]
> gfs              2     cluster3_disk6  00010008 none        
> [1 4 5 6 7 8 9 10 11 12 13 14 15]




From jpokorny at redhat.com  Tue Jun  5 10:26:51 2012
From: jpokorny at redhat.com (Jan =?utf-8?Q?Pokorn=C3=BD?=)
Date: Tue, 5 Jun 2012 12:26:51 +0200
Subject: [Linux-cluster] Connection Reset when trying to brorwse luci
 web interface
In-Reply-To: <CAOf__1PkzsPbuELb8nw0OfUO3aQ3Fs2GhBpwhxLEan3vCx8+qw@mail.gmail.com>
References: <CAOf__1PkzsPbuELb8nw0OfUO3aQ3Fs2GhBpwhxLEan3vCx8+qw@mail.gmail.com>
Message-ID: <20120605102651.GD12834@redhat.com>

On 26/05/12 19:12 +0100, fosiul alam wrote:
> Hi

Hello Fosiul,

> I am trying cluster in my lab and I have 3 nodes.
> 
> in 1st node, i have installed luci as
> 
> yum install luci
> then
> luci_admin init
> 
> then service luci restart
> 
> Now when i am trying to browse the web interface
> https://clstr1:8084/
> 
> from mozilla or internet explorer ,
> its ask for Certificate but after that , its say: Connection reset  .

To make the context clear, I can see (from luci_admin usage) you are
talking about luci used in RHEL 5 (Plone-based).

What is not clear to me is the exact reproducer.
I haven't encountered anything like this so far.

Do you actually accept the certificate[*] for luci?  And this is then
followed by "Connection reset"?  Is it this message browser-specific,
or displayed as a plain page (i.e., you can find this message in
the source of the page)?

And does /var/lib/luci/log/event.log contain anything suspicious
such as tracebacks?  Only messages after "Zope Ready to handle requests"
(and "Plone Deprecation Warning") are important here.

Also, could you be more specific about the web browsers?

[*] It is unrecognized as opposed to common public websites;  simply
because there is no recognized certification authority behind
the certificate.

> So the luci web page is not comming
> Can any one tell me why ?

Please, see that log file and provide browsers details to track
down the issue.
You can also try (change IP address/hostname appropriately):

$ LUCIHOST=192.168.122.243
$ wget --no-check-certificate -q -S --spider "https://${LUCIHOST}:8084/luci/doc"

The result should look something like ("200 OK" status code is important):

  HTTP/1.0 200 OK
  Server: Zope/(Zope 2.9.8-final, python 2.4.3, linux2) ZServer/1.1 Plone/2.5.5
  Date: Tue, 05 Jun 2012 10:18:54 GMT
  Content-Length: 25957
  Content-Location: https://192.168.122.243:8084/luci/doc/
  Accept-Ranges: none
  Connection: Keep-Alive
  Last-Modified: Wed, 27 Jun 2007 07:22:52 GMT
  Connection: close
  Date: Tue, 05 Jun 2012 10:18:54 GMT
  Content-Type: text/html

> iptables is turned off and selinux is turned off.

Firewall is OK provided that TCP destination port 8084
(default for luci) is enabled on the machine running luci.

SELinux should be OK as well, or do you see any related
messages in permissive mode/any failure if enforcing?

> Thanks for your help.

I hope this will help you,
Jan



From rhel_cluster at ckimaru.com  Tue Jun  5 14:14:57 2012
From: rhel_cluster at ckimaru.com (Cedric Kimaru)
Date: Tue, 5 Jun 2012 10:14:57 -0400
Subject: [Linux-cluster] Rhel 5.7 Cluster - gfs2 volume in
 "LEAVE_START_WAIT" status
In-Reply-To: <C7408501-85D5-488A-A609-5245FC52F5FF@gmail.com>
References: <CANe6q9+_A-_cQqaHrZNZHj3q89MdW0ve7SP5KiwqxKzkYOCWbg@mail.gmail.com>
	<C7408501-85D5-488A-A609-5245FC52F5FF@gmail.com>
Message-ID: <CANe6q9JzQxUhsWeV55Oa9o=XKySSHT2_rRpY_2trWHVf_D5D8w@mail.gmail.com>

Hi Dan,
 Thanks for the response and breadcrumb. The link to Davids document will
hopefully shed more light into this state.
I tried fencing the node with the pending sync restart, 12 in my case, but
that didn't seem to get the volume out of the weeds. Attempting to restart
from other nodes gfs2 also fails since it has to unmount, which it can't
... weeds, weeds, weeds.

Now, Could elaborate on which diags you are referring to, glock ?

thanks,
-Cedric

On Mon, Jun 4, 2012 at 10:52 AM, Dan Riley <dan131riley at gmail.com> wrote:

> Hi Cedric,
>
> About the only doc I've found that describes the barrier state transitions
> is in the cluster2 architecture doc
>
> http://people.redhat.com/teigland/cluster2-arch.txt
>
> When group membership changes, there's a barrier operation that stops the
> group, changes the membership, and restarts the group, so that all members
> agree on the membership change synchronization.  LEAVE_START_WAIT means
> that a node (12) left the group, but restarting the group hasn't completed
> because not all the nodes have acknowledged agreement.  You should do
> 'group_tool -v' on the different nodes of the cluster and look for a node
> where the final 'local_done' flag is 0, or where the group membership is
> inconsistent with the other nodes.  Dumping the debug buffer for the group
> on the various nodes may also identify which node is being waited on.  In
> the cases where we've found inconsistent group membership, fencing the node
> with the inconsistency let the group finish starting.
>
> [as an aside--is there a plan to reengineer the RH cluster group
> membership protocol stack to take advantage of the virtual synchrony
> capabilities of Corosync/TOTEM?]
>
> -dan
>
> On Jun 2, 2012, at 9:25 PM, Cedric Kimaru wrote:
>
> > Fellow Cluster Compatriots,
> > I'm looking for some guidance here. Whenever my rhel 5.7 cluster get's
> into "LEAVE_START_WAIT" on on a given iscsi volume, the following occurs:
> >       ? I can't r/w io to the volume.
> >       ? Can't unmount it, from any node.
> >       ? In flight/pending IO's are impossible to determine or kill since
> lsof on the mount fails. Basically all IO operations stall/fail.
> > So my questions are:
> >
> >       ? What does the output from group_tool -v really indicate,
> "00030005 LEAVE_START_WAIT 12 c000b0002 1" ? Man on group_tool doesn't list
> these fields.
> >       ? Does anyone have a list of what these fields represent ?
> >       ? Corrective actions. How do i get out of this state without
> rebooting the entire cluster ?
> >       ? Is it possible to determine the offending node ?
> > thanks,
> > -Cedric
> >
> >
> > //misc output
> >
> > root at bl13-node13:~# group_tool -v
> > type             level name            id       state node id local_done
> > fence            0     default         0001000d none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > dlm              1     clvmd           0001000c none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > dlm              1     cluster3_disk1  00020005 none
> > [4 5 6 7 8 9 10 11 12 13 14 15]
> > dlm              1     cluster3_disk2  00040005 none
> > [4 5 6 7 8 9 10 11 13 14 15]
> > dlm              1     cluster3_disk7  00060005 none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > dlm              1     cluster3_disk8  00080005 none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > dlm              1     cluster3_disk9  000a0005 none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > dlm              1     disk10          000c0005 none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > dlm              1     rgmanager       0001000a none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > dlm              1     cluster3_disk3  00020001 none
> > [1 5 6 7 8 9 10 11 12 13]
> > dlm              1     cluster3_disk6  00020008 none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > gfs              2     cluster3_disk1  00010005 none
> > [4 5 6 7 8 9 10 11 12 13 14 15]
> > gfs              2     cluster3_disk2  00030005 LEAVE_START_WAIT 12
> c000b0002 1
> > [4 5 6 7 8 9 10 11 13 14 15]
> > gfs              2     cluster3_disk7  00050005 none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > gfs              2     cluster3_disk8  00070005 none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > gfs              2     cluster3_disk9  00090005 none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > gfs              2     disk10          000b0005 none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > gfs              2     cluster3_disk3  00010001 none
> > [1 5 6 7 8 9 10 11 12 13]
> > gfs              2     cluster3_disk6  00010008 none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120605/4719b577/attachment.htm>

From swhiteho at redhat.com  Wed Jun  6 08:23:55 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Wed, 06 Jun 2012 09:23:55 +0100
Subject: [Linux-cluster] Error mounting lockproto lock_dlm
In-Reply-To: <4FCC7ED5.2090002@ecarnot.net>
References: <4FCC7ED5.2090002@ecarnot.net>
Message-ID: <1338971035.2714.1.camel@menhir>

Hi,

On Mon, 2012-06-04 at 11:24 +0200, Nicolas Ecarnot wrote:
> Hi,
> 
> I had a 2-nodes cluster running too fine under Ubuntu server 11.10, with 
> cman, corosync, GFS2, OCFS2, clvm, ctdb, samba, winbind.
> 
> So I decided to upgrade :)
> 
> Under Precise (12.04), my OCFS2 partition is still working well.
> CLVM is still OK, nicely speaking with the dlm layer (dlm_controld).
> 
> I ran "dlm_controld -D" and I can see the nice interaction with clvmd 
> when ran.
> 
> But when I try to mount any GFS2 partition (either directly with 
> mount.gfs2, or via the init.d script), I get the good old error:
> 
> | gfs_controld join connect error: Connection refused
> | error mounting lockproto lock_dlm
> 
Are you running selinux perhaps? That usually means that the unix socket
used to communicate cannot be opened for some reason,

Steve.

> When getting this, I don't see the smallest contact with dlm_controld 
> (ran with -D, it should blink somewhere).
> 
> 
> I guess something has changed : in Precise, here are the version numbers :
> - libdlm3        3.1.7
> - libdlmcontrol3 3.1.7
> - gfs2-utils     3.1.3
> 
> What point must I check to explain to mount.gfs2 that dlm is actually up 
> and running?
> Does all that depend on other components I should check?
> 




From epretorious at yahoo.com  Thu Jun  7 04:12:13 2012
From: epretorious at yahoo.com (Eric)
Date: Wed, 6 Jun 2012 21:12:13 -0700 (PDT)
Subject: [Linux-cluster] Bonding Interfaces: Active Load Balancing & LACP
Message-ID: <1339042333.47748.YahooMailNeo@web121704.mail.ne1.yahoo.com>

I'm currently using the HP Procurve 2824 24-port Gigabit Ethernet switch to for a backside network for synchronizing file systems between the nodes in the group. Each host has 4 Gigabit NIC's and the goal is to bond two of the Gigabit NIC's together to create a 2 Gbps link from any host? to any other host but what I'm finding is that the bonded links are only capable of 1 Gbps from any host to any other host. Is it possible to 
create a multi-Gigabit link between two hosts (without having to upgrade to 10G) using a switch that "uses the 
SA/DA (Source Address/Destination Address) method of distributing 
traffic across the trunked links"?


The problem, at least as far as I can tell, comes down to the 
limitation of ARP resolution (in the host) and mac-address tables (in 
the switch):
When configured to use Active Load Balancing, the kernel driver leaves each of the interface's MAC 
addresses unchanged. In this scenario, when Host A sends sends traffic 
to host Host B, the kernel uses the MAC address of only one of Host B's 
NIC's as the DA. When the packet arrives at the switch, the switch 
consults the mac-address table for the DA and then sends the packet to 
the interface connected to the NIC with MAC address equal to DA. Thus 
packets from Host A to Host B will only leave the switch through one 
interface - the interface connected to the NIC with MAC address equal to DA. This has the effect of limiting the throughput from Host A to Host B to the speed of the one interface connected to the NIC with MAC address equal to DA.

When configured to use IEEE 802.3ad (LACP), the kernel driver assigns the same MAC address to all of the hosts' 
interfaces. In this scenario, when Host A sends traffic to Host B, the 
kernel uses Host B's shared MAC address as the DA. When the packet 
arrives at the switch, the switch creates a hash based on the SA/DA 
pair, consults the mac-address table for the DA, and and assigns the 
flow (i.e., traffic from Host A to Host B) to one of the interfaces 
connected to Host B. Thus packets from Host A to Host B will only leave 
the switch through one interface - the interface determined by the SA/DA hash. This has the effect of limiting the throughput from Host A to Host B to the speed of the one?interface determined by the hashing method. However, if the flow (from Host A to Host B's shared MAC 
address) were to be distributed across the different interfaces in a 
round-robin 
fashion (as the 
packets were leaving the switch) the throughput between the hosts would 
equal the aggregate of 
the links (IIUC).

Is this a limitation of the the Procurve's 
implementation of LACP? Do other switches use  different methods of 
distributing traffic across the trunked links? Is there another method 
of aggregating the links between the two hosts (e.g., multipathing)?

TIA,
Eric Pretorious
Truckee, CA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120606/40789be5/attachment.htm>

From lists at alteeve.ca  Thu Jun  7 04:35:32 2012
From: lists at alteeve.ca (Digimer)
Date: Thu, 07 Jun 2012 00:35:32 -0400
Subject: [Linux-cluster] Bonding Interfaces: Active Load Balancing & LACP
In-Reply-To: <1339042333.47748.YahooMailNeo@web121704.mail.ne1.yahoo.com>
References: <1339042333.47748.YahooMailNeo@web121704.mail.ne1.yahoo.com>
Message-ID: <4FD02F94.9020300@alteeve.ca>

I know that the only *supported* bond is Active/Passive (mode=1), which 
of course provides no performance benefit.

I tested all types, using more modest D-Link DGS-3100 switches and all 
other modes failed at some point in failure and recovery testing. If you 
want to experiment, I'd suggest tweaking corosync's timeouts to be 
(much?) more generous.

I'm curious to hear back on what your experimenting finds.

Digimer

On 06/07/2012 12:12 AM, Eric wrote:
> I'm currently using the HP Procurve 2824 24-port Gigabit Ethernet switch
> to for a backside network for synchronizing file systems between the
> nodes in the group. Each host has 4 Gigabit NIC's and the goal is to
> bond two of the Gigabit NIC's together to create a 2 Gbps link from any
> host to any other host but what I'm finding is that the bonded links are
> only capable of 1 Gbps from any host to any other host. Is it possible
> to create a multi-Gigabit link between two hosts (without having to
> upgrade to 10G) using a switch that "uses the SA/DA (Source
> Address/Destination Address) method of distributing traffic across the
> trunked links"?
>
> The problem, at least as far as I can tell, comes down to the limitation
> of ARP resolution (in the host) and mac-address tables (in the switch):
>
> When configured to use Active Load Balancing, the kernel driver leaves
> each of the interface's MAC addresses unchanged. In this scenario, when
> Host A sends sends traffic to host Host B, the kernel uses the MAC
> address of only one of Host B's NIC's as the DA. When the packet arrives
> at the switch, the switch consults the mac-address table for the DA and
> then sends the packet to the interface connected to the NIC with MAC
> address equal to DA. Thus packets from Host A to Host B will only leave
> the switch through one interface - the interface connected to the NIC
> with MAC address equal to DA. This has the effect of limiting the
> throughput from Host A to Host B to the speed of the one interface
> connected to the NIC with MAC address equal to DA.
>
> When configured to use IEEE 802.3ad (LACP), the kernel driver assigns
> the same MAC address to all of the hosts' interfaces. In this scenario,
> when Host A sends traffic to Host B, the kernel uses Host B's shared MAC
> address as the DA. When the packet arrives at the switch, the switch
> creates a hash based on the SA/DA pair, consults the mac-address table
> for the DA, and and assigns the flow (i.e., traffic from Host A to Host
> B) to one of the interfaces connected to Host B. Thus packets from Host
> A to Host B will only leave the switch through one interface - the
> interface determined by the SA/DA hash. This has the effect of limiting
> the throughput from Host A to Host B to the speed of the one interface
> determined by the hashing method. However, if the flow (from Host A to
> Host B's shared MAC address) were to be distributed across the different
> interfaces in a round-robin fashion (as the packets were leaving the
> switch) the throughput between the hosts would equal the aggregate of
> the links (IIUC).
>
> Is this a limitation of the the Procurve's implementation of LACP? Do
> other switches use different methods of distributing traffic across the
> trunked links? Is there another method of aggregating the links between
> the two hosts (e.g., multipathing)?
>
> TIA,
> Eric Pretorious
> Truckee, CA
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-- 
Digimer
Papers and Projects: https://alteeve.com



From kkovachev at varna.net  Thu Jun  7 08:22:39 2012
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Thu, 07 Jun 2012 11:22:39 +0300
Subject: [Linux-cluster] Bonding Interfaces: Active Load Balancing & LACP
In-Reply-To: <1339042333.47748.YahooMailNeo@web121704.mail.ne1.yahoo.com>
References: <1339042333.47748.YahooMailNeo@web121704.mail.ne1.yahoo.com>
Message-ID: <b30c5a37e601fbedfc2bde783d772da3@mx.varna.net>

On Wed, 6 Jun 2012 21:12:13 -0700 (PDT), Eric <epretorious at yahoo.com>
wrote:
> I'm currently using the HP Procurve 2824 24-port Gigabit Ethernet switch
> to for a backside network for synchronizing file systems between the
nodes
> in the group. Each host has 4 Gigabit NIC's and the goal is to bond two
of
> the Gigabit NIC's together to create a 2 Gbps link from any host? to any
> other host but what I'm finding is that the bonded links are only
capable
> of 1 Gbps from any host to any other host. Is it possible to 
> create a multi-Gigabit link between two hosts (without having to upgrade
> to 10G) using a switch that "uses the 
> SA/DA (Source Address/Destination Address) method of distributing 
> traffic across the trunked links"?
> 
> 
> The problem, at least as far as I can tell, comes down to the 
> limitation of ARP resolution (in the host) and mac-address tables (in 
> the switch):
> When configured to use Active Load Balancing, the kernel driver leaves
> each of the interface's MAC 
> addresses unchanged. In this scenario, when Host A sends sends traffic 
> to host Host B, the kernel uses the MAC address of only one of Host B's 
> NIC's as the DA. When the packet arrives at the switch, the switch 
> consults the mac-address table for the DA and then sends the packet to 
> the interface connected to the NIC with MAC address equal to DA. Thus 
> packets from Host A to Host B will only leave the switch through one 
> interface - the interface connected to the NIC with MAC address equal to
> DA. This has the effect of limiting the throughput from Host A to Host B
to
> the speed of the one interface connected to the NIC with MAC address
equal
> to DA.
> 
> When configured to use IEEE 802.3ad (LACP), the kernel driver assigns
the
> same MAC address to all of the hosts' 
> interfaces. In this scenario, when Host A sends traffic to Host B, the 
> kernel uses Host B's shared MAC address as the DA. When the packet 
> arrives at the switch, the switch creates a hash based on the SA/DA 
> pair, consults the mac-address table for the DA, and and assigns the 
> flow (i.e., traffic from Host A to Host B) to one of the interfaces 
> connected to Host B. Thus packets from Host A to Host B will only leave 
> the switch through one interface - the interface determined by the SA/DA
> hash. This has the effect of limiting the throughput from Host A to Host
B
> to the speed of the one?interface determined by the hashing method.
> However, if the flow (from Host A to Host B's shared MAC 
> address) were to be distributed across the different interfaces in a 
> round-robin 
> fashion (as the 
> packets were leaving the switch) the throughput between the hosts would 
> equal the aggregate of 
> the links (IIUC).
> 
> Is this a limitation of the the Procurve's 
> implementation of LACP? Do other switches use  different methods of 
> distributing traffic across the trunked links? Is there another method 
> of aggregating the links between the two hosts (e.g., multipathing)?
> 

Not sure if you can choose a different hashing mode on Procurve, but
Netgear GSM7352 for example supports hashing by IP and port among other
modes:

1. Source MAC, VLAN, EtherType, and port ID
2. Destination MAC, VLAN, EtherType, and port ID
3. Source IP and source TCP/UDP port
4. Destination IP and destination TCP/UDP port
5. Source/Destination MAC, VLAN, EtherType and port
6. Source/Destination IP and source/destination TCP/UDP port

By using LACP with mode 6 for example you may get more bandwidth for
several applications (running simultaneously), but still limited to 1G for
a single socket

> TIA,
> Eric Pretorious
> Truckee, CA



From radu.rendec at mindbit.ro  Thu Jun  7 08:51:31 2012
From: radu.rendec at mindbit.ro (Radu Rendec)
Date: Thu, 07 Jun 2012 11:51:31 +0300
Subject: [Linux-cluster] Bonding Interfaces: Active Load Balancing & LACP
In-Reply-To: <4FD02F94.9020300@alteeve.ca>
References: <1339042333.47748.YahooMailNeo@web121704.mail.ne1.yahoo.com>
	<4FD02F94.9020300@alteeve.ca>
Message-ID: <1339059091.25816.470.camel@localhost>

I also experimented with D-Link DGS-3xxx switches and the bonding
driver, but in a quite strange configuration: 2 distinct switches
without any "knowledge" of each other, and with each server having NIC
#1 connected in one switch and NIC #2 in the other.

In my case, the bonding driver actually splitted the traffic between the
2 links and I could achieve higher speeds than with a single link. This
is my config (nothing fancy):

[root at host ~]# cat /etc/sysconfig/network-scripts/ifcfg-bond2 
DEVICE=bond2
TYPE=Bonding
BONDING_OPTS="miimon=50"
ONBOOT=yes
BOOTPROTO=none
BRIDGE=br2

The bond interface was assigned to a bridge in my case, because I needed
to give network access to some VMs (besides the physical host).

If you don't need 802.1q tagging, I think you could try to create 2
VLANs in your switch and for all servers have one NIC in one vlan and
and another NIC in the other vlan and then a bond across those 2 NICs.

The drawback is that everything that is connected in this manner needs
to have exactly 2 NICs connected in the 2 VLANs. The other problem is
that if one NIC fails in one of the servers, it won't receive the
packets that are sent on the corresponding VLAN, so the server will not
receive half of the traffic that is meant for it.

I'm also curious about the results of your experiments. Please post back
if you have time.

Thanks,

Radu Rendec

On Thu, 2012-06-07 at 00:35 -0400, Digimer wrote:
> I know that the only *supported* bond is Active/Passive (mode=1), which 
> of course provides no performance benefit.
> 
> I tested all types, using more modest D-Link DGS-3100 switches and all 
> other modes failed at some point in failure and recovery testing. If you 
> want to experiment, I'd suggest tweaking corosync's timeouts to be 
> (much?) more generous.
> 
> I'm curious to hear back on what your experimenting finds.
> 
> Digimer
> 
> On 06/07/2012 12:12 AM, Eric wrote:
> > I'm currently using the HP Procurve 2824 24-port Gigabit Ethernet switch
> > to for a backside network for synchronizing file systems between the
> > nodes in the group. Each host has 4 Gigabit NIC's and the goal is to
> > bond two of the Gigabit NIC's together to create a 2 Gbps link from any
> > host to any other host but what I'm finding is that the bonded links are
> > only capable of 1 Gbps from any host to any other host. Is it possible
> > to create a multi-Gigabit link between two hosts (without having to
> > upgrade to 10G) using a switch that "uses the SA/DA (Source
> > Address/Destination Address) method of distributing traffic across the
> > trunked links"?
> >
> > The problem, at least as far as I can tell, comes down to the limitation
> > of ARP resolution (in the host) and mac-address tables (in the switch):
> >
> > When configured to use Active Load Balancing, the kernel driver leaves
> > each of the interface's MAC addresses unchanged. In this scenario, when
> > Host A sends sends traffic to host Host B, the kernel uses the MAC
> > address of only one of Host B's NIC's as the DA. When the packet arrives
> > at the switch, the switch consults the mac-address table for the DA and
> > then sends the packet to the interface connected to the NIC with MAC
> > address equal to DA. Thus packets from Host A to Host B will only leave
> > the switch through one interface - the interface connected to the NIC
> > with MAC address equal to DA. This has the effect of limiting the
> > throughput from Host A to Host B to the speed of the one interface
> > connected to the NIC with MAC address equal to DA.
> >
> > When configured to use IEEE 802.3ad (LACP), the kernel driver assigns
> > the same MAC address to all of the hosts' interfaces. In this scenario,
> > when Host A sends traffic to Host B, the kernel uses Host B's shared MAC
> > address as the DA. When the packet arrives at the switch, the switch
> > creates a hash based on the SA/DA pair, consults the mac-address table
> > for the DA, and and assigns the flow (i.e., traffic from Host A to Host
> > B) to one of the interfaces connected to Host B. Thus packets from Host
> > A to Host B will only leave the switch through one interface - the
> > interface determined by the SA/DA hash. This has the effect of limiting
> > the throughput from Host A to Host B to the speed of the one interface
> > determined by the hashing method. However, if the flow (from Host A to
> > Host B's shared MAC address) were to be distributed across the different
> > interfaces in a round-robin fashion (as the packets were leaving the
> > switch) the throughput between the hosts would equal the aggregate of
> > the links (IIUC).
> >
> > Is this a limitation of the the Procurve's implementation of LACP? Do
> > other switches use different methods of distributing traffic across the
> > trunked links? Is there another method of aggregating the links between
> > the two hosts (e.g., multipathing)?
> >
> > TIA,
> > Eric Pretorious
> > Truckee, CA
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 




From lists at alteeve.ca  Thu Jun  7 15:37:07 2012
From: lists at alteeve.ca (Digimer)
Date: Thu, 07 Jun 2012 11:37:07 -0400
Subject: [Linux-cluster] Bonding Interfaces: Active Load Balancing & LACP
In-Reply-To: <1339059091.25816.470.camel@localhost>
References: <1339042333.47748.YahooMailNeo@web121704.mail.ne1.yahoo.com>
	<4FD02F94.9020300@alteeve.ca>
	<1339059091.25816.470.camel@localhost>
Message-ID: <4FD0CAA3.7060906@alteeve.ca>

As an aside; I was using the DGS-3100 switches stacked. The new 
generation of DGS-3120 switches I also used stacked, and are a *marked* 
improvement over the 3100 series. I've not gone back to re-test the 
other bond modes on these switches, as I must live within Red Hat's 
supported configuration. However, this thread might just motivate me to 
pull aside a test cluster and do some of my own testing again, on the 
new switches.

Digimer

On 06/07/2012 04:51 AM, Radu Rendec wrote:
> I also experimented with D-Link DGS-3xxx switches and the bonding
> driver, but in a quite strange configuration: 2 distinct switches
> without any "knowledge" of each other, and with each server having NIC
> #1 connected in one switch and NIC #2 in the other.
>
> In my case, the bonding driver actually splitted the traffic between the
> 2 links and I could achieve higher speeds than with a single link. This
> is my config (nothing fancy):
>
> [root at host ~]# cat /etc/sysconfig/network-scripts/ifcfg-bond2
> DEVICE=bond2
> TYPE=Bonding
> BONDING_OPTS="miimon=50"
> ONBOOT=yes
> BOOTPROTO=none
> BRIDGE=br2
>
> The bond interface was assigned to a bridge in my case, because I needed
> to give network access to some VMs (besides the physical host).
>
> If you don't need 802.1q tagging, I think you could try to create 2
> VLANs in your switch and for all servers have one NIC in one vlan and
> and another NIC in the other vlan and then a bond across those 2 NICs.
>
> The drawback is that everything that is connected in this manner needs
> to have exactly 2 NICs connected in the 2 VLANs. The other problem is
> that if one NIC fails in one of the servers, it won't receive the
> packets that are sent on the corresponding VLAN, so the server will not
> receive half of the traffic that is meant for it.
>
> I'm also curious about the results of your experiments. Please post back
> if you have time.
>
> Thanks,
>
> Radu Rendec
>
> On Thu, 2012-06-07 at 00:35 -0400, Digimer wrote:
>> I know that the only *supported* bond is Active/Passive (mode=1), which
>> of course provides no performance benefit.
>>
>> I tested all types, using more modest D-Link DGS-3100 switches and all
>> other modes failed at some point in failure and recovery testing. If you
>> want to experiment, I'd suggest tweaking corosync's timeouts to be
>> (much?) more generous.
>>
>> I'm curious to hear back on what your experimenting finds.
>>
>> Digimer
>>
>> On 06/07/2012 12:12 AM, Eric wrote:
>>> I'm currently using the HP Procurve 2824 24-port Gigabit Ethernet switch
>>> to for a backside network for synchronizing file systems between the
>>> nodes in the group. Each host has 4 Gigabit NIC's and the goal is to
>>> bond two of the Gigabit NIC's together to create a 2 Gbps link from any
>>> host to any other host but what I'm finding is that the bonded links are
>>> only capable of 1 Gbps from any host to any other host. Is it possible
>>> to create a multi-Gigabit link between two hosts (without having to
>>> upgrade to 10G) using a switch that "uses the SA/DA (Source
>>> Address/Destination Address) method of distributing traffic across the
>>> trunked links"?
>>>
>>> The problem, at least as far as I can tell, comes down to the limitation
>>> of ARP resolution (in the host) and mac-address tables (in the switch):
>>>
>>> When configured to use Active Load Balancing, the kernel driver leaves
>>> each of the interface's MAC addresses unchanged. In this scenario, when
>>> Host A sends sends traffic to host Host B, the kernel uses the MAC
>>> address of only one of Host B's NIC's as the DA. When the packet arrives
>>> at the switch, the switch consults the mac-address table for the DA and
>>> then sends the packet to the interface connected to the NIC with MAC
>>> address equal to DA. Thus packets from Host A to Host B will only leave
>>> the switch through one interface - the interface connected to the NIC
>>> with MAC address equal to DA. This has the effect of limiting the
>>> throughput from Host A to Host B to the speed of the one interface
>>> connected to the NIC with MAC address equal to DA.
>>>
>>> When configured to use IEEE 802.3ad (LACP), the kernel driver assigns
>>> the same MAC address to all of the hosts' interfaces. In this scenario,
>>> when Host A sends traffic to Host B, the kernel uses Host B's shared MAC
>>> address as the DA. When the packet arrives at the switch, the switch
>>> creates a hash based on the SA/DA pair, consults the mac-address table
>>> for the DA, and and assigns the flow (i.e., traffic from Host A to Host
>>> B) to one of the interfaces connected to Host B. Thus packets from Host
>>> A to Host B will only leave the switch through one interface - the
>>> interface determined by the SA/DA hash. This has the effect of limiting
>>> the throughput from Host A to Host B to the speed of the one interface
>>> determined by the hashing method. However, if the flow (from Host A to
>>> Host B's shared MAC address) were to be distributed across the different
>>> interfaces in a round-robin fashion (as the packets were leaving the
>>> switch) the throughput between the hosts would equal the aggregate of
>>> the links (IIUC).
>>>
>>> Is this a limitation of the the Procurve's implementation of LACP? Do
>>> other switches use different methods of distributing traffic across the
>>> trunked links? Is there another method of aggregating the links between
>>> the two hosts (e.g., multipathing)?
>>>
>>> TIA,
>>> Eric Pretorious
>>> Truckee, CA
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-- 
Digimer
Papers and Projects: https://alteeve.com



From epretorious at yahoo.com  Wed Jun 13 03:08:58 2012
From: epretorious at yahoo.com (Eric)
Date: Tue, 12 Jun 2012 20:08:58 -0700 (PDT)
Subject: [Linux-cluster] Bonding Interfaces: Active Load Balancing & LACP
In-Reply-To: <1339042333.47748.YahooMailNeo@web121704.mail.ne1.yahoo.com>
References: <1339042333.47748.YahooMailNeo@web121704.mail.ne1.yahoo.com>
Message-ID: <1339556938.6009.YahooMailNeo@web121702.mail.ne1.yahoo.com>

A good friend explained it this way:

The problem you describe is a fairly well-known issue and there's really not a good fix for it.? Often, a switch will support multiple 
addressing algorithms (L2, L2_L3, L2_L3_L4, L3_L4).? All bond a flow to a given port for egress. This means that if you have a single data flow between two servers that are connected to the same switch, you are limited to the speed of a single uplink.

I'm
 assuming in the case of the HP Procurve 2824 that the "SA/DA (Source 
Address/Destination Address) method of distributing traffic" is really 
marketing speak for L2 hashing.

If there's an option to do L3_L4 
or L2_L3_L4, you might be slightly better off if there are multiple 
flows involved. In your case, it doesn't sound like there actually are 
multiple flows.? If it really is only one flow, you'd need a 10GbE switch and interfaces to go faster.

FYI,
 Brocade has supposedly implemented frame-spraying (true round-robin) on
 their latest switches.? They do this by the use of custom ASICs derived
 from their fibre channel switch lines, which have had frame-spraying 
for some time.

In frame-spraying (assuming a 4-port 
port-channel/loadshare), frame A goes to port 1, frame B goes to port 2,
 frame C goes to port 3, frame D goes to port 4, frame E goes to port 1,
 frame F goes to port 2, etc.

This method supposedly gives a 
fairly good traffic distribution even with small numbers of flows.? 
There are still corner cases where it wouldn't work well.? It also 
doesn't fix any problems that can arise if the sending system doesn't 
implement frame-spraying (which it probably won't).

HTH,
Eric Pretorious
Truckee, CA





>________________________________
> From: Eric <epretorious at yahoo.com>
>To: "linux-cluster at redhat.com" <linux-cluster at redhat.com> 
>Sent: Wednesday, June 6, 2012 9:12 PM
>Subject: [Linux-cluster] Bonding Interfaces: Active Load Balancing & LACP
> 
>
>I'm currently using the HP Procurve 2824 24-port Gigabit Ethernet switch to for a backside network for synchronizing file systems between the nodes in the group. Each host has 4 Gigabit NIC's and the goal is to bond two of the Gigabit NIC's together to create a 2 Gbps link from any host? to any other host but what I'm finding is that the bonded links are only capable of 1 Gbps from any host to any other host. Is it possible to 
create a multi-Gigabit link between two hosts (without having to upgrade to 10G) using a switch that "uses the 
SA/DA (Source Address/Destination Address) method of distributing 
traffic across the trunked links"?
>
>
>
>The problem, at least as far as I can tell, comes down to the 
limitation of ARP resolution (in the host) and mac-address tables (in 
the switch):
>
>When configured to use Active Load Balancing, the kernel driver leaves each of the interface's MAC 
addresses unchanged. In this scenario, when Host A sends sends traffic 
to host Host B, the kernel uses the MAC address of only one of Host B's 
NIC's as the DA. When the packet arrives at the switch, the switch 
consults the mac-address table for the DA and then sends the packet to 
the interface connected to the NIC with MAC address equal to DA. Thus 
packets from Host A to Host B will only leave the switch through one 
interface - the interface connected to the NIC with MAC address equal to DA. This has the effect of limiting the throughput from Host A to Host B to the speed of the one interface connected to the NIC with MAC address equal to DA.
>
>
>When configured to use IEEE 802.3ad (LACP), the kernel driver assigns the same MAC address to all of the hosts' 
interfaces. In this scenario, when Host A sends traffic to Host B, the 
kernel uses Host B's shared MAC address as the DA. When the packet 
arrives at the switch, the switch creates a hash based on the SA/DA 
pair, consults the mac-address table for the DA, and and assigns the 
flow (i.e., traffic from Host A to Host B) to one of the interfaces 
connected to Host B. Thus packets from Host A to Host B will only leave 
the switch through one interface - the interface determined by the SA/DA hash. This has the effect of limiting the throughput from Host A to Host B to the speed of the one?interface determined by the hashing method. However, if the flow (from Host A to Host B's shared MAC 
address) were to be distributed across the different interfaces in a 
round-robin 
fashion (as the 
packets were leaving the switch) the throughput between the hosts would 
equal the aggregate of 
the links (IIUC).
>
>Is this a limitation of the the Procurve's 
implementation of LACP? Do other switches use  different methods of 
distributing traffic across the trunked links? Is there another method 
of aggregating the links between the two hosts (e.g., multipathing)?
>
>TIA,
>Eric Pretorious
>Truckee, CA
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120612/566396de/attachment.htm>

From kailash.kumawat at rudrainfotainment.com  Wed Jun 13 07:35:19 2012
From: kailash.kumawat at rudrainfotainment.com (kailash kumawat)
Date: Wed, 13 Jun 2012 13:05:19 +0530
Subject: [Linux-cluster] Problem in Cluster
Message-ID: <CALO-jX4407SiBVnaQSfUfnNBeWpryeeV=JuL2fON8ESS5_XiWQ@mail.gmail.com>

Hi

I am using two machine for the cluster and i am using 192.168.1.X network
for the private network and one for the public ip see the details below


node1                                    node2                         node3
Primary Cluster                    Secondary Cluster           LUCI
192.168.1.11                        192.168.1.12
192.168.1.13

i have one public ip which is 115.111.45.23 so i want to use this public as
a floating ip for my web server because this is the register ip in my DNS
server so can anyone help me for configure cluster server because http is
not starting when i m using this public ip as a floating ip.



-- 
Regards
*Kailash Kumawat*
System Admin
09167396313
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120613/198e4de7/attachment.htm>

From mailing.sr at gmail.com  Wed Jun 13 10:36:26 2012
From: mailing.sr at gmail.com (Seb)
Date: Wed, 13 Jun 2012 12:36:26 +0200
Subject: [Linux-cluster] Problem in Cluster
In-Reply-To: <CALO-jX4407SiBVnaQSfUfnNBeWpryeeV=JuL2fON8ESS5_XiWQ@mail.gmail.com>
References: <CALO-jX4407SiBVnaQSfUfnNBeWpryeeV=JuL2fON8ESS5_XiWQ@mail.gmail.com>
Message-ID: <CAJrH5GsswxnZmZ8JTcBwyVzGAmM_FKg5qGT8_1H8WxERC478jw@mail.gmail.com>

2012/6/13 kailash kumawat <kailash.kumawat at rudrainfotainment.com>

> Hi
>
> I am using two machine for the cluster and i am using 192.168.1.X network
> for the private network and one for the public ip see the details below
>
>
> node1                                    node2
> node3
> Primary Cluster                    Secondary Cluster           LUCI
> 192.168.1.11                        192.168.1.12
> 192.168.1.13
>
> i have one public ip which is 115.111.45.23 so i want to use this public
> as a floating ip for my web server because this is the register ip in my
> DNS server so can anyone help me for configure cluster server because http
> is not starting when i m using this public ip as a floating ip.
>

Hello,

Please attach your cluster.conf and give the versions of your cluster
packages (and OS) if you want some help.
It could simply be a mis-ordered IP and apache in your cluster.conf
resources.
Does the apache service start fine when launched manually? (floating IP
must be up first if your httpd.conf is relying on this IP)

Regards,
--Sebastien
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120613/7735091a/attachment.htm>

From tspauld98 at yahoo.com  Wed Jun 13 22:27:37 2012
From: tspauld98 at yahoo.com (Timothy Spaulding)
Date: Wed, 13 Jun 2012 15:27:37 -0700 (PDT)
Subject: [Linux-cluster] secret info!))
Message-ID: <1339626457.37034.YahooMailNeo@web162104.mail.bf1.yahoo.com>

http://www.ifhnos2014.org/wp-content/themes/wiredrive-classic/reospa.php?Potatoe+Yam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120613/19db693f/attachment.htm>

From tspauld98 at yahoo.com  Thu Jun 14 06:27:42 2012
From: tspauld98 at yahoo.com (Timothy Spaulding)
Date: Wed, 13 Jun 2012 23:27:42 -0700 (PDT)
Subject: [Linux-cluster] ... idea!
Message-ID: <1339655262.51152.YahooMailNeo@web162103.mail.bf1.yahoo.com>

http://amydanielson.net/wp-content/plugins/extended-comment-options/reospa.php?Potatoe+Lettuce
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120613/5fe61ef8/attachment.htm>

From jvdiago at gmail.com  Wed Jun 20 12:18:17 2012
From: jvdiago at gmail.com (Javier Vela)
Date: Wed, 20 Jun 2012 14:18:17 +0200
Subject: [Linux-cluster] Node can't join already quorated cluster
Message-ID: <CAEAM5QVAXY9Rq05YayUe6MY9S3S=FcnHs1G2bryzcyDMZO9Whg@mail.gmail.com>

Hi, I have a very strange problem, and after searching through lot of
forums, I haven't found the solution. This is the scenario:

Two node cluster with Red Hat 5.7, HA-LVM, no fencing and quorum disk. I
start qdiskd, cman and rgmanager on one node. After 5 minutes, finally the
fencing finishes and cluster get quorate with 2 votes:

[root at node2 ~]# clustat
Cluster Status for test_cluster @ Wed Jun 20 05:56:39 2012
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 node1-hb                                  1 Offline
 node2-hb                               2 Online, Local, rgmanager
 /dev/mapper/vg_qdisk-lv_qdisk               0 Online, Quorum Disk

 Service Name                   Owner (Last)                   State
 ------- ----                   ----- ------                   -----
 service:postgres                   node2                  started

Now, I start the second node. When cman reaches fencing, it hangs for 5
minutes aprox, and finally fails. clustat says:

root at node1 ~]# clustat
Cluster Status for test_cluster @ Wed Jun 20 06:01:12 2012
Member Status: Inquorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
node1-hb                                  1 Online, Local
node2-hb                               2 Offline
 /dev/mapper/vg_qdisk-lv_qdisk               0 Offline

And in /var/log/messages I can see this errors:

Jun 20 06:02:12 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
Jun 20 06:02:12 node1 openais[6098]: [CLM  ] got nodejoin message 15.15.2.10
Jun 20 06:02:13 node1 dlm_controld[5386]: connect to ccs error -111, check
ccsd or cluster status
Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:13 node1 ccsd[6090]: Initial status:: Inquorate
Jun 20 06:02:13 node1 gfs_controld[5392]: connect to ccs error -111, check
ccsd or cluster status
Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:14 node1 openais[6098]: [TOTEM] entering GATHER state from 9.
Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state from 0.
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Creating commit token because
I am the rep.
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Storing new sequence id for
ring 15c
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering COMMIT state.
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering RECOVERY state.
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] position [0] member 15.15.2.10:
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] previous ring seq 344 rep
15.15.2.10
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] aru e high delivered e
received flag 1
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Did not need to originate any
messages in recovery.
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Sending initial ORF token
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:18 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state from 9.
Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.

And the quorum disk:

[root at node2 ~]# mkqdisk -L -d
kqdisk v0.6.0
/dev/mapper/vg_qdisk-lv_qdisk:
/dev/vg_qdisk/lv_qdisk:
        Magic:                eb7a62c2
        Label:                cluster_qdisk
        Created:              Thu Jun  7 09:23:34 2012
        Host:                 node1
        Kernel Sector Size:   512
        Recorded Sector Size: 512

Status block for node 1
        Last updated by node 2
        Last updated on Wed Jun 20 06:17:23 2012
        State: Evicted
        Flags: 0000
        Score: 0/0
        Average Cycle speed: 0.000500 seconds
        Last Cycle speed: 0.000000 seconds
        Incarnation: 4fe1a06c4fe1a06c
Status block for node 2
        Last updated by node 2
        Last updated on Wed Jun 20 07:09:38 2012
        State: Master
        Flags: 0000
        Score: 0/0
        Average Cycle speed: 0.001000 seconds
        Last Cycle speed: 0.000000 seconds
        Incarnation: 4fe1a06c4fe1a06c


In the other node I don't see any errors in /var/log/messages. One strange
thing is that if I start cman on both nodes at the same time, everything
works fine and both nodes quorate (until I reboot one node and the problem
appears). I've checked that multicast is working properly. With iperf I can
send a receive multicast paquets. Moreover I've seen with tcpdump the
paquets that openais send when cman is trying to start. I've readed about a
bug in RH 5.3 with the same behaviour, but it is solved in RH 5.4.

I don't have Selinux enabled, and Iptables are also disabled. Here is the
cluster.conf simplified (with less services and resources). I want to point
out one thing. I have allow_kill="0" in order to avoid fencing errors when
quorum tries to fence a failed node. As <fence/> is empty, before this
stanza I got a lot of messages in /var/log/messages with failed fencing.

<?xml version="1.0"?>
<cluster alias="test_cluster" config_version="15" name="test_cluster">
        <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="-1"/>
        <clusternodes>
                <clusternode name="node1-hb" nodeid="1" votes="1">
                        <fence/>
                </clusternode>
                <clusternode name="node2-hb" nodeid="2" votes="1">
                        <fence/>
                </clusternode>
        </clusternodes>
        <cman two_node="0" expected_votes="3"/>
        <fencedevices/>

        <rm log_facility="local4" log_level="7">
                <failoverdomains>
                        <failoverdomain name="etest_cluster_fo"
nofailback="1" ordered="1" restricted="1">
                                <failoverdomainnode name="node1-hb"
priority="1"/>
                                <failoverdomainnode name="node2-hb"
priority="2"/>
                        </failoverdomain>
                </failoverdomains>
        <resources/>
        <service autostart="1" domain="test_cluster_fo" exclusive="0"
name="postgres" recovery="relocate">
                <ip address="172.24.119.44" monitor_link="1"/>
                <lvm name="vg_postgres" vg_name="vg_postgres"
lv_name="postgres"/>

                <fs device="/dev/vg_postgres/postgres" force_fsck="1"
force_unmount="1" fstype="ext3" mountpoint="/var/lib/pgsql" name="postgres"
self_fence="0"/>

                <script file="/etc/init.d/postgresql" name="postgres">
                </script>
        </service>
        </rm>
        <totem consensus="4000" join="60" token="20000"
token_retransmits_before_loss_const="20"/>
    <quorumd allow_kill="0" interval="1" label="cluster_qdisk" tko="10"
votes="1">
                <heuristic program="/usr/share/cluster/check_eth_link.sh
eth0" score="1" interval="2" tko="3"/>
        </quorumd>
 </cluster>


The /etc/hosts:
172.24.119.10 node1
172.24.119.34 node2
15.15.2.10 node1-hb node1-hb.localdomain
15.15.2.11 node2-hb node2-hb.localdomain

And the versions:
Red Hat Enterprise Linux Server release 5.7 (Tikanga)
cman-2.0.115-85.el5
rgmanager-2.0.52-21.el5
openais-0.80.6-30.el5

I don't know what else I should try, so if you can give me some ideas, I
will be very pleased.

Regards, Javi.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120620/d6ad95f4/attachment.htm>

From emi2fast at gmail.com  Wed Jun 20 12:45:28 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Wed, 20 Jun 2012 14:45:28 +0200
Subject: [Linux-cluster] Node can't join already quorated cluster
In-Reply-To: <CAEAM5QVAXY9Rq05YayUe6MY9S3S=FcnHs1G2bryzcyDMZO9Whg@mail.gmail.com>
References: <CAEAM5QVAXY9Rq05YayUe6MY9S3S=FcnHs1G2bryzcyDMZO9Whg@mail.gmail.com>
Message-ID: <CAE7pJ3BUnQyzNa+v5YmDi4hsWfojMKogWP_GDVvdmmqkjOiT6A@mail.gmail.com>

If you don't wanna use a real fence divice, because you only do some test,
you have to use fence_manual agent

2012/6/20 Javier Vela <jvdiago at gmail.com>

> Hi, I have a very strange problem, and after searching through lot of
> forums, I haven't found the solution. This is the scenario:
>
> Two node cluster with Red Hat 5.7, HA-LVM, no fencing and quorum disk. I
> start qdiskd, cman and rgmanager on one node. After 5 minutes, finally the
> fencing finishes and cluster get quorate with 2 votes:
>
> [root at node2 ~]# clustat
> Cluster Status for test_cluster @ Wed Jun 20 05:56:39 2012
> Member Status: Quorate
>
>  Member Name                             ID   Status
>  ------ ----                             ---- ------
>  node1-hb                                  1 Offline
>  node2-hb                               2 Online, Local, rgmanager
>  /dev/mapper/vg_qdisk-lv_qdisk               0 Online, Quorum Disk
>
>  Service Name                   Owner (Last)                   State
>  ------- ----                   ----- ------                   -----
>  service:postgres                   node2                  started
>
> Now, I start the second node. When cman reaches fencing, it hangs for 5
> minutes aprox, and finally fails. clustat says:
>
> root at node1 ~]# clustat
> Cluster Status for test_cluster @ Wed Jun 20 06:01:12 2012
> Member Status: Inquorate
>
>  Member Name                             ID   Status
>  ------ ----                             ---- ------
> node1-hb                                  1 Online, Local
> node2-hb                               2 Offline
>  /dev/mapper/vg_qdisk-lv_qdisk               0 Offline
>
> And in /var/log/messages I can see this errors:
>
> Jun 20 06:02:12 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
> Jun 20 06:02:12 node1 openais[6098]: [CLM  ] got nodejoin message
> 15.15.2.10
> Jun 20 06:02:13 node1 dlm_controld[5386]: connect to ccs error -111, check
> ccsd or cluster status
> Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:13 node1 ccsd[6090]: Initial status:: Inquorate
> Jun 20 06:02:13 node1 gfs_controld[5392]: connect to ccs error -111, check
> ccsd or cluster status
> Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:14 node1 openais[6098]: [TOTEM] entering GATHER state from 9.
> Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state from 0.
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Creating commit token because
> I am the rep.
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Storing new sequence id for
> ring 15c
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering COMMIT state.
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering RECOVERY state.
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] position [0] member
> 15.15.2.10:
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] previous ring seq 344 rep
> 15.15.2.10
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] aru e high delivered e
> received flag 1
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Did not need to originate any
> messages in recovery.
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Sending initial ORF token
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
> Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:18 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state from 9.
> Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
>
> And the quorum disk:
>
> [root at node2 ~]# mkqdisk -L -d
> kqdisk v0.6.0
> /dev/mapper/vg_qdisk-lv_qdisk:
> /dev/vg_qdisk/lv_qdisk:
>         Magic:                eb7a62c2
>         Label:                cluster_qdisk
>         Created:              Thu Jun  7 09:23:34 2012
>         Host:                 node1
>         Kernel Sector Size:   512
>         Recorded Sector Size: 512
>
> Status block for node 1
>         Last updated by node 2
>         Last updated on Wed Jun 20 06:17:23 2012
>         State: Evicted
>         Flags: 0000
>         Score: 0/0
>         Average Cycle speed: 0.000500 seconds
>         Last Cycle speed: 0.000000 seconds
>         Incarnation: 4fe1a06c4fe1a06c
> Status block for node 2
>         Last updated by node 2
>         Last updated on Wed Jun 20 07:09:38 2012
>         State: Master
>         Flags: 0000
>         Score: 0/0
>         Average Cycle speed: 0.001000 seconds
>         Last Cycle speed: 0.000000 seconds
>         Incarnation: 4fe1a06c4fe1a06c
>
>
> In the other node I don't see any errors in /var/log/messages. One strange
> thing is that if I start cman on both nodes at the same time, everything
> works fine and both nodes quorate (until I reboot one node and the problem
> appears). I've checked that multicast is working properly. With iperf I can
> send a receive multicast paquets. Moreover I've seen with tcpdump the
> paquets that openais send when cman is trying to start. I've readed about a
> bug in RH 5.3 with the same behaviour, but it is solved in RH 5.4.
>
> I don't have Selinux enabled, and Iptables are also disabled. Here is the
> cluster.conf simplified (with less services and resources). I want to point
> out one thing. I have allow_kill="0" in order to avoid fencing errors when
> quorum tries to fence a failed node. As <fence/> is empty, before this
> stanza I got a lot of messages in /var/log/messages with failed fencing.
>
> <?xml version="1.0"?>
> <cluster alias="test_cluster" config_version="15" name="test_cluster">
>         <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="-1"/>
>         <clusternodes>
>                 <clusternode name="node1-hb" nodeid="1" votes="1">
>                         <fence/>
>                 </clusternode>
>                 <clusternode name="node2-hb" nodeid="2" votes="1">
>                         <fence/>
>                 </clusternode>
>         </clusternodes>
>         <cman two_node="0" expected_votes="3"/>
>         <fencedevices/>
>
>         <rm log_facility="local4" log_level="7">
>                 <failoverdomains>
>                         <failoverdomain name="etest_cluster_fo"
> nofailback="1" ordered="1" restricted="1">
>                                 <failoverdomainnode name="node1-hb"
> priority="1"/>
>                                 <failoverdomainnode name="node2-hb"
> priority="2"/>
>                         </failoverdomain>
>                 </failoverdomains>
>         <resources/>
>         <service autostart="1" domain="test_cluster_fo" exclusive="0"
> name="postgres" recovery="relocate">
>                 <ip address="172.24.119.44" monitor_link="1"/>
>                 <lvm name="vg_postgres" vg_name="vg_postgres"
> lv_name="postgres"/>
>
>                 <fs device="/dev/vg_postgres/postgres" force_fsck="1"
> force_unmount="1" fstype="ext3" mountpoint="/var/lib/pgsql" name="postgres"
> self_fence="0"/>
>
>                 <script file="/etc/init.d/postgresql" name="postgres">
>                 </script>
>         </service>
>         </rm>
>         <totem consensus="4000" join="60" token="20000"
> token_retransmits_before_loss_const="20"/>
>     <quorumd allow_kill="0" interval="1" label="cluster_qdisk" tko="10"
> votes="1">
>                 <heuristic program="/usr/share/cluster/check_eth_link.sh
> eth0" score="1" interval="2" tko="3"/>
>         </quorumd>
>  </cluster>
>
>
> The /etc/hosts:
> 172.24.119.10 node1
> 172.24.119.34 node2
> 15.15.2.10 node1-hb node1-hb.localdomain
> 15.15.2.11 node2-hb node2-hb.localdomain
>
> And the versions:
> Red Hat Enterprise Linux Server release 5.7 (Tikanga)
> cman-2.0.115-85.el5
> rgmanager-2.0.52-21.el5
> openais-0.80.6-30.el5
>
> I don't know what else I should try, so if you can give me some ideas, I
> will be very pleased.
>
> Regards, Javi.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120620/258b6cd7/attachment.htm>

From jvdiago at gmail.com  Wed Jun 20 13:43:04 2012
From: jvdiago at gmail.com (Javier Vela)
Date: Wed, 20 Jun 2012 15:43:04 +0200
Subject: [Linux-cluster] Node can't join already quorated cluster
Message-ID: <CAEAM5QWVeTAdFNevhENARacq3fyRX0N9-TgB13DchyfzzDFpeA@mail.gmail.com>

As I readed, if you use HA-LVM you don't need fencing because of vg
tagging. Is It absolutely mandatory to use fencing with qdisk?

If it is, i supose i can use manual_fence, but in production I also won't
use fencing.

Regards, Javi.

Date: Wed, 20 Jun 2012 14:45:28 +0200
From: emi2fast at gmail.com
To: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] Node can't join already quorated cluster

If you don't wanna use a real fence divice, because you only do some test,
you have to use fence_manual agent

2012/6/20 Javier Vela <jvdiago at gmail.com>

Hi, I have a very strange problem, and after searching through lot of
forums, I haven't found the solution. This is the scenario:

Two node cluster with Red Hat 5.7, HA-LVM, no fencing and quorum disk. I
start qdiskd, cman and rgmanager on one node. After 5 minutes, finally the
fencing finishes and cluster get quorate with 2 votes:

[root at node2 ~]# clustat
Cluster Status for test_cluster @ Wed Jun 20 05:56:39 2012
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 node1-hb                                  1 Offline
 node2-hb                               2 Online, Local, rgmanager
 /dev/mapper/vg_qdisk-lv_qdisk               0 Online, Quorum Disk

 Service Name                   Owner (Last)                   State
 ------- ----                   ----- ------                   -----
 service:postgres                   node2                  started

Now, I start the second node. When cman reaches fencing, it hangs for 5
minutes aprox, and finally fails. clustat says:

root at node1 ~]# clustat
Cluster Status for test_cluster @ Wed Jun 20 06:01:12 2012
Member Status: Inquorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
node1-hb                                  1 Online, Local
node2-hb                               2 Offline
 /dev/mapper/vg_qdisk-lv_qdisk               0 Offline

And in /var/log/messages I can see this errors:

Jun 20 06:02:12 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
Jun 20 06:02:12 node1 openais[6098]: [CLM  ] got nodejoin message 15.15.2.10
Jun 20 06:02:13 node1 dlm_controld[5386]: connect to ccs error -111, check
ccsd or cluster status
Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:13 node1 ccsd[6090]: Initial status:: Inquorate
Jun 20 06:02:13 node1 gfs_controld[5392]: connect to ccs error -111, check
ccsd or cluster status
Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:14 node1 openais[6098]: [TOTEM] entering GATHER state from 9.
Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state from 0.
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Creating commit token because
I am the rep.
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Storing new sequence id for
ring 15c
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering COMMIT state.
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering RECOVERY state.
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] position [0] member 15.15.2.10:
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] previous ring seq 344 rep
15.15.2.10
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] aru e high delivered e
received flag 1
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Did not need to originate any
messages in recovery.
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Sending initial ORF token
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:18 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state from 9.
Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.

And the quorum disk:

[root at node2 ~]# mkqdisk -L -d
kqdisk v0.6.0
/dev/mapper/vg_qdisk-lv_qdisk:
/dev/vg_qdisk/lv_qdisk:
        Magic:                eb7a62c2
        Label:                cluster_qdisk
        Created:              Thu Jun  7 09:23:34 2012
        Host:                 node1
        Kernel Sector Size:   512
        Recorded Sector Size: 512

Status block for node 1
        Last updated by node 2
        Last updated on Wed Jun 20 06:17:23 2012
        State: Evicted
        Flags: 0000
        Score: 0/0
        Average Cycle speed: 0.000500 seconds
        Last Cycle speed: 0.000000 seconds
        Incarnation: 4fe1a06c4fe1a06c
Status block for node 2
        Last updated by node 2
        Last updated on Wed Jun 20 07:09:38 2012
        State: Master
        Flags: 0000
        Score: 0/0
        Average Cycle speed: 0.001000 seconds
        Last Cycle speed: 0.000000 seconds
        Incarnation: 4fe1a06c4fe1a06c


In the other node I don't see any errors in /var/log/messages. One strange
thing is that if I start cman on both nodes at the same time, everything
works fine and both nodes quorate (until I reboot one node and the problem
appears). I've checked that multicast is working properly. With iperf I can
send a receive multicast paquets. Moreover I've seen with tcpdump the
paquets that openais send when cman is trying to start. I've readed about a
bug in RH 5.3 with the same behaviour, but it is solved in RH 5.4.

I don't have Selinux enabled, and Iptables are also disabled. Here is the
cluster.conf simplified (with less services and resources). I want to point
out one thing. I have allow_kill="0" in order to avoid fencing errors when
quorum tries to fence a failed node. As <fence/> is empty, before this
stanza I got a lot of messages in /var/log/messages with failed fencing.

<?xml version="1.0"?>
<cluster alias="test_cluster" config_version="15" name="test_cluster">
        <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="-1"/>
        <clusternodes>
                <clusternode name="node1-hb" nodeid="1" votes="1">
                        <fence/>
                </clusternode>
                <clusternode name="node2-hb" nodeid="2" votes="1">
                        <fence/>
                </clusternode>
        </clusternodes>
        <cman two_node="0" expected_votes="3"/>
        <fencedevices/>

        <rm log_facility="local4" log_level="7">
                <failoverdomains>
                        <failoverdomain name="etest_cluster_fo"
nofailback="1" ordered="1" restricted="1">
                                <failoverdomainnode name="node1-hb"
priority="1"/>
                                <failoverdomainnode name="node2-hb"
priority="2"/>
                        </failoverdomain>
                </failoverdomains>
        <resources/>
        <service autostart="1" domain="test_cluster_fo" exclusive="0"
name="postgres" recovery="relocate">
                <ip address="172.24.119.44" monitor_link="1"/>
                <lvm name="vg_postgres" vg_name="vg_postgres"
lv_name="postgres"/>

                <fs device="/dev/vg_postgres/postgres" force_fsck="1"
force_unmount="1" fstype="ext3" mountpoint="/var/lib/pgsql" name="postgres"
self_fence="0"/>

                <script file="/etc/init.d/postgresql" name="postgres">
                </script>
        </service>
        </rm>
        <totem consensus="4000" join="60" token="20000"
token_retransmits_before_loss_const="20"/>
    <quorumd allow_kill="0" interval="1" label="cluster_qdisk" tko="10"
votes="1">
                <heuristic program="/usr/share/cluster/check_eth_link.sh
eth0" score="1" interval="2" tko="3"/>
        </quorumd>
 </cluster>


The /etc/hosts:
172.24.119.10 node1
172.24.119.34 node2
15.15.2.10 node1-hb node1-hb.localdomain
15.15.2.11 node2-hb node2-hb.localdomain

And the versions:
Red Hat Enterprise Linux Server release 5.7 (Tikanga)
cman-2.0.115-85.el5
rgmanager-2.0.52-21.el5
openais-0.80.6-30.el5

I don't know what else I should try, so if you can give me some ideas, I
will be very pleased.

Regards, Javi.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




-- 
esta es mi vida e me la vivo hasta que dios quiera

-- Linux-cluster mailing list Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120620/125988e2/attachment.htm>

From jvdiago at gmail.com  Wed Jun 20 13:50:13 2012
From: jvdiago at gmail.com (Javier Vela)
Date: Wed, 20 Jun 2012 15:50:13 +0200
Subject: [Linux-cluster]
	=?utf-8?q?Node_can=C2=B4t_join_already_quorated_c?=
	=?utf-8?b?bHVzdGVy4oCP?=
Message-ID: <CAEAM5QUfcfZEM-0K4kZgieDc2ZdBr9UOFxCBLyLY612OBD4YpQ@mail.gmail.com>

In the lvm.conf I have in volume_list the name of the vg_qdisk so this
volume group should be available to both nodes at the same time.

My volume_list in lvm.conf:

node1:
volume_list = [ "vg00", "vg_qdisk", "@node1-hb" ]

node2:
volume_list = [ "vg00", "vg_qdisk", "@node2-hb" ]

Moreover with the comand lvdisplay I can see that the lv is available to
both nodes. But maybe is worth to try another qdisk without lvm.


> Hi.
>
>
>
> Since you have HA-LVM, are you using volume tagging ? I noticed that your
> quorum disk belongs to a volume group vg_qdisk and I think when the first
> node that will activate the volumegroup will not allow the second node to
> activate the volumegroup because of volume tagging, so remove the
> quorumdisk from the volumegroup and just use it as a physical volume.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120620/3e77274d/attachment.htm>

From emi2fast at gmail.com  Wed Jun 20 13:59:18 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Wed, 20 Jun 2012 15:59:18 +0200
Subject: [Linux-cluster] Node can't join already quorated cluster
In-Reply-To: <CAEAM5QWVeTAdFNevhENARacq3fyRX0N9-TgB13DchyfzzDFpeA@mail.gmail.com>
References: <CAEAM5QWVeTAdFNevhENARacq3fyRX0N9-TgB13DchyfzzDFpeA@mail.gmail.com>
Message-ID: <CAE7pJ3BeC7LdMdMK+HZka7dReGGP6Thj4x2fMM=FceaoGM6q8Q@mail.gmail.com>

Fencing it's critical component of a cluster and i think it requires

A cluster without fencing it's not a good idea, but as you know that's your
choice

2012/6/20 Javier Vela <jvdiago at gmail.com>

> As I readed, if you use HA-LVM you don't need fencing because of vg
> tagging. Is It absolutely mandatory to use fencing with qdisk?
>
> If it is, i supose i can use manual_fence, but in production I also won't
> use fencing.
>
> Regards, Javi.
>
> Date: Wed, 20 Jun 2012 14:45:28 +0200
> From: emi2fast at gmail.com
> To: linux-cluster at redhat.com
> Subject: Re: [Linux-cluster] Node can't join already quorated cluster
>
>
> If you don't wanna use a real fence divice, because you only do some test,
> you have to use fence_manual agent
>
> 2012/6/20 Javier Vela <jvdiago at gmail.com>
>
> Hi, I have a very strange problem, and after searching through lot of
> forums, I haven't found the solution. This is the scenario:
>
> Two node cluster with Red Hat 5.7, HA-LVM, no fencing and quorum disk. I
> start qdiskd, cman and rgmanager on one node. After 5 minutes, finally the
> fencing finishes and cluster get quorate with 2 votes:
>
> [root at node2 ~]# clustat
> Cluster Status for test_cluster @ Wed Jun 20 05:56:39 2012
> Member Status: Quorate
>
>  Member Name                             ID   Status
>  ------ ----                             ---- ------
>  node1-hb                                  1 Offline
>  node2-hb                               2 Online, Local, rgmanager
>  /dev/mapper/vg_qdisk-lv_qdisk               0 Online, Quorum Disk
>
>  Service Name                   Owner (Last)                   State
>  ------- ----                   ----- ------                   -----
>  service:postgres                   node2                  started
>
> Now, I start the second node. When cman reaches fencing, it hangs for 5
> minutes aprox, and finally fails. clustat says:
>
> root at node1 ~]# clustat
> Cluster Status for test_cluster @ Wed Jun 20 06:01:12 2012
> Member Status: Inquorate
>
>  Member Name                             ID   Status
>  ------ ----                             ---- ------
> node1-hb                                  1 Online, Local
> node2-hb                               2 Offline
>  /dev/mapper/vg_qdisk-lv_qdisk               0 Offline
>
> And in /var/log/messages I can see this errors:
>
> Jun 20 06:02:12 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
> Jun 20 06:02:12 node1 openais[6098]: [CLM  ] got nodejoin message
> 15.15.2.10
> Jun 20 06:02:13 node1 dlm_controld[5386]: connect to ccs error -111, check
> ccsd or cluster status
> Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:13 node1 ccsd[6090]: Initial status:: Inquorate
> Jun 20 06:02:13 node1 gfs_controld[5392]: connect to ccs error -111, check
> ccsd or cluster status
> Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:14 node1 openais[6098]: [TOTEM] entering GATHER state from 9.
> Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state from 0.
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Creating commit token because
> I am the rep.
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Storing new sequence id for
> ring 15c
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering COMMIT state.
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering RECOVERY state.
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] position [0] member
> 15.15.2.10:
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] previous ring seq 344 rep
> 15.15.2.10
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] aru e high delivered e
> received flag 1
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Did not need to originate any
> messages in recovery.
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Sending initial ORF token
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
> Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
> Jun 20 06:02:18 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state from 9.
> Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> connection.
>
> And the quorum disk:
>
> [root at node2 ~]# mkqdisk -L -d
> kqdisk v0.6.0
> /dev/mapper/vg_qdisk-lv_qdisk:
> /dev/vg_qdisk/lv_qdisk:
>         Magic:                eb7a62c2
>         Label:                cluster_qdisk
>         Created:              Thu Jun  7 09:23:34 2012
>         Host:                 node1
>         Kernel Sector Size:   512
>         Recorded Sector Size: 512
>
> Status block for node 1
>         Last updated by node 2
>         Last updated on Wed Jun 20 06:17:23 2012
>         State: Evicted
>         Flags: 0000
>         Score: 0/0
>         Average Cycle speed: 0.000500 seconds
>         Last Cycle speed: 0.000000 seconds
>         Incarnation: 4fe1a06c4fe1a06c
> Status block for node 2
>         Last updated by node 2
>         Last updated on Wed Jun 20 07:09:38 2012
>         State: Master
>         Flags: 0000
>         Score: 0/0
>         Average Cycle speed: 0.001000 seconds
>         Last Cycle speed: 0.000000 seconds
>         Incarnation: 4fe1a06c4fe1a06c
>
>
> In the other node I don't see any errors in /var/log/messages. One strange
> thing is that if I start cman on both nodes at the same time, everything
> works fine and both nodes quorate (until I reboot one node and the problem
> appears). I've checked that multicast is working properly. With iperf I can
> send a receive multicast paquets. Moreover I've seen with tcpdump the
> paquets that openais send when cman is trying to start. I've readed about a
> bug in RH 5.3 with the same behaviour, but it is solved in RH 5.4.
>
> I don't have Selinux enabled, and Iptables are also disabled. Here is the
> cluster.conf simplified (with less services and resources). I want to point
> out one thing. I have allow_kill="0" in order to avoid fencing errors when
> quorum tries to fence a failed node. As <fence/> is empty, before this
> stanza I got a lot of messages in /var/log/messages with failed fencing.
>
> <?xml version="1.0"?>
> <cluster alias="test_cluster" config_version="15" name="test_cluster">
>         <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="-1"/>
>         <clusternodes>
>                 <clusternode name="node1-hb" nodeid="1" votes="1">
>                         <fence/>
>                 </clusternode>
>                 <clusternode name="node2-hb" nodeid="2" votes="1">
>                         <fence/>
>                 </clusternode>
>         </clusternodes>
>         <cman two_node="0" expected_votes="3"/>
>         <fencedevices/>
>
>         <rm log_facility="local4" log_level="7">
>                 <failoverdomains>
>                         <failoverdomain name="etest_cluster_fo"
> nofailback="1" ordered="1" restricted="1">
>                                 <failoverdomainnode name="node1-hb"
> priority="1"/>
>                                 <failoverdomainnode name="node2-hb"
> priority="2"/>
>                         </failoverdomain>
>                 </failoverdomains>
>         <resources/>
>         <service autostart="1" domain="test_cluster_fo" exclusive="0"
> name="postgres" recovery="relocate">
>                 <ip address="172.24.119.44" monitor_link="1"/>
>                 <lvm name="vg_postgres" vg_name="vg_postgres"
> lv_name="postgres"/>
>
>                 <fs device="/dev/vg_postgres/postgres" force_fsck="1"
> force_unmount="1" fstype="ext3" mountpoint="/var/lib/pgsql" name="postgres"
> self_fence="0"/>
>
>                 <script file="/etc/init.d/postgresql" name="postgres">
>                 </script>
>         </service>
>         </rm>
>         <totem consensus="4000" join="60" token="20000"
> token_retransmits_before_loss_const="20"/>
>     <quorumd allow_kill="0" interval="1" label="cluster_qdisk" tko="10"
> votes="1">
>                 <heuristic program="/usr/share/cluster/check_eth_link.sh
> eth0" score="1" interval="2" tko="3"/>
>         </quorumd>
>  </cluster>
>
>
> The /etc/hosts:
> 172.24.119.10 node1
> 172.24.119.34 node2
> 15.15.2.10 node1-hb node1-hb.localdomain
> 15.15.2.11 node2-hb node2-hb.localdomain
>
> And the versions:
> Red Hat Enterprise Linux Server release 5.7 (Tikanga)
> cman-2.0.115-85.el5
> rgmanager-2.0.52-21.el5
> openais-0.80.6-30.el5
>
> I don't know what else I should try, so if you can give me some ideas, I
> will be very pleased.
>
> Regards, Javi.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> -- Linux-cluster mailing list Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120620/67f9817a/attachment.htm>

From emi2fast at gmail.com  Wed Jun 20 14:01:42 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Wed, 20 Jun 2012 16:01:42 +0200
Subject: [Linux-cluster]
	=?utf-8?q?Node_can=C2=B4t_join_already_quorated_c?=
	=?utf-8?b?bHVzdGVy4oCP?=
In-Reply-To: <CAEAM5QUfcfZEM-0K4kZgieDc2ZdBr9UOFxCBLyLY612OBD4YpQ@mail.gmail.com>
References: <CAEAM5QUfcfZEM-0K4kZgieDc2ZdBr9UOFxCBLyLY612OBD4YpQ@mail.gmail.com>
Message-ID: <CAE7pJ3A8vOeYkH3kmn=zKepRMrGSMnyyRMq82OnFhTVXMsCmrg@mail.gmail.com>

Hello Javier

Can you send the ouput of this command for every node?

vgs -o tags,vg_name

2012/6/20 Javier Vela <jvdiago at gmail.com>

> In the lvm.conf I have in volume_list the name of the vg_qdisk so this
> volume group should be available to both nodes at the same time.
>
> My volume_list in lvm.conf:
>
> node1:
> volume_list = [ "vg00", "vg_qdisk", "@node1-hb" ]
>
> node2:
> volume_list = [ "vg00", "vg_qdisk", "@node2-hb" ]
>
> Moreover with the comand lvdisplay I can see that the lv is available to
> both nodes. But maybe is worth to try another qdisk without lvm.
>
>
>> Hi.
>>
>>
>>
>> Since you have HA-LVM, are you using volume tagging ? I noticed that your
>> quorum disk belongs to a volume group vg_qdisk and I think when the first
>> node that will activate the volumegroup will not allow the second node to
>> activate the volumegroup because of volume tagging, so remove the
>> quorumdisk from the volumegroup and just use it as a physical volume.
>>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120620/09e9db6d/attachment.htm>

From lists at alteeve.ca  Wed Jun 20 14:40:59 2012
From: lists at alteeve.ca (Digimer)
Date: Wed, 20 Jun 2012 10:40:59 -0400
Subject: [Linux-cluster] Node can't join already quorated cluster
In-Reply-To: <CAEAM5QWVeTAdFNevhENARacq3fyRX0N9-TgB13DchyfzzDFpeA@mail.gmail.com>
References: <CAEAM5QWVeTAdFNevhENARacq3fyRX0N9-TgB13DchyfzzDFpeA@mail.gmail.com>
Message-ID: <4FE1E0FB.1020001@alteeve.ca>

Fencing is critical, and running a cluster without fencing, even with 
qdisk, is not supported. Manual fencing is also not supported. The 
*only* way to have a reliable cluster, testing or production, is to use 
fencing.

Why do you not wish to use it?

On 06/20/2012 09:43 AM, Javier Vela wrote:
> As I readed, if you use HA-LVM you don't need fencing because of vg
> tagging. Is It absolutely mandatory to use fencing with qdisk?
>
> If it is, i supose i can use manual_fence, but in production I also
> won't use fencing.
>
> Regards, Javi.
>
> Date: Wed, 20 Jun 2012 14:45:28 +0200
> From: emi2fast at gmail.com <mailto:emi2fast at gmail.com>
> To: linux-cluster at redhat.com <mailto:linux-cluster at redhat.com>
> Subject: Re: [Linux-cluster] Node can't join already quorated cluster
>
> If you don't wanna use a real fence divice, because you only do some
> test, you have to use fence_manual agent
>
> 2012/6/20 Javier Vela <jvdiago at gmail.com <mailto:jvdiago at gmail.com>>
>
>     Hi, I have a very strange problem, and after searching through lot
>     of forums, I haven't found the solution. This is the scenario:
>
>     Two node cluster with Red Hat 5.7, HA-LVM, no fencing and quorum
>     disk. I start qdiskd, cman and rgmanager on one node. After 5
>     minutes, finally the fencing finishes and cluster get quorate with 2
>     votes:
>
>     [root at node2 ~]# clustat
>     Cluster Status for test_cluster @ Wed Jun 20 05:56:39 2012
>     Member Status: Quorate
>
>       Member Name                             ID   Status
>       ------ ----                             ---- ------
>       node1-hb                                  1 Offline
>       node2-hb                               2 Online, Local, rgmanager
>       /dev/mapper/vg_qdisk-lv_qdisk               0 Online, Quorum Disk
>
>       Service Name                   Owner (Last)                   State
>       ------- ----                   ----- ------                   -----
>       service:postgres                   node2                  started
>
>     Now, I start the second node. When cman reaches fencing, it hangs
>     for 5 minutes aprox, and finally fails. clustat says:
>
>     root at node1 ~]# clustat
>     Cluster Status for test_cluster @ Wed Jun 20 06:01:12 2012
>     Member Status: Inquorate
>
>       Member Name                             ID   Status
>       ------ ----                             ---- ------
>     node1-hb                                  1 Online, Local
>     node2-hb                               2 Offline
>       /dev/mapper/vg_qdisk-lv_qdisk               0 Offline
>
>     And in /var/log/messages I can see this errors:
>
>     Jun 20 06:02:12 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
>     Jun 20 06:02:12 node1 openais[6098]: [CLM  ] got nodejoin message
>     15.15.2.10
>     Jun 20 06:02:13 node1 dlm_controld[5386]: connect to ccs error -111,
>     check ccsd or cluster status
>     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>     connection.
>     Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
>     Connection refused
>     Jun 20 06:02:13 node1 ccsd[6090]: Initial status:: Inquorate
>     Jun 20 06:02:13 node1 gfs_controld[5392]: connect to ccs error -111,
>     check ccsd or cluster status
>     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>     connection.
>     Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
>     Connection refused
>     Jun 20 06:02:14 node1 openais[6098]: [TOTEM] entering GATHER state
>     from 9.
>     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>     connection.
>     Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
>     Connection refused
>     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>     connection.
>     Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
>     Connection refused
>     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>     connection.
>     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
>     Connection refused
>     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>     connection.
>     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
>     Connection refused
>     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>     connection.
>     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
>     Connection refused
>     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>     connection.
>     Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
>     Connection refused
>     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>     connection.
>     Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
>     Connection refused
>     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>     connection.
>     Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
>     Connection refused
>     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>     connection.
>     Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
>     Connection refused
>     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state
>     from 0.
>     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Creating commit token
>     because I am the rep.
>     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Storing new sequence id
>     for ring 15c
>     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering COMMIT state.
>     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering RECOVERY state.
>     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] position [0] member
>     15.15.2.10 <http://15.15.2.10>:
>     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] previous ring seq 344
>     rep 15.15.2.10
>     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] aru e high delivered e
>     received flag 1
>     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Did not need to
>     originate any messages in recovery.
>     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Sending initial ORF token
>     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
>     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>     connection.
>     Jun 20 06:02:18 node1 ccsd[6090]: Error while processing connect:
>     Connection refused
>     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state
>     from 9.
>     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>     connection.
>
>     And the quorum disk:
>
>     [root at node2 ~]# mkqdisk -L -d
>     kqdisk v0.6.0
>     /dev/mapper/vg_qdisk-lv_qdisk:
>     /dev/vg_qdisk/lv_qdisk:
>              Magic:                eb7a62c2
>              Label:                cluster_qdisk
>              Created:              Thu Jun  7 09:23:34 2012
>              Host:                 node1
>              Kernel Sector Size:   512
>              Recorded Sector Size: 512
>
>     Status block for node 1
>              Last updated by node 2
>              Last updated on Wed Jun 20 06:17:23 2012
>              State: Evicted
>              Flags: 0000
>              Score: 0/0
>              Average Cycle speed: 0.000500 seconds
>              Last Cycle speed: 0.000000 seconds
>              Incarnation: 4fe1a06c4fe1a06c
>     Status block for node 2
>              Last updated by node 2
>              Last updated on Wed Jun 20 07:09:38 2012
>              State: Master
>              Flags: 0000
>              Score: 0/0
>              Average Cycle speed: 0.001000 seconds
>              Last Cycle speed: 0.000000 seconds
>              Incarnation: 4fe1a06c4fe1a06c
>
>
>     In the other node I don't see any errors in /var/log/messages. One
>     strange thing is that if I start cman on both nodes at the same
>     time, everything works fine and both nodes quorate (until I reboot
>     one node and the problem appears). I've checked that multicast is
>     working properly. With iperf I can send a receive multicast paquets.
>     Moreover I've seen with tcpdump the paquets that openais send when
>     cman is trying to start. I've readed about a bug in RH 5.3 with the
>     same behaviour, but it is solved in RH 5.4.
>
>     I don't have Selinux enabled, and Iptables are also disabled. Here
>     is the cluster.conf simplified (with less services and resources). I
>     want to point out one thing. I have allow_kill="0" in order to avoid
>     fencing errors when quorum tries to fence a failed node. As <fence/>
>     is empty, before this stanza I got a lot of messages in
>     /var/log/messages with failed fencing.
>
>     <?xml version="1.0"?>
>     <cluster alias="test_cluster" config_version="15" name="test_cluster">
>              <fence_daemon clean_start="0" post_fail_delay="0"
>     post_join_delay="-1"/>
>              <clusternodes>
>                      <clusternode name="node1-hb" nodeid="1" votes="1">
>                              <fence/>
>                      </clusternode>
>                      <clusternode name="node2-hb" nodeid="2" votes="1">
>                              <fence/>
>                      </clusternode>
>              </clusternodes>
>              <cman two_node="0" expected_votes="3"/>
>              <fencedevices/>
>
>              <rm log_facility="local4" log_level="7">
>                      <failoverdomains>
>                              <failoverdomain name="etest_cluster_fo"
>     nofailback="1" ordered="1" restricted="1">
>                                      <failoverdomainnode name="node1-hb"
>     priority="1"/>
>                                      <failoverdomainnode name="node2-hb"
>     priority="2"/>
>                              </failoverdomain>
>                      </failoverdomains>
>              <resources/>
>              <service autostart="1" domain="test_cluster_fo"
>     exclusive="0" name="postgres" recovery="relocate">
>                      <ip address="172.24.119.44" monitor_link="1"/>
>                      <lvm name="vg_postgres" vg_name="vg_postgres"
>     lv_name="postgres"/>
>
>                      <fs device="/dev/vg_postgres/postgres"
>     force_fsck="1" force_unmount="1" fstype="ext3"
>     mountpoint="/var/lib/pgsql" name="postgres" self_fence="0"/>
>
>                      <script file="/etc/init.d/postgresql" name="postgres">
>                      </script>
>              </service>
>              </rm>
>              <totem consensus="4000" join="60" token="20000"
>     token_retransmits_before_loss_const="20"/>
>          <quorumd allow_kill="0" interval="1" label="cluster_qdisk"
>     tko="10" votes="1">
>                      <heuristic
>     program="/usr/share/cluster/check_eth_link.sh eth0" score="1"
>     interval="2" tko="3"/>
>              </quorumd>
>       </cluster>
>
>
>     The /etc/hosts:
>     172.24.119.10 node1
>     172.24.119.34 node2
>     15.15.2.10 node1-hb node1-hb.localdomain
>     15.15.2.11 node2-hb node2-hb.localdomain
>
>     And the versions:
>     Red Hat Enterprise Linux Server release 5.7 (Tikanga)
>     cman-2.0.115-85.el5
>     rgmanager-2.0.52-21.el5
>     openais-0.80.6-30.el5
>
>     I don't know what else I should try, so if you can give me some
>     ideas, I will be very pleased.
>
>     Regards, Javi.
>
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> -- Linux-cluster mailing list Linux-cluster at redhat.com
> <mailto:Linux-cluster at redhat.com>
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Digimer
Papers and Projects: https://alteeve.com




From jvdiago at gmail.com  Wed Jun 20 14:57:38 2012
From: jvdiago at gmail.com (Javier Vela)
Date: Wed, 20 Jun 2012 16:57:38 +0200
Subject: [Linux-cluster]
	=?utf-8?q?Node_can=C2=B4t_join_already_quorated_c?=
	=?utf-8?b?bHVzdGVy4oCP4oCP?=
Message-ID: <CAEAM5QVRDpBi5XB_N6PHjGz=GyWEXCynY_ndihOoAQtiC-Hyjw@mail.gmail.com>

node1 (node inquorate):

[root at node1 ~]# vgs -o tags,vg_name
  VG Tags       VG
  node1-hb  vg00


                vg_www
  node2-hb vg_jabber
                vg_postgres
                vg_qdisk
                vg_tomcat


node2 (quorate)

[root at node2 ~]# vgs -o tags,vg_name
  VG Tags       VG
                vg00
                vg_emasweb
  node2-hb vg_jabber
                vg_postgres
                vg_qdisk
                vg_tomcat


It's true that vg_qdisk has the label of node2-hb. But it's in the
volume_list.

Regards, Javier

Hello Javier
>
> Can you send the ouput of this command for every node?
>
> vgs -o tags,vg_name
>
> 2012/6/20 Javier Vela <jvdiago at gmail.com>
>
> In the lvm.conf I have in volume_list the name of the vg_qdisk so this
> volume group should be available to both nodes at the same time.
>
> My volume_list in lvm.conf:
>
> node1:
> volume_list = [ "vg00", "vg_qdisk", "@node1-hb" ]
>
> node2:
> volume_list = [ "vg00", "vg_qdisk", "@node2-hb" ]
>
> Moreover with the comand lvdisplay I can see that the lv is available to
> both nodes. But maybe is worth to try another qdisk without lvm.
>
>
> Hi.
>
>
>
> Since you have HA-LVM, are you using volume tagging ? I noticed that your
> quorum disk belongs to a volume group vg_qdisk and I think when the first
> node that will activate the volumegroup will not allow the second node to
> activate the volumegroup because of volume tagging, so remove the
> quorumdisk from the volumegroup and just use it as a physical volume.
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> -- Linux-cluster mailing list Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120620/6b97d108/attachment.htm>

From jvdiago at gmail.com  Wed Jun 20 15:07:38 2012
From: jvdiago at gmail.com (Javier Vela)
Date: Wed, 20 Jun 2012 17:07:38 +0200
Subject: [Linux-cluster]
	=?utf-8?q?Node_can=27t_join_already_quorated_clus?=
	=?utf-8?b?dGVy4oCP?=
Message-ID: <CAEAM5QU2KT04rWnGttkqaWWcfvUyJx=j32GaPudwQwbu3BikJQ@mail.gmail.com>

I don't use fencing because with ha-lvm I thought that I dind't need it.
But also because both nodes are VMs in VMWare. I know that there is a
module to do fencing with vmware but I prefer to avoid it. I'm not in
control of the VMWare infraestructure and probably VMWare admins won't give
me the tools to use this module.

Regards, Javi


> Fencing is critical, and running a cluster without fencing, even with
> qdisk, is not supported. Manual fencing is also not supported. The
> *only* way to have a reliable cluster, testing or production, is to use
> fencing.
>
> Why do you not wish to use it?
>
> On 06/20/2012 09:43 AM, Javier Vela wrote:
> > As I readed, if you use HA-LVM you don't need fencing because of vg
> > tagging. Is It absolutely mandatory to use fencing with qdisk?
> >
> > If it is, i supose i can use manual_fence, but in production I also
> > won't use fencing.
> >
> > Regards, Javi.
> >
> > Date: Wed, 20 Jun 2012 14:45:28 +0200
> > From: emi2fast at gmail.com <mailto:emi2fast at gmail.com>
> > To: linux-cluster at redhat.com <mailto:linux-cluster at redhat.com>
> > Subject: Re: [Linux-cluster] Node can't join already quorated cluster
> >
> > If you don't wanna use a real fence divice, because you only do some
> > test, you have to use fence_manual agent
> >
> > 2012/6/20 Javier Vela <jvdiago at gmail.com <mailto:jvdiago at gmail.com>>
> >
> >     Hi, I have a very strange problem, and after searching through lot
> >     of forums, I haven't found the solution. This is the scenario:
> >
> >     Two node cluster with Red Hat 5.7, HA-LVM, no fencing and quorum
> >     disk. I start qdiskd, cman and rgmanager on one node. After 5
> >     minutes, finally the fencing finishes and cluster get quorate with 2
> >     votes:
> >
> >     [root at node2 ~]# clustat
> >     Cluster Status for test_cluster @ Wed Jun 20 05:56:39 2012
> >     Member Status: Quorate
> >
> >       Member Name                             ID   Status
> >       ------ ----                             ---- ------
> >       node1-hb                                  1 Offline
> >       node2-hb                               2 Online, Local, rgmanager
> >       /dev/mapper/vg_qdisk-lv_qdisk               0 Online, Quorum Disk
> >
> >       Service Name                   Owner (Last)                   State
> >       ------- ----                   ----- ------                   -----
> >       service:postgres                   node2                  started
> >
> >     Now, I start the second node. When cman reaches fencing, it hangs
> >     for 5 minutes aprox, and finally fails. clustat says:
> >
> >     root at node1 ~]# clustat
> >     Cluster Status for test_cluster @ Wed Jun 20 06:01:12 2012
> >     Member Status: Inquorate
> >
> >       Member Name                             ID   Status
> >       ------ ----                             ---- ------
> >     node1-hb                                  1 Online, Local
> >     node2-hb                               2 Offline
> >       /dev/mapper/vg_qdisk-lv_qdisk               0 Offline
> >
> >     And in /var/log/messages I can see this errors:
> >
> >     Jun 20 06:02:12 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
> >     Jun 20 06:02:12 node1 openais[6098]: [CLM  ] got nodejoin message
> >     15.15.2.10
> >     Jun 20 06:02:13 node1 dlm_controld[5386]: connect to ccs error -111,
> >     check ccsd or cluster status
> >     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
> >     Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
> >     Connection refused
> >     Jun 20 06:02:13 node1 ccsd[6090]: Initial status:: Inquorate
> >     Jun 20 06:02:13 node1 gfs_controld[5392]: connect to ccs error -111,
> >     check ccsd or cluster status
> >     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
> >     Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
> >     Connection refused
> >     Jun 20 06:02:14 node1 openais[6098]: [TOTEM] entering GATHER state
> >     from 9.
> >     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
> >     Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
> >     Connection refused
> >     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
> >     Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
> >     Connection refused
> >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
> >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
> >     Connection refused
> >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
> >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
> >     Connection refused
> >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
> >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
> >     Connection refused
> >     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
> >     Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
> >     Connection refused
> >     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
> >     Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
> >     Connection refused
> >     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
> >     Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
> >     Connection refused
> >     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
> >     Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
> >     Connection refused
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state
> >     from 0.
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Creating commit token
> >     because I am the rep.
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Storing new sequence id
> >     for ring 15c
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering COMMIT state.
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering RECOVERY state.
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] position [0] member
> >     15.15.2.10 <http://15.15.2.10>:
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] previous ring seq 344
> >     rep 15.15.2.10
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] aru e high delivered e
> >     received flag 1
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Did not need to
> >     originate any messages in recovery.
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Sending initial ORF token
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
> >     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
> >     Jun 20 06:02:18 node1 ccsd[6090]: Error while processing connect:
> >     Connection refused
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state
> >     from 9.
> >     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
> >
> >     And the quorum disk:
> >
> >     [root at node2 ~]# mkqdisk -L -d
> >     kqdisk v0.6.0
> >     /dev/mapper/vg_qdisk-lv_qdisk:
> >     /dev/vg_qdisk/lv_qdisk:
> >              Magic:                eb7a62c2
> >              Label:                cluster_qdisk
> >              Created:              Thu Jun  7 09:23:34 2012
> >              Host:                 node1
> >              Kernel Sector Size:   512
> >              Recorded Sector Size: 512
> >
> >     Status block for node 1
> >              Last updated by node 2
> >              Last updated on Wed Jun 20 06:17:23 2012
> >              State: Evicted
> >              Flags: 0000
> >              Score: 0/0
> >              Average Cycle speed: 0.000500 seconds
> >              Last Cycle speed: 0.000000 seconds
> >              Incarnation: 4fe1a06c4fe1a06c
> >     Status block for node 2
> >              Last updated by node 2
> >              Last updated on Wed Jun 20 07:09:38 2012
> >              State: Master
> >              Flags: 0000
> >              Score: 0/0
> >              Average Cycle speed: 0.001000 seconds
> >              Last Cycle speed: 0.000000 seconds
> >              Incarnation: 4fe1a06c4fe1a06c
> >
> >
> >     In the other node I don't see any errors in /var/log/messages. One
> >     strange thing is that if I start cman on both nodes at the same
> >     time, everything works fine and both nodes quorate (until I reboot
> >     one node and the problem appears). I've checked that multicast is
> >     working properly. With iperf I can send a receive multicast paquets.
> >     Moreover I've seen with tcpdump the paquets that openais send when
> >     cman is trying to start. I've readed about a bug in RH 5.3 with the
> >     same behaviour, but it is solved in RH 5.4.
> >
> >     I don't have Selinux enabled, and Iptables are also disabled. Here
> >     is the cluster.conf simplified (with less services and resources). I
> >     want to point out one thing. I have allow_kill="0" in order to avoid
> >     fencing errors when quorum tries to fence a failed node. As <fence/>
> >     is empty, before this stanza I got a lot of messages in
> >     /var/log/messages with failed fencing.
> >
> >     <?xml version="1.0"?>
> >     <cluster alias="test_cluster" config_version="15" name="test_cluster">
> >              <fence_daemon clean_start="0" post_fail_delay="0"
> >     post_join_delay="-1"/>
> >              <clusternodes>
> >                      <clusternode name="node1-hb" nodeid="1" votes="1">
> >                              <fence/>
> >                      </clusternode>
> >                      <clusternode name="node2-hb" nodeid="2" votes="1">
> >                              <fence/>
> >                      </clusternode>
> >              </clusternodes>
> >              <cman two_node="0" expected_votes="3"/>
> >              <fencedevices/>
> >
> >              <rm log_facility="local4" log_level="7">
> >                      <failoverdomains>
> >                              <failoverdomain name="etest_cluster_fo"
> >     nofailback="1" ordered="1" restricted="1">
> >                                      <failoverdomainnode name="node1-hb"
> >     priority="1"/>
> >                                      <failoverdomainnode name="node2-hb"
> >     priority="2"/>
> >                              </failoverdomain>
> >                      </failoverdomains>
> >              <resources/>
> >              <service autostart="1" domain="test_cluster_fo"
> >     exclusive="0" name="postgres" recovery="relocate">
> >                      <ip address="172.24.119.44" monitor_link="1"/>
> >                      <lvm name="vg_postgres" vg_name="vg_postgres"
> >     lv_name="postgres"/>
> >
> >                      <fs device="/dev/vg_postgres/postgres"
> >     force_fsck="1" force_unmount="1" fstype="ext3"
> >     mountpoint="/var/lib/pgsql" name="postgres" self_fence="0"/>
> >
> >                      <script file="/etc/init.d/postgresql" name="postgres">
> >                      </script>
> >              </service>
> >              </rm>
> >              <totem consensus="4000" join="60" token="20000"
> >     token_retransmits_before_loss_const="20"/>
> >          <quorumd allow_kill="0" interval="1" label="cluster_qdisk"
> >     tko="10" votes="1">
> >                      <heuristic
> >     program="/usr/share/cluster/check_eth_link.sh eth0" score="1"
> >     interval="2" tko="3"/>
> >              </quorumd>
> >       </cluster>
> >
> >
> >     The /etc/hosts:
> >     172.24.119.10 node1
> >     172.24.119.34 node2
> >     15.15.2.10 node1-hb node1-hb.localdomain
> >     15.15.2.11 node2-hb node2-hb.localdomain
> >
> >     And the versions:
> >     Red Hat Enterprise Linux Server release 5.7 (Tikanga)
> >     cman-2.0.115-85.el5
> >     rgmanager-2.0.52-21.el5
> >     openais-0.80.6-30.el5
> >
> >     I don't know what else I should try, so if you can give me some
> >     ideas, I will be very pleased.
> >
> >     Regards, Javi.
> >
> >     --
> >     Linux-cluster mailing list
> >     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> >     https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> >
> >
> > --
> > esta es mi vida e me la vivo hasta que dios quiera
> >
> > -- Linux-cluster mailing list Linux-cluster at redhat.com
> > <mailto:Linux-cluster at redhat.com>
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120620/e97146d3/attachment.htm>

From jvdiago at gmail.com  Wed Jun 20 15:13:57 2012
From: jvdiago at gmail.com (Javier Vela)
Date: Wed, 20 Jun 2012 17:13:57 +0200
Subject: [Linux-cluster] =?utf-8?q?_Node_can=C2=B4t_join_already_quorated_?=
	=?utf-8?b?Y2x1c3RlcuKAj+KAj+KAjw==?=
Message-ID: <CAEAM5QU1mDxsTN_jnt23X_UxruV=6=Zs728WJO0aXFsL2c1BQw@mail.gmail.com>

I don't use fencing because with ha-lvm I thought that I dind't need it.
But also because both nodes are VMs in VMWare. I know that there is a
module to do fencing with vmware but I prefer to avoid it. I'm not in
control of the VMWare infraestructure and probably VMWare admins won't give
me the tools to use this module.

Regards, Javi


> Fencing is critical, and running a cluster without fencing, even with
>
> qdisk, is not supported. Manual fencing is also not supported. The
> *only* way to have a reliable cluster, testing or production, is to use
> fencing.
>
> Why do you not wish to use it?
>
> On 06/20/2012 09:43 AM, Javier Vela wrote:
>
> > As I readed, if you use HA-LVM you don't need fencing because of vg
> > tagging. Is It absolutely mandatory to use fencing with qdisk?
> >
> > If it is, i supose i can use manual_fence, but in production I also
>
> > won't use fencing.
> >
> > Regards, Javi.
> >
> > Date: Wed, 20 Jun 2012 14:45:28 +0200
> > From: emi2fast at gmail.com <mailto:emi2fast at gmail.com>
>
> > To: linux-cluster at redhat.com <mailto:linux-cluster at redhat.com>
> > Subject: Re: [Linux-cluster] Node can't join already quorated cluster
>
> >
> > If you don't wanna use a real fence divice, because you only do some
> > test, you have to use fence_manual agent
> >
> > 2012/6/20 Javier Vela <jvdiago at gmail.com <mailto:jvdiago at gmail.com>>
>
> >
> >     Hi, I have a very strange problem, and after searching through lot
> >     of forums, I haven't found the solution. This is the scenario:
> >
> >     Two node cluster with Red Hat 5.7, HA-LVM, no fencing and quorum
>
> >     disk. I start qdiskd, cman and rgmanager on one node. After 5
> >     minutes, finally the fencing finishes and cluster get quorate with 2
> >     votes:
> >
> >     [root at node2 ~]# clustat
> >     Cluster Status for test_cluster @ Wed Jun 20 05:56:39 2012
>
> >     Member Status: Quorate
> >
> >       Member Name                             ID   Status
> >       ------ ----                             ---- ------
> >       node1-hb                                  1 Offline
>
> >       node2-hb                               2 Online, Local, rgmanager
> >       /dev/mapper/vg_qdisk-lv_qdisk               0 Online, Quorum Disk
> >
> >       Service Name                   Owner (Last)                   State
>
> >       ------- ----                   ----- ------                   -----
> >       service:postgres                   node2                  started
> >
> >     Now, I start the second node. When cman reaches fencing, it hangs
>
> >     for 5 minutes aprox, and finally fails. clustat says:
> >
> >     root at node1 ~]# clustat
> >     Cluster Status for test_cluster @ Wed Jun 20 06:01:12 2012
> >     Member Status: Inquorate
> >
>
> >       Member Name                             ID   Status
> >       ------ ----                             ---- ------
> >     node1-hb                                  1 Online, Local
> >     node2-hb                               2 Offline
>
> >       /dev/mapper/vg_qdisk-lv_qdisk               0 Offline
> >
> >     And in /var/log/messages I can see this errors:
> >
> >     Jun 20 06:02:12 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
>
> >     Jun 20 06:02:12 node1 openais[6098]: [CLM  ] got nodejoin message
> >     15.15.2.10
> >     Jun 20 06:02:13 node1 dlm_controld[5386]: connect to ccs error -111,
> >     check ccsd or cluster status
>
> >     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
> >     Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
> >     Connection refused
> >     Jun 20 06:02:13 node1 ccsd[6090]: Initial status:: Inquorate
>
> >     Jun 20 06:02:13 node1 gfs_controld[5392]: connect to ccs error -111,
> >     check ccsd or cluster status
> >     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
>
> >     Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
> >     Connection refused
> >     Jun 20 06:02:14 node1 openais[6098]: [TOTEM] entering GATHER state
> >     from 9.
> >     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>
> >     connection.
> >     Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
> >     Connection refused
> >     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
>
> >     Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
> >     Connection refused
> >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
> >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
>
> >     Connection refused
> >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
> >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
> >     Connection refused
>
> >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
> >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
> >     Connection refused
> >     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>
> >     connection.
> >     Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
> >     Connection refused
> >     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
>
> >     Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
> >     Connection refused
> >     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
> >     Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
>
> >     Connection refused
> >     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
> >     Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
> >     Connection refused
>
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state
> >     from 0.
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Creating commit token
> >     because I am the rep.
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Storing new sequence id
>
> >     for ring 15c
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering COMMIT state.
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering RECOVERY state.
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] position [0] member
>
> >     15.15.2.10 <http://15.15.2.10>:
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] previous ring seq 344
> >     rep 15.15.2.10
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] aru e high delivered e
>
> >     received flag 1
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Did not need to
> >     originate any messages in recovery.
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Sending initial ORF token
>
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
> >     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
> >     Jun 20 06:02:18 node1 ccsd[6090]: Error while processing connect:
>
> >     Connection refused
> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state
> >     from 9.
> >     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
> >     connection.
>
> >
> >     And the quorum disk:
> >
> >     [root at node2 ~]# mkqdisk -L -d
> >     kqdisk v0.6.0
> >     /dev/mapper/vg_qdisk-lv_qdisk:
> >     /dev/vg_qdisk/lv_qdisk:
> >              Magic:                eb7a62c2
>
> >              Label:                cluster_qdisk
> >              Created:              Thu Jun  7 09:23:34 2012
> >              Host:                 node1
> >              Kernel Sector Size:   512
>
> >              Recorded Sector Size: 512
> >
> >     Status block for node 1
> >              Last updated by node 2
> >              Last updated on Wed Jun 20 06:17:23 2012
> >              State: Evicted
>
> >              Flags: 0000
> >              Score: 0/0
> >              Average Cycle speed: 0.000500 seconds
> >              Last Cycle speed: 0.000000 seconds
> >              Incarnation: 4fe1a06c4fe1a06c
>
> >     Status block for node 2
> >              Last updated by node 2
> >              Last updated on Wed Jun 20 07:09:38 2012
> >              State: Master
> >              Flags: 0000
> >              Score: 0/0
>
> >              Average Cycle speed: 0.001000 seconds
> >              Last Cycle speed: 0.000000 seconds
> >              Incarnation: 4fe1a06c4fe1a06c
> >
> >
> >     In the other node I don't see any errors in /var/log/messages. One
>
> >     strange thing is that if I start cman on both nodes at the same
> >     time, everything works fine and both nodes quorate (until I reboot
> >     one node and the problem appears). I've checked that multicast is
>
> >     working properly. With iperf I can send a receive multicast paquets.
> >     Moreover I've seen with tcpdump the paquets that openais send when
> >     cman is trying to start. I've readed about a bug in RH 5.3 with the
>
> >     same behaviour, but it is solved in RH 5.4.
> >
> >     I don't have Selinux enabled, and Iptables are also disabled. Here
> >     is the cluster.conf simplified (with less services and resources). I
>
> >     want to point out one thing. I have allow_kill="0" in order to avoid
> >     fencing errors when quorum tries to fence a failed node. As <fence/>
> >     is empty, before this stanza I got a lot of messages in
>
> >     /var/log/messages with failed fencing.
> >
> >     <?xml version="1.0"?>
> >     <cluster alias="test_cluster" config_version="15" name="test_cluster">
>
> >              <fence_daemon clean_start="0" post_fail_delay="0"
> >     post_join_delay="-1"/>
> >              <clusternodes>
> >                      <clusternode name="node1-hb" nodeid="1" votes="1">
>
> >                              <fence/>
> >                      </clusternode>
> >                      <clusternode name="node2-hb" nodeid="2" votes="1">
> >                              <fence/>
>
> >                      </clusternode>
> >              </clusternodes>
> >              <cman two_node="0" expected_votes="3"/>
> >              <fencedevices/>
>
> >
> >              <rm log_facility="local4" log_level="7">
> >                      <failoverdomains>
> >                              <failoverdomain name="etest_cluster_fo"
>
> >     nofailback="1" ordered="1" restricted="1">
> >                                      <failoverdomainnode name="node1-hb"
> >     priority="1"/>
>
> >                                      <failoverdomainnode name="node2-hb"
> >     priority="2"/>
> >                              </failoverdomain>
> >                      </failoverdomains>
>
> >              <resources/>
> >              <service autostart="1" domain="test_cluster_fo"
> >     exclusive="0" name="postgres" recovery="relocate">
>
> >                      <ip address="172.24.119.44" monitor_link="1"/>
> >                      <lvm name="vg_postgres" vg_name="vg_postgres"
> >     lv_name="postgres"/>
>
> >
> >                      <fs device="/dev/vg_postgres/postgres"
> >     force_fsck="1" force_unmount="1" fstype="ext3"
> >     mountpoint="/var/lib/pgsql" name="postgres" self_fence="0"/>
>
> >
> >                      <script file="/etc/init.d/postgresql" name="postgres">
> >                      </script>
> >              </service>
> >              </rm>
>
> >              <totem consensus="4000" join="60" token="20000"
> >     token_retransmits_before_loss_const="20"/>
> >          <quorumd allow_kill="0" interval="1" label="cluster_qdisk"
>
> >     tko="10" votes="1">
> >                      <heuristic
> >     program="/usr/share/cluster/check_eth_link.sh eth0" score="1"
> >     interval="2" tko="3"/>
>
> >              </quorumd>
> >       </cluster>
> >
> >
> >     The /etc/hosts:
> >     172.24.119.10 node1
> >     172.24.119.34 node2
> >     15.15.2.10 node1-hb node1-hb.localdomain
>
> >     15.15.2.11 node2-hb node2-hb.localdomain
> >
> >     And the versions:
> >     Red Hat Enterprise Linux Server release 5.7 (Tikanga)
> >     cman-2.0.115-85.el5
> >     rgmanager-2.0.52-21.el5
>
> >     openais-0.80.6-30.el5
> >
> >     I don't know what else I should try, so if you can give me some
> >     ideas, I will be very pleased.
> >
> >     Regards, Javi.
> >
> >     --
>
> >     Linux-cluster mailing list
> >     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> >     https://www.redhat.com/mailman/listinfo/linux-cluster
>
> >
> >
> >
> >
> > --
> > esta es mi vida e me la vivo hasta que dios quiera
> >
> > -- Linux-cluster mailing list Linux-cluster at redhat.com
> > <mailto:Linux-cluster at redhat.com>
>
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
>
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120620/295abf9b/attachment.htm>

From emi2fast at gmail.com  Wed Jun 20 15:17:07 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Wed, 20 Jun 2012 17:17:07 +0200
Subject: [Linux-cluster]
	=?utf-8?q?Node_can=C2=B4t_join_already_quorated_c?=
	=?utf-8?b?bHVzdGVy4oCP4oCP?=
In-Reply-To: <CAEAM5QVRDpBi5XB_N6PHjGz=GyWEXCynY_ndihOoAQtiC-Hyjw@mail.gmail.com>
References: <CAEAM5QVRDpBi5XB_N6PHjGz=GyWEXCynY_ndihOoAQtiC-Hyjw@mail.gmail.com>
Message-ID: <CAE7pJ3Dkb-yxGNuirr34Sz+Lx0nEQ5=VUX8QeWxabg2fqaRWtA@mail.gmail.com>

Hello Javier

I think it's better use a qdisk with a plain device not with lvm

2012/6/20 Javier Vela <jvdiago at gmail.com>

> node1 (node inquorate):
>
> [root at node1 ~]# vgs -o tags,vg_name
>   VG Tags       VG
>   node1-hb  vg00
>
>
>                 vg_www
>   node2-hb vg_jabber
>                 vg_postgres
>                 vg_qdisk
>                 vg_tomcat
>
>
> node2 (quorate)
>
> [root at node2 ~]# vgs -o tags,vg_name
>   VG Tags       VG
>                 vg00
>                 vg_emasweb
>   node2-hb vg_jabber
>                 vg_postgres
>                 vg_qdisk
>                 vg_tomcat
>
>
> It's true that vg_qdisk has the label of node2-hb. But it's in the
> volume_list.
>
> Regards, Javier
>
> Hello Javier
>>
>> Can you send the ouput of this command for every node?
>>
>> vgs -o tags,vg_name
>>
>> 2012/6/20 Javier Vela <jvdiago at gmail.com>
>>
>> In the lvm.conf I have in volume_list the name of the vg_qdisk so this
>> volume group should be available to both nodes at the same time.
>>
>> My volume_list in lvm.conf:
>>
>> node1:
>> volume_list = [ "vg00", "vg_qdisk", "@node1-hb" ]
>>
>> node2:
>> volume_list = [ "vg00", "vg_qdisk", "@node2-hb" ]
>>
>> Moreover with the comand lvdisplay I can see that the lv is available to
>> both nodes. But maybe is worth to try another qdisk without lvm.
>>
>>
>> Hi.
>>
>>
>>
>> Since you have HA-LVM, are you using volume tagging ? I noticed that your
>> quorum disk belongs to a volume group vg_qdisk and I think when the first
>> node that will activate the volumegroup will not allow the second node to
>> activate the volumegroup because of volume tagging, so remove the
>> quorumdisk from the volumegroup and just use it as a physical volume.
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>>
>>
>> --
>> esta es mi vida e me la vivo hasta que dios quiera
>>
>> -- Linux-cluster mailing list Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120620/e80faab1/attachment.htm>

From emi2fast at gmail.com  Wed Jun 20 15:22:21 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Wed, 20 Jun 2012 17:22:21 +0200
Subject: [Linux-cluster]
	=?utf-8?q?Node_can=27t_join_already_quorated_clus?=
	=?utf-8?b?dGVy4oCP?=
In-Reply-To: <CAEAM5QU2KT04rWnGttkqaWWcfvUyJx=j32GaPudwQwbu3BikJQ@mail.gmail.com>
References: <CAEAM5QU2KT04rWnGttkqaWWcfvUyJx=j32GaPudwQwbu3BikJQ@mail.gmail.com>
Message-ID: <CAE7pJ3ArON1yCJvePZdKR+BJErJthJ98pe3AVjH287TOCC=1PA@mail.gmail.com>

Ok Javier

So now i know you don't wanna the fencing and the reason :-)

<fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="-1"/>

and use the fence_manual



2012/6/20 Javier Vela <jvdiago at gmail.com>

> I don't use fencing because with ha-lvm I thought that I dind't need it.
> But also because both nodes are VMs in VMWare. I know that there is a
> module to do fencing with vmware but I prefer to avoid it. I'm not in
> control of the VMWare infraestructure and probably VMWare admins won't give
> me the tools to use this module.
>
> Regards, Javi
>
>
>> Fencing is critical, and running a cluster without fencing, even with
>>
>> qdisk, is not supported. Manual fencing is also not supported. The
>> *only* way to have a reliable cluster, testing or production, is to use
>> fencing.
>>
>> Why do you not wish to use it?
>>
>> On 06/20/2012 09:43 AM, Javier Vela wrote:
>>
>> > As I readed, if you use HA-LVM you don't need fencing because of vg
>> > tagging. Is It absolutely mandatory to use fencing with qdisk?
>> >
>> > If it is, i supose i can use manual_fence, but in production I also
>>
>> > won't use fencing.
>> >
>> > Regards, Javi.
>> >
>> > Date: Wed, 20 Jun 2012 14:45:28 +0200
>> > From: emi2fast at gmail.com <mailto:emi2fast at gmail.com>
>>
>> > To: linux-cluster at redhat.com <mailto:linux-cluster at redhat.com>
>> > Subject: Re: [Linux-cluster] Node can't join already quorated cluster
>>
>> >
>> > If you don't wanna use a real fence divice, because you only do some
>> > test, you have to use fence_manual agent
>> >
>> > 2012/6/20 Javier Vela <jvdiago at gmail.com <mailto:jvdiago at gmail.com>>
>>
>> >
>> >     Hi, I have a very strange problem, and after searching through lot
>> >     of forums, I haven't found the solution. This is the scenario:
>> >
>> >     Two node cluster with Red Hat 5.7, HA-LVM, no fencing and quorum
>>
>> >     disk. I start qdiskd, cman and rgmanager on one node. After 5
>> >     minutes, finally the fencing finishes and cluster get quorate with 2
>> >     votes:
>> >
>> >     [root at node2 ~]# clustat
>> >     Cluster Status for test_cluster @ Wed Jun 20 05:56:39 2012
>>
>> >     Member Status: Quorate
>> >
>> >       Member Name                             ID   Status
>> >       ------ ----                             ---- ------
>> >       node1-hb                                  1 Offline
>>
>> >       node2-hb                               2 Online, Local, rgmanager
>> >       /dev/mapper/vg_qdisk-lv_qdisk               0 Online, Quorum Disk
>> >
>> >       Service Name                   Owner (Last)                   State
>>
>> >       ------- ----                   ----- ------                   -----
>> >       service:postgres                   node2                  started
>> >
>> >     Now, I start the second node. When cman reaches fencing, it hangs
>>
>> >     for 5 minutes aprox, and finally fails. clustat says:
>> >
>> >     root at node1 ~]# clustat
>> >     Cluster Status for test_cluster @ Wed Jun 20 06:01:12 2012
>> >     Member Status: Inquorate
>> >
>>
>> >       Member Name                             ID   Status
>> >       ------ ----                             ---- ------
>> >     node1-hb                                  1 Online, Local
>> >     node2-hb                               2 Offline
>>
>> >       /dev/mapper/vg_qdisk-lv_qdisk               0 Offline
>> >
>> >     And in /var/log/messages I can see this errors:
>> >
>> >     Jun 20 06:02:12 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
>>
>> >     Jun 20 06:02:12 node1 openais[6098]: [CLM  ] got nodejoin message
>> >     15.15.2.10
>> >     Jun 20 06:02:13 node1 dlm_controld[5386]: connect to ccs error -111,
>> >     check ccsd or cluster status
>>
>> >     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>> >     connection.
>> >     Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
>> >     Connection refused
>> >     Jun 20 06:02:13 node1 ccsd[6090]: Initial status:: Inquorate
>>
>> >     Jun 20 06:02:13 node1 gfs_controld[5392]: connect to ccs error -111,
>> >     check ccsd or cluster status
>> >     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>> >     connection.
>>
>> >     Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
>> >     Connection refused
>> >     Jun 20 06:02:14 node1 openais[6098]: [TOTEM] entering GATHER state
>> >     from 9.
>> >     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>
>> >     connection.
>> >     Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
>> >     Connection refused
>> >     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>> >     connection.
>>
>> >     Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
>> >     Connection refused
>> >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>> >     connection.
>> >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
>>
>> >     Connection refused
>> >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>> >     connection.
>> >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
>> >     Connection refused
>>
>> >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>> >     connection.
>> >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
>> >     Connection refused
>> >     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>
>> >     connection.
>> >     Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
>> >     Connection refused
>> >     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>> >     connection.
>>
>> >     Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
>> >     Connection refused
>> >     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>> >     connection.
>> >     Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
>>
>> >     Connection refused
>> >     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>> >     connection.
>> >     Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
>> >     Connection refused
>>
>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state
>> >     from 0.
>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Creating commit token
>> >     because I am the rep.
>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Storing new sequence id
>>
>> >     for ring 15c
>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering COMMIT state.
>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering RECOVERY state.
>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] position [0] member
>>
>> >     15.15.2.10 <http://15.15.2.10>:
>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] previous ring seq 344
>> >     rep 15.15.2.10
>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] aru e high delivered e
>>
>> >     received flag 1
>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Did not need to
>> >     originate any messages in recovery.
>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Sending initial ORF token
>>
>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
>> >     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>> >     connection.
>> >     Jun 20 06:02:18 node1 ccsd[6090]: Error while processing connect:
>>
>> >     Connection refused
>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state
>> >     from 9.
>> >     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>> >     connection.
>>
>> >
>> >     And the quorum disk:
>> >
>> >     [root at node2 ~]# mkqdisk -L -d
>> >     kqdisk v0.6.0
>> >     /dev/mapper/vg_qdisk-lv_qdisk:
>> >     /dev/vg_qdisk/lv_qdisk:
>> >              Magic:                eb7a62c2
>>
>> >              Label:                cluster_qdisk
>> >              Created:              Thu Jun  7 09:23:34 2012
>> >              Host:                 node1
>> >              Kernel Sector Size:   512
>>
>> >              Recorded Sector Size: 512
>> >
>> >     Status block for node 1
>> >              Last updated by node 2
>> >              Last updated on Wed Jun 20 06:17:23 2012
>> >              State: Evicted
>>
>> >              Flags: 0000
>> >              Score: 0/0
>> >              Average Cycle speed: 0.000500 seconds
>> >              Last Cycle speed: 0.000000 seconds
>> >              Incarnation: 4fe1a06c4fe1a06c
>>
>> >     Status block for node 2
>> >              Last updated by node 2
>> >              Last updated on Wed Jun 20 07:09:38 2012
>> >              State: Master
>> >              Flags: 0000
>> >              Score: 0/0
>>
>> >              Average Cycle speed: 0.001000 seconds
>> >              Last Cycle speed: 0.000000 seconds
>> >              Incarnation: 4fe1a06c4fe1a06c
>> >
>> >
>> >     In the other node I don't see any errors in /var/log/messages. One
>>
>> >     strange thing is that if I start cman on both nodes at the same
>> >     time, everything works fine and both nodes quorate (until I reboot
>> >     one node and the problem appears). I've checked that multicast is
>>
>> >     working properly. With iperf I can send a receive multicast paquets.
>> >     Moreover I've seen with tcpdump the paquets that openais send when
>> >     cman is trying to start. I've readed about a bug in RH 5.3 with the
>>
>> >     same behaviour, but it is solved in RH 5.4.
>> >
>> >     I don't have Selinux enabled, and Iptables are also disabled. Here
>> >     is the cluster.conf simplified (with less services and resources). I
>>
>> >     want to point out one thing. I have allow_kill="0" in order to avoid
>> >     fencing errors when quorum tries to fence a failed node. As <fence/>
>> >     is empty, before this stanza I got a lot of messages in
>>
>> >     /var/log/messages with failed fencing.
>> >
>> >     <?xml version="1.0"?>
>> >     <cluster alias="test_cluster" config_version="15" name="test_cluster">
>>
>> >              <fence_daemon clean_start="0" post_fail_delay="0"
>> >     post_join_delay="-1"/>
>> >              <clusternodes>
>> >                      <clusternode name="node1-hb" nodeid="1" votes="1">
>>
>> >                              <fence/>
>> >                      </clusternode>
>> >                      <clusternode name="node2-hb" nodeid="2" votes="1">
>> >                              <fence/>
>>
>> >                      </clusternode>
>> >              </clusternodes>
>> >              <cman two_node="0" expected_votes="3"/>
>> >              <fencedevices/>
>>
>> >
>> >              <rm log_facility="local4" log_level="7">
>> >                      <failoverdomains>
>> >                              <failoverdomain name="etest_cluster_fo"
>>
>> >     nofailback="1" ordered="1" restricted="1">
>> >                                      <failoverdomainnode name="node1-hb"
>> >     priority="1"/>
>>
>> >                                      <failoverdomainnode name="node2-hb"
>> >     priority="2"/>
>> >                              </failoverdomain>
>> >                      </failoverdomains>
>>
>> >              <resources/>
>> >              <service autostart="1" domain="test_cluster_fo"
>> >     exclusive="0" name="postgres" recovery="relocate">
>>
>> >                      <ip address="172.24.119.44" monitor_link="1"/>
>> >                      <lvm name="vg_postgres" vg_name="vg_postgres"
>> >     lv_name="postgres"/>
>>
>> >
>> >                      <fs device="/dev/vg_postgres/postgres"
>> >     force_fsck="1" force_unmount="1" fstype="ext3"
>> >     mountpoint="/var/lib/pgsql" name="postgres" self_fence="0"/>
>>
>> >
>> >                      <script file="/etc/init.d/postgresql" name="postgres">
>> >                      </script>
>> >              </service>
>> >              </rm>
>>
>> >              <totem consensus="4000" join="60" token="20000"
>> >     token_retransmits_before_loss_const="20"/>
>> >          <quorumd allow_kill="0" interval="1" label="cluster_qdisk"
>>
>> >     tko="10" votes="1">
>> >                      <heuristic
>> >     program="/usr/share/cluster/check_eth_link.sh eth0" score="1"
>> >     interval="2" tko="3"/>
>>
>> >              </quorumd>
>> >       </cluster>
>> >
>> >
>> >     The /etc/hosts:
>> >     172.24.119.10 node1
>> >     172.24.119.34 node2
>> >     15.15.2.10 node1-hb node1-hb.localdomain
>>
>> >     15.15.2.11 node2-hb node2-hb.localdomain
>> >
>> >     And the versions:
>> >     Red Hat Enterprise Linux Server release 5.7 (Tikanga)
>> >     cman-2.0.115-85.el5
>> >     rgmanager-2.0.52-21.el5
>>
>> >     openais-0.80.6-30.el5
>> >
>> >     I don't know what else I should try, so if you can give me some
>> >     ideas, I will be very pleased.
>> >
>> >     Regards, Javi.
>> >
>> >     --
>>
>> >     Linux-cluster mailing list
>> >     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>> >     https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>> >
>> >
>> >
>> >
>> > --
>> > esta es mi vida e me la vivo hasta que dios quiera
>> >
>> > -- Linux-cluster mailing list Linux-cluster at redhat.com
>> > <mailto:Linux-cluster at redhat.com>
>>
>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>> >
>> >
>> > --
>> > Linux-cluster mailing list
>> > Linux-cluster at redhat.com
>>
>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>> >
>>
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.com
>>
>>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120620/9b4df71d/attachment.htm>

From jvdiago at gmail.com  Wed Jun 20 15:25:06 2012
From: jvdiago at gmail.com (Javier Vela)
Date: Wed, 20 Jun 2012 17:25:06 +0200
Subject: [Linux-cluster]
	=?iso-8859-1?q?Node_can=B4t_join_already_quorated?=
	=?iso-8859-1?q?_cluster?=
In-Reply-To: <BAF521AEFC0EE6408117583A7B11949B42BB3FB6@REK-EXMBX-P01.isbank.is>
References: <BAF521AEFC0EE6408117583A7B11949B42BB3FB6@REK-EXMBX-P01.isbank.is>
Message-ID: <CAEAM5QXBF_w63Ly0yyPci_2QajJn_-D_zx4+SGuT2dAHNd3wbQ@mail.gmail.com>

Yes, It works:

[root at node1 ~]# vgchange -ay vg_qdisk
  1 logical volume(s) in volume group "vg_qdisk" now active

lvdisplay:
 --- Logical volume ---
  LV Name                /dev/vg_qdisk/lv_qdisk
  VG Name                vg_qdisk
  LV UUID                dEYtaV-W2GW-RFOw-ckWB-ppy5-sERn-kuXTLt
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                20,00 MB
  Current LE             5
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:4


2012/6/20 J?n Bj?rn Nj?lsson <Jon.Njalsson at islandsbanki.is>

>  Are you able to activate the volume group on node1 ?****
>
> ** **
>
> Vgchange ?ay vg_qdisk ?****
>
> ** **
>
> *J?n Bj?rn Nj?lsson*
>
> Data Management & Data Security****
>
> IT Operations****
>
> ** **
>
> *ISLANDSBANKI*
>
> Lyngh?ls 4, 110 Reykjav?k****
>
> Iceland****
>
> ** **
>
> Phone:   +354 440 3898
> Mobile: +354 844 3898****
>
> www.islandsbanki.is
> Disclaimer: http://www.islandsbanki.is/disclaimer/****
>
> ** **
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120620/8b37732d/attachment.htm>

From jvdiago at gmail.com  Wed Jun 20 15:31:17 2012
From: jvdiago at gmail.com (Javier Vela)
Date: Wed, 20 Jun 2012 17:31:17 +0200
Subject: [Linux-cluster]
	=?utf-8?q?Node_can=27t_join_already_quorated_clus?=
	=?utf-8?b?dGVy4oCP?=
In-Reply-To: <CAE7pJ3ArON1yCJvePZdKR+BJErJthJ98pe3AVjH287TOCC=1PA@mail.gmail.com>
References: <CAEAM5QU2KT04rWnGttkqaWWcfvUyJx=j32GaPudwQwbu3BikJQ@mail.gmail.com>
	<CAE7pJ3ArON1yCJvePZdKR+BJErJthJ98pe3AVjH287TOCC=1PA@mail.gmail.com>
Message-ID: <CAEAM5QVjMM43G9fiq9tLDCtMfHShFcpWwBTu4Pk+dAG_X7piiA@mail.gmail.com>

Ok. I'll try the fence_manual and change the clean_start to one. I will
report you the results ASAP.

Thank you for the feedback.

2012/6/20 emmanuel segura <emi2fast at gmail.com>

> Ok Javier
>
> So now i know you don't wanna the fencing and the reason :-)
>
> <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="-1"/>
>
> and use the fence_manual
>
>
>
> 2012/6/20 Javier Vela <jvdiago at gmail.com>
>
>> I don't use fencing because with ha-lvm I thought that I dind't need it.
>> But also because both nodes are VMs in VMWare. I know that there is a
>> module to do fencing with vmware but I prefer to avoid it. I'm not in
>> control of the VMWare infraestructure and probably VMWare admins won't give
>> me the tools to use this module.
>>
>> Regards, Javi
>>
>>
>>> Fencing is critical, and running a cluster without fencing, even with
>>>
>>>
>>> qdisk, is not supported. Manual fencing is also not supported. The
>>> *only* way to have a reliable cluster, testing or production, is to use
>>> fencing.
>>>
>>> Why do you not wish to use it?
>>>
>>> On 06/20/2012 09:43 AM, Javier Vela wrote:
>>>
>>>
>>> > As I readed, if you use HA-LVM you don't need fencing because of vg
>>> > tagging. Is It absolutely mandatory to use fencing with qdisk?
>>> >
>>> > If it is, i supose i can use manual_fence, but in production I also
>>>
>>>
>>> > won't use fencing.
>>> >
>>> > Regards, Javi.
>>> >
>>> > Date: Wed, 20 Jun 2012 14:45:28 +0200
>>> > From: emi2fast at gmail.com <mailto:emi2fast at gmail.com>
>>>
>>>
>>> > To: linux-cluster at redhat.com <mailto:linux-cluster at redhat.com>
>>> > Subject: Re: [Linux-cluster] Node can't join already quorated cluster
>>>
>>>
>>> >
>>> > If you don't wanna use a real fence divice, because you only do some
>>> > test, you have to use fence_manual agent
>>> >
>>> > 2012/6/20 Javier Vela <jvdiago at gmail.com <mailto:jvdiago at gmail.com>>
>>>
>>>
>>> >
>>> >     Hi, I have a very strange problem, and after searching through lot
>>> >     of forums, I haven't found the solution. This is the scenario:
>>> >
>>> >     Two node cluster with Red Hat 5.7, HA-LVM, no fencing and quorum
>>>
>>>
>>> >     disk. I start qdiskd, cman and rgmanager on one node. After 5
>>> >     minutes, finally the fencing finishes and cluster get quorate with 2
>>> >     votes:
>>> >
>>> >     [root at node2 ~]# clustat
>>> >     Cluster Status for test_cluster @ Wed Jun 20 05:56:39 2012
>>>
>>>
>>> >     Member Status: Quorate
>>> >
>>> >       Member Name                             ID   Status
>>> >       ------ ----                             ---- ------
>>> >       node1-hb                                  1 Offline
>>>
>>>
>>> >       node2-hb                               2 Online, Local, rgmanager
>>> >       /dev/mapper/vg_qdisk-lv_qdisk               0 Online, Quorum Disk
>>> >
>>> >       Service Name                   Owner (Last)                   State
>>>
>>>
>>> >       ------- ----                   ----- ------                   -----
>>> >       service:postgres                   node2                  started
>>> >
>>> >     Now, I start the second node. When cman reaches fencing, it hangs
>>>
>>>
>>> >     for 5 minutes aprox, and finally fails. clustat says:
>>> >
>>> >     root at node1 ~]# clustat
>>> >     Cluster Status for test_cluster @ Wed Jun 20 06:01:12 2012
>>> >     Member Status: Inquorate
>>> >
>>>
>>>
>>> >       Member Name                             ID   Status
>>> >       ------ ----                             ---- ------
>>> >     node1-hb                                  1 Online, Local
>>> >     node2-hb                               2 Offline
>>>
>>>
>>> >       /dev/mapper/vg_qdisk-lv_qdisk               0 Offline
>>> >
>>> >     And in /var/log/messages I can see this errors:
>>> >
>>> >     Jun 20 06:02:12 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
>>>
>>>
>>> >     Jun 20 06:02:12 node1 openais[6098]: [CLM  ] got nodejoin message
>>> >     15.15.2.10
>>> >     Jun 20 06:02:13 node1 dlm_controld[5386]: connect to ccs error -111,
>>> >     check ccsd or cluster status
>>>
>>>
>>> >     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>> >     connection.
>>> >     Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
>>> >     Connection refused
>>> >     Jun 20 06:02:13 node1 ccsd[6090]: Initial status:: Inquorate
>>>
>>>
>>> >     Jun 20 06:02:13 node1 gfs_controld[5392]: connect to ccs error -111,
>>> >     check ccsd or cluster status
>>> >     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>> >     connection.
>>>
>>>
>>> >     Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
>>> >     Connection refused
>>> >     Jun 20 06:02:14 node1 openais[6098]: [TOTEM] entering GATHER state
>>> >     from 9.
>>> >     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>>
>>>
>>> >     connection.
>>> >     Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
>>> >     Connection refused
>>> >     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>> >     connection.
>>>
>>>
>>> >     Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
>>> >     Connection refused
>>> >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>> >     connection.
>>> >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
>>>
>>>
>>> >     Connection refused
>>> >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>> >     connection.
>>> >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
>>> >     Connection refused
>>>
>>>
>>> >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>> >     connection.
>>> >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
>>> >     Connection refused
>>> >     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>>
>>>
>>> >     connection.
>>> >     Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
>>> >     Connection refused
>>> >     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>> >     connection.
>>>
>>>
>>> >     Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
>>> >     Connection refused
>>> >     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>> >     connection.
>>> >     Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
>>>
>>>
>>> >     Connection refused
>>> >     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>> >     connection.
>>> >     Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
>>> >     Connection refused
>>>
>>>
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state
>>> >     from 0.
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Creating commit token
>>> >     because I am the rep.
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Storing new sequence id
>>>
>>>
>>> >     for ring 15c
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering COMMIT state.
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering RECOVERY state.
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] position [0] member
>>>
>>>
>>> >     15.15.2.10 <http://15.15.2.10>:
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] previous ring seq 344
>>> >     rep 15.15.2.10
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] aru e high delivered e
>>>
>>>
>>> >     received flag 1
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Did not need to
>>> >     originate any messages in recovery.
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Sending initial ORF token
>>>
>>>
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
>>> >     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>> >     connection.
>>> >     Jun 20 06:02:18 node1 ccsd[6090]: Error while processing connect:
>>>
>>>
>>> >     Connection refused
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state
>>> >     from 9.
>>> >     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>> >     connection.
>>>
>>>
>>> >
>>> >     And the quorum disk:
>>> >
>>> >     [root at node2 ~]# mkqdisk -L -d
>>> >     kqdisk v0.6.0
>>> >     /dev/mapper/vg_qdisk-lv_qdisk:
>>> >     /dev/vg_qdisk/lv_qdisk:
>>> >              Magic:                eb7a62c2
>>>
>>>
>>> >              Label:                cluster_qdisk
>>> >              Created:              Thu Jun  7 09:23:34 2012
>>> >              Host:                 node1
>>> >              Kernel Sector Size:   512
>>>
>>>
>>> >              Recorded Sector Size: 512
>>> >
>>> >     Status block for node 1
>>> >              Last updated by node 2
>>> >              Last updated on Wed Jun 20 06:17:23 2012
>>> >              State: Evicted
>>>
>>>
>>> >              Flags: 0000
>>> >              Score: 0/0
>>> >              Average Cycle speed: 0.000500 seconds
>>> >              Last Cycle speed: 0.000000 seconds
>>> >              Incarnation: 4fe1a06c4fe1a06c
>>>
>>>
>>> >     Status block for node 2
>>> >              Last updated by node 2
>>> >              Last updated on Wed Jun 20 07:09:38 2012
>>> >              State: Master
>>> >              Flags: 0000
>>> >              Score: 0/0
>>>
>>>
>>> >              Average Cycle speed: 0.001000 seconds
>>> >              Last Cycle speed: 0.000000 seconds
>>> >              Incarnation: 4fe1a06c4fe1a06c
>>> >
>>> >
>>> >     In the other node I don't see any errors in /var/log/messages. One
>>>
>>>
>>> >     strange thing is that if I start cman on both nodes at the same
>>> >     time, everything works fine and both nodes quorate (until I reboot
>>> >     one node and the problem appears). I've checked that multicast is
>>>
>>>
>>> >     working properly. With iperf I can send a receive multicast paquets.
>>> >     Moreover I've seen with tcpdump the paquets that openais send when
>>> >     cman is trying to start. I've readed about a bug in RH 5.3 with the
>>>
>>>
>>> >     same behaviour, but it is solved in RH 5.4.
>>> >
>>> >     I don't have Selinux enabled, and Iptables are also disabled. Here
>>> >     is the cluster.conf simplified (with less services and resources). I
>>>
>>>
>>> >     want to point out one thing. I have allow_kill="0" in order to avoid
>>> >     fencing errors when quorum tries to fence a failed node. As <fence/>
>>> >     is empty, before this stanza I got a lot of messages in
>>>
>>>
>>> >     /var/log/messages with failed fencing.
>>> >
>>> >     <?xml version="1.0"?>
>>> >     <cluster alias="test_cluster" config_version="15" name="test_cluster">
>>>
>>>
>>> >              <fence_daemon clean_start="0" post_fail_delay="0"
>>> >     post_join_delay="-1"/>
>>> >              <clusternodes>
>>> >                      <clusternode name="node1-hb" nodeid="1" votes="1">
>>>
>>>
>>> >                              <fence/>
>>> >                      </clusternode>
>>> >                      <clusternode name="node2-hb" nodeid="2" votes="1">
>>> >                              <fence/>
>>>
>>>
>>> >                      </clusternode>
>>> >              </clusternodes>
>>> >              <cman two_node="0" expected_votes="3"/>
>>> >              <fencedevices/>
>>>
>>>
>>> >
>>> >              <rm log_facility="local4" log_level="7">
>>> >                      <failoverdomains>
>>> >                              <failoverdomain name="etest_cluster_fo"
>>>
>>>
>>> >     nofailback="1" ordered="1" restricted="1">
>>> >                                      <failoverdomainnode name="node1-hb"
>>> >     priority="1"/>
>>>
>>>
>>> >                                      <failoverdomainnode name="node2-hb"
>>> >     priority="2"/>
>>> >                              </failoverdomain>
>>> >                      </failoverdomains>
>>>
>>>
>>> >              <resources/>
>>> >              <service autostart="1" domain="test_cluster_fo"
>>> >     exclusive="0" name="postgres" recovery="relocate">
>>>
>>>
>>> >                      <ip address="172.24.119.44" monitor_link="1"/>
>>> >                      <lvm name="vg_postgres" vg_name="vg_postgres"
>>> >     lv_name="postgres"/>
>>>
>>>
>>> >
>>> >                      <fs device="/dev/vg_postgres/postgres"
>>> >     force_fsck="1" force_unmount="1" fstype="ext3"
>>> >     mountpoint="/var/lib/pgsql" name="postgres" self_fence="0"/>
>>>
>>>
>>> >
>>> >                      <script file="/etc/init.d/postgresql" name="postgres">
>>> >                      </script>
>>> >              </service>
>>> >              </rm>
>>>
>>>
>>> >              <totem consensus="4000" join="60" token="20000"
>>> >     token_retransmits_before_loss_const="20"/>
>>> >          <quorumd allow_kill="0" interval="1" label="cluster_qdisk"
>>>
>>>
>>> >     tko="10" votes="1">
>>> >                      <heuristic
>>> >     program="/usr/share/cluster/check_eth_link.sh eth0" score="1"
>>> >     interval="2" tko="3"/>
>>>
>>>
>>> >              </quorumd>
>>> >       </cluster>
>>> >
>>> >
>>> >     The /etc/hosts:
>>> >     172.24.119.10 node1
>>> >     172.24.119.34 node2
>>> >     15.15.2.10 node1-hb node1-hb.localdomain
>>>
>>>
>>> >     15.15.2.11 node2-hb node2-hb.localdomain
>>> >
>>> >     And the versions:
>>> >     Red Hat Enterprise Linux Server release 5.7 (Tikanga)
>>> >     cman-2.0.115-85.el5
>>> >     rgmanager-2.0.52-21.el5
>>>
>>>
>>> >     openais-0.80.6-30.el5
>>> >
>>> >     I don't know what else I should try, so if you can give me some
>>> >     ideas, I will be very pleased.
>>> >
>>> >     Regards, Javi.
>>> >
>>> >     --
>>>
>>>
>>> >     Linux-cluster mailing list
>>> >     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>>>
>>> >     https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > esta es mi vida e me la vivo hasta que dios quiera
>>> >
>>> > -- Linux-cluster mailing list Linux-cluster at redhat.com
>>>
>>> > <mailto:Linux-cluster at redhat.com>
>>>
>>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>>> >
>>> >
>>> > --
>>> > Linux-cluster mailing list
>>> > Linux-cluster at redhat.com
>>>
>>>
>>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>>> >
>>>
>>>
>>> --
>>> Digimer
>>>
>>> Papers and Projects: https://alteeve.com
>>>
>>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120620/fa7a172a/attachment.htm>

From jvdiago at gmail.com  Wed Jun 20 15:32:40 2012
From: jvdiago at gmail.com (Javier Vela)
Date: Wed, 20 Jun 2012 17:32:40 +0200
Subject: [Linux-cluster]
	=?utf-8?q?Node_can=C2=B4t_join_already_quorated_c?=
	=?utf-8?b?bHVzdGVy4oCP4oCP?=
In-Reply-To: <CAE7pJ3Dkb-yxGNuirr34Sz+Lx0nEQ5=VUX8QeWxabg2fqaRWtA@mail.gmail.com>
References: <CAEAM5QVRDpBi5XB_N6PHjGz=GyWEXCynY_ndihOoAQtiC-Hyjw@mail.gmail.com>
	<CAE7pJ3Dkb-yxGNuirr34Sz+Lx0nEQ5=VUX8QeWxabg2fqaRWtA@mail.gmail.com>
Message-ID: <CAEAM5QUYKSq2nkjbqOVmxB4z4E6wzRaUrBBSDyUqqUtiXxntRg@mail.gmail.com>

Ok. I'm going to use qdisk without lvm. I will tell you.

Thank you for the advice.

Regards, javi.

2012/6/20 emmanuel segura <emi2fast at gmail.com>

> Hello Javier
>
> I think it's better use a qdisk with a plain device not with lvm
>
> 2012/6/20 Javier Vela <jvdiago at gmail.com>
>
>> node1 (node inquorate):
>>
>> [root at node1 ~]# vgs -o tags,vg_name
>>   VG Tags       VG
>>   node1-hb  vg00
>>
>>
>>                 vg_www
>>   node2-hb vg_jabber
>>                 vg_postgres
>>                 vg_qdisk
>>                 vg_tomcat
>>
>>
>> node2 (quorate)
>>
>> [root at node2 ~]# vgs -o tags,vg_name
>>   VG Tags       VG
>>                 vg00
>>                 vg_emasweb
>>   node2-hb vg_jabber
>>                 vg_postgres
>>                 vg_qdisk
>>                 vg_tomcat
>>
>>
>> It's true that vg_qdisk has the label of node2-hb. But it's in the
>> volume_list.
>>
>> Regards, Javier
>>
>> Hello Javier
>>>
>>> Can you send the ouput of this command for every node?
>>>
>>> vgs -o tags,vg_name
>>>
>>> 2012/6/20 Javier Vela <jvdiago at gmail.com>
>>>
>>> In the lvm.conf I have in volume_list the name of the vg_qdisk so this
>>> volume group should be available to both nodes at the same time.
>>>
>>> My volume_list in lvm.conf:
>>>
>>> node1:
>>> volume_list = [ "vg00", "vg_qdisk", "@node1-hb" ]
>>>
>>> node2:
>>> volume_list = [ "vg00", "vg_qdisk", "@node2-hb" ]
>>>
>>> Moreover with the comand lvdisplay I can see that the lv is available to
>>> both nodes. But maybe is worth to try another qdisk without lvm.
>>>
>>>
>>> Hi.
>>>
>>>
>>>
>>> Since you have HA-LVM, are you using volume tagging ? I noticed that
>>> your quorum disk belongs to a volume group vg_qdisk and I think when the
>>> first node that will activate the volumegroup will not allow the second
>>> node to activate the volumegroup because of volume tagging, so remove the
>>> quorumdisk from the volumegroup and just use it as a physical volume.
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>>
>>>
>>> --
>>> esta es mi vida e me la vivo hasta que dios quiera
>>>
>>> -- Linux-cluster mailing list Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120620/5fcd7f89/attachment.htm>

From emi2fast at gmail.com  Wed Jun 20 15:33:30 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Wed, 20 Jun 2012 17:33:30 +0200
Subject: [Linux-cluster]
	=?iso-8859-1?q?Node_can=B4t_join_already_quorated?=
	=?iso-8859-1?q?_cluster?=
In-Reply-To: <CAEAM5QXBF_w63Ly0yyPci_2QajJn_-D_zx4+SGuT2dAHNd3wbQ@mail.gmail.com>
References: <BAF521AEFC0EE6408117583A7B11949B42BB3FB6@REK-EXMBX-P01.isbank.is>
	<CAEAM5QXBF_w63Ly0yyPci_2QajJn_-D_zx4+SGuT2dAHNd3wbQ@mail.gmail.com>
Message-ID: <CAE7pJ3Bnj4DTf4ORakJpx8eeyZp9nSWOn4vJWJCZS6D-ioddcw@mail.gmail.com>

;-) I'm Happy

2012/6/20 Javier Vela <jvdiago at gmail.com>

> Yes, It works:
>
> [root at node1 ~]# vgchange -ay vg_qdisk
>   1 logical volume(s) in volume group "vg_qdisk" now active
>
> lvdisplay:
>  --- Logical volume ---
>   LV Name                /dev/vg_qdisk/lv_qdisk
>   VG Name                vg_qdisk
>   LV UUID                dEYtaV-W2GW-RFOw-ckWB-ppy5-sERn-kuXTLt
>   LV Write Access        read/write
>   LV Status              available
>   # open                 0
>   LV Size                20,00 MB
>   Current LE             5
>   Segments               1
>   Allocation             inherit
>   Read ahead sectors     auto
>   - currently set to     256
>   Block device           253:4
>
>
> 2012/6/20 J?n Bj?rn Nj?lsson <Jon.Njalsson at islandsbanki.is>
>
>>  Are you able to activate the volume group on node1 ?****
>>
>> ** **
>>
>> Vgchange ?ay vg_qdisk ?****
>>
>> ** **
>>
>> *J?n Bj?rn Nj?lsson*
>>
>> Data Management & Data Security****
>>
>> IT Operations****
>>
>> ** **
>>
>> *ISLANDSBANKI*
>>
>> Lyngh?ls 4, 110 Reykjav?k****
>>
>> Iceland****
>>
>> ** **
>>
>> Phone:   +354 440 3898
>> Mobile: +354 844 3898****
>>
>> www.islandsbanki.is
>> Disclaimer: http://www.islandsbanki.is/disclaimer/****
>>
>> ** **
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120620/6e571431/attachment.htm>

From lists at alteeve.ca  Wed Jun 20 15:44:00 2012
From: lists at alteeve.ca (Digimer)
Date: Wed, 20 Jun 2012 11:44:00 -0400
Subject: [Linux-cluster]
 =?utf-8?q?Node_can=27t_join_already_quorated_clus?=
 =?utf-8?b?dGVy4oCP?=
In-Reply-To: <CAE7pJ3ArON1yCJvePZdKR+BJErJthJ98pe3AVjH287TOCC=1PA@mail.gmail.com>
References: <CAEAM5QU2KT04rWnGttkqaWWcfvUyJx=j32GaPudwQwbu3BikJQ@mail.gmail.com>
	<CAE7pJ3ArON1yCJvePZdKR+BJErJthJ98pe3AVjH287TOCC=1PA@mail.gmail.com>
Message-ID: <4FE1EFC0.8070709@alteeve.ca>

It's worth re-stating;

You are running an unsupported configuration. Please try to have the 
VMWare admins enable fence calls against your nodes and setup fencing. 
Until and unless you do, you will almost certainly run into problems, up 
to and including corrupting your data.

Please take a minute to read this:

https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Concept.3B_Fencing

Digimer

On 06/20/2012 11:22 AM, emmanuel segura wrote:
> Ok Javier
>
> So now i know you don't wanna the fencing and the reason :-)
>
> <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="-1"/>
>
> and use the fence_manual
>
>
>
> 2012/6/20 Javier Vela <jvdiago at gmail.com <mailto:jvdiago at gmail.com>>
>
>     I don't use fencing because with ha-lvm I thought that I dind't need
>     it. But also because both nodes are VMs in VMWare. I know that there
>     is a module to do fencing with vmware but I prefer to avoid it. I'm
>     not in control of the VMWare infraestructure and probably VMWare
>     admins won't give me the tools to use this module.
>
>     Regards, Javi
>
>         Fencing is critical, and running a cluster without fencing, even with
>
>
>         qdisk, is not supported. Manual fencing is also not supported. The
>         *only* way to have a reliable cluster, testing or production, is to use
>         fencing.
>
>         Why do you not wish to use it?
>
>         On 06/20/2012 09:43 AM, Javier Vela wrote:
>
>
>         > As I readed, if you use HA-LVM you don't need fencing because of vg
>         > tagging. Is It absolutely mandatory to use fencing with qdisk?
>         >
>         > If it is, i supose i can use manual_fence, but in production I also
>
>
>         > won't use fencing.
>         >
>         > Regards, Javi.
>         >
>         > Date: Wed, 20 Jun 2012 14:45:28 +0200
>         > From:emi2fast at gmail.com  <mailto:emi2fast at gmail.com>  <mailto:emi2fast at gmail.com  <mailto:emi2fast at gmail.com>>
>
>
>         > To:linux-cluster at redhat.com  <mailto:linux-cluster at redhat.com>  <mailto:linux-cluster at redhat.com  <mailto:linux-cluster at redhat.com>>
>         > Subject: Re: [Linux-cluster] Node can't join already quorated cluster
>
>
>         >
>         > If you don't wanna use a real fence divice, because you only do some
>         > test, you have to use fence_manual agent
>         >
>         > 2012/6/20 Javier Vela <jvdiago at gmail.com  <mailto:jvdiago at gmail.com>  <mailto:jvdiago at gmail.com  <mailto:jvdiago at gmail.com>>>
>
>
>         >
>         >     Hi, I have a very strange problem, and after searching through lot
>         >     of forums, I haven't found the solution. This is the scenario:
>         >
>         >     Two node cluster with Red Hat 5.7, HA-LVM, no fencing and quorum
>
>
>         >     disk. I start qdiskd, cman and rgmanager on one node. After 5
>         >     minutes, finally the fencing finishes and cluster get quorate with 2
>         >     votes:
>         >
>         >     [root at node2 ~]# clustat
>         >     Cluster Status for test_cluster @ Wed Jun 20 05:56:39 2012
>
>
>         >     Member Status: Quorate
>         >
>         >       Member Name                             ID   Status
>         >       ------ ----                             ---- ------
>         >       node1-hb                                  1 Offline
>
>
>         >       node2-hb                               2 Online, Local, rgmanager
>         >       /dev/mapper/vg_qdisk-lv_qdisk               0 Online, Quorum Disk
>         >
>         >       Service Name                   Owner (Last)                   State
>
>
>         >       ------- ----                   ----- ------                   -----
>         >       service:postgres                   node2                  started
>         >
>         >     Now, I start the second node. When cman reaches fencing, it hangs
>
>
>         >     for 5 minutes aprox, and finally fails. clustat says:
>         >
>         >     root at node1 ~]# clustat
>         >     Cluster Status for test_cluster @ Wed Jun 20 06:01:12 2012
>         >     Member Status: Inquorate
>         >
>
>
>         >       Member Name                             ID   Status
>         >       ------ ----                             ---- ------
>         >     node1-hb                                  1 Online, Local
>         >     node2-hb                               2 Offline
>
>
>         >       /dev/mapper/vg_qdisk-lv_qdisk               0 Offline
>         >
>         >     And in /var/log/messages I can see this errors:
>         >
>         >     Jun 20 06:02:12 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
>
>
>         >     Jun 20 06:02:12 node1 openais[6098]: [CLM  ] got nodejoin message
>         >     15.15.2.10
>         >     Jun 20 06:02:13 node1 dlm_controld[5386]: connect to ccs error -111,
>         >     check ccsd or cluster status
>
>
>         >     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>         >     connection.
>         >     Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
>         >     Connection refused
>         >     Jun 20 06:02:13 node1 ccsd[6090]: Initial status:: Inquorate
>
>
>         >     Jun 20 06:02:13 node1 gfs_controld[5392]: connect to ccs error -111,
>         >     check ccsd or cluster status
>         >     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>         >     connection.
>
>
>         >     Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
>         >     Connection refused
>         >     Jun 20 06:02:14 node1 openais[6098]: [TOTEM] entering GATHER state
>         >     from 9.
>         >     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>
>
>         >     connection.
>         >     Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
>         >     Connection refused
>         >     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>         >     connection.
>
>
>         >     Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
>         >     Connection refused
>         >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>         >     connection.
>         >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
>
>
>         >     Connection refused
>         >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>         >     connection.
>         >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
>         >     Connection refused
>
>
>         >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>         >     connection.
>         >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
>         >     Connection refused
>         >     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>
>
>         >     connection.
>         >     Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
>         >     Connection refused
>         >     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>         >     connection.
>
>
>         >     Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
>         >     Connection refused
>         >     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>         >     connection.
>         >     Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
>
>
>         >     Connection refused
>         >     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>         >     connection.
>         >     Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
>         >     Connection refused
>
>
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state
>         >     from 0.
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Creating commit token
>         >     because I am the rep.
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Storing new sequence id
>
>
>         >     for ring 15c
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering COMMIT state.
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering RECOVERY state.
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] position [0] member
>
>
>         >     15.15.2.10 <http://15.15.2.10>:
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] previous ring seq 344
>         >     rep 15.15.2.10
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] aru e high delivered e
>
>
>         >     received flag 1
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Did not need to
>         >     originate any messages in recovery.
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Sending initial ORF token
>
>
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
>         >     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>         >     connection.
>         >     Jun 20 06:02:18 node1 ccsd[6090]: Error while processing connect:
>
>
>         >     Connection refused
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state
>         >     from 9.
>         >     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>         >     connection.
>
>
>         >
>         >     And the quorum disk:
>         >
>         >     [root at node2 ~]# mkqdisk -L -d
>         >     kqdisk v0.6.0
>         >     /dev/mapper/vg_qdisk-lv_qdisk:
>         >     /dev/vg_qdisk/lv_qdisk:
>         >              Magic:                eb7a62c2
>
>
>         >              Label:                cluster_qdisk
>         >              Created:              Thu Jun  7 09:23:34 2012
>         >              Host:                 node1
>         >              Kernel Sector Size:   512
>
>
>         >              Recorded Sector Size: 512
>         >
>         >     Status block for node 1
>         >              Last updated by node 2
>         >              Last updated on Wed Jun 20 06:17:23 2012
>         >              State: Evicted
>
>
>         >              Flags: 0000
>         >              Score: 0/0
>         >              Average Cycle speed: 0.000500 seconds
>         >              Last Cycle speed: 0.000000 seconds
>         >              Incarnation: 4fe1a06c4fe1a06c
>
>
>         >     Status block for node 2
>         >              Last updated by node 2
>         >              Last updated on Wed Jun 20 07:09:38 2012
>         >              State: Master
>         >              Flags: 0000
>         >              Score: 0/0
>
>
>         >              Average Cycle speed: 0.001000 seconds
>         >              Last Cycle speed: 0.000000 seconds
>         >              Incarnation: 4fe1a06c4fe1a06c
>         >
>         >
>         >     In the other node I don't see any errors in /var/log/messages. One
>
>
>         >     strange thing is that if I start cman on both nodes at the same
>         >     time, everything works fine and both nodes quorate (until I reboot
>         >     one node and the problem appears). I've checked that multicast is
>
>
>         >     working properly. With iperf I can send a receive multicast paquets.
>         >     Moreover I've seen with tcpdump the paquets that openais send when
>         >     cman is trying to start. I've readed about a bug in RH 5.3 with the
>
>
>         >     same behaviour, but it is solved in RH 5.4.
>         >
>         >     I don't have Selinux enabled, and Iptables are also disabled. Here
>         >     is the cluster.conf simplified (with less services and resources). I
>
>
>         >     want to point out one thing. I have allow_kill="0" in order to avoid
>         >     fencing errors when quorum tries to fence a failed node. As <fence/>
>         >     is empty, before this stanza I got a lot of messages in
>
>
>         >     /var/log/messages with failed fencing.
>         >
>         >     <?xml version="1.0"?>
>         >     <cluster alias="test_cluster" config_version="15" name="test_cluster">
>
>
>         >              <fence_daemon clean_start="0" post_fail_delay="0"
>         >     post_join_delay="-1"/>
>         >              <clusternodes>
>         >                      <clusternode name="node1-hb" nodeid="1" votes="1">
>
>
>         >                              <fence/>
>         >                      </clusternode>
>         >                      <clusternode name="node2-hb" nodeid="2" votes="1">
>         >                              <fence/>
>
>
>         >                      </clusternode>
>         >              </clusternodes>
>         >              <cman two_node="0" expected_votes="3"/>
>         >              <fencedevices/>
>
>
>         >
>         >              <rm log_facility="local4" log_level="7">
>         >                      <failoverdomains>
>         >                              <failoverdomain name="etest_cluster_fo"
>
>
>         >     nofailback="1" ordered="1" restricted="1">
>         >                                      <failoverdomainnode name="node1-hb"
>         >     priority="1"/>
>
>
>         >                                      <failoverdomainnode name="node2-hb"
>         >     priority="2"/>
>         >                              </failoverdomain>
>         >                      </failoverdomains>
>
>
>         >              <resources/>
>         >              <service autostart="1" domain="test_cluster_fo"
>         >     exclusive="0" name="postgres" recovery="relocate">
>
>
>         >                      <ip address="172.24.119.44" monitor_link="1"/>
>         >                      <lvm name="vg_postgres" vg_name="vg_postgres"
>         >     lv_name="postgres"/>
>
>
>         >
>         >                      <fs device="/dev/vg_postgres/postgres"
>         >     force_fsck="1" force_unmount="1" fstype="ext3"
>         >     mountpoint="/var/lib/pgsql" name="postgres" self_fence="0"/>
>
>
>         >
>         >                      <script file="/etc/init.d/postgresql" name="postgres">
>         >                      </script>
>         >              </service>
>         >              </rm>
>
>
>         >              <totem consensus="4000" join="60" token="20000"
>         >     token_retransmits_before_loss_const="20"/>
>         >          <quorumd allow_kill="0" interval="1" label="cluster_qdisk"
>
>
>         >     tko="10" votes="1">
>         >                      <heuristic
>         >     program="/usr/share/cluster/check_eth_link.sh eth0" score="1"
>         >     interval="2" tko="3"/>
>
>
>         >              </quorumd>
>         >       </cluster>
>         >
>         >
>         >     The /etc/hosts:
>         >     172.24.119.10 node1
>         >     172.24.119.34 node2
>         >     15.15.2.10 node1-hb node1-hb.localdomain
>
>
>         >     15.15.2.11 node2-hb node2-hb.localdomain
>         >
>         >     And the versions:
>         >     Red Hat Enterprise Linux Server release 5.7 (Tikanga)
>         >     cman-2.0.115-85.el5
>         >     rgmanager-2.0.52-21.el5
>
>
>         >     openais-0.80.6-30.el5
>         >
>         >     I don't know what else I should try, so if you can give me some
>         >     ideas, I will be very pleased.
>         >
>         >     Regards, Javi.
>         >
>         >     --
>
>
>         >     Linux-cluster mailing list
>         >Linux-cluster at redhat.com  <mailto:Linux-cluster at redhat.com>  <mailto:Linux-cluster at redhat.com  <mailto:Linux-cluster at redhat.com>>
>
>         >https://www.redhat.com/mailman/listinfo/linux-cluster
>
>         >
>         >
>         >
>         >
>         > --
>         > esta es mi vida e me la vivo hasta que dios quiera
>         >
>         > -- Linux-cluster mailing listLinux-cluster at redhat.com  <mailto:Linux-cluster at redhat.com>
>
>         > <mailto:Linux-cluster at redhat.com  <mailto:Linux-cluster at redhat.com>>
>
>         >https://www.redhat.com/mailman/listinfo/linux-cluster
>         >
>         >
>         > --
>         > Linux-cluster mailing list
>         >Linux-cluster at redhat.com  <mailto:Linux-cluster at redhat.com>
>
>
>         >https://www.redhat.com/mailman/listinfo/linux-cluster
>         >
>
>
>         --
>         Digimer
>
>         Papers and Projects:https://alteeve.com
>
>
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Digimer
Papers and Projects: https://alteeve.com




From jvdiago at gmail.com  Wed Jun 20 21:54:30 2012
From: jvdiago at gmail.com (Javier Vela)
Date: Wed, 20 Jun 2012 23:54:30 +0200
Subject: [Linux-cluster]
	=?utf-8?q?Node_can=27t_join_already_quorated_clus?=
	=?utf-8?b?dGVy4oCP?=
In-Reply-To: <4FE1EFC0.8070709@alteeve.ca>
References: <CAEAM5QU2KT04rWnGttkqaWWcfvUyJx=j32GaPudwQwbu3BikJQ@mail.gmail.com>
	<CAE7pJ3ArON1yCJvePZdKR+BJErJthJ98pe3AVjH287TOCC=1PA@mail.gmail.com>
	<4FE1EFC0.8070709@alteeve.ca>
Message-ID: <CAEAM5QVOnpzOf_-=TbBz9t3DmbxVcw=yUYVOA+fvd-OtKTxZng@mail.gmail.com>

Hi, Finally I solved the problem. First, the qdisk over lvm does not work
very well. Switching to a plain device works better. And, as you stated,
without a fence device is not possible to get the cluster to work well, so
I`m going to push VMWare admins and use vmware fencing.

I'm very grateful, I've been working on this problem 3 days without
understanding what was happening, and with only a few emails the problem is
solved. The only thing that bothers me is why cman doesn't advise you that
without a proper fencing the cluster won't work. Moreover I haven't found
in the Red Hat documentati?n a statement telling what I've readed in the
link you pasted:

Fencing is a absolutely critical part of clustering. Without fully working
> fence devices, your cluster will fail.
>


I'm a bit sorry, but now I have another problem. With the cluster quorate
and the two nodes online + qdisk, when I start rgmanager on one node,
everything works ok, an the service starts. Then I start rgmanager in the
other node, but in the second node clustat doesn't show the service:

node2 (with the service working):

[root at node2 ~]# clustat
Cluster Status for test_cluster @ Wed Jun 20 16:21:19 2012
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
node1-hb                                  1 Online
node2-hb                               2 Online, Local, rgmanager
 /dev/disk/by-path/pci-0000:02:01.0-scsi-    0 Online, Quorum Disk

 Service Name                   Owner (Last)                   State
 ------- ----                   ----- ------                   -----
 service:postgres                  node2-hb                  started

node1 (doesn't see the service)

[root at node1 ~]# clustat
Cluster Status for test_cluster @ Wed Jun 20 16:21:15 2012
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 node1-hb                                  1 Online, Local
 node2-hb                               2 Online
 /dev/disk/by-path/pci-0000:02:01.0-scsi-    0 Online, Quorum Disk

In the /var/log/messages  I don't see errors, only this:
last message repeated X times

?What I'm missing? As far I can see, rgmanager doesn't appear in node1, but:

[root at node1 ~]# service rgmanager status
Se est?? ejecutando clurgmgrd (pid  8254)...

The cluster conf:

<?xml version="1.0"?>
<cluster alias="test_cluster" config_version="15" name="test_cluster">
        <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="6"/>
        <clusternodes>
                <clusternode name="node1-hb" nodeid="1" votes="1">
                       <fence>
                                <method name="manual">
                                        <device name="fence_manual"
nodename="node1-hb"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node2-hb" nodeid="2" votes="1">
                        <fence>
                                <method name="manual">
                                        <device name="fence_manual"
nodename="node2-hb"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman two_node="0" expected_votes="3"/>
        <fencedevices>
                <fencedevice agent="fence_manual" name="fence_manual"/>
        </fencedevices>


        <rm log_facility="local4" log_level="7">
                <failoverdomains>
                        <failoverdomain name="etest_cluster_fo"
nofailback="1" ordered="1" restricted="1">


  <failoverdomainnode name="node1-hb" priority="1"/>
                                <failoverdomainnode name="node2-hb"
priority="2"/>
                        </failoverdomain>
                </failoverdomains>
        <resources/>
        <service autostart="1" domain="test_cluster_fo" exclusive="0"
name="postgres" recovery="relocate">
                <ip address="172.24.119.44" monitor_link="1"/>
                <lvm name="vg_postgres" vg_name="vg_postgres"
lv_name="postgres"/>

                <fs device="/dev/vg_postgres/postgres" force_fsck="1"
force_unmount="1" fstype="ext3" mountpoint="/var/lib/pgsql" name="postgres"
self_fence="0"/>

                <script file="/etc/init.d/postgresql" name="postgres">
                </script>
        </service>
        </rm>
        <totem consensus="4000" join="60" token="20000"
token_retransmits_before_loss_const="20"/>
    <quorumd  interval="1" label="cluster_qdisk" tko="10" votes="1">
                <heuristic program="/usr/share/cluster/check_eth_link.sh
eth0" score="1" interval="2" tko="3"/>
        </quorumd>
 </cluster>



Regards, Javi.


2012/6/20 Digimer <lists at alteeve.ca>

> It's worth re-stating;
>
> You are running an unsupported configuration. Please try to have the
> VMWare admins enable fence calls against your nodes and setup fencing.
> Until and unless you do, you will almost certainly run into problems, up to
> and including corrupting your data.
>
> Please take a minute to read this:
>
> https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorial#**
> Concept.3B_Fencing<https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Concept.3B_Fencing>
>
> Digimer
>
>
> On 06/20/2012 11:22 AM, emmanuel segura wrote:
>
>> Ok Javier
>>
>> So now i know you don't wanna the fencing and the reason :-)
>>
>> <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="-1"/>
>>
>> and use the fence_manual
>>
>>
>>
>> 2012/6/20 Javier Vela <jvdiago at gmail.com <mailto:jvdiago at gmail.com>>
>>
>>
>>    I don't use fencing because with ha-lvm I thought that I dind't need
>>    it. But also because both nodes are VMs in VMWare. I know that there
>>    is a module to do fencing with vmware but I prefer to avoid it. I'm
>>    not in control of the VMWare infraestructure and probably VMWare
>>    admins won't give me the tools to use this module.
>>
>>    Regards, Javi
>>
>>        Fencing is critical, and running a cluster without fencing, even
>> with
>>
>>
>>        qdisk, is not supported. Manual fencing is also not supported. The
>>        *only* way to have a reliable cluster, testing or production, is
>> to use
>>        fencing.
>>
>>        Why do you not wish to use it?
>>
>>        On 06/20/2012 09:43 AM, Javier Vela wrote:
>>
>>
>>        > As I readed, if you use HA-LVM you don't need fencing because of
>> vg
>>        > tagging. Is It absolutely mandatory to use fencing with qdisk?
>>        >
>>        > If it is, i supose i can use manual_fence, but in production I
>> also
>>
>>
>>        > won't use fencing.
>>        >
>>        > Regards, Javi.
>>        >
>>        > Date: Wed, 20 Jun 2012 14:45:28 +0200
>>        > From:emi2fast at gmail.com  <mailto:emi2fast at gmail.com>  <mailto:
>> emi2fast at gmail.com  <mailto:emi2fast at gmail.com>>
>>
>>
>>        > To:linux-cluster at redhat.com  <mailto:linux-cluster at redhat.**com<linux-cluster at redhat.com>>
>>  <mailto:linux-cluster at redhat.**com <linux-cluster at redhat.com>  <mailto:
>> linux-cluster at redhat.**com <linux-cluster at redhat.com>>>
>>
>>        > Subject: Re: [Linux-cluster] Node can't join already quorated
>> cluster
>>
>>
>>        >
>>        > If you don't wanna use a real fence divice, because you only do
>> some
>>        > test, you have to use fence_manual agent
>>        >
>>        > 2012/6/20 Javier Vela <jvdiago at gmail.com  <mailto:
>> jvdiago at gmail.com>  <mailto:jvdiago at gmail.com  <mailto:jvdiago at gmail.com
>> >>>
>>
>>
>>
>>        >
>>        >     Hi, I have a very strange problem, and after searching
>> through lot
>>        >     of forums, I haven't found the solution. This is the
>> scenario:
>>        >
>>        >     Two node cluster with Red Hat 5.7, HA-LVM, no fencing and
>> quorum
>>
>>
>>        >     disk. I start qdiskd, cman and rgmanager on one node. After 5
>>        >     minutes, finally the fencing finishes and cluster get
>> quorate with 2
>>        >     votes:
>>        >
>>        >     [root at node2 ~]# clustat
>>        >     Cluster Status for test_cluster @ Wed Jun 20 05:56:39 2012
>>
>>
>>        >     Member Status: Quorate
>>        >
>>        >       Member Name                             ID   Status
>>        >       ------ ----                             ---- ------
>>        >       node1-hb                                  1 Offline
>>
>>
>>        >       node2-hb                               2 Online, Local,
>> rgmanager
>>        >       /dev/mapper/vg_qdisk-lv_qdisk               0 Online,
>> Quorum Disk
>>        >
>>        >       Service Name                   Owner (Last)
>>     State
>>
>>
>>        >       ------- ----                   ----- ------
>>     -----
>>        >       service:postgres                   node2
>>  started
>>        >
>>        >     Now, I start the second node. When cman reaches fencing, it
>> hangs
>>
>>
>>        >     for 5 minutes aprox, and finally fails. clustat says:
>>        >
>>        >     root at node1 ~]# clustat
>>        >     Cluster Status for test_cluster @ Wed Jun 20 06:01:12 2012
>>        >     Member Status: Inquorate
>>        >
>>
>>
>>        >       Member Name                             ID   Status
>>        >       ------ ----                             ---- ------
>>        >     node1-hb                                  1 Online, Local
>>        >     node2-hb                               2 Offline
>>
>>
>>        >       /dev/mapper/vg_qdisk-lv_qdisk               0 Offline
>>        >
>>        >     And in /var/log/messages I can see this errors:
>>        >
>>        >     Jun 20 06:02:12 node1 openais[6098]: [TOTEM] entering
>> OPERATIONAL state.
>>
>>
>>        >     Jun 20 06:02:12 node1 openais[6098]: [CLM  ] got nodejoin
>> message
>>        >     15.15.2.10
>>        >     Jun 20 06:02:13 node1 dlm_controld[5386]: connect to ccs
>> error -111,
>>        >     check ccsd or cluster status
>>
>>
>>        >     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.
>>  Refusing
>>        >     connection.
>>        >     Jun 20 06:02:13 node1 ccsd[6090]: Error while processing
>> connect:
>>        >     Connection refused
>>        >     Jun 20 06:02:13 node1 ccsd[6090]: Initial status:: Inquorate
>>
>>
>>        >     Jun 20 06:02:13 node1 gfs_controld[5392]: connect to ccs
>> error -111,
>>        >     check ccsd or cluster status
>>        >     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.
>>  Refusing
>>        >     connection.
>>
>>
>>        >     Jun 20 06:02:13 node1 ccsd[6090]: Error while processing
>> connect:
>>        >     Connection refused
>>        >     Jun 20 06:02:14 node1 openais[6098]: [TOTEM] entering GATHER
>> state
>>        >     from 9.
>>        >     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.
>>  Refusing
>>
>>
>>        >     connection.
>>        >     Jun 20 06:02:14 node1 ccsd[6090]: Error while processing
>> connect:
>>        >     Connection refused
>>        >     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.
>>  Refusing
>>        >     connection.
>>
>>
>>        >     Jun 20 06:02:14 node1 ccsd[6090]: Error while processing
>> connect:
>>        >     Connection refused
>>        >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.
>>  Refusing
>>        >     connection.
>>        >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing
>> connect:
>>
>>
>>        >     Connection refused
>>        >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.
>>  Refusing
>>        >     connection.
>>        >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing
>> connect:
>>        >     Connection refused
>>
>>
>>        >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.
>>  Refusing
>>        >     connection.
>>        >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing
>> connect:
>>        >     Connection refused
>>        >     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.
>>  Refusing
>>
>>
>>        >     connection.
>>        >     Jun 20 06:02:16 node1 ccsd[6090]: Error while processing
>> connect:
>>        >     Connection refused
>>        >     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.
>>  Refusing
>>        >     connection.
>>
>>
>>        >     Jun 20 06:02:16 node1 ccsd[6090]: Error while processing
>> connect:
>>        >     Connection refused
>>        >     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.
>>  Refusing
>>        >     connection.
>>        >     Jun 20 06:02:17 node1 ccsd[6090]: Error while processing
>> connect:
>>
>>
>>        >     Connection refused
>>        >     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.
>>  Refusing
>>        >     connection.
>>        >     Jun 20 06:02:17 node1 ccsd[6090]: Error while processing
>> connect:
>>        >     Connection refused
>>
>>
>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER
>> state
>>        >     from 0.
>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Creating commit
>> token
>>        >     because I am the rep.
>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Storing new
>> sequence id
>>
>>
>>        >     for ring 15c
>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering COMMIT
>> state.
>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering
>> RECOVERY state.
>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] position [0]
>> member
>>
>>
>>        >     15.15.2.10 <http://15.15.2.10>:
>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] previous ring
>> seq 344
>>        >     rep 15.15.2.10
>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] aru e high
>> delivered e
>>
>>
>>        >     received flag 1
>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Did not need to
>>        >     originate any messages in recovery.
>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Sending initial
>> ORF token
>>
>>
>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering
>> OPERATIONAL state.
>>        >     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.
>>  Refusing
>>        >     connection.
>>        >     Jun 20 06:02:18 node1 ccsd[6090]: Error while processing
>> connect:
>>
>>
>>        >     Connection refused
>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER
>> state
>>        >     from 9.
>>        >     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.
>>  Refusing
>>        >     connection.
>>
>>
>>        >
>>        >     And the quorum disk:
>>        >
>>        >     [root at node2 ~]# mkqdisk -L -d
>>        >     kqdisk v0.6.0
>>        >     /dev/mapper/vg_qdisk-lv_qdisk:
>>        >     /dev/vg_qdisk/lv_qdisk:
>>        >              Magic:                eb7a62c2
>>
>>
>>        >              Label:                cluster_qdisk
>>        >              Created:              Thu Jun  7 09:23:34 2012
>>        >              Host:                 node1
>>        >              Kernel Sector Size:   512
>>
>>
>>        >              Recorded Sector Size: 512
>>        >
>>        >     Status block for node 1
>>        >              Last updated by node 2
>>        >              Last updated on Wed Jun 20 06:17:23 2012
>>        >              State: Evicted
>>
>>
>>        >              Flags: 0000
>>        >              Score: 0/0
>>        >              Average Cycle speed: 0.000500 seconds
>>        >              Last Cycle speed: 0.000000 seconds
>>        >              Incarnation: 4fe1a06c4fe1a06c
>>
>>
>>        >     Status block for node 2
>>        >              Last updated by node 2
>>        >              Last updated on Wed Jun 20 07:09:38 2012
>>        >              State: Master
>>        >              Flags: 0000
>>        >              Score: 0/0
>>
>>
>>        >              Average Cycle speed: 0.001000 seconds
>>        >              Last Cycle speed: 0.000000 seconds
>>        >              Incarnation: 4fe1a06c4fe1a06c
>>        >
>>        >
>>        >     In the other node I don't see any errors in
>> /var/log/messages. One
>>
>>
>>        >     strange thing is that if I start cman on both nodes at the
>> same
>>        >     time, everything works fine and both nodes quorate (until I
>> reboot
>>        >     one node and the problem appears). I've checked that
>> multicast is
>>
>>
>>        >     working properly. With iperf I can send a receive multicast
>> paquets.
>>        >     Moreover I've seen with tcpdump the paquets that openais
>> send when
>>        >     cman is trying to start. I've readed about a bug in RH 5.3
>> with the
>>
>>
>>        >     same behaviour, but it is solved in RH 5.4.
>>        >
>>        >     I don't have Selinux enabled, and Iptables are also
>> disabled. Here
>>        >     is the cluster.conf simplified (with less services and
>> resources). I
>>
>>
>>        >     want to point out one thing. I have allow_kill="0" in order
>> to avoid
>>        >     fencing errors when quorum tries to fence a failed node. As
>> <fence/>
>>        >     is empty, before this stanza I got a lot of messages in
>>
>>
>>        >     /var/log/messages with failed fencing.
>>        >
>>        >     <?xml version="1.0"?>
>>        >     <cluster alias="test_cluster" config_version="15"
>> name="test_cluster">
>>
>>
>>        >              <fence_daemon clean_start="0" post_fail_delay="0"
>>        >     post_join_delay="-1"/>
>>        >              <clusternodes>
>>        >                      <clusternode name="node1-hb" nodeid="1"
>> votes="1">
>>
>>
>>        >                              <fence/>
>>        >                      </clusternode>
>>        >                      <clusternode name="node2-hb" nodeid="2"
>> votes="1">
>>        >                              <fence/>
>>
>>
>>        >                      </clusternode>
>>        >              </clusternodes>
>>        >              <cman two_node="0" expected_votes="3"/>
>>        >              <fencedevices/>
>>
>>
>>        >
>>        >              <rm log_facility="local4" log_level="7">
>>        >                      <failoverdomains>
>>        >                              <failoverdomain
>> name="etest_cluster_fo"
>>
>>
>>        >     nofailback="1" ordered="1" restricted="1">
>>        >                                      <failoverdomainnode
>> name="node1-hb"
>>        >     priority="1"/>
>>
>>
>>        >                                      <failoverdomainnode
>> name="node2-hb"
>>        >     priority="2"/>
>>        >                              </failoverdomain>
>>        >                      </failoverdomains>
>>
>>
>>        >              <resources/>
>>        >              <service autostart="1" domain="test_cluster_fo"
>>        >     exclusive="0" name="postgres" recovery="relocate">
>>
>>
>>        >                      <ip address="172.24.119.44"
>> monitor_link="1"/>
>>        >                      <lvm name="vg_postgres"
>> vg_name="vg_postgres"
>>        >     lv_name="postgres"/>
>>
>>
>>        >
>>        >                      <fs device="/dev/vg_postgres/**postgres"
>>        >     force_fsck="1" force_unmount="1" fstype="ext3"
>>        >     mountpoint="/var/lib/pgsql" name="postgres" self_fence="0"/>
>>
>>
>>        >
>>        >                      <script file="/etc/init.d/postgresql"
>> name="postgres">
>>        >                      </script>
>>        >              </service>
>>        >              </rm>
>>
>>
>>        >              <totem consensus="4000" join="60" token="20000"
>>        >     token_retransmits_before_loss_**const="20"/>
>>        >          <quorumd allow_kill="0" interval="1"
>> label="cluster_qdisk"
>>
>>
>>        >     tko="10" votes="1">
>>        >                      <heuristic
>>        >     program="/usr/share/cluster/**check_eth_link.sh eth0"
>> score="1"
>>        >     interval="2" tko="3"/>
>>
>>
>>        >              </quorumd>
>>        >       </cluster>
>>        >
>>        >
>>        >     The /etc/hosts:
>>        >     172.24.119.10 node1
>>        >     172.24.119.34 node2
>>        >     15.15.2.10 node1-hb node1-hb.localdomain
>>
>>
>>        >     15.15.2.11 node2-hb node2-hb.localdomain
>>        >
>>        >     And the versions:
>>        >     Red Hat Enterprise Linux Server release 5.7 (Tikanga)
>>        >     cman-2.0.115-85.el5
>>        >     rgmanager-2.0.52-21.el5
>>
>>
>>        >     openais-0.80.6-30.el5
>>        >
>>        >     I don't know what else I should try, so if you can give me
>> some
>>        >     ideas, I will be very pleased.
>>        >
>>        >     Regards, Javi.
>>        >
>>        >     --
>>
>>
>>        >     Linux-cluster mailing list
>>        >Linux-cluster at redhat.com  <mailto:Linux-cluster at redhat.**com<Linux-cluster at redhat.com>>
>>  <mailto:Linux-cluster at redhat.**com <Linux-cluster at redhat.com>  <mailto:
>> Linux-cluster at redhat.**com <Linux-cluster at redhat.com>>>
>>
>>
>>        >https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>>
>>        >
>>        >
>>        >
>>        >
>>        > --
>>        > esta es mi vida e me la vivo hasta que dios quiera
>>        >
>>        > -- Linux-cluster mailing listLinux-cluster at redhat.com  <mailto:
>> Linux-cluster at redhat.**com <Linux-cluster at redhat.com>>
>>
>>        > <mailto:Linux-cluster at redhat.**com <Linux-cluster at redhat.com> <mailto:
>> Linux-cluster at redhat.**com <Linux-cluster at redhat.com>>>
>>
>>
>>        >https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>>        >
>>        >
>>        > --
>>        > Linux-cluster mailing list
>>        >Linux-cluster at redhat.com  <mailto:Linux-cluster at redhat.**com<Linux-cluster at redhat.com>
>> >
>>
>>
>>        >https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>>        >
>>
>>
>>        --
>>        Digimer
>>
>>        Papers and Projects:https://alteeve.com
>>
>>
>>    --
>>    Linux-cluster mailing list
>>    Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.**com<Linux-cluster at redhat.com>
>> >
>>    https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>>
>>
>>
>>
>> --
>> esta es mi vida e me la vivo hasta que dios quiera
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>>
>>
>
> --
> Digimer
> Papers and Projects: https://alteeve.com
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120620/bc7583be/attachment.htm>

From lists at alteeve.ca  Wed Jun 20 22:13:47 2012
From: lists at alteeve.ca (Digimer)
Date: Wed, 20 Jun 2012 18:13:47 -0400
Subject: [Linux-cluster]
 =?utf-8?q?Node_can=27t_join_already_quorated_clus?=
 =?utf-8?b?dGVy4oCP?=
In-Reply-To: <CAEAM5QVOnpzOf_-=TbBz9t3DmbxVcw=yUYVOA+fvd-OtKTxZng@mail.gmail.com>
References: <CAEAM5QU2KT04rWnGttkqaWWcfvUyJx=j32GaPudwQwbu3BikJQ@mail.gmail.com>
	<CAE7pJ3ArON1yCJvePZdKR+BJErJthJ98pe3AVjH287TOCC=1PA@mail.gmail.com>
	<4FE1EFC0.8070709@alteeve.ca>
	<CAEAM5QVOnpzOf_-=TbBz9t3DmbxVcw=yUYVOA+fvd-OtKTxZng@mail.gmail.com>
Message-ID: <4FE24B1B.2020101@alteeve.ca>

You won't see services until the rgmanager daemon is running. Look at this:

 > node1-hb                                  1 Online
 > node2-hb                               2 Online, Local, rgmanager

This tells you that both node1-hb and node2-hb are running CMAN (That's 
the "Online" part), but only node2-hb is running "rgmanager". So on 
node1-hb, run '/etc/init.d/rgmanager start'.

As for the fence requirement, I agree that it should be said more 
directly, but it is covered here:

http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/High_Availability_Add-On_Overview/ch-fencing.html

Specifically:

     For example, DLM and GFS2, when notified of a node failure, suspend
     activity until they detect that fenced has completed fencing the
     failed node. Upon confirmation that the failed node is fenced, DLM
     and GFS2 perform recovery. DLM releases locks of the failed node;
     GFS2 recovers the journal of the failed node.

The key is "Upon confirmation that the failed node is fenced, DLM and 
GFS2 perform recovery."

If there is no fence configured, they will never get confirmation of 
success, so the cluster stays blocked (effectively hung forever, by 
design).

This requirement is also documented on the official cluster wiki:

https://fedorahosted.org/cluster/wiki/FAQ/Fencing#fence_manual2

Digimer

On 06/20/2012 05:54 PM, Javier Vela wrote:
> Hi, Finally I solved the problem. First, the qdisk over lvm does not
> work very well. Switching to a plain device works better. And, as you
> stated, without a fence device is not possible to get the cluster to
> work well, so I`m going to push VMWare admins and use vmware fencing.
>
> I'm very grateful, I've been working on this problem 3 days without
> understanding what was happening, and with only a few emails the problem
> is solved. The only thing that bothers me is why cman doesn't advise you
> that without a proper fencing the cluster won't work. Moreover I haven't
> found in the Red Hat documentati?n a statement telling what I've readed
> in the link you pasted:
>
>     Fencing is a absolutely critical part of clustering. Without fully
>     working fence devices, your cluster will fail.
>
>
>
> I'm a bit sorry, but now I have another problem. With the cluster
> quorate and the two nodes online + qdisk, when I start rgmanager on one
> node, everything works ok, an the service starts. Then I start rgmanager
> in the other node, but in the second node clustat doesn't show the service:
>
> node2 (with the service working):
>
> [root at node2 ~]# clustat
> Cluster Status for test_cluster @ Wed Jun 20 16:21:19 2012
> Member Status: Quorate
>
>   Member Name                             ID   Status
>   ------ ----                             ---- ------
> node1-hb                                  1 Online
> node2-hb                               2 Online, Local, rgmanager
>   /dev/disk/by-path/pci-0000:02:01.0-scsi-    0 Online, Quorum Disk
>
>   Service Name                   Owner (Last)                   State
>   ------- ----                   ----- ------                   -----
>   service:postgres                  node2-hb                  started
>
> node1 (doesn't see the service)
>
> [root at node1 ~]# clustat
> Cluster Status for test_cluster @ Wed Jun 20 16:21:15 2012
> Member Status: Quorate
>
>   Member Name                             ID   Status
>   ------ ----                             ---- ------
>   node1-hb                                  1 Online, Local
>   node2-hb                               2 Online
>   /dev/disk/by-path/pci-0000:02:01.0-scsi-    0 Online, Quorum Disk
>
> In the /var/log/messages  I don't see errors, only this:
> last message repeated X times
>
> ?What I'm missing? As far I can see, rgmanager doesn't appear in node1, but:
>
> [root at node1 ~]# service rgmanager status
> Se est?? ejecutando clurgmgrd (pid  8254)...
>
> The cluster conf:
>
> <?xml version="1.0"?>
> <cluster alias="test_cluster" config_version="15" name="test_cluster">
>          <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="6"/>
>          <clusternodes>
>                  <clusternode name="node1-hb" nodeid="1" votes="1">
>                         <fence>
>                                  <method name="manual">
>                                          <device name="fence_manual"
> nodename="node1-hb"/>
>                                  </method>
>                          </fence>
>                  </clusternode>
>                  <clusternode name="node2-hb" nodeid="2" votes="1">
>                          <fence>
>                                  <method name="manual">
>                                          <device name="fence_manual"
> nodename="node2-hb"/>
>                                  </method>
>                          </fence>
>                  </clusternode>
>          </clusternodes>
>          <cman two_node="0" expected_votes="3"/>
>          <fencedevices>
>                  <fencedevice agent="fence_manual" name="fence_manual"/>
>          </fencedevices>
>
>
>          <rm log_facility="local4" log_level="7">
>                  <failoverdomains>
>                          <failoverdomain name="etest_cluster_fo"
> nofailback="1" ordered="1" restricted="1">
>
>        <failoverdomainnode name="node1-hb" priority="1"/>
>                                      <failoverdomainnode name="node2-hb"
>     priority="2"/>
>                              </failoverdomain>
>                      </failoverdomains>
>              <resources/>
>              <service autostart="1" domain="test_cluster_fo"
>     exclusive="0" name="postgres" recovery="relocate">
>                      <ip address="172.24.119.44" monitor_link="1"/>
>                      <lvm name="vg_postgres" vg_name="vg_postgres"
>     lv_name="postgres"/>
>
>                      <fs device="/dev/vg_postgres/postgres"
>     force_fsck="1" force_unmount="1" fstype="ext3"
>     mountpoint="/var/lib/pgsql" name="postgres" self_fence="0"/>
>
>                      <script file="/etc/init.d/postgresql" name="postgres">
>                      </script>
>              </service>
>              </rm>
>              <totem consensus="4000" join="60" token="20000"
>     token_retransmits_before_loss_const="20"/>
>          <quorumd  interval="1" label="cluster_qdisk" tko="10" votes="1">
>                      <heuristic
>     program="/usr/share/cluster/check_eth_link.sh eth0" score="1"
>     interval="2" tko="3"/>
>              </quorumd>
>       </cluster>
>
>
>
> Regards, Javi.
>
>
> 2012/6/20 Digimer <lists at alteeve.ca <mailto:lists at alteeve.ca>>
>
>     It's worth re-stating;
>
>     You are running an unsupported configuration. Please try to have the
>     VMWare admins enable fence calls against your nodes and setup
>     fencing. Until and unless you do, you will almost certainly run into
>     problems, up to and including corrupting your data.
>
>     Please take a minute to read this:
>
>     https://alteeve.com/w/2-Node___Red_Hat_KVM_Cluster_Tutorial#__Concept.3B_Fencing
>     <https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Concept.3B_Fencing>
>
>     Digimer
>
>
>     On 06/20/2012 11:22 AM, emmanuel segura wrote:
>
>         Ok Javier
>
>         So now i know you don't wanna the fencing and the reason :-)
>
>         <fence_daemon clean_start="1" post_fail_delay="0"
>         post_join_delay="-1"/>
>
>         and use the fence_manual
>
>
>
>         2012/6/20 Javier Vela <jvdiago at gmail.com
>         <mailto:jvdiago at gmail.com> <mailto:jvdiago at gmail.com
>         <mailto:jvdiago at gmail.com>>>
>
>
>             I don't use fencing because with ha-lvm I thought that I
>         dind't need
>             it. But also because both nodes are VMs in VMWare. I know
>         that there
>             is a module to do fencing with vmware but I prefer to avoid
>         it. I'm
>             not in control of the VMWare infraestructure and probably VMWare
>             admins won't give me the tools to use this module.
>
>             Regards, Javi
>
>                 Fencing is critical, and running a cluster without
>         fencing, even with
>
>
>                 qdisk, is not supported. Manual fencing is also not
>         supported. The
>                 *only* way to have a reliable cluster, testing or
>         production, is to use
>                 fencing.
>
>                 Why do you not wish to use it?
>
>                 On 06/20/2012 09:43 AM, Javier Vela wrote:
>
>
>                 > As I readed, if you use HA-LVM you don't need fencing
>         because of vg
>                 > tagging. Is It absolutely mandatory to use fencing
>         with qdisk?
>                 >
>                 > If it is, i supose i can use manual_fence, but in
>         production I also
>
>
>                 > won't use fencing.
>                 >
>                 > Regards, Javi.
>                 >
>                 > Date: Wed, 20 Jun 2012 14:45:28 +0200
>                 > From:emi2fast at gmail.com
>         <mailto:From%3Aemi2fast at gmail.com>  <mailto:emi2fast at gmail.com
>         <mailto:emi2fast at gmail.com>>  <mailto:emi2fast at gmail.com
>         <mailto:emi2fast at gmail.com>  <mailto:emi2fast at gmail.com
>         <mailto:emi2fast at gmail.com>>>
>
>
>                 > To:linux-cluster at redhat.com
>         <mailto:To%3Alinux-cluster at redhat.com>
>           <mailto:linux-cluster at redhat.__com
>         <mailto:linux-cluster at redhat.com>>
>           <mailto:linux-cluster at redhat.__com
>         <mailto:linux-cluster at redhat.com>
>           <mailto:linux-cluster at redhat.__com
>         <mailto:linux-cluster at redhat.com>>>
>
>                 > Subject: Re: [Linux-cluster] Node can't join already
>         quorated cluster
>
>
>                 >
>                 > If you don't wanna use a real fence divice, because
>         you only do some
>                 > test, you have to use fence_manual agent
>                 >
>                 > 2012/6/20 Javier Vela <jvdiago at gmail.com
>         <mailto:jvdiago at gmail.com>  <mailto:jvdiago at gmail.com
>         <mailto:jvdiago at gmail.com>>  <mailto:jvdiago at gmail.com
>         <mailto:jvdiago at gmail.com>  <mailto:jvdiago at gmail.com
>         <mailto:jvdiago at gmail.com>>>>
>
>
>
>                 >
>                 >     Hi, I have a very strange problem, and after
>         searching through lot
>                 >     of forums, I haven't found the solution. This is
>         the scenario:
>                 >
>                 >     Two node cluster with Red Hat 5.7, HA-LVM, no
>         fencing and quorum
>
>
>                 >     disk. I start qdiskd, cman and rgmanager on one
>         node. After 5
>                 >     minutes, finally the fencing finishes and cluster
>         get quorate with 2
>                 >     votes:
>                 >
>                 >     [root at node2 ~]# clustat
>                 >     Cluster Status for test_cluster @ Wed Jun 20
>         05:56:39 2012
>
>
>                 >     Member Status: Quorate
>                 >
>                 >       Member Name                             ID   Status
>                 >       ------ ----                             ---- ------
>                 >       node1-hb                                  1 Offline
>
>
>                 >       node2-hb                               2 Online,
>         Local, rgmanager
>                 >       /dev/mapper/vg_qdisk-lv_qdisk               0
>         Online, Quorum Disk
>                 >
>                 >       Service Name                   Owner (Last)
>                        State
>
>
>                 >       ------- ----                   ----- ------
>                        -----
>                 >       service:postgres                   node2
>                   started
>                 >
>                 >     Now, I start the second node. When cman reaches
>         fencing, it hangs
>
>
>                 >     for 5 minutes aprox, and finally fails. clustat says:
>                 >
>                 >     root at node1 ~]# clustat
>                 >     Cluster Status for test_cluster @ Wed Jun 20
>         06:01:12 2012
>                 >     Member Status: Inquorate
>                 >
>
>
>                 >       Member Name                             ID   Status
>                 >       ------ ----                             ---- ------
>                 >     node1-hb                                  1
>         Online, Local
>                 >     node2-hb                               2 Offline
>
>
>                 >       /dev/mapper/vg_qdisk-lv_qdisk               0
>         Offline
>                 >
>                 >     And in /var/log/messages I can see this errors:
>                 >
>                 >     Jun 20 06:02:12 node1 openais[6098]: [TOTEM]
>         entering OPERATIONAL state.
>
>
>                 >     Jun 20 06:02:12 node1 openais[6098]: [CLM  ] got
>         nodejoin message
>                 >     15.15.2.10
>                 >     Jun 20 06:02:13 node1 dlm_controld[5386]: connect
>         to ccs error -111,
>                 >     check ccsd or cluster status
>
>
>                 >     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not
>         quorate.  Refusing
>                 >     connection.
>                 >     Jun 20 06:02:13 node1 ccsd[6090]: Error while
>         processing connect:
>                 >     Connection refused
>                 >     Jun 20 06:02:13 node1 ccsd[6090]: Initial status::
>         Inquorate
>
>
>                 >     Jun 20 06:02:13 node1 gfs_controld[5392]: connect
>         to ccs error -111,
>                 >     check ccsd or cluster status
>                 >     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not
>         quorate.  Refusing
>                 >     connection.
>
>
>                 >     Jun 20 06:02:13 node1 ccsd[6090]: Error while
>         processing connect:
>                 >     Connection refused
>                 >     Jun 20 06:02:14 node1 openais[6098]: [TOTEM]
>         entering GATHER state
>                 >     from 9.
>                 >     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not
>         quorate.  Refusing
>
>
>                 >     connection.
>                 >     Jun 20 06:02:14 node1 ccsd[6090]: Error while
>         processing connect:
>                 >     Connection refused
>                 >     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not
>         quorate.  Refusing
>                 >     connection.
>
>
>                 >     Jun 20 06:02:14 node1 ccsd[6090]: Error while
>         processing connect:
>                 >     Connection refused
>                 >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not
>         quorate.  Refusing
>                 >     connection.
>                 >     Jun 20 06:02:15 node1 ccsd[6090]: Error while
>         processing connect:
>
>
>                 >     Connection refused
>                 >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not
>         quorate.  Refusing
>                 >     connection.
>                 >     Jun 20 06:02:15 node1 ccsd[6090]: Error while
>         processing connect:
>                 >     Connection refused
>
>
>                 >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not
>         quorate.  Refusing
>                 >     connection.
>                 >     Jun 20 06:02:15 node1 ccsd[6090]: Error while
>         processing connect:
>                 >     Connection refused
>                 >     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not
>         quorate.  Refusing
>
>
>                 >     connection.
>                 >     Jun 20 06:02:16 node1 ccsd[6090]: Error while
>         processing connect:
>                 >     Connection refused
>                 >     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not
>         quorate.  Refusing
>                 >     connection.
>
>
>                 >     Jun 20 06:02:16 node1 ccsd[6090]: Error while
>         processing connect:
>                 >     Connection refused
>                 >     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not
>         quorate.  Refusing
>                 >     connection.
>                 >     Jun 20 06:02:17 node1 ccsd[6090]: Error while
>         processing connect:
>
>
>                 >     Connection refused
>                 >     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not
>         quorate.  Refusing
>                 >     connection.
>                 >     Jun 20 06:02:17 node1 ccsd[6090]: Error while
>         processing connect:
>                 >     Connection refused
>
>
>                 >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM]
>         entering GATHER state
>                 >     from 0.
>                 >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM]
>         Creating commit token
>                 >     because I am the rep.
>                 >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM]
>         Storing new sequence id
>
>
>                 >     for ring 15c
>                 >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM]
>         entering COMMIT state.
>                 >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM]
>         entering RECOVERY state.
>                 >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM]
>         position [0] member
>
>
>                 >     15.15.2.10 <http://15.15.2.10>:
>                 >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM]
>         previous ring seq 344
>                 >     rep 15.15.2.10
>                 >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] aru e
>         high delivered e
>
>
>                 >     received flag 1
>                 >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Did
>         not need to
>                 >     originate any messages in recovery.
>                 >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM]
>         Sending initial ORF token
>
>
>                 >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM]
>         entering OPERATIONAL state.
>                 >     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not
>         quorate.  Refusing
>                 >     connection.
>                 >     Jun 20 06:02:18 node1 ccsd[6090]: Error while
>         processing connect:
>
>
>                 >     Connection refused
>                 >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM]
>         entering GATHER state
>                 >     from 9.
>                 >     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not
>         quorate.  Refusing
>                 >     connection.
>
>
>                 >
>                 >     And the quorum disk:
>                 >
>                 >     [root at node2 ~]# mkqdisk -L -d
>                 >     kqdisk v0.6.0
>                 >     /dev/mapper/vg_qdisk-lv_qdisk:
>                 >     /dev/vg_qdisk/lv_qdisk:
>                 >              Magic:                eb7a62c2
>
>
>                 >              Label:                cluster_qdisk
>                 >              Created:              Thu Jun  7 09:23:34
>         2012
>                 >              Host:                 node1
>                 >              Kernel Sector Size:   512
>
>
>                 >              Recorded Sector Size: 512
>                 >
>                 >     Status block for node 1
>                 >              Last updated by node 2
>                 >              Last updated on Wed Jun 20 06:17:23 2012
>                 >              State: Evicted
>
>
>                 >              Flags: 0000
>                 >              Score: 0/0
>                 >              Average Cycle speed: 0.000500 seconds
>                 >              Last Cycle speed: 0.000000 seconds
>                 >              Incarnation: 4fe1a06c4fe1a06c
>
>
>                 >     Status block for node 2
>                 >              Last updated by node 2
>                 >              Last updated on Wed Jun 20 07:09:38 2012
>                 >              State: Master
>                 >              Flags: 0000
>                 >              Score: 0/0
>
>
>                 >              Average Cycle speed: 0.001000 seconds
>                 >              Last Cycle speed: 0.000000 seconds
>                 >              Incarnation: 4fe1a06c4fe1a06c
>                 >
>                 >
>                 >     In the other node I don't see any errors in
>         /var/log/messages. One
>
>
>                 >     strange thing is that if I start cman on both
>         nodes at the same
>                 >     time, everything works fine and both nodes quorate
>         (until I reboot
>                 >     one node and the problem appears). I've checked
>         that multicast is
>
>
>                 >     working properly. With iperf I can send a receive
>         multicast paquets.
>                 >     Moreover I've seen with tcpdump the paquets that
>         openais send when
>                 >     cman is trying to start. I've readed about a bug
>         in RH 5.3 with the
>
>
>                 >     same behaviour, but it is solved in RH 5.4.
>                 >
>                 >     I don't have Selinux enabled, and Iptables are
>         also disabled. Here
>                 >     is the cluster.conf simplified (with less services
>         and resources). I
>
>
>                 >     want to point out one thing. I have allow_kill="0"
>         in order to avoid
>                 >     fencing errors when quorum tries to fence a failed
>         node. As <fence/>
>                 >     is empty, before this stanza I got a lot of
>         messages in
>
>
>                 >     /var/log/messages with failed fencing.
>                 >
>                 >     <?xml version="1.0"?>
>                 >     <cluster alias="test_cluster" config_version="15"
>         name="test_cluster">
>
>
>                 >              <fence_daemon clean_start="0"
>         post_fail_delay="0"
>                 >     post_join_delay="-1"/>
>                 >              <clusternodes>
>                 >                      <clusternode name="node1-hb"
>         nodeid="1" votes="1">
>
>
>                 >                              <fence/>
>                 >                      </clusternode>
>                 >                      <clusternode name="node2-hb"
>         nodeid="2" votes="1">
>                 >                              <fence/>
>
>
>                 >                      </clusternode>
>                 >              </clusternodes>
>                 >              <cman two_node="0" expected_votes="3"/>
>                 >              <fencedevices/>
>
>
>                 >
>                 >              <rm log_facility="local4" log_level="7">
>                 >                      <failoverdomains>
>                 >                              <failoverdomain
>         name="etest_cluster_fo"
>
>
>                 >     nofailback="1" ordered="1" restricted="1">
>                 >
>           <failoverdomainnode name="node1-hb"
>                 >     priority="1"/>
>
>
>                 >
>           <failoverdomainnode name="node2-hb"
>                 >     priority="2"/>
>                 >                              </failoverdomain>
>                 >                      </failoverdomains>
>
>
>                 >              <resources/>
>                 >              <service autostart="1"
>         domain="test_cluster_fo"
>                 >     exclusive="0" name="postgres" recovery="relocate">
>
>
>                 >                      <ip address="172.24.119.44"
>         monitor_link="1"/>
>                 >                      <lvm name="vg_postgres"
>         vg_name="vg_postgres"
>                 >     lv_name="postgres"/>
>
>
>                 >
>                 >                      <fs
>         device="/dev/vg_postgres/__postgres"
>                 >     force_fsck="1" force_unmount="1" fstype="ext3"
>                 >     mountpoint="/var/lib/pgsql" name="postgres"
>         self_fence="0"/>
>
>
>                 >
>                 >                      <script
>         file="/etc/init.d/postgresql" name="postgres">
>                 >                      </script>
>                 >              </service>
>                 >              </rm>
>
>
>                 >              <totem consensus="4000" join="60"
>         token="20000"
>                 >     token_retransmits_before_loss___const="20"/>
>                 >          <quorumd allow_kill="0" interval="1"
>         label="cluster_qdisk"
>
>
>                 >     tko="10" votes="1">
>                 >                      <heuristic
>                 >     program="/usr/share/cluster/__check_eth_link.sh
>         eth0" score="1"
>                 >     interval="2" tko="3"/>
>
>
>                 >              </quorumd>
>                 >       </cluster>
>                 >
>                 >
>                 >     The /etc/hosts:
>                 >     172.24.119.10 node1
>                 >     172.24.119.34 node2
>                 >     15.15.2.10 node1-hb node1-hb.localdomain
>
>
>                 >     15.15.2.11 node2-hb node2-hb.localdomain
>                 >
>                 >     And the versions:
>                 >     Red Hat Enterprise Linux Server release 5.7 (Tikanga)
>                 >     cman-2.0.115-85.el5
>                 >     rgmanager-2.0.52-21.el5
>
>
>                 >     openais-0.80.6-30.el5
>                 >
>                 >     I don't know what else I should try, so if you can
>         give me some
>                 >     ideas, I will be very pleased.
>                 >
>                 >     Regards, Javi.
>                 >
>                 >     --
>
>
>                 >     Linux-cluster mailing list
>                 >Linux-cluster at redhat.com
>         <mailto:Linux-cluster at redhat.com>
>           <mailto:Linux-cluster at redhat.__com
>         <mailto:Linux-cluster at redhat.com>>
>           <mailto:Linux-cluster at redhat.__com
>         <mailto:Linux-cluster at redhat.com>
>           <mailto:Linux-cluster at redhat.__com
>         <mailto:Linux-cluster at redhat.com>>>
>
>
>                 >https://www.redhat.com/__mailman/listinfo/linux-cluster
>         <https://www.redhat.com/mailman/listinfo/linux-cluster>
>
>                 >
>                 >
>                 >
>                 >
>                 > --
>                 > esta es mi vida e me la vivo hasta que dios quiera
>                 >
>                 > -- Linux-cluster mailing listLinux-cluster at redhat.com
>         <mailto:listLinux-cluster at redhat.com>
>           <mailto:Linux-cluster at redhat.__com
>         <mailto:Linux-cluster at redhat.com>>
>
>                 > <mailto:Linux-cluster at redhat.__com
>         <mailto:Linux-cluster at redhat.com>
>           <mailto:Linux-cluster at redhat.__com
>         <mailto:Linux-cluster at redhat.com>>>
>
>
>                 >https://www.redhat.com/__mailman/listinfo/linux-cluster
>         <https://www.redhat.com/mailman/listinfo/linux-cluster>
>                 >
>                 >
>                 > --
>                 > Linux-cluster mailing list
>                 >Linux-cluster at redhat.com
>         <mailto:Linux-cluster at redhat.com>
>           <mailto:Linux-cluster at redhat.__com
>         <mailto:Linux-cluster at redhat.com>>
>
>
>                 >https://www.redhat.com/__mailman/listinfo/linux-cluster
>         <https://www.redhat.com/mailman/listinfo/linux-cluster>
>                 >
>
>
>                 --
>                 Digimer
>
>                 Papers and Projects:https://alteeve.com
>
>
>             --
>             Linux-cluster mailing list
>         Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>         <mailto:Linux-cluster at redhat.__com
>         <mailto:Linux-cluster at redhat.com>>
>         https://www.redhat.com/__mailman/listinfo/linux-cluster
>         <https://www.redhat.com/mailman/listinfo/linux-cluster>
>
>
>
>
>         --
>         esta es mi vida e me la vivo hasta que dios quiera
>
>
>         --
>         Linux-cluster mailing list
>         Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>         https://www.redhat.com/__mailman/listinfo/linux-cluster
>         <https://www.redhat.com/mailman/listinfo/linux-cluster>
>
>
>
>     --
>     Digimer
>     Papers and Projects: https://alteeve.com
>
>
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/__mailman/listinfo/linux-cluster
>     <https://www.redhat.com/mailman/listinfo/linux-cluster>
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Digimer
Papers and Projects: https://alteeve.com




From emi2fast at gmail.com  Thu Jun 21 07:39:46 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Thu, 21 Jun 2012 09:39:46 +0200
Subject: [Linux-cluster]
	=?utf-8?q?Node_can=27t_join_already_quorated_clus?=
	=?utf-8?b?dGVy4oCP?=
In-Reply-To: <CAEAM5QVOnpzOf_-=TbBz9t3DmbxVcw=yUYVOA+fvd-OtKTxZng@mail.gmail.com>
References: <CAEAM5QU2KT04rWnGttkqaWWcfvUyJx=j32GaPudwQwbu3BikJQ@mail.gmail.com>
	<CAE7pJ3ArON1yCJvePZdKR+BJErJthJ98pe3AVjH287TOCC=1PA@mail.gmail.com>
	<4FE1EFC0.8070709@alteeve.ca>
	<CAEAM5QVOnpzOf_-=TbBz9t3DmbxVcw=yUYVOA+fvd-OtKTxZng@mail.gmail.com>
Message-ID: <CAE7pJ3D4wk58dg8AD4UjcQ9DW_mN=aYV2yW3z7Xa5OXqUd0AHQ@mail.gmail.com>

Hello Javier

use clustat -l on the node doesn't show the service


2012/6/20 Javier Vela <jvdiago at gmail.com>

> Hi, Finally I solved the problem. First, the qdisk over lvm does not work
> very well. Switching to a plain device works better. And, as you stated,
> without a fence device is not possible to get the cluster to work well, so
> I`m going to push VMWare admins and use vmware fencing.
>
> I'm very grateful, I've been working on this problem 3 days without
> understanding what was happening, and with only a few emails the problem is
> solved. The only thing that bothers me is why cman doesn't advise you that
> without a proper fencing the cluster won't work. Moreover I haven't found
> in the Red Hat documentati?n a statement telling what I've readed in the
> link you pasted:
>
> Fencing is a absolutely critical part of clustering. Without fully working
>> fence devices, your cluster will fail.
>>
>
>
> I'm a bit sorry, but now I have another problem. With the cluster quorate
> and the two nodes online + qdisk, when I start rgmanager on one node,
> everything works ok, an the service starts. Then I start rgmanager in the
> other node, but in the second node clustat doesn't show the service:
>
> node2 (with the service working):
>
> [root at node2 ~]# clustat
> Cluster Status for test_cluster @ Wed Jun 20 16:21:19 2012
> Member Status: Quorate
>
>  Member Name                             ID   Status
>  ------ ----                             ---- ------
> node1-hb                                  1 Online
> node2-hb                               2 Online, Local, rgmanager
>  /dev/disk/by-path/pci-0000:02:01.0-scsi-    0 Online, Quorum Disk
>
>  Service Name                   Owner (Last)                   State
>  ------- ----                   ----- ------                   -----
>  service:postgres                  node2-hb                  started
>
> node1 (doesn't see the service)
>
> [root at node1 ~]# clustat
> Cluster Status for test_cluster @ Wed Jun 20 16:21:15 2012
> Member Status: Quorate
>
>  Member Name                             ID   Status
>  ------ ----                             ---- ------
>  node1-hb                                  1 Online, Local
>  node2-hb                               2 Online
>  /dev/disk/by-path/pci-0000:02:01.0-scsi-    0 Online, Quorum Disk
>
> In the /var/log/messages  I don't see errors, only this:
> last message repeated X times
>
> ?What I'm missing? As far I can see, rgmanager doesn't appear in node1,
> but:
>
> [root at node1 ~]# service rgmanager status
> Se est?? ejecutando clurgmgrd (pid  8254)...
>
> The cluster conf:
>
> <?xml version="1.0"?>
> <cluster alias="test_cluster" config_version="15" name="test_cluster">
>         <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="6"/>
>         <clusternodes>
>                 <clusternode name="node1-hb" nodeid="1" votes="1">
>                        <fence>
>                                 <method name="manual">
>                                         <device name="fence_manual"
> nodename="node1-hb"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="node2-hb" nodeid="2" votes="1">
>                         <fence>
>                                 <method name="manual">
>                                         <device name="fence_manual"
> nodename="node2-hb"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
>         <cman two_node="0" expected_votes="3"/>
>         <fencedevices>
>                 <fencedevice agent="fence_manual" name="fence_manual"/>
>         </fencedevices>
>
>
>         <rm log_facility="local4" log_level="7">
>                 <failoverdomains>
>                         <failoverdomain name="etest_cluster_fo"
> nofailback="1" ordered="1" restricted="1">
>
>
>   <failoverdomainnode name="node1-hb" priority="1"/>
>                                 <failoverdomainnode name="node2-hb"
> priority="2"/>
>                         </failoverdomain>
>                 </failoverdomains>
>         <resources/>
>         <service autostart="1" domain="test_cluster_fo" exclusive="0"
> name="postgres" recovery="relocate">
>                 <ip address="172.24.119.44" monitor_link="1"/>
>                 <lvm name="vg_postgres" vg_name="vg_postgres"
> lv_name="postgres"/>
>
>                 <fs device="/dev/vg_postgres/postgres" force_fsck="1"
> force_unmount="1" fstype="ext3" mountpoint="/var/lib/pgsql" name="postgres"
> self_fence="0"/>
>
>                 <script file="/etc/init.d/postgresql" name="postgres">
>                 </script>
>         </service>
>         </rm>
>         <totem consensus="4000" join="60" token="20000"
> token_retransmits_before_loss_const="20"/>
>     <quorumd  interval="1" label="cluster_qdisk" tko="10" votes="1">
>                 <heuristic program="/usr/share/cluster/check_eth_link.sh
> eth0" score="1" interval="2" tko="3"/>
>         </quorumd>
>  </cluster>
>
>
>
> Regards, Javi.
>
>
> 2012/6/20 Digimer <lists at alteeve.ca>
>
>> It's worth re-stating;
>>
>> You are running an unsupported configuration. Please try to have the
>> VMWare admins enable fence calls against your nodes and setup fencing.
>> Until and unless you do, you will almost certainly run into problems, up to
>> and including corrupting your data.
>>
>> Please take a minute to read this:
>>
>> https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorial#**
>> Concept.3B_Fencing<https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Concept.3B_Fencing>
>>
>> Digimer
>>
>>
>> On 06/20/2012 11:22 AM, emmanuel segura wrote:
>>
>>> Ok Javier
>>>
>>> So now i know you don't wanna the fencing and the reason :-)
>>>
>>> <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="-1"/>
>>>
>>> and use the fence_manual
>>>
>>>
>>>
>>> 2012/6/20 Javier Vela <jvdiago at gmail.com <mailto:jvdiago at gmail.com>>
>>>
>>>
>>>    I don't use fencing because with ha-lvm I thought that I dind't need
>>>    it. But also because both nodes are VMs in VMWare. I know that there
>>>    is a module to do fencing with vmware but I prefer to avoid it. I'm
>>>    not in control of the VMWare infraestructure and probably VMWare
>>>    admins won't give me the tools to use this module.
>>>
>>>    Regards, Javi
>>>
>>>        Fencing is critical, and running a cluster without fencing, even
>>> with
>>>
>>>
>>>        qdisk, is not supported. Manual fencing is also not supported. The
>>>        *only* way to have a reliable cluster, testing or production, is
>>> to use
>>>        fencing.
>>>
>>>        Why do you not wish to use it?
>>>
>>>        On 06/20/2012 09:43 AM, Javier Vela wrote:
>>>
>>>
>>>        > As I readed, if you use HA-LVM you don't need fencing because
>>> of vg
>>>        > tagging. Is It absolutely mandatory to use fencing with qdisk?
>>>        >
>>>        > If it is, i supose i can use manual_fence, but in production I
>>> also
>>>
>>>
>>>        > won't use fencing.
>>>        >
>>>        > Regards, Javi.
>>>        >
>>>        > Date: Wed, 20 Jun 2012 14:45:28 +0200
>>>        > From:emi2fast at gmail.com  <mailto:emi2fast at gmail.com>  <mailto:
>>> emi2fast at gmail.com  <mailto:emi2fast at gmail.com>>
>>>
>>>
>>>        > To:linux-cluster at redhat.com  <mailto:linux-cluster at redhat.**com<linux-cluster at redhat.com>>
>>>  <mailto:linux-cluster at redhat.**com <linux-cluster at redhat.com>  <mailto:
>>> linux-cluster at redhat.**com <linux-cluster at redhat.com>>>
>>>
>>>        > Subject: Re: [Linux-cluster] Node can't join already quorated
>>> cluster
>>>
>>>
>>>        >
>>>        > If you don't wanna use a real fence divice, because you only do
>>> some
>>>        > test, you have to use fence_manual agent
>>>        >
>>>        > 2012/6/20 Javier Vela <jvdiago at gmail.com  <mailto:
>>> jvdiago at gmail.com>  <mailto:jvdiago at gmail.com  <mailto:jvdiago at gmail.com
>>> >>>
>>>
>>>
>>>
>>>        >
>>>        >     Hi, I have a very strange problem, and after searching
>>> through lot
>>>        >     of forums, I haven't found the solution. This is the
>>> scenario:
>>>        >
>>>        >     Two node cluster with Red Hat 5.7, HA-LVM, no fencing and
>>> quorum
>>>
>>>
>>>        >     disk. I start qdiskd, cman and rgmanager on one node. After
>>> 5
>>>        >     minutes, finally the fencing finishes and cluster get
>>> quorate with 2
>>>        >     votes:
>>>        >
>>>        >     [root at node2 ~]# clustat
>>>        >     Cluster Status for test_cluster @ Wed Jun 20 05:56:39 2012
>>>
>>>
>>>        >     Member Status: Quorate
>>>        >
>>>        >       Member Name                             ID   Status
>>>        >       ------ ----                             ---- ------
>>>        >       node1-hb                                  1 Offline
>>>
>>>
>>>        >       node2-hb                               2 Online, Local,
>>> rgmanager
>>>        >       /dev/mapper/vg_qdisk-lv_qdisk               0 Online,
>>> Quorum Disk
>>>        >
>>>        >       Service Name                   Owner (Last)
>>>     State
>>>
>>>
>>>        >       ------- ----                   ----- ------
>>>     -----
>>>        >       service:postgres                   node2
>>>  started
>>>        >
>>>        >     Now, I start the second node. When cman reaches fencing, it
>>> hangs
>>>
>>>
>>>        >     for 5 minutes aprox, and finally fails. clustat says:
>>>        >
>>>        >     root at node1 ~]# clustat
>>>        >     Cluster Status for test_cluster @ Wed Jun 20 06:01:12 2012
>>>        >     Member Status: Inquorate
>>>        >
>>>
>>>
>>>        >       Member Name                             ID   Status
>>>        >       ------ ----                             ---- ------
>>>        >     node1-hb                                  1 Online, Local
>>>        >     node2-hb                               2 Offline
>>>
>>>
>>>        >       /dev/mapper/vg_qdisk-lv_qdisk               0 Offline
>>>        >
>>>        >     And in /var/log/messages I can see this errors:
>>>        >
>>>        >     Jun 20 06:02:12 node1 openais[6098]: [TOTEM] entering
>>> OPERATIONAL state.
>>>
>>>
>>>        >     Jun 20 06:02:12 node1 openais[6098]: [CLM  ] got nodejoin
>>> message
>>>        >     15.15.2.10
>>>        >     Jun 20 06:02:13 node1 dlm_controld[5386]: connect to ccs
>>> error -111,
>>>        >     check ccsd or cluster status
>>>
>>>
>>>        >     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.
>>>  Refusing
>>>        >     connection.
>>>        >     Jun 20 06:02:13 node1 ccsd[6090]: Error while processing
>>> connect:
>>>        >     Connection refused
>>>        >     Jun 20 06:02:13 node1 ccsd[6090]: Initial status:: Inquorate
>>>
>>>
>>>        >     Jun 20 06:02:13 node1 gfs_controld[5392]: connect to ccs
>>> error -111,
>>>        >     check ccsd or cluster status
>>>        >     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.
>>>  Refusing
>>>        >     connection.
>>>
>>>
>>>        >     Jun 20 06:02:13 node1 ccsd[6090]: Error while processing
>>> connect:
>>>        >     Connection refused
>>>        >     Jun 20 06:02:14 node1 openais[6098]: [TOTEM] entering
>>> GATHER state
>>>        >     from 9.
>>>        >     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.
>>>  Refusing
>>>
>>>
>>>        >     connection.
>>>        >     Jun 20 06:02:14 node1 ccsd[6090]: Error while processing
>>> connect:
>>>        >     Connection refused
>>>        >     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.
>>>  Refusing
>>>        >     connection.
>>>
>>>
>>>        >     Jun 20 06:02:14 node1 ccsd[6090]: Error while processing
>>> connect:
>>>        >     Connection refused
>>>        >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.
>>>  Refusing
>>>        >     connection.
>>>        >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing
>>> connect:
>>>
>>>
>>>        >     Connection refused
>>>        >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.
>>>  Refusing
>>>        >     connection.
>>>        >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing
>>> connect:
>>>        >     Connection refused
>>>
>>>
>>>        >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.
>>>  Refusing
>>>        >     connection.
>>>        >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing
>>> connect:
>>>        >     Connection refused
>>>        >     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.
>>>  Refusing
>>>
>>>
>>>        >     connection.
>>>        >     Jun 20 06:02:16 node1 ccsd[6090]: Error while processing
>>> connect:
>>>        >     Connection refused
>>>        >     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.
>>>  Refusing
>>>        >     connection.
>>>
>>>
>>>        >     Jun 20 06:02:16 node1 ccsd[6090]: Error while processing
>>> connect:
>>>        >     Connection refused
>>>        >     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.
>>>  Refusing
>>>        >     connection.
>>>        >     Jun 20 06:02:17 node1 ccsd[6090]: Error while processing
>>> connect:
>>>
>>>
>>>        >     Connection refused
>>>        >     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.
>>>  Refusing
>>>        >     connection.
>>>        >     Jun 20 06:02:17 node1 ccsd[6090]: Error while processing
>>> connect:
>>>        >     Connection refused
>>>
>>>
>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering
>>> GATHER state
>>>        >     from 0.
>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Creating
>>> commit token
>>>        >     because I am the rep.
>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Storing new
>>> sequence id
>>>
>>>
>>>        >     for ring 15c
>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering
>>> COMMIT state.
>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering
>>> RECOVERY state.
>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] position [0]
>>> member
>>>
>>>
>>>        >     15.15.2.10 <http://15.15.2.10>:
>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] previous ring
>>> seq 344
>>>        >     rep 15.15.2.10
>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] aru e high
>>> delivered e
>>>
>>>
>>>        >     received flag 1
>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Did not need to
>>>        >     originate any messages in recovery.
>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Sending
>>> initial ORF token
>>>
>>>
>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering
>>> OPERATIONAL state.
>>>        >     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.
>>>  Refusing
>>>        >     connection.
>>>        >     Jun 20 06:02:18 node1 ccsd[6090]: Error while processing
>>> connect:
>>>
>>>
>>>        >     Connection refused
>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering
>>> GATHER state
>>>        >     from 9.
>>>        >     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.
>>>  Refusing
>>>        >     connection.
>>>
>>>
>>>        >
>>>        >     And the quorum disk:
>>>        >
>>>        >     [root at node2 ~]# mkqdisk -L -d
>>>        >     kqdisk v0.6.0
>>>        >     /dev/mapper/vg_qdisk-lv_qdisk:
>>>        >     /dev/vg_qdisk/lv_qdisk:
>>>        >              Magic:                eb7a62c2
>>>
>>>
>>>        >              Label:                cluster_qdisk
>>>        >              Created:              Thu Jun  7 09:23:34 2012
>>>        >              Host:                 node1
>>>        >              Kernel Sector Size:   512
>>>
>>>
>>>        >              Recorded Sector Size: 512
>>>        >
>>>        >     Status block for node 1
>>>        >              Last updated by node 2
>>>        >              Last updated on Wed Jun 20 06:17:23 2012
>>>        >              State: Evicted
>>>
>>>
>>>        >              Flags: 0000
>>>        >              Score: 0/0
>>>        >              Average Cycle speed: 0.000500 seconds
>>>        >              Last Cycle speed: 0.000000 seconds
>>>        >              Incarnation: 4fe1a06c4fe1a06c
>>>
>>>
>>>        >     Status block for node 2
>>>        >              Last updated by node 2
>>>        >              Last updated on Wed Jun 20 07:09:38 2012
>>>        >              State: Master
>>>        >              Flags: 0000
>>>        >              Score: 0/0
>>>
>>>
>>>        >              Average Cycle speed: 0.001000 seconds
>>>        >              Last Cycle speed: 0.000000 seconds
>>>        >              Incarnation: 4fe1a06c4fe1a06c
>>>        >
>>>        >
>>>        >     In the other node I don't see any errors in
>>> /var/log/messages. One
>>>
>>>
>>>        >     strange thing is that if I start cman on both nodes at the
>>> same
>>>        >     time, everything works fine and both nodes quorate (until I
>>> reboot
>>>        >     one node and the problem appears). I've checked that
>>> multicast is
>>>
>>>
>>>        >     working properly. With iperf I can send a receive multicast
>>> paquets.
>>>        >     Moreover I've seen with tcpdump the paquets that openais
>>> send when
>>>        >     cman is trying to start. I've readed about a bug in RH 5.3
>>> with the
>>>
>>>
>>>        >     same behaviour, but it is solved in RH 5.4.
>>>        >
>>>        >     I don't have Selinux enabled, and Iptables are also
>>> disabled. Here
>>>        >     is the cluster.conf simplified (with less services and
>>> resources). I
>>>
>>>
>>>        >     want to point out one thing. I have allow_kill="0" in order
>>> to avoid
>>>        >     fencing errors when quorum tries to fence a failed node. As
>>> <fence/>
>>>        >     is empty, before this stanza I got a lot of messages in
>>>
>>>
>>>        >     /var/log/messages with failed fencing.
>>>        >
>>>        >     <?xml version="1.0"?>
>>>        >     <cluster alias="test_cluster" config_version="15"
>>> name="test_cluster">
>>>
>>>
>>>        >              <fence_daemon clean_start="0" post_fail_delay="0"
>>>        >     post_join_delay="-1"/>
>>>        >              <clusternodes>
>>>        >                      <clusternode name="node1-hb" nodeid="1"
>>> votes="1">
>>>
>>>
>>>        >                              <fence/>
>>>        >                      </clusternode>
>>>        >                      <clusternode name="node2-hb" nodeid="2"
>>> votes="1">
>>>        >                              <fence/>
>>>
>>>
>>>        >                      </clusternode>
>>>        >              </clusternodes>
>>>        >              <cman two_node="0" expected_votes="3"/>
>>>        >              <fencedevices/>
>>>
>>>
>>>        >
>>>        >              <rm log_facility="local4" log_level="7">
>>>        >                      <failoverdomains>
>>>        >                              <failoverdomain
>>> name="etest_cluster_fo"
>>>
>>>
>>>        >     nofailback="1" ordered="1" restricted="1">
>>>        >                                      <failoverdomainnode
>>> name="node1-hb"
>>>        >     priority="1"/>
>>>
>>>
>>>        >                                      <failoverdomainnode
>>> name="node2-hb"
>>>        >     priority="2"/>
>>>        >                              </failoverdomain>
>>>        >                      </failoverdomains>
>>>
>>>
>>>        >              <resources/>
>>>        >              <service autostart="1" domain="test_cluster_fo"
>>>        >     exclusive="0" name="postgres" recovery="relocate">
>>>
>>>
>>>        >                      <ip address="172.24.119.44"
>>> monitor_link="1"/>
>>>        >                      <lvm name="vg_postgres"
>>> vg_name="vg_postgres"
>>>        >     lv_name="postgres"/>
>>>
>>>
>>>        >
>>>        >                      <fs device="/dev/vg_postgres/**postgres"
>>>        >     force_fsck="1" force_unmount="1" fstype="ext3"
>>>        >     mountpoint="/var/lib/pgsql" name="postgres" self_fence="0"/>
>>>
>>>
>>>        >
>>>        >                      <script file="/etc/init.d/postgresql"
>>> name="postgres">
>>>        >                      </script>
>>>        >              </service>
>>>        >              </rm>
>>>
>>>
>>>        >              <totem consensus="4000" join="60" token="20000"
>>>        >     token_retransmits_before_loss_**const="20"/>
>>>        >          <quorumd allow_kill="0" interval="1"
>>> label="cluster_qdisk"
>>>
>>>
>>>        >     tko="10" votes="1">
>>>        >                      <heuristic
>>>        >     program="/usr/share/cluster/**check_eth_link.sh eth0"
>>> score="1"
>>>        >     interval="2" tko="3"/>
>>>
>>>
>>>        >              </quorumd>
>>>        >       </cluster>
>>>        >
>>>        >
>>>        >     The /etc/hosts:
>>>        >     172.24.119.10 node1
>>>        >     172.24.119.34 node2
>>>        >     15.15.2.10 node1-hb node1-hb.localdomain
>>>
>>>
>>>        >     15.15.2.11 node2-hb node2-hb.localdomain
>>>        >
>>>        >     And the versions:
>>>        >     Red Hat Enterprise Linux Server release 5.7 (Tikanga)
>>>        >     cman-2.0.115-85.el5
>>>        >     rgmanager-2.0.52-21.el5
>>>
>>>
>>>        >     openais-0.80.6-30.el5
>>>        >
>>>        >     I don't know what else I should try, so if you can give me
>>> some
>>>        >     ideas, I will be very pleased.
>>>        >
>>>        >     Regards, Javi.
>>>        >
>>>        >     --
>>>
>>>
>>>        >     Linux-cluster mailing list
>>>        >Linux-cluster at redhat.com  <mailto:Linux-cluster at redhat.**com<Linux-cluster at redhat.com>>
>>>  <mailto:Linux-cluster at redhat.**com <Linux-cluster at redhat.com>  <mailto:
>>> Linux-cluster at redhat.**com <Linux-cluster at redhat.com>>>
>>>
>>>
>>>        >https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>>>
>>>        >
>>>        >
>>>        >
>>>        >
>>>        > --
>>>        > esta es mi vida e me la vivo hasta que dios quiera
>>>        >
>>>        > -- Linux-cluster mailing listLinux-cluster at redhat.com  <mailto:
>>> Linux-cluster at redhat.**com <Linux-cluster at redhat.com>>
>>>
>>>        > <mailto:Linux-cluster at redhat.**com <Linux-cluster at redhat.com> <mailto:
>>> Linux-cluster at redhat.**com <Linux-cluster at redhat.com>>>
>>>
>>>
>>>        >https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>>>        >
>>>        >
>>>        > --
>>>        > Linux-cluster mailing list
>>>        >Linux-cluster at redhat.com  <mailto:Linux-cluster at redhat.**com<Linux-cluster at redhat.com>
>>> >
>>>
>>>
>>>        >https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>>>        >
>>>
>>>
>>>        --
>>>        Digimer
>>>
>>>        Papers and Projects:https://alteeve.com
>>>
>>>
>>>    --
>>>    Linux-cluster mailing list
>>>    Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.**com<Linux-cluster at redhat.com>
>>> >
>>>    https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>>>
>>>
>>>
>>>
>>> --
>>> esta es mi vida e me la vivo hasta que dios quiera
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>>>
>>>
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.com
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120621/a2061ee1/attachment.htm>

From jvdiago at gmail.com  Thu Jun 21 10:56:36 2012
From: jvdiago at gmail.com (Javier Vela)
Date: Thu, 21 Jun 2012 12:56:36 +0200
Subject: [Linux-cluster]
	=?utf-8?q?Node_can=27t_join_already_quorated_clus?=
	=?utf-8?b?dGVy4oCP?=
In-Reply-To: <CAE7pJ3D4wk58dg8AD4UjcQ9DW_mN=aYV2yW3z7Xa5OXqUd0AHQ@mail.gmail.com>
References: <CAEAM5QU2KT04rWnGttkqaWWcfvUyJx=j32GaPudwQwbu3BikJQ@mail.gmail.com>
	<CAE7pJ3ArON1yCJvePZdKR+BJErJthJ98pe3AVjH287TOCC=1PA@mail.gmail.com>
	<4FE1EFC0.8070709@alteeve.ca>
	<CAEAM5QVOnpzOf_-=TbBz9t3DmbxVcw=yUYVOA+fvd-OtKTxZng@mail.gmail.com>
	<CAE7pJ3D4wk58dg8AD4UjcQ9DW_mN=aYV2yW3z7Xa5OXqUd0AHQ@mail.gmail.com>
Message-ID: <CAEAM5QU9aO3+7NPpLQ9ADSw1YbuOCoewadY7=34yaQC7+zc+1A@mail.gmail.com>

Hi,

Finally It works. I think, after all the tests, the cluster got a bit
unstable. I stopped the cluster software, rebooted both nodes, and started
all the software. Now both nodes are in the cluster with cman and rgmanager
running.

Thank you all for the help.

2012/6/21 emmanuel segura <emi2fast at gmail.com>

> Hello Javier
>
> use clustat -l on the node doesn't show the service
>
>
>
> 2012/6/20 Javier Vela <jvdiago at gmail.com>
>
>> Hi, Finally I solved the problem. First, the qdisk over lvm does not work
>> very well. Switching to a plain device works better. And, as you stated,
>> without a fence device is not possible to get the cluster to work well, so
>> I`m going to push VMWare admins and use vmware fencing.
>>
>> I'm very grateful, I've been working on this problem 3 days without
>> understanding what was happening, and with only a few emails the problem is
>> solved. The only thing that bothers me is why cman doesn't advise you that
>> without a proper fencing the cluster won't work. Moreover I haven't found
>> in the Red Hat documentati?n a statement telling what I've readed in the
>> link you pasted:
>>
>> Fencing is a absolutely critical part of clustering. Without fully
>>> working fence devices, your cluster will fail.
>>>
>>
>>
>> I'm a bit sorry, but now I have another problem. With the cluster quorate
>> and the two nodes online + qdisk, when I start rgmanager on one node,
>> everything works ok, an the service starts. Then I start rgmanager in the
>> other node, but in the second node clustat doesn't show the service:
>>
>> node2 (with the service working):
>>
>> [root at node2 ~]# clustat
>> Cluster Status for test_cluster @ Wed Jun 20 16:21:19 2012
>> Member Status: Quorate
>>
>>  Member Name                             ID   Status
>>  ------ ----                             ---- ------
>> node1-hb                                  1 Online
>> node2-hb                               2 Online, Local, rgmanager
>>  /dev/disk/by-path/pci-0000:02:01.0-scsi-    0 Online, Quorum Disk
>>
>>  Service Name                   Owner (Last)                   State
>>  ------- ----                   ----- ------                   -----
>>  service:postgres                  node2-hb                  started
>>
>> node1 (doesn't see the service)
>>
>> [root at node1 ~]# clustat
>> Cluster Status for test_cluster @ Wed Jun 20 16:21:15 2012
>> Member Status: Quorate
>>
>>  Member Name                             ID   Status
>>  ------ ----                             ---- ------
>>  node1-hb                                  1 Online, Local
>>  node2-hb                               2 Online
>>  /dev/disk/by-path/pci-0000:02:01.0-scsi-    0 Online, Quorum Disk
>>
>> In the /var/log/messages  I don't see errors, only this:
>> last message repeated X times
>>
>> ?What I'm missing? As far I can see, rgmanager doesn't appear in node1,
>> but:
>>
>> [root at node1 ~]# service rgmanager status
>> Se est?? ejecutando clurgmgrd (pid  8254)...
>>
>> The cluster conf:
>>
>> <?xml version="1.0"?>
>> <cluster alias="test_cluster" config_version="15" name="test_cluster">
>>         <fence_daemon clean_start="0" post_fail_delay="0"
>> post_join_delay="6"/>
>>         <clusternodes>
>>                 <clusternode name="node1-hb" nodeid="1" votes="1">
>>                        <fence>
>>                                 <method name="manual">
>>                                         <device name="fence_manual"
>> nodename="node1-hb"/>
>>                                 </method>
>>                         </fence>
>>                 </clusternode>
>>                 <clusternode name="node2-hb" nodeid="2" votes="1">
>>                         <fence>
>>                                 <method name="manual">
>>                                         <device name="fence_manual"
>> nodename="node2-hb"/>
>>                                 </method>
>>                         </fence>
>>                 </clusternode>
>>         </clusternodes>
>>         <cman two_node="0" expected_votes="3"/>
>>         <fencedevices>
>>                 <fencedevice agent="fence_manual" name="fence_manual"/>
>>         </fencedevices>
>>
>>
>>         <rm log_facility="local4" log_level="7">
>>                 <failoverdomains>
>>                         <failoverdomain name="etest_cluster_fo"
>> nofailback="1" ordered="1" restricted="1">
>>
>>
>>   <failoverdomainnode name="node1-hb" priority="1"/>
>>                                 <failoverdomainnode name="node2-hb"
>> priority="2"/>
>>                         </failoverdomain>
>>                 </failoverdomains>
>>         <resources/>
>>         <service autostart="1" domain="test_cluster_fo" exclusive="0"
>> name="postgres" recovery="relocate">
>>                 <ip address="172.24.119.44" monitor_link="1"/>
>>                 <lvm name="vg_postgres" vg_name="vg_postgres"
>> lv_name="postgres"/>
>>
>>                 <fs device="/dev/vg_postgres/postgres" force_fsck="1"
>> force_unmount="1" fstype="ext3" mountpoint="/var/lib/pgsql" name="postgres"
>> self_fence="0"/>
>>
>>                 <script file="/etc/init.d/postgresql" name="postgres">
>>                 </script>
>>         </service>
>>         </rm>
>>         <totem consensus="4000" join="60" token="20000"
>> token_retransmits_before_loss_const="20"/>
>>     <quorumd  interval="1" label="cluster_qdisk" tko="10" votes="1">
>>                 <heuristic program="/usr/share/cluster/check_eth_link.sh
>> eth0" score="1" interval="2" tko="3"/>
>>         </quorumd>
>>  </cluster>
>>
>>
>>
>> Regards, Javi.
>>
>>
>> 2012/6/20 Digimer <lists at alteeve.ca>
>>
>>> It's worth re-stating;
>>>
>>> You are running an unsupported configuration. Please try to have the
>>> VMWare admins enable fence calls against your nodes and setup fencing.
>>> Until and unless you do, you will almost certainly run into problems, up to
>>> and including corrupting your data.
>>>
>>> Please take a minute to read this:
>>>
>>> https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorial#**
>>> Concept.3B_Fencing<https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Concept.3B_Fencing>
>>>
>>> Digimer
>>>
>>>
>>> On 06/20/2012 11:22 AM, emmanuel segura wrote:
>>>
>>>> Ok Javier
>>>>
>>>> So now i know you don't wanna the fencing and the reason :-)
>>>>
>>>> <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="-1"/>
>>>>
>>>> and use the fence_manual
>>>>
>>>>
>>>>
>>>> 2012/6/20 Javier Vela <jvdiago at gmail.com <mailto:jvdiago at gmail.com>>
>>>>
>>>>
>>>>    I don't use fencing because with ha-lvm I thought that I dind't need
>>>>    it. But also because both nodes are VMs in VMWare. I know that there
>>>>    is a module to do fencing with vmware but I prefer to avoid it. I'm
>>>>    not in control of the VMWare infraestructure and probably VMWare
>>>>    admins won't give me the tools to use this module.
>>>>
>>>>    Regards, Javi
>>>>
>>>>        Fencing is critical, and running a cluster without fencing, even
>>>> with
>>>>
>>>>
>>>>        qdisk, is not supported. Manual fencing is also not supported.
>>>> The
>>>>        *only* way to have a reliable cluster, testing or production, is
>>>> to use
>>>>        fencing.
>>>>
>>>>        Why do you not wish to use it?
>>>>
>>>>        On 06/20/2012 09:43 AM, Javier Vela wrote:
>>>>
>>>>
>>>>        > As I readed, if you use HA-LVM you don't need fencing because
>>>> of vg
>>>>        > tagging. Is It absolutely mandatory to use fencing with qdisk?
>>>>        >
>>>>        > If it is, i supose i can use manual_fence, but in production I
>>>> also
>>>>
>>>>
>>>>        > won't use fencing.
>>>>        >
>>>>        > Regards, Javi.
>>>>        >
>>>>        > Date: Wed, 20 Jun 2012 14:45:28 +0200
>>>>        > From:emi2fast at gmail.com  <mailto:emi2fast at gmail.com>  <mailto:
>>>> emi2fast at gmail.com  <mailto:emi2fast at gmail.com>>
>>>>
>>>>
>>>>        > To:linux-cluster at redhat.com  <mailto:linux-cluster at redhat.**
>>>> com <linux-cluster at redhat.com>>  <mailto:linux-cluster at redhat.**com<linux-cluster at redhat.com> <mailto:
>>>> linux-cluster at redhat.**com <linux-cluster at redhat.com>>>
>>>>
>>>>        > Subject: Re: [Linux-cluster] Node can't join already quorated
>>>> cluster
>>>>
>>>>
>>>>        >
>>>>        > If you don't wanna use a real fence divice, because you only
>>>> do some
>>>>        > test, you have to use fence_manual agent
>>>>        >
>>>>        > 2012/6/20 Javier Vela <jvdiago at gmail.com  <mailto:
>>>> jvdiago at gmail.com>  <mailto:jvdiago at gmail.com  <mailto:
>>>> jvdiago at gmail.com>>>
>>>>
>>>>
>>>>
>>>>        >
>>>>        >     Hi, I have a very strange problem, and after searching
>>>> through lot
>>>>        >     of forums, I haven't found the solution. This is the
>>>> scenario:
>>>>        >
>>>>        >     Two node cluster with Red Hat 5.7, HA-LVM, no fencing and
>>>> quorum
>>>>
>>>>
>>>>        >     disk. I start qdiskd, cman and rgmanager on one node.
>>>> After 5
>>>>        >     minutes, finally the fencing finishes and cluster get
>>>> quorate with 2
>>>>        >     votes:
>>>>        >
>>>>        >     [root at node2 ~]# clustat
>>>>        >     Cluster Status for test_cluster @ Wed Jun 20 05:56:39 2012
>>>>
>>>>
>>>>        >     Member Status: Quorate
>>>>        >
>>>>        >       Member Name                             ID   Status
>>>>        >       ------ ----                             ---- ------
>>>>        >       node1-hb                                  1 Offline
>>>>
>>>>
>>>>        >       node2-hb                               2 Online, Local,
>>>> rgmanager
>>>>        >       /dev/mapper/vg_qdisk-lv_qdisk               0 Online,
>>>> Quorum Disk
>>>>        >
>>>>        >       Service Name                   Owner (Last)
>>>>       State
>>>>
>>>>
>>>>        >       ------- ----                   ----- ------
>>>>       -----
>>>>        >       service:postgres                   node2
>>>>  started
>>>>        >
>>>>        >     Now, I start the second node. When cman reaches fencing,
>>>> it hangs
>>>>
>>>>
>>>>        >     for 5 minutes aprox, and finally fails. clustat says:
>>>>        >
>>>>        >     root at node1 ~]# clustat
>>>>        >     Cluster Status for test_cluster @ Wed Jun 20 06:01:12 2012
>>>>        >     Member Status: Inquorate
>>>>        >
>>>>
>>>>
>>>>        >       Member Name                             ID   Status
>>>>        >       ------ ----                             ---- ------
>>>>        >     node1-hb                                  1 Online, Local
>>>>        >     node2-hb                               2 Offline
>>>>
>>>>
>>>>        >       /dev/mapper/vg_qdisk-lv_qdisk               0 Offline
>>>>        >
>>>>        >     And in /var/log/messages I can see this errors:
>>>>        >
>>>>        >     Jun 20 06:02:12 node1 openais[6098]: [TOTEM] entering
>>>> OPERATIONAL state.
>>>>
>>>>
>>>>        >     Jun 20 06:02:12 node1 openais[6098]: [CLM  ] got nodejoin
>>>> message
>>>>        >     15.15.2.10
>>>>        >     Jun 20 06:02:13 node1 dlm_controld[5386]: connect to ccs
>>>> error -111,
>>>>        >     check ccsd or cluster status
>>>>
>>>>
>>>>        >     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.
>>>>  Refusing
>>>>        >     connection.
>>>>        >     Jun 20 06:02:13 node1 ccsd[6090]: Error while processing
>>>> connect:
>>>>        >     Connection refused
>>>>        >     Jun 20 06:02:13 node1 ccsd[6090]: Initial status::
>>>> Inquorate
>>>>
>>>>
>>>>        >     Jun 20 06:02:13 node1 gfs_controld[5392]: connect to ccs
>>>> error -111,
>>>>        >     check ccsd or cluster status
>>>>        >     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.
>>>>  Refusing
>>>>        >     connection.
>>>>
>>>>
>>>>        >     Jun 20 06:02:13 node1 ccsd[6090]: Error while processing
>>>> connect:
>>>>        >     Connection refused
>>>>        >     Jun 20 06:02:14 node1 openais[6098]: [TOTEM] entering
>>>> GATHER state
>>>>        >     from 9.
>>>>        >     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.
>>>>  Refusing
>>>>
>>>>
>>>>        >     connection.
>>>>        >     Jun 20 06:02:14 node1 ccsd[6090]: Error while processing
>>>> connect:
>>>>        >     Connection refused
>>>>        >     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.
>>>>  Refusing
>>>>        >     connection.
>>>>
>>>>
>>>>        >     Jun 20 06:02:14 node1 ccsd[6090]: Error while processing
>>>> connect:
>>>>        >     Connection refused
>>>>        >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.
>>>>  Refusing
>>>>        >     connection.
>>>>        >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing
>>>> connect:
>>>>
>>>>
>>>>        >     Connection refused
>>>>        >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.
>>>>  Refusing
>>>>        >     connection.
>>>>        >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing
>>>> connect:
>>>>        >     Connection refused
>>>>
>>>>
>>>>        >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.
>>>>  Refusing
>>>>        >     connection.
>>>>        >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing
>>>> connect:
>>>>        >     Connection refused
>>>>        >     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.
>>>>  Refusing
>>>>
>>>>
>>>>        >     connection.
>>>>        >     Jun 20 06:02:16 node1 ccsd[6090]: Error while processing
>>>> connect:
>>>>        >     Connection refused
>>>>        >     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.
>>>>  Refusing
>>>>        >     connection.
>>>>
>>>>
>>>>        >     Jun 20 06:02:16 node1 ccsd[6090]: Error while processing
>>>> connect:
>>>>        >     Connection refused
>>>>        >     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.
>>>>  Refusing
>>>>        >     connection.
>>>>        >     Jun 20 06:02:17 node1 ccsd[6090]: Error while processing
>>>> connect:
>>>>
>>>>
>>>>        >     Connection refused
>>>>        >     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.
>>>>  Refusing
>>>>        >     connection.
>>>>        >     Jun 20 06:02:17 node1 ccsd[6090]: Error while processing
>>>> connect:
>>>>        >     Connection refused
>>>>
>>>>
>>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering
>>>> GATHER state
>>>>        >     from 0.
>>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Creating
>>>> commit token
>>>>        >     because I am the rep.
>>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Storing new
>>>> sequence id
>>>>
>>>>
>>>>        >     for ring 15c
>>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering
>>>> COMMIT state.
>>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering
>>>> RECOVERY state.
>>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] position [0]
>>>> member
>>>>
>>>>
>>>>        >     15.15.2.10 <http://15.15.2.10>:
>>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] previous ring
>>>> seq 344
>>>>        >     rep 15.15.2.10
>>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] aru e high
>>>> delivered e
>>>>
>>>>
>>>>        >     received flag 1
>>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Did not need
>>>> to
>>>>        >     originate any messages in recovery.
>>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Sending
>>>> initial ORF token
>>>>
>>>>
>>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering
>>>> OPERATIONAL state.
>>>>        >     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.
>>>>  Refusing
>>>>        >     connection.
>>>>        >     Jun 20 06:02:18 node1 ccsd[6090]: Error while processing
>>>> connect:
>>>>
>>>>
>>>>        >     Connection refused
>>>>        >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering
>>>> GATHER state
>>>>        >     from 9.
>>>>        >     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.
>>>>  Refusing
>>>>        >     connection.
>>>>
>>>>
>>>>        >
>>>>        >     And the quorum disk:
>>>>        >
>>>>        >     [root at node2 ~]# mkqdisk -L -d
>>>>        >     kqdisk v0.6.0
>>>>        >     /dev/mapper/vg_qdisk-lv_qdisk:
>>>>        >     /dev/vg_qdisk/lv_qdisk:
>>>>        >              Magic:                eb7a62c2
>>>>
>>>>
>>>>        >              Label:                cluster_qdisk
>>>>        >              Created:              Thu Jun  7 09:23:34 2012
>>>>        >              Host:                 node1
>>>>        >              Kernel Sector Size:   512
>>>>
>>>>
>>>>        >              Recorded Sector Size: 512
>>>>        >
>>>>        >     Status block for node 1
>>>>        >              Last updated by node 2
>>>>        >              Last updated on Wed Jun 20 06:17:23 2012
>>>>        >              State: Evicted
>>>>
>>>>
>>>>        >              Flags: 0000
>>>>        >              Score: 0/0
>>>>        >              Average Cycle speed: 0.000500 seconds
>>>>        >              Last Cycle speed: 0.000000 seconds
>>>>        >              Incarnation: 4fe1a06c4fe1a06c
>>>>
>>>>
>>>>        >     Status block for node 2
>>>>        >              Last updated by node 2
>>>>        >              Last updated on Wed Jun 20 07:09:38 2012
>>>>        >              State: Master
>>>>        >              Flags: 0000
>>>>        >              Score: 0/0
>>>>
>>>>
>>>>        >              Average Cycle speed: 0.001000 seconds
>>>>        >              Last Cycle speed: 0.000000 seconds
>>>>        >              Incarnation: 4fe1a06c4fe1a06c
>>>>        >
>>>>        >
>>>>        >     In the other node I don't see any errors in
>>>> /var/log/messages. One
>>>>
>>>>
>>>>        >     strange thing is that if I start cman on both nodes at the
>>>> same
>>>>        >     time, everything works fine and both nodes quorate (until
>>>> I reboot
>>>>        >     one node and the problem appears). I've checked that
>>>> multicast is
>>>>
>>>>
>>>>        >     working properly. With iperf I can send a receive
>>>> multicast paquets.
>>>>        >     Moreover I've seen with tcpdump the paquets that openais
>>>> send when
>>>>        >     cman is trying to start. I've readed about a bug in RH 5.3
>>>> with the
>>>>
>>>>
>>>>        >     same behaviour, but it is solved in RH 5.4.
>>>>        >
>>>>        >     I don't have Selinux enabled, and Iptables are also
>>>> disabled. Here
>>>>        >     is the cluster.conf simplified (with less services and
>>>> resources). I
>>>>
>>>>
>>>>        >     want to point out one thing. I have allow_kill="0" in
>>>> order to avoid
>>>>        >     fencing errors when quorum tries to fence a failed node.
>>>> As <fence/>
>>>>        >     is empty, before this stanza I got a lot of messages in
>>>>
>>>>
>>>>        >     /var/log/messages with failed fencing.
>>>>        >
>>>>        >     <?xml version="1.0"?>
>>>>        >     <cluster alias="test_cluster" config_version="15"
>>>> name="test_cluster">
>>>>
>>>>
>>>>        >              <fence_daemon clean_start="0" post_fail_delay="0"
>>>>        >     post_join_delay="-1"/>
>>>>        >              <clusternodes>
>>>>        >                      <clusternode name="node1-hb" nodeid="1"
>>>> votes="1">
>>>>
>>>>
>>>>        >                              <fence/>
>>>>        >                      </clusternode>
>>>>        >                      <clusternode name="node2-hb" nodeid="2"
>>>> votes="1">
>>>>        >                              <fence/>
>>>>
>>>>
>>>>        >                      </clusternode>
>>>>        >              </clusternodes>
>>>>        >              <cman two_node="0" expected_votes="3"/>
>>>>        >              <fencedevices/>
>>>>
>>>>
>>>>        >
>>>>        >              <rm log_facility="local4" log_level="7">
>>>>        >                      <failoverdomains>
>>>>        >                              <failoverdomain
>>>> name="etest_cluster_fo"
>>>>
>>>>
>>>>        >     nofailback="1" ordered="1" restricted="1">
>>>>        >                                      <failoverdomainnode
>>>> name="node1-hb"
>>>>        >     priority="1"/>
>>>>
>>>>
>>>>        >                                      <failoverdomainnode
>>>> name="node2-hb"
>>>>        >     priority="2"/>
>>>>        >                              </failoverdomain>
>>>>        >                      </failoverdomains>
>>>>
>>>>
>>>>        >              <resources/>
>>>>        >              <service autostart="1" domain="test_cluster_fo"
>>>>        >     exclusive="0" name="postgres" recovery="relocate">
>>>>
>>>>
>>>>        >                      <ip address="172.24.119.44"
>>>> monitor_link="1"/>
>>>>        >                      <lvm name="vg_postgres"
>>>> vg_name="vg_postgres"
>>>>        >     lv_name="postgres"/>
>>>>
>>>>
>>>>        >
>>>>        >                      <fs device="/dev/vg_postgres/**postgres"
>>>>        >     force_fsck="1" force_unmount="1" fstype="ext3"
>>>>        >     mountpoint="/var/lib/pgsql" name="postgres"
>>>> self_fence="0"/>
>>>>
>>>>
>>>>        >
>>>>        >                      <script file="/etc/init.d/postgresql"
>>>> name="postgres">
>>>>        >                      </script>
>>>>        >              </service>
>>>>        >              </rm>
>>>>
>>>>
>>>>        >              <totem consensus="4000" join="60" token="20000"
>>>>        >     token_retransmits_before_loss_**const="20"/>
>>>>        >          <quorumd allow_kill="0" interval="1"
>>>> label="cluster_qdisk"
>>>>
>>>>
>>>>        >     tko="10" votes="1">
>>>>        >                      <heuristic
>>>>        >     program="/usr/share/cluster/**check_eth_link.sh eth0"
>>>> score="1"
>>>>        >     interval="2" tko="3"/>
>>>>
>>>>
>>>>        >              </quorumd>
>>>>        >       </cluster>
>>>>        >
>>>>        >
>>>>        >     The /etc/hosts:
>>>>        >     172.24.119.10 node1
>>>>        >     172.24.119.34 node2
>>>>        >     15.15.2.10 node1-hb node1-hb.localdomain
>>>>
>>>>
>>>>        >     15.15.2.11 node2-hb node2-hb.localdomain
>>>>        >
>>>>        >     And the versions:
>>>>        >     Red Hat Enterprise Linux Server release 5.7 (Tikanga)
>>>>        >     cman-2.0.115-85.el5
>>>>        >     rgmanager-2.0.52-21.el5
>>>>
>>>>
>>>>        >     openais-0.80.6-30.el5
>>>>        >
>>>>        >     I don't know what else I should try, so if you can give me
>>>> some
>>>>        >     ideas, I will be very pleased.
>>>>        >
>>>>        >     Regards, Javi.
>>>>        >
>>>>        >     --
>>>>
>>>>
>>>>        >     Linux-cluster mailing list
>>>>        >Linux-cluster at redhat.com  <mailto:Linux-cluster at redhat.**com<Linux-cluster at redhat.com>>
>>>>  <mailto:Linux-cluster at redhat.**com <Linux-cluster at redhat.com> <mailto:
>>>> Linux-cluster at redhat.**com <Linux-cluster at redhat.com>>>
>>>>
>>>>
>>>>        >https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>>>>
>>>>        >
>>>>        >
>>>>        >
>>>>        >
>>>>        > --
>>>>        > esta es mi vida e me la vivo hasta que dios quiera
>>>>        >
>>>>        > -- Linux-cluster mailing listLinux-cluster at redhat.com <mailto:
>>>> Linux-cluster at redhat.**com <Linux-cluster at redhat.com>>
>>>>
>>>>        > <mailto:Linux-cluster at redhat.**com <Linux-cluster at redhat.com> <mailto:
>>>> Linux-cluster at redhat.**com <Linux-cluster at redhat.com>>>
>>>>
>>>>
>>>>        >https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>>>>        >
>>>>        >
>>>>        > --
>>>>        > Linux-cluster mailing list
>>>>        >Linux-cluster at redhat.com  <mailto:Linux-cluster at redhat.**com<Linux-cluster at redhat.com>
>>>> >
>>>>
>>>>
>>>>        >https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>>>>        >
>>>>
>>>>
>>>>        --
>>>>        Digimer
>>>>
>>>>        Papers and Projects:https://alteeve.com
>>>>
>>>>
>>>>    --
>>>>    Linux-cluster mailing list
>>>>    Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.**com<Linux-cluster at redhat.com>
>>>> >
>>>>    https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> esta es mi vida e me la vivo hasta que dios quiera
>>>>
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>>>>
>>>>
>>>
>>> --
>>> Digimer
>>> Papers and Projects: https://alteeve.com
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120621/c88a1e02/attachment.htm>

From yogsothoth at sinistar.org  Thu Jun 21 17:54:27 2012
From: yogsothoth at sinistar.org (Jay Tingle)
Date: Thu, 21 Jun 2012 13:54:27 -0400
Subject: [Linux-cluster] pvmove locking problem CLVM on RHEL 6
Message-ID: <20120621175427.GA771@black13.sinistar.org>

Hi All, 
    I am having a problem using pvmove during some testing with Red Hat
Cluster using CLVM on RHEL 6.2.  I have 3 nodes which are ESXi 5u1 VMs with the
'multi-writer' flag set for the shared vmdk devices. I keep getting locking
errors during the pvmove.  Everything else seems to be working great as far as
CLVM goes.  Searching through the list archives and consulting the manuals it
looks like all you need is to have cmirrord running.  The RHEL 6 manual
mentions cmirror-kmod which doesn't seem to exist anymore.  Is there still a
kernel module on RHEL 6?  I am standard clvm with ext4 in an active/passive
cluster.  Anyone know what I am doing wrong?  Below is my lvm config and my
cluster config.  Thanks in advance.

[root at rhc6esx1 ~]# rpm -qa|grep -i lvm
lvm2-libs-2.02.87-6.el6.x86_64
lvm2-2.02.87-6.el6.x86_64
lvm2-cluster-2.02.87-6.el6.x86_64
[root at rhc6esx1 ~]# rpm -q cman
cman-3.0.12.1-23.el6.x86_64
[root at rhc6esx1 ~]# rpm -q cmirror
cmirror-2.02.87-6.el6.x86_64

[root at rhc6esx1 ~]# ps -ef|grep cmirror
root     21253 20692  0 13:37 pts/1    00:00:00 grep cmirror
root     31858     1  0 13:18 ?        00:00:00 cmirrord
        
[root at rhc6esx1 ~]# pvs|grep cfq888dbvg
  /dev/sdf1  cfq888dbvg        lvm2 a--  20.00g     0
  /dev/sdi1  cfq888dbvg        lvm2 a--  20.00g     0
  /dev/sdj1  cfq888dbvg        lvm2 a--  20.00g     0
  /dev/sdk1  cfq888dbvg        lvm2 a--  80.00g 80.00g

[root at rhc6esx1 ~]# pvmove -v /dev/sdi1 /dev/sdk1
    Finding volume group "cfq888dbvg"
    Executing: /sbin/modprobe dm-log-userspace
    Archiving volume group "cfq888dbvg" metadata (seqno 7).
    Creating logical volume pvmove0
    Moving 5119 extents of logical volume cfq888dbvg/cfq888_db
  Error locking on node rhc6esx1-priv: Device or resource busy
  Error locking on node rhc6esx3-priv: Volume is busy on another node
  Error locking on node rhc6esx2-priv: Volume is busy on another node
  Failed to activate cfq888_db

[root at rhc6esx1 ~]# clustat
Cluster Status for rhc6 @ Thu Jun 21 13:35:49 2012
Member Status: Quorate

 Member Name                                              ID   Status
 ------ ----                                              ---- ------
 rhc6esx1-priv                                                1 Online, Local, rgmanager
 rhc6esx2-priv                                                2 Online, rgmanager
 rhc6esx3-priv                                                3 Online, rgmanager
 /dev/block/8:33                                              0 Online, Quorum Disk

 Service Name                                    Owner (Last)                                    State
 ------- ----                                    ----- ------                                    -----
 service:cfq888_grp                              rhc6esx1-priv                                   started

[root at rhc6esx1 ~]# lvm dumpconfig
  devices {
        dir="/dev"
        scan="/dev"
        obtain_device_list_from_udev=1
        preferred_names=["^/dev/mpath/", "^/dev/mapper/mpath", "^/dev/[hs]d"]
        filter="a/.*/"
        cache_dir="/etc/lvm/cache"
        cache_file_prefix=""
        write_cache_state=1
        sysfs_scan=1
        md_component_detection=1
        md_chunk_alignment=1
        data_alignment_detection=1
        data_alignment=0
        data_alignment_offset_detection=1
        ignore_suspended_devices=0
        disable_after_error_count=0
        require_restorefile_with_uuid=1
        pv_min_size=2048
        issue_discards=0
  }
  dmeventd {
        mirror_library="libdevmapper-event-lvm2mirror.so"
        snapshot_library="libdevmapper-event-lvm2snapshot.so"
  }
  activation {
        checks=0
        udev_sync=1
        udev_rules=1
        verify_udev_operations=0
        missing_stripe_filler="error"
        reserved_stack=256
        reserved_memory=8192
        process_priority=-18
        mirror_region_size=512
        readahead="auto"
        mirror_log_fault_policy="allocate"
        mirror_image_fault_policy="remove"
        snapshot_autoextend_threshold=100
        snapshot_autoextend_percent=20
        use_mlockall=0
        monitoring=1
        polling_interval=15
  }
  global {
        umask=63
        test=0
        units="h"
        si_unit_consistency=1
        activation=1
        proc="/proc"
        locking_type=3
        wait_for_locks=1
        fallback_to_clustered_locking=1
        fallback_to_local_locking=1
        locking_dir="/var/lock/lvm"
        prioritise_write_locks=1
        abort_on_internal_errors=0
        detect_internal_vg_cache_corruption=0
        metadata_read_only=0
        mirror_segtype_default="mirror"
  }
  shell {
        history_size=100
  }
  backup {
        backup=1
        backup_dir="/etc/lvm/backup"
        archive=1
        archive_dir="/etc/lvm/archive"
        retain_min=10
        retain_days=30
  }
  log {
        verbose=0
        syslog=1
        overwrite=0
        level=0
        indent=1
        command_names=0
        prefix="  "
  }


[root at rhc6esx1 ~]# ccs -h localhost --getconf
<cluster config_version="273" name="rhc6">
  <fence_daemon clean_start="0" post_fail_delay="20" post_join_delay="60"/>
  <clusternodes>
    <clusternode name="rhc6esx1-priv" nodeid="1">
      <fence>
        <method name="1">
          <device name="fence_vmware" uuid="422a2b6a-4093-2694-65e0-a01332ef54bd"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="rhc6esx2-priv" nodeid="2">
      <fence>
        <method name="1">
          <device name="fence_vmware" uuid="422a9c5d-f9e2-8150-340b-c84b834ba068"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="rhc6esx3-priv" nodeid="3">
      <fence>
        <method name="1">
          <device name="fence_vmware" uuid="422af24c-909f-187d-4e64-2a28cbe5d09d"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <cman expected_votes="5"/>
  <fencedevices>
    <fencedevice agent="fence_vmware_soap" ipaddr="192.168.1.111" login="mrfence" name="fence_vmware" passwd="FenceM3" ssl="yes" verbose="yes"/>
  </fencedevices>
  <totem token="30000"/>
  <quorumd interval="1" label="rhc6esx-quorum" stop_cman="1" tko="10" votes="2"/>
  <logging logfile_priority="info" syslog_facility="daemon" syslog_priority="warning" to_logfile="yes" to_syslog="yes">
    <logging_daemon logfile="/var/log/cluster/qdiskd.log" name="qdiskd"/>
    <logging_daemon logfile="/var/log/cluster/fenced.log" name="fenced"/>
    <logging_daemon logfile="/var/log/cluster/dlm_controld.log" name="dlm_controld"/>
    <logging_daemon logfile="/var/log/cluster/gfs_controld.log" name="gfs_controld"/>
    <logging_daemon debug="on" logfile="/var/log/cluster/rgmanager.log" name="rgmanager"/>
    <logging_daemon name="corosync" to_logfile="no"/>
  </logging>
  <rm log_level="7">
    <failoverdomains>
      <failoverdomain name="rhc6esx3_home" nofailback="1" ordered="1" restricted="1">
        <failoverdomainnode name="rhc6esx3-priv" priority="1"/>
        <failoverdomainnode name="rhc6esx2-priv" priority="2"/>
        <failoverdomainnode name="rhc6esx1-priv" priority="3"/>
      </failoverdomain>
      <failoverdomain name="rhc6esx2_home" nofailback="1" ordered="1" restricted="1">
        <failoverdomainnode name="rhc6esx2-priv" priority="1"/>
        <failoverdomainnode name="rhc6esx1-priv" priority="2"/>
        <failoverdomainnode name="rhc6esx3-priv" priority="3"/>
      </failoverdomain>
      <failoverdomain name="rhc6esx1_home" nofailback="1" ordered="1" restricted="1">
        <failoverdomainnode name="rhc6esx1-priv" priority="1"/>
        <failoverdomainnode name="rhc6esx2-priv" priority="2"/>
        <failoverdomainnode name="rhc6esx3-priv" priority="3"/>
      </failoverdomain>
    </failoverdomains>
    <resources>
      <lvm name="cfq888vg_lvm" self_fence="1" vg_name="cfq888vg"/>
      <lvm name="cfq888bkpvg_lvm" self_fence="1" vg_name="cfq888bkpvg"/>
      <lvm name="cfq888dbvg_lvm" self_fence="1" vg_name="cfq888dbvg"/>
      <lvm name="cfq888revg_lvm" vg_name="cfq888revg"/>
      <lvm name="cfq888flashvg_lvm" self_fence="1" vg_name="cfq888flashvg"/>
      <ip address="192.168.1.31" monitor_link="1"/>
      <fs device="/dev/cfq888vg/cfq888" force_fsck="0" force_unmount="1" fstype="ext4" mountpoint="/cfq888" name="cfq888_mnt" self_fence="0"/>
      <fs device="/dev/cfq888vg/cfq888_ar" force_fsck="0" force_unmount="1" fstype="ext4" mountpoint="/cfq888/cfq888_ar" name="cfq888_ar_mnt" self_fence="0"/>
      <fs device="/dev/cfq888vg/cfq888_sw" force_fsck="0" force_unmount="1" fstype="ext4" mountpoint="/cfq888/cfq888_sw" name="cfq888_sw_mnt" self_fence="0"/>
      <fs device="/dev/cfq888bkpvg/cfq888_dmp" force_fsck="0" force_unmount="1" fstype="ext4" mountpoint="/cfq888/cfq888_dmp" name="cfq888_dmp_mnt" self_fence="0"/>
      <fs device="/dev/cfq888bkpvg/cfq888_bk" force_fsck="0" force_unmount="1" fstype="ext4" mountpoint="/cfq888/cfq888_bk" name="cfq888_bk_mnt" self_fence="0"/>
      <fs device="/dev/cfq888dbvg/cfq888_db" force_fsck="0" force_unmount="1" fstype="ext4" mountpoint="/cfq888/cfq888_db" name="cfq888_db_mnt" self_fence="0"/>
      <fs device="/dev/cfq888flashvg/cfq888_flash" force_fsck="0" force_unmount="1" fstype="ext4" mountpoint="/cfq888/cfq888_bk/cfq888_flash" name="cfq888_flash_mnt" self_fence="0"/>
      <fs device="/dev/cfq888revg/cfq888_rd" force_fsck="0" force_unmount="1" fstype="ext4" mountpoint="/cfq888/cfq888_rd" name="cfq888_rd_mnt" self_fence="0"/>
      <oracledb home="/u01/app/oracle/product/11.2.0/dbhome_1" listener_name="cfq888_lsnr" name="cfq888" type="base" user="oracle"/>
    </resources>
    <service autostart="1" domain="rhc6esx1_home" exclusive="0" name="cfq888_grp" recovery="restart">
      <lvm ref="cfq888vg_lvm"/>
      <lvm ref="cfq888bkpvg_lvm"/>
      <lvm ref="cfq888dbvg_lvm"/>
      <lvm ref="cfq888revg_lvm"/>
      <lvm ref="cfq888flashvg_lvm"/>
      <fs ref="cfq888_mnt">
        <fs ref="cfq888_ar_mnt"/>
        <fs ref="cfq888_sw_mnt"/>
        <fs ref="cfq888_dmp_mnt"/>
        <fs ref="cfq888_bk_mnt">
          <fs ref="cfq888_flash_mnt"/>
        </fs>
        <fs ref="cfq888_db_mnt"/>
        <fs ref="cfq888_rd_mnt"/>
      </fs>
      <ip ref="192.168.1.31"/>
      <oracledb ref="cfq888"/>
    </service>
  </rm>
</cluster>


thanks,
--Jason





From emi2fast at gmail.com  Fri Jun 22 07:37:29 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Fri, 22 Jun 2012 09:37:29 +0200
Subject: [Linux-cluster] pvmove locking problem CLVM on RHEL 6
In-Reply-To: <20120621175427.GA771@black13.sinistar.org>
References: <20120621175427.GA771@black13.sinistar.org>
Message-ID: <CAE7pJ3A2pcwcX9s02oZO+wVS67gam-1hrND_eADi7AytWGhRYA@mail.gmail.com>

Hello Jay

The error is so clear, you have lvm configured in exclusive mode, that
means you can't access to your vg from more then one node at a time
2012/6/21 Jay Tingle <yogsothoth at sinistar.org>

> Hi All,    I am having a problem using pvmove during some testing with Red
> Hat
> Cluster using CLVM on RHEL 6.2.  I have 3 nodes which are ESXi 5u1 VMs
> with the
> 'multi-writer' flag set for the shared vmdk devices. I keep getting locking
> errors during the pvmove.  Everything else seems to be working great as
> far as
> CLVM goes.  Searching through the list archives and consulting the manuals
> it
> looks like all you need is to have cmirrord running.  The RHEL 6 manual
> mentions cmirror-kmod which doesn't seem to exist anymore.  Is there still
> a
> kernel module on RHEL 6?  I am standard clvm with ext4 in an active/passive
> cluster.  Anyone know what I am doing wrong?  Below is my lvm config and my
> cluster config.  Thanks in advance.
>
> [root at rhc6esx1 ~]# rpm -qa|grep -i lvm
> lvm2-libs-2.02.87-6.el6.x86_64
> lvm2-2.02.87-6.el6.x86_64
> lvm2-cluster-2.02.87-6.el6.**x86_64
> [root at rhc6esx1 ~]# rpm -q cman
> cman-3.0.12.1-23.el6.x86_64
> [root at rhc6esx1 ~]# rpm -q cmirror
> cmirror-2.02.87-6.el6.x86_64
>
> [root at rhc6esx1 ~]# ps -ef|grep cmirror
> root     21253 20692  0 13:37 pts/1    00:00:00 grep cmirror
> root     31858     1  0 13:18 ?        00:00:00 cmirrord
>       [root at rhc6esx1 ~]# pvs|grep cfq888dbvg
>  /dev/sdf1  cfq888dbvg        lvm2 a--  20.00g     0
>  /dev/sdi1  cfq888dbvg        lvm2 a--  20.00g     0
>  /dev/sdj1  cfq888dbvg        lvm2 a--  20.00g     0
>  /dev/sdk1  cfq888dbvg        lvm2 a--  80.00g 80.00g
>
> [root at rhc6esx1 ~]# pvmove -v /dev/sdi1 /dev/sdk1
>   Finding volume group "cfq888dbvg"
>   Executing: /sbin/modprobe dm-log-userspace
>   Archiving volume group "cfq888dbvg" metadata (seqno 7).
>   Creating logical volume pvmove0
>   Moving 5119 extents of logical volume cfq888dbvg/cfq888_db
>  Error locking on node rhc6esx1-priv: Device or resource busy
>  Error locking on node rhc6esx3-priv: Volume is busy on another node
>  Error locking on node rhc6esx2-priv: Volume is busy on another node
>  Failed to activate cfq888_db
>
> [root at rhc6esx1 ~]# clustat
> Cluster Status for rhc6 @ Thu Jun 21 13:35:49 2012
> Member Status: Quorate
>
> Member Name                                              ID   Status
> ------ ----                                              ---- ------
> rhc6esx1-priv                                                1 Online,
> Local, rgmanager
> rhc6esx2-priv                                                2 Online,
> rgmanager
> rhc6esx3-priv                                                3 Online,
> rgmanager
> /dev/block/8:33                                              0 Online,
> Quorum Disk
>
> Service Name                                    Owner (Last)
>                      State
> ------- ----                                    ----- ------
>                      -----
> service:cfq888_grp                              rhc6esx1-priv
>                       started
>
> [root at rhc6esx1 ~]# lvm dumpconfig
>  devices {
>       dir="/dev"
>       scan="/dev"
>       obtain_device_list_from_udev=1
>       preferred_names=["^/dev/mpath/**", "^/dev/mapper/mpath",
> "^/dev/[hs]d"]
>       filter="a/.*/"
>       cache_dir="/etc/lvm/cache"
>       cache_file_prefix=""
>       write_cache_state=1
>       sysfs_scan=1
>       md_component_detection=1
>       md_chunk_alignment=1
>       data_alignment_detection=1
>       data_alignment=0
>       data_alignment_offset_**detection=1
>       ignore_suspended_devices=0
>       disable_after_error_count=0
>       require_restorefile_with_uuid=**1
>       pv_min_size=2048
>       issue_discards=0
>  }
>  dmeventd {
>       mirror_library="libdevmapper-**event-lvm2mirror.so"
>       snapshot_library="**libdevmapper-event-**lvm2snapshot.so"
>  }
>  activation {
>       checks=0
>       udev_sync=1
>       udev_rules=1
>       verify_udev_operations=0
>       missing_stripe_filler="error"
>       reserved_stack=256
>       reserved_memory=8192
>       process_priority=-18
>       mirror_region_size=512
>       readahead="auto"
>       mirror_log_fault_policy="**allocate"
>       mirror_image_fault_policy="**remove"
>       snapshot_autoextend_threshold=**100
>       snapshot_autoextend_percent=20
>       use_mlockall=0
>       monitoring=1
>       polling_interval=15
>  }
>  global {
>       umask=63
>       test=0
>       units="h"
>       si_unit_consistency=1
>       activation=1
>       proc="/proc"
>       locking_type=3
>       wait_for_locks=1
>       fallback_to_clustered_locking=**1
>       fallback_to_local_locking=1
>       locking_dir="/var/lock/lvm"
>       prioritise_write_locks=1
>       abort_on_internal_errors=0
>       detect_internal_vg_cache_**corruption=0
>       metadata_read_only=0
>       mirror_segtype_default="**mirror"
>  }
>  shell {
>       history_size=100
>  }
>  backup {
>       backup=1
>       backup_dir="/etc/lvm/backup"
>       archive=1
>       archive_dir="/etc/lvm/archive"
>       retain_min=10
>       retain_days=30
>  }
>  log {
>       verbose=0
>       syslog=1
>       overwrite=0
>       level=0
>       indent=1
>       command_names=0
>       prefix="  "
>  }
>
>
> [root at rhc6esx1 ~]# ccs -h localhost --getconf
> <cluster config_version="273" name="rhc6">
>  <fence_daemon clean_start="0" post_fail_delay="20" post_join_delay="60"/>
>  <clusternodes>
>   <clusternode name="rhc6esx1-priv" nodeid="1">
>     <fence>
>       <method name="1">
>         <device name="fence_vmware" uuid="422a2b6a-4093-2694-65e0-**
> a01332ef54bd"/>
>       </method>
>     </fence>
>   </clusternode>
>   <clusternode name="rhc6esx2-priv" nodeid="2">
>     <fence>
>       <method name="1">
>         <device name="fence_vmware" uuid="422a9c5d-f9e2-8150-340b-**
> c84b834ba068"/>
>       </method>
>     </fence>
>   </clusternode>
>   <clusternode name="rhc6esx3-priv" nodeid="3">
>     <fence>
>       <method name="1">
>         <device name="fence_vmware" uuid="422af24c-909f-187d-4e64-**
> 2a28cbe5d09d"/>
>       </method>
>     </fence>
>   </clusternode>
>  </clusternodes>
>  <cman expected_votes="5"/>
>  <fencedevices>
>   <fencedevice agent="fence_vmware_soap" ipaddr="192.168.1.111"
> login="mrfence" name="fence_vmware" passwd="FenceM3" ssl="yes"
> verbose="yes"/>
>  </fencedevices>
>  <totem token="30000"/>
>  <quorumd interval="1" label="rhc6esx-quorum" stop_cman="1" tko="10"
> votes="2"/>
>  <logging logfile_priority="info" syslog_facility="daemon"
> syslog_priority="warning" to_logfile="yes" to_syslog="yes">
>   <logging_daemon logfile="/var/log/cluster/**qdiskd.log" name="qdiskd"/>
>   <logging_daemon logfile="/var/log/cluster/**fenced.log" name="fenced"/>
>   <logging_daemon logfile="/var/log/cluster/dlm_**controld.log"
> name="dlm_controld"/>
>   <logging_daemon logfile="/var/log/cluster/gfs_**controld.log"
> name="gfs_controld"/>
>   <logging_daemon debug="on" logfile="/var/log/cluster/**rgmanager.log"
> name="rgmanager"/>
>   <logging_daemon name="corosync" to_logfile="no"/>
>  </logging>
>  <rm log_level="7">
>   <failoverdomains>
>     <failoverdomain name="rhc6esx3_home" nofailback="1" ordered="1"
> restricted="1">
>       <failoverdomainnode name="rhc6esx3-priv" priority="1"/>
>       <failoverdomainnode name="rhc6esx2-priv" priority="2"/>
>       <failoverdomainnode name="rhc6esx1-priv" priority="3"/>
>     </failoverdomain>
>     <failoverdomain name="rhc6esx2_home" nofailback="1" ordered="1"
> restricted="1">
>       <failoverdomainnode name="rhc6esx2-priv" priority="1"/>
>       <failoverdomainnode name="rhc6esx1-priv" priority="2"/>
>       <failoverdomainnode name="rhc6esx3-priv" priority="3"/>
>     </failoverdomain>
>     <failoverdomain name="rhc6esx1_home" nofailback="1" ordered="1"
> restricted="1">
>       <failoverdomainnode name="rhc6esx1-priv" priority="1"/>
>       <failoverdomainnode name="rhc6esx2-priv" priority="2"/>
>       <failoverdomainnode name="rhc6esx3-priv" priority="3"/>
>     </failoverdomain>
>   </failoverdomains>
>   <resources>
>     <lvm name="cfq888vg_lvm" self_fence="1" vg_name="cfq888vg"/>
>     <lvm name="cfq888bkpvg_lvm" self_fence="1" vg_name="cfq888bkpvg"/>
>     <lvm name="cfq888dbvg_lvm" self_fence="1" vg_name="cfq888dbvg"/>
>     <lvm name="cfq888revg_lvm" vg_name="cfq888revg"/>
>     <lvm name="cfq888flashvg_lvm" self_fence="1" vg_name="cfq888flashvg"/>
>     <ip address="192.168.1.31" monitor_link="1"/>
>     <fs device="/dev/cfq888vg/cfq888" force_fsck="0" force_unmount="1"
> fstype="ext4" mountpoint="/cfq888" name="cfq888_mnt" self_fence="0"/>
>     <fs device="/dev/cfq888vg/cfq888_**ar" force_fsck="0"
> force_unmount="1" fstype="ext4" mountpoint="/cfq888/cfq888_ar"
> name="cfq888_ar_mnt" self_fence="0"/>
>     <fs device="/dev/cfq888vg/cfq888_**sw" force_fsck="0"
> force_unmount="1" fstype="ext4" mountpoint="/cfq888/cfq888_sw"
> name="cfq888_sw_mnt" self_fence="0"/>
>     <fs device="/dev/cfq888bkpvg/**cfq888_dmp" force_fsck="0"
> force_unmount="1" fstype="ext4" mountpoint="/cfq888/cfq888_**dmp"
> name="cfq888_dmp_mnt" self_fence="0"/>
>     <fs device="/dev/cfq888bkpvg/**cfq888_bk" force_fsck="0"
> force_unmount="1" fstype="ext4" mountpoint="/cfq888/cfq888_bk"
> name="cfq888_bk_mnt" self_fence="0"/>
>     <fs device="/dev/cfq888dbvg/**cfq888_db" force_fsck="0"
> force_unmount="1" fstype="ext4" mountpoint="/cfq888/cfq888_db"
> name="cfq888_db_mnt" self_fence="0"/>
>     <fs device="/dev/cfq888flashvg/**cfq888_flash" force_fsck="0"
> force_unmount="1" fstype="ext4" mountpoint="/cfq888/cfq888_bk/**cfq888_flash"
> name="cfq888_flash_mnt" self_fence="0"/>
>     <fs device="/dev/cfq888revg/**cfq888_rd" force_fsck="0"
> force_unmount="1" fstype="ext4" mountpoint="/cfq888/cfq888_rd"
> name="cfq888_rd_mnt" self_fence="0"/>
>     <oracledb home="/u01/app/oracle/product/**11.2.0/dbhome_1"
> listener_name="cfq888_lsnr" name="cfq888" type="base" user="oracle"/>
>   </resources>
>   <service autostart="1" domain="rhc6esx1_home" exclusive="0"
> name="cfq888_grp" recovery="restart">
>     <lvm ref="cfq888vg_lvm"/>
>     <lvm ref="cfq888bkpvg_lvm"/>
>     <lvm ref="cfq888dbvg_lvm"/>
>     <lvm ref="cfq888revg_lvm"/>
>     <lvm ref="cfq888flashvg_lvm"/>
>     <fs ref="cfq888_mnt">
>       <fs ref="cfq888_ar_mnt"/>
>       <fs ref="cfq888_sw_mnt"/>
>       <fs ref="cfq888_dmp_mnt"/>
>       <fs ref="cfq888_bk_mnt">
>         <fs ref="cfq888_flash_mnt"/>
>       </fs>
>       <fs ref="cfq888_db_mnt"/>
>       <fs ref="cfq888_rd_mnt"/>
>     </fs>
>     <ip ref="192.168.1.31"/>
>     <oracledb ref="cfq888"/>
>   </service>
>  </rm>
> </cluster>
>
>
> thanks,
> --Jason
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120622/33943b37/attachment.htm>

From Colin.Simpson at iongeo.com  Mon Jun 25 19:59:11 2012
From: Colin.Simpson at iongeo.com (Colin Simpson)
Date: Mon, 25 Jun 2012 19:59:11 +0000
Subject: [Linux-cluster] Nodes don't Power Off on halt
Message-ID: <1340654341.23233.42.camel@bhac.iouk.ioroot.tld>

Hi

On my 2 node clusters I have a UPS connected to them. This UPS is
programmed to shutdown both nodes of the clusters. Node 2 shutsdown
several minutes before node 1.

The fence mechanisms I have setup are primarily an APC network power
switch and for backup Dell DRAC with fence_ipmilan.

If you shutdown a node where the cluster services are chkconfig'd on,
and it withdraws cleanly from the cluster on shutdown it sits at system
halted and doesn't power off.

If no cluster services have ever been started (all chkconfig'd off) and
the system is shutdown, it halts and powers off.

Now I know the recommendation is for acpi to be turned off but it makes
no difference either way, and I have verified both fence mechanisms
power hard down the system acpid or not.

By not powering down, the UPS continues to have power drawn from it
until it's power off timer expires (that the last node send it in
halt.local). This has the effect of needlessly reducing battery capacity
and lowering protection from multiple consecutive power outages.

What causes the system not to power down, I realise this maybe by
design, but is it configurable?

Thanks

Colin


________________________________


This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.




From queszama at yahoo.in  Tue Jun 26 14:36:39 2012
From: queszama at yahoo.in (Zama Ques)
Date: Tue, 26 Jun 2012 22:36:39 +0800 (SGT)
Subject: [Linux-cluster] Clarifications needed on Setting up a two node HA
	cluster
Message-ID: <1340721399.31467.YahooMailNeo@web193006.mail.sg3.yahoo.com>

Hi ,

I need to setup a two node HA cluster on top of HP blade servers using Redhat Cluster Solution.? I have started going through the docs and have the following? doubts as of now. 

I am planning to build my two node setup based on the following architecture as shown in Fig 1.1 in http://www.centos.org/docs/5/pdf/Cluster_Administration.pdf?



As per t'he above fig, I am planning to build my setup as per the following .

1) Each of the nodes will have two interfaces. One interface say eth0 on both the nodes will be assigned private addresses and 
??? will be connected to a switch which will be used for cluster traffic.The other interface on both the nodes say eth1 will be 
??? assigned public ip address and will be connected to another switch which is connected to internet via a firewall/router . I 
? ? will also be assigning a virtual public ip address to the cluster by configuring a ip resource in conga. This ip address I 
?? will add it to ?Listen directive in apache configuration file ?so that apache listens on this ip address only to serve client 
??? requests. This ip address will also resolve to a registered domain name for our portal which we are going to serve by this 
??? setup. 

?? And as a prerequisite for conga setup , I will update /etc/hosts on both nodes by supplying FQDNs 
?? corresponding to private ip address assigned earlier on eth0 on both the nodes. 
?
2) Regarding storage , I am not sure as of now what kind of storage device will be used . If it is not a SAN storage 
than I will ?configure one of the partitions on the storage as iscsi target and will share it to both the cluster nodes. 
From both the cluster nodes , I will create volumes using clvm and will use GFS on top of it as file system . 

3)Regarding cluster resource , we will be using apache as one of the resource to serve http traffic. Will configure apache
using Conga and after configuration is done , will copy the httpd config file manually to the other cluster node . Will not 
start apache service on both cluster nodes and will leave it to cluster software to start services . Will also do chkconfig 
httpd off on both the nodes and will also not update /etc/fstab with the GFS file system leaving it to cluster node to handle?
mounting of a file system.


Sorry for writing too lenghthy? , but want to clear my doubts before starting up.?

Will be very much grateful if members can read my long mail with patience and reply back whether I am going in the right direction. 



Thanks in Advance
Zaman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120626/7ab4fe64/attachment.htm>

From lists at alteeve.ca  Tue Jun 26 17:22:52 2012
From: lists at alteeve.ca (Digimer)
Date: Tue, 26 Jun 2012 13:22:52 -0400
Subject: [Linux-cluster] Clarifications needed on Setting up a two node
 HA cluster
In-Reply-To: <1340721399.31467.YahooMailNeo@web193006.mail.sg3.yahoo.com>
References: <1340721399.31467.YahooMailNeo@web193006.mail.sg3.yahoo.com>
Message-ID: <4FE9EFEC.4020907@alteeve.ca>

Is your primary concern load balancing or high availability?

On 06/26/2012 10:36 AM, Zama Ques wrote:
> Hi ,
>
> I need to setup a two node HA cluster on top of HP blade servers using
> Redhat Cluster Solution.  I have started going through the docs and have
> the following  doubts as of now.
>
> I am planning to build my two node setup based on the following
> architecture as shown in Fig 1.1 in
> http://www.centos.org/docs/5/pdf/Cluster_Administration.pdf
>
>
>
> As per t'he above fig, I am planning to build my setup as per the
> following .
>
> 1) Each of the nodes will have two interfaces. One interface say eth0 on
> both the nodes will be assigned private addresses and
>      will be connected to a switch which will be used for cluster
> traffic.The other interface on both the nodes say eth1 will be
>      assigned public ip address and will be connected to another switch
> which is connected to internet via a firewall/router . I
>      will also be assigning a virtual public ip address to the cluster
> by configuring a ip resource in conga. This ip address I
>     will add it to  Listen directive in apache configuration file  so
> that apache listens on this ip address only to serve client
>      requests. This ip address will also resolve to a registered domain
> name for our portal which we are going to serve by this
>      setup.
>
>     And as a prerequisite for conga setup , I will update /etc/hosts on
> both nodes by supplying FQDNs
>     corresponding to private ip address assigned earlier on eth0 on both
> the nodes.
>
> 2) Regarding storage , I am not sure as of now what kind of storage
> device will be used . If it is not a SAN storage
> than I will  configure one of the partitions on the storage as iscsi
> target and will share it to both the cluster nodes.
>  From both the cluster nodes , I will create volumes using clvm and will
> use GFS on top of it as file system .
>
> 3)Regarding cluster resource , we will be using apache as one of the
> resource to serve http traffic. Will configure apache
> using Conga and after configuration is done , will copy the httpd config
> file manually to the other cluster node . Will not
> start apache service on both cluster nodes and will leave it to cluster
> software to start services . Will also do chkconfig
> httpd off on both the nodes and will also not update /etc/fstab with the
> GFS file system leaving it to cluster node to handle
> mounting of a file system.
>
> Sorry for writing too lenghthy  , but want to clear my doubts before
> starting up.
>
> Will be very much grateful if members can read my long mail with
> patience and reply back whether I am going in the right direction.
>
>
> Thanks in Advance
> Zaman
>
>
>
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Digimer
Papers and Projects: https://alteeve.com




From queszama at yahoo.in  Wed Jun 27 02:13:49 2012
From: queszama at yahoo.in (Zama Ques)
Date: Wed, 27 Jun 2012 10:13:49 +0800 (SGT)
Subject: [Linux-cluster] Clarifications needed on Setting up a two node
	HA cluster
In-Reply-To: <4FE9EFEC.4020907@alteeve.ca>
References: <1340721399.31467.YahooMailNeo@web193006.mail.sg3.yahoo.com>
	<4FE9EFEC.4020907@alteeve.ca>
Message-ID: <1340763229.86678.YahooMailNeo@web193006.mail.sg3.yahoo.com>

Primary concern is high availability only . 



________________________________
 From: Digimer <lists at alteeve.ca>
To: Zama Ques <queszama at yahoo.in>; linux clustering <linux-cluster at redhat.com> 
Sent: Tuesday, 26 June 2012 10:52 PM
Subject: Re: [Linux-cluster] Clarifications needed on Setting up a two node HA cluster
 
Is your primary concern load balancing or high availability?

On 06/26/2012 10:36 AM, Zama Ques wrote:
> Hi ,
>
> I need to setup a two node HA cluster on top of HP blade servers using
> Redhat Cluster Solution.? I have started going through the docs and have
> the following? doubts as of now.
>
> I am planning to build my two node setup based on the following
> architecture as shown in Fig 1.1 in
> http://www.centos.org/docs/5/pdf/Cluster_Administration.pdf
>
>
>
> As per t'he above fig, I am planning to build my setup as per the
> following .
>
> 1) Each of the nodes will have two interfaces. One interface say eth0 on
> both the nodes will be assigned private addresses and
>? ? ? will be connected to a switch which will be used for cluster
> traffic.The other interface on both the nodes say eth1 will be
>? ? ? assigned public ip address and will be connected to another switch
> which is connected to internet via a firewall/router . I
>? ? ? will also be assigning a virtual public ip address to the cluster
> by configuring a ip resource in conga. This ip address I
>? ?  will add it to? Listen directive in apache configuration file? so
> that apache listens on this ip address only to serve client
>? ? ? requests. This ip address will also resolve to a registered domain
> name for our portal which we are going to serve by this
>? ? ? setup.
>
>? ?  And as a prerequisite for conga setup , I will update /etc/hosts on
> both nodes by supplying FQDNs
>? ?  corresponding to private ip address assigned earlier on eth0 on both
> the nodes.
>
> 2) Regarding storage , I am not sure as of now what kind of storage
> device will be used . If it is not a SAN storage
> than I will? configure one of the partitions on the storage as iscsi
> target and will share it to both the cluster nodes.
>? From both the cluster nodes , I will create volumes using clvm and will
> use GFS on top of it as file system .
>
> 3)Regarding cluster resource , we will be using apache as one of the
> resource to serve http traffic. Will configure apache
> using Conga and after configuration is done , will copy the httpd config
> file manually to the other cluster node . Will not
> start apache service on both cluster nodes and will leave it to cluster
> software to start services . Will also do chkconfig
> httpd off on both the nodes and will also not update /etc/fstab with the
> GFS file system leaving it to cluster node to handle
> mounting of a file system.
>
> Sorry for writing too lenghthy? , but want to clear my doubts before
> starting up.
>
> Will be very much grateful if members can read my long mail with
> patience and reply back whether I am going in the right direction.
>
>
> Thanks in Advance
> Zaman
>
>
>
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Digimer
Papers and Projects: https://alteeve.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120627/4b7bbf93/attachment.htm>

From lists at alteeve.ca  Wed Jun 27 02:46:17 2012
From: lists at alteeve.ca (Digimer)
Date: Tue, 26 Jun 2012 22:46:17 -0400
Subject: [Linux-cluster] Clarifications needed on Setting up a two node
 HA cluster
In-Reply-To: <1340763229.86678.YahooMailNeo@web193006.mail.sg3.yahoo.com>
References: <1340721399.31467.YahooMailNeo@web193006.mail.sg3.yahoo.com>
	<4FE9EFEC.4020907@alteeve.ca>
	<1340763229.86678.YahooMailNeo@web193006.mail.sg3.yahoo.com>
Message-ID: <4FEA73F9.7040105@alteeve.ca>

What I like to recommend then is to put the service on a virtual 
machine. Then make the VM itself the highly available service. The 
reason I prefer this is that the same setup can then be re-used for 
pretty much any other service on any operating system. The down side 
though it that recovery from a node failure takes however long it takes 
for the VM to reboot, which might be too long for you. I've got a 
tutorial for this kind of setup:

https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial

If you need to make the recovery faster, then you will want to make the 
apache service itself the HA service. The image you linked is from RHEL 
5. Be sure to use RHEL 6 docs.

You might want to look at DRBD as an alternative to a SAN if you want to 
keep costs down. This is, effectively, "RAID 1 over a network". The idea 
is that your storage backing your active node is replicated to the 
backup node. Should the primary fail, you'd "promote" the backup node's 
storage, start apache and take over the floating IP address.

If you want even faster fail-over, then you can run DRBD is "dual 
primary mode", use GFS2 on it and have apache running on both nodes all 
the time. Then the only thing you need to make the highly available 
service is the floating IP address.

You can configure the cluster using 'luci', which can be installed on a 
machine outside the cluster if you would like. Personally, I recommend 
people work with the core /etc/cluster/cluster.conf file as it helps you 
understand what is happening behind the scenes better.

Happy clustering. :)

On 06/26/2012 10:13 PM, Zama Ques wrote:
> Primary concern is high availability only .
>
> ------------------------------------------------------------------------
> *From:* Digimer <lists at alteeve.ca>
> *To:* Zama Ques <queszama at yahoo.in>; linux clustering
> <linux-cluster at redhat.com>
> *Sent:* Tuesday, 26 June 2012 10:52 PM
> *Subject:* Re: [Linux-cluster] Clarifications needed on Setting up a two
> node HA cluster
>
> Is your primary concern load balancing or high availability?
>
> On 06/26/2012 10:36 AM, Zama Ques wrote:
>  > Hi ,
>  >
>  > I need to setup a two node HA cluster on top of HP blade servers using
>  > Redhat Cluster Solution.  I have started going through the docs and have
>  > the following  doubts as of now.
>  >
>  > I am planning to build my two node setup based on the following
>  > architecture as shown in Fig 1.1 in
>  > http://www.centos.org/docs/5/pdf/Cluster_Administration.pdf
>  >
>  >
>  >
>  > As per t'he above fig, I am planning to build my setup as per the
>  > following .
>  >
>  > 1) Each of the nodes will have two interfaces. One interface say eth0 on
>  > both the nodes will be assigned private addresses and
>  >      will be connected to a switch which will be used for cluster
>  > traffic.The other interface on both the nodes say eth1 will be
>  >      assigned public ip address and will be connected to another switch
>  > which is connected to internet via a firewall/router . I
>  >      will also be assigning a virtual public ip address to the cluster
>  > by configuring a ip resource in conga. This ip address I
>  >    will add it to  Listen directive in apache configuration file  so
>  > that apache listens on this ip address only to serve client
>  >      requests. This ip address will also resolve to a registered domain
>  > name for our portal which we are going to serve by this
>  >      setup.
>  >
>  >    And as a prerequisite for conga setup , I will update /etc/hosts on
>  > both nodes by supplying FQDNs
>  >    corresponding to private ip address assigned earlier on eth0 on both
>  > the nodes.
>  >
>  > 2) Regarding storage , I am not sure as of now what kind of storage
>  > device will be used . If it is not a SAN storage
>  > than I will  configure one of the partitions on the storage as iscsi
>  > target and will share it to both the cluster nodes.
>  >  From both the cluster nodes , I will create volumes using clvm and will
>  > use GFS on top of it as file system .
>  >
>  > 3)Regarding cluster resource , we will be using apache as one of the
>  > resource to serve http traffic. Will configure apache
>  > using Conga and after configuration is done , will copy the httpd config
>  > file manually to the other cluster node . Will not
>  > start apache service on both cluster nodes and will leave it to cluster
>  > software to start services . Will also do chkconfig
>  > httpd off on both the nodes and will also not update /etc/fstab with the
>  > GFS file system leaving it to cluster node to handle
>  > mounting of a file system.
>  >
>  > Sorry for writing too lenghthy  , but want to clear my doubts before
>  > starting up.
>  >
>  > Will be very much grateful if members can read my long mail with
>  > patience and reply back whether I am going in the right direction.
>  >
>  >
>  > Thanks in Advance
>  > Zaman
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  > --
>  > Linux-cluster mailing list
>  > Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>  > https://www.redhat.com/mailman/listinfo/linux-cluster
>  >
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.com <https://alteeve.com/>
>
>
>
>


-- 
Digimer
Papers and Projects: https://alteeve.com




From queszama at yahoo.in  Wed Jun 27 03:23:35 2012
From: queszama at yahoo.in (Zama Ques)
Date: Wed, 27 Jun 2012 11:23:35 +0800 (SGT)
Subject: [Linux-cluster] Clarifications needed on Setting up a two node
	HA cluster
In-Reply-To: <4FEA73F9.7040105@alteeve.ca>
References: <1340721399.31467.YahooMailNeo@web193006.mail.sg3.yahoo.com>
	<4FE9EFEC.4020907@alteeve.ca>
	<1340763229.86678.YahooMailNeo@web193006.mail.sg3.yahoo.com>
	<4FEA73F9.7040105@alteeve.ca>
Message-ID: <1340767415.52186.YahooMailNeo@web193005.mail.sg3.yahoo.com>

Thanks Digimer for replying back . Your suggestion feels great , but for us it is as per client requirement and so we need to go for a two node setup. I am yet to be clear on our storage part. 


I will take care of going through Redhat6 docs.?

I also have some confusion on configuring ip address for the cluster as mentioned in my original mail .? Pasting the contents once more. 


"1) Each of the nodes will have two interfaces. One interface say eth0 on
>? > both the nodes will be assigned private addresses and
>? >? ? ? will be connected to a switch which will be used for cluster
>? > traffic.The other interface on both the nodes say eth1 will be
>? >? ? ? assigned public ip address and will be connected to another switch
>? > which is connected to internet via a firewall/router . I
>? >? ? ? will also be assigning a virtual public ip address to the cluster
>? > by configuring a ip resource in conga. This ip address I
>? >? ? will add it to? Listen directive in apache configuration file? so
>? > that apache listens on this ip address only to serve client
>? >? ? ? requests. This ip address will? resolve to a registered domain
>? > name for our portal which we are going to serve by this
>? >? ? ? setup.



Will be great if you let me know whether configuring ip address for the cluster as per above notes is the right way of doing it. 


Thanks 

Zaman



________________________________
 From: Digimer <lists at alteeve.ca>
To: Zama Ques <queszama at yahoo.in> 
Cc: linux clustering <linux-cluster at redhat.com> 
Sent: Wednesday, 27 June 2012 8:16 AM
Subject: Re: [Linux-cluster] Clarifications needed on Setting up a two node HA cluster
 
What I like to recommend then is to put the service on a virtual machine. Then make the VM itself the highly available service. The reason I prefer this is that the same setup can then be re-used for pretty much any other service on any operating system. The down side though it that recovery from a node failure takes however long it takes for the VM to reboot, which might be too long for you. I've got a tutorial for this kind of setup:

https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial

If you need to make the recovery faster, then you will want to make the apache service itself the HA service. The image you linked is from RHEL 5. Be sure to use RHEL 6 docs.

You might want to look at DRBD as an alternative to a SAN if you want to keep costs down. This is, effectively, "RAID 1 over a network". The idea is that your storage backing your active node is replicated to the backup node. Should the primary fail, you'd "promote" the backup node's storage, start apache and take over the floating IP address.

If you want even faster fail-over, then you can run DRBD is "dual primary mode", use GFS2 on it and have apache running on both nodes all the time. Then the only thing you need to make the highly available service is the floating IP address.

You can configure the cluster using 'luci', which can be installed on a machine outside the cluster if you would like. Personally, I recommend people work with the core /etc/cluster/cluster.conf file as it helps you understand what is happening behind the scenes better.

Happy clustering. :)

On 06/26/2012 10:13 PM, Zama Ques wrote:
> Primary concern is high availability only .
> 
> ------------------------------------------------------------------------
> *From:* Digimer <lists at alteeve.ca>
> *To:* Zama Ques <queszama at yahoo.in>; linux clustering
> <linux-cluster at redhat.com>
> *Sent:* Tuesday, 26 June 2012 10:52 PM
> *Subject:* Re: [Linux-cluster] Clarifications needed on Setting up a two
> node HA cluster
> 
> Is your primary concern load balancing or high availability?
> 
> On 06/26/2012 10:36 AM, Zama Ques wrote:
>? > Hi ,
>? >
>? > I need to setup a two node HA cluster on top of HP blade servers using
>? > Redhat Cluster Solution.? I have started going through the docs and have
>? > the following? doubts as of now.
>? >
>? > I am planning to build my two node setup based on the following
>? > architecture as shown in Fig 1.1 in
>? > http://www.centos.org/docs/5/pdf/Cluster_Administration.pdf
>? >
>? >
>? >
>? > As per t'he above fig, I am planning to build my setup as per the
>? > following .
>? >
>? > 1) Each of the nodes will have two interfaces. One interface say eth0 on
>? > both the nodes will be assigned private addresses and
>? >? ? ? will be connected to a switch which will be used for cluster
>? > traffic.The other interface on both the nodes say eth1 will be
>? >? ? ? assigned public ip address and will be connected to another switch
>? > which is connected to internet via a firewall/router . I
>? >? ? ? will also be assigning a virtual public ip address to the cluster
>? > by configuring a ip resource in conga. This ip address I
>? >? ? will add it to? Listen directive in apache configuration file? so
>? > that apache listens on this ip address only to serve client
>? >? ? ? requests. This ip address will also resolve to a registered domain
>? > name for our portal which we are going to serve by this
>? >? ? ? setup.
>? >
>? >? ? And as a prerequisite for conga setup , I will update /etc/hosts on
>? > both nodes by supplying FQDNs
>? >? ? corresponding to private ip address assigned earlier on eth0 on both
>? > the nodes.
>? >
>? > 2) Regarding storage , I am not sure as of now what kind of storage
>? > device will be used . If it is not a SAN storage
>? > than I will? configure one of the partitions on the storage as iscsi
>? > target and will share it to both the cluster nodes.
>? >? From both the cluster nodes , I will create volumes using clvm and will
>? > use GFS on top of it as file system .
>? >
>? > 3)Regarding cluster resource , we will be using apache as one of the
>? > resource to serve http traffic. Will configure apache
>? > using Conga and after configuration is done , will copy the httpd config
>? > file manually to the other cluster node . Will not
>? > start apache service on both cluster nodes and will leave it to cluster
>? > software to start services . Will also do chkconfig
>? > httpd off on both the nodes and will also not update /etc/fstab with the
>? > GFS file system leaving it to cluster node to handle
>? > mounting of a file system.
>? >
>? > Sorry for writing too lenghthy? , but want to clear my doubts before
>? > starting up.
>? >
>? > Will be very much grateful if members can read my long mail with
>? > patience and reply back whether I am going in the right direction.
>? >
>? >
>? > Thanks in Advance
>? > Zaman
>? >
>? >
>? >
>? >
>? >
>? >
>? >
>? > --
>? > Linux-cluster mailing list
>? > Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>? > https://www.redhat.com/mailman/listinfo/linux-cluster
>? >
> 
> 
> --
> Digimer
> Papers and Projects: https://alteeve.com <https://alteeve.com/>
> 
> 
> 
> 


-- Digimer
Papers and Projects: https://alteeve.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120627/eddb0184/attachment.htm>

From lipson12 at yahoo.com  Wed Jun 27 03:56:15 2012
From: lipson12 at yahoo.com (Kaisar Ahmed Khan)
Date: Tue, 26 Jun 2012 20:56:15 -0700 (PDT)
Subject: [Linux-cluster] mysql cluster
Message-ID: <1340769375.83343.YahooMailNeo@web162405.mail.bf1.yahoo.com>

Dear all,

what is the easiest way to make mysql cluster with redhat cluster suite.

if anybody has proper doc pls send me.

?

thanks.

Md.Kaisar Ahmed Khan
????????????? (RHCSS,RHCE,MCSE)
IT Manager,X-net Ltd.Dhaka

Cell : 0191 5431453
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120626/148e30a9/attachment.htm>

From rodgersr at yahoo.com  Wed Jun 27 10:16:47 2012
From: rodgersr at yahoo.com (Rick)
Date: Wed, 27 Jun 2012 03:16:47 -0700 (PDT)
Subject: [Linux-cluster] (no subject)
Message-ID: <1340792207.96615.YahooMailNeo@web160306.mail.bf1.yahoo.com>

http://expressions.mx/therpo.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120627/95a81ff0/attachment.htm>

From queszama at yahoo.in  Wed Jun 27 14:20:58 2012
From: queszama at yahoo.in (Zama Ques)
Date: Wed, 27 Jun 2012 22:20:58 +0800 (SGT)
Subject: [Linux-cluster] mysql cluster
In-Reply-To: <1340769375.83343.YahooMailNeo@web162405.mail.bf1.yahoo.com>
References: <1340769375.83343.YahooMailNeo@web162405.mail.bf1.yahoo.com>
Message-ID: <1340806858.16209.YahooMailNeo@web193003.mail.sg3.yahoo.com>

Hi Kaisar ,
You can use Redhat cluster suite to configure mysql as a high availability service .? The docs are available here
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/index.html

Thanks
Zaman



________________________________
 From: Kaisar Ahmed Khan <lipson12 at yahoo.com>
To: "linux-cluster at redhat.com" <linux-cluster at redhat.com> 
Sent: Wednesday, 27 June 2012 9:26 AM
Subject: [Linux-cluster] mysql cluster
 

Dear all,

what is the easiest way to make mysql cluster with redhat cluster suite.

if anybody has proper doc pls send me.

?

thanks.

Md.Kaisar Ahmed Khan
????????????? (RHCSS,RHCE,MCSE)
IT Manager,X-net Ltd.Dhaka

Cell : 0191 5431453

?
--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120627/d5a4595f/attachment.htm>

From queszama at yahoo.in  Fri Jun 29 02:32:06 2012
From: queszama at yahoo.in (Zama Ques)
Date: Fri, 29 Jun 2012 10:32:06 +0800 (SGT)
Subject: [Linux-cluster] Options for fencing at the node level .
Message-ID: <1340937126.95735.YahooMailNeo@web193003.mail.sg3.yahoo.com>

Hi All ,

I need to setup HA clustering using redhat cluster suite on two nodes , primary concern being high availability . Before trying it on production , I am trying to configure the setup on two desktop machines . For storage , I am creating a partition and sharing the partition as a iscsi target on a third machine . Would like to know what are the options for fencing available at the node level ?. ?I tried going through the conga interface for creating a
shared fence device , I could see one option is using GNBD . virtual machine fencing is there in the list but that is for xen based HA
cluster . scsi fencing is there , but as far as what I understand it does not support iscsi target as of now. Manual fencing is also there , and I am planning to use that , but would like to? know is there any other options are available for fencing at node level ? 

Thanks in Advance
Zaman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120629/c108864b/attachment.htm>

From lists at alteeve.ca  Fri Jun 29 03:01:07 2012
From: lists at alteeve.ca (Digimer)
Date: Thu, 28 Jun 2012 23:01:07 -0400
Subject: [Linux-cluster] Options for fencing at the node level .
In-Reply-To: <1340937126.95735.YahooMailNeo@web193003.mail.sg3.yahoo.com>
References: <1340937126.95735.YahooMailNeo@web193003.mail.sg3.yahoo.com>
Message-ID: <4FED1A73.5060207@alteeve.ca>

On 06/28/2012 10:32 PM, Zama Ques wrote:
> Hi All ,
>
> I need to setup HA clustering using redhat cluster suite on two nodes ,
> primary concern being high availability . Before trying it on production
> , I am trying to configure the setup on two desktop machines . For
> storage , I am creating a partition and sharing the partition as a iscsi
> target on a third machine . Would like to know what are the options for
> fencing available at the node level  .  I tried going through the conga
> interface for creating a
> shared fence device , I could see one option is using GNBD . virtual
> machine fencing is there in the list but that is for xen based HA
> cluster . scsi fencing is there , but as far as what I understand it
> does not support iscsi target as of now. Manual fencing is also there ,
> and I am planning to use that , but would like to  know is there any
> other options are available for fencing at node level ?
>
> Thanks in Advance
> Zaman

You need a real device that will power off the cluster node. If your 
machines do not have IPMI, which desktops rarely ever do, you next best 
option is a switched PDU. I have had excellent luck with APC AP7900 (or 
your country's version of). The cluster can call this PDU and ask it to 
turn off the power to the target node.

Fencing requires a mechanism totally independent of the target. With 
virtual machines, the host hypervisor can do this. For real machines 
though, you need hardware.

-- 
Digimer
Papers and Projects: https://alteeve.com




From erik.redding at txstate.edu  Fri Jun 29 17:16:24 2012
From: erik.redding at txstate.edu (Redding, Erik)
Date: Fri, 29 Jun 2012 12:16:24 -0500
Subject: [Linux-cluster] Options for fencing at the node level .
In-Reply-To: <4FED1A73.5060207@alteeve.ca>
References: <1340937126.95735.YahooMailNeo@web193003.mail.sg3.yahoo.com>
	<4FED1A73.5060207@alteeve.ca>
Message-ID: <3DB913F9-D9E5-43DD-B9AF-4340C1141396@txstate.edu>

unsubscribe 

Erik Redding
Systems Programmer, RHCE
Core Systems
Texas State University

On Jun 28, 2012, at 10:01 PM, Digimer wrote:

> On 06/28/2012 10:32 PM, Zama Ques wrote:
>> Hi All ,
>> 
>> I need to setup HA clustering using redhat cluster suite on two nodes ,
>> primary concern being high availability . Before trying it on production
>> , I am trying to configure the setup on two desktop machines . For
>> storage , I am creating a partition and sharing the partition as a iscsi
>> target on a third machine . Would like to know what are the options for
>> fencing available at the node level  .  I tried going through the conga
>> interface for creating a
>> shared fence device , I could see one option is using GNBD . virtual
>> machine fencing is there in the list but that is for xen based HA
>> cluster . scsi fencing is there , but as far as what I understand it
>> does not support iscsi target as of now. Manual fencing is also there ,
>> and I am planning to use that , but would like to  know is there any
>> other options are available for fencing at node level ?
>> 
>> Thanks in Advance
>> Zaman
> 
> You need a real device that will power off the cluster node. If your 
> machines do not have IPMI, which desktops rarely ever do, you next best 
> option is a switched PDU. I have had excellent luck with APC AP7900 (or 
> your country's version of). The cluster can call this PDU and ask it to 
> turn off the power to the target node.
> 
> Fencing requires a mechanism totally independent of the target. With 
> virtual machines, the host hypervisor can do this. For real machines 
> though, you need hardware.
> 
> -- 
> Digimer
> Papers and Projects: https://alteeve.com
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From lists at alteeve.ca  Fri Jun 29 17:18:15 2012
From: lists at alteeve.ca (Digimer)
Date: Fri, 29 Jun 2012 13:18:15 -0400
Subject: [Linux-cluster] Options for fencing at the node level .
In-Reply-To: <3DB913F9-D9E5-43DD-B9AF-4340C1141396@txstate.edu>
References: <1340937126.95735.YahooMailNeo@web193003.mail.sg3.yahoo.com>
	<4FED1A73.5060207@alteeve.ca>
	<3DB913F9-D9E5-43DD-B9AF-4340C1141396@txstate.edu>
Message-ID: <4FEDE357.5000903@alteeve.ca>

https://www.redhat.com/mailman/listinfo/linux-cluster <- go here

On 06/29/2012 01:16 PM, Redding, Erik wrote:
> unsubscribe
>
> Erik Redding
> Systems Programmer, RHCE
> Core Systems
> Texas State University
>
> On Jun 28, 2012, at 10:01 PM, Digimer wrote:
>
>> On 06/28/2012 10:32 PM, Zama Ques wrote:
>>> Hi All ,
>>>
>>> I need to setup HA clustering using redhat cluster suite on two nodes ,
>>> primary concern being high availability . Before trying it on production
>>> , I am trying to configure the setup on two desktop machines . For
>>> storage , I am creating a partition and sharing the partition as a iscsi
>>> target on a third machine . Would like to know what are the options for
>>> fencing available at the node level  .  I tried going through the conga
>>> interface for creating a
>>> shared fence device , I could see one option is using GNBD . virtual
>>> machine fencing is there in the list but that is for xen based HA
>>> cluster . scsi fencing is there , but as far as what I understand it
>>> does not support iscsi target as of now. Manual fencing is also there ,
>>> and I am planning to use that , but would like to  know is there any
>>> other options are available for fencing at node level ?
>>>
>>> Thanks in Advance
>>> Zaman
>>
>> You need a real device that will power off the cluster node. If your
>> machines do not have IPMI, which desktops rarely ever do, you next best
>> option is a switched PDU. I have had excellent luck with APC AP7900 (or
>> your country's version of). The cluster can call this PDU and ask it to
>> turn off the power to the target node.
>>
>> Fencing requires a mechanism totally independent of the target. With
>> virtual machines, the host hypervisor can do this. For real machines
>> though, you need hardware.
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.com
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Digimer
Papers and Projects: https://alteeve.com