From akaris at gmail.com  Sun Jan  1 22:04:54 2012
From: akaris at gmail.com (Michel Nadeau)
Date: Sun, 1 Jan 2012 17:04:54 -0500
Subject: [Linux-cluster] CMAN across different datacenters
In-Reply-To: <4EFE5821.1020602@hastexo.com>
References: <CA+i7LTdODWzsaE9ZN6DssfPM7Dz6BqpXfyWPG41rGCP94VvK1g@mail.gmail.com>
	<4EFE5821.1020602@hastexo.com>
Message-ID: <CA+i7LTck_DCFLEEmjJk5k85XUY9GMQj77eturitM5nEcfSYY_A@mail.gmail.com>

Hi,

I upgraded to Corosync 1.4.2 and cman 6.2.0, but when I add this to my
cluster.conf :

<cman expected_votes="1" two_node="1" *transport="udpu"*/>

I get (when starting cman) :

   Starting cman... Relax-NG validity error : Extra element cman in
interleave
tempfile:20: element cman: Relax-NG validity error : Element cluster failed
to validate content
Configuration fails to validate

Any idea why?

Thanks,

- Mike
akaris at gmail.com


On Fri, Dec 30, 2011 at 7:32 PM, Andreas Kurz <andreas at hastexo.com> wrote:

> Hello,
>
> On 12/30/2011 10:16 PM, Michel Nadeau wrote:
> > Hi,
> >
> > We're trying to configure a CMAN cluster with 2 nodes located in 2
> > different datacenters.
>
> only one of various problems of split-site cluster: how do you plan to
> implement reliable fencing?
>
> >
> > The 2 nodes are running Debian 6 and they can access each other on the
> > private LAN (using the eth0 interface).
> >
> > The problem is that the 2 nodes don't have the same subnet and the
> > multicast doesn't seem to work: is there any way to make this work?
>
> since 1.3.0 corosync supports unicasts (UDPU) ... it ships with a nice
> example configuration.
>
> Regards,
> Andreas
>
> --
> Need help with Corosync?
> http://www.hastexo.com/now
>
> >
> > Thanks,
> >
> > - Mike
> > akaris at gmail.com <mailto:akaris at gmail.com>
> >
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120101/8730f3da/attachment.htm>

From szekelyi at niif.hu  Sun Jan  1 23:24:32 2012
From: szekelyi at niif.hu (=?ISO-8859-1?Q?Sz=E9kelyi?= Szabolcs)
Date: Mon, 02 Jan 2012 00:24:32 +0100
Subject: [Linux-cluster] CMAN across different datacenters
In-Reply-To: <CA+i7LTck_DCFLEEmjJk5k85XUY9GMQj77eturitM5nEcfSYY_A@mail.gmail.com>
References: <CA+i7LTdODWzsaE9ZN6DssfPM7Dz6BqpXfyWPG41rGCP94VvK1g@mail.gmail.com>
	<4EFE5821.1020602@hastexo.com>
	<CA+i7LTck_DCFLEEmjJk5k85XUY9GMQj77eturitM5nEcfSYY_A@mail.gmail.com>
Message-ID: <15595999.ZutN6uBcOj@mranderson>

On 2012. January 1. 17:04:54 Michel Nadeau wrote:
> I upgraded to Corosync 1.4.2 and cman 6.2.0, but when I add this to my
> cluster.conf :
> 
> <cman expected_votes="1" two_node="1" *transport="udpu"*/>
> 
> I get (when starting cman) :
> 
>    Starting cman... Relax-NG validity error : Extra element cman in
> interleave
> tempfile:20: element cman: Relax-NG validity error : Element cluster failed
> to validate content
> Configuration fails to validate
> 
> Any idea why?

I've fiddled around quite a lot with this. I wanted to keep multicasting, but 
change the TTL to more than 1 so that the nodes' packets can reach each other. 
It turned out that Corosync 1.4.1 that's included in Debian Backports, 
supports this feature, but I've figured out that this is not enough since you 
need a new cman to communicate the config to Corosync. Debian has cman 
3.0.12 which is unable to do this. The situation was strange, because it 
looked like I'm using the same version than people on this list, but this 
feature works for them but not for me. After further research, it turned out 
that there are two versions of 3.0.12 out there, 3.0.12 and 3.0.12.1. Debian 
has the older one, which doesn't have this feature. The latter one came out 
long after the old one, and according to changelogs, has significant 
enhancements including the one in question. Looking at 
https://fedorahosted.org/releases/c/l/cluster/, here are the version numbers 
ordered by release dates:

3.0.11: 21-Apr-2010
3.0.12: 11-May-2010
3.0.13: 08-Jun-2010
3.0.14: 30-Jul-2010
3.0.15: 02-Sep-2010
3.0.16: 02-Sep-2010
3.0.17: 06-Oct-2010
3.1.0: 02-Dec-2010
3.1.1: 08-Mar-2011
3.0.12.1: 27-May-2011
3.1.2: 16-Jun-2011

Wow, it looks like the cman guys have a strange idea on versioning... The size 
of changelogs is also very interesting.

Anyway, since Debian doesn't have the "new" 3.0.12, I worked around this 
problem by using multicast and some iptables magic to achieve what you need:

iptables -t mangle -A OUTPUT -d <multicast_address> -j TTL --ttl-set 8

Cheers,
-- 
cc




From akaris at gmail.com  Mon Jan  2 00:24:14 2012
From: akaris at gmail.com (Michel Nadeau)
Date: Mon, 2 Jan 2012 00:24:14 +0000
Subject: [Linux-cluster] CMAN across different datacenters
In-Reply-To: <15595999.ZutN6uBcOj@mranderson>
References: <CA+i7LTdODWzsaE9ZN6DssfPM7Dz6BqpXfyWPG41rGCP94VvK1g@mail.gmail.com>
	<4EFE5821.1020602@hastexo.com>
	<CA+i7LTck_DCFLEEmjJk5k85XUY9GMQj77eturitM5nEcfSYY_A@mail.gmail.com>
	<15595999.ZutN6uBcOj@mranderson>
Message-ID: <1358280838-1325463856-cardhu_decombobulator_blackberry.rim.net-1070681867-@b15.c31.bise6.blackberry>

I'm not sure to understand how your iptables rule can fix this? I'm trying to get 2 nodes in 2 datacenters using 2 IP subnets to work.
-----Original Message-----
From: Sz?kelyi Szabolcs <szekelyi at niif.hu>
Sender: linux-cluster-bounces at redhat.com
Date: Mon, 02 Jan 2012 00:24:32 
To: <linux-cluster at redhat.com>
Reply-To: linux clustering <linux-cluster at redhat.com>
Subject: Re: [Linux-cluster] CMAN across different datacenters

On 2012. January 1. 17:04:54 Michel Nadeau wrote:
> I upgraded to Corosync 1.4.2 and cman 6.2.0, but when I add this to my
> cluster.conf :
> 
> <cman expected_votes="1" two_node="1" *transport="udpu"*/>
> 
> I get (when starting cman) :
> 
>    Starting cman... Relax-NG validity error : Extra element cman in
> interleave
> tempfile:20: element cman: Relax-NG validity error : Element cluster failed
> to validate content
> Configuration fails to validate
> 
> Any idea why?

I've fiddled around quite a lot with this. I wanted to keep multicasting, but 
change the TTL to more than 1 so that the nodes' packets can reach each other. 
It turned out that Corosync 1.4.1 that's included in Debian Backports, 
supports this feature, but I've figured out that this is not enough since you 
need a new cman to communicate the config to Corosync. Debian has cman 
3.0.12 which is unable to do this. The situation was strange, because it 
looked like I'm using the same version than people on this list, but this 
feature works for them but not for me. After further research, it turned out 
that there are two versions of 3.0.12 out there, 3.0.12 and 3.0.12.1. Debian 
has the older one, which doesn't have this feature. The latter one came out 
long after the old one, and according to changelogs, has significant 
enhancements including the one in question. Looking at 
https://fedorahosted.org/releases/c/l/cluster/, here are the version numbers 
ordered by release dates:

3.0.11: 21-Apr-2010
3.0.12: 11-May-2010
3.0.13: 08-Jun-2010
3.0.14: 30-Jul-2010
3.0.15: 02-Sep-2010
3.0.16: 02-Sep-2010
3.0.17: 06-Oct-2010
3.1.0: 02-Dec-2010
3.1.1: 08-Mar-2011
3.0.12.1: 27-May-2011
3.1.2: 16-Jun-2011

Wow, it looks like the cman guys have a strange idea on versioning... The size 
of changelogs is also very interesting.

Anyway, since Debian doesn't have the "new" 3.0.12, I worked around this 
problem by using multicast and some iptables magic to achieve what you need:

iptables -t mangle -A OUTPUT -d <multicast_address> -j TTL --ttl-set 8

Cheers,
-- 
cc


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From fdinitto at redhat.com  Mon Jan  2 04:20:36 2012
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Mon, 02 Jan 2012 05:20:36 +0100
Subject: [Linux-cluster] CMAN across different datacenters
In-Reply-To: <15595999.ZutN6uBcOj@mranderson>
References: <CA+i7LTdODWzsaE9ZN6DssfPM7Dz6BqpXfyWPG41rGCP94VvK1g@mail.gmail.com>
	<4EFE5821.1020602@hastexo.com>
	<CA+i7LTck_DCFLEEmjJk5k85XUY9GMQj77eturitM5nEcfSYY_A@mail.gmail.com>
	<15595999.ZutN6uBcOj@mranderson>
Message-ID: <4F013094.7050302@redhat.com>

On 01/02/2012 12:24 AM, Sz?kelyi Szabolcs wrote:

> 3.0.11: 21-Apr-2010
> 3.0.12: 11-May-2010
> 3.0.13: 08-Jun-2010
> 3.0.14: 30-Jul-2010
> 3.0.15: 02-Sep-2010
> 3.0.16: 02-Sep-2010
> 3.0.17: 06-Oct-2010
> 3.1.0: 02-Dec-2010
> 3.1.1: 08-Mar-2011
> 3.0.12.1: 27-May-2011
> 3.1.2: 16-Jun-2011
> 
> Wow, it looks like the cman guys have a strange idea on versioning... The size 
> of changelogs is also very interesting.

What strange idea? The 3.0.12.x is based on the RHEL6 branch (what you
ge tin RHEL6.x) and the 3.1.x serie is from STABLE31 branch from upstream.

3.0.12.x is a more strict bug fix only series. 3.1 gets a bit more.

Mind to explain what's strange about it?




From fdinitto at redhat.com  Mon Jan  2 04:21:40 2012
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Mon, 02 Jan 2012 05:21:40 +0100
Subject: [Linux-cluster] CMAN across different datacenters
In-Reply-To: <CA+i7LTck_DCFLEEmjJk5k85XUY9GMQj77eturitM5nEcfSYY_A@mail.gmail.com>
References: <CA+i7LTdODWzsaE9ZN6DssfPM7Dz6BqpXfyWPG41rGCP94VvK1g@mail.gmail.com>
	<4EFE5821.1020602@hastexo.com>
	<CA+i7LTck_DCFLEEmjJk5k85XUY9GMQj77eturitM5nEcfSYY_A@mail.gmail.com>
Message-ID: <4F0130D4.5040909@redhat.com>

On 01/01/2012 11:04 PM, Michel Nadeau wrote:
> Hi,
> 
> I upgraded to Corosync 1.4.2 and cman 6.2.0, but when I add this to my
> cluster.conf :

cman 6.2.0 does not exists anywhere.

> 
> <cman expected_votes="1" two_node="1" *transport="udpu"*/>
> 
> I get (when starting cman) :
> 
>    Starting cman... Relax-NG validity error : Extra element cman in
> interleave
> tempfile:20: element cman: Relax-NG validity error : Element cluster
> failed to validate content
> Configuration fails to validate
> 
> Any idea why?

This issue has been fixed a while ago in both upstream (3.1.x) and
3.0.12.x in RHEL.

Fabio



From list at fajar.net  Mon Jan  2 04:33:20 2012
From: list at fajar.net (Fajar A. Nugraha)
Date: Mon, 2 Jan 2012 11:33:20 +0700
Subject: [Linux-cluster] CMAN across different datacenters
In-Reply-To: <4F013094.7050302@redhat.com>
References: <CA+i7LTdODWzsaE9ZN6DssfPM7Dz6BqpXfyWPG41rGCP94VvK1g@mail.gmail.com>
	<4EFE5821.1020602@hastexo.com>
	<CA+i7LTck_DCFLEEmjJk5k85XUY9GMQj77eturitM5nEcfSYY_A@mail.gmail.com>
	<15595999.ZutN6uBcOj@mranderson> <4F013094.7050302@redhat.com>
Message-ID: <CAG1y0sfPX5k63fedjpzfcO0Nmbu-SdpjBCwqdpMOnq5ONYRvVg@mail.gmail.com>

On Mon, Jan 2, 2012 at 11:20 AM, Fabio M. Di Nitto <fdinitto at redhat.com> wrote:
> On 01/02/2012 12:24 AM, Sz?kelyi Szabolcs wrote:

>> 3.0.12: 11-May-2010

>> 3.0.12.1: 27-May-2011
>> 3.1.2: 16-Jun-2011
>>
>> Wow, it looks like the cman guys have a strange idea on versioning... The size
>> of changelogs is also very interesting.
>
> What strange idea? The 3.0.12.x is based on the RHEL6 branch (what you
> ge tin RHEL6.x) and the 3.1.x serie is from STABLE31 branch from upstream.
>
> 3.0.12.x is a more strict bug fix only series. 3.1 gets a bit more.
>
> Mind to explain what's strange about it?

I think Sz?kelyi is refering to the size of 3.0.12.1 changelog and the
things that went in it. If it's a "strict bug fix only series", then
e.g. arguably these changes shouldn't be there, cause they're new
features:

      cman init: add support for "nocluster" kernel cmdline to not
start cman at boot
      Cman: Add support for udpu and rdma transport
      resource-agents: Add NFSv4 support

-- 
Fajar



From fdinitto at redhat.com  Mon Jan  2 05:30:58 2012
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Mon, 02 Jan 2012 06:30:58 +0100
Subject: [Linux-cluster] CMAN across different datacenters
In-Reply-To: <CAG1y0sfPX5k63fedjpzfcO0Nmbu-SdpjBCwqdpMOnq5ONYRvVg@mail.gmail.com>
References: <CA+i7LTdODWzsaE9ZN6DssfPM7Dz6BqpXfyWPG41rGCP94VvK1g@mail.gmail.com>
	<4EFE5821.1020602@hastexo.com>
	<CA+i7LTck_DCFLEEmjJk5k85XUY9GMQj77eturitM5nEcfSYY_A@mail.gmail.com>
	<15595999.ZutN6uBcOj@mranderson> <4F013094.7050302@redhat.com>
	<CAG1y0sfPX5k63fedjpzfcO0Nmbu-SdpjBCwqdpMOnq5ONYRvVg@mail.gmail.com>
Message-ID: <4F014112.4000607@redhat.com>

On 01/02/2012 05:33 AM, Fajar A. Nugraha wrote:
> On Mon, Jan 2, 2012 at 11:20 AM, Fabio M. Di Nitto <fdinitto at redhat.com> wrote:
>> On 01/02/2012 12:24 AM, Sz?kelyi Szabolcs wrote:
> 
>>> 3.0.12: 11-May-2010
> 
>>> 3.0.12.1: 27-May-2011
>>> 3.1.2: 16-Jun-2011
>>>
>>> Wow, it looks like the cman guys have a strange idea on versioning... The size
>>> of changelogs is also very interesting.
>>
>> What strange idea? The 3.0.12.x is based on the RHEL6 branch (what you
>> ge tin RHEL6.x) and the 3.1.x serie is from STABLE31 branch from upstream.
>>
>> 3.0.12.x is a more strict bug fix only series. 3.1 gets a bit more.
>>
>> Mind to explain what's strange about it?
> 
> I think Sz?kelyi is refering to the size of 3.0.12.1 changelog and the
> things that went in it. If it's a "strict bug fix only series", then
> e.g. arguably these changes shouldn't be there, cause they're new
> features:
> 
>       cman init: add support for "nocluster" kernel cmdline to not
> start cman at boot
>       Cman: Add support for udpu and rdma transport
>       resource-agents: Add NFSv4 support
> 

Request For Enhancement or integration bits are also bugs.

corosync supports nocluster and so cman needs the same.

corosync added support for udpu and rdma transports. In order for those
to work, cman needs to understand them.

Similar, NFSv4 support has been added, and that support needs to be
reflected in resource-agents.

The lack of those integration bits are bugs, we can argue that there is
a thin gray line here. It's not always black and white.

>From a cman perspective those are "new features" but when you look at it
from a more global integration overview, the lack of those are issues.

Fabio



From akaris at gmail.com  Mon Jan  2 16:52:14 2012
From: akaris at gmail.com (Michel Nadeau)
Date: Mon, 2 Jan 2012 11:52:14 -0500
Subject: [Linux-cluster] CMAN across different datacenters
In-Reply-To: <4F0130D4.5040909@redhat.com>
References: <CA+i7LTdODWzsaE9ZN6DssfPM7Dz6BqpXfyWPG41rGCP94VvK1g@mail.gmail.com>
	<4EFE5821.1020602@hastexo.com>
	<CA+i7LTck_DCFLEEmjJk5k85XUY9GMQj77eturitM5nEcfSYY_A@mail.gmail.com>
	<4F0130D4.5040909@redhat.com>
Message-ID: <CA+i7LTfm=_2rGoG5C2+5sA89r-trjMVtY0a-m3cDUBtciYvPSQ@mail.gmail.com>

I just found out that my cman version isn't 6.2.0 (what's that version
anyway? The cman_tool version?). My Debian package version is 3.0.12-3.. so
I guess I don't have the udpu fix as I don't have 3.0.12.x.

- Mike
akaris at gmail.com


On Sun, Jan 1, 2012 at 11:21 PM, Fabio M. Di Nitto <fdinitto at redhat.com>wrote:

> On 01/01/2012 11:04 PM, Michel Nadeau wrote:
> > Hi,
> >
> > I upgraded to Corosync 1.4.2 and cman 6.2.0, but when I add this to my
> > cluster.conf :
>
> cman 6.2.0 does not exists anywhere.
>
> >
> > <cman expected_votes="1" two_node="1" *transport="udpu"*/>
> >
> > I get (when starting cman) :
> >
> >    Starting cman... Relax-NG validity error : Extra element cman in
> > interleave
> > tempfile:20: element cman: Relax-NG validity error : Element cluster
> > failed to validate content
> > Configuration fails to validate
> >
> > Any idea why?
>
> This issue has been fixed a while ago in both upstream (3.1.x) and
> 3.0.12.x in RHEL.
>
> Fabio
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120102/c4161134/attachment.htm>

From swhiteho at redhat.com  Tue Jan  3 09:52:27 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Tue, 03 Jan 2012 09:52:27 +0000
Subject: [Linux-cluster] CLVM/GFS2 distributed locking
In-Reply-To: <4EFE116D.3080505@dbtgroup.com>
References: <CAAUywg8PbgJKY6+uBT-udBZYd3PQk-eyEqqBJWYdxhcxw27Vzg@mail.gmail.com>
	<4EFE071F.8030406@alteeve.com>  <4EFE116D.3080505@dbtgroup.com>
Message-ID: <1325584347.2685.0.camel@menhir>

Hi,

On Fri, 2011-12-30 at 19:30 +0000, yvette hirth wrote:
> Digimer wrote:
> 
> > For GFS2, one of the easiest performance wins is to set
> > 'noatime,nodiratime' in the mount options to avoid requiring locks to
> > update the access times on files when you only read them.
> 
> i've found that "noatime" implies "nodiratime", so both are not needed - 
> unless GFS/GFS2 behaves differently than other fs's wrt this attribute. 
>   if so, that would be good to know for certain.
> 
> see here:  http://lwn.net/Articles/245002/
> 
> the article didn't specify the filesystem...
> 
> yvette
> 
Earlier GFS did have different atime code, but GFS2 uses the same code
as all other filesystems, so the behaviour should also be the same,

Steve.




From swhiteho at redhat.com  Tue Jan  3 09:55:28 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Tue, 03 Jan 2012 09:55:28 +0000
Subject: [Linux-cluster] CLVM/GFS2 distributed locking
In-Reply-To: <CAAUywg-OJ-BPbJKsEfRK6Z=YbsrcWBczdLGvirjJ-92Mk0Cuxw@mail.gmail.com>
References: <CAAUywg8PbgJKY6+uBT-udBZYd3PQk-eyEqqBJWYdxhcxw27Vzg@mail.gmail.com>
	<4EFE071F.8030406@alteeve.com> <4EFE116D.3080505@dbtgroup.com>
	<CAAUywg8WaxF0WiotOtQA9u_WoqkcsxGnkA1hVMDX71oYst530Q@mail.gmail.com>
	<4EFE1DA7.3020506@alteeve.com>
	<CAAUywg-OJ-BPbJKsEfRK6Z=YbsrcWBczdLGvirjJ-92Mk0Cuxw@mail.gmail.com>
Message-ID: <1325584528.2685.2.camel@menhir>

Hi,

On Fri, 2011-12-30 at 21:37 +0100, Stevo Slavi? wrote:
> Pulling the cables between shared storage and foo01, foo01 gets
> fenced. Here is some info from foo02 about shared storage and dlm
> debug (lock file seems to remain locked)
> 
> root at foo02:-//data/activemq_data#ls -li
> total 276
>  66467 -rw-r--r-- 1 root root 33030144 Dec 30 16:32 db-1.log
>  66468 -rw-r--r-- 1 root root    73728 Dec 30 16:24 db.data
>  66470 -rw-r--r-- 1 root root    53344 Dec 30 16:24 db.redo
> 128014 -rw-r--r-- 1 root root        0 Dec 30 19:49 dummy
>  66466 -rw-r--r-- 1 root root        0 Dec 30 16:23 lock
> root at foo02:-//data/activemq_data#grep -A 7 -i
> 103a2 /debug/dlm/activemq
> Resource ffff81090faf96c0 Name (len=24) "       2           103a2"  
> Master Copy
> Granted Queue
> 03d10002 PR Remote:   1 00c80001
> 00e00001 PR
> Conversion Queue
> Waiting Queue
> --
> Resource ffff81090faf97c0 Name (len=24) "       5           103a2"  
> Master Copy
> Granted Queue
> 03c30003 PR Remote:   1 039a0001
> 03550001 PR
> Conversion Queue
> Waiting Queue
> 
> 
> Are there some docs for interpreting this dlm debug output?
> 
> 
Not as such I think. It sounds like the issue is recovery related. Are
there any messages which indicate what might be going on? Once the
failed node has been fenced, then recovery should proceed fairly soon
afterwards,

Steve.

> Regards,
> Stevo.
> 
> On Fri, Dec 30, 2011 at 9:23 PM, Digimer <linux at alteeve.com> wrote:
>         On 12/30/2011 03:08 PM, Stevo Slavi? wrote:
>         > Hi Digimer and Yvette,
>         >
>         > Thanks for tips! I don't doubt reliability of the
>         technology, just want
>         > to make sure it is configured well.
>         >
>         > After fencing a node that held a lock on a file on shared
>         storage, lock
>         > remains, and non-fenced node cannot take over the lock on
>         that file.
>         > Wondering how can one check which process (from which node
>         if possible)
>         > is holding a lock on a file on shared storage.
>         > dlm should have taken care of releasing the lock once node
>         got fenced,
>         > right?
>         >
>         > Regards,
>         > Stevo.
>         
>         
>         After a successful fence call, DLM will clean up any locks
>         held by the
>         lost node. That's why it's so critical that the fence action
>         succeeded
>         (ie: test-test-test). If a node doesn't actually die in a
>         fence, but the
>         cluster thinks it did, and somehow the lost node returns, the
>         lost node
>         will think it's locks are still valid and modify shared
>         storage, leading
>         to near-certain data corruption.
>         
>         It's all perfectly safe, provided you've tested your fencing
>         properly. :)
>         
>         Yvette,
>         
>          You might be right on the 'noatime' implying 'nodiratime'...
>         I add
>         both out of habit.
>         
>         --
>         Digimer
>         E-Mail:              digimer at alteeve.com
>         Freenode handle:     digimer
>         Papers and Projects: http://alteeve.com
>         Node Assassin:       http://nodeassassin.org
>         "omg my singularity battery is dead again.
>         stupid hawking radiation." - epitron
>         
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From swhiteho at redhat.com  Tue Jan  3 09:59:16 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Tue, 03 Jan 2012 09:59:16 +0000
Subject: [Linux-cluster] CLVM/GFS2 distributed locking
In-Reply-To: <CAAUywg8PbgJKY6+uBT-udBZYd3PQk-eyEqqBJWYdxhcxw27Vzg@mail.gmail.com>
References: <CAAUywg8PbgJKY6+uBT-udBZYd3PQk-eyEqqBJWYdxhcxw27Vzg@mail.gmail.com>
Message-ID: <1325584756.2685.6.camel@menhir>

Hi,

On Fri, 2011-12-30 at 14:39 +0100, Stevo Slavi? wrote:
> Hello RedHat Linux cluster community,
> 
> I'm in process of configuring shared filesystem storage master/slave
> Apache ActiveMQ setup. For it to work, it requires reliable
> distributed locking - master is node that holds exclusive lock on a
> file on shared filesystem storage.
> 
How does it do this locking? There are several possible ways this might
be done, and some will work better than others.

> On RHEL (5.4), using CLVM with GFS2 is one of the options that should
> work.
Why are you using RHEL 5.4 and not something more recent? Note that if
you are a Red Hat customer, then you should contact our support team who
will be very happy to assist.

> Third party configured the CLVM/GFS2. I'd like to make sure that
> distributed locking works OK.
> What are my options for verifying this?
> 
I think we need to verify which type of locking the application uses
before we can answer this,

Steve.

> Thanks in advance!
> 
> Regards,
> Stevo.
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From linux at alteeve.com  Tue Jan  3 14:26:32 2012
From: linux at alteeve.com (Digimer)
Date: Tue, 03 Jan 2012 09:26:32 -0500
Subject: [Linux-cluster] New Tutorial - RHCS + DRBD + KVM; 2-Node HA on EL6
Message-ID: <4F031018.4060204@alteeve.com>

Hi all,

  I'm happy to announce a new tutorial!

https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial

  This tutorial walks a user through the entire process of building a
2-Node cluster for making KVM virtual machines highly available. It uses
Red Hat Cluster services v3 and DRBD 8.3.12. It is written such that you
can use entirely free or fully Red Hat supported environments.

Highlights;
* Full network and power redundancy; no single-points of failure.
* All off-the-shelf hardware; Storage via DRBD.
* Starts with base OS install, no clustering experience required.
* All software components explained.
* Includes all testing steps covered.
* Configuration is used in production environments!

  This tutorial is totally free (no ads, no registration) and released
under the Creative Common 3.0 Share-Alike Non-Commercial license.
Feedback is always appreciated!

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron



From jeff.sturm at eprize.com  Tue Jan  3 16:31:57 2012
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Tue, 3 Jan 2012 16:31:57 +0000
Subject: [Linux-cluster] CLVM/GFS2 distributed locking
In-Reply-To: <1325584347.2685.0.camel@menhir>
References: <CAAUywg8PbgJKY6+uBT-udBZYd3PQk-eyEqqBJWYdxhcxw27Vzg@mail.gmail.com>
	<4EFE071F.8030406@alteeve.com>  <4EFE116D.3080505@dbtgroup.com>
	<1325584347.2685.0.camel@menhir>
Message-ID: <B1B9801C5CBC954680D0374CC4EEABA51194F787@MailNode2.eprize.local>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Steven Whitehouse
> Sent: Tuesday, January 03, 2012 4:52 AM
> 
> Earlier GFS did have different atime code, but GFS2 uses the same code as all other
> filesystems, so the behaviour should also be the same,

My testing on GFS (a few years ago) showed that "noatime" definitely did not set "nodiratime" implicitly, so I've always set both.

Good to know that's corrected for GFS2.

-Jeff





From sslavic at gmail.com  Wed Jan  4 17:10:46 2012
From: sslavic at gmail.com (=?UTF-8?Q?Stevo_Slavi=C4=87?=)
Date: Wed, 4 Jan 2012 18:10:46 +0100
Subject: [Linux-cluster] CLVM/GFS2 distributed locking
In-Reply-To: <1325584756.2685.6.camel@menhir>
References: <CAAUywg8PbgJKY6+uBT-udBZYd3PQk-eyEqqBJWYdxhcxw27Vzg@mail.gmail.com>
	<1325584756.2685.6.camel@menhir>
Message-ID: <CAAUywg9LJsgvnhWkcg0ZZycjFuVAZ77zKBkG0sf6zr_zhavP5g@mail.gmail.com>

Hello Steven,

I guess license covers only 5.4. Anyway I'm just told it's not an option at
the moment to do the upgrade.


About locking used, ActiveMQ uses Java 6 standard API for trying to acquire
file lock, here is javadoc for the method used:

http://docs.oracle.com/javase/6/docs/api/java/nio/channels/FileChannel.html#tryLock%28long,%20long,%20boolean%29

ActiveMQ tries to obtain non-shared (thus exclusive) lock on whole file on
shared file system, with range from 0 to 1, since the file being locked is
empty. As documentation states, tryLock is non-blocking, it executes
immediately. If ActiveMQ fails to obtain a lock it will  loop (pause and
retry acquiring lock again) until lock is obtained. In initial state first
node obtains lock and becomes master, second one fails to obtain lock and
gets into this loop, as expected. Problem is that slave ActiveMQ on cannot
obtain a lock even after first node gets fenced - it reports that the file
on shared storage is still locked. Simple custom java tool that I made
reports the same, that the file is locked.

OpenJDK 1.6 update 20 is being used as Java runtime. I haven't yet found in
openjdk source exact code which tryLock will call on Linux.


Is there non-Java tool that could be used to reliably check if a file (on
gfs2) is (or can be) exclusively locked (regardless of where the process
holding lock is  running, on same or different node where the tool is being
run)?


Regards,
Stevo.



On Tue, Jan 3, 2012 at 10:59 AM, Steven Whitehouse <swhiteho at redhat.com>wrote:

> Hi,
>
> On Fri, 2011-12-30 at 14:39 +0100, Stevo Slavi? wrote:
> > Hello RedHat Linux cluster community,
> >
> > I'm in process of configuring shared filesystem storage master/slave
> > Apache ActiveMQ setup. For it to work, it requires reliable
> > distributed locking - master is node that holds exclusive lock on a
> > file on shared filesystem storage.
> >
> How does it do this locking? There are several possible ways this might
> be done, and some will work better than others.
>
> > On RHEL (5.4), using CLVM with GFS2 is one of the options that should
> > work.
> Why are you using RHEL 5.4 and not something more recent? Note that if
> you are a Red Hat customer, then you should contact our support team who
> will be very happy to assist.
>
> > Third party configured the CLVM/GFS2. I'd like to make sure that
> > distributed locking works OK.
> > What are my options for verifying this?
> >
> I think we need to verify which type of locking the application uses
> before we can answer this,
>
> Steve.
>
> > Thanks in advance!
> >
> > Regards,
> > Stevo.
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120104/a236d14f/attachment.htm>

From sslavic at gmail.com  Thu Jan  5 00:00:20 2012
From: sslavic at gmail.com (=?UTF-8?Q?Stevo_Slavi=C4=87?=)
Date: Thu, 5 Jan 2012 01:00:20 +0100
Subject: [Linux-cluster] CLVM/GFS2 distributed locking
In-Reply-To: <CAAUywg9LJsgvnhWkcg0ZZycjFuVAZ77zKBkG0sf6zr_zhavP5g@mail.gmail.com>
References: <CAAUywg8PbgJKY6+uBT-udBZYd3PQk-eyEqqBJWYdxhcxw27Vzg@mail.gmail.com>
	<1325584756.2685.6.camel@menhir>
	<CAAUywg9LJsgvnhWkcg0ZZycjFuVAZ77zKBkG0sf6zr_zhavP5g@mail.gmail.com>
Message-ID: <CAAUywg-NUkD0WoFBq9Tc2BqNaAbJgm-FUdeV3VmPH--vaJOJEQ@mail.gmail.com>

Here is a simple C utility for locking file - it's combination of two
sources:
- reading lock info from here:
http://uw714doc.sco.com/en/SDK_sysprog/_Getting_Lock_Information.html
- acquiring file lock from here:
http://siber.cankaya.edu.tr/ozdogan/SystemsProgramming/ceng425/node161.html

Now, to make use of it.

Regards,
Stevo.

On Wed, Jan 4, 2012 at 6:10 PM, Stevo Slavi? <sslavic at gmail.com> wrote:

> Hello Steven,
>
> I guess license covers only 5.4. Anyway I'm just told it's not an option
> at the moment to do the upgrade.
>
>
> About locking used, ActiveMQ uses Java 6 standard API for trying to
> acquire file lock, here is javadoc for the method used:
>
>
> http://docs.oracle.com/javase/6/docs/api/java/nio/channels/FileChannel.html#tryLock%28long,%20long,%20boolean%29
>
> ActiveMQ tries to obtain non-shared (thus exclusive) lock on whole file on
> shared file system, with range from 0 to 1, since the file being locked is
> empty. As documentation states, tryLock is non-blocking, it executes
> immediately. If ActiveMQ fails to obtain a lock it will  loop (pause and
> retry acquiring lock again) until lock is obtained. In initial state first
> node obtains lock and becomes master, second one fails to obtain lock and
> gets into this loop, as expected. Problem is that slave ActiveMQ on cannot
> obtain a lock even after first node gets fenced - it reports that the file
> on shared storage is still locked. Simple custom java tool that I made
> reports the same, that the file is locked.
>
> OpenJDK 1.6 update 20 is being used as Java runtime. I haven't yet found
> in openjdk source exact code which tryLock will call on Linux.
>
>
> Is there non-Java tool that could be used to reliably check if a file (on
> gfs2) is (or can be) exclusively locked (regardless of where the process
> holding lock is  running, on same or different node where the tool is being
> run)?
>
>
> Regards,
> Stevo.
>
>
>
>
> On Tue, Jan 3, 2012 at 10:59 AM, Steven Whitehouse <swhiteho at redhat.com>wrote:
>
>> Hi,
>>
>> On Fri, 2011-12-30 at 14:39 +0100, Stevo Slavi? wrote:
>> > Hello RedHat Linux cluster community,
>> >
>> > I'm in process of configuring shared filesystem storage master/slave
>> > Apache ActiveMQ setup. For it to work, it requires reliable
>> > distributed locking - master is node that holds exclusive lock on a
>> > file on shared filesystem storage.
>> >
>> How does it do this locking? There are several possible ways this might
>> be done, and some will work better than others.
>>
>> > On RHEL (5.4), using CLVM with GFS2 is one of the options that should
>> > work.
>> Why are you using RHEL 5.4 and not something more recent? Note that if
>> you are a Red Hat customer, then you should contact our support team who
>> will be very happy to assist.
>>
>> > Third party configured the CLVM/GFS2. I'd like to make sure that
>> > distributed locking works OK.
>> > What are my options for verifying this?
>> >
>> I think we need to verify which type of locking the application uses
>> before we can answer this,
>>
>> Steve.
>>
>> > Thanks in advance!
>> >
>> > Regards,
>> > Stevo.
>> > --
>> > Linux-cluster mailing list
>> > Linux-cluster at redhat.com
>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120105/28394ef5/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lock-file.c
Type: text/x-csrc
Size: 1433 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120105/28394ef5/attachment.bin>

From linux at alteeve.com  Thu Jan  5 05:46:35 2012
From: linux at alteeve.com (Digimer)
Date: Thu, 05 Jan 2012 00:46:35 -0500
Subject: [Linux-cluster] cluster 3.1.90 released
Message-ID: <4F05393B.2030501@alteeve.com>

Welcome to the cluster 3.1.90 release.

The release has bug fixes and code clean-up. It is a test release for
the coming 3.2 branch. Feedback is always appreciated.

The new source tarball can be downloaded here:

https://fedorahosted.org/releases/c/l/cluster/cluster-3.1.90.tar.xz

ChangeLog:

https://fedorahosted.org/releases/c/l/cluster/Changelog-3.1.90

To report bugs or issues:

   https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

   Join us on IRC (irc.freenode.net #linux-cluster) and share your
   experience  with other sysadministrators or power users.

Thanks/congratulations to all people that contributed to achieve this
great milestone.

Happy clustering,
Digimer



From dkelson at gurulabs.com  Thu Jan  5 17:07:42 2012
From: dkelson at gurulabs.com (Dax Kelson)
Date: Thu, 5 Jan 2012 10:07:42 -0700
Subject: [Linux-cluster] Maximum number of GFS server nodes?
Message-ID: <CAFDa5Q3a_MMi5Y7ZyvSZf4Xm9XXHe046KUK_ShiU_FrFTYXZKQ@mail.gmail.com>

Looking in older Red Hat Magazine article by Matthew O'Keefe such as:

http://www.redhat.com/magazine/008jun05/features/gfs/
http://www.redhat.com/magazine/008jun05/features/gfs_nfs/

There are references to large GFS clusters.

"For example, if 128 GFS server nodes require..." and "scalability 300+ or
more"

Why is it on RHEL6 only a max of 16 nodes is supported?

Thanks,
Dax Kelson
Guru Labs
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120105/a58951df/attachment.htm>

From swhiteho at redhat.com  Thu Jan  5 17:20:44 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Thu, 05 Jan 2012 17:20:44 +0000
Subject: [Linux-cluster] Maximum number of GFS server nodes?
In-Reply-To: <CAFDa5Q3a_MMi5Y7ZyvSZf4Xm9XXHe046KUK_ShiU_FrFTYXZKQ@mail.gmail.com>
References: <CAFDa5Q3a_MMi5Y7ZyvSZf4Xm9XXHe046KUK_ShiU_FrFTYXZKQ@mail.gmail.com>
Message-ID: <1325784044.2690.52.camel@menhir>

Hi,

On Thu, 2012-01-05 at 10:07 -0700, Dax Kelson wrote:
> Looking in older Red Hat Magazine article by Matthew O'Keefe such as:
> 
> http://www.redhat.com/magazine/008jun05/features/gfs/
> http://www.redhat.com/magazine/008jun05/features/gfs_nfs/
> 
> There are references to large GFS clusters.
> 
> "For example, if 128 GFS server nodes require..." and "scalability 300
> + or more"
> 
> Why is it on RHEL6 only a max of 16 nodes is supported?
> 
Those articles are rather out of date. I don't think that GFS was ever
used or tested at that scale and that was probably the theoretical limit
at the time. The reason for the 16 node limit is that it is what we test
(and therefore what we support), which largely reflects what people have
requested. There is no reason why larger numbers of nodes couldn't be
made to work in theory though,

Steve.

> Thanks,
> Dax Kelson
> Guru Labs
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From adrew at redhat.com  Thu Jan  5 17:26:41 2012
From: adrew at redhat.com (Adam Drew)
Date: Thu, 5 Jan 2012 12:26:41 -0500
Subject: [Linux-cluster] Maximum number of GFS server nodes?
In-Reply-To: <CAFDa5Q3a_MMi5Y7ZyvSZf4Xm9XXHe046KUK_ShiU_FrFTYXZKQ@mail.gmail.com>
References: <CAFDa5Q3a_MMi5Y7ZyvSZf4Xm9XXHe046KUK_ShiU_FrFTYXZKQ@mail.gmail.com>
Message-ID: <F2870BEC-16FC-46C0-98EE-FA843C8D894A@redhat.com>

Hello,

Red Hat magazine articles aren't official documentation. Additionally RHM is no longer published (hasn't been for years.) 

The difference between what the article is talking about and what we support in RHEL is a matter of quality assurance and testing - we can only support what we can reasonably test and what we can commit to being able to dedicate to issue reproduction and resolution in the course of a support case. Linux-cluster and GFS/GFS2 will scale well past 16 nodes but Red Hat doesn't test or do engineering and development work on more than 16. 

The other side of the equation is that linux-cluster + GFS2 on RHEL as marketed by Red Hat is a high availability product - not a distributed computing or "big data" product. It's hard to make a case for HA at large scale. For HA purposes 16 nodes is on the generous side - I rarely see clusters greater than 4 nodes in the course of my work with cluster customers. Cluster and GFS2 could be spun into the back-bone for distributed computing or big data deployments but that's not how Red Hat tests, develops, and thus supports the combination of those products.  If you are doing a research, academic, community, or personal project and don't require enterprise support you could likely do some really interesting things with GFS2/cluster at large scale - but for supported deployments with a commitment from Red Hat to test, QA, develop, and resolve issues the limit is 16.

Hope this information helps you.

--
Adam Drew
Software Maintenance Engineer
Support Engineering Group
Red Hat, Inc.

On Jan 5, 2012, at 12:07 PM, Dax Kelson wrote:

> Looking in older Red Hat Magazine article by Matthew O'Keefe such as:
> 
> http://www.redhat.com/magazine/008jun05/features/gfs/
> http://www.redhat.com/magazine/008jun05/features/gfs_nfs/
> 
> There are references to large GFS clusters.
> 
> "For example, if 128 GFS server nodes require..." and "scalability 300+ or more"
> 
> Why is it on RHEL6 only a max of 16 nodes is supported?
> 
> Thanks,
> Dax Kelson
> Guru Labs
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster








-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120105/78509e37/attachment.htm>

From linux at alteeve.com  Thu Jan  5 17:48:23 2012
From: linux at alteeve.com (Digimer)
Date: Thu, 05 Jan 2012 12:48:23 -0500
Subject: [Linux-cluster] Maximum number of GFS server nodes?
In-Reply-To: <CAFDa5Q3a_MMi5Y7ZyvSZf4Xm9XXHe046KUK_ShiU_FrFTYXZKQ@mail.gmail.com>
References: <CAFDa5Q3a_MMi5Y7ZyvSZf4Xm9XXHe046KUK_ShiU_FrFTYXZKQ@mail.gmail.com>
Message-ID: <4F05E267.6030307@alteeve.com>

On 01/05/2012 12:07 PM, Dax Kelson wrote:
> Looking in older Red Hat Magazine article by Matthew O'Keefe such as:
> 
> http://www.redhat.com/magazine/008jun05/features/gfs/
> http://www.redhat.com/magazine/008jun05/features/gfs_nfs/
> 
> There are references to large GFS clusters.
> 
> "For example, if 128 GFS server nodes require..." and "scalability 300+
> or more"
> 
> Why is it on RHEL6 only a max of 16 nodes is supported?
> 
> Thanks,
> Dax Kelson
> Guru Labs

Speaking as an independent; I often see people with latency issues when
they try to grow past 16 nodes when using corosync, which is the HA
cluster communication layer in RHEL 6. More specifically, DLM (the
distributed lock manager) can start to suffer from a performance
perspective as the size of the cluster grows. You may be able to go
higher, but be prepared to do a lot of network tweaking.

Also, as Steven and Adam pointed out, >16 is outside the supported size
so you will have trouble getting any official support. If you want to
try anyway, the freenode IRC channel #linux-cluster is a good place to
ask about specific problems you run into.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron



From wmodes at ucsc.edu  Thu Jan  5 21:54:25 2012
From: wmodes at ucsc.edu (Wes Modes)
Date: Thu, 05 Jan 2012 13:54:25 -0800
Subject: [Linux-cluster] GFS on CentOS - cman unable to start
Message-ID: <4F061C11.5030303@ucsc.edu>

Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS systems
running on vmWare. The GFS FS is on a Dell Equilogic SAN.

I keep running into the same problem despite many differently-flavored
attempts to set up GFS. The problem comes when I try to start cman, the
cluster management software.

    [root at test01]# service cman start
    Starting cluster:
       Loading modules... done
       Mounting configfs... done
       Starting ccsd... done
       Starting cman... failed
    cman not started: Can't find local node name in cluster.conf
/usr/sbin/cman_tool: aisexec daemon didn't start
                                                               [FAILED]

    [root at test01]# tail /var/log/messages
    Jan  5 13:39:40 testbench06 ccsd[13194]: Unable to connect to
cluster infrastructure after 1193640 seconds.
    Jan  5 13:40:10 testbench06 ccsd[13194]: Unable to connect to
cluster infrastructure after 1193670 seconds.
    Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
Service RELEASE 'subrev 1887 version 0.80.6'
    Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C)
2002-2006 MontaVista Software, Inc and contributors.
    Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C)
2006 Red Hat, Inc.
    Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
Service: started and ready to provide service.
    Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] local node name
"test01.gdao.ucsc.edu" not found in cluster.conf
    Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading CCS
info, cannot start
    Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading
config from CCS
    Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
exiting (reason: could not read the main configuration file).

Here are details of my configuration:

    [root at test01]# rpm -qa | grep cman
    cman-2.0.115-85.el5_7.2

    [root at test01]# echo $HOSTNAME
    test01.gdao.ucsc.edu

    [root at test01]# hostname
    test01.gdao.ucsc.edu

    [root at test01]# cat /etc/hosts
    # Do not remove the following line, or various programs
    # that require network functionality will fail.
    128.114.31.112      test01 test01.gdao test01.gdao.ucsc.edu
    128.114.31.113      test02 test02.gdao test02.gdao.ucsc.edu
    127.0.0.1               localhost.localdomain localhost
    ::1             localhost6.localdomain6 localhost6

    [root at test01]# sestatus
    SELinux status:                 enabled
    SELinuxfs mount:                /selinux
    Current mode:                   permissive
    Mode from config file:          permissive
    Policy version:                 21
    Policy from config file:        targeted

    [root at test01]# cat /etc/cluster/cluster.conf
    <?xml version="1.0"?>
    <cluster config_version="25" name="gdao_cluster">
        <fence_daemon post_fail_delay="0" post_join_delay="120"/>
        <clusternodes>
            <clusternode name="test01" nodeid="1" votes="1">
                <fence>
                    <method name="single">
                        <device name="gfs_vmware"/>
                    </method>
                </fence>
            </clusternode>
            <clusternode name="test02" nodeid="2" votes="1">
                <fence>
                    <method name="single">
                        <device name="gfs_vmware"/>
                    </method>
                </fence>
            </clusternode>
        </clusternodes>
        <cman/>
        <fencedevices>
            <fencedevice agent="fence_manual" name="gfs1_ipmi"/>
            <fencedevice agent="fence_vmware" name="gfs_vmware"
ipaddr="gdvcenter.ucsc.edu" login="root" passwd="1hateAmazon.com"
vmlogin="root" vmpasswd="esxpass"
port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/>
        </fencedevices>
        <rm>
        <failoverdomains/>
        </rm>
    </cluster>

I've seen much discussion of this problem, but no definitive solutions. 
Any help you can provide will be welcome.

Wes Modes



From dkelson at gurulabs.com  Fri Jan  6 00:45:07 2012
From: dkelson at gurulabs.com (Dax Kelson)
Date: Thu, 5 Jan 2012 17:45:07 -0700
Subject: [Linux-cluster] Getting a SPC-3 compliant PR enabled iSCSI target
	up and running?
Message-ID: <CAFDa5Q3TfdEs4B40N3OwZkxDZxaZOR0s+Stxj-0OB4BAA_X-Pg@mail.gmail.com>

Hi,

I have a testing lab where I'm attempting to get some more experience with
LIO and targetcli. Is there an IRC channel where cluster and/or target
folks hang out?

I have:

- A 3 node RHEL6.2 cluster with clvmd and GFS2
- A Fedora 16 box (kernel 3.1.6) with LIO/targetcli-2.0rc1.fb3-2 for my
shared storage

Is there special configuration needs to be done on the target to enable PR,
because it doesn't seem to be working. I'm not able to get fence_scsi
working.

I'm seeing registrations, but no reservation.

>From one of the RHEL6 cluster nodes, and attempt to do a read-key fails:

# sg_persist -n -i -k -d /dev/sdf
PR in: aborted command

Which generates this message over on my Fedora 16 scsi target:

filp_open(/var/target/pr/
aptpl_086b4b49-8736-45e1-a80c-2ddeb8a5a01e) for APTPL metadata failed

# cat /sys/kernel/config/target/core/iblock_0/store01/pr/res_*
APTPL Bit Status: Disabled
Ready to process PR APTPL metadata..
No SPC-3 Reservation holder
No SPC-3 Reservation holder
0x00000008
No SPC-3 Reservation holder
SPC-3 PR Registrations:
iSCSI Node: iqn.1994-05.com.redhat:226e63cf8cf5,i,0x00023d010000 Key:
0x00000000aa230001 PRgen: 0x00000003
iSCSI Node: iqn.1994-05.com.redhat:93471b6582,i,0x00023d010000 Key:
0x00000000aa230002 PRgen: 0x00000007
iSCSI Node: iqn.1994-05.com.redhat:21d24bc1b670,i,0x00023d010000 Key:
0x00000000aa230003 PRgen: 0x00000005
No SPC-3 Reservation holder
SPC3_PERSISTENT_RESERVATIONS

On the cluster node, this is what my fence_scsi log file looks like:

Jan  5 17:38:45 fence_scsi: [debug] main::do_register_ignore
(node_key=aa230003, dev=/dev/sdf)
Jan  5 17:38:45 fence_scsi: [debug] main::do_reset (dev=/dev/sdf, status=0)
(cmd=sg_turs /dev/sdf)
Jan  5 17:38:45 fence_scsi: [debug] main::do_register_ignore (err=0)
(cmd=sg_persist -n -o -I -S aa230003 -d /dev/sdf)
Jan  5 17:38:45 fence_scsi: [error] main::get_reservation_key (err=11)
(cmd=sg_persist -n -i -r -d /dev/sdf)

Running that failing sg_persist gives "PR in: aborted command"

A wireshark packet capture of that command shows:

=== packet generated by sg_persist ===

iSCSI (SCSI Command)
    Opcode: SCSI Command (0x01)
    .0.. .... = I: Queued delivery
    Flags: 0xc1
        1... .... = F: Final PDU in sequence
        .1.. .... = R: Data will be read from target
        ..0. .... = W: No data will be written to target
        .... .001 = Attr: Simple (0x01)
    TotalAHSLength: 0x00
    DataSegmentLength: 0x00000000
    LUN: 0000000000000000
    InitiatorTaskTag: 0x20000000
    ExpectedDataTransferLength: 0x00002000
    CmdSN: 0x000000e2
    ExpStatSN: 0x4c6ecb1b
SCSI CDB Persistent Reserve In
    [LUN: 0x0000]
    [Command Set:Direct Access Device (0x00) (Using default commandset)]
    Opcode: Persistent Reserve In (0x5e)
    .... 0001 = Service Action: Read Reservation (0x01)
    Allocation Length: 8192
    Control: 0x00
        00.. .... = Vendor specific: 0x00
        ..00 0... = Reserved: 0x00
        .... .0.. = NACA: Normal ACA is not set
        .... ..0. = Obsolete: 0x00
        .... ...0 = Obsolete: 0x00

=== response from target ===

iSCSI (SCSI Response)
    Opcode: SCSI Response (0x21)
    Flags: 0x80
        ...0 .... = o: No overflow of read part of bi-directional command
        .... 0... = u: No underflow of read part of bi-directional command
        .... .0.. = O: No residual overflow occurred
        .... ..0. = U: No residual underflow occurred
    Response: Command completed at target (0x00)
    Status: Check Condition (0x02)
    TotalAHSLength: 0x00
    DataSegmentLength: 0x00000062
    InitiatorTaskTag: 0x20000000
    StatSN: 0x4c6ecb1b
    ExpCmdSN: 0x000000e3
    MaxCmdSN: 0x000000f2
    ExpDataSN: 0x00000000
    BidiReadResidualCount: 0x00000000
    ResidualCount: 0x00000000
    Request in: 1
    Time from request: 0.000096000 seconds
    SenseLength: 0x0060
SCSI: SNS Info
    [LUN: 0x0000]
    Valid: 0
    .111 0000 = SNS Error Type: Current Error (0x70)
    Filemark: 0, EOM: 0, ILI: 0
    .... 1011 = Sense Key: Command Aborted (0x0b)
    Sense Info: 0x00000000
    Additional Sense Length: 0
    Command-Specific Information: 00000000
    Additional Sense Code+Qualifier: Invalid Field In Cdb (0x2400)
    Field Replaceable Unit Code: 0x00
    0... .... = SKSV: False
    Sense Key Specific: 000000

(there are more packets in the capture if needed)

My target's saveconfig.json looks like this:

{
  "storage_objects": [
    {
      "attributes": {
        "block_size": 512,
        "emulate_dpo": 0,
        "emulate_fua_read": 0,
        "emulate_fua_write": 1,
        "emulate_rest_reord": 0,
        "emulate_tas": 1,
        "emulate_tpu": 0,
        "emulate_tpws": 0,
        "emulate_ua_intlck_ctrl": 0,
        "emulate_write_cache": 0,
        "enforce_pr_isids": 1,
        "is_nonrot": 0,
        "max_sectors": 1024,
        "max_unmap_block_desc_count": 0,
        "max_unmap_lba_count": 0,
        "optimal_sectors": 1024,
        "queue_depth": 128,
        "task_timeout": 0,
        "unmap_granularity": 0,
        "unmap_granularity_alignment": 0
      },
      "dev": "/dev/vg_station11/iscsi-
lun01",
      "name": "store01",
      "plugin": "block",
      "wwn": "086b4b49-8736-45e1-a80c-2ddeb8a5a01e"
    }
  ],
  "targets": [
    {
      "fabric": "iscsi",
      "tpgs": [
        {
          "attributes": {
            "authentication": 0,
            "cache_dynamic_acls": 0,
            "default_cmdsn_depth": 16,
            "demo_mode_write_protect": 1,
            "generate_node_acls": 0,
            "login_timeout": 15,
            "netif_timeout": 2,
            "prod_mode_write_protect": 0
          },
          "luns": [
            {
              "index": 0,
              "storage_object": "/backstores/block/store01"
            }
          ],
          "node_acls": [
            {
              "attributes": {
                "dataout_timeout": 3,
                "dataout_timeout_retries": 5,
                "default_erl": 0,
                "nopin_response_timeout": 5,
                "nopin_timeout": 5,
                "random_datain_pdu_offsets": 0,
                "random_datain_seq_offsets": 0,
                "random_r2t_offsets": 0
              },
              "chap_mutual_password": "",
              "chap_mutual_userid": "",
              "chap_password": "",
              "chap_userid": "",
              "mapped_luns": [
                {
                  "index": 0,
                  "write_protect": false
                }
              ],
              "node_wwn": "iqn.1994-05.com.redhat:21d24bc1b670",
              "tcq_depth": 16
            },
            {
              "attributes": {
                "dataout_timeout": 3,
                "dataout_timeout_retries": 5,
                "default_erl": 0,
                "nopin_response_timeout": 5,
                "nopin_timeout": 5,
                "random_datain_pdu_offsets": 0,
                "random_datain_seq_offsets": 0,
                "random_r2t_offsets": 0
              },
              "chap_mutual_password": "",
              "chap_mutual_userid": "",
              "chap_password": "",
              "chap_userid": "",
              "mapped_luns": [
                {
                  "index": 0,
                  "write_protect": false
                }
              ],
              "node_wwn": "iqn.1994-05.com.redhat:93471b6582",
              "tcq_depth": 16
            },
            {
              "attributes": {
                "dataout_timeout": 3,
                "dataout_timeout_retries": 5,
                "default_erl": 0,
                "nopin_response_timeout": 5,
                "nopin_timeout": 5,
                "random_datain_pdu_offsets": 0,
                "random_datain_seq_offsets": 0,
                "random_r2t_offsets": 0
              },
              "chap_mutual_password": "",
              "chap_mutual_userid": "",
              "chap_password": "",
              "chap_userid": "",
              "mapped_luns": [
                {
                  "index": 0,
                  "write_protect": false
                }
              ],
              "node_wwn": "iqn.1994-05.com.redhat:226e63cf8cf5",
              "tcq_depth": 16
            }
          ],
          "portals": [
            {
              "ip_address": "10.100.0.12",
              "port": 3260
            }
          ],
          "tag": 1
        }
      ],
      "wwn": "iqn.2003-01.org.linux-iscsi.station11.x8664:sn.32668e1cd52d"
    }
  ]
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120105/86d1336a/attachment.htm>

From linux at alteeve.com  Fri Jan  6 00:55:57 2012
From: linux at alteeve.com (Digimer)
Date: Thu, 05 Jan 2012 19:55:57 -0500
Subject: [Linux-cluster] Getting a SPC-3 compliant PR enabled iSCSI
 target up and running?
In-Reply-To: <CAFDa5Q3TfdEs4B40N3OwZkxDZxaZOR0s+Stxj-0OB4BAA_X-Pg@mail.gmail.com>
References: <CAFDa5Q3TfdEs4B40N3OwZkxDZxaZOR0s+Stxj-0OB4BAA_X-Pg@mail.gmail.com>
Message-ID: <4F06469D.2090704@alteeve.com>

On 01/05/2012 07:45 PM, Dax Kelson wrote:
> Hi,
> 
> I have a testing lab where I'm attempting to get some more experience
> with LIO and targetcli. Is there an IRC channel where cluster and/or
> target folks hang out?

Hi Dax,

  The two main HA clustering channels on freenode are #linux-cluster and
#linux-ha. There is also an HPC channel at #hpc. The first is slightly
more red hat focused, but certainly not exclusively RH. The second is in
the same way slightly more pacemaker focused, but again, not exclusively
in the least.

  Feel free to drop in and say hi. :)

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron



From swhiteho at redhat.com  Fri Jan  6 10:00:29 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Fri, 06 Jan 2012 10:00:29 +0000
Subject: [Linux-cluster] GFS on CentOS - cman unable to start
In-Reply-To: <4F061C11.5030303@ucsc.edu>
References: <4F061C11.5030303@ucsc.edu>
Message-ID: <1325844029.2703.8.camel@menhir>

Hi,

On Thu, 2012-01-05 at 13:54 -0800, Wes Modes wrote:
> Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS systems
> running on vmWare. The GFS FS is on a Dell Equilogic SAN.
> 
> I keep running into the same problem despite many differently-flavored
> attempts to set up GFS. The problem comes when I try to start cman, the
> cluster management software.
> 
>     [root at test01]# service cman start
>     Starting cluster:
>        Loading modules... done
>        Mounting configfs... done
>        Starting ccsd... done
>        Starting cman... failed
>     cman not started: Can't find local node name in cluster.conf
> /usr/sbin/cman_tool: aisexec daemon didn't start
>                                                                [FAILED]
> 
This looks like what it says... whatever the node name is in
cluster.conf, it doesn't exist when the name is looked up, or possibly
it does exist, but is mapped to the loopback address (it needs to map to
an address which is valid cluster-wide)

Since your config files look correct, the next thing to check is what
the resolver is actually returning. Try (for example) a ping to test01
(you need to specify exactly the same form of the name as is used in
cluster.conf) from test02 and see whether it uses the correct ip
address, just in case the wrong thing is being returned.

Steve.

>     [root at test01]# tail /var/log/messages
>     Jan  5 13:39:40 testbench06 ccsd[13194]: Unable to connect to
> cluster infrastructure after 1193640 seconds.
>     Jan  5 13:40:10 testbench06 ccsd[13194]: Unable to connect to
> cluster infrastructure after 1193670 seconds.
>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
> Service RELEASE 'subrev 1887 version 0.80.6'
>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C)
> 2002-2006 MontaVista Software, Inc and contributors.
>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C)
> 2006 Red Hat, Inc.
>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
> Service: started and ready to provide service.
>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] local node name
> "test01.gdao.ucsc.edu" not found in cluster.conf
>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading CCS
> info, cannot start
>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading
> config from CCS
>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
> exiting (reason: could not read the main configuration file).
> 
> Here are details of my configuration:
> 
>     [root at test01]# rpm -qa | grep cman
>     cman-2.0.115-85.el5_7.2
> 
>     [root at test01]# echo $HOSTNAME
>     test01.gdao.ucsc.edu
> 
>     [root at test01]# hostname
>     test01.gdao.ucsc.edu
> 
>     [root at test01]# cat /etc/hosts
>     # Do not remove the following line, or various programs
>     # that require network functionality will fail.
>     128.114.31.112      test01 test01.gdao test01.gdao.ucsc.edu
>     128.114.31.113      test02 test02.gdao test02.gdao.ucsc.edu
>     127.0.0.1               localhost.localdomain localhost
>     ::1             localhost6.localdomain6 localhost6
> 
>     [root at test01]# sestatus
>     SELinux status:                 enabled
>     SELinuxfs mount:                /selinux
>     Current mode:                   permissive
>     Mode from config file:          permissive
>     Policy version:                 21
>     Policy from config file:        targeted
> 
>     [root at test01]# cat /etc/cluster/cluster.conf
>     <?xml version="1.0"?>
>     <cluster config_version="25" name="gdao_cluster">
>         <fence_daemon post_fail_delay="0" post_join_delay="120"/>
>         <clusternodes>
>             <clusternode name="test01" nodeid="1" votes="1">
>                 <fence>
>                     <method name="single">
>                         <device name="gfs_vmware"/>
>                     </method>
>                 </fence>
>             </clusternode>
>             <clusternode name="test02" nodeid="2" votes="1">
>                 <fence>
>                     <method name="single">
>                         <device name="gfs_vmware"/>
>                     </method>
>                 </fence>
>             </clusternode>
>         </clusternodes>
>         <cman/>
>         <fencedevices>
>             <fencedevice agent="fence_manual" name="gfs1_ipmi"/>
>             <fencedevice agent="fence_vmware" name="gfs_vmware"
> ipaddr="gdvcenter.ucsc.edu" login="root" passwd="1hateAmazon.com"
> vmlogin="root" vmpasswd="esxpass"
> port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/>
>         </fencedevices>
>         <rm>
>         <failoverdomains/>
>         </rm>
>     </cluster>
> 
> I've seen much discussion of this problem, but no definitive solutions. 
> Any help you can provide will be welcome.
> 
> Wes Modes
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From wmodes at ucsc.edu  Fri Jan  6 19:01:28 2012
From: wmodes at ucsc.edu (Wes Modes)
Date: Fri, 06 Jan 2012 11:01:28 -0800
Subject: [Linux-cluster] GFS on CentOS - cman unable to start
In-Reply-To: <1325844029.2703.8.camel@menhir>
References: <4F061C11.5030303@ucsc.edu> <1325844029.2703.8.camel@menhir>
Message-ID: <4F074508.7020701@ucsc.edu>

Hi, Steven.

I've tried just about every possible combination of hostname and
cluster.conf.

ping to test01 resolves to 128.114.31.112
ping to test01.gdao.ucsc.edu resolves to 128.114.31.112

It feels like the right thing is being returned.  This feels like it
might be a quirk (or bug possibly) of cman or openais.

There are some old bug reports around this, for example
https://bugzilla.redhat.com/show_bug.cgi?id=488565.  It sounds like the
way that cman reports this error is anything but straightforward. 

Is there anyone who has encountered this error and found a solution?

Wes


On 1/6/2012 2:00 AM, Steven Whitehouse wrote:
> Hi,
>
> On Thu, 2012-01-05 at 13:54 -0800, Wes Modes wrote:
>> Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS systems
>> running on vmWare. The GFS FS is on a Dell Equilogic SAN.
>>
>> I keep running into the same problem despite many differently-flavored
>> attempts to set up GFS. The problem comes when I try to start cman, the
>> cluster management software.
>>
>>     [root at test01]# service cman start
>>     Starting cluster:
>>        Loading modules... done
>>        Mounting configfs... done
>>        Starting ccsd... done
>>        Starting cman... failed
>>     cman not started: Can't find local node name in cluster.conf
>> /usr/sbin/cman_tool: aisexec daemon didn't start
>>                                                                [FAILED]
>>
> This looks like what it says... whatever the node name is in
> cluster.conf, it doesn't exist when the name is looked up, or possibly
> it does exist, but is mapped to the loopback address (it needs to map to
> an address which is valid cluster-wide)
>
> Since your config files look correct, the next thing to check is what
> the resolver is actually returning. Try (for example) a ping to test01
> (you need to specify exactly the same form of the name as is used in
> cluster.conf) from test02 and see whether it uses the correct ip
> address, just in case the wrong thing is being returned.
>
> Steve.
>
>>     [root at test01]# tail /var/log/messages
>>     Jan  5 13:39:40 testbench06 ccsd[13194]: Unable to connect to
>> cluster infrastructure after 1193640 seconds.
>>     Jan  5 13:40:10 testbench06 ccsd[13194]: Unable to connect to
>> cluster infrastructure after 1193670 seconds.
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
>> Service RELEASE 'subrev 1887 version 0.80.6'
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C)
>> 2002-2006 MontaVista Software, Inc and contributors.
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C)
>> 2006 Red Hat, Inc.
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
>> Service: started and ready to provide service.
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] local node name
>> "test01.gdao.ucsc.edu" not found in cluster.conf
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading CCS
>> info, cannot start
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading
>> config from CCS
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
>> exiting (reason: could not read the main configuration file).
>>
>> Here are details of my configuration:
>>
>>     [root at test01]# rpm -qa | grep cman
>>     cman-2.0.115-85.el5_7.2
>>
>>     [root at test01]# echo $HOSTNAME
>>     test01.gdao.ucsc.edu
>>
>>     [root at test01]# hostname
>>     test01.gdao.ucsc.edu
>>
>>     [root at test01]# cat /etc/hosts
>>     # Do not remove the following line, or various programs
>>     # that require network functionality will fail.
>>     128.114.31.112      test01 test01.gdao test01.gdao.ucsc.edu
>>     128.114.31.113      test02 test02.gdao test02.gdao.ucsc.edu
>>     127.0.0.1               localhost.localdomain localhost
>>     ::1             localhost6.localdomain6 localhost6
>>
>>     [root at test01]# sestatus
>>     SELinux status:                 enabled
>>     SELinuxfs mount:                /selinux
>>     Current mode:                   permissive
>>     Mode from config file:          permissive
>>     Policy version:                 21
>>     Policy from config file:        targeted
>>
>>     [root at test01]# cat /etc/cluster/cluster.conf
>>     <?xml version="1.0"?>
>>     <cluster config_version="25" name="gdao_cluster">
>>         <fence_daemon post_fail_delay="0" post_join_delay="120"/>
>>         <clusternodes>
>>             <clusternode name="test01" nodeid="1" votes="1">
>>                 <fence>
>>                     <method name="single">
>>                         <device name="gfs_vmware"/>
>>                     </method>
>>                 </fence>
>>             </clusternode>
>>             <clusternode name="test02" nodeid="2" votes="1">
>>                 <fence>
>>                     <method name="single">
>>                         <device name="gfs_vmware"/>
>>                     </method>
>>                 </fence>
>>             </clusternode>
>>         </clusternodes>
>>         <cman/>
>>         <fencedevices>
>>             <fencedevice agent="fence_manual" name="gfs1_ipmi"/>
>>             <fencedevice agent="fence_vmware" name="gfs_vmware"
>> ipaddr="gdvcenter.ucsc.edu" login="root" passwd="1hateAmazon.com"
>> vmlogin="root" vmpasswd="esxpass"
>> port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/>
>>         </fencedevices>
>>         <rm>
>>         <failoverdomains/>
>>         </rm>
>>     </cluster>
>>
>> I've seen much discussion of this problem, but no definitive solutions. 
>> Any help you can provide will be welcome.
>>
>> Wes Modes
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From gustavo.tonello at gmail.com  Fri Jan  6 20:05:23 2012
From: gustavo.tonello at gmail.com (Luiz Gustavo Tonello)
Date: Fri, 6 Jan 2012 18:05:23 -0200
Subject: [Linux-cluster] GFS on CentOS - cman unable to start
In-Reply-To: <4F074508.7020701@ucsc.edu>
References: <4F061C11.5030303@ucsc.edu> <1325844029.2703.8.camel@menhir>
	<4F074508.7020701@ucsc.edu>
Message-ID: <CALXEaKit2gLj+6oDEV4jPHmxYe6F-nRPoofTX=dgBgAqdTSz6A@mail.gmail.com>

Hi,

This servers is on VMware? At the same host?
SElinux is disable? iptables have something?

In my environment I had a problem to start GFS2 with servers in differents
hosts.
To clustering servers, was need migrate one server to the same host of the
other, and restart this.

I think, one of the problem was because the virtual switchs.
To solve, I changed a multicast IP, to use 225.0.0.13 at cluster.conf
  <multicast addr="225.0.0.13"/>
And add a static route in both, to use default gateway.

I don't know if it's correct, but this solve my problem.

I hope that help you.

Regards.

On Fri, Jan 6, 2012 at 5:01 PM, Wes Modes <wmodes at ucsc.edu> wrote:

> Hi, Steven.
>
> I've tried just about every possible combination of hostname and
> cluster.conf.
>
> ping to test01 resolves to 128.114.31.112
> ping to test01.gdao.ucsc.edu resolves to 128.114.31.112
>
> It feels like the right thing is being returned.  This feels like it
> might be a quirk (or bug possibly) of cman or openais.
>
> There are some old bug reports around this, for example
> https://bugzilla.redhat.com/show_bug.cgi?id=488565.  It sounds like the
> way that cman reports this error is anything but straightforward.
>
> Is there anyone who has encountered this error and found a solution?
>
> Wes
>
>
> On 1/6/2012 2:00 AM, Steven Whitehouse wrote:
> > Hi,
> >
> > On Thu, 2012-01-05 at 13:54 -0800, Wes Modes wrote:
> >> Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS systems
> >> running on vmWare. The GFS FS is on a Dell Equilogic SAN.
> >>
> >> I keep running into the same problem despite many differently-flavored
> >> attempts to set up GFS. The problem comes when I try to start cman, the
> >> cluster management software.
> >>
> >>     [root at test01]# service cman start
> >>     Starting cluster:
> >>        Loading modules... done
> >>        Mounting configfs... done
> >>        Starting ccsd... done
> >>        Starting cman... failed
> >>     cman not started: Can't find local node name in cluster.conf
> >> /usr/sbin/cman_tool: aisexec daemon didn't start
> >>                                                                [FAILED]
> >>
> > This looks like what it says... whatever the node name is in
> > cluster.conf, it doesn't exist when the name is looked up, or possibly
> > it does exist, but is mapped to the loopback address (it needs to map to
> > an address which is valid cluster-wide)
> >
> > Since your config files look correct, the next thing to check is what
> > the resolver is actually returning. Try (for example) a ping to test01
> > (you need to specify exactly the same form of the name as is used in
> > cluster.conf) from test02 and see whether it uses the correct ip
> > address, just in case the wrong thing is being returned.
> >
> > Steve.
> >
> >>     [root at test01]# tail /var/log/messages
> >>     Jan  5 13:39:40 testbench06 ccsd[13194]: Unable to connect to
> >> cluster infrastructure after 1193640 seconds.
> >>     Jan  5 13:40:10 testbench06 ccsd[13194]: Unable to connect to
> >> cluster infrastructure after 1193670 seconds.
> >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
> >> Service RELEASE 'subrev 1887 version 0.80.6'
> >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C)
> >> 2002-2006 MontaVista Software, Inc and contributors.
> >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C)
> >> 2006 Red Hat, Inc.
> >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
> >> Service: started and ready to provide service.
> >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] local node name
> >> "test01.gdao.ucsc.edu" not found in cluster.conf
> >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading CCS
> >> info, cannot start
> >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading
> >> config from CCS
> >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
> >> exiting (reason: could not read the main configuration file).
> >>
> >> Here are details of my configuration:
> >>
> >>     [root at test01]# rpm -qa | grep cman
> >>     cman-2.0.115-85.el5_7.2
> >>
> >>     [root at test01]# echo $HOSTNAME
> >>     test01.gdao.ucsc.edu
> >>
> >>     [root at test01]# hostname
> >>     test01.gdao.ucsc.edu
> >>
> >>     [root at test01]# cat /etc/hosts
> >>     # Do not remove the following line, or various programs
> >>     # that require network functionality will fail.
> >>     128.114.31.112      test01 test01.gdao test01.gdao.ucsc.edu
> >>     128.114.31.113      test02 test02.gdao test02.gdao.ucsc.edu
> >>     127.0.0.1               localhost.localdomain localhost
> >>     ::1             localhost6.localdomain6 localhost6
> >>
> >>     [root at test01]# sestatus
> >>     SELinux status:                 enabled
> >>     SELinuxfs mount:                /selinux
> >>     Current mode:                   permissive
> >>     Mode from config file:          permissive
> >>     Policy version:                 21
> >>     Policy from config file:        targeted
> >>
> >>     [root at test01]# cat /etc/cluster/cluster.conf
> >>     <?xml version="1.0"?>
> >>     <cluster config_version="25" name="gdao_cluster">
> >>         <fence_daemon post_fail_delay="0" post_join_delay="120"/>
> >>         <clusternodes>
> >>             <clusternode name="test01" nodeid="1" votes="1">
> >>                 <fence>
> >>                     <method name="single">
> >>                         <device name="gfs_vmware"/>
> >>                     </method>
> >>                 </fence>
> >>             </clusternode>
> >>             <clusternode name="test02" nodeid="2" votes="1">
> >>                 <fence>
> >>                     <method name="single">
> >>                         <device name="gfs_vmware"/>
> >>                     </method>
> >>                 </fence>
> >>             </clusternode>
> >>         </clusternodes>
> >>         <cman/>
> >>         <fencedevices>
> >>             <fencedevice agent="fence_manual" name="gfs1_ipmi"/>
> >>             <fencedevice agent="fence_vmware" name="gfs_vmware"
> >> ipaddr="gdvcenter.ucsc.edu" login="root" passwd="1hateAmazon.com"
> >> vmlogin="root" vmpasswd="esxpass"
> >>
> port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/>
> >>         </fencedevices>
> >>         <rm>
> >>         <failoverdomains/>
> >>         </rm>
> >>     </cluster>
> >>
> >> I've seen much discussion of this problem, but no definitive solutions.
> >> Any help you can provide will be welcome.
> >>
> >> Wes Modes
> >>
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Luiz Gustavo P Tonello.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120106/4c0c46de/attachment.htm>

From wmodes at ucsc.edu  Fri Jan  6 20:38:43 2012
From: wmodes at ucsc.edu (Wes Modes)
Date: Fri, 06 Jan 2012 12:38:43 -0800
Subject: [Linux-cluster] GFS on CentOS - cman unable to start
In-Reply-To: <CALXEaKit2gLj+6oDEV4jPHmxYe6F-nRPoofTX=dgBgAqdTSz6A@mail.gmail.com>
References: <4F061C11.5030303@ucsc.edu> <1325844029.2703.8.camel@menhir>
	<4F074508.7020701@ucsc.edu>
	<CALXEaKit2gLj+6oDEV4jPHmxYe6F-nRPoofTX=dgBgAqdTSz6A@mail.gmail.com>
Message-ID: <4F075BD3.3090702@ucsc.edu>

These servers are currently on the same host, but may not be in the
future.  They are in a vm cluster (though honestly, I'm not sure what
this means yet).

SElinux is on, but disabled.
Firewalling through iptables is turned off via system-config-securitylevel

There is no line currently in the cluster.conf that deals with multicasting.

Any other suggestions?

Wes

On 1/6/2012 12:05 PM, Luiz Gustavo Tonello wrote:
> Hi,
>
> This servers is on VMware? At the same host?
> SElinux is disable? iptables have something?
>
> In my environment I had a problem to start GFS2 with servers in
> differents hosts.
> To clustering servers, was need migrate one server to the same host of
> the other, and restart this.
>
> I think, one of the problem was because the virtual switchs.
> To solve, I changed a multicast IP, to use 225.0.0.13 at cluster.conf 
>   <multicast addr="225.0.0.13"/>
> And add a static route in both, to use default gateway.
>
> I don't know if it's correct, but this solve my problem.
>
> I hope that help you.
>
> Regards.
>
> On Fri, Jan 6, 2012 at 5:01 PM, Wes Modes <wmodes at ucsc.edu
> <mailto:wmodes at ucsc.edu>> wrote:
>
>     Hi, Steven.
>
>     I've tried just about every possible combination of hostname and
>     cluster.conf.
>
>     ping to test01 resolves to 128.114.31.112
>     ping to test01.gdao.ucsc.edu <http://test01.gdao.ucsc.edu>
>     resolves to 128.114.31.112
>
>     It feels like the right thing is being returned.  This feels like it
>     might be a quirk (or bug possibly) of cman or openais.
>
>     There are some old bug reports around this, for example
>     https://bugzilla.redhat.com/show_bug.cgi?id=488565.  It sounds
>     like the
>     way that cman reports this error is anything but straightforward.
>
>     Is there anyone who has encountered this error and found a solution?
>
>     Wes
>
>
>     On 1/6/2012 2:00 AM, Steven Whitehouse wrote:
>     > Hi,
>     >
>     > On Thu, 2012-01-05 at 13:54 -0800, Wes Modes wrote:
>     >> Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS
>     systems
>     >> running on vmWare. The GFS FS is on a Dell Equilogic SAN.
>     >>
>     >> I keep running into the same problem despite many
>     differently-flavored
>     >> attempts to set up GFS. The problem comes when I try to start
>     cman, the
>     >> cluster management software.
>     >>
>     >>     [root at test01]# service cman start
>     >>     Starting cluster:
>     >>        Loading modules... done
>     >>        Mounting configfs... done
>     >>        Starting ccsd... done
>     >>        Starting cman... failed
>     >>     cman not started: Can't find local node name in cluster.conf
>     >> /usr/sbin/cman_tool: aisexec daemon didn't start
>     >>                                                              
>      [FAILED]
>     >>
>     > This looks like what it says... whatever the node name is in
>     > cluster.conf, it doesn't exist when the name is looked up, or
>     possibly
>     > it does exist, but is mapped to the loopback address (it needs
>     to map to
>     > an address which is valid cluster-wide)
>     >
>     > Since your config files look correct, the next thing to check is
>     what
>     > the resolver is actually returning. Try (for example) a ping to
>     test01
>     > (you need to specify exactly the same form of the name as is used in
>     > cluster.conf) from test02 and see whether it uses the correct ip
>     > address, just in case the wrong thing is being returned.
>     >
>     > Steve.
>     >
>     >>     [root at test01]# tail /var/log/messages
>     >>     Jan  5 13:39:40 testbench06 ccsd[13194]: Unable to connect to
>     >> cluster infrastructure after 1193640 seconds.
>     >>     Jan  5 13:40:10 testbench06 ccsd[13194]: Unable to connect to
>     >> cluster infrastructure after 1193670 seconds.
>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS
>     Executive
>     >> Service RELEASE 'subrev 1887 version 0.80.6'
>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ]
>     Copyright (C)
>     >> 2002-2006 MontaVista Software, Inc and contributors.
>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ]
>     Copyright (C)
>     >> 2006 Red Hat, Inc.
>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS
>     Executive
>     >> Service: started and ready to provide service.
>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] local
>     node name
>     >> "test01.gdao.ucsc.edu <http://test01.gdao.ucsc.edu>" not found
>     in cluster.conf
>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Error
>     reading CCS
>     >> info, cannot start
>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Error
>     reading
>     >> config from CCS
>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS
>     Executive
>     >> exiting (reason: could not read the main configuration file).
>     >>
>     >> Here are details of my configuration:
>     >>
>     >>     [root at test01]# rpm -qa | grep cman
>     >>     cman-2.0.115-85.el5_7.2
>     >>
>     >>     [root at test01]# echo $HOSTNAME
>     >>     test01.gdao.ucsc.edu <http://test01.gdao.ucsc.edu>
>     >>
>     >>     [root at test01]# hostname
>     >>     test01.gdao.ucsc.edu <http://test01.gdao.ucsc.edu>
>     >>
>     >>     [root at test01]# cat /etc/hosts
>     >>     # Do not remove the following line, or various programs
>     >>     # that require network functionality will fail.
>     >>     128.114.31.112      test01 test01.gdao test01.gdao.ucsc.edu
>     <http://test01.gdao.ucsc.edu>
>     >>     128.114.31.113      test02 test02.gdao test02.gdao.ucsc.edu
>     <http://test02.gdao.ucsc.edu>
>     >>     127.0.0.1               localhost.localdomain localhost
>     >>     ::1             localhost6.localdomain6 localhost6
>     >>
>     >>     [root at test01]# sestatus
>     >>     SELinux status:                 enabled
>     >>     SELinuxfs mount:                /selinux
>     >>     Current mode:                   permissive
>     >>     Mode from config file:          permissive
>     >>     Policy version:                 21
>     >>     Policy from config file:        targeted
>     >>
>     >>     [root at test01]# cat /etc/cluster/cluster.conf
>     >>     <?xml version="1.0"?>
>     >>     <cluster config_version="25" name="gdao_cluster">
>     >>         <fence_daemon post_fail_delay="0" post_join_delay="120"/>
>     >>         <clusternodes>
>     >>             <clusternode name="test01" nodeid="1" votes="1">
>     >>                 <fence>
>     >>                     <method name="single">
>     >>                         <device name="gfs_vmware"/>
>     >>                     </method>
>     >>                 </fence>
>     >>             </clusternode>
>     >>             <clusternode name="test02" nodeid="2" votes="1">
>     >>                 <fence>
>     >>                     <method name="single">
>     >>                         <device name="gfs_vmware"/>
>     >>                     </method>
>     >>                 </fence>
>     >>             </clusternode>
>     >>         </clusternodes>
>     >>         <cman/>
>     >>         <fencedevices>
>     >>             <fencedevice agent="fence_manual" name="gfs1_ipmi"/>
>     >>             <fencedevice agent="fence_vmware" name="gfs_vmware"
>     >> ipaddr="gdvcenter.ucsc.edu <http://gdvcenter.ucsc.edu>"
>     login="root" passwd="1hateAmazon.com"
>     >> vmlogin="root" vmpasswd="esxpass"
>     >>
>     port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/>
>     >>         </fencedevices>
>     >>         <rm>
>     >>         <failoverdomains/>
>     >>         </rm>
>     >>     </cluster>
>     >>
>     >> I've seen much discussion of this problem, but no definitive
>     solutions.
>     >> Any help you can provide will be welcome.
>     >>
>     >> Wes Modes
>     >>
>     >> --
>     >> Linux-cluster mailing list
>     >> Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     >> https://www.redhat.com/mailman/listinfo/linux-cluster
>     >
>     > --
>     > Linux-cluster mailing list
>     > Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     > https://www.redhat.com/mailman/listinfo/linux-cluster
>
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> -- 
> Luiz Gustavo P Tonello.
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120106/9ccc2c37/attachment.htm>

From pbruna at it-linux.cl  Sat Jan  7 00:30:19 2012
From: pbruna at it-linux.cl (Patricio A. Bruna)
Date: Fri, 06 Jan 2012 21:30:19 -0300 (CLST)
Subject: [Linux-cluster] GFS on CentOS - cman unable to start
In-Reply-To: <4F075BD3.3090702@ucsc.edu>
Message-ID: <60c71479-72a3-47aa-a91d-fa1c91c3e9ef@lisa.itlinux.cl>

Hi, 
I think CMAN expect that the names of the cluster nodes be the same returned by the command "uname -n". 
For what you write your nodes hostnames are: test01.gdao.ucsc.edu and test02.gdao.ucsc.edu, but in cluster.conf you have declared only "test01" and "test02". 

------------------------------------ 
Patricio Bruna V. 
IT Linux Ltda. 
www.it-linux.cl 
Twitter 
Fono : (+56-2) 333 0578 
M?vil: (+56-9) 8899 6618 

----- Mensaje original -----

> These servers are currently on the same host, but may not be in the
> future. They are in a vm cluster (though honestly, I'm not sure what
> this means yet).

> SElinux is on, but disabled.
> Firewalling through iptables is turned off via
> system-config-securitylevel

> There is no line currently in the cluster.conf that deals with
> multicasting.

> Any other suggestions?

> Wes

> On 1/6/2012 12:05 PM, Luiz Gustavo Tonello wrote:
> > Hi,
> 

> > This servers is on VMware? At the same host?
> 
> > SElinux is disable? iptables have something?
> 

> > In my environment I had a problem to start GFS2 with servers in
> > differents hosts.
> 
> > To clustering servers, was need migrate one server to the same host
> > of the other, and restart this.
> 

> > I think, one of the problem was because the virtual switchs.
> 
> > To solve, I changed a multicast IP, to use 225.0.0.13 at
> > cluster.conf
> 
> > <multicast addr="225.0.0.13"/>
> 
> > And add a static route in both, to use default gateway.
> 

> > I don't know if it's correct, but this solve my problem.
> 

> > I hope that help you.
> 

> > Regards.
> 

> > On Fri, Jan 6, 2012 at 5:01 PM, Wes Modes < wmodes at ucsc.edu >
> > wrote:
> 

> > > Hi, Steven.
> > 
> 

> > > I've tried just about every possible combination of hostname and
> > 
> 
> > > cluster.conf.
> > 
> 

> > > ping to test01 resolves to 128.114.31.112
> > 
> 
> > > ping to test01.gdao.ucsc.edu resolves to 128.114.31.112
> > 
> 

> > > It feels like the right thing is being returned. This feels like
> > > it
> > 
> 
> > > might be a quirk (or bug possibly) of cman or openais.
> > 
> 

> > > There are some old bug reports around this, for example
> > 
> 
> > > https://bugzilla.redhat.com/show_bug.cgi?id=488565 . It sounds
> > > like
> > > the
> > 
> 
> > > way that cman reports this error is anything but straightforward.
> > 
> 

> > > Is there anyone who has encountered this error and found a
> > > solution?
> > 
> 

> > > Wes
> > 
> 

> > > On 1/6/2012 2:00 AM, Steven Whitehouse wrote:
> > 
> 
> > > > Hi,
> > 
> 
> > > >
> > 
> 
> > > > On Thu, 2012-01-05 at 13:54 -0800, Wes Modes wrote:
> > 
> 
> > > >> Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS
> > > >> systems
> > 
> 
> > > >> running on vmWare. The GFS FS is on a Dell Equilogic SAN.
> > 
> 
> > > >>
> > 
> 
> > > >> I keep running into the same problem despite many
> > > >> differently-flavored
> > 
> 
> > > >> attempts to set up GFS. The problem comes when I try to start
> > > >> cman, the
> > 
> 
> > > >> cluster management software.
> > 
> 
> > > >>
> > 
> 
> > > >> [root at test01]# service cman start
> > 
> 
> > > >> Starting cluster:
> > 
> 
> > > >> Loading modules... done
> > 
> 
> > > >> Mounting configfs... done
> > 
> 
> > > >> Starting ccsd... done
> > 
> 
> > > >> Starting cman... failed
> > 
> 
> > > >> cman not started: Can't find local node name in cluster.conf
> > 
> 
> > > >> /usr/sbin/cman_tool: aisexec daemon didn't start
> > 
> 
> > > >> [FAILED]
> > 
> 
> > > >>
> > 
> 
> > > > This looks like what it says... whatever the node name is in
> > 
> 
> > > > cluster.conf, it doesn't exist when the name is looked up, or
> > > > possibly
> > 
> 
> > > > it does exist, but is mapped to the loopback address (it needs
> > > > to
> > > > map to
> > 
> 
> > > > an address which is valid cluster-wide)
> > 
> 
> > > >
> > 
> 
> > > > Since your config files look correct, the next thing to check
> > > > is
> > > > what
> > 
> 
> > > > the resolver is actually returning. Try (for example) a ping to
> > > > test01
> > 
> 
> > > > (you need to specify exactly the same form of the name as is
> > > > used
> > > > in
> > 
> 
> > > > cluster.conf) from test02 and see whether it uses the correct
> > > > ip
> > 
> 
> > > > address, just in case the wrong thing is being returned.
> > 
> 
> > > >
> > 
> 
> > > > Steve.
> > 
> 
> > > >
> > 
> 
> > > >> [root at test01]# tail /var/log/messages
> > 
> 
> > > >> Jan 5 13:39:40 testbench06 ccsd[13194]: Unable to connect to
> > 
> 
> > > >> cluster infrastructure after 1193640 seconds.
> > 
> 
> > > >> Jan 5 13:40:10 testbench06 ccsd[13194]: Unable to connect to
> > 
> 
> > > >> cluster infrastructure after 1193670 seconds.
> > 
> 
> > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS
> > > >> Executive
> > 
> 
> > > >> Service RELEASE 'subrev 1887 version 0.80.6'
> > 
> 
> > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright
> > > >> (C)
> > 
> 
> > > >> 2002-2006 MontaVista Software, Inc and contributors.
> > 
> 
> > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright
> > > >> (C)
> > 
> 
> > > >> 2006 Red Hat, Inc.
> > 
> 
> > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS
> > > >> Executive
> > 
> 
> > > >> Service: started and ready to provide service.
> > 
> 
> > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] local node
> > > >> name
> > 
> 
> > > >> " test01.gdao.ucsc.edu " not found in cluster.conf
> > 
> 
> > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error
> > > >> reading
> > > >> CCS
> > 
> 
> > > >> info, cannot start
> > 
> 
> > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error
> > > >> reading
> > 
> 
> > > >> config from CCS
> > 
> 
> > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS
> > > >> Executive
> > 
> 
> > > >> exiting (reason: could not read the main configuration file).
> > 
> 
> > > >>
> > 
> 
> > > >> Here are details of my configuration:
> > 
> 
> > > >>
> > 
> 
> > > >> [root at test01]# rpm -qa | grep cman
> > 
> 
> > > >> cman-2.0.115-85.el5_7.2
> > 
> 
> > > >>
> > 
> 
> > > >> [root at test01]# echo $HOSTNAME
> > 
> 
> > > >> test01.gdao.ucsc.edu
> > 
> 
> > > >>
> > 
> 
> > > >> [root at test01]# hostname
> > 
> 
> > > >> test01.gdao.ucsc.edu
> > 
> 
> > > >>
> > 
> 
> > > >> [root at test01]# cat /etc/hosts
> > 
> 
> > > >> # Do not remove the following line, or various programs
> > 
> 
> > > >> # that require network functionality will fail.
> > 
> 
> > > >> 128.114.31.112 test01 test01.gdao test01.gdao.ucsc.edu
> > 
> 
> > > >> 128.114.31.113 test02 test02.gdao test02.gdao.ucsc.edu
> > 
> 
> > > >> 127.0.0.1 localhost.localdomain localhost
> > 
> 
> > > >> ::1 localhost6.localdomain6 localhost6
> > 
> 
> > > >>
> > 
> 
> > > >> [root at test01]# sestatus
> > 
> 
> > > >> SELinux status: enabled
> > 
> 
> > > >> SELinuxfs mount: /selinux
> > 
> 
> > > >> Current mode: permissive
> > 
> 
> > > >> Mode from config file: permissive
> > 
> 
> > > >> Policy version: 21
> > 
> 
> > > >> Policy from config file: targeted
> > 
> 
> > > >>
> > 
> 
> > > >> [root at test01]# cat /etc/cluster/cluster.conf
> > 
> 
> > > >> <?xml version="1.0"?>
> > 
> 
> > > >> <cluster config_version="25" name="gdao_cluster">
> > 
> 
> > > >> <fence_daemon post_fail_delay="0" post_join_delay="120"/>
> > 
> 
> > > >> <clusternodes>
> > 
> 
> > > >> <clusternode name="test01" nodeid="1" votes="1">
> > 
> 
> > > >> <fence>
> > 
> 
> > > >> <method name="single">
> > 
> 
> > > >> <device name="gfs_vmware"/>
> > 
> 
> > > >> </method>
> > 
> 
> > > >> </fence>
> > 
> 
> > > >> </clusternode>
> > 
> 
> > > >> <clusternode name="test02" nodeid="2" votes="1">
> > 
> 
> > > >> <fence>
> > 
> 
> > > >> <method name="single">
> > 
> 
> > > >> <device name="gfs_vmware"/>
> > 
> 
> > > >> </method>
> > 
> 
> > > >> </fence>
> > 
> 
> > > >> </clusternode>
> > 
> 
> > > >> </clusternodes>
> > 
> 
> > > >> <cman/>
> > 
> 
> > > >> <fencedevices>
> > 
> 
> > > >> <fencedevice agent="fence_manual" name="gfs1_ipmi"/>
> > 
> 
> > > >> <fencedevice agent="fence_vmware" name="gfs_vmware"
> > 
> 
> > > >> ipaddr=" gdvcenter.ucsc.edu " login="root"
> > > >> passwd="1hateAmazon.com"
> > 
> 
> > > >> vmlogin="root" vmpasswd="esxpass"
> > 
> 
> > > >> port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/>
> > 
> 
> > > >> </fencedevices>
> > 
> 
> > > >> <rm>
> > 
> 
> > > >> <failoverdomains/>
> > 
> 
> > > >> </rm>
> > 
> 
> > > >> </cluster>
> > 
> 
> > > >>
> > 
> 
> > > >> I've seen much discussion of this problem, but no definitive
> > > >> solutions.
> > 
> 
> > > >> Any help you can provide will be welcome.
> > 
> 
> > > >>
> > 
> 
> > > >> Wes Modes
> > 
> 
> > > >>
> > 
> 
> > > >> --
> > 
> 
> > > >> Linux-cluster mailing list
> > 
> 
> > > >> Linux-cluster at redhat.com
> > 
> 
> > > >> https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> 
> > > >
> > 
> 
> > > > --
> > 
> 
> > > > Linux-cluster mailing list
> > 
> 
> > > > Linux-cluster at redhat.com
> > 
> 
> > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> 

> > > --
> > 
> 
> > > Linux-cluster mailing list
> > 
> 
> > > Linux-cluster at redhat.com
> > 
> 
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> 

> > --
> 
> > Luiz Gustavo P Tonello.
> 

> > --
> 
> > Linux-cluster mailing list Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120106/891358db/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: zimbra_gold_partner.png
Type: image/png
Size: 2893 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120106/891358db/attachment.png>

From kevin.stanton at eprize.com  Sat Jan  7 02:06:03 2012
From: kevin.stanton at eprize.com (Kevin Stanton)
Date: Sat, 7 Jan 2012 02:06:03 +0000
Subject: [Linux-cluster] GFS on CentOS - cman unable to start
In-Reply-To: <60c71479-72a3-47aa-a91d-fa1c91c3e9ef@lisa.itlinux.cl>
References: <4F075BD3.3090702@ucsc.edu>
	<60c71479-72a3-47aa-a91d-fa1c91c3e9ef@lisa.itlinux.cl>
Message-ID: <B9EFAE4B1BE97D49B9FCFDCEC3AC09C313C563AA@MailNode2.eprize.local>

> Hi,
> I think CMAN expect that the names of the cluster nodes be the same returned by the command "uname -n".
> For what you write your nodes hostnames are: test01.gdao.ucsc.edu and test02.gdao.ucsc.edu, but in cluster.conf you have declared only "test01" and "test02".

I haven't found this to be the case in the past.  I actually use a separate short name to reference each node which is different than the hostname of the server itself.  All I've ever had to do is make sure it resolves correctly.  You can do this either in DNS and/or in /etc/hosts.  I have found that it's a good idea to do both in case your DNS server is a virtual machine and is not running for some reason.  In that case with /etc/hosts you can still start cman.

I would make sure whatever node names you use in the cluster.conf will resolve when you try to ping it from all nodes in the cluster.  Also make sure your cluster.conf is in sync between all nodes.

-Kevin


________________________________
These servers are currently on the same host, but may not be in the future.  They are in a vm cluster (though honestly, I'm not sure what this means yet).

SElinux is on, but disabled.
Firewalling through iptables is turned off via system-config-securitylevel

There is no line currently in the cluster.conf that deals with multicasting.

Any other suggestions?

Wes

On 1/6/2012 12:05 PM, Luiz Gustavo Tonello wrote:
Hi,

This servers is on VMware? At the same host?
SElinux is disable? iptables have something?

In my environment I had a problem to start GFS2 with servers in differents hosts.
To clustering servers, was need migrate one server to the same host of the other, and restart this.

I think, one of the problem was because the virtual switchs.
To solve, I changed a multicast IP, to use 225.0.0.13 at cluster.conf
  <multicast addr="225.0.0.13"/>
And add a static route in both, to use default gateway.

I don't know if it's correct, but this solve my problem.

I hope that help you.

Regards.

On Fri, Jan 6, 2012 at 5:01 PM, Wes Modes <wmodes at ucsc.edu<mailto:wmodes at ucsc.edu>> wrote:
Hi, Steven.

I've tried just about every possible combination of hostname and
cluster.conf.

ping to test01 resolves to 128.114.31.112
ping to test01.gdao.ucsc.edu<http://test01.gdao.ucsc.edu> resolves to 128.114.31.112

It feels like the right thing is being returned.  This feels like it
might be a quirk (or bug possibly) of cman or openais.

There are some old bug reports around this, for example
https://bugzilla.redhat.com/show_bug.cgi?id=488565.  It sounds like the
way that cman reports this error is anything but straightforward.

Is there anyone who has encountered this error and found a solution?

Wes


On 1/6/2012 2:00 AM, Steven Whitehouse wrote:
> Hi,
>
> On Thu, 2012-01-05 at 13:54 -0800, Wes Modes wrote:
>> Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS systems
>> running on vmWare. The GFS FS is on a Dell Equilogic SAN.
>>
>> I keep running into the same problem despite many differently-flavored
>> attempts to set up GFS. The problem comes when I try to start cman, the
>> cluster management software.
>>
>>     [root at test01]# service cman start
>>     Starting cluster:
>>        Loading modules... done
>>        Mounting configfs... done
>>        Starting ccsd... done
>>        Starting cman... failed
>>     cman not started: Can't find local node name in cluster.conf
>> /usr/sbin/cman_tool: aisexec daemon didn't start
>>                                                                [FAILED]
>>
> This looks like what it says... whatever the node name is in
> cluster.conf, it doesn't exist when the name is looked up, or possibly
> it does exist, but is mapped to the loopback address (it needs to map to
> an address which is valid cluster-wide)
>
> Since your config files look correct, the next thing to check is what
> the resolver is actually returning. Try (for example) a ping to test01
> (you need to specify exactly the same form of the name as is used in
> cluster.conf) from test02 and see whether it uses the correct ip
> address, just in case the wrong thing is being returned.
>
> Steve.
>
>>     [root at test01]# tail /var/log/messages
>>     Jan  5 13:39:40 testbench06 ccsd[13194]: Unable to connect to
>> cluster infrastructure after 1193640 seconds.
>>     Jan  5 13:40:10 testbench06 ccsd[13194]: Unable to connect to
>> cluster infrastructure after 1193670 seconds.
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
>> Service RELEASE 'subrev 1887 version 0.80.6'
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C)
>> 2002-2006 MontaVista Software, Inc and contributors.
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C)
>> 2006 Red Hat, Inc.
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
>> Service: started and ready to provide service.
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] local node name
>> "test01.gdao.ucsc.edu<http://test01.gdao.ucsc.edu>" not found in cluster.conf
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading CCS
>> info, cannot start
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading
>> config from CCS
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
>> exiting (reason: could not read the main configuration file).
>>
>> Here are details of my configuration:
>>
>>     [root at test01]# rpm -qa | grep cman
>>     cman-2.0.115-85.el5_7.2
>>
>>     [root at test01]# echo $HOSTNAME
>>     test01.gdao.ucsc.edu<http://test01.gdao.ucsc.edu>
>>
>>     [root at test01]# hostname
>>     test01.gdao.ucsc.edu<http://test01.gdao.ucsc.edu>
>>
>>     [root at test01]# cat /etc/hosts
>>     # Do not remove the following line, or various programs
>>     # that require network functionality will fail.
>>     128.114.31.112      test01 test01.gdao test01.gdao.ucsc.edu<http://test01.gdao.ucsc.edu>
>>     128.114.31.113      test02 test02.gdao test02.gdao.ucsc.edu<http://test02.gdao.ucsc.edu>
>>     127.0.0.1               localhost.localdomain localhost
>>     ::1             localhost6.localdomain6 localhost6
>>
>>     [root at test01]# sestatus
>>     SELinux status:                 enabled
>>     SELinuxfs mount:                /selinux
>>     Current mode:                   permissive
>>     Mode from config file:          permissive
>>     Policy version:                 21
>>     Policy from config file:        targeted
>>
>>     [root at test01]# cat /etc/cluster/cluster.conf
>>     <?xml version="1.0"?>
>>     <cluster config_version="25" name="gdao_cluster">
>>         <fence_daemon post_fail_delay="0" post_join_delay="120"/>
>>         <clusternodes>
>>             <clusternode name="test01" nodeid="1" votes="1">
>>                 <fence>
>>                     <method name="single">
>>                         <device name="gfs_vmware"/>
>>                     </method>
>>                 </fence>
>>             </clusternode>
>>             <clusternode name="test02" nodeid="2" votes="1">
>>                 <fence>
>>                     <method name="single">
>>                         <device name="gfs_vmware"/>
>>                     </method>
>>                 </fence>
>>             </clusternode>
>>         </clusternodes>
>>         <cman/>
>>         <fencedevices>
>>             <fencedevice agent="fence_manual" name="gfs1_ipmi"/>
>>             <fencedevice agent="fence_vmware" name="gfs_vmware"
>> ipaddr="gdvcenter.ucsc.edu<http://gdvcenter.ucsc.edu>" login="root" passwd="1hateAmazon.com"
>> vmlogin="root" vmpasswd="esxpass"
>> port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/>
>>         </fencedevices>
>>         <rm>
>>         <failoverdomains/>
>>         </rm>
>>     </cluster>
>>
>> I've seen much discussion of this problem, but no definitive solutions.
>> Any help you can provide will be welcome.
>>
>> Wes Modes
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com<mailto:Linux-cluster at redhat.com>
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com<mailto:Linux-cluster at redhat.com>
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com<mailto:Linux-cluster at redhat.com>
https://www.redhat.com/mailman/listinfo/linux-cluster



--
Luiz Gustavo P Tonello.



--

Linux-cluster mailing list

Linux-cluster at redhat.com<mailto:Linux-cluster at redhat.com>

https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com<mailto:Linux-cluster at redhat.com>
https://www.redhat.com/mailman/listinfo/linux-cluster


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120107/6f5cf14a/attachment.htm>

From christiankwall-qsa at yahoo.com  Sat Jan  7 08:22:12 2012
From: christiankwall-qsa at yahoo.com (Chris Kwall)
Date: Sat, 7 Jan 2012 08:22:12 +0000 (GMT)
Subject: [Linux-cluster] GFS on CentOS - cman unable to start
In-Reply-To: <4F061C11.5030303@ucsc.edu>
References: <4F061C11.5030303@ucsc.edu>
Message-ID: <1325924532.90505.YahooMailNeo@web29703.mail.ird.yahoo.com>

Hi Wes

Please excuse my poor english - it's not my mother?language I'm writing in.

----- Urspr?ngliche Message -----

> Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS systems
> running on vmWare. The GFS FS is on a Dell Equilogic SAN.
> 
> I keep running into the same problem despite many differently-flavored
> attempts to set up GFS. The problem comes when I try to start cman, the
> cluster management software.
> 
> ? ? [root at test01]# service cman start
> ? ? Starting cluster:
> ? ? ?  Loading modules... done
> ? ? ?  Mounting configfs... done
> ? ? ?  Starting ccsd... done
> ? ? ?  Starting cman... failed
> ? ? cman not started: Can't find local node name in cluster.conf
> /usr/sbin/cman_tool: aisexec daemon didn't start
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [FAILED]


I don't think that the cluster is your main-problem.
The nodename must "not" present in DNS, but it must be?resolvable by files, ldap whatever.

Please verify?that "files" is present at /etc/nsswitch.conf.

e.g:?hosts: ? ? ?files dns

Did you've check with "ip addr list" that the ip-address matches the same as in /etc/hosts?

>?

> ? ? [root at test01]# tail /var/log/messages
> ? ? Jan? 5 13:39:40 testbench06 ccsd[13194]: Unable to connect to
> cluster infrastructure after 1193640 seconds.
> ? ? Jan? 5 13:40:10 testbench06 ccsd[13194]: Unable to connect to
> cluster infrastructure after 1193670 seconds.
> ? ? Jan? 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
> Service RELEASE 'subrev 1887 version 0.80.6'
> ? ? Jan? 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C)
> 2002-2006 MontaVista Software, Inc and contributors.
> ? ? Jan? 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C)
> 2006 Red Hat, Inc.
> ? ? Jan? 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
> Service: started and ready to provide service.
> ? ? Jan? 5 13:40:24 testbench06 openais[3939]: [MAIN ] local node name
> "test01.gdao.ucsc.edu" not found in cluster.conf
> ? ? Jan? 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading CCS
> info, cannot start
> ? ? Jan? 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading
> config from CCS
> ? ? Jan? 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
> exiting (reason: could not read the main configuration file).
> 
> Here are details of my configuration:
> 
> ? ? [root at test01]# rpm -qa | grep cman
> ? ? cman-2.0.115-85.el5_7.2
> 
> ? ? [root at test01]# echo $HOSTNAME
> ? ? test01.gdao.ucsc.edu
> 
> ? ? [root at test01]# hostname
> ? ? test01.gdao.ucsc.edu
> 
> ? ? [root at test01]# cat /etc/hosts
> ? ? # Do not remove the following line, or various programs
> ? ? # that require network functionality will fail.
> ? ? 128.114.31.112? ? ? test01 test01.gdao test01.gdao.ucsc.edu
> ? ? 128.114.31.113? ? ? test02 test02.gdao test02.gdao.ucsc.edu
> ? ? 127.0.0.1? ? ? ? ? ? ?  localhost.localdomain localhost
> ? ? ::1? ? ? ? ? ?  localhost6.localdomain6 localhost6
> 
> ? ? [root at test01]# sestatus
> ? ? SELinux status:? ? ? ? ? ? ? ?  enabled
> ? ? SELinuxfs mount:? ? ? ? ? ? ? ? /selinux
> ? ? Current mode:? ? ? ? ? ? ? ? ?  permissive
> ? ? Mode from config file:? ? ? ? ? permissive
> ? ? Policy version:? ? ? ? ? ? ? ?  21
> ? ? Policy from config file:? ? ? ? targeted
> 
> ? ? [root at test01]# cat /etc/cluster/cluster.conf
> ? ? <?xml version="1.0"?>
> ? ? <cluster config_version="25" name="gdao_cluster">
> ? ? ? ? <fence_daemon post_fail_delay="0" 
> post_join_delay="120"/>
> ? ? ? ? <clusternodes>
> ? ? ? ? ? ? <clusternode name="test01" nodeid="1" 
> votes="1">
> ? ? ? ? ? ? ? ? <fence>
> ? ? ? ? ? ? ? ? ? ? <method name="single">
> ? ? ? ? ? ? ? ? ? ? ? ? <device name="gfs_vmware"/>
> ? ? ? ? ? ? ? ? ? ? </method>
> ? ? ? ? ? ? ? ? </fence>
> ? ? ? ? ? ? </clusternode>
> ? ? ? ? ? ? <clusternode name="test02" nodeid="2" 
> votes="1">
> ? ? ? ? ? ? ? ? <fence>
> ? ? ? ? ? ? ? ? ? ? <method name="single">
> ? ? ? ? ? ? ? ? ? ? ? ? <device name="gfs_vmware"/>
> ? ? ? ? ? ? ? ? ? ? </method>
> ? ? ? ? ? ? ? ? </fence>
> ? ? ? ? ? ? </clusternode>
> ? ? ? ? </clusternodes>
> ? ? ? ? <cman/>
> ? ? ? ? <fencedevices>
> ? ? ? ? ? ? <fencedevice agent="fence_manual" 
> name="gfs1_ipmi"/>
> ? ? ? ? ? ? <fencedevice agent="fence_vmware" 
> name="gfs_vmware"
> ipaddr="gdvcenter.ucsc.edu" login="root" 
> passwd="1hateAmazon.com"
> vmlogin="root" vmpasswd="esxpass"
> port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/>
> ? ? ? ? </fencedevices>
> ? ? ? ? <rm>
> ? ? ? ? <failoverdomains/>
> ? ? ? ? </rm>
> ? ? </cluster>


- Chris



From pbruna at it-linux.cl  Sat Jan  7 14:11:38 2012
From: pbruna at it-linux.cl (Patricio Bruna)
Date: Sat, 7 Jan 2012 11:11:38 -0300
Subject: [Linux-cluster] GFS on CentOS - cman unable to start
In-Reply-To: <1325924532.90505.YahooMailNeo@web29703.mail.ird.yahoo.com>
References: <4F061C11.5030303@ucsc.edu>
	<1325924532.90505.YahooMailNeo@web29703.mail.ird.yahoo.com>
Message-ID: <E8E22A8E-2D55-4243-B70A-2650AA42FFC3@it-linux.cl>

Hi,
The error log is clear

---------
Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] local node name  "test01.gdao.ucsc.edu" not found in cluster.conf
---------

And is related to the email i sent before. Give a try, you don't lose anything.


El 07-01-2012, a las 5:22, Chris Kwall escribi?:

> Hi Wes
> 
> Please excuse my poor english - it's not my mother language I'm writing in.
> 
> ----- Urspr?ngliche Message -----
> 
>> Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS systems
>> running on vmWare. The GFS FS is on a Dell Equilogic SAN.
>> 
>> I keep running into the same problem despite many differently-flavored
>> attempts to set up GFS. The problem comes when I try to start cman, the
>> cluster management software.
>> 
>>     [root at test01]# service cman start
>>     Starting cluster:
>>        Loading modules... done
>>        Mounting configfs... done
>>        Starting ccsd... done
>>        Starting cman... failed
>>     cman not started: Can't find local node name in cluster.conf
>> /usr/sbin/cman_tool: aisexec daemon didn't start
>>                                                                [FAILED]
> 
> 
> I don't think that the cluster is your main-problem.
> The nodename must "not" present in DNS, but it must be resolvable by files, ldap whatever.
> 
> Please verify that "files" is present at /etc/nsswitch.conf.
> 
> e.g: hosts:      files dns
> 
> Did you've check with "ip addr list" that the ip-address matches the same as in /etc/hosts?
> 
>>  
> 
>>     [root at test01]# tail /var/log/messages
>>     Jan  5 13:39:40 testbench06 ccsd[13194]: Unable to connect to
>> cluster infrastructure after 1193640 seconds.
>>     Jan  5 13:40:10 testbench06 ccsd[13194]: Unable to connect to
>> cluster infrastructure after 1193670 seconds.
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
>> Service RELEASE 'subrev 1887 version 0.80.6'
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C)
>> 2002-2006 MontaVista Software, Inc and contributors.
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C)
>> 2006 Red Hat, Inc.
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
>> Service: started and ready to provide service.
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] local node name
>> "test01.gdao.ucsc.edu" not found in cluster.conf
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading CCS
>> info, cannot start
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading
>> config from CCS
>>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
>> exiting (reason: could not read the main configuration file).
>> 
>> Here are details of my configuration:
>> 
>>     [root at test01]# rpm -qa | grep cman
>>     cman-2.0.115-85.el5_7.2
>> 
>>     [root at test01]# echo $HOSTNAME
>>     test01.gdao.ucsc.edu
>> 
>>     [root at test01]# hostname
>>     test01.gdao.ucsc.edu
>> 
>>     [root at test01]# cat /etc/hosts
>>     # Do not remove the following line, or various programs
>>     # that require network functionality will fail.
>>     128.114.31.112      test01 test01.gdao test01.gdao.ucsc.edu
>>     128.114.31.113      test02 test02.gdao test02.gdao.ucsc.edu
>>     127.0.0.1               localhost.localdomain localhost
>>     ::1             localhost6.localdomain6 localhost6
>> 
>>     [root at test01]# sestatus
>>     SELinux status:                 enabled
>>     SELinuxfs mount:                /selinux
>>     Current mode:                   permissive
>>     Mode from config file:          permissive
>>     Policy version:                 21
>>     Policy from config file:        targeted
>> 
>>     [root at test01]# cat /etc/cluster/cluster.conf
>>     <?xml version="1.0"?>
>>     <cluster config_version="25" name="gdao_cluster">
>>         <fence_daemon post_fail_delay="0" 
>> post_join_delay="120"/>
>>         <clusternodes>
>>             <clusternode name="test01" nodeid="1" 
>> votes="1">
>>                 <fence>
>>                     <method name="single">
>>                         <device name="gfs_vmware"/>
>>                     </method>
>>                 </fence>
>>             </clusternode>
>>             <clusternode name="test02" nodeid="2" 
>> votes="1">
>>                 <fence>
>>                     <method name="single">
>>                         <device name="gfs_vmware"/>
>>                     </method>
>>                 </fence>
>>             </clusternode>
>>         </clusternodes>
>>         <cman/>
>>         <fencedevices>
>>             <fencedevice agent="fence_manual" 
>> name="gfs1_ipmi"/>
>>             <fencedevice agent="fence_vmware" 
>> name="gfs_vmware"
>> ipaddr="gdvcenter.ucsc.edu" login="root" 
>> passwd="1hateAmazon.com"
>> vmlogin="root" vmpasswd="esxpass"
>> port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/>
>>         </fencedevices>
>>         <rm>
>>         <failoverdomains/>
>>         </rm>
>>     </cluster>
> 
> 
> - Chris
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

------------------------------------
Patricio Bruna V.
IT Linux Ltda.
Fono : (+56-2) 333 0578
M?vil: (+56-9) 8899 6618
Twitter: http://twitter.com/ITLinux

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120107/14363ab2/attachment.htm>

From td3201 at gmail.com  Sun Jan  8 23:39:37 2012
From: td3201 at gmail.com (Terry)
Date: Sun, 8 Jan 2012 17:39:37 -0600
Subject: [Linux-cluster] centos5 to RHEL6 migration
Message-ID: <CAHSRzpBvTbGU22JqVC0NCm7ETe84egB7rEkp22wWjEyZYA_T-Q@mail.gmail.com>

Hello,

I am trying to gently migrate a 3-node cluster from centos5 to RHEL6.  I
have already taken one of the three nodes out and rebuilt it. My thinking
is to build a new cluster from the RHEL node but want to run it by everyone
here first. The cluster consists of a handful of NFS volumes and a
PostgreSQL database.  I am not concerned about the database.  I am moving
to a new version and will simply migrate that.  I am more concerned about
all of the ext4 clustered LVM volumes.  In this process, if I shut down the
old cluster, what's the process to force the new node to read those volumes
in to the new single-node cluster?  A pvscan on the new server shows all of
the volumes fine. I am concerned there's something else I'll have to do
here to begin mounting these volumes in the new cluster.
[root at server ~]# pvdisplay
  Skipping clustered volume group vg_data01b

Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120108/bc344718/attachment.htm>

From michael.allen at visi.com  Mon Jan  9 01:36:10 2012
From: michael.allen at visi.com (Michael Allen)
Date: Sun, 8 Jan 2012 19:36:10 -0600
Subject: [Linux-cluster] centos5 to RHEL6 migration
In-Reply-To: <CAHSRzpBvTbGU22JqVC0NCm7ETe84egB7rEkp22wWjEyZYA_T-Q@mail.gmail.com>
References: <CAHSRzpBvTbGU22JqVC0NCm7ETe84egB7rEkp22wWjEyZYA_T-Q@mail.gmail.com>
Message-ID: <20120108193610.2404b425@godelsrevenge.induswx.com>

On Sun, 8 Jan 2012 17:39:37 -0600
Terry <td3201 at gmail.com> wrote:

> Hello,
> 
> I am trying to gently migrate a 3-node cluster from centos5 to
> RHEL6.  I have already taken one of the three nodes out and rebuilt
> it. My thinking is to build a new cluster from the RHEL node but want
> to run it by everyone here first. The cluster consists of a handful
> of NFS volumes and a PostgreSQL database.  I am not concerned about
> the database.  I am moving to a new version and will simply migrate
> that.  I am more concerned about all of the ext4 clustered LVM
> volumes.  In this process, if I shut down the old cluster, what's the
> process to force the new node to read those volumes in to the new
> single-node cluster?  A pvscan on the new server shows all of the
> volumes fine. I am concerned there's something else I'll have to do
> here to begin mounting these volumes in the new cluster. [root at server
> ~]# pvdisplay Skipping clustered volume group vg_data01b
> 
> Thanks!
This message comes at a good time for me, too, since I am considering
the same thing.  I have 10 nodes but it appears that a change to CentOS
6.xx is about due.

Michael Allen



From linux at alteeve.com  Mon Jan  9 02:38:34 2012
From: linux at alteeve.com (Digimer)
Date: Sun, 08 Jan 2012 21:38:34 -0500
Subject: [Linux-cluster] centos5 to RHEL6 migration
In-Reply-To: <CAHSRzpBvTbGU22JqVC0NCm7ETe84egB7rEkp22wWjEyZYA_T-Q@mail.gmail.com>
References: <CAHSRzpBvTbGU22JqVC0NCm7ETe84egB7rEkp22wWjEyZYA_T-Q@mail.gmail.com>
Message-ID: <4F0A532A.2000202@alteeve.com>

On 01/08/2012 06:39 PM, Terry wrote:
> Hello,
> 
> I am trying to gently migrate a 3-node cluster from centos5 to RHEL6.  I
> have already taken one of the three nodes out and rebuilt it. My
> thinking is to build a new cluster from the RHEL node but want to run it
> by everyone here first. The cluster consists of a handful of NFS volumes
> and a PostgreSQL database.  I am not concerned about the database.  I am
> moving to a new version and will simply migrate that.  I am more
> concerned about all of the ext4 clustered LVM volumes.  In this process,
> if I shut down the old cluster, what's the process to force the new node
> to read those volumes in to the new single-node cluster?  A pvscan on
> the new server shows all of the volumes fine. I am concerned there's
> something else I'll have to do here to begin mounting these volumes in
> the new cluster.
> [root at server ~]# pvdisplay
>   Skipping clustered volume group vg_data01b
> 
> Thanks!

  Technically yes, practically no. Or rather, not without a lot of
testing first.

  I've never done this, but here are some pointers;

<cman upgrading="yes" disallowed="1" ...>

upgrading
	Set this if you are performing a rolling upgrade of the cluster
	between major releases.

disallowed
	Set this to 1 enable cman's Disallowed mode. This is usually
	only needed for backwards compatibility.

<group groupd_compat="1" />

	Enable compatibility with cluster2 nodes. groupd(8)

  There may be some other things you need to do as well. Please be sure
to do proper testing and, if you have the budget, hire Red Hat to advise
on this process. Also, please report back your results. It would help me
help others in the same boat later. :)

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron



From td3201 at gmail.com  Mon Jan  9 03:31:38 2012
From: td3201 at gmail.com (Terry)
Date: Sun, 8 Jan 2012 21:31:38 -0600
Subject: [Linux-cluster] centos5 to RHEL6 migration
In-Reply-To: <4F0A532A.2000202@alteeve.com>
References: <CAHSRzpBvTbGU22JqVC0NCm7ETe84egB7rEkp22wWjEyZYA_T-Q@mail.gmail.com>
	<4F0A532A.2000202@alteeve.com>
Message-ID: <CAHSRzpCHH65E4Qi2mcuCE=6RpVyZnHQGjYT7hrg7M6Bs=Wyv7Q@mail.gmail.com>

If it's not practical, am I left with building a new cluster from scratch?

On Sun, Jan 8, 2012 at 8:38 PM, Digimer <linux at alteeve.com> wrote:

> On 01/08/2012 06:39 PM, Terry wrote:
> > Hello,
> >
> > I am trying to gently migrate a 3-node cluster from centos5 to RHEL6.  I
> > have already taken one of the three nodes out and rebuilt it. My
> > thinking is to build a new cluster from the RHEL node but want to run it
> > by everyone here first. The cluster consists of a handful of NFS volumes
> > and a PostgreSQL database.  I am not concerned about the database.  I am
> > moving to a new version and will simply migrate that.  I am more
> > concerned about all of the ext4 clustered LVM volumes.  In this process,
> > if I shut down the old cluster, what's the process to force the new node
> > to read those volumes in to the new single-node cluster?  A pvscan on
> > the new server shows all of the volumes fine. I am concerned there's
> > something else I'll have to do here to begin mounting these volumes in
> > the new cluster.
> > [root at server ~]# pvdisplay
> >   Skipping clustered volume group vg_data01b
> >
> > Thanks!
>
>   Technically yes, practically no. Or rather, not without a lot of
> testing first.
>
>  I've never done this, but here are some pointers;
>
> <cman upgrading="yes" disallowed="1" ...>
>
> upgrading
>        Set this if you are performing a rolling upgrade of the cluster
>        between major releases.
>
> disallowed
>        Set this to 1 enable cman's Disallowed mode. This is usually
>        only needed for backwards compatibility.
>
> <group groupd_compat="1" />
>
>        Enable compatibility with cluster2 nodes. groupd(8)
>
>  There may be some other things you need to do as well. Please be sure
> to do proper testing and, if you have the budget, hire Red Hat to advise
> on this process. Also, please report back your results. It would help me
> help others in the same boat later. :)
>
> --
> Digimer
> E-Mail:              digimer at alteeve.com
> Freenode handle:     digimer
> Papers and Projects: http://alteeve.com
> Node Assassin:       http://nodeassassin.org
> "omg my singularity battery is dead again.
> stupid hawking radiation." - epitron
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120108/e8917278/attachment.htm>

From list at fajar.net  Mon Jan  9 04:01:06 2012
From: list at fajar.net (Fajar A. Nugraha)
Date: Mon, 9 Jan 2012 11:01:06 +0700
Subject: [Linux-cluster] centos5 to RHEL6 migration
In-Reply-To: <CAHSRzpCHH65E4Qi2mcuCE=6RpVyZnHQGjYT7hrg7M6Bs=Wyv7Q@mail.gmail.com>
References: <CAHSRzpBvTbGU22JqVC0NCm7ETe84egB7rEkp22wWjEyZYA_T-Q@mail.gmail.com>
	<4F0A532A.2000202@alteeve.com>
	<CAHSRzpCHH65E4Qi2mcuCE=6RpVyZnHQGjYT7hrg7M6Bs=Wyv7Q@mail.gmail.com>
Message-ID: <CAG1y0sdjqaV4nCHfOHy0b9moQYe+BdzkMBYDRT7UuRVAqkSqag@mail.gmail.com>

On Mon, Jan 9, 2012 at 10:31 AM, Terry <td3201 at gmail.com> wrote:
> If it's not practical, am I left with building a new cluster from scratch?

I'm pretty sure if your ONLY problem is "Skipping clustered volume
group vg_data01b", you can just turn off cluster flag with "vgchange
-cn", then use "-o lock_nolock" to mount it on a SINGLE (i.e. not
cluster) node. That was your original question, wasn't it?

As for upgrading, I haven't tested it. You should be able to use your
old storage, but just create other settings from scratch. Like Digimer
said, be sure to do proper testing :)

-- 
Fajar



From wmodes at ucsc.edu  Mon Jan  9 04:03:18 2012
From: wmodes at ucsc.edu (Wes Modes)
Date: Sun, 08 Jan 2012 20:03:18 -0800
Subject: [Linux-cluster] GFS on CentOS - cman unable to start
In-Reply-To: <B9EFAE4B1BE97D49B9FCFDCEC3AC09C313C563AA@MailNode2.eprize.local>
References: <4F075BD3.3090702@ucsc.edu>
	<60c71479-72a3-47aa-a91d-fa1c91c3e9ef@lisa.itlinux.cl>
	<B9EFAE4B1BE97D49B9FCFDCEC3AC09C313C563AA@MailNode2.eprize.local>
Message-ID: <4F0A6706.6090308@ucsc.edu>

The behavior of cman's resolving of cluster node names is less than
clear, as per the RHEL bugzilla report.

The hostname and cluster.conf match, as does /etc/hosts and uname -n. 
The short names and FQDN ping.  I believe all the node cluster.conf are
in sync, and all nodes are accessible to each other using either short
or long names.

You'll have to trust that I've tried everything obvious, and every
possible combination of FQDN and short names in cluster.conf and
hostname.  That said, it is totally possible I missed something obvious.

I suspect, there is something else going on and I don't know how to get
at it.

Wes


On 1/6/2012 6:06 PM, Kevin Stanton wrote:
>
> > Hi,
>
> > I think CMAN expect that the names of the cluster nodes be the same
> returned by the command "uname -n".
>
> > For what you write your nodes hostnames are: test01.gdao.ucsc.edu
> and test02.gdao.ucsc.edu, but in cluster.conf you have declared only
> "test01" and "test02".
>
>  
>
> I haven't found this to be the case in the past.  I actually use a
> separate short name to reference each node which is different than the
> hostname of the server itself.  All I've ever had to do is make sure
> it resolves correctly.  You can do this either in DNS and/or in
> /etc/hosts.  I have found that it's a good idea to do both in case
> your DNS server is a virtual machine and is not running for some
> reason.  In that case with /etc/hosts you can still start cman.  
>
>  
>
> I would make sure whatever node names you use in the cluster.conf will
> resolve when you try to ping it from all nodes in the cluster.  Also
> make sure your cluster.conf is in sync between all nodes.
>
>  
>
> -Kevin
>
>  
>
>  
>
> ------------------------------------------------------------------------
>
>     These servers are currently on the same host, but may not be in
>     the future.  They are in a vm cluster (though honestly, I'm not
>     sure what this means yet).
>
>     SElinux is on, but disabled.
>     Firewalling through iptables is turned off via
>     system-config-securitylevel
>
>     There is no line currently in the cluster.conf that deals with
>     multicasting.
>
>     Any other suggestions?
>
>     Wes
>
>     On 1/6/2012 12:05 PM, Luiz Gustavo Tonello wrote:
>
>     Hi,
>
>      
>
>     This servers is on VMware? At the same host?
>
>     SElinux is disable? iptables have something?
>
>      
>
>     In my environment I had a problem to start GFS2 with servers in
>     differents hosts.
>
>     To clustering servers, was need migrate one server to the same
>     host of the other, and restart this.
>
>      
>
>     I think, one of the problem was because the virtual switchs.
>
>     To solve, I changed a multicast IP, to use 225.0.0.13 at cluster.conf 
>
>       <multicast addr="225.0.0.13"/>
>
>     And add a static route in both, to use default gateway.
>
>      
>
>     I don't know if it's correct, but this solve my problem.
>
>      
>
>     I hope that help you.
>
>      
>
>     Regards.
>
>      
>
>     On Fri, Jan 6, 2012 at 5:01 PM, Wes Modes <wmodes at ucsc.edu
>     <mailto:wmodes at ucsc.edu>> wrote:
>
>     Hi, Steven.
>
>     I've tried just about every possible combination of hostname and
>     cluster.conf.
>
>     ping to test01 resolves to 128.114.31.112
>     ping to test01.gdao.ucsc.edu <http://test01.gdao.ucsc.edu>
>     resolves to 128.114.31.112
>
>     It feels like the right thing is being returned.  This feels like it
>     might be a quirk (or bug possibly) of cman or openais.
>
>     There are some old bug reports around this, for example
>     https://bugzilla.redhat.com/show_bug.cgi?id=488565.  It sounds
>     like the
>     way that cman reports this error is anything but straightforward.
>
>     Is there anyone who has encountered this error and found a solution?
>
>     Wes
>
>
>
>     On 1/6/2012 2:00 AM, Steven Whitehouse wrote:
>     > Hi,
>     >
>     > On Thu, 2012-01-05 at 13:54 -0800, Wes Modes wrote:
>     >> Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS
>     systems
>     >> running on vmWare. The GFS FS is on a Dell Equilogic SAN.
>     >>
>     >> I keep running into the same problem despite many
>     differently-flavored
>     >> attempts to set up GFS. The problem comes when I try to start
>     cman, the
>     >> cluster management software.
>     >>
>     >>     [root at test01]# service cman start
>     >>     Starting cluster:
>     >>        Loading modules... done
>     >>        Mounting configfs... done
>     >>        Starting ccsd... done
>     >>        Starting cman... failed
>     >>     cman not started: Can't find local node name in cluster.conf
>     >> /usr/sbin/cman_tool: aisexec daemon didn't start
>     >>                                                              
>      [FAILED]
>     >>
>     > This looks like what it says... whatever the node name is in
>     > cluster.conf, it doesn't exist when the name is looked up, or
>     possibly
>     > it does exist, but is mapped to the loopback address (it needs to
>     map to
>     > an address which is valid cluster-wide)
>     >
>     > Since your config files look correct, the next thing to check is what
>     > the resolver is actually returning. Try (for example) a ping to
>     test01
>     > (you need to specify exactly the same form of the name as is used in
>     > cluster.conf) from test02 and see whether it uses the correct ip
>     > address, just in case the wrong thing is being returned.
>     >
>     > Steve.
>     >
>     >>     [root at test01]# tail /var/log/messages
>     >>     Jan  5 13:39:40 testbench06 ccsd[13194]: Unable to connect to
>     >> cluster infrastructure after 1193640 seconds.
>     >>     Jan  5 13:40:10 testbench06 ccsd[13194]: Unable to connect to
>     >> cluster infrastructure after 1193670 seconds.
>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
>     >> Service RELEASE 'subrev 1887 version 0.80.6'
>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C)
>     >> 2002-2006 MontaVista Software, Inc and contributors.
>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C)
>     >> 2006 Red Hat, Inc.
>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
>     >> Service: started and ready to provide service.
>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] local
>     node name
>     >> "test01.gdao.ucsc.edu <http://test01.gdao.ucsc.edu>" not found
>     in cluster.conf
>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Error
>     reading CCS
>     >> info, cannot start
>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading
>     >> config from CCS
>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive
>     >> exiting (reason: could not read the main configuration file).
>     >>
>     >> Here are details of my configuration:
>     >>
>     >>     [root at test01]# rpm -qa | grep cman
>     >>     cman-2.0.115-85.el5_7.2
>     >>
>     >>     [root at test01]# echo $HOSTNAME
>     >>     test01.gdao.ucsc.edu <http://test01.gdao.ucsc.edu>
>     >>
>     >>     [root at test01]# hostname
>     >>     test01.gdao.ucsc.edu <http://test01.gdao.ucsc.edu>
>     >>
>     >>     [root at test01]# cat /etc/hosts
>     >>     # Do not remove the following line, or various programs
>     >>     # that require network functionality will fail.
>     >>     128.114.31.112      test01 test01.gdao test01.gdao.ucsc.edu
>     <http://test01.gdao.ucsc.edu>
>     >>     128.114.31.113      test02 test02.gdao test02.gdao.ucsc.edu
>     <http://test02.gdao.ucsc.edu>
>     >>     127.0.0.1               localhost.localdomain localhost
>     >>     ::1             localhost6.localdomain6 localhost6
>     >>
>     >>     [root at test01]# sestatus
>     >>     SELinux status:                 enabled
>     >>     SELinuxfs mount:                /selinux
>     >>     Current mode:                   permissive
>     >>     Mode from config file:          permissive
>     >>     Policy version:                 21
>     >>     Policy from config file:        targeted
>     >>
>     >>     [root at test01]# cat /etc/cluster/cluster.conf
>     >>     <?xml version="1.0"?>
>     >>     <cluster config_version="25" name="gdao_cluster">
>     >>         <fence_daemon post_fail_delay="0" post_join_delay="120"/>
>     >>         <clusternodes>
>     >>             <clusternode name="test01" nodeid="1" votes="1">
>     >>                 <fence>
>     >>                     <method name="single">
>     >>                         <device name="gfs_vmware"/>
>     >>                     </method>
>     >>                 </fence>
>     >>             </clusternode>
>     >>             <clusternode name="test02" nodeid="2" votes="1">
>     >>                 <fence>
>     >>                     <method name="single">
>     >>                         <device name="gfs_vmware"/>
>     >>                     </method>
>     >>                 </fence>
>     >>             </clusternode>
>     >>         </clusternodes>
>     >>         <cman/>
>     >>         <fencedevices>
>     >>             <fencedevice agent="fence_manual" name="gfs1_ipmi"/>
>     >>             <fencedevice agent="fence_vmware" name="gfs_vmware"
>     >> ipaddr="gdvcenter.ucsc.edu <http://gdvcenter.ucsc.edu>"
>     login="root" passwd="1hateAmazon.com"
>     >> vmlogin="root" vmpasswd="esxpass"
>     >>
>     port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/>
>     >>         </fencedevices>
>     >>         <rm>
>     >>         <failoverdomains/>
>     >>         </rm>
>     >>     </cluster>
>     >>
>     >> I've seen much discussion of this problem, but no definitive
>     solutions.
>     >> Any help you can provide will be welcome.
>     >>
>     >> Wes Modes
>     >>
>     >> --
>     >> Linux-cluster mailing list
>     >> Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     >> https://www.redhat.com/mailman/listinfo/linux-cluster
>     >
>     > --
>     > Linux-cluster mailing list
>     > Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     > https://www.redhat.com/mailman/listinfo/linux-cluster
>
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>      
>
>     -- 
>     Luiz Gustavo P Tonello.
>
>
>
>     --
>
>     Linux-cluster mailing list
>
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
>
>  
>
>  
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120108/707d1029/attachment.htm>

From sathyanarayanan.varadharajan at precisionit.co.in  Mon Jan  9 04:37:45 2012
From: sathyanarayanan.varadharajan at precisionit.co.in (SATHYA - IT)
Date: Mon, 9 Jan 2012 10:07:45 +0530
Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster
	environment
Message-ID: <003101ccce88$71ddd7e0$559987a0$@precisionit.co.in>

Hi,

 

We had configured RHEL 6.2 - 2 node Cluster with clvmd + gfs2 + cman + smb.
We have 4 nic cards in the servers where 2 been configured in bonding for
heartbeat (with mode=1) and 2 been configured in bonding for public access
(with mode=0). Heartbeat network is connected directly from server to
server. Once in 3 - 4 days, the heartbeat goes down and comes up
automatically in 2 to 3 seconds. Not sure why this down and up occurs.
Because of this in cluster, one system is got fenced by other. 

 

Is there anyway where we can increase the time to wait for the cluster to
wait for heartbeat. Ie if the cluster can wait for 5-6 seconds even the
heartbeat fails for 5-6 seconds the node won't get fenced. Kindly advise.

 

Thanks

 

Sathya Narayanan V

Solution Architect    


This communication may contain confidential information. 
If you are not the intended recipient it may be unlawful for you to read, copy, distribute, disclose or otherwise use the information contained within this communication.. 
Errors and Omissions may occur in the contents of this Email arising out of or in connection with data transmission, network malfunction or failure, machine or software error, malfunction, or operator errors by the person who is sending the email. 
Precision Group accepts no responsibility for any such errors or omissions. The information, views and comments within this communication are those of the individual and not necessarily those of Precision Group. 
All email that is sent from/to Precision Group is scanned for the presence of computer viruses, security issues and inappropriate content. However, it is the recipient's responsibility to check any attachments for viruses before use.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120109/c8d9afca/attachment.htm>

From linux at alteeve.com  Mon Jan  9 04:51:54 2012
From: linux at alteeve.com (Digimer)
Date: Sun, 08 Jan 2012 23:51:54 -0500
Subject: [Linux-cluster] centos5 to RHEL6 migration
In-Reply-To: <CAHSRzpCHH65E4Qi2mcuCE=6RpVyZnHQGjYT7hrg7M6Bs=Wyv7Q@mail.gmail.com>
References: <CAHSRzpBvTbGU22JqVC0NCm7ETe84egB7rEkp22wWjEyZYA_T-Q@mail.gmail.com>
	<4F0A532A.2000202@alteeve.com>
	<CAHSRzpCHH65E4Qi2mcuCE=6RpVyZnHQGjYT7hrg7M6Bs=Wyv7Q@mail.gmail.com>
Message-ID: <4F0A726A.6050304@alteeve.com>

On 01/08/2012 10:31 PM, Terry wrote:
> If it's not practical, am I left with building a new cluster from scratch?

I don't know enough to say either way. I'd strongly suggest talking to
Red hat, as you have a subscription, and ask them for advice. It might
cost a bit, but I am certain it will save you trouble and money in the
long wrong.

Alternatively, use some spare machines to mock-up the current cluster
and then test-upgrade. It might work flawlessly, I genuinely don't know.
I do know that a good attempt was made at on-wire compatibility, I just
don't know if it's actually been used in production, so I was erring on
the side of caution.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron



From linux at alteeve.com  Mon Jan  9 04:56:38 2012
From: linux at alteeve.com (Digimer)
Date: Sun, 08 Jan 2012 23:56:38 -0500
Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster
 environment
In-Reply-To: <003101ccce88$71ddd7e0$559987a0$@precisionit.co.in>
References: <003101ccce88$71ddd7e0$559987a0$@precisionit.co.in>
Message-ID: <4F0A7386.5000303@alteeve.com>

On 01/08/2012 11:37 PM, SATHYA - IT wrote:
> Hi,
> 
> We had configured RHEL 6.2 - 2 node Cluster with clvmd + gfs2 + cman +
> smb. We have 4 nic cards in the servers where 2 been configured in
> bonding for heartbeat (with mode=1) and 2 been configured in bonding for
> public access (with mode=0). Heartbeat network is connected directly
> from server to server. Once in 3 ? 4 days, the heartbeat goes down and
> comes up automatically in 2 to 3 seconds. Not sure why this down and up
> occurs. Because of this in cluster, one system is got fenced by other.
> 
> Is there anyway where we can increase the time to wait for the cluster
> to wait for heartbeat. Ie if the cluster can wait for 5-6 seconds even
> the heartbeat fails for 5-6 seconds the node won?t get fenced. Kindly
> advise.

"mode=1" is Active/Passive and I use it extensively with no trouble. I'm
not sure where "heartbeat" comes from, but I might be missing the
obvious. Can you share your bond and eth configuration files here please
(as plain-text attachments)?

Secondly, make sure that you are actually using that interface/bond. Run
'gethostip -d <nodename>', where "nodename" is what you set in
cluster.conf. The returned IP will be the one used by the cluster.

Back to the bond; A failed link would nearly instantly transfer to the
backup link. So if you are going down for 2~3 seconds on both links,
something else is happening. Look at syslog on both nodes around the
time the last fence happened and see what logs are written just prior to
the fence. That might give you a clue.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron



From sathyanarayanan.varadharajan at precisionit.co.in  Mon Jan  9 05:12:43 2012
From: sathyanarayanan.varadharajan at precisionit.co.in (SATHYA - IT)
Date: Mon, 9 Jan 2012 10:42:43 +0530
Subject: [Linux-cluster] rhel 6.2 network bonding interface in	cluster
	environment
Message-ID: <003901ccce8d$55a9be40$00fd3ac0$@precisionit.co.in>

Hi,

Thanks for your mail. I herewith attaching the bonding and eth configuration
files. And on the /var/log/messages during the fence operation we can get
the logs updated related to network only in the node which fences the other.


Server 1 NIC 1:  (eth2)

/etc/sysconfig/network-scripts/ifcfg-eth2

DEVICE="eth2"
HWADDR="3C:D9:2B:04:2D:7A"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER=bond0
SLAVE=yes
USERCTL=no
BOOTPROTO=none

Server 1 NIC 4: (eth5)

/etc/sysconfig/network-scripts/ifcfg-eth5

DEVICE="eth5"
HWADDR="3C:D9:2B:04:2D:80"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER=bond0
SLAVE=yes
USERCTL=no
BOOTPROTO=none

Server 1 NIC 2: (eth3)

/etc/sysconfig/network-scripts/ifcfg-eth3

DEVICE="eth3"
HWADDR="3C:D9:2B:04:2D:7C"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER=bond1
SLAVE=yes
USERCTL=no
BOOTPROTO=none


Server 1 NIC 3:

/etc/sysconfig/network-scripts/ifcfg-eth4

DEVICE="eth4"
HWADDR="3C:D9:2B:04:2D:7E"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER=bond1
SLAVE=yes
USERCTL=no
BOOTPROTO=none

Server 1 Bond0: (Public Access)

/etc/sysconfig/network-scripts/ifcfg-bond0

DEVICE=bond0
BOOTPROTO=static
IPADDR=192.168.129.10
NETMASK=255.255.255.0
GATEWAY=192.168.129.1
USERCTL=no
ONBOOT=yes
BONDING_OPTS="miimon=100 mode=0"


Server 1 Bond1: (Heartbeat)

/etc/sysconfig/network-scripts/ifcfg-bond1

DEVICE=bond1
BOOTPROTO=static
IPADDR=10.0.0.10
NETMASK=255.0.0.0
USERCTL=no
ONBOOT=yes
BONDING_OPTS="miimon=100 mode=1"

On the log messages, 

Jan  3 14:46:07 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
Down
Jan  3 14:46:07 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
Down
Jan  3 14:46:07 filesrv2 kernel: bonding: bond1: link status definitely down
for interface eth3, disabling it
Jan  3 14:46:07 filesrv2 kernel: bonding: bond1: now running without any
active interface !
Jan  3 14:46:07 filesrv2 kernel: bonding: bond1: link status definitely down
for interface eth4, disabling it
Jan  3 14:46:10 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
Up, 1000 Mbps full duplex, receive & transmit flow control ON
Jan  3 14:46:10 filesrv2 kernel: bond1: link status definitely up for
interface eth3, 1000 Mbps full duplex.
Jan  3 14:46:10 filesrv2 kernel: bonding: bond1: making interface eth3 the
new active one.
Jan  3 14:46:10 filesrv2 kernel: bonding: bond1: first active interface up!
Jan  3 14:46:10 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
Up, 1000 Mbps full duplex, receive & transmit flow control ON
Jan  3 14:46:10 filesrv2 kernel: bond1: link status definitely up for
interface eth4, 1000 Mbps full duplex.


Thanks

Sathya Narayanan V
Solution Architect	

-----Original Message-----
From: Digimer [mailto:linux at alteeve.com] 
Sent: Monday, January 09, 2012 10:27 AM
To: linux clustering
Cc: SATHYA - IT
Subject: SPAM - Re: [Linux-cluster] rhel 6.2 network bonding interface in
cluster environment

On 01/08/2012 11:37 PM, SATHYA - IT wrote:
> Hi,
> 
> We had configured RHEL 6.2 - 2 node Cluster with clvmd + gfs2 + cman + 
> smb. We have 4 nic cards in the servers where 2 been configured in 
> bonding for heartbeat (with mode=1) and 2 been configured in bonding 
> for public access (with mode=0). Heartbeat network is connected 
> directly from server to server. Once in 3 - 4 days, the heartbeat goes 
> down and comes up automatically in 2 to 3 seconds. Not sure why this 
> down and up occurs. Because of this in cluster, one system is got fenced
by other.
> 
> Is there anyway where we can increase the time to wait for the cluster 
> to wait for heartbeat. Ie if the cluster can wait for 5-6 seconds even 
> the heartbeat fails for 5-6 seconds the node won't get fenced. Kindly 
> advise.

"mode=1" is Active/Passive and I use it extensively with no trouble. I'm not
sure where "heartbeat" comes from, but I might be missing the obvious. Can
you share your bond and eth configuration files here please (as plain-text
attachments)?

Secondly, make sure that you are actually using that interface/bond. Run
'gethostip -d <nodename>', where "nodename" is what you set in cluster.conf.
The returned IP will be the one used by the cluster.

Back to the bond; A failed link would nearly instantly transfer to the
backup link. So if you are going down for 2~3 seconds on both links,
something else is happening. Look at syslog on both nodes around the time
the last fence happened and see what logs are written just prior to the
fence. That might give you a clue.

--
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron

This communication may contain confidential information. 
If you are not the intended recipient it may be unlawful for you to read, copy, distribute, disclose or otherwise use the information contained within this communication.. 
Errors and Omissions may occur in the contents of this Email arising out of or in connection with data transmission, network malfunction or failure, machine or software error, malfunction, or operator errors by the person who is sending the email. 
Precision Group accepts no responsibility for any such errors or omissions. The information, views and comments within this communication are those of the individual and not necessarily those of Precision Group. 
All email that is sent from/to Precision Group is scanned for the presence of computer viruses, security issues and inappropriate content. However, it is the recipient's responsibility to check any attachments for viruses before use.



From linux at alteeve.com  Mon Jan  9 05:24:10 2012
From: linux at alteeve.com (Digimer)
Date: Mon, 09 Jan 2012 00:24:10 -0500
Subject: [Linux-cluster] rhel 6.2 network bonding interface in	cluster
 environment
In-Reply-To: <003901ccce8d$55a9be40$00fd3ac0$@precisionit.co.in>
References: <003901ccce8d$55a9be40$00fd3ac0$@precisionit.co.in>
Message-ID: <4F0A79FA.7080408@alteeve.com>

On 01/09/2012 12:12 AM, SATHYA - IT wrote:
> Hi,
> 
> Thanks for your mail. I herewith attaching the bonding and eth configuration
> files. And on the /var/log/messages during the fence operation we can get
> the logs updated related to network only in the node which fences the other.

What IPs do the node names resolve to? I'm assuming bond1, but I would
like you to confirm.

> Server 1 Bond1: (Heartbeat)

I'm still not sure what you mean by heartbeat. Do you mean the channel
corosync is using?

> On the log messages, 
> 
> Jan  3 14:46:07 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
> Down
> Jan  3 14:46:07 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
> Down

This tells me both links dropped at the same time. These messages are
coming from below the cluster though.

> Jan  3 14:46:07 filesrv2 kernel: bonding: bond1: link status definitely down
> for interface eth3, disabling it
> Jan  3 14:46:07 filesrv2 kernel: bonding: bond1: now running without any
> active interface !
> Jan  3 14:46:07 filesrv2 kernel: bonding: bond1: link status definitely down
> for interface eth4, disabling it

With both of the bond's NICs down, the bond itself is going to drop.

> Jan  3 14:46:10 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
> Up, 1000 Mbps full duplex, receive & transmit flow control ON
> Jan  3 14:46:10 filesrv2 kernel: bond1: link status definitely up for
> interface eth3, 1000 Mbps full duplex.
> Jan  3 14:46:10 filesrv2 kernel: bonding: bond1: making interface eth3 the
> new active one.
> Jan  3 14:46:10 filesrv2 kernel: bonding: bond1: first active interface up!
> Jan  3 14:46:10 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
> Up, 1000 Mbps full duplex, receive & transmit flow control ON
> Jan  3 14:46:10 filesrv2 kernel: bond1: link status definitely up for
> interface eth4, 1000 Mbps full duplex.

I don't see any messages about the cluster in here, which I assume you
cropped out. In this case, it doesn't matter as the problem is well
below the cluster, but in general, please provide more data, not less.
You never know what might help. :)

Anyway, you need to sort out what is happening here. Bad drivers? Bad
card (assuming dual-port)? Something is taking the NICs down, as though
they were actually unplugged.

If you can run them through a switch, if might help isolate which node
is causing the problems as then you would only see one node record "NIC
Copper Link is Down" and can then focus on just that node.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron



From sathyanarayanan.varadharajan at precisionit.co.in  Mon Jan  9 05:51:08 2012
From: sathyanarayanan.varadharajan at precisionit.co.in (SATHYA - IT)
Date: Mon, 9 Jan 2012 11:21:08 +0530
Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster
	environment
Message-ID: <004001ccce92$b0bfd210$123f7630$@precisionit.co.in>

Hi,

Herewith attaching the /var/log/messages of both the servers. Yesterday
(08th Jan) one of the server got fenced by other around 10:48 AM. I am also
attaching the cluster.conf file for your reference. 

On the related note, related to heartbeat - I am referring the channel used
by corosync. And the name which has been configured in cluster.conf file
resolves with bond1 only.

Related to the network card, we are using 2 dual port card where we
configured 1 port from each for bond0 and 1 port from the other for bond1.
So it doesn't seems be a network card related issue. Moreover, we are not
having any errors related to bond0.

Thanks

Sathya Narayanan V
Solution Architect	


-----Original Message-----
From: Digimer [mailto:linux at alteeve.com] 
Sent: Monday, January 09, 2012 10:54 AM
To: SATHYA - IT
Cc: 'linux clustering'
Subject: SPAM - Re: [Linux-cluster] rhel 6.2 network bonding interface in
cluster environment

On 01/09/2012 12:12 AM, SATHYA - IT wrote:
> Hi,
> 
> Thanks for your mail. I herewith attaching the bonding and eth 
> configuration files. And on the /var/log/messages during the fence 
> operation we can get the logs updated related to network only in the node
which fences the other.

What IPs do the node names resolve to? I'm assuming bond1, but I would like
you to confirm.

> Server 1 Bond1: (Heartbeat)

I'm still not sure what you mean by heartbeat. Do you mean the channel
corosync is using?

> On the log messages,
> 
> Jan  3 14:46:07 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper 
> Link is Down Jan  3 14:46:07 filesrv2 kernel: bnx2 0000:03:00.1: eth3: 
> NIC Copper Link is Down

This tells me both links dropped at the same time. These messages are coming
from below the cluster though.

> Jan  3 14:46:07 filesrv2 kernel: bonding: bond1: link status 
> definitely down for interface eth3, disabling it Jan  3 14:46:07 
> filesrv2 kernel: bonding: bond1: now running without any active 
> interface !
> Jan  3 14:46:07 filesrv2 kernel: bonding: bond1: link status 
> definitely down for interface eth4, disabling it

With both of the bond's NICs down, the bond itself is going to drop.

> Jan  3 14:46:10 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper 
> Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON 
> Jan  3 14:46:10 filesrv2 kernel: bond1: link status definitely up for 
> interface eth3, 1000 Mbps full duplex.
> Jan  3 14:46:10 filesrv2 kernel: bonding: bond1: making interface eth3 
> the new active one.
> Jan  3 14:46:10 filesrv2 kernel: bonding: bond1: first active interface
up!
> Jan  3 14:46:10 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper 
> Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON 
> Jan  3 14:46:10 filesrv2 kernel: bond1: link status definitely up for 
> interface eth4, 1000 Mbps full duplex.

I don't see any messages about the cluster in here, which I assume you
cropped out. In this case, it doesn't matter as the problem is well below
the cluster, but in general, please provide more data, not less.
You never know what might help. :)

Anyway, you need to sort out what is happening here. Bad drivers? Bad card
(assuming dual-port)? Something is taking the NICs down, as though they were
actually unplugged.

If you can run them through a switch, if might help isolate which node is
causing the problems as then you would only see one node record "NIC Copper
Link is Down" and can then focus on just that node.

--
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron

This communication may contain confidential information. 
If you are not the intended recipient it may be unlawful for you to read, copy, distribute, disclose or otherwise use the information contained within this communication.. 
Errors and Omissions may occur in the contents of this Email arising out of or in connection with data transmission, network malfunction or failure, machine or software error, malfunction, or operator errors by the person who is sending the email. 
Precision Group accepts no responsibility for any such errors or omissions. The information, views and comments within this communication are those of the individual and not necessarily those of Precision Group. 
All email that is sent from/to Precision Group is scanned for the presence of computer viruses, security issues and inappropriate content. However, it is the recipient's responsibility to check any attachments for viruses before use.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: application/octet-stream
Size: 1043 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120109/bc973723/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: messages_filesrv1
Type: application/octet-stream
Size: 117290 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120109/bc973723/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: messages_filesrv2
Type: application/octet-stream
Size: 15302 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120109/bc973723/attachment-0002.obj>

From sathyanarayanan.varadharajan at precisionit.co.in  Mon Jan  9 06:18:22 2012
From: sathyanarayanan.varadharajan at precisionit.co.in (SATHYA - IT)
Date: Mon, 9 Jan 2012 11:48:22 +0530
Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster
	environment
Message-ID: <004c01ccce96$7f974ca0$7ec5e5e0$@precisionit.co.in>

Not sure whether you received the logs and cluster.conf file. Herewith
pasting the same...

On File Server1:

Jan  8 03:15:04 filesrv1 kernel: imklog 4.6.2, log source = /proc/kmsg
started.
Jan  8 03:15:04 filesrv1 rsyslogd: [origin software="rsyslogd"
swVersion="4.6.2" x-pid="8765" x-info="http://www.rsyslog.com"] (re)start
Jan  8 10:52:42 filesrv1 kernel: imklog 4.6.2, log source = /proc/kmsg
started.
Jan  8 10:52:42 filesrv1 rsyslogd: [origin software="rsyslogd"
swVersion="4.6.2" x-pid="8751" x-info="http://www.rsyslog.com"] (re)start
Jan  8 10:52:42 filesrv1 kernel: Initializing cgroup subsys cpuset
Jan  8 10:52:42 filesrv1 kernel: Initializing cgroup subsys cpu
Jan  8 10:52:42 filesrv1 kernel: Linux version 2.6.32-220.el6.x86_64
(mockbuild at x86-004.build.bos.redhat.com) (gcc version 4.4.5 20110214 (Red
Hat 4.4.5-6) (GCC) ) #1 SMP Wed Nov 9 08:03:13 EST 2011
Jan  8 10:52:42 filesrv1 kernel: Command line: ro
root=/dev/mapper/vg01-LogVol01 rd_LVM_LV=vg01/LogVol01
rd_LVM_LV=vg01/LogVol00 rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8
SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us crashkernel=128M rhgb
quiet acpi=off
Jan  8 10:52:42 filesrv1 kernel: KERNEL supported cpus:
Jan  8 10:52:42 filesrv1 kernel:  Intel GenuineIntel
Jan  8 10:52:42 filesrv1 kernel:  AMD AuthenticAMD
Jan  8 10:52:42 filesrv1 kernel:  Centaur CentaurHauls
Jan  8 10:52:42 filesrv1 kernel: BIOS-provided physical RAM map:
Jan  8 10:52:42 filesrv1 kernel: BIOS-e820: 0000000000000000 -
000000000009f400 (usable)
Jan  8 10:52:42 filesrv1 kernel: BIOS-e820: 000000000009f400 -
00000000000a0000 (reserved)
Jan  8 10:52:42 filesrv1 kernel: BIOS-e820: 00000000000f0000 -
0000000000100000 (reserved)
Jan  8 10:52:42 filesrv1 kernel: BIOS-e820: 0000000000100000 -
00000000d762f000 (usable)
Jan  8 10:52:42 filesrv1 kernel: BIOS-e820: 00000000d762f000 -
00000000d763c000 (ACPI data)
Jan  8 10:52:42 filesrv1 kernel: BIOS-e820: 00000000d763c000 -
00000000d763d000 (usable)
Jan  8 10:52:42 filesrv1 kernel: BIOS-e820: 00000000d763d000 -
00000000dc000000 (reserved)
Jan  8 10:52:42 filesrv1 kernel: BIOS-e820: 00000000fec00000 -
00000000fee10000 (reserved)
Jan  8 10:52:42 filesrv1 kernel: BIOS-e820: 00000000ff800000 -
0000000100000000 (reserved)
Jan  8 10:52:42 filesrv1 kernel: BIOS-e820: 0000000100000000 -
00000008a7fff000 (usable)
Jan  8 10:52:42 filesrv1 kernel: DMI 2.7 present.
Jan  8 10:52:42 filesrv1 kernel: SMBIOS version 2.7 @ 0xF4F40
Jan  8 10:52:42 filesrv1 kernel: last_pfn = 0x8a7fff max_arch_pfn =
0x400000000
Jan  8 10:52:42 filesrv1 kernel: x86 PAT enabled: cpu 0, old
0x7040600070406, new 0x7010600070106
Jan  8 10:52:42 filesrv1 kernel: last_pfn = 0xd763d max_arch_pfn =
0x400000000
.
.

On File Server 2:

Jan  8 03:09:06 filesrv2 rsyslogd: [origin software="rsyslogd"
swVersion="4.6.2" x-pid="8648" x-info="http://www.rsyslog.com"] (re)start
Jan  8 10:48:07 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
Down
Jan  8 10:48:07 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
Down
Jan  8 10:48:07 filesrv2 kernel: bonding: bond1: link status definitely down
for interface eth3, disabling it
Jan  8 10:48:07 filesrv2 kernel: bonding: bond1: now running without any
active interface !
Jan  8 10:48:07 filesrv2 kernel: bonding: bond1: link status definitely down
for interface eth4, disabling it
Jan  8 10:48:09 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
Up, 1000 Mbps full duplex, receive & transmit flow control ON
Jan  8 10:48:09 filesrv2 kernel: bond1: link status definitely up for
interface eth4, 1000 Mbps full duplex.
Jan  8 10:48:09 filesrv2 kernel: bonding: bond1: making interface eth4 the
new active one.
Jan  8 10:48:09 filesrv2 kernel: bonding: bond1: first active interface up!
Jan  8 10:48:09 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
Up, 1000 Mbps full duplex, receive & transmit flow control ON
Jan  8 10:48:09 filesrv2 kernel: bond1: link status definitely up for
interface eth3, 1000 Mbps full duplex.
Jan  8 10:48:15 filesrv2 corosync[8933]:   [TOTEM ] A processor failed,
forming new configuration.
Jan  8 10:48:17 filesrv2 corosync[8933]:   [QUORUM] Members[1]: 2
Jan  8 10:48:17 filesrv2 corosync[8933]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Jan  8 10:48:17 filesrv2 rgmanager[12557]: State change: clustsrv1 DOWN
Jan  8 10:48:17 filesrv2 corosync[8933]:   [CPG   ] chosen downlist: sender
r(0) ip(10.0.0.20) ; members(old:2 left:1)
Jan  8 10:48:17 filesrv2 corosync[8933]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Jan  8 10:48:17 filesrv2 kernel: dlm: closing connection to node 1
Jan  8 10:48:17 filesrv2 fenced[8989]: fencing node clustsrv1
Jan  8 10:48:17 filesrv2 kernel: GFS2: fsid=samba:ctdb.0: jid=1: Trying to
acquire journal lock...
Jan  8 10:48:17 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Trying to
acquire journal lock...
Jan  8 10:48:24 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
Down
Jan  8 10:48:24 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
Down
Jan  8 10:48:24 filesrv2 kernel: bonding: bond1: link status definitely down
for interface eth3, disabling it
Jan  8 10:48:24 filesrv2 kernel: bonding: bond1: link status definitely down
for interface eth4, disabling it
Jan  8 10:48:24 filesrv2 kernel: bonding: bond1: now running without any
active interface !
Jan  8 10:48:25 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
Up, 100 Mbps full duplex, receive & transmit flow control ON
Jan  8 10:48:25 filesrv2 kernel: bond1: link status definitely up for
interface eth4, 100 Mbps full duplex.
Jan  8 10:48:25 filesrv2 kernel: bonding: bond1: making interface eth4 the
new active one.
Jan  8 10:48:25 filesrv2 kernel: bonding: bond1: first active interface up!
Jan  8 10:48:25 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
Up, 100 Mbps full duplex, receive & transmit flow control ON
Jan  8 10:48:25 filesrv2 kernel: bond1: link status definitely up for
interface eth3, 100 Mbps full duplex.
Jan  8 10:48:25 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
Down
Jan  8 10:48:25 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
Down
Jan  8 10:48:25 filesrv2 kernel: bonding: bond1: link status definitely down
for interface eth3, disabling it
Jan  8 10:48:25 filesrv2 kernel: bonding: bond1: link status definitely down
for interface eth4, disabling it
Jan  8 10:48:25 filesrv2 kernel: bonding: bond1: now running without any
active interface !
Jan  8 10:48:27 filesrv2 fenced[8989]: fence clustsrv1 success
Jan  8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:ctdb.0: jid=1: Looking at
journal...
Jan  8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:ctdb.0: jid=1: Done
Jan  8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:gen02.0: jid=1: Trying to
acquire journal lock...
Jan  8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Trying
to acquire journal lock...
Jan  8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:gen02.0: jid=1: Looking at
journal...
Jan  8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:gen02.0: jid=1: Done
Jan  8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Looking at
journal...
Jan  8 10:48:28 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
Up, 1000 Mbps full duplex, receive & transmit flow control ON
Jan  8 10:48:28 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
Up, 1000 Mbps full duplex, receive & transmit flow control ON
Jan  8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Acquiring
the transaction lock...
Jan  8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Replaying
journal...
Jan  8 10:48:28 filesrv2 kernel: bond1: link status definitely up for
interface eth3, 1000 Mbps full duplex.
Jan  8 10:48:28 filesrv2 kernel: bonding: bond1: making interface eth3 the
new active one.
Jan  8 10:48:28 filesrv2 kernel: bonding: bond1: first active interface up!
Jan  8 10:48:28 filesrv2 kernel: bond1: link status definitely up for
interface eth4, 1000 Mbps full duplex.
Jan  8 10:48:30 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Replayed
29140 of 29474 blocks
Jan  8 10:48:30 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Found 334
revoke tags
Jan  8 10:48:30 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Journal
replayed in 2s
Jan  8 10:48:30 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Done
Jan  8 10:49:01 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
Down
Jan  8 10:49:01 filesrv2 kernel: bonding: bond1: link status definitely down
for interface eth3, disabling it
Jan  8 10:49:01 filesrv2 kernel: bonding: bond1: making interface eth4 the
new active one.
Jan  8 10:49:01 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
Down
Jan  8 10:49:01 filesrv2 kernel: bonding: bond1: link status definitely down
for interface eth4, disabling it
Jan  8 10:49:01 filesrv2 kernel: bonding: bond1: now running without any
active interface !
Jan  8 10:49:03 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
Up, 1000 Mbps full duplex, receive & transmit flow control ON
Jan  8 10:49:03 filesrv2 kernel: bond1: link status definitely up for
interface eth3, 1000 Mbps full duplex.
Jan  8 10:49:03 filesrv2 kernel: bonding: bond1: making interface eth3 the
new active one.
Jan  8 10:49:03 filesrv2 kernel: bonding: bond1: first active interface up!
Jan  8 10:49:04 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
Up, 1000 Mbps full duplex, receive & transmit flow control ON
Jan  8 10:49:04 filesrv2 kernel: bond1: link status definitely up for
interface eth4, 1000 Mbps full duplex.
Jan  8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Looking
at journal...
Jan  8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1:
Acquiring the transaction lock...
Jan  8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1:
Replaying journal...
Jan  8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1:
Replayed 0 of 0 blocks
Jan  8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Found 0
revoke tags
Jan  8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Journal
replayed in 0s
Jan  8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Done
Jan  8 10:52:37 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
Down
Jan  8 10:52:38 filesrv2 kernel: bonding: bond1: link status definitely down
for interface eth3, disabling it
Jan  8 10:52:38 filesrv2 kernel: bonding: bond1: making interface eth4 the
new active one.
Jan  8 10:52:38 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
Down
Jan  8 10:52:38 filesrv2 kernel: bonding: bond1: link status definitely down
for interface eth4, disabling it
Jan  8 10:52:38 filesrv2 kernel: bonding: bond1: now running without any
active interface !
Jan  8 10:52:40 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
Up, 1000 Mbps full duplex, receive & transmit flow control ON
Jan  8 10:52:40 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
Up, 1000 Mbps full duplex, receive & transmit flow control ON
Jan  8 10:52:40 filesrv2 kernel: bond1: link status definitely up for
interface eth3, 1000 Mbps full duplex.
Jan  8 10:52:40 filesrv2 kernel: bonding: bond1: making interface eth3 the
new active one.
Jan  8 10:52:40 filesrv2 kernel: bonding: bond1: first active interface up!
Jan  8 10:52:40 filesrv2 kernel: bond1: link status definitely up for
interface eth4, 1000 Mbps full duplex.
Jan  8 10:52:44 filesrv2 corosync[8933]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Jan  8 10:52:44 filesrv2 corosync[8933]:   [QUORUM] Members[2]: 1 2
Jan  8 10:52:44 filesrv2 corosync[8933]:   [QUORUM] Members[2]: 1 2
Jan  8 10:52:44 filesrv2 corosync[8933]:   [CPG   ] chosen downlist: sender
r(0) ip(10.0.0.10) ; members(old:1 left:0)
Jan  8 10:52:44 filesrv2 corosync[8933]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Jan  8 10:52:51 filesrv2 kernel: dlm: got connection from 1
Jan  8 10:55:57 filesrv2 kernel: INFO: task gfs2_quotad:9389 blocked for
more than 120 seconds.
Jan  8 10:55:57 filesrv2 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  8 10:55:57 filesrv2 kernel: gfs2_quotad   D ffff8808a7824900     0
9389      2 0x00000080
Jan  8 10:55:57 filesrv2 kernel: ffff88087580da88 0000000000000046
0000000000000000 00000000000001c3
Jan  8 10:55:57 filesrv2 kernel: ffff88087580da18 ffff88087580da50
ffffffff810ea694 ffff88088b184080
Jan  8 10:55:57 filesrv2 kernel: ffff88088e71a5f8 ffff88087580dfd8
000000000000f4e8 ffff88088e71a5f8
Jan  8 10:55:57 filesrv2 kernel: Call Trace:
Jan  8 10:55:57 filesrv2 kernel: [<ffffffff810ea694>] ?
rb_reserve_next_event+0xb4/0x370
Jan  8 10:55:57 filesrv2 kernel: [<ffffffff81013563>] ?
native_sched_clock+0x13/0x60
Jan  8 10:55:57 filesrv2 kernel: [<ffffffff814eefb5>]
rwsem_down_failed_common+0x95/0x1d0
Jan  8 10:55:57 filesrv2 kernel: [<ffffffff81013563>] ?
native_sched_clock+0x13/0x60
Jan  8 10:55:57 filesrv2 kernel: [<ffffffff814ef146>]
rwsem_down_read_failed+0x26/0x30
Jan  8 10:55:57 filesrv2 kernel: [<ffffffff81276e04>]
call_rwsem_down_read_failed+0x14/0x30
Jan  8 10:55:57 filesrv2 kernel: [<ffffffff814ee644>] ? down_read+0x24/0x30
Jan  8 10:55:57 filesrv2 kernel: [<ffffffffa04fe4b2>] dlm_lock+0x62/0x1e0
[dlm]
Jan  8 10:55:57 filesrv2 kernel: [<ffffffff810eab02>] ?
ring_buffer_lock_reserve+0xa2/0x160
Jan  8 10:55:57 filesrv2 kernel: [<ffffffffa0550d62>] gdlm_lock+0xf2/0x130
[gfs2]
Jan  8 10:55:57 filesrv2 kernel: [<ffffffffa0550e60>] ? gdlm_ast+0x0/0xe0
[gfs2]
Jan  8 10:55:57 filesrv2 kernel: [<ffffffffa0550da0>] ? gdlm_bast+0x0/0x50
[gfs2]
Jan  8 10:55:57 filesrv2 kernel: [<ffffffffa053430f>] do_xmote+0x17f/0x260
[gfs2]
Jan  8 10:55:57 filesrv2 kernel: [<ffffffffa05344e1>] run_queue+0xf1/0x1d0
[gfs2]
Jan  8 10:55:57 filesrv2 kernel: [<ffffffffa0534807>]
gfs2_glock_nq+0x1b7/0x360 [gfs2]
Jan  8 10:55:57 filesrv2 kernel: [<ffffffff8107cb7b>] ?
try_to_del_timer_sync+0x7b/0xe0
Jan  8 10:55:57 filesrv2 kernel: [<ffffffffa054db88>]
gfs2_statfs_sync+0x58/0x1b0 [gfs2]
Jan  8 10:55:57 filesrv2 kernel: [<ffffffff814ed84a>] ?
schedule_timeout+0x19a/0x2e0
Jan  8 10:55:57 filesrv2 kernel: [<ffffffffa054db80>] ?
gfs2_statfs_sync+0x50/0x1b0 [gfs2]
Jan  8 10:55:57 filesrv2 kernel: [<ffffffffa0545bb7>]
quotad_check_timeo+0x57/0xb0 [gfs2]
Jan  8 10:55:57 filesrv2 kernel: [<ffffffffa0545e44>]
gfs2_quotad+0x234/0x2b0 [gfs2]
Jan  8 10:55:57 filesrv2 kernel: [<ffffffff81090bf0>] ?
autoremove_wake_function+0x0/0x40
Jan  8 10:55:57 filesrv2 kernel: [<ffffffffa0545c10>] ?
gfs2_quotad+0x0/0x2b0 [gfs2]
Jan  8 10:55:57 filesrv2 kernel: [<ffffffff81090886>] kthread+0x96/0xa0
Jan  8 10:55:57 filesrv2 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
Jan  8 10:55:57 filesrv2 kernel: [<ffffffff810907f0>] ? kthread+0x0/0xa0
Jan  8 10:55:57 filesrv2 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
Jan  8 10:57:57 filesrv2 kernel: INFO: task gfs2_quotad:9389 blocked for
more than 120 seconds.
Jan  8 10:57:57 filesrv2 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  8 10:57:57 filesrv2 kernel: gfs2_quotad   D ffff8808a7824900     0
9389      2 0x00000080
Jan  8 10:57:57 filesrv2 kernel: ffff88087580da88 0000000000000046
0000000000000000 00000000000001c3
Jan  8 10:57:57 filesrv2 kernel: ffff88087580da18 ffff88087580da50
ffffffff810ea694 ffff88088b184080
Jan  8 10:57:57 filesrv2 kernel: ffff88088e71a5f8 ffff88087580dfd8
000000000000f4e8 ffff88088e71a5f8
Jan  8 10:57:57 filesrv2 kernel: Call Trace:
Jan  8 10:57:57 filesrv2 kernel: [<ffffffff810ea694>] ?
rb_reserve_next_event+0xb4/0x370
Jan  8 10:57:57 filesrv2 kernel: [<ffffffff81013563>] ?
native_sched_clock+0x13/0x60
Jan  8 10:57:57 filesrv2 kernel: [<ffffffff814eefb5>]
rwsem_down_failed_common+0x95/0x1d0
Jan  8 10:57:57 filesrv2 kernel: [<ffffffff81013563>] ?
native_sched_clock+0x13/0x60
Jan  8 10:57:57 filesrv2 kernel: [<ffffffff814ef146>]
rwsem_down_read_failed+0x26/0x30
Jan  8 10:57:57 filesrv2 kernel: [<ffffffff81276e04>]
call_rwsem_down_read_failed+0x14/0x30
Jan  8 10:57:57 filesrv2 kernel: [<ffffffff814ee644>] ? down_read+0x24/0x30
Jan  8 10:57:57 filesrv2 kernel: [<ffffffffa04fe4b2>] dlm_lock+0x62/0x1e0
[dlm]
Jan  8 10:57:57 filesrv2 kernel: [<ffffffff810eab02>] ?
ring_buffer_lock_reserve+0xa2/0x160
Jan  8 10:57:57 filesrv2 kernel: [<ffffffffa0550d62>] gdlm_lock+0xf2/0x130
[gfs2]
Jan  8 10:57:57 filesrv2 kernel: [<ffffffffa0550e60>] ? gdlm_ast+0x0/0xe0
[gfs2]
Jan  8 10:57:57 filesrv2 kernel: [<ffffffffa0550da0>] ? gdlm_bast+0x0/0x50
[gfs2]
Jan  8 10:57:57 filesrv2 kernel: [<ffffffffa053430f>] do_xmote+0x17f/0x260
[gfs2]
Jan  8 10:57:57 filesrv2 kernel: [<ffffffffa05344e1>] run_queue+0xf1/0x1d0
[gfs2]
Jan  8 10:57:57 filesrv2 kernel: [<ffffffffa0534807>]
gfs2_glock_nq+0x1b7/0x360 [gfs2]
Jan  8 10:57:57 filesrv2 kernel: [<ffffffff8107cb7b>] ?
try_to_del_timer_sync+0x7b/0xe0
Jan  8 10:57:57 filesrv2 kernel: [<ffffffffa054db88>]
gfs2_statfs_sync+0x58/0x1b0 [gfs2]
Jan  8 10:57:57 filesrv2 kernel: [<ffffffff814ed84a>] ?
schedule_timeout+0x19a/0x2e0
Jan  8 10:57:57 filesrv2 kernel: [<ffffffffa054db80>] ?
gfs2_statfs_sync+0x50/0x1b0 [gfs2]
Jan  8 10:57:57 filesrv2 kernel: [<ffffffffa0545bb7>]
quotad_check_timeo+0x57/0xb0 [gfs2]
Jan  8 10:57:57 filesrv2 kernel: [<ffffffffa0545e44>]
gfs2_quotad+0x234/0x2b0 [gfs2]
Jan  8 10:57:57 filesrv2 kernel: [<ffffffff81090bf0>] ?
autoremove_wake_function+0x0/0x40
Jan  8 10:57:57 filesrv2 kernel: [<ffffffffa0545c10>] ?
gfs2_quotad+0x0/0x2b0 [gfs2]
Jan  8 10:57:57 filesrv2 kernel: [<ffffffff81090886>] kthread+0x96/0xa0
Jan  8 10:57:57 filesrv2 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
Jan  8 10:57:57 filesrv2 kernel: [<ffffffff810907f0>] ? kthread+0x0/0xa0
Jan  8 10:57:57 filesrv2 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
Jan  8 10:59:22 filesrv2 rgmanager[12557]: State change: clustsrv1 UP

Cluster.conf File:

<?xml version="1.0"?>
<cluster config_version="8" name="samba">
	<fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="3"/>
	<clusternodes>
		<clusternode name="clustsrv1" nodeid="1" votes="1">
			<fence>
				<method name="fenceilo1">
					<device name="ilosrv1"/>
				</method>
			</fence>
			<unfence>
			        <device action="on" name="ilosrv2"/>
		        </unfence> 
		</clusternode>
		<clusternode name="clustsrv2" nodeid="2" votes="1">
			<fence>
				<method name="fenceilo2">
					<device name="ilosrv2"/>
				</method>
			</fence>
			<unfence>
			        <device action="on" name="ilosrv2"/>
			</unfence> 
		</clusternode>
	</clusternodes>
	<cman expected_votes="1" two_node="1"/>
	<fencedevices>
		<fencedevice agent="fence_ipmilan" ipaddr="192.168.129.157"
lanplus="1" login="fence" name="ilosrv1" passwd="xxxxxxx"/>
		<fencedevice agent="fence_ipmilan" ipaddr="192.168.129.158"
lanplus="1" login="fence" name="ilosrv2" passwd="xxxxxxx"/>
	</fencedevices>
	<rm>
		<failoverdomains/>
		<resources/>
	</rm>
</cluster>


Thanks

Sathya Narayanan V
Solution Architect	


-----Original Message-----
From: SATHYA - IT [mailto:sathyanarayanan.varadharajan at precisionit.co.in] 
Sent: Monday, January 09, 2012 11:21 AM
To: 'Digimer'; 'linux clustering'
Subject: Re: [Linux-cluster] rhel 6.2 network bonding interface in cluster
environment

Hi,

Herewith attaching the /var/log/messages of both the servers. Yesterday
(08th Jan) one of the server got fenced by other around 10:48 AM. I am also
attaching the cluster.conf file for your reference. 

On the related note, related to heartbeat - I am referring the channel used
by corosync. And the name which has been configured in cluster.conf file
resolves with bond1 only.

Related to the network card, we are using 2 dual port card where we
configured 1 port from each for bond0 and 1 port from the other for bond1.
So it doesn't seems be a network card related issue. Moreover, we are not
having any errors related to bond0.

Thanks

Sathya Narayanan V
Solution Architect	


This communication may contain confidential information. 
If you are not the intended recipient it may be unlawful for you to read, copy, distribute, disclose or otherwise use the information contained within this communication.. 
Errors and Omissions may occur in the contents of this Email arising out of or in connection with data transmission, network malfunction or failure, machine or software error, malfunction, or operator errors by the person who is sending the email. 
Precision Group accepts no responsibility for any such errors or omissions. The information, views and comments within this communication are those of the individual and not necessarily those of Precision Group. 
All email that is sent from/to Precision Group is scanned for the presence of computer viruses, security issues and inappropriate content. However, it is the recipient's responsibility to check any attachments for viruses before use.



From Klaus.Steinberger at physik.uni-muenchen.de  Mon Jan  9 07:28:38 2012
From: Klaus.Steinberger at physik.uni-muenchen.de (Klaus Steinberger)
Date: Mon, 9 Jan 2012 08:28:38 +0100
Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster
	environment
In-Reply-To: <mailman.3657.1326088351.3054.linux-cluster@redhat.com>
References: <mailman.3657.1326088351.3054.linux-cluster@redhat.com>
Message-ID: <53FBFF3E-A139-43F7-A500-FE69539ECF84@physik.uni-muenchen.de>




> 
> 
> Jan  3 14:46:07 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
> Down
> Jan  3 14:46:07 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
> Down
> Jan  3 14:46:07 filesrv2 kernel: bonding: bond1: link status definitely down
> for interface eth3, disabling it
> Jan  3 14:46:07 filesrv2 kernel: bonding: bond1: now running without any
> active interface !
> Jan  3 14:46:07 filesrv2 kernel: bonding: bond1: link status definitely down
> for interface eth4, disabling it
> Jan  3 14:46:10 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
> Up, 1000 Mbps full duplex, receive & transmit flow control ON
> Jan  3 14:46:10 filesrv2 kernel: bond1: link status definitely up for
> interface eth3, 1000 Mbps full duplex.
> Jan  3 14:46:10 filesrv2 kernel: bonding: bond1: making interface eth3 the
> new active one.
> Jan  3 14:46:10 filesrv2 kernel: bonding: bond1: first active interface up!
> Jan  3 14:46:10 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
> Up, 1000 Mbps full duplex, receive & transmit flow control ON
> Jan  3 14:46:10 filesrv2 kernel: bond1: link status definitely up for
> interface eth4, 1000 Mbps full duplex.

Both links are going down at same time. Did you connect them to the same switch?
Is there a switch reboot at that time or something else going on in the switch?

Sincerly,
Klaus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120109/ddf6c650/attachment.htm>

From ajb2 at mssl.ucl.ac.uk  Mon Jan  9 08:52:30 2012
From: ajb2 at mssl.ucl.ac.uk (Alan Brown)
Date: Mon, 09 Jan 2012 08:52:30 +0000
Subject: [Linux-cluster] centos5 to RHEL6 migration
In-Reply-To: <4F0A532A.2000202@alteeve.com>
References: <CAHSRzpBvTbGU22JqVC0NCm7ETe84egB7rEkp22wWjEyZYA_T-Q@mail.gmail.com>
	<4F0A532A.2000202@alteeve.com>
Message-ID: <4F0AAACE.7080602@mssl.ucl.ac.uk>

On 09/01/12 02:38, Digimer wrote:

>   Technically yes, practically no. Or rather, not without a lot of
> testing first.

This is "rather a shame"

I have a similar requirement (EL5 -> EL6 with GFS)

>   There may be some other things you need to do as well. Please be sure
> to do proper testing and, if you have the budget, hire Red Hat to advise
> on this process. Also, please report back your results. It would help me
> help others in the same boat later. :)

RH's advice to use is to "Big Bang" it.

The last such transition (EL4 to EL5) was an unmitigated disaster even
with RH onsite to make the change, so we're _very_ wary this time around.








From ajb2 at mssl.ucl.ac.uk  Mon Jan  9 08:55:15 2012
From: ajb2 at mssl.ucl.ac.uk (Alan Brown)
Date: Mon, 09 Jan 2012 08:55:15 +0000
Subject: [Linux-cluster] centos5 to RHEL6 migration
In-Reply-To: <4F0A726A.6050304@alteeve.com>
References: <CAHSRzpBvTbGU22JqVC0NCm7ETe84egB7rEkp22wWjEyZYA_T-Q@mail.gmail.com>
	<4F0A532A.2000202@alteeve.com>
	<CAHSRzpCHH65E4Qi2mcuCE=6RpVyZnHQGjYT7hrg7M6Bs=Wyv7Q@mail.gmail.com>
	<4F0A726A.6050304@alteeve.com>
Message-ID: <4F0AAB73.2040201@mssl.ucl.ac.uk>

On 09/01/12 04:51, Digimer wrote:

> Alternatively, use some spare machines to mock-up the current cluster
> and then test-upgrade. It might work flawlessly, I genuinely don't know.

Test setups aren't always a good metric. Everything worked fine on our
last changeover until we put real-world load on.





From fdinitto at redhat.com  Mon Jan  9 09:36:05 2012
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Mon, 09 Jan 2012 10:36:05 +0100
Subject: [Linux-cluster] centos5 to RHEL6 migration
In-Reply-To: <4F0AAACE.7080602@mssl.ucl.ac.uk>
References: <CAHSRzpBvTbGU22JqVC0NCm7ETe84egB7rEkp22wWjEyZYA_T-Q@mail.gmail.com>
	<4F0A532A.2000202@alteeve.com> <4F0AAACE.7080602@mssl.ucl.ac.uk>
Message-ID: <4F0AB505.2020402@redhat.com>

On 1/9/2012 9:52 AM, Alan Brown wrote:
> On 09/01/12 02:38, Digimer wrote:
> 
>>   Technically yes, practically no. Or rather, not without a lot of
>> testing first.
> 
> This is "rather a shame"
> 
> I have a similar requirement (EL5 -> EL6 with GFS)
> 

Well the cluster stack itself (openais/cman/gfs/rgmanager ->
corosync/cman/gfs2/rgmanager) is capable of handling the upgrade in a
compatible mode.

*BUT* (yes there are tons of those)

in time, while performing different upgrade scenarios/tests, we come to
the conclusion that it is a lot more complicated for any user (even
expert/advanced ones) to perform a safe upgrade than rebuilding the
cluster from scratch (*) given that setup/config/etc are known from the
old cluster.

>>   There may be some other things you need to do as well. Please be sure
>> to do proper testing and, if you have the budget, hire Red Hat to advise
>> on this process. Also, please report back your results. It would help me
>> help others in the same boat later. :)
> 
> RH's advice to use is to "Big Bang" it.

It?s not much of an advice, as RH does not officially support this
upgrade method.

> 
> The last such transition (EL4 to EL5) was an unmitigated disaster even
> with RH onsite to make the change, so we're _very_ wary this time around.
> 

The amount of changes in the cluster software between EL5 and EL6 are a
lot less intrusive at system level. I can?t really say for sure for the
entire OS, since the upgrade doesn?t involve only RHCS.

Fabio

(*) The major issues, while upgrading from 5 to 6 are:
- GFS1 is not support in EL6. Volumes need to be migrated to GFS2 (and
there are several ways to do it, but still needs to be done offline)
- cluster.conf cannot be updated automatically during an upgrade or
nodes running in mixed mode (some nodes at 5 and others at 6).
- some config options, while backward compat should be retained, needs
to be changed in very specific sequence, making it really hard to
perform an easy upgrade.
- but the biggest blocker of all are all the resources (driven or not by
rgmanager).

For example, apache2 config in EL5 cannot be used out-of-the-box on EL6.
So assuming rgmanager is driving apache2, then you would need to setup 2
separate apache2 configs, test them individually, perform migration
checks between EL5 and 6... etc. This kind of testing is more time
consuming and complex than what you can possibly gain by redoing the
cluster from scratch.

There are also other resources that are simply unable to deal with this
kind of upgrade.

Let?s make the example of a db stored on a gfs2 filesystem. DB created
in version 1, after a migration to EL6, the DB format is upgraded to
internal version 2. Version 2 being incompatible with 1.

IF there is a situation where the service needs to failover back to a
node running EL5, the DB will be unable to start. Effectively killing
the purpose of HA.

What you want to notice is that the service compatibility level has
nothing to do with cluster itself.

Now, when you multiply the amount of possible services, failover
scenarios, config changes etc, you will easily come to the conclusion
that an upgrade of this proportion is a path to insanity for the
administrator.



From rajatjpatel at gmail.com  Mon Jan  9 10:20:19 2012
From: rajatjpatel at gmail.com (rajatjpatel)
Date: Mon, 9 Jan 2012 15:50:19 +0530
Subject: [Linux-cluster] centos5 to RHEL6 migration
Message-ID: <CANcjGC8Y7oHc3Znbrfuekj+jvbjYQhdnty=-VJXsuD0XA1dc4w@mail.gmail.com>

   1. Back up anything you care about.
   2.

   Remember - A fresh install is generally *strongly* preferred over an
   upgrade.



Regards,
Rajat Patel

http://studyhat.blogspot.com
FIRST THEY IGNORE YOU...
THEN THEY LAUGH AT YOU...
THEN THEY FIGHT YOU...
THEN YOU WIN...




On Mon, Jan 9, 2012 at 9:33 AM, <linux-cluster-request at redhat.com> wrote:

> Send Linux-cluster mailing list submissions to
>        linux-cluster at redhat.com
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        https://www.redhat.com/mailman/listinfo/linux-cluster
> or, via email, send a message with subject or body 'help' to
>        linux-cluster-request at redhat.com
>
> You can reach the person managing the list at
>        linux-cluster-owner at redhat.com
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Linux-cluster digest..."
>
>
> Today's Topics:
>
>   1. centos5 to RHEL6 migration (Terry)
>   2. Re: centos5 to RHEL6 migration (Michael Allen)
>   3. Re: centos5 to RHEL6 migration (Digimer)
>   4. Re: centos5 to RHEL6 migration (Terry)
>   5. Re: centos5 to RHEL6 migration (Fajar A. Nugraha)
>   6. Re: GFS on CentOS - cman unable to start (Wes Modes)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sun, 8 Jan 2012 17:39:37 -0600
> From: Terry <td3201 at gmail.com>
> To: linux clustering <linux-cluster at redhat.com>
> Subject: [Linux-cluster] centos5 to RHEL6 migration
> Message-ID:
>        <CAHSRzpBvTbGU22JqVC0NCm7ETe84egB7rEkp22wWjEyZYA_T-Q at mail.gmail.com
> >
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hello,
>
> I am trying to gently migrate a 3-node cluster from centos5 to RHEL6.  I
> have already taken one of the three nodes out and rebuilt it. My thinking
> is to build a new cluster from the RHEL node but want to run it by everyone
> here first. The cluster consists of a handful of NFS volumes and a
> PostgreSQL database.  I am not concerned about the database.  I am moving
> to a new version and will simply migrate that.  I am more concerned about
> all of the ext4 clustered LVM volumes.  In this process, if I shut down the
> old cluster, what's the process to force the new node to read those volumes
> in to the new single-node cluster?  A pvscan on the new server shows all of
> the volumes fine. I am concerned there's something else I'll have to do
> here to begin mounting these volumes in the new cluster.
> [root at server ~]# pvdisplay
>  Skipping clustered volume group vg_data01b
>
> Thanks!
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> https://www.redhat.com/archives/linux-cluster/attachments/20120108/bc344718/attachment.html
> >
>
> ------------------------------
>
> Message: 2
> Date: Sun, 8 Jan 2012 19:36:10 -0600
> From: Michael Allen <michael.allen at visi.com>
> To: linux clustering <linux-cluster at redhat.com>
> Subject: Re: [Linux-cluster] centos5 to RHEL6 migration
> Message-ID: <20120108193610.2404b425 at godelsrevenge.induswx.com>
> Content-Type: text/plain; charset=US-ASCII
>
> On Sun, 8 Jan 2012 17:39:37 -0600
> Terry <td3201 at gmail.com> wrote:
>
> > Hello,
> >
> > I am trying to gently migrate a 3-node cluster from centos5 to
> > RHEL6.  I have already taken one of the three nodes out and rebuilt
> > it. My thinking is to build a new cluster from the RHEL node but want
> > to run it by everyone here first. The cluster consists of a handful
> > of NFS volumes and a PostgreSQL database.  I am not concerned about
> > the database.  I am moving to a new version and will simply migrate
> > that.  I am more concerned about all of the ext4 clustered LVM
> > volumes.  In this process, if I shut down the old cluster, what's the
> > process to force the new node to read those volumes in to the new
> > single-node cluster?  A pvscan on the new server shows all of the
> > volumes fine. I am concerned there's something else I'll have to do
> > here to begin mounting these volumes in the new cluster. [root at server
> > ~]# pvdisplay Skipping clustered volume group vg_data01b
> >
> > Thanks!
> This message comes at a good time for me, too, since I am considering
> the same thing.  I have 10 nodes but it appears that a change to CentOS
> 6.xx is about due.
>
> Michael Allen
>
>
>
> ------------------------------
>
> Message: 3
> Date: Sun, 08 Jan 2012 21:38:34 -0500
> From: Digimer <linux at alteeve.com>
> To: linux clustering <linux-cluster at redhat.com>
> Subject: Re: [Linux-cluster] centos5 to RHEL6 migration
> Message-ID: <4F0A532A.2000202 at alteeve.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> On 01/08/2012 06:39 PM, Terry wrote:
> > Hello,
> >
> > I am trying to gently migrate a 3-node cluster from centos5 to RHEL6.  I
> > have already taken one of the three nodes out and rebuilt it. My
> > thinking is to build a new cluster from the RHEL node but want to run it
> > by everyone here first. The cluster consists of a handful of NFS volumes
> > and a PostgreSQL database.  I am not concerned about the database.  I am
> > moving to a new version and will simply migrate that.  I am more
> > concerned about all of the ext4 clustered LVM volumes.  In this process,
> > if I shut down the old cluster, what's the process to force the new node
> > to read those volumes in to the new single-node cluster?  A pvscan on
> > the new server shows all of the volumes fine. I am concerned there's
> > something else I'll have to do here to begin mounting these volumes in
> > the new cluster.
> > [root at server ~]# pvdisplay
> >   Skipping clustered volume group vg_data01b
> >
> > Thanks!
>
>  Technically yes, practically no. Or rather, not without a lot of
> testing first.
>
>  I've never done this, but here are some pointers;
>
> <cman upgrading="yes" disallowed="1" ...>
>
> upgrading
>        Set this if you are performing a rolling upgrade of the cluster
>        between major releases.
>
> disallowed
>        Set this to 1 enable cman's Disallowed mode. This is usually
>        only needed for backwards compatibility.
>
> <group groupd_compat="1" />
>
>        Enable compatibility with cluster2 nodes. groupd(8)
>
>  There may be some other things you need to do as well. Please be sure
> to do proper testing and, if you have the budget, hire Red Hat to advise
> on this process. Also, please report back your results. It would help me
> help others in the same boat later. :)
>
> --
> Digimer
> E-Mail:              digimer at alteeve.com
> Freenode handle:     digimer
> Papers and Projects: http://alteeve.com
> Node Assassin:       http://nodeassassin.org
> "omg my singularity battery is dead again.
> stupid hawking radiation." - epitron
>
>
>
> ------------------------------
>
> Message: 4
> Date: Sun, 8 Jan 2012 21:31:38 -0600
> From: Terry <td3201 at gmail.com>
> To: Digimer <linux at alteeve.com>
> Cc: linux clustering <linux-cluster at redhat.com>
> Subject: Re: [Linux-cluster] centos5 to RHEL6 migration
> Message-ID:
>        <CAHSRzpCHH65E4Qi2mcuCE=6RpVyZnHQGjYT7hrg7M6Bs=Wyv7Q at mail.gmail.com
> >
> Content-Type: text/plain; charset="iso-8859-1"
>
> If it's not practical, am I left with building a new cluster from scratch?
>
> On Sun, Jan 8, 2012 at 8:38 PM, Digimer <linux at alteeve.com> wrote:
>
> > On 01/08/2012 06:39 PM, Terry wrote:
> > > Hello,
> > >
> > > I am trying to gently migrate a 3-node cluster from centos5 to RHEL6.
>  I
> > > have already taken one of the three nodes out and rebuilt it. My
> > > thinking is to build a new cluster from the RHEL node but want to run
> it
> > > by everyone here first. The cluster consists of a handful of NFS
> volumes
> > > and a PostgreSQL database.  I am not concerned about the database.  I
> am
> > > moving to a new version and will simply migrate that.  I am more
> > > concerned about all of the ext4 clustered LVM volumes.  In this
> process,
> > > if I shut down the old cluster, what's the process to force the new
> node
> > > to read those volumes in to the new single-node cluster?  A pvscan on
> > > the new server shows all of the volumes fine. I am concerned there's
> > > something else I'll have to do here to begin mounting these volumes in
> > > the new cluster.
> > > [root at server ~]# pvdisplay
> > >   Skipping clustered volume group vg_data01b
> > >
> > > Thanks!
> >
> >   Technically yes, practically no. Or rather, not without a lot of
> > testing first.
> >
> >  I've never done this, but here are some pointers;
> >
> > <cman upgrading="yes" disallowed="1" ...>
> >
> > upgrading
> >        Set this if you are performing a rolling upgrade of the cluster
> >        between major releases.
> >
> > disallowed
> >        Set this to 1 enable cman's Disallowed mode. This is usually
> >        only needed for backwards compatibility.
> >
> > <group groupd_compat="1" />
> >
> >        Enable compatibility with cluster2 nodes. groupd(8)
> >
> >  There may be some other things you need to do as well. Please be sure
> > to do proper testing and, if you have the budget, hire Red Hat to advise
> > on this process. Also, please report back your results. It would help me
> > help others in the same boat later. :)
> >
> > --
> > Digimer
> > E-Mail:              digimer at alteeve.com
> > Freenode handle:     digimer
> > Papers and Projects: http://alteeve.com
> > Node Assassin:       http://nodeassassin.org
> > "omg my singularity battery is dead again.
> > stupid hawking radiation." - epitron
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> https://www.redhat.com/archives/linux-cluster/attachments/20120108/e8917278/attachment.html
> >
>
> ------------------------------
>
> Message: 5
> Date: Mon, 9 Jan 2012 11:01:06 +0700
> From: "Fajar A. Nugraha" <list at fajar.net>
> To: linux clustering <linux-cluster at redhat.com>
> Subject: Re: [Linux-cluster] centos5 to RHEL6 migration
> Message-ID:
>        <CAG1y0sdjqaV4nCHfOHy0b9moQYe+BdzkMBYDRT7UuRVAqkSqag at mail.gmail.com
> >
> Content-Type: text/plain; charset=ISO-8859-1
>
> On Mon, Jan 9, 2012 at 10:31 AM, Terry <td3201 at gmail.com> wrote:
> > If it's not practical, am I left with building a new cluster from
> scratch?
>
> I'm pretty sure if your ONLY problem is "Skipping clustered volume
> group vg_data01b", you can just turn off cluster flag with "vgchange
> -cn", then use "-o lock_nolock" to mount it on a SINGLE (i.e. not
> cluster) node. That was your original question, wasn't it?
>
> As for upgrading, I haven't tested it. You should be able to use your
> old storage, but just create other settings from scratch. Like Digimer
> said, be sure to do proper testing :)
>
> --
> Fajar
>
>
>
> ------------------------------
>
> Message: 6
> Date: Sun, 08 Jan 2012 20:03:18 -0800
> From: Wes Modes <wmodes at ucsc.edu>
> To: linux-cluster at redhat.com
> Subject: Re: [Linux-cluster] GFS on CentOS - cman unable to start
> Message-ID: <4F0A6706.6090308 at ucsc.edu>
> Content-Type: text/plain; charset="iso-8859-1"
>
> The behavior of cman's resolving of cluster node names is less than
> clear, as per the RHEL bugzilla report.
>
> The hostname and cluster.conf match, as does /etc/hosts and uname -n.
> The short names and FQDN ping.  I believe all the node cluster.conf are
> in sync, and all nodes are accessible to each other using either short
> or long names.
>
> You'll have to trust that I've tried everything obvious, and every
> possible combination of FQDN and short names in cluster.conf and
> hostname.  That said, it is totally possible I missed something obvious.
>
> I suspect, there is something else going on and I don't know how to get
> at it.
>
> Wes
>
>
> On 1/6/2012 6:06 PM, Kevin Stanton wrote:
> >
> > > Hi,
> >
> > > I think CMAN expect that the names of the cluster nodes be the same
> > returned by the command "uname -n".
> >
> > > For what you write your nodes hostnames are: test01.gdao.ucsc.edu
> > and test02.gdao.ucsc.edu, but in cluster.conf you have declared only
> > "test01" and "test02".
> >
> >
> >
> > I haven't found this to be the case in the past.  I actually use a
> > separate short name to reference each node which is different than the
> > hostname of the server itself.  All I've ever had to do is make sure
> > it resolves correctly.  You can do this either in DNS and/or in
> > /etc/hosts.  I have found that it's a good idea to do both in case
> > your DNS server is a virtual machine and is not running for some
> > reason.  In that case with /etc/hosts you can still start cman.
> >
> >
> >
> > I would make sure whatever node names you use in the cluster.conf will
> > resolve when you try to ping it from all nodes in the cluster.  Also
> > make sure your cluster.conf is in sync between all nodes.
> >
> >
> >
> > -Kevin
> >
> >
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> >     These servers are currently on the same host, but may not be in
> >     the future.  They are in a vm cluster (though honestly, I'm not
> >     sure what this means yet).
> >
> >     SElinux is on, but disabled.
> >     Firewalling through iptables is turned off via
> >     system-config-securitylevel
> >
> >     There is no line currently in the cluster.conf that deals with
> >     multicasting.
> >
> >     Any other suggestions?
> >
> >     Wes
> >
> >     On 1/6/2012 12:05 PM, Luiz Gustavo Tonello wrote:
> >
> >     Hi,
> >
> >
> >
> >     This servers is on VMware? At the same host?
> >
> >     SElinux is disable? iptables have something?
> >
> >
> >
> >     In my environment I had a problem to start GFS2 with servers in
> >     differents hosts.
> >
> >     To clustering servers, was need migrate one server to the same
> >     host of the other, and restart this.
> >
> >
> >
> >     I think, one of the problem was because the virtual switchs.
> >
> >     To solve, I changed a multicast IP, to use 225.0.0.13 at cluster.conf
> >
> >       <multicast addr="225.0.0.13"/>
> >
> >     And add a static route in both, to use default gateway.
> >
> >
> >
> >     I don't know if it's correct, but this solve my problem.
> >
> >
> >
> >     I hope that help you.
> >
> >
> >
> >     Regards.
> >
> >
> >
> >     On Fri, Jan 6, 2012 at 5:01 PM, Wes Modes <wmodes at ucsc.edu
> >     <mailto:wmodes at ucsc.edu>> wrote:
> >
> >     Hi, Steven.
> >
> >     I've tried just about every possible combination of hostname and
> >     cluster.conf.
> >
> >     ping to test01 resolves to 128.114.31.112
> >     ping to test01.gdao.ucsc.edu <http://test01.gdao.ucsc.edu>
> >     resolves to 128.114.31.112
> >
> >     It feels like the right thing is being returned.  This feels like it
> >     might be a quirk (or bug possibly) of cman or openais.
> >
> >     There are some old bug reports around this, for example
> >     https://bugzilla.redhat.com/show_bug.cgi?id=488565.  It sounds
> >     like the
> >     way that cman reports this error is anything but straightforward.
> >
> >     Is there anyone who has encountered this error and found a solution?
> >
> >     Wes
> >
> >
> >
> >     On 1/6/2012 2:00 AM, Steven Whitehouse wrote:
> >     > Hi,
> >     >
> >     > On Thu, 2012-01-05 at 13:54 -0800, Wes Modes wrote:
> >     >> Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS
> >     systems
> >     >> running on vmWare. The GFS FS is on a Dell Equilogic SAN.
> >     >>
> >     >> I keep running into the same problem despite many
> >     differently-flavored
> >     >> attempts to set up GFS. The problem comes when I try to start
> >     cman, the
> >     >> cluster management software.
> >     >>
> >     >>     [root at test01]# service cman start
> >     >>     Starting cluster:
> >     >>        Loading modules... done
> >     >>        Mounting configfs... done
> >     >>        Starting ccsd... done
> >     >>        Starting cman... failed
> >     >>     cman not started: Can't find local node name in cluster.conf
> >     >> /usr/sbin/cman_tool: aisexec daemon didn't start
> >     >>
> >      [FAILED]
> >     >>
> >     > This looks like what it says... whatever the node name is in
> >     > cluster.conf, it doesn't exist when the name is looked up, or
> >     possibly
> >     > it does exist, but is mapped to the loopback address (it needs to
> >     map to
> >     > an address which is valid cluster-wide)
> >     >
> >     > Since your config files look correct, the next thing to check is
> what
> >     > the resolver is actually returning. Try (for example) a ping to
> >     test01
> >     > (you need to specify exactly the same form of the name as is used
> in
> >     > cluster.conf) from test02 and see whether it uses the correct ip
> >     > address, just in case the wrong thing is being returned.
> >     >
> >     > Steve.
> >     >
> >     >>     [root at test01]# tail /var/log/messages
> >     >>     Jan  5 13:39:40 testbench06 ccsd[13194]: Unable to connect to
> >     >> cluster infrastructure after 1193640 seconds.
> >     >>     Jan  5 13:40:10 testbench06 ccsd[13194]: Unable to connect to
> >     >> cluster infrastructure after 1193670 seconds.
> >     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS
> Executive
> >     >> Service RELEASE 'subrev 1887 version 0.80.6'
> >     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright
> (C)
> >     >> 2002-2006 MontaVista Software, Inc and contributors.
> >     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright
> (C)
> >     >> 2006 Red Hat, Inc.
> >     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS
> Executive
> >     >> Service: started and ready to provide service.
> >     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] local
> >     node name
> >     >> "test01.gdao.ucsc.edu <http://test01.gdao.ucsc.edu>" not found
> >     in cluster.conf
> >     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Error
> >     reading CCS
> >     >> info, cannot start
> >     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Error
> reading
> >     >> config from CCS
> >     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS
> Executive
> >     >> exiting (reason: could not read the main configuration file).
> >     >>
> >     >> Here are details of my configuration:
> >     >>
> >     >>     [root at test01]# rpm -qa | grep cman
> >     >>     cman-2.0.115-85.el5_7.2
> >     >>
> >     >>     [root at test01]# echo $HOSTNAME
> >     >>     test01.gdao.ucsc.edu <http://test01.gdao.ucsc.edu>
> >     >>
> >     >>     [root at test01]# hostname
> >     >>     test01.gdao.ucsc.edu <http://test01.gdao.ucsc.edu>
> >     >>
> >     >>     [root at test01]# cat /etc/hosts
> >     >>     # Do not remove the following line, or various programs
> >     >>     # that require network functionality will fail.
> >     >>     128.114.31.112      test01 test01.gdao test01.gdao.ucsc.edu
> >     <http://test01.gdao.ucsc.edu>
> >     >>     128.114.31.113      test02 test02.gdao test02.gdao.ucsc.edu
> >     <http://test02.gdao.ucsc.edu>
> >     >>     127.0.0.1               localhost.localdomain localhost
> >     >>     ::1             localhost6.localdomain6 localhost6
> >     >>
> >     >>     [root at test01]# sestatus
> >     >>     SELinux status:                 enabled
> >     >>     SELinuxfs mount:                /selinux
> >     >>     Current mode:                   permissive
> >     >>     Mode from config file:          permissive
> >     >>     Policy version:                 21
> >     >>     Policy from config file:        targeted
> >     >>
> >     >>     [root at test01]# cat /etc/cluster/cluster.conf
> >     >>     <?xml version="1.0"?>
> >     >>     <cluster config_version="25" name="gdao_cluster">
> >     >>         <fence_daemon post_fail_delay="0" post_join_delay="120"/>
> >     >>         <clusternodes>
> >     >>             <clusternode name="test01" nodeid="1" votes="1">
> >     >>                 <fence>
> >     >>                     <method name="single">
> >     >>                         <device name="gfs_vmware"/>
> >     >>                     </method>
> >     >>                 </fence>
> >     >>             </clusternode>
> >     >>             <clusternode name="test02" nodeid="2" votes="1">
> >     >>                 <fence>
> >     >>                     <method name="single">
> >     >>                         <device name="gfs_vmware"/>
> >     >>                     </method>
> >     >>                 </fence>
> >     >>             </clusternode>
> >     >>         </clusternodes>
> >     >>         <cman/>
> >     >>         <fencedevices>
> >     >>             <fencedevice agent="fence_manual" name="gfs1_ipmi"/>
> >     >>             <fencedevice agent="fence_vmware" name="gfs_vmware"
> >     >> ipaddr="gdvcenter.ucsc.edu <http://gdvcenter.ucsc.edu>"
> >     login="root" passwd="1hateAmazon.com"
> >     >> vmlogin="root" vmpasswd="esxpass"
> >     >>
> >
> port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/>
> >     >>         </fencedevices>
> >     >>         <rm>
> >     >>         <failoverdomains/>
> >     >>         </rm>
> >     >>     </cluster>
> >     >>
> >     >> I've seen much discussion of this problem, but no definitive
> >     solutions.
> >     >> Any help you can provide will be welcome.
> >     >>
> >     >> Wes Modes
> >     >>
> >     >> --
> >     >> Linux-cluster mailing list
> >     >> Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> >     >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >     >
> >     > --
> >     > Linux-cluster mailing list
> >     > Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> >     > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >     --
> >     Linux-cluster mailing list
> >     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> >     https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> >
> >
> >
> >     --
> >     Luiz Gustavo P Tonello.
> >
> >
> >
> >     --
> >
> >     Linux-cluster mailing list
> >
> >     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> >
> >     https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> >     --
> >     Linux-cluster mailing list
> >     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> >     https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> >
> >
> >
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> https://www.redhat.com/archives/linux-cluster/attachments/20120108/707d1029/attachment.html
> >
>
> ------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> End of Linux-cluster Digest, Vol 93, Issue 7
> ********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120109/15b40ba3/attachment.htm>

From sathyanarayanan.varadharajan at precisionit.co.in  Mon Jan  9 10:43:15 2012
From: sathyanarayanan.varadharajan at precisionit.co.in (SATHYA - IT)
Date: Mon, 9 Jan 2012 16:13:15 +0530
Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster
	environment
Message-ID: <003201cccebb$805852e0$8108f8a0$@precisionit.co.in>

Klaus,

For your point the corosync network is not connected to the switch. They are
connected directly to the servers (server to server).

Thanks

Sathya Narayanan V
Solution Architect	

-----Original Message-----
From: SATHYA - IT [mailto:sathyanarayanan.varadharajan at precisionit.co.in] 
Sent: Monday, January 09, 2012 11:48 AM
To: 'Digimer'; 'linux clustering'
Subject: RE: [Linux-cluster] rhel 6.2 network bonding interface in cluster
environment

Not sure whether you received the logs and cluster.conf file. Herewith
pasting the same...

On File Server1:

Jan  8 03:15:04 filesrv1 kernel: imklog 4.6.2, log source = /proc/kmsg
started.
Jan  8 03:15:04 filesrv1 rsyslogd: [origin software="rsyslogd"
swVersion="4.6.2" x-pid="8765" x-info="http://www.rsyslog.com"] (re)start
Jan  8 10:52:42 filesrv1 kernel: imklog 4.6.2, log source = /proc/kmsg
started.
Jan  8 10:52:42 filesrv1 rsyslogd: [origin software="rsyslogd"
swVersion="4.6.2" x-pid="8751" x-info="http://www.rsyslog.com"] (re)start
Jan  8 10:52:42 filesrv1 kernel: Initializing cgroup subsys cpuset Jan  8
10:52:42 filesrv1 kernel: Initializing cgroup subsys cpu Jan  8 10:52:42
filesrv1 kernel: Linux version 2.6.32-220.el6.x86_64
(mockbuild at x86-004.build.bos.redhat.com) (gcc version 4.4.5 20110214 (Red
Hat 4.4.5-6) (GCC) ) #1 SMP Wed Nov 9 08:03:13 EST 2011 Jan  8 10:52:42
filesrv1 kernel: Command line: ro root=/dev/mapper/vg01-LogVol01
rd_LVM_LV=vg01/LogVol01 rd_LVM_LV=vg01/LogVol00 rd_NO_LUKS rd_NO_MD rd_NO_DM
LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us
crashkernel=128M rhgb quiet acpi=off Jan  8 10:52:42 filesrv1 kernel: KERNEL
supported cpus:
Jan  8 10:52:42 filesrv1 kernel:  Intel GenuineIntel Jan  8 10:52:42
filesrv1 kernel:  AMD AuthenticAMD Jan  8 10:52:42 filesrv1 kernel:  Centaur
CentaurHauls Jan  8 10:52:42 filesrv1 kernel: BIOS-provided physical RAM
map:
Jan  8 10:52:42 filesrv1 kernel: BIOS-e820: 0000000000000000 -
000000000009f400 (usable) Jan  8 10:52:42 filesrv1 kernel: BIOS-e820:
000000000009f400 - 00000000000a0000 (reserved) Jan  8 10:52:42 filesrv1
kernel: BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved) Jan  8
10:52:42 filesrv1 kernel: BIOS-e820: 0000000000100000 - 00000000d762f000
(usable) Jan  8 10:52:42 filesrv1 kernel: BIOS-e820: 00000000d762f000 -
00000000d763c000 (ACPI data) Jan  8 10:52:42 filesrv1 kernel: BIOS-e820:
00000000d763c000 - 00000000d763d000 (usable) Jan  8 10:52:42 filesrv1
kernel: BIOS-e820: 00000000d763d000 - 00000000dc000000 (reserved) Jan  8
10:52:42 filesrv1 kernel: BIOS-e820: 00000000fec00000 - 00000000fee10000
(reserved) Jan  8 10:52:42 filesrv1 kernel: BIOS-e820: 00000000ff800000 -
0000000100000000 (reserved) Jan  8 10:52:42 filesrv1 kernel: BIOS-e820:
0000000100000000 - 00000008a7fff000 (usable) Jan  8 10:52:42 filesrv1
kernel: DMI 2.7 present.
Jan  8 10:52:42 filesrv1 kernel: SMBIOS version 2.7 @ 0xF4F40 Jan  8
10:52:42 filesrv1 kernel: last_pfn = 0x8a7fff max_arch_pfn = 0x400000000 Jan
8 10:52:42 filesrv1 kernel: x86 PAT enabled: cpu 0, old 0x7040600070406, new
0x7010600070106 Jan  8 10:52:42 filesrv1 kernel: last_pfn = 0xd763d
max_arch_pfn = 0x400000000 .
.

On File Server 2:

Jan  8 03:09:06 filesrv2 rsyslogd: [origin software="rsyslogd"
swVersion="4.6.2" x-pid="8648" x-info="http://www.rsyslog.com"] (re)start
Jan  8 10:48:07 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
Down Jan  8 10:48:07 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper
Link is Down Jan  8 10:48:07 filesrv2 kernel: bonding: bond1: link status
definitely down for interface eth3, disabling it Jan  8 10:48:07 filesrv2
kernel: bonding: bond1: now running without any active interface !
Jan  8 10:48:07 filesrv2 kernel: bonding: bond1: link status definitely down
for interface eth4, disabling it Jan  8 10:48:09 filesrv2 kernel: bnx2
0000:04:00.0: eth4: NIC Copper Link is Up, 1000 Mbps full duplex, receive &
transmit flow control ON Jan  8 10:48:09 filesrv2 kernel: bond1: link status
definitely up for interface eth4, 1000 Mbps full duplex.
Jan  8 10:48:09 filesrv2 kernel: bonding: bond1: making interface eth4 the
new active one.
Jan  8 10:48:09 filesrv2 kernel: bonding: bond1: first active interface up!
Jan  8 10:48:09 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan  8
10:48:09 filesrv2 kernel: bond1: link status definitely up for interface
eth3, 1000 Mbps full duplex.
Jan  8 10:48:15 filesrv2 corosync[8933]:   [TOTEM ] A processor failed,
forming new configuration.
Jan  8 10:48:17 filesrv2 corosync[8933]:   [QUORUM] Members[1]: 2
Jan  8 10:48:17 filesrv2 corosync[8933]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Jan  8 10:48:17 filesrv2 rgmanager[12557]: State change: clustsrv1 DOWN
Jan  8 10:48:17 filesrv2 corosync[8933]:   [CPG   ] chosen downlist: sender
r(0) ip(10.0.0.20) ; members(old:2 left:1)
Jan  8 10:48:17 filesrv2 corosync[8933]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Jan  8 10:48:17 filesrv2 kernel: dlm: closing connection to node 1 Jan  8
10:48:17 filesrv2 fenced[8989]: fencing node clustsrv1 Jan  8 10:48:17
filesrv2 kernel: GFS2: fsid=samba:ctdb.0: jid=1: Trying to acquire journal
lock...
Jan  8 10:48:17 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Trying to
acquire journal lock...
Jan  8 10:48:24 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
Down Jan  8 10:48:24 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper
Link is Down Jan  8 10:48:24 filesrv2 kernel: bonding: bond1: link status
definitely down for interface eth3, disabling it Jan  8 10:48:24 filesrv2
kernel: bonding: bond1: link status definitely down for interface eth4,
disabling it Jan  8 10:48:24 filesrv2 kernel: bonding: bond1: now running
without any active interface !
Jan  8 10:48:25 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
Up, 100 Mbps full duplex, receive & transmit flow control ON Jan  8 10:48:25
filesrv2 kernel: bond1: link status definitely up for interface eth4, 100
Mbps full duplex.
Jan  8 10:48:25 filesrv2 kernel: bonding: bond1: making interface eth4 the
new active one.
Jan  8 10:48:25 filesrv2 kernel: bonding: bond1: first active interface up!
Jan  8 10:48:25 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
Up, 100 Mbps full duplex, receive & transmit flow control ON Jan  8 10:48:25
filesrv2 kernel: bond1: link status definitely up for interface eth3, 100
Mbps full duplex.
Jan  8 10:48:25 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
Down Jan  8 10:48:25 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper
Link is Down Jan  8 10:48:25 filesrv2 kernel: bonding: bond1: link status
definitely down for interface eth3, disabling it Jan  8 10:48:25 filesrv2
kernel: bonding: bond1: link status definitely down for interface eth4,
disabling it Jan  8 10:48:25 filesrv2 kernel: bonding: bond1: now running
without any active interface !
Jan  8 10:48:27 filesrv2 fenced[8989]: fence clustsrv1 success Jan  8
10:48:28 filesrv2 kernel: GFS2: fsid=samba:ctdb.0: jid=1: Looking at
journal...
Jan  8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:ctdb.0: jid=1: Done Jan  8
10:48:28 filesrv2 kernel: GFS2: fsid=samba:gen02.0: jid=1: Trying to acquire
journal lock...
Jan  8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Trying
to acquire journal lock...
Jan  8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:gen02.0: jid=1: Looking at
journal...
Jan  8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:gen02.0: jid=1: Done Jan
8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Looking at
journal...
Jan  8 10:48:28 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan  8
10:48:28 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Up,
1000 Mbps full duplex, receive & transmit flow control ON Jan  8 10:48:28
filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Acquiring the transaction
lock...
Jan  8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Replaying
journal...
Jan  8 10:48:28 filesrv2 kernel: bond1: link status definitely up for
interface eth3, 1000 Mbps full duplex.
Jan  8 10:48:28 filesrv2 kernel: bonding: bond1: making interface eth3 the
new active one.
Jan  8 10:48:28 filesrv2 kernel: bonding: bond1: first active interface up!
Jan  8 10:48:28 filesrv2 kernel: bond1: link status definitely up for
interface eth4, 1000 Mbps full duplex.
Jan  8 10:48:30 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Replayed
29140 of 29474 blocks Jan  8 10:48:30 filesrv2 kernel: GFS2:
fsid=samba:gen01.0: jid=1: Found 334 revoke tags Jan  8 10:48:30 filesrv2
kernel: GFS2: fsid=samba:gen01.0: jid=1: Journal replayed in 2s Jan  8
10:48:30 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Done Jan  8
10:49:01 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Down
Jan  8 10:49:01 filesrv2 kernel: bonding: bond1: link status definitely down
for interface eth3, disabling it Jan  8 10:49:01 filesrv2 kernel: bonding:
bond1: making interface eth4 the new active one.
Jan  8 10:49:01 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
Down Jan  8 10:49:01 filesrv2 kernel: bonding: bond1: link status definitely
down for interface eth4, disabling it Jan  8 10:49:01 filesrv2 kernel:
bonding: bond1: now running without any active interface !
Jan  8 10:49:03 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan  8
10:49:03 filesrv2 kernel: bond1: link status definitely up for interface
eth3, 1000 Mbps full duplex.
Jan  8 10:49:03 filesrv2 kernel: bonding: bond1: making interface eth3 the
new active one.
Jan  8 10:49:03 filesrv2 kernel: bonding: bond1: first active interface up!
Jan  8 10:49:04 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan  8
10:49:04 filesrv2 kernel: bond1: link status definitely up for interface
eth4, 1000 Mbps full duplex.
Jan  8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Looking
at journal...
Jan  8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1:
Acquiring the transaction lock...
Jan  8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1:
Replaying journal...
Jan  8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1:
Replayed 0 of 0 blocks Jan  8 10:50:13 filesrv2 kernel: GFS2:
fsid=samba:hadata02.0: jid=1: Found 0 revoke tags Jan  8 10:50:13 filesrv2
kernel: GFS2: fsid=samba:hadata02.0: jid=1: Journal replayed in 0s Jan  8
10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Done Jan  8
10:52:37 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Down
Jan  8 10:52:38 filesrv2 kernel: bonding: bond1: link status definitely down
for interface eth3, disabling it Jan  8 10:52:38 filesrv2 kernel: bonding:
bond1: making interface eth4 the new active one.
Jan  8 10:52:38 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
Down Jan  8 10:52:38 filesrv2 kernel: bonding: bond1: link status definitely
down for interface eth4, disabling it Jan  8 10:52:38 filesrv2 kernel:
bonding: bond1: now running without any active interface !
Jan  8 10:52:40 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan  8
10:52:40 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Up,
1000 Mbps full duplex, receive & transmit flow control ON Jan  8 10:52:40
filesrv2 kernel: bond1: link status definitely up for interface eth3, 1000
Mbps full duplex.
Jan  8 10:52:40 filesrv2 kernel: bonding: bond1: making interface eth3 the
new active one.
Jan  8 10:52:40 filesrv2 kernel: bonding: bond1: first active interface up!
Jan  8 10:52:40 filesrv2 kernel: bond1: link status definitely up for
interface eth4, 1000 Mbps full duplex.
Jan  8 10:52:44 filesrv2 corosync[8933]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Jan  8 10:52:44 filesrv2 corosync[8933]:   [QUORUM] Members[2]: 1 2
Jan  8 10:52:44 filesrv2 corosync[8933]:   [QUORUM] Members[2]: 1 2
Jan  8 10:52:44 filesrv2 corosync[8933]:   [CPG   ] chosen downlist: sender
r(0) ip(10.0.0.10) ; members(old:1 left:0)
Jan  8 10:52:44 filesrv2 corosync[8933]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Jan  8 10:52:51 filesrv2 kernel: dlm: got connection from 1 Jan  8 10:55:57
filesrv2 kernel: INFO: task gfs2_quotad:9389 blocked for more than 120
seconds.
Jan  8 10:55:57 filesrv2 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  8 10:55:57 filesrv2 kernel: gfs2_quotad   D ffff8808a7824900     0
9389      2 0x00000080
Jan  8 10:55:57 filesrv2 kernel: ffff88087580da88 0000000000000046
0000000000000000 00000000000001c3 Jan  8 10:55:57 filesrv2 kernel:
ffff88087580da18 ffff88087580da50 ffffffff810ea694 ffff88088b184080 Jan  8
10:55:57 filesrv2 kernel: ffff88088e71a5f8 ffff88087580dfd8 000000000000f4e8
ffff88088e71a5f8 Jan  8 10:55:57 filesrv2 kernel: Call Trace:
Jan  8 10:55:57 filesrv2 kernel: [<ffffffff810ea694>] ?
rb_reserve_next_event+0xb4/0x370 Jan  8 10:55:57 filesrv2 kernel:
[<ffffffff81013563>] ? native_sched_clock+0x13/0x60 Jan  8 10:55:57 filesrv2
kernel: [<ffffffff814eefb5>] rwsem_down_failed_common+0x95/0x1d0
Jan  8 10:55:57 filesrv2 kernel: [<ffffffff81013563>] ?
native_sched_clock+0x13/0x60 Jan  8 10:55:57 filesrv2 kernel:
[<ffffffff814ef146>] rwsem_down_read_failed+0x26/0x30 Jan  8 10:55:57
filesrv2 kernel: [<ffffffff81276e04>] call_rwsem_down_read_failed+0x14/0x30
Jan  8 10:55:57 filesrv2 kernel: [<ffffffff814ee644>] ? down_read+0x24/0x30
Jan  8 10:55:57 filesrv2 kernel: [<ffffffffa04fe4b2>] dlm_lock+0x62/0x1e0
[dlm] Jan  8 10:55:57 filesrv2 kernel: [<ffffffff810eab02>] ?
ring_buffer_lock_reserve+0xa2/0x160
Jan  8 10:55:57 filesrv2 kernel: [<ffffffffa0550d62>] gdlm_lock+0xf2/0x130
[gfs2] Jan  8 10:55:57 filesrv2 kernel: [<ffffffffa0550e60>] ?
gdlm_ast+0x0/0xe0 [gfs2] Jan  8 10:55:57 filesrv2 kernel:
[<ffffffffa0550da0>] ? gdlm_bast+0x0/0x50 [gfs2] Jan  8 10:55:57 filesrv2
kernel: [<ffffffffa053430f>] do_xmote+0x17f/0x260 [gfs2] Jan  8 10:55:57
filesrv2 kernel: [<ffffffffa05344e1>] run_queue+0xf1/0x1d0 [gfs2] Jan  8
10:55:57 filesrv2 kernel: [<ffffffffa0534807>] gfs2_glock_nq+0x1b7/0x360
[gfs2] Jan  8 10:55:57 filesrv2 kernel: [<ffffffff8107cb7b>] ?
try_to_del_timer_sync+0x7b/0xe0 Jan  8 10:55:57 filesrv2 kernel:
[<ffffffffa054db88>] gfs2_statfs_sync+0x58/0x1b0 [gfs2] Jan  8 10:55:57
filesrv2 kernel: [<ffffffff814ed84a>] ? schedule_timeout+0x19a/0x2e0 Jan  8
10:55:57 filesrv2 kernel: [<ffffffffa054db80>] ? gfs2_statfs_sync+0x50/0x1b0
[gfs2] Jan  8 10:55:57 filesrv2 kernel: [<ffffffffa0545bb7>]
quotad_check_timeo+0x57/0xb0 [gfs2] Jan  8 10:55:57 filesrv2 kernel:
[<ffffffffa0545e44>] gfs2_quotad+0x234/0x2b0 [gfs2] Jan  8 10:55:57 filesrv2
kernel: [<ffffffff81090bf0>] ? autoremove_wake_function+0x0/0x40 Jan  8
10:55:57 filesrv2 kernel: [<ffffffffa0545c10>] ? gfs2_quotad+0x0/0x2b0
[gfs2] Jan  8 10:55:57 filesrv2 kernel: [<ffffffff81090886>]
kthread+0x96/0xa0 Jan  8 10:55:57 filesrv2 kernel: [<ffffffff8100c14a>]
child_rip+0xa/0x20 Jan  8 10:55:57 filesrv2 kernel: [<ffffffff810907f0>] ?
kthread+0x0/0xa0 Jan  8 10:55:57 filesrv2 kernel: [<ffffffff8100c140>] ?
child_rip+0x0/0x20 Jan  8 10:57:57 filesrv2 kernel: INFO: task
gfs2_quotad:9389 blocked for more than 120 seconds.
Jan  8 10:57:57 filesrv2 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  8 10:57:57 filesrv2 kernel: gfs2_quotad   D ffff8808a7824900     0
9389      2 0x00000080
Jan  8 10:57:57 filesrv2 kernel: ffff88087580da88 0000000000000046
0000000000000000 00000000000001c3 Jan  8 10:57:57 filesrv2 kernel:
ffff88087580da18 ffff88087580da50 ffffffff810ea694 ffff88088b184080 Jan  8
10:57:57 filesrv2 kernel: ffff88088e71a5f8 ffff88087580dfd8 000000000000f4e8
ffff88088e71a5f8 Jan  8 10:57:57 filesrv2 kernel: Call Trace:
Jan  8 10:57:57 filesrv2 kernel: [<ffffffff810ea694>] ?
rb_reserve_next_event+0xb4/0x370 Jan  8 10:57:57 filesrv2 kernel:
[<ffffffff81013563>] ? native_sched_clock+0x13/0x60 Jan  8 10:57:57 filesrv2
kernel: [<ffffffff814eefb5>] rwsem_down_failed_common+0x95/0x1d0
Jan  8 10:57:57 filesrv2 kernel: [<ffffffff81013563>] ?
native_sched_clock+0x13/0x60 Jan  8 10:57:57 filesrv2 kernel:
[<ffffffff814ef146>] rwsem_down_read_failed+0x26/0x30 Jan  8 10:57:57
filesrv2 kernel: [<ffffffff81276e04>] call_rwsem_down_read_failed+0x14/0x30
Jan  8 10:57:57 filesrv2 kernel: [<ffffffff814ee644>] ? down_read+0x24/0x30
Jan  8 10:57:57 filesrv2 kernel: [<ffffffffa04fe4b2>] dlm_lock+0x62/0x1e0
[dlm] Jan  8 10:57:57 filesrv2 kernel: [<ffffffff810eab02>] ?
ring_buffer_lock_reserve+0xa2/0x160
Jan  8 10:57:57 filesrv2 kernel: [<ffffffffa0550d62>] gdlm_lock+0xf2/0x130
[gfs2] Jan  8 10:57:57 filesrv2 kernel: [<ffffffffa0550e60>] ?
gdlm_ast+0x0/0xe0 [gfs2] Jan  8 10:57:57 filesrv2 kernel:
[<ffffffffa0550da0>] ? gdlm_bast+0x0/0x50 [gfs2] Jan  8 10:57:57 filesrv2
kernel: [<ffffffffa053430f>] do_xmote+0x17f/0x260 [gfs2] Jan  8 10:57:57
filesrv2 kernel: [<ffffffffa05344e1>] run_queue+0xf1/0x1d0 [gfs2] Jan  8
10:57:57 filesrv2 kernel: [<ffffffffa0534807>] gfs2_glock_nq+0x1b7/0x360
[gfs2] Jan  8 10:57:57 filesrv2 kernel: [<ffffffff8107cb7b>] ?
try_to_del_timer_sync+0x7b/0xe0 Jan  8 10:57:57 filesrv2 kernel:
[<ffffffffa054db88>] gfs2_statfs_sync+0x58/0x1b0 [gfs2] Jan  8 10:57:57
filesrv2 kernel: [<ffffffff814ed84a>] ? schedule_timeout+0x19a/0x2e0 Jan  8
10:57:57 filesrv2 kernel: [<ffffffffa054db80>] ? gfs2_statfs_sync+0x50/0x1b0
[gfs2] Jan  8 10:57:57 filesrv2 kernel: [<ffffffffa0545bb7>]
quotad_check_timeo+0x57/0xb0 [gfs2] Jan  8 10:57:57 filesrv2 kernel:
[<ffffffffa0545e44>] gfs2_quotad+0x234/0x2b0 [gfs2] Jan  8 10:57:57 filesrv2
kernel: [<ffffffff81090bf0>] ? autoremove_wake_function+0x0/0x40 Jan  8
10:57:57 filesrv2 kernel: [<ffffffffa0545c10>] ? gfs2_quotad+0x0/0x2b0
[gfs2] Jan  8 10:57:57 filesrv2 kernel: [<ffffffff81090886>]
kthread+0x96/0xa0 Jan  8 10:57:57 filesrv2 kernel: [<ffffffff8100c14a>]
child_rip+0xa/0x20 Jan  8 10:57:57 filesrv2 kernel: [<ffffffff810907f0>] ?
kthread+0x0/0xa0 Jan  8 10:57:57 filesrv2 kernel: [<ffffffff8100c140>] ?
child_rip+0x0/0x20 Jan  8 10:59:22 filesrv2 rgmanager[12557]: State change:
clustsrv1 UP

Cluster.conf File:

<?xml version="1.0"?>
<cluster config_version="8" name="samba">
	<fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="3"/>
	<clusternodes>
		<clusternode name="clustsrv1" nodeid="1" votes="1">
			<fence>
				<method name="fenceilo1">
					<device name="ilosrv1"/>
				</method>
			</fence>
			<unfence>
			        <device action="on" name="ilosrv2"/>
		        </unfence> 
		</clusternode>
		<clusternode name="clustsrv2" nodeid="2" votes="1">
			<fence>
				<method name="fenceilo2">
					<device name="ilosrv2"/>
				</method>
			</fence>
			<unfence>
			        <device action="on" name="ilosrv2"/>
			</unfence> 
		</clusternode>
	</clusternodes>
	<cman expected_votes="1" two_node="1"/>
	<fencedevices>
		<fencedevice agent="fence_ipmilan" ipaddr="192.168.129.157"
lanplus="1" login="fence" name="ilosrv1" passwd="xxxxxxx"/>
		<fencedevice agent="fence_ipmilan" ipaddr="192.168.129.158"
lanplus="1" login="fence" name="ilosrv2" passwd="xxxxxxx"/>
	</fencedevices>
	<rm>
		<failoverdomains/>
		<resources/>
	</rm>
</cluster>


Thanks

Sathya Narayanan V
Solution Architect	


-----Original Message-----
From: SATHYA - IT [mailto:sathyanarayanan.varadharajan at precisionit.co.in]
Sent: Monday, January 09, 2012 11:21 AM
To: 'Digimer'; 'linux clustering'
Subject: Re: [Linux-cluster] rhel 6.2 network bonding interface in cluster
environment

Hi,

Herewith attaching the /var/log/messages of both the servers. Yesterday
(08th Jan) one of the server got fenced by other around 10:48 AM. I am also
attaching the cluster.conf file for your reference. 

On the related note, related to heartbeat - I am referring the channel used
by corosync. And the name which has been configured in cluster.conf file
resolves with bond1 only.

Related to the network card, we are using 2 dual port card where we
configured 1 port from each for bond0 and 1 port from the other for bond1.
So it doesn't seems be a network card related issue. Moreover, we are not
having any errors related to bond0.

Thanks

Sathya Narayanan V
Solution Architect	


This communication may contain confidential information. 
If you are not the intended recipient it may be unlawful for you to read, copy, distribute, disclose or otherwise use the information contained within this communication.. 
Errors and Omissions may occur in the contents of this Email arising out of or in connection with data transmission, network malfunction or failure, machine or software error, malfunction, or operator errors by the person who is sending the email. 
Precision Group accepts no responsibility for any such errors or omissions. The information, views and comments within this communication are those of the individual and not necessarily those of Precision Group. 
All email that is sent from/to Precision Group is scanned for the presence of computer viruses, security issues and inappropriate content. However, it is the recipient's responsibility to check any attachments for viruses before use.



From kkovachev at varna.net  Mon Jan  9 11:08:25 2012
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Mon, 09 Jan 2012 13:08:25 +0200
Subject: [Linux-cluster] GFS on CentOS - cman unable to start
In-Reply-To: <4F0A6706.6090308@ucsc.edu>
References: <4F075BD3.3090702@ucsc.edu>
	<60c71479-72a3-47aa-a91d-fa1c91c3e9ef@lisa.itlinux.cl>
	<B9EFAE4B1BE97D49B9FCFDCEC3AC09C313C563AA@MailNode2.eprize.local>
	<4F0A6706.6090308@ucsc.edu>
Message-ID: <7b4965e95aef00d06ba7be68951fb79b@mx.varna.net>

Hi,
 check /etc/sysconfig/cman maybe there is a different name present as
NODENAME ... remove the file (if present) or try to create one as:

#CMAN_CLUSTER_TIMEOUT=120
#CMAN_QUORUM_TIMEOUT=0
#CMAN_SHUTDOWN_TIMEOUT=60
FENCED_START_TIMEOUT=120
##FENCE_JOIN=no
#LOCK_FILE="/var/lock/subsys/cman"
CLUSTERNAME=ClusterName
NODENAME=NodeName


On Sun, 08 Jan 2012 20:03:18 -0800, Wes Modes <wmodes at ucsc.edu> wrote:
> The behavior of cman's resolving of cluster node names is less than
> clear, as per the RHEL bugzilla report.
> 
> The hostname and cluster.conf match, as does /etc/hosts and uname -n. 
> The short names and FQDN ping.  I believe all the node cluster.conf are
> in sync, and all nodes are accessible to each other using either short
> or long names.
> 
> You'll have to trust that I've tried everything obvious, and every
> possible combination of FQDN and short names in cluster.conf and
> hostname.  That said, it is totally possible I missed something obvious.
> 
> I suspect, there is something else going on and I don't know how to get
> at it.
> 
> Wes
> 
> 
> On 1/6/2012 6:06 PM, Kevin Stanton wrote:
>>
>> > Hi,
>>
>> > I think CMAN expect that the names of the cluster nodes be the same
>> returned by the command "uname -n".
>>
>> > For what you write your nodes hostnames are: test01.gdao.ucsc.edu
>> and test02.gdao.ucsc.edu, but in cluster.conf you have declared only
>> "test01" and "test02".
>>
>>  
>>
>> I haven't found this to be the case in the past.  I actually use a
>> separate short name to reference each node which is different than the
>> hostname of the server itself.  All I've ever had to do is make sure
>> it resolves correctly.  You can do this either in DNS and/or in
>> /etc/hosts.  I have found that it's a good idea to do both in case
>> your DNS server is a virtual machine and is not running for some
>> reason.  In that case with /etc/hosts you can still start cman.  
>>
>>  
>>
>> I would make sure whatever node names you use in the cluster.conf will
>> resolve when you try to ping it from all nodes in the cluster.  Also
>> make sure your cluster.conf is in sync between all nodes.
>>
>>  
>>
>> -Kevin
>>
>>  
>>
>>  
>>
>>
------------------------------------------------------------------------
>>
>>     These servers are currently on the same host, but may not be in
>>     the future.  They are in a vm cluster (though honestly, I'm not
>>     sure what this means yet).
>>
>>     SElinux is on, but disabled.
>>     Firewalling through iptables is turned off via
>>     system-config-securitylevel
>>
>>     There is no line currently in the cluster.conf that deals with
>>     multicasting.
>>
>>     Any other suggestions?
>>
>>     Wes
>>
>>     On 1/6/2012 12:05 PM, Luiz Gustavo Tonello wrote:
>>
>>     Hi,
>>
>>      
>>
>>     This servers is on VMware? At the same host?
>>
>>     SElinux is disable? iptables have something?
>>
>>      
>>
>>     In my environment I had a problem to start GFS2 with servers in
>>     differents hosts.
>>
>>     To clustering servers, was need migrate one server to the same
>>     host of the other, and restart this.
>>
>>      
>>
>>     I think, one of the problem was because the virtual switchs.
>>
>>     To solve, I changed a multicast IP, to use 225.0.0.13 at
>>     cluster.conf
>>
>>       <multicast addr="225.0.0.13"/>
>>
>>     And add a static route in both, to use default gateway.
>>
>>      
>>
>>     I don't know if it's correct, but this solve my problem.
>>
>>      
>>
>>     I hope that help you.
>>
>>      
>>
>>     Regards.
>>
>>      
>>
>>     On Fri, Jan 6, 2012 at 5:01 PM, Wes Modes <wmodes at ucsc.edu
>>     <mailto:wmodes at ucsc.edu>> wrote:
>>
>>     Hi, Steven.
>>
>>     I've tried just about every possible combination of hostname and
>>     cluster.conf.
>>
>>     ping to test01 resolves to 128.114.31.112
>>     ping to test01.gdao.ucsc.edu <http://test01.gdao.ucsc.edu>
>>     resolves to 128.114.31.112
>>
>>     It feels like the right thing is being returned.  This feels like
it
>>     might be a quirk (or bug possibly) of cman or openais.
>>
>>     There are some old bug reports around this, for example
>>     https://bugzilla.redhat.com/show_bug.cgi?id=488565.  It sounds
>>     like the
>>     way that cman reports this error is anything but straightforward.
>>
>>     Is there anyone who has encountered this error and found a
solution?
>>
>>     Wes
>>
>>
>>
>>     On 1/6/2012 2:00 AM, Steven Whitehouse wrote:
>>     > Hi,
>>     >
>>     > On Thu, 2012-01-05 at 13:54 -0800, Wes Modes wrote:
>>     >> Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS
>>     systems
>>     >> running on vmWare. The GFS FS is on a Dell Equilogic SAN.
>>     >>
>>     >> I keep running into the same problem despite many
>>     differently-flavored
>>     >> attempts to set up GFS. The problem comes when I try to start
>>     cman, the
>>     >> cluster management software.
>>     >>
>>     >>     [root at test01]# service cman start
>>     >>     Starting cluster:
>>     >>        Loading modules... done
>>     >>        Mounting configfs... done
>>     >>        Starting ccsd... done
>>     >>        Starting cman... failed
>>     >>     cman not started: Can't find local node name in cluster.conf
>>     >> /usr/sbin/cman_tool: aisexec daemon didn't start
>>     >>                                                              
>>      [FAILED]
>>     >>
>>     > This looks like what it says... whatever the node name is in
>>     > cluster.conf, it doesn't exist when the name is looked up, or
>>     possibly
>>     > it does exist, but is mapped to the loopback address (it needs to
>>     map to
>>     > an address which is valid cluster-wide)
>>     >
>>     > Since your config files look correct, the next thing to check is
>>     > what
>>     > the resolver is actually returning. Try (for example) a ping to
>>     test01
>>     > (you need to specify exactly the same form of the name as is used
>>     > in
>>     > cluster.conf) from test02 and see whether it uses the correct ip
>>     > address, just in case the wrong thing is being returned.
>>     >
>>     > Steve.
>>     >
>>     >>     [root at test01]# tail /var/log/messages
>>     >>     Jan  5 13:39:40 testbench06 ccsd[13194]: Unable to connect
to
>>     >> cluster infrastructure after 1193640 seconds.
>>     >>     Jan  5 13:40:10 testbench06 ccsd[13194]: Unable to connect
to
>>     >> cluster infrastructure after 1193670 seconds.
>>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS
>>     >>     Executive
>>     >> Service RELEASE 'subrev 1887 version 0.80.6'
>>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright
>>     >>     (C)
>>     >> 2002-2006 MontaVista Software, Inc and contributors.
>>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright
>>     >>     (C)
>>     >> 2006 Red Hat, Inc.
>>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS
>>     >>     Executive
>>     >> Service: started and ready to provide service.
>>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] local
>>     node name
>>     >> "test01.gdao.ucsc.edu <http://test01.gdao.ucsc.edu>" not found
>>     in cluster.conf
>>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Error
>>     reading CCS
>>     >> info, cannot start
>>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Error
>>     >>     reading
>>     >> config from CCS
>>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS
>>     >>     Executive
>>     >> exiting (reason: could not read the main configuration file).
>>     >>
>>     >> Here are details of my configuration:
>>     >>
>>     >>     [root at test01]# rpm -qa | grep cman
>>     >>     cman-2.0.115-85.el5_7.2
>>     >>
>>     >>     [root at test01]# echo $HOSTNAME
>>     >>     test01.gdao.ucsc.edu <http://test01.gdao.ucsc.edu>
>>     >>
>>     >>     [root at test01]# hostname
>>     >>     test01.gdao.ucsc.edu <http://test01.gdao.ucsc.edu>
>>     >>
>>     >>     [root at test01]# cat /etc/hosts
>>     >>     # Do not remove the following line, or various programs
>>     >>     # that require network functionality will fail.
>>     >>     128.114.31.112      test01 test01.gdao test01.gdao.ucsc.edu
>>     <http://test01.gdao.ucsc.edu>
>>     >>     128.114.31.113      test02 test02.gdao test02.gdao.ucsc.edu
>>     <http://test02.gdao.ucsc.edu>
>>     >>     127.0.0.1               localhost.localdomain localhost
>>     >>     ::1             localhost6.localdomain6 localhost6
>>     >>
>>     >>     [root at test01]# sestatus
>>     >>     SELinux status:                 enabled
>>     >>     SELinuxfs mount:                /selinux
>>     >>     Current mode:                   permissive
>>     >>     Mode from config file:          permissive
>>     >>     Policy version:                 21
>>     >>     Policy from config file:        targeted
>>     >>
>>     >>     [root at test01]# cat /etc/cluster/cluster.conf
>>     >>     <?xml version="1.0"?>
>>     >>     <cluster config_version="25" name="gdao_cluster">
>>     >>         <fence_daemon post_fail_delay="0"
post_join_delay="120"/>
>>     >>         <clusternodes>
>>     >>             <clusternode name="test01" nodeid="1" votes="1">
>>     >>                 <fence>
>>     >>                     <method name="single">
>>     >>                         <device name="gfs_vmware"/>
>>     >>                     </method>
>>     >>                 </fence>
>>     >>             </clusternode>
>>     >>             <clusternode name="test02" nodeid="2" votes="1">
>>     >>                 <fence>
>>     >>                     <method name="single">
>>     >>                         <device name="gfs_vmware"/>
>>     >>                     </method>
>>     >>                 </fence>
>>     >>             </clusternode>
>>     >>         </clusternodes>
>>     >>         <cman/>
>>     >>         <fencedevices>
>>     >>             <fencedevice agent="fence_manual" name="gfs1_ipmi"/>
>>     >>             <fencedevice agent="fence_vmware" name="gfs_vmware"
>>     >> ipaddr="gdvcenter.ucsc.edu <http://gdvcenter.ucsc.edu>"
>>     login="root" passwd="1hateAmazon.com"
>>     >> vmlogin="root" vmpasswd="esxpass"
>>     >>
>>    
port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/>
>>     >>         </fencedevices>
>>     >>         <rm>
>>     >>         <failoverdomains/>
>>     >>         </rm>
>>     >>     </cluster>
>>     >>
>>     >> I've seen much discussion of this problem, but no definitive
>>     solutions.
>>     >> Any help you can provide will be welcome.
>>     >>
>>     >> Wes Modes
>>     >>
>>     >> --
>>     >> Linux-cluster mailing list
>>     >> Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>>     >> https://www.redhat.com/mailman/listinfo/linux-cluster
>>     >
>>     > --
>>     > Linux-cluster mailing list
>>     > Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>>     > https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>     --
>>     Linux-cluster mailing list
>>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>>     https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>>
>>      
>>
>>     -- 
>>     Luiz Gustavo P Tonello.
>>
>>
>>
>>     --
>>
>>     Linux-cluster mailing list
>>
>>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>>
>>     https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>>     --
>>     Linux-cluster mailing list
>>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>>     https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>  
>>
>>  
>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster



From klaus.steinberger at Physik.Uni-Muenchen.DE  Mon Jan  9 12:37:44 2012
From: klaus.steinberger at Physik.Uni-Muenchen.DE (Klaus Steinberger)
Date: Mon, 09 Jan 2012 13:37:44 +0100
Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster
 environment
In-Reply-To: <003201cccebb$805852e0$8108f8a0$@precisionit.co.in>
References: <003201cccebb$805852e0$8108f8a0$@precisionit.co.in>
Message-ID: <4F0ADF98.9030809@Physik.Uni-Muenchen.DE>

Am 09.01.2012 11:43, schrieb SATHYA - IT:
> Klaus,
> 
> For your point the corosync network is not connected to the switch. They are
> connected directly to the servers (server to server).

Ahh, then the going down of the bond is probably not a sign of a network
problem, it probably goes down when the other server is already down (fenced ?)

Sincerly,
Klaus


-- 
Rechnerbetriebsgruppe / IT, Fakult?t f?r Physik
Klaus Steinberger
FAX: +49 89 28914280
Tel: +49 89 28914287
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0x7FC1E68A.asc
Type: application/pgp-keys
Size: 6692 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120109/2fe83de1/attachment.bin>

From sathyanarayanan.varadharajan at precisionit.co.in  Mon Jan  9 12:46:43 2012
From: sathyanarayanan.varadharajan at precisionit.co.in (SATHYA - IT)
Date: Mon, 9 Jan 2012 18:16:43 +0530
Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster
	environment
In-Reply-To: <4F0ADF98.9030809@Physik.Uni-Muenchen.DE>
References: <003201cccebb$805852e0$8108f8a0$@precisionit.co.in>
	<4F0ADF98.9030809@Physik.Uni-Muenchen.DE>
Message-ID: <000401cccecc$bfe3f660$3fabe320$@precisionit.co.in>

Klaus,

That is weird. If you refer the logs which I had posted earlier, the server
initiate the fence only after this error message. And the network fail error
message is only on one server and not sure how it is not reflecting in the
other. The server which has the error message fences the other server.
Moreover on the error message, the link is getting down and is back on
within 2 seconds. Not sure where it leads to...

Thanks

Sathya Narayanan V
Solution Architect	
-----Original Message-----
From: Klaus Steinberger [mailto:klaus.steinberger at Physik.Uni-Muenchen.DE] 
Sent: Monday, January 09, 2012 6:08 PM
To: SATHYA - IT
Cc: 'Digimer'; 'linux clustering'
Subject: Re: [Linux-cluster] rhel 6.2 network bonding interface in cluster
environment

Am 09.01.2012 11:43, schrieb SATHYA - IT:
> Klaus,
> 
> For your point the corosync network is not connected to the switch. 
> They are connected directly to the servers (server to server).

Ahh, then the going down of the bond is probably not a sign of a network
problem, it probably goes down when the other server is already down (fenced
?)

Sincerly,
Klaus


--
Rechnerbetriebsgruppe / IT, Fakult?t f?r Physik Klaus Steinberger
FAX: +49 89 28914280
Tel: +49 89 28914287

This communication may contain confidential information. 
If you are not the intended recipient it may be unlawful for you to read, copy, distribute, disclose or otherwise use the information contained within this communication.. 
Errors and Omissions may occur in the contents of this Email arising out of or in connection with data transmission, network malfunction or failure, machine or software error, malfunction, or operator errors by the person who is sending the email. 
Precision Group accepts no responsibility for any such errors or omissions. The information, views and comments within this communication are those of the individual and not necessarily those of Precision Group. 
All email that is sent from/to Precision Group is scanned for the presence of computer viruses, security issues and inappropriate content. However, it is the recipient's responsibility to check any attachments for viruses before use.




From ajb2 at mssl.ucl.ac.uk  Mon Jan  9 13:16:18 2012
From: ajb2 at mssl.ucl.ac.uk (Alan Brown)
Date: Mon, 09 Jan 2012 13:16:18 +0000
Subject: [Linux-cluster] rhel 6.2 network bonding interface in	cluster
 environment
In-Reply-To: <4F0A79FA.7080408@alteeve.com>
References: <003901ccce8d$55a9be40$00fd3ac0$@precisionit.co.in>
	<4F0A79FA.7080408@alteeve.com>
Message-ID: <4F0AE8A2.2010601@mssl.ucl.ac.uk>

On 09/01/12 05:24, Digimer wrote:

> With both of the bond's NICs down, the bond itself is going to drop.

Odds are, both NICs are plugged into the same switch.


(assuming the OP isn't running things plugged nic-nic - which I have 
found in the past tends to be flakey when N-way negotiation becomes 
involved)

I'm assuming "heartbeat" - is a dedicated corosync (v)lan.

To the OP: Please look at 
http://www.cyberciti.biz/howto/question/static/linux-ethernet-bonding-driver-howto.php 
and the descriptions of bonding there.

The type of bond you want for this purpose is either LACP (mode 3) (if 
NICs are plugged into a single switch or switch stack which supports 
LACP) or Active Failover (mode 1) if separate switches are involved.

Any other mode is potentially failure prone if things go wrong.


FWIW: My heartbeat setup is as follows.

2 switches with a 4way LACP bond between them.

2 NICs on each cluster member in bonding mode 1, one NIC on each switch.

This setup is resiliant against individual link (NIC, cable or fat 
fingers) OR switch failures.

Switches used for this purpose are best completely isolated from the 
rest of the network and multicast traffic control should be DISABLED.

Corosync can be set to failover to the public lan as a last resort but 
I've found it's not necessary - if things get bad enough that the 
private lan is completely out of action then the systems should shut 
themselves down (bad data is worse than zero data).

Switch ports should be set "portfast" or whatever the non-cisco 
equivalent is, or else ~30 seconds will be wasted in checking that 
whatever's attached doesn't have a lan segment behind it. This can also 
lead to fencing.





From sathyanarayanan.varadharajan at precisionit.co.in  Mon Jan  9 13:23:42 2012
From: sathyanarayanan.varadharajan at precisionit.co.in (SATHYA - IT)
Date: Mon, 9 Jan 2012 18:53:42 +0530
Subject: [Linux-cluster] rhel 6.2 network bonding interface in	cluster
	environment
In-Reply-To: <4F0AE8A2.2010601@mssl.ucl.ac.uk>
References: <003901ccce8d$55a9be40$00fd3ac0$@precisionit.co.in>
	<4F0A79FA.7080408@alteeve.com> <4F0AE8A2.2010601@mssl.ucl.ac.uk>
Message-ID: <000d01ccced1$eaac2a20$c0047e60$@precisionit.co.in>

Alan,

Corosync (heartbeat) network is not connected to switch. The network is
connected between server to server directly. 

Thanks

Sathya Narayanan V
Solution Architect	
-----Original Message-----
From: Alan Brown [mailto:ajb2 at mssl.ucl.ac.uk] 
Sent: Monday, January 09, 2012 6:46 PM
To: linux clustering
Cc: Digimer; SATHYA - IT
Subject: Re: [Linux-cluster] rhel 6.2 network bonding interface in cluster
environment

On 09/01/12 05:24, Digimer wrote:

> With both of the bond's NICs down, the bond itself is going to drop.

Odds are, both NICs are plugged into the same switch.


(assuming the OP isn't running things plugged nic-nic - which I have found
in the past tends to be flakey when N-way negotiation becomes
involved)

I'm assuming "heartbeat" - is a dedicated corosync (v)lan.

To the OP: Please look at
http://www.cyberciti.biz/howto/question/static/linux-ethernet-bonding-driver
-howto.php
and the descriptions of bonding there.

The type of bond you want for this purpose is either LACP (mode 3) (if NICs
are plugged into a single switch or switch stack which supports
LACP) or Active Failover (mode 1) if separate switches are involved.

Any other mode is potentially failure prone if things go wrong.


FWIW: My heartbeat setup is as follows.

2 switches with a 4way LACP bond between them.

2 NICs on each cluster member in bonding mode 1, one NIC on each switch.

This setup is resiliant against individual link (NIC, cable or fat
fingers) OR switch failures.

Switches used for this purpose are best completely isolated from the rest of
the network and multicast traffic control should be DISABLED.

Corosync can be set to failover to the public lan as a last resort but I've
found it's not necessary - if things get bad enough that the private lan is
completely out of action then the systems should shut themselves down (bad
data is worse than zero data).

Switch ports should be set "portfast" or whatever the non-cisco equivalent
is, or else ~30 seconds will be wasted in checking that whatever's attached
doesn't have a lan segment behind it. This can also lead to fencing.



This communication may contain confidential information. 
If you are not the intended recipient it may be unlawful for you to read, copy, distribute, disclose or otherwise use the information contained within this communication.. 
Errors and Omissions may occur in the contents of this Email arising out of or in connection with data transmission, network malfunction or failure, machine or software error, malfunction, or operator errors by the person who is sending the email. 
Precision Group accepts no responsibility for any such errors or omissions. The information, views and comments within this communication are those of the individual and not necessarily those of Precision Group. 
All email that is sent from/to Precision Group is scanned for the presence of computer viruses, security issues and inappropriate content. However, it is the recipient's responsibility to check any attachments for viruses before use.



From ajb2 at mssl.ucl.ac.uk  Mon Jan  9 13:27:10 2012
From: ajb2 at mssl.ucl.ac.uk (Alan Brown)
Date: Mon, 09 Jan 2012 13:27:10 +0000
Subject: [Linux-cluster] centos5 to RHEL6 migration
In-Reply-To: <4F0AB505.2020402@redhat.com>
References: <CAHSRzpBvTbGU22JqVC0NCm7ETe84egB7rEkp22wWjEyZYA_T-Q@mail.gmail.com>
	<4F0A532A.2000202@alteeve.com>
	<4F0AAACE.7080602@mssl.ucl.ac.uk> <4F0AB505.2020402@redhat.com>
Message-ID: <4F0AEB2E.2060203@mssl.ucl.ac.uk>

On 09/01/12 09:36, Fabio M. Di Nitto wrote:

>> RH's advice to use is to "Big Bang" it.
>
> It?s not much of an advice, as RH does not officially support this
> upgrade method.

Indeed, but scheduling downtime in a 24*7*365.254 operation like space 
science ftp servers is tricky. (1: You can't please everyone all the 
time and they all believe their priorities are of earth-shattering 
importance. 2: You can't schedule downtime during nights or vacation 
periods as the people concerned tend to decide this is the best time to 
run heavy duty batch processing that's due first thing Monday morning.)

> The amount of changes in the cluster software between EL5 and EL6 are a
> lot less intrusive at system level. I can?t really say for sure for the
> entire OS, since the upgrade doesn?t involve only RHCS.

Aye.

In this case the boxes are ONLY used as NFS fileservers because running 
anything else on them which touched the GFS(2) FSes resulted in file 
corruption (which is a case of "NFS vs everything else", more than 
clustering itself.)

It would be _nice_ to have NFSv4 support working and supported in a GFS2 
cluster.

It's really a pity Ken Olsen refused to opensource VMS (and OSF1) all 
those years ago. They had this stuff working along time ago and there's 
a lot of wheel reinvention going on. :(


AB




From raju.rajsand at gmail.com  Mon Jan  9 13:33:03 2012
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Mon, 9 Jan 2012 19:03:03 +0530
Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster
	environment
In-Reply-To: <4F0AE8A2.2010601@mssl.ucl.ac.uk>
References: <003901ccce8d$55a9be40$00fd3ac0$@precisionit.co.in>
	<4F0A79FA.7080408@alteeve.com> <4F0AE8A2.2010601@mssl.ucl.ac.uk>
Message-ID: <CA+Ydgaqwv+zYHsG1NvC5zQuXCzG5RQtEGK2zBqDOvBiT0uYdWw@mail.gmail.com>

Greetings,

On Mon, Jan 9, 2012 at 6:46 PM, Alan Brown <ajb2 at mssl.ucl.ac.uk> wrote:
> On 09/01/12 05:24, Digimer wrote:
>
>
> Switches used for this purpose are best completely isolated from the rest of
> the network and multicast traffic control should be DISABLED.
>

I distinctly remember asking the network guys Multicast mode to be on
for the Heartbeat network (for the clusters that I have built).

This is BIG change I suppose from 5.x

That was about couple years ago.

-- 
Regards,

Rajagopal



From fdinitto at redhat.com  Mon Jan  9 13:34:18 2012
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Mon, 09 Jan 2012 14:34:18 +0100
Subject: [Linux-cluster] centos5 to RHEL6 migration
In-Reply-To: <4F0AEB2E.2060203@mssl.ucl.ac.uk>
References: <CAHSRzpBvTbGU22JqVC0NCm7ETe84egB7rEkp22wWjEyZYA_T-Q@mail.gmail.com>
	<4F0A532A.2000202@alteeve.com>
	<4F0AAACE.7080602@mssl.ucl.ac.uk> <4F0AB505.2020402@redhat.com>
	<4F0AEB2E.2060203@mssl.ucl.ac.uk>
Message-ID: <4F0AECDA.9060402@redhat.com>

On 1/9/2012 2:27 PM, Alan Brown wrote:
> On 09/01/12 09:36, Fabio M. Di Nitto wrote:
> 
>>> RH's advice to use is to "Big Bang" it.
>>
>> It?s not much of an advice, as RH does not officially support this
>> upgrade method.
> 
> Indeed, but scheduling downtime in a 24*7*365.254 operation like space
> science ftp servers is tricky. (1: You can't please everyone all the
> time and they all believe their priorities are of earth-shattering
> importance. 2: You can't schedule downtime during nights or vacation
> periods as the people concerned tend to decide this is the best time to
> run heavy duty batch processing that's due first thing Monday morning.)

Yeah you are not telling me anything new :)

Something i forgot to mention in the other email, is that for example,
you can just move the LUNs from your SAN from one cluster to another
assuming you are running GFS2 and that will work.

So in theory the downtime would be reduced to just stop old cluster ->
rewire the SAN -> start new cluster.

> 
>> The amount of changes in the cluster software between EL5 and EL6 are a
>> lot less intrusive at system level. I can?t really say for sure for the
>> entire OS, since the upgrade doesn?t involve only RHCS.
> 
> Aye.
> 
> In this case the boxes are ONLY used as NFS fileservers because running
> anything else on them which touched the GFS(2) FSes resulted in file
> corruption (which is a case of "NFS vs everything else", more than
> clustering itself.)
> 

Possibly this is one of the use case where upgrading could work.

> It would be _nice_ to have NFSv4 support working and supported in a GFS2
> cluster.

Steven can answer to this one.. but I think the point is more
active/active vs active/passive (IIRC from previous discussions).

Fabio



From ajb2 at mssl.ucl.ac.uk  Mon Jan  9 14:04:25 2012
From: ajb2 at mssl.ucl.ac.uk (Alan Brown)
Date: Mon, 09 Jan 2012 14:04:25 +0000
Subject: [Linux-cluster] centos5 to RHEL6 migration
In-Reply-To: <4F0AECDA.9060402@redhat.com>
References: <CAHSRzpBvTbGU22JqVC0NCm7ETe84egB7rEkp22wWjEyZYA_T-Q@mail.gmail.com>
	<4F0A532A.2000202@alteeve.com>
	<4F0AAACE.7080602@mssl.ucl.ac.uk> <4F0AB505.2020402@redhat.com>
	<4F0AEB2E.2060203@mssl.ucl.ac.uk> <4F0AECDA.9060402@redhat.com>
Message-ID: <4F0AF3E9.9060907@mssl.ucl.ac.uk>

On 09/01/12 13:34, Fabio M. Di Nitto wrote:

> Something i forgot to mention in the other email, is that for example,
> you can just move the LUNs from your SAN from one cluster to another
> assuming you are running GFS2 and that will work.

And assuming that you have 2 clusters. This might be a possiblity shortly.

>> It would be _nice_ to have NFSv4 support working and supported in a GFS2
>> cluster.
>
> Steven can answer to this one.. but I think the point is more
> active/active vs active/passive (IIRC from previous discussions).

We break up NFS serving into one service (ip) per FS.

Any given FS is only served from one node because NFSv3 doesnt play 
nicely with anything else, including other instances of itself.

Bringing all the NFS services all onto one node is perfectly possible 
but it's still a bunch of individual services.

Running all NFS on one box turns into a choke point several times/day 
due to the loads involved. The protocol just doesn't scale very well.







From ajb2 at mssl.ucl.ac.uk  Mon Jan  9 14:07:21 2012
From: ajb2 at mssl.ucl.ac.uk (Alan Brown)
Date: Mon, 09 Jan 2012 14:07:21 +0000
Subject: [Linux-cluster] rhel 6.2 network bonding interface in	cluster
 environment
In-Reply-To: <000d01ccced1$eaac2a20$c0047e60$@precisionit.co.in>
References: <003901ccce8d$55a9be40$00fd3ac0$@precisionit.co.in>
	<4F0A79FA.7080408@alteeve.com> <4F0AE8A2.2010601@mssl.ucl.ac.uk>
	<000d01ccced1$eaac2a20$c0047e60$@precisionit.co.in>
Message-ID: <4F0AF499.5020208@mssl.ucl.ac.uk>

On 09/01/12 13:23, SATHYA - IT wrote:
> Alan,
>
> Corosync (heartbeat) network is not connected to switch. The network is
> connected between server to server directly.

See my comment about direct hookups. My experience is that they are 
prone to playing up for no apparent reason (NICs simply aren't designed 
or tested well enough for this kind of connection mode)

Managed Gb Switches are pretty cheap compared to the hours you'll waste 
trying to make it go. Put a couple in between the servers.






From ajb2 at mssl.ucl.ac.uk  Mon Jan  9 14:14:46 2012
From: ajb2 at mssl.ucl.ac.uk (Alan Brown)
Date: Mon, 09 Jan 2012 14:14:46 +0000
Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster
 environment
In-Reply-To: <CA+Ydgaqwv+zYHsG1NvC5zQuXCzG5RQtEGK2zBqDOvBiT0uYdWw@mail.gmail.com>
References: <003901ccce8d$55a9be40$00fd3ac0$@precisionit.co.in>
	<4F0A79FA.7080408@alteeve.com> <4F0AE8A2.2010601@mssl.ucl.ac.uk>
	<CA+Ydgaqwv+zYHsG1NvC5zQuXCzG5RQtEGK2zBqDOvBiT0uYdWw@mail.gmail.com>
Message-ID: <4F0AF656.7070708@mssl.ucl.ac.uk>

On 09/01/12 13:33, Rajagopal Swaminathan wrote:

>> Switches used for this purpose are best completely isolated from the rest of
>> the network and multicast traffic control should be DISABLED.
>>
>
> I distinctly remember asking the network guys Multicast mode to be on
> for the Heartbeat network (for the clusters that I have built).

You need multicast.

What you don't want, is any form of filtering based on packet rates 
(broadcast/multicast rate limiting). This gets in the way.

I can't emphasise enough that the heartbeat equipment is best separated 
from everything else. A spanning tree rebuild initiated elsewhere in the 
LAN may be enough to cause an outage long enough to generate a fence event

(Ethernet fabric switching is spreading, but spanning tree will be 
around for quite a while yet)





From wmodes at ucsc.edu  Mon Jan  9 15:57:14 2012
From: wmodes at ucsc.edu (Wes Modes)
Date: Mon, 09 Jan 2012 07:57:14 -0800
Subject: [Linux-cluster] GFS on CentOS - cman unable to start
In-Reply-To: <7b4965e95aef00d06ba7be68951fb79b@mx.varna.net>
References: <4F075BD3.3090702@ucsc.edu>
	<60c71479-72a3-47aa-a91d-fa1c91c3e9ef@lisa.itlinux.cl>
	<B9EFAE4B1BE97D49B9FCFDCEC3AC09C313C563AA@MailNode2.eprize.local>
	<4F0A6706.6090308@ucsc.edu>
	<7b4965e95aef00d06ba7be68951fb79b@mx.varna.net>
Message-ID: <4F0B0E5A.40401@ucsc.edu>

Thanks, Kaloyan.  Now we're talking.  This is something I hadn't already
tried yet.  I will try it as soon as I get in.

Wes

On 1/9/2012 3:08 AM, Kaloyan Kovachev wrote:
> Hi,
>  check /etc/sysconfig/cman maybe there is a different name present as
> NODENAME ... remove the file (if present) or try to create one as:
>
> #CMAN_CLUSTER_TIMEOUT=120
> #CMAN_QUORUM_TIMEOUT=0
> #CMAN_SHUTDOWN_TIMEOUT=60
> FENCED_START_TIMEOUT=120
> ##FENCE_JOIN=no
> #LOCK_FILE="/var/lock/subsys/cman"
> CLUSTERNAME=ClusterName
> NODENAME=NodeName
>
>
> On Sun, 08 Jan 2012 20:03:18 -0800, Wes Modes <wmodes at ucsc.edu> wrote:
>> The behavior of cman's resolving of cluster node names is less than
>> clear, as per the RHEL bugzilla report.
>>
>> The hostname and cluster.conf match, as does /etc/hosts and uname -n. 
>> The short names and FQDN ping.  I believe all the node cluster.conf are
>> in sync, and all nodes are accessible to each other using either short
>> or long names.
>>
>> You'll have to trust that I've tried everything obvious, and every
>> possible combination of FQDN and short names in cluster.conf and
>> hostname.  That said, it is totally possible I missed something obvious.
>>
>> I suspect, there is something else going on and I don't know how to get
>> at it.
>>
>> Wes
>>
>>
>> On 1/6/2012 6:06 PM, Kevin Stanton wrote:
>>>> Hi,
>>>> I think CMAN expect that the names of the cluster nodes be the same
>>> returned by the command "uname -n".
>>>
>>>> For what you write your nodes hostnames are: test01.gdao.ucsc.edu
>>> and test02.gdao.ucsc.edu, but in cluster.conf you have declared only
>>> "test01" and "test02".
>>>
>>>  
>>>
>>> I haven't found this to be the case in the past.  I actually use a
>>> separate short name to reference each node which is different than the
>>> hostname of the server itself.  All I've ever had to do is make sure
>>> it resolves correctly.  You can do this either in DNS and/or in
>>> /etc/hosts.  I have found that it's a good idea to do both in case
>>> your DNS server is a virtual machine and is not running for some
>>> reason.  In that case with /etc/hosts you can still start cman.  
>>>
>>>  
>>>
>>> I would make sure whatever node names you use in the cluster.conf will
>>> resolve when you try to ping it from all nodes in the cluster.  Also
>>> make sure your cluster.conf is in sync between all nodes.
>>>
>>>  
>>>
>>> -Kevin
>>>
>>>  
>>>
>>>  
>>>
>>>
> ------------------------------------------------------------------------
>>>     These servers are currently on the same host, but may not be in
>>>     the future.  They are in a vm cluster (though honestly, I'm not
>>>     sure what this means yet).
>>>
>>>     SElinux is on, but disabled.
>>>     Firewalling through iptables is turned off via
>>>     system-config-securitylevel
>>>
>>>     There is no line currently in the cluster.conf that deals with
>>>     multicasting.
>>>
>>>     Any other suggestions?
>>>
>>>     Wes
>>>
>>>     On 1/6/2012 12:05 PM, Luiz Gustavo Tonello wrote:
>>>
>>>     Hi,
>>>
>>>      
>>>
>>>     This servers is on VMware? At the same host?
>>>
>>>     SElinux is disable? iptables have something?
>>>
>>>      
>>>
>>>     In my environment I had a problem to start GFS2 with servers in
>>>     differents hosts.
>>>
>>>     To clustering servers, was need migrate one server to the same
>>>     host of the other, and restart this.
>>>
>>>      
>>>
>>>     I think, one of the problem was because the virtual switchs.
>>>
>>>     To solve, I changed a multicast IP, to use 225.0.0.13 at
>>>     cluster.conf
>>>
>>>       <multicast addr="225.0.0.13"/>
>>>
>>>     And add a static route in both, to use default gateway.
>>>
>>>      
>>>
>>>     I don't know if it's correct, but this solve my problem.
>>>
>>>      
>>>
>>>     I hope that help you.
>>>
>>>      
>>>
>>>     Regards.
>>>
>>>      
>>>
>>>     On Fri, Jan 6, 2012 at 5:01 PM, Wes Modes <wmodes at ucsc.edu
>>>     <mailto:wmodes at ucsc.edu>> wrote:
>>>
>>>     Hi, Steven.
>>>
>>>     I've tried just about every possible combination of hostname and
>>>     cluster.conf.
>>>
>>>     ping to test01 resolves to 128.114.31.112
>>>     ping to test01.gdao.ucsc.edu <http://test01.gdao.ucsc.edu>
>>>     resolves to 128.114.31.112
>>>
>>>     It feels like the right thing is being returned.  This feels like
> it
>>>     might be a quirk (or bug possibly) of cman or openais.
>>>
>>>     There are some old bug reports around this, for example
>>>     https://bugzilla.redhat.com/show_bug.cgi?id=488565.  It sounds
>>>     like the
>>>     way that cman reports this error is anything but straightforward.
>>>
>>>     Is there anyone who has encountered this error and found a
> solution?
>>>     Wes
>>>
>>>
>>>
>>>     On 1/6/2012 2:00 AM, Steven Whitehouse wrote:
>>>     > Hi,
>>>     >
>>>     > On Thu, 2012-01-05 at 13:54 -0800, Wes Modes wrote:
>>>     >> Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS
>>>     systems
>>>     >> running on vmWare. The GFS FS is on a Dell Equilogic SAN.
>>>     >>
>>>     >> I keep running into the same problem despite many
>>>     differently-flavored
>>>     >> attempts to set up GFS. The problem comes when I try to start
>>>     cman, the
>>>     >> cluster management software.
>>>     >>
>>>     >>     [root at test01]# service cman start
>>>     >>     Starting cluster:
>>>     >>        Loading modules... done
>>>     >>        Mounting configfs... done
>>>     >>        Starting ccsd... done
>>>     >>        Starting cman... failed
>>>     >>     cman not started: Can't find local node name in cluster.conf
>>>     >> /usr/sbin/cman_tool: aisexec daemon didn't start
>>>     >>                                                              
>>>      [FAILED]
>>>     >>
>>>     > This looks like what it says... whatever the node name is in
>>>     > cluster.conf, it doesn't exist when the name is looked up, or
>>>     possibly
>>>     > it does exist, but is mapped to the loopback address (it needs to
>>>     map to
>>>     > an address which is valid cluster-wide)
>>>     >
>>>     > Since your config files look correct, the next thing to check is
>>>     > what
>>>     > the resolver is actually returning. Try (for example) a ping to
>>>     test01
>>>     > (you need to specify exactly the same form of the name as is used
>>>     > in
>>>     > cluster.conf) from test02 and see whether it uses the correct ip
>>>     > address, just in case the wrong thing is being returned.
>>>     >
>>>     > Steve.
>>>     >
>>>     >>     [root at test01]# tail /var/log/messages
>>>     >>     Jan  5 13:39:40 testbench06 ccsd[13194]: Unable to connect
> to
>>>     >> cluster infrastructure after 1193640 seconds.
>>>     >>     Jan  5 13:40:10 testbench06 ccsd[13194]: Unable to connect
> to
>>>     >> cluster infrastructure after 1193670 seconds.
>>>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS
>>>     >>     Executive
>>>     >> Service RELEASE 'subrev 1887 version 0.80.6'
>>>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright
>>>     >>     (C)
>>>     >> 2002-2006 MontaVista Software, Inc and contributors.
>>>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright
>>>     >>     (C)
>>>     >> 2006 Red Hat, Inc.
>>>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS
>>>     >>     Executive
>>>     >> Service: started and ready to provide service.
>>>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] local
>>>     node name
>>>     >> "test01.gdao.ucsc.edu <http://test01.gdao.ucsc.edu>" not found
>>>     in cluster.conf
>>>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Error
>>>     reading CCS
>>>     >> info, cannot start
>>>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] Error
>>>     >>     reading
>>>     >> config from CCS
>>>     >>     Jan  5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS
>>>     >>     Executive
>>>     >> exiting (reason: could not read the main configuration file).
>>>     >>
>>>     >> Here are details of my configuration:
>>>     >>
>>>     >>     [root at test01]# rpm -qa | grep cman
>>>     >>     cman-2.0.115-85.el5_7.2
>>>     >>
>>>     >>     [root at test01]# echo $HOSTNAME
>>>     >>     test01.gdao.ucsc.edu <http://test01.gdao.ucsc.edu>
>>>     >>
>>>     >>     [root at test01]# hostname
>>>     >>     test01.gdao.ucsc.edu <http://test01.gdao.ucsc.edu>
>>>     >>
>>>     >>     [root at test01]# cat /etc/hosts
>>>     >>     # Do not remove the following line, or various programs
>>>     >>     # that require network functionality will fail.
>>>     >>     128.114.31.112      test01 test01.gdao test01.gdao.ucsc.edu
>>>     <http://test01.gdao.ucsc.edu>
>>>     >>     128.114.31.113      test02 test02.gdao test02.gdao.ucsc.edu
>>>     <http://test02.gdao.ucsc.edu>
>>>     >>     127.0.0.1               localhost.localdomain localhost
>>>     >>     ::1             localhost6.localdomain6 localhost6
>>>     >>
>>>     >>     [root at test01]# sestatus
>>>     >>     SELinux status:                 enabled
>>>     >>     SELinuxfs mount:                /selinux
>>>     >>     Current mode:                   permissive
>>>     >>     Mode from config file:          permissive
>>>     >>     Policy version:                 21
>>>     >>     Policy from config file:        targeted
>>>     >>
>>>     >>     [root at test01]# cat /etc/cluster/cluster.conf
>>>     >>     <?xml version="1.0"?>
>>>     >>     <cluster config_version="25" name="gdao_cluster">
>>>     >>         <fence_daemon post_fail_delay="0"
> post_join_delay="120"/>
>>>     >>         <clusternodes>
>>>     >>             <clusternode name="test01" nodeid="1" votes="1">
>>>     >>                 <fence>
>>>     >>                     <method name="single">
>>>     >>                         <device name="gfs_vmware"/>
>>>     >>                     </method>
>>>     >>                 </fence>
>>>     >>             </clusternode>
>>>     >>             <clusternode name="test02" nodeid="2" votes="1">
>>>     >>                 <fence>
>>>     >>                     <method name="single">
>>>     >>                         <device name="gfs_vmware"/>
>>>     >>                     </method>
>>>     >>                 </fence>
>>>     >>             </clusternode>
>>>     >>         </clusternodes>
>>>     >>         <cman/>
>>>     >>         <fencedevices>
>>>     >>             <fencedevice agent="fence_manual" name="gfs1_ipmi"/>
>>>     >>             <fencedevice agent="fence_vmware" name="gfs_vmware"
>>>     >> ipaddr="gdvcenter.ucsc.edu <http://gdvcenter.ucsc.edu>"
>>>     login="root" passwd="1hateAmazon.com"
>>>     >> vmlogin="root" vmpasswd="esxpass"
>>>     >>
>>>    
> port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/>
>>>     >>         </fencedevices>
>>>     >>         <rm>
>>>     >>         <failoverdomains/>
>>>     >>         </rm>
>>>     >>     </cluster>
>>>     >>
>>>     >> I've seen much discussion of this problem, but no definitive
>>>     solutions.
>>>     >> Any help you can provide will be welcome.
>>>     >>
>>>     >> Wes Modes
>>>     >>
>>>     >> --
>>>     >> Linux-cluster mailing list
>>>     >> Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>>>     >> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>     >
>>>     > --
>>>     > Linux-cluster mailing list
>>>     > Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>>>     > https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>     --
>>>     Linux-cluster mailing list
>>>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>>>     https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>>
>>>      
>>>
>>>     -- 
>>>     Luiz Gustavo P Tonello.
>>>
>>>
>>>
>>>     --
>>>
>>>     Linux-cluster mailing list
>>>
>>>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>>>
>>>     https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>>     --
>>>     Linux-cluster mailing list
>>>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>>>     https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>  
>>>
>>>  
>>>
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From rajendra.roka at pacificmags.com.au  Mon Jan  9 21:57:29 2012
From: rajendra.roka at pacificmags.com.au (Roka, Rajendra)
Date: Tue, 10 Jan 2012 08:57:29 +1100
Subject: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster
Message-ID: <508450C6AB960E4299CA64E838597F6202F22B63@nsw-mmp-exch1.snl.7net.com.au>

I am having issue with mysql service in RHEL6.2 cluster. While starting
service I receive the following error in /var/log/message

 

Jan 10 08:47:52 atp-wwdev1 rgmanager[6015]: Starting Service mysql:mysql

Jan 10 08:47:54 atp-wwdev1 rgmanager[6096]: Starting Service mysql:mysql
> Failed - Timeout Error

Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: start on mysql "mysql"
returned 1 (generic error)

Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: #68: Failed to start
service:mysql; return value: 1

Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: Stopping service
service:mysql

Jan 10 08:47:54 atp-wwdev1 rgmanager[6180]: Stopping Service mysql:mysql

Jan 10 08:47:55 atp-wwdev1 rgmanager[6202]: Checking Existence Of File
/var/run/cluster/mysql.pid [mysql:mysql] > Failed - File Doesn't Exist

Jan 10 08:47:55 atp-wwdev1 rgmanager[6224]: Stopping Service mysql:mysql
> Succeed

Jan 10 08:47:55 atp-wwdev1 rgmanager[1842]: Service service:mysql is
recovering

Jan 10 08:47:55 atp-wwdev1 rgmanager[1842]: #71: Relocating failed
service service:mysql

Jan 10 08:47:59 atp-wwdev1 rgmanager[1842]: Service service:mysql is
stopped

 

Can you please help me with the above problem?

 

My cluster.conf file is follows:

 

<?xml version="1.0"?>

<cluster config_version="29" name="atp_mysql">

        <clusternodes>

                <clusternode name="atp-wwdev1.test1.com.au" nodeid="1">

                        <fence/>

                        <multicast addr="239.192.200.1"/>

                </clusternode>

                <clusternode name="atp-wwdev2.test1.com.au" nodeid="2"
votes="1">

                        <fence/>

                        <multicast addr="239.192.200.1"/>

                </clusternode>

        </clusternodes>

        <fencedevices>

                <fencedevice agent="fence_xvm" name="atp_fance"/>

        </fencedevices>

        <rm>

                <failoverdomains>

                        <failoverdomain name="atp_failover"
nofailback="0" ordered="1" restricted="0">

                                <failoverdomainnode
name="atp-wwdev1.test1.com.au" priority="2"/>

                                <failoverdomainnode
name="atp-wwdev2.test1.com.au" priority="5"/>

                        </failoverdomain>

                </failoverdomains>

                <resources>

                        <ip address="10.26.240.95/24" monitor_link="on"
sleeptime="2"/>

                        <netfs export="/nfs/mysql" force_unmount="on"
fstype="nfs" host="10.26.240.190" mountpoint="/var/lib/mysql"
name="filesystem" no_unmount="on"/>

                        <mysql config_file="/etc/my.cnf"
listen_address="10.26.24.95" name="MySQL server" shutdown_wait="2"
startup_wait="0"/>

                </resources>

                <service autostart="1" domain="atp_failover"
exclusive="0" name="access_ip" recovery="relocate">

                        <ip ref="10.26.240.95/24"/>

                </service>

                <service autostart="1" domain="atp_failover"
exclusive="0" name="mysql" recovery="relocate">

                        <mysql config_file="/etc/my.cnf"
listen_address="10.26.240.95" name="mysql" shutdown_wait="2"
startup_wait="2"/>

                </service>

                <service autostart="1" domain="atp_failover"
exclusive="0" name="storage" recovery="relocate">

                        <netfs ref="filesystem"/>

                </service>

        </rm>

        <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="3"/>

        <cman expected_votes="1" two_node="1">

                <multicast addr="239.192.200.1"/>

        </cman>

        <totem/>

        <logging debug="off"/>

</cluster>

 

 

/etc/my.cnf file is follows:

 

[mysqld]

datadir=/var/lib/mysql

socket=/var/lib/mysql/mysql.sock

user=mysql

# Disabling symbolic-links is recommended to prevent assorted security
risks

symbolic-links=0

 

[mysqld_safe]

log-error=/var/log/mysqld.log

pid-file=/var/run/cluster/mysql.pid

 

 

Thanks

Raj

 

 


Important Notice:
This message and its attachments are confidential and may contain information which is protected by copyright. It is intended solely for the named addressee. If you are not the authorised recipient (or responsible for delivery of the message to the authorised recipient), you must not use, disclose, print, copy or deliver this message or its attachments to anyone. If you receive this email in error, please contact the sender immediately and permanently delete this message and its attachments from your system. 
Any content of this message and its attachments that does not relate to the official business of Pacific Magazines Pty Limited must be taken not to have been sent or endorsed by it. No representation is made that this email or its attachments are without defect or that the contents express views other than those of the sender.

Please consider the environment - do you really need to print this email?



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120110/3458cf6f/attachment.htm>

From td3201 at gmail.com  Mon Jan  9 22:36:31 2012
From: td3201 at gmail.com (Terry)
Date: Mon, 9 Jan 2012 16:36:31 -0600
Subject: [Linux-cluster] centos5 to RHEL6 migration
In-Reply-To: <4F0AF3E9.9060907@mssl.ucl.ac.uk>
References: <CAHSRzpBvTbGU22JqVC0NCm7ETe84egB7rEkp22wWjEyZYA_T-Q@mail.gmail.com>
	<4F0A532A.2000202@alteeve.com> <4F0AAACE.7080602@mssl.ucl.ac.uk>
	<4F0AB505.2020402@redhat.com> <4F0AEB2E.2060203@mssl.ucl.ac.uk>
	<4F0AECDA.9060402@redhat.com> <4F0AF3E9.9060907@mssl.ucl.ac.uk>
Message-ID: <CAHSRzpBVmGQ0ZCukMERpKoHWnX9A=0-rW89s00b+y92XNmUGog@mail.gmail.com>

So here's what I have done so far:
1. Created new cluster based on RHEL 6.
2. Created resources and services from scratch to match that in the old
cluster (fsid, mount points, everything). I am using Congra (luci/ricci)
just to ensure I am using the right syntax.
3. Gave access to storage volumes (iscsi) to new cluster node
4. pvscan/vgscan/lvscan
5. Disabled NFS services on old cluster
6. Enabled the NFS services on the new cluster

That's it.  Life's good for the volumes on the cluster.  I am yet to
transfer my postgres stuff but I am moving from 8.3 to 9.0 so that will be
a new volume and postgres installation so nothing exciting there.


On Mon, Jan 9, 2012 at 8:04 AM, Alan Brown <ajb2 at mssl.ucl.ac.uk> wrote:

> On 09/01/12 13:34, Fabio M. Di Nitto wrote:
>
>  Something i forgot to mention in the other email, is that for example,
>> you can just move the LUNs from your SAN from one cluster to another
>> assuming you are running GFS2 and that will work.
>>
>
> And assuming that you have 2 clusters. This might be a possiblity shortly.
>
>
>  It would be _nice_ to have NFSv4 support working and supported in a GFS2
>>> cluster.
>>>
>>
>> Steven can answer to this one.. but I think the point is more
>> active/active vs active/passive (IIRC from previous discussions).
>>
>
> We break up NFS serving into one service (ip) per FS.
>
> Any given FS is only served from one node because NFSv3 doesnt play nicely
> with anything else, including other instances of itself.
>
> Bringing all the NFS services all onto one node is perfectly possible but
> it's still a bunch of individual services.
>
> Running all NFS on one box turns into a choke point several times/day due
> to the loads involved. The protocol just doesn't scale very well.
>
>
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120109/b01b22ad/attachment.htm>

From pbruna at it-linux.cl  Mon Jan  9 22:34:22 2012
From: pbruna at it-linux.cl (Patricio A. Bruna)
Date: Mon, 09 Jan 2012 19:34:22 -0300 (CLST)
Subject: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster
In-Reply-To: <508450C6AB960E4299CA64E838597F6202F22B63@nsw-mmp-exch1.snl.7net.com.au>
Message-ID: <da7df227-904d-4873-ae6c-f9026a55cab2@lisa.itlinux.cl>

Has the mysql user the permissions to write on the /var/run/cluster directory? 

------------------------------------ 
Patricio Bruna V. 
IT Linux Ltda. 
www.it-linux.cl 
Twitter 
Fono : (+56-2) 333 0578 
M?vil: (+56-9) 8899 6618 

----- Mensaje original -----

> I am having issue with mysql service in RHEL6.2 cluster. While
> starting service I receive the following error in /var/log/message

> Jan 10 08:47:52 atp-wwdev1 rgmanager[6015]: Starting Service
> mysql:mysql
> Jan 10 08:47:54 atp-wwdev1 rgmanager[6096]: Starting Service
> mysql:mysql > Failed - Timeout Error
> Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: start on mysql "mysql"
> returned 1 (generic error)
> Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: #68: Failed to start
> service:mysql; return value: 1
> Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: Stopping service
> service:mysql
> Jan 10 08:47:54 atp-wwdev1 rgmanager[6180]: Stopping Service
> mysql:mysql
> Jan 10 08:47:55 atp-wwdev1 rgmanager[6202]: Checking Existence Of
> File /var/run/cluster/mysql.pid [mysql:mysql] > Failed - File
> Doesn't Exist
> Jan 10 08:47:55 atp-wwdev1 rgmanager[6224]: Stopping Service
> mysql:mysql > Succeed
> Jan 10 08:47:55 atp-wwdev1 rgmanager[1842]: Service service:mysql is
> recovering
> Jan 10 08:47:55 atp-wwdev1 rgmanager[1842]: #71: Relocating failed
> service service:mysql
> Jan 10 08:47:59 atp-wwdev1 rgmanager[1842]: Service service:mysql is
> stopped

> Can you please help me with the above problem?

> My cluster.conf file is follows:

> <?xml version="1.0"?>
> <cluster config_version="29" name="atp_mysql">
> <clusternodes>
> <clusternode name="atp-wwdev1.test1.com.au" nodeid="1">
> <fence/>
> <multicast addr="239.192.200.1"/>
> </clusternode>
> <clusternode name="atp-wwdev2.test1.com.au" nodeid="2" votes="1">
> <fence/>
> <multicast addr="239.192.200.1"/>
> </clusternode>
> </clusternodes>
> <fencedevices>
> <fencedevice agent="fence_xvm" name="atp_fance"/>
> </fencedevices>
> <rm>
> <failoverdomains>
> <failoverdomain name="atp_failover" nofailback="0" ordered="1"
> restricted="0">
> <failoverdomainnode name="atp-wwdev1.test1.com.au" priority="2"/>
> <failoverdomainnode name="atp-wwdev2.test1.com.au" priority="5"/>
> </failoverdomain>
> </failoverdomains>
> <resources>
> <ip address="10.26.240.95/24" monitor_link="on" sleeptime="2"/>
> <netfs export="/nfs/mysql" force_unmount="on" fstype="nfs"
> host="10.26.240.190" mountpoint="/var/lib/mysql" name="filesystem"
> no_unmount="on"/>
> <mysql config_file="/etc/my.cnf" listen_address="10.26.24.95"
> name="MySQL server" shutdown_wait="2" startup_wait="0"/>
> </resources>
> <service autostart="1" domain="atp_failover" exclusive="0"
> name="access_ip" recovery="relocate">
> <ip ref="10.26.240.95/24"/>
> </service>
> <service autostart="1" domain="atp_failover" exclusive="0"
> name="mysql" recovery="relocate">
> <mysql config_file="/etc/my.cnf" listen_address="10.26.240.95"
> name="mysql" shutdown_wait="2" startup_wait="2"/>
> </service>
> <service autostart="1" domain="atp_failover" exclusive="0"
> name="storage" recovery="relocate">
> <netfs ref="filesystem"/>
> </service>
> </rm>
> <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="3"/>
> <cman expected_votes="1" two_node="1">
> <multicast addr="239.192.200.1"/>
> </cman>
> <totem/>
> <logging debug="off"/>
> </cluster>

> /etc/my.cnf file is follows:

> [mysqld]
> datadir=/var/lib/mysql
> socket=/var/lib/mysql/mysql.sock
> user=mysql
> # Disabling symbolic-links is recommended to prevent assorted
> security risks
> symbolic-links=0

> [mysqld_safe]
> log-error=/var/log/mysqld.log
> pid-file=/var/run/cluster/mysql.pid

> Thanks
> Raj

> Important Notice:
> This message and its attachments are confidential and may contain
> information which is protected by copyright. It is intended solely
> for the named addressee. If you are not the authorised recipient (or
> responsible for delivery of the message to the authorised
> recipient), you must not use, disclose, print, copy or deliver this
> message or its attachments to anyone. If you receive this email in
> error, please contact the sender immediately and permanently delete
> this message and its attachments from your system.
> Any content of this message and its attachments that does not relate
> to the official business of Pacific Magazines Pty Limited must be
> taken not to have been sent or endorsed by it. No representation is
> made that this email or its attachments are without defect or that
> the contents express views other than those of the sender.

> Please consider the environment - do you really need to print this
> email?
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120109/8d682fea/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: zimbra_gold_partner.png
Type: image/png
Size: 2893 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120109/8d682fea/attachment.png>

From rajendra.roka at pacificmags.com.au  Mon Jan  9 23:09:25 2012
From: rajendra.roka at pacificmags.com.au (Roka, Rajendra)
Date: Tue, 10 Jan 2012 10:09:25 +1100
Subject: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster
In-Reply-To: <da7df227-904d-4873-ae6c-f9026a55cab2@lisa.itlinux.cl>
References: <508450C6AB960E4299CA64E838597F6202F22B63@nsw-mmp-exch1.snl.7net.com.au>
	<da7df227-904d-4873-ae6c-f9026a55cab2@lisa.itlinux.cl>
Message-ID: <508450C6AB960E4299CA64E838597F6202F22B64@nsw-mmp-exch1.snl.7net.com.au>

Yes it has.

 

drwx--x--x. 3 mysql     root      4096 Jan  9 13:45 cluster

 

Thanks

 

From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Patricio A. Bruna
Sent: Tuesday, 10 January 2012 9:34 AM
To: linux clustering
Subject: Re: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster

 

Has the mysql user the permissions  to write on the /var/run/cluster directory?

------------------------------------
Patricio Bruna V.
IT Linux Ltda.
www.it-linux.cl
Twitter <http://twitter.com/ITLinux> 
Fono : (+56-2) 333 0578
M?vil: (+56-9) 8899 6618

  

 

________________________________

	I am having issue with mysql service in RHEL6.2 cluster. While starting service I receive the following error in /var/log/message

	 

	Jan 10 08:47:52 atp-wwdev1 rgmanager[6015]: Starting Service mysql:mysql

	Jan 10 08:47:54 atp-wwdev1 rgmanager[6096]: Starting Service mysql:mysql > Failed - Timeout Error

	Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: start on mysql "mysql" returned 1 (generic error)

	Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: #68: Failed to start service:mysql; return value: 1

	Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: Stopping service service:mysql

	Jan 10 08:47:54 atp-wwdev1 rgmanager[6180]: Stopping Service mysql:mysql

	Jan 10 08:47:55 atp-wwdev1 rgmanager[6202]: Checking Existence Of File /var/run/cluster/mysql.pid [mysql:mysql] > Failed - File Doesn't Exist

	Jan 10 08:47:55 atp-wwdev1 rgmanager[6224]: Stopping Service mysql:mysql > Succeed

	Jan 10 08:47:55 atp-wwdev1 rgmanager[1842]: Service service:mysql is recovering

	Jan 10 08:47:55 atp-wwdev1 rgmanager[1842]: #71: Relocating failed service service:mysql

	Jan 10 08:47:59 atp-wwdev1 rgmanager[1842]: Service service:mysql is stopped

	 

	Can you please help me with the above problem?

	 

	My cluster.conf file is follows:

	 

	<?xml version="1.0"?>

	<cluster config_version="29" name="atp_mysql">

	        <clusternodes>

	                <clusternode name="atp-wwdev1.test1.com.au" nodeid="1">

	                        <fence/>

	                        <multicast addr="239.192.200.1"/>

	                </clusternode>

	                <clusternode name="atp-wwdev2.test1.com.au" nodeid="2" votes="1">

	                        <fence/>

	                        <multicast addr="239.192.200.1"/>

	                </clusternode>

	        </clusternodes>

	        <fencedevices>

	                <fencedevice agent="fence_xvm" name="atp_fance"/>

	        </fencedevices>

	        <rm>

	                <failoverdomains>

	                        <failoverdomain name="atp_failover" nofailback="0" ordered="1" restricted="0">

	                                <failoverdomainnode name="atp-wwdev1.test1.com.au" priority="2"/>

	                                <failoverdomainnode name="atp-wwdev2.test1.com.au" priority="5"/>

	                        </failoverdomain>

	                </failoverdomains>

	                <resources>

	                        <ip address="10.26.240.95/24" monitor_link="on" sleeptime="2"/>

	                        <netfs export="/nfs/mysql" force_unmount="on" fstype="nfs" host="10.26.240.190" mountpoint="/var/lib/mysql" name="filesystem" no_unmount="on"/>

	                        <mysql config_file="/etc/my.cnf" listen_address="10.26.24.95" name="MySQL server" shutdown_wait="2" startup_wait="0"/>

	                </resources>

	                <service autostart="1" domain="atp_failover" exclusive="0" name="access_ip" recovery="relocate">

	                        <ip ref="10.26.240.95/24"/>

	                </service>

	                <service autostart="1" domain="atp_failover" exclusive="0" name="mysql" recovery="relocate">

	                        <mysql config_file="/etc/my.cnf" listen_address="10.26.240.95" name="mysql" shutdown_wait="2" startup_wait="2"/>

	                </service>

	                <service autostart="1" domain="atp_failover" exclusive="0" name="storage" recovery="relocate">

	                        <netfs ref="filesystem"/>

	                </service>

	        </rm>

	        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>

	        <cman expected_votes="1" two_node="1">

	                <multicast addr="239.192.200.1"/>

	        </cman>

	        <totem/>

	        <logging debug="off"/>

	</cluster>

	 

	 

	/etc/my.cnf file is follows:

	 

	[mysqld]

	datadir=/var/lib/mysql

	socket=/var/lib/mysql/mysql.sock

	user=mysql

	# Disabling symbolic-links is recommended to prevent assorted security risks

	symbolic-links=0

	 

	[mysqld_safe]

	log-error=/var/log/mysqld.log

	pid-file=/var/run/cluster/mysql.pid

	 

	 

	Thanks

	Raj

	 

	 

	Important Notice:
	This message and its attachments are confidential and may contain information which is protected by copyright. It is intended solely for the named addressee. If you are not the authorised recipient (or responsible for delivery of the message to the authorised recipient), you must not use, disclose, print, copy or deliver this message or its attachments to anyone. If you receive this email in error, please contact the sender immediately and permanently delete this message and its attachments from your system. 
	Any content of this message and its attachments that does not relate to the official business of Pacific Magazines Pty Limited must be taken not to have been sent or endorsed by it. No representation is made that this email or its attachments are without defect or that the contents express views other than those of the sender.
	 
	Please consider the environment - do you really need to print this email?
	 
	 
	 
	
	--
	Linux-cluster mailing list
	Linux-cluster at redhat.com
	https://www.redhat.com/mailman/listinfo/linux-cluster

 

 


Important Notice:
This message and its attachments are confidential and may contain information which is protected by copyright. It is intended solely for the named addressee. If you are not the authorised recipient (or responsible for delivery of the message to the authorised recipient), you must not use, disclose, print, copy or deliver this message or its attachments to anyone. If you receive this email in error, please contact the sender immediately and permanently delete this message and its attachments from your system. 
Any content of this message and its attachments that does not relate to the official business of Pacific Magazines Pty Limited must be taken not to have been sent or endorsed by it. No representation is made that this email or its attachments are without defect or that the contents express views other than those of the sender.

Please consider the environment - do you really need to print this email?



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120110/0054671b/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 2893 bytes
Desc: image001.png
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120110/0054671b/attachment.png>

From rmitchel at redhat.com  Mon Jan  9 23:12:08 2012
From: rmitchel at redhat.com (Ryan Mitchell)
Date: Tue, 10 Jan 2012 09:12:08 +1000
Subject: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster
In-Reply-To: <508450C6AB960E4299CA64E838597F6202F22B63@nsw-mmp-exch1.snl.7net.com.au>
References: <508450C6AB960E4299CA64E838597F6202F22B63@nsw-mmp-exch1.snl.7net.com.au>
Message-ID: <4F0B7448.9060008@redhat.com>

On 01/10/2012 07:57 AM, Roka, Rajendra wrote:
>
> *I am having issue with mysql service in RHEL6.2 cluster. While 
> starting service I receive the following error in /var/log/message*
>
> Jan 10 08:47:52 atp-wwdev1 rgmanager[6015]: Starting Service mysql:mysql
>
> Jan 10 08:47:54 atp-wwdev1 rgmanager[6096]: Starting Service 
> mysql:mysql > Failed - Timeout Error
>
> Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: start on mysql "mysql" 
> returned 1 (generic error)
>
>
I'm pretty sure the first problem is that mysql doesn't start before the 
script times out.  All subsequent errors are trying to clean up from the 
failed start and can be ignored.  There won't be a pid file if the 
service did not start or if it was cleanly shut down outside of 
rgmanager.  Try increasing the startup_wait (something large until you 
find its successful, like 60).  Its currently waiting 2 seconds.

Also, I don't think you want to have the VIP and the service that uses 
it (mysql) in different services.  They should be in the same service, 
because they always have to run on the same node (they aren't 
independent).  Same goes for the filesystem resource if that is required 
by MYSQL.  Perhaps something like the following?:

<resources>

<ip address="10.26.240.95/24" monitor_link="on" sleeptime="2"/>

<netfs export="/nfs/mysql" force_unmount="on" fstype="nfs" 
host="10.26.240.190" mountpoint="/var/lib/mysql" name="filesystem" 
no_unmount="on"/>

<mysql config_file="/etc/my.cnf" listen_address="10.26.240.95" 
name="mysql" shutdown_wait="60" startup_wait="60"/>

</resources>

<service autostart="1" domain="atp_failover" exclusive="0" 
name="database" recovery="relocate">

<ip ref="10.26.240.95/24"/>

<netfs ref="filesystem"/>

<mysql ref="mysql"/>

</service>


Lastly, you have created a fence device but you haven't assigned it to 
the nodes so they currently have no fencing devices.  Make sure you do 
that and test fencing before doing anything important with this cluster.

Regards,

Ryan Mitchell
Software Maintenance Engineer
Support Engineering Group
Red Hat, Inc.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120110/27c24baa/attachment.htm>

From rajendra.roka at pacificmags.com.au  Tue Jan 10 00:48:35 2012
From: rajendra.roka at pacificmags.com.au (Roka, Rajendra)
Date: Tue, 10 Jan 2012 11:48:35 +1100
Subject: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster
In-Reply-To: <4F0B7448.9060008@redhat.com>
References: <508450C6AB960E4299CA64E838597F6202F22B63@nsw-mmp-exch1.snl.7net.com.au>
	<4F0B7448.9060008@redhat.com>
Message-ID: <508450C6AB960E4299CA64E838597F6202F22B66@nsw-mmp-exch1.snl.7net.com.au>

 

I have changed the resources and service in cluster.conf as follows:

 

<resources>

                        <ip address="10.26.240.95/24" monitor_link="on"
sleeptime="2"/>

                        <mysql config_file="/etc/my.cnf"
listen_address="10.26.24.95" name="mysql" shutdown_wait="60"
startup_wait="60"/>

                        <netfs export="/nfs/mysql" force_unmount="on"
fstype="nfs" host="10.26.240.190" mountpoint="/var/lib/mysql"
name="storage" no_unmount="on"/>

                </resources>

                <service autostart="1" domain="atp_failover"
exclusive="0" name="mysql" recovery="relocate">

                        <ip ref="10.26.240.95/24"/>

                        <netfs ref="storage"/>

                        <mysql ref="mysql"/>

                </service>

 

But no luck with the following message:

 

Jan 10 11:42:57 atp-wwdev1 modcluster: Starting service: mysql on node 

Jan 10 11:42:57 atp-wwdev1 rgmanager[1690]: Starting stopped service
service:mysql

Jan 10 11:42:58 atp-wwdev1 rgmanager[5252]: Adding IPv4 address
10.26.240.95/24 to eth0

Jan 10 11:43:01 atp-wwdev1 rgmanager[5401]: Starting Service mysql:mysql

Jan 10 11:44:01 atp-wwdev1 rgmanager[5657]: Starting Service mysql:mysql
> Failed - Timeout Error

Jan 10 11:44:01 atp-wwdev1 rgmanager[1690]: start on mysql "mysql"
returned 1 (generic error)

Jan 10 11:44:02 atp-wwdev1 rgmanager[1690]: #68: Failed to start
service:mysql; return value: 1

Jan 10 11:44:02 atp-wwdev1 rgmanager[1690]: Stopping service
service:mysql

Jan 10 11:44:02 atp-wwdev1 rgmanager[5742]: Stopping Service mysql:mysql

Jan 10 11:44:02 atp-wwdev1 rgmanager[5764]: Checking Existence Of File
/var/run/cluster/mysql/mysql:mysql.pid [mysql:mysql] > Failed - File
Doesn't Exist

Jan 10 11:44:02 atp-wwdev1 rgmanager[5786]: Stopping Service mysql:mysql
> Succeed

Jan 10 11:44:02 atp-wwdev1 rgmanager[5837]: Removing IPv4 address
10.26.240.95/24 from eth0

Jan 10 11:44:04 atp-wwdev1 rgmanager[5924]: unmounting /var/lib/mysql

Jan 10 11:44:04 atp-wwdev1 rgmanager[1690]: Service service:mysql is
recovering

Jan 10 11:44:04 atp-wwdev1 rgmanager[1690]: #71: Relocating failed
service service:mysql

Jan 10 11:45:14 atp-wwdev1 rgmanager[1690]: Service service:mysql is
stopped

 

Also changed the my.conf to:

 

pid-file=/var/run/cluster/mysql/mysql.pid

 

Cheers

 

 

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan Mitchell
Sent: Tuesday, 10 January 2012 10:12 AM
To: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster

 

On 01/10/2012 07:57 AM, Roka, Rajendra wrote: 

I am having issue with mysql service in RHEL6.2 cluster. While starting
service I receive the following error in /var/log/message

 

Jan 10 08:47:52 atp-wwdev1 rgmanager[6015]: Starting Service mysql:mysql

Jan 10 08:47:54 atp-wwdev1 rgmanager[6096]: Starting Service mysql:mysql
> Failed - Timeout Error

Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: start on mysql "mysql"
returned 1 (generic error)





I'm pretty sure the first problem is that mysql doesn't start before the
script times out.  All subsequent errors are trying to clean up from the
failed start and can be ignored.  There won't be a pid file if the
service did not start or if it was cleanly shut down outside of
rgmanager.  Try increasing the startup_wait (something large until you
find its successful, like 60).  Its currently waiting 2 seconds.

Also, I don't think you want to have the VIP and the service that uses
it (mysql) in different services.  They should be in the same service,
because they always have to run on the same node (they aren't
independent).  Same goes for the filesystem resource if that is required
by MYSQL.  Perhaps something like the following?:

                <resources>

                        <ip address="10.26.240.95/24" monitor_link="on"
sleeptime="2"/>

                        <netfs export="/nfs/mysql" force_unmount="on"
fstype="nfs" host="10.26.240.190" mountpoint="/var/lib/mysql"
name="filesystem" no_unmount="on"/>

                        <mysql config_file="/etc/my.cnf"
listen_address="10.26.240.95" name="mysql" shutdown_wait="60"
startup_wait="60"/>

                </resources>

                <service autostart="1" domain="atp_failover"
exclusive="0" name="database" recovery="relocate"> 

                        <ip ref="10.26.240.95/24"/>

                        <netfs ref="filesystem"/>

                        <mysql ref="mysql"/> 

                </service>


Lastly, you have created a fence device but you haven't assigned it to
the nodes so they currently have no fencing devices.  Make sure you do
that and test fencing before doing anything important with this cluster.

Regards,

Ryan Mitchell
Software Maintenance Engineer
Support Engineering Group
Red Hat, Inc.


Important Notice:
This message and its attachments are confidential and may contain information which is protected by copyright. It is intended solely for the named addressee. If you are not the authorised recipient (or responsible for delivery of the message to the authorised recipient), you must not use, disclose, print, copy or deliver this message or its attachments to anyone. If you receive this email in error, please contact the sender immediately and permanently delete this message and its attachments from your system. 
Any content of this message and its attachments that does not relate to the official business of Pacific Magazines Pty Limited must be taken not to have been sent or endorsed by it. No representation is made that this email or its attachments are without defect or that the contents express views other than those of the sender.

Please consider the environment - do you really need to print this email?



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120110/f492775e/attachment.htm>

From linux at alteeve.com  Tue Jan 10 00:59:39 2012
From: linux at alteeve.com (Digimer)
Date: Mon, 09 Jan 2012 19:59:39 -0500
Subject: [Linux-cluster] centos5 to RHEL6 migration
In-Reply-To: <CAHSRzpBVmGQ0ZCukMERpKoHWnX9A=0-rW89s00b+y92XNmUGog@mail.gmail.com>
References: <CAHSRzpBvTbGU22JqVC0NCm7ETe84egB7rEkp22wWjEyZYA_T-Q@mail.gmail.com>
	<4F0A532A.2000202@alteeve.com>
	<4F0AAACE.7080602@mssl.ucl.ac.uk> <4F0AB505.2020402@redhat.com>
	<4F0AEB2E.2060203@mssl.ucl.ac.uk> <4F0AECDA.9060402@redhat.com>
	<4F0AF3E9.9060907@mssl.ucl.ac.uk>
	<CAHSRzpBVmGQ0ZCukMERpKoHWnX9A=0-rW89s00b+y92XNmUGog@mail.gmail.com>
Message-ID: <4F0B8D7B.7030502@alteeve.com>

On 01/09/2012 05:36 PM, Terry wrote:
> So here's what I have done so far:
> 1. Created new cluster based on RHEL 6.
> 2. Created resources and services from scratch to match that in the old
> cluster (fsid, mount points, everything). I am using Congra (luci/ricci)
> just to ensure I am using the right syntax.
> 3. Gave access to storage volumes (iscsi) to new cluster node
> 4. pvscan/vgscan/lvscan
> 5. Disabled NFS services on old cluster
> 6. Enabled the NFS services on the new cluster
> 
> That's it.  Life's good for the volumes on the cluster.  I am yet to
> transfer my postgres stuff but I am moving from 8.3 to 9.0 so that will
> be a new volume and postgres installation so nothing exciting there.

Thanks for reporting back. I'm glad to hear it worked out well. Did you
have to change your gfs part to gfs2?

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron



From Gert.Wieberdink at enovation.nl  Tue Jan 10 11:12:08 2012
From: Gert.Wieberdink at enovation.nl (Gert Wieberdink)
Date: Tue, 10 Jan 2012 12:12:08 +0100
Subject: [Linux-cluster] (no subject)
Message-ID: <8634845864125D4D9B397A3E598995980C9497F45A@MBX.emd.enovation.net>

RHCS/GFS2 support team,

I would like to inform you about a serious GFS2 problem we encountered last week.
Please find a detailed description below. I have enclosed a tarfile containing
detailed information about this problem.

Description
        Two-node cluster is used as a test cluster without any load.
        Only functionality is tested, no performance tests. The RHCS services
        that run on this cluster are rather standard services.
        In a 2-day timeframe we had two occurrences of this problem which were
        both very similar.
        On the 2nd  node, a Perl script tried to write some info to a file on
        the GFS2 filesystem, but the process hung at that time. From the GFS2
        lockdump info we saw one W-lock associated with an inode and it
        turned out that the inode was a directory on GFS2. Every command executed on
        that file (eg. ls -l) or on this directory resulted in a hang of that
        process (eg. du <dirname>).
        The processes that hung all had the D-state (uninterruptable sleep).
        However, from the 1st  node all files and directories were accessible without
        any problem. Even ls -lR executed on the 1st node from top of the GFS2
        filesystem traversed the full directory tree without problems.
        We suspect that the offending directory has got a W-lock and that there is
        no lock owner anymore.
        So, it does not look like a 'global' file system hang, but it seems to
        to be a local problem on the 2nd  node, where the major part of the GFS2
        is also accessible from the 2nd node, except the dir with the lock.
        Needless to say that this causes the application to be unavailable.

                  We are unable to reproduce the problem.

        1st occurrence. After collecting information, we rebooted the 2nd node and after
        the reboot it joined the 1st node in the cluster without any problem.

        2nd occurrence. This happened 2 days later in the same way on the same node. After
        collecting information, we now also ran gfs2_fsck on the GFS2 filesystem
        before letting it join the cluster. No errors, orphans, corruption was reported.

        After the fsck we started the cluster software on the 2nd  node and the 2nd
        node joined the cluster without any problem.
        Additional information (gfs2_lockdump, gfs2_hangalyzer, sysrq-t info, etc.) was
        collected in a tarball (enov_additional_info.tar).

Additional information in additional_info.tar
- enov_clusterinfo_app2.txt.gz containing
                        - /etc/cluster.conf
                        - gfs2_hangalyzer output from 2nd node
                        - cman_tool <version, status, services, -af nodes>
                        - group_tool < -v, dump, dump fence, dump gfs2>
                        - ccs_tool <lsnode, lsfence>
                        - openais-cfgtool -s
                        - clustat -fl
                        - Process status information of all processes
                        - gfs2_tool gettune /gfsdata

                - enov_sysrq-t_app2.txt.gz
                - enov_glocks_app2.txt.gz
                - enov_debugfs_dlm_app2.tar.gz Contains compressed tarball of dlm
                  directory from debugfs filesystem from 2nd node.
Environment
        2-node cluster running CentOS 5.7, with RedHat Cluster Suite and GFS2.
        Latest updates for OS and RHCS/GFS2 (as per Jan 8, 2012) are installed.
        Kernel version 2.6.18-274.12.1.el5PAE.
        One GFS2 filesystem (20G) on HP/LeftHand Networks iSCSI SAN volume.
        iSCSI initiator version 6.2.0.872-10.el5.

Thanking you in advance for your cooperation.
If you need additional information to help to solve this problem, please let me know.

With kind regards,
G. Wieberdink
Sr. Engineer at E.Novation

gert.wieberdink at enovation.nl<mailto:gert.wieberdink at enovation.nl>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120110/0d9ffd30/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: enov_additional_info.tar
Type: application/x-tar
Size: 102400 bytes
Desc: enov_additional_info.tar
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120110/0d9ffd30/attachment.tar>

From td3201 at gmail.com  Tue Jan 10 15:04:13 2012
From: td3201 at gmail.com (Terry)
Date: Tue, 10 Jan 2012 09:04:13 -0600
Subject: [Linux-cluster] centos5 to RHEL6 migration
In-Reply-To: <4F0B8D7B.7030502@alteeve.com>
References: <CAHSRzpBvTbGU22JqVC0NCm7ETe84egB7rEkp22wWjEyZYA_T-Q@mail.gmail.com>
	<4F0A532A.2000202@alteeve.com> <4F0AAACE.7080602@mssl.ucl.ac.uk>
	<4F0AB505.2020402@redhat.com> <4F0AEB2E.2060203@mssl.ucl.ac.uk>
	<4F0AECDA.9060402@redhat.com> <4F0AF3E9.9060907@mssl.ucl.ac.uk>
	<CAHSRzpBVmGQ0ZCukMERpKoHWnX9A=0-rW89s00b+y92XNmUGog@mail.gmail.com>
	<4F0B8D7B.7030502@alteeve.com>
Message-ID: <CAHSRzpCg4oVkp8-Bb+RtVuAMmLADwvb==pWQaC5hOXx34mtK3g@mail.gmail.com>

On Mon, Jan 9, 2012 at 6:59 PM, Digimer <linux at alteeve.com> wrote:

> On 01/09/2012 05:36 PM, Terry wrote:
> > So here's what I have done so far:
> > 1. Created new cluster based on RHEL 6.
> > 2. Created resources and services from scratch to match that in the old
> > cluster (fsid, mount points, everything). I am using Congra (luci/ricci)
> > just to ensure I am using the right syntax.
> > 3. Gave access to storage volumes (iscsi) to new cluster node
> > 4. pvscan/vgscan/lvscan
> > 5. Disabled NFS services on old cluster
> > 6. Enabled the NFS services on the new cluster
> >
> > That's it.  Life's good for the volumes on the cluster.  I am yet to
> > transfer my postgres stuff but I am moving from 8.3 to 9.0 so that will
> > be a new volume and postgres installation so nothing exciting there.
>
> Thanks for reporting back. I'm glad to hear it worked out well. Did you
> have to change your gfs part to gfs2?
>
> --
> Digimer
> E-Mail:              digimer at alteeve.com
> Freenode handle:     digimer
> Papers and Projects: http://alteeve.com
> Node Assassin:       http://nodeassassin.org
> "omg my singularity battery is dead again.
> stupid hawking radiation." - epitron
>

I am not using GFS.  All ext3/4.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120110/432d012a/attachment.htm>

From linux at alteeve.com  Tue Jan 10 15:52:18 2012
From: linux at alteeve.com (Digimer)
Date: Tue, 10 Jan 2012 10:52:18 -0500
Subject: [Linux-cluster] centos5 to RHEL6 migration
In-Reply-To: <CAHSRzpCg4oVkp8-Bb+RtVuAMmLADwvb==pWQaC5hOXx34mtK3g@mail.gmail.com>
References: <CAHSRzpBvTbGU22JqVC0NCm7ETe84egB7rEkp22wWjEyZYA_T-Q@mail.gmail.com>
	<4F0A532A.2000202@alteeve.com>
	<4F0AAACE.7080602@mssl.ucl.ac.uk> <4F0AB505.2020402@redhat.com>
	<4F0AEB2E.2060203@mssl.ucl.ac.uk> <4F0AECDA.9060402@redhat.com>
	<4F0AF3E9.9060907@mssl.ucl.ac.uk>
	<CAHSRzpBVmGQ0ZCukMERpKoHWnX9A=0-rW89s00b+y92XNmUGog@mail.gmail.com>
	<4F0B8D7B.7030502@alteeve.com>
	<CAHSRzpCg4oVkp8-Bb+RtVuAMmLADwvb==pWQaC5hOXx34mtK3g@mail.gmail.com>
Message-ID: <4F0C5EB2.9060209@alteeve.com>

> I am not using GFS.  All ext3/4.

Well then, that would make it easy to deal with, I suppose. :P

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron



From dkelson at gurulabs.com  Wed Jan 11 19:42:12 2012
From: dkelson at gurulabs.com (Dax Kelson)
Date: Wed, 11 Jan 2012 12:42:12 -0700
Subject: [Linux-cluster] [PATCH] fence_scsi log actual commands
Message-ID: <1326310932.4540.11.camel@mentor.gurulabs.com>

There is a new Linux iSCSI target in the Linux kernel 3.1. Unlike tgt,
it supports SPC-3 compliant persistent reservations so that it can be
used with fence_scsi.

I encountered a bug in the iSCSI target (an easy workaround is
available) and it was very helpful to see the actual commands that
fence_scsi was running.

I now have a fully working 3 node RHEL6.2 cluster with a Fedora 16 iSCSI
target with working SCSI fencing.

Please consider applying this patch so, that if logging is enabled, the
actual command being run will be logged as well.

Dax Kelson
Guru Labs

Workaround details -- the bug should be fixed when the scatterlist
conversion is completed by Andy Grover, but for now modifying the
allocation length used by the sg_persist commands to 512 by adding '-l
512' to the sg_persist command lines is the workaround.
--- fence_scsi.org	2012-01-11 12:27:52.234042483 -0700
+++ fence_scsi	2012-01-10 18:09:34.301813562 -0700
@@ -208,7 +208,7 @@
     # 	log_error ("$self (err=$err)");
     # }
 
-    log_debug ("$self (err=$err)");
+    log_debug ("$self (err=$err cmd=$cmd)");
 
     return ($err);
 }
@@ -245,7 +245,7 @@
     # 	log_error ("$self (err=$err)");
     # }
 
-    log_debug ("$self (err=$err)");
+    log_debug ("$self (err=$err) (cmd=$cmd)");
 
     return ($err);
 }
@@ -265,7 +265,7 @@
     # 	log_error ("$self (err=$err)");
     # }
 
-    log_debug ("$self (err=$err)");
+    log_debug ("$self (err=$err, cmd=$cmd)");
 
     return ($err);
 }
@@ -285,7 +285,7 @@
     # 	log_error ("$self (err=$err)");
     # }
 
-    log_debug ("$self (err=$err)");
+    log_debug ("$self (err=$err cmd=$cmd)");
 
     return ($err);
 }
@@ -305,7 +305,7 @@
     # 	log_error ("$self (err=$err)");
     # }
 
-    log_debug ("$self (err=$err)");
+    log_debug ("$self (err=$err cmd=$cmd)");
 
     return ($err);
 }
@@ -325,7 +325,7 @@
     # 	log_error ("$self (err=$err)");
     # }
 
-    log_debug ("$self (err=$err)");
+    log_debug ("$self (err=$err cmd=$cmd)");
 
     return ($err);
 }
@@ -342,7 +342,7 @@
     ## note that it is not necessarily an error is $err is non-zero,
     ## so just log the device and status and continue.
 
-    log_debug ("$self (dev=$dev, status=$err)");
+    log_debug ("$self (dev=$dev, status=$err, cmd=$cmd)");
 
     return ($err);
 }
@@ -425,7 +425,7 @@
     my $err = ($?>>8);
 
     if ($err != 0) {
-	log_error ("$self (err=$err)");
+	log_error ("$self (err=$err cmd=$cmd)");
     }
 
     # die "[error]: $self\n" if ($?>>8);
@@ -447,7 +447,7 @@
     my $err = ($?>>8);
 
     if ($err != 0) {
-	log_error ("$self (err=$err)");
+	log_error ("$self (err=$err cmd=$cmd)");
     }
 
     # die "[error]: $self\n" if ($?>>8);
@@ -479,7 +479,7 @@
     my $err = ($?>>8);
 
     if ($err != 0) {
-	log_error ("$self (err=$err)");
+	log_error ("$self (err=$err cmd=$cmd)");
     }
 
     # die "[error]: $self\n" if ($?>>8);
@@ -576,7 +576,7 @@
     my $err = ($?>>8);
 
     if ($err != 0) {
-	log_error ("$self (err=$err)");
+	log_error ("$self (err=$err cmd=$cmd)");
     }
 
     # die "[error]: $self\n" if ($?>>8);
@@ -602,7 +602,7 @@
     my $err = ($?>>8);
 
     if ($err != 0) {
-	log_error ("$self (err=$err)");
+	log_error ("$self (err=$err cmd=$cmd)");
     }
 
     # die "[error]: $self\n" if ($?>>8);




From florian at hastexo.com  Wed Jan 11 20:32:21 2012
From: florian at hastexo.com (Florian Haas)
Date: Wed, 11 Jan 2012 21:32:21 +0100
Subject: [Linux-cluster] [PATCH] fence_scsi log actual commands
In-Reply-To: <1326310932.4540.11.camel@mentor.gurulabs.com>
References: <1326310932.4540.11.camel@mentor.gurulabs.com>
Message-ID: <CAPUexz8vgwFv=RMEUGG_qKNzU0HXL6zdynwEnwt8-wa9Z_LiZg@mail.gmail.com>

On Wed, Jan 11, 2012 at 8:42 PM, Dax Kelson <dkelson at gurulabs.com> wrote:
> There is a new Linux iSCSI target in the Linux kernel 3.1. Unlike tgt,
> it supports SPC-3 compliant persistent reservations so that it can be
> used with fence_scsi.

"Unlike tgt"? I thought tgt does support PR since its 1.0 release. In
fact I seem to recall that implementing PR was what prompted Tomo to
move to 1.0. Are you saying that tgt targets don't work with
fence_iscsi?

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now



From dkelson at gurulabs.com  Wed Jan 11 20:43:59 2012
From: dkelson at gurulabs.com (Dax Kelson)
Date: Wed, 11 Jan 2012 13:43:59 -0700
Subject: [Linux-cluster] [PATCH] fence_scsi log actual commands
In-Reply-To: <CAPUexz8vgwFv=RMEUGG_qKNzU0HXL6zdynwEnwt8-wa9Z_LiZg@mail.gmail.com>
References: <1326310932.4540.11.camel@mentor.gurulabs.com>
	<CAPUexz8vgwFv=RMEUGG_qKNzU0HXL6zdynwEnwt8-wa9Z_LiZg@mail.gmail.com>
Message-ID: <1326314639.4540.19.camel@mentor.gurulabs.com>

On Wed, 2012-01-11 at 21:32 +0100, Florian Haas wrote:
> On Wed, Jan 11, 2012 at 8:42 PM, Dax Kelson <dkelson at gurulabs.com> wrote:
> > There is a new Linux iSCSI target in the Linux kernel 3.1. Unlike tgt,
> > it supports SPC-3 compliant persistent reservations so that it can be
> > used with fence_scsi.
> 
> "Unlike tgt"? I thought tgt does support PR since its 1.0 release. In
> fact I seem to recall that implementing PR was what prompted Tomo to
> move to 1.0. Are you saying that tgt targets don't work with
> fence_iscsi?
> 
> Cheers,
> Florian

My understanding is that tgt has support for PR but not the
PR_OUT_PREEMPT_AND_ABORT service action necessary for I/O fencing.

Maybe this has changed in the last year.

Dax Kelson
Guru Labs



From florian at hastexo.com  Wed Jan 11 21:19:01 2012
From: florian at hastexo.com (Florian Haas)
Date: Wed, 11 Jan 2012 22:19:01 +0100
Subject: [Linux-cluster] [PATCH] fence_scsi log actual commands
In-Reply-To: <1326314639.4540.19.camel@mentor.gurulabs.com>
References: <1326310932.4540.11.camel@mentor.gurulabs.com>
	<CAPUexz8vgwFv=RMEUGG_qKNzU0HXL6zdynwEnwt8-wa9Z_LiZg@mail.gmail.com>
	<1326314639.4540.19.camel@mentor.gurulabs.com>
Message-ID: <CAPUexz9ouKSGCqi8ZU9RtuaRhQQdB=sVgcSg7NgqTYKAEiNGDA@mail.gmail.com>

On Wed, Jan 11, 2012 at 9:43 PM, Dax Kelson <dkelson at gurulabs.com> wrote:
> On Wed, 2012-01-11 at 21:32 +0100, Florian Haas wrote:
>> On Wed, Jan 11, 2012 at 8:42 PM, Dax Kelson <dkelson at gurulabs.com> wrote:
>> > There is a new Linux iSCSI target in the Linux kernel 3.1. Unlike tgt,
>> > it supports SPC-3 compliant persistent reservations so that it can be
>> > used with fence_scsi.
>>
>> "Unlike tgt"? I thought tgt does support PR since its 1.0 release. In
>> fact I seem to recall that implementing PR was what prompted Tomo to
>> move to 1.0. Are you saying that tgt targets don't work with
>> fence_iscsi?
>>
>> Cheers,
>> Florian
>
> My understanding is that tgt has support for PR but not the
> PR_OUT_PREEMPT_AND_ABORT service action necessary for I/O fencing.

Ah, that sounds about right (iirc).

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now



From rajendra.roka at pacificmags.com.au  Thu Jan 12 03:11:01 2012
From: rajendra.roka at pacificmags.com.au (Roka, Rajendra)
Date: Thu, 12 Jan 2012 14:11:01 +1100
Subject: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster
In-Reply-To: <508450C6AB960E4299CA64E838597F6202F22B66@nsw-mmp-exch1.snl.7net.com.au>
References: <508450C6AB960E4299CA64E838597F6202F22B63@nsw-mmp-exch1.snl.7net.com.au><4F0B7448.9060008@redhat.com>
	<508450C6AB960E4299CA64E838597F6202F22B66@nsw-mmp-exch1.snl.7net.com.au>
Message-ID: <508450C6AB960E4299CA64E838597F6202F22B6A@nsw-mmp-exch1.snl.7net.com.au>

Any more suggestions on this?

 

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Roka, Rajendra
Sent: Tuesday, 10 January 2012 11:49 AM
To: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster

 

 

I have changed the resources and service in cluster.conf as follows:

 

<resources>

                        <ip address="10.26.240.95/24" monitor_link="on"
sleeptime="2"/>

                        <mysql config_file="/etc/my.cnf"
listen_address="10.26.24.95" name="mysql" shutdown_wait="60"
startup_wait="60"/>

                        <netfs export="/nfs/mysql" force_unmount="on"
fstype="nfs" host="10.26.240.190" mountpoint="/var/lib/mysql"
name="storage" no_unmount="on"/>

                </resources>

                <service autostart="1" domain="atp_failover"
exclusive="0" name="mysql" recovery="relocate">

                        <ip ref="10.26.240.95/24"/>

                        <netfs ref="storage"/>

                        <mysql ref="mysql"/>

                </service>

 

But no luck with the following message:

 

Jan 10 11:42:57 atp-wwdev1 modcluster: Starting service: mysql on node 

Jan 10 11:42:57 atp-wwdev1 rgmanager[1690]: Starting stopped service
service:mysql

Jan 10 11:42:58 atp-wwdev1 rgmanager[5252]: Adding IPv4 address
10.26.240.95/24 to eth0

Jan 10 11:43:01 atp-wwdev1 rgmanager[5401]: Starting Service mysql:mysql

Jan 10 11:44:01 atp-wwdev1 rgmanager[5657]: Starting Service mysql:mysql
> Failed - Timeout Error

Jan 10 11:44:01 atp-wwdev1 rgmanager[1690]: start on mysql "mysql"
returned 1 (generic error)

Jan 10 11:44:02 atp-wwdev1 rgmanager[1690]: #68: Failed to start
service:mysql; return value: 1

Jan 10 11:44:02 atp-wwdev1 rgmanager[1690]: Stopping service
service:mysql

Jan 10 11:44:02 atp-wwdev1 rgmanager[5742]: Stopping Service mysql:mysql

Jan 10 11:44:02 atp-wwdev1 rgmanager[5764]: Checking Existence Of File
/var/run/cluster/mysql/mysql:mysql.pid [mysql:mysql] > Failed - File
Doesn't Exist

Jan 10 11:44:02 atp-wwdev1 rgmanager[5786]: Stopping Service mysql:mysql
> Succeed

Jan 10 11:44:02 atp-wwdev1 rgmanager[5837]: Removing IPv4 address
10.26.240.95/24 from eth0

Jan 10 11:44:04 atp-wwdev1 rgmanager[5924]: unmounting /var/lib/mysql

Jan 10 11:44:04 atp-wwdev1 rgmanager[1690]: Service service:mysql is
recovering

Jan 10 11:44:04 atp-wwdev1 rgmanager[1690]: #71: Relocating failed
service service:mysql

Jan 10 11:45:14 atp-wwdev1 rgmanager[1690]: Service service:mysql is
stopped

 

Also changed the my.conf to:

 

pid-file=/var/run/cluster/mysql/mysql.pid

 

Cheers

 

 

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan Mitchell
Sent: Tuesday, 10 January 2012 10:12 AM
To: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster

 

On 01/10/2012 07:57 AM, Roka, Rajendra wrote: 

I am having issue with mysql service in RHEL6.2 cluster. While starting
service I receive the following error in /var/log/message

 

Jan 10 08:47:52 atp-wwdev1 rgmanager[6015]: Starting Service mysql:mysql

Jan 10 08:47:54 atp-wwdev1 rgmanager[6096]: Starting Service mysql:mysql
> Failed - Timeout Error

Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: start on mysql "mysql"
returned 1 (generic error)

 

I'm pretty sure the first problem is that mysql doesn't start before the
script times out.  All subsequent errors are trying to clean up from the
failed start and can be ignored.  There won't be a pid file if the
service did not start or if it was cleanly shut down outside of
rgmanager.  Try increasing the startup_wait (something large until you
find its successful, like 60).  Its currently waiting 2 seconds.

Also, I don't think you want to have the VIP and the service that uses
it (mysql) in different services.  They should be in the same service,
because they always have to run on the same node (they aren't
independent).  Same goes for the filesystem resource if that is required
by MYSQL.  Perhaps something like the following?:

                <resources>

                        <ip address="10.26.240.95/24" monitor_link="on"
sleeptime="2"/>

                        <netfs export="/nfs/mysql" force_unmount="on"
fstype="nfs" host="10.26.240.190" mountpoint="/var/lib/mysql"
name="filesystem" no_unmount="on"/>

                        <mysql config_file="/etc/my.cnf"
listen_address="10.26.240.95" name="mysql" shutdown_wait="60"
startup_wait="60"/>

                </resources>

                <service autostart="1" domain="atp_failover"
exclusive="0" name="database" recovery="relocate"> 

                        <ip ref="10.26.240.95/24"/>

                        <netfs ref="filesystem"/>

                        <mysql ref="mysql"/> 

                </service>


Lastly, you have created a fence device but you haven't assigned it to
the nodes so they currently have no fencing devices.  Make sure you do
that and test fencing before doing anything important with this cluster.

Regards,

Ryan Mitchell
Software Maintenance Engineer
Support Engineering Group
Red Hat, Inc.

Important Notice:
This message and its attachments are confidential and may contain
information which is protected by copyright. It is intended solely for
the named addressee. If you are not the authorised recipient (or
responsible for delivery of the message to the authorised recipient),
you must not use, disclose, print, copy or deliver this message or its
attachments to anyone. If you receive this email in error, please
contact the sender immediately and permanently delete this message and
its attachments from your system. 
Any content of this message and its attachments that does not relate to
the official business of Pacific Magazines Pty Limited must be taken not
to have been sent or endorsed by it. No representation is made that this
email or its attachments are without defect or that the contents express
views other than those of the sender.
 
Please consider the environment - do you really need to print this
email?
 
 
 

Important Notice:
This message and its attachments are confidential and may contain information which is protected by copyright. It is intended solely for the named addressee. If you are not the authorised recipient (or responsible for delivery of the message to the authorised recipient), you must not use, disclose, print, copy or deliver this message or its attachments to anyone. If you receive this email in error, please contact the sender immediately and permanently delete this message and its attachments from your system. 
Any content of this message and its attachments that does not relate to the official business of Pacific Magazines Pty Limited must be taken not to have been sent or endorsed by it. No representation is made that this email or its attachments are without defect or that the contents express views other than those of the sender.

Please consider the environment - do you really need to print this email?



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120112/f7581a7f/attachment.htm>

From rmitchel at redhat.com  Thu Jan 12 04:00:39 2012
From: rmitchel at redhat.com (Ryan Mitchell)
Date: Thu, 12 Jan 2012 14:00:39 +1000
Subject: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster
In-Reply-To: <508450C6AB960E4299CA64E838597F6202F22B6A@nsw-mmp-exch1.snl.7net.com.au>
References: <508450C6AB960E4299CA64E838597F6202F22B63@nsw-mmp-exch1.snl.7net.com.au><4F0B7448.9060008@redhat.com>	<508450C6AB960E4299CA64E838597F6202F22B66@nsw-mmp-exch1.snl.7net.com.au>
	<508450C6AB960E4299CA64E838597F6202F22B6A@nsw-mmp-exch1.snl.7net.com.au>
Message-ID: <4F0E5AE7.1080607@redhat.com>

On 01/12/2012 01:11 PM, Roka, Rajendra wrote:
>
> Any more suggestions on this?
>
According to the new log, it still timed out after 60 seconds, so either 
that wasn't long enough either, or there is a misconfiguration and the 
database can't start because of it:
> **
>
> Jan 10 11:42:57 atp-wwdev1 modcluster: Starting service: mysql on node
>
> Jan 10 11:42:57 atp-wwdev1 rgmanager[1690]: Starting stopped service 
> service:mysql
>
> Jan 10 11:42:58 atp-wwdev1 rgmanager[5252]: Adding IPv4 address 
> 10.26.240.95/24 to eth0
>
> Jan 10 11:43:01 atp-wwdev1 rgmanager[5401]: Starting Service mysql:mysql
>
> Jan 10 11:44:01 atp-wwdev1 rgmanager[5657]: Starting Service 
> mysql:mysql > Failed - Timeout Error
>
> Jan 10 11:44:01 atp-wwdev1 rgmanager[1690]: start on mysql "mysql" 
> returned 1 (generic error)
>
> Jan 10 11:44:02 atp-wwdev1 rgmanager[1690]: #68: Failed to start 
> service:mysql; return value: 1
>

What does it say in your mysql log?  The resource script runs the 
command to start the database and then waits for it to return success.  
It waited 60 seconds, and hadn't received any notice that the database 
started or not, so it gave up.

Look in the logs to see if there is any indication as to why the 
database won't start.  It could be because you have the wrong 
configuration in /etc/my.cnf, no permissions on some critical 
directories, or the resource script is misconfigured.  Also, you should 
investigate whether you can manually start the database (after mounting 
the NFS mount and adding the VIP of course) outside of cluster (and 
compare working and failing mysql logs).

Regards,

Ryan Mitchell
Software Maintenance Engineer
Support Engineering Group
Red Hat, Inc.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120112/b555dab5/attachment.htm>

From tc3driver at gmail.com  Thu Jan 12 04:20:01 2012
From: tc3driver at gmail.com (Bill G.)
Date: Wed, 11 Jan 2012 20:20:01 -0800
Subject: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster
In-Reply-To: <4F0E5AE7.1080607@redhat.com>
References: <508450C6AB960E4299CA64E838597F6202F22B63@nsw-mmp-exch1.snl.7net.com.au>
	<4F0B7448.9060008@redhat.com>
	<508450C6AB960E4299CA64E838597F6202F22B66@nsw-mmp-exch1.snl.7net.com.au>
	<508450C6AB960E4299CA64E838597F6202F22B6A@nsw-mmp-exch1.snl.7net.com.au>
	<4F0E5AE7.1080607@redhat.com>
Message-ID: <CABQafzhQVZtZc6m_0iV1qDm44hcQgyR-LbqmqcLQn4TDpa2idQ@mail.gmail.com>

Really dumb question... do you have mysql installed?
What happens when you try to start mysql stand alone?
Is mysql already running?
Is there anything in /var/log/messages? anything in the mysql logs?



On Wed, Jan 11, 2012 at 8:00 PM, Ryan Mitchell <rmitchel at redhat.com> wrote:

> **
> On 01/12/2012 01:11 PM, Roka, Rajendra wrote:
>
>  Any more suggestions on this?****
>
> According to the new log, it still timed out after 60 seconds, so either
> that wasn't long enough either, or there is a misconfiguration and the
> database can't start because of it:
>
>  **
>
> Jan 10 11:42:57 atp-wwdev1 modcluster: Starting service: mysql on node ***
> *
>
> Jan 10 11:42:57 atp-wwdev1 rgmanager[1690]: Starting stopped service
> service:mysql****
>
> Jan 10 11:42:58 atp-wwdev1 rgmanager[5252]: Adding IPv4 address
> 10.26.240.95/24 to eth0****
>
> Jan 10 11:43:01 atp-wwdev1 rgmanager[5401]: Starting Service mysql:mysql**
> **
>
> Jan 10 11:44:01 atp-wwdev1 rgmanager[5657]: Starting Service mysql:mysql >
> Failed - Timeout Error****
>
> Jan 10 11:44:01 atp-wwdev1 rgmanager[1690]: start on mysql "mysql"
> returned 1 (generic error)****
>
> Jan 10 11:44:02 atp-wwdev1 rgmanager[1690]: #68: Failed to start
> service:mysql; return value: 1****
>
>
> What does it say in your mysql log?  The resource script runs the command
> to start the database and then waits for it to return success.  It waited
> 60 seconds, and hadn't received any notice that the database started or
> not, so it gave up.
>
> Look in the logs to see if there is any indication as to why the database
> won't start.  It could be because you have the wrong configuration in
> /etc/my.cnf, no permissions on some critical directories, or the resource
> script is misconfigured.  Also, you should investigate whether you can
> manually start the database (after mounting the NFS mount and adding the
> VIP of course) outside of cluster (and compare working and failing mysql
> logs).
>
>
> Regards,
>
> Ryan Mitchell
> Software Maintenance Engineer
> Support Engineering Group
> Red Hat, Inc.
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Thanks,
Bill G.
tc3driver at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120111/b42b451c/attachment.htm>

From rajendra.roka at pacificmags.com.au  Thu Jan 12 04:39:43 2012
From: rajendra.roka at pacificmags.com.au (Roka, Rajendra)
Date: Thu, 12 Jan 2012 15:39:43 +1100
Subject: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster
In-Reply-To: <4F0E5AE7.1080607@redhat.com>
References: <508450C6AB960E4299CA64E838597F6202F22B6A@nsw-mmp-exch1.snl.7net.com.au>
	<4F0E5AE7.1080607@redhat.com>
Message-ID: <508450C6AB960E4299CA64E838597F6202F22B6B@nsw-mmp-exch1.snl.7net.com.au>

Yes it starts if I do manually:

 

[root at atp-wwdev1 ~]# mount -t nfs 10.26.240.190:/nfs/mysql
/var/lib/mysql/

[root at atp-wwdev1 ~]# /etc/init.d/mysqld start

Starting mysqld:                                           [  OK  ]

 

[root at atp-wwdev1 ~]# cat /var/log/mysqld.log

120112 15:28:57 mysqld_safe Starting mysqld daemon with databases from
/var/lib/mysql

120112 15:28:58  InnoDB: Started; log sequence number 0 44233

120112 15:28:58 [Note] Event Scheduler: Loaded 0 events

120112 15:28:58 [Note] /usr/libexec/mysqld: ready for connections.

Version: '5.1.52'  socket: '/var/lib/mysql/mysql.sock'  port: 3306
Source distribution

 

[root at atp-wwdev1 ~]# /etc/init.d/mysqld stop

Stopping mysqld:                                           [  OK  ]

root at atp-wwdev1 ~]# cat /var/log/mysqld.log

120112 15:29:39 [Note] /usr/libexec/mysqld: Normal shutdown

120112 15:29:39 [Note] Event Scheduler: Purging the queue. 0 events

120112 15:29:39  InnoDB: Starting shutdown...

120112 15:29:43  InnoDB: Shutdown completed; log sequence number 0 44233

120112 15:29:43 [Note] /usr/libexec/mysqld: Shutdown complete

 

But if I start with cluster, it doesnot give any error message in
/var/log/mysqld.log

 

Once again my cluster.conf is follows:

<?xml version="1.0"?>

<cluster config_version="39" name="atp_mysql">

        <clusternodes>

                <clusternode name="atp-wwdev1.test1.com.au" nodeid="1">

                        <fence/>

                        <multicast addr="239.192.200.1"/>

                </clusternode>

                <clusternode name="atp-wwdev2.test1.com.au" nodeid="2"
votes="1">

                        <fence/>

                        <multicast addr="239.192.200.1"/>

                </clusternode>

        </clusternodes>

        <fencedevices>

                <fencedevice agent="fence_xvm" name="fence"/>

        </fencedevices>

       <rm>

                <failoverdomains>

                        <failoverdomain name="atp_failover"
nofailback="0" ordered="1" restricted="0">

                                <failoverdomainnode
name="atp-wwdev1.test1.com.au" priority="2"/>

                                <failoverdomainnode
name="atp-wwdev2.test1.com.au" priority="5"/>

                        </failoverdomain>

                </failoverdomains>

                <resources>

                        <ip address="10.26.240.95/24" monitor_link="on"
sleeptime="2"/>

                        <mysql config_file="/etc/my.cnf"
listen_address="10.26.24.95" name="mysql" shutdown_wait="60"
startup_wait="60"/>

                        <netfs export="/nfs/mysql" force_unmount="on"
fstype="nfs" host="10.26.240.190" mountpoint="/var/lib/mysql"
name="storage" no_unmount="on"/>

                </resources>

                <service autostart="1" domain="atp_failover"
exclusive="0" name="mysql" recovery="relocate">

                        <ip ref="10.26.240.95/24"/>

                        <netfs ref="storage"/>

                        <mysql ref="mysql"/>

                </service>

        </rm>

        <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="3"/>

        <cman expected_votes="1" two_node="1">

                <multicast addr="239.192.200.1"/>

        </cman>

        <totem/>

        <logging debug="off"/>

</cluster>

 

And my.cnf is follows:

[mysqld]

datadir=/var/lib/mysql

socket=/var/lib/mysql/mysql.sock

user=mysql

# Disabling symbolic-links is recommended to prevent assorted security
risks

symbolic-links=0

 

[mysqld_safe]

log-error=/var/log/mysqld.log

pid-file=/var/run/cluster/mysql/mysql.pid

 

If you need any more info, please let me know.

 

Thanks

 

 

 

 

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan Mitchell
Sent: Thursday, 12 January 2012 3:01 PM
To: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster

 

On 01/12/2012 01:11 PM, Roka, Rajendra wrote: 

Any more suggestions on this?

According to the new log, it still timed out after 60 seconds, so either
that wasn't long enough either, or there is a misconfiguration and the
database can't start because of it:



Jan 10 11:42:57 atp-wwdev1 modcluster: Starting service: mysql on node 

Jan 10 11:42:57 atp-wwdev1 rgmanager[1690]: Starting stopped service
service:mysql

Jan 10 11:42:58 atp-wwdev1 rgmanager[5252]: Adding IPv4 address
10.26.240.95/24 to eth0

Jan 10 11:43:01 atp-wwdev1 rgmanager[5401]: Starting Service mysql:mysql

Jan 10 11:44:01 atp-wwdev1 rgmanager[5657]: Starting Service mysql:mysql
> Failed - Timeout Error

Jan 10 11:44:01 atp-wwdev1 rgmanager[1690]: start on mysql "mysql"
returned 1 (generic error)

Jan 10 11:44:02 atp-wwdev1 rgmanager[1690]: #68: Failed to start
service:mysql; return value: 1


What does it say in your mysql log?  The resource script runs the
command to start the database and then waits for it to return success.
It waited 60 seconds, and hadn't received any notice that the database
started or not, so it gave up.

Look in the logs to see if there is any indication as to why the
database won't start.  It could be because you have the wrong
configuration in /etc/my.cnf, no permissions on some critical
directories, or the resource script is misconfigured.  Also, you should
investigate whether you can manually start the database (after mounting
the NFS mount and adding the VIP of course) outside of cluster (and
compare working and failing mysql logs).

Regards,

Ryan Mitchell
Software Maintenance Engineer
Support Engineering Group
Red Hat, Inc.


Important Notice:
This message and its attachments are confidential and may contain information which is protected by copyright. It is intended solely for the named addressee. If you are not the authorised recipient (or responsible for delivery of the message to the authorised recipient), you must not use, disclose, print, copy or deliver this message or its attachments to anyone. If you receive this email in error, please contact the sender immediately and permanently delete this message and its attachments from your system. 
Any content of this message and its attachments that does not relate to the official business of Pacific Magazines Pty Limited must be taken not to have been sent or endorsed by it. No representation is made that this email or its attachments are without defect or that the contents express views other than those of the sender.

Please consider the environment - do you really need to print this email?



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120112/ed777ae6/attachment.htm>

From tc3driver at gmail.com  Thu Jan 12 05:17:49 2012
From: tc3driver at gmail.com (Bill G.)
Date: Wed, 11 Jan 2012 21:17:49 -0800
Subject: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster
In-Reply-To: <508450C6AB960E4299CA64E838597F6202F22B6B@nsw-mmp-exch1.snl.7net.com.au>
References: <508450C6AB960E4299CA64E838597F6202F22B6A@nsw-mmp-exch1.snl.7net.com.au>
	<4F0E5AE7.1080607@redhat.com>
	<508450C6AB960E4299CA64E838597F6202F22B6B@nsw-mmp-exch1.snl.7net.com.au>
Message-ID: <CABQafzgZ8xfQ+AH1jFuiFEw9QyEsxRqRg3Wtm+ngTeEGe5sUHQ@mail.gmail.com>

Ok more dumb things...

In the past I have had problems bringing up VIPs that have the subnet mask
bits in the address

try changing this line:
<ip address="10.26.240.95/24" monitor_link="on" sleeptime="2"/>

to this
<ip address="10.26.240.9 <http://10.26.240.95/24>5" monitor_link="on"
sleeptime="2"/>

Also remove it from the ip ref= tag as well...

Then try starting the service.  also it may be easier to enable debug
logging to help figure out what is going on with the service... but I am
betting the change to the ip will probably work.

HTH,
Bill
On Wed, Jan 11, 2012 at 8:39 PM, Roka, Rajendra <
rajendra.roka at pacificmags.com.au> wrote:

> *Yes it starts if I do manually:*
>
> ** **
>
> [root at atp-wwdev1 ~]# mount -t nfs 10.26.240.190:/nfs/mysql /var/lib/mysql/
> ****
>
> [root at atp-wwdev1 ~]# /etc/init.d/mysqld start****
>
> Starting mysqld:                                           [  OK  ]****
>
> ** **
>
> [root at atp-wwdev1 ~]# cat /var/log/mysqld.log****
>
> 120112 15:28:57 mysqld_safe Starting mysqld daemon with databases from
> /var/lib/mysql****
>
> 120112 15:28:58  InnoDB: Started; log sequence number 0 44233****
>
> 120112 15:28:58 [Note] Event Scheduler: Loaded 0 events****
>
> 120112 15:28:58 [Note] /usr/libexec/mysqld: ready for connections.****
>
> Version: '5.1.52'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source
> distribution****
>
> ** **
>
> [root at atp-wwdev1 ~]# /etc/init.d/mysqld stop****
>
> Stopping mysqld:                                           [  OK  ]****
>
> root at atp-wwdev1 ~]# cat /var/log/mysqld.log****
>
> 120112 15:29:39 [Note] /usr/libexec/mysqld: Normal shutdown****
>
> 120112 15:29:39 [Note] Event Scheduler: Purging the queue. 0 events****
>
> 120112 15:29:39  InnoDB: Starting shutdown...****
>
> 120112 15:29:43  InnoDB: Shutdown completed; log sequence number 0 44233**
> **
>
> 120112 15:29:43 [Note] /usr/libexec/mysqld: Shutdown complete****
>
> ** **
>
> *But if I start with cluster, it doesnot give any error message in
> /var/log/mysqld.log*
>
> * *
>
> *Once again my cluster.conf is follows:*
>
> <?xml version="1.0"?>****
>
> <cluster config_version="39" name="atp_mysql">****
>
>         <clusternodes>****
>
>                 <clusternode name="atp-wwdev1.test1.com.au" nodeid="1">***
> *
>
>                         <fence/>****
>
>                         <multicast addr="239.192.200.1"/>****
>
>                 </clusternode>****
>
>                 <clusternode name="atp-wwdev2.test1.com.au" nodeid="2"
> votes="1">****
>
>                         <fence/>****
>
>                         <multicast addr="239.192.200.1"/>****
>
>                 </clusternode>****
>
>         </clusternodes>****
>
>         <fencedevices>****
>
>                 <fencedevice agent="fence_xvm" name="fence"/>****
>
>         </fencedevices>****
>
>        <rm>****
>
>                 <failoverdomains>****
>
>                         <failoverdomain name="atp_failover" nofailback="0"
> ordered="1" restricted="0">****
>
>                                 <failoverdomainnode name="
> atp-wwdev1.test1.com.au" priority="2"/>****
>
>                                 <failoverdomainnode name="
> atp-wwdev2.test1.com.au" priority="5"/>****
>
>                         </failoverdomain>****
>
>                 </failoverdomains>****
>
>                 <resources>****
>
>                         <ip address="10.26.240.95/24" monitor_link="on"
> sleeptime="2"/>****
>
>                         <mysql config_file="/etc/my.cnf"
> listen_address="10.26.24.95" name="mysql" shutdown_wait="60"
> startup_wait="60"/>****
>
>                         <netfs export="/nfs/mysql" force_unmount="on"
> fstype="nfs" host="10.26.240.190" mountpoint="/var/lib/mysql"
> name="storage" no_unmount="on"/>****
>
>                 </resources>****
>
>                 <service autostart="1" domain="atp_failover" exclusive="0"
> name="mysql" recovery="relocate">****
>
>                         <ip ref="10.26.240.95/24"/>****
>
>                         <netfs ref="storage"/>****
>
>                         <mysql ref="mysql"/>****
>
>                 </service>****
>
>         </rm>****
>
>         <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="3"/>****
>
>         <cman expected_votes="1" two_node="1">****
>
>                 <multicast addr="239.192.200.1"/>****
>
>         </cman>****
>
>         <totem/>****
>
>         <logging debug="off"/>****
>
> </cluster>****
>
> ** **
>
> *And my.cnf is follows:*
>
> [mysqld]****
>
> datadir=/var/lib/mysql****
>
> socket=/var/lib/mysql/mysql.sock****
>
> user=mysql****
>
> # Disabling symbolic-links is recommended to prevent assorted security
> risks****
>
> symbolic-links=0****
>
> ** **
>
> [mysqld_safe]****
>
> log-error=/var/log/mysqld.log****
>
> pid-file=/var/run/cluster/mysql/mysql.pid****
>
> ** **
>
> If you need any more info, please let me know.****
>
> ** **
>
> Thanks****
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> *From:* linux-cluster-bounces at redhat.com [mailto:
> linux-cluster-bounces at redhat.com] *On Behalf Of *Ryan Mitchell
> *Sent:* Thursday, 12 January 2012 3:01 PM
>
> *To:* linux-cluster at redhat.com
> *Subject:* Re: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster
> ****
>
> ** **
>
> On 01/12/2012 01:11 PM, Roka, Rajendra wrote: ****
>
> Any more suggestions on this?****
>
> According to the new log, it still timed out after 60 seconds, so either
> that wasn't long enough either, or there is a misconfiguration and the
> database can't start because of it:
>
> ****
>
> Jan 10 11:42:57 atp-wwdev1 modcluster: Starting service: mysql on node ***
> *
>
> Jan 10 11:42:57 atp-wwdev1 rgmanager[1690]: Starting stopped service
> service:mysql****
>
> Jan 10 11:42:58 atp-wwdev1 rgmanager[5252]: Adding IPv4 address
> 10.26.240.95/24 to eth0****
>
> Jan 10 11:43:01 atp-wwdev1 rgmanager[5401]: Starting Service mysql:mysql**
> **
>
> Jan 10 11:44:01 atp-wwdev1 rgmanager[5657]: Starting Service mysql:mysql >
> Failed - Timeout Error****
>
> Jan 10 11:44:01 atp-wwdev1 rgmanager[1690]: start on mysql "mysql"
> returned 1 (generic error)****
>
> Jan 10 11:44:02 atp-wwdev1 rgmanager[1690]: #68: Failed to start
> service:mysql; return value: 1****
>
>
> What does it say in your mysql log?  The resource script runs the command
> to start the database and then waits for it to return success.  It waited
> 60 seconds, and hadn't received any notice that the database started or
> not, so it gave up.
>
> Look in the logs to see if there is any indication as to why the database
> won't start.  It could be because you have the wrong configuration in
> /etc/my.cnf, no permissions on some critical directories, or the resource
> script is misconfigured.  Also, you should investigate whether you can
> manually start the database (after mounting the NFS mount and adding the
> VIP of course) outside of cluster (and compare working and failing mysql
> logs).
>
> Regards,
>
> Ryan Mitchell
> Software Maintenance Engineer
> Support Engineering Group
> Red Hat, Inc.****
>
> Important Notice:
> This message and its attachments are confidential and may contain information which is protected by copyright. It is intended solely for the named addressee. If you are not the authorised recipient (or responsible for delivery of the message to the authorised recipient), you must not use, disclose, print, copy or deliver this message or its attachments to anyone. If you receive this email in error, please contact the sender immediately and permanently delete this message and its attachments from your system.
> Any content of this message and its attachments that does not relate to the official business of Pacific Magazines Pty Limited must be taken not to have been sent or endorsed by it. No representation is made that this email or its attachments are without defect or that the contents express views other than those of the sender.
>
> Please consider the environment - do you really need to print this email?
>
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Thanks,
Bill G.
tc3driver at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120111/622f17fe/attachment.htm>

From scooter at cgl.ucsf.edu  Thu Jan 12 22:50:43 2012
From: scooter at cgl.ucsf.edu (Scooter Morris)
Date: Thu, 12 Jan 2012 14:50:43 -0800
Subject: [Linux-cluster] GFS2 mounts taking a *very* long time
Message-ID: <4F0F63C3.1010309@cgl.ucsf.edu>

Greetings all,
     We've got a 4 node cluster running RHEL 6.2.   As part of the 
cluster, we've got several gfs2 filesystem.  We've often noticed that 
when we reboot a single node in the cluster, the gfs2 mounts take a long 
time -- eventually getting the 120 second delay messages.  When we 
migrated to 6.2, the default mount script echoed the filesystem being 
mounted, and we discovered that the long delays were 
filesystem-dependent.  In particular, two filesystems were causing all 
of the problems, both of which had >1M files in them.  We also noticed 
that dlm_recoverd on one of the other nodes accumulates a lot of time 
when this is happening.  Is this expected?  Are there non-ilnear 
handshaking algorithms between the mounting node and the cluster that 
are dependent on the number of files?

Thanks in advance!

-- scooter



From pbruna at it-linux.cl  Thu Jan 12 23:40:13 2012
From: pbruna at it-linux.cl (Patricio A. Bruna)
Date: Thu, 12 Jan 2012 20:40:13 -0300 (CLST)
Subject: [Linux-cluster] GFS2 mounts taking a *very* long time
In-Reply-To: <4F0F63C3.1010309@cgl.ucsf.edu>
Message-ID: <593e1c15-bfa2-4f0b-9053-8b34248d67db@lisa.itlinux.cl>

Hi scooter, Logs would be welcome 

------------------------------------ 
Patricio Bruna V. 
IT Linux Ltda. 
www.it-linux.cl 
Twitter 
Fono : (+56-2) 333 0578 
M?vil: (+56-9) 8899 6618 

----- Mensaje original -----

> Greetings all,
> We've got a 4 node cluster running RHEL 6.2. As part of the
> cluster, we've got several gfs2 filesystem. We've often noticed that
> when we reboot a single node in the cluster, the gfs2 mounts take a
> long
> time -- eventually getting the 120 second delay messages. When we
> migrated to 6.2, the default mount script echoed the filesystem being
> mounted, and we discovered that the long delays were
> filesystem-dependent. In particular, two filesystems were causing all
> of the problems, both of which had >1M files in them. We also noticed
> that dlm_recoverd on one of the other nodes accumulates a lot of time
> when this is happening. Is this expected? Are there non-ilnear
> handshaking algorithms between the mounting node and the cluster that
> are dependent on the number of files?

> Thanks in advance!

> -- scooter

> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120112/98222c77/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: zimbra_gold_partner.png
Type: image/png
Size: 2893 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120112/98222c77/attachment.png>

From scooter at cgl.ucsf.edu  Fri Jan 13 00:26:05 2012
From: scooter at cgl.ucsf.edu (Scooter Morris)
Date: Thu, 12 Jan 2012 16:26:05 -0800
Subject: [Linux-cluster] GFS2 mounts taking a *very* long time
In-Reply-To: <593e1c15-bfa2-4f0b-9053-8b34248d67db@lisa.itlinux.cl>
References: <593e1c15-bfa2-4f0b-9053-8b34248d67db@lisa.itlinux.cl>
Message-ID: <4F0F7A1D.8070601@cgl.ucsf.edu>

Hi Patricio,
     Sure thing -- which logs would help?  I don't think the kernel logs 
would be of much use, and when the dlm_recoverd process is going it 
doesn't log anything, so it's not clear what would be useful, here.

-- scooter

On 01/12/2012 03:40 PM, Patricio A. Bruna wrote:
> Hi scooter,
> Logs would be welcome
>
> ------------------------------------
> Patricio Bruna V.
> IT Linux Ltda.
> www.it-linux.cl <http://www.it-linux.cl>
> Twitter <http://twitter.com/ITLinux>
> Fono : (+56-2) 333 0578
> M?vil: (+56-9) 8899 6618
>
>
>
> ------------------------------------------------------------------------
>
>     Greetings all,
>          We've got a 4 node cluster running RHEL 6.2.   As part of the
>     cluster, we've got several gfs2 filesystem.  We've often noticed that
>     when we reboot a single node in the cluster, the gfs2 mounts take
>     a long
>     time -- eventually getting the 120 second delay messages.  When we
>     migrated to 6.2, the default mount script echoed the filesystem being
>     mounted, and we discovered that the long delays were
>     filesystem-dependent.  In particular, two filesystems were causing
>     all
>     of the problems, both of which had >1M files in them.  We also
>     noticed
>     that dlm_recoverd on one of the other nodes accumulates a lot of time
>     when this is happening.  Is this expected?  Are there non-ilnear
>     handshaking algorithms between the mounting node and the cluster that
>     are dependent on the number of files?
>
>     Thanks in advance!
>
>     -- scooter
>
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com
>     https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120112/9557afe7/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 2893 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120112/9557afe7/attachment.png>

From zheka at uvt.cz  Fri Jan 13 00:52:04 2012
From: zheka at uvt.cz (Yevheniy Demchenko)
Date: Fri, 13 Jan 2012 02:52:04 +0200
Subject: [Linux-cluster] GFS2 mounts taking a *very* long time
In-Reply-To: <4F0F63C3.1010309@cgl.ucsf.edu>
References: <4F0F63C3.1010309@cgl.ucsf.edu>
Message-ID: <49D5F414-AFB6-49CF-A02B-B80BDFDB6F89@uvt.cz>

Hi!
This patched version of dlm will probably resolve your issue, please try it.
http://www.bosson.eu/temp/dlm-kmod-1.0-1.el6.src.rpm
See detailed description in the list earlier ( Subject: [Linux-cluster] [PATCH] dlm: faster dlm recovery )
And yes, mounts and umounts with unpatched dlm are proportional to N*N, where N is a number of files.

Sincerely,
Yevheniy Demchenko

On Jan 13, 2012, at 00:50 , Scooter Morris wrote:

> Greetings all,
>    We've got a 4 node cluster running RHEL 6.2.   As part of the cluster, we've got several gfs2 filesystem.  We've often noticed that when we reboot a single node in the cluster, the gfs2 mounts take a long time -- eventually getting the 120 second delay messages.  When we migrated to 6.2, the default mount script echoed the filesystem being mounted, and we discovered that the long delays were filesystem-dependent.  In particular, two filesystems were causing all of the problems, both of which had >1M files in them.  We also noticed that dlm_recoverd on one of the other nodes accumulates a lot of time when this is happening.  Is this expected?  Are there non-ilnear handshaking algorithms between the mounting node and the cluster that are dependent on the number of files?
> 
> Thanks in advance!
> 
> -- scooter
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120113/4a52142f/attachment.htm>

From wmodes at ucsc.edu  Fri Jan 13 20:30:13 2012
From: wmodes at ucsc.edu (Wes Modes)
Date: Fri, 13 Jan 2012 12:30:13 -0800
Subject: [Linux-cluster] Clustered filesystem questions for shared storage
	on CentOS/vmWare/SAN
Message-ID: <4F109455.5090607@ucsc.edu>

I have some general clustered filesystem questions for you.  I'm wading
through the confusing and often contradictory web sources RE
clustering.  I struggled through the initial setup of the GFS software,
and am now working to create a shared GFS disk.  But all of this brings
up some general questions:

1)  First several online sources have pointed me to the Microsoft
Clustered Filesystem doc to set up my linux clustered FSs on vmWare. 
Though it deals with MSCS, I can see that it has some applicability. 
However, I have yet to find a step-by-step guide to linux clustered
filesystems.  Is there a better suited document to guide me thorough the
process of creating shared filesystems on CentOS/RHEL on vmWare across
boxes? 

2)  Is it necessary to create a private network for access to the shared
filesystem as the MSCS doc suggests?

3)  So far I've been looking at GFS because it is native to
CentOS/RHEL.  Is there a better non-commercial/free choice?

4)  Is there a clustered filesystem method that supports vmWare HA? 
This is important to us.

5)  Seems there at least three different methods to set up GFS (using
parted, using lvmconf, and using iSCSI).  If I go with GFS, which method
should I use?

Clustering seems to have a steep learning curve, but I'm laboriously
climbing the slope!  Thanks for your help.

Wes Modes
UCSC Library ITS
Programmer/Analyst



From linux at alteeve.com  Fri Jan 13 20:44:28 2012
From: linux at alteeve.com (Digimer)
Date: Fri, 13 Jan 2012 15:44:28 -0500
Subject: [Linux-cluster] Clustered filesystem questions for shared
 storage on CentOS/vmWare/SAN
In-Reply-To: <4F109455.5090607@ucsc.edu>
References: <4F109455.5090607@ucsc.edu>
Message-ID: <4F1097AC.6030409@alteeve.com>

On 01/13/2012 03:30 PM, Wes Modes wrote:
> I have some general clustered filesystem questions for you.  I'm wading
> through the confusing and often contradictory web sources RE
> clustering.  I struggled through the initial setup of the GFS software,
> and am now working to create a shared GFS disk.  But all of this brings
> up some general questions:
> 
> 1)  First several online sources have pointed me to the Microsoft
> Clustered Filesystem doc to set up my linux clustered FSs on vmWare. 
> Though it deals with MSCS, I can see that it has some applicability. 
> However, I have yet to find a step-by-step guide to linux clustered
> filesystems.  Is there a better suited document to guide me thorough the
> process of creating shared filesystems on CentOS/RHEL on vmWare across
> boxes? 
> 
> 2)  Is it necessary to create a private network for access to the shared
> filesystem as the MSCS doc suggests?
> 
> 3)  So far I've been looking at GFS because it is native to
> CentOS/RHEL.  Is there a better non-commercial/free choice?
> 
> 4)  Is there a clustered filesystem method that supports vmWare HA? 
> This is important to us.
> 
> 5)  Seems there at least three different methods to set up GFS (using
> parted, using lvmconf, and using iSCSI).  If I go with GFS, which method
> should I use?
> 
> Clustering seems to have a steep learning curve, but I'm laboriously
> climbing the slope!  Thanks for your help.
> 
> Wes Modes
> UCSC Library ITS
> Programmer/Analyst

Hi Wes,

  I can't speak to windows or VMWare as I have near-null experience with
both. So allow me to speak in general terms;

1. GFS2 is my preferred clustered file system, but it requires
distributed locking as provided by DLM, which is part of the Red Hat
Cluster Suite.

2. A private storage network is not required, but it is usually a good
idea simply because of how much traffic storage uses and how easy it is
to saturate a link and cause problems for other network stuff.

3. No. OCFS2 is the only other clustered file system I am aware of, and
it's under the control of Oracle. I shall say no more.

4. I'm not familiar with what requirements VMWare HA has. Can you
elaborate? In short though, all nodes take common storage and mount them
as local partitions/filesystems. Once done, your GFS2 partition is,
effectively, just another file system.

5. The storage layer and the file system should be independent of one
another. So long as the back-end storage presents said storage as raw
space to the nodes, GFS2 and the cluster shouldn't care. As for managing
that storage... That is effectively up to you. Personally, I like to use
clustered LVM on the raw storage, then create my GFS2 file system on an
LV. Of course, you can put the file system directly on the raw storage
and forego cLVM.

  I'm not sure how much this will help, given you want to use VMWare,
but I've got a tutorial that, among other steps, walks you through
setting up the base cluster , fencing (which is *required* for *any*
shared storage) and configuring and using the clustered LVM and GFS2 tools;

https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron



From linux at alteeve.com  Fri Jan 13 20:45:59 2012
From: linux at alteeve.com (Digimer)
Date: Fri, 13 Jan 2012 15:45:59 -0500
Subject: [Linux-cluster] Clustered filesystem questions for shared
 storage on CentOS/vmWare/SAN
In-Reply-To: <4F109455.5090607@ucsc.edu>
References: <4F109455.5090607@ucsc.edu>
Message-ID: <4F109807.6070507@alteeve.com>

I forgot to mention;

Friendly cluster folks can be found on freenode at #linux-cluster :)

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron



From dkelson at gurulabs.com  Fri Jan 13 20:50:13 2012
From: dkelson at gurulabs.com (Dax Kelson)
Date: Fri, 13 Jan 2012 13:50:13 -0700
Subject: [Linux-cluster] Clustered filesystem questions for shared
 storage on CentOS/vmWare/SAN
In-Reply-To: <4F109455.5090607@ucsc.edu>
References: <4F109455.5090607@ucsc.edu>
Message-ID: <1326487813.3314.8.camel@mentor.gurulabs.com>

A few comments below.

On Fri, 2012-01-13 at 12:30 -0800, Wes Modes wrote:
> I have some general clustered filesystem questions for you.  I'm wading
>
> 1)  First several online sources have pointed me to the Microsoft
> Clustered Filesystem doc to set up my linux clustered FSs on vmWare. 
> Though it deals with MSCS, I can see that it has some applicability. 
> However, I have yet to find a step-by-step guide to linux clustered
> filesystems.  Is there a better suited document to guide me thorough the
> process of creating shared filesystems on CentOS/RHEL on vmWare across
> boxes? 

http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/index.html
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/index.html

> 2)  Is it necessary to create a private network for access to the shared
> filesystem as the MSCS doc suggests?

Required? No.

> 3)  So far I've been looking at GFS because it is native to
> CentOS/RHEL.  Is there a better non-commercial/free choice?

Not really. Probably the next most popular is OCFS2.

> 4)  Is there a clustered filesystem method that supports vmWare HA? 
> This is important to us.

Not sure what you mean. Do you mean fencing?

> 5)  Seems there at least three different methods to set up GFS (using
> parted, using lvmconf, and using iSCSI).  If I go with GFS, which method
> should I use?

GFS2 requires shared storage such as SAN, iSCSI or DRBD. Pick one.

>From the RH docs, "While a GFS2 file system may be used outside of LVM,
Red Hat supports only GFS2 file systems that are created on a CLVM
logical volume."

On RHEL6 and clones, clvmd requires cman.

GFS2 requires fencing for safety and reliability suitable for
production.

Dax Kelson
Guru Labs



From pbruna at it-linux.cl  Fri Jan 13 22:26:56 2012
From: pbruna at it-linux.cl (Patricio A. Bruna)
Date: Fri, 13 Jan 2012 19:26:56 -0300 (CLST)
Subject: [Linux-cluster] Clustered filesystem questions for shared
	storage	on CentOS/vmWare/SAN
In-Reply-To: <4F109455.5090607@ucsc.edu>
Message-ID: <2578fc6c-f9cf-489f-ba18-13a8ee665bad@lisa.itlinux.cl>

Hi, 
I used to use GFS for a while, but it has several requirements and some make it very inflexible. 
These days i'm all for GlusterFS (www.glusterfs.org) Gluster is a distributed filesystem, recently adquired by Red Hat. Gluster provides the main benefit of GFS, Cluster Filesystem, but without so many constrains. 

------------------------------------ 
Patricio Bruna V. 
IT Linux Ltda. 
www.it-linux.cl 
Twitter 
Fono : (+56-2) 333 0578 
M?vil: (+56-9) 8899 6618 

----- Mensaje original -----

> I have some general clustered filesystem questions for you. I'm
> wading
> through the confusing and often contradictory web sources RE
> clustering. I struggled through the initial setup of the GFS
> software,
> and am now working to create a shared GFS disk. But all of this
> brings
> up some general questions:

> 1) First several online sources have pointed me to the Microsoft
> Clustered Filesystem doc to set up my linux clustered FSs on vmWare.
> Though it deals with MSCS, I can see that it has some applicability.
> However, I have yet to find a step-by-step guide to linux clustered
> filesystems. Is there a better suited document to guide me thorough
> the
> process of creating shared filesystems on CentOS/RHEL on vmWare
> across
> boxes?

> 2) Is it necessary to create a private network for access to the
> shared
> filesystem as the MSCS doc suggests?

> 3) So far I've been looking at GFS because it is native to
> CentOS/RHEL. Is there a better non-commercial/free choice?

> 4) Is there a clustered filesystem method that supports vmWare HA?
> This is important to us.

> 5) Seems there at least three different methods to set up GFS (using
> parted, using lvmconf, and using iSCSI). If I go with GFS, which
> method
> should I use?

> Clustering seems to have a steep learning curve, but I'm laboriously
> climbing the slope! Thanks for your help.

> Wes Modes
> UCSC Library ITS
> Programmer/Analyst

> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120113/56e71d89/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: zimbra_gold_partner.png
Type: image/png
Size: 2893 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120113/56e71d89/attachment.png>

From td3201 at gmail.com  Sat Jan 14 16:27:35 2012
From: td3201 at gmail.com (Terry)
Date: Sat, 14 Jan 2012 10:27:35 -0600
Subject: [Linux-cluster] LVM not available on 2/6 clustered volumes on reboot
Message-ID: <CAHSRzpCu3vC_OoNXWai0E=dWmkp5B5TTqHdhvnnnwAzuWab2pw@mail.gmail.com>

All of my nodes have experienced this issue but I can't determine root
cause.  After reboot, 2/6 of my volumes are set to NOT available.  I either
have to do a vgscan or vgchange -ay on the volume group to then set the LV
to available.

Doing an lvchange -ay before doing the vgchange or vgscan results in this
error:
Jan 14 10:14:25 omadvnfs01c kernel: device-mapper: table: 253:45: linear:
dm-linear: Device lookup failed
Jan 14 10:14:25 omadvnfs01c kernel: device-mapper: ioctl: error adding
target to table

I am sure I can hack in a vgchange or something but clvmd does a vgscan I
believe during the startup process not sure this workaround would even
help.  I just need to be pointed down a path to try to determine root cause
here.

Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120114/9b911488/attachment.htm>

From linux at alteeve.com  Sat Jan 14 17:47:01 2012
From: linux at alteeve.com (Digimer)
Date: Sat, 14 Jan 2012 12:47:01 -0500
Subject: [Linux-cluster] LVM not available on 2/6 clustered volumes on
 reboot
In-Reply-To: <CAHSRzpCu3vC_OoNXWai0E=dWmkp5B5TTqHdhvnnnwAzuWab2pw@mail.gmail.com>
References: <CAHSRzpCu3vC_OoNXWai0E=dWmkp5B5TTqHdhvnnnwAzuWab2pw@mail.gmail.com>
Message-ID: <4F11BF95.7090707@alteeve.com>

On 01/14/2012 11:27 AM, Terry wrote:
> All of my nodes have experienced this issue but I can't determine root
> cause.  After reboot, 2/6 of my volumes are set to NOT available.  I
> either have to do a vgscan or vgchange -ay on the volume group to then
> set the LV to available. 
> 
> Doing an lvchange -ay before doing the vgchange or vgscan results in
> this error:
> Jan 14 10:14:25 omadvnfs01c kernel: device-mapper: table: 253:45:
> linear: dm-linear: Device lookup failed
> Jan 14 10:14:25 omadvnfs01c kernel: device-mapper: ioctl: error adding
> target to table
> 
> I am sure I can hack in a vgchange or something but clvmd does a vgscan
> I believe during the startup process not sure this workaround would even
> help.  I just need to be pointed down a path to try to determine root
> cause here.
> 
> Thanks!

First thing that comes to mind is that LVM is starting before the
devices are available.

Some questions;
* Do you mean clustered LVM?
* How and when is (c)lvm started?
* How and when is the backing device connected?
* What kind of cluster is this? What versions?
* What are the relevant configuration files?

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron



From ga at steadfasttelecom.com  Sun Jan 22 19:19:41 2012
From: ga at steadfasttelecom.com (Gilad Abada)
Date: Sun, 22 Jan 2012 14:19:41 -0500
Subject: [Linux-cluster] crm issue
Message-ID: <CAKD7_2SpOsQ_pS2KXijsNT-LimGQpE7QEiYSPwhQEMVAn8z9qw@mail.gmail.com>

Hi Guys

I am new to the world of clustering.

I am working on Ubuntu 10.04 LTS 64 bit and im running into a weird issue.

When I am in crm -> configure after I type primitive if I try to tab
anything out it doesnt work. It seems like its frozen.

The only way to get out is to CTRL + C.

Also this may be a related issue if i go to crm -> configure -> edit
and actually make an edit, I am trying to add:

primitive drbd_disk ocf:linbit:drbd \
        params drbd_resource="disk0" \
        op monitor interval="15s"
primitive fs_drbd ocf:heartbeat:Filesystem \
        params device="/dev/drbd/by-res/disk0" directory="/mnt" fstype="ext3"
ms ms_drbd drbd_disk \
        meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
colocation mnt_on_master inf: fs_drbd ms_drbd:Master
order mount_after_drbd inf: ms_drbd:promote fs_drbd:start

then I :wq!
and it freezes again and i have to CTRL + C

I am hoping its a bad config issue on my side?

Also if anyone has any good links for n00bs on clustering with ubuntu
please send them along this is pretty overwhelming.

Thanks so much!!

Gill


-- 
Gilad Abada

SteadFast Telecommunications, Inc.

Call us to find out how much you can save with VoIP!

V: 212.589.1001
F: 212.589.1011


For 35 years, Steadfast Telecommunications has been providing
state-of-the-art communications technology to businesses and
government agencies - large and small. Steadfast Telecommunications
tailors Unified Communications and Voice-Over IP Solutions to
single-site offices or multi-site and worldwide enterprises.?? Make
your virtual office a reality.? Enjoy the freedom to travel while
remaining connected to your office.



From df.cluster at gmail.com  Mon Jan 23 06:36:32 2012
From: df.cluster at gmail.com (Dan Frincu)
Date: Mon, 23 Jan 2012 08:36:32 +0200
Subject: [Linux-cluster] crm issue
In-Reply-To: <CAKD7_2SpOsQ_pS2KXijsNT-LimGQpE7QEiYSPwhQEMVAn8z9qw@mail.gmail.com>
References: <CAKD7_2SpOsQ_pS2KXijsNT-LimGQpE7QEiYSPwhQEMVAn8z9qw@mail.gmail.com>
Message-ID: <CADQRkwjGKXVocR8+PO=x8oU03grqsGmLjCUr01HkCbhBjmCMVw@mail.gmail.com>

Hi,

On Sun, Jan 22, 2012 at 9:19 PM, Gilad Abada <ga at steadfasttelecom.com> wrote:
> Hi Guys
>
> I am new to the world of clustering.
>
> I am working on Ubuntu 10.04 LTS 64 bit and im running into a weird issue.
>
> When I am in crm -> configure after I type primitive if I try to tab
> anything out it doesnt work. It seems like its frozen.
>
> The only way to get out is to CTRL + C.

Maybe this helps
http://www.gossamer-threads.com/lists/linuxha/pacemaker/77423?do=post_view_threaded#77423

Regards,
Dan

>
> Also this may be a related issue if i go to crm -> configure -> edit
> and actually make an edit, I am trying to add:
>
> primitive drbd_disk ocf:linbit:drbd \
> ? ? ? ?params drbd_resource="disk0" \
> ? ? ? ?op monitor interval="15s"
> primitive fs_drbd ocf:heartbeat:Filesystem \
> ? ? ? ?params device="/dev/drbd/by-res/disk0" directory="/mnt" fstype="ext3"
> ms ms_drbd drbd_disk \
> ? ? ? ?meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> colocation mnt_on_master inf: fs_drbd ms_drbd:Master
> order mount_after_drbd inf: ms_drbd:promote fs_drbd:start
>
> then I :wq!
> and it freezes again and i have to CTRL + C
>
> I am hoping its a bad config issue on my side?
>
> Also if anyone has any good links for n00bs on clustering with ubuntu
> please send them along this is pretty overwhelming.
>
> Thanks so much!!
>
> Gill
>
>
> --
> Gilad Abada
>
> SteadFast Telecommunications, Inc.
>
> Call us to find out how much you can save with VoIP!
>
> V: 212.589.1001
> F: 212.589.1011
>
>
> For 35 years, Steadfast Telecommunications has been providing
> state-of-the-art communications technology to businesses and
> government agencies - large and small. Steadfast Telecommunications
> tailors Unified Communications and Voice-Over IP Solutions to
> single-site offices or multi-site and worldwide enterprises.?? Make
> your virtual office a reality.? Enjoy the freedom to travel while
> remaining connected to your office.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
Dan Frincu
CCNA, RHCE



From ga at steadfasttelecom.com  Mon Jan 23 18:12:12 2012
From: ga at steadfasttelecom.com (Gilad Abada)
Date: Mon, 23 Jan 2012 13:12:12 -0500
Subject: [Linux-cluster] crm issue
In-Reply-To: <CADQRkwjGKXVocR8+PO=x8oU03grqsGmLjCUr01HkCbhBjmCMVw@mail.gmail.com>
References: <CAKD7_2SpOsQ_pS2KXijsNT-LimGQpE7QEiYSPwhQEMVAn8z9qw@mail.gmail.com>
	<CADQRkwjGKXVocR8+PO=x8oU03grqsGmLjCUr01HkCbhBjmCMVw@mail.gmail.com>
Message-ID: <CAKD7_2R5SJO05-8B5YQjm=mDtzSg93j1DZAN0yY=haRaG2kSrA@mail.gmail.com>

Hi Dan,

Thank you!! That worked.

On Mon, Jan 23, 2012 at 1:36 AM, Dan Frincu <df.cluster at gmail.com> wrote:
> Hi,
>
> On Sun, Jan 22, 2012 at 9:19 PM, Gilad Abada <ga at steadfasttelecom.com> wrote:
>> Hi Guys
>>
>> I am new to the world of clustering.
>>
>> I am working on Ubuntu 10.04 LTS 64 bit and im running into a weird issue.
>>
>> When I am in crm -> configure after I type primitive if I try to tab
>> anything out it doesnt work. It seems like its frozen.
>>
>> The only way to get out is to CTRL + C.
>
> Maybe this helps
> http://www.gossamer-threads.com/lists/linuxha/pacemaker/77423?do=post_view_threaded#77423
>
> Regards,
> Dan
>
>>
>> Also this may be a related issue if i go to crm -> configure -> edit
>> and actually make an edit, I am trying to add:
>>
>> primitive drbd_disk ocf:linbit:drbd \
>> ? ? ? ?params drbd_resource="disk0" \
>> ? ? ? ?op monitor interval="15s"
>> primitive fs_drbd ocf:heartbeat:Filesystem \
>> ? ? ? ?params device="/dev/drbd/by-res/disk0" directory="/mnt" fstype="ext3"
>> ms ms_drbd drbd_disk \
>> ? ? ? ?meta master-max="1" master-node-max="1" clone-max="2"
>> clone-node-max="1" notify="true"
>> colocation mnt_on_master inf: fs_drbd ms_drbd:Master
>> order mount_after_drbd inf: ms_drbd:promote fs_drbd:start
>>
>> then I :wq!
>> and it freezes again and i have to CTRL + C
>>
>> I am hoping its a bad config issue on my side?
>>
>> Also if anyone has any good links for n00bs on clustering with ubuntu
>> please send them along this is pretty overwhelming.
>>
>> Thanks so much!!
>>
>> Gill
>>
>>
>> --
>> Gilad Abada
>>
>> SteadFast Telecommunications, Inc.
>>
>> Call us to find out how much you can save with VoIP!
>>
>> V: 212.589.1001
>> F: 212.589.1011
>>
>>
>> For 35 years, Steadfast Telecommunications has been providing
>> state-of-the-art communications technology to businesses and
>> government agencies - large and small. Steadfast Telecommunications
>> tailors Unified Communications and Voice-Over IP Solutions to
>> single-site offices or multi-site and worldwide enterprises.?? Make
>> your virtual office a reality.? Enjoy the freedom to travel while
>> remaining connected to your office.
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
> --
> Dan Frincu
> CCNA, RHCE
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
Gilad Abada

SteadFast Telecommunications, Inc.

Call us to find out how much you can save with VoIP!

V: 212.589.1001
F: 212.589.1011


For 35 years, Steadfast Telecommunications has been providing
state-of-the-art communications technology to businesses and
government agencies - large and small. Steadfast Telecommunications
tailors Unified Communications and Voice-Over IP Solutions to
single-site offices or multi-site and worldwide enterprises.?? Make
your virtual office a reality.? Enjoy the freedom to travel while
remaining connected to your office.



From kortux at gmail.com  Tue Jan 24 20:57:57 2012
From: kortux at gmail.com (Miguel Angel Guerrero)
Date: Tue, 24 Jan 2012 15:57:57 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
Message-ID: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>

Hi i'm trying to setup a centos cluster with two nodes with cman, drbd,
gfs2 and i'm using ipmi for fencing. DRBD is set up between the nodes using
a dedicated interface. So, when I unplug the drbd network cable, both nodes
power off immediatly (i tried using crossover cable and both nodes
connected to a switch, but both scenarios fail), and the logs doesn't seem
to show something useful. In a previous thread on this list, it is
recommended to deactivate ACPID daemon, even at BIOS level, but I'm still
having troubles.

If I simulate a physical disconnection with ifdown command in some node,
this node reboots with no hassle, but unpluging the cable kills both nodes.
I think the first scenario is correct, but the second one is not what I
expect.

Thanks for your help the next are my cluster.conf

  <?xml version="1.0"?>
<cluster config_version="2" name="WSGClust">
        <cman expected_votes="1" two_node="1"/>
        <clusternodes>
                <clusternode name="wsguardian1" nodeid="1">
                        <fence>
                                <method name="wsguardian1_ipmi">
                                        <device name="ipmi1"
action="reboot"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="wsguardian2" nodeid="2">
                        <fence>
                                <method name="wsguardian2_ipmi">
                                        <device name="ipmi2"
action="reboot"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_ipmilan" ipaddr="192.168.201.220"
lanplus="1" login="ADMIN" name="ipmi1" passwd="itac321"/>
                <fencedevice agent="fence_ipmilan" ipaddr="192.168.201.186"
lanplus="1" login="ADMIN" name="ipmi2" passwd="itac321"/>
        </fencedevices>
</cluster>


-- 
Att:
------------------------------------
Miguel Angel Guerrero
Usuario GNU/Linux Registrado #353531
------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120124/637171ee/attachment.htm>

From wmodes at ucsc.edu  Tue Jan 24 21:01:53 2012
From: wmodes at ucsc.edu (Wes Modes)
Date: Tue, 24 Jan 2012 13:01:53 -0800
Subject: [Linux-cluster] Expanding a LUN and a GFS2 filesystem
Message-ID: <4F1F1C41.5030701@ucsc.edu>

I am running CentOS with a GFS2 filesystem on a Dell EqualLogic SAN.  I
created the filesystem by mapping an RDM through VMWare to the guest
OS.  I used pvcreate, vgcreate, lvcreate, and mkfs.gfs2 to create the
filesystem and the underlying architecture.  I've included the log I
created to document the process below.

I've already increased the size of the LUN on the SAN.  Now, how do I
increase the size of the GFS2 filesystem and the LVM beneath it?  Do I
need to do something with the PV and VG as well? 

Thanks in advance for your help.

Wes


Here is the log of the process I used to create the filesystem:

    With the RDM created and all the daemons started (luci, ricci, cman)
    now I can config GFS.  Make sure they are running on all of our nodes.
    We can even see the RDM on the guest systems:

    [root at test03]# ls /dev/sdb
    /dev/sdb

    [root at test04]# ls /dev/sdb
    /dev/sdb

    So we are doing this using lvm clustering: 
    http://emrahbaysal.blogspot.com/2011/03/gfs-cluster-on-vmware-vsphere-rh...
    <http://emrahbaysal.blogspot.com/2011/03/gfs-cluster-on-vmware-vsphere-rhel.html>
    and http://linuxdynasty.org/215/howto-setup-gfs2-with-clustering/
     
    We've already set up gfs daemons and fencing and whatnot.
    Before we start to create the LVM2 volumes and Proceed to GFS2, we
    will need to enable clustering in LVM2.

    [root at test03]# lvmconf --enable-cluster

    I try to create the cluster FS
    [root at test03]# pvcreate /dev/sdb
      connect() failed on local socket: No such file or directory
      Internal cluster locking initialisation failed.
      WARNING: Falling back to local file-based locking.
      Volume Groups with the clustered attribute will be inaccessible.
      Physical volume "/dev/sdb" successfully created
    One internet source says:

    >> That indicates that you have cluster locking enabled but that the cluster LVM
    >> daemon (clvmd) is not running.

    So let's start it,

    [root at test03]# service clvmd status
    clvmd is stopped
    [root at test03]# service clvmd start
    Starting clvmd:
    Activating VG(s):   2 logical volume(s) in volume group "VolGroup00" now active
      clvmd not running on node test04
                                                               [  OK  ]
    [root at test03]# chkconfig clvmd on

    Okay, over on the other node:

    [root at test04]# service clvmd status
    clvmd is stopped
    [root at test04]# service clvmd start
    Starting clvmd: clvmd could not connect to cluster manager
    Consult syslog for more information
    [root at test04]# service cman status
    groupd is stopped
    [root at test04]# service cman start
    Starting cluster:
       Loading modules... done
       Mounting configfs... done
       Starting ccsd... done
       Starting cman... done
       Starting daemons... done
       Starting fencing... done
                                                               [  OK  ]
    [root at test04]# chkconfig cman on
    [root at test04]# service luci status
    luci is running...
    [root at test04]# service ricci status
    ricci (pid  4381) is running...
    [root at test04]# chkconfig ricci on
    [root at test04]# chkconfig luci on
    [root at test04]# service clvmd start
    Starting clvmd:
    Activating VG(s):   2 logical volume(s) in volume group "VolGroup00" now active
                                                               [  OK  ]

    And this time, no complaints:

    [root at test03]# service clvmd restart
    Restarting clvmd:                                          [  OK  ]

    Try again with pvcreate:

    [root at test03]# pvcreate /dev/sdb
      Physical volume "/dev/sdb" successfully created

    Create volume group:

    [root at test03]# vgcreate gdcache_vg /dev/sdb
      Clustered volume group "gdcache_vg" successfully created

    Create logical volume:

    [root at test03]# lvcreate -n gdcache_lv -L 2T gdcache_vg
      Logical volume "gdcache_lv" created

    Create GFS filesystem, ahem, GFS2 filesystem.  I screwed this up the
    first time.

    [root at test03]# mkfs.gfs2 -j 8 -p lock_dlm -t gdcluster:gdcache -j 4 /dev/mapper/gdcache_vg-gdcache_lv     
    This will destroy any data on /dev/mapper/gdcache_vg-gdcache_lv.
      It appears to contain a gfs filesystem.

    Are you sure you want to proceed? [y/n] y

    Device:                    /dev/mapper/gdcache_vg-gdcache_lv
    Blocksize:                 4096
    Device Size                2048.00 GB (536870912 blocks)
    Filesystem Size:           2048.00 GB (536870910 blocks)
    Journals:                  4
    Resource Groups:           8192
    Locking Protocol:          "lock_dlm"
    Lock Table:                "gdcluster:gdcache"
    UUID:                      0542628C-D8B8-2480-F67D-081435F38606

    Okay!  And!  Finally!  We mount it!

    [root at test03]# mount /dev/mapper/gdcache_vg-gdcache_lv /data
    /sbin/mount.gfs: fs is for a different cluster
    /sbin/mount.gfs: error mounting lockproto lock_dlm

    Wawawwah.  Bummer.
    /var/log/messages says:

    Jan 19 14:21:05 test03 gfs_controld[3369]: mount: fs requires cluster="gdcluster" current="gdao_cluster"

    Someone on the interwebs concurs:

    the cluster name defined in /etc/cluster/cluster.conf is different
    from the one tagged on the GFS volume.

    Okay, so looking at cluster.conf:

    [root at test03]# vi /etc/cluster/cluster.conf

    <?xml version="1.0"?>
    <cluster config_version="25" name="gdao_cluster">

    Let's change that to match how I named the cluster in the above cfg_mkfs

    [root at test03]# vi /etc/cluster/cluster.conf

    <?xml version="1.0"?>
    <cluster config_version="25" name="gdcluster">

    And restart some stuff:

    [root at test03]# /etc/init.d/gfs2 stop
    [root at test03]# service luci stop
    Shutting down luci: service ricci                          [  OK  ]
    [root at test03]# service ricci stop
    Shutting down ricci:                                       [  OK  ]
    [root at test03]# service cman stop
    Stopping cluster:
       Stopping fencing... done
       Stopping cman... failed
    /usr/sbin/cman_tool: Error leaving cluster: Device or resource busy
                                                               [FAILED]

    [root at test03]# cman_tool leave force

    [root at test03]# service cman stop
    Stopping cluster:
       Stopping fencing... done
       Stopping cman... done
       Stopping ccsd... done
       Unmounting configfs... done
                                                               [  OK  ]

    AAAARRRRGGGHGHHH

    [root at test03]# service ricci start
    Starting ricci:                                            [  OK  ]
    [root at test03]# service luci start
    Starting luci:                                             [  OK  ]

    Point your web browser to https://test03.gdao.ucsc.edu:8084 to access luci

    [root at test03]# service gfs2 start
    [root at test03]# service cman start
    Starting cluster:
       Loading modules... done
       Mounting configfs... done
       Starting ccsd... done
       Starting cman... done
       Starting daemons... done
       Starting fencing... failed

                                                               [FAILED]

    I had to reboot. 

    [root at test03]# service luci status
    luci is running...
    [root at test03]# service ricci status
    ricci (pid  4385) is running...
    [root at test03]# service cman status
    cman is running.
    [root at test03]# service gfs2 status

    Okay, again?

    [root at test03]# mount /dev/mapper/gdcache_vg-gdcache_lv /data

    Did that just work?  And on test04

    [root at test04]# mount /dev/mapper/gdcache_vg-gdcache_lv /data

    Okay, how about a test:

    [root at test03]# touch /data/killme

    And then we look on the other node:

    [root at test04]# ls /data
    killme

    Holy shit. 
    I've been working so hard for this moment that I don't completely
    know what to do now.
    Question is, now that I have two working nodes, can I duplicate it?

    Okay, finish up:

    [root at test03]# chkconfig rgmanager on
    [root at test03]# service rgmanager start
    Starting Cluster Service Manager:                          [  OK  ]
    [root at test03]# vi /etc/fstab

    /dev/mapper/gdcache_vg-gdcache_lv /data         gfs2    defaults,noatime,nodiratime 0 0

    and on the other node:

    [root at test04]# chkconfig rgmanager on
    [root at test04]# service rgmanager start
    Starting Cluster Service Manager:
    [root at test04]# vi /etc/fstab

    /dev/mapper/gdcache_vg-gdcache_lv /data         gfs2    defaults,noatime,nodiratime 0 0

     And it works.  Hell, yeah.







-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120124/4e065301/attachment.htm>

From linux at alteeve.com  Tue Jan 24 21:09:55 2012
From: linux at alteeve.com (Digimer)
Date: Tue, 24 Jan 2012 16:09:55 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
Message-ID: <4F1F1E23.6080308@alteeve.com>

On 01/24/2012 03:57 PM, Miguel Angel Guerrero wrote:
> Hi i'm trying to setup a centos cluster with two nodes with cman, drbd,
> gfs2 and i'm using ipmi for fencing. DRBD is set up between the nodes
> using a dedicated interface. So, when I unplug the drbd network cable,
> both nodes power off immediatly (i tried using crossover cable and both
> nodes connected to a switch, but both scenarios fail), and the logs
> doesn't seem to show something useful. In a previous thread on this
> list, it is recommended to deactivate ACPID daemon, even at BIOS level,
> but I'm still having troubles.
> 
> If I simulate a physical disconnection with ifdown command in some node,
> this node reboots with no hassle, but unpluging the cable kills both
> nodes. I think the first scenario is correct, but the second one is not
> what I expect.
> 
> Thanks for your help the next are my cluster.conf

This is likely caused by both nodes getting their fence calls off before
one of them dies.

How do you have DRBD configured? Specifically, what fence handler are
you using? If you're interested in testing, I have rewritten lon's
obliterate-peer.sh and added explicit delays to help resolve this exact
issue.

https://github.com/digimer/rhcs_fence

Alternatively, add a 'sleep 10' or similar to one of your existing fence
handlers and you should find that the node with the delay consistently
loses while the other node remains up.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Papers and Projects: https://alteeve.com



From rpeterso at redhat.com  Tue Jan 24 21:24:12 2012
From: rpeterso at redhat.com (Bob Peterson)
Date: Tue, 24 Jan 2012 16:24:12 -0500 (EST)
Subject: [Linux-cluster] Expanding a LUN and a GFS2 filesystem
In-Reply-To: <4F1F1C41.5030701@ucsc.edu>
Message-ID: <17a7e975-d459-41b3-a5ed-2b3d9958c4de@zmail16.collab.prod.int.phx2.redhat.com>

----- Original Message -----
| I am running CentOS with a GFS2 filesystem on a Dell EqualLogic SAN.
|  I
| created the filesystem by mapping an RDM through VMWare to the guest
| OS.  I used pvcreate, vgcreate, lvcreate, and mkfs.gfs2 to create the
| filesystem and the underlying architecture.  I've included the log I
| created to document the process below.
| 
| I've already increased the size of the LUN on the SAN.  Now, how do I
| increase the size of the GFS2 filesystem and the LVM beneath it?  Do
| I
| need to do something with the PV and VG as well?
| 
| Thanks in advance for your help.
| 
| Wes

Hi Wes,

Yep, you do need to start cman service before clvmd.

If you've already extended the volume with lvresize or lvextend,
then the procedure to expand the GFS2 file system to use that
extra space is simple:

1. mount it on both nodes
2. gfs2_grow /mnt/point (your mount point)

If it was my file system, I'd umount it at that point and do sync
just to be on the safe side. Some older versions of the software
didn't always sync the statfs information correctly, etc.
It shouldn't be necessary, but it doesn't hurt to do it, right?
Then mount it again.

Regards,

Bob Peterson
Red Hat File Systems



From wmodes at ucsc.edu  Tue Jan 24 21:25:33 2012
From: wmodes at ucsc.edu (Wes Modes)
Date: Tue, 24 Jan 2012 13:25:33 -0800
Subject: [Linux-cluster] Expanding a LUN and a GFS2 filesystem
Message-ID: <4F1F21CD.3000702@ucsc.edu>

I am running CentOS with a GFS2 filesystem on a Dell EqualLogic SAN.  I
created the filesystem by mapping an RDM through VMWare to the guest
OS.  I used pvcreate, vgcreate, lvcreate, and mkfs.gfs2 to create the
filesystem and the underlying architecture.  I've included the log I
created to document the process below.

I've already increased the size of the LUN on the SAN.  Now, how do I
increase the size of the GFS2 filesystem and the LVM beneath it?  Do I
need to do something with the PV and VG as well? 

Thanks in advance for your help.

Wes


Here is the log of the process I used to create the filesystem:

    With the RDM created and all the daemons started (luci, ricci, cman)
    now I can config GFS.  Make sure they are running on all of our nodes.
    We can even see the RDM on the guest systems:

    [root at test03]# ls /dev/sdb
    /dev/sdb

    [root at test04]# ls /dev/sdb
    /dev/sdb

    So we are doing this using lvm clustering: 
    http://emrahbaysal.blogspot.com/2011/03/gfs-cluster-on-vmware-vsphere-rh...
    <http://emrahbaysal.blogspot.com/2011/03/gfs-cluster-on-vmware-vsphere-rhel.html>
    and http://linuxdynasty.org/215/howto-setup-gfs2-with-clustering/
     
    We've already set up gfs daemons and fencing and whatnot.
    Before we start to create the LVM2 volumes and Proceed to GFS2, we
    will need to enable clustering in LVM2.

    [root at test03]# lvmconf --enable-cluster

    I try to create the cluster FS
    [root at test03]# pvcreate /dev/sdb
      connect() failed on local socket: No such file or directory
      Internal cluster locking initialisation failed.
      WARNING: Falling back to local file-based locking.
      Volume Groups with the clustered attribute will be inaccessible.
      Physical volume "/dev/sdb" successfully created
    One internet source says:

    >> That indicates that you have cluster locking enabled but that the cluster LVM
    >> daemon (clvmd) is not running.

    So let's start it,

    [root at test03]# service clvmd status
    clvmd is stopped
    [root at test03]# service clvmd start
    Starting clvmd:
    Activating VG(s):   2 logical volume(s) in volume group "VolGroup00" now active
      clvmd not running on node test04
                                                               [  OK  ]
    [root at test03]# chkconfig clvmd on

    Okay, over on the other node:

    [root at test04]# service clvmd status
    clvmd is stopped
    [root at test04]# service clvmd start
    Starting clvmd: clvmd could not connect to cluster manager
    Consult syslog for more information
    [root at test04]# service cman status
    groupd is stopped
    [root at test04]# service cman start
    Starting cluster:
       Loading modules... done
       Mounting configfs... done
       Starting ccsd... done
       Starting cman... done
       Starting daemons... done
       Starting fencing... done
                                                               [  OK  ]
    [root at test04]# chkconfig cman on
    [root at test04]# service luci status
    luci is running...
    [root at test04]# service ricci status
    ricci (pid  4381) is running...
    [root at test04]# chkconfig ricci on
    [root at test04]# chkconfig luci on
    [root at test04]# service clvmd start
    Starting clvmd:
    Activating VG(s):   2 logical volume(s) in volume group "VolGroup00" now active
                                                               [  OK  ]

    And this time, no complaints:

    [root at test03]# service clvmd restart
    Restarting clvmd:                                          [  OK  ]

    Try again with pvcreate:

    [root at test03]# pvcreate /dev/sdb
      Physical volume "/dev/sdb" successfully created

    Create volume group:

    [root at test03]# vgcreate gdcache_vg /dev/sdb
      Clustered volume group "gdcache_vg" successfully created

    Create logical volume:

    [root at test03]# lvcreate -n gdcache_lv -L 2T gdcache_vg
      Logical volume "gdcache_lv" created

    Create GFS filesystem, ahem, GFS2 filesystem.  I screwed this up the
    first time.

    [root at test03]# mkfs.gfs2 -j 8 -p lock_dlm -t gdcluster:gdcache -j 4 /dev/mapper/gdcache_vg-gdcache_lv     
    This will destroy any data on /dev/mapper/gdcache_vg-gdcache_lv.
      It appears to contain a gfs filesystem.

    Are you sure you want to proceed? [y/n] y

    Device:                    /dev/mapper/gdcache_vg-gdcache_lv
    Blocksize:                 4096
    Device Size                2048.00 GB (536870912 blocks)
    Filesystem Size:           2048.00 GB (536870910 blocks)
    Journals:                  4
    Resource Groups:           8192
    Locking Protocol:          "lock_dlm"
    Lock Table:                "gdcluster:gdcache"
    UUID:                      0542628C-D8B8-2480-F67D-081435F38606

    Okay!  And!  Finally!  We mount it!

    [root at test03]# mount /dev/mapper/gdcache_vg-gdcache_lv /data
    /sbin/mount.gfs: fs is for a different cluster
    /sbin/mount.gfs: error mounting lockproto lock_dlm

    Wawawwah.  Bummer.
    /var/log/messages says:

    Jan 19 14:21:05 test03 gfs_controld[3369]: mount: fs requires cluster="gdcluster" current="gdao_cluster"

    Someone on the interwebs concurs:

    the cluster name defined in /etc/cluster/cluster.conf is different
    from the one tagged on the GFS volume.

    Okay, so looking at cluster.conf:

    [root at test03]# vi /etc/cluster/cluster.conf

    <?xml version="1.0"?>
    <cluster config_version="25" name="gdao_cluster">

    Let's change that to match how I named the cluster in the above cfg_mkfs

    [root at test03]# vi /etc/cluster/cluster.conf

    <?xml version="1.0"?>
    <cluster config_version="25" name="gdcluster">

    And restart some stuff:

    [root at test03]# /etc/init.d/gfs2 stop
    [root at test03]# service luci stop
    Shutting down luci: service ricci                          [  OK  ]
    [root at test03]# service ricci stop
    Shutting down ricci:                                       [  OK  ]
    [root at test03]# service cman stop
    Stopping cluster:
       Stopping fencing... done
       Stopping cman... failed
    /usr/sbin/cman_tool: Error leaving cluster: Device or resource busy
                                                               [FAILED]

    [root at test03]# cman_tool leave force

    [root at test03]# service cman stop
    Stopping cluster:
       Stopping fencing... done
       Stopping cman... done
       Stopping ccsd... done
       Unmounting configfs... done
                                                               [  OK  ]

    AAAARRRRGGGHGHHH

    [root at test03]# service ricci start
    Starting ricci:                                            [  OK  ]
    [root at test03]# service luci start
    Starting luci:                                             [  OK  ]

    Point your web browser to https://test03.gdao.ucsc.edu:8084 to access luci

    [root at test03]# service gfs2 start
    [root at test03]# service cman start
    Starting cluster:
       Loading modules... done
       Mounting configfs... done
       Starting ccsd... done
       Starting cman... done
       Starting daemons... done
       Starting fencing... failed

                                                               [FAILED]

    I had to reboot. 

    [root at test03]# service luci status
    luci is running...
    [root at test03]# service ricci status
    ricci (pid  4385) is running...
    [root at test03]# service cman status
    cman is running.
    [root at test03]# service gfs2 status

    Okay, again?

    [root at test03]# mount /dev/mapper/gdcache_vg-gdcache_lv /data

    Did that just work?  And on test04

    [root at test04]# mount /dev/mapper/gdcache_vg-gdcache_lv /data

    Okay, how about a test:

    [root at test03]# touch /data/killme

    And then we look on the other node:

    [root at test04]# ls /data
    killme

    Holy shit. 
    I've been working so hard for this moment that I don't completely
    know what to do now.
    Question is, now that I have two working nodes, can I duplicate it?

    Okay, finish up:

    [root at test03]# chkconfig rgmanager on
    [root at test03]# service rgmanager start
    Starting Cluster Service Manager:                          [  OK  ]
    [root at test03]# vi /etc/fstab

    /dev/mapper/gdcache_vg-gdcache_lv /data         gfs2    defaults,noatime,nodiratime 0 0

    and on the other node:

    [root at test04]# chkconfig rgmanager on
    [root at test04]# service rgmanager start
    Starting Cluster Service Manager:
    [root at test04]# vi /etc/fstab

    /dev/mapper/gdcache_vg-gdcache_lv /data         gfs2    defaults,noatime,nodiratime 0 0

     And it works.  Hell, yeah.







-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120124/e3434a70/attachment.htm>

From kortux at gmail.com  Tue Jan 24 21:34:42 2012
From: kortux at gmail.com (Miguel Angel Guerrero)
Date: Tue, 24 Jan 2012 16:34:42 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <4F1F1E23.6080308@alteeve.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<4F1F1E23.6080308@alteeve.com>
Message-ID: <CA+xCQuCoyzCdHxMcpZXb92VkxNiU7qrmM78BJ_duMKX5+xtFqQ@mail.gmail.com>

Digimer i use your manual ;)

https://alteeve.com/w/Red_Hat_Cluster_Service_2_Tutorial

in a test environment y desactivate drbd daemon for testing but with or
without drbd daemon running, the problem persist
I use the next handler and fencing policy in drbd

fencing resource-and-stonith;
outdate-peer "/sbin/obliterate-peer.sh";

Digimer when you suggest add "sleep 10"' is in drbd.conf?

On Tue, Jan 24, 2012 at 4:09 PM, Digimer <linux at alteeve.com> wrote:

> On 01/24/2012 03:57 PM, Miguel Angel Guerrero wrote:
> > Hi i'm trying to setup a centos cluster with two nodes with cman, drbd,
> > gfs2 and i'm using ipmi for fencing. DRBD is set up between the nodes
> > using a dedicated interface. So, when I unplug the drbd network cable,
> > both nodes power off immediatly (i tried using crossover cable and both
> > nodes connected to a switch, but both scenarios fail), and the logs
> > doesn't seem to show something useful. In a previous thread on this
> > list, it is recommended to deactivate ACPID daemon, even at BIOS level,
> > but I'm still having troubles.
> >
> > If I simulate a physical disconnection with ifdown command in some node,
> > this node reboots with no hassle, but unpluging the cable kills both
> > nodes. I think the first scenario is correct, but the second one is not
> > what I expect.
> >
> > Thanks for your help the next are my cluster.conf
>
> This is likely caused by both nodes getting their fence calls off before
> one of them dies.
>
> How do you have DRBD configured? Specifically, what fence handler are
> you using? If you're interested in testing, I have rewritten lon's
> obliterate-peer.sh and added explicit delays to help resolve this exact
> issue.
>
> https://github.com/digimer/rhcs_fence
>
> Alternatively, add a 'sleep 10' or similar to one of your existing fence
> handlers and you should find that the node with the delay consistently
> loses while the other node remains up.
>
> --
> Digimer
> E-Mail:              digimer at alteeve.com
> Papers and Projects: https://alteeve.com
>



-- 
Atte:
------------------------------------
Miguel Angel Guerrero
Usuario GNU/Linux Registrado #353531
------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120124/95d46907/attachment.htm>

From linux at alteeve.com  Tue Jan 24 21:42:56 2012
From: linux at alteeve.com (Digimer)
Date: Tue, 24 Jan 2012 16:42:56 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <CA+xCQuCoyzCdHxMcpZXb92VkxNiU7qrmM78BJ_duMKX5+xtFqQ@mail.gmail.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<4F1F1E23.6080308@alteeve.com>
	<CA+xCQuCoyzCdHxMcpZXb92VkxNiU7qrmM78BJ_duMKX5+xtFqQ@mail.gmail.com>
Message-ID: <4F1F25E0.80002@alteeve.com>

On 01/24/2012 04:34 PM, Miguel Angel Guerrero wrote:
> Digimer i use your manual ;) 
> 
> https://alteeve.com/w/Red_Hat_Cluster_Service_2_Tutorial
> 
> in a test environment y desactivate drbd daemon for testing but with or
> without drbd daemon running, the problem persist
> I use the next handler and fencing policy in drbd
> 
> fencingresource-and-stonith;
> outdate-peer"/sbin/obliterate-peer.sh";
> 
> Digimer when you suggest add "sleep 10"' is in drbd.conf?

That's awesome! :)

No, you would put the sleep at the start of obliterate-peer.sh on one
node only. If this works, would you be willing to test 'rhcs_fence' for
me? It's new, and could use some testing. It automatically adds a delay
based on the node's cluster ID, with no delay for the node with ID of "1".

If so, here is how to install it on both nodes;

wget -c https://raw.github.com/digimer/rhcs_fence/master/rhcs_fence
chmod 755 rhcs_fence
mv rhcs_fence /usr/sbin/

Then in drbd.conf, change:

outdate-peer "/sbin/obliterate-peer.sh";

to

outdate-peer "/usr/sbin/rhcs_fence";

Cheers. :)

-- 
Digimer
E-Mail:              digimer at alteeve.com
Papers and Projects: https://alteeve.com



From wmodes at ucsc.edu  Tue Jan 24 22:19:58 2012
From: wmodes at ucsc.edu (Wes Modes)
Date: Tue, 24 Jan 2012 14:19:58 -0800
Subject: [Linux-cluster] Expanding a LUN and a GFS2 filesystem
In-Reply-To: <17a7e975-d459-41b3-a5ed-2b3d9958c4de@zmail16.collab.prod.int.phx2.redhat.com>
References: <17a7e975-d459-41b3-a5ed-2b3d9958c4de@zmail16.collab.prod.int.phx2.redhat.com>
Message-ID: <4F1F2E8E.4010308@ucsc.edu>

I have not extended the volume.  That was precisely my question.  I
already understand how to grow the GFS2 filesystem (conceptually).  As
per https://alteeve.com/w/Grow_a_GFS2_Partition. 

I've tried to increase the size of the volume with lvextend, but it's
not having it. 

    [root at test03]# lvextend -L +2T /dev/sdb
      Path required for Logical Volume "sdb"
      Please provide a volume group name
      Run `lvextend --help' for more information.
    [root at test03]# lvextend -L +2T  /dev/mapper/gdcache_vg-gdcache_lv /dev/sdb
      Extending logical volume gdcache_lv to 4.00 TB
      Insufficient free space: 524288 extents needed, but only 3 available
    [root at test03]# lvextend -L +2000G  /dev/mapper/gdcache_vg-gdcache_lv /dev/sdb  
      Extending logical volume gdcache_lv to 3.95 TB
      Insufficient free space: 512000 extents needed, but only 3 available
    [root at test03]# lvextend -L +1999G  /dev/mapper/gdcache_vg-gdcache_lv /dev/sdb    
      Extending logical volume gdcache_lv to 3.95 TB
      Insufficient free space: 511744 extents needed, but only 3 available

I assume I need to expand the underlying PV or VG.  But how?

Wes



On 1/24/2012 1:24 PM, Bob Peterson wrote:
> ----- Original Message -----
> | I am running CentOS with a GFS2 filesystem on a Dell EqualLogic SAN.
> |  I
> | created the filesystem by mapping an RDM through VMWare to the guest
> | OS.  I used pvcreate, vgcreate, lvcreate, and mkfs.gfs2 to create the
> | filesystem and the underlying architecture.  I've included the log I
> | created to document the process below.
> | 
> | I've already increased the size of the LUN on the SAN.  Now, how do I
> | increase the size of the GFS2 filesystem and the LVM beneath it?  Do
> | I
> | need to do something with the PV and VG as well?
> | 
> | Thanks in advance for your help.
> | 
> | Wes
>
> Hi Wes,
>
> Yep, you do need to start cman service before clvmd.
>
> If you've already extended the volume with lvresize or lvextend,
> then the procedure to expand the GFS2 file system to use that
> extra space is simple:
>
> 1. mount it on both nodes
> 2. gfs2_grow /mnt/point (your mount point)
>
> If it was my file system, I'd umount it at that point and do sync
> just to be on the safe side. Some older versions of the software
> didn't always sync the statfs information correctly, etc.
> It shouldn't be necessary, but it doesn't hurt to do it, right?
> Then mount it again.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120124/19b1be6a/attachment.htm>

From rpeterso at redhat.com  Tue Jan 24 22:30:41 2012
From: rpeterso at redhat.com (Bob Peterson)
Date: Tue, 24 Jan 2012 17:30:41 -0500 (EST)
Subject: [Linux-cluster] Expanding a LUN and a GFS2 filesystem
In-Reply-To: <4F1F2E8E.4010308@ucsc.edu>
Message-ID: <28f21da3-b41b-496c-9be0-4104a0a7df91@zmail16.collab.prod.int.phx2.redhat.com>

----- Original Message -----
| I have not extended the volume.  That was precisely my question.  I
| already understand how to grow the GFS2 filesystem (conceptually).
|  As
| per https://alteeve.com/w/Grow_a_GFS2_Partition.
| 
| I've tried to increase the size of the volume with lvextend, but it's
| not having it.
| 
|     [root at test03]# lvextend -L +2T /dev/sdb
|       Path required for Logical Volume "sdb"
|       Please provide a volume group name
|       Run `lvextend --help' for more information.
|     [root at test03]# lvextend -L +2T  /dev/mapper/gdcache_vg-gdcache_lv
|     /dev/sdb
|       Extending logical volume gdcache_lv to 4.00 TB
|       Insufficient free space: 524288 extents needed, but only 3
|       available
|     [root at test03]# lvextend -L +2000G
|      /dev/mapper/gdcache_vg-gdcache_lv /dev/sdb
|       Extending logical volume gdcache_lv to 3.95 TB
|       Insufficient free space: 512000 extents needed, but only 3
|       available
|     [root at test03]# lvextend -L +1999G
|      /dev/mapper/gdcache_vg-gdcache_lv /dev/sdb
|       Extending logical volume gdcache_lv to 3.95 TB
|       Insufficient free space: 511744 extents needed, but only 3
|       available
| 
| I assume I need to expand the underlying PV or VG.  But how?
| 
| Wes

In order to make the volume bigger, you need to lvresize or lvextend it.
In order to do that, you need to make the volume group bigger. If your
volume group has no more space, you can add storage devices to it with
a command like this:

vgextend gdcache_vg /dev/sdt /dev/sdu /dev/sdv

(assuming you want to add those devices to the vg)

Once you've done that, you can extend the lv with lvresize or lvextend.

So something like:

vgextend gdcache_vg /dev/sdt /dev/sdu /dev/sdv
lvresize -L+1T /dev/gdcache_vg/gdcache_lv
mount -t gfs2 /dev/gdcache_vg/gdcache_lv /mnt/gfs2
gfs2_grow /mnt/gfs2

Regards,

Bob Peterson
Red Hat File Systems



From jayesh.shinde at netcore.co.in  Wed Jan 25 07:50:32 2012
From: jayesh.shinde at netcore.co.in (jayesh.shinde)
Date: Wed, 25 Jan 2012 13:20:32 +0530
Subject: [Linux-cluster] Few queries about fence working
Message-ID: <4F1FB448.6060709@netcore.co.in>

Hi  all ,

I have few queries about fence working.

I am using 2 different  the 2 node cluster with Dell and IBM hardware in 
two different IDC.
Recently I came across the network failure problem at different time and 
I found my 2 nodes are power off state.

Below is  how the situation happened with my 2 different 2 node cluster.

With 2 node IBM node cluster with SAN :--
==============================
1)  Network connectivity  was failed totally for few minutes.
2) And as per the /var/log/messages both servers failed to  fence to 
each other and both server was UP as it is with all services.
3) But the "clustat" was showing serves are not in cluster mode and 
"regmanger" status was stop.
4) I simply reboot the server.
5) After that I found both server in power off stat.


with another  2 node Dell server with DRBD  :--
=================================
1) Network connectivity  was failed totally.
2) DRAC ip was unavailable so fence failed from both server.
3) after some time I fond the servers are shutdown.

In normal conditions both cluster work properly

  my queries are now :--
  ===============
1) What could be the reason for power off ?
2) Does cluster's fencing method  caused for the power off  of server ( 
i.e because of previous failed fence ) ?
3) Is there any test cases mentioned on net / blog / wiki  about the 
fence , i.e different situation under which  fence works.

Please guide.

Thanks & Regards
Jayesh Shinde



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120125/b161d427/attachment.htm>

From emi2fast at gmail.com  Wed Jan 25 08:29:27 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Wed, 25 Jan 2012 09:29:27 +0100
Subject: [Linux-cluster] Few queries about fence working
In-Reply-To: <4F1FB448.6060709@netcore.co.in>
References: <4F1FB448.6060709@netcore.co.in>
Message-ID: <CAE7pJ3AQ7rP1pgsmfDRsLij=Y9bU1wbg330ejFC6qAEgB_WVtw@mail.gmail.com>

Can you show me your cluster config?

2012/1/25 jayesh.shinde <jayesh.shinde at netcore.co.in>

> **
> Hi  all ,
>
> I have few queries about fence working.
>
> I am using 2 different  the 2 node cluster with Dell and IBM hardware in
> two different IDC.
> Recently I came across the network failure problem at different time and I
> found my 2 nodes are power off state.
>
> Below is  how the situation happened with my 2 different 2 node cluster.
>
> With 2 node IBM  node cluster with SAN :--
> ==============================
> 1)  Network connectivity  was failed totally for few minutes.
> 2) And as per the /var/log/messages both servers failed to  fence to each
> other and both server was UP as it is with all services.
> 3) But the "clustat" was showing serves are not in cluster mode and
> "regmanger" status was stop.
> 4) I simply reboot the server.
> 5) After that I found both server in power off stat.
>
>
> with another  2 node Dell server with DRBD  :--
> =================================
> 1) Network connectivity  was failed totally.
> 2) DRAC ip was unavailable so fence failed from both server.
> 3) after some time I fond the servers are shutdown.
>
> In normal conditions both cluster work properly
>
>  my queries are now :--
>  ===============
> 1) What could be the reason for power off ?
> 2) Does cluster's fencing method  caused for the power off  of server (
> i.e because of previous failed fence ) ?
> 3) Is there any test cases mentioned on net / blog / wiki  about the fence
> , i.e different situation under which  fence works.
>
> Please guide.
>
> Thanks & Regards
> Jayesh Shinde
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120125/0ee0241e/attachment.htm>

From emi2fast at gmail.com  Wed Jan 25 09:20:21 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Wed, 25 Jan 2012 10:20:21 +0100
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
Message-ID: <CAE7pJ3CYU3Kn+52UZMQD3qmzRqVEUrQ4_rw6B1RY2eoRNDvQLw@mail.gmail.com>

Hello Miguel

Talking about the problem when both nodes gets poweroff, this is called
fencing-race, Redhat has this problem from so much time and the only fix
was made it

fence delay

delay="30"

man fence_ipmilan

And I thinks you can look for a quorum qdisk



2012/1/24 Miguel Angel Guerrero <kortux at gmail.com>

> Hi i'm trying to setup a centos cluster with two nodes with cman, drbd,
> gfs2 and i'm using ipmi for fencing. DRBD is set up between the nodes using
> a dedicated interface. So, when I unplug the drbd network cable, both nodes
> power off immediatly (i tried using crossover cable and both nodes
> connected to a switch, but both scenarios fail), and the logs doesn't seem
> to show something useful. In a previous thread on this list, it is
> recommended to deactivate ACPID daemon, even at BIOS level, but I'm still
> having troubles.
>
> If I simulate a physical disconnection with ifdown command in some node,
> this node reboots with no hassle, but unpluging the cable kills both nodes.
> I think the first scenario is correct, but the second one is not what I
> expect.
>
> Thanks for your help the next are my cluster.conf
>
>   <?xml version="1.0"?>
> <cluster config_version="2" name="WSGClust">
>         <cman expected_votes="1" two_node="1"/>
>         <clusternodes>
>                 <clusternode name="wsguardian1" nodeid="1">
>                         <fence>
>                                 <method name="wsguardian1_ipmi">
>                                         <device name="ipmi1"
> action="reboot"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="wsguardian2" nodeid="2">
>                         <fence>
>                                 <method name="wsguardian2_ipmi">
>                                         <device name="ipmi2"
> action="reboot"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
>         <fencedevices>
>                 <fencedevice agent="fence_ipmilan"
> ipaddr="192.168.201.220" lanplus="1" login="ADMIN" name="ipmi1"
> passwd="itac321"/>
>                 <fencedevice agent="fence_ipmilan"
> ipaddr="192.168.201.186" lanplus="1" login="ADMIN" name="ipmi2"
> passwd="itac321"/>
>         </fencedevices>
> </cluster>
>
>
> --
> Att:
> ------------------------------------
> Miguel Angel Guerrero
> Usuario GNU/Linux Registrado #353531
> ------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120125/8bf4887b/attachment.htm>

From jayesh.shinde at netcore.co.in  Wed Jan 25 09:38:45 2012
From: jayesh.shinde at netcore.co.in (jayesh.shinde)
Date: Wed, 25 Jan 2012 15:08:45 +0530
Subject: [Linux-cluster] Few queries about fence working
In-Reply-To: <CAE7pJ3AQ7rP1pgsmfDRsLij=Y9bU1wbg330ejFC6qAEgB_WVtw@mail.gmail.com>
References: <4F1FB448.6060709@netcore.co.in>
	<CAE7pJ3AQ7rP1pgsmfDRsLij=Y9bU1wbg330ejFC6qAEgB_WVtw@mail.gmail.com>
Message-ID: <4F1FCDA5.4010909@netcore.co.in>

Dear  Emmanuel Segura,

Find the config below.  Because of policy I have removed some login details.

#############

<?xml version="1.0"?>
<cluster config_version="6" name="new_cluster">
<fence_daemon post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="mailbox1" nodeid="1" votes="1">
<multicast addr="224.0.0.1" interface="bond0"/>
<fence>
<method name="1">
<device name="imap1drac"/>
</method>
</fence>
</clusternode>
<clusternode name="mailbox2" nodeid="2" votes="1">
<multicast addr="224.0.0.1" interface="bond0"/>
<fence>
<method name="1">
<device name="imap2drac"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1">
<multicast addr="224.0.0.1"/>
</cman>
<fencedevices>
<fencedevice agent="fence_drac6" ipaddr="<drac IP>" login="<login name>" 
name="imap1drac" passwd="xxxxx"/>
<fencedevice agent="fence_drac6" ipaddr="<drac IP>" login="<login name>" 
name="imap2drac" passwd="xxxxx"/>
</fencedevices>
<rm>
<failoverdomains/>
<resources>
<ip address="192.168.1.1" monitor_link="1"/>
<fs device="/dev/drbd0" force_fsck="0" force_unmount="1" fsid="28418" 
fstype="ext3" mountpoint="/mount/path" name="imap1_fs" options="rw" 
self_fence="1"/>
<script file="/etc/init.d/cyrus-imapd" name="imap1_init"/>
</resources>
<service autostart="1" name="imap1" recovery="restart">
<ip ref="192.168.1.1"/>
<fs ref="imap1_fs"/>
<script ref="imap1_init"/>
</service>
</rm>
</cluster>
###################

Regards
Jayesh Shinde


On 01/25/2012 01:59 PM, emmanuel segura wrote:
> Can you show me your cluster config?
>
> 2012/1/25 jayesh.shinde <jayesh.shinde at netcore.co.in 
> <mailto:jayesh.shinde at netcore.co.in>>
>
>     Hi  all ,
>
>     I have few queries about fence working.
>
>     I am using 2 different  the 2 node cluster with Dell and IBM
>     hardware in two different IDC.
>     Recently I came across the network failure problem at different
>     time and I found my 2 nodes are power off state.
>
>     Below is  how the situation happened with my 2 different 2 node
>     cluster.
>
>     With 2 node IBM node cluster with SAN :--
>     ==============================
>     1)  Network connectivity  was failed totally for few minutes.
>     2) And as per the /var/log/messages both servers failed to  fence
>     to each other and both server was UP as it is with all services.
>     3) But the "clustat" was showing serves are not in cluster mode
>     and "regmanger" status was stop.
>     4) I simply reboot the server.
>     5) After that I found both server in power off stat.
>
>
>     with another  2 node Dell server with DRBD  :--
>     =================================
>     1) Network connectivity  was failed totally.
>     2) DRAC ip was unavailable so fence failed from both server.
>     3) after some time I fond the servers are shutdown.
>
>     In normal conditions both cluster work properly
>
>      my queries are now :--
>      ===============
>     1) What could be the reason for power off ?
>     2) Does cluster's fencing method  caused for the power off  of
>     server ( i.e because of previous failed fence ) ?
>     3) Is there any test cases mentioned on net / blog / wiki  about
>     the fence , i.e different situation under which  fence works.
>
>     Please guide.
>
>     Thanks & Regards
>     Jayesh Shinde
>
>
>
>
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> -- 
> esta es mi vida e me la vivo hasta que dios quiera
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120125/e5f5f267/attachment.htm>

From emi2fast at gmail.com  Wed Jan 25 10:05:08 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Wed, 25 Jan 2012 11:05:08 +0100
Subject: [Linux-cluster] Few queries about fence working
In-Reply-To: <4F1FCDA5.4010909@netcore.co.in>
References: <4F1FB448.6060709@netcore.co.in>
	<CAE7pJ3AQ7rP1pgsmfDRsLij=Y9bU1wbg330ejFC6qAEgB_WVtw@mail.gmail.com>
	<4F1FCDA5.4010909@netcore.co.in>
Message-ID: <CAE7pJ3Cx_2snjLD7JhjmcH15EP-ynk9=UQbjByaJ5A5su9XO+g@mail.gmail.com>

I think you have fencing-race problem, try to look man fence_drac6

try to delay fencing on a node when have problem with the cluster network

=======================================
 --delay
              Wait X seconds before fencing is started (Default Value: 0)
========================================

And i see you don't have a quorum disk, using qdisk for redhat it's always
a good idea
2012/1/25 jayesh.shinde <jayesh.shinde at netcore.co.in>

> **
> Dear  Emmanuel Segura,
>
> Find the config below.  Because of policy I have removed some login
> details.
>
> #############
>
> <?xml version="1.0"?>
> <cluster config_version="6" name="new_cluster">
>         <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>         <clusternodes>
>                 <clusternode name="mailbox1" nodeid="1" votes="1">
>                         <multicast addr="224.0.0.1" interface="bond0"/>
>                         <fence>
>                                 <method name="1">
>                                         <device name="imap1drac"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="mailbox2" nodeid="2" votes="1">
>                         <multicast addr="224.0.0.1" interface="bond0"/>
>                         <fence>
>                                 <method name="1">
>                                         <device name="imap2drac"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
>    <cman expected_votes="1" two_node="1">
>            <multicast addr="224.0.0.1"/>
>    </cman>
>         <fencedevices>
>                 <fencedevice agent="fence_drac6" ipaddr="<drac IP>"
> login="<login name>" name="imap1drac" passwd="xxxxx"/>
>                 <fencedevice agent="fence_drac6" ipaddr="<drac IP>"
> login="<login name>" name="imap2drac" passwd="xxxxx"/>
>         </fencedevices>
>         <rm>
>                 <failoverdomains/>
>                 <resources>
>                         <ip address="192.168.1.1" monitor_link="1"/>
>                         <fs device="/dev/drbd0" force_fsck="0"
> force_unmount="1" fsid="28418" fstype="ext3" mountpoint="/mount/path"
> name="imap1_fs" options="rw" self_fence="1"/>
>                         <script file="/etc/init.d/cyrus-imapd"
> name="imap1_init"/>
>                 </resources>
>                 <service autostart="1" name="imap1" recovery="restart">
>                         <ip ref="192.168.1.1"/>
>                         <fs ref="imap1_fs"/>
>                         <script ref="imap1_init"/>
>                 </service>
>         </rm>
> </cluster>
> ###################
>
> Regards
> Jayesh Shinde
>
>
> On 01/25/2012 01:59 PM, emmanuel segura wrote:
>
> Can you show me your cluster config?
>
> 2012/1/25 jayesh.shinde <jayesh.shinde at netcore.co.in>
>
>>  Hi  all ,
>>
>> I have few queries about fence working.
>>
>> I am using 2 different  the 2 node cluster with Dell and IBM hardware in
>> two different IDC.
>> Recently I came across the network failure problem at different time and
>> I found my 2 nodes are power off state.
>>
>> Below is  how the situation happened with my 2 different 2 node cluster.
>>
>> With 2 node IBM  node cluster with SAN :--
>> ==============================
>> 1)  Network connectivity  was failed totally for few minutes.
>> 2) And as per the /var/log/messages both servers failed to  fence to each
>> other and both server was UP as it is with all services.
>> 3) But the "clustat" was showing serves are not in cluster mode and
>> "regmanger" status was stop.
>> 4) I simply reboot the server.
>> 5) After that I found both server in power off stat.
>>
>>
>> with another  2 node Dell server with DRBD  :--
>> =================================
>> 1) Network connectivity  was failed totally.
>> 2) DRAC ip was unavailable so fence failed from both server.
>> 3) after some time I fond the servers are shutdown.
>>
>> In normal conditions both cluster work properly
>>
>>  my queries are now :--
>>  ===============
>> 1) What could be the reason for power off ?
>> 2) Does cluster's fencing method  caused for the power off  of server (
>> i.e because of previous failed fence ) ?
>> 3) Is there any test cases mentioned on net / blog / wiki  about the
>> fence , i.e different situation under which  fence works.
>>
>> Please guide.
>>
>> Thanks & Regards
>> Jayesh Shinde
>>
>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
>
> --
> Linux-cluster mailing listLinux-cluster at redhat.comhttps://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120125/9436ab07/attachment.htm>

From mailing.sr at gmail.com  Wed Jan 25 10:18:47 2012
From: mailing.sr at gmail.com (Seb)
Date: Wed, 25 Jan 2012 11:18:47 +0100
Subject: [Linux-cluster] Expanding a LUN and a GFS2 filesystem
In-Reply-To: <28f21da3-b41b-496c-9be0-4104a0a7df91@zmail16.collab.prod.int.phx2.redhat.com>
References: <4F1F2E8E.4010308@ucsc.edu>
	<28f21da3-b41b-496c-9be0-4104a0a7df91@zmail16.collab.prod.int.phx2.redhat.com>
Message-ID: <CAJrH5Gt9tTYUpOk3dv4F+C=ABUyc36UFnU6Lf-4SWgMySDTEzA@mail.gmail.com>

2012/1/24 Bob Peterson <rpeterso at redhat.com>

> ----- Original Message -----
> | I have not extended the volume.  That was precisely my question.  I
> | already understand how to grow the GFS2 filesystem (conceptually).
> |  As
> | per https://alteeve.com/w/Grow_a_GFS2_Partition.
> |
> | I've tried to increase the size of the volume with lvextend, but it's
> | not having it.
> |
> |     [root at test03]# lvextend -L +2T /dev/sdb
> |       Path required for Logical Volume "sdb"
> |       Please provide a volume group name
> |       Run `lvextend --help' for more information.
> |     [root at test03]# lvextend -L +2T  /dev/mapper/gdcache_vg-gdcache_lv
> |     /dev/sdb
> |       Extending logical volume gdcache_lv to 4.00 TB
> |       Insufficient free space: 524288 extents needed, but only 3
> |       available
> |     [root at test03]# lvextend -L +2000G
> |      /dev/mapper/gdcache_vg-gdcache_lv /dev/sdb
> |       Extending logical volume gdcache_lv to 3.95 TB
> |       Insufficient free space: 512000 extents needed, but only 3
> |       available
> |     [root at test03]# lvextend -L +1999G
> |      /dev/mapper/gdcache_vg-gdcache_lv /dev/sdb
> |       Extending logical volume gdcache_lv to 3.95 TB
> |       Insufficient free space: 511744 extents needed, but only 3
> |       available
> |
> | I assume I need to expand the underlying PV or VG.  But how?
> |
> | Wes
>
> In order to make the volume bigger, you need to lvresize or lvextend it.
> In order to do that, you need to make the volume group bigger. If your
> volume group has no more space, you can add storage devices to it with
> a command like this:
>
> vgextend gdcache_vg /dev/sdt /dev/sdu /dev/sdv
>
> (assuming you want to add those devices to the vg)
>
> Once you've done that, you can extend the lv with lvresize or lvextend.
>
> So something like:
>
> vgextend gdcache_vg /dev/sdt /dev/sdu /dev/sdv
> lvresize -L+1T /dev/gdcache_vg/gdcache_lv
> mount -t gfs2 /dev/gdcache_vg/gdcache_lv /mnt/gfs2
> gfs2_grow /mnt/gfs2
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
>

Hello,

To add extra space to your VG, you can either map a new RDM (or vdisk),
rescan your adapter and create a PV on this new disk and extend your FS as
described above.
Or extend the existing RDM, but you'll have to umount all FS using this RDM
before being able to use the new extra space (rescan then pvresize).

Regards,
--Seb Roy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120125/4d45ab1c/attachment.htm>

From kkovachev at varna.net  Wed Jan 25 10:32:24 2012
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Wed, 25 Jan 2012 12:32:24 +0200
Subject: [Linux-cluster] Few queries about fence working
In-Reply-To: <4F1FCDA5.4010909@netcore.co.in>
References: <4F1FB448.6060709@netcore.co.in>
	<CAE7pJ3AQ7rP1pgsmfDRsLij=Y9bU1wbg330ejFC6qAEgB_WVtw@mail.gmail.com>
	<4F1FCDA5.4010909@netcore.co.in>
Message-ID: <c3667b26713a46833b85199117c59013@mx.varna.net>

> <resources>
> <ip address="192.168.1.1" monitor_link="1"/>
> <fs device="/dev/drbd0" force_fsck="0" force_unmount="1" fsid="28418" 
> fstype="ext3" mountpoint="/mount/path" name="imap1_fs" options="rw" 
> self_fence="1"/>

You have self_fence, which should reboot the node instead of power off,
but as you are using drbd - the power off may be caused from drbd instead
(check drbd.conf)

> <script file="/etc/init.d/cyrus-imapd" name="imap1_init"/>
> </resources>

In either case if the remote node is not fenced it is safer to shutdown
instead of having the service run at both, so i wouldn't change anything



From Jan.Huijsmans at interaccess.nl  Wed Jan 25 11:21:13 2012
From: Jan.Huijsmans at interaccess.nl (Jan Huijsmans)
Date: Wed, 25 Jan 2012 12:21:13 +0100
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <CAE7pJ3CYU3Kn+52UZMQD3qmzRqVEUrQ4_rw6B1RY2eoRNDvQLw@mail.gmail.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>,
	<CAE7pJ3CYU3Kn+52UZMQD3qmzRqVEUrQ4_rw6B1RY2eoRNDvQLw@mail.gmail.com>
Message-ID: <ACE3B8174837D048815BCE1F8373521B0C7B7C6AED@NTHVSEXCHMAIL01>

Hello,

> And I thinks you can look for a quorum qdisk

Don't forget to add heurstic tests to the qdisk to an external testpoint (for example, a network gateway) so the failing node resets itself before fencing starts. When you don't the fence race condition is still possible, altho rare. (but I've seen it happen and didn't like it :( )

Greetings,

Jan Huijsmans
Inter Access



From Jan.Huijsmans at interaccess.nl  Wed Jan 25 12:04:32 2012
From: Jan.Huijsmans at interaccess.nl (Jan Huijsmans)
Date: Wed, 25 Jan 2012 13:04:32 +0100
Subject: [Linux-cluster] best qdisk location
Message-ID: <ACE3B8174837D048815BCE1F8373521B0C7B7C6AEE@NTHVSEXCHMAIL01>

Hello,

When checking the RedHat cluster set-up I was surprised to find the quorum disk located
on the same LUN as the database. This location was chosen because the database LUN
needs to be accessable for the node to be able to service the environment.
It's a logical choice.

However, at this moment we're experiencing latency on the storage, which also hinders
the usage of the qdisk. There are lots of time-outs on disk activity which won't hinder the
application much, at least when the cluster won't reboot due to time-outs on the qdisk.

For me the logical choice for the qdisk would be a seperate LUN on a fast disk, we have
a quorum disk library for the SAN with unused disks, instead on the same LUN that's
being used by the application. (in a cabinet that's used by the complete environment.

This way the qdisk can be fast and it's a real quorum LUN, as it's located on the quorum
location of the SAN controllers.

My main question is which method would give the most stable environment for the cluster.

1. qdisk on same LUN as application
2. qdisk on seperate, isolated, LUN

I would choose the second option, but I'm not sure which would give the stability I'm seeking.

Greetings,

Jan Huijsmans



From Jan.Huijsmans at interaccess.nl  Wed Jan 25 12:07:05 2012
From: Jan.Huijsmans at interaccess.nl (Jan Huijsmans)
Date: Wed, 25 Jan 2012 13:07:05 +0100
Subject: [Linux-cluster] best qdisk location
In-Reply-To: <ACE3B8174837D048815BCE1F8373521B0C7B7C6AEE@NTHVSEXCHMAIL01>
References: <ACE3B8174837D048815BCE1F8373521B0C7B7C6AEE@NTHVSEXCHMAIL01>
Message-ID: <ACE3B8174837D048815BCE1F8373521B0C7B7C6AEF@NTHVSEXCHMAIL01>

Hello,

When checking the RedHat cluster set-up I was surprised to find the quorum disk located
on the same LUN as the database. This location was chosen because the database LUN
needs to be accessable for the node to be able to service the environment.
It's a logical choice.

However, at this moment we're experiencing latency on the storage, which also hinders
the usage of the qdisk. There are lots of time-outs on disk activity which won't hinder the
application much, at least when the cluster won't reboot due to time-outs on the qdisk.

For me the logical choice for the qdisk would be a seperate LUN on a fast disk, we have
a quorum disk library for the SAN with unused disks, instead on the same LUN that's
being used by the application. (in a cabinet that's used by the complete environment.

This way the qdisk can be fast and it's a real quorum LUN, as it's located on the quorum
location of the SAN controllers.

My main question is which method would give the most stable environment for the cluster.

1. qdisk on same LUN as application
2. qdisk on seperate, isolated, LUN

I would choose the second option, but I'm not sure which would give the stability I'm seeking.

Greetings,

Jan Huijsmans



From Jan.Huijsmans at interaccess.nl  Wed Jan 25 12:27:21 2012
From: Jan.Huijsmans at interaccess.nl (Jan Huijsmans)
Date: Wed, 25 Jan 2012 13:27:21 +0100
Subject: [Linux-cluster] best qdisk location
In-Reply-To: <ACE3B8174837D048815BCE1F8373521B0C7B7C6AEE@NTHVSEXCHMAIL01>
References: <ACE3B8174837D048815BCE1F8373521B0C7B7C6AEE@NTHVSEXCHMAIL01>
Message-ID: <ACE3B8174837D048815BCE1F8373521B0C7B7C6AF0@NTHVSEXCHMAIL01>

Sorry about the double post, I received errors from miton.cz and no post on the list. My bad.

Greetings,

Jan Huijsmans



From emi2fast at gmail.com  Wed Jan 25 12:57:48 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Wed, 25 Jan 2012 13:57:48 +0100
Subject: [Linux-cluster] best qdisk location
In-Reply-To: <ACE3B8174837D048815BCE1F8373521B0C7B7C6AEE@NTHVSEXCHMAIL01>
References: <ACE3B8174837D048815BCE1F8373521B0C7B7C6AEE@NTHVSEXCHMAIL01>
Message-ID: <CAE7pJ3DW99mof16B27eqavwqCxysYan7Mmy7icsYXeaY3uYqCA@mail.gmail.com>

I alway used the second option and i never found any problem

2012/1/25 Jan Huijsmans <Jan.Huijsmans at interaccess.nl>

> Hello,
>
> When checking the RedHat cluster set-up I was surprised to find the quorum
> disk located
> on the same LUN as the database. This location was chosen because the
> database LUN
> needs to be accessable for the node to be able to service the environment.
> It's a logical choice.
>
> However, at this moment we're experiencing latency on the storage, which
> also hinders
> the usage of the qdisk. There are lots of time-outs on disk activity which
> won't hinder the
> application much, at least when the cluster won't reboot due to time-outs
> on the qdisk.
>
> For me the logical choice for the qdisk would be a seperate LUN on a fast
> disk, we have
> a quorum disk library for the SAN with unused disks, instead on the same
> LUN that's
> being used by the application. (in a cabinet that's used by the complete
> environment.
>
> This way the qdisk can be fast and it's a real quorum LUN, as it's located
> on the quorum
> location of the SAN controllers.
>
> My main question is which method would give the most stable environment
> for the cluster.
>
> 1. qdisk on same LUN as application
> 2. qdisk on seperate, isolated, LUN
>
> I would choose the second option, but I'm not sure which would give the
> stability I'm seeking.
>
> Greetings,
>
> Jan Huijsmans
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120125/a45da1a9/attachment.htm>

From jayesh.shinde at netcore.co.in  Wed Jan 25 13:57:28 2012
From: jayesh.shinde at netcore.co.in (jayesh.shinde)
Date: Wed, 25 Jan 2012 19:27:28 +0530
Subject: [Linux-cluster] Few queries about fence working
In-Reply-To: <c3667b26713a46833b85199117c59013@mx.varna.net>
References: <4F1FB448.6060709@netcore.co.in>	<CAE7pJ3AQ7rP1pgsmfDRsLij=Y9bU1wbg330ejFC6qAEgB_WVtw@mail.gmail.com>	<4F1FCDA5.4010909@netcore.co.in>
	<c3667b26713a46833b85199117c59013@mx.varna.net>
Message-ID: <4F200A48.3060100@netcore.co.in>

Hi Kaloyan Kovachev ,

I am using below config  in drbd.conf  which is mention on DRBD cookbook.

}
   disk {
     fencing resource-and-stonith;
   }
   handlers {
     outdate-peer "/sbin/obliterate";

Under  /sbin/obliterate script , "fence_node" is mention.

*Do you know what is the default method with "**fence_node $REMOTE" *i.e 
reboot of power-off ?

Dear Digimer ,

Can you please guide me here.

Currently I am not having the test machine to test it , so all member's  
inputs will help me a lot to understand it.

Below is the /sbin/obliterate


#!/bin/bash
# ###########################################################
# DRBD 0.8.2.1 -> linux-cluster super-simple fencing wrapper
#
# Copyright Red Hat, 2007
#
# Licensed under the GNU General Public License version 2
# which is incorporated herein by reference:
#
#   http://www.gnu.org/licenses/gpl-2.0.html
#
# At your option, you may choose to license this software
# under any later version of the GNU General Public License.
#
# This software is distributed in the hopes that it will be
# useful, but without warranty of any kind.
#
# Kills the other node in a 2-node cluster.  Only works with
# 2-node clusters (FIXME?)
#
# ###########################################################
#
# Author: Lon Hohberger <lhh[a]redhat.com>
#
# Special thanks to fabioc on freenode
#

PATH="/bin:/sbin:/usr/bin:/usr/sbin"

NODECOUNT=0
LOCAL_ID=$(cman_tool status 2>/dev/null | grep '^Node ID:' | awk '{print 
$3}')
REMOTE_ID=""
REMOTE=""

if [ -z "$LOCAL_ID" ]; then
         echo "Could not determine local node ID!"
         exit 1
fi

# Shoot the other guy.
while read nid nodename; do
         if [ "$nid" = "0" ]; then
                 continue
         fi

         ((NODECOUNT++))

         if [ "$nid" != "$LOCAL_ID" ]; then
                 REMOTE_ID=$nid
                 REMOTE=$nodename
         fi
done < <(cman_tool nodes 2>/dev/null | grep -v '^Node' | awk '{print 
$1,$6}')

if [ $NODECOUNT -ne 2 ]; then
         echo "Only works with 2 node clusters"
         exit 1
fi

if [ -z "$REMOTE_ID" ] || [ -z "$REMOTE" ]; then
         echo "Could not determine remote node"
         exit 1
fi

echo "Local node ID: $LOCAL_ID"
echo "Remote node ID: $REMOTE_ID"
echo "Remote node: $REMOTE "

#
# This could be cleaner by calling cman_tool kill -n <node>, but then we 
have
# to poll/wait for fence status, and I don't feel like writing that right
# now.  Note that GFS *will* wait for this to occur, so if you're using GFS
# on DRBD, you still don't get access. ;)
#
fence_node $REMOTE

if [ $? -eq 0 ]; then
         #
         # Reference:
         # http://osdir.com/ml/linux.kernel.drbd.devel/2006-11/msg00005.html
         #
         # 7 = node got blown away.
         #
         exit 7
fi

#
# Fencing failed?!
#
exit 1

Regards
Jayesh Shinde




On 01/25/2012 04:02 PM, Kaloyan Kovachev wrote:
>> <resources>
>> <ip address="192.168.1.1" monitor_link="1"/>
>> <fs device="/dev/drbd0" force_fsck="0" force_unmount="1" fsid="28418"
>> fstype="ext3" mountpoint="/mount/path" name="imap1_fs" options="rw"
>> self_fence="1"/>
> You have self_fence, which should reboot the node instead of power off,
> but as you are using drbd - the power off may be caused from drbd instead
> (check drbd.conf)
>
>> <script file="/etc/init.d/cyrus-imapd" name="imap1_init"/>
>> </resources>
> In either case if the remote node is not fenced it is safer to shutdown
> instead of having the service run at both, so i wouldn't change anything
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120125/46816d2a/attachment.htm>

From linux at alteeve.com  Wed Jan 25 14:03:42 2012
From: linux at alteeve.com (Digimer)
Date: Wed, 25 Jan 2012 09:03:42 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <CAE7pJ3CYU3Kn+52UZMQD3qmzRqVEUrQ4_rw6B1RY2eoRNDvQLw@mail.gmail.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<CAE7pJ3CYU3Kn+52UZMQD3qmzRqVEUrQ4_rw6B1RY2eoRNDvQLw@mail.gmail.com>
Message-ID: <4F200BBE.9000507@alteeve.com>

On 01/25/2012 04:20 AM, emmanuel segura wrote:
> Hello Miguel
> 
> Talking about the problem when both nodes gets poweroff, this is called
> fencing-race, Redhat has this problem from so much time and the only fix
> was made it
> 
> fence delay
> 
> |delay="30"
> 
> man fence_ipmilan
> 
> And I thinks you can look for a quorum qdisk

I believe that Miguel is using DRBD, which can't be used (safely) for qdisk.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Papers and Projects: https://alteeve.com



From emi2fast at gmail.com  Wed Jan 25 14:08:30 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Wed, 25 Jan 2012 15:08:30 +0100
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <4F200BBE.9000507@alteeve.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<CAE7pJ3CYU3Kn+52UZMQD3qmzRqVEUrQ4_rw6B1RY2eoRNDvQLw@mail.gmail.com>
	<4F200BBE.9000507@alteeve.com>
Message-ID: <CAE7pJ3CYx+sY=s_9m2NY8NFURF+o3_cB=J-nbD1-z84W8=nS3g@mail.gmail.com>

But it can be using with iscsi lun, because redhat cluster without qdisk
have some problems after a cluster split-brain

2012/1/25 Digimer <linux at alteeve.com>

> On 01/25/2012 04:20 AM, emmanuel segura wrote:
> > Hello Miguel
> >
> > Talking about the problem when both nodes gets poweroff, this is called
> > fencing-race, Redhat has this problem from so much time and the only fix
> > was made it
> >
> > fence delay
> >
> > |delay="30"
> >
> > man fence_ipmilan
> >
> > And I thinks you can look for a quorum qdisk
>
> I believe that Miguel is using DRBD, which can't be used (safely) for
> qdisk.
>
> --
> Digimer
> E-Mail:              digimer at alteeve.com
> Papers and Projects: https://alteeve.com
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120125/297a5071/attachment.htm>

From Jan.Huijsmans at interaccess.nl  Wed Jan 25 14:33:37 2012
From: Jan.Huijsmans at interaccess.nl (Jan Huijsmans)
Date: Wed, 25 Jan 2012 15:33:37 +0100
Subject: [Linux-cluster] best qdisk location
In-Reply-To: <CAE7pJ3DW99mof16B27eqavwqCxysYan7Mmy7icsYXeaY3uYqCA@mail.gmail.com>
References: <ACE3B8174837D048815BCE1F8373521B0C7B7C6AEE@NTHVSEXCHMAIL01>,
	<CAE7pJ3DW99mof16B27eqavwqCxysYan7Mmy7icsYXeaY3uYqCA@mail.gmail.com>
Message-ID: <ACE3B8174837D048815BCE1F8373521B0C7B7C6AF2@NTHVSEXCHMAIL01>

> I always used the second option and I never found any problem

Sounds good, but is there any documentation on this subject? I would like to advise a
conversion to the separate LUN set-up, but I need to convince the administrators to
switch from the current set-up. (which presumable was advised by RedHat)

2012/1/25 Jan Huijsmans <Jan.Huijsmans at interaccess.nl<mailto:Jan.Huijsmans at interaccess.nl>>
Hello,

When checking the RedHat cluster set-up I was surprised to find the quorum disk located
on the same LUN as the database. This location was chosen because the database LUN
needs to be accessible for the node to be able to service the environment.
It's a logical choice.

However, at this moment we're experiencing latency on the storage, which also hinders
the usage of the qdisk. There are lots of time-outs on disk activity which won't hinder the
application much, at least when the cluster won't reboot due to time-outs on the qdisk.

For me the logical choice for the qdisk would be a separate LUN on a fast disk, we have
a quorum disk library for the SAN with unused disks, instead on the same LUN that's
being used by the application. (in a cabinet that's used by the complete environment.

This way the qdisk can be fast and it's a real quorum LUN, as it's located on the quorum
location of the SAN controllers.

My main question is which method would give the most stable environment for the cluster.

1. qdisk on same LUN as application
2. qdisk on separate, isolated, LUN

I would choose the second option, but I'm not sure which would give the stability I'm seeking.

Greetings,

Jan Huijsmans

--
Linux-cluster mailing list
Linux-cluster at redhat.com<mailto:Linux-cluster at redhat.com>
https://www.redhat.com/mailman/listinfo/linux-cluster



--
esta es mi vida e me la vivo hasta que dios quiera



From linux at alteeve.com  Wed Jan 25 14:56:59 2012
From: linux at alteeve.com (Digimer)
Date: Wed, 25 Jan 2012 09:56:59 -0500
Subject: [Linux-cluster] Few queries about fence working
In-Reply-To: <4F200A48.3060100@netcore.co.in>
References: <4F1FB448.6060709@netcore.co.in>	<CAE7pJ3AQ7rP1pgsmfDRsLij=Y9bU1wbg330ejFC6qAEgB_WVtw@mail.gmail.com>	<4F1FCDA5.4010909@netcore.co.in>
	<c3667b26713a46833b85199117c59013@mx.varna.net>
	<4F200A48.3060100@netcore.co.in>
Message-ID: <4F20183B.7030503@alteeve.com>

On 01/25/2012 08:57 AM, jayesh.shinde wrote:
> Hi Kaloyan Kovachev ,
> 
> I am using below config  in drbd.conf  which is mention on DRBD cookbook.
> 
> }
>   disk {
>     fencing resource-and-stonith;
>   }
>   handlers {
>     outdate-peer "/sbin/obliterate";
> 
> Under  /sbin/obliterate script , "fence_node" is mention.
> 
> *Do you know what is the default method with "**fence_node $REMOTE" *i.e
> reboot of power-off ?
> 
> Dear Digimer ,
> 
> Can you please guide me here.
> 
> Currently I am not having the test machine to test it , so all member's 
> inputs will help me a lot to understand it.
> 
> Below is the /sbin/obliterate

I updated the tutorial to address this last night;

https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Hooking_DRBD_Into_The_Cluster.27s_Fencing

and

https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Configuring_DRBD_Global_and_Common_Options

In short; this is a problem where the fence device, IPMI and DRAC here,
get the call to shut down their host but don't act on it fast enough to
block the call heading to the other node.

The obliterate scripts (obliterate is an older version of
obliterate-peer.sh, which I am working to replace with rhcs_fence now)
call cman to remove the peer node from the cluster, then call the actual
fence. For this reason, the delay set in cluster.conf won't help.

The options are to add a 'sleep 10;' to the start of *one* node's
obliterate or obliterate-peer.sh script. Alternatively, rhcs_fence uses
the node's ID to calculate a delay automatically to help avoid these
dual-fence scenarios.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Papers and Projects: https://alteeve.com



From linux at alteeve.com  Wed Jan 25 14:59:51 2012
From: linux at alteeve.com (Digimer)
Date: Wed, 25 Jan 2012 09:59:51 -0500
Subject: [Linux-cluster] Few queries about fence working
In-Reply-To: <4F200A48.3060100@netcore.co.in>
References: <4F1FB448.6060709@netcore.co.in>	<CAE7pJ3AQ7rP1pgsmfDRsLij=Y9bU1wbg330ejFC6qAEgB_WVtw@mail.gmail.com>	<4F1FCDA5.4010909@netcore.co.in>
	<c3667b26713a46833b85199117c59013@mx.varna.net>
	<4F200A48.3060100@netcore.co.in>
Message-ID: <4F2018E7.3020008@alteeve.com>

On 01/25/2012 08:57 AM, jayesh.shinde wrote:
> Hi Kaloyan Kovachev ,
> 
> I am using below config  in drbd.conf  which is mention on DRBD cookbook.
> 
> }
>   disk {
>     fencing resource-and-stonith;
>   }
>   handlers {
>     outdate-peer "/sbin/obliterate";
> 
> Under  /sbin/obliterate script , "fence_node" is mention.
> 
> *Do you know what is the default method with "**fence_node $REMOTE" *i.e
> reboot of power-off ?
> 
> Dear Digimer ,
> 
> Can you please guide me here.
> 
> Currently I am not having the test machine to test it , so all member's 
> inputs will help me a lot to understand it.
> 
> Below is the /sbin/obliterate

I updated the tutorial to address this last night;

https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Hooking_DRBD_Into_The_Cluster.27s_Fencing

and

https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Configuring_DRBD_Global_and_Common_Options

In short; this is a problem where the fence device, IPMI and DRAC here,
get the call to shut down their host but don't act on it fast enough to
block the call heading to the other node.

The obliterate scripts (obliterate is an older version of
obliterate-peer.sh, which I am working to replace with rhcs_fence now)
call cman to remove the peer node from the cluster, then call the actual
fence. For this reason, the delay set in cluster.conf won't help.

The options are to add a 'sleep 10;' to the start of *one* node's
obliterate or obliterate-peer.sh script. Alternatively, rhcs_fence uses
the node's ID to calculate a delay automatically to help avoid these
dual-fence scenarios.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Papers and Projects: https://alteeve.com



From kkovachev at varna.net  Wed Jan 25 15:00:22 2012
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Wed, 25 Jan 2012 17:00:22 +0200
Subject: [Linux-cluster] Few queries about fence working
In-Reply-To: <4F200A48.3060100@netcore.co.in>
References: <4F1FB448.6060709@netcore.co.in>	<CAE7pJ3AQ7rP1pgsmfDRsLij=Y9bU1wbg330ejFC6qAEgB_WVtw@mail.gmail.com>	<4F1FCDA5.4010909@netcore.co.in>
	<c3667b26713a46833b85199117c59013@mx.varna.net>
	<4F200A48.3060100@netcore.co.in>
Message-ID: <895481ec8180f30a37800c19628612f1@mx.varna.net>

On Wed, 25 Jan 2012 19:27:28 +0530, "jayesh.shinde"
<jayesh.shinde at netcore.co.in> wrote:
> Hi Kaloyan Kovachev ,
> 
> I am using below config  in drbd.conf  which is mention on DRBD
cookbook.
> 
> }
>    disk {
>      fencing resource-and-stonith;
>    }
>    handlers {
>      outdate-peer "/sbin/obliterate";
> 
> Under  /sbin/obliterate script , "fence_node" is mention.
> 
> *Do you know what is the default method with "**fence_node $REMOTE" *i.e

> reboot of power-off ?
> 

It depends on the fence agent and/or what is configured in your
cluster.conf, but in most cases it should be reboot (in most cases
performed as off then on)
In this case (drbd is using cluster fencing) we are back to cluster.conf,
but you said cluster failed to fence the remote, so drbd would remain
blocking IO to the device and the cluster remains in inquorate state - no
services running
When the connection comes back there are several possible reasons
(split-brain is detected, waiting or a subsequent fence attempt is made)
both servers to try to fence each other at the same time where you end with
both servers down.

> Dear Digimer ,
> 
> Can you please guide me here.
> 
> Currently I am not having the test machine to test it , so all member's 

> inputs will help me a lot to understand it.
> 
> Below is the /sbin/obliterate
> 
> 
> #!/bin/bash
> # ###########################################################
> # DRBD 0.8.2.1 -> linux-cluster super-simple fencing wrapper
> #
> # Copyright Red Hat, 2007
> #
> # Licensed under the GNU General Public License version 2
> # which is incorporated herein by reference:
> #
> #   http://www.gnu.org/licenses/gpl-2.0.html
> #
> # At your option, you may choose to license this software
> # under any later version of the GNU General Public License.
> #
> # This software is distributed in the hopes that it will be
> # useful, but without warranty of any kind.
> #
> # Kills the other node in a 2-node cluster.  Only works with
> # 2-node clusters (FIXME?)
> #
> # ###########################################################
> #
> # Author: Lon Hohberger <lhh[a]redhat.com>
> #
> # Special thanks to fabioc on freenode
> #
> 
> PATH="/bin:/sbin:/usr/bin:/usr/sbin"
> 
> NODECOUNT=0
> LOCAL_ID=$(cman_tool status 2>/dev/null | grep '^Node ID:' | awk '{print

> $3}')
> REMOTE_ID=""
> REMOTE=""
> 
> if [ -z "$LOCAL_ID" ]; then
>          echo "Could not determine local node ID!"
>          exit 1
> fi
> 
> # Shoot the other guy.
> while read nid nodename; do
>          if [ "$nid" = "0" ]; then
>                  continue
>          fi
> 
>          ((NODECOUNT++))
> 
>          if [ "$nid" != "$LOCAL_ID" ]; then
>                  REMOTE_ID=$nid
>                  REMOTE=$nodename
>          fi
> done < <(cman_tool nodes 2>/dev/null | grep -v '^Node' | awk '{print 
> $1,$6}')
> 
> if [ $NODECOUNT -ne 2 ]; then
>          echo "Only works with 2 node clusters"
>          exit 1
> fi
> 
> if [ -z "$REMOTE_ID" ] || [ -z "$REMOTE" ]; then
>          echo "Could not determine remote node"
>          exit 1
> fi
> 
> echo "Local node ID: $LOCAL_ID"
> echo "Remote node ID: $REMOTE_ID"
> echo "Remote node: $REMOTE "
> 
> #
> # This could be cleaner by calling cman_tool kill -n <node>, but then we

> have
> # to poll/wait for fence status, and I don't feel like writing that
right
> # now.  Note that GFS *will* wait for this to occur, so if you're using
GFS
> # on DRBD, you still don't get access. ;)
> #
> fence_node $REMOTE
> 
> if [ $? -eq 0 ]; then
>          #
>          # Reference:
>          #
>         
http://osdir.com/ml/linux.kernel.drbd.devel/2006-11/msg00005.html
>          #
>          # 7 = node got blown away.
>          #
>          exit 7
> fi
> 
> #
> # Fencing failed?!
> #
> exit 1
> 
> Regards
> Jayesh Shinde
> 
> 
> 
> 
> On 01/25/2012 04:02 PM, Kaloyan Kovachev wrote:
>>> <resources>
>>> <ip address="192.168.1.1" monitor_link="1"/>
>>> <fs device="/dev/drbd0" force_fsck="0" force_unmount="1" fsid="28418"
>>> fstype="ext3" mountpoint="/mount/path" name="imap1_fs" options="rw"
>>> self_fence="1"/>
>> You have self_fence, which should reboot the node instead of power off,
>> but as you are using drbd - the power off may be caused from drbd
instead
>> (check drbd.conf)
>>
>>> <script file="/etc/init.d/cyrus-imapd" name="imap1_init"/>
>>> </resources>
>> In either case if the remote node is not fenced it is safer to shutdown
>> instead of having the service run at both, so i wouldn't change
anything
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster



From zagar at arlut.utexas.edu  Wed Jan 25 15:07:06 2012
From: zagar at arlut.utexas.edu (Randy Zagar)
Date: Wed, 25 Jan 2012 09:07:06 -0600
Subject: [Linux-cluster] Expanding a LUN and a GFS2 filesystem
In-Reply-To: <mailman.12361.1327484429.3054.linux-cluster@redhat.com>
References: <mailman.12361.1327484429.3054.linux-cluster@redhat.com>
Message-ID: <1327504026.8732.4.camel@bofh.arlut.utexas.edu>

pvresize

On Wed, 2012-01-25 at 04:40 -0500, linux-cluster-request at redhat.com
wrote:

> Message: 2
> Date: Tue, 24 Jan 2012 14:19:58 -0800
> From: Wes Modes <wmodes at ucsc.edu>
> To: linux-cluster at redhat.com
> Subject: Re: [Linux-cluster] Expanding a LUN and a GFS2 filesystem
> Message-ID: <4F1F2E8E.4010308 at ucsc.edu>
> Content-Type: text/plain; charset="iso-8859-1"
> 
> I have not extended the volume.  That was precisely my question.  I
> already understand how to grow the GFS2 filesystem (conceptually).  As
> per https://alteeve.com/w/Grow_a_GFS2_Partition. 
> 
> I've tried to increase the size of the volume with lvextend, but it's
> not having it. 
> 
>     [root at test03]# lvextend -L +2T /dev/sdb
>       Path required for Logical Volume "sdb"
>       Please provide a volume group name
>       Run `lvextend --help' for more information.
>     [root at test03]# lvextend -L +2T  /dev/mapper/gdcache_vg-gdcache_lv /dev/sdb
>       Extending logical volume gdcache_lv to 4.00 TB
>       Insufficient free space: 524288 extents needed, but only 3 available
>     [root at test03]# lvextend -L +2000G  /dev/mapper/gdcache_vg-gdcache_lv /dev/sdb  
>       Extending logical volume gdcache_lv to 3.95 TB
>       Insufficient free space: 512000 extents needed, but only 3 available
>     [root at test03]# lvextend -L +1999G  /dev/mapper/gdcache_vg-gdcache_lv /dev/sdb    
>       Extending logical volume gdcache_lv to 3.95 TB
>       Insufficient free space: 511744 extents needed, but only 3 available
> 
> I assume I need to expand the underlying PV or VG.  But how?
> 
> Wes



From emi2fast at gmail.com  Wed Jan 25 15:17:45 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Wed, 25 Jan 2012 16:17:45 +0100
Subject: [Linux-cluster] best qdisk location
In-Reply-To: <ACE3B8174837D048815BCE1F8373521B0C7B7C6AF2@NTHVSEXCHMAIL01>
References: <ACE3B8174837D048815BCE1F8373521B0C7B7C6AEE@NTHVSEXCHMAIL01>
	<CAE7pJ3DW99mof16B27eqavwqCxysYan7Mmy7icsYXeaY3uYqCA@mail.gmail.com>
	<ACE3B8174837D048815BCE1F8373521B0C7B7C6AF2@NTHVSEXCHMAIL01>
Message-ID: <CAE7pJ3C7b+4s2X_0DDO7gh0ftYkZoCbbft4JHh7KTb9bFL53jw@mail.gmail.com>

Sorry i don't have any document about this topic, but the problem it's
logical

example: if my database make intensive IO on the lun where my qdisk reside
the node take fencing because cannot write to the qdisk

2012/1/25 Jan Huijsmans <Jan.Huijsmans at interaccess.nl>

> > I always used the second option and I never found any problem
>
> Sounds good, but is there any documentation on this subject? I would like
> to advise a
> conversion to the separate LUN set-up, but I need to convince the
> administrators to
> switch from the current set-up. (which presumable was advised by RedHat)
>
> 2012/1/25 Jan Huijsmans <Jan.Huijsmans at interaccess.nl<mailto:
> Jan.Huijsmans at interaccess.nl>>
> Hello,
>
> When checking the RedHat cluster set-up I was surprised to find the quorum
> disk located
> on the same LUN as the database. This location was chosen because the
> database LUN
> needs to be accessible for the node to be able to service the environment.
> It's a logical choice.
>
> However, at this moment we're experiencing latency on the storage, which
> also hinders
> the usage of the qdisk. There are lots of time-outs on disk activity which
> won't hinder the
> application much, at least when the cluster won't reboot due to time-outs
> on the qdisk.
>
> For me the logical choice for the qdisk would be a separate LUN on a fast
> disk, we have
> a quorum disk library for the SAN with unused disks, instead on the same
> LUN that's
> being used by the application. (in a cabinet that's used by the complete
> environment.
>
> This way the qdisk can be fast and it's a real quorum LUN, as it's located
> on the quorum
> location of the SAN controllers.
>
> My main question is which method would give the most stable environment
> for the cluster.
>
> 1. qdisk on same LUN as application
> 2. qdisk on separate, isolated, LUN
>
> I would choose the second option, but I'm not sure which would give the
> stability I'm seeking.
>
> Greetings,
>
> Jan Huijsmans
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com<mailto:Linux-cluster at redhat.com>
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120125/5543e46c/attachment.htm>

From kortux at gmail.com  Wed Jan 25 19:21:51 2012
From: kortux at gmail.com (Miguel Angel Guerrero)
Date: Wed, 25 Jan 2012 14:21:51 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <4F203DE0.6070606@alteeve.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<4F1F1E23.6080308@alteeve.com>
	<CA+xCQuCoyzCdHxMcpZXb92VkxNiU7qrmM78BJ_duMKX5+xtFqQ@mail.gmail.com>
	<4F1F25E0.80002@alteeve.com>
	<CA+xCQuDLiUBy=rSmPhhcA6BprMts5_cVPG3UuyaXVp-MPPKfLw@mail.gmail.com>
	<4F1F49E4.1060509@alteeve.com>
	<CA+xCQuCp4J27Z8Vbwzep1=yCkMxqj4cRRqo53bS4+CaN9VsHjg@mail.gmail.com>
	<4F203DE0.6070606@alteeve.com>
Message-ID: <CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>

Digimer
in my test when i disconnect the network cable after 6 or 8 seconds, the
cluster dies (one node halt and the other reboot), when this happen i
reconnect the network cable and start both nodes at the same time, only if
i make a --discard-my-data to drbd this show me a connected UpToDate state
again.

On Wed, Jan 25, 2012 at 12:37 PM, Digimer <linux at alteeve.com> wrote:

> On 01/25/2012 12:25 PM, Miguel Angel Guerrero wrote:
> > Hi Digimer
> >
> > i put the new script but i have problems
> >
> > 1. When i disconnect the network interface one node halt and the other
> > reboot,
> > 2. After startup both the nodes (at the same time) shows consistent
> > state but no connect
> >
> > The next pastebin shows the logs and drbd state, obliterate-peer.sh
> >
> > http://pastebin.com/irCBhuAn
> >
> > For the moment the obliterate-peer.sh script show better result
> > than obliterate-peer.sh
> >
> > thanks for your help again
>
> No prob, but please stay on the mailing list as this might help others
> later.
>
> That node couldn't talk to it's peer, and it is not UpToDate, so it
> refused to become primary, by design. The network link is still down.
>
> --
> Digimer
> E-Mail:              digimer at alteeve.com
> Papers and Projects: https://alteeve.com
>



-- 
Atte:
------------------------------------
Miguel Angel Guerrero
Usuario GNU/Linux Registrado #353531
------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120125/10ec18de/attachment.htm>

From linux at alteeve.com  Wed Jan 25 19:38:21 2012
From: linux at alteeve.com (Digimer)
Date: Wed, 25 Jan 2012 14:38:21 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<4F1F1E23.6080308@alteeve.com>
	<CA+xCQuCoyzCdHxMcpZXb92VkxNiU7qrmM78BJ_duMKX5+xtFqQ@mail.gmail.com>
	<4F1F25E0.80002@alteeve.com>
	<CA+xCQuDLiUBy=rSmPhhcA6BprMts5_cVPG3UuyaXVp-MPPKfLw@mail.gmail.com>
	<4F1F49E4.1060509@alteeve.com>
	<CA+xCQuCp4J27Z8Vbwzep1=yCkMxqj4cRRqo53bS4+CaN9VsHjg@mail.gmail.com>
	<4F203DE0.6070606@alteeve.com>
	<CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>
Message-ID: <4F205A2D.2080606@alteeve.com>

On 01/25/2012 02:21 PM, Miguel Angel Guerrero wrote:
> Digimer
> in my test when i disconnect the network cable after 6 or 8 seconds, the
> cluster dies (one node halt and the other reboot), when this happen i
> reconnect the network cable and start both nodes at the same time, only
> if i make a --discard-my-data to drbd this show me a connected UpToDate
> state again.

Then you are split-brain'ing, meaning that something has gone wrong and
for a period of time both nodes were StandAlone and Primary. Please
paste your drbd's global and resource configurations.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Papers and Projects: https://alteeve.com



From kortux at gmail.com  Wed Jan 25 19:42:38 2012
From: kortux at gmail.com (Miguel Angel Guerrero)
Date: Wed, 25 Jan 2012 14:42:38 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <4F205A2D.2080606@alteeve.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<4F1F1E23.6080308@alteeve.com>
	<CA+xCQuCoyzCdHxMcpZXb92VkxNiU7qrmM78BJ_duMKX5+xtFqQ@mail.gmail.com>
	<4F1F25E0.80002@alteeve.com>
	<CA+xCQuDLiUBy=rSmPhhcA6BprMts5_cVPG3UuyaXVp-MPPKfLw@mail.gmail.com>
	<4F1F49E4.1060509@alteeve.com>
	<CA+xCQuCp4J27Z8Vbwzep1=yCkMxqj4cRRqo53bS4+CaN9VsHjg@mail.gmail.com>
	<4F203DE0.6070606@alteeve.com>
	<CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>
	<4F205A2D.2080606@alteeve.com>
Message-ID: <CA+xCQuC=n=VNW2oBWDPkkqSgAd5h_9mXNT_R_4zY6QqpMHBr-A@mail.gmail.com>

Hi the next are my drbd.conf


global {
    usage-count no;
}
common {
    protocol C;

    syncer {
        rate 100M;
        al-extents 3389;
    }
     disk {

    fencing resource-and-stonith;
    }

    handlers {
        outdate-peer "/usr/sbin/rhcs_fence";
    }

    net {
        sndbuf-size 1024k;
        allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
    startup {
        become-primary-on both;
    }

}
resource wsg_db {
    device /dev/drbd0;
    meta-disk internal;

    on wsguardian1 {
        address 192.168.253.1:7788;
disk /dev/rootvg/wsg_data_lv;
    }
    on wsguardian2 {
        address 192.168.253.2:7788;
disk /dev/rootvg/wsg_data_lv;
    }
}



On Wed, Jan 25, 2012 at 2:38 PM, Digimer <linux at alteeve.com> wrote:

> On 01/25/2012 02:21 PM, Miguel Angel Guerrero wrote:
> > Digimer
> > in my test when i disconnect the network cable after 6 or 8 seconds, the
> > cluster dies (one node halt and the other reboot), when this happen i
> > reconnect the network cable and start both nodes at the same time, only
> > if i make a --discard-my-data to drbd this show me a connected UpToDate
> > state again.
>
> Then you are split-brain'ing, meaning that something has gone wrong and
> for a period of time both nodes were StandAlone and Primary. Please
> paste your drbd's global and resource configurations.
>
> --
> Digimer
> E-Mail:              digimer at alteeve.com
> Papers and Projects: https://alteeve.com
>



-- 
Atte:
------------------------------------
Miguel Angel Guerrero
Usuario GNU/Linux Registrado #353531
------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120125/6d393472/attachment.htm>

From linux at alteeve.com  Wed Jan 25 19:47:39 2012
From: linux at alteeve.com (Digimer)
Date: Wed, 25 Jan 2012 14:47:39 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <CA+xCQuC=n=VNW2oBWDPkkqSgAd5h_9mXNT_R_4zY6QqpMHBr-A@mail.gmail.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<4F1F1E23.6080308@alteeve.com>
	<CA+xCQuCoyzCdHxMcpZXb92VkxNiU7qrmM78BJ_duMKX5+xtFqQ@mail.gmail.com>
	<4F1F25E0.80002@alteeve.com>
	<CA+xCQuDLiUBy=rSmPhhcA6BprMts5_cVPG3UuyaXVp-MPPKfLw@mail.gmail.com>
	<4F1F49E4.1060509@alteeve.com>
	<CA+xCQuCp4J27Z8Vbwzep1=yCkMxqj4cRRqo53bS4+CaN9VsHjg@mail.gmail.com>
	<4F203DE0.6070606@alteeve.com>
	<CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>
	<4F205A2D.2080606@alteeve.com>
	<CA+xCQuC=n=VNW2oBWDPkkqSgAd5h_9mXNT_R_4zY6QqpMHBr-A@mail.gmail.com>
Message-ID: <4F205C5B.5030906@alteeve.com>

On 01/25/2012 02:42 PM, Miguel Angel Guerrero wrote:
> Hi the next are my drbd.conf

That looks good. Can you do the following;

Edit /usr/sbin/rhcs_fence (line 45) and change;

debug			=>	0,

to:

debug			=>	1,

Open up a terminal to either node and run;

clear; tail -f -n 0 /var/log/messages

Fail the link and wait. Once things stop happening, please copy *both*
of the /var/log/messages output to pastebin.com (or similar) and reply
with the links please?

Thanks.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Papers and Projects: https://alteeve.com



From kortux at gmail.com  Wed Jan 25 20:21:49 2012
From: kortux at gmail.com (Miguel Angel Guerrero)
Date: Wed, 25 Jan 2012 15:21:49 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <4F205C5B.5030906@alteeve.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<4F1F1E23.6080308@alteeve.com>
	<CA+xCQuCoyzCdHxMcpZXb92VkxNiU7qrmM78BJ_duMKX5+xtFqQ@mail.gmail.com>
	<4F1F25E0.80002@alteeve.com>
	<CA+xCQuDLiUBy=rSmPhhcA6BprMts5_cVPG3UuyaXVp-MPPKfLw@mail.gmail.com>
	<4F1F49E4.1060509@alteeve.com>
	<CA+xCQuCp4J27Z8Vbwzep1=yCkMxqj4cRRqo53bS4+CaN9VsHjg@mail.gmail.com>
	<4F203DE0.6070606@alteeve.com>
	<CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>
	<4F205A2D.2080606@alteeve.com>
	<CA+xCQuC=n=VNW2oBWDPkkqSgAd5h_9mXNT_R_4zY6QqpMHBr-A@mail.gmail.com>
	<4F205C5B.5030906@alteeve.com>
Message-ID: <CA+xCQuBoYuWdWe6uRssBYBpvnc2rcFCg0VY7TG9RyPxh8aNoOg@mail.gmail.com>

Ready this is the pastebin with log output on both nodes

http://pastebin.com/1GsXWAVU


On Wed, Jan 25, 2012 at 2:47 PM, Digimer <linux at alteeve.com> wrote:

> On 01/25/2012 02:42 PM, Miguel Angel Guerrero wrote:
> > Hi the next are my drbd.conf
>
> That looks good. Can you do the following;
>
> Edit /usr/sbin/rhcs_fence (line 45) and change;
>
> debug                   =>      0,
>
> to:
>
> debug                   =>      1,
>
> Open up a terminal to either node and run;
>
> clear; tail -f -n 0 /var/log/messages
>
> Fail the link and wait. Once things stop happening, please copy *both*
> of the /var/log/messages output to pastebin.com (or similar) and reply
> with the links please?
>
> Thanks.
>
> --
> Digimer
> E-Mail:              digimer at alteeve.com
> Papers and Projects: https://alteeve.com
>



-- 
Atte:
------------------------------------
Miguel Angel Guerrero
Usuario GNU/Linux Registrado #353531
------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120125/2e166fbf/attachment.htm>

From linux at alteeve.com  Wed Jan 25 20:26:59 2012
From: linux at alteeve.com (Digimer)
Date: Wed, 25 Jan 2012 15:26:59 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <CA+xCQuBoYuWdWe6uRssBYBpvnc2rcFCg0VY7TG9RyPxh8aNoOg@mail.gmail.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<4F1F1E23.6080308@alteeve.com>
	<CA+xCQuCoyzCdHxMcpZXb92VkxNiU7qrmM78BJ_duMKX5+xtFqQ@mail.gmail.com>
	<4F1F25E0.80002@alteeve.com>
	<CA+xCQuDLiUBy=rSmPhhcA6BprMts5_cVPG3UuyaXVp-MPPKfLw@mail.gmail.com>
	<4F1F49E4.1060509@alteeve.com>
	<CA+xCQuCp4J27Z8Vbwzep1=yCkMxqj4cRRqo53bS4+CaN9VsHjg@mail.gmail.com>
	<4F203DE0.6070606@alteeve.com>
	<CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>
	<4F205A2D.2080606@alteeve.com>
	<CA+xCQuC=n=VNW2oBWDPkkqSgAd5h_9mXNT_R_4zY6QqpMHBr-A@mail.gmail.com>
	<4F205C5B.5030906@alteeve.com>
	<CA+xCQuBoYuWdWe6uRssBYBpvnc2rcFCg0VY7TG9RyPxh8aNoOg@mail.gmail.com>
Message-ID: <4F206593.9080701@alteeve.com>

On 01/25/2012 03:21 PM, Miguel Angel Guerrero wrote:
> Ready this is the pastebin with log output on both nodes
> 
> http://pastebin.com/1GsXWAVU

Ah, it failed to fence the remote node. It didn't like the '-v' option
in '/sbin/fence_node -v wsguardian2'. What version of the cluster stack
are you using? What distribution and version?

Please let me know what these return;

rpm -q cman
uname -a
cat /etc/issue

-- 
Digimer
E-Mail:              digimer at alteeve.com
Papers and Projects: https://alteeve.com



From kortux at gmail.com  Wed Jan 25 20:31:52 2012
From: kortux at gmail.com (Miguel Angel Guerrero)
Date: Wed, 25 Jan 2012 15:31:52 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <4F206593.9080701@alteeve.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<4F1F1E23.6080308@alteeve.com>
	<CA+xCQuCoyzCdHxMcpZXb92VkxNiU7qrmM78BJ_duMKX5+xtFqQ@mail.gmail.com>
	<4F1F25E0.80002@alteeve.com>
	<CA+xCQuDLiUBy=rSmPhhcA6BprMts5_cVPG3UuyaXVp-MPPKfLw@mail.gmail.com>
	<4F1F49E4.1060509@alteeve.com>
	<CA+xCQuCp4J27Z8Vbwzep1=yCkMxqj4cRRqo53bS4+CaN9VsHjg@mail.gmail.com>
	<4F203DE0.6070606@alteeve.com>
	<CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>
	<4F205A2D.2080606@alteeve.com>
	<CA+xCQuC=n=VNW2oBWDPkkqSgAd5h_9mXNT_R_4zY6QqpMHBr-A@mail.gmail.com>
	<4F205C5B.5030906@alteeve.com>
	<CA+xCQuBoYuWdWe6uRssBYBpvnc2rcFCg0VY7TG9RyPxh8aNoOg@mail.gmail.com>
	<4F206593.9080701@alteeve.com>
Message-ID: <CA+xCQuBaUJ6oAVAwfbeWXHDCrNmA8v8vDihF=ga-v5Gkpb7ccw@mail.gmail.com>

Yes i see that the result of commands are:

/sbin/fence_node -V
/sbin/fence_node 2.0.115 (built Sep 26 2011 13:25:51)
Copyright (C) Red Hat, Inc.  2004  All rights reserved.
[root at wsguardian2 ~]# rpm -q cman
cman-2.0.115-85.el5_7.2
[root at wsguardian2 ~]# uname -a
Linux wsguardian2 2.6.18-274.17.1.el5 #1 SMP Tue Jan 10 17:25:58 EST
2012 x86_64 x86_64 x86_64 GNU/Linux
[root at wsguardian2 ~]# cat /etc/issue
CentOS release 5.7 (Final)
Kernel \r on an \m



On Wed, Jan 25, 2012 at 3:26 PM, Digimer <linux at alteeve.com> wrote:
>
> On 01/25/2012 03:21 PM, Miguel Angel Guerrero wrote:
> > Ready this is the pastebin with log output on both nodes
> >
> > http://pastebin.com/1GsXWAVU
>
> Ah, it failed to fence the remote node. It didn't like the '-v' option
> in '/sbin/fence_node -v wsguardian2'. What version of the cluster stack
> are you using? What distribution and version?
>
> Please let me know what these return;
>
> rpm -q cman
> uname -a
> cat /etc/issue
>
> --
> Digimer
> E-Mail: ? ? ? ? ? ? ?digimer at alteeve.com
> Papers and Projects: https://alteeve.com




--
Atte:
------------------------------------
Miguel Angel Guerrero
Usuario GNU/Linux Registrado #353531
------------------------------------



From linux at alteeve.com  Wed Jan 25 20:33:50 2012
From: linux at alteeve.com (Digimer)
Date: Wed, 25 Jan 2012 15:33:50 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <CA+xCQuBaUJ6oAVAwfbeWXHDCrNmA8v8vDihF=ga-v5Gkpb7ccw@mail.gmail.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<4F1F1E23.6080308@alteeve.com>
	<CA+xCQuCoyzCdHxMcpZXb92VkxNiU7qrmM78BJ_duMKX5+xtFqQ@mail.gmail.com>
	<4F1F25E0.80002@alteeve.com>
	<CA+xCQuDLiUBy=rSmPhhcA6BprMts5_cVPG3UuyaXVp-MPPKfLw@mail.gmail.com>
	<4F1F49E4.1060509@alteeve.com>
	<CA+xCQuCp4J27Z8Vbwzep1=yCkMxqj4cRRqo53bS4+CaN9VsHjg@mail.gmail.com>
	<4F203DE0.6070606@alteeve.com>
	<CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>
	<4F205A2D.2080606@alteeve.com>
	<CA+xCQuC=n=VNW2oBWDPkkqSgAd5h_9mXNT_R_4zY6QqpMHBr-A@mail.gmail.com>
	<4F205C5B.5030906@alteeve.com>
	<CA+xCQuBoYuWdWe6uRssBYBpvnc2rcFCg0VY7TG9RyPxh8aNoOg@mail.gmail.com>
	<4F206593.9080701@alteeve.com>
	<CA+xCQuBaUJ6oAVAwfbeWXHDCrNmA8v8vDihF=ga-v5Gkpb7ccw@mail.gmail.com>
Message-ID: <4F20672E.7050801@alteeve.com>

On 01/25/2012 03:31 PM, Miguel Angel Guerrero wrote:
> Yes i see that the result of commands are:
> 
> /sbin/fence_node -V
> /sbin/fence_node 2.0.115 (built Sep 26 2011 13:25:51)
> Copyright (C) Red Hat, Inc.  2004  All rights reserved.
> [root at wsguardian2 ~]# rpm -q cman
> cman-2.0.115-85.el5_7.2
> [root at wsguardian2 ~]# uname -a
> Linux wsguardian2 2.6.18-274.17.1.el5 #1 SMP Tue Jan 10 17:25:58 EST
> 2012 x86_64 x86_64 x86_64 GNU/Linux
> [root at wsguardian2 ~]# cat /etc/issue
> CentOS release 5.7 (Final)
> Kernel \r on an \m

Ah! Ok, you're using EL5 / RHCS v2. The rhcs_fence tool has not been
tested on EL5, so you will need to go back to "obliterate-peer.sh" and
manually enter a "sleep 10;" at the start of *one* of the nodes.

Once that is done, test again and see if it works.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Papers and Projects: https://alteeve.com



From linux at alteeve.com  Wed Jan 25 21:00:21 2012
From: linux at alteeve.com (Digimer)
Date: Wed, 25 Jan 2012 16:00:21 -0500
Subject: [Linux-cluster] DLM tuning
Message-ID: <4F206D65.4020001@alteeve.com>

Hi all,

  EL6 DLM question. Beginning adventures in DLM tuning... :)

  I am trying to optimize DLM for use on nodes which can hit the disks
hard and fast. Specifically, clustered LVM with many LVs hosting many
VMs where the VMs are simultaneously disk i/o intensive. I am testing
though with bonnie++ to induce a high load on a GFS2 partition though.

  The problem I am seeing is that when one process (bonnie++ running a
test on one node) hammers DLM, it causes long delays on another node
trying to, for example, run 'ls -lah'. Is there a way to tweak DLM to
allow better response from other nodes trying to access the same lock-space?

Thanks!

-- 
Digimer
E-Mail:              digimer at alteeve.com
Papers and Projects: https://alteeve.com



From kortux at gmail.com  Wed Jan 25 22:00:59 2012
From: kortux at gmail.com (Miguel Angel Guerrero)
Date: Wed, 25 Jan 2012 17:00:59 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <4F20672E.7050801@alteeve.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<4F1F1E23.6080308@alteeve.com>
	<CA+xCQuCoyzCdHxMcpZXb92VkxNiU7qrmM78BJ_duMKX5+xtFqQ@mail.gmail.com>
	<4F1F25E0.80002@alteeve.com>
	<CA+xCQuDLiUBy=rSmPhhcA6BprMts5_cVPG3UuyaXVp-MPPKfLw@mail.gmail.com>
	<4F1F49E4.1060509@alteeve.com>
	<CA+xCQuCp4J27Z8Vbwzep1=yCkMxqj4cRRqo53bS4+CaN9VsHjg@mail.gmail.com>
	<4F203DE0.6070606@alteeve.com>
	<CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>
	<4F205A2D.2080606@alteeve.com>
	<CA+xCQuC=n=VNW2oBWDPkkqSgAd5h_9mXNT_R_4zY6QqpMHBr-A@mail.gmail.com>
	<4F205C5B.5030906@alteeve.com>
	<CA+xCQuBoYuWdWe6uRssBYBpvnc2rcFCg0VY7TG9RyPxh8aNoOg@mail.gmail.com>
	<4F206593.9080701@alteeve.com>
	<CA+xCQuBaUJ6oAVAwfbeWXHDCrNmA8v8vDihF=ga-v5Gkpb7ccw@mail.gmail.com>
	<4F20672E.7050801@alteeve.com>
Message-ID: <CA+xCQuCrP0omtcSM-80=ZqaMidqi1uLYhRJnwG=1RtfO-nKLvg@mail.gmail.com>

The obliterate-peer.sh was restored, but when i make a cable
disconnection or simulate this with ifdown, always the same node
reboot in this case the node without sleep in obliterate-peer.sh
script, this is a normal situation?

On Wed, Jan 25, 2012 at 3:33 PM, Digimer <linux at alteeve.com> wrote:
> On 01/25/2012 03:31 PM, Miguel Angel Guerrero wrote:
>> Yes i see that the result of commands are:
>>
>> /sbin/fence_node -V
>> /sbin/fence_node 2.0.115 (built Sep 26 2011 13:25:51)
>> Copyright (C) Red Hat, Inc. ?2004 ?All rights reserved.
>> [root at wsguardian2 ~]# rpm -q cman
>> cman-2.0.115-85.el5_7.2
>> [root at wsguardian2 ~]# uname -a
>> Linux wsguardian2 2.6.18-274.17.1.el5 #1 SMP Tue Jan 10 17:25:58 EST
>> 2012 x86_64 x86_64 x86_64 GNU/Linux
>> [root at wsguardian2 ~]# cat /etc/issue
>> CentOS release 5.7 (Final)
>> Kernel \r on an \m
>
> Ah! Ok, you're using EL5 / RHCS v2. The rhcs_fence tool has not been
> tested on EL5, so you will need to go back to "obliterate-peer.sh" and
> manually enter a "sleep 10;" at the start of *one* of the nodes.
>
> Once that is done, test again and see if it works.
>
> --
> Digimer
> E-Mail: ? ? ? ? ? ? ?digimer at alteeve.com
> Papers and Projects: https://alteeve.com



-- 
Atte:
------------------------------------
Miguel Angel Guerrero
Usuario GNU/Linux Registrado #353531
------------------------------------



From linux at alteeve.com  Wed Jan 25 22:02:50 2012
From: linux at alteeve.com (Digimer)
Date: Wed, 25 Jan 2012 17:02:50 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <CA+xCQuCrP0omtcSM-80=ZqaMidqi1uLYhRJnwG=1RtfO-nKLvg@mail.gmail.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<4F1F1E23.6080308@alteeve.com>
	<CA+xCQuCoyzCdHxMcpZXb92VkxNiU7qrmM78BJ_duMKX5+xtFqQ@mail.gmail.com>
	<4F1F25E0.80002@alteeve.com>
	<CA+xCQuDLiUBy=rSmPhhcA6BprMts5_cVPG3UuyaXVp-MPPKfLw@mail.gmail.com>
	<4F1F49E4.1060509@alteeve.com>
	<CA+xCQuCp4J27Z8Vbwzep1=yCkMxqj4cRRqo53bS4+CaN9VsHjg@mail.gmail.com>
	<4F203DE0.6070606@alteeve.com>
	<CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>
	<4F205A2D.2080606@alteeve.com>
	<CA+xCQuC=n=VNW2oBWDPkkqSgAd5h_9mXNT_R_4zY6QqpMHBr-A@mail.gmail.com>
	<4F205C5B.5030906@alteeve.com>
	<CA+xCQuBoYuWdWe6uRssBYBpvnc2rcFCg0VY7TG9RyPxh8aNoOg@mail.gmail.com>
	<4F206593.9080701@alteeve.com>
	<CA+xCQuBaUJ6oAVAwfbeWXHDCrNmA8v8vDihF=ga-v5Gkpb7ccw@mail.gmail.com>
	<4F20672E.7050801@alteeve.com>
	<CA+xCQuCrP0omtcSM-80=ZqaMidqi1uLYhRJnwG=1RtfO-nKLvg@mail.gmail.com>
Message-ID: <4F207C0A.7010802@alteeve.com>

On 01/25/2012 05:00 PM, Miguel Angel Guerrero wrote:
> The obliterate-peer.sh was restored, but when i make a cable
> disconnection or simulate this with ifdown, always the same node
> reboot in this case the node without sleep in obliterate-peer.sh
> script, this is a normal situation?

Yup, this is expected. When the link breaks, the one with the sleep will
delay long enough that it will be dead before it finishes sleeping.
However, if the node without the sleep dies, the one with the sleep will
still succeed and the cluster will recover but with a short delay.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Papers and Projects: https://alteeve.com



From jayesh.shinde at netcore.co.in  Thu Jan 26 12:43:53 2012
From: jayesh.shinde at netcore.co.in (jayesh.shinde at netcore.co.in)
Date: Thu, 26 Jan 2012 18:13:53 +0530
Subject: [Linux-cluster] Few queries about fence working
In-Reply-To: <4F20183B.7030503@alteeve.com>
References: <4F1FB448.6060709@netcore.co.in>
	<CAE7pJ3AQ7rP1pgsmfDRsLij=Y9bU1wbg330ejFC6qAEgB_WVtw@mail.gmail.com>
	<4F1FCDA5.4010909@netcore.co.in>
	<c3667b26713a46833b85199117c59013@mx.varna.net>
	<4F200A48.3060100@netcore.co.in> <4F20183B.7030503@alteeve.com>
Message-ID: <20120126181353.4841327yywnyejy8@port3.netcore.co.in>

Dear Digimer & Kaloyan Kovachev ,

Do u think this server shutdown problem ( while fencing simultaneously from both node via drbd.conf) can be completely avoid  if I use SAN disk instead of DRBD disk ?

i.e  in case of SAN disk the defined fence config under cluster.conf will take care of the n/w failuer and related fencing of node ?

What you will suggect ,  SAN or DRBD disk.
please guide me.

Regards
Jayesh Shinde

Quoting Digimer <linux at alteeve.com>:

> On 01/25/2012 08:57 AM, jayesh.shinde wrote:
>> Hi Kaloyan Kovachev ,
>>
>> I am using below config  in drbd.conf  which is mention on DRBD cookbook.
>>
>> }
>>   disk {
>>     fencing resource-and-stonith;
>>   }
>>   handlers {
>>     outdate-peer "/sbin/obliterate";
>>
>> Under  /sbin/obliterate script , "fence_node" is mention.
>>
>> *Do you know what is the default method with "**fence_node $REMOTE" *i.e
>> reboot of power-off ?
>>
>> Dear Digimer ,
>>
>> Can you please guide me here.
>>
>> Currently I am not having the test machine to test it , so all member's
>> inputs will help me a lot to understand it.
>>
>> Below is the /sbin/obliterate
>
> I updated the tutorial to address this last night;
>
> https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Hooking_DRBD_Into_The_Cluster.27s_Fencing
>
> and
>
> https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Configuring_DRBD_Global_and_Common_Options
>
> In short; this is a problem where the fence device, IPMI and DRAC here,
> get the call to shut down their host but don't act on it fast enough to
> block the call heading to the other node.
>
> The obliterate scripts (obliterate is an older version of
> obliterate-peer.sh, which I am working to replace with rhcs_fence now)
> call cman to remove the peer node from the cluster, then call the actual
> fence. For this reason, the delay set in cluster.conf won't help.
>
> The options are to add a 'sleep 10;' to the start of *one* node's
> obliterate or obliterate-peer.sh script. Alternatively, rhcs_fence uses
> the node's ID to calculate a delay automatically to help avoid these
> dual-fence scenarios.
>
> --
> Digimer
> E-Mail:              digimer at alteeve.com
> Papers and Projects: https://alteeve.com
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120126/662d6d07/attachment.htm>

From linux at alteeve.com  Thu Jan 26 13:29:01 2012
From: linux at alteeve.com (Digimer)
Date: Thu, 26 Jan 2012 08:29:01 -0500
Subject: [Linux-cluster] Few queries about fence working
In-Reply-To: <20120126181353.4841327yywnyejy8@port3.netcore.co.in>
References: <4F1FB448.6060709@netcore.co.in>
	<CAE7pJ3AQ7rP1pgsmfDRsLij=Y9bU1wbg330ejFC6qAEgB_WVtw@mail.gmail.com>
	<4F1FCDA5.4010909@netcore.co.in>
	<c3667b26713a46833b85199117c59013@mx.varna.net>
	<4F200A48.3060100@netcore.co.in> <4F20183B.7030503@alteeve.com>
	<20120126181353.4841327yywnyejy8@port3.netcore.co.in>
Message-ID: <4F21551D.702@alteeve.com>

On 01/26/2012 07:43 AM, jayesh.shinde at netcore.co.in wrote:
> Dear Digimer & Kaloyan Kovachev ,
> 
> Do u think this server shutdown problem ( while fencing simultaneously
> from both node via drbd.conf) can be completely avoid  if I use SAN disk
> instead of DRBD disk ?
> 
> i.e  in case of SAN disk the defined fence config under cluster.conf
> will take care of the n/w failuer and related fencing of node ?
> 
> What you will suggect ,  SAN or DRBD disk.
> please guide me.
> 
> Regards
> Jayesh Shinde

It won't fundamentally remove the issue. Any time there is a break down
in communication between nodes in a two-node cluster, there is going to
be a simultaneous fence call made. Ideally, you would have a fence
device that would not buffer calls, but that maybe not be feasible in
your case.

This is why fence delays exist - specifically to allow one node to
always complete a fence operation before another. If you really want to
avoid having the same node survive a fence call in a split like this,
then your best bet is to add a 3rd node for quorum. However, once you
do, the obliterate fence handler will no longer work as it is restricted
to 2 node clusters only (one of the things rhcs_fence resolves, but it
isn't tested on EL5).

To be honest though, is there really a problem with having one node
pre-defined to win a dual-fence call?

-- 
Digimer
E-Mail:              digimer at alteeve.com
Papers and Projects: https://alteeve.com



From emi2fast at gmail.com  Thu Jan 26 13:29:14 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Thu, 26 Jan 2012 14:29:14 +0100
Subject: [Linux-cluster] Few queries about fence working
In-Reply-To: <20120126181353.4841327yywnyejy8@port3.netcore.co.in>
References: <4F1FB448.6060709@netcore.co.in>
	<CAE7pJ3AQ7rP1pgsmfDRsLij=Y9bU1wbg330ejFC6qAEgB_WVtw@mail.gmail.com>
	<4F1FCDA5.4010909@netcore.co.in>
	<c3667b26713a46833b85199117c59013@mx.varna.net>
	<4F200A48.3060100@netcore.co.in> <4F20183B.7030503@alteeve.com>
	<20120126181353.4841327yywnyejy8@port3.netcore.co.in>
Message-ID: <CAE7pJ3CCT6ParTcTa_Dctr-COU9wc=OHPNY8okXPBgZ6Z_b4gg@mail.gmail.com>

I think your problem it's in your rhedhat cluster fencing, because in my
current job i use SAN and we have the some problem, the only workaround
it's fence delay in redhat cluster fencing agent

2012/1/26 <jayesh.shinde at netcore.co.in>

> Dear Digimer & Kaloyan Kovachev ,
>
> Do u think this server shutdown problem ( while fencing simultaneously
> from both node via drbd.conf) can be completely avoid  if I use SAN disk
> instead of DRBD disk ?
>
> i.e  in case of SAN disk the defined fence config under cluster.conf will
> take care of the n/w failuer and related fencing of node ?
>
> What you will suggect ,  SAN or DRBD disk.
> please guide me.
>
> Regards
> Jayesh Shinde
>
> Quoting Digimer <linux at alteeve.com>:
>
> > On 01/25/2012 08:57 AM, jayesh.shinde wrote:
> >> Hi Kaloyan Kovachev ,
> >>
> >> I am using below config  in drbd.conf  which is mention on DRBD
> cookbook.
> >>
> >> }
> >>   disk {
> >>     fencing resource-and-stonith;
> >>   }
> >>   handlers {
> >>     outdate-peer "/sbin/obliterate";
> >>
> >> Under  /sbin/obliterate script , "fence_node" is mention.
> >>
> >> *Do you know what is the default method with "**fence_node $REMOTE" *i.e
> >> reboot of power-off ?
> >>
> >> Dear Digimer ,
> >>
> >> Can you please guide me here.
> >>
> >> Currently I am not having the test machine to test it , so all member's
> >> inputs will help me a lot to understand it.
> >>
> >> Below is the /sbin/obliterate
> >
> > I updated the tutorial to address this last night;
> >
> >
> https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Hooking_DRBD_Into_The_Cluster.27s_Fencing
> >
> > and
> >
> >
> https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Configuring_DRBD_Global_and_Common_Options
> >
> > In short; this is a problem where the fence device, IPMI and DRAC here,
> > get the call to shut down their host but don't act on it fast enough to
> > block the call heading to the other node.
> >
> > The obliterate scripts (obliterate is an older version of
> > obliterate-peer.sh, which I am working to replace with rhcs_fence now)
> > call cman to remove the peer node from the cluster, then call the actual
> > fence. For this reason, the delay set in cluster.conf won't help.
> >
> > The options are to add a 'sleep 10;' to the start of *one* node's
> > obliterate or obliterate-peer.sh script. Alternatively, rhcs_fence uses
> > the node's ID to calculate a delay automatically to help avoid these
> > dual-fence scenarios.
> >
> > --
> > Digimer
> > E-Mail:              digimer at alteeve.com
> > Papers and Projects: https://alteeve.com
> >
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120126/dd9eff78/attachment.htm>

From linux at alteeve.com  Thu Jan 26 13:36:27 2012
From: linux at alteeve.com (Digimer)
Date: Thu, 26 Jan 2012 08:36:27 -0500
Subject: [Linux-cluster] Few queries about fence working
In-Reply-To: <CAE7pJ3CCT6ParTcTa_Dctr-COU9wc=OHPNY8okXPBgZ6Z_b4gg@mail.gmail.com>
References: <4F1FB448.6060709@netcore.co.in>
	<CAE7pJ3AQ7rP1pgsmfDRsLij=Y9bU1wbg330ejFC6qAEgB_WVtw@mail.gmail.com>
	<4F1FCDA5.4010909@netcore.co.in>
	<c3667b26713a46833b85199117c59013@mx.varna.net>
	<4F200A48.3060100@netcore.co.in> <4F20183B.7030503@alteeve.com>
	<20120126181353.4841327yywnyejy8@port3.netcore.co.in>
	<CAE7pJ3CCT6ParTcTa_Dctr-COU9wc=OHPNY8okXPBgZ6Z_b4gg@mail.gmail.com>
Message-ID: <4F2156DB.3030703@alteeve.com>

On 01/26/2012 08:29 AM, emmanuel segura wrote:
> I think your problem it's in your rhedhat cluster fencing, because in my
> current job i use SAN and we have the some problem, the only workaround
> it's fence delay in redhat cluster fencing agent

It's not related to Red Hat so much as the fence device. Both nodes
begin the process of fencing their peer before either node actually
dies. With ACPI enabled, the call to power down a node begins an orderly
shutdown. This allows the node's own fence call to fire off a shutdown
of the peer node before it dies (~4 seconds between initial shut down
call and node death).

Disabling ACPI should help in that the node should instantly power down
when the first call to shut down comes in (moral equivalent to pressing
and holding the power button). With ACPI, you have to hold the power
button for 4 seconds before power off. Without ACPI, the node should
instantly power off.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Papers and Projects: https://alteeve.com



From kkovachev at varna.net  Thu Jan 26 13:43:16 2012
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Thu, 26 Jan 2012 15:43:16 +0200
Subject: [Linux-cluster] Few queries about fence working
In-Reply-To: <20120126181353.4841327yywnyejy8@port3.netcore.co.in>
References: <4F1FB448.6060709@netcore.co.in>
	<CAE7pJ3AQ7rP1pgsmfDRsLij=Y9bU1wbg330ejFC6qAEgB_WVtw@mail.gmail.com>
	<4F1FCDA5.4010909@netcore.co.in>
	<c3667b26713a46833b85199117c59013@mx.varna.net>
	<4F200A48.3060100@netcore.co.in> <4F20183B.7030503@alteeve.com>
	<20120126181353.4841327yywnyejy8@port3.netcore.co.in>
Message-ID: <68ad58d1703315c8e76100c9cc9fd818@mx.varna.net>

On Thu, 26 Jan 2012 18:13:53 +0530, jayesh.shinde at netcore.co.in wrote:
> Dear Digimer & Kaloyan Kovachev ,
> 
> Do u think this server shutdown problem ( while fencing simultaneously
> from both node via drbd.conf) can be completely avoid  if I use SAN disk
> instead of DRBD disk ?
> 
> i.e  in case of SAN disk the defined fence config under cluster.conf
will
> take care of the n/w failuer and related fencing of node ?
> 
> What you will suggect ,  SAN or DRBD disk.
> please guide me.

It has nothing to do with SAN and DRBD - it is the cluster software, which
takes care of the fencing.
There is another mail thread here "Halt nodes in cluster with cable
disconnect" going at the same time about the same problem

> 
> Regards
> Jayesh Shinde
> 
> Quoting Digimer <linux at alteeve.com>:
> 
>> On 01/25/2012 08:57 AM, jayesh.shinde wrote:
>>> Hi Kaloyan Kovachev ,
>>>
>>> I am using below config  in drbd.conf  which is mention on DRBD
>>> cookbook.
>>>
>>> }
>>>   disk {
>>>     fencing resource-and-stonith;
>>>   }
>>>   handlers {
>>>     outdate-peer "/sbin/obliterate";
>>>
>>> Under  /sbin/obliterate script , "fence_node" is mention.
>>>
>>> *Do you know what is the default method with "**fence_node $REMOTE"
*i.e
>>> reboot of power-off ?
>>>
>>> Dear Digimer ,
>>>
>>> Can you please guide me here.
>>>
>>> Currently I am not having the test machine to test it , so all
member's
>>> inputs will help me a lot to understand it.
>>>
>>> Below is the /sbin/obliterate
>>
>> I updated the tutorial to address this last night;
>>
>>
https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Hooking_DRBD_Into_The_Cluster.27s_Fencing
>>
>> and
>>
>>
https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Configuring_DRBD_Global_and_Common_Options
>>
>> In short; this is a problem where the fence device, IPMI and DRAC here,
>> get the call to shut down their host but don't act on it fast enough to
>> block the call heading to the other node.
>>
>> The obliterate scripts (obliterate is an older version of
>> obliterate-peer.sh, which I am working to replace with rhcs_fence now)
>> call cman to remove the peer node from the cluster, then call the
actual
>> fence. For this reason, the delay set in cluster.conf won't help.
>>
>> The options are to add a 'sleep 10;' to the start of *one* node's
>> obliterate or obliterate-peer.sh script. Alternatively, rhcs_fence uses
>> the node's ID to calculate a delay automatically to help avoid these
>> dual-fence scenarios.
>>
>> --
>> Digimer
>> E-Mail:              digimer at alteeve.com
>> Papers and Projects: https://alteeve.com
>>



From devrim at gunduz.org  Thu Jan 26 13:45:25 2012
From: devrim at gunduz.org (Devrim =?ISO-8859-1?Q?G=DCND=DCZ?=)
Date: Thu, 26 Jan 2012 13:45:25 +0000
Subject: [Linux-cluster] Preventing clvmd timeouts
Message-ID: <1327585525.2343.15.camel@lenovo01-laptop03.gunduz.org>


Hi,

This is RHEL 6.2, where all packages are up2date.

For a simple 2-node cluster, clvmd cannot start, because it cannot scan
clustered LVs -- if one of the nodes are down. It simply hangs in the
vgscan phase.

Is that the expected behaviour? If not, what is the recommended way to
get rid of this?

Regards,
-- 
Devrim G?ND?Z
Principal Systems Engineer @ EnterpriseDB: http://www.enterprisedb.com
PostgreSQL Dan??man?/Consultant, Red Hat Certified Engineer
Community: devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr
http://www.gunduz.org  Twitter: http://twitter.com/devrimgunduz
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120126/bba7dd74/attachment.sig>

From kkovachev at varna.net  Thu Jan 26 13:51:46 2012
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Thu, 26 Jan 2012 15:51:46 +0200
Subject: [Linux-cluster] Few queries about fence working
In-Reply-To: <4F21551D.702@alteeve.com>
References: <4F1FB448.6060709@netcore.co.in>
	<CAE7pJ3AQ7rP1pgsmfDRsLij=Y9bU1wbg330ejFC6qAEgB_WVtw@mail.gmail.com>
	<4F1FCDA5.4010909@netcore.co.in>
	<c3667b26713a46833b85199117c59013@mx.varna.net>
	<4F200A48.3060100@netcore.co.in> <4F20183B.7030503@alteeve.com>
	<20120126181353.4841327yywnyejy8@port3.netcore.co.in>
	<4F21551D.702@alteeve.com>
Message-ID: <7bd0395638eac64519e163ce9233a89f@mx.varna.net>

On Thu, 26 Jan 2012 08:29:01 -0500, Digimer <linux at alteeve.com> wrote:
> On 01/26/2012 07:43 AM, jayesh.shinde at netcore.co.in wrote:
>> Dear Digimer & Kaloyan Kovachev ,
>> 
>> Do u think this server shutdown problem ( while fencing simultaneously
>> from both node via drbd.conf) can be completely avoid  if I use SAN
disk
>> instead of DRBD disk ?
>> 
>> i.e  in case of SAN disk the defined fence config under cluster.conf
>> will take care of the n/w failuer and related fencing of node ?
>> 
>> What you will suggect ,  SAN or DRBD disk.
>> please guide me.
>> 
>> Regards
>> Jayesh Shinde
> 
> It won't fundamentally remove the issue. Any time there is a break down
> in communication between nodes in a two-node cluster, there is going to
> be a simultaneous fence call made. Ideally, you would have a fence
> device that would not buffer calls, but that maybe not be feasible in
> your case.
> 
> This is why fence delays exist - specifically to allow one node to
> always complete a fence operation before another. If you really want to
> avoid having the same node survive a fence call in a split like this,
> then your best bet is to add a 3rd node for quorum. However, once you
> do, the obliterate fence handler will no longer work as it is restricted
> to 2 node clusters only (one of the things rhcs_fence resolves, but it
> isn't tested on EL5).
> 

There is also another/modified version of outdate_peer at
http://lists.linbit.com/pipermail/drbd-user/2011-October/016998.html
3-rd node for quorum is the best way to go and the script above does work
with more than 2 nodes (in my case 4)

> To be honest though, is there really a problem with having one node
> pre-defined to win a dual-fence call?

Adding a random sleep between 2 and 5 seconds can deal (most times) with
that too, but i think it is preferable to always have the same node win the
duel



From linux at alteeve.com  Thu Jan 26 13:55:45 2012
From: linux at alteeve.com (Digimer)
Date: Thu, 26 Jan 2012 08:55:45 -0500
Subject: [Linux-cluster] Preventing clvmd timeouts
In-Reply-To: <1327585525.2343.15.camel@lenovo01-laptop03.gunduz.org>
References: <1327585525.2343.15.camel@lenovo01-laptop03.gunduz.org>
Message-ID: <4F215B61.90305@alteeve.com>

On 01/26/2012 08:45 AM, Devrim G?ND?Z wrote:
> 
> Hi,
> 
> This is RHEL 6.2, where all packages are up2date.
> 
> For a simple 2-node cluster, clvmd cannot start, because it cannot scan
> clustered LVs -- if one of the nodes are down. It simply hangs in the
> vgscan phase.
> 
> Is that the expected behaviour? If not, what is the recommended way to
> get rid of this?
> 
> Regards,

I am guessing that you do not have fencing configured and/or tested.
When a node is lost, fenced calls DLM which then blocks locking until a
fence call succeeds. Without fencing, this will hang all things using
DLM, including clustered LVM.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Papers and Projects: https://alteeve.com



From corey.kovacs at gmail.com  Thu Jan 26 14:02:06 2012
From: corey.kovacs at gmail.com (Corey Kovacs)
Date: Thu, 26 Jan 2012 07:02:06 -0700
Subject: [Linux-cluster] Preventing clvmd timeouts
In-Reply-To: <1327585525.2343.15.camel@lenovo01-laptop03.gunduz.org>
References: <1327585525.2343.15.camel@lenovo01-laptop03.gunduz.org>
Message-ID: <CAMH2m-pgPOdN1o7cvqZj7iRE+RZ_xGzNz19uidXS2axn8CHFXg@mail.gmail.com>

My first thought is that since a two node cluster is at best a misnomer,
then clvmd won't work as expected since it might require the operational
semantics of a proper quorum based cluster (three nodes/votes or better).
That's how I understand it anyway so if anyone has info the contrary,
please chime in and correct me. If it work as I think it does, then perhaps
the devs could insert a check to ensure all pre reqs are met before even
attempting this and spitting out the appropriate message?

One thing you could take advantage of is qdiskd in order to attain a quorum
based cluster with two nodes. It's not an ideal solution but it can work
quite nicely if you are limited to two nodes.

If it will in fact work properly with a two node "cluster" then it could be
the result of a node being fenced and not acknowledged properly?

Not much help but some things to think about.

-C


2012/1/26 Devrim G?ND?Z <devrim at gunduz.org>

>
> Hi,
>
> This is RHEL 6.2, where all packages are up2date.
>
> For a simple 2-node cluster, clvmd cannot start, because it cannot scan
> clustered LVs -- if one of the nodes are down. It simply hangs in the
> vgscan phase.
>
> Is that the expected behaviour? If not, what is the recommended way to
> get rid of this?
>
> Regards,
> --
> Devrim G?ND?Z
> Principal Systems Engineer @ EnterpriseDB: http://www.enterprisedb.com
> PostgreSQL Dan??man?/Consultant, Red Hat Certified Engineer
> Community: devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr
> http://www.gunduz.org  Twitter: http://twitter.com/devrimgunduz
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120126/749c7947/attachment.htm>

From swhiteho at redhat.com  Thu Jan 26 14:12:27 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Thu, 26 Jan 2012 14:12:27 +0000
Subject: [Linux-cluster] DLM tuning
In-Reply-To: <4F206D65.4020001@alteeve.com>
References: <4F206D65.4020001@alteeve.com>
Message-ID: <1327587147.2719.16.camel@menhir>

Hi,

On Wed, 2012-01-25 at 16:00 -0500, Digimer wrote:
> Hi all,
> 
>   EL6 DLM question. Beginning adventures in DLM tuning... :)
> 
>   I am trying to optimize DLM for use on nodes which can hit the disks
> hard and fast. Specifically, clustered LVM with many LVs hosting many
> VMs where the VMs are simultaneously disk i/o intensive. I am testing
> though with bonnie++ to induce a high load on a GFS2 partition though.
> 
>   The problem I am seeing is that when one process (bonnie++ running a
> test on one node) hammers DLM, it causes long delays on another node
> trying to, for example, run 'ls -lah'. Is there a way to tweak DLM to
> allow better response from other nodes trying to access the same lock-space?
> 
> Thanks!
> 

The locks which GFS2 uses scale according to the number of inodes in
cache at any one time. The time taken to acquire a DLM lock should not
differ greatly as the number of locks increases, since the main issue is
the network round trip time, and the design of the DLM is such that the
number of messages is minimised. For locally mastered locks, for
example, it should be very small indeed, and in an N node cluster
normally 1/N of the DLM locks will be mastered locally to each node on
average.

You can use the GFS2 tracepoints to see how much locking activity there
is, and how long some internal operations take. In the latest -nmw tree
code, you can also get stats on how long the DLM locks are taking.

I'd be surprised though if the issue that you are describing is related
to the DLM, as it sounds more like just a block I/O issue to me. Once a
lock has been granted it doesn't get dropped unless there is memory
pressure or another node requires it, so for streaming data type
applications GFS2's locking is usually a very small part of the overall
time taken, and it normally only shows up in the "lots of small files"
type workloads, or if there is contention between nodes accessing the
same objects,

Steve.




From bmr at redhat.com  Thu Jan 26 14:18:19 2012
From: bmr at redhat.com (Bryn M. Reeves)
Date: Thu, 26 Jan 2012 14:18:19 +0000
Subject: [Linux-cluster] DLM tuning
In-Reply-To: <4F206D65.4020001@alteeve.com>
References: <4F206D65.4020001@alteeve.com>
Message-ID: <4F2160AB.4080903@redhat.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/25/2012 09:00 PM, Digimer wrote:
> Hi all,
> 
> EL6 DLM question. Beginning adventures in DLM tuning... :)
> 
> I am trying to optimize DLM for use on nodes which can hit the
> disks hard and fast. Specifically, clustered LVM with many LVs
> hosting many VMs where the VMs are simultaneously disk i/o
> intensive. I am testing though with bonnie++ to induce a high load
> on a GFS2 partition though.

Wouldn't it be more useful to test the clustered LVM scenario (e.g.
using a block device benchmark tool to simulate VM I/O load)?

GFS2 and clvmd use the locking service in quite different ways - in
particular I/O on a GFS2 file system involves locking operations but
this is not true for clvm volumes unless they use a cluster aware
device-mapper target like cmirror.

I'd only expect DLM performance to become a consideration here if
there were many clustered LVM operations taking place on different
nodes at the same time.

Regards,
Bryn.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk8hYKsACgkQ6YSQoMYUY970NQCeM5ottyaExj1SJTfgaiewYzfl
2PMAniZ1QVv8PWnADJ5/rru+dzy2OWNm
=RdiC
-----END PGP SIGNATURE-----



From linux at alteeve.com  Thu Jan 26 14:41:03 2012
From: linux at alteeve.com (Digimer)
Date: Thu, 26 Jan 2012 09:41:03 -0500
Subject: [Linux-cluster] Preventing clvmd timeouts
In-Reply-To: <CAMH2m-pgPOdN1o7cvqZj7iRE+RZ_xGzNz19uidXS2axn8CHFXg@mail.gmail.com>
References: <1327585525.2343.15.camel@lenovo01-laptop03.gunduz.org>
	<CAMH2m-pgPOdN1o7cvqZj7iRE+RZ_xGzNz19uidXS2axn8CHFXg@mail.gmail.com>
Message-ID: <4F2165FF.8090604@alteeve.com>

On 01/26/2012 09:02 AM, Corey Kovacs wrote:
> My first thought is that since a two node cluster is at best a misnomer,
> then clvmd won't work as expected since it might require the operational
> semantics of a proper quorum based cluster (three nodes/votes or
> better). That's how I understand it anyway so if anyone has info the
> contrary, please chime in and correct me. If it work as I think it does,
> then perhaps the devs could insert a check to ensure all pre reqs are
> met before even attempting this and spitting out the appropriate message?
> 
> One thing you could take advantage of is qdiskd in order to attain a
> quorum based cluster with two nodes. It's not an ideal solution but it
> can work quite nicely if you are limited to two nodes.
> 
> If it will in fact work properly with a two node "cluster" then it could
> be the result of a node being fenced and not acknowledged properly?
> 
> Not much help but some things to think about.
> 
> -C

DLM works fine on 2-node as quorum is allowed with just one node, thanks
to the 'expected_votes="1" two_node="1"' attributes.

As for qdisk, you can't use it on DRBD, only on a SAN (as it is possible
to have a split-brain condition where both nodes go StandAlone and
Primary, this allowing both nodes to think they have the qdisk vote).

-- 
Digimer
E-Mail:              digimer at alteeve.com
Papers and Projects: https://alteeve.com



From ajb2 at mssl.ucl.ac.uk  Thu Jan 26 15:46:08 2012
From: ajb2 at mssl.ucl.ac.uk (Alan Brown)
Date: Thu, 26 Jan 2012 15:46:08 +0000
Subject: [Linux-cluster] Preventing clvmd timeouts
In-Reply-To: <4F2165FF.8090604@alteeve.com>
References: <1327585525.2343.15.camel@lenovo01-laptop03.gunduz.org>
	<CAMH2m-pgPOdN1o7cvqZj7iRE+RZ_xGzNz19uidXS2axn8CHFXg@mail.gmail.com>
	<4F2165FF.8090604@alteeve.com>
Message-ID: <4F217540.2070102@mssl.ucl.ac.uk>

On 26/01/12 14:41, Digimer wrote:

> As for qdisk, you can't use it on DRBD, only on a SAN (as it is possible
> to have a split-brain condition where both nodes go StandAlone and
> Primary, this allowing both nodes to think they have the qdisk vote).

Is anyone actually using DRDB for serious cluster implementations (ie, 
production systems) or is it just being used for hobby/test rigs?







From linux at alteeve.com  Thu Jan 26 16:05:22 2012
From: linux at alteeve.com (Digimer)
Date: Thu, 26 Jan 2012 11:05:22 -0500
Subject: [Linux-cluster] Preventing clvmd timeouts
In-Reply-To: <4F217540.2070102@mssl.ucl.ac.uk>
References: <1327585525.2343.15.camel@lenovo01-laptop03.gunduz.org>
	<CAMH2m-pgPOdN1o7cvqZj7iRE+RZ_xGzNz19uidXS2axn8CHFXg@mail.gmail.com>
	<4F2165FF.8090604@alteeve.com> <4F217540.2070102@mssl.ucl.ac.uk>
Message-ID: <4F2179C2.5030700@alteeve.com>

On 01/26/2012 10:46 AM, Alan Brown wrote:
> On 26/01/12 14:41, Digimer wrote:
> 
>> As for qdisk, you can't use it on DRBD, only on a SAN (as it is possible
>> to have a split-brain condition where both nodes go StandAlone and
>> Primary, this allowing both nodes to think they have the qdisk vote).
> 
> Is anyone actually using DRDB for serious cluster implementations (ie,
> production systems) or is it just being used for hobby/test rigs?

I use it rather extensively in production. I use it to back clustered
LVM-backed virtual machines and GFS2 partitions. I stick with 8.3.x
(8.3.12 now) and have no problems with it.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Papers and Projects: https://alteeve.com



From ajb2 at mssl.ucl.ac.uk  Thu Jan 26 16:27:41 2012
From: ajb2 at mssl.ucl.ac.uk (Alan Brown)
Date: Thu, 26 Jan 2012 16:27:41 +0000
Subject: [Linux-cluster] Preventing clvmd timeouts
In-Reply-To: <4F2179C2.5030700@alteeve.com>
References: <1327585525.2343.15.camel@lenovo01-laptop03.gunduz.org>
	<CAMH2m-pgPOdN1o7cvqZj7iRE+RZ_xGzNz19uidXS2axn8CHFXg@mail.gmail.com>
	<4F2165FF.8090604@alteeve.com> <4F217540.2070102@mssl.ucl.ac.uk>
	<4F2179C2.5030700@alteeve.com>
Message-ID: <4F217EFD.3090104@mssl.ucl.ac.uk>

On 26/01/12 16:05, Digimer wrote:

>> Is anyone actually using DRDB for serious cluster implementations (ie,
>> production systems) or is it just being used for hobby/test rigs?
>
> I use it rather extensively in production. I use it to back clustered
> LVM-backed virtual machines and GFS2 partitions. I stick with 8.3.x
> (8.3.12 now) and have no problems with it.

What kind of load are you throwing at this setup?

Right now we're using a SAN but I've been asked to look into 
geographical redundancy and there's only ethernet available between the 
locations.

AB





From linux at alteeve.com  Thu Jan 26 16:38:38 2012
From: linux at alteeve.com (Digimer)
Date: Thu, 26 Jan 2012 11:38:38 -0500
Subject: [Linux-cluster] Preventing clvmd timeouts
In-Reply-To: <4F217EFD.3090104@mssl.ucl.ac.uk>
References: <1327585525.2343.15.camel@lenovo01-laptop03.gunduz.org>
	<CAMH2m-pgPOdN1o7cvqZj7iRE+RZ_xGzNz19uidXS2axn8CHFXg@mail.gmail.com>
	<4F2165FF.8090604@alteeve.com> <4F217540.2070102@mssl.ucl.ac.uk>
	<4F2179C2.5030700@alteeve.com> <4F217EFD.3090104@mssl.ucl.ac.uk>
Message-ID: <4F21818E.7020807@alteeve.com>

On 01/26/2012 11:27 AM, Alan Brown wrote:
> On 26/01/12 16:05, Digimer wrote:
> 
>>> Is anyone actually using DRDB for serious cluster implementations (ie,
>>> production systems) or is it just being used for hobby/test rigs?
>>
>> I use it rather extensively in production. I use it to back clustered
>> LVM-backed virtual machines and GFS2 partitions. I stick with 8.3.x
>> (8.3.12 now) and have no problems with it.
> 
> What kind of load are you throwing at this setup?
> 
> Right now we're using a SAN but I've been asked to look into
> geographical redundancy and there's only ethernet available between the
> locations.
> 
> AB

Geographic DRBD will be tricky if you intend to maintain synchronous
replication. This is because, in synchronous setup, each write is not
confirmed until the data has been committed to both nodes, effectively
dropping the disk speed (throughput and latency) to that of the network.

Linbit does have a "proxy" asynchronous configuration which is designed
for WAN/stretch clusters, but I have not played with it. Linbit should
be able to help you decide if it would suit your needs.

In my case, I build a good number of 2-node clusters backed by DRBD
hosting 4~5 VMs per cluster. The main issue I have is dealing with high
seek latency caused by highly random I/O coming from the VMs being on
physically different parts of the platters triggering a lot of
read/write head movement on platter drives. This is below DRBD though,
and DRBD itself has never been a bottle neck for me.

Back to the seek i/o issue; I worked around it using varying
combinations of high rpm drives, write-cache HBAs and splitting arrays
up to keep high disk i/o VMs on separate platters from one another. A
common setup is 4 or 6 disks per node, split into two RAID level 1 or 5
arrays backing two separate DRBD resources. Then I put clustered LVM on
each resource and mix high and low disk i/o load VMs on the arrays.
Having decent write caching helps a good deal.

One this I plan to test soon is LSI's Cachecade v2 which lets you use
standard SSDs as large read/write caches.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Papers and Projects: https://alteeve.com



From Chris.Jankowski at hp.com  Fri Jan 27 07:57:31 2012
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Fri, 27 Jan 2012 07:57:31 +0000
Subject: [Linux-cluster] Why RHEV 3.0 does not use GFS2?
Message-ID: <036B68E61A28CA49AC2767596576CD59744CD5FB8B@GVW1113EXC.americas.hpqcorp.net>

I am curious why the designers of RHEV V3.0 did not use GFS2 for their shared storage.  It seems that this would be a natural choice.  Instead RHEV 3.0 allows either NFS or raw shared LUNs, I believe.

Anybody has some thoughts on this subject?

Thanks and regards,

Chris Jankowski

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120127/1d3ebcb9/attachment.htm>

From list at fajar.net  Fri Jan 27 08:08:10 2012
From: list at fajar.net (Fajar A. Nugraha)
Date: Fri, 27 Jan 2012 15:08:10 +0700
Subject: [Linux-cluster] Why RHEV 3.0 does not use GFS2?
In-Reply-To: <036B68E61A28CA49AC2767596576CD59744CD5FB8B@GVW1113EXC.americas.hpqcorp.net>
References: <036B68E61A28CA49AC2767596576CD59744CD5FB8B@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <CAG1y0sd4LvNR0e1DfEgZFkeKbM-S+P6CT-TUrnNkO-JHsOx=cw@mail.gmail.com>

On Fri, Jan 27, 2012 at 2:57 PM, Jankowski, Chris
<Chris.Jankowski at hp.com> wrote:
> I am curious why the designers of RHEV V3.0 did not use GFS2 for their
> shared storage.? It seems that this would be a natural choice.? Instead RHEV
> 3.0 allows either NFS or raw shared LUNs, I believe.
>
> Anybody has some thoughts on this subject?

I'm GUESSing that:
- it's simpler that way (i.e. no need to setup fence, cman, etc). This
is different from ocfs2 which can work without additional manual
fencing setup involved
- The performance is better

-- 
Fajar



From parvez.h.shaikh at gmail.com  Fri Jan 27 09:03:22 2012
From: parvez.h.shaikh at gmail.com (Parvez Shaikh)
Date: Fri, 27 Jan 2012 14:33:22 +0530
Subject: [Linux-cluster] $OCF_ERR_CONFIGURED - recovers service on another
	cluster node
Message-ID: <CAKrd532x_ogm7wi=roRwH9nJJzDZdTJGaLvMYBFKG7ufMKfsOg@mail.gmail.com>

Hi guys,

I am using Red Hat Cluster Suite which comes with RHEL 5.5 -

cman_tool version
>>6.2.0 config xxx

Now I have a script resource in which I return $OCF_ERR_CONFIGURED; in case
of a Fatal irrecoverable error, hoping that my service would not start on
another cluster node.

But I see that cluster, relocates it to another cluster node and attempts
to start it.

I referred error code documentation from
http://www.linux-ha.org/doc/dev-guides/_return_codes.html

Is there any return code which makes RHCS to give up on recovering service?

Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120127/cb18c54c/attachment.htm>

From emi2fast at gmail.com  Fri Jan 27 09:48:27 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Fri, 27 Jan 2012 10:48:27 +0100
Subject: [Linux-cluster] $OCF_ERR_CONFIGURED - recovers service on
 another cluster node
In-Reply-To: <CAKrd532x_ogm7wi=roRwH9nJJzDZdTJGaLvMYBFKG7ufMKfsOg@mail.gmail.com>
References: <CAKrd532x_ogm7wi=roRwH9nJJzDZdTJGaLvMYBFKG7ufMKfsOg@mail.gmail.com>
Message-ID: <CAE7pJ3DM--jnpYnmoCrCUz5te-3_HXv=vgmP923M8G1r6a0hww@mail.gmail.com>

The first thing you can do is stop your cluster service
go to the node where you found the problem and using rg_test test
/etc/cluster/cluster.conf start put_the_name_of_the_service

like that you can see what it's wrong

2012/1/27 Parvez Shaikh <parvez.h.shaikh at gmail.com>

> Hi guys,
>
> I am using Red Hat Cluster Suite which comes with RHEL 5.5 -
>
> cman_tool version
> >>6.2.0 config xxx
>
> Now I have a script resource in which I return $OCF_ERR_CONFIGURED; in
> case of a Fatal irrecoverable error, hoping that my service would not start
> on another cluster node.
>
> But I see that cluster, relocates it to another cluster node and attempts
> to start it.
>
> I referred error code documentation from
> http://www.linux-ha.org/doc/dev-guides/_return_codes.html
>
> Is there any return code which makes RHCS to give up on recovering service?
>
> Thanks
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120127/a6c66b7b/attachment.htm>

From kortux at gmail.com  Fri Jan 27 16:51:27 2012
From: kortux at gmail.com (Miguel Angel Guerrero)
Date: Fri, 27 Jan 2012 11:51:27 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <4F207C0A.7010802@alteeve.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<4F1F1E23.6080308@alteeve.com>
	<CA+xCQuCoyzCdHxMcpZXb92VkxNiU7qrmM78BJ_duMKX5+xtFqQ@mail.gmail.com>
	<4F1F25E0.80002@alteeve.com>
	<CA+xCQuDLiUBy=rSmPhhcA6BprMts5_cVPG3UuyaXVp-MPPKfLw@mail.gmail.com>
	<4F1F49E4.1060509@alteeve.com>
	<CA+xCQuCp4J27Z8Vbwzep1=yCkMxqj4cRRqo53bS4+CaN9VsHjg@mail.gmail.com>
	<4F203DE0.6070606@alteeve.com>
	<CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>
	<4F205A2D.2080606@alteeve.com>
	<CA+xCQuC=n=VNW2oBWDPkkqSgAd5h_9mXNT_R_4zY6QqpMHBr-A@mail.gmail.com>
	<4F205C5B.5030906@alteeve.com>
	<CA+xCQuBoYuWdWe6uRssBYBpvnc2rcFCg0VY7TG9RyPxh8aNoOg@mail.gmail.com>
	<4F206593.9080701@alteeve.com>
	<CA+xCQuBaUJ6oAVAwfbeWXHDCrNmA8v8vDihF=ga-v5Gkpb7ccw@mail.gmail.com>
	<4F20672E.7050801@alteeve.com>
	<CA+xCQuCrP0omtcSM-80=ZqaMidqi1uLYhRJnwG=1RtfO-nKLvg@mail.gmail.com>
	<4F207C0A.7010802@alteeve.com>
Message-ID: <CA+xCQuBWceAWqRRRNYzPinWSmbRp2t7dHaX3Zib=CQxS78H1gQ@mail.gmail.com>

Hi Digimer and Emmanuel

I was trying some tests with my cluster configuration and, in short:

1. I think something's wrong with my configuration, because when a
real desconnection (i.e. unplug the cable) happens on the node which
does not have the sleep in the script (node A), the other node (node
B) is always stonith'ed, when obviously the node which should reboot
is the node A. This important to me because I want to know how the
cluster should behave when a fail over the switch port or the NIC
occurs.

2.  @Emmanuel, could you point me to redhat's documentation about
this? I tried your solution as this:

<fence_daemon clean_start="0" post_fail_delay="10" post_join_delay="30"/>

But still failed, tthere is another way?

3. Another solution in this thread is to add a quorum disk to the
cluster. I began to make this with this manual
http://www.skau.dk/index.php?option=com_content&view=article&id=34:rhcs-cluster-using-iscsi&catid=4:cases-to-explain&Itemid=3

But I need to replicate the data using only two nodes, and it seems
that this solution requires three. Could somebody tell me if I'm doing
it fine/wrong? This causes conflicts when using DRBD?

On Wed, Jan 25, 2012 at 5:02 PM, Digimer <linux at alteeve.com> wrote:
> On 01/25/2012 05:00 PM, Miguel Angel Guerrero wrote:
>> The obliterate-peer.sh was restored, but when i make a cable
>> disconnection or simulate this with ifdown, always the same node
>> reboot in this case the node without sleep in obliterate-peer.sh
>> script, this is a normal situation?
>
> Yup, this is expected. When the link breaks, the one with the sleep will
> delay long enough that it will be dead before it finishes sleeping.
> However, if the node without the sleep dies, the one with the sleep will
> still succeed and the cluster will recover but with a short delay.
>
> --
> Digimer
> E-Mail: ? ? ? ? ? ? ?digimer at alteeve.com
> Papers and Projects: https://alteeve.com



-- 
Atte:
------------------------------------
Miguel Angel Guerrero
Usuario GNU/Linux Registrado #353531
------------------------------------



From linux at alteeve.com  Fri Jan 27 17:07:46 2012
From: linux at alteeve.com (Digimer)
Date: Fri, 27 Jan 2012 12:07:46 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <CA+xCQuBWceAWqRRRNYzPinWSmbRp2t7dHaX3Zib=CQxS78H1gQ@mail.gmail.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<4F1F1E23.6080308@alteeve.com>
	<CA+xCQuCoyzCdHxMcpZXb92VkxNiU7qrmM78BJ_duMKX5+xtFqQ@mail.gmail.com>
	<4F1F25E0.80002@alteeve.com>
	<CA+xCQuDLiUBy=rSmPhhcA6BprMts5_cVPG3UuyaXVp-MPPKfLw@mail.gmail.com>
	<4F1F49E4.1060509@alteeve.com>
	<CA+xCQuCp4J27Z8Vbwzep1=yCkMxqj4cRRqo53bS4+CaN9VsHjg@mail.gmail.com>
	<4F203DE0.6070606@alteeve.com>
	<CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>
	<4F205A2D.2080606@alteeve.com>
	<CA+xCQuC=n=VNW2oBWDPkkqSgAd5h_9mXNT_R_4zY6QqpMHBr-A@mail.gmail.com>
	<4F205C5B.5030906@alteeve.com>
	<CA+xCQuBoYuWdWe6uRssBYBpvnc2rcFCg0VY7TG9RyPxh8aNoOg@mail.gmail.com>
	<4F206593.9080701@alteeve.com>
	<CA+xCQuBaUJ6oAVAwfbeWXHDCrNmA8v8vDihF=ga-v5Gkpb7ccw@mail.gmail.com>
	<4F20672E.7050801@alteeve.com>
	<CA+xCQuCrP0omtcSM-80=ZqaMidqi1uLYhRJnwG=1RtfO-nKLvg@mail.gmail.com>
	<4F207C0A.7010802@alteeve.com>
	<CA+xCQuBWceAWqRRRNYzPinWSmbRp2t7dHaX3Zib=CQxS78H1gQ@mail.gmail.com>
Message-ID: <4F22D9E2.2030909@alteeve.com>

On 01/27/2012 11:51 AM, Miguel Angel Guerrero wrote:
> Hi Digimer and Emmanuel
> 
> I was trying some tests with my cluster configuration and, in short:
> 
> 1. I think something's wrong with my configuration, because when a
> real desconnection (i.e. unplug the cable) happens on the node which
> does not have the sleep in the script (node A), the other node (node
> B) is always stonith'ed, when obviously the node which should reboot
> is the node A. This important to me because I want to know how the
> cluster should behave when a fail over the switch port or the NIC
> occurs.

A broken link is a broken link. The cluster has no idea whose cable has
been unplugged, only that they can no longer talk to one another. So the
same node being fenced is expected.

If you want to test an actual failure of the node to confirm that the
node with the sleep will win, hang the nodeA machine.

You can crash the machine with this;

echo c > /proc/sysrq-trigger

NodeB will lose contact with NodeA and call it's fence, sleep and then
finish the fence call. NodeA will be completely hung, so it won't even
try to fence and will stay hung until fenced by nodeB.

> 2.  @Emmanuel, could you point me to redhat's documentation about
> this? I tried your solution as this:
> 
> <fence_daemon clean_start="0" post_fail_delay="10" post_join_delay="30"/>
> 
> But still failed, tthere is another way?
> 
> 3. Another solution in this thread is to add a quorum disk to the
> cluster. I began to make this with this manual
> http://www.skau.dk/index.php?option=com_content&view=article&id=34:rhcs-cluster-using-iscsi&catid=4:cases-to-explain&Itemid=3
> 
> But I need to replicate the data using only two nodes, and it seems
> that this solution requires three. Could somebody tell me if I'm doing
> it fine/wrong? This causes conflicts when using DRBD?

Using qdisk on DRBD is a bad idea. Consider a split-brain scenario, the
qdisk could effectively duplicate, completely rendering it's purpose void.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Papers and Projects: https://alteeve.com



From dkelson at gurulabs.com  Fri Jan 27 17:14:52 2012
From: dkelson at gurulabs.com (Dax Kelson)
Date: Fri, 27 Jan 2012 10:14:52 -0700
Subject: [Linux-cluster] RHEL6.2 qdisk, luci,
	3node last-man-standing question
Message-ID: <CAFDa5Q1SyYUfH7HSoBwM8aB7wVqVjARKnVN5um1+c=0fzD61ng@mail.gmail.com>

In a 3 node cluster using a qdisk for a "last man standing"
configuration (one node plus the qdisk can be quorate), the
configuration is typically:

quorum disk = 3 votes
each node = 2 votes
expected votes = 9

So one node plus the qdisk has 5 votes = quorum.

This works fine when configuring using ccs and editing the cluster.conf.

However, in luci if you try to setup the same configuration via luci,
when you go to enable a qdisk, you get the big warning message "Quorum
Disk cannot be used unless each cluster node has exactly 1 vote.".

Why?

Thanks,
Dax Kelson
Guru Labs



From kortux at gmail.com  Fri Jan 27 17:23:59 2012
From: kortux at gmail.com (Miguel Angel Guerrero)
Date: Fri, 27 Jan 2012 12:23:59 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <4F22D9E2.2030909@alteeve.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<4F1F1E23.6080308@alteeve.com>
	<CA+xCQuCoyzCdHxMcpZXb92VkxNiU7qrmM78BJ_duMKX5+xtFqQ@mail.gmail.com>
	<4F1F25E0.80002@alteeve.com>
	<CA+xCQuDLiUBy=rSmPhhcA6BprMts5_cVPG3UuyaXVp-MPPKfLw@mail.gmail.com>
	<4F1F49E4.1060509@alteeve.com>
	<CA+xCQuCp4J27Z8Vbwzep1=yCkMxqj4cRRqo53bS4+CaN9VsHjg@mail.gmail.com>
	<4F203DE0.6070606@alteeve.com>
	<CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>
	<4F205A2D.2080606@alteeve.com>
	<CA+xCQuC=n=VNW2oBWDPkkqSgAd5h_9mXNT_R_4zY6QqpMHBr-A@mail.gmail.com>
	<4F205C5B.5030906@alteeve.com>
	<CA+xCQuBoYuWdWe6uRssBYBpvnc2rcFCg0VY7TG9RyPxh8aNoOg@mail.gmail.com>
	<4F206593.9080701@alteeve.com>
	<CA+xCQuBaUJ6oAVAwfbeWXHDCrNmA8v8vDihF=ga-v5Gkpb7ccw@mail.gmail.com>
	<4F20672E.7050801@alteeve.com>
	<CA+xCQuCrP0omtcSM-80=ZqaMidqi1uLYhRJnwG=1RtfO-nKLvg@mail.gmail.com>
	<4F207C0A.7010802@alteeve.com>
	<CA+xCQuBWceAWqRRRNYzPinWSmbRp2t7dHaX3Zib=CQxS78H1gQ@mail.gmail.com>
	<4F22D9E2.2030909@alteeve.com>
Message-ID: <CA+xCQuCw5prFOvZ4MTP4oj2+QnXKKcGan1s8HHNV5YCsi-rT=g@mail.gmail.com>

In this scenario do you  know how i can do an explict ping to a
gateway to add a additional fencing condition?

On Fri, Jan 27, 2012 at 12:07 PM, Digimer <linux at alteeve.com> wrote:
> On 01/27/2012 11:51 AM, Miguel Angel Guerrero wrote:
>> Hi Digimer and Emmanuel
>>
>> I was trying some tests with my cluster configuration and, in short:
>>
>> 1. I think something's wrong with my configuration, because when a
>> real desconnection (i.e. unplug the cable) happens on the node which
>> does not have the sleep in the script (node A), the other node (node
>> B) is always stonith'ed, when obviously the node which should reboot
>> is the node A. This important to me because I want to know how the
>> cluster should behave when a fail over the switch port or the NIC
>> occurs.
>
> A broken link is a broken link. The cluster has no idea whose cable has
> been unplugged, only that they can no longer talk to one another. So the
> same node being fenced is expected.
>
> If you want to test an actual failure of the node to confirm that the
> node with the sleep will win, hang the nodeA machine.
>
> You can crash the machine with this;
>
> echo c > /proc/sysrq-trigger
>
> NodeB will lose contact with NodeA and call it's fence, sleep and then
> finish the fence call. NodeA will be completely hung, so it won't even
> try to fence and will stay hung until fenced by nodeB.
>
>> 2. ?@Emmanuel, could you point me to redhat's documentation about
>> this? I tried your solution as this:
>>
>> <fence_daemon clean_start="0" post_fail_delay="10" post_join_delay="30"/>
>>
>> But still failed, tthere is another way?
>>
>> 3. Another solution in this thread is to add a quorum disk to the
>> cluster. I began to make this with this manual
>> http://www.skau.dk/index.php?option=com_content&view=article&id=34:rhcs-cluster-using-iscsi&catid=4:cases-to-explain&Itemid=3
>>
>> But I need to replicate the data using only two nodes, and it seems
>> that this solution requires three. Could somebody tell me if I'm doing
>> it fine/wrong? This causes conflicts when using DRBD?
>
> Using qdisk on DRBD is a bad idea. Consider a split-brain scenario, the
> qdisk could effectively duplicate, completely rendering it's purpose void.
>
> --
> Digimer
> E-Mail: ? ? ? ? ? ? ?digimer at alteeve.com
> Papers and Projects: https://alteeve.com



-- 
Atte:
------------------------------------
Miguel Angel Guerrero
Usuario GNU/Linux Registrado #353531
------------------------------------



From parvez.h.shaikh at gmail.com  Fri Jan 27 18:28:42 2012
From: parvez.h.shaikh at gmail.com (Parvez Shaikh)
Date: Fri, 27 Jan 2012 23:58:42 +0530
Subject: [Linux-cluster] $OCF_ERR_CONFIGURED - recovers service on
 another cluster node
In-Reply-To: <CAE7pJ3DM--jnpYnmoCrCUz5te-3_HXv=vgmP923M8G1r6a0hww@mail.gmail.com>
References: <CAKrd532x_ogm7wi=roRwH9nJJzDZdTJGaLvMYBFKG7ufMKfsOg@mail.gmail.com>
	<CAE7pJ3DM--jnpYnmoCrCUz5te-3_HXv=vgmP923M8G1r6a0hww@mail.gmail.com>
Message-ID: <CAKrd5302AuaDRJTdiHmnD2p0DhpK7ShgMKuopB4p75xtFwwJ0Q@mail.gmail.com>

Hi,

Requirement is to "Fail" service, not to fail over it on another node in
case of certain issues, which would be detected by my
service(automatically/programmatically) while it starts (if it doesn't find
prerequisite) it will "Fail", to do so which error code should I use in my
"start" function ?

Thanks

On Fri, Jan 27, 2012 at 3:18 PM, emmanuel segura <emi2fast at gmail.com> wrote:

> The first thing you can do is stop your cluster service
> go to the node where you found the problem and using rg_test test
> /etc/cluster/cluster.conf start put_the_name_of_the_service
>
> like that you can see what it's wrong
>
> 2012/1/27 Parvez Shaikh <parvez.h.shaikh at gmail.com>
>
>> Hi guys,
>>
>> I am using Red Hat Cluster Suite which comes with RHEL 5.5 -
>>
>> cman_tool version
>> >>6.2.0 config xxx
>>
>> Now I have a script resource in which I return $OCF_ERR_CONFIGURED; in
>> case of a Fatal irrecoverable error, hoping that my service would not start
>> on another cluster node.
>>
>> But I see that cluster, relocates it to another cluster node and attempts
>> to start it.
>>
>> I referred error code documentation from
>> http://www.linux-ha.org/doc/dev-guides/_return_codes.html
>>
>> Is there any return code which makes RHCS to give up on recovering
>> service?
>>
>> Thanks
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120127/9da56d40/attachment.htm>

From emi2fast at gmail.com  Fri Jan 27 18:51:20 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Fri, 27 Jan 2012 19:51:20 +0100
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <CA+xCQuBWceAWqRRRNYzPinWSmbRp2t7dHaX3Zib=CQxS78H1gQ@mail.gmail.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<4F1F1E23.6080308@alteeve.com>
	<CA+xCQuCoyzCdHxMcpZXb92VkxNiU7qrmM78BJ_duMKX5+xtFqQ@mail.gmail.com>
	<4F1F25E0.80002@alteeve.com>
	<CA+xCQuDLiUBy=rSmPhhcA6BprMts5_cVPG3UuyaXVp-MPPKfLw@mail.gmail.com>
	<4F1F49E4.1060509@alteeve.com>
	<CA+xCQuCp4J27Z8Vbwzep1=yCkMxqj4cRRqo53bS4+CaN9VsHjg@mail.gmail.com>
	<4F203DE0.6070606@alteeve.com>
	<CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>
	<4F205A2D.2080606@alteeve.com>
	<CA+xCQuC=n=VNW2oBWDPkkqSgAd5h_9mXNT_R_4zY6QqpMHBr-A@mail.gmail.com>
	<4F205C5B.5030906@alteeve.com>
	<CA+xCQuBoYuWdWe6uRssBYBpvnc2rcFCg0VY7TG9RyPxh8aNoOg@mail.gmail.com>
	<4F206593.9080701@alteeve.com>
	<CA+xCQuBaUJ6oAVAwfbeWXHDCrNmA8v8vDihF=ga-v5Gkpb7ccw@mail.gmail.com>
	<4F20672E.7050801@alteeve.com>
	<CA+xCQuCrP0omtcSM-80=ZqaMidqi1uLYhRJnwG=1RtfO-nKLvg@mail.gmail.com>
	<4F207C0A.7010802@alteeve.com>
	<CA+xCQuBWceAWqRRRNYzPinWSmbRp2t7dHaX3Zib=CQxS78H1gQ@mail.gmail.com>
Message-ID: <CAE7pJ3D0SJEoJr3d8ONz0RXf2Neg0-o1y3M45GSrb8rHo4mibg@mail.gmail.com>

It's ok like that, the node doesn't has the sleep always gets fenced,
because when it tries to use the fence device of the other node to make the
fence take the sleep, sorry for my bad english :-)

if you use a quorum disk you can aboid this problem with master_wins="1" in
the quorum tags in your cluster conf && and if you wanna info about the
parameters for the cluster

man qdisk ; man fence ; man cluster.conf

I recommend to use a qdisk if you are using SAN or iscsi, but if you are
using just DRBD, remember drbd has it's own internal fencing

I had experience with drbd and i think work better with heartbeat+pacemaker

Remember every redhat cluster version has the diferents problems


http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/index.html

2012/1/27 Miguel Angel Guerrero <kortux at gmail.com>

> Hi Digimer and Emmanuel
>
> I was trying some tests with my cluster configuration and, in short:
>
> 1. I think something's wrong with my configuration, because when a
> real desconnection (i.e. unplug the cable) happens on the node which
> does not have the sleep in the script (node A), the other node (node
> B) is always stonith'ed, when obviously the node which should reboot
> is the node A. This important to me because I want to know how the
> cluster should behave when a fail over the switch port or the NIC
> occurs.
>
> 2.  @Emmanuel, could you point me to redhat's documentation about
> this? I tried your solution as this:
>
> <fence_daemon clean_start="0" post_fail_delay="10" post_join_delay="30"/>
>
> But still failed, tthere is another way?
>
> 3. Another solution in this thread is to add a quorum disk to the
> cluster. I began to make this with this manual
>
> http://www.skau.dk/index.php?option=com_content&view=article&id=34:rhcs-cluster-using-iscsi&catid=4:cases-to-explain&Itemid=3
>
> But I need to replicate the data using only two nodes, and it seems
> that this solution requires three. Could somebody tell me if I'm doing
> it fine/wrong? This causes conflicts when using DRBD?
>
> On Wed, Jan 25, 2012 at 5:02 PM, Digimer <linux at alteeve.com> wrote:
> > On 01/25/2012 05:00 PM, Miguel Angel Guerrero wrote:
> >> The obliterate-peer.sh was restored, but when i make a cable
> >> disconnection or simulate this with ifdown, always the same node
> >> reboot in this case the node without sleep in obliterate-peer.sh
> >> script, this is a normal situation?
> >
> > Yup, this is expected. When the link breaks, the one with the sleep will
> > delay long enough that it will be dead before it finishes sleeping.
> > However, if the node without the sleep dies, the one with the sleep will
> > still succeed and the cluster will recover but with a short delay.
> >
> > --
> > Digimer
> > E-Mail:              digimer at alteeve.com
> > Papers and Projects: https://alteeve.com
>
>
>
> --
> Atte:
> ------------------------------------
> Miguel Angel Guerrero
> Usuario GNU/Linux Registrado #353531
> ------------------------------------
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120127/181614b8/attachment.htm>

From emi2fast at gmail.com  Fri Jan 27 18:58:53 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Fri, 27 Jan 2012 19:58:53 +0100
Subject: [Linux-cluster] RHEL6.2 qdisk, luci,
	3node last-man-standing question
In-Reply-To: <CAFDa5Q1SyYUfH7HSoBwM8aB7wVqVjARKnVN5um1+c=0fzD61ng@mail.gmail.com>
References: <CAFDa5Q1SyYUfH7HSoBwM8aB7wVqVjARKnVN5um1+c=0fzD61ng@mail.gmail.com>
Message-ID: <CAE7pJ3CL2btBBivFxVJLF7X7vMxaBRxVawwQRac3Aw+8Y9EmVw@mail.gmail.com>

why you give 3 votes to quorum?


i give you an example i'm using in a cluster

disk = 2
node1 = 1
node2 = 1
node3 = 1

total = 5

if node1 die = i remain with 4 and cluster still it's quorated
if node2 die = i remain with 3 votes and still cluster it's quorated

So the result it's, i can work with one node if two node die because i have
the majority

2012/1/27 Dax Kelson <dkelson at gurulabs.com>

> In a 3 node cluster using a qdisk for a "last man standing"
> configuration (one node plus the qdisk can be quorate), the
> configuration is typically:
>
> quorum disk = 3 votes
> each node = 2 votes
> expected votes = 9
>
> So one node plus the qdisk has 5 votes = quorum.
>
> This works fine when configuring using ccs and editing the cluster.conf.
>
> However, in luci if you try to setup the same configuration via luci,
> when you go to enable a qdisk, you get the big warning message "Quorum
> Disk cannot be used unless each cluster node has exactly 1 vote.".
>
> Why?
>
> Thanks,
> Dax Kelson
> Guru Labs
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120127/e79e274e/attachment.htm>

From linux at alteeve.com  Fri Jan 27 20:09:11 2012
From: linux at alteeve.com (Digimer)
Date: Fri, 27 Jan 2012 15:09:11 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <CA+xCQuCw5prFOvZ4MTP4oj2+QnXKKcGan1s8HHNV5YCsi-rT=g@mail.gmail.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<4F1F25E0.80002@alteeve.com>
	<CA+xCQuDLiUBy=rSmPhhcA6BprMts5_cVPG3UuyaXVp-MPPKfLw@mail.gmail.com>
	<4F1F49E4.1060509@alteeve.com>
	<CA+xCQuCp4J27Z8Vbwzep1=yCkMxqj4cRRqo53bS4+CaN9VsHjg@mail.gmail.com>
	<4F203DE0.6070606@alteeve.com>
	<CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>
	<4F205A2D.2080606@alteeve.com>
	<CA+xCQuC=n=VNW2oBWDPkkqSgAd5h_9mXNT_R_4zY6QqpMHBr-A@mail.gmail.com>
	<4F205C5B.5030906@alteeve.com>
	<CA+xCQuBoYuWdWe6uRssBYBpvnc2rcFCg0VY7TG9RyPxh8aNoOg@mail.gmail.com>
	<4F206593.9080701@alteeve.com>
	<CA+xCQuBaUJ6oAVAwfbeWXHDCrNmA8v8vDihF=ga-v5Gkpb7ccw@mail.gmail.com>
	<4F20672E.7050801@alteeve.com>
	<CA+xCQuCrP0omtcSM-80=ZqaMidqi1uLYhRJnwG=1RtfO-nKLvg@mail.gmail.com>
	<4F207C0A.7010802@alteeve.com>
	<CA+xCQuBWceAWqRRRNYzPinWSmbRp2t7dHaX3Zib=CQxS78H1gQ@mail.gmail.com>
	<4F22D9E2.2030909@alteeve.com>
	<CA+xCQuCw5prFOvZ4MTP4oj2+QnXKKcGan1s8HHNV5YCsi-rT=g@mail.gmail.com>
Message-ID: <4F230467.90404@alteeve.com>

On 01/27/2012 12:23 PM, Miguel Angel Guerrero wrote:
> In this scenario do you  know how i can do an explict ping to a
> gateway to add a additional fencing condition?

Not really, no. If you had a proper SAN, you could use heuristics with
qdisk, but as I said, that's not safe on DRBD.

If you really wanted to, you could modify the obliterate-peer.sh to do a
ping and only proceed if it resolves, which would help, but even then it
isn't reliable if your ping goes over a different network. You would
have to make sure that the target was on the same subnet as the
storage/DRBD interface.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Papers and Projects: https://alteeve.com



From linux at alteeve.com  Fri Jan 27 20:14:59 2012
From: linux at alteeve.com (Digimer)
Date: Fri, 27 Jan 2012 15:14:59 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <CAE7pJ3D0SJEoJr3d8ONz0RXf2Neg0-o1y3M45GSrb8rHo4mibg@mail.gmail.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<4F1F25E0.80002@alteeve.com>
	<CA+xCQuDLiUBy=rSmPhhcA6BprMts5_cVPG3UuyaXVp-MPPKfLw@mail.gmail.com>
	<4F1F49E4.1060509@alteeve.com>
	<CA+xCQuCp4J27Z8Vbwzep1=yCkMxqj4cRRqo53bS4+CaN9VsHjg@mail.gmail.com>
	<4F203DE0.6070606@alteeve.com>
	<CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>
	<4F205A2D.2080606@alteeve.com>
	<CA+xCQuC=n=VNW2oBWDPkkqSgAd5h_9mXNT_R_4zY6QqpMHBr-A@mail.gmail.com>
	<4F205C5B.5030906@alteeve.com>
	<CA+xCQuBoYuWdWe6uRssBYBpvnc2rcFCg0VY7TG9RyPxh8aNoOg@mail.gmail.com>
	<4F206593.9080701@alteeve.com>
	<CA+xCQuBaUJ6oAVAwfbeWXHDCrNmA8v8vDihF=ga-v5Gkpb7ccw@mail.gmail.com>
	<4F20672E.7050801@alteeve.com>
	<CA+xCQuCrP0omtcSM-80=ZqaMidqi1uLYhRJnwG=1RtfO-nKLvg@mail.gmail.com>
	<4F207C0A.7010802@alteeve.com>
	<CA+xCQuBWceAWqRRRNYzPinWSmbRp2t7dHaX3Zib=CQxS78H1gQ@mail.gmail.com>
	<CAE7pJ3D0SJEoJr3d8ONz0RXf2Neg0-o1y3M45GSrb8rHo4mibg@mail.gmail.com>
Message-ID: <4F2305C3.6070702@alteeve.com>

On 01/27/2012 01:51 PM, emmanuel segura wrote:
> It's ok like that, the node doesn't has the sleep always gets fenced,
> because when it tries to use the fence device of the other node to make
> the fence take the sleep, sorry for my bad english :-)

That anyone can learn English as a second language is amazing. I think
your English is fine. :)

> if you use a quorum disk you can aboid this problem with master_wins="1"
> in the quorum tags in your cluster conf && and if you wanna info about
> the parameters for the cluster
> 
> man qdisk ; man fence ; man cluster.conf
> 
> I recommend to use a qdisk if you are using SAN or iscsi, but if you are
> using just DRBD, remember drbd has it's own internal fencing

As mentioned, don't use qdisk on DRBD. In a split-brain condition, both
partitions could get the qdisk votes.

> I had experience with drbd and i think work better with heartbeat+pacemaker

DRBD works just fine on corosync+rhcs. The heartbeat project is
deprecated and should not be used in any new projects. You can use
corosync+pacemaker with DRBD just fine though.

> Remember every redhat cluster version has the diferents problems
> 
> http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/index.html

Every cluster has it's own set of problems. :)

-- 
Digimer
E-Mail:              digimer at alteeve.com
Papers and Projects: https://alteeve.com



From yvette at dbtgroup.com  Fri Jan 27 20:20:20 2012
From: yvette at dbtgroup.com (yvette hirth)
Date: Fri, 27 Jan 2012 20:20:20 +0000
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <4F22D9E2.2030909@alteeve.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>	<CA+xCQuCoyzCdHxMcpZXb92VkxNiU7qrmM78BJ_duMKX5+xtFqQ@mail.gmail.com>	<4F1F25E0.80002@alteeve.com>	<CA+xCQuDLiUBy=rSmPhhcA6BprMts5_cVPG3UuyaXVp-MPPKfLw@mail.gmail.com>	<4F1F49E4.1060509@alteeve.com>	<CA+xCQuCp4J27Z8Vbwzep1=yCkMxqj4cRRqo53bS4+CaN9VsHjg@mail.gmail.com>	<4F203DE0.6070606@alteeve.com>	<CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>	<4F205A2D.2080606@alteeve.com>	<CA+xCQuC=n=VNW2oBWDPkkqSgAd5h_9mXNT_R_4zY6QqpMHBr-A@mail.gmail.com>	<4F205C5B.5030906@alteeve.com>	<CA+xCQuBoYuWdWe6uRssBYBpvnc2rcFCg0VY7TG9RyPxh8aNoOg@mail.gmail.com>	<4F206593.9080701@alteeve.com>	<CA+xCQuBaUJ6oAVAwfbeWXHDCrNmA8v8vDihF=ga-v5Gkpb7ccw@mail.gmail.com>	<4F20672E.7050801@alteeve.com>	<CA+xCQuCrP0omtcSM-80=ZqaMidqi1uLYhRJnwG=1RtfO-nKLvg@mail.gmail.com>	<4F207C0A.7010802@alteeve.com>	<CA+xCQuBWceAWqRRRNYzPinWSmbRp2t7dHaX3Zib=CQxS78H1gQ@mail.gmail.com>
	<4F22D9E2.2030909@alteeve.com>
Message-ID: <4F230704.4060603@dbtgroup.com>

Digimer wrote:

> You can crash the machine with this;
> 
> echo c > /proc/sysrq-trigger

will

ifconfig ethx down  (where "x" = heartbeat ethernet interface numbah)

do the same thing?

yvette



From linux at alteeve.com  Fri Jan 27 20:31:48 2012
From: linux at alteeve.com (Digimer)
Date: Fri, 27 Jan 2012 15:31:48 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <4F230704.4060603@dbtgroup.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>	<4F1F25E0.80002@alteeve.com>	<CA+xCQuDLiUBy=rSmPhhcA6BprMts5_cVPG3UuyaXVp-MPPKfLw@mail.gmail.com>	<4F1F49E4.1060509@alteeve.com>	<CA+xCQuCp4J27Z8Vbwzep1=yCkMxqj4cRRqo53bS4+CaN9VsHjg@mail.gmail.com>	<4F203DE0.6070606@alteeve.com>	<CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>	<4F205A2D.2080606@alteeve.com>	<CA+xCQuC=n=VNW2oBWDPkkqSgAd5h_9mXNT_R_4zY6QqpMHBr-A@mail.gmail.com>	<4F205C5B.5030906@alteeve.com>	<CA+xCQuBoYuWdWe6uRssBYBpvnc2rcFCg0VY7TG9RyPxh8aNoOg@mail.gmail.com>	<4F206593.9080701@alteeve.com>	<CA+xCQuBaUJ6oAVAwfbeWXHDCrNmA8v8vDihF=ga-v5Gkpb7ccw@mail.gmail.com>	<4F20672E.7050801@alteeve.com>	<CA+xCQuCrP0omtcSM-80=ZqaMidqi1uLYhRJnwG=1RtfO-nKLvg@mail.gmail.com>	<4F207C0A.7010802@alteeve.com>	<CA+xCQuBWceAWqRRRNYzPinWSmbRp2t7dHaX3Zib=CQxS78H1gQ@mail.gmail.com>
	<4F22D9E2.2030909@alteeve.com> <4F230704.4060603@dbtgroup.com>
Message-ID: <4F2309B4.1050801@alteeve.com>

On 01/27/2012 03:20 PM, yvette hirth wrote:
> Digimer wrote:
> 
>> You can crash the machine with this;
>>
>> echo c > /proc/sysrq-trigger
> 
> will
> 
> ifconfig ethx down  (where "x" = heartbeat ethernet interface numbah)
> 
> do the same thing?
> 
> yvette

Nope. The scenario is caused by both nodes being alive, but losing the
ability to talk to one another on the storage channel. Whether it is
because a given cable is unplugged or a bad firewall rule, the result is
the same; Both nodes see a failure at the same time and call their fence
handlers at the same time. The one with the sleep will delay, and thus,
always lose (and be the fence victim).

The idea behind sending "c" to sysre-trigger is that it hangs the kernel
entirely. The hung node will no trigger it's fence, or do anything else
for that matter. Meanwhile, the node with the sleep will detect the
fault, call the agent, sleep for a few seconds, then proceed to fence
the hung node. This more accurately simulates an actual fault in the
primary node and confirms that the sleep'ed node will in fact fence
successfully.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Papers and Projects: https://alteeve.com



From kortux at gmail.com  Fri Jan 27 21:43:50 2012
From: kortux at gmail.com (Miguel Angel Guerrero)
Date: Fri, 27 Jan 2012 16:43:50 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <4F2309B4.1050801@alteeve.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<4F1F25E0.80002@alteeve.com>
	<CA+xCQuDLiUBy=rSmPhhcA6BprMts5_cVPG3UuyaXVp-MPPKfLw@mail.gmail.com>
	<4F1F49E4.1060509@alteeve.com>
	<CA+xCQuCp4J27Z8Vbwzep1=yCkMxqj4cRRqo53bS4+CaN9VsHjg@mail.gmail.com>
	<4F203DE0.6070606@alteeve.com>
	<CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>
	<4F205A2D.2080606@alteeve.com>
	<CA+xCQuC=n=VNW2oBWDPkkqSgAd5h_9mXNT_R_4zY6QqpMHBr-A@mail.gmail.com>
	<4F205C5B.5030906@alteeve.com>
	<CA+xCQuBoYuWdWe6uRssBYBpvnc2rcFCg0VY7TG9RyPxh8aNoOg@mail.gmail.com>
	<4F206593.9080701@alteeve.com>
	<CA+xCQuBaUJ6oAVAwfbeWXHDCrNmA8v8vDihF=ga-v5Gkpb7ccw@mail.gmail.com>
	<4F20672E.7050801@alteeve.com>
	<CA+xCQuCrP0omtcSM-80=ZqaMidqi1uLYhRJnwG=1RtfO-nKLvg@mail.gmail.com>
	<4F207C0A.7010802@alteeve.com>
	<CA+xCQuBWceAWqRRRNYzPinWSmbRp2t7dHaX3Zib=CQxS78H1gQ@mail.gmail.com>
	<4F22D9E2.2030909@alteeve.com> <4F230704.4060603@dbtgroup.com>
	<4F2309B4.1050801@alteeve.com>
Message-ID: <CA+xCQuCTUkev3gqJ0di=dVCbQjHe+-DOPA21_9Cp0fBgOhb9XQ@mail.gmail.com>

Digimer
the echo c > /proc/sysrq-trigger; command works fine the node with
this command reboot thank to the fence-peer :) in a scenary without
"fencing race", how the cluster take a decisition about what node
reboot in the cable disconnection test?
One question you think drbd works better in a pacemaker or cman environment?

Emmanuel
Your english is good, i preffer talk in spanish :P sorry for my bad
english ever i learn so much thanks to this thread

You never say me nothing about my delay fence line
<fence_daemon clean_start="0" post_fail_delay="10" post_join_delay="30"/>

Digimer And Emmanuel Thanks a lot for your help and patience

On Fri, Jan 27, 2012 at 3:31 PM, Digimer <linux at alteeve.com> wrote:
> On 01/27/2012 03:20 PM, yvette hirth wrote:
>> Digimer wrote:
>>
>>> You can crash the machine with this;
>>>
>>> echo c > /proc/sysrq-trigger
>>
>> will
>>
>> ifconfig ethx down ?(where "x" = heartbeat ethernet interface numbah)
>>
>> do the same thing?
>>
>> yvette
>
> Nope. The scenario is caused by both nodes being alive, but losing the
> ability to talk to one another on the storage channel. Whether it is
> because a given cable is unplugged or a bad firewall rule, the result is
> the same; Both nodes see a failure at the same time and call their fence
> handlers at the same time. The one with the sleep will delay, and thus,
> always lose (and be the fence victim).
>
> The idea behind sending "c" to sysre-trigger is that it hangs the kernel
> entirely. The hung node will no trigger it's fence, or do anything else
> for that matter. Meanwhile, the node with the sleep will detect the
> fault, call the agent, sleep for a few seconds, then proceed to fence
> the hung node. This more accurately simulates an actual fault in the
> primary node and confirms that the sleep'ed node will in fact fence
> successfully.
>
> --
> Digimer
> E-Mail: ? ? ? ? ? ? ?digimer at alteeve.com
> Papers and Projects: https://alteeve.com
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
Atte:
------------------------------------
Miguel Angel Guerrero
Usuario GNU/Linux Registrado #353531
------------------------------------



From linux at alteeve.com  Fri Jan 27 22:13:08 2012
From: linux at alteeve.com (Digimer)
Date: Fri, 27 Jan 2012 17:13:08 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <CA+xCQuCTUkev3gqJ0di=dVCbQjHe+-DOPA21_9Cp0fBgOhb9XQ@mail.gmail.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<4F1F49E4.1060509@alteeve.com>
	<CA+xCQuCp4J27Z8Vbwzep1=yCkMxqj4cRRqo53bS4+CaN9VsHjg@mail.gmail.com>
	<4F203DE0.6070606@alteeve.com>
	<CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>
	<4F205A2D.2080606@alteeve.com>
	<CA+xCQuC=n=VNW2oBWDPkkqSgAd5h_9mXNT_R_4zY6QqpMHBr-A@mail.gmail.com>
	<4F205C5B.5030906@alteeve.com>
	<CA+xCQuBoYuWdWe6uRssBYBpvnc2rcFCg0VY7TG9RyPxh8aNoOg@mail.gmail.com>
	<4F206593.9080701@alteeve.com>
	<CA+xCQuBaUJ6oAVAwfbeWXHDCrNmA8v8vDihF=ga-v5Gkpb7ccw@mail.gmail.com>
	<4F20672E.7050801@alteeve.com>
	<CA+xCQuCrP0omtcSM-80=ZqaMidqi1uLYhRJnwG=1RtfO-nKLvg@mail.gmail.com>
	<4F207C0A.7010802@alteeve.com>
	<CA+xCQuBWceAWqRRRNYzPinWSmbRp2t7dHaX3Zib=CQxS78H1gQ@mail.gmail.com>
	<4F22D9E2.2030909@alteeve.com> <4F230704.4060603@dbtgroup.com>
	<4F2309B4.1050801@alteeve.com>
	<CA+xCQuCTUkev3gqJ0di=dVCbQjHe+-DOPA21_9Cp0fBgOhb9XQ@mail.gmail.com>
Message-ID: <4F232174.9090402@alteeve.com>

On 01/27/2012 04:43 PM, Miguel Angel Guerrero wrote:
> Digimer
> the echo c > /proc/sysrq-trigger; command works fine the node with
> this command reboot thank to the fence-peer :) in a scenary without
> "fencing race", how the cluster take a decisition about what node
> reboot in the cable disconnection test?
> One question you think drbd works better in a pacemaker or cman environment?
> 
> Emmanuel
> Your english is good, i preffer talk in spanish :P sorry for my bad
> english ever i learn so much thanks to this thread
> 
> You never say me nothing about my delay fence line
> <fence_daemon clean_start="0" post_fail_delay="10" post_join_delay="30"/>
> 
> Digimer And Emmanuel Thanks a lot for your help and patience

In a fence race, the node with the sleep will always lose.

DRBD works equally fine with Pacemaker and RHCS.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Papers and Projects: https://alteeve.com



From emi2fast at gmail.com  Sat Jan 28 17:55:45 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Sat, 28 Jan 2012 18:55:45 +0100
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <4F232174.9090402@alteeve.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<4F1F49E4.1060509@alteeve.com>
	<CA+xCQuCp4J27Z8Vbwzep1=yCkMxqj4cRRqo53bS4+CaN9VsHjg@mail.gmail.com>
	<4F203DE0.6070606@alteeve.com>
	<CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>
	<4F205A2D.2080606@alteeve.com>
	<CA+xCQuC=n=VNW2oBWDPkkqSgAd5h_9mXNT_R_4zY6QqpMHBr-A@mail.gmail.com>
	<4F205C5B.5030906@alteeve.com>
	<CA+xCQuBoYuWdWe6uRssBYBpvnc2rcFCg0VY7TG9RyPxh8aNoOg@mail.gmail.com>
	<4F206593.9080701@alteeve.com>
	<CA+xCQuBaUJ6oAVAwfbeWXHDCrNmA8v8vDihF=ga-v5Gkpb7ccw@mail.gmail.com>
	<4F20672E.7050801@alteeve.com>
	<CA+xCQuCrP0omtcSM-80=ZqaMidqi1uLYhRJnwG=1RtfO-nKLvg@mail.gmail.com>
	<4F207C0A.7010802@alteeve.com>
	<CA+xCQuBWceAWqRRRNYzPinWSmbRp2t7dHaX3Zib=CQxS78H1gQ@mail.gmail.com>
	<4F22D9E2.2030909@alteeve.com> <4F230704.4060603@dbtgroup.com>
	<4F2309B4.1050801@alteeve.com>
	<CA+xCQuCTUkev3gqJ0di=dVCbQjHe+-DOPA21_9Cp0fBgOhb9XQ@mail.gmail.com>
	<4F232174.9090402@alteeve.com>
Message-ID: <CAE7pJ3CDWWcqH13V4s5U8aRh53p=Zo53d0SZnVr2fzYi53-DQQ@mail.gmail.com>

@Digemer

It's no true the node without the delay always win

this is from a comment redhat support about this
====================================================
Red Hat Support  says:

This kbase is a little confusing....The node *without* the fence delay will
always "win" the fence race



It would probably be better to state it this way....



The node without fencedevice "delay" assigned will get fenced faster. So
for the one that you want to prevent getting fenced you'll want that
node's fencedevice to have the delay set.

============================================================


2012/1/27 Digimer <linux at alteeve.com>

> On 01/27/2012 04:43 PM, Miguel Angel Guerrero wrote:
> > Digimer
> > the echo c > /proc/sysrq-trigger; command works fine the node with
> > this command reboot thank to the fence-peer :) in a scenary without
> > "fencing race", how the cluster take a decisition about what node
> > reboot in the cable disconnection test?
> > One question you think drbd works better in a pacemaker or cman
> environment?
> >
> > Emmanuel
> > Your english is good, i preffer talk in spanish :P sorry for my bad
> > english ever i learn so much thanks to this thread
> >
> > You never say me nothing about my delay fence line
> > <fence_daemon clean_start="0" post_fail_delay="10" post_join_delay="30"/>
> >
> > Digimer And Emmanuel Thanks a lot for your help and patience
>
> In a fence race, the node with the sleep will always lose.
>
> DRBD works equally fine with Pacemaker and RHCS.
>
> --
> Digimer
> E-Mail:              digimer at alteeve.com
> Papers and Projects: https://alteeve.com
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120128/6e3b7f7f/attachment.htm>

From kortux at gmail.com  Sat Jan 28 18:42:13 2012
From: kortux at gmail.com (Miguel Angel Guerrero)
Date: Sat, 28 Jan 2012 13:42:13 -0500
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <CAE7pJ3CDWWcqH13V4s5U8aRh53p=Zo53d0SZnVr2fzYi53-DQQ@mail.gmail.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<4F1F49E4.1060509@alteeve.com>
	<CA+xCQuCp4J27Z8Vbwzep1=yCkMxqj4cRRqo53bS4+CaN9VsHjg@mail.gmail.com>
	<4F203DE0.6070606@alteeve.com>
	<CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>
	<4F205A2D.2080606@alteeve.com>
	<CA+xCQuC=n=VNW2oBWDPkkqSgAd5h_9mXNT_R_4zY6QqpMHBr-A@mail.gmail.com>
	<4F205C5B.5030906@alteeve.com>
	<CA+xCQuBoYuWdWe6uRssBYBpvnc2rcFCg0VY7TG9RyPxh8aNoOg@mail.gmail.com>
	<4F206593.9080701@alteeve.com>
	<CA+xCQuBaUJ6oAVAwfbeWXHDCrNmA8v8vDihF=ga-v5Gkpb7ccw@mail.gmail.com>
	<4F20672E.7050801@alteeve.com>
	<CA+xCQuCrP0omtcSM-80=ZqaMidqi1uLYhRJnwG=1RtfO-nKLvg@mail.gmail.com>
	<4F207C0A.7010802@alteeve.com>
	<CA+xCQuBWceAWqRRRNYzPinWSmbRp2t7dHaX3Zib=CQxS78H1gQ@mail.gmail.com>
	<4F22D9E2.2030909@alteeve.com> <4F230704.4060603@dbtgroup.com>
	<4F2309B4.1050801@alteeve.com>
	<CA+xCQuCTUkev3gqJ0di=dVCbQjHe+-DOPA21_9Cp0fBgOhb9XQ@mail.gmail.com>
	<4F232174.9090402@alteeve.com>
	<CAE7pJ3CDWWcqH13V4s5U8aRh53p=Zo53d0SZnVr2fzYi53-DQQ@mail.gmail.com>
Message-ID: <CA+xCQuBY9O4z3Z8_xypyv96LhGqsKSYCGPfTuEcMR0SkUbgrAg@mail.gmail.com>

Emmanuel How you put the fencedevice "delay"  in cluster.conf?

On Sat, Jan 28, 2012 at 12:55 PM, emmanuel segura <emi2fast at gmail.com> wrote:
> @Digemer
>
> It's no true the node without the delay always win
>
> this is from a comment redhat support about this
> ====================================================
> Red Hat Support ?says:
>
> This kbase is a little confusing....The node without the fence delay will
> always "win" the fence race
>
>
>
> It would probably be better to state it this way....
>
>
>
> The node without fencedevice "delay" assigned will get fenced faster. So
> for the one that you want to prevent getting fenced you'll want that? node's
> fencedevice to have the delay set.
>
> ============================================================
>
>
>
> 2012/1/27 Digimer <linux at alteeve.com>
>>
>> On 01/27/2012 04:43 PM, Miguel Angel Guerrero wrote:
>> > Digimer
>> > the echo c > /proc/sysrq-trigger; command works fine the node with
>> > this command reboot thank to the fence-peer :) in a scenary without
>> > "fencing race", how the cluster take a decisition about what node
>> > reboot in the cable disconnection test?
>> > One question you think drbd works better in a pacemaker or cman
>> > environment?
>> >
>> > Emmanuel
>> > Your english is good, i preffer talk in spanish :P sorry for my bad
>> > english ever i learn so much thanks to this thread
>> >
>> > You never say me nothing about my delay fence line
>> > <fence_daemon clean_start="0" post_fail_delay="10"
>> > post_join_delay="30"/>
>> >
>> > Digimer And Emmanuel Thanks a lot for your help and patience
>>
>> In a fence race, the node with the sleep will always lose.
>>
>> DRBD works equally fine with Pacemaker and RHCS.
>>
>> --
>> Digimer
>> E-Mail: ? ? ? ? ? ? ?digimer at alteeve.com
>> Papers and Projects: https://alteeve.com
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
Atte:
------------------------------------
Miguel Angel Guerrero
Usuario GNU/Linux Registrado #353531
------------------------------------



From emi2fast at gmail.com  Sat Jan 28 18:52:52 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Sat, 28 Jan 2012 19:52:52 +0100
Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect
In-Reply-To: <CA+xCQuBY9O4z3Z8_xypyv96LhGqsKSYCGPfTuEcMR0SkUbgrAg@mail.gmail.com>
References: <CA+xCQuCqCecEYCTQNocCPm5kXvTHT5yONqrJXdY0-_mcwOULfA@mail.gmail.com>
	<4F1F49E4.1060509@alteeve.com>
	<CA+xCQuCp4J27Z8Vbwzep1=yCkMxqj4cRRqo53bS4+CaN9VsHjg@mail.gmail.com>
	<4F203DE0.6070606@alteeve.com>
	<CA+xCQuA0yNLB-Zqrotq16uN2DBT2wk1SeH8E_6ah6DNJa6hjbg@mail.gmail.com>
	<4F205A2D.2080606@alteeve.com>
	<CA+xCQuC=n=VNW2oBWDPkkqSgAd5h_9mXNT_R_4zY6QqpMHBr-A@mail.gmail.com>
	<4F205C5B.5030906@alteeve.com>
	<CA+xCQuBoYuWdWe6uRssBYBpvnc2rcFCg0VY7TG9RyPxh8aNoOg@mail.gmail.com>
	<4F206593.9080701@alteeve.com>
	<CA+xCQuBaUJ6oAVAwfbeWXHDCrNmA8v8vDihF=ga-v5Gkpb7ccw@mail.gmail.com>
	<4F20672E.7050801@alteeve.com>
	<CA+xCQuCrP0omtcSM-80=ZqaMidqi1uLYhRJnwG=1RtfO-nKLvg@mail.gmail.com>
	<4F207C0A.7010802@alteeve.com>
	<CA+xCQuBWceAWqRRRNYzPinWSmbRp2t7dHaX3Zib=CQxS78H1gQ@mail.gmail.com>
	<4F22D9E2.2030909@alteeve.com> <4F230704.4060603@dbtgroup.com>
	<4F2309B4.1050801@alteeve.com>
	<CA+xCQuCTUkev3gqJ0di=dVCbQjHe+-DOPA21_9Cp0fBgOhb9XQ@mail.gmail.com>
	<4F232174.9090402@alteeve.com>
	<CAE7pJ3CDWWcqH13V4s5U8aRh53p=Zo53d0SZnVr2fzYi53-DQQ@mail.gmail.com>
	<CA+xCQuBY9O4z3Z8_xypyv96LhGqsKSYCGPfTuEcMR0SkUbgrAg@mail.gmail.com>
Message-ID: <CAE7pJ3C7=kZ+4-kwiomEC9zGJ78ws=eLDMQfFwSJSM4ws3Bv+A@mail.gmail.com>

==========================================================

<fencedevices>
                <fencedevice name="node1-fence" agent="fence_ilo"
ipaddr="node1-ilo" login="user" passwd="passwd" delay="30" />
                <fencedevice name="node2-fence"  agent="fence_ilo"
ipaddr="node2-ilo" login="user"  passwd="passwd" />
</fencedevices>



2012/1/28 Miguel Angel Guerrero <kortux at gmail.com>

> Emmanuel How you put the fencedevice "delay"  in cluster.conf?
>
> On Sat, Jan 28, 2012 at 12:55 PM, emmanuel segura <emi2fast at gmail.com>
> wrote:
> > @Digemer
> >
> > It's no true the node without the delay always win
> >
> > this is from a comment redhat support about this
> > ====================================================
> > Red Hat Support  says:
> >
> > This kbase is a little confusing....The node without the fence delay will
> > always "win" the fence race
> >
> >
> >
> > It would probably be better to state it this way....
> >
> >
> >
> > The node without fencedevice "delay" assigned will get fenced faster. So
> > for the one that you want to prevent getting fenced you'll want that
> node's
> > fencedevice to have the delay set.
> >
> > ============================================================
> >
> >
> >
> > 2012/1/27 Digimer <linux at alteeve.com>
> >>
> >> On 01/27/2012 04:43 PM, Miguel Angel Guerrero wrote:
> >> > Digimer
> >> > the echo c > /proc/sysrq-trigger; command works fine the node with
> >> > this command reboot thank to the fence-peer :) in a scenary without
> >> > "fencing race", how the cluster take a decisition about what node
> >> > reboot in the cable disconnection test?
> >> > One question you think drbd works better in a pacemaker or cman
> >> > environment?
> >> >
> >> > Emmanuel
> >> > Your english is good, i preffer talk in spanish :P sorry for my bad
> >> > english ever i learn so much thanks to this thread
> >> >
> >> > You never say me nothing about my delay fence line
> >> > <fence_daemon clean_start="0" post_fail_delay="10"
> >> > post_join_delay="30"/>
> >> >
> >> > Digimer And Emmanuel Thanks a lot for your help and patience
> >>
> >> In a fence race, the node with the sleep will always lose.
> >>
> >> DRBD works equally fine with Pacemaker and RHCS.
> >>
> >> --
> >> Digimer
> >> E-Mail:              digimer at alteeve.com
> >> Papers and Projects: https://alteeve.com
> >>
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> >
> >
> > --
> > esta es mi vida e me la vivo hasta que dios quiera
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
> --
> Atte:
> ------------------------------------
> Miguel Angel Guerrero
> Usuario GNU/Linux Registrado #353531
> ------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120128/3b87195c/attachment.htm>

From carlopmart at gmail.com  Sun Jan 29 16:49:51 2012
From: carlopmart at gmail.com (carlopmart)
Date: Sun, 29 Jan 2012 17:49:51 +0100
Subject: [Linux-cluster] Corosync goes cpu to 95-99% (same problems with
 RHEL6.2)
In-Reply-To: <16366A53AA0D47A7A935FD7FE920D462@versa>
References: <4DD29D03.9080901@gmail.com>	<4DD2BAC3.50509@redhat.com>	<4DD2BD7D.5070704@gmail.com>	<4DD2CA90.6090802@redhat.com>	<3B50BA7445114813AE429BEE51A2BA52@versa>	<4DD78908.2030801@gmail.com>	<0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca><4DD80D5D.10004@gmail.com>	<4DD873C7.8080402@cybercat.ca>	<22E7D11CD5E64E338A66811F31F06238@versa>	<4DE545D7.1080703@redhat.com>	<4DE69786.5010204@gmail.com><4DE6CAF6.4000002@cybercat.ca>	<4DE75602.1000408@gmail.com><51BB988BCCF547E69BF222BDAF34C4DE@versa><4E04B61B.9070208@cybercat.ca>	<4E2D63DD.4050007@gmail.com><4E2D7329.6050607@redhat.com>	<4E2D7425.4070801@gmail.com><4E2D8ECB.6020305@redhat.com>	<4E2D8F87.30508@gmail.com><4E2D940B.5020803@redhat.com>	<4E73073D.8010209@gmail.com>
	<16366A53AA0D47A7A935FD7FE920D462@versa>
Message-ID: <4F2578AF.3050805@gmail.com>

On 11/04/2011 07:05 PM, Nicolas Ross wrote:
>>> get a support signoff. Also the corosync updates have not finished
>>> through our validation process. Only hot fixes (from support) are
>>> available
>>>
>>> Regards
>>> -steve
>>>
>>
>> Sorry to re-open this thread ... But exists any news about this problem??
>
> In fact, there is !
>
> It appears that this situation is within the microcode of some specific
> xeon "nahalem" (sorry for the spelling) processors... It has to do with
> switching cstate and the way rhel6.1 now switch state that was not done
> in 6.0.
>
> You can look at bugzilla # 710265 and kb docs # 61105.
>
> Our temporary fix for the moment was to disable cstate transition by
> adding :
>
> intel_idle.max_cstate=0 processor.max_cstate=1
>
> to the kernel line in grub.conf, update and reboot. We hadn't had any
> cpu spikes on any of the 5 nodes we've updated yet. The 3 remaining
> still haven't been updated due to production downtime.
>
> Get a support signoff for this, I'm in no way endorsing this solution,
> as I can't know if you're in the same situation as mine.
>
> Have fun !

Ok, I have upgraded one rhel6.2 host (in a cluster with two nodes) and 
problems persists ... Versions:

corosync-1.4.1-4.el6.i686
corosynclib-1.4.1-4.el6.i686
cman-3.0.12.1-23.el6.i686
rgmanager-3.0.12.1-5.el6.i686

  Will be any solution at sometime??




-- 
CL Martinez
carlopmart {at} gmail {d0t} com



From arunkp1987 at gmail.com  Tue Jan 31 15:08:01 2012
From: arunkp1987 at gmail.com (Arun Purushothaman)
Date: Tue, 31 Jan 2012 20:38:01 +0530
Subject: [Linux-cluster] Nodes are getting Down while relocating service
Message-ID: <CAM64JXJ2R1FQAT-+R6zMAJbOgEDYZN0Q18Vs7chX1K+oGrgb=A@mail.gmail.com>

Hi,

We  are facing some issue while configuring cluster in Centos 5.5


Here is the scenario where we got stuck.

Issue:

All nodes in the cluster turned of if cluster services restarted or
disabled or enabled.

Three services should work as a clustered service,

1.     Postgresql.
2.     GFS (1TB SAN space which is mounted on /var/lib/pgsql)
3.     Virtual IP (common IP)?IP 10.242.108.42

Even we tried adding only Virtual IP as a cluster service then also,

#clusvcadm  -r DBService ?m ssdgblade2.db2   (from ssdgblade1.db1)

Could not relocate the service and both node get turned off.

Environment

CentOS 5.5
Postgresql 8.3.3
Kernel version-2.6.18-194
CentOs Cluster Suit.

Hardware:

1.    Chasis IBM BladeCenter E.
2.    IBM HS22 blades (8 numbers)?clustering is done in blade1 and blade2
3.    Blade Management Module IP is 10.242.108.58
4.    Fence device IBM Bladecenter.( login successful via telnet and
web browser to management module).
5.    Cisco Catalyst 2960G Switch.

IP:

10.242.108.41 (ssdgblade1.db1)
10.242.108.43 (ssdgblade2.db2)

Virtual IP 10.242.108.42
Multicast IP 239.192.247.38


Diagnostic Steps followed:

1.     Removed postgresql and GFS from cluster service and rebooted
both the server with only VIP service. Still problem exist. Can not
relocate the service.
2.    Tested fencing by,

#fence_node ssdgblade2.db2   (from db1)
#fence_node ssdgblade1.db1   (from db2)

Can fence the given node.  But during boot up it fence the other node.

Please find the attachment for your reference.
-- 


Thanks & Regards,

*Arun K P
*

System Administrator

*HCL Infosystems Ltd*.

*Kolkata*

Mob: +91- 9903361422

*www.hclinfosystems.in* <http://www.hclinfosystems.in/>

*Technology that touches lives* *TM*
**
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster(1).conf
Type: application/octet-stream
Size: 1527 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120131/879e8154/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: messages
Type: application/octet-stream
Size: 20447 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120131/879e8154/attachment-0001.obj>

From emi2fast at gmail.com  Tue Jan 31 15:33:43 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Tue, 31 Jan 2012 16:33:43 +0100
Subject: [Linux-cluster] Nodes are getting Down while relocating service
In-Reply-To: <CAM64JXJ2R1FQAT-+R6zMAJbOgEDYZN0Q18Vs7chX1K+oGrgb=A@mail.gmail.com>
References: <CAM64JXJ2R1FQAT-+R6zMAJbOgEDYZN0Q18Vs7chX1K+oGrgb=A@mail.gmail.com>
Message-ID: <CAE7pJ3DU-_R=jSOaRLmAkwerSd8oNk1VHE9ZJOS7JeuDdZB9+Q@mail.gmail.com>

it can be a multicast problem comunication

I had a same with a cluster running two oracle services

2012/1/31 Arun Purushothaman <arunkp1987 at gmail.com>

> Hi,
>
> We  are facing some issue while configuring cluster in Centos 5.5
>
>
> Here is the scenario where we got stuck.
>
> Issue:
>
> All nodes in the cluster turned of if cluster services restarted or
> disabled or enabled.
>
> Three services should work as a clustered service,
>
> 1.     Postgresql.
> 2.     GFS (1TB SAN space which is mounted on /var/lib/pgsql)
> 3.     Virtual IP (common IP)?IP 10.242.108.42
>
> Even we tried adding only Virtual IP as a cluster service then also,
>
> #clusvcadm  -r DBService ?m ssdgblade2.db2   (from ssdgblade1.db1)
>
> Could not relocate the service and both node get turned off.
>
> Environment
>
> CentOS 5.5
> Postgresql 8.3.3
> Kernel version-2.6.18-194
> CentOs Cluster Suit.
>
> Hardware:
>
> 1.    Chasis IBM BladeCenter E.
> 2.    IBM HS22 blades (8 numbers)?clustering is done in blade1 and blade2
> 3.    Blade Management Module IP is 10.242.108.58
> 4.    Fence device IBM Bladecenter.( login successful via telnet and
> web browser to management module).
> 5.    Cisco Catalyst 2960G Switch.
>
> IP:
>
> 10.242.108.41 (ssdgblade1.db1)
> 10.242.108.43 (ssdgblade2.db2)
>
> Virtual IP 10.242.108.42
> Multicast IP 239.192.247.38
>
>
> Diagnostic Steps followed:
>
> 1.     Removed postgresql and GFS from cluster service and rebooted
> both the server with only VIP service. Still problem exist. Can not
> relocate the service.
> 2.    Tested fencing by,
>
> #fence_node ssdgblade2.db2   (from db1)
> #fence_node ssdgblade1.db1   (from db2)
>
> Can fence the given node.  But during boot up it fence the other node.
>
> Please find the attachment for your reference.
> --
>
>
> Thanks & Regards,
>
> *Arun K P
> *
>
> System Administrator
>
> *HCL Infosystems Ltd*.
>
> *Kolkata*
>
> Mob: +91- 9903361422
>
> *www.hclinfosystems.in* <http://www.hclinfosystems.in/>
>
> *Technology that touches lives* *TM*
> **
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120131/671f8457/attachment.htm>

From jose.neto at liber4e.com  Tue Jan 31 16:08:09 2012
From: jose.neto at liber4e.com (jose nuno neto)
Date: Tue, 31 Jan 2012 16:08:09 -0000 (GMT)
Subject: [Linux-cluster] Nodes are getting Down while relocating service
In-Reply-To: <CAM64JXJ2R1FQAT-+R6zMAJbOgEDYZN0Q18Vs7chX1K+oGrgb=A@mail.gmail.com>
References: <CAM64JXJ2R1FQAT-+R6zMAJbOgEDYZN0Q18Vs7chX1K+oGrgb=A@mail.gmail.com>
Message-ID: <9d561870047de6363d4b49f403e37522.squirrel@liber4e.com>

Hello

Took a quick look on the messages and see no fence reference, there's a
break in token messages, recovering, cluster.conf change, comunication
lost again....
could be the service shutdown, after cluster.conf update, forcing shutdown

do you have drbd running too?

Cheers
Jose Neto

> Hi,
>
> We  are facing some issue while configuring cluster in Centos 5.5
>
>
> Here is the scenario where we got stuck.
>
> Issue:
>
> All nodes in the cluster turned of if cluster services restarted or
> disabled or enabled.
>
> Three services should work as a clustered service,
>
> 1.     Postgresql.
> 2.     GFS (1TB SAN space which is mounted on /var/lib/pgsql)
> 3.     Virtual IP (common IP)?IP 10.242.108.42
>
> Even we tried adding only Virtual IP as a cluster service then also,
>
> #clusvcadm  -r DBService ?m ssdgblade2.db2   (from ssdgblade1.db1)
>
> Could not relocate the service and both node get turned off.
>
> Environment
>
> CentOS 5.5
> Postgresql 8.3.3
> Kernel version-2.6.18-194
> CentOs Cluster Suit.
>
> Hardware:
>
> 1.    Chasis IBM BladeCenter E.
> 2.    IBM HS22 blades (8 numbers)?clustering is done in blade1 and blade2
> 3.    Blade Management Module IP is 10.242.108.58
> 4.    Fence device IBM Bladecenter.( login successful via telnet and
> web browser to management module).
> 5.    Cisco Catalyst 2960G Switch.
>
> IP:
>
> 10.242.108.41 (ssdgblade1.db1)
> 10.242.108.43 (ssdgblade2.db2)
>
> Virtual IP 10.242.108.42
> Multicast IP 239.192.247.38
>
>
> Diagnostic Steps followed:
>
> 1.     Removed postgresql and GFS from cluster service and rebooted
> both the server with only VIP service. Still problem exist. Can not
> relocate the service.
> 2.    Tested fencing by,
>
> #fence_node ssdgblade2.db2   (from db1)
> #fence_node ssdgblade1.db1   (from db2)
>
> Can fence the given node.  But during boot up it fence the other node.
>
> Please find the attachment for your reference.
> --
>
>
> Thanks & Regards,
>
> *Arun K P
> *
>
> System Administrator
>
> *HCL Infosystems Ltd*.
>
> *Kolkata*
>
> Mob: +91- 9903361422
>
> *www.hclinfosystems.in* <http://www.hclinfosystems.in/>
>
> *Technology that touches lives* *TM*
> **
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



From emi2fast at gmail.com  Tue Jan 31 16:29:55 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Tue, 31 Jan 2012 17:29:55 +0100
Subject: [Linux-cluster] Nodes are getting Down while relocating service
In-Reply-To: <9d561870047de6363d4b49f403e37522.squirrel@liber4e.com>
References: <CAM64JXJ2R1FQAT-+R6zMAJbOgEDYZN0Q18Vs7chX1K+oGrgb=A@mail.gmail.com>
	<9d561870047de6363d4b49f403e37522.squirrel@liber4e.com>
Message-ID: <CAE7pJ3CrZ79O87cBMOZ_MWgY4pGCUoC+qRSqci8NaZ9BcpQHyA@mail.gmail.com>

Hello Jose

If you look the cluster.conf you can see his dosn't using drbd

Like i sayed beforce
===================================================
[network_problem]
===================================================
Jan 28 15:50:05 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
Jan 28 15:50:05 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
from 6.
Jan 28 15:50:05 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
Jan 28 15:50:05 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
from 6.
Jan 28 15:50:06 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
Jan 28 15:50:06 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
from 6.
Jan 28 15:50:06 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
Jan 28 15:50:06 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
from 6.
Jan 28 15:50:07 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
Jan 28 15:50:07 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
from 6.
Jan 28 15:50:07 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
Jan 28 15:50:07 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
from 6.
Jan 28 15:50:08 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
Jan 28 15:50:08 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
from 6.
Jan 28 15:50:08 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
Jan 28 15:50:08 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
from 6.
Jan 28 15:50:09 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
Jan 28 15:50:09 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
from 6.
Jan 28 15:50:09 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
Jan 28 15:50:09 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
from 6.
Jan 28 15:50:10 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
Jan 28 15:50:10 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
from 6.
Jan 28 15:50:10 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
Jan 28 15:50:10 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
from 6.
Jan 28 15:50:11 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
Jan 28 15:50:11 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
from 6.
Jan 28 15:50:11 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
Jan 28 15:50:11 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
from 6.
==================================================================

the first think it can be utils it's stops iptables

2012/1/31 jose nuno neto <jose.neto at liber4e.com>

> Hello
>
> Took a quick look on the messages and see no fence reference, there's a
> break in token messages, recovering, cluster.conf change, comunication
> lost again....
> could be the service shutdown, after cluster.conf update, forcing shutdown
>
> do you have drbd running too?
>
> Cheers
> Jose Neto
>
> > Hi,
> >
> > We  are facing some issue while configuring cluster in Centos 5.5
> >
> >
> > Here is the scenario where we got stuck.
> >
> > Issue:
> >
> > All nodes in the cluster turned of if cluster services restarted or
> > disabled or enabled.
> >
> > Three services should work as a clustered service,
> >
> > 1.     Postgresql.
> > 2.     GFS (1TB SAN space which is mounted on /var/lib/pgsql)
> > 3.     Virtual IP (common IP)?IP 10.242.108.42
> >
> > Even we tried adding only Virtual IP as a cluster service then also,
> >
> > #clusvcadm  -r DBService ?m ssdgblade2.db2   (from ssdgblade1.db1)
> >
> > Could not relocate the service and both node get turned off.
> >
> > Environment
> >
> > CentOS 5.5
> > Postgresql 8.3.3
> > Kernel version-2.6.18-194
> > CentOs Cluster Suit.
> >
> > Hardware:
> >
> > 1.    Chasis IBM BladeCenter E.
> > 2.    IBM HS22 blades (8 numbers)?clustering is done in blade1 and blade2
> > 3.    Blade Management Module IP is 10.242.108.58
> > 4.    Fence device IBM Bladecenter.( login successful via telnet and
> > web browser to management module).
> > 5.    Cisco Catalyst 2960G Switch.
> >
> > IP:
> >
> > 10.242.108.41 (ssdgblade1.db1)
> > 10.242.108.43 (ssdgblade2.db2)
> >
> > Virtual IP 10.242.108.42
> > Multicast IP 239.192.247.38
> >
> >
> > Diagnostic Steps followed:
> >
> > 1.     Removed postgresql and GFS from cluster service and rebooted
> > both the server with only VIP service. Still problem exist. Can not
> > relocate the service.
> > 2.    Tested fencing by,
> >
> > #fence_node ssdgblade2.db2   (from db1)
> > #fence_node ssdgblade1.db1   (from db2)
> >
> > Can fence the given node.  But during boot up it fence the other node.
> >
> > Please find the attachment for your reference.
> > --
> >
> >
> > Thanks & Regards,
> >
> > *Arun K P
> > *
> >
> > System Administrator
> >
> > *HCL Infosystems Ltd*.
> >
> > *Kolkata*
> >
> > Mob: +91- 9903361422
> >
> > *www.hclinfosystems.in* <http://www.hclinfosystems.in/>
> >
> > *Technology that touches lives* *TM*
> > **
> >
> > --
> > This message has been scanned for viruses and
> > dangerous content by MailScanner, and is
> > believed to be clean.
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120131/76286f72/attachment.htm>

From jose.neto at liber4e.com  Tue Jan 31 17:25:41 2012
From: jose.neto at liber4e.com (jose nuno neto)
Date: Tue, 31 Jan 2012 17:25:41 -0000 (GMT)
Subject: [Linux-cluster] Nodes are getting Down while relocating service
In-Reply-To: <CAE7pJ3CrZ79O87cBMOZ_MWgY4pGCUoC+qRSqci8NaZ9BcpQHyA@mail.gmail.com>
References: <CAM64JXJ2R1FQAT-+R6zMAJbOgEDYZN0Q18Vs7chX1K+oGrgb=A@mail.gmail.com>
	<9d561870047de6363d4b49f403e37522.squirrel@liber4e.com>
	<CAE7pJ3CrZ79O87cBMOZ_MWgY4pGCUoC+qRSqci8NaZ9BcpQHyA@mail.gmail.com>
Message-ID: <bb5da13c5b8445847d0d479c4b9defdd.squirrel@liber4e.com>

Hi
Well just not fully sure what logging that was

Anyway, to help clarify, if the cluster works ok, up until you start
services, I'll investigate the services

can you post the output of
cman_tool services

when cluster is running ok

Cheers
Jose

> Hello Jose
>
> If you look the cluster.conf you can see his dosn't using drbd
>
> Like i sayed beforce
> ===================================================
> [network_problem]
> ===================================================
> Jan 28 15:50:05 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
> Jan 28 15:50:05 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
> from 6.
> Jan 28 15:50:05 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
> Jan 28 15:50:05 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
> from 6.
> Jan 28 15:50:06 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
> Jan 28 15:50:06 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
> from 6.
> Jan 28 15:50:06 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
> Jan 28 15:50:06 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
> from 6.
> Jan 28 15:50:07 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
> Jan 28 15:50:07 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
> from 6.
> Jan 28 15:50:07 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
> Jan 28 15:50:07 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
> from 6.
> Jan 28 15:50:08 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
> Jan 28 15:50:08 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
> from 6.
> Jan 28 15:50:08 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
> Jan 28 15:50:08 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
> from 6.
> Jan 28 15:50:09 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
> Jan 28 15:50:09 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
> from 6.
> Jan 28 15:50:09 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
> Jan 28 15:50:09 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
> from 6.
> Jan 28 15:50:10 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
> Jan 28 15:50:10 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
> from 6.
> Jan 28 15:50:10 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
> Jan 28 15:50:10 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
> from 6.
> Jan 28 15:50:11 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
> Jan 28 15:50:11 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
> from 6.
> Jan 28 15:50:11 ssdgblade2 openais[10324]: [TOTEM] FAILED TO RECEIVE
> Jan 28 15:50:11 ssdgblade2 openais[10324]: [TOTEM] entering GATHER state
> from 6.
> ==================================================================
>
> the first think it can be utils it's stops iptables
>
> 2012/1/31 jose nuno neto <jose.neto at liber4e.com>
>
>> Hello
>>
>> Took a quick look on the messages and see no fence reference, there's a
>> break in token messages, recovering, cluster.conf change, comunication
>> lost again....
>> could be the service shutdown, after cluster.conf update, forcing
>> shutdown
>>
>> do you have drbd running too?
>>
>> Cheers
>> Jose Neto
>>
>> > Hi,
>> >
>> > We  are facing some issue while configuring cluster in Centos 5.5
>> >
>> >
>> > Here is the scenario where we got stuck.
>> >
>> > Issue:
>> >
>> > All nodes in the cluster turned of if cluster services restarted or
>> > disabled or enabled.
>> >
>> > Three services should work as a clustered service,
>> >
>> > 1.     Postgresql.
>> > 2.     GFS (1TB SAN space which is mounted on /var/lib/pgsql)
>> > 3.     Virtual IP (common IP)?IP 10.242.108.42
>> >
>> > Even we tried adding only Virtual IP as a cluster service then also,
>> >
>> > #clusvcadm  -r DBService ?m ssdgblade2.db2   (from ssdgblade1.db1)
>> >
>> > Could not relocate the service and both node get turned off.
>> >
>> > Environment
>> >
>> > CentOS 5.5
>> > Postgresql 8.3.3
>> > Kernel version-2.6.18-194
>> > CentOs Cluster Suit.
>> >
>> > Hardware:
>> >
>> > 1.    Chasis IBM BladeCenter E.
>> > 2.    IBM HS22 blades (8 numbers)?clustering is done in blade1 and
>> blade2
>> > 3.    Blade Management Module IP is 10.242.108.58
>> > 4.    Fence device IBM Bladecenter.( login successful via telnet and
>> > web browser to management module).
>> > 5.    Cisco Catalyst 2960G Switch.
>> >
>> > IP:
>> >
>> > 10.242.108.41 (ssdgblade1.db1)
>> > 10.242.108.43 (ssdgblade2.db2)
>> >
>> > Virtual IP 10.242.108.42
>> > Multicast IP 239.192.247.38
>> >
>> >
>> > Diagnostic Steps followed:
>> >
>> > 1.     Removed postgresql and GFS from cluster service and rebooted
>> > both the server with only VIP service. Still problem exist. Can not
>> > relocate the service.
>> > 2.    Tested fencing by,
>> >
>> > #fence_node ssdgblade2.db2   (from db1)
>> > #fence_node ssdgblade1.db1   (from db2)
>> >
>> > Can fence the given node.  But during boot up it fence the other node.
>> >
>> > Please find the attachment for your reference.
>> > --
>> >
>> >
>> > Thanks & Regards,
>> >
>> > *Arun K P
>> > *
>> >
>> > System Administrator
>> >
>> > *HCL Infosystems Ltd*.
>> >
>> > *Kolkata*
>> >
>> > Mob: +91- 9903361422
>> >
>> > *www.hclinfosystems.in* <http://www.hclinfosystems.in/>
>> >
>> > *Technology that touches lives* *TM*
>> > **
>> >
>> > --
>> > This message has been scanned for viruses and
>> > dangerous content by MailScanner, and is
>> > believed to be clean.
>> >
>> > --
>> > Linux-cluster mailing list
>> > Linux-cluster at redhat.com
>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>> --
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.