From kkovachev at varna.net  Wed Jun  1 08:19:31 2011
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Wed, 01 Jun 2011 11:19:31 +0300
Subject: [Linux-cluster] quorum dissolved but resources are still alive
In-Reply-To: <4DE515A9.40003@abilene.it>
References: <4DE515A9.40003@abilene.it>
Message-ID: <527e664e6a568f3049f42559cebf8359@mx.varna.net>

Hi,
 replying to your original email ...
the problem i can see in the logs is the line:
 openais[971]: [SYNC ] This node is within the primary component and will
provide service.

as you have expected_votes=2 and node votes=1 this shouldn't happen, so it
looks as a bug

P.S.
 If you had fencing configured - when node2 is back it would fence node1
and start the services

On Tue, 31 May 2011 18:22:01 +0200, Martin Claudio
<claudio.martin at abilene.it> wrote:
> Hi,
> 
> i have a problem with a 2 node cluster with this conf:
> 
> 
>          <clusternodes>
>                  <clusternode name="TEST1" nodeid="1" votes="1">
>                          <fence/>
>                  </clusternode>
>                  <clusternode name="TEST2" nodeid="2" votes="2">
>                          <fence/>
>                  </clusternode>
>          </clusternodes>
>          <cman expected_votes="2"/>
> 
> 
> all is ok but when node 2 goes down quorum dissolved but resources is 
> not stopped, here log:
> 
> 
> clurgmgrd[1302]: <emerg> #1: Quorum Dissolved
> kernel: dlm: closing connection to node 2
> openais[971]: [CLM  ]       r(0) ip(10.1.1.11)
> openais[971]: [CLM  ] Members Left:
> openais[971]: [CLM  ]       r(0) ip(10.1.1.12)
> openais[971]: [CLM  ] Members Joined:
> openais[971]: [CMAN ] quorum lost, blocking activity
> openais[971]: [CLM  ] CLM CONFIGURATION CHANGE
> openais[971]: [CLM  ] New Configuration:
> openais[971]: [CLM  ]       r(0) ip(10.1.1.11)
> openais[971]: [CLM  ] Members Left:
> openais[971]: [CLM  ] Members Joined:
> openais[971]: [SYNC ] This node is within the primary component and will

> provide service.
> openais[971]: [TOTEM] entering OPERATIONAL state.
> openais[971]: [CLM  ] got nodejoin message 10.1.1.11
> openais[971]: [CPG  ] got joinlist message from node 1
> ccsd[964]: Cluster is not quorate.  Refusing connection.
> 
> 
> cluster recognized that quorum is dissolved but resource manager doesn't

> stop resource, ip address is still alive, filesystem is still mount, 
> i'll expect an emergency shutdown but it does not happen....



From carlopmart at gmail.com  Wed Jun  1 19:48:22 2011
From: carlopmart at gmail.com (carlopmart)
Date: Wed, 01 Jun 2011 21:48:22 +0200
Subject: [Linux-cluster] Corosync goes cpu to 95-99%
In-Reply-To: <4DE545D7.1080703@redhat.com>
References: <4DD29D03.9080901@gmail.com>	<4DD2BAC3.50509@redhat.com>	<4DD2BD7D.5070704@gmail.com>	<4DD2CA90.6090802@redhat.com>	<3B50BA7445114813AE429BEE51A2BA52@versa>	<4DD78908.2030801@gmail.com>	<0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca><4DD80D5D.10004@gmail.com>	<4DD873C7.8080402@cybercat.ca>	<22E7D11CD5E64E338A66811F31F06238@versa>
	<4DE545D7.1080703@redhat.com>
Message-ID: <4DE69786.5010204@gmail.com>

On 05/31/2011 09:47 PM, Steven Dake wrote:
> On 05/31/2011 12:00 PM, Nicolas Ross wrote:
>>>>> I've opened a support case at redhat for this. While collecting the
>>>>> sosreport for redhat, I found ot in my var/log/message file something
>>>>> about gfs2_quotad being stalled for more than 120 seconds. Tought I
>>>>> disabled quotas with the noquota option. It appears that it's
>>>>> "quota=off". Since I cannot chane thecluster config and remount the
>>>>> filessystems at the moment, I did not made the change to tes it.
>>>>
>>>> Thanks Nicolas. what bugzilla id is??
>>>
>>> It's not a bugzilla, it's a support case.
>>
>> Hi !
>>
>> FYI, my support ticket is still open, and GSS are searching to find the
>> cause of the problem. In the mean time, they suggested that I start
>> corosync with -p option and see if that changes anything.
>>
>> I wanted to know how to do that since it's cman that does start corosync ?
>>
>
> cman_tool join is called in /etc/rc.d/init.d/cman I believe.  Add a -P
> option to it.
>
> Regards
> -steve

Where is "-P" option under cman_tool manpage?? I didn't see it. Appears 
"-S", "-X", "-A", "-D" ... but not -P ...

Is it correct to put this option under /etc/sysconfig/cman config file 
on RHEL6??

Thanks.

-- 
CL Martinez
carlopmart {at} gmail {d0t} com



From rossnick-lists at cybercat.ca  Wed Jun  1 23:27:50 2011
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Wed, 01 Jun 2011 19:27:50 -0400
Subject: [Linux-cluster] Corosync goes cpu to 95-99%
In-Reply-To: <4DE69786.5010204@gmail.com>
References: <4DD29D03.9080901@gmail.com>	<4DD2BAC3.50509@redhat.com>	<4DD2BD7D.5070704@gmail.com>	<4DD2CA90.6090802@redhat.com>	<3B50BA7445114813AE429BEE51A2BA52@versa>	<4DD78908.2030801@gmail.com>	<0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca><4DD80D5D.10004@gmail.com>	<4DD873C7.8080402@cybercat.ca>	<22E7D11CD5E64E338A66811F31F06238@versa>	<4DE545D7.1080703@redhat.com>
	<4DE69786.5010204@gmail.com>
Message-ID: <4DE6CAF6.4000002@cybercat.ca>


>>
>> cman_tool join is called in /etc/rc.d/init.d/cman I believe. Add a -P
>> option to it.
>>
>> Regards
>> -steve
>
> Where is "-P" option under cman_tool manpage?? I didn't see it. Appears
> "-S", "-X", "-A", "-D" ... but not -P ...
>
> Is it correct to put this option under /etc/sysconfig/cman config file
> on RHEL6??

I had to modify my /etc/rc.d/init.d/cman script on each node and add -P 
(undocumented) at line 500, after $cman_join_opts

And it did not solve the problem, but it help verry little bit to 
aliviate it. While a node is experiencing it, it's still not usable by 
ssh, but response time to service seems a very little better, barely 
noticable.

GSS asked me today to produce a core dump of corosync while it's eating 
up CPU.

Regards,



From carlopmart at gmail.com  Thu Jun  2 09:21:06 2011
From: carlopmart at gmail.com (carlopmart)
Date: Thu, 02 Jun 2011 11:21:06 +0200
Subject: [Linux-cluster] Corosync goes cpu to 95-99%
In-Reply-To: <4DE6CAF6.4000002@cybercat.ca>
References: <4DD29D03.9080901@gmail.com>	<4DD2BAC3.50509@redhat.com>	<4DD2BD7D.5070704@gmail.com>	<4DD2CA90.6090802@redhat.com>	<3B50BA7445114813AE429BEE51A2BA52@versa>	<4DD78908.2030801@gmail.com>	<0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca><4DD80D5D.10004@gmail.com>	<4DD873C7.8080402@cybercat.ca>	<22E7D11CD5E64E338A66811F31F06238@versa>	<4DE545D7.1080703@redhat.com>	<4DE69786.5010204@gmail.com>
	<4DE6CAF6.4000002@cybercat.ca>
Message-ID: <4DE75602.1000408@gmail.com>

On 06/02/2011 01:27 AM, Nicolas Ross wrote:
>
>>>
>>> cman_tool join is called in /etc/rc.d/init.d/cman I believe. Add a -P
>>> option to it.
>>>
>>> Regards
>>> -steve
>>
>> Where is "-P" option under cman_tool manpage?? I didn't see it. Appears
>> "-S", "-X", "-A", "-D" ... but not -P ...
>>
>> Is it correct to put this option under /etc/sysconfig/cman config file
>> on RHEL6??
>
> I had to modify my /etc/rc.d/init.d/cman script on each node and add -P
> (undocumented) at line 500, after $cman_join_opts
>
> And it did not solve the problem, but it help verry little bit to
> aliviate it. While a node is experiencing it, it's still not usable by
> ssh, but response time to service seems a very little better, barely
> noticable.
>
> GSS asked me today to produce a core dump of corosync while it's eating
> up CPU.
>
> Regards,
>

Oops .. Bad, bad, very bad news, almost for me. Nicolas, I have found 
the option to pass "-p" to corosync without modifying cman startup 
script. In /etc/sysconfig/cman config file, I have put a line with this:

CMAN_JOIN_OPTS="-P"

  .. and works ok.

[root at rhelnode01 sysconfig]# ps xa |grep corosync
  1033 ?        SLsl   0:00 corosync -f -p
  1494 pts/1    S+     0:00 grep corosync

I will do some tests with two nodes, But I think RHEL6.x is not yet 
ready for production environments, almost RHCS.


-- 
CL Martinez
carlopmart {at} gmail {d0t} com



From ajb2 at mssl.ucl.ac.uk  Thu Jun  2 09:34:43 2011
From: ajb2 at mssl.ucl.ac.uk (Alan Brown)
Date: Thu, 02 Jun 2011 10:34:43 +0100
Subject: [Linux-cluster] defragmentation.....
Message-ID: <4DE75933.4030302@mssl.ucl.ac.uk>


GFS2 seems horribly prone to fragmentation.

I have a filesystem which has been written to once (data archive, 
migrated from a GFS1 filesystem to a clean GFS2 fs) and a lot of the 
files are composed of hundreds of extents - most of these are only 1-2Mb 
so this is a bit over the top and it badly affects backup performance.

Has there been any progress on tools to help with this kind of problem?

Alan





From swhiteho at redhat.com  Thu Jun  2 09:46:51 2011
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Thu, 02 Jun 2011 10:46:51 +0100
Subject: [Linux-cluster] defragmentation.....
In-Reply-To: <4DE75933.4030302@mssl.ucl.ac.uk>
References: <4DE75933.4030302@mssl.ucl.ac.uk>
Message-ID: <1307008011.2823.22.camel@menhir>

Hi,

On Thu, 2011-06-02 at 10:34 +0100, Alan Brown wrote:
> GFS2 seems horribly prone to fragmentation.
> 
> I have a filesystem which has been written to once (data archive, 
> migrated from a GFS1 filesystem to a clean GFS2 fs) and a lot of the 
> files are composed of hundreds of extents - most of these are only 1-2Mb 
> so this is a bit over the top and it badly affects backup performance.
> 
> Has there been any progress on tools to help with this kind of problem?
> 
> Alan
> 
The thing to check is what size the extents are... the on-disk layout is
designed so that you should have a metadata block separating each data
extent at exactly the place where we would need to read a new metadata
block in order to continue reading the file in a streaming fashion.

That means on a 4k block size filesystem, the data extents are usually
around 509 blocks in length, and if you see a number of these with
(mostly) a single metadata block between them (sometimes more if the
height of the metadata tree grows) then that is the expected layout.

Fragmentation tends to be more of an issue with directories than with
regular files, and that is something that we are looking into at the
moment,

Steve.




From ajb2 at mssl.ucl.ac.uk  Thu Jun  2 10:47:32 2011
From: ajb2 at mssl.ucl.ac.uk (Alan Brown)
Date: Thu, 02 Jun 2011 11:47:32 +0100
Subject: [Linux-cluster] defragmentation.....
In-Reply-To: <1307008011.2823.22.camel@menhir>
References: <4DE75933.4030302@mssl.ucl.ac.uk> <1307008011.2823.22.camel@menhir>
Message-ID: <4DE76A44.2000902@mssl.ucl.ac.uk>

Steven Whitehouse wrote:

> The thing to check is what size the extents are...

filefrag doesn't show this.

> the on-disk layout is
> designed so that you should have a metadata block separating each data
> extent at exactly the place where we would need to read a new metadata
> block in order to continue reading the file in a streaming fashion.
> 
> That means on a 4k block size filesystem, the data extents are usually
> around 509 blocks in length, and if you see a number of these with
> (mostly) a single metadata block between them (sometimes more if the
> height of the metadata tree grows) then that is the expected layout.

4k*509 = 2024k - most of these files are 800-1010k (there isn't a file 
on this FS larger than 2Mb)

I've just taken one directory (225 entries, all 880-900k), copied each 
file and moved the copy back to the original spot.

Filefrag says they're now 1-3 extents (50% 1 extent, 30% 2 extents)

This filesystem is 700G and was originally populated in a single rsync pass.

Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroupBeast03-LogVolSarch01--GFS2
                       700G  660G   41G  95% /stage/sarch01

Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/mapper/VolGroupBeast03-LogVolSarch01--GFS2
                      13072686 2542375 10530311   20% /stage/sarch01

I'd understand if the last files written were like this, but it's right 
across the entire FS.





From ajb2 at mssl.ucl.ac.uk  Thu Jun  2 10:58:10 2011
From: ajb2 at mssl.ucl.ac.uk (Alan Brown)
Date: Thu, 02 Jun 2011 11:58:10 +0100
Subject: [Linux-cluster] defragmentation.....
In-Reply-To: <1307008011.2823.22.camel@menhir>
References: <4DE75933.4030302@mssl.ucl.ac.uk> <1307008011.2823.22.camel@menhir>
Message-ID: <4DE76CC2.8010201@mssl.ucl.ac.uk>

This is interesting too. note the variation in extents (the file is a 
piece of marketing fluff, name is unimportant)


$ df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroupBeast03-LogVolUser1
                       250G  113G  138G  45% /stage/user1

$ ls -l SUMO-SATA-Competitive-Positioning-v1.ppt
-rw-r--r-- 1 ajb2 computing 3746304 Nov  8  2007 
SUMO-SATA-Competitive-Positioning-v1.ppt

$ rsync SUMO-SATA-Competitive-Positioning-v1.ppt 
SUMO-SATA-Competitive-Positioning-v1.ppt.new

$ filefrag SUMO-SATA-Competitive-Positioning-v1.ppt*
SUMO-SATA-Competitive-Positioning-v1.ppt: 915 extents found
SUMO-SATA-Competitive-Positioning-v1.ppt.new: 153 extents found

$ rsync SUMO-SATA-Competitive-Positioning-v1.ppt 
SUMO-SATA-Competitive-Positioning-v1.ppt.new
$ filefrag SUMO-SATA-Competitive-Positioning-v1.ppt*
SUMO-SATA-Competitive-Positioning-v1.ppt: 915 extents found
SUMO-SATA-Competitive-Positioning-v1.ppt.new: 73 extents found

$ rsync SUMO-SATA-Competitive-Positioning-v1.ppt 
SUMO-SATA-Competitive-Positioning-v1.ppt.new
$ filefrag SUMO-SATA-Competitive-Positioning-v1.ppt*
SUMO-SATA-Competitive-Positioning-v1.ppt: 915 extents found
SUMO-SATA-Competitive-Positioning-v1.ppt.new: 12 extents found

$ rsync SUMO-SATA-Competitive-Positioning-v1.ppt 
SUMO-SATA-Competitive-Positioning-v1.ppt.new
$ filefrag SUMO-SATA-Competitive-Positioning-v1.ppt*
SUMO-SATA-Competitive-Positioning-v1.ppt: 915 extents found
SUMO-SATA-Competitive-Positioning-v1.ppt.new: 16 extents found

$ rsync SUMO-SATA-Competitive-Positioning-v1.ppt 
SUMO-SATA-Competitive-Positioning-v1.ppt.new
$ filefrag SUMO-SATA-Competitive-Positioning-v1.ppt*
SUMO-SATA-Competitive-Positioning-v1.ppt: 915 extents found
SUMO-SATA-Competitive-Positioning-v1.ppt.new: 9 extents found

$ rsync SUMO-SATA-Competitive-Positioning-v1.ppt 
SUMO-SATA-Competitive-Positioning-v1.ppt.new
$ filefrag SUMO-SATA-Competitive-Positioning-v1.ppt*
SUMO-SATA-Competitive-Positioning-v1.ppt: 915 extents found
SUMO-SATA-Competitive-Positioning-v1.ppt.new: 16 extents found

$ cp SUMO-SATA-Competitive-Positioning-v1.ppt 
SUMO-SATA-Competitive-Positioning-v1.ppt.new
cp: overwrite `SUMO-SATA-Competitive-Positioning-v1.ppt.new'? y
$ filefrag SUMO-SATA-Competitive-Positioning-v1.ppt*
SUMO-SATA-Competitive-Positioning-v1.ppt: 915 extents found
SUMO-SATA-Competitive-Positioning-v1.ppt.new: 5 extents found

$ cp SUMO-SATA-Competitive-Positioning-v1.ppt 
SUMO-SATA-Competitive-Positioning-v1.ppt.new
cp: overwrite `SUMO-SATA-Competitive-Positioning-v1.ppt.new'? y
$ filefrag SUMO-SATA-Competitive-Positioning-v1.ppt*
SUMO-SATA-Competitive-Positioning-v1.ppt: 915 extents found
SUMO-SATA-Competitive-Positioning-v1.ppt.new: 16 extents found

All these commands were executed in a 30 second period.





From swhiteho at redhat.com  Thu Jun  2 11:03:39 2011
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Thu, 02 Jun 2011 12:03:39 +0100
Subject: [Linux-cluster] defragmentation.....
In-Reply-To: <4DE76A44.2000902@mssl.ucl.ac.uk>
References: <4DE75933.4030302@mssl.ucl.ac.uk>
	<1307008011.2823.22.camel@menhir>  <4DE76A44.2000902@mssl.ucl.ac.uk>
Message-ID: <1307012619.2823.31.camel@menhir>

Hi,

On Thu, 2011-06-02 at 11:47 +0100, Alan Brown wrote:
> Steven Whitehouse wrote:
> 
> > The thing to check is what size the extents are...
> 
> filefrag doesn't show this.
> 
Yes it does. You need the -v flag

> > the on-disk layout is
> > designed so that you should have a metadata block separating each data
> > extent at exactly the place where we would need to read a new metadata
> > block in order to continue reading the file in a streaming fashion.
> > 
> > That means on a 4k block size filesystem, the data extents are usually
> > around 509 blocks in length, and if you see a number of these with
> > (mostly) a single metadata block between them (sometimes more if the
> > height of the metadata tree grows) then that is the expected layout.
> 
> 4k*509 = 2024k - most of these files are 800-1010k (there isn't a file 
> on this FS larger than 2Mb)
> 
> I've just taken one directory (225 entries, all 880-900k), copied each 
> file and moved the copy back to the original spot.
> 
> Filefrag says they're now 1-3 extents (50% 1 extent, 30% 2 extents)
> 
That doesn't sound too unreasonable to me. Usually the best way to
defrag is simply to copy the file elsewhere and copy it back as you've
done. That is why there is no specific tool to do this.

> This filesystem is 700G and was originally populated in a single rsync pass.
> 
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/mapper/VolGroupBeast03-LogVolSarch01--GFS2
>                        700G  660G   41G  95% /stage/sarch01
> 
> Filesystem            Inodes   IUsed   IFree IUse% Mounted on
> /dev/mapper/VolGroupBeast03-LogVolSarch01--GFS2
>                       13072686 2542375 10530311   20% /stage/sarch01
> 
> I'd understand if the last files written were like this, but it's right 
> across the entire FS.
> 
> 
If rsync is writing only a single file at a time, it should be pretty
good wrt to fragmentation. If it is trying to write multiple files at
the same time, bit by bit, then that is the kind of thing which might
increase fragmentation a bit depending on the exact pattern in this
case,

Steve.




From ajb2 at mssl.ucl.ac.uk  Thu Jun  2 11:50:33 2011
From: ajb2 at mssl.ucl.ac.uk (Alan Brown)
Date: Thu, 02 Jun 2011 12:50:33 +0100
Subject: [Linux-cluster] defragmentation.....
In-Reply-To: <4DE76CC2.8010201@mssl.ucl.ac.uk>
References: <4DE75933.4030302@mssl.ucl.ac.uk> <1307008011.2823.22.camel@menhir>
	<4DE76CC2.8010201@mssl.ucl.ac.uk>
Message-ID: <4DE77909.6000405@mssl.ucl.ac.uk>

Alan Brown wrote:
> This is interesting too. note the variation in extents (the file is a 
> piece of marketing fluff, name is unimportant)

I'm getting the same thing in sarch01 and that's mounted read-only by 
the clients - there's zero write activity going on.





From rossnick-lists at cybercat.ca  Thu Jun  2 13:42:39 2011
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Thu, 2 Jun 2011 09:42:39 -0400
Subject: [Linux-cluster] Corosync goes cpu to 95-99%
References: <4DD29D03.9080901@gmail.com>	<4DD2BAC3.50509@redhat.com>	<4DD2BD7D.5070704@gmail.com>	<4DD2CA90.6090802@redhat.com>	<3B50BA7445114813AE429BEE51A2BA52@versa>	<4DD78908.2030801@gmail.com>	<0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca><4DD80D5D.10004@gmail.com>	<4DD873C7.8080402@cybercat.ca>	<22E7D11CD5E64E338A66811F31F06238@versa>	<4DE545D7.1080703@redhat.com>	<4DE69786.5010204@gmail.com><4DE6CAF6.4000002@cybercat.ca>
	<4DE75602.1000408@gmail.com>
Message-ID: <51BB988BCCF547E69BF222BDAF34C4DE@versa>

> Oops .. Bad, bad, very bad news, almost for me. Nicolas, I have found the
> option to pass "-p" to corosync without modifying cman startup script. In
> /etc/sysconfig/cman config file, I have put a line with this:
>
> CMAN_JOIN_OPTS="-P"
>
>  .. and works ok.
>
> [root at rhelnode01 sysconfig]# ps xa |grep corosync
>  1033 ?        SLsl   0:00 corosync -f -p
>  1494 pts/1    S+     0:00 grep corosync
>
> I will do some tests with two nodes, But I think RHEL6.x is not yet ready
> for production environments, almost RHCS.

Thanks for that, that'll prevent me from modifying a system file...

And yes, I find it a little disapointing. We're now at 6.1, and our setup is 
exactly what RHCS was designed for... A GFS over fiber, httpd running 
content from that gfs... 



From swap_project at yahoo.com  Thu Jun  2 15:37:07 2011
From: swap_project at yahoo.com (Srija)
Date: Thu, 2 Jun 2011 08:37:07 -0700 (PDT)
Subject: [Linux-cluster] Cluster environment issue
In-Reply-To: <9537f2a4eb5ae1c11038deed2e3fe40f@mx.varna.net>
Message-ID: <748430.45768.qm@web112815.mail.gq1.yahoo.com>

Thank you so much for your reply again.

--- On Tue, 5/31/11, Kaloyan Kovachev <kkovachev at varna.net> wrote:
Thanks for your reply again.


 > 
> If it is a switch restart you will have in your logs the
> interface going
> down/up, but more problematic is to find a short drop of
> the multicast

I checked all nodes did not find anything about interface, but in all the nodes it is reporting that server19(node 12) /server18 (node 11) is the problematic, here I am mentioning the logs  from three nodes (out of 16 nodes)

   May 24 18:04:59 server7 openais[6113]: [TOTEM] entering GATHER state from 12.
   May 24 18:05:01 server7 crond[5068]: (root) CMD (  /opt/hp/hp-health/bin/check-for-restart-requests)
   May 24 18:05:19 server7 openais[6113]: [TOTEM] entering GATHER state from 11.

   May 24 18:04:59 server1 openais[6148]: [TOTEM] entering GATHER state from 12.
   May 24 18:05:01 server1 crond[2275]: (root) CMD (  /opt/hp/hp-health/bin/check-for-restart-requests)
   May 24 18:05:19 server1 openais[6148]: [TOTEM] entering GATHER state from 11.

   May 24 18:04:59 server8 openais[6279]: [TOTEM] entering GATHER state from 12.
   May 24 18:05:01 server8 crond[11125]: (root) CMD (  /opt/hp/hp-health/bin/check-for-restart-requests)
   May 24 18:05:19 server8 openais[6279]: [TOTEM] entering GATHER state from 11.


Here is some lines from  node12 , at the same time
___________________________________________________


May 24 18:04:59 server19 openais[5950]: [TOTEM] The token was lost in the OPERATIONAL state.
May 24 18:04:59 server19 openais[5950]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes).
May 24 18:04:59 server19 openais[5950]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
May 24 18:04:59 server19 openais[5950]: [TOTEM] entering GATHER state from 2.
May 24 18:05:19 server19 openais[5950]: [TOTEM] entering GATHER state from 11.
May 24 18:05:20 server19 openais[5950]: [TOTEM] Saving state aru 39a8f high seq received 39a8f
May 24 18:05:20 server19 openais[5950]: [TOTEM] Storing new sequence id for ring 2af0
May 24 18:05:20 server19 openais[5950]: [TOTEM] entering COMMIT state.
May 24 18:05:20 server19 openais[5950]: [TOTEM] entering RECOVERY state.


Here is few lines  on node11 ie server18 
------------------------------------------

ay 24 18:04:48 server18
May 24 18:10:14 server18 syslog-ng[5619]: syslog-ng starting up; version='2.0.10'
May 24 18:10:14 server18 Bootdata ok (command line is ro root=/dev/vgroot_xen/lvroot rhgb quiet)


So it seems  that node11 is rebooting just after few mintues we get all the problems  in the logs of all nodes. 


 > You may ask the network people to check for STP changes and
> double check
> the multicast configuration and you may also try to use
> broadcast instead
> of multicast or use a dedicated switch.

As per the dedicated switch,  I don't think it is possible as per the network team.  I asked the STP chanes  related.  their answer is 

"there are no stp changes for the private network as there are no redundant devices in the environment. the multicast configs is  igmp snooping with Pim"

I have talked to the network team for using the broadcast instead of multicast, as per them , they can set..  

Pl. comment  on this...

 > your interface and multicast address)
> ??? ping -I ethX -b -L 239.x.x.x -c 1
> and finaly run this script until the cluster gets broken

Yes ,  I have checked it , it is working fine now.  I have also set a cron
for this script and set in one node.

I have few questions  regarding the cluster configuration ...


   -  We are using clvm  in the cluster environment.  As I understand it is active-active.
      The environment is xen . all the xen hosts are in the cluster and each host have
      the guests. We are keeping the options  to live migrate the guests from one host to another.

    - I was looking into the redhat knowledgebase https://access.redhat.com/kb/docs/DOC-3068, 
     as per the document , what do you think using  CLVM or HA-LVM will be the best choice?

Pl. advice.
 

Thanks  and regards again.



From bergman at merctech.com  Thu Jun  2 20:05:04 2011
From: bergman at merctech.com (bergman at merctech.com)
Date: Thu, 02 Jun 2011 16:05:04 -0400
Subject: [Linux-cluster] recommended method for changing quorum device
In-Reply-To: Your message of "Tue, 31 May 2011 22:22:44 +0200."
	<215272920.3337.1306873364406.JavaMail.root@axgroupware01-1.gallien.atix>
References: <215272920.3337.1306873364406.JavaMail.root@axgroupware01-1.gallien.atix>
Message-ID: <2865.1307045104@localhost>

In the message dated: Tue, 31 May 2011 22:22:44 +0200,
The pithy ruminations from Mark Hlawatschek on 
<Re: [Linux-cluster] recommended method for changing quorum device> were:
=> Mark,
=> 
=> without guarantee ;-) I believe that the following method should work:

Thanks for the suggestion.

Here's what I did:

=> 
=> 1. make sure that all 3 nodes are running and part of the cluster

Yes.

1a. Decrement the number of expected votes to the expected quorum value
    without a quorum disk (for a 3-node cluster):
	cman_tool expected -e 2

1b. Change the cluster config to remove the quorum disk and decrease
    the number of expected votes to 2; then run "ccs_tool update"

	"clustat" shows the old quorum device as being "offline"

	the cluster remains quorate

=> 2. stop qdiskd on all nodes (#service qdiskd stop)

Yes.

=> 3. create new quorum disk (#mkqdisk ...)

Yes.

=> 4. modify cluster.conf

=> 5. #ccs_tool update /etc/cluster/cluster.conf

Yes.

Modified to use the new quorum disk. Did NOT change the expected number of
votes back to 5.

	The cluster remains quorate.

	At this point, "mkqdisk -L" shows two quorum devices.


=> 6. start qdiskd on all nodes (#service qdiskd start)

Yes.

	At this point, "cman_tool status" shows 2 votes from the
	quorum disk (5 votes total, 2 needed for quorum).

6a.	Modify the cluster config to  use the new quorum disk and to use
	the previous number of expected votes (3, to allow the 3-node
	cluster to function with 1 node + the quorum device).

	The cluster remains quorate.

	The expected number of votes is 3, the actual number of votes is 5.

----------------------------------------------------------------
The good news: No errors, no sudden cluster failures.

However, "clustat" shows the path to the old quorum device, and doesn't
show the new disk. The [old] quorum disk is shown as being "Online".

Running "qdiskd -f -d" shows that the quorum device is functioning
(hueristic checks, etc.), but doesn't give information about which
device is being used.

Running:
	strace -o /tmp/qdisk.strace -f /usr/sbin/qdiskd -d -f
and examining the system calls shows that the new quorum device is in use.


So, aside from the incorrect information from "clustat", it looks like
the change in quorum device was successful. Now the old array hardware
can continue failing. :)

Thanks,

Mark


=> 
=> Kind regards,
=> Mark
=> 
=> 
=> ----- bergman at merctech.com wrote:
=> 
=> > I've got a 3-node RHCS cluster and the quorum device is on a SAN disk
=> > array that needs to be replaced. The relevent versions are:
=> > 
=> >       CentOS 5.6 (2.6.18-238.9.1.el5)
=> >       openais-0.80.6-28.el5_6.1
=> >       cman-2.0.115-68.el5_6.3
=> >       rgmanager-2.0.52-9.el5.centos.1
=> >       
=> > 
=> > Currently the cluster is configured with each node having one vote
=> > and
=> > the quorum device having 2 votes, to allow operation in the event of
=> > multiple node failures.
=> > 
=> > I'd like to know if there's any recommended method for changing the
=> > quorum disk "in place", without shutting down the cluster.
=> > 
=> > The following approaches come to mind:
=> > 
=> >       1. Create a new quorum device (multipath, mkqdisk).
=> > 
=> >          Ensure that at least 2 of the 3 nodes are up.
=> > 
=> >          Change the cluster configuration to use the new path to
=> >          the new device instead of the old device.
=> > 
=> >          Commit the change to the cluster.
=> > 
=> >       2. Create a new quorum device (multipath, mkqdisk).
=> > 
=> >          Ensure that at least 2 of the 3 nodes are up.
=> > 
=> >          Change the cluster configuration to not use any quorum
=> >          device.
=> >          
=> >          Commit the change to the cluster.
=> >          
=> >          Change the cluster configuration to use the new quorum
=> >          device.
=> > 
=> >          Commit the change to the cluster.
=> > 
=> >       3. Create a new quorum device (multipath, mkqdisk).
=> > 
=> >          Change the cluster configuration to use both quorum
=> >          devices. 
=> > 
=> >          Commit the change to the cluster.
=> > 
=> >             --------------------------------------------------
=> >             Note: the 'mkqdisk' manual page (dated July 2006)
=> > 	    states:
=> >                   using multiple different devices is currently
=> >                   not supported
=> >             Is that still accurate?
=> >             --------------------------------------------------
=> > 
=> >          Change the cluster configuration to use just the 
=> >          new quorum device instead of the old device.
=> > 
=> >          Commit the change to the cluster.
=> > 
=> > Thanks for any suggestions.
=> > 
=> > Mark
=> > 
=> > --
=> > Linux-cluster mailing list
=> > Linux-cluster at redhat.com
=> > https://www.redhat.com/mailman/listinfo/linux-cluster
=> 
=> -- 
=> Mark Hlawatschek
=> 
=> ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 |
=> 85716 Unterschleissheim | www.atix.de 
=> 
=> http://www.linux-subscriptions.com
=> 



From kkovachev at varna.net  Fri Jun  3 08:48:31 2011
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Fri, 03 Jun 2011 11:48:31 +0300
Subject: [Linux-cluster] Cluster environment issue
In-Reply-To: <748430.45768.qm@web112815.mail.gq1.yahoo.com>
References: <748430.45768.qm@web112815.mail.gq1.yahoo.com>
Message-ID: <e178f3352b47e3e2a22347f8d7ea48ec@mx.varna.net>

Hi,

On Thu, 2 Jun 2011 08:37:07 -0700 (PDT), Srija <swap_project at yahoo.com>
wrote:
> Thank you so much for your reply again.
> 
> --- On Tue, 5/31/11, Kaloyan Kovachev <kkovachev at varna.net> wrote:
> Thanks for your reply again.
> 
> 
>  > 
>> If it is a switch restart you will have in your logs the
>> interface going
>> down/up, but more problematic is to find a short drop of
>> the multicast
> 
> I checked all nodes did not find anything about interface, but in all
the
> nodes it is reporting that server19(node 12) /server18 (node 11) is the
> problematic, here I am mentioning the logs  from three nodes (out of 16
> nodes)
> 
>    May 24 18:04:59 server7 openais[6113]: [TOTEM] entering GATHER state
>    from 12.
>    May 24 18:05:01 server7 crond[5068]: (root) CMD ( 
>    /opt/hp/hp-health/bin/check-for-restart-requests)
>    May 24 18:05:19 server7 openais[6113]: [TOTEM] entering GATHER state
>    from 11.
> 
>    May 24 18:04:59 server1 openais[6148]: [TOTEM] entering GATHER state
>    from 12.
>    May 24 18:05:01 server1 crond[2275]: (root) CMD ( 
>    /opt/hp/hp-health/bin/check-for-restart-requests)
>    May 24 18:05:19 server1 openais[6148]: [TOTEM] entering GATHER state
>    from 11.
> 
>    May 24 18:04:59 server8 openais[6279]: [TOTEM] entering GATHER state
>    from 12.
>    May 24 18:05:01 server8 crond[11125]: (root) CMD ( 
>    /opt/hp/hp-health/bin/check-for-restart-requests)
>    May 24 18:05:19 server8 openais[6279]: [TOTEM] entering GATHER state
>    from 11.
> 
> 
> Here is some lines from  node12 , at the same time
> ___________________________________________________
> 
> 
> May 24 18:04:59 server19 openais[5950]: [TOTEM] The token was lost in
the
> OPERATIONAL state.
> May 24 18:04:59 server19 openais[5950]: [TOTEM] Receive multicast socket
> recv buffer size (320000 bytes).
> May 24 18:04:59 server19 openais[5950]: [TOTEM] Transmit multicast
socket
> send buffer size (262142 bytes).
> May 24 18:04:59 server19 openais[5950]: [TOTEM] entering GATHER state
from
> 2.
> May 24 18:05:19 server19 openais[5950]: [TOTEM] entering GATHER state
from
> 11.
> May 24 18:05:20 server19 openais[5950]: [TOTEM] Saving state aru 39a8f
> high seq received 39a8f
> May 24 18:05:20 server19 openais[5950]: [TOTEM] Storing new sequence id
> for ring 2af0
> May 24 18:05:20 server19 openais[5950]: [TOTEM] entering COMMIT state.
> May 24 18:05:20 server19 openais[5950]: [TOTEM] entering RECOVERY state.
> 
> 
> Here is few lines  on node11 ie server18 
> ------------------------------------------
> 
> ay 24 18:04:48 server18
> May 24 18:10:14 server18 syslog-ng[5619]: syslog-ng starting up;
> version='2.0.10'
> May 24 18:10:14 server18 Bootdata ok (command line is ro
> root=/dev/vgroot_xen/lvroot rhgb quiet)
> 
> 
> So it seems  that node11 is rebooting just after few mintues we get all
> the problems  in the logs of all nodes. 
> 
> 
>  > You may ask the network people to check for STP changes and
>> double check
>> the multicast configuration and you may also try to use
>> broadcast instead
>> of multicast or use a dedicated switch.
> 
> As per the dedicated switch,  I don't think it is possible as per the
> network team.  I asked the STP chanes  related.  their answer is 
> 
> "there are no stp changes for the private network as there are no
> redundant devices in the environment. the multicast configs is  igmp
> snooping with Pim"
> 
> I have talked to the network team for using the broadcast instead of
> multicast, as per them , they can set..  
> 
> Pl. comment  on this...
> 

to use broadcast (if private addresses are in the same VLAN/subnet) you
just need to set it in cluster.conf - cman section, but not sure if it can
be done on a running cluster (without stopping or braking it)

>  > your interface and multicast address)
>>     ping -I ethX -b -L 239.x.x.x -c 1
>> and finaly run this script until the cluster gets broken
> 
> Yes ,  I have checked it , it is working fine now.  I have also set a
cron
> for this script and set in one node.

no need for cron if you haven't changed the script - this will start
several processes and your network will be overloaded !!!
the script was made to run on a console (or via screen) and it will exit
_only_ when multicast is lost

> 
> I have few questions  regarding the cluster configuration ...
> 
> 
>    -  We are using clvm  in the cluster environment.  As I understand it
>    is active-active.
>       The environment is xen . all the xen hosts are in the cluster and
>       each host have
>       the guests. We are keeping the options  to live migrate the guests
>       from one host to another.
> 
>     - I was looking into the redhat knowledgebase
>     https://access.redhat.com/kb/docs/DOC-3068,
>      as per the document , what do you think using  CLVM or HA-LVM will
be
>      the best choice?
> 
> Pl. advice.

can't comment on this sorry

>  
> 
> Thanks  and regards again.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From jpeteb at gmail.com  Fri Jun  3 14:01:53 2011
From: jpeteb at gmail.com (Pete)
Date: Fri, 3 Jun 2011 10:01:53 -0400
Subject: [Linux-cluster] Some nodes not starting groupd correctly
Message-ID: <BANLkTik1tLeNiAVdAPMM3=_pcb0bsmY0yw@mail.gmail.com>

Hello,

I have a startup issue with a cluster that we've set up. We have 34 HP
G7 servers running in a cluster to share one SAN resource, a HP
(Lefthand) P4500. All the servers are running RHEL 5.4. When we reboot
the cluster, a small, random number of nodes will not mount the SAN.

On inspection, the failing nodes are members of the cluster (looking
at clustat). When I run a "service cman status" on them, they say that
groupd is not running. I'm assuming that because of this, clvmd does
not run correctly (I see a "clvmd: Can't open cluster manager socket:
No such file or directory" in the messages log), so no SAN VG and no
SAN mount.

If I do a "service cman restart; service clvmd restart; mount -a" the
SAN will mount correctly.

I've created a sample cluster.conf below. It only contains 4 nodes,
but it is identical to the 34 node system. We use IPMI for the
fencing, as the HP G7 systems are iLO3, and we could not get fence_ilo
to work with them.

Any help is appreciated - thanks!

--pete


<?xml version="1.0" ?>
<cluster config_version="1" name="dasCluster">
  <fence_daemon post_fail_delay="0" post_join_delay="6"/>
  <totem consensus="45000" join="15000" send_join="1000" token="60000"
token_retransmits_before_loss_const="100"/>
  <logging to_stderr="yes">
    <logger debug="on" ident="CPG" to_stderr="yes"/>
    <logger debug="on" ident="CMAN" to_stderr="yes"/>
  </logging>
  <clusternodes>
    <clusternode name="g2das01x" nodeid="1" votes="1">
      <fence>
        <method name="1">
          <device name="das01"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="g2das02x" nodeid="2" votes="1">
      <fence>
        <method name="1">
          <device name="das02"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="g2das03x" nodeid="3" votes="1">
      <fence>
        <method name="1">
          <device name="das03"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="g2das04x" nodeid="4" votes="1">
      <fence>
        <method name="1">
          <device name="das04"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <cman cluster_id="881991"/>
  <fencedevices>
    <fencedevice action="reboot" agent="fence_ipmilan" auth="password"
ipaddr="g2das01x-ilo" lanplus="1" login="admin" name="das01"
passwd="b"/>
    <fencedevice action="reboot" agent="fence_ipmilan" auth="password"
ipaddr="g2das02x-ilo" lanplus="1" login="admin" name="das02"
passwd="c"/>
    <fencedevice action="reboot" agent="fence_ipmilan" auth="password"
ipaddr="g2das03x-ilo" lanplus="1" login="admin" name="das03"
passwd="d"/>
    <fencedevice action="reboot" agent="fence_ipmilan" auth="password"
ipaddr="g2das04x-ilo" lanplus="1" login="admin" name="das04"
passwd="e"/>
  </fencedevices>
  <rm>
    <failoverdomains/>
    <resources/>
  </rm>
</cluster>



From swap_project at yahoo.com  Fri Jun  3 15:27:58 2011
From: swap_project at yahoo.com (Srija)
Date: Fri, 3 Jun 2011 08:27:58 -0700 (PDT)
Subject: [Linux-cluster] Cluster environment issue
In-Reply-To: <e178f3352b47e3e2a22347f8d7ea48ec@mx.varna.net>
Message-ID: <807701.11474.qm@web112805.mail.gq1.yahoo.com>

Thanks for your reply.

--- On Fri, 6/3/11, Kaloyan Kovachev <kkovachev at varna.net> wrote:

 > 
> to use broadcast (if private addresses are in the same
> VLAN/subnet) you
> just need to set it in cluster.conf - cman section, but not
> sure if it can
> be done on a running cluster (without stopping or braking
> it)

Yes all the ips  are  in the same  vlan. I will test it in the lab with the 3 nodes cluster. 

If I want to check the difference between  multicast setting and broadcast setting,  how  to test ?

My plan is,  already the  test  environment  is set with multicast. I will test it. Then I will change the cluster.conf with broadcast setting then  test.  

Pl. let me know.


 Thanks  and regards.



From mkathuria at tuxtechnologies.co.in  Mon Jun  6 07:53:55 2011
From: mkathuria at tuxtechnologies.co.in (Manish Kathuria)
Date: Mon, 6 Jun 2011 13:23:55 +0530
Subject: [Linux-cluster] Fencing Issues: fence_node fails but fence_ipmilan
	works
Message-ID: <BANLkTi=Fs+e24cwvNUbE7Zg75w55Lx1y9A@mail.gmail.com>

I am facing a strange problem configuring a two node cluster using
RHCS 4.8. Both nodes are HP Proliant DL 180 G6 servers using HP LO
100i (IPMI Based).

When I run the fence_node command to check the fence device
configuration for either of the nodes, it fails giving the following
message in the logs:

fence_node[nnnn]: Fence of "node1" was unsuccessful
fence_node[nnnn]: Fence of "node2" was unsuccessful

However, when I run the fence_impilan command using the same
credentials, it executes successfully and is able to switch on, off
and reboot the nodes. The cluster configuration for the fence devices
is:

IPMI Lan Type
Name:		lo1
IP Address:	172.16.1.x
Login		admin
Password	passone
Auth Type	password

Name:		lo2
IP Address:	172.16.1.y
Login		admin
Password	passtwo
Auth Type	password

I have already tried different options for Auth Type (blank, password,
md5). Have also tried using / not using lanplus for both the fence
devices in the Manage Fencing dialog without success.

Any suggestions ?

Thanks,

Manish Kathuria



From sakect at gmail.com  Mon Jun  6 10:47:03 2011
From: sakect at gmail.com (POWERBALL ONLINE)
Date: Mon, 6 Jun 2011 17:47:03 +0700
Subject: [Linux-cluster] Fencing Issues: fence_node fails but
 fence_ipmilan works
In-Reply-To: <BANLkTi=Fs+e24cwvNUbE7Zg75w55Lx1y9A@mail.gmail.com>
References: <BANLkTi=Fs+e24cwvNUbE7Zg75w55Lx1y9A@mail.gmail.com>
Message-ID: <BANLkTikZg+VC=zh6QQgwntH2=OqxSUSxWQ@mail.gmail.com>

Please give me the cluster.conf file

On Mon, Jun 6, 2011 at 2:53 PM, Manish Kathuria <
mkathuria at tuxtechnologies.co.in> wrote:

> I am facing a strange problem configuring a two node cluster using
> RHCS 4.8. Both nodes are HP Proliant DL 180 G6 servers using HP LO
> 100i (IPMI Based).
>
> When I run the fence_node command to check the fence device
> configuration for either of the nodes, it fails giving the following
> message in the logs:
>
> fence_node[nnnn]: Fence of "node1" was unsuccessful
> fence_node[nnnn]: Fence of "node2" was unsuccessful
>
> However, when I run the fence_impilan command using the same
> credentials, it executes successfully and is able to switch on, off
> and reboot the nodes. The cluster configuration for the fence devices
> is:
>
> IPMI Lan Type
> Name:           lo1
> IP Address:     172.16.1.x
> Login           admin
> Password        passone
> Auth Type       password
>
> Name:           lo2
> IP Address:     172.16.1.y
> Login           admin
> Password        passtwo
> Auth Type       password
>
> I have already tried different options for Auth Type (blank, password,
> md5). Have also tried using / not using lanplus for both the fence
> devices in the Manage Fencing dialog without success.
>
> Any suggestions ?
>
> Thanks,
>
> Manish Kathuria
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110606/d9bd9025/attachment.htm>

From RMartinez-Sanchez at nds.com  Mon Jun  6 13:52:47 2011
From: RMartinez-Sanchez at nds.com (Martinez-Sanchez, Raul)
Date: Mon, 6 Jun 2011 14:52:47 +0100
Subject: [Linux-cluster] Fencing Issues: fence_node fails but
 fence_ipmilan works
In-Reply-To: <BANLkTikZg+VC=zh6QQgwntH2=OqxSUSxWQ@mail.gmail.com>
References: <BANLkTi=Fs+e24cwvNUbE7Zg75w55Lx1y9A@mail.gmail.com>
	<BANLkTikZg+VC=zh6QQgwntH2=OqxSUSxWQ@mail.gmail.com>
Message-ID: <7370F6F5ED3B874F988F5CE657D801EA13A9309268@UKMA1.UK.NDS.COM>

Hi,

I had a similar issue on an IBM M3 system although it was on a rhel5u6. The way we got it fix was by changing the ipmilam configuration *location*(attribute lanplus="1" in device element to fencedevice element) in the cluster.conf file, I am not sure if this would be relevant to you but just in case ....

I.E.

<method name="1">
<device name="m3vgc1b-ilo" action="off"/>
<device name="fcswitch1" port="2"/>
<device name="fcswitch2" port="2"/>
</method>

<method name="1">
<device name="m3vgc1a-ilo" action="off"/>
<device name="fcswitch1" port="1"/>
<device name="fcswitch2" port="1"/>
</method>


<fencedevice agent="fence_ipmilan" power_wait="10" ipaddr="XXX.XXX.XXX.XXX" lanplus="1" login="Admin" name="m3vgc1b-ilo" passwd="***"/>
<fencedevice agent="fence_ipmilan" power_wait="10" ipaddr="XXX.XXX.XXX.XXX" lanplus="1" login="Admin" name="m3vgc1a-ilo" passwd="***"/>


Regards,

Ra?l Mart?nez S?nchez

From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of POWERBALL ONLINE
Sent: Monday, June 06, 2011 11:47 AM
To: linux clustering
Subject: Re: [Linux-cluster] Fencing Issues: fence_node fails but fence_ipmilan works

Please give me the cluster.conf file
On Mon, Jun 6, 2011 at 2:53 PM, Manish Kathuria <mkathuria at tuxtechnologies.co.in<mailto:mkathuria at tuxtechnologies.co.in>> wrote:
I am facing a strange problem configuring a two node cluster using
RHCS 4.8. Both nodes are HP Proliant DL 180 G6 servers using HP LO
100i (IPMI Based).

When I run the fence_node command to check the fence device
configuration for either of the nodes, it fails giving the following
message in the logs:

fence_node[nnnn]: Fence of "node1" was unsuccessful
fence_node[nnnn]: Fence of "node2" was unsuccessful

However, when I run the fence_impilan command using the same
credentials, it executes successfully and is able to switch on, off
and reboot the nodes. The cluster configuration for the fence devices
is:

IPMI Lan Type
Name:           lo1
IP Address:     172.16.1.x
Login           admin
Password        passone
Auth Type       password

Name:           lo2
IP Address:     172.16.1.y
Login           admin
Password        passtwo
Auth Type       password

I have already tried different options for Auth Type (blank, password,
md5). Have also tried using / not using lanplus for both the fence
devices in the Manage Fencing dialog without success.

Any suggestions ?

Thanks,

Manish Kathuria

--
Linux-cluster mailing list
Linux-cluster at redhat.com<mailto:Linux-cluster at redhat.com>
https://www.redhat.com/mailman/listinfo/linux-cluster


________________________________

**************************************************************************************
This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postmaster at nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes. To protect the environment please do not print this e-mail unless necessary.

NDS Limited. Registered Office: One London Road, Staines, Middlesex, TW18 4EX, United Kingdom. A company registered in England and Wales. Registered no. 3080780. VAT no. GB 603 8808 40-00
**************************************************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110606/68523bb0/attachment.htm>

From mkathuria at tuxtechnologies.co.in  Mon Jun  6 15:13:32 2011
From: mkathuria at tuxtechnologies.co.in (Manish Kathuria)
Date: Mon, 6 Jun 2011 20:43:32 +0530
Subject: [Linux-cluster] Fencing Issues: fence_node fails but
 fence_ipmilan works
In-Reply-To: <7370F6F5ED3B874F988F5CE657D801EA13A9309268@UKMA1.UK.NDS.COM>
References: <BANLkTi=Fs+e24cwvNUbE7Zg75w55Lx1y9A@mail.gmail.com>
	<BANLkTikZg+VC=zh6QQgwntH2=OqxSUSxWQ@mail.gmail.com>
	<7370F6F5ED3B874F988F5CE657D801EA13A9309268@UKMA1.UK.NDS.COM>
Message-ID: <BANLkTim8yOaGzXTOkC_tVPnvzn74ibrkog@mail.gmail.com>

On Mon, Jun 6, 2011 at 7:22 PM, Martinez-Sanchez, Raul
<RMartinez-Sanchez at nds.com> wrote:
> Hi,
>
>
>
> I had a similar issue on an IBM M3 system although it was on a rhel5u6. The
> way we got it fix was by changing the ipmilam configuration
> *location*(attribute lanplus="1" in device element to fencedevice element)
> in the cluster.conf file, I am not sure if this would be relevant to you but
> just in case ?.
>
>
>
> I.E.
>
>
>
> <method name="1">
> <device name="m3vgc1b-ilo" action="off"/>
> <device name="fcswitch1" port="2"/>
> <device name="fcswitch2" port="2"/>
> </method>
>
> <method name="1">
> <device name="m3vgc1a-ilo" action="off"/>
> <device name="fcswitch1" port="1"/>
> <device name="fcswitch2" port="1"/>
> </method>
>
>
> <fencedevice agent="fence_ipmilan" power_wait="10" ipaddr="XXX.XXX.XXX.XXX"
> lanplus="1" login="Admin" name="m3vgc1b-ilo" passwd="***"/>
> <fencedevice agent="fence_ipmilan" power_wait="10" ipaddr="XXX.XXX.XXX.XXX"
> lanplus="1" login="Admin" name="m3vgc1a-ilo" passwd="***"/>
>

> Regards,
>

> Ra?l Mart?nez S?nchez
>
>

>
> Please give me the cluster.conf file
>
> On Mon, Jun 6, 2011 at 2:53 PM, Manish Kathuria
> <mkathuria at tuxtechnologies.co.in> wrote:
>
> I am facing a strange problem configuring a two node cluster using
> RHCS 4.8. Both nodes are HP Proliant DL 180 G6 servers using HP LO
> 100i (IPMI Based).
>
> When I run the fence_node command to check the fence device
> configuration for either of the nodes, it fails giving the following
> message in the logs:
>
> fence_node[nnnn]: Fence of "node1" was unsuccessful
> fence_node[nnnn]: Fence of "node2" was unsuccessful
>
> However, when I run the fence_impilan command using the same
> credentials, it executes successfully and is able to switch on, off
> and reboot the nodes. The cluster configuration for the fence devices
> is:
>
> IPMI Lan Type
> Name: ? ? ? ? ? lo1
> IP Address: ? ? 172.16.1.x
> Login ? ? ? ? ? admin
> Password ? ? ? ?passone
> Auth Type ? ? ? password
>
> Name: ? ? ? ? ? lo2
> IP Address: ? ? 172.16.1.y
> Login ? ? ? ? ? admin
> Password ? ? ? ?passtwo
> Auth Type ? ? ? password
>
> I have already tried different options for Auth Type (blank, password,
> md5). Have also tried using / not using lanplus for both the fence
> devices in the Manage Fencing dialog without success.
>
> Any suggestions ?
>

Thanks for the tip, I will try that out. Another interesting thing
which I discovered subsequently was that the nodes were being fenced
by the cluster during testing and its just the command fence_node
which fails to execute giving the error message mentioned in the
initial mail. Quite surprising.

--
Manish



From fdinitto at redhat.com  Mon Jun  6 17:56:44 2011
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Mon, 06 Jun 2011 19:56:44 +0200
Subject: [Linux-cluster] resource agents 3.9.1rc1 release
Message-ID: <4DED14DC.8070604@redhat.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Hi everybody,

The current resource agent repository [1] has been tagged to v3.9.1rc1.
Tarballs are also available [2].

This is the very first release of the common resource agent repository.
It is a big milestone towards eliminating duplication of effort with the
goal of improving the overall quality and user experience. There is
still a long way to go but the first stone has been laid down.

Highlights for the LHA resource agents set:

- - lxc, symlink: new resource agents
- - db2: major rewrite and support for master/slave mode of operation
- - exportfs: backup/restore of rmtab is back
- - mysql: multiple improvements for master/slave and replication
- - ocft: new tests for pgsql, postfix, and iscsi

Highlights for the rgmanager resource agents set:

- - oracledb: use shutdown immediate
- - tomcat5: fix generated XML
- - nfsclient: fix client name mismatch
- - halvm: fix mirror dev failure
- - nfs: fix selinux integration

Several changes have been made to the build system and the spec file to
accommodate both projects? needs. The most noticeable change is the
option to select "all", "linux-ha" or "rgmanager" resource agents at
configuration time, which will also set the default for the
spec file.

The full list of changes is available in the "ChangeLog" file for users,
and in an auto-generated git-to-changelog file called "ChangeLog.devel".

NOTE: About the 3.9.x version (particularly for linux-ha folks): This
version was chosen simply because the rgmanager set was already at
3.1.x. In order to make it easier for distribution, and to keep package
upgrades linear, we decided to bump the number higher than both
projects. There is no other special meaning associated with it.

The final 3.9.1 release will take place soon.

Many thanks to everybody who helped with this release, in
particular to the numerous contributors. Without you, the release
would certainly not be possible.

Cheers,
The RAS Tribe

[1] https://github.com/ClusterLabs/resource-agents/tarball/v3.9.1rc1
[2] https://fedorahosted.org/releases/r/e/resource-agents/

PS: I am absolutely sure that URL [2] might give some people a fit, but
we are still working to get a common release area.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBCAAGBQJN7RTaAAoJEFA6oBJjVJ+OLd0QAJsNrNxjwDOuHAIt8LW6pOPL
WZ7kR0S/S4rXzMC93jFAx3c4UE+7WBUHAEnOqZKQBLpkCti+o6lGG31EsM8sqk94
cHa7P4sLZ7OqnjbulBvORaGFkRrtewxugQMzX03UOnDplTluDaE4duBWou9uCZI1
mq6hd9EyDHXZJCrCN5BRAWAV2JbRb2cp9Wu4HaSFVb/1662mOuaUVLvYoAkmriF8
p5URVsJ09qJVRCpyLFIO4Xd51x46B807naRJMVEclNu5qv6IzL+HvqsR0KLL7CCv
cDAzNMqOGYRi3PQlywPaC/D+/PWw5LspmdepizooyIwleUK0O9d8dl3PuMjtewfn
4uMPdp2Vc9OqpAcZpcSIBwrK9zRH+JOQDUJmCL4dRZtsukU2qxAT4f7pX66hTVts
DkCkuDcX+xhi/y5eTu5cMKvsfrdcpNaDmIimKtq6T34Axncp8TYaLBfaoSB/2LIm
RD7MDXxY9tLD6b/e2gK6xtSXT4A+YQm7eXsBMhjYu30Ozq9Jvjz58V3bivMDtp+E
aUI/vxRnxOMjw9io8w2ltnCU9oLI3T9dDkj1Dilnl+HI0ju1flzsW8mhCA0c0GsY
tqZ1Em7js1Mp4PcoI57wS4f0INfU32KTkhPBViRn+o8GNJ9wFLd6XtwMFYrinqhS
mZxO0uDsvQ9gTnoVTUvL
=2KKW
-----END PGP SIGNATURE-----



From zaeem.arshad at gmail.com  Tue Jun  7 18:44:51 2011
From: zaeem.arshad at gmail.com (Zaeem Arshad)
Date: Tue, 7 Jun 2011 23:44:51 +0500
Subject: [Linux-cluster] Mixing kernel versions in a GFS cluster
In-Reply-To: <30487.1303931741@datil.uphs.upenn.edu>
References: <4DB1C7A5.10307@ntsg.umt.edu>
	<30487.1303931741@datil.uphs.upenn.edu>
Message-ID: <BANLkTikeKPjvv-GN=11X0Du8mckxPYK+Dw@mail.gmail.com>

On Thu, Apr 28, 2011 at 12:15 AM, <bergman at merctech.com> wrote:

> In the message dated: Fri, 22 Apr 2011 12:23:33 MDT,
> The pithy ruminations from "Andrew A. Neuschwander" on
> <[Linux-cluster] Mixing kernel versions in a GFS cluster> were:
> => Would it be a problem to mix CentOS 5.5 and CentOS 5.6 nodes in a GFS(1)
> cluster?
> =>
>
> Any information?
>
> Has anyone tried this? I'm trying to figure out the best update path for a
> 3-node CentOS 5.5 cluster (with GFS(1) and GFS2).
>
>
Not sure if it's relevant but we had two nodes running CentOS 5.4 and 5.5
respectively for quite a while without any issues. We evetually got around
to updating the second node but never experienced any issues.


HTH

--
Zaeem
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110607/929b0e1c/attachment.htm>

From swap_project at yahoo.com  Tue Jun  7 18:57:02 2011
From: swap_project at yahoo.com (Srija)
Date: Tue, 7 Jun 2011 11:57:02 -0700 (PDT)
Subject: [Linux-cluster] Cluster environment issue
In-Reply-To: <807701.11474.qm@web112805.mail.gq1.yahoo.com>
Message-ID: <687930.36765.qm@web112809.mail.gq1.yahoo.com>

Hi Kaloyan

> --- On Fri, 6/3/11, Kaloyan Kovachev <kkovachev at varna.net>
> wrote:
> 
>  > 
> > to use broadcast (if private addresses are in the
> same
> > VLAN/subnet) you
> > just need to set it in cluster.conf - cman section,
> but not
> > sure if it can
> > be done on a running cluster (without stopping or
> braking
> > it)


I have configured  the cluster in the lab (  with three nodes) and set the
broadcast. Here is the configuration --

<?xml version="1.0"?>
<cluster alias="test" config_version="61" name="test">
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="node1" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="ilo-node1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node2" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="ilo-node2"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node3" nodeid="3" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="ilo-node3"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
#---------------------------------------
        <cman>
          <totem broadcast="yes" ip="192.168.205.255"/>
        </cman>
#---------------------------------------

        <fencedevices>
                <fencedevice agent="fence_ilo" hostname="node1r" login="Admin" name="ilo-node1r" passwd="xxx"/>
                <fencedevice agent="fence_ilo" hostname="node2r" login="Admin" name="ilo-node2r" passwd="xxx"/>
                <fencedevice agent="fence_ilo" hostname="node3r" login="Admin" name="ilo-node3r" passwd="xxx"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
        <dlm plock_ownership="1" plock_rate_limit="0"/>
        <gfs_controld plock_rate_limit="0"/>
</cluster>


When I am executeing  the cman_tool status command  , getting the following output

[root  ~]# cman_tool status
Version: 6.2.0
Config Version: 61
Cluster Name:  test
Cluster Id: 25790
Cluster Member: Yes
Cluster Generation: 968
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Quorum: 2  
Active subsystems: 7
Flags: Dirty 
Ports Bound: 0  
Node name: node1
Node ID: 1
Multicast addresses: 239.192.xxx.xx 
Node addresses: 192.168.205.1 

Would  you pl. confirm the broadcast  configuration!! 

Again ,in the following document

https://access.redhat.com/kb/docs/DOC-40821

 under Unsopported items/ Netowrking, it is telling that broadcast is not supportive...

Thanks and regards.





From kkovachev at varna.net  Wed Jun  8 08:33:01 2011
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Wed, 08 Jun 2011 11:33:01 +0300
Subject: [Linux-cluster] Cluster environment issue
In-Reply-To: <687930.36765.qm@web112809.mail.gq1.yahoo.com>
References: <687930.36765.qm@web112809.mail.gq1.yahoo.com>
Message-ID: <8a55259aef507690dbae1bd902e0dc83@mx.varna.net>

Hi,

On Tue, 7 Jun 2011 11:57:02 -0700 (PDT), Srija <swap_project at yahoo.com>
wrote:
> Hi Kaloyan
> 
>> --- On Fri, 6/3/11, Kaloyan Kovachev <kkovachev at varna.net>
>> wrote:
>> 
>>  > 
>> > to use broadcast (if private addresses are in the
>> same
>> > VLAN/subnet) you
>> > just need to set it in cluster.conf - cman section,
>> but not
>> > sure if it can
>> > be done on a running cluster (without stopping or
>> braking
>> > it)
> 
> 
> I have configured  the cluster in the lab (  with three nodes) and set
the
> broadcast. Here is the configuration --
> 
> <?xml version="1.0"?>
> <cluster alias="test" config_version="61" name="test">
>         <fence_daemon clean_start="0" post_fail_delay="0"
>         post_join_delay="3"/>
>         <clusternodes>
>                 <clusternode name="node1" nodeid="1" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="ilo-node1"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="node2" nodeid="2" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="ilo-node2"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="node3" nodeid="3" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="ilo-node3"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
> #---------------------------------------
>         <cman>
>           <totem broadcast="yes" ip="192.168.205.255"/>
>         </cman>

try to replace this with just <cman broadcast="yes"/>

> #---------------------------------------
> 
>         <fencedevices>
>                 <fencedevice agent="fence_ilo" hostname="node1r"
>                 login="Admin" name="ilo-node1r" passwd="xxx"/>
>                 <fencedevice agent="fence_ilo" hostname="node2r"
>                 login="Admin" name="ilo-node2r" passwd="xxx"/>
>                 <fencedevice agent="fence_ilo" hostname="node3r"
>                 login="Admin" name="ilo-node3r" passwd="xxx"/>
>         </fencedevices>
>         <rm>
>                 <failoverdomains/>
>                 <resources/>
>         </rm>
>         <dlm plock_ownership="1" plock_rate_limit="0"/>
>         <gfs_controld plock_rate_limit="0"/>
> </cluster>
> 
> 
> When I am executeing  the cman_tool status command  , getting the
> following output
> 
> [root  ~]# cman_tool status
> Version: 6.2.0
> Config Version: 61
> Cluster Name:  test
> Cluster Id: 25790
> Cluster Member: Yes
> Cluster Generation: 968
> Membership state: Cluster-Member
> Nodes: 3
> Expected votes: 3
> Total votes: 3
> Quorum: 2  
> Active subsystems: 7
> Flags: Dirty 
> Ports Bound: 0  
> Node name: node1
> Node ID: 1
> Multicast addresses: 239.192.xxx.xx 
> Node addresses: 192.168.205.1 
> 
> Would  you pl. confirm the broadcast  configuration!! 
> 
> Again ,in the following document
> 
> https://access.redhat.com/kb/docs/DOC-40821

unfortunately i can't access the document, but using broadcast is just to
confirm the problem is with multicast (like the script i've sent earlier).

> 
>  under Unsopported items/ Netowrking, it is telling that broadcast is
not
>  supportive...
> 
> Thanks and regards.
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From yamato at redhat.com  Thu Jun  9 08:55:29 2011
From: yamato at redhat.com (Masatake YAMATO)
Date: Thu, 09 Jun 2011 17:55:29 +0900 (JST)
Subject: [Linux-cluster] [RFC] Read access to
 /config/dlm/<cluster>/comms/<comm>/addr
Message-ID: <20110609.175529.646090028440251828.yamato@redhat.com>

Hi,

I've found /config/dlm/<cluster>/comms/<comm>/addr is readable 
(in meaning of ls -l) but no handler(comm_addr_read) is defined in 
dlm/fs/dlm/config.c.

If cat command works fine with /config/dlm/<cluster>/comms/<comm>/addr,
it will be nice to understand the status of dlm. So I'm thinking about
writing a patch.

But after reading the source code, I've found its difficulties;
/config/dlm/<cluster>/comms/<comm>/addr holds 'struct
sockaddr_storage'.

I'd like to get your comment before going ahead.
I think we have three choice. Which do you think the best?

    1. When 'cat /config/dlm/<cluster>/comms/<comm>/addr' is invoked,
       it converts the held sockaddr_storage to human readable text and
       provids it to userland.
       e.g.

	    # cat /config/dlm/<cluster>/comms/<comm>/addr
	    AF_INET
	    192.168.151.1
	    #

       Advantage: human readable
       Disadvantage: data asymmetry in writing and reading
		     When writing to
		     /config/dlm/<cluster>/comms/<comm>/addr, it expects
		     binary format of sockaddr_storage.

    2. When 'cat /config/dlm/<cluster>/comms/<comm>/addr' is invoked,
       it provides the held sockaddr_storage to userland. 

       Advantage: data symmetry in writing and reading.
       Disadvantage: not human readable. It needs something effort to
		     understanding the returned binary data.

    3. Make /config/dlm/<cluster>/comms/<comm>/addr unreadable (in meaning of ls -l)

       e.g.
       # ls -l /config/dlm/<cluster>/comms/<comm>/addr
       --w-------. 1 root root 4096 Jun  9 08:51 /config/dlm/<cluster>/comms/<comm>/addr

       Advantage: easy to implement.
       Disadvantage: no way to know the value of node addr of dlm view.

Regards,
Masatake YAMATO



From laszlo.budai at gmail.com  Thu Jun  9 09:45:29 2011
From: laszlo.budai at gmail.com (Budai Laszlo)
Date: Thu, 09 Jun 2011 12:45:29 +0300
Subject: [Linux-cluster] Remove GFS journal
Message-ID: <4DF09639.70607@gmail.com>

Hi,

I would like to know if it is possible to remove a journal from GFS. I
have tried to google for it, but did not found anything conclusive. I've
read the documentation on the following address:
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html-single/Global_File_System/index.html
but I did not found any mention about the possibility or impossibility
of removing journals.

Thank you,
Laszlo



From swhiteho at redhat.com  Thu Jun  9 09:56:53 2011
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Thu, 09 Jun 2011 10:56:53 +0100
Subject: [Linux-cluster] Remove GFS journal
In-Reply-To: <4DF09639.70607@gmail.com>
References: <4DF09639.70607@gmail.com>
Message-ID: <1307613413.2821.1.camel@menhir>

Hi,

On Thu, 2011-06-09 at 12:45 +0300, Budai Laszlo wrote:
> Hi,
> 
> I would like to know if it is possible to remove a journal from GFS. I
> have tried to google for it, but did not found anything conclusive. I've
> read the documentation on the following address:
> http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html-single/Global_File_System/index.html
> but I did not found any mention about the possibility or impossibility
> of removing journals.
> 
> Thank you,
> Laszlo
> 
There is no tool to do this, I'm afraid. Theoretically it could be done
by editing the fs directly, but it would be a pretty tricky thing to do,
and certainly not a recommended procedure,

Steve.




From shankar.jha at gmail.com  Thu Jun  9 10:27:22 2011
From: shankar.jha at gmail.com (Shankar Jha)
Date: Thu, 9 Jun 2011 15:57:22 +0530
Subject: [Linux-cluster] cluster is not relocation on second node.
Message-ID: <BANLkTikDoSAWgqyJbR+UcHKfxm5bL3xppg@mail.gmail.com>

Hi,

I have problem in rhel5.5 cluster.
Mysqld service is on cluster. when there is any issue with cluster,
services(hell) not relocation automatically. Even I have tried to
enable on second node but fails. In that case we need to reboot both
nodes and enable it on manually on anyone. HP-ILO fencing is not
working.
Please find the below /var/log/message and suggest.


Jun  9 02:46:25 indls0040 clurgmgrd[6530]: <notice> Stopping service
service:hell
Jun  9 02:46:27 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
Jun  9 02:46:44 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
Jun  9 02:46:45 indls0040 ccsd[5222]: Unable to connect to cluster
infrastructure after 19710 seconds.
Jun  9 02:46:55 indls0040 clurgmgrd[6530]: <err> #52: Failed changing RG status
Jun  9 02:47:03 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
Jun  9 02:47:05 indls0040 clurgmgrd: [6530]: <warning> 10.48.64.82 is
not configured
Jun  9 02:47:05 indls0040 clurgmgrd[6530]: <notice> Stopping service
service:hell
Jun  9 02:47:15 indls0040 ccsd[5222]: Unable to connect to cluster
infrastructure after 19740 seconds.
Jun  9 02:47:20 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
Jun  9 02:47:35 indls0040 clurgmgrd[6530]: <err> #52: Failed changing RG status
Jun  9 02:47:38 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
Jun  9 02:47:45 indls0040 clurgmgrd: [6530]: <warning> 10.48.64.82 is
not configured
Jun  9 02:47:45 indls0040 clurgmgrd[6530]: <notice> Stopping service
service:hell
Jun  9 02:47:45 indls0040 ccsd[5222]: Unable to connect to cluster
infrastructure after 19770 seconds.
Jun  9 02:47:50 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
Jun  9 02:48:14 indls0040 last message repeated 2 times
Jun  9 02:48:15 indls0040 ccsd[5222]: Unable to connect to cluster
infrastructure after 19800 seconds.
Jun  9 02:48:15 indls0040 clurgmgrd[6530]: <err> #52: Failed changing RG status
Jun  9 02:48:23 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
Jun  9 02:48:25 indls0040 clurgmgrd: [6530]: <warning> 10.48.64.82 is
not configured
Jun  9 02:48:25 indls0040 clurgmgrd[6530]: <notice> Stopping service
service:hell
Jun  9 02:48:37 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
Jun  9 02:48:45 indls0040 ccsd[5222]: Unable to connect to cluster
infrastructure after 19830 seconds.
Jun  9 02:48:55 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
Jun  9 02:48:55 indls0040 clurgmgrd[6530]: <err> #52: Failed changing RG status
Jun  9 02:49:05 indls0040 clurgmgrd: [6530]: <warning> 10.48.64.82 is
not configured
Jun  9 02:49:05 indls0040 clurgmgrd[6530]: <notice> Stopping service
service:hell
Jun  9 02:49:13 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
Jun  9 02:49:15 indls0040 ccsd[5222]: Unable to connect to cluster
infrastructure after 19860 seconds.
Jun  9 02:49:26 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
Jun  9 02:49:35 indls0040 clurgmgrd[6530]: <err> #52: Failed changing RG status
Jun  9 02:49:45 indls0040 clurgmgrd: [6530]: <warning> 10.48.64.82 is
not configured
Jun  9 02:49:45 indls0040 clurgmgrd[6530]: <notice> Stopping service
service:hell
Jun  9 02:49:45 indls0040 ccsd[5222]: Unable to connect to cluster
infrastructure after 19890 seconds.
Jun  9 02:49:47 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
Jun  9 02:50:10 indls0040 last message repeated 2 times
Jun  9 02:50:15 indls0040 clurgmgrd[6530]: <err> #52: Failed changing RG status


Jun  9 10:03:59 indls0040 openais[23169]: [MAIN ] Using default
multicast address of 239.192.67.158
Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] Token Timeout (10000
ms) retransmit timeout (495 ms)
Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] token hold (386 ms)
retransmits before loss (20 retrans)
Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] join (60 ms)
send_join (0 ms) consensus (4800 ms) merge (200 ms)
Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] downcheck (1000 ms)
fail to recv const (50 msgs)
Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] seqno unchanged
const (30 rotations) Maximum network MTU 1402
Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] window size per
rotation (50 messages) maximum messages per rotation (1
7 messages)
Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] missed count const
(5 messages)
Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] send threads (0 threads)
Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] RRP token expired
timeout (495 ms)
Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] RRP token problem
counter (2000 ms)
Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] RRP threshold (10
problem count)
Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] RRP mode set to none.
Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] heartbeat_failures_allowed (0)
Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] max_network_delay (50 ms)
Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] HeartBeat is
Disabled. To enable set heartbeat_failures_allowed > 0
Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] Receive multicast
socket recv buffer size (320000 bytes).
Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] Transmit multicast
socket send buffer size (262142 bytes).
Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] The network
interface [10.48.65.54] is now up.
Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] Created or loaded
sequence id 7136704.10.48.65.54 for this ring.
Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] entering GATHER state from 15.
Jun  9 10:04:00 indls0040 openais[23169]: [CMAN ] CMAN 2.0.115 (built
Jul 28 2010 19:18:41) started
Jun  9 10:04:00 indls0040 openais[23169]: [MAIN ] Service initialized
'openais CMAN membership service 2.01'
Jun  9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized
'openais extended virtual synchrony service'
Jun  9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized
'openais cluster membership service B.01.01'
Jun  9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized
'openais availability management framework B.01.01'

Jun  9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized
'openais checkpoint service B.01.01'
Jun  9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized
'openais event service B.01.01'
Jun  9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized
'openais distributed locking service B.01.01'
Jun  9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized
'openais message service B.01.01'
Jun  9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized
'openais configuration service'
Jun  9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized
'openais cluster closed process group service v1.01
'
Jun  9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized
'openais cluster config database access v1.01'
Jun  9 10:04:00 indls0040 openais[23169]: [SYNC ] Not using a virtual
synchrony filter.
Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] Creating commit
token because I am the rep.
--More--


Thanks-
Shankar



Jun  9 10:04:01 indls0040 openais[23169]: [CLM  ]       r(0) ip(10.48.64.67)
Jun  9 10:04:01 indls0040 openais[23169]: [SYNC ] This node is within
the primary component and will provide service.
Jun  9 10:04:01 indls0040 openais[23169]: [TOTEM] entering OPERATIONAL state.
Jun  9 10:04:02 indls0040 openais[23169]: [CLM  ] got nodejoin message
10.48.64.67
Jun  9 10:04:02 indls0040 openais[23169]: [CLM  ] got nodejoin message
10.48.65.54
Jun  9 10:04:02 indls0040 openais[23169]: [CMAN ] cman killed by node
2 because we were killed by cman_tool or other appl
ication
Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading all
openais components
Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais
component: openais_confdb v0 (19/10)
Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais
component: openais_cpg v0 (18/8)
Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais
component: openais_cfg v0 (17/7)
Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais
component: openais_msg v0 (16/6)
Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais
component: openais_lck v0 (15/5)
Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais
component: openais_evt v0 (14/4)
Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais
component: openais_ckpt v0 (13/3)
Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais
component: openais_amf v0 (12/2)
Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais
component: openais_clm v0 (11/1)
Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais
component: openais_evs v0 (10/0)
Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais
component: openais_cman v0 (9/9)
Jun  9 10:04:03 indls0040 dlm_controld[23196]: cluster is down, exiting
Jun  9 10:04:03 indls0040 fenced[23188]: cluster is down, exiting
Jun  9 10:04:03 indls0040 kernel: dlm: closing connection to node 1
Jun  9 10:04:03 indls0040 gfs_controld[23203]: cpg_join error 2
Jun  9 10:04:06 indls0040 fence_node[23194]: Fence of
"indls0040.qdx.in" was unsuccessful
Jun  9 10:04:15 indls0040 ccsd[5222]: Unable to connect to cluster
infrastructure after 45930 seconds.
Jun  9 10:04:16 indls0040 clurgmgrd[6530]: <err> #52: Failed changing RG status
-------------- next part --------------
A non-text attachment was scrubbed...
Name: logs.docx
Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document
Size: 147479 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110609/2844ec81/attachment.docx>

From ccaulfie at redhat.com  Thu Jun  9 10:40:28 2011
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Thu, 09 Jun 2011 11:40:28 +0100
Subject: [Linux-cluster] cluster is not relocation on second node.
In-Reply-To: <BANLkTikDoSAWgqyJbR+UcHKfxm5bL3xppg@mail.gmail.com>
References: <BANLkTikDoSAWgqyJbR+UcHKfxm5bL3xppg@mail.gmail.com>
Message-ID: <4DF0A31C.70605@redhat.com>

On 09/06/11 11:27, Shankar Jha wrote:
> Hi,
>
> I have problem in rhel5.5 cluster.
> Mysqld service is on cluster. when there is any issue with cluster,
> services(hell) not relocation automatically. Even I have tried to
> enable on second node but fails. In that case we need to reboot both
> nodes and enable it on manually on anyone. HP-ILO fencing is not
> working.

You answered your own question. Fix fencing and the failover should work 
fine :-)

Chrissie

> Please find the below /var/log/message and suggest.
>
>
> Jun  9 02:46:25 indls0040 clurgmgrd[6530]:<notice>  Stopping service
> service:hell
> Jun  9 02:46:27 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
> Jun  9 02:46:44 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
> Jun  9 02:46:45 indls0040 ccsd[5222]: Unable to connect to cluster
> infrastructure after 19710 seconds.
> Jun  9 02:46:55 indls0040 clurgmgrd[6530]:<err>  #52: Failed changing RG status
> Jun  9 02:47:03 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
> Jun  9 02:47:05 indls0040 clurgmgrd: [6530]:<warning>  10.48.64.82 is
> not configured
> Jun  9 02:47:05 indls0040 clurgmgrd[6530]:<notice>  Stopping service
> service:hell
> Jun  9 02:47:15 indls0040 ccsd[5222]: Unable to connect to cluster
> infrastructure after 19740 seconds.
> Jun  9 02:47:20 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
> Jun  9 02:47:35 indls0040 clurgmgrd[6530]:<err>  #52: Failed changing RG status
> Jun  9 02:47:38 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
> Jun  9 02:47:45 indls0040 clurgmgrd: [6530]:<warning>  10.48.64.82 is
> not configured
> Jun  9 02:47:45 indls0040 clurgmgrd[6530]:<notice>  Stopping service
> service:hell
> Jun  9 02:47:45 indls0040 ccsd[5222]: Unable to connect to cluster
> infrastructure after 19770 seconds.
> Jun  9 02:47:50 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
> Jun  9 02:48:14 indls0040 last message repeated 2 times
> Jun  9 02:48:15 indls0040 ccsd[5222]: Unable to connect to cluster
> infrastructure after 19800 seconds.
> Jun  9 02:48:15 indls0040 clurgmgrd[6530]:<err>  #52: Failed changing RG status
> Jun  9 02:48:23 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
> Jun  9 02:48:25 indls0040 clurgmgrd: [6530]:<warning>  10.48.64.82 is
> not configured
> Jun  9 02:48:25 indls0040 clurgmgrd[6530]:<notice>  Stopping service
> service:hell
> Jun  9 02:48:37 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
> Jun  9 02:48:45 indls0040 ccsd[5222]: Unable to connect to cluster
> infrastructure after 19830 seconds.
> Jun  9 02:48:55 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
> Jun  9 02:48:55 indls0040 clurgmgrd[6530]:<err>  #52: Failed changing RG status
> Jun  9 02:49:05 indls0040 clurgmgrd: [6530]:<warning>  10.48.64.82 is
> not configured
> Jun  9 02:49:05 indls0040 clurgmgrd[6530]:<notice>  Stopping service
> service:hell
> Jun  9 02:49:13 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
> Jun  9 02:49:15 indls0040 ccsd[5222]: Unable to connect to cluster
> infrastructure after 19860 seconds.
> Jun  9 02:49:26 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
> Jun  9 02:49:35 indls0040 clurgmgrd[6530]:<err>  #52: Failed changing RG status
> Jun  9 02:49:45 indls0040 clurgmgrd: [6530]:<warning>  10.48.64.82 is
> not configured
> Jun  9 02:49:45 indls0040 clurgmgrd[6530]:<notice>  Stopping service
> service:hell
> Jun  9 02:49:45 indls0040 ccsd[5222]: Unable to connect to cluster
> infrastructure after 19890 seconds.
> Jun  9 02:49:47 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67
> Jun  9 02:50:10 indls0040 last message repeated 2 times
> Jun  9 02:50:15 indls0040 clurgmgrd[6530]:<err>  #52: Failed changing RG status
>
>
> Jun  9 10:03:59 indls0040 openais[23169]: [MAIN ] Using default
> multicast address of 239.192.67.158
> Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] Token Timeout (10000
> ms) retransmit timeout (495 ms)
> Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] token hold (386 ms)
> retransmits before loss (20 retrans)
> Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] join (60 ms)
> send_join (0 ms) consensus (4800 ms) merge (200 ms)
> Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] downcheck (1000 ms)
> fail to recv const (50 msgs)
> Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] seqno unchanged
> const (30 rotations) Maximum network MTU 1402
> Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] window size per
> rotation (50 messages) maximum messages per rotation (1
> 7 messages)
> Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] missed count const
> (5 messages)
> Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] send threads (0 threads)
> Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] RRP token expired
> timeout (495 ms)
> Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] RRP token problem
> counter (2000 ms)
> Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] RRP threshold (10
> problem count)
> Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] RRP mode set to none.
> Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] heartbeat_failures_allowed (0)
> Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] max_network_delay (50 ms)
> Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] HeartBeat is
> Disabled. To enable set heartbeat_failures_allowed>  0
> Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] Receive multicast
> socket recv buffer size (320000 bytes).
> Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] Transmit multicast
> socket send buffer size (262142 bytes).
> Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] The network
> interface [10.48.65.54] is now up.
> Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] Created or loaded
> sequence id 7136704.10.48.65.54 for this ring.
> Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] entering GATHER state from 15.
> Jun  9 10:04:00 indls0040 openais[23169]: [CMAN ] CMAN 2.0.115 (built
> Jul 28 2010 19:18:41) started
> Jun  9 10:04:00 indls0040 openais[23169]: [MAIN ] Service initialized
> 'openais CMAN membership service 2.01'
> Jun  9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized
> 'openais extended virtual synchrony service'
> Jun  9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized
> 'openais cluster membership service B.01.01'
> Jun  9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized
> 'openais availability management framework B.01.01'
>
> Jun  9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized
> 'openais checkpoint service B.01.01'
> Jun  9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized
> 'openais event service B.01.01'
> Jun  9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized
> 'openais distributed locking service B.01.01'
> Jun  9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized
> 'openais message service B.01.01'
> Jun  9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized
> 'openais configuration service'
> Jun  9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized
> 'openais cluster closed process group service v1.01
> '
> Jun  9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized
> 'openais cluster config database access v1.01'
> Jun  9 10:04:00 indls0040 openais[23169]: [SYNC ] Not using a virtual
> synchrony filter.
> Jun  9 10:04:00 indls0040 openais[23169]: [TOTEM] Creating commit
> token because I am the rep.
> --More--
>
>
> Thanks-
> Shankar
>
>
>
> Jun  9 10:04:01 indls0040 openais[23169]: [CLM  ]       r(0) ip(10.48.64.67)
> Jun  9 10:04:01 indls0040 openais[23169]: [SYNC ] This node is within
> the primary component and will provide service.
> Jun  9 10:04:01 indls0040 openais[23169]: [TOTEM] entering OPERATIONAL state.
> Jun  9 10:04:02 indls0040 openais[23169]: [CLM  ] got nodejoin message
> 10.48.64.67
> Jun  9 10:04:02 indls0040 openais[23169]: [CLM  ] got nodejoin message
> 10.48.65.54
> Jun  9 10:04:02 indls0040 openais[23169]: [CMAN ] cman killed by node
> 2 because we were killed by cman_tool or other appl
> ication
> Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading all
> openais components
> Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais
> component: openais_confdb v0 (19/10)
> Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais
> component: openais_cpg v0 (18/8)
> Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais
> component: openais_cfg v0 (17/7)
> Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais
> component: openais_msg v0 (16/6)
> Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais
> component: openais_lck v0 (15/5)
> Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais
> component: openais_evt v0 (14/4)
> Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais
> component: openais_ckpt v0 (13/3)
> Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais
> component: openais_amf v0 (12/2)
> Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais
> component: openais_clm v0 (11/1)
> Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais
> component: openais_evs v0 (10/0)
> Jun  9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais
> component: openais_cman v0 (9/9)
> Jun  9 10:04:03 indls0040 dlm_controld[23196]: cluster is down, exiting
> Jun  9 10:04:03 indls0040 fenced[23188]: cluster is down, exiting
> Jun  9 10:04:03 indls0040 kernel: dlm: closing connection to node 1
> Jun  9 10:04:03 indls0040 gfs_controld[23203]: cpg_join error 2
> Jun  9 10:04:06 indls0040 fence_node[23194]: Fence of
> "indls0040.qdx.in" was unsuccessful
> Jun  9 10:04:15 indls0040 ccsd[5222]: Unable to connect to cluster
> infrastructure after 45930 seconds.
> Jun  9 10:04:16 indls0040 clurgmgrd[6530]:<err>  #52: Failed changing RG status
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From teigland at redhat.com  Thu Jun  9 14:05:46 2011
From: teigland at redhat.com (David Teigland)
Date: Thu, 9 Jun 2011 10:05:46 -0400
Subject: [Linux-cluster] [RFC] Read access to
 /config/dlm/<cluster>/comms/<comm>/addr
In-Reply-To: <20110609.175529.646090028440251828.yamato@redhat.com>
References: <20110609.175529.646090028440251828.yamato@redhat.com>
Message-ID: <20110609140546.GA30732@redhat.com>

On Thu, Jun 09, 2011 at 05:55:29PM +0900, Masatake YAMATO wrote:
> Hi,
> 
> I've found /config/dlm/<cluster>/comms/<comm>/addr is readable 
> (in meaning of ls -l) but no handler(comm_addr_read) is defined in 
> dlm/fs/dlm/config.c.
> 
> If cat command works fine with /config/dlm/<cluster>/comms/<comm>/addr,
> it will be nice to understand the status of dlm. So I'm thinking about
> writing a patch.
> 
> But after reading the source code, I've found its difficulties;
> /config/dlm/<cluster>/comms/<comm>/addr holds 'struct
> sockaddr_storage'.

Another problem is that you can write multiple addr's to that file
sequentially when using SCTP, so which do you get when you read it?

>     3. Make /config/dlm/<cluster>/comms/<comm>/addr unreadable (in meaning of ls -l)
> 
>        e.g.
>        # ls -l /config/dlm/<cluster>/comms/<comm>/addr
>        --w-------. 1 root root 4096 Jun  9 08:51 /config/dlm/<cluster>/comms/<comm>/addr
> 
>        Advantage: easy to implement.
>        Disadvantage: no way to know the value of node addr of dlm view.

I suggest this.  If you want a way to read them, I'd add a new readonly
file addr_list,

# cat /config/dlm/<cluster>/comms/<comm>/addr_list
AF_INET 192.168.151.1
AF_INET 192.168.151.2

Dave



From yamato at redhat.com  Thu Jun  9 14:39:30 2011
From: yamato at redhat.com (Masatake YAMATO)
Date: Thu, 09 Jun 2011 23:39:30 +0900 (JST)
Subject: [Linux-cluster] [RFC] Read access to
 /config/dlm/<cluster>/comms/<comm>/addr
In-Reply-To: <20110609140546.GA30732@redhat.com>
References: <20110609.175529.646090028440251828.yamato@redhat.com>
	<20110609140546.GA30732@redhat.com>
Message-ID: <20110609.233930.815852573745836394.yamato@redhat.com>

> On Thu, Jun 09, 2011 at 05:55:29PM +0900, Masatake YAMATO wrote:
>> Hi,
>> 
>> I've found /config/dlm/<cluster>/comms/<comm>/addr is readable 
>> (in meaning of ls -l) but no handler(comm_addr_read) is defined in 
>> dlm/fs/dlm/config.c.
>> 
>> If cat command works fine with /config/dlm/<cluster>/comms/<comm>/addr,
>> it will be nice to understand the status of dlm. So I'm thinking about
>> writing a patch.
>> 
>> But after reading the source code, I've found its difficulties;
>> /config/dlm/<cluster>/comms/<comm>/addr holds 'struct
>> sockaddr_storage'.
> 
> Another problem is that you can write multiple addr's to that file
> sequentially when using SCTP, so which do you get when you read it?
> 
>>     3. Make /config/dlm/<cluster>/comms/<comm>/addr unreadable (in meaning of ls -l)
>> 
>>        e.g.
>>        # ls -l /config/dlm/<cluster>/comms/<comm>/addr
>>        --w-------. 1 root root 4096 Jun  9 08:51 /config/dlm/<cluster>/comms/<comm>/addr
>> 
>>        Advantage: easy to implement.
>>        Disadvantage: no way to know the value of node addr of dlm view.
> 
> I suggest this.  If you want a way to read them, I'd add a new readonly
> file addr_list,

Of course, I want:)
 
> # cat /config/dlm/<cluster>/comms/<comm>/addr_list
> AF_INET 192.168.151.1
> AF_INET 192.168.151.2

This is what I want.
 
> Dave
> 



From laszlo.budai at gmail.com  Thu Jun  9 14:46:57 2011
From: laszlo.budai at gmail.com (Budai Laszlo)
Date: Thu, 09 Jun 2011 17:46:57 +0300
Subject: [Linux-cluster] gfs mount at boot
Message-ID: <4DF0DCE1.2000406@gmail.com>

Hi,

What should be done in order to mount a gfs file system at boot?
I've created the following line in /etc/fstab:

/dev/clvg/gfsvol        /mnt/testgfs            gfs     defaults        0 0

but it is not mounting the fs at boot. If I run "mount -a" then the fs
will get mounted.
Is there any option for fstab to specify that this mount should be
delayed  until the cluster is up and running?

Thank you,
Laszlo



From corey.kovacs at gmail.com  Thu Jun  9 14:52:46 2011
From: corey.kovacs at gmail.com (Corey Kovacs)
Date: Thu, 9 Jun 2011 15:52:46 +0100
Subject: [Linux-cluster] umount failing...
Message-ID: <BANLkTim-8aTyhN3ebBBbKURkdsnP2KqcHg@mail.gmail.com>

Folks,

I have a 5 node cluster serving out several NFS exports, one of which is /home.

All of the nfs services can be moved from node to node without problem
except for the one providing /home.

The logs on that node indicate the umount is failing and then the
service is disabled (self-fence is not enabled).

Even after the service is put into a failed state and then disabled
manually, umount fails...

I had noticed recently while playing with conga that creating a
service for /home on a test cluster a warning was issued about
reserved words and as I recall (i could be wrong) /home was among the
illegal parameters for the mount point.

I have turned everything off that I could think of which might be
"holding" the mount and have run the various iterations of lsof, find
etc. nothing shows up as having anything being actively used.

This particular file system is 1TB.

Is there something wrong with using /home as an export?

Some specifics.

RHEL5.6 (updated as of last week)
HA-LVM protecting ext3 using the newer "preferred method" with clvmd
Ext3 for exported file systems
5 nodes.


Any ideas would be greatly appreciated.

-C



From thomas at sjolshagen.net  Thu Jun  9 15:04:29 2011
From: thomas at sjolshagen.net (Thomas Sjolshagen)
Date: Thu, 9 Jun 2011 11:04:29 -0400
Subject: [Linux-cluster] gfs mount at boot
In-Reply-To: <4DF0DCE1.2000406@gmail.com>
References: <4DF0DCE1.2000406@gmail.com>
Message-ID: <C6FE696B-2CEC-4686-9766-9CA5ED212025@sjolshagen.net>

Usually, there's a gfs boot service or network filesystem boot service you may need to enable.

On Jun 9, 2011, at 10:46, Budai Laszlo <laszlo.budai at gmail.com> wrote:

> Hi,
> 
> What should be done in order to mount a gfs file system at boot?
> I've created the following line in /etc/fstab:
> 
> /dev/clvg/gfsvol        /mnt/testgfs            gfs     defaults        0 0
> 
> but it is not mounting the fs at boot. If I run "mount -a" then the fs
> will get mounted.
> Is there any option for fstab to specify that this mount should be
> delayed  until the cluster is up and running?
> 
> Thank you,
> Laszlo
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From corey.kovacs at gmail.com  Thu Jun  9 15:12:35 2011
From: corey.kovacs at gmail.com (Corey Kovacs)
Date: Thu, 9 Jun 2011 16:12:35 +0100
Subject: [Linux-cluster] gfs mount at boot
In-Reply-To: <C6FE696B-2CEC-4686-9766-9CA5ED212025@sjolshagen.net>
References: <4DF0DCE1.2000406@gmail.com>
	<C6FE696B-2CEC-4686-9766-9CA5ED212025@sjolshagen.net>
Message-ID: <BANLkTi=bOXF4nGQW03WgRd5ZBJdQXNP28Q@mail.gmail.com>

Put "_netfs" in the options line. GFS is dependent on the network so
once the network is up, it should try to mount again, but not before.

On Thu, Jun 9, 2011 at 4:04 PM, Thomas Sjolshagen <thomas at sjolshagen.net> wrote:
> Usually, there's a gfs boot service or network filesystem boot service you may need to enable.
>
> On Jun 9, 2011, at 10:46, Budai Laszlo <laszlo.budai at gmail.com> wrote:
>
>> Hi,
>>
>> What should be done in order to mount a gfs file system at boot?
>> I've created the following line in /etc/fstab:
>>
>> /dev/clvg/gfsvol ? ? ? ?/mnt/testgfs ? ? ? ? ? ?gfs ? ? defaults ? ? ? ?0 0
>>
>> but it is not mounting the fs at boot. If I run "mount -a" then the fs
>> will get mounted.
>> Is there any option for fstab to specify that this mount should be
>> delayed ?until the cluster is up and running?
>>
>> Thank you,
>> Laszlo
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From linux at alteeve.com  Thu Jun  9 15:20:18 2011
From: linux at alteeve.com (Digimer)
Date: Thu, 09 Jun 2011 11:20:18 -0400
Subject: [Linux-cluster] gfs mount at boot
In-Reply-To: <4DF0DCE1.2000406@gmail.com>
References: <4DF0DCE1.2000406@gmail.com>
Message-ID: <4DF0E4B2.1070704@alteeve.com>

On 06/09/2011 10:46 AM, Budai Laszlo wrote:
> Hi,
>
> What should be done in order to mount a gfs file system at boot?
> I've created the following line in /etc/fstab:
>
> /dev/clvg/gfsvol        /mnt/testgfs            gfs     defaults        0 0
>
> but it is not mounting the fs at boot. If I run "mount -a" then the fs
> will get mounted.
> Is there any option for fstab to specify that this mount should be
> delayed  until the cluster is up and running?
>
> Thank you,
> Laszlo

The trick is that you need to setup the GFS2 partition with 
"rw,suid,dev,exec,nouser,async" instead of "defaults". This is because 
"defaults" implies "auto", and the cluster is not online that early in 
the boot process.

To have it mount on boot, start the cluster (chkconfig cman on). If you 
defined GFS2 as a managed resource, then also enable rgmanager at boot. 
If not, then instead, enable "gfs2" at boot.

If you're not using RHCS, then the same should still work. You just need 
to ensure that the service that provides quorum (corosync in pacemaker) 
starts so that the cluster can form and provide DLM, which is needed by 
GFS2. With DLM, then it's a matter of starting the resource manager 
(pacemaker/rgmanager) if the partitions are managed, or starting GFS2 
which will consult /etc/fstab and mount any found GFS2 partitions.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"I feel confined, only free to expand myself within boundaries."



From spaulo05 at hotmail.com  Thu Jun  9 15:08:47 2011
From: spaulo05 at hotmail.com (Sergio Paulo)
Date: Thu, 9 Jun 2011 16:08:47 +0100
Subject: [Linux-cluster] gfs mount at boot
In-Reply-To: <4DF0DCE1.2000406@gmail.com>
References: <4DF0DCE1.2000406@gmail.com>
Message-ID: <COL122-W63C5A9AC286E3175028D87DE650@phx.gbl>



Hi!
look at this example and try to adapt iton /etc/fstab                            /dev/VG01/LV00	 /oracle	gfs _netdev,defaults 0 0
manualy i use                          mount.gfs /dev/VG01/LV00 /oracle



S?rgio Paulo Fonseca



> Date: Thu, 9 Jun 2011 17:46:57 +0300
> From: laszlo.budai at gmail.com
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] gfs mount at boot
> 
> Hi,
> 
> What should be done in order to mount a gfs file system at boot?
> I've created the following line in /etc/fstab:
> 
> /dev/clvg/gfsvol        /mnt/testgfs            gfs     defaults        0 0
> 
> but it is not mounting the fs at boot. If I run "mount -a" then the fs
> will get mounted.
> Is there any option for fstab to specify that this mount should be
> delayed  until the cluster is up and running?
> 
> Thank you,
> Laszlo
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110609/401e5600/attachment.htm>

From linux at alteeve.com  Thu Jun  9 15:23:40 2011
From: linux at alteeve.com (Digimer)
Date: Thu, 09 Jun 2011 11:23:40 -0400
Subject: [Linux-cluster] gfs mount at boot
In-Reply-To: <BANLkTi=bOXF4nGQW03WgRd5ZBJdQXNP28Q@mail.gmail.com>
References: <4DF0DCE1.2000406@gmail.com>	<C6FE696B-2CEC-4686-9766-9CA5ED212025@sjolshagen.net>
	<BANLkTi=bOXF4nGQW03WgRd5ZBJdQXNP28Q@mail.gmail.com>
Message-ID: <4DF0E57C.8000204@alteeve.com>

On 06/09/2011 11:12 AM, Corey Kovacs wrote:
> Put "_netfs" in the options line. GFS is dependent on the network so
> once the network is up, it should try to mount again, but not before.

GFS2 is dependant on the cluster's distributed lock manager. It can't 
come up until: network -> cluster engine -> resource manager or gfs2 daemon.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"I feel confined, only free to expand myself within boundaries."



From corey.kovacs at gmail.com  Thu Jun  9 15:27:27 2011
From: corey.kovacs at gmail.com (Corey Kovacs)
Date: Thu, 9 Jun 2011 16:27:27 +0100
Subject: [Linux-cluster] gfs mount at boot
In-Reply-To: <4DF0E4B2.1070704@alteeve.com>
References: <4DF0DCE1.2000406@gmail.com>
	<4DF0E4B2.1070704@alteeve.com>
Message-ID: <BANLkTimjJa2V2NvjUp+LhKW8_wPzX06xOQ@mail.gmail.com>

Ahh, forgot about the gfs2 service. Been a long time since I've set GFS1/2 up.

I'll go crawl back into my cave now...

-C

On Thu, Jun 9, 2011 at 4:20 PM, Digimer <linux at alteeve.com> wrote:
> On 06/09/2011 10:46 AM, Budai Laszlo wrote:
>>
>> Hi,
>>
>> What should be done in order to mount a gfs file system at boot?
>> I've created the following line in /etc/fstab:
>>
>> /dev/clvg/gfsvol ? ? ? ?/mnt/testgfs ? ? ? ? ? ?gfs ? ? defaults ? ? ? ?0
>> 0
>>
>> but it is not mounting the fs at boot. If I run "mount -a" then the fs
>> will get mounted.
>> Is there any option for fstab to specify that this mount should be
>> delayed ?until the cluster is up and running?
>>
>> Thank you,
>> Laszlo
>
> The trick is that you need to setup the GFS2 partition with
> "rw,suid,dev,exec,nouser,async" instead of "defaults". This is because
> "defaults" implies "auto", and the cluster is not online that early in the
> boot process.
>
> To have it mount on boot, start the cluster (chkconfig cman on). If you
> defined GFS2 as a managed resource, then also enable rgmanager at boot. If
> not, then instead, enable "gfs2" at boot.
>
> If you're not using RHCS, then the same should still work. You just need to
> ensure that the service that provides quorum (corosync in pacemaker) starts
> so that the cluster can form and provide DLM, which is needed by GFS2. With
> DLM, then it's a matter of starting the resource manager
> (pacemaker/rgmanager) if the partitions are managed, or starting GFS2 which
> will consult /etc/fstab and mount any found GFS2 partitions.
>
> --
> Digimer
> E-Mail: ? ? ? ? ? ? ?digimer at alteeve.com
> Freenode handle: ? ? digimer
> Papers and Projects: http://alteeve.com
> Node Assassin: ? ? ? http://nodeassassin.org
> "I feel confined, only free to expand myself within boundaries."
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From ajb2 at mssl.ucl.ac.uk  Thu Jun  9 18:48:04 2011
From: ajb2 at mssl.ucl.ac.uk (Alan Brown)
Date: Thu, 09 Jun 2011 19:48:04 +0100
Subject: [Linux-cluster] gfs mount at boot
In-Reply-To: <4DF0DCE1.2000406@gmail.com>
References: <4DF0DCE1.2000406@gmail.com>
Message-ID: <4DF11564.7090108@mssl.ucl.ac.uk>

On 09/06/11 15:46, Budai Laszlo wrote:
> Hi,
>
> What should be done in order to mount a gfs file system at boot?
> I've created the following line in /etc/fstab:
>
> /dev/clvg/gfsvol        /mnt/testgfs            gfs     defaults        0 0
>
> but it is not mounting the fs at boot. If I run "mount -a" then the fs
> will get mounted.
> Is there any option for fstab to specify that this mount should be
> delayed  until the cluster is up and running?

Add _netfs after defaults.





From rossnick-lists at cybercat.ca  Sat Jun 11 02:43:22 2011
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Fri, 10 Jun 2011 22:43:22 -0400
Subject: [Linux-cluster] clustat -x
Message-ID: <4DF2D64A.3040702@cybercat.ca>

Hi !

I am make scripts to help monitor and administer our cluster. And I 
wounder where I could find info about the xml output of clustat ? 
Notably what are the different state and flags ? I know that state of 
112 means started, but what else ?

thanks,
Nicolas



From lcl at nym.hush.com  Sun Jun 12 23:54:00 2011
From: lcl at nym.hush.com (lcl at nym.hush.com)
Date: Sun, 12 Jun 2011 16:54:00 -0700
Subject: [Linux-cluster] GFS2 reads eventually cause writes to slow
Message-ID: <20110612235400.4273F6F437@smtp.hushmail.com>

Hello,

My team has been having a problem while testing a cluster in a lab in which write operations are extremely slow after reads have been performed continuously for some period of time.

We eventually isolated the problem to where we can replicate it on only one node.  The other two nodes are powered on, but no filesystems are mounted on those nodes and no operations are performed on those nodes.

To replicate, we first reboot everything, then start a number of threads (300+) doing random reads of 8K files in a large directory structure.  This goes well for up to 45 minutes.  (We're not expecting the reads to be that fast, given they are not cached at that point, but they are within expectations.)  We don't do any writes at this time.

Then something changes, and we can see that glock_manager is generally at 98-99% in iotop.  At this point, however, the reads are still fast enough.

Once the node has gotten into this state, an attempt to write an 8K file will usually take several seconds.  Note that it takes several seconds to write a file even if it is written on a different filesystem from the one on which we are doing the reads.

This bad condition persists until reads are stopped.  After reads are stopped, the node recovers in a few minutes, after which writes can be performed quickly.  After that, once the test is restarted, it will once again take up to 45 minutes to get the node into the bad state again.

Our hypothesis at this point is that there is some cleanup that is not getting performed as long as intensive reads are ongoing.  Because that cleanup has not been done, writes are extremely slow.  Once the reads stop, the necessary cleanup gets performed, and then it is a long time to cause the problem again.

We've tried various tuning options and are starting to dig into source code to find out more, but I thought I'd find out if anyone has any insight into this.

We're testing on CentOS 5 with kernel 2.6.18-238.12.1.el5, with gfs2-kmod-debuginfo.x86_64 1.92-1.1.el5_2.2.

Thanks,
Brian




From torajveersingh at gmail.com  Mon Jun 13 10:44:12 2011
From: torajveersingh at gmail.com (Rajveer Singh)
Date: Mon, 13 Jun 2011 16:14:12 +0530
Subject: [Linux-cluster] umount failing...
In-Reply-To: <BANLkTim-8aTyhN3ebBBbKURkdsnP2KqcHg@mail.gmail.com>
References: <BANLkTim-8aTyhN3ebBBbKURkdsnP2KqcHg@mail.gmail.com>
Message-ID: <BANLkTimcu9X5yDWaApeK6sRmk2CjrQNHqw@mail.gmail.com>

On Thu, Jun 9, 2011 at 8:22 PM, Corey Kovacs <corey.kovacs at gmail.com> wrote:

> Folks,
>
> I have a 5 node cluster serving out several NFS exports, one of which is
> /home.
>
> All of the nfs services can be moved from node to node without problem
> except for the one providing /home.
>
> The logs on that node indicate the umount is failing and then the
> service is disabled (self-fence is not enabled).
>
> Even after the service is put into a failed state and then disabled
> manually, umount fails...
>
> I had noticed recently while playing with conga that creating a
> service for /home on a test cluster a warning was issued about
> reserved words and as I recall (i could be wrong) /home was among the
> illegal parameters for the mount point.
>
> I have turned everything off that I could think of which might be
> "holding" the mount and have run the various iterations of lsof, find
> etc. nothing shows up as having anything being actively used.
>
> This particular file system is 1TB.
>
> Is there something wrong with using /home as an export?
>
> Some specifics.
>
> RHEL5.6 (updated as of last week)
> HA-LVM protecting ext3 using the newer "preferred method" with clvmd
> Ext3 for exported file systems
> 5 nodes.
>
>
> Any ideas would be greatly appreciated.
>
> -C
>
> Can you share your log file and cluster.conf file
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110613/56b7a382/attachment.htm>

From laszlo.budai at gmail.com  Tue Jun 14 11:13:13 2011
From: laszlo.budai at gmail.com (Budai Laszlo)
Date: Tue, 14 Jun 2011 14:13:13 +0300
Subject: [Linux-cluster] gfs mount at boot
In-Reply-To: <C6FE696B-2CEC-4686-9766-9CA5ED212025@sjolshagen.net>
References: <4DF0DCE1.2000406@gmail.com>
	<C6FE696B-2CEC-4686-9766-9CA5ED212025@sjolshagen.net>
Message-ID: <4DF74249.6020906@gmail.com>

Hi all,

Indeed enabling the gfs service has mounted the file system after reboot.
I have also tried the other suggestions, but none of them has worked out
for me (the most probable cause is that the cluster stack was not ready
yet when the system has tried to do the mount).

So my conclusion is that if one needs a gfs at boot without configuring
any cluster resource to mount it, then the gfs system service needs to
be enabled (chkconfig gfs on).

Thank you all for your ideas and time.

Laszlo

On 06/09/2011 06:04 PM, Thomas Sjolshagen wrote:
> Usually, there's a gfs boot service or network filesystem boot service you may need to enable.
>
> On Jun 9, 2011, at 10:46, Budai Laszlo <laszlo.budai at gmail.com> wrote:
>
>> Hi,
>>
>> What should be done in order to mount a gfs file system at boot?
>> I've created the following line in /etc/fstab:
>>
>> /dev/clvg/gfsvol        /mnt/testgfs            gfs     defaults        0 0
>>
>> but it is not mounting the fs at boot. If I run "mount -a" then the fs
>> will get mounted.
>> Is there any option for fstab to specify that this mount should be
>> delayed  until the cluster is up and running?
>>
>> Thank you,
>> Laszlo
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From benpro82 at gmail.com  Tue Jun 14 15:31:12 2011
From: benpro82 at gmail.com (benpro)
Date: Tue, 14 Jun 2011 17:31:12 +0200
Subject: [Linux-cluster] CLVM Documentation.
Message-ID: <BANLkTimHFv-G+j5bCTLNHP4GWpM4Le+Hfw@mail.gmail.com>

Hi there,

I'm actualy studying some solutions to have a shared FS for KVM and live
migration.
I've already tested DRBD+OCFS2 with success.

I wanted to take a look at CLVM, but I don't find any explicit
documentation, like how to configure lvm.conf and how to set up CLVM?

Did you get any links which talk about CLVM? I've already found Red Hat
Documentation [1], but still don't explain the subject in term of software
configuration.

Thanks in advance.

[1] :
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Logical_Volume_Manager_Administration/LVM_Cluster_Overview.html

Regards,

---
Beno?t.S
alias Benpro
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110614/8069961e/attachment.htm>

From linux at alteeve.com  Tue Jun 14 15:41:52 2011
From: linux at alteeve.com (Digimer)
Date: Tue, 14 Jun 2011 11:41:52 -0400
Subject: [Linux-cluster] CLVM Documentation.
In-Reply-To: <BANLkTimHFv-G+j5bCTLNHP4GWpM4Le+Hfw@mail.gmail.com>
References: <BANLkTimHFv-G+j5bCTLNHP4GWpM4Le+Hfw@mail.gmail.com>
Message-ID: <4DF78140.9080203@alteeve.com>

On 06/14/2011 11:31 AM, benpro wrote:
> Hi there,
>
> I'm actualy studying some solutions to have a shared FS for KVM and live
> migration.
> I've already tested DRBD+OCFS2 with success.
>
> I wanted to take a look at CLVM, but I don't find any explicit
> documentation, like how to configure lvm.conf and how to set up CLVM?
>
> Did you get any links which talk about CLVM? I've already found Red Hat
> Documentation [1], but still don't explain the subject in term of
> software configuration.
>
> Thanks in advance.
>
> [1] :
> http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Logical_Volume_Manager_Administration/LVM_Cluster_Overview.html
>
>
> Regards,
>
> ---
> Beno?t.S
> alias Benpro

In the end, all that is really needed is to change locking_type to "3" 
and fallback_to_local_locking to "0" (and, of course, have DLM).

I've got a bit of documentation on implementing CLVM on EL5 with DRBD here:

http://wiki.alteeve.com/index.php/Red_Hat_Cluster_Service_2_Tutorial#Setting_Up_Clustered_LVM

It is not at all extensive, but hopefully it's sufficient to help.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"I feel confined, only free to expand myself within boundaries."



From skjbalaji at gmail.com  Tue Jun 14 18:46:18 2011
From: skjbalaji at gmail.com (Balaji S)
Date: Wed, 15 Jun 2011 00:16:18 +0530
Subject: [Linux-cluster] Cluster Failover Failed
Message-ID: <BANLkTinezSRuLDKWXpYd8kFA2rJpbd+GfQ@mail.gmail.com>

Hi,
In my setup implemented 10 tow node cluster's which running mysql as cluster
service, ipmi card as fencing device.

In my /var/log/messages i am keep getting the errors like below,

Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdm, sector 0
Jun 14 12:50:48 hostname kernel: sd 3:0:2:2: Device not ready: <6>: Current:
sense key: Not Ready
Jun 14 12:50:48 hostname kernel:     Add. Sense: Logical unit not ready,
manual intervention required
Jun 14 12:50:48 hostname kernel:
Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdn, sector 0
Jun 14 12:50:48 hostname kernel: sd 3:0:2:4: Device not ready: <6>: Current:
sense key: Not Ready
Jun 14 12:50:48 hostname kernel:     Add. Sense: Logical unit not ready,
manual intervention required
Jun 14 12:50:48 hostname kernel:
Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdp, sector 0
Jun 14 12:51:10 hostname kernel: sd 3:0:0:1: Device not ready: <6>: Current:
sense key: Not Ready
Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready,
manual intervention required
Jun 14 12:51:10 hostname kernel:
Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdc, sector 0
Jun 14 12:51:10 hostname kernel: printk: 3 messages suppressed.
Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdc, logical
block 0
Jun 14 12:51:10 hostname kernel: sd 3:0:0:2: Device not ready: <6>: Current:
sense key: Not Ready
Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready,
manual intervention required
Jun 14 12:51:10 hostname kernel:
Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdd, sector 0
Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdd, logical
block 0
Jun 14 12:51:10 hostname kernel: sd 3:0:0:4: Device not ready: <6>: Current:
sense key: Not Ready
Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready,
manual intervention required


when i am checking the multipath -ll , this all devices are in passive path.

Environment :

RHEL 5.4 & EMC SAN

Please suggest how to overcome this issue. Support will be highly helpful.
Thanks in Advance


-- 
Thanks,
BSK
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110615/f1287a5a/attachment.htm>

From parvez.h.shaikh at gmail.com  Wed Jun 15 05:45:03 2011
From: parvez.h.shaikh at gmail.com (Parvez Shaikh)
Date: Wed, 15 Jun 2011 11:15:03 +0530
Subject: [Linux-cluster] Plugged out blade from bladecenter chassis -
 fence_bladecenter failed
In-Reply-To: <BANLkTin6P1XhQiMjWRO0GDw_1RrFo=vt+A@mail.gmail.com>
References: <BANLkTi=htj0r-O9AXK78Sz6x-HV5od-h3A@mail.gmail.com>
	<4DBA71EA.9070303@redhat.com>
	<BANLkTi=mOvNcTbk_b+EiYCjT17HeYktOxw@mail.gmail.com>
	<4DBE5E66.80802@redhat.com>
	<BANLkTin6P1XhQiMjWRO0GDw_1RrFo=vt+A@mail.gmail.com>
Message-ID: <BANLkTik+H5Wp1UQU6hLXYtZSf=G=tn3JOQ@mail.gmail.com>

Hi,

Has anyone used missing_as_off in cluster.conf file?

Any help where to put this option in cluster.conf would be greatly
appreciated

Thanks,
Parvez

On Mon, May 2, 2011 at 6:49 PM, Parvez Shaikh <parvez.h.shaikh at gmail.com>wrote:

> Hi Marek,
>
> I tried the option missing_as_off="1" and now I get an another error -
>
> fenced[18433]: fence "node5.sscdomain" failed
> fenced[18433]: fencing node "node5.sscdomain"
>
> Sniplet of cluster.conf file is -
> ....
>     <clusternode name="node5" nodeid="5" votes="1">
>       <fence>
>         <method name="1">
>           <device blade="5" name="BladeCenterFencing" missing_as_off="1"/>
>         </method>
>       </fence>
>     </clusternode>
>   </clusternodes>
> ....
>   <fencedevices>
>     <fencedevice agent="fence_bladecenter" ipaddr="blade-mm-1"
> login="USERID" name="BladeCenterFencing" passwd="PASSW0RD"/>
>   </fencedevices>
>
> Did I miss something?
>
> Thanks
> Parvez
>
>
>
> On Mon, May 2, 2011 at 1:03 PM, Marek Grac <mgrac at redhat.com> wrote:
>
>> Hi,
>>
>>
>> On 04/29/2011 10:15 AM, Parvez Shaikh wrote:
>>
>>> Hi Marek,
>>>
>>> Can we give this option in cluster.conf file for bladecenter fencing
>>> device or method
>>>
>>
>> for cluster.conf you should add ... missing_as_off="1" ... to fence
>> configuration
>>
>>
>>
>>> For IPMI, fencing is there similar option?
>>>
>>>
>> There is no such method for IPMI.
>>
>>
>> m,
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110615/e95a9505/attachment.htm>

From RMartinez-Sanchez at nds.com  Wed Jun 15 11:11:03 2011
From: RMartinez-Sanchez at nds.com (Martinez-Sanchez, Raul)
Date: Wed, 15 Jun 2011 12:11:03 +0100
Subject: [Linux-cluster] Cluster Failover Failed
In-Reply-To: <BANLkTinezSRuLDKWXpYd8kFA2rJpbd+GfQ@mail.gmail.com>
References: <BANLkTinezSRuLDKWXpYd8kFA2rJpbd+GfQ@mail.gmail.com>
Message-ID: <7370F6F5ED3B874F988F5CE657D801EA13A9309289@UKMA1.UK.NDS.COM>

Hi Balaji,

According to RedHat documentation some Storage Array Devices configured in active/passive mode and using multipath will display this I/O error messages, so this might also be your case (see https://access.redhat.com/kb/docs/DOC-35489), this link indicates that the messages are harmless and can be avoided following its instructions.

The logs you sent do not indicate anything related to fencing, so you might need to send the relevant info for that.

Cheers,

Ra?l


From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Balaji S
Sent: Tuesday, June 14, 2011 7:46 PM
To: Linux-cluster at redhat.com
Subject: [Linux-cluster] Cluster Failover Failed

Hi,
In my setup implemented 10 tow node cluster's which running mysql as cluster service, ipmi card as fencing device.

In my /var/log/messages i am keep getting the errors like below,

Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdm, sector 0
Jun 14 12:50:48 hostname kernel: sd 3:0:2:2: Device not ready: <6>: Current: sense key: Not Ready
Jun 14 12:50:48 hostname kernel:     Add. Sense: Logical unit not ready, manual intervention required
Jun 14 12:50:48 hostname kernel:
Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdn, sector 0
Jun 14 12:50:48 hostname kernel: sd 3:0:2:4: Device not ready: <6>: Current: sense key: Not Ready
Jun 14 12:50:48 hostname kernel:     Add. Sense: Logical unit not ready, manual intervention required
Jun 14 12:50:48 hostname kernel:
Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdp, sector 0
Jun 14 12:51:10 hostname kernel: sd 3:0:0:1: Device not ready: <6>: Current: sense key: Not Ready
Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready, manual intervention required
Jun 14 12:51:10 hostname kernel:
Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdc, sector 0
Jun 14 12:51:10 hostname kernel: printk: 3 messages suppressed.
Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdc, logical block 0
Jun 14 12:51:10 hostname kernel: sd 3:0:0:2: Device not ready: <6>: Current: sense key: Not Ready
Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready, manual intervention required
Jun 14 12:51:10 hostname kernel:
Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdd, sector 0
Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdd, logical block 0
Jun 14 12:51:10 hostname kernel: sd 3:0:0:4: Device not ready: <6>: Current: sense key: Not Ready
Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready, manual intervention required


when i am checking the multipath -ll , this all devices are in passive path.

Environment :

RHEL 5.4 & EMC SAN

Please suggest how to overcome this issue. Support will be highly helpful.
Thanks in Advance


--
Thanks,
BSK

________________________________

**************************************************************************************
This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postmaster at nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes. To protect the environment please do not print this e-mail unless necessary.

NDS Limited. Registered Office: One London Road, Staines, Middlesex, TW18 4EX, United Kingdom. A company registered in England and Wales. Registered no. 3080780. VAT no. GB 603 8808 40-00
**************************************************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110615/ebc71f26/attachment.htm>

From alvaro.fernandez at sivsa.com  Wed Jun 15 12:14:44 2011
From: alvaro.fernandez at sivsa.com (Alvaro Jose Fernandez)
Date: Wed, 15 Jun 2011 14:14:44 +0200
Subject: [Linux-cluster] Cluster Failover Failed
References: <BANLkTinezSRuLDKWXpYd8kFA2rJpbd+GfQ@mail.gmail.com>
	<7370F6F5ED3B874F988F5CE657D801EA13A9309289@UKMA1.UK.NDS.COM>
Message-ID: <607D6181D9919041BE792D70EF2AEC4801A50506@LIMENS.sivsa.int>

Hi,

 

DOC-35489 only partionally approaches the problem. I have it too, on a passive/active IBM DS4000 array and RHEL5.5. I've excluded from lvm.conf any SAN partitions as per the note (and also made a new initrd boot, as lvm.conf is included at boot time as the / partition I have it LVM'ed) , but messages still apears on bootup. They always dissapear when multipathd service starts and its scsi_dh_rdac discipline is loaded.

 

Even opened a case with Redhat, and obtained the same response (but not workaround): "it's entirely harmless, they are normal".

 

Alvaro

 

________________________________

De: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] En nombre de Martinez-Sanchez, Raul
Enviado el: mi?rcoles, 15 de junio de 2011 13:11
Para: 'Linux-cluster at redhat.com'
Asunto: Re: [Linux-cluster] Cluster Failover Failed

 

Hi Balaji,

 

According to RedHat documentation some Storage Array Devices configured in active/passive mode and using multipath will display this I/O error messages, so this might also be your case (see https://access.redhat.com/kb/docs/DOC-35489), this link indicates that the messages are harmless and can be avoided following its instructions. 

 

The logs you sent do not indicate anything related to fencing, so you might need to send the relevant info for that.

 

Cheers,

 

Ra?l

 

 

From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Balaji S
Sent: Tuesday, June 14, 2011 7:46 PM
To: Linux-cluster at redhat.com
Subject: [Linux-cluster] Cluster Failover Failed

 

Hi,

In my setup implemented 10 tow node cluster's which running mysql as cluster service, ipmi card as fencing device.

 

In my /var/log/messages i am keep getting the errors like below,

 

Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdm, sector 0

Jun 14 12:50:48 hostname kernel: sd 3:0:2:2: Device not ready: <6>: Current: sense key: Not Ready

Jun 14 12:50:48 hostname kernel:     Add. Sense: Logical unit not ready, manual intervention required

Jun 14 12:50:48 hostname kernel: 

Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdn, sector 0

Jun 14 12:50:48 hostname kernel: sd 3:0:2:4: Device not ready: <6>: Current: sense key: Not Ready

Jun 14 12:50:48 hostname kernel:     Add. Sense: Logical unit not ready, manual intervention required

Jun 14 12:50:48 hostname kernel: 

Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdp, sector 0

Jun 14 12:51:10 hostname kernel: sd 3:0:0:1: Device not ready: <6>: Current: sense key: Not Ready

Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready, manual intervention required

Jun 14 12:51:10 hostname kernel: 

Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdc, sector 0

Jun 14 12:51:10 hostname kernel: printk: 3 messages suppressed.

Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdc, logical block 0

Jun 14 12:51:10 hostname kernel: sd 3:0:0:2: Device not ready: <6>: Current: sense key: Not Ready

Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready, manual intervention required

Jun 14 12:51:10 hostname kernel: 

Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdd, sector 0

Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdd, logical block 0

Jun 14 12:51:10 hostname kernel: sd 3:0:0:4: Device not ready: <6>: Current: sense key: Not Ready

Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready, manual intervention required

 

 

when i am checking the multipath -ll , this all devices are in passive path.

 

Environment :

 

RHEL 5.4 & EMC SAN

 

Please suggest how to overcome this issue. Support will be highly helpful.

Thanks in Advance

 

-- 
Thanks,
BSK

 

________________________________


**************************************************************************************
This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postmaster at nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes. To protect the environment please do not print this e-mail unless necessary.

NDS Limited. Registered Office: One London Road, Staines, Middlesex, TW18 4EX, United Kingdom. A company registered in England and Wales. Registered no. 3080780. VAT no. GB 603 8808 40-00
**************************************************************************************

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110615/f5e94cec/attachment.htm>

From RMartinez-Sanchez at nds.com  Wed Jun 15 12:54:47 2011
From: RMartinez-Sanchez at nds.com (Martinez-Sanchez, Raul)
Date: Wed, 15 Jun 2011 13:54:47 +0100
Subject: [Linux-cluster] Cluster Failover Failed
In-Reply-To: <607D6181D9919041BE792D70EF2AEC4801A50506@LIMENS.sivsa.int>
References: <BANLkTinezSRuLDKWXpYd8kFA2rJpbd+GfQ@mail.gmail.com>
	<7370F6F5ED3B874F988F5CE657D801EA13A9309289@UKMA1.UK.NDS.COM>
	<607D6181D9919041BE792D70EF2AEC4801A50506@LIMENS.sivsa.int>
Message-ID: <7370F6F5ED3B874F988F5CE657D801EA13A930928B@UKMA1.UK.NDS.COM>

Hi Alvaro,

I have also opened a ticket with RedHat for the same reasons on rhel5u6 and a DS5020 and a DS3524 which I believe they are both active/active and multipath seems to treat them as active/passive, but I guess this is for another mailing list.

Ra?l

From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Alvaro Jose Fernandez
Sent: Wednesday, June 15, 2011 1:15 PM
To: linux clustering
Subject: Re: [Linux-cluster] Cluster Failover Failed

Hi,

DOC-35489 only partionally approaches the problem. I have it too, on a passive/active IBM DS4000 array and RHEL5.5. I've excluded from lvm.conf any SAN partitions as per the note (and also made a new initrd boot, as lvm.conf is included at boot time as the / partition I have it LVM'ed) , but messages still apears on bootup. They always dissapear when multipathd service starts and its scsi_dh_rdac discipline is loaded.

Even opened a case with Redhat, and obtained the same response (but not workaround): "it's entirely harmless, they are normal".

Alvaro

________________________________
De: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] En nombre de Martinez-Sanchez, Raul
Enviado el: mi?rcoles, 15 de junio de 2011 13:11
Para: 'Linux-cluster at redhat.com'
Asunto: Re: [Linux-cluster] Cluster Failover Failed

Hi Balaji,

According to RedHat documentation some Storage Array Devices configured in active/passive mode and using multipath will display this I/O error messages, so this might also be your case (see https://access.redhat.com/kb/docs/DOC-35489), this link indicates that the messages are harmless and can be avoided following its instructions.

The logs you sent do not indicate anything related to fencing, so you might need to send the relevant info for that.

Cheers,

Ra?l


From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Balaji S
Sent: Tuesday, June 14, 2011 7:46 PM
To: Linux-cluster at redhat.com
Subject: [Linux-cluster] Cluster Failover Failed

Hi,
In my setup implemented 10 tow node cluster's which running mysql as cluster service, ipmi card as fencing device.

In my /var/log/messages i am keep getting the errors like below,

Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdm, sector 0
Jun 14 12:50:48 hostname kernel: sd 3:0:2:2: Device not ready: <6>: Current: sense key: Not Ready
Jun 14 12:50:48 hostname kernel:     Add. Sense: Logical unit not ready, manual intervention required
Jun 14 12:50:48 hostname kernel:
Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdn, sector 0
Jun 14 12:50:48 hostname kernel: sd 3:0:2:4: Device not ready: <6>: Current: sense key: Not Ready
Jun 14 12:50:48 hostname kernel:     Add. Sense: Logical unit not ready, manual intervention required
Jun 14 12:50:48 hostname kernel:
Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdp, sector 0
Jun 14 12:51:10 hostname kernel: sd 3:0:0:1: Device not ready: <6>: Current: sense key: Not Ready
Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready, manual intervention required
Jun 14 12:51:10 hostname kernel:
Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdc, sector 0
Jun 14 12:51:10 hostname kernel: printk: 3 messages suppressed.
Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdc, logical block 0
Jun 14 12:51:10 hostname kernel: sd 3:0:0:2: Device not ready: <6>: Current: sense key: Not Ready
Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready, manual intervention required
Jun 14 12:51:10 hostname kernel:
Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdd, sector 0
Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdd, logical block 0
Jun 14 12:51:10 hostname kernel: sd 3:0:0:4: Device not ready: <6>: Current: sense key: Not Ready
Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready, manual intervention required


when i am checking the multipath -ll , this all devices are in passive path.

Environment :

RHEL 5.4 & EMC SAN

Please suggest how to overcome this issue. Support will be highly helpful.
Thanks in Advance


--
Thanks,
BSK

________________________________

**************************************************************************************
This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postmaster at nds.com<mailto:postmaster at nds.com> and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes. To protect the environment please do not print this e-mail unless necessary.

NDS Limited. Registered Office: One London Road, Staines, Middlesex, TW18 4EX, United Kingdom. A company registered in England and Wales. Registered no. 3080780. VAT no. GB 603 8808 40-00
**************************************************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110615/57b2e62d/attachment.htm>

From alvaro.fernandez at sivsa.com  Wed Jun 15 13:16:14 2011
From: alvaro.fernandez at sivsa.com (Alvaro Jose Fernandez)
Date: Wed, 15 Jun 2011 15:16:14 +0200
Subject: [Linux-cluster] Cluster Failover Failed
References: <BANLkTinezSRuLDKWXpYd8kFA2rJpbd+GfQ@mail.gmail.com><7370F6F5ED3B874F988F5CE657D801EA13A9309289@UKMA1.UK.NDS.COM><607D6181D9919041BE792D70EF2AEC4801A50506@LIMENS.sivsa.int>
	<7370F6F5ED3B874F988F5CE657D801EA13A930928B@UKMA1.UK.NDS.COM>
Message-ID: <607D6181D9919041BE792D70EF2AEC4801A50524@LIMENS.sivsa.int>

Hi Raul,

 

Yes, it seems like-stuff.  Thanks for pointing out the same still applies to RHEL5.6 . There is a opened bugzilla at https://bugzilla.redhat.com/show_bug.cgi?id=649705 . 

 

Low priority, of course (for Redhat), as no response at all. They seem to ignore that sometimes we have to do demostrations to prospective customers, etc, and the image of all these messages popping out from the console and the logs are unforgettable.

 

Alvaro

 

________________________________

De: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] En nombre de Martinez-Sanchez, Raul
Enviado el: mi?rcoles, 15 de junio de 2011 14:55
Para: 'linux clustering'
Asunto: Re: [Linux-cluster] Cluster Failover Failed

 

Hi Alvaro,

 

I have also opened a ticket with RedHat for the same reasons on rhel5u6 and a DS5020 and a DS3524 which I believe they are both active/active and multipath seems to treat them as active/passive, but I guess this is for another mailing list.

 

Ra?l 

 

From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Alvaro Jose Fernandez
Sent: Wednesday, June 15, 2011 1:15 PM
To: linux clustering
Subject: Re: [Linux-cluster] Cluster Failover Failed

 

Hi,

 

DOC-35489 only partionally approaches the problem. I have it too, on a passive/active IBM DS4000 array and RHEL5.5. I've excluded from lvm.conf any SAN partitions as per the note (and also made a new initrd boot, as lvm.conf is included at boot time as the / partition I have it LVM'ed) , but messages still apears on bootup. They always dissapear when multipathd service starts and its scsi_dh_rdac discipline is loaded.

 

Even opened a case with Redhat, and obtained the same response (but not workaround): "it's entirely harmless, they are normal".

 

Alvaro

 

________________________________

De: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] En nombre de Martinez-Sanchez, Raul
Enviado el: mi?rcoles, 15 de junio de 2011 13:11
Para: 'Linux-cluster at redhat.com'
Asunto: Re: [Linux-cluster] Cluster Failover Failed

 

Hi Balaji,

 

According to RedHat documentation some Storage Array Devices configured in active/passive mode and using multipath will display this I/O error messages, so this might also be your case (see https://access.redhat.com/kb/docs/DOC-35489), this link indicates that the messages are harmless and can be avoided following its instructions. 

 

The logs you sent do not indicate anything related to fencing, so you might need to send the relevant info for that.

 

Cheers,

 

Ra?l

 

 

From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Balaji S
Sent: Tuesday, June 14, 2011 7:46 PM
To: Linux-cluster at redhat.com
Subject: [Linux-cluster] Cluster Failover Failed

 

Hi,

In my setup implemented 10 tow node cluster's which running mysql as cluster service, ipmi card as fencing device.

 

In my /var/log/messages i am keep getting the errors like below,

 

Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdm, sector 0

Jun 14 12:50:48 hostname kernel: sd 3:0:2:2: Device not ready: <6>: Current: sense key: Not Ready

Jun 14 12:50:48 hostname kernel:     Add. Sense: Logical unit not ready, manual intervention required

Jun 14 12:50:48 hostname kernel: 

Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdn, sector 0

Jun 14 12:50:48 hostname kernel: sd 3:0:2:4: Device not ready: <6>: Current: sense key: Not Ready

Jun 14 12:50:48 hostname kernel:     Add. Sense: Logical unit not ready, manual intervention required

Jun 14 12:50:48 hostname kernel: 

Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdp, sector 0

Jun 14 12:51:10 hostname kernel: sd 3:0:0:1: Device not ready: <6>: Current: sense key: Not Ready

Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready, manual intervention required

Jun 14 12:51:10 hostname kernel: 

Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdc, sector 0

Jun 14 12:51:10 hostname kernel: printk: 3 messages suppressed.

Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdc, logical block 0

Jun 14 12:51:10 hostname kernel: sd 3:0:0:2: Device not ready: <6>: Current: sense key: Not Ready

Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready, manual intervention required

Jun 14 12:51:10 hostname kernel: 

Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdd, sector 0

Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdd, logical block 0

Jun 14 12:51:10 hostname kernel: sd 3:0:0:4: Device not ready: <6>: Current: sense key: Not Ready

Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready, manual intervention required

 

 

when i am checking the multipath -ll , this all devices are in passive path.

 

Environment :

 

RHEL 5.4 & EMC SAN

 

Please suggest how to overcome this issue. Support will be highly helpful.

Thanks in Advance

 

-- 
Thanks,
BSK

 

________________________________


**************************************************************************************
This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postmaster at nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes. To protect the environment please do not print this e-mail unless necessary.

NDS Limited. Registered Office: One London Road, Staines, Middlesex, TW18 4EX, United Kingdom. A company registered in England and Wales. Registered no. 3080780. VAT no. GB 603 8808 40-00
**************************************************************************************

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110615/68801e98/attachment.htm>

From corey.kovacs at gmail.com  Thu Jun 16 08:18:19 2011
From: corey.kovacs at gmail.com (Corey Kovacs)
Date: Thu, 16 Jun 2011 09:18:19 +0100
Subject: [Linux-cluster] umount failing...
In-Reply-To: <BANLkTimcu9X5yDWaApeK6sRmk2CjrQNHqw@mail.gmail.com>
References: <BANLkTim-8aTyhN3ebBBbKURkdsnP2KqcHg@mail.gmail.com>
	<BANLkTimcu9X5yDWaApeK6sRmk2CjrQNHqw@mail.gmail.com>
Message-ID: <BANLkTik-GQRaBfrpJR=aNVSJrrJ38bH03g@mail.gmail.com>

My appologies for not getting back sooner. I am in the middle of a move.

I cannot post my configs or logs (yeah, not helpful I know) but
suffice it to say I strongly believe they are correct (I know,
everyone says that). I've had other people look at them just make sure
it wasn't a case of proofreading my own paper etc. and it always comes
down to the umount failing. I have 6 other identical NFS services
(save for the mount point/export location) and they all work
flawlessly. That's why I am zeroing in on the use of '/home' as the
culprit.

Anyway, it's not a lot to go on I know, but I am just looking for
directions to search for now.

Thanks

Corey

On Mon, Jun 13, 2011 at 11:44 AM, Rajveer Singh
<torajveersingh at gmail.com> wrote:
>
>
> On Thu, Jun 9, 2011 at 8:22 PM, Corey Kovacs <corey.kovacs at gmail.com> wrote:
>>
>> Folks,
>>
>> I have a 5 node cluster serving out several NFS exports, one of which is
>> /home.
>>
>> All of the nfs services can be moved from node to node without problem
>> except for the one providing /home.
>>
>> The logs on that node indicate the umount is failing and then the
>> service is disabled (self-fence is not enabled).
>>
>> Even after the service is put into a failed state and then disabled
>> manually, umount fails...
>>
>> I had noticed recently while playing with conga that creating a
>> service for /home on a test cluster a warning was issued about
>> reserved words and as I recall (i could be wrong) /home was among the
>> illegal parameters for the mount point.
>>
>> I have turned everything off that I could think of which might be
>> "holding" the mount and have run the various iterations of lsof, find
>> etc. nothing shows up as having anything being actively used.
>>
>> This particular file system is 1TB.
>>
>> Is there something wrong with using /home as an export?
>>
>> Some specifics.
>>
>> RHEL5.6 (updated as of last week)
>> HA-LVM protecting ext3 using the newer "preferred method" with clvmd
>> Ext3 for exported file systems
>> 5 nodes.
>>
>>
>> Any ideas would be greatly appreciated.
>>
>> -C
>>
> Can you share your log file and cluster.conf file
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From fdinitto at redhat.com  Thu Jun 16 13:13:14 2011
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Thu, 16 Jun 2011 15:13:14 +0200
Subject: [Linux-cluster] resource agents 3.9.1 final release
Message-ID: <4DFA016A.8040708@redhat.com>


Hi everybody,

The current resource agent repository [1] has been tagged to v3.9.1.
Tarballs are also available [2].

This is the very first release of the common resource agent repository.
It is a big milestone towards eliminating duplication of effort with the
goal of improving the overall quality and user experience. There is
still a long way to go but the first stone has been laid down.

Highlights for the LHA resource agents set:

- lxc, symlink: new resource agents
- db2: major rewrite and support for master/slave mode of operation
- exportfs: backup/restore of rmtab is back
- mysql: multiple improvements for master/slave and replication
- ocft: new tests for pgsql, postfix, and iscsi
- CTDB: minor bug fixes
- pgsql: improve configuration check and probe handling

Highlights for the rgmanager resource agents set:

- oracledb: use shutdown immediate
- tomcat5: fix generated XML
- nfsclient: fix client name mismatch
- halvm: fix mirror dev failure
- nfs: fix selinux integration

Several changes have been made to the build system and the spec file to
accommodate both projects? needs. The most noticeable change is the
option to select "all", "linux-ha" or "rgmanager" resource agents at
configuration time, which will also set the default for the
spec file. Also several improvements have been made to correctly build
srpm/rpms on different distributions in different versions.

The full list of changes is available in the "ChangeLog" file for users,
and in an auto-generated git-to-changelog file called "ChangeLog.devel".

NOTE: About the 3.9.x version (particularly for linux-ha folks): This
version was chosen simply because the rgmanager set was already at
3.1.x. In order to make it easier for distribution, and to keep package
upgrades linear, we decided to bump the number higher than both
projects. There is no other special meaning associated with it.

Many thanks to everybody who helped with this release, in
particular to the numerous contributors. Without you, the release
would certainly not be possible.

Cheers,
The RAS Tribe

[1] https://github.com/ClusterLabs/resource-agents/tarball/v3.9.1
[2] https://fedorahosted.org/releases/r/e/resource-agents/

PS: I am absolutely sure that URL [2] might give some people a fit, but
we are still working to get a common release area.



From fdinitto at redhat.com  Thu Jun 16 13:48:32 2011
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Thu, 16 Jun 2011 15:48:32 +0200
Subject: [Linux-cluster] cluster 3.1.2 stable release
Message-ID: <4DFA09B0.2020001@redhat.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Welcome to the cluster 3.1.2 release.

This release contains several bug fixes and improvements. This version
must be used in conjunction with resource-agents 3.9.1.

The new source tarball can be downloaded here:

https://fedorahosted.org/releases/c/l/cluster/cluster-3.1.2.tar.xz

ChangeLog:

https://fedorahosted.org/releases/c/l/cluster/Changelog-3.1.2

To report bugs or issues:

   https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

   Join us on IRC (irc.freenode.net #linux-cluster) and share your
   experience  with other sysadministrators or power users.

Thanks/congratulations to all people that contributed to achieve this
great milestone.

Happy clustering,
Fabio

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/

iQIcBAEBCAAGBQJN+gmrAAoJEAgUGcMLQ3qJiXAQAIh4NyN6ZP66YhZk0lw7Zjz5
80KH5SI/agu1YhXeeXFCfwgJlFdWZj1SBP75Q5f+OxUW6uWsIDOBa26hGzFXj9H6
HdXrHUReb4/5TTat26Nd8aFLJ1jn35ltp3rBTqHqIJ9gYb7wZzHrLnret0HLV9S6
dG3G4uWNciru+Acb1cIW/ANBkioFO+f1GQiBF96txYfKojJNR9R3DRDQy8ysNDmn
CCqQwaAFje/JO4w3qggwngFNJ0n0vizSU8kGm1UGFYLcjeqGZE+NDuu4OMWMC6/U
KgtEL48VeHqRD/sJD//Tt99LVeL7VuAKBW79pfcYl8KUqVVDMXP9FqIA4okVcEr9
vPK23T3VZ3+6NJaZVEOSuYrjvNXOsi4yAa+rR8EiwnHSG2RXuxzuTyKl90HRrugO
TIkvtUj9hqGj97AviBtCFZyRUhAH68sbVFiGDV6X0nLmY2gN1A8o0CpyI6hMhsIS
MieJ9DbNjqj0b9GOzzD1EFMp65+wooZJMkku70Tbx3hKaxv28HotPCpvb7yRQUF9
j1AzFVG9YZn7FWbdQS3taPzjZNxvbKEvTpUzEz5I5xUZIRODY3uCbBHRPTNpDcpE
J0WYvOqlO7rMCuHYG8tj12ejdgDGexQXJFG/q4lrMId3ATVBV1NuaHAAszDLYN8C
+gKWDJ8aCqFKAwCCcQxe
=1IUH
-----END PGP SIGNATURE-----



From gianluca.cecchi at gmail.com  Thu Jun 16 14:44:50 2011
From: gianluca.cecchi at gmail.com (Gianluca Cecchi)
Date: Thu, 16 Jun 2011 16:44:50 +0200
Subject: [Linux-cluster] [Linux-HA] resource agents 3.9.1 final release
In-Reply-To: <4DFA016A.8040708@redhat.com>
References: <4DFA016A.8040708@redhat.com>
Message-ID: <BANLkTimKE7dveAzFhAz=DT9V-S2xSW8kjQ@mail.gmail.com>

On Thu, Jun 16, 2011 at 3:13 PM, Fabio M. Di Nitto  wrote:

> Highlights for the rgmanager resource agents set:
>
> - oracledb: use shutdown immediate

hello,
from oracledb.sh.in I can see this actually is not a configurable
parameter, so that I cannot choose between "immediate" and "abort",
and I think it is not the best change.


        faction "Stopping Oracle Database:" stop_db immediate
        if [ $? -ne 0 ]; then
                faction "Stopping Oracle Database (hard):" stop_db
abort || return 1
        fi


There are situations where an occurring problem could let a DB stuck
on shutdown immediate, preventing completion of the command itself so
you will never arrive to the error code condition to try the abort
option...
And also:
"
SHUTDOWN IMMEDIATE
No new connections are allowed, nor are new transactions allowed to be
started, after the statement is issued.
Any uncommitted transactions are rolled back. (If long uncommitted
transactions exist, this method of shutdown might not complete
quickly, despite its name.)
Oracle does not wait for users currently connected to the database to
disconnect. Oracle implicitly rolls back active transactions and
disconnects all connected users.
"

it is true that in case of shutdown abort you have anyway to rollback
too, during the following crash recovery of startup phase, but I'd
prefer to do this on the node where I'm going to land to and not on
the node that I'm leaving (possibly because of a problem).
In my opinion the only situation where "immediate" is better is for
planned maintenance.

Just my opininon.
Keep on with the good job
Gianluca



From zagar at arlut.utexas.edu  Thu Jun 16 18:17:13 2011
From: zagar at arlut.utexas.edu (Randy Zagar)
Date: Thu, 16 Jun 2011 13:17:13 -0500
Subject: [Linux-cluster] Linux-cluster Digest, Vol 86, Issue 15
In-Reply-To: <mailman.43.1308240006.29633.linux-cluster@redhat.com>
References: <mailman.43.1308240006.29633.linux-cluster@redhat.com>
Message-ID: <4DFA48A9.6090109@arlut.utexas.edu>

On 06/16/2011 11:00 AM, Corey Kovacs <corey.kovacs at gmail.com> wrote:
> My appologies for not getting back sooner. I am in the middle of a move.
>
> I cannot post my configs or logs (yeah, not helpful I know) but
> suffice it to say I strongly believe they are correct (I know,
> everyone says that). I've had other people look at them just make sure
> it wasn't a case of proofreading my own paper etc. and it always comes
> down to the umount failing. I have 6 other identical NFS services
> (save for the mount point/export location) and they all work
> flawlessly. That's why I am zeroing in on the use of '/home' as the
> culprit.
>
> Anyway, it's not a lot to go on I know, but I am just looking for
> directions to search for now.
>
> Thanks
>
> Corey

There are several other services that might be interfering with your 
attempts to umount /home.  In addition to NFS, my list of usual suspects 
includes: Apache, Samba, and Autofs.  If these, or any other services, 
are configured to use users' home directories then you're going to have 
problems with umount.

-RZ





From martijn.storck at gmail.com  Fri Jun 17 07:26:47 2011
From: martijn.storck at gmail.com (Martijn Storck)
Date: Fri, 17 Jun 2011 09:26:47 +0200
Subject: [Linux-cluster] Replacing network switch in a cluster
Message-ID: <BANLkTi=_MkqnHJggdK6C2qWUL0zRCeLLQQ@mail.gmail.com>

Hi all,

Unfortunately I have to swap out the switch that is used for the cluster
traffic of our 4-node cluster for a new one. I'm hoping I can do this by
connecting the new switch to the old switch and then moving the nodes over
one by one.

Can I change the cluster configuration so that there is a longer grace
period before a node is deemed 'lost' and gets fenced? The only line in my
cluster.conf that looks like it has anything to do with it is this one:

        <totem consensus="4800" join="60" token="10000"
token_retransmits_before_loss_const="20"/>

I think that with faststart enabled the link with a node will be down for
only a few seconds. I realize that this probably means the cluster will lock
up during that period (since we use a lot of GFS), but it's still better
than having to bring the entire cluster down.

Kind regards,
Martijn Storck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110617/86cc4e9e/attachment.htm>

From fdinitto at redhat.com  Fri Jun 17 07:28:58 2011
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Fri, 17 Jun 2011 09:28:58 +0200
Subject: [Linux-cluster] [Linux-HA] resource agents 3.9.1 final release
In-Reply-To: <BANLkTimKE7dveAzFhAz=DT9V-S2xSW8kjQ@mail.gmail.com>
References: <4DFA016A.8040708@redhat.com>
	<BANLkTimKE7dveAzFhAz=DT9V-S2xSW8kjQ@mail.gmail.com>
Message-ID: <4DFB023A.4030409@redhat.com>

Lon, what's your opinion on this one?

On 06/16/2011 04:44 PM, Gianluca Cecchi wrote:
> On Thu, Jun 16, 2011 at 3:13 PM, Fabio M. Di Nitto  wrote:
> 
>> Highlights for the rgmanager resource agents set:
>>
>> - oracledb: use shutdown immediate
> 
> hello,
> from oracledb.sh.in I can see this actually is not a configurable
> parameter, so that I cannot choose between "immediate" and "abort",
> and I think it is not the best change.
> 
> 
>         faction "Stopping Oracle Database:" stop_db immediate
>         if [ $? -ne 0 ]; then
>                 faction "Stopping Oracle Database (hard):" stop_db
> abort || return 1
>         fi
> 
> 
> There are situations where an occurring problem could let a DB stuck
> on shutdown immediate, preventing completion of the command itself so
> you will never arrive to the error code condition to try the abort
> option...
> And also:
> "
> SHUTDOWN IMMEDIATE
> No new connections are allowed, nor are new transactions allowed to be
> started, after the statement is issued.
> Any uncommitted transactions are rolled back. (If long uncommitted
> transactions exist, this method of shutdown might not complete
> quickly, despite its name.)
> Oracle does not wait for users currently connected to the database to
> disconnect. Oracle implicitly rolls back active transactions and
> disconnects all connected users.
> "
> 
> it is true that in case of shutdown abort you have anyway to rollback
> too, during the following crash recovery of startup phase, but I'd
> prefer to do this on the node where I'm going to land to and not on
> the node that I'm leaving (possibly because of a problem).
> In my opinion the only situation where "immediate" is better is for
> planned maintenance.
> 
> Just my opininon.
> Keep on with the good job
> Gianluca
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From gianluca.cecchi at gmail.com  Fri Jun 17 07:58:47 2011
From: gianluca.cecchi at gmail.com (Gianluca Cecchi)
Date: Fri, 17 Jun 2011 09:58:47 +0200
Subject: [Linux-cluster] [Linux-HA] resource agents 3.9.1 final release
In-Reply-To: <4DFB023A.4030409@redhat.com>
References: <4DFA016A.8040708@redhat.com>
	<BANLkTimKE7dveAzFhAz=DT9V-S2xSW8kjQ@mail.gmail.com>
	<4DFB023A.4030409@redhat.com>
Message-ID: <BANLkTik1uA5HubGYmDXM8iAehcbvB92E-w@mail.gmail.com>

On Fri, Jun 17, 2011 at 9:28 AM, Fabio M. Di Nitto <fdinitto at redhat.com> wrote:
> Lon, what's your opinion on this one?

Some other considerations of mine.
This of the current "abort" default option (as in RH EL 5 cluster
suite base) is indeed a difficulty, in case of planned maintenance, so
that a change inside the agent giving choice and flexibility would be
a great thing.
I was thinking about  making myself some change and then propose but
had not the time unfortunately.

Just to note, nowadays if we have a planned operation for the Oracle
DB we go through this workflow:

- DB service is DBSRV
- clusvcadm -Z DBSRV
- Operations on DB, such as shutdown immediate, patching, ecc..
- startup of DB
- clusvcadm -U DBSRV

If the planned operation involves patching of the OS and eventually
cluster suite too, after testing on test cluster, we make sometyhing
like this (from memory supposing a monoservice cluster):

- detach from cluster and update standby node (eventually update both
os and Oracle binaries as we manage their planned maintenance
together)
- DB service is DBSRV
- clusvcadm -Z DBSRV on primary node
- shutdown immediate of db
- clusvcadm -U DBSRV ; clusvcadm -d DBSRV   (*)
- shutdown of primary node
- startup of the updated node with the service DBSRV modified so that
Oracle part is not inside (so only vip, lvm, fs parts are enabled)
- verify that oracle startup with new OS and Oracle binaries is ok on the node
- shutdown immediate of db
- change cluster.conf to insert Oracle too inside DBSRV definition and
have it started/monitored from rgmanager
- update the ex-primary node too and start it to join the cluster


(*) this is risky: it would be better to be able to disable a frozen
service, eventually after asking confirmation for that....

An idea could be to have inside the clusvcadm command something like
"soft stop" option:

-ss <service>

And if inside the service there is oracledb.sh it parses this and
change its "abort" flag in "immediate"
This "soft stop" could be managed by other resources too...

Gianluca



From miha.valencic at gmail.com  Fri Jun 17 08:13:59 2011
From: miha.valencic at gmail.com (Miha Valencic)
Date: Fri, 17 Jun 2011 10:13:59 +0200
Subject: [Linux-cluster] Troubleshooting service relocation
Message-ID: <BANLkTinQEtyg5kjMde0M5MsjheWKiidcEw@mail.gmail.com>

Hi!

I'm trying to troubleshoot service migration, which happens once a day and I
don't have a clue why. (i.e.: there is nothing wrong with it and there are
no entries in the log file)

The system is RHEL4 (Red Hat 4.1.2-46) with cluster version 2.0.52.

Cluster software used to log events to /var/log/cluster.log as configured by
the syslog facility local4.*, but those messages disappeared on May 6.

The service we're running on the cluster is Zimbra, if that matters at all.

The problem is, that there are no logging entries in the cluster.log file.
If I issue 'logger -p local4.info 'test'' I see an entry in the cluster.log
file, so syslog is obviously working.

In the /etc/cluster/cluster.conf file, I see no logging configuration (and I
guess there is none, looking at config schema described at
http://sources.redhat.com/cluster/doc/cluster_schema_rhel4.html.

How can I turn on logging or what else can I check?

Thank you,
 Miha.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110617/edd329e7/attachment.htm>

From raju.rajsand at gmail.com  Fri Jun 17 17:33:22 2011
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Fri, 17 Jun 2011 23:03:22 +0530
Subject: [Linux-cluster] Replacing network switch in a cluster
In-Reply-To: <BANLkTi=_MkqnHJggdK6C2qWUL0zRCeLLQQ@mail.gmail.com>
References: <BANLkTi=_MkqnHJggdK6C2qWUL0zRCeLLQQ@mail.gmail.com>
Message-ID: <BANLkTimVXvyoLiJUHKJDinvtWeORt=C4vg@mail.gmail.com>

Greeetings,


On Fri, Jun 17, 2011 at 12:56 PM, Martijn Storck
<martijn.storck at gmail.com> wrote:
> Hi all,

dunno muchabout the configs. Please makesure tht the cluster traffic
ports are cpndfigured to multicast.
-- 
Regards,

Rajagopal



From noreply at boxbe.com  Fri Jun 17 16:05:21 2011
From: noreply at boxbe.com (noreply at boxbe.com)
Date: Fri, 17 Jun 2011 09:05:21 -0700 (PDT)
Subject: [Linux-cluster] Linux-cluster Digest, Vol 86,
	Issue 16 (Action Required)
Message-ID: <2035445104.167092.1308326721513.JavaMail.prod@app006.boxbe.com>


Dear sender,

You will not receive any more courtesy notices from our members 
for two days. Messages you have sent will remain in a lower 
priority mailbox for our member to review at their leisure.

Future messages will be more likely to be viewed if you are on 
our member's priority Guest List.


  Thank you,
  shanavasmca at gmail.com


Powered by Boxbe -- "End Email Overload"
Visit http://www.boxbe.com/how-it-works?tc=8429770443_542558017

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110617/52bf6014/attachment.htm>
-------------- next part --------------
An embedded message was scrubbed...
From: linux-cluster-request at redhat.com
Subject: Linux-cluster Digest, Vol 86, Issue 16
Date: Fri, 17 Jun 2011 12:00:06 -0400
Size: 2088
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110617/52bf6014/attachment.eml>

From michael at ulimit.org  Sat Jun 18 09:24:47 2011
From: michael at ulimit.org (Michael Pye)
Date: Sat, 18 Jun 2011 10:24:47 +0100
Subject: [Linux-cluster] Troubleshooting service relocation
In-Reply-To: <BANLkTinQEtyg5kjMde0M5MsjheWKiidcEw@mail.gmail.com>
References: <BANLkTinQEtyg5kjMde0M5MsjheWKiidcEw@mail.gmail.com>
Message-ID: <4DFC6EDF.5090202@ulimit.org>

On 17/06/2011 09:13, Miha Valencic wrote:
> How can I turn on logging or what else can I check?

Take a look at this knowledgebase article:
https://access.redhat.com/kb/docs/DOC-53500

Michael



From share2dom at gmail.com  Sun Jun 19 16:33:52 2011
From: share2dom at gmail.com (dOminic)
Date: Sun, 19 Jun 2011 22:03:52 +0530
Subject: [Linux-cluster] Cluster Failover Failed
In-Reply-To: <BANLkTinezSRuLDKWXpYd8kFA2rJpbd+GfQ@mail.gmail.com>
References: <BANLkTinezSRuLDKWXpYd8kFA2rJpbd+GfQ@mail.gmail.com>
Message-ID: <BANLkTikrF-Ge2TpQFGfng4-ob+osApbNZw@mail.gmail.com>

Hi Balaji,

Yes, the reported message is harmless ... However, you can try following

1) I would suggest you to set the filter setting in lvm.conf to properly
scan your mpath* devices and local disks.
2) Enable blacklist section in multipath.conf  eg:

blacklist {
       devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
       devnode "^hd[a-z]"
}

# multipath -v2

Observe the box. Check whether that helps ...


Regards,


On Wed, Jun 15, 2011 at 12:16 AM, Balaji S <skjbalaji at gmail.com> wrote:

> Hi,
> In my setup implemented 10 tow node cluster's which running mysql as
> cluster service, ipmi card as fencing device.
>
> In my /var/log/messages i am keep getting the errors like below,
>
> Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdm, sector 0
> Jun 14 12:50:48 hostname kernel: sd 3:0:2:2: Device not ready: <6>:
> Current: sense key: Not Ready
> Jun 14 12:50:48 hostname kernel:     Add. Sense: Logical unit not ready,
> manual intervention required
> Jun 14 12:50:48 hostname kernel:
> Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdn, sector 0
> Jun 14 12:50:48 hostname kernel: sd 3:0:2:4: Device not ready: <6>:
> Current: sense key: Not Ready
> Jun 14 12:50:48 hostname kernel:     Add. Sense: Logical unit not ready,
> manual intervention required
> Jun 14 12:50:48 hostname kernel:
> Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdp, sector 0
> Jun 14 12:51:10 hostname kernel: sd 3:0:0:1: Device not ready: <6>:
> Current: sense key: Not Ready
> Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready,
> manual intervention required
> Jun 14 12:51:10 hostname kernel:
> Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdc, sector 0
> Jun 14 12:51:10 hostname kernel: printk: 3 messages suppressed.
> Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdc, logical
> block 0
> Jun 14 12:51:10 hostname kernel: sd 3:0:0:2: Device not ready: <6>:
> Current: sense key: Not Ready
> Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready,
> manual intervention required
> Jun 14 12:51:10 hostname kernel:
> Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdd, sector 0
> Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdd, logical
> block 0
> Jun 14 12:51:10 hostname kernel: sd 3:0:0:4: Device not ready: <6>:
> Current: sense key: Not Ready
> Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready,
> manual intervention required
>
>
> when i am checking the multipath -ll , this all devices are in passive
> path.
>
> Environment :
>
> RHEL 5.4 & EMC SAN
>
> Please suggest how to overcome this issue. Support will be highly helpful.
> Thanks in Advance
>
>
> --
> Thanks,
> BSK
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110619/57c6e69e/attachment.htm>

From share2dom at gmail.com  Sun Jun 19 16:42:35 2011
From: share2dom at gmail.com (dOminic)
Date: Sun, 19 Jun 2011 22:12:35 +0530
Subject: [Linux-cluster] umount failing...
In-Reply-To: <BANLkTik-GQRaBfrpJR=aNVSJrrJ38bH03g@mail.gmail.com>
References: <BANLkTim-8aTyhN3ebBBbKURkdsnP2KqcHg@mail.gmail.com>
	<BANLkTimcu9X5yDWaApeK6sRmk2CjrQNHqw@mail.gmail.com>
	<BANLkTik-GQRaBfrpJR=aNVSJrrJ38bH03g@mail.gmail.com>
Message-ID: <BANLkTimXC6NZLO=GN2ew6=y=7z=VCujsOQ@mail.gmail.com>

selinux is in Enforced mode ( worth checking audit.log ) ? .If yes, try
selinux to permissive or disabled mode and check .

Regards,

On Thu, Jun 16, 2011 at 1:48 PM, Corey Kovacs <corey.kovacs at gmail.com>wrote:

> My appologies for not getting back sooner. I am in the middle of a move.
>
> I cannot post my configs or logs (yeah, not helpful I know) but
> suffice it to say I strongly believe they are correct (I know,
> everyone says that). I've had other people look at them just make sure
> it wasn't a case of proofreading my own paper etc. and it always comes
> down to the umount failing. I have 6 other identical NFS services
> (save for the mount point/export location) and they all work
> flawlessly. That's why I am zeroing in on the use of '/home' as the
> culprit.
>
> Anyway, it's not a lot to go on I know, but I am just looking for
> directions to search for now.
>
> Thanks
>
> Corey
>
> On Mon, Jun 13, 2011 at 11:44 AM, Rajveer Singh
> <torajveersingh at gmail.com> wrote:
> >
> >
> > On Thu, Jun 9, 2011 at 8:22 PM, Corey Kovacs <corey.kovacs at gmail.com>
> wrote:
> >>
> >> Folks,
> >>
> >> I have a 5 node cluster serving out several NFS exports, one of which is
> >> /home.
> >>
> >> All of the nfs services can be moved from node to node without problem
> >> except for the one providing /home.
> >>
> >> The logs on that node indicate the umount is failing and then the
> >> service is disabled (self-fence is not enabled).
> >>
> >> Even after the service is put into a failed state and then disabled
> >> manually, umount fails...
> >>
> >> I had noticed recently while playing with conga that creating a
> >> service for /home on a test cluster a warning was issued about
> >> reserved words and as I recall (i could be wrong) /home was among the
> >> illegal parameters for the mount point.
> >>
> >> I have turned everything off that I could think of which might be
> >> "holding" the mount and have run the various iterations of lsof, find
> >> etc. nothing shows up as having anything being actively used.
> >>
> >> This particular file system is 1TB.
> >>
> >> Is there something wrong with using /home as an export?
> >>
> >> Some specifics.
> >>
> >> RHEL5.6 (updated as of last week)
> >> HA-LVM protecting ext3 using the newer "preferred method" with clvmd
> >> Ext3 for exported file systems
> >> 5 nodes.
> >>
> >>
> >> Any ideas would be greatly appreciated.
> >>
> >> -C
> >>
> > Can you share your log file and cluster.conf file
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110619/42a94077/attachment.htm>

From share2dom at gmail.com  Sun Jun 19 16:44:56 2011
From: share2dom at gmail.com (dOminic)
Date: Sun, 19 Jun 2011 22:14:56 +0530
Subject: [Linux-cluster] Plugged out blade from bladecenter chassis -
 fence_bladecenter failed
In-Reply-To: <BANLkTi=htj0r-O9AXK78Sz6x-HV5od-h3A@mail.gmail.com>
References: <BANLkTi=htj0r-O9AXK78Sz6x-HV5od-h3A@mail.gmail.com>
Message-ID: <BANLkTindmcada5yXOAZKYTp_DB9oXtQtQg@mail.gmail.com>

There is a bug  related to missing_as_off  -
https://bugzilla.redhat.com/show_bug.cgi?id=689851 - expects the fix in
rhel5u7 .

regards,

On Wed, Apr 27, 2011 at 1:59 PM, Parvez Shaikh <parvez.h.shaikh at gmail.com>wrote:

> Hi all,
>
> I am using RHCS on IBM bladecenter with blade center fencing. I plugged out
> a blade from blade center chassis slot and was hoping that failover to
> occur. However when I did so, I get following message -
>
> fenced[10240]: agent "fence_bladecenter" reports: Failed: Unable to obtain
> correct plug status or plug is not available
> fenced[10240]: fence "blade1" failed
>
> Is this supported that if I plug out blade from its slot, then failover
> occur without manual intervention? If so, which fencing must I use?
>
> Thanks,
> Parvez
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110619/e1f7ba7a/attachment.htm>

From parvez.h.shaikh at gmail.com  Mon Jun 20 05:16:41 2011
From: parvez.h.shaikh at gmail.com (Parvez Shaikh)
Date: Mon, 20 Jun 2011 10:46:41 +0530
Subject: [Linux-cluster] Plugged out blade from bladecenter chassis -
 fence_bladecenter failed
In-Reply-To: <BANLkTindmcada5yXOAZKYTp_DB9oXtQtQg@mail.gmail.com>
References: <BANLkTi=htj0r-O9AXK78Sz6x-HV5od-h3A@mail.gmail.com>
	<BANLkTindmcada5yXOAZKYTp_DB9oXtQtQg@mail.gmail.com>
Message-ID: <BANLkTikQbWhmVvXLFphx4q7tsxT1tFoLPQ@mail.gmail.com>

Hi Thanks Dominic,

Do fence_bladecenter "reboot" the blade as a part of fencing always? I have
seen it turning the blade off by default.

Through fence_bladecenter --missing-as-off...... -o off returns me a correct
result when run from command line but fencing fails through "fenced". I am
using RHEL 5.5 ES and fence_bladecenter version reports following -

fence_bladecenter -V
2.0.115 (built Tue Dec 22 10:05:55 EST 2009)
Copyright (C) Red Hat, Inc.  2004  All rights reserved.


Anyway thanks for bugzilla reference

Regards

On Sun, Jun 19, 2011 at 10:14 PM, dOminic <share2dom at gmail.com> wrote:

> There is a bug  related to missing_as_off  -
> https://bugzilla.redhat.com/show_bug.cgi?id=689851 - expects the fix in
> rhel5u7 .
>
> regards,
>
> On Wed, Apr 27, 2011 at 1:59 PM, Parvez Shaikh <parvez.h.shaikh at gmail.com>wrote:
>
>> Hi all,
>>
>> I am using RHCS on IBM bladecenter with blade center fencing. I plugged
>> out a blade from blade center chassis slot and was hoping that failover to
>> occur. However when I did so, I get following message -
>>
>> fenced[10240]: agent "fence_bladecenter" reports: Failed: Unable to obtain
>> correct plug status or plug is not available
>> fenced[10240]: fence "blade1" failed
>>
>> Is this supported that if I plug out blade from its slot, then failover
>> occur without manual intervention? If so, which fencing must I use?
>>
>> Thanks,
>> Parvez
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110620/0386831e/attachment.htm>

From skjbalaji at gmail.com  Mon Jun 20 16:13:46 2011
From: skjbalaji at gmail.com (Balaji S)
Date: Mon, 20 Jun 2011 21:43:46 +0530
Subject: [Linux-cluster] Linux-cluster Digest, Vol 86, Issue 18
In-Reply-To: <mailman.37.1308585606.28345.linux-cluster@redhat.com>
References: <mailman.37.1308585606.28345.linux-cluster@redhat.com>
Message-ID: <BANLkTimtFg3tHM2wqzvVaU7AbWaes1U7ng@mail.gmail.com>

Thanks dominic, i have added the filter things in lvm.conf, still i am
getting same error messages, here below i am mentioning the lines i have
added in lvm.conf, still aything need to modify to avoid this kind of error
in system messages.

 filter = [ "a|/dev/mapper|", "a|/dev/sda|", "r/.*/" ]

On Mon, Jun 20, 2011 at 9:30 PM, <linux-cluster-request at redhat.com> wrote:

> Send Linux-cluster mailing list submissions to
>        linux-cluster at redhat.com
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        https://www.redhat.com/mailman/listinfo/linux-cluster
> or, via email, send a message with subject or body 'help' to
>        linux-cluster-request at redhat.com
>
> You can reach the person managing the list at
>        linux-cluster-owner at redhat.com
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Linux-cluster digest..."
>
>
> Today's Topics:
>
>   1. Re: Cluster Failover Failed (dOminic)
>   2. Re: umount failing... (dOminic)
>   3. Re: Plugged out blade from bladecenter chassis -
>      fence_bladecenter failed (dOminic)
>   4. Re: Plugged out blade from bladecenter chassis -
>      fence_bladecenter failed (Parvez Shaikh)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sun, 19 Jun 2011 22:03:52 +0530
> From: dOminic <share2dom at gmail.com>
> To: linux clustering <linux-cluster at redhat.com>
> Subject: Re: [Linux-cluster] Cluster Failover Failed
> Message-ID: <BANLkTikrF-Ge2TpQFGfng4-ob+osApbNZw at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi Balaji,
>
> Yes, the reported message is harmless ... However, you can try following
>
> 1) I would suggest you to set the filter setting in lvm.conf to properly
> scan your mpath* devices and local disks.
> 2) Enable blacklist section in multipath.conf  eg:
>
> blacklist {
>       devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
>       devnode "^hd[a-z]"
> }
>
> # multipath -v2
>
> Observe the box. Check whether that helps ...
>
>
> Regards,
>
>
> On Wed, Jun 15, 2011 at 12:16 AM, Balaji S <skjbalaji at gmail.com> wrote:
>
> > Hi,
> > In my setup implemented 10 tow node cluster's which running mysql as
> > cluster service, ipmi card as fencing device.
> >
> > In my /var/log/messages i am keep getting the errors like below,
> >
> > Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdm, sector
> 0
> > Jun 14 12:50:48 hostname kernel: sd 3:0:2:2: Device not ready: <6>:
> > Current: sense key: Not Ready
> > Jun 14 12:50:48 hostname kernel:     Add. Sense: Logical unit not ready,
> > manual intervention required
> > Jun 14 12:50:48 hostname kernel:
> > Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdn, sector
> 0
> > Jun 14 12:50:48 hostname kernel: sd 3:0:2:4: Device not ready: <6>:
> > Current: sense key: Not Ready
> > Jun 14 12:50:48 hostname kernel:     Add. Sense: Logical unit not ready,
> > manual intervention required
> > Jun 14 12:50:48 hostname kernel:
> > Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdp, sector
> 0
> > Jun 14 12:51:10 hostname kernel: sd 3:0:0:1: Device not ready: <6>:
> > Current: sense key: Not Ready
> > Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready,
> > manual intervention required
> > Jun 14 12:51:10 hostname kernel:
> > Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdc, sector
> 0
> > Jun 14 12:51:10 hostname kernel: printk: 3 messages suppressed.
> > Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdc, logical
> > block 0
> > Jun 14 12:51:10 hostname kernel: sd 3:0:0:2: Device not ready: <6>:
> > Current: sense key: Not Ready
> > Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready,
> > manual intervention required
> > Jun 14 12:51:10 hostname kernel:
> > Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdd, sector
> 0
> > Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdd, logical
> > block 0
> > Jun 14 12:51:10 hostname kernel: sd 3:0:0:4: Device not ready: <6>:
> > Current: sense key: Not Ready
> > Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready,
> > manual intervention required
> >
> >
> > when i am checking the multipath -ll , this all devices are in passive
> > path.
> >
> > Environment :
> >
> > RHEL 5.4 & EMC SAN
> >
> > Please suggest how to overcome this issue. Support will be highly
> helpful.
> > Thanks in Advance
> >
> >
> > --
> > Thanks,
> > BSK
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> https://www.redhat.com/archives/linux-cluster/attachments/20110619/57c6e69e/attachment.html
> >
>
> ------------------------------
>
> Message: 2
> Date: Sun, 19 Jun 2011 22:12:35 +0530
> From: dOminic <share2dom at gmail.com>
> To: linux clustering <linux-cluster at redhat.com>
> Subject: Re: [Linux-cluster] umount failing...
> Message-ID: <BANLkTimXC6NZLO=GN2ew6=y=7z=VCujsOQ at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> selinux is in Enforced mode ( worth checking audit.log ) ? .If yes, try
> selinux to permissive or disabled mode and check .
>
> Regards,
>
> On Thu, Jun 16, 2011 at 1:48 PM, Corey Kovacs <corey.kovacs at gmail.com
> >wrote:
>
> > My appologies for not getting back sooner. I am in the middle of a move.
> >
> > I cannot post my configs or logs (yeah, not helpful I know) but
> > suffice it to say I strongly believe they are correct (I know,
> > everyone says that). I've had other people look at them just make sure
> > it wasn't a case of proofreading my own paper etc. and it always comes
> > down to the umount failing. I have 6 other identical NFS services
> > (save for the mount point/export location) and they all work
> > flawlessly. That's why I am zeroing in on the use of '/home' as the
> > culprit.
> >
> > Anyway, it's not a lot to go on I know, but I am just looking for
> > directions to search for now.
> >
> > Thanks
> >
> > Corey
> >
> > On Mon, Jun 13, 2011 at 11:44 AM, Rajveer Singh
> > <torajveersingh at gmail.com> wrote:
> > >
> > >
> > > On Thu, Jun 9, 2011 at 8:22 PM, Corey Kovacs <corey.kovacs at gmail.com>
> > wrote:
> > >>
> > >> Folks,
> > >>
> > >> I have a 5 node cluster serving out several NFS exports, one of which
> is
> > >> /home.
> > >>
> > >> All of the nfs services can be moved from node to node without problem
> > >> except for the one providing /home.
> > >>
> > >> The logs on that node indicate the umount is failing and then the
> > >> service is disabled (self-fence is not enabled).
> > >>
> > >> Even after the service is put into a failed state and then disabled
> > >> manually, umount fails...
> > >>
> > >> I had noticed recently while playing with conga that creating a
> > >> service for /home on a test cluster a warning was issued about
> > >> reserved words and as I recall (i could be wrong) /home was among the
> > >> illegal parameters for the mount point.
> > >>
> > >> I have turned everything off that I could think of which might be
> > >> "holding" the mount and have run the various iterations of lsof, find
> > >> etc. nothing shows up as having anything being actively used.
> > >>
> > >> This particular file system is 1TB.
> > >>
> > >> Is there something wrong with using /home as an export?
> > >>
> > >> Some specifics.
> > >>
> > >> RHEL5.6 (updated as of last week)
> > >> HA-LVM protecting ext3 using the newer "preferred method" with clvmd
> > >> Ext3 for exported file systems
> > >> 5 nodes.
> > >>
> > >>
> > >> Any ideas would be greatly appreciated.
> > >>
> > >> -C
> > >>
> > > Can you share your log file and cluster.conf file
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> https://www.redhat.com/archives/linux-cluster/attachments/20110619/42a94077/attachment.html
> >
>
> ------------------------------
>
> Message: 3
> Date: Sun, 19 Jun 2011 22:14:56 +0530
> From: dOminic <share2dom at gmail.com>
> To: linux clustering <linux-cluster at redhat.com>
> Subject: Re: [Linux-cluster] Plugged out blade from bladecenter
>        chassis - fence_bladecenter failed
> Message-ID: <BANLkTindmcada5yXOAZKYTp_DB9oXtQtQg at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> There is a bug  related to missing_as_off  -
> https://bugzilla.redhat.com/show_bug.cgi?id=689851 - expects the fix in
> rhel5u7 .
>
> regards,
>
> On Wed, Apr 27, 2011 at 1:59 PM, Parvez Shaikh <parvez.h.shaikh at gmail.com
> >wrote:
>
> > Hi all,
> >
> > I am using RHCS on IBM bladecenter with blade center fencing. I plugged
> out
> > a blade from blade center chassis slot and was hoping that failover to
> > occur. However when I did so, I get following message -
> >
> > fenced[10240]: agent "fence_bladecenter" reports: Failed: Unable to
> obtain
> > correct plug status or plug is not available
> > fenced[10240]: fence "blade1" failed
> >
> > Is this supported that if I plug out blade from its slot, then failover
> > occur without manual intervention? If so, which fencing must I use?
> >
> > Thanks,
> > Parvez
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> https://www.redhat.com/archives/linux-cluster/attachments/20110619/e1f7ba7a/attachment.html
> >
>
> ------------------------------
>
> Message: 4
> Date: Mon, 20 Jun 2011 10:46:41 +0530
> From: Parvez Shaikh <parvez.h.shaikh at gmail.com>
> To: linux clustering <linux-cluster at redhat.com>
> Subject: Re: [Linux-cluster] Plugged out blade from bladecenter
>        chassis - fence_bladecenter failed
> Message-ID: <BANLkTikQbWhmVvXLFphx4q7tsxT1tFoLPQ at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi Thanks Dominic,
>
> Do fence_bladecenter "reboot" the blade as a part of fencing always? I have
> seen it turning the blade off by default.
>
> Through fence_bladecenter --missing-as-off...... -o off returns me a
> correct
> result when run from command line but fencing fails through "fenced". I am
> using RHEL 5.5 ES and fence_bladecenter version reports following -
>
> fence_bladecenter -V
> 2.0.115 (built Tue Dec 22 10:05:55 EST 2009)
> Copyright (C) Red Hat, Inc.  2004  All rights reserved.
>
>
> Anyway thanks for bugzilla reference
>
> Regards
>
> On Sun, Jun 19, 2011 at 10:14 PM, dOminic <share2dom at gmail.com> wrote:
>
> > There is a bug  related to missing_as_off  -
> > https://bugzilla.redhat.com/show_bug.cgi?id=689851 - expects the fix in
> > rhel5u7 .
> >
> > regards,
> >
> > On Wed, Apr 27, 2011 at 1:59 PM, Parvez Shaikh <
> parvez.h.shaikh at gmail.com>wrote:
> >
> >> Hi all,
> >>
> >> I am using RHCS on IBM bladecenter with blade center fencing. I plugged
> >> out a blade from blade center chassis slot and was hoping that failover
> to
> >> occur. However when I did so, I get following message -
> >>
> >> fenced[10240]: agent "fence_bladecenter" reports: Failed: Unable to
> obtain
> >> correct plug status or plug is not available
> >> fenced[10240]: fence "blade1" failed
> >>
> >> Is this supported that if I plug out blade from its slot, then failover
> >> occur without manual intervention? If so, which fencing must I use?
> >>
> >> Thanks,
> >> Parvez
> >>
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> https://www.redhat.com/archives/linux-cluster/attachments/20110620/0386831e/attachment.html
> >
>
> ------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> End of Linux-cluster Digest, Vol 86, Issue 18
> *********************************************
>



-- 
Thanks,
Balaji S
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110620/6e045832/attachment.htm>

From fdinitto at redhat.com  Mon Jun 20 16:46:28 2011
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Mon, 20 Jun 2011 18:46:28 +0200
Subject: [Linux-cluster] cluster 3.1.3 stable release
Message-ID: <4DFF7964.6040504@redhat.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Welcome to the cluster 3.1.3 release.

This release fixes a build issue in dlm_controld with any kernel older
than 3.0.

The new source tarball can be downloaded here:

https://fedorahosted.org/releases/c/l/cluster/cluster-3.1.3.tar.xz

ChangeLog:

https://fedorahosted.org/releases/c/l/cluster/Changelog-3.1.3

To report bugs or issues:

   https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

   Join us on IRC (irc.freenode.net #linux-cluster) and share your
   experience  with other sysadministrators or power users.

Thanks/congratulations to all people that contributed to achieve this
great milestone.

Happy clustering,
Fabio

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBCAAGBQJN/3lhAAoJEFA6oBJjVJ+O9PkP/073d0Y1ydvLbm7Yui5/ttlI
aSsy3dt53QzcI1MY2MEsqWW3T4MlJMaM/kbUWcTKGy83OeZzLC13WtMb/Hyyt6Oo
2cqmnEMAbTZ89rjiJurt0aj42QOscYBBZfMR72njK94WTWGVrQ8q/vSq/BNzdUkw
sS0cbb/7BpS3RN1uXxm+x0x2DJMB9uIK6s9So0hXRPL2m/wwANIoD/k9T9D6B/ax
5L2kqAATPtPbl+2H5yk3FxSgJf+bLbcxd2kbHpPxWawG6W3qRsPLi3mBBOYXQF4B
y4Mj9N5i97BgrGT/nqCxqMHh2S+pKEe4gulZcQIYl/JaZhZ9LuoODRhKo/UL7i63
3n8RbPCIiIJzI7eJVbmSfcGk11ZRRxJ4nbTdvJVylaFe8bCjvTO+eLHgkTPTlDGj
WWqt9uNWdvQuef77G0TOaZcvMphw1VduXLvtU2wejpfVAzz+lEprL+VthSrbNfxf
HggKRDxgsrAYbJ4LgJPt/ApkhWx/HhrJYJfSkNTQOXAY3JKuOrWwbJyx9woCQu1c
wIUnrQ2VB/CmKNTDT4AFYWA/GV3d/4FuijTvd3LcTKtWoCOVdKGic/MmFjBvJk/R
kbSG8JMpTm02w2L0G+WDhMdC58GGQHB6GhQ8Nr6aAu55QPWlijwHtgUYw4xYSKdn
0D9vQsSNlYWiQZAmAwhg
=Vvy3
-----END PGP SIGNATURE-----



From share2dom at gmail.com  Tue Jun 21 12:52:49 2011
From: share2dom at gmail.com (dOminic)
Date: Tue, 21 Jun 2011 18:22:49 +0530
Subject: [Linux-cluster] Cluster Failover Failed
In-Reply-To: <BANLkTikrF-Ge2TpQFGfng4-ob+osApbNZw@mail.gmail.com>
References: <BANLkTinezSRuLDKWXpYd8kFA2rJpbd+GfQ@mail.gmail.com>
	<BANLkTikrF-Ge2TpQFGfng4-ob+osApbNZw@mail.gmail.com>
Message-ID: <BANLkTi=bAtD8BYp4_T5ksir=dRSAO2dq9Q@mail.gmail.com>

Hi,

Btw, how many HBAs are present in your box ? . Problem is with scsi3 only ?.

Refer https://access.redhat.com/kb/docs/DOC-2991 , then set the filter.
Also, I would suggest you to open ticket with Linux vendor if IO errors are
belongs to Active paths.

Pointed IO errors are belongs to disk that in passive paths group ?. you can
verify the same in multipath-ll output .

regards,

On Sun, Jun 19, 2011 at 10:03 PM, dOminic <share2dom at gmail.com> wrote:

> Hi Balaji,
>
> Yes, the reported message is harmless ... However, you can try following
>
> 1) I would suggest you to set the filter setting in lvm.conf to properly
> scan your mpath* devices and local disks.
> 2) Enable blacklist section in multipath.conf  eg:
>
> blacklist {
>        devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
>        devnode "^hd[a-z]"
> }
>
> # multipath -v2
>
> Observe the box. Check whether that helps ...
>
>
> Regards,
>
>
> On Wed, Jun 15, 2011 at 12:16 AM, Balaji S <skjbalaji at gmail.com> wrote:
>
>> Hi,
>> In my setup implemented 10 tow node cluster's which running mysql as
>> cluster service, ipmi card as fencing device.
>>
>> In my /var/log/messages i am keep getting the errors like below,
>>
>> Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdm, sector 0
>> Jun 14 12:50:48 hostname kernel: sd 3:0:2:2: Device not ready: <6>:
>> Current: sense key: Not Ready
>> Jun 14 12:50:48 hostname kernel:     Add. Sense: Logical unit not ready,
>> manual intervention required
>> Jun 14 12:50:48 hostname kernel:
>> Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdn, sector 0
>> Jun 14 12:50:48 hostname kernel: sd 3:0:2:4: Device not ready: <6>:
>> Current: sense key: Not Ready
>> Jun 14 12:50:48 hostname kernel:     Add. Sense: Logical unit not ready,
>> manual intervention required
>> Jun 14 12:50:48 hostname kernel:
>> Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdp, sector 0
>> Jun 14 12:51:10 hostname kernel: sd 3:0:0:1: Device not ready: <6>:
>> Current: sense key: Not Ready
>> Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready,
>> manual intervention required
>> Jun 14 12:51:10 hostname kernel:
>> Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdc, sector 0
>> Jun 14 12:51:10 hostname kernel: printk: 3 messages suppressed.
>> Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdc, logical
>> block 0
>> Jun 14 12:51:10 hostname kernel: sd 3:0:0:2: Device not ready: <6>:
>> Current: sense key: Not Ready
>> Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready,
>> manual intervention required
>> Jun 14 12:51:10 hostname kernel:
>> Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdd, sector 0
>> Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdd, logical
>> block 0
>> Jun 14 12:51:10 hostname kernel: sd 3:0:0:4: Device not ready: <6>:
>> Current: sense key: Not Ready
>> Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready,
>> manual intervention required
>>
>>
>> when i am checking the multipath -ll , this all devices are in passive
>> path.
>>
>> Environment :
>>
>> RHEL 5.4 & EMC SAN
>>
>> Please suggest how to overcome this issue. Support will be highly helpful.
>> Thanks in Advance
>>
>>
>> --
>> Thanks,
>> BSK
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110621/e41e841c/attachment.htm>

From miha.valencic at gmail.com  Tue Jun 21 13:31:13 2011
From: miha.valencic at gmail.com (Miha Valencic)
Date: Tue, 21 Jun 2011 15:31:13 +0200
Subject: [Linux-cluster] Troubleshooting service relocation
In-Reply-To: <4DFC6EDF.5090202@ulimit.org>
References: <BANLkTinQEtyg5kjMde0M5MsjheWKiidcEw@mail.gmail.com>
	<4DFC6EDF.5090202@ulimit.org>
Message-ID: <BANLkTi=eT93Bv3qeO0+t+EzZP=6yDYaV1Q@mail.gmail.com>

Michael, I've configured the logging on RM and am now waiting for it to
switch nodes. Hopefully, I can see a reason why it is relocating.

Thanks,
 Miha.

On Sat, Jun 18, 2011 at 11:24 AM, Michael Pye <michael at ulimit.org> wrote:

> On 17/06/2011 09:13, Miha Valencic wrote:
> > How can I turn on logging or what else can I check?
>
> Take a look at this knowledgebase article:
> https://access.redhat.com/kb/docs/DOC-53500
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110621/19a643fd/attachment.htm>

From rossnick-lists at cybercat.ca  Tue Jun 21 13:57:38 2011
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Tue, 21 Jun 2011 09:57:38 -0400
Subject: [Linux-cluster] GFS2 fatal: filesystem consistency error
Message-ID: <AD364AF1E9D94C50B96231FB0320B1DE@versa>

8 node cluster, fiber channel hbas and disks access trough a qlogic fabric.

I've got hit 3 times with this error on different nodes :

GFS2: fsid=CyberCluster:GizServer.1: fatal: filesystem consistency error
GFS2: fsid=CyberCluster:GizServer.1: inode = 9582 6698267
GFS2: fsid=CyberCluster:GizServer.1: function = gfs2_dinode_dealloc, file = 
fs/gfs2/inode.c, line = 352
GFS2: fsid=CyberCluster:GizServer.1: about to withdraw this file system
GFS2: fsid=CyberCluster:GizServer.1: telling LM to unmount
GFS2: fsid=CyberCluster:GizServer.1: withdrawn
Pid: 2659, comm: delete_workqueu Tainted: G W ---------------- T 
2.6.32-131.2.1.el6.x86_64 #1
Call Trace:
[<ffffffffa044ffd2>] ? gfs2_lm_withdraw+0x102/0x130 [gfs2]
[<ffffffffa0425209>] ? trunc_dealloc+0xa9/0x130 [gfs2]
[<ffffffffa04501dd>] ? gfs2_consist_inode_i+0x5d/0x60 [gfs2]
[<ffffffffa0435584>] ? gfs2_dinode_dealloc+0x64/0x210 [gfs2]
[<ffffffffa044e1da>] ? gfs2_delete_inode+0x1ba/0x280 [gfs2]
[<ffffffffa044e0ad>] ? gfs2_delete_inode+0x8d/0x280 [gfs2]
[<ffffffffa044e020>] ? gfs2_delete_inode+0x0/0x280 [gfs2]
[<ffffffff8118cfbe>] ? generic_delete_inode+0xde/0x1d0
[<ffffffffa0432940>] ? delete_work_func+0x0/0x80 [gfs2]
[<ffffffff8118d115>] ? generic_drop_inode+0x65/0x80
[<ffffffffa044cc4e>] ? gfs2_drop_inode+0x2e/0x30 [gfs2]
[<ffffffff8118bf82>] ? iput+0x62/0x70
[<ffffffffa0432994>] ? delete_work_func+0x54/0x80 [gfs2]
[<ffffffff810887d0>] ? worker_thread+0x170/0x2a0
[<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81088660>] ? worker_thread+0x0/0x2a0
[<ffffffff8108dd96>] ? kthread+0x96/0xa0
[<ffffffff8100c1ca>] ? child_rip+0xa/0x20
[<ffffffff8108dd00>] ? kthread+0x0/0xa0
[<ffffffff8100c1c0>] ? child_rip+0x0/0x20
no_formal_ino = 9582
no_addr = 6698267
i_disksize = 6838
blocks = 0
i_goal = 6698304
i_diskflags = 0x00000000
i_height = 1
i_depth = 0
i_entries = 0
i_eattr = 0
GFS2: fsid=CyberCluster:GizServer.1: gfs2_delete_inode: -5
gdlm_unlock 5,66351b err=-22


Only, with different inodes each time.

After that event, services running on that filesystem are marked failed and 
not moved over another node. Any access to that fs yields I/O error. Server 
needed to be rebooted to properly work again.

I did ran a fsck last night on that filesystem, and it did find some errors, 
but nothing serious. Lots (realy lots) of those :

Ondisk and fsck bitmaps differ at block 5771602 (0x581152)
Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
Metadata type is 0 (free)
Fix bitmap for block 5771602 (0x581152) ? (y/n)

And after completing the fsck, I started back some services, and I got the 
same error on another filesystem that is practily empty and used for small 
utilities used troughout the cluster...

What should I do to find the source of this problem ? 



From rpeterso at redhat.com  Tue Jun 21 14:42:40 2011
From: rpeterso at redhat.com (Bob Peterson)
Date: Tue, 21 Jun 2011 10:42:40 -0400 (EDT)
Subject: [Linux-cluster] GFS2 fatal: filesystem consistency error
In-Reply-To: <AD364AF1E9D94C50B96231FB0320B1DE@versa>
Message-ID: <1036238479.689034.1308667360488.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>

----- Original Message -----
| 8 node cluster, fiber channel hbas and disks access trough a qlogic
| fabric.
| 
| I've got hit 3 times with this error on different nodes :
| 
| GFS2: fsid=CyberCluster:GizServer.1: fatal: filesystem consistency
| error
| GFS2: fsid=CyberCluster:GizServer.1: inode = 9582 6698267
| GFS2: fsid=CyberCluster:GizServer.1: function = gfs2_dinode_dealloc,
| file =
| fs/gfs2/inode.c, line = 352
| GFS2: fsid=CyberCluster:GizServer.1: about to withdraw this file
| system
| GFS2: fsid=CyberCluster:GizServer.1: telling LM to unmount
| GFS2: fsid=CyberCluster:GizServer.1: withdrawn
| Pid: 2659, comm: delete_workqueu Tainted: G W ---------------- T
| 2.6.32-131.2.1.el6.x86_64 #1
| Call Trace:
| [<ffffffffa044ffd2>] ? gfs2_lm_withdraw+0x102/0x130 [gfs2]
| [<ffffffffa0425209>] ? trunc_dealloc+0xa9/0x130 [gfs2]
| [<ffffffffa04501dd>] ? gfs2_consist_inode_i+0x5d/0x60 [gfs2]
| [<ffffffffa0435584>] ? gfs2_dinode_dealloc+0x64/0x210 [gfs2]
| [<ffffffffa044e1da>] ? gfs2_delete_inode+0x1ba/0x280 [gfs2]
| [<ffffffffa044e0ad>] ? gfs2_delete_inode+0x8d/0x280 [gfs2]
| [<ffffffffa044e020>] ? gfs2_delete_inode+0x0/0x280 [gfs2]
| [<ffffffff8118cfbe>] ? generic_delete_inode+0xde/0x1d0
| [<ffffffffa0432940>] ? delete_work_func+0x0/0x80 [gfs2]
| [<ffffffff8118d115>] ? generic_drop_inode+0x65/0x80
| [<ffffffffa044cc4e>] ? gfs2_drop_inode+0x2e/0x30 [gfs2]
| [<ffffffff8118bf82>] ? iput+0x62/0x70
| [<ffffffffa0432994>] ? delete_work_func+0x54/0x80 [gfs2]
| [<ffffffff810887d0>] ? worker_thread+0x170/0x2a0
| [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40
| [<ffffffff81088660>] ? worker_thread+0x0/0x2a0
| [<ffffffff8108dd96>] ? kthread+0x96/0xa0
| [<ffffffff8100c1ca>] ? child_rip+0xa/0x20
| [<ffffffff8108dd00>] ? kthread+0x0/0xa0
| [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
| no_formal_ino = 9582
| no_addr = 6698267
| i_disksize = 6838
| blocks = 0
| i_goal = 6698304
| i_diskflags = 0x00000000
| i_height = 1
| i_depth = 0
| i_entries = 0
| i_eattr = 0
| GFS2: fsid=CyberCluster:GizServer.1: gfs2_delete_inode: -5
| gdlm_unlock 5,66351b err=-22
| 
| 
| Only, with different inodes each time.
| 
| After that event, services running on that filesystem are marked
| failed and
| not moved over another node. Any access to that fs yields I/O error.
| Server
| needed to be rebooted to properly work again.
| 
| I did ran a fsck last night on that filesystem, and it did find some
| errors,
| but nothing serious. Lots (realy lots) of those :
| 
| Ondisk and fsck bitmaps differ at block 5771602 (0x581152)
| Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
| Metadata type is 0 (free)
| Fix bitmap for block 5771602 (0x581152) ? (y/n)
| 
| And after completing the fsck, I started back some services, and I got
| the
| same error on another filesystem that is practily empty and used for
| small
| utilities used troughout the cluster...
| 
| What should I do to find the source of this problem ?

Hi,

I believe this is a GFS2 bug we've already solved.
Please contact Red Hat Support.

Regards,

Bob Peterson
Red Hat File Systems



From swhiteho at redhat.com  Tue Jun 21 14:46:07 2011
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Tue, 21 Jun 2011 15:46:07 +0100
Subject: [Linux-cluster] GFS2 fatal: filesystem consistency error
In-Reply-To: <AD364AF1E9D94C50B96231FB0320B1DE@versa>
References: <AD364AF1E9D94C50B96231FB0320B1DE@versa>
Message-ID: <1308667567.2762.15.camel@menhir>

Hi,

On Tue, 2011-06-21 at 09:57 -0400, Nicolas Ross wrote:
> 8 node cluster, fiber channel hbas and disks access trough a qlogic fabric.
> 
> I've got hit 3 times with this error on different nodes :
> 
> GFS2: fsid=CyberCluster:GizServer.1: fatal: filesystem consistency error
> GFS2: fsid=CyberCluster:GizServer.1: inode = 9582 6698267
> GFS2: fsid=CyberCluster:GizServer.1: function = gfs2_dinode_dealloc, file = 
> fs/gfs2/inode.c, line = 352
> GFS2: fsid=CyberCluster:GizServer.1: about to withdraw this file system
> GFS2: fsid=CyberCluster:GizServer.1: telling LM to unmount
> GFS2: fsid=CyberCluster:GizServer.1: withdrawn
> Pid: 2659, comm: delete_workqueu Tainted: G W ---------------- T 
> 2.6.32-131.2.1.el6.x86_64 #1
> Call Trace:
> [<ffffffffa044ffd2>] ? gfs2_lm_withdraw+0x102/0x130 [gfs2]
> [<ffffffffa0425209>] ? trunc_dealloc+0xa9/0x130 [gfs2]
> [<ffffffffa04501dd>] ? gfs2_consist_inode_i+0x5d/0x60 [gfs2]
> [<ffffffffa0435584>] ? gfs2_dinode_dealloc+0x64/0x210 [gfs2]
> [<ffffffffa044e1da>] ? gfs2_delete_inode+0x1ba/0x280 [gfs2]
> [<ffffffffa044e0ad>] ? gfs2_delete_inode+0x8d/0x280 [gfs2]
> [<ffffffffa044e020>] ? gfs2_delete_inode+0x0/0x280 [gfs2]
> [<ffffffff8118cfbe>] ? generic_delete_inode+0xde/0x1d0
> [<ffffffffa0432940>] ? delete_work_func+0x0/0x80 [gfs2]
> [<ffffffff8118d115>] ? generic_drop_inode+0x65/0x80
> [<ffffffffa044cc4e>] ? gfs2_drop_inode+0x2e/0x30 [gfs2]
> [<ffffffff8118bf82>] ? iput+0x62/0x70
> [<ffffffffa0432994>] ? delete_work_func+0x54/0x80 [gfs2]
> [<ffffffff810887d0>] ? worker_thread+0x170/0x2a0
> [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40
> [<ffffffff81088660>] ? worker_thread+0x0/0x2a0
> [<ffffffff8108dd96>] ? kthread+0x96/0xa0
> [<ffffffff8100c1ca>] ? child_rip+0xa/0x20
> [<ffffffff8108dd00>] ? kthread+0x0/0xa0
> [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
> no_formal_ino = 9582
> no_addr = 6698267
> i_disksize = 6838
> blocks = 0
> i_goal = 6698304
> i_diskflags = 0x00000000
> i_height = 1
> i_depth = 0
> i_entries = 0
> i_eattr = 0
> GFS2: fsid=CyberCluster:GizServer.1: gfs2_delete_inode: -5
> gdlm_unlock 5,66351b err=-22
> 
> 
> Only, with different inodes each time.
> 
> After that event, services running on that filesystem are marked failed and 
> not moved over another node. Any access to that fs yields I/O error. Server 
> needed to be rebooted to properly work again.
> 
> I did ran a fsck last night on that filesystem, and it did find some errors, 
> but nothing serious. Lots (realy lots) of those :
> 
> Ondisk and fsck bitmaps differ at block 5771602 (0x581152)
> Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
> Metadata type is 0 (free)
> Fix bitmap for block 5771602 (0x581152) ? (y/n)
> 
> And after completing the fsck, I started back some services, and I got the 
> same error on another filesystem that is practily empty and used for small 
> utilities used troughout the cluster...
> 
> What should I do to find the source of this problem ? 
> 

I suspect that this is a know problem, bz #712139 if you have access to
the Red Hat bugzilla. There is a fix available via our usual support
channels. Note that this particular bug is highly version specific so it
only applies to RHEL 6.1 and no other version (either RHEL or upstream),

Steve.




From rossnick-lists at cybercat.ca  Tue Jun 21 14:58:23 2011
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Tue, 21 Jun 2011 10:58:23 -0400
Subject: [Linux-cluster] GFS2 fatal: filesystem consistency error
References: <AD364AF1E9D94C50B96231FB0320B1DE@versa>
	<1308667567.2762.15.camel@menhir>
Message-ID: <3C1F816785264B95A73FBD89C03EBAB5@versa>

>> And after completing the fsck, I started back some services, and I got 
>> the
>> same error on another filesystem that is practily empty and used for 
>> small
>> utilities used troughout the cluster...
>>
>> What should I do to find the source of this problem ?
>>
>
> I suspect that this is a know problem, bz #712139 if you have access to
> the Red Hat bugzilla. There is a fix available via our usual support
> channels. Note that this particular bug is highly version specific so it
> only applies to RHEL 6.1 and no other version (either RHEL or upstream),

Thanks, I am indeed at 6.1. I did find this bug while googling yesterday for 
that, I will contact support once I got the why I don't have support for 
resilient storage enabled cleared... 



From noreply at boxbe.com  Tue Jun 21 14:52:45 2011
From: noreply at boxbe.com (noreply at boxbe.com)
Date: Tue, 21 Jun 2011 07:52:45 -0700 (PDT)
Subject: [Linux-cluster] GFS2 fatal: filesystem consistency error
 (Action Required)
Message-ID: <1933689685.1398153.1308667965985.JavaMail.prod@app010.dmz>


Hello linux clustering,

You will not receive any more courtesy notices from our members 
for two days. Messages you have sent will remain in a lower 
priority mailbox for our member to review at their leisure.

Future messages will be more likely to be viewed if you are on 
our member's priority Guest List.


  Thank you,
  debjyoti.mail at gmail.com


Powered by Boxbe -- "End Email Overload"
Visit http://www.boxbe.com/how-it-works?tc=8467960205_652083268

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110621/bb3b53ff/attachment.htm>
-------------- next part --------------
An embedded message was scrubbed...
From: Steven Whitehouse <swhiteho at redhat.com>
Subject: Re: [Linux-cluster] GFS2 fatal: filesystem consistency error
Date: Tue, 21 Jun 2011 15:46:07 +0100
Size: 2870
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110621/bb3b53ff/attachment.eml>

From skjbalaji at gmail.com  Wed Jun 22 03:01:06 2011
From: skjbalaji at gmail.com (Balaji S)
Date: Wed, 22 Jun 2011 08:31:06 +0530
Subject: [Linux-cluster] Linux-cluster Digest, Vol 86, Issue 19
In-Reply-To: <mailman.25550.1308667366.10068.linux-cluster@redhat.com>
References: <mailman.25550.1308667366.10068.linux-cluster@redhat.com>
Message-ID: <BANLkTi=o7bh4iAhp_Gjo4FGS5H_1TJCqZQ@mail.gmail.com>

Hi Dominic,

Yes the errors are only belongs to passive path.

> ------------------------------
>
> Message: 3
> Date: Tue, 21 Jun 2011 18:22:49 +0530
> From: dOminic <share2dom at gmail.com>
> To: linux clustering <linux-cluster at redhat.com>
> Subject: Re: [Linux-cluster] Cluster Failover Failed
> Message-ID: <BANLkTi=bAtD8BYp4_T5ksir=dRSAO2dq9Q at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi,
>
> Btw, how many HBAs are present in your box ? . Problem is with scsi3 only
> ?.
>
> Refer https://access.redhat.com/kb/docs/DOC-2991 , then set the filter.
> Also, I would suggest you to open ticket with Linux vendor if IO errors are
> belongs to Active paths.
>
> Pointed IO errors are belongs to disk that in passive paths group ?. you
> can
> verify the same in multipath-ll output .
>
> regards,
>
> On Sun, Jun 19, 2011 at 10:03 PM, dOminic <share2dom at gmail.com> wrote:
>
> > Hi Balaji,
> >
> > Yes, the reported message is harmless ... However, you can try following
> >
> > 1) I would suggest you to set the filter setting in lvm.conf to properly
> > scan your mpath* devices and local disks.
> > 2) Enable blacklist section in multipath.conf  eg:
> >
> > blacklist {
> >        devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
> >        devnode "^hd[a-z]"
> > }
> >
> > # multipath -v2
> >
> > Observe the box. Check whether that helps ...
> >
> >
> > Regards,
> >
> >
> > On Wed, Jun 15, 2011 at 12:16 AM, Balaji S <skjbalaji at gmail.com> wrote:
> >
> >> Hi,
> >> In my setup implemented 10 tow node cluster's which running mysql as
> >> cluster service, ipmi card as fencing device.
> >>
> >> In my /var/log/messages i am keep getting the errors like below,
> >>
> >> Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdm, sector
> 0
> >> Jun 14 12:50:48 hostname kernel: sd 3:0:2:2: Device not ready: <6>:
> >> Current: sense key: Not Ready
> >> Jun 14 12:50:48 hostname kernel:     Add. Sense: Logical unit not ready,
> >> manual intervention required
> >> Jun 14 12:50:48 hostname kernel:
> >> Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdn, sector
> 0
> >> Jun 14 12:50:48 hostname kernel: sd 3:0:2:4: Device not ready: <6>:
> >> Current: sense key: Not Ready
> >> Jun 14 12:50:48 hostname kernel:     Add. Sense: Logical unit not ready,
> >> manual intervention required
> >> Jun 14 12:50:48 hostname kernel:
> >> Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdp, sector
> 0
> >> Jun 14 12:51:10 hostname kernel: sd 3:0:0:1: Device not ready: <6>:
> >> Current: sense key: Not Ready
> >> Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready,
> >> manual intervention required
> >> Jun 14 12:51:10 hostname kernel:
> >> Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdc, sector
> 0
> >> Jun 14 12:51:10 hostname kernel: printk: 3 messages suppressed.
> >> Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdc, logical
> >> block 0
> >> Jun 14 12:51:10 hostname kernel: sd 3:0:0:2: Device not ready: <6>:
> >> Current: sense key: Not Ready
> >> Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready,
> >> manual intervention required
> >> Jun 14 12:51:10 hostname kernel:
> >> Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdd, sector
> 0
> >> Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdd, logical
> >> block 0
> >> Jun 14 12:51:10 hostname kernel: sd 3:0:0:4: Device not ready: <6>:
> >> Current: sense key: Not Ready
> >> Jun 14 12:51:10 hostname kernel:     Add. Sense: Logical unit not ready,
> >> manual intervention required
> >>
> >>
> >> when i am checking the multipath -ll , this all devices are in passive
> >> path.
> >>
> >> Environment :
> >>
> >> RHEL 5.4 & EMC SAN
> >>
> >> Please suggest how to overcome this issue. Support will be highly
> helpful.
> >> Thanks in Advance
> >>
> >>
> >> --
> >> Thanks,
> >> BSK
> >>
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>
> >
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> https://www.redhat.com/archives/linux-cluster/attachments/20110621/e41e841c/attachment.html
> >
>
> ------------------------------
>
> Message: 4
> Date: Tue, 21 Jun 2011 15:31:13 +0200
> From: Miha Valencic <miha.valencic at gmail.com>
> To: linux clustering <linux-cluster at redhat.com>
> Subject: Re: [Linux-cluster] Troubleshooting service relocation
> Message-ID: <BANLkTi=eT93Bv3qeO0+t+EzZP=6yDYaV1Q at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Michael, I've configured the logging on RM and am now waiting for it to
> switch nodes. Hopefully, I can see a reason why it is relocating.
>
> Thanks,
>  Miha.
>
> On Sat, Jun 18, 2011 at 11:24 AM, Michael Pye <michael at ulimit.org> wrote:
>
> > On 17/06/2011 09:13, Miha Valencic wrote:
> > > How can I turn on logging or what else can I check?
> >
> > Take a look at this knowledgebase article:
> > https://access.redhat.com/kb/docs/DOC-53500
> >
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> https://www.redhat.com/archives/linux-cluster/attachments/20110621/19a643fd/attachment.html
> >
>
> ------------------------------
>
> Message: 5
> Date: Tue, 21 Jun 2011 09:57:38 -0400
> From: "Nicolas Ross" <rossnick-lists at cybercat.ca>
> To: "linux clustering" <linux-cluster at redhat.com>
> Subject: [Linux-cluster] GFS2 fatal: filesystem consistency error
> Message-ID: <AD364AF1E9D94C50B96231FB0320B1DE at versa>
> Content-Type: text/plain; format=flowed; charset="iso-8859-1";
>        reply-type=original
>
> 8 node cluster, fiber channel hbas and disks access trough a qlogic fabric.
>
> I've got hit 3 times with this error on different nodes :
>
> GFS2: fsid=CyberCluster:GizServer.1: fatal: filesystem consistency error
> GFS2: fsid=CyberCluster:GizServer.1: inode = 9582 6698267
> GFS2: fsid=CyberCluster:GizServer.1: function = gfs2_dinode_dealloc, file =
> fs/gfs2/inode.c, line = 352
> GFS2: fsid=CyberCluster:GizServer.1: about to withdraw this file system
> GFS2: fsid=CyberCluster:GizServer.1: telling LM to unmount
> GFS2: fsid=CyberCluster:GizServer.1: withdrawn
> Pid: 2659, comm: delete_workqueu Tainted: G W ---------------- T
> 2.6.32-131.2.1.el6.x86_64 #1
> Call Trace:
> [<ffffffffa044ffd2>] ? gfs2_lm_withdraw+0x102/0x130 [gfs2]
> [<ffffffffa0425209>] ? trunc_dealloc+0xa9/0x130 [gfs2]
> [<ffffffffa04501dd>] ? gfs2_consist_inode_i+0x5d/0x60 [gfs2]
> [<ffffffffa0435584>] ? gfs2_dinode_dealloc+0x64/0x210 [gfs2]
> [<ffffffffa044e1da>] ? gfs2_delete_inode+0x1ba/0x280 [gfs2]
> [<ffffffffa044e0ad>] ? gfs2_delete_inode+0x8d/0x280 [gfs2]
> [<ffffffffa044e020>] ? gfs2_delete_inode+0x0/0x280 [gfs2]
> [<ffffffff8118cfbe>] ? generic_delete_inode+0xde/0x1d0
> [<ffffffffa0432940>] ? delete_work_func+0x0/0x80 [gfs2]
> [<ffffffff8118d115>] ? generic_drop_inode+0x65/0x80
> [<ffffffffa044cc4e>] ? gfs2_drop_inode+0x2e/0x30 [gfs2]
> [<ffffffff8118bf82>] ? iput+0x62/0x70
> [<ffffffffa0432994>] ? delete_work_func+0x54/0x80 [gfs2]
> [<ffffffff810887d0>] ? worker_thread+0x170/0x2a0
> [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40
> [<ffffffff81088660>] ? worker_thread+0x0/0x2a0
> [<ffffffff8108dd96>] ? kthread+0x96/0xa0
> [<ffffffff8100c1ca>] ? child_rip+0xa/0x20
> [<ffffffff8108dd00>] ? kthread+0x0/0xa0
> [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
> no_formal_ino = 9582
> no_addr = 6698267
> i_disksize = 6838
> blocks = 0
> i_goal = 6698304
> i_diskflags = 0x00000000
> i_height = 1
> i_depth = 0
> i_entries = 0
> i_eattr = 0
> GFS2: fsid=CyberCluster:GizServer.1: gfs2_delete_inode: -5
> gdlm_unlock 5,66351b err=-22
>
>
> Only, with different inodes each time.
>
> After that event, services running on that filesystem are marked failed and
> not moved over another node. Any access to that fs yields I/O error. Server
> needed to be rebooted to properly work again.
>
> I did ran a fsck last night on that filesystem, and it did find some
> errors,
> but nothing serious. Lots (realy lots) of those :
>
> Ondisk and fsck bitmaps differ at block 5771602 (0x581152)
> Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
> Metadata type is 0 (free)
> Fix bitmap for block 5771602 (0x581152) ? (y/n)
>
> And after completing the fsck, I started back some services, and I got the
> same error on another filesystem that is practily empty and used for small
> utilities used troughout the cluster...
>
> What should I do to find the source of this problem ?
>
>
>
> ------------------------------
>
> Message: 6
> Date: Tue, 21 Jun 2011 10:42:40 -0400 (EDT)
> From: Bob Peterson <rpeterso at redhat.com>
> To: linux clustering <linux-cluster at redhat.com>
> Subject: Re: [Linux-cluster] GFS2 fatal: filesystem consistency error
> Message-ID:
>        <
> 1036238479.689034.1308667360488.JavaMail.root at zmail06.collab.prod.int.phx2.redhat.com
> >
>
> Content-Type: text/plain; charset=utf-8
>
> ----- Original Message -----
> | 8 node cluster, fiber channel hbas and disks access trough a qlogic
> | fabric.
> |
> | I've got hit 3 times with this error on different nodes :
> |
> | GFS2: fsid=CyberCluster:GizServer.1: fatal: filesystem consistency
> | error
> | GFS2: fsid=CyberCluster:GizServer.1: inode = 9582 6698267
> | GFS2: fsid=CyberCluster:GizServer.1: function = gfs2_dinode_dealloc,
> | file =
> | fs/gfs2/inode.c, line = 352
> | GFS2: fsid=CyberCluster:GizServer.1: about to withdraw this file
> | system
> | GFS2: fsid=CyberCluster:GizServer.1: telling LM to unmount
> | GFS2: fsid=CyberCluster:GizServer.1: withdrawn
> | Pid: 2659, comm: delete_workqueu Tainted: G W ---------------- T
> | 2.6.32-131.2.1.el6.x86_64 #1
> | Call Trace:
> | [<ffffffffa044ffd2>] ? gfs2_lm_withdraw+0x102/0x130 [gfs2]
> | [<ffffffffa0425209>] ? trunc_dealloc+0xa9/0x130 [gfs2]
> | [<ffffffffa04501dd>] ? gfs2_consist_inode_i+0x5d/0x60 [gfs2]
> | [<ffffffffa0435584>] ? gfs2_dinode_dealloc+0x64/0x210 [gfs2]
> | [<ffffffffa044e1da>] ? gfs2_delete_inode+0x1ba/0x280 [gfs2]
> | [<ffffffffa044e0ad>] ? gfs2_delete_inode+0x8d/0x280 [gfs2]
> | [<ffffffffa044e020>] ? gfs2_delete_inode+0x0/0x280 [gfs2]
> | [<ffffffff8118cfbe>] ? generic_delete_inode+0xde/0x1d0
> | [<ffffffffa0432940>] ? delete_work_func+0x0/0x80 [gfs2]
> | [<ffffffff8118d115>] ? generic_drop_inode+0x65/0x80
> | [<ffffffffa044cc4e>] ? gfs2_drop_inode+0x2e/0x30 [gfs2]
> | [<ffffffff8118bf82>] ? iput+0x62/0x70
> | [<ffffffffa0432994>] ? delete_work_func+0x54/0x80 [gfs2]
> | [<ffffffff810887d0>] ? worker_thread+0x170/0x2a0
> | [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40
> | [<ffffffff81088660>] ? worker_thread+0x0/0x2a0
> | [<ffffffff8108dd96>] ? kthread+0x96/0xa0
> | [<ffffffff8100c1ca>] ? child_rip+0xa/0x20
> | [<ffffffff8108dd00>] ? kthread+0x0/0xa0
> | [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
> | no_formal_ino = 9582
> | no_addr = 6698267
> | i_disksize = 6838
> | blocks = 0
> | i_goal = 6698304
> | i_diskflags = 0x00000000
> | i_height = 1
> | i_depth = 0
> | i_entries = 0
> | i_eattr = 0
> | GFS2: fsid=CyberCluster:GizServer.1: gfs2_delete_inode: -5
> | gdlm_unlock 5,66351b err=-22
> |
> |
> | Only, with different inodes each time.
> |
> | After that event, services running on that filesystem are marked
> | failed and
> | not moved over another node. Any access to that fs yields I/O error.
> | Server
> | needed to be rebooted to properly work again.
> |
> | I did ran a fsck last night on that filesystem, and it did find some
> | errors,
> | but nothing serious. Lots (realy lots) of those :
> |
> | Ondisk and fsck bitmaps differ at block 5771602 (0x581152)
> | Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
> | Metadata type is 0 (free)
> | Fix bitmap for block 5771602 (0x581152) ? (y/n)
> |
> | And after completing the fsck, I started back some services, and I got
> | the
> | same error on another filesystem that is practily empty and used for
> | small
> | utilities used troughout the cluster...
> |
> | What should I do to find the source of this problem ?
>
> Hi,
>
> I believe this is a GFS2 bug we've already solved.
> Please contact Red Hat Support.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
>
>
>
> ------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> End of Linux-cluster Digest, Vol 86, Issue 19
> *********************************************
>




-- 
Thanks,
Balaji S
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110622/f19ace48/attachment.htm>

From henahadu at gmail.com  Wed Jun 22 04:12:47 2011
From: henahadu at gmail.com (Peter Sjoberg)
Date: Wed, 22 Jun 2011 00:12:47 -0400
Subject: [Linux-cluster] kvm cluster ed guests and virtd fencing
Message-ID: <1308715967.23586.36.camel@defiant1.intra.techwiz.ca>

I have two KVM hosts with some clustered guests that I'm trying to setup
fencing for using fence_virtd and I wonder if this is even suppose to
work, that guest on one host tells the other host to kill it's guest.
I wonder if I need to add some qpid stuff for the two hosts to work
together.

Setup:
I have two kvm hosts, lets call them host1 & host2. 
Each hosts has a guest (guest1 on host1 & guest2 on host2) and this
guests will be clustered with each other.
The hosts normal network is internal only and originates on host
eth0/br0
The guests have a separate DMZ network segment, and originates as
bridged on host eth1/br1, host has no ip on br1
The guests also have a private link between each other and originates on
host eth2/br10 (crossover cable between the two hosts).

To bypass multicast routing problems I have on the host side added an ip
to the private link and running /usr/sbin/fence_virtd set to listen to
br10

The intent is that guest1 running on host1 should be able to fence by
telling host2 to kill guest2 but this doesn't work.
On the guest side I test this with "fence_xvm -o list" and I get a list
of all guests on one of the hosts, I expected combined list.
What host list I get depends, mostly I get same as the host I'm running
on or the first _virtd started.
I think the multicast part works because when I start fence_virtd on one
host (host1 or host2) I can issue "fence_xvm -o list" on all 4 nodes and
get the a list of guests from the host I started it on.

One other thing that fails is the killing part.
I start fence_virtd on host2 and then on guest1 I issue
 fence_xvm -H <UUID of guest2> -o restart
and it just returns "permission denied"

So, first of all, is it suppose to work and I just messed up my config
or do I need to figure out how to add qpid (or something else) to my
setup?

-- 
-------------------------------------------------------------------
Techwiz, Peter Sjoberg    PGP key (12F506C8) on keyserver & homepage
Key fingerprint =  3DC2 CEBA 1590 B41A 3780  955A DB42 02BB 12F5 06C8
mailto:peters-redhat AT techwiz.ca http://www.techwiz.ca/~peters

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110622/0e70bb40/attachment.sig>

From j.koekkoek at perrit.nl  Wed Jun 22 07:55:02 2011
From: j.koekkoek at perrit.nl (Jeroen Koekkoek)
Date: Wed, 22 Jun 2011 07:55:02 +0000
Subject: [Linux-cluster] relationship corosync + dlm + cman in cluster 3.1.3
Message-ID: <C06930FB67368448890AE634F811C71AAF33A744@amsterdam.office.pagelink.nl>

Hi,

I have a question regarding the relationship between Corosync, DLM, and CMAN. Is the following statement correct?

The DLM is a kernel module, dlm_controld is the control daemon.
CMAN is the old messaging layer, and is now stacked on OpenAIS, which in turn is stacked on Corosync.

The DLM does not use CMAN (or Corosync for that matter) to communicate, but does fetch node information from CMAN.

The filesystem (GFS or OCFS2) speaks to the DLM locally (in kernel?) and the DLM takes care of the communication.


Now for the real question.

In the 3.1.3 release dlm_controld still depends on CMAN, but is it safe to say that I can just use Pacemaker and Heartbeat resource agents and only install CMAN so that dlm_controld can query node information?

Or did I misunderstand the documentation?

Regards,
Jeroen



From martijn.storck at gmail.com  Wed Jun 22 09:11:56 2011
From: martijn.storck at gmail.com (Martijn Storck)
Date: Wed, 22 Jun 2011 11:11:56 +0200
Subject: [Linux-cluster] Replacing network switch in a cluster
In-Reply-To: <BANLkTi=_MkqnHJggdK6C2qWUL0zRCeLLQQ@mail.gmail.com>
References: <BANLkTi=_MkqnHJggdK6C2qWUL0zRCeLLQQ@mail.gmail.com>
Message-ID: <BANLkTin_8DEPVBReZ9PL3W2MWGMOkPFiCw@mail.gmail.com>

Well, after reading man openais.conf the settings below seemed ok for this
operation, so I just went ahead with it. I connected the old and new switch
and moved the cluster nodes over one by one (which meant the link was down
for 4-5 seconds). There were no problems whatsoever.

Cheers,
Martijn

On Fri, Jun 17, 2011 at 9:26 AM, Martijn Storck <martijn.storck at gmail.com>wrote:

> Hi all,
>
> Unfortunately I have to swap out the switch that is used for the cluster
> traffic of our 4-node cluster for a new one. I'm hoping I can do this by
> connecting the new switch to the old switch and then moving the nodes over
> one by one.
>
> Can I change the cluster configuration so that there is a longer grace
> period before a node is deemed 'lost' and gets fenced? The only line in my
> cluster.conf that looks like it has anything to do with it is this one:
>
>         <totem consensus="4800" join="60" token="10000"
> token_retransmits_before_loss_const="20"/>
>
> I think that with faststart enabled the link with a node will be down for
> only a few seconds. I realize that this probably means the cluster will lock
> up during that period (since we use a lot of GFS), but it's still better
> than having to bring the entire cluster down.
>
> Kind regards,
> Martijn Storck
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110622/4b619d49/attachment.htm>

From mickael.bourneuf at celeonet.fr  Wed Jun 22 11:54:47 2011
From: mickael.bourneuf at celeonet.fr (=?ISO-8859-1?Q?=22=5BCeleonet=5D_Micka=EBl_Bourneuf=22?=)
Date: Wed, 22 Jun 2011 13:54:47 +0200
Subject: [Linux-cluster] (no subject)
Message-ID: <4E01D807.7070708@celeonet.fr>




From mgrac at redhat.com  Wed Jun 22 16:12:12 2011
From: mgrac at redhat.com (Marek Grac)
Date: Wed, 22 Jun 2011 18:12:12 +0200
Subject: [Linux-cluster] Plugged out blade from bladecenter chassis -
 fence_bladecenter failed
In-Reply-To: <BANLkTikQbWhmVvXLFphx4q7tsxT1tFoLPQ@mail.gmail.com>
References: <BANLkTi=htj0r-O9AXK78Sz6x-HV5od-h3A@mail.gmail.com>	<BANLkTindmcada5yXOAZKYTp_DB9oXtQtQg@mail.gmail.com>
	<BANLkTikQbWhmVvXLFphx4q7tsxT1tFoLPQ@mail.gmail.com>
Message-ID: <4E02145C.5070805@redhat.com>

Hi,

On 06/20/2011 07:16 AM, Parvez Shaikh wrote:
>
> Hi Thanks Dominic,
>
> Do fence_bladecenter "reboot" the blade as a part of fencing always? I 
> have seen it turning the blade off by default.
>

fence daemon always tries to reboot a machine. In most of the cases it 
is done by power off / check if it is really off / power on / check if 
it is really on.

Fencing is successful if we were able to check that system is down.

> Through fence_bladecenter --missing-as-off...... -o off returns me a 
> correct result when run from command line but fencing fails through 
> "fenced". I am using RHEL 5.5 ES and fence_bladecenter version reports 
> following -

m,



From henahadu at gmail.com  Wed Jun 22 17:45:24 2011
From: henahadu at gmail.com (Peter Sjoberg)
Date: Wed, 22 Jun 2011 13:45:24 -0400
Subject: [Linux-cluster] kvm cluster ed guests and virtd fencing
In-Reply-To: <BANLkTik2DvZOO4yz-_9QxfBWGhoSAunY7A@mail.gmail.com>
References: <1308715967.23586.36.camel@defiant1.intra.techwiz.ca>
	<BANLkTik2DvZOO4yz-_9QxfBWGhoSAunY7A@mail.gmail.com>
Message-ID: <1308764725.6739.9.camel@defiant1.intra.techwiz.ca>

On Wed, 2011-06-22 at 11:58 -0400, Victor Ramirez wrote:
> You dont need qpid
Good, makes life easier
Forgot to say it before but there is no plans to make the guests move
around so guest1 will always be on host1 and I think with that config I
don't need qpid.

> 
> First of all, try setting SELinux to permissive on your guests or else
> the fenced process will not be allowed to send a multicast packet
It's permissive for other reasons (but I like to enable it)
> 
> 
> Second of all, remember to set the fence_xvm.key files as such:

> host1 key = guest2 key
> host2 key = guest1 key
I have same key on all 4 nodes (for now at least) and I did fix the
error I see on all howtos
 dd if=/dev/random bs=4096 count=1 of=/etc/cluster/fence_xvm.key
Is wrong, because it fails way fast and I got a 20-200byte file
when /dev/random ran out.
 dd if=/dev/random bs=1 count=4096 of=/etc/cluster/fence_xvm.key
Works, get a 4096byte file but takes forever, specially on a remote
server (in which case I would generate the file locally and scp to
remote)
 dd if=/dev/urandom bs=1 count=4096 of=/etc/cluster/fence_xvm.key
Goes fast and is good enough for me.

> 
> so that guest1 send a multicast signal to host2 to fence guest2.
Right.
> 
> 
I did find a config error so now I can kill guest2 from guest1 (had
dmzip instead of privip in the config file) but it is still a problem
with that I only see one hosts guest, not both and that means I can only
kill one way, not both ways.

/ps
> 
> 2011/6/22 Peter Sjoberg <henahadu at gmail.com>
>         I have two KVM hosts with some clustered guests that I'm
>         trying to setup
>         fencing for using fence_virtd and I wonder if this is even
>         suppose to
>         work, that guest on one host tells the other host to kill it's
>         guest.
>         I wonder if I need to add some qpid stuff for the two hosts to
>         work
>         together.
>         
>         Setup:
>         I have two kvm hosts, lets call them host1 & host2.
>         Each hosts has a guest (guest1 on host1 & guest2 on host2) and
>         this
>         guests will be clustered with each other.
>         The hosts normal network is internal only and originates on
>         host
>         eth0/br0
>         The guests have a separate DMZ network segment, and originates
>         as
>         bridged on host eth1/br1, host has no ip on br1
>         The guests also have a private link between each other and
>         originates on
>         host eth2/br10 (crossover cable between the two hosts).
>         
>         To bypass multicast routing problems I have on the host side
>         added an ip
>         to the private link and running /usr/sbin/fence_virtd set to
>         listen to
>         br10
>         
>         The intent is that guest1 running on host1 should be able to
>         fence by
>         telling host2 to kill guest2 but this doesn't work.
>         On the guest side I test this with "fence_xvm -o list" and I
>         get a list
>         of all guests on one of the hosts, I expected combined list.
>         What host list I get depends, mostly I get same as the host
>         I'm running
>         on or the first _virtd started.
>         I think the multicast part works because when I start
>         fence_virtd on one
>         host (host1 or host2) I can issue "fence_xvm -o list" on all 4
>         nodes and
>         get the a list of guests from the host I started it on.
>         
>         One other thing that fails is the killing part.
>         I start fence_virtd on host2 and then on guest1 I issue
>          fence_xvm -H <UUID of guest2> -o restart
>         and it just returns "permission denied"
>         
>         So, first of all, is it suppose to work and I just messed up
>         my config
>         or do I need to figure out how to add qpid (or something else)
>         to my
>         setup?
>         
>         --
>         -------------------------------------------------------------------
>         Techwiz, Peter Sjoberg    PGP key (12F506C8) on keyserver &
>         homepage
>         Key fingerprint =  3DC2 CEBA 1590 B41A 3780  955A DB42 02BB
>         12F5 06C8
>         mailto:peters-redhat AT techwiz.ca
>         http://www.techwiz.ca/~peters
>         
>         
>         --
>         Linux-cluster mailing list
>         Linux-cluster at redhat.com
>         https://www.redhat.com/mailman/listinfo/linux-cluster
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110622/d8f444e2/attachment.sig>

From Ralph.Grothe at itdz-berlin.de  Fri Jun 24 07:17:57 2011
From: Ralph.Grothe at itdz-berlin.de (Ralph.Grothe at itdz-berlin.de)
Date: Fri, 24 Jun 2011 09:17:57 +0200
Subject: [Linux-cluster] How to achieve a service's "stickyness" to a
	"preferred" node in RHCS?
Message-ID: <A789DDB53ED7E94396E842EE2AC9B5FF0143294D@itdzex101.ITDZ.verwalt-berlin.de>

Hello Clustering Gurus,

I need to have a service during normal operation (i.e. not during
relocation when loss of stickyness is ok and wanted) stick to a
preferred cluster node.
(n.b. this is only a two-node cluster)

In the redhat cluster admin guide (we're on RHEL 5.6) I think to
have read that such thing as a "preffered_node" or similar
attribute doesn't exist in the schema any more and that instead
one should define ordered="0" and restricted="0" failover domains
for the respective service as this would in effect result in the
wanted behavior.

Is this correct?
And how (e.g. a short cluster.conf XML example snippet would be
appreciated) would this have to be applied?


Regards
Ralph



From l.santeramo at brgm.fr  Fri Jun 24 07:56:36 2011
From: l.santeramo at brgm.fr (Santeramo Luc)
Date: Fri, 24 Jun 2011 07:56:36 +0000
Subject: [Linux-cluster] How to achieve a service's "stickyness" to
	a"preferred" node in RHCS?
In-Reply-To: <A789DDB53ED7E94396E842EE2AC9B5FF0143294D@itdzex101.ITDZ.verwalt-berlin.de>
References: <A789DDB53ED7E94396E842EE2AC9B5FF0143294D@itdzex101.ITDZ.verwalt-berlin.de>
Message-ID: <4E04434F.4050305@brgm.fr>

Hi,

        <failoverdomains>
            <failoverdomain name="FOD_srv1" nofailback="1" ordered="1" restricted="1">
                <failoverdomainnode name="srv1" priority="10"/>
                <failoverdomainnode name="srv2" priority="20"/>
            </failoverdomain>
            <failoverdomain name="FOD_srv2" nofailback="1" ordered="1" restricted="1">
                <failoverdomainnode name="srv1" priority="20"/>
                <failoverdomainnode name="srv2" priority="10"/>
            </failoverdomain>
        </failoverdomains>

<service autostart="1" domain="FOD_srv1" exclusive="0" name="SVC_001" recovery="relocate">
            <ip ref="10.100.0.8"/>
</service>

SVC_001 will be sticked to Failover Domain "FOD_srv1", which have node srv1 as priority node.

...and you can have more informations about options on RHCS admin guide.

Luc

________________________________

Le 24/06/2011 09:17, Ralph.Grothe at itdz-berlin.de<mailto:Ralph.Grothe at itdz-berlin.de> a ?crit :

Hello Clustering Gurus,

I need to have a service during normal operation (i.e. not during
relocation when loss of stickyness is ok and wanted) stick to a
preferred cluster node.
(n.b. this is only a two-node cluster)

In the redhat cluster admin guide (we're on RHEL 5.6) I think to
have read that such thing as a "preffered_node" or similar
attribute doesn't exist in the schema any more and that instead
one should define ordered="0" and restricted="0" failover domains
for the respective service as this would in effect result in the
wanted behavior.

Is this correct?
And how (e.g. a short cluster.conf XML example snippet would be
appreciated) would this have to be applied?


Regards
Ralph

--
Linux-cluster mailing list
Linux-cluster at redhat.com<mailto:Linux-cluster at redhat.com>
https://www.redhat.com/mailman/listinfo/linux-cluster

**********************************************************************************************
Pensez a l'environnement avant d'imprimer ce message
Think Environment before printing
 
Le contenu de ce mel et de ses pieces jointes est destine a l'usage exclusif du (des) destinataire(s) designe
(s) comme tel(s). 
En cas de reception par erreur, le signaler a son expediteur et ne pas en divulguer le contenu. 
L'absence de virus a ete verifiee a l'emission, il convient neanmoins de s'assurer de l'absence de 
contamination a sa reception.
 
The contents of this email and any attachments are confidential. They are intended for the named recipient
(s) only. 
If you have received this email in error please notify the system manager or the sender immediately and do 
not disclose the contents to anyone or make copies. 
eSafe scanned this email for viruses, vandals and malicious content.
**********************************************************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110624/6e58ffca/attachment.htm>

From rossnick-lists at cybercat.ca  Fri Jun 24 16:06:51 2011
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Fri, 24 Jun 2011 12:06:51 -0400
Subject: [Linux-cluster] Corosync goes cpu to 95-99%
In-Reply-To: <51BB988BCCF547E69BF222BDAF34C4DE@versa>
References: <4DD29D03.9080901@gmail.com>	<4DD2BAC3.50509@redhat.com>	<4DD2BD7D.5070704@gmail.com>	<4DD2CA90.6090802@redhat.com>	<3B50BA7445114813AE429BEE51A2BA52@versa>	<4DD78908.2030801@gmail.com>	<0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca><4DD80D5D.10004@gmail.com>	<4DD873C7.8080402@cybercat.ca>	<22E7D11CD5E64E338A66811F31F06238@versa>	<4DE545D7.1080703@redhat.com>	<4DE69786.5010204@gmail.com><4DE6CAF6.4000002@cybercat.ca>	<4DE75602.1000408@gmail.com>
	<51BB988BCCF547E69BF222BDAF34C4DE@versa>
Message-ID: <4E04B61B.9070208@cybercat.ca>


> Thanks for that, that'll prevent me from modifying a system file...
>
> And yes, I find it a little disapointing. We're now at 6.1, and our
> setup is exactly what RHCS was designed for... A GFS over fiber, httpd
> running content from that gfs...

Two thing I need to mention in this issue. One, support doesn't think 
anymore that it's a coro-sync specific issue, they are searching to a 
driver issue or other source for this problem.

Second, I downgraded my kernel to 2.6.32-71.29.1.el6 (pre-6.1, or 6.0), 
for another issue, and since I did, I don't think I saw that issue 
again. I saw spikes in my cpu graphs, but I'm not 100% sure that they 
are caused by this issue.

So, as a temporary work-around for this time, woule be (at your own 
risks) to downgrade to 2.6.32-71.29.1.el6 kernel :

yum install kernel-2.6.32-71.29.1.el6.x86_64

Regards,



From Ralph.Grothe at itdz-berlin.de  Sat Jun 25 09:00:13 2011
From: Ralph.Grothe at itdz-berlin.de (Ralph.Grothe at itdz-berlin.de)
Date: Sat, 25 Jun 2011 11:00:13 +0200
Subject: [Linux-cluster] How to achieve a service's "stickyness"
	toa"preferred" node in RHCS?
In-Reply-To: <4E04434F.4050305@brgm.fr>
References: <A789DDB53ED7E94396E842EE2AC9B5FF0143294D@itdzex101.ITDZ.verwalt-berlin.de>
	<4E04434F.4050305@brgm.fr>
Message-ID: <A789DDB53ED7E94396E842EE2AC9B5FF01432950@itdzex101.ITDZ.verwalt-berlin.de>

Bonjour Luc,


many thanks for the sample snippet.

I am afraid, I couldn't reply yesterday.

I will give your suggestion a try.
Especially, since I want to assure that failback of a relocated
service is disabled, what according to the RHCS admin guide is
only applicable to ordered failover domains
(i.e. cited from RHCS AG "The failback characteristic is
applicable only if ordered failover is configured.")

http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html
/Cluster_Administration/s1-config-failover-domain-CA.html


>
> 	...and you can have more informations about options on
RHCS admin guide
>

I don't agree. In fact the RHCS AD is quite telling the opposite
to your proposal by demanding:

"To configure a preferred member, you can create an unrestricted
failover domain comprising only one cluster member. Doing that
causes a cluster service to run on that cluster member primarily
(the preferred member), but allows the cluster service to fail
over to any of the other members."

So here they claim that it has to be an *unordered* failover
domain with only *one* member.

Sadly, they even don't care to further elaborate by e.g.
providing a mor illustrative config code sample.

Because I had configuered my failoverdomains according the above
cited statement in RHCS AG to achieve stickyness I was more than
surprised to observe the service to failback after it already had
been relocated after I had rebooted the node it currently and
preferredly ran on.


Rgds
Ralph


 


________________________________

	From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Santeramo
Luc
	Sent: Friday, June 24, 2011 9:57 AM
	To: linux-cluster at redhat.com
	Subject: Re: [Linux-cluster] How to achieve a service's
"stickyness" toa"preferred" node in RHCS?
	
	
	Hi,
	
	        <failoverdomains>
	            <failoverdomain name="FOD_srv1"
nofailback="1" ordered="1" restricted="1">
	                <failoverdomainnode name="srv1"
priority="10"/>
	                <failoverdomainnode name="srv2"
priority="20"/>
	            </failoverdomain>
	            <failoverdomain name="FOD_srv2"
nofailback="1" ordered="1" restricted="1">
	                <failoverdomainnode name="srv1"
priority="20"/>
	                <failoverdomainnode name="srv2"
priority="10"/>
	            </failoverdomain>
	        </failoverdomains>
	
	<service autostart="1" domain="FOD_srv1" exclusive="0"
name="SVC_001" recovery="relocate">
	            <ip ref="10.100.0.8"/>
	</service>
	
	SVC_001 will be sticked to Failover Domain "FOD_srv1",
which have node srv1 as priority node.
	
	...and you can have more informations about options on
RHCS admin guide.
	
	Luc
	
	
________________________________


	Le 24/06/2011 09:17, Ralph.Grothe at itdz-berlin.de a ?crit
: 

		Hello Clustering Gurus,
		
		I need to have a service during normal operation
(i.e. not during
		relocation when loss of stickyness is ok and
wanted) stick to a
		preferred cluster node.
		(n.b. this is only a two-node cluster)
		
		In the redhat cluster admin guide (we're on RHEL
5.6) I think to
		have read that such thing as a "preffered_node"
or similar
		attribute doesn't exist in the schema any more
and that instead
		one should define ordered="0" and restricted="0"
failover domains
		for the respective service as this would in
effect result in the
		wanted behavior.
		
		Is this correct?
		And how (e.g. a short cluster.conf XML example
snippet would be
		appreciated) would this have to be applied?
		
		
		Regards
		Ralph
		
		--
		Linux-cluster mailing list
		Linux-cluster at redhat.com
	
https://www.redhat.com/mailman/listinfo/linux-cluster
		

	P Pensez ? l'environnement avant d'imprimer ce message

	       Think Environment before printing

		Le contenu de ce m?l et de ses pi?ces jointes est
destin? ? l'usage exclusif du (des) destinataire(s) d?sign?(s)
comme tel(s). 

	En cas de r?ception par erreur, le signaler ? son
exp?diteur et ne pas en divulguer le contenu. 
	L'absence de virus a ?t? v?rifi?e ? l'?mission, il
convient n?anmoins de s'assurer de l'absence de contamination ?
sa r?ception.

	 

	The contents of this email and any attachments are
confidential. They are intended for the named recipient(s) only. 

	If you have received this email in error please notify
the system manager or  the sender immediately and do not disclose
the contents to 
	anyone or make copies. 

	eSafe scanned this email for viruses, vandals and
malicious content.

	



From noreply at boxbe.com  Sat Jun 25 16:04:21 2011
From: noreply at boxbe.com (noreply at boxbe.com)
Date: Sat, 25 Jun 2011 09:04:21 -0700 (PDT)
Subject: [Linux-cluster] Linux-cluster Digest, Vol 86,
	Issue 24 (Action Required)
Message-ID: <710906935.2168317.1309017861695.JavaMail.prod@app010.dmz>


Dear sender,

You will not receive any more courtesy notices from our members 
for two days. Messages you have sent will remain in a lower 
priority mailbox for our member to review at their leisure.

Future messages will be more likely to be viewed if you are on 
our member's priority Guest List.


  Thank you,
  shanavasmca at gmail.com


Powered by Boxbe -- "End Email Overload"
Visit http://www.boxbe.com/how-it-works?tc=8511374584_4610803

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110625/4a393f0b/attachment.htm>
-------------- next part --------------
An embedded message was scrubbed...
From: linux-cluster-request at redhat.com
Subject: Linux-cluster Digest, Vol 86, Issue 24
Date: Sat, 25 Jun 2011 12:00:05 -0400
Size: 2088
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110625/4a393f0b/attachment.eml>

From andrew at beekhof.net  Sun Jun 26 23:23:57 2011
From: andrew at beekhof.net (Andrew Beekhof)
Date: Mon, 27 Jun 2011 09:23:57 +1000
Subject: [Linux-cluster] relationship corosync + dlm + cman in cluster
	3.1.3
In-Reply-To: <C06930FB67368448890AE634F811C71AAF33A744@amsterdam.office.pagelink.nl>
References: <Acwwr+YY17GHeF+DTN6NBI5X/cX9hg==>
	<C06930FB67368448890AE634F811C71AAF33A744@amsterdam.office.pagelink.nl>
Message-ID: <BANLkTi=OepNi82uT8cGiVF-XBhdqzqkofw@mail.gmail.com>

On Wed, Jun 22, 2011 at 5:55 PM, Jeroen Koekkoek <j.koekkoek at perrit.nl> wrote:
> Hi,
>
> I have a question regarding the relationship between Corosync, DLM, and CMAN. Is the following statement correct?
>
> The DLM is a kernel module, dlm_controld is the control daemon.
> CMAN is the old messaging layer, and is now stacked on OpenAIS, which in turn is stacked on Corosync.
>
> The DLM does not use CMAN (or Corosync for that matter) to communicate, but does fetch node information from CMAN.
>
> The filesystem (GFS or OCFS2) speaks to the DLM locally (in kernel?) and the DLM takes care of the communication.
>
>
> Now for the real question.
>
> In the 3.1.3 release dlm_controld still depends on CMAN, but is it safe to say that I can just use Pacemaker and Heartbeat resource agents and only install CMAN so that dlm_controld can query node information?
>
> Or did I misunderstand the documentation?

You need to make sure everyone is getting the same membership and
quorum information.
So yes, install CMAN for dlm_controld but also tell pacemaker to use
it too (make sure you're on 1.1.5 or higher).

>
> Regards,
> Jeroen
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From RMartinez-Sanchez at nds.com  Mon Jun 27 13:35:19 2011
From: RMartinez-Sanchez at nds.com (Martinez-Sanchez, Raul)
Date: Mon, 27 Jun 2011 14:35:19 +0100
Subject: [Linux-cluster]  info RHEL 6 Cluster Suite + File System
Message-ID: <7370F6F5ED3B874F988F5CE657D801EA13BE7DAB87@UKMA1.UK.NDS.COM>

Hi All,

I have a very generic question that somehow am unable to answer. In the past (RHEL 5) we have been deploying HA Clusters in the following manner: Two to four redhat nodes with the Red Hat cluster suite on them. As well all the nodes are attached to a SAN/Fibre infrastructure with two SAN Switches and two controllers per Storage Array. The storage array was presented to the cluster suite as a GFS resources  and services (Oracle) were making use of it by mounting the GFS resource and operating on it.

It is my understanding (maybe am wrong) that in RHEL 6 there is no GFS support as well as that GFS2 is not oracle certified and therefore cannot be used. So my question is how can we replicate the same structure/architecture on RHEL 6 if GFS/GFS2 cannot be used?
Apologies if this question is too simple but am just trying to get some more understanding on how we could proceed next.

Regards,

Ra?l Mart?nez S?nchez


**************************************************************************************
This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postmaster at nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes. To protect the environment please do not print this e-mail unless necessary.

NDS Limited. Registered Office: One London Road, Staines, Middlesex, TW18 4EX, United Kingdom. A company registered in England and Wales. Registered no. 3080780. VAT no. GB 603 8808 40-00
**************************************************************************************



From andersonlira at gmail.com  Tue Jun 28 05:55:09 2011
From: andersonlira at gmail.com (anderson souza)
Date: Mon, 27 Jun 2011 23:55:09 -0600
Subject: [Linux-cluster] Hight I/O Wait Rates - RHEL 6.1 + GFS2 + NFS
Message-ID: <BANLkTi=DP53hHBUkXhRXOzeULqp7+POpYA@mail.gmail.com>

Hi everyone,

I have an Active/Passive RHCS 6.1 runing with 8TB of GFS2 with NFS on
top and exporting 26 mouting points to 250 NFS clients. The GFS2 mounting
points are mounted with noatime, nodiratime, data=writeback and localflocks
options, and also the SAN and servers are fast (4Gbps and 8Gb, dual
controllers working in LB, H.A... QuadCore, 48GB of memory...). The cluster
has been doing its work (failover working fine...), however
and unfortunately I have seen hight I/Owait rates, sometimes around 60-70%
(on which is very bad), and a couple of glock_workqueue jobs, so I get a
bunch of gfs2_quotad, nfsd errors and qdisk latency. The debugfs didn't show
me "W", only "G" and "H".

Have you guys seen it before?
Looks like some glock's contention?
How could I get it fixed and what does it mean?

Thank you very much


Jun 27 18:48:05  kernel: INFO: task gfs2_quotad:19066 blocked for more than
120 seconds.
Jun 27 18:48:05  kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
Jun 27 18:48:05  kernel: gfs2_quotad   D 0000000000000004     0 19066      2
0x00000080
Jun 27 18:48:05  kernel: ffff880bb01e1c20 0000000000000046 0000000000000000
ffffffffa045ec6d
Jun 27 18:48:05  kernel: 0000000000000000 ffff880be6e2b000 ffff880bb01e1c50
00000001051d8b46
Jun 27 18:48:05  kernel: ffff880be4865af8 ffff880bb01e1fd8 000000000000f598
ffff880be4865af8t
Jun 27 18:48:05  kernel: Call Trace:
Jun 27 18:48:05  kernel: [<ffffffffa045ec6d>] ? dlm_put_lockspace+0x1d/0x40
[dlm]
Jun 27 18:48:05  kernel: [<ffffffffa0525c50>] ?
gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Jun 27 18:48:05  kernel: [<ffffffffa0525c5e>]
gfs2_glock_holder_wait+0xe/0x20 [gfs2]
Jun 27 18:48:05  kernel: [<ffffffff814db87f>] __wait_on_bit+0x5f/0x90
Jun 27 18:48:05  kernel: [<ffffffffa0525c50>] ?
gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Jun 27 18:48:05  kernel: [<ffffffff814db928>]
out_of_line_wait_on_bit+0x78/0x90
Jun 27 18:48:05  kernel: [<ffffffff8108e140>] ? wake_bit_function+0x0/0x50
Jun 27 18:48:05  kernel: [<ffffffffa0526816>] gfs2_glock_wait+0x36/0x40
[gfs2]
Jun 27 18:48:05  kernel: [<ffffffffa0529011>] gfs2_glock_nq+0x191/0x370
[gfs2]
Jun 27 18:48:05  kernel: [<ffffffff8107a11b>] ?
try_to_del_timer_sync+0x7b/0xe0
Jun 27 18:48:05  kernel: [<ffffffffa05427f8>] gfs2_statfs_sync+0x58/0x1b0
[gfs2]
Jun 27 18:48:05  kernel: [<ffffffff814db52a>] ? schedule_timeout+0x19a/0x2e0
Jun 27 18:48:05  kernel: [<ffffffffa05427f0>] ? gfs2_statfs_sync+0x50/0x1b0
[gfs2]
Jun 27 18:48:05  kernel: [<ffffffffa053a787>] quotad_check_timeo+0x57/0xb0
[gfs2]
Jun 27 18:48:05  kernel: [<ffffffffa053aa14>] gfs2_quotad+0x234/0x2b0 [gfs2]
Jun 27 18:48:05  kernel: [<ffffffff8108e100>] ?
autoremove_wake_function+0x0/0x40
Jun 27 18:48:05  kernel: [<ffffffffa053a7e0>] ? gfs2_quotad+0x0/0x2b0 [gfs2]
Jun 27 18:48:05  kernel: [<ffffffff8108dd96>] kthread+0x96/0xa0
Jun 27 18:48:05  kernel: [<ffffffff8100c1ca>] child_rip+0xa/0x20
Jun 27 18:48:05  kernel: [<ffffffff8108dd00>] ? kthread+0x0/0xa0
Jun 27 18:48:05  kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20

Jun 27 19:49:07  kernel: __ratelimit: 57 callbacks suppressed
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 20:00:58  kernel: rpc-srv/tcp: nfsd: got error -104 when sending 140
bytes - shutting down socket
Jun 27 20:00:58  kernel: __ratelimit: 40 callbacks suppressed
qdiskd[10078]: qdisk cycle took more than 1 second to complete (1.170000)
qdisk cycle took more than 1 second to complete (1.120000)

Thanks
James S.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110627/14ac4efb/attachment.htm>

From omerfsen at gmail.com  Tue Jun 28 06:05:36 2011
From: omerfsen at gmail.com (Omer Faruk SEN)
Date: Tue, 28 Jun 2011 09:05:36 +0300
Subject: [Linux-cluster] Hight I/O Wait Rates - RHEL 6.1 + GFS2 + NFS
In-Reply-To: <BANLkTi=DP53hHBUkXhRXOzeULqp7+POpYA@mail.gmail.com>
References: <BANLkTi=DP53hHBUkXhRXOzeULqp7+POpYA@mail.gmail.com>
Message-ID: <BANLkTimfD-cdtt3BpCeVyJ+cUcVAfCoiFQ@mail.gmail.com>

Hi,

Open a ticket so Red Hat technical staff can take care of this. I think it
is the fastest way to resolve and fix this issue.

Regards.

On Tue, Jun 28, 2011 at 8:55 AM, anderson souza <andersonlira at gmail.com>wrote:

> Hi everyone,
>
> I have an Active/Passive RHCS 6.1 runing with 8TB of GFS2 with NFS on
> top and exporting 26 mouting points to 250 NFS clients. The GFS2 mounting
> points are mounted with noatime, nodiratime, data=writeback and localflocks
> options, and also the SAN and servers are fast (4Gbps and 8Gb, dual
> controllers working in LB, H.A... QuadCore, 48GB of memory...). The cluster
> has been doing its work (failover working fine...), however
> and unfortunately I have seen hight I/Owait rates, sometimes around 60-70%
> (on which is very bad), and a couple of glock_workqueue jobs, so I get a
> bunch of gfs2_quotad, nfsd errors and qdisk latency. The debugfs didn't show
> me "W", only "G" and "H".
>
> Have you guys seen it before?
> Looks like some glock's contention?
> How could I get it fixed and what does it mean?
>
> Thank you very much
>
>
> Jun 27 18:48:05  kernel: INFO: task gfs2_quotad:19066 blocked for more than
> 120 seconds.
> Jun 27 18:48:05  kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> Jun 27 18:48:05  kernel: gfs2_quotad   D 0000000000000004     0 19066
> 2 0x00000080
> Jun 27 18:48:05  kernel: ffff880bb01e1c20 0000000000000046 0000000000000000
> ffffffffa045ec6d
> Jun 27 18:48:05  kernel: 0000000000000000 ffff880be6e2b000 ffff880bb01e1c50
> 00000001051d8b46
> Jun 27 18:48:05  kernel: ffff880be4865af8 ffff880bb01e1fd8 000000000000f598
> ffff880be4865af8t
> Jun 27 18:48:05  kernel: Call Trace:
> Jun 27 18:48:05  kernel: [<ffffffffa045ec6d>] ? dlm_put_lockspace+0x1d/0x40
> [dlm]
> Jun 27 18:48:05  kernel: [<ffffffffa0525c50>] ?
> gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Jun 27 18:48:05  kernel: [<ffffffffa0525c5e>]
> gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> Jun 27 18:48:05  kernel: [<ffffffff814db87f>] __wait_on_bit+0x5f/0x90
> Jun 27 18:48:05  kernel: [<ffffffffa0525c50>] ?
> gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Jun 27 18:48:05  kernel: [<ffffffff814db928>]
> out_of_line_wait_on_bit+0x78/0x90
> Jun 27 18:48:05  kernel: [<ffffffff8108e140>] ? wake_bit_function+0x0/0x50
> Jun 27 18:48:05  kernel: [<ffffffffa0526816>] gfs2_glock_wait+0x36/0x40
> [gfs2]
> Jun 27 18:48:05  kernel: [<ffffffffa0529011>] gfs2_glock_nq+0x191/0x370
> [gfs2]
> Jun 27 18:48:05  kernel: [<ffffffff8107a11b>] ?
> try_to_del_timer_sync+0x7b/0xe0
> Jun 27 18:48:05  kernel: [<ffffffffa05427f8>] gfs2_statfs_sync+0x58/0x1b0
> [gfs2]
> Jun 27 18:48:05  kernel: [<ffffffff814db52a>] ?
> schedule_timeout+0x19a/0x2e0
> Jun 27 18:48:05  kernel: [<ffffffffa05427f0>] ? gfs2_statfs_sync+0x50/0x1b0
> [gfs2]
> Jun 27 18:48:05  kernel: [<ffffffffa053a787>] quotad_check_timeo+0x57/0xb0
> [gfs2]
> Jun 27 18:48:05  kernel: [<ffffffffa053aa14>] gfs2_quotad+0x234/0x2b0
> [gfs2]
> Jun 27 18:48:05  kernel: [<ffffffff8108e100>] ?
> autoremove_wake_function+0x0/0x40
> Jun 27 18:48:05  kernel: [<ffffffffa053a7e0>] ? gfs2_quotad+0x0/0x2b0
> [gfs2]
> Jun 27 18:48:05  kernel: [<ffffffff8108dd96>] kthread+0x96/0xa0
> Jun 27 18:48:05  kernel: [<ffffffff8100c1ca>] child_rip+0xa/0x20
> Jun 27 18:48:05  kernel: [<ffffffff8108dd00>] ? kthread+0x0/0xa0
> Jun 27 18:48:05  kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
>
> Jun 27 19:49:07  kernel: __ratelimit: 57 callbacks suppressed
> Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
> Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
> Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
> Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
> Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
> Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
> Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
> Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
> Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
> Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
> Jun 27 20:00:58  kernel: rpc-srv/tcp: nfsd: got error -104 when sending 140
> bytes - shutting down socket
> Jun 27 20:00:58  kernel: __ratelimit: 40 callbacks suppressed
> qdiskd[10078]: qdisk cycle took more than 1 second to complete (1.170000)
> qdisk cycle took more than 1 second to complete (1.120000)
>
> Thanks
> James S.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110628/971e7d51/attachment.htm>

From fdinitto at redhat.com  Tue Jun 28 06:22:00 2011
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 28 Jun 2011 08:22:00 +0200
Subject: [Linux-cluster] Hight I/O Wait Rates - RHEL 6.1 + GFS2 + NFS
In-Reply-To: <BANLkTi=DP53hHBUkXhRXOzeULqp7+POpYA@mail.gmail.com>
References: <BANLkTi=DP53hHBUkXhRXOzeULqp7+POpYA@mail.gmail.com>
Message-ID: <4E097308.9090104@redhat.com>

On 6/28/2011 7:55 AM, anderson souza wrote:
> Hi everyone,
>  
> I have an Active/Passive RHCS 6.1 runing with 8TB of GFS2 with NFS on
> top and exporting 26 mouting points to 250 NFS clients. The GFS2
> mounting points are mounted with noatime, nodiratime, data=writeback and
> localflocks options, and also the SAN and servers are fast (4Gbps and
> 8Gb, dual controllers working in LB, H.A... QuadCore, 48GB of
> memory...). The cluster has been doing its work (failover working
> fine...), however and unfortunately I have seen hight I/Owait rates,
> sometimes around 60-70% (on which is very bad), and a couple
> of glock_workqueue jobs, so I get a bunch of gfs2_quotad, nfsd errors
> and qdisk latency. The debugfs didn't show me "W", only "G" and "H".
>  
> Have you guys seen it before?
> Looks like some glock's contention?
> How could I get it fixed and what does it mean?

Please contact GSS and file a ticket.

You are probably experiencing this:
https://bugzilla.redhat.com/show_bug.cgi?id=717010

(you might not be able to see the whole content directly, but try
downgrading the kernel to 6.0 should make things better)

Also, given the nature of your setup, I would recommend to request a
cluster architecture review to GSS for GFS2 usage in such environment.

Fabio



From j.koekkoek at perrit.nl  Tue Jun 28 07:12:29 2011
From: j.koekkoek at perrit.nl (Jeroen Koekkoek)
Date: Tue, 28 Jun 2011 07:12:29 +0000
Subject: [Linux-cluster] relationship corosync + dlm + cman in
	cluster	3.1.3
In-Reply-To: <BANLkTi=OepNi82uT8cGiVF-XBhdqzqkofw@mail.gmail.com>
References: <Acwwr+YY17GHeF+DTN6NBI5X/cX9hg==>
	<C06930FB67368448890AE634F811C71AAF33A744@amsterdam.office.pagelink.nl>
	<BANLkTi=OepNi82uT8cGiVF-XBhdqzqkofw@mail.gmail.com>
Message-ID: <C06930FB67368448890AE634F811C71AAF33F33E@amsterdam.office.pagelink.nl>

Hi Andrew,

Thanks for answering my question. While looking at the current source tree, I noticed newer versions, at least dlm_controld, will not use cman anymore. So I'll keep using 3.0.12 for now (with dlm_controld.pcmk). Do you have any estimate on the first release without cman?

Regards,
Jeroen

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-
> bounces at redhat.com] On Behalf Of Andrew Beekhof
> Sent: Monday, June 27, 2011 1:24 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] relationship corosync + dlm + cman in
> cluster 3.1.3
> 
> On Wed, Jun 22, 2011 at 5:55 PM, Jeroen Koekkoek <j.koekkoek at perrit.nl>
> wrote:
> > Hi,
> >
> > I have a question regarding the relationship between Corosync, DLM,
> and CMAN. Is the following statement correct?
> >
> > The DLM is a kernel module, dlm_controld is the control daemon.
> > CMAN is the old messaging layer, and is now stacked on OpenAIS, which
> in turn is stacked on Corosync.
> >
> > The DLM does not use CMAN (or Corosync for that matter) to
> communicate, but does fetch node information from CMAN.
> >
> > The filesystem (GFS or OCFS2) speaks to the DLM locally (in kernel?)
> and the DLM takes care of the communication.
> >
> >
> > Now for the real question.
> >
> > In the 3.1.3 release dlm_controld still depends on CMAN, but is it
> safe to say that I can just use Pacemaker and Heartbeat resource agents
> and only install CMAN so that dlm_controld can query node information?
> >
> > Or did I misunderstand the documentation?
> 
> You need to make sure everyone is getting the same membership and quorum
> information.
> So yes, install CMAN for dlm_controld but also tell pacemaker to use it
> too (make sure you're on 1.1.5 or higher).
> 
> >
> > Regards,
> > Jeroen
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From ekuric at redhat.com  Tue Jun 28 09:25:26 2011
From: ekuric at redhat.com (Elvir Kuric)
Date: Tue, 28 Jun 2011 11:25:26 +0200
Subject: [Linux-cluster] relationship corosync + dlm + cman in cluster
 3.1.3
In-Reply-To: <C06930FB67368448890AE634F811C71AAF33A744@amsterdam.office.pagelink.nl>
References: <C06930FB67368448890AE634F811C71AAF33A744@amsterdam.office.pagelink.nl>
Message-ID: <4E099E06.9080303@redhat.com>

On 06/22/2011 09:55 AM, Jeroen Koekkoek wrote:
> Hi,
>
> I have a question regarding the relationship between Corosync, DLM, and CMAN. Is the following statement correct?
>
> The DLM is a kernel module, dlm_controld is the control daemon.
> CMAN is the old messaging layer, and is now stacked on OpenAIS, which in turn is stacked on Corosync.
>
> The DLM does not use CMAN (or Corosync for that matter) to communicate, but does fetch node information from CMAN.
>
> The filesystem (GFS or OCFS2) speaks to the DLM locally (in kernel?) and the DLM takes care of the communication.
>
>
> Now for the real question.
>
> In the 3.1.3 release dlm_controld still depends on CMAN, but is it safe to say that I can just use Pacemaker and Heartbeat resource agents and only install CMAN so that dlm_controld can query node information?
>
> Or did I misunderstand the documentation?
>
> Regards,
> Jeroen
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
I think document at below link will give nice overview of relationships 
between cman, dlm, ...

http://people.redhat.com/ccaulfie/docs/ClusterPic.pdf

Thanks

Kind regards,

Elvir



From c.mammoli at apra.it  Tue Jun 28 16:24:51 2011
From: c.mammoli at apra.it (Cristian Mammoli - Apra Sistemi)
Date: Tue, 28 Jun 2011 18:24:51 +0200
Subject: [Linux-cluster] virtual machine resource agent does not gracefully
	shutdown vms
Message-ID: <sig.91609780ff.4E0A0053.6050803@apra.it>

Hi, I have 4 2-node clusters (EL clone, 5.6 release) which provide high 
availability to KVM virtual machines.
Very often when I stop a vm service (with luci or clusvcadm) the virtual 
machine does not shutdown gracefully but continues to operate normally 
until the timeout kicks in and force the poweroff.

Most of the vms are windows 2008 with virtio drivers and have ACPI 
enabled, indeed running virsh shutdown works!

Any clue about what's going on?

cluster.conf: http://pastebin.com/q6tt3gMA
examplevm.xml: http://pastebin.com/z1e94h25

-- 
Cristian Mammoli
APRA SISTEMI srl
Via Brodolini,6 Jesi (AN)
tel dir. +390731719822

Web   www.apra.it
e-mail  c.mammoli at apra.it



From c.mammoli at apra.it  Tue Jun 28 17:03:08 2011
From: c.mammoli at apra.it (Cristian Mammoli - Apra Sistemi)
Date: Tue, 28 Jun 2011 19:03:08 +0200
Subject: [Linux-cluster] virtual machine resource agent does not
 gracefully shutdown vms
In-Reply-To: <4E0A0053.6050803@apra.it>
References: <4E0A0053.6050803@apra.it>
Message-ID: <sig.41606b836b.4E0A094C.2000606@apra.it>

On 06/28/2011 06:24 PM, Cristian Mammoli - Apra Sistemi wrote:
>
> Hi, I have 4 2-node clusters (EL clone, 5.6 release) which provide high
> availability to KVM virtual machines.
> Very often when I stop a vm service (with luci or clusvcadm) the virtual
> machine does not shutdown gracefully but continues to operate normally
> until the timeout kicks in and force the poweroff.

I reproduced the issue nad it seems that vm.sh correctly issues "virsh 
shutdown domain" but the vm does not actually give a f*** :)

[root at srvha01 ~]# /usr/share/cluster/vm.sh stop
Hypervisor: qemu
Management tool: virsh
Hypervisor URI: qemu:///system
Migration URI format: qemu+ssh://target_host/system
<debug>  Virtual machine srvdc01 is running
virsh shutdown srvdc01 ...
Domain srvdc01 is being shutdown

Nothing happens and the domain keeps running normally.

Second try:

[root at srvha01 ~]# /usr/share/cluster/vm.sh stop
Hypervisor: qemu
Management tool: virsh
Hypervisor URI: qemu:///system
Migration URI format: qemu+ssh://target_host/system
<debug>  Virtual machine srvdc01 is running
virsh shutdown srvdc01 ...
Domain srvdc01 is being shutdown

The domain shuts down correctly

At this point I think this is a libvirt/kvm/windows issue...
Anyway any help is appreciated.

-- 
Cristian Mammoli
APRA SISTEMI srl
Via Brodolini,6 Jesi (AN)
tel dir. +390731719822

Web   www.apra.it
e-mail  c.mammoli at apra.it



From c.mammoli at apra.it  Tue Jun 28 17:47:40 2011
From: c.mammoli at apra.it (Cristian Mammoli - Apra Sistemi)
Date: Tue, 28 Jun 2011 19:47:40 +0200
Subject: [Linux-cluster] [SOLVED] Virtual machine resource agent does
 not gracefully shutdown vms
In-Reply-To: <4E0A094C.2000606@apra.it>
References: <4E0A0053.6050803@apra.it> <4E0A094C.2000606@apra.it>
Message-ID: <sig.9160fc6011.4E0A13BC.2010203@apra.it>

It seems that Windows server in an active directory environment has a 
default group policy setting that inhibits ACPI shutdown if no user is 
logged in...

It is located in:
Computer Configuration\Windows Settings\Security Settings\Local 
Policies\Security Options\Shutdown: Allow system to be shut down without 
having to log on

After setting this to "on" VMs shutdown gracefully on the first try when 
I stop them from luci.

-- 
Cristian Mammoli
APRA SISTEMI srl
Via Brodolini,6 Jesi (AN)
tel dir. +390731719822

Web   www.apra.it
e-mail  c.mammoli at apra.it



From noreply at boxbe.com  Tue Jun 28 17:57:24 2011
From: noreply at boxbe.com (noreply at boxbe.com)
Date: Tue, 28 Jun 2011 10:57:24 -0700 (PDT)
Subject: [Linux-cluster] [SOLVED] Virtual machine resource agent does
 not gracefully shutdown vms (Action Required)
Message-ID: <854324580.2651655.1309283844276.JavaMail.prod@app010.dmz>


Hello linux clustering,

You will not receive any more courtesy notices from our members 
for two days. Messages you have sent will remain in a lower 
priority mailbox for our member to review at their leisure.

Future messages will be more likely to be viewed if you are on 
our member's priority Guest List.


  Thank you,
  debjyoti.mail at gmail.com


Powered by Boxbe -- "End Email Overload"
Visit http://www.boxbe.com/how-it-works?tc=8539327229_705574608

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110628/36c99571/attachment.htm>
-------------- next part --------------
An embedded message was scrubbed...
From: Cristian Mammoli - Apra Sistemi <c.mammoli at apra.it>
Subject: Re: [Linux-cluster] [SOLVED] Virtual machine resource agent does not	gracefully shutdown vms
Date: Tue, 28 Jun 2011 19:47:40 +0200
Size: 4652
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110628/36c99571/attachment.eml>

From chris.alexander at kusiri.com  Wed Jun 29 14:17:53 2011
From: chris.alexander at kusiri.com (Chris Alexander)
Date: Wed, 29 Jun 2011 15:17:53 +0100
Subject: [Linux-cluster] Expected behaviour when service fails to stop
Message-ID: <BANLkTikF8WF17c3VBqk9pRvHBG3POABAkA@mail.gmail.com>

Hi,

I was wondering what the expected behaviour of the cluster would be when a
service cannot be shutdown safely. For example, if you request a service
group to be relocated to another node in the cluster, if one of the services
in that group fails to stop (causing a timeout?), what would the result be?
I should imagine that the service would be marked as Failed, is this the
case? I have been unable to find this particular scenario documented
anywhere.

Thanks

Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110629/af2a1ee6/attachment.htm>

From andrew at beekhof.net  Thu Jun 30 03:32:09 2011
From: andrew at beekhof.net (Andrew Beekhof)
Date: Thu, 30 Jun 2011 13:32:09 +1000
Subject: [Linux-cluster] relationship corosync + dlm + cman in cluster
	3.1.3
In-Reply-To: <C06930FB67368448890AE634F811C71AAF33F33E@amsterdam.office.pagelink.nl>
References: <C06930FB67368448890AE634F811C71AAF33A744@amsterdam.office.pagelink.nl>
	<BANLkTi=OepNi82uT8cGiVF-XBhdqzqkofw@mail.gmail.com>
	<C06930FB67368448890AE634F811C71AAF33F33E@amsterdam.office.pagelink.nl>
Message-ID: <BANLkTi=CXtfh7-rAeAEVdrsAKUo=+8JObA@mail.gmail.com>

On Tue, Jun 28, 2011 at 5:12 PM, Jeroen Koekkoek <j.koekkoek at perrit.nl> wrote:
> Hi Andrew,
>
> Thanks for answering my question. While looking at the current source tree, I noticed newer versions, at least dlm_controld, will not use cman anymore. So I'll keep using 3.0.12 for now (with dlm_controld.pcmk). Do you have any estimate on the first release without cman?

Of the dlm etc? No. You'd have to talk to the owners of those projects.
At a guess maybe a year from now.

>
> Regards,
> Jeroen
>
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-
>> bounces at redhat.com] On Behalf Of Andrew Beekhof
>> Sent: Monday, June 27, 2011 1:24 AM
>> To: linux clustering
>> Subject: Re: [Linux-cluster] relationship corosync + dlm + cman in
>> cluster 3.1.3
>>
>> On Wed, Jun 22, 2011 at 5:55 PM, Jeroen Koekkoek <j.koekkoek at perrit.nl>
>> wrote:
>> > Hi,
>> >
>> > I have a question regarding the relationship between Corosync, DLM,
>> and CMAN. Is the following statement correct?
>> >
>> > The DLM is a kernel module, dlm_controld is the control daemon.
>> > CMAN is the old messaging layer, and is now stacked on OpenAIS, which
>> in turn is stacked on Corosync.
>> >
>> > The DLM does not use CMAN (or Corosync for that matter) to
>> communicate, but does fetch node information from CMAN.
>> >
>> > The filesystem (GFS or OCFS2) speaks to the DLM locally (in kernel?)
>> and the DLM takes care of the communication.
>> >
>> >
>> > Now for the real question.
>> >
>> > In the 3.1.3 release dlm_controld still depends on CMAN, but is it
>> safe to say that I can just use Pacemaker and Heartbeat resource agents
>> and only install CMAN so that dlm_controld can query node information?
>> >
>> > Or did I misunderstand the documentation?
>>
>> You need to make sure everyone is getting the same membership and quorum
>> information.
>> So yes, install CMAN for dlm_controld but also tell pacemaker to use it
>> too (make sure you're on 1.1.5 or higher).
>>
>> >
>> > Regards,
>> > Jeroen
>> >
>> > --
>> > Linux-cluster mailing list
>> > Linux-cluster at redhat.com
>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>> >
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From Rahul.Borate at sailpoint.com  Thu Jun 30 05:57:43 2011
From: Rahul.Borate at sailpoint.com (Rahul Borate)
Date: Thu, 30 Jun 2011 11:27:43 +0530
Subject: [Linux-cluster]  Service Recovery Failure
Message-ID: <59cb602d2d3493a90a462f595244ea3a@mail.gmail.com>

Hi all,



I just performed a test which fail miserably. I have two nodes node-1 and
node-2

Global file system /gfs is on node-1.



Two HA services running on node-1. If I unplug the cables for node 1 then
those two services should transfers to Node-2. But node-2 did not take over
the services.

But if I do proper shutdown/reboot on node-1 then those two services are
transferring to  node-2 without problem.



Please Help!



clustat from node-2 before unplug of cable for node-1:



[root at Node-2 ~]# clustat

Member Status: Quorate



  Member Name                                ID           Status

  ------ ----                                             ----        ------

  Node-1                                               1
Online, rgmanager

  Node-2                                               2
Online, Local, rgmanager



  Service Name                   Owner (Last)             State

  ------- ----                             ----- ------
 -----

  service:nfs                        Node-1
    started

  service:ESS_HA               Node-1                       started



clustat from node-2 After unplug of cable for node-1:



[root at Node-2 ~]# clustat

Member Status: Quorate



  Member Name                                ID           Status

  ------ ----                                             ----        ------

  Node-1                                               1
Offline

  Node-2                                               2
Online, Local, rgmanager



  Service Name                   Owner (Last)            State

  ------- ----                             ----- ------
 -----

  service:nfs                        Node-1
    started

  service:ESS_HA               Node-1                       started





/etc/cluster/cluster.conf:



[root at Node-2 ~]# cat /etc/cluster/cluster.conf

<?xml version="1.0"?>

<cluster config_version="54" name="idm_cluster">

        <fence_daemon post_fail_delay="0" post_join_delay="120"/>

        <clusternodes>

                <clusternode name="Node-1" nodeid="1" votes="1">

                        <fence/>

                </clusternode>

                <clusternode name="Node-2" nodeid="2" votes="1">

                        <fence/>

                </clusternode>

        </clusternodes>

        <cman expected_votes="1" two_node="1"/>

        <fencedevices/>

        <rm>

                <failoverdomains>

                        <failoverdomain name="nfs" ordered="0"
restricted="1">

                                <failoverdomainnode name="Node-1"
priority="1"/>

                                <failoverdomainnode name="Node-2"
priority="1"/>

                        </failoverdomain>

                </failoverdomains>

                <resources>

                        <clusterfs device="/dev/vg00/mygfs"
force_unmount="0" fsid="59408" fstype="gfs" mountpoint="/gfs" name="gfs"
options=""/>

                        <ip address="10.128.107.229" monitor_link="1"/>

                        <script file="/gfs/ess_clus/HA/clusTest.sh"
name="ESS_HA_test"/>

                        <script file="/gfs/clusTest.sh" name="Clus_Test"/>

                </resources>

                <service autostart="1" name="nfs">

                        <clusterfs ref="gfs"/>

                        <ip ref="10.128.107.229"/>

                </service>

                <service autostart="1" domain="nfs" name="ESS_HA"
recovery="restart">

                       <script ref="ESS_HA_test"/>

                        <clusterfs ref="gfs"/>

                        <ip ref="10.128.107.229"/>

                </service>

        </rm>

</cluster>

[root at Node-2 ~]#



Node2: tail ?f /var/log/message



Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ] CLM CONFIGURATION CHANGE

Jun 29 18:20:49 vm-idm02 fenced[1706]: vm-idm01 not a cluster member after 0
sec post_fail_delay

Jun 29 18:20:49 vm-idm02 kernel: dlm: closing connection to node 1

Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ] New Configuration:

Jun 29 18:20:49 vm-idm02 fenced[1706]: fencing node "vm-idm01"

Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ]         r(0)
ip(10.128.107.224)

Jun 29 18:20:49 vm-idm02 fenced[1706]: fence "vm-idm01" failed

Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ] Members Left:

Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ]         r(0)
ip(10.128.107.223)

Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ] Members Joined:

Jun 29 18:20:49 vm-idm02 openais[1690]: [SYNC ] This node is within the
primary component and will provide service.

Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ] CLM CONFIGURATION CHANGE

Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ] New Configuration:

Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ]         r(0)
ip(10.128.107.224)

Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ] Members Left:

Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ] Members Joined:

Jun 29 18:20:49 vm-idm02 openais[1690]: [SYNC ] This node is within the
primary component and will provide service.

Jun 29 18:20:49 vm-idm02 openais[1690]: [TOTEM] entering OPERATIONAL state.

Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ] got nodejoin message
10.128.107.224

Jun 29 18:20:49 vm-idm02 openais[1690]: [CPG  ] got joinlist message from
node 2

Jun 29 18:20:54 vm-idm02 fenced[1706]: fencing node "Node-1"

Jun 29 18:20:54 vm-idm02 fenced[1706]: fence "Node-1" failed

Jun 29 18:20:59 vm-idm02 fenced[1706]: fencing node "Node-1"



Regards,

Rahul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110630/e41d2116/attachment.htm>

From fdinitto at redhat.com  Thu Jun 30 07:44:44 2011
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Thu, 30 Jun 2011 09:44:44 +0200
Subject: [Linux-cluster] resource-agents 3.9.2 release
Message-ID: <4E0C296C.6030302@redhat.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Hi everybody,

The current resource agent repository [1] has been tagged to v3.9.2.
Tarballs are also available [2].

This is a quick bug fix release to address a couple of regressions
introduced during 3.9.1 development cycle.

Highlights for the LHA resource agents set:

- - ethermon: new resource agent
- - iscsi: fix regression for open-scsi version 2.0-872
- - pgsql: fix regression in directories on probes
- - VirtualDomain: if there's no config exit on stop with success
- - doc: add sfex_init(8) man page

Highlights for the rgmanager resource agents set:

- - ship .xsl and relaxng snippets required to build cluster relaxng
dynamically.
- - ASEHAagent: allow multiple instances on one machine.

The full list of changes is available in the "ChangeLog" file for users,
and in an auto-generated git-to-changelog file called "ChangeLog.devel".

Many thanks to everybody who helped with this release, in
particular to the numerous contributors. Without you, the release
would certainly not be possible.

Cheers,
The RAS Tribe

[1] https://github.com/ClusterLabs/resource-agents/tarball/v3.9.2
[2] https://fedorahosted.org/releases/r/e/resource-agents/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBCAAGBQJODClqAAoJEFA6oBJjVJ+OObAQAJaNOPI6gmqGtvA74+/Fw0zq
CQmJAOEW7v1536dhvQjnwbYE+C4am2dYL18vool8eb7baufoYSlCcpEUP3qGfSzQ
3gARCtDVwSJthPE8jWz3pND5Tw7/vH6+kB94ZoOnJJ08Lryn0nJxKiUQYVYdDOhX
E83KRCWriHqZxf/4E/j3MPFMoPX+Afa6vaDl+wauL8edQL/ErzalCjAJMfSUdaji
/GBdaYE2K5JwPsTodAXqigB48+LB18FKs4zatwUuyMjp6kdhTBeDHDLh+NjcV4JB
kzkKmLrnvlqh6N1ki7FSu58Mw/+itIkwwp9mQZP1ZbH5oEu6KJnS/+Xv/7Rt2RpM
4rrPSTes7D4PBNcjPrRMJXkTPYkm4Gzy+wHsSwYY0eW7JNY4hDn4Z1DYE9L3MoNM
HMHj0UXaTRf0f0s+uJ9Dg7sIyQ9cl9GIn+N7G2NS2YVtAfEhdxuj+20egxiMUgjJ
16qNVq4vTovtTI63yg27NO0YnRe8ft8zcIW+SS1RZNdahObsOJcKj5zXQV5IFI3e
Jc5YFo6mGg/rKLQpuFc3TMxRGn4rSmnsAoZZC8T9FgZXoFhH9RS/Xw/Jf07ZeWCK
il26PywrIUOKzul0nonKGQiu0k+p6ojWgWzMKn4mtNeZfJaGkDy9hJUQxQerZmM8
D2b7brbaXH1tpSIh/+aw
=NI0h
-----END PGP SIGNATURE-----



From parvez.h.shaikh at gmail.com  Thu Jun 30 10:03:14 2011
From: parvez.h.shaikh at gmail.com (Parvez Shaikh)
Date: Thu, 30 Jun 2011 15:33:14 +0530
Subject: [Linux-cluster] fence_ipmilan fails to reboot
Message-ID: <BANLkTims0ecxMwOuBifRrTxD6d5UJve0cg@mail.gmail.com>

Hi all,

I am on RHEL 5.5; and I have two rack mounted servers with IPMI configured.

When I run command from the prompt to reboot the server through
fence_ipmilan, it shutsdown the server fine but it fails to power it on

# fence_ipmilan -a <IPMI IP Address> -l admin -p password -o reboot
>
Rebooting machine @ IPMI:<IPMI IP Address>...Failed
>

But I can power it on or power off just fine

>
> # fence_ipmilan -a <IPMI IP Address> -l admin -p password -o on
>
Powering on machine @ IPMI:<IPMI IP Address>...Done
>

Due to this my fencing is failing and failover is not happening.

I have questions around this -

1. Can we provide action (off or reboot) in cluster.conf for ipmi lan
fencing?
2. Is there anything wrong in my configuration? Cluster.conf file is pasted
below
3. Is this a known issue which is fixed in newer versions

Here is how my cluster.conf looks like -

<?xml version="1.0"?>
<cluster config_version="4" name="Cluster">
 <fence_daemon post_fail_delay="0" post_join_delay="3"/>
 <clusternodes>
  <clusternode name="blade1.domain" nodeid="1" votes="1">
   <fence>
    <method name="1">
     <device lanplus="" name="IPMI_1"/>
    </method>
   </fence>
  </clusternode>
  <clusternode name="blade2.domain" nodeid="2" votes="1">
   <fence>
    <method name="1">
     <device lanplus="" name="IPMI_2"/>
    </method>
   </fence>
  </clusternode>
 </clusternodes>
 <cman expected_votes="1" two_node="1"/>
 <fencedevices>
  <fencedevice agent="fence_ipmilan" auth="none" ipaddr="<IMPI 1 IP
Address>" login="admin" name="IPMI_1" passwd="password"/>
  <fencedevice agent="fence_ipmilan" auth="none" ipaddr="<IMPI 2 IP
Address>" login="admin" name="IPMI_2" passwd="password"/>
 </fencedevices>
 <rm>
  <failoverdomains>
   <failoverdomain name="FailoveDomain" ordered="1" restricted="1">
    <failoverdomainnode name="blade1.domain" priority="2"/>
    <failoverdomainnode name="blade2.domain" priority="1"/>
   </failoverdomain>
  </failoverdomains>
  <resources/>
  <service autostart="1" name="service" recovery="relocate"/>
 </rm>
</cluster>

Thanks,
Parvez
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110630/342094cd/attachment.htm>

From yamato at redhat.com  Thu Jun 30 12:37:10 2011
From: yamato at redhat.com (Masatake YAMATO)
Date: Thu, 30 Jun 2011 21:37:10 +0900 (JST)
Subject: [Linux-cluster] [PATCH] /config/dlm/<cluster>/comms/<comm>/addr_list
In-Reply-To: <20110609140546.GA30732@redhat.com>
References: <20110609.175529.646090028440251828.yamato@redhat.com>
	<20110609140546.GA30732@redhat.com>
Message-ID: <20110630.213710.655406242398069789.yamato@redhat.com>

Hi,

> On Thu, Jun 09, 2011 at 05:55:29PM +0900, Masatake YAMATO wrote:
>> Hi,
>> 
>> I've found /config/dlm/<cluster>/comms/<comm>/addr is readable 
>> (in meaning of ls -l) but no handler(comm_addr_read) is defined in 
>> dlm/fs/dlm/config.c.
>> 
>> If cat command works fine with /config/dlm/<cluster>/comms/<comm>/addr,
>> it will be nice to understand the status of dlm. So I'm thinking about
>> writing a patch.
>> 
>> But after reading the source code, I've found its difficulties;
>> /config/dlm/<cluster>/comms/<comm>/addr holds 'struct
>> sockaddr_storage'.
> 
> Another problem is that you can write multiple addr's to that file
> sequentially when using SCTP, so which do you get when you read it?
> 
>>     3. Make /config/dlm/<cluster>/comms/<comm>/addr unreadable (in meaning of ls -l)
>> 
>>        e.g.
>>        # ls -l /config/dlm/<cluster>/comms/<comm>/addr
>>        --w-------. 1 root root 4096 Jun  9 08:51 /config/dlm/<cluster>/comms/<comm>/addr
>> 
>>        Advantage: easy to implement.
>>        Disadvantage: no way to know the value of node addr of dlm view.
> 
> I suggest this.  If you want a way to read them, I'd add a new readonly
> file addr_list,
> 
> # cat /config/dlm/<cluster>/comms/<comm>/addr_list
> AF_INET 192.168.151.1
> AF_INET 192.168.151.2
> 
> Dave
> 

Added addr_list. Could you try my patch?

Signed-off-by: Masatake YAMATO <yamato at redhat.com>
diff --git a/fs/dlm/config.c b/fs/dlm/config.c
index 0d329ff..adfd90b 100644
--- a/fs/dlm/config.c
+++ b/fs/dlm/config.c
@@ -28,7 +28,8 @@
  * /config/dlm/<cluster>/spaces/<space>/nodes/<node>/weight
  * /config/dlm/<cluster>/comms/<comm>/nodeid
  * /config/dlm/<cluster>/comms/<comm>/local
- * /config/dlm/<cluster>/comms/<comm>/addr
+ * /config/dlm/<cluster>/comms/<comm>/addr      (write only)
+ * /config/dlm/<cluster>/comms/<comm>/addr_list (read only)
  * The <cluster> level is useless, but I haven't figured out how to avoid it.
  */
 
@@ -80,6 +81,7 @@ static ssize_t comm_local_write(struct dlm_comm *cm, const char *buf,
 				size_t len);
 static ssize_t comm_addr_write(struct dlm_comm *cm, const char *buf,
 				size_t len);
+static ssize_t comm_addr_list_read(struct dlm_comm *cm, char *buf);
 static ssize_t node_nodeid_read(struct dlm_node *nd, char *buf);
 static ssize_t node_nodeid_write(struct dlm_node *nd, const char *buf,
 				size_t len);
@@ -186,6 +188,7 @@ enum {
 	COMM_ATTR_NODEID = 0,
 	COMM_ATTR_LOCAL,
 	COMM_ATTR_ADDR,
+	COMM_ATTR_ADDR_LIST,
 };
 
 struct comm_attribute {
@@ -213,14 +216,22 @@ static struct comm_attribute comm_attr_local = {
 static struct comm_attribute comm_attr_addr = {
 	.attr   = { .ca_owner = THIS_MODULE,
                     .ca_name = "addr",
-                    .ca_mode = S_IRUGO | S_IWUSR },
+                    .ca_mode = S_IWUSR },
 	.store  = comm_addr_write,
 };
 
+static struct comm_attribute comm_attr_addr_list = {
+	.attr   = { .ca_owner = THIS_MODULE,
+                    .ca_name = "addr_list",
+                    .ca_mode = S_IRUGO },
+	.show   = comm_addr_list_read,
+};
+
 static struct configfs_attribute *comm_attrs[] = {
 	[COMM_ATTR_NODEID] = &comm_attr_nodeid.attr,
 	[COMM_ATTR_LOCAL] = &comm_attr_local.attr,
 	[COMM_ATTR_ADDR] = &comm_attr_addr.attr,
+	[COMM_ATTR_ADDR_LIST] = &comm_attr_addr_list.attr,
 	NULL,
 };
 
@@ -715,6 +726,50 @@ static ssize_t comm_addr_write(struct dlm_comm *cm, const char *buf, size_t len)
 	return len;
 }
 
+static ssize_t comm_addr_list_read(struct dlm_comm *cm, char *buf)
+{
+	ssize_t s;
+	ssize_t allowance;
+	int i;
+	struct sockaddr_storage *addr;
+	struct sockaddr_in *addr_in;
+	struct sockaddr_in6 *addr_in6;
+	
+	/* Taken from ip6_addr_string() defined in lib/vsprintf.c */
+	char buf0[sizeof("AF_INET6	xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:255.255.255.255\n")];
+	
+
+	/* Derived from SIMPLE_ATTR_SIZE of fs/configfs/file.c */
+	allowance = 4096;
+	buf[0] = '\0';
+
+	for (i = 0; i < cm->addr_count; i++) {
+		addr = cm->addr[i];
+
+		switch(addr->ss_family) {
+		case AF_INET:
+			addr_in = (struct sockaddr_in *)addr;
+			s = sprintf(buf0, "AF_INET	%pI4\n", &addr_in->sin_addr.s_addr);
+			break;
+		case AF_INET6:
+			addr_in6 = (struct sockaddr_in6 *)addr;
+			s = sprintf(buf0, "AF_INET6	%pI6\n", &addr_in6->sin6_addr);
+			break;
+		default:
+			s = sprintf(buf0, "%s\n", "<UNKNOWN>");
+			break;
+		}
+		allowance -= s;
+		if (allowance >= 0)
+			strcat(buf, buf0);
+		else {
+			allowance += s;
+			break;
+		}
+	}
+	return 4096 - allowance;
+}
+
 static ssize_t show_node(struct config_item *i, struct configfs_attribute *a,
 			 char *buf)
 {



From slords at lordsfam.net  Thu Jun 30 13:51:48 2011
From: slords at lordsfam.net (Shad L. Lords)
Date: Thu, 30 Jun 2011 07:51:48 -0600
Subject: [Linux-cluster] Service Recovery Failure
In-Reply-To: <59cb602d2d3493a90a462f595244ea3a@mail.gmail.com>
References: <59cb602d2d3493a90a462f595244ea3a@mail.gmail.com>
Message-ID: <BANLkTi=gsrK=KFFVMfAQJPv1-F2eqANFDA@mail.gmail.com>

On Wed, Jun 29, 2011 at 11:57 PM, Rahul Borate
<Rahul.Borate at sailpoint.com>wrote:

> Two HA services running on node-1. If I unplug the cables for node 1 then
> those two services should transfers to Node-2. But node-2 did not take over
> the services.
>
> But if I do proper shutdown/reboot on node-1 then those two services are
> transferring to  node-2 without problem.
>
>
If you shut down a node then it leaves the cluster cleanly.  If you pull the
network on a node then the other node tries to fence it.  Nothing will
happen with the services the missing node "owned" until that node is
successfully fenced.

Node2: tail ?f /var/log/message
>
> ...
> Jun 29 18:20:49 vm-idm02 fenced[1706]: fencing node "vm-idm01"
>
> Jun 29 18:20:49 vm-idm02 fenced[1706]: fence "vm-idm01" failed
>

This is where your issues is.  Because fencing failed the other node will
not take over any of the failed services.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110630/eb4dc3f3/attachment.htm>

From linux at alteeve.com  Thu Jun 30 13:59:54 2011
From: linux at alteeve.com (Digimer)
Date: Thu, 30 Jun 2011 09:59:54 -0400
Subject: [Linux-cluster] Service Recovery Failure
In-Reply-To: <59cb602d2d3493a90a462f595244ea3a@mail.gmail.com>
References: <59cb602d2d3493a90a462f595244ea3a@mail.gmail.com>
Message-ID: <4E0C815A.2080102@alteeve.com>

On 06/30/2011 01:57 AM, Rahul Borate wrote:
> Hi all,
>
> I just performed a test which fail miserably. I have two nodes
> node-1 and node-2
>
> Global file system /gfs is on node-1.

You do not have fencing configured.

On the clean shut down, the node withdraws and the other node knows that 
it's safe to take over services. When the node simply disappears, it 
doesn't know what state the other node is in. The survivor's only safe 
action is to block I/O, fence the lost node (to put it in a known 
state), then after successful fence (and only then), I/O will resume.

http://wiki.alteeve.com/index.php/Red_Hat_Cluster_Service_2_Tutorial#Concept.3B_Fencing

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"I feel confined, only free to expand myself within boundaries."



From linux at alteeve.com  Thu Jun 30 14:06:04 2011
From: linux at alteeve.com (Digimer)
Date: Thu, 30 Jun 2011 10:06:04 -0400
Subject: [Linux-cluster] Service Recovery Failure
In-Reply-To: <BANLkTi=gsrK=KFFVMfAQJPv1-F2eqANFDA@mail.gmail.com>
References: <59cb602d2d3493a90a462f595244ea3a@mail.gmail.com>
	<BANLkTi=gsrK=KFFVMfAQJPv1-F2eqANFDA@mail.gmail.com>
Message-ID: <4E0C82CC.4000908@alteeve.com>

On 06/30/2011 09:51 AM, Shad L. Lords wrote:
> This is where your issues is.  Because fencing failed the other node
> will not take over any of the failed services.

I should have read your reply before replying myself. You beat me to it. :)

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"I feel confined, only free to expand myself within boundaries."



From l.santeramo at brgm.fr  Thu Jun 30 14:32:56 2011
From: l.santeramo at brgm.fr (Santeramo Luc)
Date: Thu, 30 Jun 2011 14:32:56 +0000
Subject: [Linux-cluster] fence_ipmilan fails to reboot
In-Reply-To: <BANLkTims0ecxMwOuBifRrTxD6d5UJve0cg@mail.gmail.com>
References: <BANLkTims0ecxMwOuBifRrTxD6d5UJve0cg@mail.gmail.com>
Message-ID: <4E0C899D.1020903@brgm.fr>

Hi Parvez,

Have you tried method="cycle" ?
(on my cluster I don't specify the auth parameter, but I don't think that's the problem)

--
Luc


________________________________

Le 30/06/2011 12:03, Parvez Shaikh a ?crit :
Hi all,

I am on RHEL 5.5; and I have two rack mounted servers with IPMI configured.

When I run command from the prompt to reboot the server through fence_ipmilan, it shutsdown the server fine but it fails to power it on

# fence_ipmilan -a <IPMI IP Address> -l admin -p password -o reboot
Rebooting machine @ IPMI:<IPMI IP Address>...Failed

But I can power it on or power off just fine

# fence_ipmilan -a <IPMI IP Address> -l admin -p password -o on
Powering on machine @ IPMI:<IPMI IP Address>...Done

Due to this my fencing is failing and failover is not happening.

I have questions around this -

1. Can we provide action (off or reboot) in cluster.conf for ipmi lan fencing?
2. Is there anything wrong in my configuration? Cluster.conf file is pasted below
3. Is this a known issue which is fixed in newer versions

Here is how my cluster.conf looks like -
<?xml version="1.0"?>
<cluster config_version="4" name="Cluster">
 <fence_daemon post_fail_delay="0" post_join_delay="3"/>
 <clusternodes>
  <clusternode name="blade1.domain" nodeid="1" votes="1">
   <fence>
    <method name="1">
     <device lanplus="" name="IPMI_1"/>
    </method>
   </fence>
  </clusternode>
  <clusternode name="blade2.domain" nodeid="2" votes="1">
   <fence>
    <method name="1">
     <device lanplus="" name="IPMI_2"/>
    </method>
   </fence>
  </clusternode>
 </clusternodes>
 <cman expected_votes="1" two_node="1"/>
 <fencedevices>
  <fencedevice agent="fence_ipmilan" auth="none" ipaddr="<IMPI 1 IP Address>" login="admin" name="IPMI_1" passwd="password"/>
  <fencedevice agent="fence_ipmilan" auth="none" ipaddr="<IMPI 2 IP Address>" login="admin" name="IPMI_2" passwd="password"/>
 </fencedevices>
 <rm>
  <failoverdomains>
   <failoverdomain name="FailoveDomain" ordered="1" restricted="1">
    <failoverdomainnode name="blade1.domain" priority="2"/>
    <failoverdomainnode name="blade2.domain" priority="1"/>
   </failoverdomain>
  </failoverdomains>
  <resources/>
  <service autostart="1" name="service" recovery="relocate"/>
 </rm>
</cluster>
Thanks,
Parvez




--
Linux-cluster mailing list
Linux-cluster at redhat.com<mailto:Linux-cluster at redhat.com>
https://www.redhat.com/mailman/listinfo/linux-cluster
**********************************************************************************************
Pensez a l'environnement avant d'imprimer ce message
Think Environment before printing
 
Le contenu de ce mel et de ses pieces jointes est destine a l'usage exclusif du (des) destinataire(s) designe
(s) comme tel(s). 
En cas de reception par erreur, le signaler a son expediteur et ne pas en divulguer le contenu. 
L'absence de virus a ete verifiee a l'emission, il convient neanmoins de s'assurer de l'absence de 
contamination a sa reception.
 
The contents of this email and any attachments are confidential. They are intended for the named recipient
(s) only. 
If you have received this email in error please notify the system manager or the sender immediately and do 
not disclose the contents to anyone or make copies. 
eSafe scanned this email for viruses, vandals and malicious content.
**********************************************************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110630/63f5ab84/attachment.htm>

From linux at alteeve.com  Thu Jun 30 14:56:52 2011
From: linux at alteeve.com (Digimer)
Date: Thu, 30 Jun 2011 10:56:52 -0400
Subject: [Linux-cluster] fence_ipmilan fails to reboot
In-Reply-To: <BANLkTims0ecxMwOuBifRrTxD6d5UJve0cg@mail.gmail.com>
References: <BANLkTims0ecxMwOuBifRrTxD6d5UJve0cg@mail.gmail.com>
Message-ID: <4E0C8EB4.30702@alteeve.com>

On 06/30/2011 06:03 AM, Parvez Shaikh wrote:
> Hi all,
>
> I am on RHEL 5.5; and I have two rack mounted servers with IPMI configured.
>
> When I run command from the prompt to reboot the server through
> fence_ipmilan, it shutsdown the server fine but it fails to power it on
>
>     # fence_ipmilan -a <IPMI IP Address> -l admin -p password -o reboot
>
>     Rebooting machine @ IPMI:<IPMI IP Address>...Failed
>
>
> But I can power it on or power off just fine
>
>
>     # fence_ipmilan -a <IPMI IP Address> -l admin -p password -o on
>
>     Powering on machine @ IPMI:<IPMI IP Address>...Done
>
>
> Due to this my fencing is failing and failover is not happening.
>
> I have questions around this -
>
> 1. Can we provide action (off or reboot) in cluster.conf for ipmi lan
> fencing?
> 2. Is there anything wrong in my configuration? Cluster.conf file is
> pasted below
> 3. Is this a known issue which is fixed in newer versions
>
> Here is how my cluster.conf looks like -
>
>     <?xml version="1.0"?>
>     <cluster config_version="4" name="Cluster">
>     <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>     <clusternodes>
>     <clusternode name="blade1.domain" nodeid="1" votes="1">
>     <fence>
>     <method name="1">
>     <device lanplus="" name="IPMI_1"/>
>     </method>
>     </fence>
>     </clusternode>
>     <clusternode name="blade2.domain" nodeid="2" votes="1">
>     <fence>
>     <method name="1">
>     <device lanplus="" name="IPMI_2"/>
>     </method>
>     </fence>
>     </clusternode>
>     </clusternodes>
>     <cman expected_votes="1" two_node="1"/>
>     <fencedevices>
>     <fencedevice agent="fence_ipmilan" auth="none" ipaddr="<IMPI 1 IP
>     Address>" login="admin" name="IPMI_1" passwd="password"/>
>     <fencedevice agent="fence_ipmilan" auth="none" ipaddr="<IMPI 2 IP
>     Address>" login="admin" name="IPMI_2" passwd="password"/>
>     </fencedevices>

Try:

<fence>
   <method name="1">
     <device action="reboot" name="IPMI_1"/>
   </method>
</fence>
...
<fencedevice agent="fence_ipmilan" ipaddr="(IP)" login="admin" 
name="IPMI_1" passwd="password"/>

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"I feel confined, only free to expand myself within boundaries."



From teigland at redhat.com  Thu Jun 30 21:34:14 2011
From: teigland at redhat.com (David Teigland)
Date: Thu, 30 Jun 2011 17:34:14 -0400
Subject: [Linux-cluster] [PATCH]
 /config/dlm/<cluster>/comms/<comm>/addr_list
In-Reply-To: <20110630.213710.655406242398069789.yamato@redhat.com>
References: <20110609.175529.646090028440251828.yamato@redhat.com>
	<20110609140546.GA30732@redhat.com>
	<20110630.213710.655406242398069789.yamato@redhat.com>
Message-ID: <20110630213414.GC16480@redhat.com>

On Thu, Jun 30, 2011 at 09:37:10PM +0900, Masatake YAMATO wrote:
> Added addr_list. Could you try my patch?
> 
> Signed-off-by: Masatake YAMATO <yamato at redhat.com>

Thanks, it looks good, I'll push it to the next branch.  Do you use this
mainly for debugging?  or is there some other reason that I should note in
the commit message?
Dave