From mrugeshkarnik at gmail.com  Fri May  1 05:15:05 2009
From: mrugeshkarnik at gmail.com (Mrugesh Karnik)
Date: Fri, 1 May 2009 10:45:05 +0530
Subject: [Linux-cluster] qdiskd error `does match kernel's reported sector
	size'
In-Reply-To: <1241116045.5206.163.camel@ayanami>
References: <200904291547.44205.mrugeshkarnik@gmail.com>
	<1241116045.5206.163.camel@ayanami>
Message-ID: <200905011045.05156.mrugeshkarnik@gmail.com>

On Thursday 30 Apr 2009 23:57:25 Lon Hohberger wrote:
> What tree did you build on the Debian node?  This was a problem awhile
> ago but (I thought) has been fixed for some time.

I just used the redhat-cluster-suite-2.20081102-1 available in Lenny.

Mrugesh


From j.buzzard at dundee.ac.uk  Fri May  1 09:07:07 2009
From: j.buzzard at dundee.ac.uk (Jonathan Buzzard)
Date: Fri, 01 May 2009 10:07:07 +0100
Subject: [Linux-cluster] Hardware options
In-Reply-To: <B291C4D226224501B126BC9C52E59248@Desktop>
References: <B291C4D226224501B126BC9C52E59248@Desktop>
Message-ID: <1241168827.29906.117.camel@penguin.lifesci.dundee.ac.uk>


On Thu, 2009-04-30 at 18:22 +0100, Virginian wrote:
> I was hit by a rather large electricity bill recently (at home). My
> current cluster set up comprises 2 x HP Proliant DL380 G3s and an MSA
> 500 storage array (all three are very heavy on the juice!). I decided
> that if I want to continue playing with RHCS at home I needed to look
> for a cheaper, greener option. I can easily get a couple of PC's (dual
> or quad core cpus and plenty of RAM for running a virtualised cluster)
> but the stumbling block has been a cheap low power shared storage
> solution. The best that I have come up with so far is an offering from
> Maxtor and from LaCie, which basically comprises an esata disk
> enclosure that has two firewire 800 ports. I believe that Linux will
> support these dual firewire 800 enclosures but I am a little concerned
> about the speed (91MB/s) in comparison to a SCSI disk array. Ideally,
> I would prefer a disk enclosure / array with dual esata ports but I
> haven't been able to find anything. 
>  
> My question is, does anybody have a low cost hardware specification
> that they are running xen and RHCS on with shared storage that won't
> cost the earth and won't hit me in the wallet when it comes to paying
> the electricity bill? 

The cheapest shared storage you are going to get is FireWire. You cannot
do shared storage with eSATA. Thing of FireWire as a cheap mans Fibre
Channel network in loop mode.

However I would roll your own firewire enclosure and fit it out with
some Western Digital VelociRaptor 10k RPM drives. As I would say I/O's
per second is more important than actual throughput.

You could also cheat a bit if you are using more than one physical
drive, and use a bridge board per drive, and wire each drive back to the
server.

I reckon that you could do a couple of quad code nodes, with two 300GB
VelociRaptor drives with a power budget under 400W easily, and less if
you pay for laptop parts. There are also some nice cases that take two
mini-ITX boards if you want to go small.

JAB.

-- 
Jonathan A. Buzzard                      Tel: +441382-386998
Storage Administrator, College of Life Sciences
University of Dundee, DD1 5EH


From virginian at blueyonder.co.uk  Fri May  1 10:42:38 2009
From: virginian at blueyonder.co.uk (Virginian)
Date: Fri, 1 May 2009 11:42:38 +0100
Subject: [Linux-cluster] Hardware options
References: <B291C4D226224501B126BC9C52E59248@Desktop>
	<1241168827.29906.117.camel@penguin.lifesci.dundee.ac.uk>
Message-ID: <94A188CAF75140E0BAB7FE90AD911970@Desktop>

Thanks Jonathon, that's very informative and some very good ideas.

I like the idea of the 10K RPM disks, I will definitely read up on those. 
Also, I had been thinking of quad core CPU's too, something in a small form 
factor or as you say even laptop size. What I am looking for is (ideally)

2 physical machines

1 x Quad Core CPU with Intel Virtualization (for KVM)
4GB RAM

External shared storage, anything from 250GB upwards (I like the idea of 
2.5" disks perhaps two in RAID 1)

The above would give me plenty of horse power to run quite a few guests and 
enable me to set up a physical and virtual cluster. If I can get the whole 
lot for under 400W I will be more than pleased. At present my two DL 380's 
run at 500W approximately as does the MSA 500 disk array. Cutting my power 
consumptiion by nearly 75% definitely appeals!!

Anybody else got any example of a lower power set up for home use?

Regards

John
----- Original Message ----- 
From: "Jonathan Buzzard" <j.buzzard at dundee.ac.uk>
To: "linux clustering" <linux-cluster at redhat.com>
Sent: Friday, May 01, 2009 10:07 AM
Subject: Re: [Linux-cluster] Hardware options


>
> On Thu, 2009-04-30 at 18:22 +0100, Virginian wrote:
>> I was hit by a rather large electricity bill recently (at home). My
>> current cluster set up comprises 2 x HP Proliant DL380 G3s and an MSA
>> 500 storage array (all three are very heavy on the juice!). I decided
>> that if I want to continue playing with RHCS at home I needed to look
>> for a cheaper, greener option. I can easily get a couple of PC's (dual
>> or quad core cpus and plenty of RAM for running a virtualised cluster)
>> but the stumbling block has been a cheap low power shared storage
>> solution. The best that I have come up with so far is an offering from
>> Maxtor and from LaCie, which basically comprises an esata disk
>> enclosure that has two firewire 800 ports. I believe that Linux will
>> support these dual firewire 800 enclosures but I am a little concerned
>> about the speed (91MB/s) in comparison to a SCSI disk array. Ideally,
>> I would prefer a disk enclosure / array with dual esata ports but I
>> haven't been able to find anything.
>>
>> My question is, does anybody have a low cost hardware specification
>> that they are running xen and RHCS on with shared storage that won't
>> cost the earth and won't hit me in the wallet when it comes to paying
>> the electricity bill?
>
> The cheapest shared storage you are going to get is FireWire. You cannot
> do shared storage with eSATA. Thing of FireWire as a cheap mans Fibre
> Channel network in loop mode.
>
> However I would roll your own firewire enclosure and fit it out with
> some Western Digital VelociRaptor 10k RPM drives. As I would say I/O's
> per second is more important than actual throughput.
>
> You could also cheat a bit if you are using more than one physical
> drive, and use a bridge board per drive, and wire each drive back to the
> server.
>
> I reckon that you could do a couple of quad code nodes, with two 300GB
> VelociRaptor drives with a power budget under 400W easily, and less if
> you pay for laptop parts. There are also some nice cases that take two
> mini-ITX boards if you want to go small.
>
> JAB.
>
> -- 
> Jonathan A. Buzzard                      Tel: +441382-386998
> Storage Administrator, College of Life Sciences
> University of Dundee, DD1 5EH
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


From j.buzzard at dundee.ac.uk  Fri May  1 14:14:20 2009
From: j.buzzard at dundee.ac.uk (Jonathan Buzzard)
Date: Fri, 01 May 2009 15:14:20 +0100
Subject: [Linux-cluster] Hardware options
In-Reply-To: <94A188CAF75140E0BAB7FE90AD911970@Desktop>
References: <B291C4D226224501B126BC9C52E59248@Desktop>
	<1241168827.29906.117.camel@penguin.lifesci.dundee.ac.uk>
	<94A188CAF75140E0BAB7FE90AD911970@Desktop>
Message-ID: <1241187260.4554.26.camel@penguin.lifesci.dundee.ac.uk>


On Fri, 2009-05-01 at 11:42 +0100, Virginian wrote:
> Thanks Jonathon, that's very informative and some very good ideas.
> 
> I like the idea of the 10K RPM disks, I will definitely read up on those.

Well worth it, make a big difference to running any virtualization
solution.

>  
> Also, I had been thinking of quad core CPU's too, something in a small form 
> factor or as you say even laptop size. What I am looking for is (ideally)
> 

Have you looked at the ZOTAC GeForce 9300-ITX WiFi Mini-ITX board? It is
a mini-ITX board that takes a up to a Core2 Extreme with 8GB of RAM, a
1Gbps Ethernet adaptor. No FireWire, but does have a PCI-E x16 slot, so
you could add an adaptor in. You might even be able to squeeze two of
these with two drives into the Travla C147 Dual Mini-ITX rackmount case.

> 2 physical machines
> 
> 1 x Quad Core CPU with Intel Virtualization (for KVM)
> 4GB RAM

More RAM, if you want to do virtualization this is what limits the
number of guests more than anything. I would say 8GB is a minimum.


> External shared storage, anything from 250GB upwards (I like the idea of 
> 2.5" disks perhaps two in RAID 1)
> 

The VelociRaptor is a 2.5" SATA drive, under 10W a drive. I upgraded my
workstation to the old 150GB Raptor drives a couple years back and it
made a big difference when running lots of guests on VMware
workstations.

> The above would give me plenty of horse power to run quite a few guests and 
> enable me to set up a physical and virtual cluster. If I can get the whole 
> lot for under 400W I will be more than pleased. At present my two DL 380's 
> run at 500W approximately as does the MSA 500 disk array. Cutting my power 
> consumptiion by nearly 75% definitely appeals!!

Take a look at the picoPSU power supplies. They are small and efficient,
and pick the right one (for your application the M3) and you can do a
UPS direct from a lead acid battery in the form of a battery backed
power supply. Much more frugal than a normal UPS. Even if you don't want
a UPS, one step down from mains to a beefy 12V, is more efficient.

> Anybody else got any example of a lower power set up for home use?

I have doing it for some time, but on VIA and now Atom boards. My
current setup has a power draw *under* 30W at the wall plug, for which I
get a 1.2GHz Via C7 with 1GB RAM, with a PCI ADSL card, 1GbE, 100GB of
RAID-1 7200RPM, WiFi and with a battery backed PSU. It's role is a home
file server, come ADSL gateway, come Wireless access point.

I am looking at a new setup which will have an Atom N330 with 2GB of
RAM, and a pair of 300GB VelociRaptors and a pair of 2TB 3.5" drives,
cause I want to ditch the external firewire drives. The power budget for
this will be under 50W and I will reuse the battery backed PSU.

I would have thought that under 300W would easily be achievable using
desktop processors. If 4GB of RAM is definitely enough, then you could
go Socket P, pick one of a range of mini-ITX boards and go under 150W,
possibly 100W. However this will bump the cost up because you are buying
laptop parts.


JAB.

-- 
Jonathan A. Buzzard                      Tel: +441382-386998
Storage Administrator, College of Life Sciences
University of Dundee, DD1 5EH


From michael.osullivan at auckland.ac.nz  Fri May  1 23:33:16 2009
From: michael.osullivan at auckland.ac.nz (Michael O'Sullivan)
Date: Sat, 02 May 2009 11:33:16 +1200
Subject: [Linux-cluster] GFS/GFS2 problems with iozone
Message-ID: <49FB86BC.104@auckland.ac.nz>

Hi everyone,

I am having some problems testing a GFS system using iozone. I am 
running CentOS 2.6.18-128.1.6.el5 and have a two node cluster with a GFS 
installed on a shared iSCSI target. The GFS sits on top of a 1.79TB 
clustered logical volume and can be mounted successfully on both cluster 
nodes.

When using iozone to test performance everything goes smoothly until I 
get to a file size of 2GB and a record length of 2048. Then iozone exits 
with the error

Error fwriting block 250, fd= 7

and (as far as I can tell) the GFS becomes corrupted

fatal: invalid metadata block
bh = 12912396 (magic)
function = gfs_get_meta_buffer
file = /builddir/build/BUILD/gfs-kmod-0.1.31/_kmod_build_/src/gfs/dio.c, 
line = 1225

Can anyone shed some light on what is happening?

Kind regards, Mike O'S


From theophanis_kontogiannis at yahoo.gr  Sat May  2 12:15:00 2009
From: theophanis_kontogiannis at yahoo.gr (Theophanis Kontogiannis)
Date: Sat, 2 May 2009 15:15:00 +0300
Subject: [Linux-cluster] Question about controlling the start of services
	with RIND
In-Reply-To: <1241116158.5206.165.camel@ayanami>
References: <006901c9c818$e0731ae0$a15950a0$@gr>
	<1241116158.5206.165.camel@ayanami>
Message-ID: <007401c9cb1f$9ec07ed0$dc417c70$@gr>

Hello Lon and All,

Thank you for your answer. It helps to start getting somewhere.

Is there any place I can get all the possible options for RIND and
cluster.conf?

Anyway, I did not made clear that in my two cluster node, the storage
service for node 1 is different then the storage service for node 2.

Both start it since both mount a GFS2 filesystem.

Some of the other services have as preferred node 1, and some have as
preffered node 2.

If I put as depend="storage-t1" on a service that should start on node t1,
how would the service behave if node t1 never boots, but the service is
configured to start on node t2 as second choice?

The service will start anyway?

Is there any way to put multiple dependencies for the service in an OR way,
like 

<service name="foo" depend="Storage-t1" depend="Storage-t2"...>
   ...
</service>


So if either node has started and if either storage service has started, the
service will start?

I am attaching my cluster.conf to make my point more clear, and referring to
service Apache-t1-t2.


Thank you All four your time.


> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-
> bounces at redhat.com] On Behalf Of Lon Hohberger
> Sent: Thursday, April 30, 2009 9:29 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] Question about controlling the start of
> services with RIND
> 
> On Tue, 2009-04-28 at 18:49 +0300, Theophanis Kontogiannis wrote:
> 
> > Could I use RIND somehow to make the rest of the clustered services to
> > start only if the filesystem service has started?
> 
> Yeah, just set:
> 
> <service name="foo" depend="file_system_service_name" ...>
>   ...
> </service>
> 
> -- Lon
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: application/octet-stream
Size: 2570 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090502/be607a48/attachment.obj>

From sunhux at gmail.com  Sun May  3 03:59:38 2009
From: sunhux at gmail.com (sunhux G)
Date: Sun, 3 May 2009 11:59:38 +0800
Subject: [Linux-cluster] 2 node qdiskd cluster gave "Quorum Dissolved"
Message-ID: <60f08e700905022059j380a0212gbaad9b351f19b0dc@mail.gmail.com>

Hi


I have a 2 node qdiskd cluster (OS is RHES 5.0) with
2x heartbeat cross cables between the 2 nodes

Currently we manually issue the following commands
to start the cluster services in the sequence below :
a) cd /etc/init.d
b) ./cman start
c) ./clvmd ...
d) ./qdiskd ...
e) ./rgmanager ...

& on the primary node, issue "clusvcadm ..... Oracle_Service"
to start oracle services which will also mount the SAN partition.

Occasionally, we ran into the error below & cluster breaks on
both nodes (ie SAN partition unmounted on both and Oracle
services stopped on both) :

lurgmgrd[5843]: <emerg> #1: Quorum Dissolved


What's wrong?

Usually when this happens, I could usually make the first node
rejoin the cluster + mount the SAN partition but the 2nd node
usually can't rejoin the cluster/mount SAN and has to be rebooted
and reissued with the commands a-e for it to rejoin the cluster.
Thanks for any insights
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090503/5b87ea4b/attachment.htm>

From rpeterso at redhat.com  Mon May  4 15:05:20 2009
From: rpeterso at redhat.com (Bob Peterson)
Date: Mon, 4 May 2009 11:05:20 -0400 (EDT)
Subject: [Linux-cluster] GFS/GFS2 problems with iozone
In-Reply-To: <49FB86BC.104@auckland.ac.nz>
Message-ID: <494270654.5591241449520328.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>

----- "Michael O'Sullivan" <michael.osullivan at auckland.ac.nz> wrote:
| Hi everyone,
| 
| I am having some problems testing a GFS system using iozone. I am 
| running CentOS 2.6.18-128.1.6.el5 and have a two node cluster with a
| GFS 
| installed on a shared iSCSI target. The GFS sits on top of a 1.79TB 
| clustered logical volume and can be mounted successfully on both
| cluster 
| nodes.
| 
| When using iozone to test performance everything goes smoothly until I
| 
| get to a file size of 2GB and a record length of 2048. Then iozone
| exits 
| with the error
| 
| Error fwriting block 250, fd= 7
| 
| and (as far as I can tell) the GFS becomes corrupted
| 
| fatal: invalid metadata block
| bh = 12912396 (magic)
| function = gfs_get_meta_buffer
| file =
| /builddir/build/BUILD/gfs-kmod-0.1.31/_kmod_build_/src/gfs/dio.c, 
| line = 1225
| 
| Can anyone shed some light on what is happening?
| 
| Kind regards, Mike O'S

Hi Mike,

Are you running iozone on a single node or both simultaneously?
If it's running on two nodes, please make sure that both nodes have
the iSCSI target mounted with lock_dlm protocol (not lock_nolock).
Also, we need to make sure that they're not trying to use the same
files in the file system because I think iozone is not cluster-aware.
But even so, the file system should not be corrupted unless one of
the nodes is using lock_nolock protocol, or if other boxes are
using the iSCSI target without the knowledge of GFS.

We regularly run iozone here, in single-node performance trials, and
we have never seen this kind of problem.

Also, you didn't specify what version of the kmod-gfs package you have
installed.  I've fixed at least one bug that might account for it,
depending on what version of kmod-gfs you're running.

I'm not aware of any other problems in the GFS kernel code that can
account for this kind of corruption, except for possibly this one:

https://bugzilla.redhat.com/show_bug.cgi?id=491369

(A gfs bug that really goes well beyond the nfs usage described in the bug).
You can find the patch in the attachments, although I won't guarantee
it'll solve your problem.  There's a slight chance though.
My apologies if you don't have permission to see the bug; that sometimes
happens and it's out of my control.  I can, however, post the patch
if needed.

If iozone is being run on a single node, this might be a new bug.  If you can
still recreate the problem with that patch in place, or if you don't want
to try the patch for some reason, perhaps you should open up a bugzilla
record and we'll investigate the problem.  If we can reproduce it, we'll
figure it out and fix it.

Regards,

Bob Peterson
Red Hat GFS


From michael.osullivan at auckland.ac.nz  Mon May  4 21:37:51 2009
From: michael.osullivan at auckland.ac.nz (Michael O'Sullivan)
Date: Tue, 05 May 2009 09:37:51 +1200
Subject: [Linux-cluster] Re: Re: GFS/GFS2 problems with iozone
In-Reply-To: <20090504160008.07D0B618605@hormel.redhat.com>
References: <20090504160008.07D0B618605@hormel.redhat.com>
Message-ID: <49FF602F.2060603@auckland.ac.nz>

Date: Mon, 4 May 2009 11:05:20 -0400 (EDT)
> From: Bob Peterson <rpeterso at redhat.com>
> Subject: Re: [Linux-cluster] GFS/GFS2 problems with iozone
> To: linux clustering <linux-cluster at redhat.com>
> Message-ID:
> 	<494270654.5591241449520328.JavaMail.root at zmail06.collab.prod.int.phx2.redhat.com>
> 	
> Content-Type: text/plain; charset=utf-8
>
> ----- "Michael O'Sullivan" <michael.osullivan at auckland.ac.nz> wrote:
> | Hi everyone,
> | 
> | I am having some problems testing a GFS system using iozone. I am 
> | running CentOS 2.6.18-128.1.6.el5 and have a two node cluster with a
> | GFS 
> | installed on a shared iSCSI target. The GFS sits on top of a 1.79TB 
> | clustered logical volume and can be mounted successfully on both
> | cluster 
> | nodes.
> | 
> | When using iozone to test performance everything goes smoothly until I
> | 
> | get to a file size of 2GB and a record length of 2048. Then iozone
> | exits 
> | with the error
> | 
> | Error fwriting block 250, fd= 7
> | 
> | and (as far as I can tell) the GFS becomes corrupted
> | 
> | fatal: invalid metadata block
> | bh = 12912396 (magic)
> | function = gfs_get_meta_buffer
> | file =
> | /builddir/build/BUILD/gfs-kmod-0.1.31/_kmod_build_/src/gfs/dio.c, 
> | line = 1225
> | 
> | Can anyone shed some light on what is happening?
> | 
> | Kind regards, Mike O'S
>
> Hi Mike,
>
> Are you running iozone on a single node or both simultaneously?
> If it's running on two nodes, please make sure that both nodes have
> the iSCSI target mounted with lock_dlm protocol (not lock_nolock).
> Also, we need to make sure that they're not trying to use the same
> files in the file system because I think iozone is not cluster-aware.
> But even so, the file system should not be corrupted unless one of
> the nodes is using lock_nolock protocol, or if other boxes are
> using the iSCSI target without the knowledge of GFS.
>
> We regularly run iozone here, in single-node performance trials, and
> we have never seen this kind of problem.
>
> Also, you didn't specify what version of the kmod-gfs package you have
> installed.  I've fixed at least one bug that might account for it,
> depending on what version of kmod-gfs you're running.
>
> I'm not aware of any other problems in the GFS kernel code that can
> account for this kind of corruption, except for possibly this one:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=491369
>
> (A gfs bug that really goes well beyond the nfs usage described in the bug).
> You can find the patch in the attachments, although I won't guarantee
> it'll solve your problem.  There's a slight chance though.
> My apologies if you don't have permission to see the bug; that sometimes
> happens and it's out of my control.  I can, however, post the patch
> if needed.
>
> If iozone is being run on a single node, this might be a new bug.  If you can
> still recreate the problem with that patch in place, or if you don't want
> to try the patch for some reason, perhaps you should open up a bugzilla
> record and we'll investigate the problem.  If we can reproduce it, we'll
> figure it out and fix it.
>
> Regards,
>
> Bob Peterson
> Red Hat GFS
>   
Hi Bob,

I have changed back to GFS2 (as I realised this is now production ready, 
is that correct?), but I am still having similar problems. I am running 
iozone on a single node and accessing the mount point of GFS2 running 
with lock_dlm. Note that the GFS2 is created on a multipathed device 
created via iSCSI/DRBD. However, I run the following commands:

gfs2_fsck # which shows no errors on either node

mount -t gfs2 /dev/iscsi_mirror/lvol0 /mnt/iscsi_mirror/ #mounts the 
file system (on top of iSCSI/DRBD) on both nodes

/usr/src/ioszone3_321/src/current/iozone -Ra -g 4G -f 
/mnt/iscsi_mirror/test # Only on node 1

This gets to 1048576 KB and reclen 256 before giving

Error reading block 1018 b6e00000

I can fix the GFS2 using gfs2_fsck (it fixes some dirty journals, but no 
other changes). I don't have the error messages from this latest test as 
I ran it over the weekend and /var/log/messages doesn't have the error 
messages anymore. I can recreate this test and record the error messages 
if necessary, but I wonder if the patch you talked about also exists for 
GFS2?

Thanks very much for your help, Mike


From esggrupos at gmail.com  Tue May  5 15:37:56 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Tue, 5 May 2009 17:37:56 +0200
Subject: [Linux-cluster] help configuring HP ILO
Message-ID: <3128ba140905050837j5654c099k4c94b3cd8a4343de@mail.gmail.com>

Hello all,

I?m configuring a 2 nodes cluster on 2 servers HP Proliant DL165 G5
This servers have HP ProLiant Lights-Out 100 Remote Management and I want to
use it as fencing device.

My first idea was to configure them with IPMI and it works almost fine but I
have detected that when I have the network down, the fence devices doesn't
work because the nodes can't reach the othe node to fence it.

I have tried with a dedicated switch and a direct cable but it doesn't work
and I begin to think I?m doing something wrong because the interface with
the ipmi configured doesnt appear on the servers. I?ll try to explain:

I have
node1
eth0: IP1
eth1: IP2
In the bios I have configured ethMng : IP3

node2
eth0: IP4
eth1: IP5
In the bios I have configured ethMng : IP6


with the network up all works fine and I can use fence_node to fence the
nodes, and the cluster works fine.

But, if I disconnect IP1, IP2, IP3, IP4 (This simulate a switch fail) I
expect the cluster become fencing with IP3 and IP6 but the system doesn't
find these IP?s and all the cluster hungs.

Looking the fence devices avaliable with conga I have seen that there is one
called:
**
HP iLOwith this parameters to configure:
Name
Hostname
Login
Password
Password Script (optional)

All are self explanatory but I don?t know what to put on Hostname (which
hostname? the same machine, the other? FQDN.. IP....)

So, I have 2 questions:
If I use IPMI what I?m doing wrong?
and
If I use HP iLO, what I need to configure???

any idea, manual, doc, suggest.... is welcome

thanks in advance

ESG
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090505/acdd8f7f/attachment.htm>

From cthulhucalling at gmail.com  Tue May  5 16:01:56 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Tue, 5 May 2009 12:01:56 -0400
Subject: [Linux-cluster] help configuring HP ILO
In-Reply-To: <3128ba140905050837j5654c099k4c94b3cd8a4343de@mail.gmail.com>
References: <3128ba140905050837j5654c099k4c94b3cd8a4343de@mail.gmail.com>
Message-ID: <36df569a0905050901y3722c456i695f89174340d2ac@mail.gmail.com>

For hostname you can put the FQDN or IP address...

I believe that you're a bit confused what iLO is capable of. IP3 and IP6 are
for the iLO, the cluster can't use them for networking. The cluster members
need to be able to reach the iLO (IP3 and 6 in this case) from eth0 or eth1.
In a 2-node cluster, this can be as simple as connecting eth0 or eth1 on one
node to the iLO of the other node via crossover cable. The iLO is its own
device that exists outside of the operating system.

Here's an example of a cluster that I've built previously that is similar to
your setup

Host1:
eth0 192.168.0.1 (host1)
eth1 10.1.1.1 (host1-management)
iLO 10.1.1.2

Host 2
eth0 192.168.0.2 (host2)
eth1: 10.1.1.3 (host2-management)
iLO 10.1.1.4

All cluster management communication in this cluster is via eth1. I
specified host1-management and host2-management as the hostnames in the
cluster config to partition off cluster traffic from the interfaces that are
actually doing the VIP work. The nodes provide a virtual IP on eth0, and a
script service, with the daemon bound to the VIP. For the iLOs and eth1, you
could either plug them into a switch on their own non-trunked VLAN, or you
can connect eth1 of host1 to the iLO of host 2, and eth1 of host2 to the iLO
of host1. Both eth1 and iLOs don't need a gateway since they're on the same
subnet.

To configure the iLO, you just set up the correct IP address, mask and
create a username and password that has the appropriate privileges (power).
These get put into the cluster.conf file via system-config-cluster or Luci.
You would need to create two fence resources. In the above case, I would
create a Fence_Host_1 and Fence_Host_2 dence devices, using fence_ilo.

Fence_Host_1 would have the IP address of host1's iLO, a valid login and
password for that iLO. Host2 is similar, but has the IP address of host2's
iLO. Attach Fence_Host_1 to host1 and Fence_Host_2 to host2. This way, the
entire cluster knows "to fence host1, I see that I need to use the
Fence_Host_1 method. Fence_host_1 uses fence_ilo as its method, target ip
address 10.1.1.2, username foo, password bar. To fence host2, it uses
fence_ilo as its method, target address 10.1.1.4, username foobar, password
barfoo". These get passed to the fence_ilo script and it handles the rest.
You can play with this my manually running fence_ilo.

On Tue, May 5, 2009 at 11:37 AM, ESGLinux <esggrupos at gmail.com> wrote:

> Hello all,
>
> I?m configuring a 2 nodes cluster on 2 servers HP Proliant DL165 G5
> This servers have HP ProLiant Lights-Out 100 Remote Management and I want
> to use it as fencing device.
>
> My first idea was to configure them with IPMI and it works almost fine but
> I have detected that when I have the network down, the fence devices doesn't
> work because the nodes can't reach the othe node to fence it.
>
> I have tried with a dedicated switch and a direct cable but it doesn't work
> and I begin to think I?m doing something wrong because the interface with
> the ipmi configured doesnt appear on the servers. I?ll try to explain:
>
> I have
> node1
> eth0: IP1
> eth1: IP2
> In the bios I have configured ethMng : IP3
>
> node2
> eth0: IP4
> eth1: IP5
> In the bios I have configured ethMng : IP6
>
>
> with the network up all works fine and I can use fence_node to fence the
> nodes, and the cluster works fine.
>
> But, if I disconnect IP1, IP2, IP3, IP4 (This simulate a switch fail) I
> expect the cluster become fencing with IP3 and IP6 but the system doesn't
> find these IP?s and all the cluster hungs.
>
> Looking the fence devices avaliable with conga I have seen that there is
> one called:
> **
> HP iLOwith this parameters to configure:
> Name
> Hostname
> Login
> Password
> Password Script (optional)
>
> All are self explanatory but I don?t know what to put on Hostname (which
> hostname? the same machine, the other? FQDN.. IP....)
>
> So, I have 2 questions:
> If I use IPMI what I?m doing wrong?
> and
> If I use HP iLO, what I need to configure???
>
> any idea, manual, doc, suggest.... is welcome
>
> thanks in advance
>
> ESG
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090505/4436a0b8/attachment.htm>

From esggrupos at gmail.com  Tue May  5 16:54:04 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Tue, 5 May 2009 18:54:04 +0200
Subject: [Linux-cluster] help configuring HP ILO
In-Reply-To: <36df569a0905050901y3722c456i695f89174340d2ac@mail.gmail.com>
References: <3128ba140905050837j5654c099k4c94b3cd8a4343de@mail.gmail.com>
	<36df569a0905050901y3722c456i695f89174340d2ac@mail.gmail.com>
Message-ID: <3128ba140905050954t648a283ev601e5352a8f6d9f3@mail.gmail.com>

Hello,

Thanks for your answer...

2009/5/5 Ian Hayes <cthulhucalling at gmail.com>

> For hostname you can put the FQDN or IP address...
>
> I believe that you're a bit confused what iLO is capable of.


I absolutelly agree with you ;-)


> IP3 and IP6 are for the iLO, the cluster can't use them for networking.


I don't use for networking (I think...) I only want to use it to fence....
(i?m begin to think this is my mistake)


> The cluster members need to be able to reach the iLO (IP3 and 6 in this
> case) from eth0 or eth1.


I think I could reach the iLO from the interfaces ILO (In my configuration
ethMng, IP3 and IP6)


In a 2-node cluster, this can be as simple as connecting eth0 or eth1 on one
> node to the iLO of the other node via crossover cable. The iLO is its own
> device that exists outside of the operating system.
>
> Here's an example of a cluster that I've built previously that is similar
> to your setup
>
> Host1:
> eth0 192.168.0.1 (host1)
> eth1 10.1.1.1 (host1-management)
> iLO 10.1.1.2
>
> Host 2
> eth0 192.168.0.2 (host2)
> eth1: 10.1.1.3 (host2-management)
> iLO 10.1.1.4
>
> All cluster management communication in this cluster is via eth1. I
> specified host1-management and host2-management as the hostnames in the
> cluster config to partition off cluster traffic from the interfaces that are
> actually doing the VIP work. The nodes provide a virtual IP on eth0, and a
> script service, with the daemon bound to the VIP. For the iLOs and eth1, you
> could either plug them into a switch on their own non-trunked VLAN, or you
> can connect eth1 of host1 to the iLO of host 2, and eth1 of host2 to the iLO
> of host1. Both eth1 and iLOs don't need a gateway since they're on the same
> subnet.
>

If I have understand, if I use a dedicated switch, I must to connect IP2,
IP3, IP5 and IP6 to the same switche and IP1 and IP4 to the service switch,
isn?t it?


>
> To configure the iLO, you just set up the correct IP address, mask and
> create a username and password that has the appropriate privileges (power).
> These get put into the cluster.conf file via system-config-cluster or Luci.
> You would need to create two fence resources. In the above case, I would
> create a Fence_Host_1 and Fence_Host_2 dence devices, using fence_ilo.
>

this is ok, is what I have done but with fence_ipmi


>
> Fence_Host_1 would have the IP address of host1's iLO, a valid login and
> password for that iLO. Host2 is similar, but has the IP address of host2's
> iLO. Attach Fence_Host_1 to host1 and Fence_Host_2 to host2. This way, the
> entire cluster knows "to fence host1, I see that I need to use the
> Fence_Host_1 method. Fence_host_1 uses fence_ilo as its method, target ip
> address 10.1.1.2, username foo, password bar. To fence host2, it uses
> fence_ilo as its method, target address 10.1.1.4, username foobar, password
> barfoo". These get passed to the fence_ilo script and it handles the rest.
> You can play with this my manually running fence_ilo.


I think I have understood

My problem was that I thought I can reach the iLO interfaces  only using the
iLO interfaces.

I?ll try this configuration and I will post my results,

thanks for your answer

ESG


>
>
> On Tue, May 5, 2009 at 11:37 AM, ESGLinux <esggrupos at gmail.com> wrote:
>
>> Hello all,
>>
>> I?m configuring a 2 nodes cluster on 2 servers HP Proliant DL165 G5
>> This servers have HP ProLiant Lights-Out 100 Remote Management and I want
>> to use it as fencing device.
>>
>> My first idea was to configure them with IPMI and it works almost fine but
>> I have detected that when I have the network down, the fence devices doesn't
>> work because the nodes can't reach the othe node to fence it.
>>
>> I have tried with a dedicated switch and a direct cable but it doesn't
>> work and I begin to think I?m doing something wrong because the interface
>> with the ipmi configured doesnt appear on the servers. I?ll try to explain:
>>
>> I have
>> node1
>> eth0: IP1
>> eth1: IP2
>> In the bios I have configured ethMng : IP3
>>
>> node2
>> eth0: IP4
>> eth1: IP5
>> In the bios I have configured ethMng : IP6
>>
>>
>> with the network up all works fine and I can use fence_node to fence the
>> nodes, and the cluster works fine.
>>
>> But, if I disconnect IP1, IP2, IP3, IP4 (This simulate a switch fail) I
>> expect the cluster become fencing with IP3 and IP6 but the system doesn't
>> find these IP?s and all the cluster hungs.
>>
>> Looking the fence devices avaliable with conga I have seen that there is
>> one called:
>> **
>> HP iLOwith this parameters to configure:
>> Name
>> Hostname
>> Login
>> Password
>> Password Script (optional)
>>
>> All are self explanatory but I don?t know what to put on Hostname (which
>> hostname? the same machine, the other? FQDN.. IP....)
>>
>> So, I have 2 questions:
>> If I use IPMI what I?m doing wrong?
>> and
>> If I use HP iLO, what I need to configure???
>>
>> any idea, manual, doc, suggest.... is welcome
>>
>> thanks in advance
>>
>> ESG
>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090505/939d2322/attachment.htm>

From EliasM at dnb.com  Tue May  5 17:48:09 2009
From: EliasM at dnb.com (Elias, Michael)
Date: Tue, 5 May 2009 13:48:09 -0400
Subject: [Linux-cluster] Heartbeat time outs in rhel4 understanding
Message-ID: <437A36C1327D794D87D207AC80BDD8FD0B05191C@DNBMSXBH002.dnbint.net>

I am trying to understand how these timers interact with each other.

 
In a RHEL4 cluster the heartbeat defaults are;

hello_timer:5

max_retries:5 

deadnode_timeout:21

 
Meaning a heartbeat message is sent every 5 seconds, if it fails to
receive a response it will start a deadnode counter @ 21 seconds. It
will also try to send 5 more heartbeat requests. What is the interval of
those retries? If none of those requests receive a response. 5 seconds
pass.. there is 15 seconds left on the deadnode timer and we try upto 5
times to get a response.... This goes on until we hit the 4th iteration
of the hellotimer it tries again upto 5 times and fails... we then hit
the 21 second on the deadnode time.. fenced takes over and wham reboot.

 
Is my understanding of this correct???? 

 
Thanks for any help..

 
Michael

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090505/8e3bd22e/attachment.htm>

From cthulhucalling at gmail.com  Tue May  5 17:12:31 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Tue, 5 May 2009 13:12:31 -0400
Subject: [Linux-cluster] help configuring HP ILO
In-Reply-To: <3128ba140905050954t648a283ev601e5352a8f6d9f3@mail.gmail.com>
References: <3128ba140905050837j5654c099k4c94b3cd8a4343de@mail.gmail.com>
	<36df569a0905050901y3722c456i695f89174340d2ac@mail.gmail.com>
	<3128ba140905050954t648a283ev601e5352a8f6d9f3@mail.gmail.com>
Message-ID: <36df569a0905051012n59bf3dcfn340ce96cfc7b62c5@mail.gmail.com>

On Tue, May 5, 2009 at 12:54 PM, ESGLinux <esggrupos at gmail.com> wrote:

> Hello,
>
> Thanks for your answer...
>
> 2009/5/5 Ian Hayes <cthulhucalling at gmail.com>
>>
>>
>>
>> All cluster management communication in this cluster is via eth1. I
>> specified host1-management and host2-management as the hostnames in the
>> cluster config to partition off cluster traffic from the interfaces that are
>> actually doing the VIP work. The nodes provide a virtual IP on eth0, and a
>> script service, with the daemon bound to the VIP. For the iLOs and eth1, you
>> could either plug them into a switch on their own non-trunked VLAN, or you
>> can connect eth1 of host1 to the iLO of host 2, and eth1 of host2 to the iLO
>> of host1. Both eth1 and iLOs don't need a gateway since they're on the same
>> subnet.
>>
>
> If I have understand, if I use a dedicated switch, I must to connect IP2,
> IP3, IP5 and IP6 to the same switche and IP1 and IP4 to the service switch,
> isn?t it?
>

If you're using a dedicated switch for cluster management, yes. Assuming
that IP1 and 4 are your public facing interfaces that will be holding the
service. You can use 2 and 5 for cluster management and fencing via IP3 and
6.

Or you can use crossover cables IP2 to IP6, and IP5 to IP3, but you will
have to run your cluster management over eth0, on the same interface that
you have your services bound to. Your choice.

But you can't reach the iLO interfaces using the other iLOs. Think of them
as "recieve only". It's up to one of the hosts to establish a connection to
them and issue commands.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090505/a04f4c2e/attachment.htm>

From garromo at us.ibm.com  Tue May  5 19:31:01 2009
From: garromo at us.ibm.com (Gary Romo)
Date: Tue, 5 May 2009 13:31:01 -0600
Subject: [Linux-cluster] service failover
Message-ID: <OFA9C05CB8.CF3941D3-ON872575AD.006AEBEE-872575AD.006B35C7@us.ibm.com>


Hello.

How can I tell when a service has failed over?
I'm looking for date and time stamps.  Thanks.

-Gary
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090505/43d5ec7d/attachment.htm>

From cthulhucalling at gmail.com  Tue May  5 20:15:43 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Tue, 5 May 2009 16:15:43 -0400
Subject: [Linux-cluster] service failover
In-Reply-To: <OFA9C05CB8.CF3941D3-ON872575AD.006AEBEE-872575AD.006B35C7@us.ibm.com>
References: <OFA9C05CB8.CF3941D3-ON872575AD.006AEBEE-872575AD.006B35C7@us.ibm.com>
Message-ID: <36df569a0905051315x3811bb5encbac1a954b43430@mail.gmail.com>

/var/log/messages

Message from clurgmgrd will say "<notice> Service service:servicename
started"

There will be a bunch of message prior to that announcing that the cluster
is in trouble, it is fencing the downed node, and that a node is taking over
the service. The service will go throught he normal startup messages. The
final message will be the one that says that the service is started.

On Tue, May 5, 2009 at 3:31 PM, Gary Romo <garromo at us.ibm.com> wrote:

> Hello.
>
> How can I tell when a service has failed over?
> I'm looking for date and time stamps. Thanks.
>
> -Gary
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090505/2a012b48/attachment.htm>

From Harri.Paivaniemi at tieto.com  Wed May  6 03:56:15 2009
From: Harri.Paivaniemi at tieto.com (Harri.Paivaniemi at tieto.com)
Date: Wed, 6 May 2009 06:56:15 +0300
Subject: [Linux-cluster] service failover
References: <OFA9C05CB8.CF3941D3-ON872575AD.006AEBEE-872575AD.006B35C7@us.ibm.com>
Message-ID: <41E8D4F07FCE154CBEBAA60FFC92F67754D50E@apollo.eu.tieto.com>

If you need a quick-n-dirty way,

you can always put something in to services starting script so every time cluster says start|stop to that service, you can send mail to yourself etc and you don't have to grep messages-log ;)


-hjp


-----Original Message-----
From: linux-cluster-bounces at redhat.com on behalf of Gary Romo
Sent: Tue 5/5/2009 22:31
To: linux-cluster at redhat.com
Subject: [Linux-cluster] service failover
 

Hello.

How can I tell when a service has failed over?
I'm looking for date and time stamps.  Thanks.

-Gary

-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 2681 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090506/41582172/attachment.bin>

From esggrupos at gmail.com  Wed May  6 06:53:23 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Wed, 6 May 2009 08:53:23 +0200
Subject: [Linux-cluster] help configuring HP ILO
In-Reply-To: <36df569a0905051012n59bf3dcfn340ce96cfc7b62c5@mail.gmail.com>
References: <3128ba140905050837j5654c099k4c94b3cd8a4343de@mail.gmail.com>
	<36df569a0905050901y3722c456i695f89174340d2ac@mail.gmail.com>
	<3128ba140905050954t648a283ev601e5352a8f6d9f3@mail.gmail.com>
	<36df569a0905051012n59bf3dcfn340ce96cfc7b62c5@mail.gmail.com>
Message-ID: <3128ba140905052353v58b79832j5905eea90f5b793a@mail.gmail.com>

thanks again Ian,

the key is what you have said "Think of them as "recieve only""

Now all is clear like water ;-)

thank you very much for your help

Greetings

ESG

2009/5/5 Ian Hayes <cthulhucalling at gmail.com>

> On Tue, May 5, 2009 at 12:54 PM, ESGLinux <esggrupos at gmail.com> wrote:
>
>> Hello,
>>
>> Thanks for your answer...
>>
>> 2009/5/5 Ian Hayes <cthulhucalling at gmail.com>
>>
>>>
>>>
>>> All cluster management communication in this cluster is via eth1. I
>>> specified host1-management and host2-management as the hostnames in the
>>> cluster config to partition off cluster traffic from the interfaces that are
>>> actually doing the VIP work. The nodes provide a virtual IP on eth0, and a
>>> script service, with the daemon bound to the VIP. For the iLOs and eth1, you
>>> could either plug them into a switch on their own non-trunked VLAN, or you
>>> can connect eth1 of host1 to the iLO of host 2, and eth1 of host2 to the iLO
>>> of host1. Both eth1 and iLOs don't need a gateway since they're on the same
>>> subnet.
>>>
>>
>> If I have understand, if I use a dedicated switch, I must to connect IP2,
>> IP3, IP5 and IP6 to the same switche and IP1 and IP4 to the service switch,
>> isn?t it?
>>
>
> If you're using a dedicated switch for cluster management, yes. Assuming
> that IP1 and 4 are your public facing interfaces that will be holding the
> service. You can use 2 and 5 for cluster management and fencing via IP3 and
> 6.
>
> Or you can use crossover cables IP2 to IP6, and IP5 to IP3, but you will
> have to run your cluster management over eth0, on the same interface that
> you have your services bound to. Your choice.
>
> But you can't reach the iLO interfaces using the other iLOs. Think of them
> as "recieve only". It's up to one of the hosts to establish a connection to
> them and issue commands.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090506/26a67870/attachment.htm>

From ccaulfie at redhat.com  Wed May  6 08:06:17 2009
From: ccaulfie at redhat.com (Chrissie Caulfield)
Date: Wed, 06 May 2009 09:06:17 +0100
Subject: [Linux-cluster] Heartbeat time outs in rhel4 understanding
In-Reply-To: <437A36C1327D794D87D207AC80BDD8FD0B05191C@DNBMSXBH002.dnbint.net>
References: <437A36C1327D794D87D207AC80BDD8FD0B05191C@DNBMSXBH002.dnbint.net>
Message-ID: <4A0144F9.5040701@redhat.com>

Elias, Michael wrote:
> I am trying to understand how these timers interact with each other.
> 
>  
> 
> In a RHEL4 cluster the heartbeat defaults are;
> 
> hello_timer:5
> 
> max_retries:5
> 
> deadnode_timeout:21
> 
>  
> 
> Meaning a heartbeat message is sent every 5 seconds, if it fails to
> receive a response it will start a deadnode counter @ 21 seconds. It
> will also try to send 5 more heartbeat requests. What is the interval of
> those retries? If none of those requests receive a response. 5 seconds
> pass.. there is 15 seconds left on the deadnode timer and we try upto 5
> times to get a response?. This goes on until we hit the 4^th iteration
> of the hellotimer it tries again upto 5 times and fails? we then hit the
> 21 second on the deadnode time.. fenced takes over and wham reboot.
> 
>  
> 
> Is my understanding of this correct????
> 

No, I'm afraid it isn't :-)

max_retries has nothing to do with the heartbeat. It is to do with
cluster messages, such as service join requests, clvmd messages or the
messages used in the membership protocol.

So the heartbeat system is just a 5 second heartbeat and after 21
seconds the node will be evicted from the cluster and (usually) fenced.

The same happens for data messages if max_retries is exceeded. The retry
period here starts at 1 second and increases each time to avoid filling
the ethernet buffers.

I hope this helps,

Chrissie


From gfs2etis at ensea.fr  Wed May  6 10:41:05 2009
From: gfs2etis at ensea.fr (gfs2etis)
Date: Wed, 06 May 2009 12:41:05 +0200
Subject: [Linux-cluster] gsf2
Message-ID: <1241606465.8236.39.camel@tyr.ensea.fr>

Hello,

I am a newbie with gfs2,
i have set 4 nodes on a fiber channel SAN
synchronized through gfs2.

When one machine crash all gfs2 crash,
especially when there is a lot of IO
on the gfs2 devices.

I have no message on the system when it crash (???)

version : gfs2-utils-2.03.07-2

thanks
-- 
gfs2etis <gfs2etis at ensea.fr>
ETIS


From ntadmin at fi.upm.es  Wed May  6 10:59:45 2009
From: ntadmin at fi.upm.es (Miguel Sanchez)
Date: Wed, 06 May 2009 12:59:45 +0200
Subject: [Linux-cluster] Necessary a delay to restart cman?
Message-ID: <4A016DA1.5090404@fi.upm.es>

 Hi. I have a CentOS 5.3 cluster with two nodes. If I execute service 
cman restart within a node, or stop + start after few seconds, another 
node doesn?t recognize this membership return and its fellow stay 
forever offline.

For example:

* Before cman restart:

node1# cman_tool status
Version: 6.1.0
Config Version: 6
Cluster Name: CSVirtualizacion
Cluster Id: 42648
Cluster Member: Yes
Cluster Generation: 202600
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Quorum: 1
Active subsystems: 7
Flags: 2node Dirty
Ports Bound: 0
Node name: patty
Node ID: 1
Multicast addresses: 224.0.0.133
Node addresses: 138.100.8.70

* After cman stop for node2 (and before a number seconds < token parameter)

node1# cman_tool status
Version: 6.1.0
Config Version: 6
Cluster Name: CSVirtualizacion
Cluster Id: 42648
Cluster Member: Yes
Cluster Generation: 202600
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 1
Quorum: 1
Active subsystems: 7
Flags: 2node Dirty
Ports Bound: 0
Node name: patty
Node ID: 1
Multicast addresses: 224.0.0.133
Node addresses: 138.100.8.70
Wed May  6 12:29:38 CEST 2009

* After cman stop for node2 (and after a number seconds > token parameter)

node1# date; cman_tool status
Version: 6.1.0
Config Version: 6
Cluster Name: CSVirtualizacion
Cluster Id: 42648
Cluster Member: Yes
Cluster Generation: 202604
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Quorum: 1
Active subsystems: 7
Flags: 2node Dirty
Ports Bound: 0
Node name: patty
Node ID: 1
Multicast addresses: 224.0.0.133
Node addresses: 138.100.8.70
Wed May  6 12:29:47 CEST 2009

/var/log/messages:
May  6 12:35:20 node2 openais[17262]: [TOTEM] The token was lost in the 
OPERATIONAL state.
May  6 12:35:20 node2 openais[17262]: [TOTEM] Receive multicast socket 
recv buffer size (288000 bytes).
May  6 12:35:20 node2 openais[17262]: [TOTEM] Transmit multicast socket 
send buffer size (262142 bytes).
May  6 12:35:20 node2 openais[17262]: [TOTEM] entering GATHER state from 2.
May  6 12:35:25 node2 openais[17262]: [TOTEM] entering GATHER state from 0.
May  6 12:35:25 node2 openais[17262]: [TOTEM] Creating commit token 
because I am the rep.
May  6 12:35:25 node2 openais[17262]: [TOTEM] Saving state aru 26 high 
seq received 26
May  6 12:35:25 node2 openais[17262]: [TOTEM] Storing new sequence id 
for ring 31780
May  6 12:35:25 node2 openais[17262]: [TOTEM] entering COMMIT state.
May  6 12:35:25 node2 openais[17262]: [TOTEM] entering RECOVERY state.
May  6 12:35:25 node2 openais[17262]: [TOTEM] position [0] member 
10.10.8.70:
May  6 12:35:25 node2 openais[17262]: [TOTEM] previous ring seq 202620 
rep 10.10.8.70
May  6 12:35:25 node2 openais[17262]: [TOTEM] aru 26 high delivered 26 
received flag 1
May  6 12:35:25 node2 openais[17262]: [TOTEM] Did not need to originate 
any messages in recovery.
May  6 12:35:25 node2 openais[17262]: [TOTEM] Sending initial ORF token
May  6 12:35:25 node2 openais[17262]: [CLM  ] CLM CONFIGURATION CHANGE
May  6 12:35:25 node2 openais[17262]: [CLM  ] New Configuration:
May  6 12:35:25 node2 openais[17262]: [CLM  ]   r(0) ip(10.10.8.70)
May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Left:
May  6 12:35:25 node2 openais[17262]: [CLM  ]   r(0) ip(10.10.8.71)
May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Joined:
May  6 12:35:25 node2 openais[17262]: [CLM  ] CLM CONFIGURATION CHANGE
May  6 12:35:25 node2 openais[17262]: [CLM  ] New Configuration:
May  6 12:35:25 node2 openais[17262]: [CLM  ]   r(0) ip(10.10.8.70)
May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Left:
May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Joined:
May  6 12:35:25 node2 openais[17262]: [SYNC ] This node is within the 
primary component and will provide service.
May  6 12:35:25 node2 openais[17262]: [TOTEM] entering OPERATIONAL state.
May  6 12:35:25 node2 kernel: dlm: closing connection to node 2
May  6 12:35:25 node2 openais[17262]: [CLM  ] got nodejoin message 
10.10.8.70
May  6 12:35:25 node2 openais[17262]: [CPG  ] got joinlist message from 
node 1


if node2 doesn`t wait for run cman start to the detection the 
operational token's lost, node1 detect node2 like offline forever. 
Following attempts for cman restarts don`t change this state:
node1# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M  202616   2009-05-06 12:34:43  node1
   2   X  202628                        node2
node2# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M  202644   2009-05-06 12:51:04  node1
   2   M  202640   2009-05-06 12:51:04  node2


Is it necessary a delay for cman stop + start to avoid this inconsistent 
state or really is it a bug?

Regards.


From reggaestar at gmail.com  Wed May  6 11:37:31 2009
From: reggaestar at gmail.com (remi doubi)
Date: Wed, 6 May 2009 11:37:31 +0000
Subject: [Linux-cluster] lock_gulm
Message-ID: <3c88c73a0905060437j207a4429lc52b6007330c1163@mail.gmail.com>

Hi everyone, i'm trying to work with gfs to create a File System in order to
be used by the cluster but i got an error when trying to mount :

#gfs_mkfs -p lock_gulm -t cluster-DomU:gfs -j 2 /dev/VolGroup00/test

This will destroy any data on /dev/VolGroup00/test.
  It appears to contain a gfs filesystem.

Are you sure you want to proceed? [y/n] y

Device:                    /dev/VolGroup00/test
Blocksize:                 4096
Filesystem Size:           458692
Journals:                  2
Resource Groups:           8
Locking Protocol:          lock_gulm
Lock Table:                cluster-DomU:gfs

Syncing...
All Done

#mount /dev/VolGroup00/test /mnt/bob/
/sbin/mount.gfs: error mounting /dev/mapper/VolGroup00-test on /mnt/bob: No
such file or directory

please anyone can help !!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090506/0e57dcb7/attachment.htm>

From ccaulfie at redhat.com  Wed May  6 12:01:21 2009
From: ccaulfie at redhat.com (Chrissie Caulfield)
Date: Wed, 06 May 2009 13:01:21 +0100
Subject: [Linux-cluster] Necessary a delay to restart cman?
In-Reply-To: <4A016DA1.5090404@fi.upm.es>
References: <4A016DA1.5090404@fi.upm.es>
Message-ID: <4A017C11.9010006@redhat.com>

Miguel Sanchez wrote:
> Hi. I have a CentOS 5.3 cluster with two nodes. If I execute service
> cman restart within a node, or stop + start after few seconds, another
> node doesn?t recognize this membership return and its fellow stay
> forever offline.
> 
> For example:
> 
> * Before cman restart:
> 
> node1# cman_tool status
> Version: 6.1.0
> Config Version: 6
> Cluster Name: CSVirtualizacion
> Cluster Id: 42648
> Cluster Member: Yes
> Cluster Generation: 202600
> Membership state: Cluster-Member
> Nodes: 2
> Expected votes: 1
> Total votes: 2
> Quorum: 1
> Active subsystems: 7
> Flags: 2node Dirty
> Ports Bound: 0
> Node name: patty
> Node ID: 1
> Multicast addresses: 224.0.0.133
> Node addresses: 138.100.8.70
> 
> * After cman stop for node2 (and before a number seconds < token parameter)
> 
> node1# cman_tool status
> Version: 6.1.0
> Config Version: 6
> Cluster Name: CSVirtualizacion
> Cluster Id: 42648
> Cluster Member: Yes
> Cluster Generation: 202600
> Membership state: Cluster-Member
> Nodes: 2
> Expected votes: 1
> Total votes: 1
> Quorum: 1
> Active subsystems: 7
> Flags: 2node Dirty
> Ports Bound: 0
> Node name: patty
> Node ID: 1
> Multicast addresses: 224.0.0.133
> Node addresses: 138.100.8.70
> Wed May  6 12:29:38 CEST 2009
> 
> * After cman stop for node2 (and after a number seconds > token parameter)
> 
> node1# date; cman_tool status
> Version: 6.1.0
> Config Version: 6
> Cluster Name: CSVirtualizacion
> Cluster Id: 42648
> Cluster Member: Yes
> Cluster Generation: 202604
> Membership state: Cluster-Member
> Nodes: 1
> Expected votes: 1
> Total votes: 1
> Quorum: 1
> Active subsystems: 7
> Flags: 2node Dirty
> Ports Bound: 0
> Node name: patty
> Node ID: 1
> Multicast addresses: 224.0.0.133
> Node addresses: 138.100.8.70
> Wed May  6 12:29:47 CEST 2009
> 
> /var/log/messages:
> May  6 12:35:20 node2 openais[17262]: [TOTEM] The token was lost in the
> OPERATIONAL state.
> May  6 12:35:20 node2 openais[17262]: [TOTEM] Receive multicast socket
> recv buffer size (288000 bytes).
> May  6 12:35:20 node2 openais[17262]: [TOTEM] Transmit multicast socket
> send buffer size (262142 bytes).
> May  6 12:35:20 node2 openais[17262]: [TOTEM] entering GATHER state from 2.
> May  6 12:35:25 node2 openais[17262]: [TOTEM] entering GATHER state from 0.
> May  6 12:35:25 node2 openais[17262]: [TOTEM] Creating commit token
> because I am the rep.
> May  6 12:35:25 node2 openais[17262]: [TOTEM] Saving state aru 26 high
> seq received 26
> May  6 12:35:25 node2 openais[17262]: [TOTEM] Storing new sequence id
> for ring 31780
> May  6 12:35:25 node2 openais[17262]: [TOTEM] entering COMMIT state.
> May  6 12:35:25 node2 openais[17262]: [TOTEM] entering RECOVERY state.
> May  6 12:35:25 node2 openais[17262]: [TOTEM] position [0] member
> 10.10.8.70:
> May  6 12:35:25 node2 openais[17262]: [TOTEM] previous ring seq 202620
> rep 10.10.8.70
> May  6 12:35:25 node2 openais[17262]: [TOTEM] aru 26 high delivered 26
> received flag 1
> May  6 12:35:25 node2 openais[17262]: [TOTEM] Did not need to originate
> any messages in recovery.
> May  6 12:35:25 node2 openais[17262]: [TOTEM] Sending initial ORF token
> May  6 12:35:25 node2 openais[17262]: [CLM  ] CLM CONFIGURATION CHANGE
> May  6 12:35:25 node2 openais[17262]: [CLM  ] New Configuration:
> May  6 12:35:25 node2 openais[17262]: [CLM  ]   r(0) ip(10.10.8.70)
> May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Left:
> May  6 12:35:25 node2 openais[17262]: [CLM  ]   r(0) ip(10.10.8.71)
> May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Joined:
> May  6 12:35:25 node2 openais[17262]: [CLM  ] CLM CONFIGURATION CHANGE
> May  6 12:35:25 node2 openais[17262]: [CLM  ] New Configuration:
> May  6 12:35:25 node2 openais[17262]: [CLM  ]   r(0) ip(10.10.8.70)
> May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Left:
> May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Joined:
> May  6 12:35:25 node2 openais[17262]: [SYNC ] This node is within the
> primary component and will provide service.
> May  6 12:35:25 node2 openais[17262]: [TOTEM] entering OPERATIONAL state.
> May  6 12:35:25 node2 kernel: dlm: closing connection to node 2
> May  6 12:35:25 node2 openais[17262]: [CLM  ] got nodejoin message
> 10.10.8.70
> May  6 12:35:25 node2 openais[17262]: [CPG  ] got joinlist message from
> node 1
> 
> 
> if node2 doesn`t wait for run cman start to the detection the
> operational token's lost, node1 detect node2 like offline forever.
> Following attempts for cman restarts don`t change this state:
> node1# cman_tool nodes
> Node  Sts   Inc   Joined               Name
>   1   M  202616   2009-05-06 12:34:43  node1
>   2   X  202628                        node2
> node2# cman_tool nodes
> Node  Sts   Inc   Joined               Name
>   1   M  202644   2009-05-06 12:51:04  node1
>   2   M  202640   2009-05-06 12:51:04  node2
> 
> 
> Is it necessary a delay for cman stop + start to avoid this inconsistent
> state or really is it a bug?


I suspect it's an instance of this known bug. Check that CentOS has the
appropriate patch available:

https://bugzilla.redhat.com/show_bug.cgi?id=485026

Chrissie


From swhiteho at redhat.com  Wed May  6 12:21:10 2009
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Wed, 06 May 2009 13:21:10 +0100
Subject: [Linux-cluster] lock_gulm
In-Reply-To: <3c88c73a0905060437j207a4429lc52b6007330c1163@mail.gmail.com>
References: <3c88c73a0905060437j207a4429lc52b6007330c1163@mail.gmail.com>
Message-ID: <1241612470.29604.221.camel@localhost.localdomain>

Hi,

On Wed, 2009-05-06 at 11:37 +0000, remi doubi wrote:
> Hi everyone, i'm trying to work with gfs to create a File System in
> order to be used by the cluster but i got an error when trying to
> mount :
> 
> #gfs_mkfs -p lock_gulm -t cluster-DomU:gfs -j 2 /dev/VolGroup00/test
> 
> This will destroy any data on /dev/VolGroup00/test.
>   It appears to contain a gfs filesystem.
> 
> Are you sure you want to proceed? [y/n] y
> 
> Device:                    /dev/VolGroup00/test
> Blocksize:                 4096
> Filesystem Size:           458692
> Journals:                  2
> Resource Groups:           8
> Locking Protocol:          lock_gulm
> Lock Table:                cluster-DomU:gfs
> 
> Syncing...
> All Done
> 
> #mount /dev/VolGroup00/test /mnt/bob/
> /sbin/mount.gfs: error mounting /dev/mapper/VolGroup00-test
> on /mnt/bob: No such file or directory
> 
> please anyone can help !!
> 
> 
I've no idea what version of gfs you are using, but you almost certainly
want to be using lock_dlm and not lock_gulm. I'm assuming also, or
course, that /mnt/bob does actually exist, is a directory and is
accessible to the mount process,

Steve.

> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From adam at gradientzero.com  Wed May  6 12:59:05 2009
From: adam at gradientzero.com (Adam Hough)
Date: Wed, 6 May 2009 07:59:05 -0500
Subject: [Linux-cluster] Necessary a delay to restart cman?
In-Reply-To: <4A017C11.9010006@redhat.com>
References: <4A016DA1.5090404@fi.upm.es> <4A017C11.9010006@redhat.com>
Message-ID: <c67630fc0905060559l24a30a2y99b2506af9ba95d5@mail.gmail.com>

On Wed, May 6, 2009 at 7:01 AM, Chrissie Caulfield <ccaulfie at redhat.com> wrote:
> Miguel Sanchez wrote:
>> Hi. I have a CentOS 5.3 cluster with two nodes. If I execute service
>> cman restart within a node, or stop + start after few seconds, another
>> node doesn?t recognize this membership return and its fellow stay
>> forever offline.
>>
>> For example:
>>
>> * Before cman restart:
>>
>> node1# cman_tool status
>> Version: 6.1.0
>> Config Version: 6
>> Cluster Name: CSVirtualizacion
>> Cluster Id: 42648
>> Cluster Member: Yes
>> Cluster Generation: 202600
>> Membership state: Cluster-Member
>> Nodes: 2
>> Expected votes: 1
>> Total votes: 2
>> Quorum: 1
>> Active subsystems: 7
>> Flags: 2node Dirty
>> Ports Bound: 0
>> Node name: patty
>> Node ID: 1
>> Multicast addresses: 224.0.0.133
>> Node addresses: 138.100.8.70
>>
>> * After cman stop for node2 (and before a number seconds < token parameter)
>>
>> node1# cman_tool status
>> Version: 6.1.0
>> Config Version: 6
>> Cluster Name: CSVirtualizacion
>> Cluster Id: 42648
>> Cluster Member: Yes
>> Cluster Generation: 202600
>> Membership state: Cluster-Member
>> Nodes: 2
>> Expected votes: 1
>> Total votes: 1
>> Quorum: 1
>> Active subsystems: 7
>> Flags: 2node Dirty
>> Ports Bound: 0
>> Node name: patty
>> Node ID: 1
>> Multicast addresses: 224.0.0.133
>> Node addresses: 138.100.8.70
>> Wed May ?6 12:29:38 CEST 2009
>>
>> * After cman stop for node2 (and after a number seconds > token parameter)
>>
>> node1# date; cman_tool status
>> Version: 6.1.0
>> Config Version: 6
>> Cluster Name: CSVirtualizacion
>> Cluster Id: 42648
>> Cluster Member: Yes
>> Cluster Generation: 202604
>> Membership state: Cluster-Member
>> Nodes: 1
>> Expected votes: 1
>> Total votes: 1
>> Quorum: 1
>> Active subsystems: 7
>> Flags: 2node Dirty
>> Ports Bound: 0
>> Node name: patty
>> Node ID: 1
>> Multicast addresses: 224.0.0.133
>> Node addresses: 138.100.8.70
>> Wed May ?6 12:29:47 CEST 2009
>>
>> /var/log/messages:
>> May ?6 12:35:20 node2 openais[17262]: [TOTEM] The token was lost in the
>> OPERATIONAL state.
>> May ?6 12:35:20 node2 openais[17262]: [TOTEM] Receive multicast socket
>> recv buffer size (288000 bytes).
>> May ?6 12:35:20 node2 openais[17262]: [TOTEM] Transmit multicast socket
>> send buffer size (262142 bytes).
>> May ?6 12:35:20 node2 openais[17262]: [TOTEM] entering GATHER state from 2.
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] entering GATHER state from 0.
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] Creating commit token
>> because I am the rep.
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] Saving state aru 26 high
>> seq received 26
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] Storing new sequence id
>> for ring 31780
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] entering COMMIT state.
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] entering RECOVERY state.
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] position [0] member
>> 10.10.8.70:
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] previous ring seq 202620
>> rep 10.10.8.70
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] aru 26 high delivered 26
>> received flag 1
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] Did not need to originate
>> any messages in recovery.
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] Sending initial ORF token
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] CLM CONFIGURATION CHANGE
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] New Configuration:
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] ? r(0) ip(10.10.8.70)
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] Members Left:
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] ? r(0) ip(10.10.8.71)
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] Members Joined:
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] CLM CONFIGURATION CHANGE
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] New Configuration:
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] ? r(0) ip(10.10.8.70)
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] Members Left:
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] Members Joined:
>> May ?6 12:35:25 node2 openais[17262]: [SYNC ] This node is within the
>> primary component and will provide service.
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] entering OPERATIONAL state.
>> May ?6 12:35:25 node2 kernel: dlm: closing connection to node 2
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] got nodejoin message
>> 10.10.8.70
>> May ?6 12:35:25 node2 openais[17262]: [CPG ?] got joinlist message from
>> node 1
>>
>>
>> if node2 doesn`t wait for run cman start to the detection the
>> operational token's lost, node1 detect node2 like offline forever.
>> Following attempts for cman restarts don`t change this state:
>> node1# cman_tool nodes
>> Node ?Sts ? Inc ? Joined ? ? ? ? ? ? ? Name
>> ? 1 ? M ?202616 ? 2009-05-06 12:34:43 ?node1
>> ? 2 ? X ?202628 ? ? ? ? ? ? ? ? ? ? ? ?node2
>> node2# cman_tool nodes
>> Node ?Sts ? Inc ? Joined ? ? ? ? ? ? ? Name
>> ? 1 ? M ?202644 ? 2009-05-06 12:51:04 ?node1
>> ? 2 ? M ?202640 ? 2009-05-06 12:51:04 ?node2
>>
>>
>> Is it necessary a delay for cman stop + start to avoid this inconsistent
>> state or really is it a bug?
>
>
> I suspect it's an instance of this known bug. Check that CentOS has the
> appropriate patch available:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=485026
>
> Chrissie
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


When restarting cman, I have always had to stop cman and then manually
stop openais before trying to start cman again.   If I do not follow
these steps then the node would never rejoin the cluster or might
fence the other node.


From ccaulfie at redhat.com  Wed May  6 13:05:41 2009
From: ccaulfie at redhat.com (Chrissie Caulfield)
Date: Wed, 06 May 2009 14:05:41 +0100
Subject: [Linux-cluster] Necessary a delay to restart cman?
In-Reply-To: <c67630fc0905060559l24a30a2y99b2506af9ba95d5@mail.gmail.com>
References: <4A016DA1.5090404@fi.upm.es> <4A017C11.9010006@redhat.com>
	<c67630fc0905060559l24a30a2y99b2506af9ba95d5@mail.gmail.com>
Message-ID: <4A018B25.1050800@redhat.com>

Adam Hough wrote:
> On Wed, May 6, 2009 at 7:01 AM, Chrissie Caulfield <ccaulfie at redhat.com> wrote:
>> Miguel Sanchez wrote:
>>> Hi. I have a CentOS 5.3 cluster with two nodes. If I execute service
>>> cman restart within a node, or stop + start after few seconds, another
>>> node doesn?t recognize this membership return and its fellow stay
>>> forever offline.
>>>
>>> For example:
>>>
>>> * Before cman restart:
>>>
>>> node1# cman_tool status
>>> Version: 6.1.0
>>> Config Version: 6
>>> Cluster Name: CSVirtualizacion
>>> Cluster Id: 42648
>>> Cluster Member: Yes
>>> Cluster Generation: 202600
>>> Membership state: Cluster-Member
>>> Nodes: 2
>>> Expected votes: 1
>>> Total votes: 2
>>> Quorum: 1
>>> Active subsystems: 7
>>> Flags: 2node Dirty
>>> Ports Bound: 0
>>> Node name: patty
>>> Node ID: 1
>>> Multicast addresses: 224.0.0.133
>>> Node addresses: 138.100.8.70
>>>
>>> * After cman stop for node2 (and before a number seconds < token parameter)
>>>
>>> node1# cman_tool status
>>> Version: 6.1.0
>>> Config Version: 6
>>> Cluster Name: CSVirtualizacion
>>> Cluster Id: 42648
>>> Cluster Member: Yes
>>> Cluster Generation: 202600
>>> Membership state: Cluster-Member
>>> Nodes: 2
>>> Expected votes: 1
>>> Total votes: 1
>>> Quorum: 1
>>> Active subsystems: 7
>>> Flags: 2node Dirty
>>> Ports Bound: 0
>>> Node name: patty
>>> Node ID: 1
>>> Multicast addresses: 224.0.0.133
>>> Node addresses: 138.100.8.70
>>> Wed May  6 12:29:38 CEST 2009
>>>
>>> * After cman stop for node2 (and after a number seconds > token parameter)
>>>
>>> node1# date; cman_tool status
>>> Version: 6.1.0
>>> Config Version: 6
>>> Cluster Name: CSVirtualizacion
>>> Cluster Id: 42648
>>> Cluster Member: Yes
>>> Cluster Generation: 202604
>>> Membership state: Cluster-Member
>>> Nodes: 1
>>> Expected votes: 1
>>> Total votes: 1
>>> Quorum: 1
>>> Active subsystems: 7
>>> Flags: 2node Dirty
>>> Ports Bound: 0
>>> Node name: patty
>>> Node ID: 1
>>> Multicast addresses: 224.0.0.133
>>> Node addresses: 138.100.8.70
>>> Wed May  6 12:29:47 CEST 2009
>>>
>>> /var/log/messages:
>>> May  6 12:35:20 node2 openais[17262]: [TOTEM] The token was lost in the
>>> OPERATIONAL state.
>>> May  6 12:35:20 node2 openais[17262]: [TOTEM] Receive multicast socket
>>> recv buffer size (288000 bytes).
>>> May  6 12:35:20 node2 openais[17262]: [TOTEM] Transmit multicast socket
>>> send buffer size (262142 bytes).
>>> May  6 12:35:20 node2 openais[17262]: [TOTEM] entering GATHER state from 2.
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] entering GATHER state from 0.
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] Creating commit token
>>> because I am the rep.
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] Saving state aru 26 high
>>> seq received 26
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] Storing new sequence id
>>> for ring 31780
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] entering COMMIT state.
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] entering RECOVERY state.
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] position [0] member
>>> 10.10.8.70:
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] previous ring seq 202620
>>> rep 10.10.8.70
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] aru 26 high delivered 26
>>> received flag 1
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] Did not need to originate
>>> any messages in recovery.
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] Sending initial ORF token
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ] CLM CONFIGURATION CHANGE
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ] New Configuration:
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ]   r(0) ip(10.10.8.70)
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Left:
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ]   r(0) ip(10.10.8.71)
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Joined:
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ] CLM CONFIGURATION CHANGE
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ] New Configuration:
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ]   r(0) ip(10.10.8.70)
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Left:
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Joined:
>>> May  6 12:35:25 node2 openais[17262]: [SYNC ] This node is within the
>>> primary component and will provide service.
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] entering OPERATIONAL state.
>>> May  6 12:35:25 node2 kernel: dlm: closing connection to node 2
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ] got nodejoin message
>>> 10.10.8.70
>>> May  6 12:35:25 node2 openais[17262]: [CPG  ] got joinlist message from
>>> node 1
>>>
>>>
>>> if node2 doesn`t wait for run cman start to the detection the
>>> operational token's lost, node1 detect node2 like offline forever.
>>> Following attempts for cman restarts don`t change this state:
>>> node1# cman_tool nodes
>>> Node  Sts   Inc   Joined               Name
>>>   1   M  202616   2009-05-06 12:34:43  node1
>>>   2   X  202628                        node2
>>> node2# cman_tool nodes
>>> Node  Sts   Inc   Joined               Name
>>>   1   M  202644   2009-05-06 12:51:04  node1
>>>   2   M  202640   2009-05-06 12:51:04  node2
>>>
>>>
>>> Is it necessary a delay for cman stop + start to avoid this inconsistent
>>> state or really is it a bug?
>>
>> I suspect it's an instance of this known bug. Check that CentOS has the
>> appropriate patch available:
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=485026
>>
>> Chrissie
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
> 
> 
> When restarting cman, I have always had to stop cman and then manually
> stop openais before trying to start cman again.   If I do not follow
> these steps then the node would never rejoin the cluster or might
> fence the other node.

That indicates some form of configuration error. You should never have
to do that. Make sure that openais is not enabled at boot time using

chkconfig openais off

Also, I really don't recommend stopping and starting cman without a
reboot. Yes you might get away with it a few times, but one day it won't
work and you'll be emailing here again ;-)

Chrissie


From rpeterso at redhat.com  Wed May  6 13:15:27 2009
From: rpeterso at redhat.com (Bob Peterson)
Date: Wed, 6 May 2009 09:15:27 -0400 (EDT)
Subject: [Linux-cluster] gsf2
In-Reply-To: <1241606465.8236.39.camel@tyr.ensea.fr>
Message-ID: <973733077.62501241615727925.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>

----- "gfs2etis" <gfs2etis at ensea.fr> wrote:
| Hello,
| 
| I am a newbie with gfs2,
| i have set 4 nodes on a fiber channel SAN
| synchronized through gfs2.
| 
| When one machine crash all gfs2 crash,
| especially when there is a lot of IO
| on the gfs2 devices.
| 
| I have no message on the system when it crash (???)
| 
| version : gfs2-utils-2.03.07-2
| 
| thanks
| -- 
| gfs2etis <gfs2etis at ensea.fr>
| ETIS

Hi,

When one node crashes, gfs2 should prevent any new locks
from being taken until the node it properly fenced and
the journal recovered.  This can appear as a freeze or
lockup, and it is done intentionally to preserve data integrity.
The other nodes should not crash and if they do, they should
leave console messages stating what happened.

Regards,

Bob Peterson
Red Hat GFS


From EliasM at dnb.com  Wed May  6 13:33:53 2009
From: EliasM at dnb.com (Elias, Michael)
Date: Wed, 6 May 2009 09:33:53 -0400
Subject: [Linux-cluster] Heartbeat time outs in rhel4 understanding
In-Reply-To: <4A0144F9.5040701@redhat.com>
Message-ID: <437A36C1327D794D87D207AC80BDD8FD0B05191F@DNBMSXBH002.dnbint.net>

Ok, so let me ask this. I did a tcpdump between nodes. Is the heartbeat
the udp pack I see? I also see an xml doc. Like node1 keeps uptime and
other cluster info for itself and node2. node2 keeps uptime and cluster
onfo for nodes 1 and 3. Node 3 does the same for 2 and 4 and so on. I
assume is a node dies then they next closest node starts watching the
uptime for that node until the failed node rejoins.

Thanks again

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Chrissie
Caulfield
Sent: Wednesday, May 06, 2009 4:06 AM
To: linux clustering
Subject: Re: [Linux-cluster] Heartbeat time outs in rhel4 understanding

Elias, Michael wrote:
> I am trying to understand how these timers interact with each other.
> 
>  
> 
> In a RHEL4 cluster the heartbeat defaults are;
> 
> hello_timer:5
> 
> max_retries:5
> 
> deadnode_timeout:21
> 
>  
> 
> Meaning a heartbeat message is sent every 5 seconds, if it fails to
> receive a response it will start a deadnode counter @ 21 seconds. It
> will also try to send 5 more heartbeat requests. What is the interval
of
> those retries? If none of those requests receive a response. 5 seconds
> pass.. there is 15 seconds left on the deadnode timer and we try upto
5
> times to get a response.... This goes on until we hit the 4^th
iteration
> of the hellotimer it tries again upto 5 times and fails... we then hit
the
> 21 second on the deadnode time.. fenced takes over and wham reboot.
> 
>  
> 
> Is my understanding of this correct????
> 

No, I'm afraid it isn't :-)

max_retries has nothing to do with the heartbeat. It is to do with
cluster messages, such as service join requests, clvmd messages or the
messages used in the membership protocol.

So the heartbeat system is just a 5 second heartbeat and after 21
seconds the node will be evicted from the cluster and (usually) fenced.

The same happens for data messages if max_retries is exceeded. The retry
period here starts at 1 second and increases each time to avoid filling
the ethernet buffers.

I hope this helps,

Chrissie

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From Harri.Paivaniemi at tieto.com  Wed May  6 14:30:45 2009
From: Harri.Paivaniemi at tieto.com (Harri.Paivaniemi at tieto.com)
Date: Wed, 6 May 2009 17:30:45 +0300
Subject: [Linux-cluster] Necessary a delay to restart cman?
References: <4A016DA1.5090404@fi.upm.es> <4A017C11.9010006@redhat.com>
	<c67630fc0905060559l24a30a2y99b2506af9ba95d5@mail.gmail.com>
Message-ID: <41E8D4F07FCE154CBEBAA60FFC92F67754D510@apollo.eu.tieto.com>

Hi,

Just fyi:

I had a similar problem in the past and I made a support request to RH support. They said you have to wait totem token- time after stop before starting again, or it's not giong to work...

I wonder if this is correct...


-hjp
 

-----Original Message-----
From: linux-cluster-bounces at redhat.com on behalf of Adam Hough
Sent: Wed 5/6/2009 15:59
To: linux clustering
Subject: Re: [Linux-cluster] Necessary a delay to restart cman?
 
On Wed, May 6, 2009 at 7:01 AM, Chrissie Caulfield <ccaulfie at redhat.com> wrote:
> Miguel Sanchez wrote:
>> Hi. I have a CentOS 5.3 cluster with two nodes. If I execute service
>> cman restart within a node, or stop + start after few seconds, another
>> node doesn?t recognize this membership return and its fellow stay
>> forever offline.
>>
>> For example:
>>
>> * Before cman restart:
>>
>> node1# cman_tool status
>> Version: 6.1.0
>> Config Version: 6
>> Cluster Name: CSVirtualizacion
>> Cluster Id: 42648
>> Cluster Member: Yes
>> Cluster Generation: 202600
>> Membership state: Cluster-Member
>> Nodes: 2
>> Expected votes: 1
>> Total votes: 2
>> Quorum: 1
>> Active subsystems: 7
>> Flags: 2node Dirty
>> Ports Bound: 0
>> Node name: patty
>> Node ID: 1
>> Multicast addresses: 224.0.0.133
>> Node addresses: 138.100.8.70
>>
>> * After cman stop for node2 (and before a number seconds < token parameter)
>>
>> node1# cman_tool status
>> Version: 6.1.0
>> Config Version: 6
>> Cluster Name: CSVirtualizacion
>> Cluster Id: 42648
>> Cluster Member: Yes
>> Cluster Generation: 202600
>> Membership state: Cluster-Member
>> Nodes: 2
>> Expected votes: 1
>> Total votes: 1
>> Quorum: 1
>> Active subsystems: 7
>> Flags: 2node Dirty
>> Ports Bound: 0
>> Node name: patty
>> Node ID: 1
>> Multicast addresses: 224.0.0.133
>> Node addresses: 138.100.8.70
>> Wed May ?6 12:29:38 CEST 2009
>>
>> * After cman stop for node2 (and after a number seconds > token parameter)
>>
>> node1# date; cman_tool status
>> Version: 6.1.0
>> Config Version: 6
>> Cluster Name: CSVirtualizacion
>> Cluster Id: 42648
>> Cluster Member: Yes
>> Cluster Generation: 202604
>> Membership state: Cluster-Member
>> Nodes: 1
>> Expected votes: 1
>> Total votes: 1
>> Quorum: 1
>> Active subsystems: 7
>> Flags: 2node Dirty
>> Ports Bound: 0
>> Node name: patty
>> Node ID: 1
>> Multicast addresses: 224.0.0.133
>> Node addresses: 138.100.8.70
>> Wed May ?6 12:29:47 CEST 2009
>>
>> /var/log/messages:
>> May ?6 12:35:20 node2 openais[17262]: [TOTEM] The token was lost in the
>> OPERATIONAL state.
>> May ?6 12:35:20 node2 openais[17262]: [TOTEM] Receive multicast socket
>> recv buffer size (288000 bytes).
>> May ?6 12:35:20 node2 openais[17262]: [TOTEM] Transmit multicast socket
>> send buffer size (262142 bytes).
>> May ?6 12:35:20 node2 openais[17262]: [TOTEM] entering GATHER state from 2.
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] entering GATHER state from 0.
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] Creating commit token
>> because I am the rep.
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] Saving state aru 26 high
>> seq received 26
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] Storing new sequence id
>> for ring 31780
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] entering COMMIT state.
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] entering RECOVERY state.
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] position [0] member
>> 10.10.8.70:
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] previous ring seq 202620
>> rep 10.10.8.70
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] aru 26 high delivered 26
>> received flag 1
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] Did not need to originate
>> any messages in recovery.
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] Sending initial ORF token
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] CLM CONFIGURATION CHANGE
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] New Configuration:
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] ? r(0) ip(10.10.8.70)
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] Members Left:
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] ? r(0) ip(10.10.8.71)
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] Members Joined:
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] CLM CONFIGURATION CHANGE
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] New Configuration:
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] ? r(0) ip(10.10.8.70)
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] Members Left:
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] Members Joined:
>> May ?6 12:35:25 node2 openais[17262]: [SYNC ] This node is within the
>> primary component and will provide service.
>> May ?6 12:35:25 node2 openais[17262]: [TOTEM] entering OPERATIONAL state.
>> May ?6 12:35:25 node2 kernel: dlm: closing connection to node 2
>> May ?6 12:35:25 node2 openais[17262]: [CLM ?] got nodejoin message
>> 10.10.8.70
>> May ?6 12:35:25 node2 openais[17262]: [CPG ?] got joinlist message from
>> node 1
>>
>>
>> if node2 doesn`t wait for run cman start to the detection the
>> operational token's lost, node1 detect node2 like offline forever.
>> Following attempts for cman restarts don`t change this state:
>> node1# cman_tool nodes
>> Node ?Sts ? Inc ? Joined ? ? ? ? ? ? ? Name
>> ? 1 ? M ?202616 ? 2009-05-06 12:34:43 ?node1
>> ? 2 ? X ?202628 ? ? ? ? ? ? ? ? ? ? ? ?node2
>> node2# cman_tool nodes
>> Node ?Sts ? Inc ? Joined ? ? ? ? ? ? ? Name
>> ? 1 ? M ?202644 ? 2009-05-06 12:51:04 ?node1
>> ? 2 ? M ?202640 ? 2009-05-06 12:51:04 ?node2
>>
>>
>> Is it necessary a delay for cman stop + start to avoid this inconsistent
>> state or really is it a bug?
>
>
> I suspect it's an instance of this known bug. Check that CentOS has the
> appropriate patch available:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=485026
>
> Chrissie
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


When restarting cman, I have always had to stop cman and then manually
stop openais before trying to start cman again.   If I do not follow
these steps then the node would never rejoin the cluster or might
fence the other node.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 5251 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090506/7b5ffd35/attachment.bin>

From arwin.tugade at csun.edu  Wed May  6 21:18:33 2009
From: arwin.tugade at csun.edu (Arwin L Tugade)
Date: Wed, 6 May 2009 14:18:33 -0700
Subject: [Linux-cluster] service failover
In-Reply-To: <41E8D4F07FCE154CBEBAA60FFC92F67754D50E@apollo.eu.tieto.com>
References: <OFA9C05CB8.CF3941D3-ON872575AD.006AEBEE-872575AD.006B35C7@us.ibm.com>
	<41E8D4F07FCE154CBEBAA60FFC92F67754D50E@apollo.eu.tieto.com>
Message-ID: <6708F96BBF31F846BFA56EC0AE37D62281E94FDE68@CSUN-EX-V01.csun.edu>

Yup, or the way I do it, with Swatch (http://sourceforge.net/projects/swatch/).

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Harri.Paivaniemi at tieto.com
Sent: Tuesday, May 05, 2009 8:56 PM
To: linux-cluster at redhat.com
Subject: RE: [Linux-cluster] service failover

If you need a quick-n-dirty way,

you can always put something in to services starting script so every time cluster says start|stop to that service, you can send mail to yourself etc and you don't have to grep messages-log ;)


-hjp


-----Original Message-----
From: linux-cluster-bounces at redhat.com on behalf of Gary Romo
Sent: Tue 5/5/2009 22:31
To: linux-cluster at redhat.com
Subject: [Linux-cluster] service failover
 

Hello.

How can I tell when a service has failed over?
I'm looking for date and time stamps.  Thanks.

-Gary


From billpp at gmail.com  Wed May  6 22:44:15 2009
From: billpp at gmail.com (Flavio Junior)
Date: Wed, 6 May 2009 19:44:15 -0300
Subject: [Linux-cluster] RHCS 4-node cluster: Networking/Membership issues
In-Reply-To: <58aa8d780904300941l6146d881lfe102d8ebd6c8e32@mail.gmail.com>
References: <58aa8d780904291321i21a8914fl956db51d61664a51@mail.gmail.com>
	<1DA7B2AF-E920-4701-A49B-9806478144F7@auckland.ac.nz>
	<58aa8d780904300941l6146d881lfe102d8ebd6c8e32@mail.gmail.com>
Message-ID: <58aa8d780905061544x4bc38908n6613921ac11a756a@mail.gmail.com>

Hi again folks...

One update here:

- I'd removed bonding for cluster heartbeat (bond0) and setup it direct on
eth0 for all nodes. This solves the issue for membership.

Now I can boot up all 4 nodes, join fence domain, start clvmd on them.
Everything is stable and I didn't see random messages about "openais
retransmit" anymore.

Of course, I still have a problem :).

I've 1 GFS filesystem and 16 GFS2 filesystems. I can mount all filesystems
on node1 and node2 (same build/switch), but when I try to run "service gfs2
start" on node3 or node4 (another build/switch) the things becomes unstable
and whole cluster fail with infinity messages about "cpg_mcast_retry
RETRY_NUMBER".

Log can be found here: http://pastebin.com/m2f26ab1d

What apparently happened is that without bonding setup the network layer
becomes more "simple" and could handle with membership but still cant handle
with GFS/GFS2 heartbeat.

I've set nodes to talk IGMPv2, as said at:
http://archives.free.net.ph/message/20081001.223026.9cf6d7bf.de.html

Well.. any hints?

Thanks again.

--

Fl?vio do Carmo J?nior aka waKKu


On Thu, Apr 30, 2009 at 1:41 PM, Flavio Junior <billpp at gmail.com> wrote:

> Hi Abraham, thanks for your answer.
>
> I'd configured your suggestion to cluster.conf but still gets the same
> problem.
>
> Here is what I did:
> * Disable cman init script on boot for all nodes
> * Edit config file and copy it for all nodes
> * reboot all
> * start cman on node1 (OK)
> * start cman on node2 (OK)
> * start cman on node3 (problems to become member, fence node2)
>
> Here is the log file with this process 'til the fence:
> http://pastebin.com/f477e7114
>
> PS: node1 and node2 as on the same switch at site1. node3 and node4 as
> on the same switch at site2.
>
> Thanks again, any other suggestions ?
>
> I dont know if it would help but, is corosync a feasible option for
> production use?
>
> --
>
> Fl?vio do Carmo J?nior aka waKKu
>
> On Wed, Apr 29, 2009 at 10:19 PM, Abraham Alawi <a.alawi at auckland.ac.nz>
> wrote:
> > If not tried already, the following settings in cluster.conf might help
> > especially "clean_start"
> >
> > <fence_daemon clean_start="1" post_fail_delay="5" post_join_delay="15"/>
> > clean_start --> assume the cluster is in healthy state upon startup
> > post_fail_delay --> seconds to wait before fencing a node that thinks it
> > should be fenced (i.e. lost connection with)
> > post_join_delay --> seconds to wait before fencing any node that should
> be
> > fenced upon startup (right after joining)
> >
> > On 30/04/2009, at 8:21 AM, Flavio Junior wrote:
> >
> >> Hi folks,
> >>
> >> I've been trying to set up a 4-node RHCS+GFS cluster for awhile. I've
> >> another 2-node cluster using CentOS 5.3 without problem.
> >>
> >> Well.. My scenario is as follow:
> >>
> >> * System configuration and info: http://pastebin.com/f41d63624
> >>
> >> * Network:
> >>
> http://www.uploadimagens.com/upload/2ac9074fbb10c2479c59abe419880dc8.jpg
> >>  * Switches on loop are 3Com 2924 (or 2948)-SFP
> >>  * Have STP enabled (RSTP auto)
> >>  * IGMP Snooping Disabled as:
> >>
> >>
> http://magazine.redhat.com/2007/08/23/automated-failover-and-recovery-of-virtualized-guests-in-advanced-platform/
> >> comment 32
> >>  * Yellow lines are a fiber link 990ft (330mts) single-mode
> >>  * I'm using a dedicated tagged VLAN for cluster-heartbeat
> >>  * I'm using 2 NIC's with bonding mode=1 (active/backup) for
> >> heartbeat and 4 NIC's to "public"
> >>  * Every node has your public four cables plugged on same switch and
> >> Link-Aggregation on it
> >>  * Looking to the picture, that 2 switches with below fiber link is
> >> where the nodes are plugged. 2 nodes each build.
> >>
> >> SAN: http://img139.imageshack.us/img139/642/clusters.jpg
> >>  * Switches: Brocade TotalStorage 16SAN-B
> >>  * Storages: IBM DS4700 72A (using ERM for sync replication (storage
> >> level))
> >>
> >> My problem is:
> >>
> >> I can't get the 4 nodes up. Every time the fourth (sometimes even the
> >> third) node becomes online i got one or two of them fenced. I keep
> >> getting messages about openais/cman, cpg_mcast_joined very often:
> >> --- snipped ---
> >> Apr 29 16:08:23 athos groupd[5393]: cpg_mcast_joined retry 1098900
> >> Apr 29 16:08:23 athos groupd[5393]: cpg_mcast_joined retry 1099000
> >> --- snipped ---
> >>
> >> Is really seldom the times I can get a node to boot up and join on
> >> fence domain, almost every time it hangs and i need to reboot and try
> >> again or either reboot, enter single mode, disable cman, reboot, keep
> >> trying to service cman start/stop. Sometimes another nodes can see the
> >> node in domain but boot keeps hangs on "Starting fenced..."
> >>
> >> ########
> >> [root at athos ~]# cman_tool services
> >> type             level name     id       state
> >> fence            0     default  00010001 none
> >> [1 3 4]
> >> dlm              1     clvmd    00020001 none
> >> [1 3 4]
> >> [root at athos ~]# cman_tool nodes -f
> >> Node  Sts   Inc   Joined               Name
> >>  0   M      0   2009-04-29 15:16:47
> >> /dev/disk/by-id/scsi-3600a0b800048834e000014fb49dcc47b
> >>  1   M   7556   2009-04-29 15:16:35  athos-priv
> >>      Last fenced:   2009-04-29 15:13:49 by athos-ipmi
> >>  2   X   7820                        porthos-priv
> >>      Last fenced:   2009-04-29 15:31:01 by porthos-ipmi
> >>      Node has not been fenced since it went down
> >>  3   M   7696   2009-04-29 15:27:15  aramis-priv
> >>      Last fenced:   2009-04-29 15:24:17 by aramis-ipmi
> >>  4   M   8232   2009-04-29 16:12:34  dartagnan-priv
> >>      Last fenced:   2009-04-29 16:09:53 by dartagnan-ipmi
> >> [root at athos ~]# ssh root at aramis-priv
> >> ssh: connect to host aramis-priv port 22: Connection refused
> >> [root at athos ~]# ssh root at dartagnan-priv
> >> ssh: connect to host dartagnan-priv port 22: Connection refused
> >> [root at athos ~]#
> >> #########
> >>
> >> (I know how unreliable is ssh, but I'm seeing the console screen
> >> hanged.. Just trying to show it)
> >>
> >>
> >> The BIG log file: http://pastebin.com/f453c220
> >> Every entry on this log after 16:54h is when node2 (porthos-priv
> >> 172.16.1.2) was booting and hanged on "Starting fenced..."
> >>
> >>
> >> I've no more ideias to try solve this problem, any hints is
> >> appreciated. If you need any other info, just tell me how to get it
> >> and I'll post just after I read.
> >>
> >>
> >> Very thanks, in advance.
> >>
> >> --
> >>
> >> Fl?vio do Carmo J?nior aka waKKu
> >>
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> > ''''''''''''''''''''''''''''''''''''''''''''''''''''''
> > Abraham Alawi
> >
> > Unix/Linux Systems Administrator
> > Science IT
> > University of Auckland
> > e: a.alawi at auckland.ac.nz
> > p: +64-9-373 7599, ext#: 87572
> >
> > ''''''''''''''''''''''''''''''''''''''''''''''''''''''
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090506/818e5731/attachment.htm>

From CISPLengineer.hz at ril.com  Thu May  7 06:13:25 2009
From: CISPLengineer.hz at ril.com (Viral .D. Ahire)
Date: Thu, 07 May 2009 11:43:25 +0530
Subject: [Linux-cluster] Reg: node getting fenced during stop,
 restart or relocate cluster service in RHEL-5
Message-ID: <4A027C05.4010605@ril.com>

Hi,

I have configured two node cluster on redhat-5. now the problem is when 
i relocate,restart or stop,  running cluster service between nodes (2 
nos) ,the node get fenced and restart server . Other side, the server 
who obtain cluster service leave the cluster and it's cluster service 
(cman) stop automatically .so it is also fenced by other server.

I observed that , this problem occurred while stopping cluster service 
(oracle).

Please help me to resolve this problem.

log messages and cluster.conf file are as given as  below.
-------------------------
/etc/cluster/cluster.conf
-------------------------
<?xml version="1.0"?>
<cluster config_version="59" name="new_cluster">
    <fence_daemon post_fail_delay="0" post_join_delay="3"/>
    <clusternodes>
        <clusternode name="psfhost1" nodeid="1" votes="1">
            <fence>
                <method name="1">
                    <device name="cluster1"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="psfhost2" nodeid="2" votes="1">
            <fence>
                <method name="1">
                    <device name="cluster2"/>
                </method>
            </fence>
        </clusternode>
    </clusternodes>
    <cman expected_votes="1" two_node="1"/>
    <fencedevices>
        <fencedevice agent="fence_ilo" hostname="ilonode1" 
login="Administrator" name="cluster1" passwd="9M6X9CAU"/>
        <fencedevice agent="fence_ilo" hostname="ilonode2" 
login="Administrator" name="cluster2" passwd="ST69D87V"/>
    </fencedevices>
    <rm>
        <failoverdomains>
            <failoverdomain name="poy-cluster" ordered="0" restricted="0">
                <failoverdomainnode name="psfhost1" priority="1"/>
                <failoverdomainnode name="psfhost2" priority="1"/>
            </failoverdomain>
        </failoverdomains>
        <resources>
            <ip address="10.2.220.2" monitor_link="1"/>
            <script file="/etc/init.d/httpd" name="httpd"/>
            <fs device="/dev/cciss/c1d0p3" force_fsck="0" 
force_unmount="0" fsid="52427" fstype="ext3" mountpoint="/app" 
name="app" options="" self_fence="0"/>
            <fs device="/dev/cciss/c1d0p4" force_fsck="0" 
force_unmount="0" fsid="39388" fstype="ext3" mountpoint="/opt" 
name="opt" options="" self_fence="0"/>
            <fs device="/dev/cciss/c1d0p1" force_fsck="0" 
force_unmount="0" fsid="62307" fstype="ext3" mountpoint="/data" 
name="data" options="" self_fence="0"/>
            <fs device="/dev/cciss/c1d0p2" force_fsck="0" 
force_unmount="0" fsid="47234" fstype="ext3" mountpoint="/OPERATION" 
name="OPERATION" options="" self_fence="0"/>
            <script file="/etc/init.d/orcl" name="Oracle"/>
        </resources>
        <service autostart="0" name="oracle" recovery="relocate">
            <fs ref="app"/>
            <fs ref="opt"/>
            <fs ref="data"/>
            <fs ref="OPERATION"/>
            <ip ref="10.2.220.2"/>
            <script ref="Oracle"/>
        </service>
    </rm>
</cluster>


---------------- -------
/var/log/messages
-----------------------
following logs during relocate cluster service (oracle) between nodes.

_*Node-1*_

2 16:17:58 psfhost2 clurgmgrd[3793]: <notice> Starting stopped service 
service:oracle
May  2 16:17:58 psfhost2 kernel: kjournald starting.  Commit interval 5 
seconds
May  2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count 
reached, running e2fsck is recommended
May  2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p3, internal journal
May  2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with 
ordered data mode.
May  2 16:17:58 psfhost2 kernel: kjournald starting.  Commit interval 5 
seconds
May  2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count 
reached, running e2fsck is recommended
May  2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p4, internal journal
May  2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with 
ordered data mode.
May  2 16:17:58 psfhost2 kernel: kjournald starting.  Commit interval 5 
seconds
May  2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count 
reached, running e2fsck is recommended
May  2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p1, internal journal
May  2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with 
ordered data mode.
May  2 16:17:59 psfhost2 kernel: kjournald starting.  Commit interval 5 
seconds
May  2 16:17:59 psfhost2 kernel: EXT3-fs warning: maximal mount count 
reached, running e2fsck is recommended
May  2 16:17:59 psfhost2 kernel: EXT3 FS on cciss/c1d0p2, internal journal
May  2 16:17:59 psfhost2 kernel: EXT3-fs: mounted filesystem with 
ordered data mode.
May  2 16:17:59 psfhost2 avahi-daemon[3661]: Registering new address 
record for 10.2.220.2 on eth0.
May  2 16:18:00 psfhost2 in.rdiscd[5945]: setsockopt 
(IP_ADD_MEMBERSHIP): Address already in use
May  2 16:18:00 psfhost2 in.rdiscd[5945]: Failed joining addresses
May  2 16:18:11 psfhost2 clurgmgrd[3793]: <notice> Service 
service:oracle started
May  2 16:19:17 psfhost2 kernel: bnx2: eth1 NIC Link is Down
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering GATHER state 
from 11.
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] Saving state aru 1b high 
seq received 1b
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] Storing new sequence id 
for ring 90
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering COMMIT state.
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering RECOVERY state.
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [0] member 
10.2.220.6:
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq 140 
rep 10.2.220.6
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 9 high delivered 9 
received flag 1
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [1] member 
10.2.220.7:
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq 136 
rep 10.2.220.7
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 1b high delivered 1b 
received flag 1
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] Did not need to 
originate any messages in recovery.
May  2 16:19:26 psfhost2 openais[3275]: [CLM  ] CLM CONFIGURATION CHANGE
May  2 16:19:26 psfhost2 openais[3275]: [CLM  ] New Configuration:
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ]     r(0) ip(10.2.220.7) 
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] Members Left:
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] Members Joined:
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] CLM CONFIGURATION CHANGE
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] New Configuration:
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ]     r(0) ip(10.2.220.6) 
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ]     r(0) ip(10.2.220.7) 
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] Members Left:
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] Members Joined:
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ]     r(0) ip(10.2.220.6) 
May  2 16:19:27 psfhost2 openais[3275]: [SYNC ] This node is within the 
primary component and will provide service.
May  2 16:19:27 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL state.
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] got nodejoin message 
10.2.220.6
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] got nodejoin message 
10.2.220.7
May  2 16:19:27 psfhost2 openais[3275]: [CPG  ] got joinlist message 
from node 2
May  2 16:19:29 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 1000 Mbps 
full duplex, receive & transmit flow control ON
May  2 16:19:31 psfhost2 kernel: bnx2: eth1 NIC Link is Down
May  2 16:19:35 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 100 Mbps 
full duplex, receive & transmit flow control ON
May  2 16:19:42 psfhost2 kernel: dlm: connecting to 1
May  2 16:20:36 psfhost2 ccsd[3265]: Update of cluster.conf complete 
(version 57 -> 59).
May  2 16:20:43 psfhost2 clurgmgrd[3793]: <notice> Reconfiguring
May  2 16:21:15 psfhost2 clurgmgrd[3793]: <notice> Stopping service 
service:oracle
May  2 16:21:25 psfhost2 avahi-daemon[3661]: Withdrawing address record 
for 10.2.220.7 on eth0.
May  2 16:21:25 psfhost2 avahi-daemon[3661]: Leaving mDNS multicast 
group on interface eth0.IPv4 with address 10.2.220.7.
May  2 16:21:25 psfhost2 avahi-daemon[3661]: Joining mDNS multicast 
group on interface eth0.IPv4 with address 10.2.220.2.
May  2 16:21:25 psfhost2 clurgmgrd: [3793]: <err> Failed to remove 
10.2.220.2
May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering RECOVERY state.
May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] position [0] member 
127.0.0.1:
May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] previous ring seq 144 
rep 10.2.220.6
May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] aru 31 high delivered 31 
received flag 1
May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] Did not need to 
originate any messages in recovery.
May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] Sending initial ORF token
May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] CLM CONFIGURATION CHANGE
May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] New Configuration:
May  2 16:21:40 psfhost2 openais[3275]: [CLM  ]     r(0) ip(127.0.0.1) 
May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] Members Left:
May  2 16:21:40 psfhost2 openais[3275]: [CLM  ]     r(0) ip(10.2.220.7) 
May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] Members Joined:
May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] CLM CONFIGURATION CHANGE
May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] New Configuration:
May  2 16:21:40 psfhost2 openais[3275]: [CLM  ]     r(0) ip(127.0.0.1) 
May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] Members Left:
May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] Members Joined:
May  2 16:21:40 psfhost2 openais[3275]: [SYNC ] This node is within the 
primary component and will provide service.
May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL state.
May  2 16:21:40 psfhost2 openais[3275]: [MAIN ] Killing node psfhost2 
because it has rejoined the cluster without cman_tool join
May  2 16:21:40 psfhost2 openais[3275]: [CMAN ] cman killed by node 2 
because we rejoined the cluster without a full restart
May  2 16:21:40 psfhost2 fenced[3291]: cman_get_nodes error -1 104
May  2 16:21:40 psfhost2 kernel: clurgmgrd[3793]: segfault at 
0000000000000000 rip 0000000000408c4a rsp 00007fff3c4a9e20 error 4
May  2 16:21:40 psfhost2 fenced[3291]: cluster is down, exiting
May  2 16:21:40 psfhost2 groupd[3283]: cman_get_nodes error -1 104
May  2 16:21:40 psfhost2 dlm_controld[3297]: cluster is down, exiting
May  2 16:21:40 psfhost2 gfs_controld[3303]: cluster is down, exiting
May  2 16:21:40 psfhost2 clurgmgrd[3792]: <crit> Watchdog: Daemon died, 
rebooting...
May  2 16:21:40 psfhost2 kernel: dlm: closing connection to node 1
May  2 16:21:40 psfhost2 kernel: dlm: closing connection to node 2
May  2 16:21:40 psfhost2 kernel: md: stopping all md devices.
May  2 16:21:41 psfhost2 kernel: uhci_hcd 0000:01:04.4: HCRESET not 
completed yet!
May  2 16:24:55 psfhost2 syslogd 1.4.1: restart.
May  2 16:24:55 psfhost2 kernel: klogd 1.4.1, log source = /proc/kmsg 
started.
May  2 16:24:55 psfhost2 kernel: Linux version 2.6.18-53.el5 
(brewbuilder at hs20-bc1-7.build.redhat.com) (gcc version 4.1.2 20070626 
(Red Hat 4.1.2-14)) #1 SMP Wed Oct 10 16:34:19 EDT 2007
May  2 16:24:55 psfhost2 kernel: Command line: ro root=LABEL=/ rhgb quiet
May  2 16:24:55 psfhost2 kernel: BIOS-provided physical RAM map:

_*Node-2
*_

May  2 16:17:45 psfhost1 openais[3244]: [TOTEM] The token was lost in 
the OPERATIONAL state.
May  2 16:17:45 psfhost1 openais[3244]: [TOTEM] Receive multicast socket 
recv buffer size (288000 bytes).
May  2 16:17:45 psfhost1 openais[3244]: [TOTEM] Transmit multicast 
socket send buffer size (288000 bytes).
May  2 16:17:45 psfhost1 openais[3244]: [TOTEM] entering GATHER state 
from 2.
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] entering GATHER state 
from 0.
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] Creating commit token 
because I am the rep.
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] Saving state aru 2f high 
seq received 2f
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] Storing new sequence id 
for ring 88
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] entering COMMIT state.
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] entering RECOVERY state.
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] position [0] member 
10.2.220.6:
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] previous ring seq 132 
rep 10.2.220.6
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] aru 2f high delivered 2f 
received flag 1
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] Did not need to 
originate any messages in recovery.
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] Sending initial ORF token
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ] CLM CONFIGURATION CHANGE
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ] New Configuration:
May  2 16:17:50 psfhost1 fenced[3260]: psfhost2 not a cluster member 
after 0 sec post_fail_delay
May  2 16:17:50 psfhost1 kernel: dlm: closing connection to node 2
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ]     r(0) ip(10.2.220.6) 
May  2 16:17:50 psfhost1 fenced[3260]: fencing node "psfhost2"
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ] Members Left:
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ]     r(0) ip(10.2.220.7) 
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ] Members Joined:
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ] CLM CONFIGURATION CHANGE
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ] New Configuration:
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ]     r(0) ip(10.2.220.6) 
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ] Members Left:
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ] Members Joined:
May  2 16:17:50 psfhost1 openais[3244]: [SYNC ] This node is within the 
primary component and will provide service.
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] entering OPERATIONAL state.
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ] got nodejoin message 
10.2.220.6
May  2 16:17:50 psfhost1 openais[3244]: [CPG  ] got joinlist message 
from node 1
May  2 16:17:54 psfhost1 kernel: bnx2: eth1 NIC Link is Down
May  2 16:18:13 psfhost1 kernel: bnx2: eth1 NIC Link is Up, 100 Mbps 
full duplex, receive & transmit flow control ON
May  2 16:18:17 psfhost1 kernel: bnx2: eth1 NIC Link is Down
May  2 16:18:19 psfhost1 kernel: bnx2: eth1 NIC Link is Up, 100 Mbps 
full duplex, receive & transmit flow control ON
May  2 16:18:45 psfhost1 kernel: bnx2: eth1 NIC Link is Down
May  2 16:19:02 psfhost1 fenced[3260]: fence "psfhost2" success
May  2 16:19:03 psfhost1 kernel: bnx2: eth1 NIC Link is Up, 100 Mbps 
full duplex, receive & transmit flow control ON
May  2 16:19:07 psfhost1 ccsd[3236]: Attempt to close an unopened CCS 
descriptor (2670).
May  2 16:19:07 psfhost1 ccsd[3236]: Error while processing disconnect: 
Invalid request descriptor
May  2 16:19:07 psfhost1 clurgmgrd[3753]: <notice> Taking over service 
service:oracle from down member psfhost2
May  2 16:19:10 psfhost1 kernel: kjournald starting.  Commit interval 5 
seconds
May  2 16:19:10 psfhost1 kernel: EXT3-fs warning: maximal mount count 
reached, running e2fsck is recommended
May  2 16:19:10 psfhost1 kernel: EXT3 FS on cciss/c1d0p3, internal journal
May  2 16:19:10 psfhost1 kernel: EXT3-fs: mounted filesystem with 
ordered data mode.
May  2 16:19:11 psfhost1 kernel: kjournald starting.  Commit interval 5 
seconds
May  2 16:19:11 psfhost1 kernel: EXT3-fs warning: maximal mount count 
reached, running e2fsck is recommended
May  2 16:19:11 psfhost1 kernel: EXT3 FS on cciss/c1d0p4, internal journal
May  2 16:19:11 psfhost1 kernel: EXT3-fs: mounted filesystem with 
ordered data mode.
May  2 16:19:11 psfhost1 kernel: kjournald starting.  Commit interval 5 
seconds
May  2 16:19:11 psfhost1 kernel: EXT3-fs warning: maximal mount count 
reached, running e2fsck is recommended
May  2 16:19:11 psfhost1 kernel: EXT3 FS on cciss/c1d0p1, internal journal
May  2 16:19:11 psfhost1 kernel: EXT3-fs: mounted filesystem with 
ordered data mode.
May  2 16:19:11 psfhost1 kernel: kjournald starting.  Commit interval 5 
seconds
May  2 16:19:11 psfhost1 kernel: EXT3-fs warning: maximal mount count 
reached, running e2fsck is recommended
May  2 16:19:11 psfhost1 kernel: EXT3 FS on cciss/c1d0p2, internal journal
May  2 16:19:11 psfhost1 kernel: EXT3-fs: mounted filesystem with 
ordered data mode.
May  2 16:19:11 psfhost1 avahi-daemon[3625]: Registering new address 
record for 10.2.220.2 on eth0.
May  2 16:19:12 psfhost1 in.rdiscd[6165]: setsockopt 
(IP_ADD_MEMBERSHIP): Address already in use
May  2 16:19:12 psfhost1 in.rdiscd[6165]: Failed joining addresses
May  2 16:19:12 psfhost1 clurgmgrd[3753]: <notice> Service 
service:oracle started
May  2 16:19:17 psfhost1 clurgmgrd: [3753]: <err> script:Oracle: status 
of /etc/init.d/orcl failed (returned 1)
May  2 16:19:17 psfhost1 clurgmgrd[3753]: <notice> status on script 
"Oracle" returned 1 (generic error)
May  2 16:19:17 psfhost1 clurgmgrd[3753]: <notice> Stopping service 
service:oracle
May  2 16:19:17 psfhost1 avahi-daemon[3625]: Withdrawing address record 
for 10.2.220.6 on eth0.
May  2 16:19:17 psfhost1 avahi-daemon[3625]: Leaving mDNS multicast 
group on interface eth0.IPv4 with address 10.2.220.6.
May  2 16:19:17 psfhost1 avahi-daemon[3625]: Joining mDNS multicast 
group on interface eth0.IPv4 with address 10.2.220.2.
May  2 16:19:17 psfhost1 clurgmgrd: [3753]: <err> Failed to remove 
10.2.220.2
May  2 16:19:17 psfhost1 clurgmgrd[3753]: <notice> stop on ip 
"10.2.220.2" returned 1 (generic error)
May  2 16:19:18 psfhost1 openais[3244]: [TOTEM] Receive multicast socket 
recv buffer size (288000 bytes).
May  2 16:19:18 psfhost1 openais[3244]: [TOTEM] Transmit multicast 
socket send buffer size (288000 bytes).
May  2 16:19:18 psfhost1 openais[3244]: [TOTEM] The network interface is 
down.
May  2 16:19:18 psfhost1 openais[3244]: [TOTEM] entering GATHER state 
from 15.
May  2 16:19:18 psfhost1 clurgmgrd[3753]: <notice> Service 
service:oracle is recovering
May  2 16:19:22 psfhost1 openais[3244]: [TOTEM] entering GATHER state 
from 0.
May  2 16:19:22 psfhost1 openais[3244]: [TOTEM] Creating commit token 
because I am the rep.
May  2 16:19:22 psfhost1 openais[3244]: [TOTEM] Saving state aru 1a high 
seq received 1a
May  2 16:19:22 psfhost1 openais[3244]: [TOTEM] Storing new sequence id 
for ring 8c
May  2 16:19:22 psfhost1 openais[3244]: [TOTEM] entering COMMIT state.
May  2 16:19:22 psfhost1 openais[3244]: [TOTEM] entering RECOVERY state.
May  2 16:19:22 psfhost1 openais[3244]: [TOTEM] position [0] member 
127.0.0.1:
May  2 16:19:22 psfhost1 openais[3244]: [TOTEM] previous ring seq 136 
rep 10.2.220.6
May  2 16:19:22 psfhost1 openais[3244]: [TOTEM] aru 1a high delivered 1a 
received flag 1
May  2 16:19:22 psfhost1 openais[3244]: [TOTEM] Did not need to 
originate any messages in recovery.
May  2 16:19:22 psfhost1 openais[3244]: [TOTEM] Sending initial ORF token
May  2 16:19:22 psfhost1 openais[3244]: [CLM  ] CLM CONFIGURATION CHANGE
May  2 16:19:22 psfhost1 openais[3244]: [CLM  ] New Configuration:
May  2 16:19:22 psfhost1 openais[3244]: [CLM  ]     r(0) ip(127.0.0.1) 
May  2 16:19:23 psfhost1 openais[3244]: [CLM  ] Members Left:
May  2 16:19:23 psfhost1 openais[3244]: [CLM  ] Members Joined:
May  2 16:19:23 psfhost1 openais[3244]: [CLM  ] CLM CONFIGURATION CHANGE
May  2 16:19:23 psfhost1 openais[3244]: [CLM  ] New Configuration:
May  2 16:19:23 psfhost1 openais[3244]: [CLM  ]     r(0) ip(127.0.0.1) 
May  2 16:19:23 psfhost1 openais[3244]: [CLM  ] Members Left:
May  2 16:19:23 psfhost1 openais[3244]: [CLM  ] Members Joined:
May  2 16:19:23 psfhost1 openais[3244]: [SYNC ] This node is within the 
primary component and will provide service.
May  2 16:19:23 psfhost1 openais[3244]: [TOTEM] entering OPERATIONAL state.
May  2 16:19:23 psfhost1 openais[3244]: [CLM  ] got nodejoin message 
192.168.1.1
May  2 16:19:23 psfhost1 openais[3244]: [EVT  ] recovery error node: 
r(0) ip(127.0.0.1)  not found
May  2 16:19:23 psfhost1 gfs_controld[3272]: cluster is down, exiting
May  2 16:19:23 psfhost1 fenced[3260]: cluster is down, exiting
May  2 16:19:23 psfhost1 dlm_controld[3266]: groupd is down, exiting
May  2 16:19:23 psfhost1 kernel: dlm: closing connection to node 1
May  2 16:19:33 psfhost1 kernel: dlm: rgmanager: remove fr 0 ID 1
May  2 16:19:33 psfhost1 kernel: dlm: rgmanager: remove fr 0 ID 1
May  2 16:19:48 psfhost1 ccsd[3236]: Unable to connect to cluster 
infrastructure after 30 seconds.
May  2 16:20:18 psfhost1 ccsd[3236]: Unable to connect to cluster 
infrastructure after 60 seconds.
May  2 16:20:48 psfhost1 ccsd[3236]: Unable to connect to cluster 
infrastructure after 90 seconds.
May  2 16:21:01 psfhost1 kernel: bnx2: eth1 NIC Link is Down
May  2 16:21:18 psfhost1 ccsd[3236]: Unable to connect to cluster 
infrastructure after 120 seconds.
May  2 16:21:18 psfhost1 kernel: bnx2: eth1 NIC Link is Up, 100 Mbps 
full duplex, receive & transmit flow control ON
May  2 16:21:38 psfhost1 gnome-power-manager: (root) GNOME interactive 
logout because the power button has been pressed
May  2 16:24:34 psfhost1 syslogd 1.4.1: restart.


"Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s). 
are confidential. and may be privileged. If you are not the intended recipient. you are hereby notified that any 
review. re-transmission. conversion to hard copy. copying. circulation or other use of this message and any attachments is 
strictly prohibited. If you are not the intended recipient. please notify the sender immediately by return email. 
and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. 
The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment."
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090507/09eb3a5a/attachment.htm>

From fdinitto at redhat.com  Thu May  7 08:06:37 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Thu, 07 May 2009 10:06:37 +0200
Subject: [Linux-cluster] Cluster 3.0.0.rc2 release
Message-ID: <1241683597.16955.286.camel@cerberus.int.fabbione.net>

The cluster team and its community are proud to announce the
3.0.0.rc2 release candidate from the STABLE3 branch.

The development cycle for 3.0.0 is completed. The STABLE3 branch
is now collecting only bug fixes and minimal update required to build
and run on top of the latest upstream kernel/corosync/openais.

Everybody with test equipment and time to spare, is highly encouraged to
download, install and test this release candidate and more important
report problems. This is the time for people to make a difference and
help us testing as much as possible.

In order to build the 3.0.0.rc2 release you will need:

- corosync 0.97
- openais 0.96
- linux kernel 2.6.29

The new source tarball can be downloaded here:

ftp://sources.redhat.com/pub/cluster/releases/cluster-3.0.0.rc2.tar.gz
https://fedorahosted.org/releases/c/l/cluster/cluster-3.0.0.rc2.tar.gz

At the same location is now possible to find separated tarballs for
fence-agents and resource-agents as previously announced
(http://www.redhat.com/archives/cluster-devel/2009-February/msg00003.htm)

To report bugs or issues:

   https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

   Join us on IRC (irc.freenode.net #linux-cluster) and share your
   experience  with other sysadministrators or power users.

Happy clustering,
Fabio

Under the hood (from 3.0.0.rc1):

Abhijith Das (2):
      Revert "gfs-kernel: change __grab_cache_page to
grab_cache_page_write_begin"
      gfs-kernel: Bring gfs1 up to speed with 2.6.29-rc2

Andrew Price (2):
      libgfs2: Remove die from compute_heightsize
      libgfs2: Remove die from fix_device_geometry

Bob Peterson (9):
      gfs2_convert results in GFS2 File System Corruption
      gfs2_edit: Display pointer numbers and use color changes to
      GFS2: gfs2_edit savemeta not saving per_node quota files
      Fix block count in pass1b.
      Speed up gfs_grow
      Fix gfs2_fsck segfault
      GFS: gfs_fsck segfaults while fixing 'EA leaf block type' problem.
      Write out changes when fixing EA corruption
      GFS2: gfs2_fsck should fix journal sequence number problems

Christine Caulfield (9):
      config: Add fence_daemon type to conf2ldif
      cman: Change openais references to corosync in man pages.
      config: fix device objectclass in confdb2ldif
      cman: change environment variable to CMAN_DEBUG
      cman: Fix LDAP environment variable names in init script
      cman: Look for IPv6 names that match cluster.conf too.
      config: Add reload to LDAP plugin
      cman: use correct nodeid
      cman: Fix compile warning in libcman

David Teigland (12):
      dlm_controld: skip unlink with no ckpt
      dlm_tool: ls failure should exit with failure
      gfs_control: ls failure should exit with failure
      group_tool: compat command
      man pages: group_tool
      groupd/fenced/dlm_controld/gfs_controld: default groupd_compat 0
      group_tool: -g option for controlling query
      man pages: group_tool
      man pages: groupd
      dlm_controld/gfs_controld: handle zero global_id (in compat code)
      dlm_controld: new libcpg api
      fence_node: rename log flog

Fabio M. Di Nitto (77):
      build/init: install/create common dirs
      build: install cluster relaxng schema
      cman notifyd: add support for debugging via env var
      cman init: whitespace cleanup
      cman init: consistent use of if/then/else
      cman init: major cleanup
      cman init: more cleanup
      cman init: fix return vs exit bug
      cman init: group all check functions together
      cman init: common helper function to check configfs
      cman init: do a better check to umount configfs
      cman init: make output a lot more complete
      cman init: fix unfence extra info to be less scary
      cman init: group some checks together
      cman init: move special checks where they belong
      cman init: add cman_running local function
      cman init: introduce runwrap local function and use it
      cman init: fix more whitespaces
      cman init: fix cmannotify conditional check
      cman init: improve cman config check
      cman init: major clean up of startup sequence
      cman init: major clean up of shutdown sequence
      cman init: clean up status check
      cman init: simplify restart operation
      cman init: clean up status check a bit more
      cman init: clean return codes
      cman init: allow users to set the init script loglevel
      cman init: drop one more unrequired return code
      cman init: more whitespace cleanup
      cman init: drop all trailing whitespaces
      cman init: fix spelling
      cman init: fix regexp for configfs
      cman init: factor a file into a var
      cman init: stop abusing SBINDIR
      cman init: better handle of unfencing
      cman init: propagate proper error from unfencing
      cman init: stop using which and drop requirements on xen
      qdisk: standardize debug env var
      Revert "Remove unused code from various places"
      cman init: drop unrequired check for virsh
      cman init: prepare for fence_xvmd standalone operations
      cman init: implement groupd protocol negotiation check
      cman init: fix start_daemon return code handling
      cman init: fix mtab_configfs return code
      cman init: make start_groupd check non-fatal
      cman init: wait for qdiskd to be active
      build: fix nss_wrapper Makefile
      build: require kernel 2.6.29 for gfs1-kernel
      cman init: change groupd startup check
      cman init: faster stop operation
      misc: port the whole stack to the new corosync API
      libccs: fix several build warnings
      xmlconfig: add some const and fix build warnings
      qdisk: scandisk fix obvious missing consts
      libccs: more const around
      logthread: add const around and avoid shadowing
      notifyd: add const and proper function prototypes
      configldap: add some consts
      build: enable_paranoia_cflags
      cman init: make groupd check more robust
      rgmanager: fix fallout from death of alloc library
      build: add FORCESBINT install/uninstall target
      gfs: fsck and mkfs binaries should be in /sbin
      gfs2:  fsck and mkfs binaries should be in /sbin
      ccs: fix build warnings in ccs_tool
      build: fix install target for SBINSYMT.
      build: propagate relative info about /sbin vs sbindir
      gfs2: use relative links
      fence: drop obsoleted fence_manual man page
      rgmanager: init script rework
      qdisk: fix build warnings spotted by paranoia cflags
      config: fix some warnings in ldap tools
      cman: fix cman_tool build warnings spotted by paranoia cflags
      cman: fix warnings in daemon/ spotted by paranoia cflags
      fence agents: fix warnings spotted by paranoia cflags
      fenced: rename log flog
      cman: set default log file to corosync.log

Jan Friesse (5):
      fence_agents: Replaced telnet_ssl by fence_nss_wrapper
      fence_nss_wrapper: Fix minor polling bug and force wrapper to
build
      fence: Add support for IPv4/IPv6 forcing
      fence: Change force_ipvX to inetX_only
      fence: Make SNMP v3 default for fence_intelmodular

Lon Hohberger (25):
      config: Fix up a couple things in cluster.rng
      fence_xvmd: Make -L imply -X.
      config: Update 99cluster.ldif to be more complete
      config: Type checking fixups in schemas
      rgmanager: Status check tuning/optimization
      fence: Make fence_xvmd bind correctly
      fence: Fix fence_xvmd log message
      rgmanager: Make command line debug work
      rgmanager: Support RGMANAGER_DEBUG env. variable
      fence_xvmd: Support FENCE_XVMD_DEBUG env. variable
      rgmanager: Don't clear foreground log mode on reconfig
      rgmanager: Allow exit while waiting for cman
      rgmanager: Fix event information during configuration
      rgmanager: Optimize fork/clone during status checks
      config: Schema updates for file system quick_status
      rgmanager: Fix bug in check_rdomain_crash
      rgmanager: Fix log messages for status_* attributes
      fence: Add -U option to fence_xvmd.8
      rgmanager: Remove rg_test memory cap
      rgmanager: Remove local slab allocator
      rgmanager: Remove references to malloc_dump_table
      Revert "rgmanager: Remove references to malloc_dump_table"
      rgmanager: Remove references to malloc_dump_table
      qdisk: Remove useless debug message
      qdisk: Fix undead loop messages

Marek 'marx' Grac (3):
      fence_rsa: #493802 - Support for ssh enabled RSA II fence devices
      [FENCE] #462390 - Support for iDRAC on Dell M600 Blade Chassis
      fencing.py: #498329 - fence_drac5 help output shows incorrect
usage

 Makefile                                        |    8 +
 cman/cman_tool/cman_tool.h                      |    2 +-
 cman/cman_tool/join.c                           |   12 +-
 cman/cman_tool/main.c                           |    9 +-
 cman/daemon/ais.c                               |   42 +-
 cman/daemon/barrier.c                           |    8 +-
 cman/daemon/cman-preconfig.c                    |   86 +-
 cman/daemon/cman.h                              |    3 +-
 cman/daemon/cmanconfig.c                        |   26 +-
 cman/daemon/commands.c                          |   40 +-
 cman/daemon/commands.h                          |    6 +-
 cman/daemon/daemon.c                            |   20 +-
 cman/daemon/logging.c                           |    6 +-
 cman/daemon/logging.h                           |    8 +-
 cman/daemon/nodelist.h                          |    4 +-
 cman/init.d/Makefile                            |    2 +
 cman/init.d/cman.in                             | 1154
+++++++++---------
 cman/lib/libcman.c                              |    2 +-
 cman/man/cman.5                                 |   34 +-
 cman/man/cman_tool.8                            |   20 +-
 cman/notifyd/main.c                             |   22 +-
 cman/qdisk/bitmap.c                             |    3 +
 cman/qdisk/daemon_init.c                        |    1 +
 cman/qdisk/disk.c                               |    5 +-
 cman/qdisk/disk_util.c                          |    3 +
 cman/qdisk/main.c                               |   60 +-
 cman/qdisk/proc.c                               |    4 +-
 cman/qdisk/scandisk.c                           |   10 +-
 cman/qdisk/score.c                              |   10 +-
 common/liblogthread/liblogthread.c              |   22 +-
 common/liblogthread/liblogthread.h              |   10 +-
 config/libs/libccsconfdb/ccs.h                  |    2 +-
 config/libs/libccsconfdb/ccs_internal.h         |    4 +-
 config/libs/libccsconfdb/extras.c               |   10 +-
 config/libs/libccsconfdb/fullxpath.c            |    4 +-
 config/libs/libccsconfdb/libccs.c               |   30 +-
 config/libs/libccsconfdb/xpathlite.c            |   20 +-
 config/plugins/ldap/99cluster.ldif              | 1557
++++++++++++++++++++++-
 config/plugins/ldap/configldap.c                |   19 +-
 config/plugins/ldap/ldap-base.csv               |  283 ++++
 config/plugins/xml/config.c                     |    6 +-
 config/tools/Makefile                           |    2 +-
 config/tools/ccs_tool/editconf.c                |   54 +-
 config/tools/ldap/confdb2ldif.c                 |   17 +-
 config/tools/xml/Makefile                       |   14 +
 config/tools/xml/cluster.rng                    |  100 ++-
 configure                                       |   17 +-
 dlm/tool/main.c                                 |    9 +-
 fence/agents/Makefile                           |    2 +-
 fence/agents/alom/fence_alom.py                 |    2 +-
 fence/agents/apc/fence_apc.py                   |    3 +-
 fence/agents/apc_snmp/fence_apc_snmp.py         |    2 +-
 fence/agents/bladecenter/fence_bladecenter.py   |    3 +-
 fence/agents/cisco_mds/fence_cisco_mds.py       |    2 +-
 fence/agents/drac/fence_drac5.py                |   60 +-
 fence/agents/ibmblade/fence_ibmblade.py         |    2 +-
 fence/agents/ifmib/fence_ifmib.py               |    2 +-
 fence/agents/ilo/fence_ilo.py                   |    2 +-
 fence/agents/intelmodular/fence_intelmodular.py |    8 +-
 fence/agents/ipmilan/ipmilan.c                  |   20 +-
 fence/agents/ldom/fence_ldom.py                 |    2 +-
 fence/agents/lib/Makefile                       |    2 +-
 fence/agents/lib/fencing.py.py                  |   32 +-
 fence/agents/lib/fencing_snmp.py.py             |   10 +-
 fence/agents/lib/telnet_ssl.py                  |   72 --
 fence/agents/lpar/fence_lpar.py                 |    2 +-
 fence/agents/nss_wrapper/Makefile               |   26 +
 fence/agents/nss_wrapper/fence_nss_wrapper.c    |  482 +++++++
 fence/agents/rackswitch/do_rack.c               |    6 +-
 fence/agents/rsa/fence_rsa.py                   |  343 +-----
 fence/agents/virsh/fence_virsh.py               |    3 +-
 fence/agents/wti/fence_wti.py                   |    2 +-
 fence/agents/xvm/fence_xvm.c                    |   10 +-
 fence/agents/xvm/fence_xvmd.c                   |   72 +-
 fence/agents/xvm/ip_lookup.c                    |    4 +-
 fence/agents/xvm/mcast.c                        |   16 +-
 fence/agents/xvm/options-ccs.c                  |    5 +-
 fence/agents/xvm/options.c                      |   21 +-
 fence/agents/xvm/options.h                      |   21 +-
 fence/agents/xvm/simple_auth.c                  |    6 +-
 fence/agents/xvm/tcp.c                          |    1 +
 fence/agents/xvm/virt.h                         |    4 +-
 fence/agents/xvm/vm_states.c                    |    4 +-
 fence/agents/xvm/xml.c                          |   23 +-
 fence/fence_node/fence_node.c                   |   34 +-
 fence/fenced/config.h                           |    2 +-
 fence/fenced/cpg.c                              |   47 +-
 fence/fenced/recover.c                          |   28 +-
 fence/man/Makefile                              |    1 -
 fence/man/fence_intelmodular.8                  |    9 +-
 fence/man/fence_manual.8                        |   49 -
 fence/man/fence_xvmd.8                          |    4 +-
 gfs-kernel/src/gfs/ops_super.c                  |   20 +-
 gfs/gfs_fsck/Makefile                           |   12 +-
 gfs/gfs_fsck/metawalk.c                         |   95 +-
 gfs/gfs_fsck/metawalk.h                         |    2 +
 gfs/gfs_fsck/pass1.c                            |  270 +++-
 gfs/gfs_fsck/pass1b.c                           |    6 +
 gfs/gfs_grow/main.c                             |  132 ++-
 gfs/gfs_mkfs/Makefile                           |    8 +-
 gfs2/convert/gfs2_convert.c                     |  547 ++++----
 gfs2/edit/hexedit.c                             |  154 ++-
 gfs2/edit/hexedit.h                             |    2 +
 gfs2/edit/savemeta.c                            |   31 +-
 gfs2/fsck/Makefile                              |   17 +-
 gfs2/fsck/fs_recovery.c                         |   76 ++-
 gfs2/fsck/initialize.c                          |    5 +-
 gfs2/fsck/pass1b.c                              |   25 +-
 gfs2/fsck/rgrepair.c                            |    7 +-
 gfs2/libgfs2/block_list.c                       |   63 +-
 gfs2/libgfs2/device_geometry.c                  |   10 +-
 gfs2/libgfs2/fs_ops.c                           |   10 +-
 gfs2/libgfs2/libgfs2.h                          |   20 +-
 gfs2/libgfs2/misc.c                             |   40 +-
 gfs2/mkfs/Makefile                              |   19 +-
 gfs2/mkfs/main.c                                |    9 +-
 gfs2/mkfs/main_grow.c                           |   58 +-
 gfs2/mkfs/main_jadd.c                           |  140 ++-
 gfs2/mkfs/main_mkfs.c                           |  168 ++--
 gfs2/mount/Makefile                             |   21 +-
 gfs2/tool/df.c                                  |    5 +-
 group/daemon/cpg.c                              |   18 +-
 group/daemon/gd_internal.h                      |    2 +-
 group/dlm_controld/config.h                     |    2 +-
 group/dlm_controld/cpg.c                        |   49 +-
 group/dlm_controld/deadlock.c                   |    9 +-
 group/dlm_controld/dlm_daemon.h                 |   14 +-
 group/dlm_controld/group.c                      |   12 +
 group/dlm_controld/plock.c                      |    3 +
 group/gfs_control/main.c                        |    9 +-
 group/gfs_controld/config.h                     |    2 +-
 group/gfs_controld/cpg-new.c                    |   47 +-
 group/gfs_controld/cpg-old.c                    |   44 +-
 group/gfs_controld/gfs_daemon.h                 |    6 +-
 group/gfs_controld/group.c                      |   12 +
 group/man/group_tool.8                          |   87 +-
 group/man/groupd.8                              |   52 +-
 group/tool/main.c                               |  146 ++-
 make/defines.mk.input                           |    1 +
 make/fencebuild.mk                              |    1 +
 make/install.mk                                 |    5 +
 make/uninstall.mk                               |    3 +
 rgmanager/include/rg_locks.h                    |    4 +
 rgmanager/init.d/rgmanager.in                   |  192 +--
 rgmanager/src/clulib/Makefile                   |   18 +-
 rgmanager/src/clulib/alloc.c                    | 1233
------------------
 rgmanager/src/clulib/logging.c                  |    8 +-
 rgmanager/src/clulib/msgtest.c                  |    7 -
 rgmanager/src/daemons/Makefile                  |   11 +-
 rgmanager/src/daemons/dtest.c                   |    9 +-
 rgmanager/src/daemons/groups.c                  |    6 +-
 rgmanager/src/daemons/main.c                    |   44 +-
 rgmanager/src/daemons/restree.c                 |   17 +-
 rgmanager/src/daemons/rg_locks.c                |   48 +
 rgmanager/src/daemons/test.c                    |    3 -
 rgmanager/src/resources/fs.sh.in                |  267 ++---
 rgmanager/src/resources/service.sh              |    3 +-
 157 files changed, 5583 insertions(+), 4024 deletions(-)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090507/5fba5c9e/attachment.sig>

From rpeterso at redhat.com  Thu May  7 13:17:48 2009
From: rpeterso at redhat.com (Bob Peterson)
Date: Thu, 7 May 2009 09:17:48 -0400 (EDT)
Subject: [Linux-cluster] GFS performance.
In-Reply-To: <200904250826.n3P8QJwo032632@mx3.redhat.com>
Message-ID: <213461096.155721241702268599.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>

----- "Vikash Khatuwala" <vikash at netvigator.com> wrote:
| Hi,
| 
| Can I downgrade the lock manage from lock_dlm to no_lock? Do I need 
| to un-mount the gfs partition before changing? I want to see if it 
| makes any performance improvements.
| 
| Thanks,
| Vikash.

Hi Vikash,

Yes: gfs_tool sb /dev/your/device proto "lock_nolock"
Yes: you need to unmount it first

Regards,

Bob Peterson
Red Hat GFS


From rvandolson at esri.com  Thu May  7 19:01:07 2009
From: rvandolson at esri.com (Ray Van Dolson)
Date: Thu, 7 May 2009 12:01:07 -0700
Subject: [Linux-cluster] groupd SEGFAULT
Message-ID: <20090507190107.GA18359@esri.com>

I'm trying to set up a 2-node cluster on RHEL 5.3 POWER (ppc64).  When
I do a service cman start after configuring my cluster.conf file (via
system-config-cluster), it appears that groupd is segfaulting.

/var/log/groupd:
  1241721765 cman: our nodeid 2 name domusB.domain.com quorum 1
  1241721765 groupd segfault log follows:

There are core files generated in /.  If I install the debuginfo's and
run via gdb:

(gdb) set follow-fork-mode child
(gdb) r
Starting program: /sbin/groupd 

Program received signal SIGSEGV, Segmentation fault.
[Switching to process 19766]
0x0ff238b0 in semctl@@GLIBC_2.2 () from /lib/libc.so.6
(gdb) bt full
#0  0x0ff238b0 in semctl@@GLIBC_2.2 () from /lib/libc.so.6
No symbol table info available.
#1  0x0fe1314c in openais_service_connect () from /usr/lib/openais/libcpg.so.2
No symbol table info available.
#2  0x0fe13d60 in cpg_initialize () from /usr/lib/openais/libcpg.so.2
No symbol table info available.
#3  0x1000c96c in setup_cpg () at cpg.c:638
  error = 1516
  fd = 1241721509
#4  0x10012930 in loop () at main.c:816
  rv = 0
  i = 268367568
  timeout = -1
  workfn = (void (*)(int)) 0x80
  deadfn = (void (*)(int)) 0xfff8fa64
#5  0x1001357c in main (argc=1, argv=0xfff8fa64) at main.c:1054
  i = 4

My cluster.conf is as follows:

  <?xml version="1.0"?>
  <cluster alias="domus" config_version="4" name="domus">
    <fence_daemon post_fail_delay="0" post_join_delay="3"/>
    <clusternodes>
      <clusternode name="domusA.domain.com" nodeid="1" votes="1">
        <fence>
          <method name="1">
            <device blade="4" name="sysibmbc1.domain.com"/>
          </method>
        </fence>
      </clusternode>
      <clusternode name="domusB.domain.com" nodeid="2" votes="1">
        <fence>
          <method name="1">
            <device blade="2" name="sysibmbc1.domain.com"/>
          </method>
        </fence>
      </clusternode>
    </clusternodes>
    <cman expected_votes="1" two_node="1"/>
    <fencedevices>
      <fencedevice agent="fence_bladecenter" ipaddr="10.49.4.192" login="user" name="sysibmbc1.domain.com" passwd="pass"/>
    </fencedevices>
    <rm>
      <failoverdomains/>
      <resources>
        <ip address="10.49.6.97" monitor_link="1"/>
      </resources>
    </rm>
  </cluster>

Have I hit a bug or maybe just configured something wrong?

Ray


From alan.zg at gmail.com  Thu May  7 21:29:24 2009
From: alan.zg at gmail.com (Alan A)
Date: Thu, 7 May 2009 16:29:24 -0500
Subject: [Linux-cluster] What is the best way to check if process is running
	on a different node
Message-ID: <fac531740905071429t534e6d40jae26c96629635b8c@mail.gmail.com>

I would like to write a script or utilize a service provided on 5.3 RHEL
cluster to check if a process is running on a different node before it's
started. Need just some general information on possible best way to do this?

-- 
Alan A.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090507/3bd72e3f/attachment.htm>

From jumanjiman at gmail.com  Thu May  7 21:47:00 2009
From: jumanjiman at gmail.com (Paul Morgan)
Date: Thu, 7 May 2009 21:47:00 +0000
Subject: [Linux-cluster] What is the best way to check if process is
	runningon a different node
In-Reply-To: <fac531740905071429t534e6d40jae26c96629635b8c@mail.gmail.com>
References: <fac531740905071429t534e6d40jae26c96629635b8c@mail.gmail.com>
Message-ID: <1427421427-1241732868-cardhu_decombobulator_blackberry.rim.net-1573581984-@bxe1239.bisx.prod.on.blackberry>

I'd prolly start with the cluster-snmp rpm. At the very least, it has an OID for service status. Use snmpget+grep in your script to query the nodes. There's a README in the package with step-by-step for adding the MIB to snmpd.conf.

Hth,
-paul

-----Original Message-----
From: Alan A <alan.zg at gmail.com>

Date: Thu, 7 May 2009 16:29:24 
To: linux clustering<linux-cluster at redhat.com>
Subject: [Linux-cluster] What is the best way to check if process is running
	on a different node


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From alan.zg at gmail.com  Thu May  7 21:51:47 2009
From: alan.zg at gmail.com (Alan A)
Date: Thu, 7 May 2009 16:51:47 -0500
Subject: [Linux-cluster] What is the best way to check if process is 
	runningon a different node
In-Reply-To: <1427421427-1241732868-cardhu_decombobulator_blackberry.rim.net-1573581984-@bxe1239.bisx.prod.on.blackberry>
References: <fac531740905071429t534e6d40jae26c96629635b8c@mail.gmail.com>
	<1427421427-1241732868-cardhu_decombobulator_blackberry.rim.net-1573581984-@bxe1239.bisx.prod.on.blackberry>
Message-ID: <fac531740905071451t606927f9u9c09a9002f2af401@mail.gmail.com>

Thanks Paul, I will look into it..


On Thu, May 7, 2009 at 4:47 PM, Paul Morgan <jumanjiman at gmail.com> wrote:

> I'd prolly start with the cluster-snmp rpm. At the very least, it has an
> OID for service status. Use snmpget+grep in your script to query the nodes.
> There's a README in the package with step-by-step for adding the MIB to
> snmpd.conf.
>
> Hth,
> -paul
>
> -----Original Message-----
> From: Alan A <alan.zg at gmail.com>
>
> Date: Thu, 7 May 2009 16:29:24
> To: linux clustering<linux-cluster at redhat.com>
> Subject: [Linux-cluster] What is the best way to check if process is
> running
>        on a different node
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Alan A.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090507/6983b908/attachment.htm>

From rpeterso at redhat.com  Thu May  7 21:52:47 2009
From: rpeterso at redhat.com (Bob Peterson)
Date: Thu, 7 May 2009 17:52:47 -0400 (EDT)
Subject: [Linux-cluster] What is the best way to check if process is
	running on a different node
In-Reply-To: <fac531740905071429t534e6d40jae26c96629635b8c@mail.gmail.com>
Message-ID: <1113014435.27411241733167156.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>

----- "Alan A" <alan.zg at gmail.com> wrote:
| I would like to write a script or utilize a service provided on 5.3
| RHEL cluster to check if a process is running on a different node
| before it's started. Need just some general information on possible
| best way to do this?
| 
| --
| Alan A.

Hi Alan,

This entry from the faq might help:

http://sources.redhat.com/cluster/wiki/FAQ/RGManager#rgm_wheresvc

Regards,

Bob Peterson
Red Hat GFS


From sdake at redhat.com  Fri May  8 05:40:54 2009
From: sdake at redhat.com (Steven Dake)
Date: Thu, 07 May 2009 22:40:54 -0700
Subject: [Linux-cluster] finalization of plans for 1.0 of Corosync/OpenAIS
Message-ID: <1241761254.3625.33.camel@localhost.localdomain>

As we enter our final stages to 1.0, I believe most of the services are
in solid shape.   If you find defects in any software, please file a
bugzilla against fedora 11 as soon as possible so we know where we
stand.  Assign them to sdake at redhat.com and I will triage them to the
appropriate owners.

The final work remaining before tree freeze occurs is below.

Corosync:
1) some additions of warning codes to the configure.ac and removal of
produced warnings
2) rediff of old totemsrp patch, which I rediffed yesterday/today, and
deleted accidentally GRR
3) Finalization of Andrew's prioritization patch required for Pacemaker
4) Possibly some rework around shutdown that Chrissie may be working on
5) sync v2

Once #2, #3, #5 hit the tree, the flatiron branch will be created and
frozen to any feature development.  After that, we will enter a 2 week
release candidate phase where we will resolve defects that are reported
in bugzilla.

Therefore, it is _critical_ to report bugs in bugzilla ASAP if unrelated
to the above development.

OpenAIS:
1) LCK service 
2) MSG Service saftest verification
3) LCK service saftest verification

Once #1, #2, #3 hit the tree, a branch will be created and that branch
will be frozen to feature development.  Same rc window applies.

Our current dates are:
May 15 - Corosync branch
June 1 - Corosync 1.0 released

June 1 - OpenAIS branch
June 15 - OpenAIS 1.0 released

Regards
-steve


From anasnajj at gmail.com  Fri May  8 13:17:33 2009
From: anasnajj at gmail.com (anasnajj)
Date: Fri, 8 May 2009 16:17:33 +0300
Subject: [Linux-cluster] clean luci configuration 
Message-ID: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAAK+NMCA9lNZDkxSIaR2EjYEBAAAAAA==@gmail.com>

 
Dear sir 
 i configure my cluster using luci web interface and add two node ,after
that i delete /etc/cluster/cluster.conf file to reset my cluster and create
new cluster  
 
,but Luci still remember old configuration file and give me error message
that the old clusters node unreachable when i try to access cluster list on
luci web .. 
 
so how i can clean my cluster to new configuratin and clean luci
configuration ? i try to remove and reinstall cluster suite but luci still
remmeber old cluster and show error msg!!! 
 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090508/c54d0205/attachment.htm>

From rvandolson at esri.com  Fri May  8 16:09:38 2009
From: rvandolson at esri.com (Ray Van Dolson)
Date: Fri, 8 May 2009 09:09:38 -0700
Subject: [Linux-cluster] groupd SEGFAULT
In-Reply-To: <20090507190107.GA18359@esri.com>
References: <20090507190107.GA18359@esri.com>
Message-ID: <20090508160938.GA9945@esri.com>

On Thu, May 07, 2009 at 12:01:07PM -0700, Ray Van Dolson wrote:
> I'm trying to set up a 2-node cluster on RHEL 5.3 POWER (ppc64).  When
> I do a service cman start after configuring my cluster.conf file (via
> system-config-cluster), it appears that groupd is segfaulting.
> 
> /var/log/groupd:
>   1241721765 cman: our nodeid 2 name domusB.domain.com quorum 1
>   1241721765 groupd segfault log follows:

For those interested:

  https://bugzilla.redhat.com/show_bug.cgi?id=499767

Ray


From branimirp at gmail.com  Fri May  8 20:13:36 2009
From: branimirp at gmail.com (Branimir)
Date: Fri, 08 May 2009 22:13:36 +0200
Subject: [Linux-cluster] clean luci configuration
In-Reply-To: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAAK+NMCA9lNZDkxSIaR2EjYEBAAAAAA==@gmail.com>
References: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAAK+NMCA9lNZDkxSIaR2EjYEBAAAAAA==@gmail.com>
Message-ID: <4A049270.2060802@gmail.com>

anasnajj wrote:
>
>  
>
>  
>
> Dear sir
>  i configure my cluster using luci web interface and add two node 
> ,after that i delete /etc/cluster/cluster.conf file to reset my 
> cluster and create new cluster  
>  
> ,but Luci still remember old configuration file and give me error 
> message that the old clusters node unreachable when i try to access 
> cluster list on luci web ..
>  
> so how i can clean my cluster to new configuratin and clean luci 
> configuration ? i try to remove and reinstall cluster suite but luci 
> still remmeber old cluster and show error msg!!!
>  
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
Hi anasnajj,

take a look at, I think,  /var/lib/luci.  When you uninstall Cluster 
suite,  remove files in that directory and install the suite again. It 
should work.

Cheers,

Branimir


From anasnajj at gmail.com  Sat May  9 20:20:38 2009
From: anasnajj at gmail.com (anasnajj)
Date: Sat, 9 May 2009 23:20:38 +0300
Subject: [Linux-cluster] clean luci configuration
In-Reply-To: <4A049270.2060802@gmail.com>
References: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAAK+NMCA9lNZDkxSIaR2EjYEBAAAAAA==@gmail.com>
	<4A049270.2060802@gmail.com>
Message-ID: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAAH6pA6PcCOhLtWhEB1m8W+sBAAAAAA==@gmail.com>

Yes this is right thanks Branimir its work 


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Branimir
Sent: Friday, May 08, 2009 11:14 PM
To: linux clustering
Subject: Re: [Linux-cluster] clean luci configuration

anasnajj wrote:
>
>  
>
>  
>
> Dear sir
>  i configure my cluster using luci web interface and add two node 
> ,after that i delete /etc/cluster/cluster.conf file to reset my 
> cluster and create new cluster  
>  
> ,but Luci still remember old configuration file and give me error 
> message that the old clusters node unreachable when i try to access 
> cluster list on luci web ..
>  
> so how i can clean my cluster to new configuratin and clean luci 
> configuration ? i try to remove and reinstall cluster suite but luci 
> still remmeber old cluster and show error msg!!!
>  
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
Hi anasnajj,

take a look at, I think,  /var/lib/luci.  When you uninstall Cluster 
suite,  remove files in that directory and install the suite again. It 
should work.

Cheers,

Branimir

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From anasnajj at gmail.com  Sun May 10 20:44:50 2009
From: anasnajj at gmail.com (anasnajj)
Date: Sun, 10 May 2009 23:44:50 +0300
Subject: [Linux-cluster] Redhat Cluster for 10 nodes
Message-ID: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAAJe0VRp8uYlMrv/FmlsbOMEBAAAAAA==@gmail.com>

 
Dear all 

I need to implement redhat cluster for 10 nodes with addition two nodes as
backup servers (total 12 servers)

Each node will handle one service exeslusivly this service contain (On IP ,
TWO LVM, TWO Filesystems, Mysql Script , Myown Script)

.         MySql script will mounted on /var/lib/mysql and the mysql will
store its database on that location ..

.         MyOwn script will store its data on second file system

There is no fencing device .

The following is the snap of my cluster.conf 

 
<?xml version="1.0" ?>

<cluster alias="ClustGrp1" config_version="10" name="ClustGrp1">

      <fence_daemon post_fail_delay="0" post_join_delay="3"/>

      <clusternodes>

            <clusternode name="server01">

                  <fence/>

            </clusternode>

            <clusternode name="server02">

                  <fence/>

            </clusternode>

            ...

            ..

            ..

            ..

      </clusternodes>

      <fencedevices/>

      <rm>

            <failoverdomains/>

            <resources>

                  <ip address="10.10.103.11" monitor_link="1"/>

                  <ip address="10.10.103.12" monitor_link="1"/>

                  ..

                  ..

                  ..    

                  ..

                  <script file="/etc/init.d/MyOwnScript"
name="MyOwnScript"/>

                  <script file="/etc/init.d/mysqld" name="MySQLScript"/>

            </resources>

            <service autostart="1" exclusive="1" name="server01srv"
recovery="relocate">

                  <ip ref="10.10.103.11"/>

            </service>

            <service autostart="1" exclusive="1" name="server02srv"
recovery="relocate">

                  <ip ref="10.10.103.12"/>

            </service>

            ...

            ..

            ..

            ..

                  </rm>

 
</cluster>

 
I already add Ip address resource but what is the structure for the other
resource especially that the scripts depend on FS which is depend on LVM?
and how can I make if the any node fail its relocate just on addition two
server if no one of the addition servers free it will change status to
stopped? 

And pleas if you can Advice me for this scenarios ? I hav RHCS 4u5

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090510/96f3d97b/attachment.htm>

From romi3rdfeb at gmail.com  Mon May 11 11:04:46 2009
From: romi3rdfeb at gmail.com (Romi Verma)
Date: Mon, 11 May 2009 16:34:46 +0530
Subject: [Linux-cluster] new to RHCS
Message-ID: <133066c80905110404g52401fbfi690b02b15e2a54f6@mail.gmail.com>

Hi all ,
I am very new to RHCS . i wanted to configure a tomcat resource. I am using
Redhat 5.3. the Tomcat resource bundled with Redhat 5.3 is claiming support
for tomcat 5 . does it support tomcat 6 also.


Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090511/f8bbf971/attachment.htm>

From CISPLengineer.hz at ril.com  Mon May 11 11:08:53 2009
From: CISPLengineer.hz at ril.com (Viral .D. Ahire)
Date: Mon, 11 May 2009 16:38:53 +0530
Subject: [Linux-cluster] node is reboot during stop cluster application
 (oracle) and unable to relocate cluster application between nodes
Message-ID: <4A080745.6090009@ril.com>

Hi,

I have configured two node cluster on redhat-5. now the problem is when 
i relocate,restart or stop,  running cluster service between nodes (2 
nos) ,the node get fenced and restart server . Other side, the server 
who obtain cluster service leave the cluster and it's cluster service 
(cman) stop automatically .so it is also fenced by other server.

I observed that , this problem occurred while stopping cluster service 
(oracle).

Please help me to resolve this problem.

log messages and cluster.conf file are as given as  below.
-------------------------
/etc/cluster/cluster.conf
-------------------------
<?xml version="1.0"?>
<cluster config_version="59" name="new_cluster">
    <fence_daemon post_fail_delay="0" post_join_delay="3"/>
    <clusternodes>
        <clusternode name="psfhost1" nodeid="1" votes="1">
            <fence>
                <method name="1">
                    <device name="cluster1"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="psfhost2" nodeid="2" votes="1">
            <fence>
                <method name="1">
                    <device name="cluster2"/>
                </method>
            </fence>
        </clusternode>
    </clusternodes>
    <cman expected_votes="1" two_node="1"/>
    <fencedevices>
        <fencedevice agent="fence_ilo" hostname="ilonode1" 
login="Administrator" name="cluster1" passwd="9M6X9CAU"/>
        <fencedevice agent="fence_ilo" hostname="ilonode2" 
login="Administrator" name="cluster2" passwd="ST69D87V"/>
    </fencedevices>
    <rm>
        <failoverdomains>
            <failoverdomain name="poy-cluster" ordered="0" restricted="0">
                <failoverdomainnode name="psfhost1" priority="1"/>
                <failoverdomainnode name="psfhost2" priority="1"/>
            </failoverdomain>
        </failoverdomains>
        <resources>
            <ip address="10.2.220.2" monitor_link="1"/>
            <script file="/etc/init.d/httpd" name="httpd"/>
            <fs device="/dev/cciss/c1d0p3" force_fsck="0" 
force_unmount="0" fsid="52427" fstype="ext3" mountpoint="/app" 
name="app" options="" self_fence="0"/>
            <fs device="/dev/cciss/c1d0p4" force_fsck="0" 
force_unmount="0" fsid="39388" fstype="ext3" mountpoint="/opt" 
name="opt" options="" self_fence="0"/>
            <fs device="/dev/cciss/c1d0p1" force_fsck="0" 
force_unmount="0" fsid="62307" fstype="ext3" mountpoint="/data" 
name="data" options="" self_fence="0"/>
            <fs device="/dev/cciss/c1d0p2" force_fsck="0" 
force_unmount="0" fsid="47234" fstype="ext3" mountpoint="/OPERATION" 
name="OPERATION" options="" self_fence="0"/>
            <script file="/etc/init.d/orcl" name="Oracle"/>
        </resources>
        <service autostart="0" name="oracle" recovery="relocate">
            <fs ref="app"/>
            <fs ref="opt"/>
            <fs ref="data"/>
            <fs ref="OPERATION"/>
            <ip ref="10.2.220.2"/>
            <script ref="Oracle"/>
        </service>
    </rm>
</cluster>


---------------- -------
/var/log/messages
-----------------------
following logs during relocate cluster service (oracle) between nodes.

_*Node-1*_

2 16:17:58 psfhost2 clurgmgrd[3793]: <notice> Starting stopped service 
service:oracle
May  2 16:17:58 psfhost2 kernel: kjournald starting.  Commit interval 5 
seconds
May  2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count 
reached, running e2fsck is recommended
May  2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p3, internal journal
May  2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with 
ordered data mode.
May  2 16:17:58 psfhost2 kernel: kjournald starting.  Commit interval 5 
seconds
May  2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count 
reached, running e2fsck is recommended
May  2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p4, internal journal
May  2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with 
ordered data mode.
May  2 16:17:58 psfhost2 kernel: kjournald starting.  Commit interval 5 
seconds
May  2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count 
reached, running e2fsck is recommended
May  2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p1, internal journal
May  2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with 
ordered data mode.
May  2 16:17:59 psfhost2 kernel: kjournald starting.  Commit interval 5 
seconds
May  2 16:17:59 psfhost2 kernel: EXT3-fs warning: maximal mount count 
reached, running e2fsck is recommended
May  2 16:17:59 psfhost2 kernel: EXT3 FS on cciss/c1d0p2, internal journal
May  2 16:17:59 psfhost2 kernel: EXT3-fs: mounted filesystem with 
ordered data mode.
May  2 16:17:59 psfhost2 avahi-daemon[3661]: Registering new address 
record for 10.2.220.2 on eth0.
May  2 16:18:00 psfhost2 in.rdiscd[5945]: setsockopt 
(IP_ADD_MEMBERSHIP): Address already in use
May  2 16:18:00 psfhost2 in.rdiscd[5945]: Failed joining addresses
May  2 16:18:11 psfhost2 clurgmgrd[3793]: <notice> Service 
service:oracle started
May  2 16:19:17 psfhost2 kernel: bnx2: eth1 NIC Link is Down
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering GATHER state 
from 11.
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] Saving state aru 1b high 
seq received 1b
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] Storing new sequence id 
for ring 90
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering COMMIT state.
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering RECOVERY state.
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [0] member 
10.2.220.6:
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq 140 
rep 10.2.220.6
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 9 high delivered 9 
received flag 1
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [1] member 
10.2.220.7:
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq 136 
rep 10.2.220.7
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 1b high delivered 1b 
received flag 1
May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] Did not need to 
originate any messages in recovery.
May  2 16:19:26 psfhost2 openais[3275]: [CLM  ] CLM CONFIGURATION CHANGE
May  2 16:19:26 psfhost2 openais[3275]: [CLM  ] New Configuration:
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ]     r(0) ip(10.2.220.7) 
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] Members Left:
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] Members Joined:
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] CLM CONFIGURATION CHANGE
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] New Configuration:
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ]     r(0) ip(10.2.220.6) 
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ]     r(0) ip(10.2.220.7) 
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] Members Left:
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] Members Joined:
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ]     r(0) ip(10.2.220.6) 
May  2 16:19:27 psfhost2 openais[3275]: [SYNC ] This node is within the 
primary component and will provide service.
May  2 16:19:27 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL state.
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] got nodejoin message 
10.2.220.6
May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] got nodejoin message 
10.2.220.7
May  2 16:19:27 psfhost2 openais[3275]: [CPG  ] got joinlist message 
from node 2
May  2 16:19:29 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 1000 Mbps 
full duplex, receive & transmit flow control ON
May  2 16:19:31 psfhost2 kernel: bnx2: eth1 NIC Link is Down
May  2 16:19:35 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 100 Mbps 
full duplex, receive & transmit flow control ON
May  2 16:19:42 psfhost2 kernel: dlm: connecting to 1
May  2 16:20:36 psfhost2 ccsd[3265]: Update of cluster.conf complete 
(version 57 -> 59).
May  2 16:20:43 psfhost2 clurgmgrd[3793]: <notice> Reconfiguring
May  2 16:21:15 psfhost2 clurgmgrd[3793]: <notice> Stopping service 
service:oracle
May  2 16:21:25 psfhost2 avahi-daemon[3661]: Withdrawing address record 
for 10.2.220.7 on eth0.
May  2 16:21:25 psfhost2 avahi-daemon[3661]: Leaving mDNS multicast 
group on interface eth0.IPv4 with address 10.2.220.7.
May  2 16:21:25 psfhost2 avahi-daemon[3661]: Joining mDNS multicast 
group on interface eth0.IPv4 with address 10.2.220.2.
May  2 16:21:25 psfhost2 clurgmgrd: [3793]: <err> Failed to remove 
10.2.220.2
May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering RECOVERY state.
May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] position [0] member 
127.0.0.1:
May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] previous ring seq 144 
rep 10.2.220.6
May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] aru 31 high delivered 31 
received flag 1
May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] Did not need to 
originate any messages in recovery.
May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] Sending initial ORF token
May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] CLM CONFIGURATION CHANGE
May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] New Configuration:
May  2 16:21:40 psfhost2 openais[3275]: [CLM  ]     r(0) ip(127.0.0.1) 
May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] Members Left:
May  2 16:21:40 psfhost2 openais[3275]: [CLM  ]     r(0) ip(10.2.220.7) 
May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] Members Joined:
May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] CLM CONFIGURATION CHANGE
May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] New Configuration:
May  2 16:21:40 psfhost2 openais[3275]: [CLM  ]     r(0) ip(127.0.0.1) 
May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] Members Left:
May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] Members Joined:
May  2 16:21:40 psfhost2 openais[3275]: [SYNC ] This node is within the 
primary component and will provide service.
May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL state.
May  2 16:21:40 psfhost2 openais[3275]: [MAIN ] Killing node psfhost2 
because it has rejoined the cluster without cman_tool join
May  2 16:21:40 psfhost2 openais[3275]: [CMAN ] cman killed by node 2 
because we rejoined the cluster without a full restart
May  2 16:21:40 psfhost2 fenced[3291]: cman_get_nodes error -1 104
May  2 16:21:40 psfhost2 kernel: clurgmgrd[3793]: segfault at 
0000000000000000 rip 0000000000408c4a rsp 00007fff3c4a9e20 error 4
May  2 16:21:40 psfhost2 fenced[3291]: cluster is down, exiting
May  2 16:21:40 psfhost2 groupd[3283]: cman_get_nodes error -1 104
May  2 16:21:40 psfhost2 dlm_controld[3297]: cluster is down, exiting
May  2 16:21:40 psfhost2 gfs_controld[3303]: cluster is down, exiting
May  2 16:21:40 psfhost2 clurgmgrd[3792]: <crit> Watchdog: Daemon died, 
rebooting...
May  2 16:21:40 psfhost2 kernel: dlm: closing connection to node 1
May  2 16:21:40 psfhost2 kernel: dlm: closing connection to node 2
May  2 16:21:40 psfhost2 kernel: md: stopping all md devices.
May  2 16:21:41 psfhost2 kernel: uhci_hcd 0000:01:04.4: HCRESET not 
completed yet!
May  2 16:24:55 psfhost2 syslogd 1.4.1: restart.
May  2 16:24:55 psfhost2 kernel: klogd 1.4.1, log source = /proc/kmsg 
started.
May  2 16:24:55 psfhost2 kernel: Linux version 2.6.18-53.el5 
(brewbuilder at hs20-bc1-7.build.redhat.com) (gcc version 4.1.2 20070626 
(Red Hat 4.1.2-14)) #1 SMP Wed Oct 10 16:34:19 EDT 2007
May  2 16:24:55 psfhost2 kernel: Command line: ro root=LABEL=/ rhgb quiet
May  2 16:24:55 psfhost2 kernel: BIOS-provided physical RAM map:

_*Node-2
*_

May  2 16:17:45 psfhost1 openais[3244]: [TOTEM] The token was lost in 
the OPERATIONAL state.
May  2 16:17:45 psfhost1 openais[3244]: [TOTEM] Receive multicast socket 
recv buffer size (288000 bytes).
May  2 16:17:45 psfhost1 openais[3244]: [TOTEM] Transmit multicast 
socket send buffer size (288000 bytes).
May  2 16:17:45 psfhost1 openais[3244]: [TOTEM] entering GATHER state 
from 2.
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] entering GATHER state 
from 0.
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] Creating commit token 
because I am the rep.
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] Saving state aru 2f high 
seq received 2f
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] Storing new sequence id 
for ring 88
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] entering COMMIT state.
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] entering RECOVERY state.
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] position [0] member 
10.2.220.6:
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] previous ring seq 132 
rep 10.2.220.6
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] aru 2f high delivered 2f 
received flag 1
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] Did not need to 
originate any messages in recovery.
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] Sending initial ORF token
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ] CLM CONFIGURATION CHANGE
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ] New Configuration:
May  2 16:17:50 psfhost1 fenced[3260]: psfhost2 not a cluster member 
after 0 sec post_fail_delay
May  2 16:17:50 psfhost1 kernel: dlm: closing connection to node 2
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ]     r(0) ip(10.2.220.6) 
May  2 16:17:50 psfhost1 fenced[3260]: fencing node "psfhost2"
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ] Members Left:
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ]     r(0) ip(10.2.220.7) 
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ] Members Joined:
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ] CLM CONFIGURATION CHANGE
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ] New Configuration:
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ]     r(0) ip(10.2.220.6) 
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ] Members Left:
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ] Members Joined:
May  2 16:17:50 psfhost1 openais[3244]: [SYNC ] This node is within the 
primary component and will provide service.
May  2 16:17:50 psfhost1 openais[3244]: [TOTEM] entering OPERATIONAL state.
May  2 16:17:50 psfhost1 openais[3244]: [CLM  ] got nodejoin message 
10.2.220.6
May  2 16:17:50 psfhost1 openais[3244]: [CPG  ] got joinlist message 
from node 1
May  2 16:17:54 psfhost1 kernel: bnx2: eth1 NIC Link is Down
May  2 16:18:13 psfhost1 kernel: bnx2: eth1 NIC Link is Up, 100 Mbps 
full duplex, receive & transmit flow control ON
May  2 16:18:17 psfhost1 kernel: bnx2: eth1 NIC Link is Down
May  2 16:18:19 psfhost1 kernel: bnx2: eth1 NIC Link is Up, 100 Mbps 
full duplex, receive & transmit flow control ON
May  2 16:18:45 psfhost1 kernel: bnx2: eth1 NIC Link is Down
May  2 16:19:02 psfhost1 fenced[3260]: fence "psfhost2" success
May  2 16:19:03 psfhost1 kernel: bnx2: eth1 NIC Link is Up, 100 Mbps 
full duplex, receive & transmit flow control ON
May  2 16:19:07 psfhost1 ccsd[3236]: Attempt to close an unopened CCS 
descriptor (2670).
May  2 16:19:07 psfhost1 ccsd[3236]: Error while processing disconnect: 
Invalid request descriptor
May  2 16:19:07 psfhost1 clurgmgrd[3753]: <notice> Taking over service 
service:oracle from down member psfhost2
May  2 16:19:10 psfhost1 kernel: kjournald starting.  Commit interval 5 
seconds
May  2 16:19:10 psfhost1 kernel: EXT3-fs warning: maximal mount count 
reached, running e2fsck is recommended
May  2 16:19:10 psfhost1 kernel: EXT3 FS on cciss/c1d0p3, internal journal
May  2 16:19:10 psfhost1 kernel: EXT3-fs: mounted filesystem with 
ordered data mode.
May  2 16:19:11 psfhost1 kernel: kjournald starting.  Commit interval 5 
seconds
May  2 16:19:11 psfhost1 kernel: EXT3-fs warning: maximal mount count 
reached, running e2fsck is recommended
May  2 16:19:11 psfhost1 kernel: EXT3 FS on cciss/c1d0p4, internal journal
May  2 16:19:11 psfhost1 kernel: EXT3-fs: mounted filesystem with 
ordered data mode.
May  2 16:19:11 psfhost1 kernel: kjournald starting.  Commit interval 5 
seconds
May  2 16:19:11 psfhost1 kernel: EXT3-fs warning: maximal mount count 
reached, running e2fsck is recommended
May  2 16:19:11 psfhost1 kernel: EXT3 FS on cciss/c1d0p1, internal journal
May  2 16:19:11 psfhost1 kernel: EXT3-fs: mounted filesystem with 
ordered data mode.
May  2 16:19:11 psfhost1 kernel: kjournald starting.  Commit interval 5 
seconds
May  2 16:19:11 psfhost1 kernel: EXT3-fs warning: maximal mount count 
reached, running e2fsck is recommended
May  2 16:19:11 psfhost1 kernel: EXT3 FS on cciss/c1d0p2, internal journal
May  2 16:19:11 psfhost1 kernel: EXT3-fs: mounted filesystem with 
ordered data mode.
May  2 16:19:11 psfhost1 avahi-daemon[3625]: Registering new address 
record for 10.2.220.2 on eth0.
May  2 16:19:12 psfhost1 in.rdiscd[6165]: setsockopt 
(IP_ADD_MEMBERSHIP): Address already in use
May  2 16:19:12 psfhost1 in.rdiscd[6165]: Failed joining addresses
May  2 16:19:12 psfhost1 clurgmgrd[3753]: <notice> Service 
service:oracle started
May  2 16:19:17 psfhost1 clurgmgrd: [3753]: <err> script:Oracle: status 
of /etc/init.d/orcl failed (returned 1)
May  2 16:19:17 psfhost1 clurgmgrd[3753]: <notice> status on script 
"Oracle" returned 1 (generic error)
May  2 16:19:17 psfhost1 clurgmgrd[3753]: <notice> Stopping service 
service:oracle
May  2 16:19:17 psfhost1 avahi-daemon[3625]: Withdrawing address record 
for 10.2.220.6 on eth0.
May  2 16:19:17 psfhost1 avahi-daemon[3625]: Leaving mDNS multicast 
group on interface eth0.IPv4 with address 10.2.220.6.
May  2 16:19:17 psfhost1 avahi-daemon[3625]: Joining mDNS multicast 
group on interface eth0.IPv4 with address 10.2.220.2.
May  2 16:19:17 psfhost1 clurgmgrd: [3753]: <err> Failed to remove 
10.2.220.2
May  2 16:19:17 psfhost1 clurgmgrd[3753]: <notice> stop on ip 
"10.2.220.2" returned 1 (generic error)
May  2 16:19:18 psfhost1 openais[3244]: [TOTEM] Receive multicast socket 
recv buffer size (288000 bytes).
May  2 16:19:18 psfhost1 openais[3244]: [TOTEM] Transmit multicast 
socket send buffer size (288000 bytes).
May  2 16:19:18 psfhost1 openais[3244]: [TOTEM] The network interface is 
down.
May  2 16:19:18 psfhost1 openais[3244]: [TOTEM] entering GATHER state 
from 15.
May  2 16:19:18 psfhost1 clurgmgrd[3753]: <notice> Service 
service:oracle is recovering
May  2 16:19:22 psfhost1 openais[3244]: [TOTEM] entering GATHER state 
from 0.
May  2 16:19:22 psfhost1 openais[3244]: [TOTEM] Creating commit token 
because I am the rep.
May  2 16:19:22 psfhost1 openais[3244]: [TOTEM] Saving state aru 1a high 
seq received 1a
May  2 16:19:22 psfhost1 openais[3244]: [TOTEM] Storing new sequence id 
for ring 8c
May  2 16:19:22 psfhost1 openais[3244]: [TOTEM] entering COMMIT state.
May  2 16:19:22 psfhost1 openais[3244]: [TOTEM] entering RECOVERY state.
May  2 16:19:22 psfhost1 openais[3244]: [TOTEM] position [0] member 
127.0.0.1:
May  2 16:19:22 psfhost1 openais[3244]: [TOTEM] previous ring seq 136 
rep 10.2.220.6
May  2 16:19:22 psfhost1 openais[3244]: [TOTEM] aru 1a high delivered 1a 
received flag 1
May  2 16:19:22 psfhost1 openais[3244]: [TOTEM] Did not need to 
originate any messages in recovery.
May  2 16:19:22 psfhost1 openais[3244]: [TOTEM] Sending initial ORF token
May  2 16:19:22 psfhost1 openais[3244]: [CLM  ] CLM CONFIGURATION CHANGE
May  2 16:19:22 psfhost1 openais[3244]: [CLM  ] New Configuration:
May  2 16:19:22 psfhost1 openais[3244]: [CLM  ]     r(0) ip(127.0.0.1) 
May  2 16:19:23 psfhost1 openais[3244]: [CLM  ] Members Left:
May  2 16:19:23 psfhost1 openais[3244]: [CLM  ] Members Joined:
May  2 16:19:23 psfhost1 openais[3244]: [CLM  ] CLM CONFIGURATION CHANGE
May  2 16:19:23 psfhost1 openais[3244]: [CLM  ] New Configuration:
May  2 16:19:23 psfhost1 openais[3244]: [CLM  ]     r(0) ip(127.0.0.1) 
May  2 16:19:23 psfhost1 openais[3244]: [CLM  ] Members Left:
May  2 16:19:23 psfhost1 openais[3244]: [CLM  ] Members Joined:
May  2 16:19:23 psfhost1 openais[3244]: [SYNC ] This node is within the 
primary component and will provide service.
May  2 16:19:23 psfhost1 openais[3244]: [TOTEM] entering OPERATIONAL state.
May  2 16:19:23 psfhost1 openais[3244]: [CLM  ] got nodejoin message 
192.168.1.1
May  2 16:19:23 psfhost1 openais[3244]: [EVT  ] recovery error node: 
r(0) ip(127.0.0.1)  not found
May  2 16:19:23 psfhost1 gfs_controld[3272]: cluster is down, exiting
May  2 16:19:23 psfhost1 fenced[3260]: cluster is down, exiting
May  2 16:19:23 psfhost1 dlm_controld[3266]: groupd is down, exiting
May  2 16:19:23 psfhost1 kernel: dlm: closing connection to node 1
May  2 16:19:33 psfhost1 kernel: dlm: rgmanager: remove fr 0 ID 1
May  2 16:19:33 psfhost1 kernel: dlm: rgmanager: remove fr 0 ID 1
May  2 16:19:48 psfhost1 ccsd[3236]: Unable to connect to cluster 
infrastructure after 30 seconds.
May  2 16:20:18 psfhost1 ccsd[3236]: Unable to connect to cluster 
infrastructure after 60 seconds.
May  2 16:20:48 psfhost1 ccsd[3236]: Unable to connect to cluster 
infrastructure after 90 seconds.
May  2 16:21:01 psfhost1 kernel: bnx2: eth1 NIC Link is Down
May  2 16:21:18 psfhost1 ccsd[3236]: Unable to connect to cluster 
infrastructure after 120 seconds.
May  2 16:21:18 psfhost1 kernel: bnx2: eth1 NIC Link is Up, 100 Mbps 
full duplex, receive & transmit flow control ON
May  2 16:21:38 psfhost1 gnome-power-manager: (root) GNOME interactive 
logout because the power button has been pressed
May  2 16:24:34 psfhost1 syslogd 1.4.1: restart.


"Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s). 
are confidential. and may be privileged. If you are not the intended recipient. you are hereby notified that any 
review. re-transmission. conversion to hard copy. copying. circulation or other use of this message and any attachments is 
strictly prohibited. If you are not the intended recipient. please notify the sender immediately by return email. 
and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. 
The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment."
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090511/9efda54b/attachment.htm>

From muffaleta at gmail.com  Mon May 11 14:34:13 2009
From: muffaleta at gmail.com (Christopher Chen)
Date: Mon, 11 May 2009 07:34:13 -0700
Subject: [Linux-cluster] node is reboot during stop cluster application
	(oracle) and unable to relocate cluster application between nodes
In-Reply-To: <4A080745.6090009@ril.com>
References: <4A080745.6090009@ril.com>
Message-ID: <BD92ECDE-8FFA-4D9A-B8DC-881A5249F65A@gmail.com>

I hope you're planning to expand to least a 3 node cluster before you  
go into production. You know two node clusters are inherently  
unstable, right?I assume you've read the architectural overview of how  
the cluster suite achieves quorum.

A cluster requires (n/2)+1 to continue to operate. If you restart or  
otherwise remove a machine from a two node cluster, you've lost quorum  
and by definition you've dissolved your cluster while you're in that  
state.

I'm pretty sure the behavior you are describing is proper.

Time flies like an arrow.
Fruit flies like a banana.

On May 11, 2009, at 4:08, "Viral .D. Ahire" <CISPLengineer.hz at ril.com>  
wrote:

> Hi,
>
> I have configured two node cluster on redhat-5. now the problem is  
> when i relocate,restart or stop,  running cluster service between  
> nodes (2 nos) ,the node get fenced and restart server . Other side,  
> the server who obtain cluster service leave the cluster and it's  
> cluster service (cman) stop automatically .so it is also fenced by  
> other server.
>
> I observed that , this problem occurred while stopping cluster  
> service (oracle).
>
> Please help me to resolve this problem.
>
> log messages and cluster.conf file are as given as  below.
> -------------------------
> /etc/cluster/cluster.conf
> -------------------------
> <?xml version="1.0"?>
> <cluster config_version="59" name="new_cluster">
>     <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>     <clusternodes>
>         <clusternode name="psfhost1" nodeid="1" votes="1">
>             <fence>
>                 <method name="1">
>                     <device name="cluster1"/>
>                 </method>
>             </fence>
>         </clusternode>
>         <clusternode name="psfhost2" nodeid="2" votes="1">
>             <fence>
>                 <method name="1">
>                     <device name="cluster2"/>
>                 </method>
>             </fence>
>         </clusternode>
>     </clusternodes>
>     <cman expected_votes="1" two_node="1"/>
>     <fencedevices>
>         <fencedevice agent="fence_ilo" hostname="ilonode1"  
> login="Administrator" name="cluster1" passwd="9M6X9CAU"/>
>         <fencedevice agent="fence_ilo" hostname="ilonode2"  
> login="Administrator" name="cluster2" passwd="ST69D87V"/>
>     </fencedevices>
>     <rm>
>         <failoverdomains>
>             <failoverdomain name="poy-cluster" ordered="0"  
> restricted="0">
>                 <failoverdomainnode name="psfhost1" priority="1"/>
>                 <failoverdomainnode name="psfhost2" priority="1"/>
>             </failoverdomain>
>         </failoverdomains>
>         <resources>
>             <ip address="10.2.220.2" monitor_link="1"/>
>             <script file="/etc/init.d/httpd" name="httpd"/>
>             <fs device="/dev/cciss/c1d0p3" force_fsck="0"  
> force_unmount="0" fsid="52427" fstype="ext3" mountpoint="/app"  
> name="app" options="" self_fence="0"/>
>             <fs device="/dev/cciss/c1d0p4" force_fsck="0"  
> force_unmount="0" fsid="39388" fstype="ext3" mountpoint="/opt"  
> name="opt" options="" self_fence="0"/>
>             <fs device="/dev/cciss/c1d0p1" force_fsck="0"  
> force_unmount="0" fsid="62307" fstype="ext3" mountpoint="/data"  
> name="data" options="" self_fence="0"/>
>             <fs device="/dev/cciss/c1d0p2" force_fsck="0"  
> force_unmount="0" fsid="47234" fstype="ext3" mountpoint="/OPERATION"  
> name="OPERATION" options="" self_fence="0"/>
>             <script file="/etc/init.d/orcl" name="Oracle"/>
>         </resources>
>         <service autostart="0" name="oracle" recovery="relocate">
>             <fs ref="app"/>
>             <fs ref="opt"/>
>             <fs ref="data"/>
>             <fs ref="OPERATION"/>
>             <ip ref="10.2.220.2"/>
>             <script ref="Oracle"/>
>         </service>
>     </rm>
> </cluster>
>
>
>
>
>
>
>
> ---------------- -------
> /var/log/messages
> -----------------------
> following logs during relocate cluster service (oracle) between nodes.
> Node-1
>
> 2 16:17:58 psfhost2 clurgmgrd[3793]: <notice> Starting stopped  
> service service:oracle
> May  2 16:17:58 psfhost2 kernel: kjournald starting.  Commit  
> interval 5 seconds
> May  2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount  
> count reached, running e2fsck is recommended
> May  2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p3, internal  
> journal
> May  2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with  
> ordered data mode.
> May  2 16:17:58 psfhost2 kernel: kjournald starting.  Commit  
> interval 5 seconds
> May  2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount  
> count reached, running e2fsck is recommended
> May  2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p4, internal  
> journal
> May  2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with  
> ordered data mode.
> May  2 16:17:58 psfhost2 kernel: kjournald starting.  Commit  
> interval 5 seconds
> May  2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount  
> count reached, running e2fsck is recommended
> May  2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p1, internal  
> journal
> May  2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with  
> ordered data mode.
> May  2 16:17:59 psfhost2 kernel: kjournald starting.  Commit  
> interval 5 seconds
> May  2 16:17:59 psfhost2 kernel: EXT3-fs warning: maximal mount  
> count reached, running e2fsck is recommended
> May  2 16:17:59 psfhost2 kernel: EXT3 FS on cciss/c1d0p2, internal  
> journal
> May  2 16:17:59 psfhost2 kernel: EXT3-fs: mounted filesystem with  
> ordered data mode.
> May  2 16:17:59 psfhost2 avahi-daemon[3661]: Registering new address  
> record for 10.2.220.2 on eth0.
> May  2 16:18:00 psfhost2 in.rdiscd[5945]: setsockopt  
> (IP_ADD_MEMBERSHIP): Address already in use
> May  2 16:18:00 psfhost2 in.rdiscd[5945]: Failed joining addresses
> May  2 16:18:11 psfhost2 clurgmgrd[3793]: <notice> Service  
> service:oracle started
> May  2 16:19:17 psfhost2 kernel: bnx2: eth1 NIC Link is Down
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering GATHER  
> state from 11.
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] Saving state aru 1b  
> high seq received 1b
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] Storing new sequence  
> id for ring 90
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering COMMIT state.
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering RECOVERY  
> state.
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [0] member 10.2.220.6 
> :
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq  
> 140 rep 10.2.220.6
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 9 high delivered  
> 9 received flag 1
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [1] member 10.2.220.7 
> :
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq  
> 136 rep 10.2.220.7
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 1b high  
> delivered 1b received flag 1
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] Did not need to  
> originate any messages in recovery.
> May  2 16:19:26 psfhost2 openais[3275]: [CLM  ] CLM CONFIGURATION  
> CHANGE
> May  2 16:19:26 psfhost2 openais[3275]: [CLM  ] New Configuration:
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ]     r(0)  
> ip(10.2.220.7)
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] Members Left:
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] Members Joined:
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] CLM CONFIGURATION  
> CHANGE
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] New Configuration:
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ]     r(0)  
> ip(10.2.220.6)
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ]     r(0)  
> ip(10.2.220.7)
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] Members Left:
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] Members Joined:
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ]     r(0)  
> ip(10.2.220.6)
> May  2 16:19:27 psfhost2 openais[3275]: [SYNC ] This node is within  
> the primary component and will provide service.
> May  2 16:19:27 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL  
> state.
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] got nodejoin message 10.2.220.6
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] got nodejoin message 10.2.220.7
> May  2 16:19:27 psfhost2 openais[3275]: [CPG  ] got joinlist message  
> from node 2
> May  2 16:19:29 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 1000  
> Mbps full duplex, receive & transmit flow control ON
> May  2 16:19:31 psfhost2 kernel: bnx2: eth1 NIC Link is Down
> May  2 16:19:35 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 100 Mbps  
> full duplex, receive & transmit flow control ON
> May  2 16:19:42 psfhost2 kernel: dlm: connecting to 1
> May  2 16:20:36 psfhost2 ccsd[3265]: Update of cluster.conf complete  
> (version 57 -> 59).
> May  2 16:20:43 psfhost2 clurgmgrd[3793]: <notice> Reconfiguring
> May  2 16:21:15 psfhost2 clurgmgrd[3793]: <notice> Stopping service  
> service:oracle
> May  2 16:21:25 psfhost2 avahi-daemon[3661]: Withdrawing address  
> record for 10.2.220.7 on eth0.
> May  2 16:21:25 psfhost2 avahi-daemon[3661]: Leaving mDNS multicast  
> group on interface eth0.IPv4 with address 10.2.220.7.
> May  2 16:21:25 psfhost2 avahi-daemon[3661]: Joining mDNS multicast  
> group on interface eth0.IPv4 with address 10.2.220.2.
> May  2 16:21:25 psfhost2 clurgmgrd: [3793]: <err> Failed to remove 10.2.220.2
> May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering RECOVERY  
> state.
> May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] position [0] member  
> 127.0.0.1:
> May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] previous ring seq  
> 144 rep 10.2.220.6
> May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] aru 31 high  
> delivered 31 received flag 1
> May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] Did not need to  
> originate any messages in recovery.
> May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] Sending initial ORF  
> token
> May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] CLM CONFIGURATION  
> CHANGE
> May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] New Configuration:
> May  2 16:21:40 psfhost2 openais[3275]: [CLM  ]     r(0) ip(127.0.0.1)
> May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] Members Left:
> May  2 16:21:40 psfhost2 openais[3275]: [CLM  ]     r(0)  
> ip(10.2.220.7)
> May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] Members Joined:
> May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] CLM CONFIGURATION  
> CHANGE
> May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] New Configuration:
> May  2 16:21:40 psfhost2 openais[3275]: [CLM  ]     r(0) ip(127.0.0.1)
> May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] Members Left:
> May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] Members Joined:
> May  2 16:21:40 psfhost2 openais[3275]: [SYNC ] This node is within  
> the primary component and will provide service.
> May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL  
> state.
> May  2 16:21:40 psfhost2 openais[3275]: [MAIN ] Killing node  
> psfhost2 because it has rejoined the cluster without cman_tool join
> May  2 16:21:40 psfhost2 openais[3275]: [CMAN ] cman killed by node  
> 2 because we rejoined the cluster without a full restart
> May  2 16:21:40 psfhost2 fenced[3291]: cman_get_nodes error -1 104
> May  2 16:21:40 psfhost2 kernel: clurgmgrd[3793]: segfault at  
> 0000000000000000 rip 0000000000408c4a rsp 00007fff3c4a9e20 error 4
> May  2 16:21:40 psfhost2 fenced[3291]: cluster is down, exiting
> May  2 16:21:40 psfhost2 groupd[3283]: cman_get_nodes error -1 104
> May  2 16:21:40 psfhost2 dlm_controld[3297]: cluster is down, exiting
> May  2 16:21:40 psfhost2 gfs_controld[3303]: cluster is down, exiting
> May  2 16:21:40 psfhost2 clurgmgrd[3792]: <crit> Watchdog: Daemon  
> died, rebooting...
> May  2 16:21:40 psfhost2 kernel: dlm: closing connection to node 1
> May  2 16:21:40 psfhost2 kernel: dlm: closing connection to node 2
> May  2 16:21:40 psfhost2 kernel: md: stopping all md devices.
> May  2 16:21:41 psfhost2 kernel: uhci_hcd 0000:01:04.4: HCRESET not  
> completed yet!
> May  2 16:24:55 psfhost2 syslogd 1.4.1: restart.
> May  2 16:24:55 psfhost2 kernel: klogd 1.4.1, log source = /proc/ 
> kmsg started.
> May  2 16:24:55 psfhost2 kernel: Linux version 2.6.18-53.el5 (brewbuilder at hs20-bc1-7.build.redhat.com 
> ) (gcc version 4.1.2 20070626 (Red Hat 4.1.2-14)) #1 SMP Wed Oct
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090511/9626ca03/attachment.htm>

From jruemker at redhat.com  Mon May 11 16:33:14 2009
From: jruemker at redhat.com (John Ruemker)
Date: Mon, 11 May 2009 12:33:14 -0400
Subject: [Linux-cluster] node is reboot during stop cluster application
	(oracle) and unable to relocate cluster application between nodes
In-Reply-To: <BD92ECDE-8FFA-4D9A-B8DC-881A5249F65A@gmail.com>
References: <4A080745.6090009@ril.com>
	<BD92ECDE-8FFA-4D9A-B8DC-881A5249F65A@gmail.com>
Message-ID: <4A08534A.50204@redhat.com>

On 05/11/2009 10:34 AM, Christopher Chen wrote:
> I hope you're planning to expand to least a 3 node cluster before you go
> into production. You know two node clusters are inherently unstable,
> right?I assume you've read the architectural overview of how the cluster
> suite achieves quorum.
>
> A cluster requires (n/2)+1 to continue to operate. If you restart or
> otherwise remove a machine from a two node cluster, you've lost quorum
> and by definition you've dissolved your cluster while you're in that state.
>

Unless the special case two_node="1" is in use, and it is here:

        <cman expected_votes="1" two_node="1"/>

This allows for maintaining quorum when only one vote is present. 
Fencing is occurring because the link is dropping.  See below:

> I'm pretty sure the behavior you are describing is proper.
>
> Time flies like an arrow.
> Fruit flies like a banana.
>
> On May 11, 2009, at 4:08, "Viral .D. Ahire" <CISPLengineer.hz at ril.com
> <mailto:CISPLengineer.hz at ril.com>> wrote:
>
>> Hi,
>>
>> I have configured two node cluster on redhat-5. now the problem is
>> when i relocate,restart or stop, running cluster service between nodes
>> (2 nos) ,the node get fenced and restart server . Other side, the
>> server who obtain cluster service leave the cluster and it's cluster
>> service (cman) stop automatically .so it is also fenced by other server.
>>
>> I observed that , this problem occurred while stopping cluster service
>> (oracle).
>>
>> Please help me to resolve this problem.
>>
>> log messages and cluster.conf file are as given as below.
>> -------------------------
>> /etc/cluster/cluster.conf
>> -------------------------
>> <?xml version="1.0"?>
>> <cluster config_version="59" name="new_cluster">
>> <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>> <clusternodes>
>> <clusternode name="psfhost1" nodeid="1" votes="1">
>> <fence>
>> <method name="1">
>> <device name="cluster1"/>
>> </method>
>> </fence>
>> </clusternode>
>> <clusternode name="psfhost2" nodeid="2" votes="1">
>> <fence>
>> <method name="1">
>> <device name="cluster2"/>
>> </method>
>> </fence>
>> </clusternode>
>> </clusternodes>
>> <cman expected_votes="1" two_node="1"/>
>> <fencedevices>
>> <fencedevice agent="fence_ilo" hostname="ilonode1"
>> login="Administrator" name="cluster1" passwd="9M6X9CAU"/>
>> <fencedevice agent="fence_ilo" hostname="ilonode2"
>> login="Administrator" name="cluster2" passwd="ST69D87V"/>
>> </fencedevices>
>> <rm>
>> <failoverdomains>
>> <failoverdomain name="poy-cluster" ordered="0" restricted="0">
>> <failoverdomainnode name="psfhost1" priority="1"/>
>> <failoverdomainnode name="psfhost2" priority="1"/>
>> </failoverdomain>
>> </failoverdomains>
>> <resources>
>> <ip address="10.2.220.2" monitor_link="1"/>
>> <script file="/etc/init.d/httpd" name="httpd"/>
>> <fs device="/dev/cciss/c1d0p3" force_fsck="0" force_unmount="0"
>> fsid="52427" fstype="ext3" mountpoint="/app" name="app" options=""
>> self_fence="0"/>
>> <fs device="/dev/cciss/c1d0p4" force_fsck="0" force_unmount="0"
>> fsid="39388" fstype="ext3" mountpoint="/opt" name="opt" options=""
>> self_fence="0"/>
>> <fs device="/dev/cciss/c1d0p1" force_fsck="0" force_unmount="0"
>> fsid="62307" fstype="ext3" mountpoint="/data" name="data" options=""
>> self_fence="0"/>
>> <fs device="/dev/cciss/c1d0p2" force_fsck="0" force_unmount="0"
>> fsid="47234" fstype="ext3" mountpoint="/OPERATION" name="OPERATION"
>> options="" self_fence="0"/>
>> <script file="/etc/init.d/orcl" name="Oracle"/>
>> </resources>
>> <service autostart="0" name="oracle" recovery="relocate">
>> <fs ref="app"/>
>> <fs ref="opt"/>
>> <fs ref="data"/>
>> <fs ref="OPERATION"/>
>> <ip ref="10.2.220.2"/>
>> <script ref="Oracle"/>
>> </service>
>> </rm>
>> </cluster>
>>
>>
>>
>>
>>
>>
>>
>> ---------------- -------
>> /var/log/messages
>> -----------------------
>> following logs during relocate cluster service (oracle) between nodes.
>>
>> _*Node-1*_
>>
>> 2 16:17:58 psfhost2 clurgmgrd[3793]: <notice> Starting stopped service
>> service:oracle
>> May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5
>> seconds
>> May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count
>> reached, running e2fsck is recommended
>> May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p3, internal journal
>> May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with
>> ordered data mode.
>> May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5
>> seconds
>> May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count
>> reached, running e2fsck is recommended
>> May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p4, internal journal
>> May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with
>> ordered data mode.
>> May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5
>> seconds
>> May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count
>> reached, running e2fsck is recommended
>> May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p1, internal journal
>> May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with
>> ordered data mode.
>> May 2 16:17:59 psfhost2 kernel: kjournald starting. Commit interval 5
>> seconds
>> May 2 16:17:59 psfhost2 kernel: EXT3-fs warning: maximal mount count
>> reached, running e2fsck is recommended
>> May 2 16:17:59 psfhost2 kernel: EXT3 FS on cciss/c1d0p2, internal journal
>> May 2 16:17:59 psfhost2 kernel: EXT3-fs: mounted filesystem with
>> ordered data mode.
>> May 2 16:17:59 psfhost2 avahi-daemon[3661]: Registering new address
>> record for 10.2.220.2 on eth0.
>> May 2 16:18:00 psfhost2 in.rdiscd[5945]: setsockopt
>> (IP_ADD_MEMBERSHIP): Address already in use
>> May 2 16:18:00 psfhost2 in.rdiscd[5945]: Failed joining addresses
>> May 2 16:18:11 psfhost2 clurgmgrd[3793]: <notice> Service
>> service:oracle started
>> May 2 16:19:17 psfhost2 kernel: bnx2: eth1 NIC Link is Down

^^^^^
The cluster interconnect link went down, and thus this node could no 
longer communicate with the other node.


>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering GATHER state
>> from 11.
>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Saving state aru 1b
>> high seq received 1b
>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Storing new sequence id
>> for ring 90
>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering COMMIT state.
>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering RECOVERY state.
>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [0] member
>> 10.2.220.6:
>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq 140
>> rep 10.2.220.6
>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 9 high delivered 9
>> received flag 1
>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [1] member
>> 10.2.220.7:
>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq 136
>> rep 10.2.220.7
>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 1b high delivered
>> 1b received flag 1
>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Did not need to
>> originate any messages in recovery.
>> May 2 16:19:26 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
>> May 2 16:19:26 psfhost2 openais[3275]: [CLM ] New Configuration:
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7)
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Left:
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Joined:
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] New Configuration:
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.6)
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7)
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Left:
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Joined:
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.6)
>> May 2 16:19:27 psfhost2 openais[3275]: [SYNC ] This node is within the
>> primary component and will provide service.
>> May 2 16:19:27 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL
>> state.
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] got nodejoin message
>> 10.2.220.6
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] got nodejoin message
>> 10.2.220.7
>> May 2 16:19:27 psfhost2 openais[3275]: [CPG ] got joinlist message
>> from node 2
>> May 2 16:19:29 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 1000 Mbps
>> full duplex, receive & transmit flow control ON
>> May 2 16:19:31 psfhost2 kernel: bnx2: eth1 NIC Link is Down
>> May 2 16:19:35 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 100 Mbps
>> full duplex, receive & transmit flow control ON
>> May 2 16:19:42 psfhost2 kernel: dlm: connecting to 1
>> May 2 16:20:36 psfhost2 ccsd[3265]: Update of cluster.conf complete
>> (version 57 -> 59).
>> May 2 16:20:43 psfhost2 clurgmgrd[3793]: <notice> Reconfiguring
>> May 2 16:21:15 psfhost2 clurgmgrd[3793]: <notice> Stopping service
>> service:oracle
>> May 2 16:21:25 psfhost2 avahi-daemon[3661]: Withdrawing address record
>> for 10.2.220.7 on eth0.
>> May 2 16:21:25 psfhost2 avahi-daemon[3661]: Leaving mDNS multicast
>> group on interface eth0.IPv4 with address 10.2.220.7.
>> May 2 16:21:25 psfhost2 avahi-daemon[3661]: Joining mDNS multicast
>> group on interface eth0.IPv4 with address 10.2.220.2.
>> May 2 16:21:25 psfhost2 clurgmgrd: [3793]: <err> Failed to remove
>> 10.2.220.2
>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering RECOVERY state.
>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] position [0] member
>> 127.0.0.1:
>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] previous ring seq 144
>> rep 10.2.220.6
>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] aru 31 high delivered
>> 31 received flag 1
>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] Did not need to
>> originate any messages in recovery.
>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] Sending initial ORF token
>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] New Configuration:
>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(127.0.0.1)
>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Left:
>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7)
>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Joined:
>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] New Configuration:
>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(127.0.0.1)
>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Left:
>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Joined:
>> May 2 16:21:40 psfhost2 openais[3275]: [SYNC ] This node is within the
>> primary component and will provide service.
>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL state.
>> May 2 16:21:40 psfhost2 openais[3275]: [MAIN ] Killing node psfhost2
>> because it has rejoined the cluster without cman_tool join
>> May 2 16:21:40 psfhost2 openais[3275]: [CMAN ] cman killed by node 2
>> because we rejoined the cluster without a full restart
>> May 2 16:21:40 psfhost2 fenced[3291]: cman_get_nodes error -1 104
>> May 2 16:21:40 psfhost2 kernel: clurgmgrd[3793]: segfault at
>> 0000000000000000 rip 0000000000408c4a rsp 00007fff3c4a9e20 error 4
>> May 2 16:21:40 psfhost2 fenced[3291]: cluster is down, exiting
>> May 2 16:21:40 psfhost2 groupd[3283]: cman_get_nodes error -1 104
>> May 2 16:21:40 psfhost2 dlm_controld[3297]: cluster is down, exiting
>> May 2 16:21:40 psfhost2 gfs_controld[3303]: cluster is down, exiting
>> May 2 16:21:40 psfhost2 clurgmgrd[3792]: <crit> Watchdog: Daemon died,
>> rebooting...
>> May 2 16:21:40 psfhost2 kernel: dlm: closing connection to node 1
>> May 2 16:21:40 psfhost2 kernel: dlm: closing connection to node 2
>> May 2 16:21:40 psfhost2 kernel: md: stopping all md devices.
>> May 2 16:21:41 psfhost2 kernel: uhci_hcd 0000:01:04.4: HCRESET not
>> completed yet!
>> May 2 16:24:55 psfhost2 syslogd 1.4.1: restart.
>> May 2 16:24:55 psfhost2 kernel: klogd 1.4.1, log source = /proc/kmsg
>> started.
>> May 2 16:24:55 psfhost2 kernel: Linux version 2.6.18-53.el5
>> (brewbuilder at hs20-bc1-7.build.redhat.com
>> <mailto:brewbuilder at hs20-bc1-7.build.redhat.com>) (gcc version 4.1.2
>> 20070626 (Red Hat 4.1.2-14)) #1 SMP Wed Oct
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From muffaleta at gmail.com  Mon May 11 17:21:20 2009
From: muffaleta at gmail.com (Christopher Chen)
Date: Mon, 11 May 2009 10:21:20 -0700
Subject: [Linux-cluster] node is reboot during stop cluster application 
	(oracle) and unable to relocate cluster application between nodes
In-Reply-To: <4A08534A.50204@redhat.com>
References: <4A080745.6090009@ril.com>
	<BD92ECDE-8FFA-4D9A-B8DC-881A5249F65A@gmail.com>
	<4A08534A.50204@redhat.com>
Message-ID: <7bc80d500905111021k7f1e5439lce46c0af2694477d@mail.gmail.com>

On Mon, May 11, 2009 at 9:33 AM, John Ruemker <jruemker at redhat.com> wrote:
> On 05/11/2009 10:34 AM, Christopher Chen wrote:
>>
>> I hope you're planning to expand to least a 3 node cluster before you go
>> into production. You know two node clusters are inherently unstable,
>> right?I assume you've read the architectural overview of how the cluster
>> suite achieves quorum.
>>
>> A cluster requires (n/2)+1 to continue to operate. If you restart or
>> otherwise remove a machine from a two node cluster, you've lost quorum
>> and by definition you've dissolved your cluster while you're in that
>> state.
>>
>
> Unless the special case two_node="1" is in use, and it is here:
>
> ? ? ? <cman expected_votes="1" two_node="1"/>
>
> This allows for maintaining quorum when only one vote is present. Fencing is
> occurring because the link is dropping. ?See below:

I understand that that's an option, but how safe is it? Two node
clusters scare me.
>
>> I'm pretty sure the behavior you are describing is proper.
>>
>> Time flies like an arrow.
>> Fruit flies like a banana.
>>
>> On May 11, 2009, at 4:08, "Viral .D. Ahire" <CISPLengineer.hz at ril.com
>> <mailto:CISPLengineer.hz at ril.com>> wrote:
>>
>>> Hi,
>>>
>>> I have configured two node cluster on redhat-5. now the problem is
>>> when i relocate,restart or stop, running cluster service between nodes
>>> (2 nos) ,the node get fenced and restart server . Other side, the
>>> server who obtain cluster service leave the cluster and it's cluster
>>> service (cman) stop automatically .so it is also fenced by other server.
>>>
>>> I observed that , this problem occurred while stopping cluster service
>>> (oracle).
>>>
>>> Please help me to resolve this problem.
>>>
>>> log messages and cluster.conf file are as given as below.
>>> -------------------------
>>> /etc/cluster/cluster.conf
>>> -------------------------
>>> <?xml version="1.0"?>
>>> <cluster config_version="59" name="new_cluster">
>>> <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>>> <clusternodes>
>>> <clusternode name="psfhost1" nodeid="1" votes="1">
>>> <fence>
>>> <method name="1">
>>> <device name="cluster1"/>
>>> </method>
>>> </fence>
>>> </clusternode>
>>> <clusternode name="psfhost2" nodeid="2" votes="1">
>>> <fence>
>>> <method name="1">
>>> <device name="cluster2"/>
>>> </method>
>>> </fence>
>>> </clusternode>
>>> </clusternodes>
>>> <cman expected_votes="1" two_node="1"/>
>>> <fencedevices>
>>> <fencedevice agent="fence_ilo" hostname="ilonode1"
>>> login="Administrator" name="cluster1" passwd="9M6X9CAU"/>
>>> <fencedevice agent="fence_ilo" hostname="ilonode2"
>>> login="Administrator" name="cluster2" passwd="ST69D87V"/>
>>> </fencedevices>
>>> <rm>
>>> <failoverdomains>
>>> <failoverdomain name="poy-cluster" ordered="0" restricted="0">
>>> <failoverdomainnode name="psfhost1" priority="1"/>
>>> <failoverdomainnode name="psfhost2" priority="1"/>
>>> </failoverdomain>
>>> </failoverdomains>
>>> <resources>
>>> <ip address="10.2.220.2" monitor_link="1"/>
>>> <script file="/etc/init.d/httpd" name="httpd"/>
>>> <fs device="/dev/cciss/c1d0p3" force_fsck="0" force_unmount="0"
>>> fsid="52427" fstype="ext3" mountpoint="/app" name="app" options=""
>>> self_fence="0"/>
>>> <fs device="/dev/cciss/c1d0p4" force_fsck="0" force_unmount="0"
>>> fsid="39388" fstype="ext3" mountpoint="/opt" name="opt" options=""
>>> self_fence="0"/>
>>> <fs device="/dev/cciss/c1d0p1" force_fsck="0" force_unmount="0"
>>> fsid="62307" fstype="ext3" mountpoint="/data" name="data" options=""
>>> self_fence="0"/>
>>> <fs device="/dev/cciss/c1d0p2" force_fsck="0" force_unmount="0"
>>> fsid="47234" fstype="ext3" mountpoint="/OPERATION" name="OPERATION"
>>> options="" self_fence="0"/>
>>> <script file="/etc/init.d/orcl" name="Oracle"/>
>>> </resources>
>>> <service autostart="0" name="oracle" recovery="relocate">
>>> <fs ref="app"/>
>>> <fs ref="opt"/>
>>> <fs ref="data"/>
>>> <fs ref="OPERATION"/>
>>> <ip ref="10.2.220.2"/>
>>> <script ref="Oracle"/>
>>> </service>
>>> </rm>
>>> </cluster>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ---------------- -------
>>> /var/log/messages
>>> -----------------------
>>> following logs during relocate cluster service (oracle) between nodes.
>>>
>>> _*Node-1*_
>>>
>>> 2 16:17:58 psfhost2 clurgmgrd[3793]: <notice> Starting stopped service
>>> service:oracle
>>> May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5
>>> seconds
>>> May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count
>>> reached, running e2fsck is recommended
>>> May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p3, internal journal
>>> May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with
>>> ordered data mode.
>>> May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5
>>> seconds
>>> May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count
>>> reached, running e2fsck is recommended
>>> May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p4, internal journal
>>> May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with
>>> ordered data mode.
>>> May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5
>>> seconds
>>> May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count
>>> reached, running e2fsck is recommended
>>> May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p1, internal journal
>>> May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with
>>> ordered data mode.
>>> May 2 16:17:59 psfhost2 kernel: kjournald starting. Commit interval 5
>>> seconds
>>> May 2 16:17:59 psfhost2 kernel: EXT3-fs warning: maximal mount count
>>> reached, running e2fsck is recommended
>>> May 2 16:17:59 psfhost2 kernel: EXT3 FS on cciss/c1d0p2, internal journal
>>> May 2 16:17:59 psfhost2 kernel: EXT3-fs: mounted filesystem with
>>> ordered data mode.
>>> May 2 16:17:59 psfhost2 avahi-daemon[3661]: Registering new address
>>> record for 10.2.220.2 on eth0.
>>> May 2 16:18:00 psfhost2 in.rdiscd[5945]: setsockopt
>>> (IP_ADD_MEMBERSHIP): Address already in use
>>> May 2 16:18:00 psfhost2 in.rdiscd[5945]: Failed joining addresses
>>> May 2 16:18:11 psfhost2 clurgmgrd[3793]: <notice> Service
>>> service:oracle started
>>> May 2 16:19:17 psfhost2 kernel: bnx2: eth1 NIC Link is Down
>
> ^^^^^
> The cluster interconnect link went down, and thus this node could no longer
> communicate with the other node.
>
>
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering GATHER state
>>> from 11.
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Saving state aru 1b
>>> high seq received 1b
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Storing new sequence id
>>> for ring 90
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering COMMIT state.
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering RECOVERY state.
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [0] member
>>> 10.2.220.6:
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq 140
>>> rep 10.2.220.6
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 9 high delivered 9
>>> received flag 1
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [1] member
>>> 10.2.220.7:
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq 136
>>> rep 10.2.220.7
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 1b high delivered
>>> 1b received flag 1
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Did not need to
>>> originate any messages in recovery.
>>> May 2 16:19:26 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
>>> May 2 16:19:26 psfhost2 openais[3275]: [CLM ] New Configuration:
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7)
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Left:
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Joined:
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] New Configuration:
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.6)
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7)
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Left:
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Joined:
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.6)
>>> May 2 16:19:27 psfhost2 openais[3275]: [SYNC ] This node is within the
>>> primary component and will provide service.
>>> May 2 16:19:27 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL
>>> state.
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] got nodejoin message
>>> 10.2.220.6
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] got nodejoin message
>>> 10.2.220.7
>>> May 2 16:19:27 psfhost2 openais[3275]: [CPG ] got joinlist message
>>> from node 2
>>> May 2 16:19:29 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 1000 Mbps
>>> full duplex, receive & transmit flow control ON
>>> May 2 16:19:31 psfhost2 kernel: bnx2: eth1 NIC Link is Down
>>> May 2 16:19:35 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 100 Mbps
>>> full duplex, receive & transmit flow control ON
>>> May 2 16:19:42 psfhost2 kernel: dlm: connecting to 1
>>> May 2 16:20:36 psfhost2 ccsd[3265]: Update of cluster.conf complete
>>> (version 57 -> 59).
>>> May 2 16:20:43 psfhost2 clurgmgrd[3793]: <notice> Reconfiguring
>>> May 2 16:21:15 psfhost2 clurgmgrd[3793]: <notice> Stopping service
>>> service:oracle
>>> May 2 16:21:25 psfhost2 avahi-daemon[3661]: Withdrawing address record
>>> for 10.2.220.7 on eth0.
>>> May 2 16:21:25 psfhost2 avahi-daemon[3661]: Leaving mDNS multicast
>>> group on interface eth0.IPv4 with address 10.2.220.7.
>>> May 2 16:21:25 psfhost2 avahi-daemon[3661]: Joining mDNS multicast
>>> group on interface eth0.IPv4 with address 10.2.220.2.
>>> May 2 16:21:25 psfhost2 clurgmgrd: [3793]: <err> Failed to remove
>>> 10.2.220.2
>>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering RECOVERY state.
>>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] position [0] member
>>> 127.0.0.1:
>>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] previous ring seq 144
>>> rep 10.2.220.6
>>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] aru 31 high delivered
>>> 31 received flag 1
>>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] Did not need to
>>> originate any messages in recovery.
>>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] Sending initial ORF token
>>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
>>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] New Configuration:
>>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(127.0.0.1)
>>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Left:
>>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7)
>>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Joined:
>>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
>>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] New Configuration:
>>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(127.0.0.1)
>>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Left:
>>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Joined:
>>> May 2 16:21:40 psfhost2 openais[3275]: [SYNC ] This node is within the
>>> primary component and will provide service.
>>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL
>>> state.
>>> May 2 16:21:40 psfhost2 openais[3275]: [MAIN ] Killing node psfhost2
>>> because it has rejoined the cluster without cman_tool join
>>> May 2 16:21:40 psfhost2 openais[3275]: [CMAN ] cman killed by node 2
>>> because we rejoined the cluster without a full restart
>>> May 2 16:21:40 psfhost2 fenced[3291]: cman_get_nodes error -1 104
>>> May 2 16:21:40 psfhost2 kernel: clurgmgrd[3793]: segfault at
>>> 0000000000000000 rip 0000000000408c4a rsp 00007fff3c4a9e20 error 4
>>> May 2 16:21:40 psfhost2 fenced[3291]: cluster is down, exiting
>>> May 2 16:21:40 psfhost2 groupd[3283]: cman_get_nodes error -1 104
>>> May 2 16:21:40 psfhost2 dlm_controld[3297]: cluster is down, exiting
>>> May 2 16:21:40 psfhost2 gfs_controld[3303]: cluster is down, exiting
>>> May 2 16:21:40 psfhost2 clurgmgrd[3792]: <crit> Watchdog: Daemon died,
>>> rebooting...
>>> May 2 16:21:40 psfhost2 kernel: dlm: closing connection to node 1
>>> May 2 16:21:40 psfhost2 kernel: dlm: closing connection to node 2
>>> May 2 16:21:40 psfhost2 kernel: md: stopping all md devices.
>>> May 2 16:21:41 psfhost2 kernel: uhci_hcd 0000:01:04.4: HCRESET not
>>> completed yet!
>>> May 2 16:24:55 psfhost2 syslogd 1.4.1: restart.
>>> May 2 16:24:55 psfhost2 kernel: klogd 1.4.1, log source = /proc/kmsg
>>> started.
>>> May 2 16:24:55 psfhost2 kernel: Linux version 2.6.18-53.el5
>>> (brewbuilder at hs20-bc1-7.build.redhat.com
>>> <mailto:brewbuilder at hs20-bc1-7.build.redhat.com>) (gcc version 4.1.2
>>> 20070626 (Red Hat 4.1.2-14)) #1 SMP Wed Oct
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>> ------------------------------------------------------------------------
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Chris Chen <muffaleta at gmail.com>
"I want the kind of six pack you can't drink."
-- Micah


From ravikumar.c25 at gmail.com  Mon May 11 18:50:48 2009
From: ravikumar.c25 at gmail.com (ravi kumar)
Date: Tue, 12 May 2009 02:50:48 +0800
Subject: [Linux-cluster] RHEL 5.2 cluster
Message-ID: <539ce33b0905111150q73177bra514543130f92896@mail.gmail.com>

Hi All,

Not able to add as a member in dbnode1. how to fix this problem

dbnode1:/root # clustat
Cluster Status for md_cluster @ Mon May 11 18:46:28 2009
Member Status: Quorate

 Member Name                                              ID   Status
 ------ ----                                              ---- ------
 dbnode1.xtks.com                             1 Online, Local
 dbnode2.xtks.com                             2 Offline


Messages:

May 11 16:40:56 dbnode1 openais[6864]: [SERV ] Initialising service
handler 'openais configuration service'
May 11 16:40:56 dbnode1 openais[6864]: [SERV ] Initialising service
handler 'openais cluster closed process group service v1.01'
May 11 16:40:56 dbnode1 openais[6864]: [SERV ] Initialising service
handler 'openais CMAN membership service 2.01'
May 11 16:40:56 dbnode1 openais[6864]: [CMAN ] CMAN 2.0.98 (built Dec
3 2008 16:32:34) started
May 11 16:40:56 dbnode1 openais[6864]: [SYNC ] Not using a virtual
synchrony filter.
May 11 16:40:56 dbnode1 openais[6864]: [TOTEM] Creating commit token
because I am the rep.
May 11 16:40:56 dbnode1 openais[6864]: [TOTEM] Saving state aru 0 high
seq received 0
May 11 16:40:56 dbnode1 openais[6864]: [TOTEM] Storing new sequence id
for ring a8
May 11 16:40:56 dbnode1 openais[6864]: [TOTEM] entering COMMIT state.
May 11 16:40:56 dbnode1 openais[6864]: [TOTEM] entering RECOVERY state.
May 11 16:40:56 dbnode1 openais[6864]: [TOTEM] position [0] member
10.192.64.41:
May 11 16:40:56 dbnode1 openais[6864]: [TOTEM] previous ring seq 164
rep 10.192.64.41
May 11 16:40:56 dbnode1 openais[6864]: [TOTEM] aru 0 high delivered 0
received flag 1
May 11 16:40:56 dbnode1 openais[6864]: [TOTEM] Did not need to
originate any messages in recovery.
May 11 16:40:56 dbnode1 openais[6864]: [TOTEM] Sending initial ORF token
May 11 16:40:56 dbnode1 openais[6864]: [CLM  ] CLM CONFIGURATION CHANGE
May 11 16:40:56 dbnode1 openais[6864]: [CLM  ] New Configuration:
May 11 16:40:56 dbnode1 openais[6864]: [CLM  ] Members Left:
May 11 16:40:56 dbnode1 openais[6864]: [CLM  ] Members Joined:
May 11 16:40:56 dbnode1 openais[6864]: [CLM  ] CLM CONFIGURATION CHANGE
May 11 16:40:56 dbnode1 openais[6864]: [CLM  ] New Configuration:
May 11 16:40:56 dbnode1 openais[6864]: [CLM  ]   r(0) ip(10.192.64.41)
May 11 16:40:56 dbnode1 openais[6864]: [CLM  ] Members Left:
May 11 16:40:56 dbnode1 openais[6864]: [CLM  ] Members Joined:
May 11 16:40:56 dbnode1 openais[6864]: [CLM  ]   r(0) ip(10.192.64.41)
May 11 16:40:56 dbnode1 openais[6864]: [SYNC ] This node is within the
primary component and will provide service.
May 11 16:40:56 dbnode1 openais[6864]: [TOTEM] entering OPERATIONAL state.
May 11 16:40:56 dbnode1 openais[6864]: [CMAN ] quorum regained,
resuming activity
May 11 16:40:56 dbnode1 openais[6864]: [CLM  ] got nodejoin message
10.192.64.41
May 11 16:40:57 dbnode1 ccsd[20014]: Initial status:: Quorate
May 11 16:40:57 dbnode1 kernel: dlm: closing connection to node 1


From CISPLengineer.hz at ril.com  Tue May 12 05:27:02 2009
From: CISPLengineer.hz at ril.com (Viral .D. Ahire)
Date: Tue, 12 May 2009 10:57:02 +0530
Subject: [Linux-cluster] node is reboot during stop cluster application
	(oracle) and unable to relocate cluster application between nodes
Message-ID: <4A0908A6.3050009@ril.com>

Hi Jhon,

Thanks for the replay..........

As you said from log  that, node getting fenced (reboot) due to link is 
down.But the there is no problem with network connectivity. Actually 
when i relocate cluster application from one to another (oracle),the 
node who runs application is reboot. Both nodes are connected with cross 
cable for heartbeat link and that is why it is showing link down in logs.

Regards,
Viral Ahire

------------------------------------------------------------

    * /From/: John Ruemker <jruemker redhat com>
    * /To/: linux clustering <linux-cluster redhat com>
    * /Subject/: Re: [Linux-cluster] node is reboot during stop cluster
      application (oracle) and unable to relocate cluster application
      between nodes
    * /Date/: Mon, 11 May 2009 12:33:14 -0400

------------------------------------------------------------------------

On 05/11/2009 10:34 AM, Christopher Chen wrote:

    I hope you're planning to expand to least a 3 node cluster before you go
    into production. You know two node clusters are inherently unstable,
    right?I assume you've read the architectural overview of how the cluster
    suite achieves quorum.

    A cluster requires (n/2)+1 to continue to operate. If you restart or
    otherwise remove a machine from a two node cluster, you've lost quorum
    and by definition you've dissolved your cluster while you're in that state.

      
Unless the special case two_node="1" is in use, and it is here:

       <cman expected_votes="1" two_node="1"/>

This allows for maintaining quorum when only one vote is present. 
Fencing is occurring because the link is dropping. See below:

    I'm pretty sure the behavior you are describing is proper.

    Time flies like an arrow.
    Fruit flies like a banana.

    On May 11, 2009, at 4:08, "Viral .D. Ahire" <CISPLengineer hz ril com
    <mailto:CISPLengineer hz ril com>> wrote:

      
        Hi,

        I have configured two node cluster on redhat-5. now the problem is
        when i relocate,restart or stop, running cluster service between nodes
        (2 nos) ,the node get fenced and restart server . Other side, the
        server who obtain cluster service leave the cluster and it's cluster
        service (cman) stop automatically .so it is also fenced by other server.

        I observed that , this problem occurred while stopping cluster service
        (oracle).

        Please help me to resolve this problem.

        log messages and cluster.conf file are as given as below.
        -------------------------
        /etc/cluster/cluster.conf
        -------------------------
        <?xml version="1.0"?>
        <cluster config_version="59" name="new_cluster">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
        <clusternode name="psfhost1" nodeid="1" votes="1">
        <fence>
        <method name="1">
        <device name="cluster1"/>
        </method>
        </fence>
        </clusternode>
        <clusternode name="psfhost2" nodeid="2" votes="1">
        <fence>
        <method name="1">
        <device name="cluster2"/>
        </method>
        </fence>
        </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
        <fencedevice agent="fence_ilo" hostname="ilonode1"
        login="Administrator" name="cluster1" passwd="9M6X9CAU"/>
        <fencedevice agent="fence_ilo" hostname="ilonode2"
        login="Administrator" name="cluster2" passwd="ST69D87V"/>
        </fencedevices>
        <rm>
        <failoverdomains>
        <failoverdomain name="poy-cluster" ordered="0" restricted="0">
        <failoverdomainnode name="psfhost1" priority="1"/>
        <failoverdomainnode name="psfhost2" priority="1"/>
        </failoverdomain>
        </failoverdomains>
        <resources>
        <ip address="10.2.220.2" monitor_link="1"/>
        <script file="/etc/init.d/httpd" name="httpd"/>
        <fs device="/dev/cciss/c1d0p3" force_fsck="0" force_unmount="0"
        fsid="52427" fstype="ext3" mountpoint="/app" name="app" options=""
        self_fence="0"/>
        <fs device="/dev/cciss/c1d0p4" force_fsck="0" force_unmount="0"
        fsid="39388" fstype="ext3" mountpoint="/opt" name="opt" options=""
        self_fence="0"/>
        <fs device="/dev/cciss/c1d0p1" force_fsck="0" force_unmount="0"
        fsid="62307" fstype="ext3" mountpoint="/data" name="data" options=""
        self_fence="0"/>
        <fs device="/dev/cciss/c1d0p2" force_fsck="0" force_unmount="0"
        fsid="47234" fstype="ext3" mountpoint="/OPERATION" name="OPERATION"
        options="" self_fence="0"/>
        <script file="/etc/init.d/orcl" name="Oracle"/>
        </resources>
        <service autostart="0" name="oracle" recovery="relocate">
        <fs ref="app"/>
        <fs ref="opt"/>
        <fs ref="data"/>
        <fs ref="OPERATION"/>
        <ip ref="10.2.220.2"/>
        <script ref="Oracle"/>
        </service>
        </rm>
        </cluster>


        ---------------- -------
        /var/log/messages
        -----------------------
        following logs during relocate cluster service (oracle) between nodes.

        _*Node-1*_

        2 16:17:58 psfhost2 clurgmgrd[3793]: <notice> Starting stopped service
        service:oracle
        May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5
        seconds
        May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count
        reached, running e2fsck is recommended
        May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p3, internal journal
        May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with
        ordered data mode.
        May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5
        seconds
        May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count
        reached, running e2fsck is recommended
        May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p4, internal journal
        May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with
        ordered data mode.
        May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5
        seconds
        May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count
        reached, running e2fsck is recommended
        May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p1, internal journal
        May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with
        ordered data mode.
        May 2 16:17:59 psfhost2 kernel: kjournald starting. Commit interval 5
        seconds
        May 2 16:17:59 psfhost2 kernel: EXT3-fs warning: maximal mount count
        reached, running e2fsck is recommended
        May 2 16:17:59 psfhost2 kernel: EXT3 FS on cciss/c1d0p2, internal journal
        May 2 16:17:59 psfhost2 kernel: EXT3-fs: mounted filesystem with
        ordered data mode.
        May 2 16:17:59 psfhost2 avahi-daemon[3661]: Registering new address
        record for 10.2.220.2 on eth0.
        May 2 16:18:00 psfhost2 in.rdiscd[5945]: setsockopt
        (IP_ADD_MEMBERSHIP): Address already in use
        May 2 16:18:00 psfhost2 in.rdiscd[5945]: Failed joining addresses
        May 2 16:18:11 psfhost2 clurgmgrd[3793]: <notice> Service
        service:oracle started
        May 2 16:19:17 psfhost2 kernel: bnx2: eth1 NIC Link is Down
            

^^^^^

The cluster interconnect link went down, and thus this node could no 
longer communicate with the other node.

        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering GATHER state
        from 11.
        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Saving state aru 1b
        high seq received 1b
        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Storing new sequence id
        for ring 90
        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering COMMIT state.
        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering RECOVERY state.
        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [0] member
        10.2.220.6:
        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq 140
        rep 10.2.220.6
        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 9 high delivered 9
        received flag 1
        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [1] member
        10.2.220.7:
        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq 136
        rep 10.2.220.7
        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 1b high delivered
        1b received flag 1
        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Did not need to
        originate any messages in recovery.
        May 2 16:19:26 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
        May 2 16:19:26 psfhost2 openais[3275]: [CLM ] New Configuration:
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7)
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Left:
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Joined:
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] New Configuration:
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.6)
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7)
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Left:
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Joined:
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.6)
        May 2 16:19:27 psfhost2 openais[3275]: [SYNC ] This node is within the
        primary component and will provide service.
        May 2 16:19:27 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL
        state.
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] got nodejoin message
        10.2.220.6
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] got nodejoin message
        10.2.220.7
        May 2 16:19:27 psfhost2 openais[3275]: [CPG ] got joinlist message
        from node 2
        May 2 16:19:29 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 1000 Mbps
        full duplex, receive & transmit flow control ON
        May 2 16:19:31 psfhost2 kernel: bnx2: eth1 NIC Link is Down
        May 2 16:19:35 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 100 Mbps
        full duplex, receive & transmit flow control ON
        May 2 16:19:42 psfhost2 kernel: dlm: connecting to 1
        May 2 16:20:36 psfhost2 ccsd[3265]: Update of cluster.conf complete
        (version 57 -> 59).
        May 2 16:20:43 psfhost2 clurgmgrd[3793]: <notice> Reconfiguring
        May 2 16:21:15 psfhost2 clurgmgrd[3793]: <notice> Stopping service
        service:oracle
        May 2 16:21:25 psfhost2 avahi-daemon[3661]: Withdrawing address record
        for 10.2.220.7 on eth0.
        May 2 16:21:25 psfhost2 avahi-daemon[3661]: Leaving mDNS multicast
        group on interface eth0.IPv4 with address 10.2.220.7.
        May 2 16:21:25 psfhost2 avahi-daemon[3661]: Joining mDNS multicast
        group on interface eth0.IPv4 with address 10.2.220.2.
        May 2 16:21:25 psfhost2 clurgmgrd: [3793]: <err> Failed to remove
        10.2.220.2
        May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering RECOVERY state.
        May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] position [0] member
        127.0.0.1:
        May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] previous ring seq 144
        rep 10.2.220.6
        May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] aru 31 high delivered
        31 received flag 1
        May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] Did not need to
        originate any messages in recovery.
        May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] Sending initial ORF token
        May 2 16:21:40 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
        May 2 16:21:40 psfhost2 openais[3275]: [CLM ] New Configuration:
        May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(127.0.0.1)
        May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Left:
        May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7)
        May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Joined:
        May 2 16:21:40 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
        May 2 16:21:40 psfhost2 openais[3275]: [CLM ] New Configuration:
        May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(127.0.0.1)
        May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Left:
        May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Joined:
        May 2 16:21:40 psfhost2 openais[3275]: [SYNC ] This node is within the
        primary component and will provide service.
        May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL state.
        May 2 16:21:40 psfhost2 openais[3275]: [MAIN ] Killing node psfhost2
        because it has rejoined the cluster without cman_tool join
        May 2 16:21:40 psfhost2 openais[3275]: [CMAN ] cman killed by node 2
        because we rejoined the cluster without a full restart
        May 2 16:21:40 psfhost2 fenced[3291]: cman_get_nodes error -1 104
        May 2 16:21:40 psfhost2 kernel: clurgmgrd[3793]: segfault at
        0000000000000000 rip 0000000000408c4a rsp 00007fff3c4a9e20 error 4
        May 2 16:21:40 psfhost2 fenced[3291]: cluster is down, exiting
        May 2 16:21:40 psfhost2 groupd[3283]: cman_get_nodes error -1 104
        May 2 16:21:40 psfhost2 dlm_controld[3297]: cluster is down, exiting
        May 2 16:21:40 psfhost2 gfs_controld[3303]: cluster is down, exiting
        May 2 16:21:40 psfhost2 clurgmgrd[3792]: <crit> Watchdog: Daemon died,
        rebooting...
        May 2 16:21:40 psfhost2 kernel: dlm: closing connection to node 1
        May 2 16:21:40 psfhost2 kernel: dlm: closing connection to node 2
        May 2 16:21:40 psfhost2 kernel: md: stopping all md devices.
        May 2 16:21:41 psfhost2 kernel: uhci_hcd 0000:01:04.4: HCRESET not
        completed yet!
        May 2 16:24:55 psfhost2 syslogd 1.4.1: restart.
        May 2 16:24:55 psfhost2 kernel: klogd 1.4.1, log source = /proc/kmsg
        started.
        May 2 16:24:55 psfhost2 kernel: Linux version 2.6.18-53.el5
        (brewbuilder hs20-bc1-7 build redhat com
        <mailto:brewbuilder hs20-bc1-7 build redhat com>) (gcc version 4.1.2
        20070626 (Red Hat 4.1.2-14)) #1 SMP Wed Oct

        --
        Linux-cluster mailing list
        Linux-cluster redhat com <mailto:Linux-cluster redhat com>
        https://www.redhat.com/mailman/listinfo/linux-cluster
            

    ------------------------------------------------------------------------

    --
    Linux-cluster mailing list
    Linux-cluster redhat com
    https://www.redhat.com/mailman/listinfo/linux-cluster
      

"Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s). 
are confidential. and may be privileged. If you are not the intended recipient. you are hereby notified that any 
review. re-transmission. conversion to hard copy. copying. circulation or other use of this message and any attachments is 
strictly prohibited. If you are not the intended recipient. please notify the sender immediately by return email. 
and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. 
The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment."
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090512/31150d81/attachment.htm>

From Alain.Moulle at bull.net  Tue May 12 08:02:43 2009
From: Alain.Moulle at bull.net (Alain.Moulle)
Date: Tue, 12 May 2009 10:02:43 +0200
Subject: [Linux-cluster] CS5 / rgmanager pb if shutdown on 1 node
Message-ID: <4A092D23.6040309@bull.net>

Hi

Context : cluster with two nodes (without quorum disk) :
Both nodes are running their CS5 and their services.

If on one node we launch "shutdown -h", we systematically get
the error :

<debug> Event (0:1:0) Processed
<debug> 2 events processed
clurgmgrd[29262]: <err> #47: Failed changing service status

on the other node when it tries to launch the service of first
node. However, if the first node is rebooted and the CS5 launched on it again,
then the service is well started on backup node, and it can be migrated
without any problem to the first node.


Whereas if we launch "poweroff -f" instead of "shutdown -f", it is working fine.
The difference in logs is that in the first case, the 2nd is not fencing
the 1st one because, the "shutdown -h" begins to stop the service long time
before the node is stopped, so it is normal that there is no fencing. But
the second node should have retreived the service in any case. *
*
Is it an already known problem ?

Thanks
Regards
Alain

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090512/bcbc47d6/attachment.htm>

From jaime.alonso.miguel at gmail.com  Tue May 12 10:22:22 2009
From: jaime.alonso.miguel at gmail.com (Jaime Alonso)
Date: Tue, 12 May 2009 12:22:22 +0200
Subject: [Linux-cluster] core dump
Message-ID: <68ca69660905120322o1cafc4e8tba412bb7a45be772@mail.gmail.com>

hi everyone.

last week the node 1 of my cluster failed. I checked the logs but i have no
idea why it felt down.
this is part of the log messages. does any know something about it?
I would appreciate your help.
thank you

*May  4 23:50:42 HSLL-BD1 kernel: BUG: soft lockup - CPU#3 stuck for 10s!
[fs.sh:28923]*
May  4 23:50:42 HSLL-BD1 kernel:
May  4 23:50:42 HSLL-BD1 kernel: Pid: 28923, comm:                fs.sh
May  4 23:50:42 HSLL-BD1 kernel: EIP: 0060:[<c04e3d98>] CPU: 3
May  4 23:50:42 HSLL-BD1 kernel: EIP is at memcmp+0x0/0x22
May  4 23:50:42 HSLL-BD1 kernel:  EFLAGS: 00000202    Tainted: G
(2.6.18-92.el5PAE #1)
May  4 23:50:42 HSLL-BD1 kernel: EAX: df5dcf1d EBX: df5dcf1c ECX: 00000004
EDX: df5dce95
May  4 23:50:42 HSLL-BD1 kernel: ESI: df5dce95 EDI: 00000004 EBP: 00000000
DS: 007b ES: 007b
May  4 23:50:42 HSLL-BD1 kernel: CR0: 8005003b CR2: 00c24540 CR3: 3487bde0
CR4: 000006f0
May  4 23:50:42 HSLL-BD1 kernel:  [<f8d4c1f5>] abi_personality+0x55/0x7c
[abi_lcall]
May  4 23:50:42 HSLL-BD1 kernel:  [<c046f4d3>] do_sync_read+0xb6/0xf1
May  4 23:50:42 HSLL-BD1 kernel:  [<c0457353>]
get_page_from_freelist+0x96/0x333
May  4 23:50:42 HSLL-BD1 kernel:  [<f922d01a>] xout_load_object+0x1a/0x82d
[binfmt_xout]
May  4 23:50:42 HSLL-BD1 kernel:  [<c045c6ba>] page_address+0x7a/0x81
May  4 23:50:42 HSLL-BD1 kernel:  [<c045cc0f>] kunmap_high+0x14/0x7e
May  4 23:50:42 HSLL-BD1 kernel:  [<f922d8bc>] xout_load_binary+0xe/0x26
[binfmt_xout]
May  4 23:50:42 HSLL-BD1 kernel:  [<c0477cea>]
search_binary_handler+0x99/0x219
May  4 23:50:42 HSLL-BD1 kernel:  [<c047953b>] do_execve+0x158/0x1f5
May  4 23:50:42 HSLL-BD1 kernel:  [<c040321f>] sys_execve+0x2a/0x4a
May  4 23:50:42 HSLL-BD1 kernel:  [<c0404eff>] syscall_call+0x7/0xb
May  4 23:50:42 HSLL-BD1 kernel:  =======================
*May  4 23:50:46 HSLL-BD1 kernel: BUG: soft lockup - CPU#1 stuck for 10s!
[modclusterd:28924]*
May  4 23:50:46 HSLL-BD1 kernel:
May  4 23:50:46 HSLL-BD1 kernel: Pid: 28924, comm:          modclusterd
May  4 23:50:46 HSLL-BD1 kernel: EIP: 0060:[<f8d4c208>] CPU: 1
May  4 23:50:46 HSLL-BD1 kernel: EIP is at abi_personality+0x68/0x7c
[abi_lcall]
May  4 23:50:46 HSLL-BD1 kernel:  EFLAGS: 00200293    Tainted: G
(2.6.18-92.el5PAE #1)
May  4 23:50:46 HSLL-BD1 kernel: EAX: ffffff60 EBX: df5dcf34 ECX: cba85e9a
EDX: 0000006c
May  4 23:50:46 HSLL-BD1 kernel: ESI: cba85e9a EDI: 00000008 EBP: 00000000
DS: 007b ES: 007b
May  4 23:50:46 HSLL-BD1 kernel: CR0: 8005003b CR2: b7f3e000 CR3: 29216640
CR4: 000006f0
May  4 23:50:46 HSLL-BD1 kernel:  [<c046f4d3>] do_sync_read+0xb6/0xf1
May  4 23:50:46 HSLL-BD1 kernel:  [<c0457353>]
get_page_from_freelist+0x96/0x333
May  4 23:50:46 HSLL-BD1 kernel:  [<f922d01a>] xout_load_object+0x1a/0x82d
[binfmt_xout]
May  4 23:50:46 HSLL-BD1 kernel:  [<c045c6ba>] page_address+0x7a/0x81
May  4 23:50:46 HSLL-BD1 kernel:  [<c045cc0f>] kunmap_high+0x14/0x7e
May  4 23:50:46 HSLL-BD1 kernel:  [<f922d8bc>] xout_load_binary+0xe/0x26
[binfmt_xout]
May  4 23:50:46 HSLL-BD1 kernel:  [<c0477cea>]
search_binary_handler+0x99/0x219
May  4 23:50:46 HSLL-BD1 kernel:  [<c047953b>] do_execve+0x158/0x1f5
May  4 23:50:46 HSLL-BD1 kernel:  [<c040321f>] sys_execve+0x2a/0x4a
May  4 23:50:46 HSLL-BD1 kernel:  [<c0404eff>] syscall_call+0x7/0xb
May  4 23:50:46 HSLL-BD1 kernel:  =======================
*May  4 23:50:52 HSLL-BD1 kernel: BUG: soft lockup - CPU#3 stuck for 10s!
[fs.sh:28923]*
May  4 23:50:52 HSLL-BD1 kernel:
May  4 23:50:52 HSLL-BD1 kernel: Pid: 28923, comm:                fs.sh
May  4 23:50:52 HSLL-BD1 kernel: EIP: 0060:[<f8d4c1f5>] CPU: 3
May  4 23:50:52 HSLL-BD1 kernel: EIP is at abi_personality+0x55/0x7c
[abi_lcall]
May  4 23:50:52 HSLL-BD1 kernel:  EFLAGS: 00000212    Tainted: G
(2.6.18-92.el5PAE #1)
May  4 23:50:52 HSLL-BD1 kernel: EAX: 0000005d EBX: df5dcf1c ECX: df5dce95
EDX: 0000005d
May  4 23:50:52 HSLL-BD1 kernel: ESI: df5dce95 EDI: 00000004 EBP: 00000000
DS: 007b ES: 007b
May  4 23:50:52 HSLL-BD1 kernel: CR0: 8005003b CR2: 00c24540 CR3: 3487bde0
CR4: 000006f0
May  4 23:50:52 HSLL-BD1 kernel:  [<c046f4d3>] do_sync_read+0xb6/0xf1
May  4 23:50:52 HSLL-BD1 kernel:  [<c0457353>]
get_page_from_freelist+0x96/0x333
May  4 23:50:52 HSLL-BD1 kernel:  [<f922d01a>] xout_load_object+0x1a/0x82d
[binfmt_xout]
May  4 23:50:52 HSLL-BD1 kernel:  [<c045c6ba>] page_address+0x7a/0x81
May  4 23:50:52 HSLL-BD1 kernel:  [<c045cc0f>] kunmap_high+0x14/0x7e
May  4 23:50:52 HSLL-BD1 kernel:  [<f922d8bc>] xout_load_binary+0xe/0x26
[binfmt_xout]
May  4 23:50:52 HSLL-BD1 kernel:  [<c0477cea>]
search_binary_handler+0x99/0x219
May  4 23:50:52 HSLL-BD1 kernel:  [<c047953b>] do_execve+0x158/0x1f5
May  4 23:50:52 HSLL-BD1 kernel:  [<c040321f>] sys_execve+0x2a/0x4a
May  4 23:50:52 HSLL-BD1 kernel:  [<c0404eff>] syscall_call+0x7/0xb
May  4 23:50:52 HSLL-BD1 kernel:  =======================
*May  4 23:50:56 HSLL-BD1 kernel: BUG: soft lockup - CPU#1 stuck for 10s!
[modclusterd:28924]*
May  4 23:50:56 HSLL-BD1 kernel:
May  4 23:50:56 HSLL-BD1 kernel: Pid: 28924, comm:          modclusterd
May  4 23:50:56 HSLL-BD1 kernel: EIP: 0060:[<f8d4c1e9>] CPU: 1
May  4 23:50:56 HSLL-BD1 kernel: EIP is at abi_personality+0x49/0x7c
[abi_lcall]
May  4 23:50:56 HSLL-BD1 kernel:  EFLAGS: 00200202    Tainted: G
(2.6.18-92.el5PAE #1)
May  4 23:50:56 HSLL-BD1 kernel: EAX: ffffff1a EBX: df5dcf1c ECX: cba85e9a
EDX: fffffff1
May  4 23:50:56 HSLL-BD1 kernel: ESI: cba85e9a EDI: 00000008 EBP: 00000000
DS: 007b ES: 007b
May  4 23:50:56 HSLL-BD1 kernel: CR0: 8005003b CR2: b7f3e000 CR3: 29216640
CR4: 000006f0
May  4 23:50:56 HSLL-BD1 kernel:  [<c046f4d3>] do_sync_read+0xb6/0xf1
May  4 23:50:56 HSLL-BD1 kernel:  [<c0457353>]
get_page_from_freelist+0x96/0x333
May  4 23:50:56 HSLL-BD1 kernel:  [<f922d01a>] xout_load_object+0x1a/0x82d
[binfmt_xout]
May  4 23:50:56 HSLL-BD1 kernel:  [<c045c6ba>] page_address+0x7a/0x81
May  4 23:50:56 HSLL-BD1 kernel:  [<c045cc0f>] kunmap_high+0x14/0x7e
May  4 23:50:56 HSLL-BD1 kernel:  [<f922d8bc>] xout_load_binary+0xe/0x26
[binfmt_xout]
May  4 23:50:56 HSLL-BD1 kernel:  [<c0477cea>]
search_binary_handler+0x99/0x219
May  4 23:50:56 HSLL-BD1 kernel:  [<c047953b>] do_execve+0x158/0x1f5
May  4 23:50:56 HSLL-BD1 kernel:  [<c040321f>] sys_execve+0x2a/0x4a
May  4 23:50:56 HSLL-BD1 kernel:  [<c0404eff>] syscall_call+0x7/0xb
May  4 23:50:56 HSLL-BD1 kernel:  =======================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090512/d0572cdd/attachment.htm>

From daniel.theodoro at gmail.com  Tue May 12 13:55:32 2009
From: daniel.theodoro at gmail.com (Theodoro)
Date: Tue, 12 May 2009 10:55:32 -0300
Subject: [Linux-cluster] RHEL 5.2 cluster
In-Reply-To: <539ce33b0905111150q73177bra514543130f92896@mail.gmail.com>
References: <539ce33b0905111150q73177bra514543130f92896@mail.gmail.com>
Message-ID: <f99622460905120655p181bd61coaf88a0fada87b29f@mail.gmail.com>

Hi,

I had the same problem, in my case I upgrade for redhat 5.3

On Mon, May 11, 2009 at 3:50 PM, ravi kumar <ravikumar.c25 at gmail.com> wrote:
> Hi All,
>
> Not able to add as a member in dbnode1. how to fix this problem
>
> dbnode1:/root # clustat
> Cluster Status for md_cluster @ Mon May 11 18:46:28 2009
> Member Status: Quorate
>
> ?Member Name ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ID ? Status
> ?------ ---- ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?---- ------
> ?dbnode1.xtks.com ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1 Online, Local
> ?dbnode2.xtks.com ? ? ? ? ? ? ? ? ? ? ? ? ? ? 2 Offline
>
>
>
> Messages:
>
> May 11 16:40:56 dbnode1 openais[6864]: [SERV ] Initialising service
> handler 'openais configuration service'
> May 11 16:40:56 dbnode1 openais[6864]: [SERV ] Initialising service
> handler 'openais cluster closed process group service v1.01'
> May 11 16:40:56 dbnode1 openais[6864]: [SERV ] Initialising service
> handler 'openais CMAN membership service 2.01'
> May 11 16:40:56 dbnode1 openais[6864]: [CMAN ] CMAN 2.0.98 (built Dec
> 3 2008 16:32:34) started
> May 11 16:40:56 dbnode1 openais[6864]: [SYNC ] Not using a virtual
> synchrony filter.
> May 11 16:40:56 dbnode1 openais[6864]: [TOTEM] Creating commit token
> because I am the rep.
> May 11 16:40:56 dbnode1 openais[6864]: [TOTEM] Saving state aru 0 high
> seq received 0
> May 11 16:40:56 dbnode1 openais[6864]: [TOTEM] Storing new sequence id
> for ring a8
> May 11 16:40:56 dbnode1 openais[6864]: [TOTEM] entering COMMIT state.
> May 11 16:40:56 dbnode1 openais[6864]: [TOTEM] entering RECOVERY state.
> May 11 16:40:56 dbnode1 openais[6864]: [TOTEM] position [0] member
> 10.192.64.41:
> May 11 16:40:56 dbnode1 openais[6864]: [TOTEM] previous ring seq 164
> rep 10.192.64.41
> May 11 16:40:56 dbnode1 openais[6864]: [TOTEM] aru 0 high delivered 0
> received flag 1
> May 11 16:40:56 dbnode1 openais[6864]: [TOTEM] Did not need to
> originate any messages in recovery.
> May 11 16:40:56 dbnode1 openais[6864]: [TOTEM] Sending initial ORF token
> May 11 16:40:56 dbnode1 openais[6864]: [CLM ?] CLM CONFIGURATION CHANGE
> May 11 16:40:56 dbnode1 openais[6864]: [CLM ?] New Configuration:
> May 11 16:40:56 dbnode1 openais[6864]: [CLM ?] Members Left:
> May 11 16:40:56 dbnode1 openais[6864]: [CLM ?] Members Joined:
> May 11 16:40:56 dbnode1 openais[6864]: [CLM ?] CLM CONFIGURATION CHANGE
> May 11 16:40:56 dbnode1 openais[6864]: [CLM ?] New Configuration:
> May 11 16:40:56 dbnode1 openais[6864]: [CLM ?] ? r(0) ip(10.192.64.41)
> May 11 16:40:56 dbnode1 openais[6864]: [CLM ?] Members Left:
> May 11 16:40:56 dbnode1 openais[6864]: [CLM ?] Members Joined:
> May 11 16:40:56 dbnode1 openais[6864]: [CLM ?] ? r(0) ip(10.192.64.41)
> May 11 16:40:56 dbnode1 openais[6864]: [SYNC ] This node is within the
> primary component and will provide service.
> May 11 16:40:56 dbnode1 openais[6864]: [TOTEM] entering OPERATIONAL state.
> May 11 16:40:56 dbnode1 openais[6864]: [CMAN ] quorum regained,
> resuming activity
> May 11 16:40:56 dbnode1 openais[6864]: [CLM ?] got nodejoin message
> 10.192.64.41
> May 11 16:40:57 dbnode1 ccsd[20014]: Initial status:: Quorate
> May 11 16:40:57 dbnode1 kernel: dlm: closing connection to node 1
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From anasnajj at gmail.com  Tue May 12 17:34:28 2009
From: anasnajj at gmail.com (anasnajj)
Date: Tue, 12 May 2009 20:34:28 +0300
Subject: [Linux-cluster] CLVM service
Message-ID: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAAH6BpCqOsNlHt5LXVM2Xv+kBAAAAAA==@gmail.com>

Dear all

I installed RHCS on redhat 4 Update 5 but I cannot find clvmd service !!! is
it installed with Redhat cluster suite ??

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090512/00086a63/attachment.htm>

From jumanjiman at gmail.com  Tue May 12 17:39:43 2009
From: jumanjiman at gmail.com (Paul Morgan)
Date: Tue, 12 May 2009 17:39:43 +0000
Subject: [Linux-cluster] CLVM service
In-Reply-To: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAAH6BpCqOsNlHt5LXVM2Xv+kBAAAAAA==@gmail.com>
References: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAAH6BpCqOsNlHt5LXVM2Xv+kBAAAAAA==@gmail.com>
Message-ID: <451427707-1242149990-cardhu_decombobulator_blackberry.rim.net-1878009334-@bxe1239.bisx.prod.on.blackberry>

Nope, separate package:
lvm2-cluster iirc

-----Original Message-----
From: "anasnajj" <anasnajj at gmail.com>

Date: Tue, 12 May 2009 20:34:28 
To: <linux-cluster at redhat.com>
Subject: [Linux-cluster] CLVM service


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From esggrupos at gmail.com  Wed May 13 12:20:06 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Wed, 13 May 2009 14:20:06 +0200
Subject: [Linux-cluster] help with fence_xvmd
Message-ID: <3128ba140905130520n30a13eb6l67187daa2fae221d@mail.gmail.com>

Hi all,

I?m testing a 2 nodes cluster with 2 xen virtual machines running on a Xen
host.

I have the cluster running and I want to configure fence_xvm.

I have followed the instructions of this page:

http://sources.redhat.com/cluster/wiki/VMClusterCookbook

but when I reach at the point of running this command:
cluster]# fence_xvmd -fdddd
Debugging threshold is now 4
-- args @ 0xbfd38f08 --
  args->addr = 225.0.0.12
  args->domain = (null)
  args->key_file = /etc/cluster/fence_xvm.key
  args->op = 2
  args->hash = 2
  args->auth = 2
  args->port = 1229
  args->ifindex = 0
  args->family = 2
  args->timeout = 30
  args->retr_time = 20
  args->flags = 1
  args->debug = 4
-- end args --
Reading in key file /etc/cluster/fence_xvm.key into 0xbfd37f08 (4096 max
size)
Actual key length = 4096 bytesOpened ckpt vm_states
Setting up ipv4 multicast receive (225.0.0.12:1229)
Joining multicast group
Failed to bind multicast receive socket to 225.0.0.12: No such device
Check network configuration.
Could not set up multicast listen socket

I don?t understand what I have to do to make this work. (I?m newbie to this
kind of fence method, perhaps is a stupid thing.)

any clue?

thanks in advance

ESG
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090513/68c97b83/attachment.htm>

From anasnajj at gmail.com  Wed May 13 16:52:31 2009
From: anasnajj at gmail.com (anasnajj)
Date: Wed, 13 May 2009 19:52:31 +0300
Subject: [Linux-cluster] Unique attribute collision.
Message-ID: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAAJgyklEYzXtNq32/sN3uFfYBAAAAAA==@gmail.com>

 
dear sir 
 
i had configured  my system with RHCS 4U5 with 10 node , i created all
resources without any problems , and when i create the first service its
work well,but when i create the second service and more i get the following
error msg from /var/log/messages :

# tail -f /var/log/messages 

May 13 13:56:24 MENCS01 ccsd[21824]: Starting ccsd 1.0.10: 

May 13 13:56:24 MENCS01 ccsd[21824]:  Built: Mar 19 2007 17:44:46 

May 13 13:56:24 MENCS01 ccsd[21824]:  Copyright (C) Red Hat, Inc.  2004  All
rights reserved. 

May 13 13:56:24 MENCS01 ccsd[21824]: Connected to cluster infrastruture via:
CMAN/SM Plugin v1.1.7.4 

May 13 13:56:24 MENCS01 ccsd[21824]: Initial status:: Quorate 

May 13 13:56:24 MENCS01 ccsd:  succeeded

May 13 13:56:28 MENCS01 ccsd[21824]: cluster.conf (cluster name = ClustGrp1,
version = 323) found. 

May 13 13:56:28 MENCS01 ccsd[21824]: Remote copy of cluster.conf is from
quorate node. 

May 13 13:56:28 MENCS01 ccsd[21824]:  Local version # : 323 

May 13 13:56:28 MENCS01 ccsd[21824]:  Remote version #: 323 

May 13 13:56:28 MENCS01 cman: startup succeeded

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <notice> Resource Group Manager
Starting 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <info> Loading Service Data 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Unique attribute collision.
type=fs attr=mountpoint value=/Images 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Error storing fs resource 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Unique attribute collision.
type=fs attr=mountpoint value=/var/lib/mysql 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Error storing fs resource 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Unique attribute collision.
type=fs attr=mountpoint value=/Images 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Error storing fs resource 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Unique attribute collision.
type=fs attr=mountpoint value=/var/lib/mysql 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Error storing fs resource 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Unique attribute collision.
type=fs attr=mountpoint value=/Images 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Error storing fs resource 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Unique attribute collision.
type=fs attr=mountpoint value=/var/lib/mysql 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Error storing fs resource 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Unique attribute collision.
type=fs attr=mountpoint value=/Images 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Error storing fs resource 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Unique attribute collision.
type=fs attr=mountpoint value=/var/lib/mysql 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Error storing fs resource 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Unique attribute collision.
type=fs attr=mountpoint value=/Images 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Error storing fs resource 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Unique attribute collision.
type=fs attr=mountpoint value=/var/lib/mysql 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Error storing fs resource 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Unique attribute collision.
type=fs attr=mountpoint value=/Images 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Error storing fs resource 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Unique attribute collision.
type=fs attr=mountpoint value=/var/lib/mysql

 
and the services ignore all file system resources !!!! 
 
my case is to implement 10 node with many redhat cluster services which each
service will run exclusively on any node and each service in cluster have
the following resources : 
- different two LVM 
-two file systems to mount LVM on different two folders 
-scripts   
 
each File system Resources should mount to the same location (Ex: /Images)
as you see in the following snap of cluster.conf file .. 
<?xml version="1.0"?>

<cluster alias="ClustGrp1" config_version="323" name="ClustGrp1">

      <fence_daemon post_fail_delay="0" post_join_delay="3"/>

      <clusternodes>

            <clusternode name="MENCS01" votes="1">

                  <fence/>

            </clusternode>

            <clusternode name="MENCS02" votes="1">

                  <fence/>

            </clusternode>

            <clusternode name="MENCS03" votes="1">

                  <fence/>

            </clusternode>

            <clusternode name="MENCS04" votes="1">

                  <fence/>

            </clusternode>

            <clusternode name="MENCS05" votes="1">

                  <fence/>

            </clusternode>

            <clusternode name="MENCS11" votes="1">

                  <fence/>

            </clusternode>

            <clusternode name="MENCS12" votes="1">

                  <fence/>

            </clusternode>

            <clusternode name="MENCS13" votes="1">

                  <fence/>

            </clusternode>

            <clusternode name="MENCS15" votes="1">

                  <fence/>

            </clusternode>

            <clusternode name="MENBS01" votes="1">

                  <fence/>

            </clusternode>

            <clusternode name="MENBS02" votes="1">

                  <fence/>

            </clusternode>

      </clusternodes>

      <cman/>

      <fencedevices/>

      <rm>

            <failoverdomains>

                  <failoverdomain name="ClustFailover" ordered="0"
restricted="1">

                        <failoverdomainnode name="MENBS01" priority="1"/>

                        <failoverdomainnode name="MENBS02" priority="1"/>

                  </failoverdomain>

            </failoverdomains>

            <resources>

                  <ip address="10.10.103.111" monitor_link="1"/>

                  <ip address="10.10.103.112" monitor_link="1"/>

                  <ip address="10.10.103.113" monitor_link="1"/>

                  <ip address="10.10.103.114" monitor_link="1"/>

                  <ip address="10.10.103.115" monitor_link="1"/>

                  <ip address="10.10.103.116" monitor_link="1"/>

                  <ip address="10.10.103.117" monitor_link="1"/>

                  <ip address="10.10.103.118" monitor_link="1"/>

                  <ip address="10.10.103.119" monitor_link="1"/>

                  <ip address="10.10.103.120" monitor_link="1"/>

                  <lvm lv_name="mencs02lv" name="MENCS02lvm"
vg_name="mencs02vg"/>

                  <lvm lv_name="mencs02sqllv" name="mencs02sqllvm"
vg_name="mencs02sqlvg"/>

                  <lvm lv_name="mencs03lv" name="MENCS03lvm"
vg_name="mencs03vg"/>

                  <lvm lv_name="mencs03sqllv" name="mencs03sqllvm"
vg_name="mencs03sqlvg"/>

                  <lvm lv_name="mencs04lv" name="MENCS04lvm"
vg_name="mencs04vg"/>

                  <lvm lv_name="mencs04sqllv" name="mencs04sqllvm"
vg_name="mencs04sqlvg"/>

                  <lvm lv_name="mencs05lv" name="MENCS05lvm"
vg_name="mencs05vg"/>

                  <lvm lv_name="mencs05sqllv" name="mencs05sqllvm"
vg_name="mencs05sqlvg"/>

                  <lvm lv_name="mencs11lv" name="MENCS11lvm"
vg_name="mencs11vg"/>

                  <lvm lv_name="mencs11sqllv" name="mencs11sqllvm"
vg_name="mencs11sqlvg"/>

                  <lvm lv_name="mencs12lv" name="MENCS12lvm"
vg_name="mencs12vg"/>

                  <lvm lv_name="mencs12sqllv" name="mencs12sqllvm"
vg_name="mencs12sqlvg"/>

                  <lvm lv_name="mencs13lv" name="MENCS13lvm"
vg_name="mencs13vg"/>

                  <lvm lv_name="mencs13sqllv" name="mencs13sqllvm"
vg_name="mencs13sqlvg"/>

                  <lvm lv_name="mencs14lv" name="MENCS14lvm"
vg_name="mencs14vg"/>

                  <lvm lv_name="mencs14sqllv" name="mencs14sqllvm"
vg_name="mencs14sqlvg"/>

                  <lvm lv_name="mencs15lv" name="MENCS15lvm"
vg_name="mencs15vg"/>

                  <lvm lv_name="mencs15sqllv" name="mencs15sqllvm"
vg_name="mencs15sqlvg"/>

                  <fs device="/dev/mencs02vg/mencs02lv" force_fsck="0"
force_unmount="1" fsid="55922" fstype="ext3" mountpoint="/Images"
name="MENCS02fs" options="" self_fence="0"/>

                  <fs device="/dev/mencs02sqlvg/mencs02sqllv" force_fsck="0"
force_unmount="1" fsid="47577" fstype="ext3" mountpoint="/var/lib/mysql"
name="MENCS02sqlfs" options="" self_fence="0"/>

                  <fs device="/dev/mencs04vg/mencs04lv" force_fsck="0"
force_unmount="1" fsid="4940" fstype="ext3" mountpoint="/Images"
name="MENCS04fs" options="" self_fence="0"/>

                  <fs device="/dev/mencs04sqlvg/mencs04sqllv" force_fsck="0"
force_unmount="1" fsid="3795" fstype="ext3" mountpoint="/var/lib/mysql"
name="MENCS04sqlfs" options="" self_fence="0"/>

                  <fs device="/dev/mencs05vg/mencs05lv" force_fsck="0"
force_unmount="1" fsid="52471" fstype="ext3" mountpoint="/Images"
name="MENCS05fs" options="" self_fence="0"/>

                  <fs device="/dev/mencs05sqlvg/mencs05sqllv" force_fsck="0"
force_unmount="1" fsid="52206" fstype="ext3" mountpoint="/var/lib/mysql"
name="MENFS05sqlfs" options="" self_fence="0"/>

                  <fs device="/dev/mencs11vg/mencs11lv" force_fsck="0"
force_unmount="1" fsid="846" fstype="ext3" mountpoint="/Images"
name="MENCS11fs" options="" self_fence="0"/>

                  <fs device="/dev/mencs11sqlvg/mencs11sqllv" force_fsck="0"
force_unmount="1" fsid="30665" fstype="ext3" mountpoint="/var/lib/mysql"
name="mencs11sqlfs" options="" self_fence="0"/>

                  <fs device="/dev/mencs012vg/mencs12lv" force_fsck="0"
force_unmount="1" fsid="48407" fstype="ext3" mountpoint="/Images"
name="MENCS12fs" options="" self_fence="0"/>

                  <fs device="/dev/mencs012sqlvg/mencs012sqllv"
force_fsck="0" force_unmount="1" fsid="45677" fstype="ext3"
mountpoint="/var/lib/mysql" name="MENCS12sqlfs" options="" self_fence="0"/>

                  <fs device="/dev/mencs13vg/mencs13lv" force_fsck="0"
force_unmount="1" fsid="55072" fstype="ext3" mountpoint="/Images"
name="MENCS13fs" options="" self_fence="0"/>

                  <fs device="/dev/mencs13sqlvg/mencs13sqllvs"
force_fsck="0" force_unmount="1" fsid="25032" fstype="ext3"
mountpoint="/var/lib/mysql" name="MENCS13sqlfs" options="" self_fence="0"/>

                  <fs device="/dev/mencs15vg/mencs15lv" force_fsck="0"
force_unmount="1" fsid="43922" fstype="ext3" mountpoint="/Images"
name="MENCS15fs" options="rw" self_fence="0"/>

                  <fs device="/dev/mencs15sqlvg/mencs15sqllv" force_fsck="0"
force_unmount="1" fsid="33493" fstype="ext3" mountpoint="/var/lib/mysql"
name="MENCS15sqlfs" options="rw" self_fence="0"/>

            </resources>

            <service autostart="0" exclusive="1" name="MENCS01srv"
recovery="relocate">

                  <ip ref="10.10.103.111"/>

            </service>

            <service autostart="0" exclusive="1" name="MENCS02srv"
recovery="relocate">

                  <ip ref="10.10.103.112">

                        <lvm ref="MENCS02lvm">

                              <fs ref="MENCS02fs"/>

                        </lvm>

                        <lvm ref="mencs02sqllvm">

                              <fs ref="MENCS02sqlfs"/>

                        </lvm>

                  </ip>

            </service>

            <service autostart="0" exclusive="1" name="MENCS03srv"
recovery="relocate">

                  <ip ref="10.10.103.113"/>

            </service>

            <service autostart="1" exclusive="1" name="MENCS04srv"
recovery="relocate">

                  <ip ref="10.10.103.114">

                        <lvm ref="MENCS04lvm">

                              <fs ref="MENCS04fs"/>

                        </lvm>

                        <lvm ref="mencs04sqllvm">

                              <fs ref="MENCS04sqlfs"/>

                        </lvm>

                  </ip>

            </service>

            <service autostart="0" exclusive="1" name="MENCS05srv"
recovery="relocate">

                  <ip ref="10.10.103.115">

                        <lvm ref="MENCS05lvm">

                              <fs ref="MENCS05fs"/>

                        </lvm>

                        <lvm ref="mencs05sqllvm">

                              <fs ref="MENFS05sqlfs"/>

                        </lvm>

                  </ip>

            </service>

            <service autostart="0" exclusive="1" name="MENCS11srv"
recovery="relocate">

                  <ip ref="10.10.103.116">

                        <lvm ref="MENCS11lvm">

                              <fs ref="MENCS11fs"/>

                        </lvm>

                        <lvm ref="mencs11sqllvm">

                              <fs ref="mencs11sqlfs"/>

                        </lvm>

                  </ip>

            </service>

            <service autostart="0" exclusive="1" name="MENCS12srv"
recovery="relocate">

                  <ip ref="10.10.103.117">

                        <lvm ref="MENCS12lvm">

                              <fs ref="MENCS12fs"/>

                        </lvm>

                        <lvm ref="mencs12sqllvm">

                              <fs ref="MENCS12sqlfs"/>

                        </lvm>

                  </ip>

            </service>

            <service autostart="0" exclusive="1" name="MENCS13srv"
recovery="relocate">

                  <ip ref="10.10.103.118">

                        <lvm ref="MENCS13lvm">

                              <fs ref="MENCS13fs"/>

                        </lvm>

                        <lvm ref="mencs13sqllvm">

                              <fs ref="MENCS13sqlfs"/>

                        </lvm>

                  </ip>

            </service>

            <service autostart="0" exclusive="1" name="MENCS14srv"
recovery="relocate">

                  <ip ref="10.10.103.119"/>

            </service>

            <service autostart="0" exclusive="1" name="MENCS15srv"
recovery="relocate">

                  <ip ref="10.10.103.120"/>

                  <lvm ref="MENCS15lvm"/>

                  <lvm ref="mencs15sqllvm"/>

                  <fs ref="MENCS15fs"/>

                  <fs ref="MENCS15sqlfs"/>

            </service>

      </rm>

</cluster>

 
Best Regards

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090513/fd295a46/attachment.htm>

From rvandolson at esri.com  Wed May 13 17:35:11 2009
From: rvandolson at esri.com (Ray Van Dolson)
Date: Wed, 13 May 2009 10:35:11 -0700
Subject: [Linux-cluster] System load at 1.00 for gfs2?
Message-ID: <20090513173511.GA5992@esri.com>

Running a gfs2 filesystem on RHEL 5.3 PPC (kernel 2.6.18-128.1.10.el5)
backed by clvm (two nodes).

# gfs2_tool list
253:4 domus:homedir1

# gfs2_tool getargs /domus1
data 2
suiddir 0
quota 0
posix_acl 0
num_glockd 1
upgrade 0
debug 0
localflocks 0
localcaching 0
ignore_local_fs 0
spectator 0
hostdata jid=1:id=196610:first=0
locktable 
lockproto 

As soon as I mount the gfs2 filesystem on either node, that node's
system load average goes to 1.00 and stays there until the filesystem
is unmounted.

top, sar and iostat all show the system as being completely idle, and
no iowait going on:

Linux 2.6.18-128.1.10.el5 (domusA.esri.com)   05/13/2009

12:00:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
12:10:01 AM         0       145      1.00      1.00      1.00
12:20:01 AM         0       145      1.00      1.00      1.00
12:30:01 AM         0       145      1.00      1.01      1.00
12:40:01 AM         0       145      1.00      1.00      1.00
12:50:01 AM         0       145      1.01      1.01      1.00
01:00:01 AM         0       145      1.00      1.00      1.00
01:10:01 AM         0       145      1.00      1.00      1.00

12:00:01 AM       CPU     %user     %nice   %system   %iowait    %steal     %idle
12:10:01 AM       all      0.09      0.00      0.46      0.02      0.00     99.43
12:20:01 AM       all      0.09      0.00      0.45      0.02      0.00     99.44
12:30:01 AM       all      0.09      0.52      0.48      0.02      0.00     98.88
12:40:01 AM       all      0.10      0.00      0.45      0.02      0.00     99.43
12:50:01 AM       all      0.09      0.00      0.45      0.02      0.00     99.44
01:00:01 AM       all      0.09      0.00      0.45      0.02      0.00     99.44
01:10:01 AM       all      0.09      0.00      0.45      0.02      0.00     99.43

Can anyone speculate on what might be causing this?  I found one other
post on the subject[1], but there were no replies.  Would any other
information be helpful?

Ray

[1]: http://www.redhat.com/archives/linux-cluster/2009-March/msg00166.html


From esggrupos at gmail.com  Wed May 13 18:41:51 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Wed, 13 May 2009 20:41:51 +0200
Subject: [Linux-cluster] Re: help with fence_xvmd
In-Reply-To: <3128ba140905130520n30a13eb6l67187daa2fae221d@mail.gmail.com>
References: <3128ba140905130520n30a13eb6l67187daa2fae221d@mail.gmail.com>
Message-ID: <3128ba140905131141n59df2ddar11df53f5b899244a@mail.gmail.com>

Hi again,

I have found that when I run the comand this way:
fence_xvmd -fddddd -I vnet0
Debugging threshold is now 5
-- args @ 0xbfa1cb58 --
  args->addr = 225.0.0.12
  args->domain = (null)
  args->key_file = /etc/cluster/fence_xvm.key
  args->op = 2
  args->hash = 2
  args->auth = 2
  args->port = 1229
  args->ifindex = 4
  args->family = 2
  args->timeout = 30
  args->retr_time = 20
  args->flags = 1
  args->debug = 5
-- end args --
Reading in key file /etc/cluster/fence_xvm.key into 0xbfa1bb58 (4096 max
size)
Actual key length = 4096 bytesOpened ckpt vm_states
Setting up ipv4 multicast receive (225.0.0.12:1229)
Joining multicast group
ipv4_recv_sk: success, fd = 7
My Node ID = 1
Domain                   UUID                                 Owner State
------                   ----                                 ----- -----
Domain-0                 00000000-0000-0000-0000-000000000000 00001 00001
NODO1                    73f4ab26-62de-732d-e5a2-1eb1740840ec 00001 00002
NODO2                    73f4ab26-62de-732d-e5a2-1eb1740840ed 00001 00002
Storing NODO1
Storing NODO2
........

But when I try to fence the from the node:
fence_xvm -H NODO2 -ddd -o null
Debugging threshold is now 3
-- args @ 0xbfed88bc --
  args->addr = 225.0.0.12
  args->domain = NODO2
  args->key_file = /etc/cluster/fence_xvm.key
  args->op = 0
  args->hash = 2
  args->auth = 2
  args->port = 1229
  args->ifindex = 0
  args->family = 2
  args->timeout = 30
  args->retr_time = 20
  args->flags = 0
  args->debug = 3
-- end args --
Reading in key file /etc/cluster/fence_xvm.key into 0xbfed786c (4096 max
size)
Actual key length = 4096 bytesSending to 225.0.0.12 via 127.0.0.1
Sending to 225.0.0.12 via 192.168.1.186
Sending to 225.0.0.12 via 192.168.1.187
Sending to 225.0.0.12 via 10.0.0.1
Sending to 225.0.0.12 via 192.168.122.1
Waiting for connection from XVM host daemon.
.....
Timed out waiting for response

and doesn?t fence.


any help with this new info?

Thanks,

ESG
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090513/33f5e6af/attachment.htm>

From vu at sivell.com  Wed May 13 19:04:47 2009
From: vu at sivell.com (vu pham)
Date: Wed, 13 May 2009 14:04:47 -0500
Subject: [Linux-cluster] Re: help with fence_xvmd
In-Reply-To: <3128ba140905131141n59df2ddar11df53f5b899244a@mail.gmail.com>
References: <3128ba140905130520n30a13eb6l67187daa2fae221d@mail.gmail.com>
	<3128ba140905131141n59df2ddar11df53f5b899244a@mail.gmail.com>
Message-ID: <4A0B19CF.8080400@sivell.com>

ESGLinux wrote:
> Hi again,
> 
> I have found that when I run the comand this way:
> fence_xvmd -fddddd -I vnet0
> Debugging threshold is now 5
> -- args @ 0xbfa1cb58 --
>   args->addr = 225.0.0.12
>   args->domain = (null)
>   args->key_file = /etc/cluster/fence_xvm.key
>   args->op = 2
>   args->hash = 2
>   args->auth = 2
>   args->port = 1229
>   args->ifindex = 4
>   args->family = 2
>   args->timeout = 30
>   args->retr_time = 20
>   args->flags = 1
>   args->debug = 5
> -- end args --
> Reading in key file /etc/cluster/fence_xvm.key into 0xbfa1bb58 (4096 max 
> size)
> Actual key length = 4096 bytesOpened ckpt vm_states
> Setting up ipv4 multicast receive (225.0.0.12:1229 <http://225.0.0.12:1229>)
> Joining multicast group
> ipv4_recv_sk: success, fd = 7
> My Node ID = 1
> Domain                   UUID                                 Owner State
> ------                   ----                                 ----- -----
> Domain-0                 00000000-0000-0000-0000-000000000000 00001 00001
> NODO1                    73f4ab26-62de-732d-e5a2-1eb1740840ec 00001 00002
> NODO2                    73f4ab26-62de-732d-e5a2-1eb1740840ed 00001 00002
> Storing NODO1
> Storing NODO2
> ........
> 
> But when I try to fence the from the node:
> fence_xvm -H NODO2 -ddd -o null
> Debugging threshold is now 3
> -- args @ 0xbfed88bc --
>   args->addr = 225.0.0.12
>   args->domain = NODO2
>   args->key_file = /etc/cluster/fence_xvm.key
>   args->op = 0
>   args->hash = 2
>   args->auth = 2
>   args->port = 1229
>   args->ifindex = 0
>   args->family = 2
>   args->timeout = 30
>   args->retr_time = 20
>   args->flags = 0
>   args->debug = 3
> -- end args --
> Reading in key file /etc/cluster/fence_xvm.key into 0xbfed786c (4096 max 
> size)
> Actual key length = 4096 bytesSending to 225.0.0.12 via 127.0.0.1
> Sending to 225.0.0.12 via 192.168.1.186
> Sending to 225.0.0.12 via 192.168.1.187
> Sending to 225.0.0.12 via 10.0.0.1
> Sending to 225.0.0.12 via 192.168.122.1
> Waiting for connection from XVM host daemon.
> .....
> Timed out waiting for response
> 
> and doesn?t fence.
> 
> 

So your vm hosts are not on the eth0 network of Domain-0, but they are 
on the private network. IIRC, fence_xvmd listens, by default, on eth0.

You can overcome this by, if you use RHEL 5.3, adding -I vnet0 ( or 
whatever device which domain-0 uses to communicate with domU ) into the 
fence_xvmd 's parameter list .IIRC, this parm -I does not exist on RHEL 
5.2 or ealier. You can do that by adding FENCE_XVMD_OPTS="-I vnet0" into 
/etc/sysconfig/cman.

hth,
Vu


From jeff.sturm at eprize.com  Wed May 13 19:38:02 2009
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Wed, 13 May 2009 15:38:02 -0400
Subject: [Linux-cluster] Re: help with fence_xvmd
In-Reply-To: <4A0B19CF.8080400@sivell.com>
References: <3128ba140905130520n30a13eb6l67187daa2fae221d@mail.gmail.com><3128ba140905131141n59df2ddar11df53f5b899244a@mail.gmail.com>
	<4A0B19CF.8080400@sivell.com>
Message-ID: <64D0546C5EBBD147B75DE133D798665F02FDBBF3@hugo.eprize.local>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of vu pham
> Sent: Wednesday, May 13, 2009 3:05 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] Re: help with fence_xvmd
> 
> You can overcome this by, if you use RHEL 5.3, adding -I 
> vnet0 ( or whatever device which domain-0 uses to communicate 
> with domU ) into the fence_xvmd 's parameter list .IIRC, this 
> parm -I does not exist on RHEL
> 5.2 or ealier. You can do that by adding FENCE_XVMD_OPTS="-I 
> vnet0" into /etc/sysconfig/cman.

On earlier versions you can also configure an explicit multicast route.

The following worked for us (xenbreth1 is the bridge device carrying the
DomU cluster traffic):

[root at xen01 ~]# cat /etc/sysconfig/network-scripts/route-xenbreth1
to multicast 224.0.0.0/4 dev xenbreth1

Jeff


From anasnajj at gmail.com  Thu May 14 12:22:32 2009
From: anasnajj at gmail.com (anasnajj)
Date: Thu, 14 May 2009 15:22:32 +0300
Subject: [Linux-cluster] Unique attribute collision for mmount Point
Message-ID: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAALBoLfuOKhlPlBsUMLRLCFoBAAAAAA==@gmail.com>

Dear all

I know I asked this question before but I not get any response so I will try
to re-explain it

 
I want to run 10 services on 10 nodes and each service run script that store
its data by default  on /image and used mysql script which store in
/var/lib/mysql

in each service there is the following resources with following order: 
-IP 
-LVM for my Script DATA 
-LVM for mysql DATA 
-File system to mount my Script DATA /images 
-File system to mount /var/lib/mysql 
-Script mysql 
-Script : MyScript File

 
when I run the service its give me the following error and ignore mount
point resources:

 
May 13 13:56:32 MENCS01 clurgmgrd[21863]: <notice> Resource Group Manager
Starting 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <info> Loading Service Data 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Unique attribute collision.
type=fs attr=mountpoint value=/Images 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Error storing fs resource 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Unique attribute collision.
type=fs attr=mountpoint value=/var/lib/mysql 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Error storing fs resource 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Unique attribute collision.
type=fs attr=mountpoint value=/Images 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Error storing fs resource 

May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Unique attribute collision.
type=fs attr=mountpoint value=/var/lib/mysql 

 
 if i used different names i cant told my service to store its data on
anoter location .??? because my script by default store in /Images   and
cannot changed.

 
by the way all my nodes use the same script file located on /etc/init.d/ 
so i cant make service #1 store in /images1 and service#2 store on /images2
and so on.. 

 
Best Regards

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090514/95683f7f/attachment.htm>

From robejrm at gmail.com  Thu May 14 13:14:12 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Thu, 14 May 2009 15:14:12 +0200
Subject: [Linux-cluster] Unique attribute collision for mmount Point
In-Reply-To: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAALBoLfuOKhlPlBsUMLRLCFoBAAAAAA==@gmail.com>
References: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAALBoLfuOKhlPlBsUMLRLCFoBAAAAAA==@gmail.com>
Message-ID: <8a5668960905140614j1a85b57fk5544f51a34f421e1@mail.gmail.com>

On Thu, May 14, 2009 at 2:22 PM, anasnajj <anasnajj at gmail.com> wrote:

>  Dear all
>
> I know I asked this question before but I not get any response so I will
> try to re-explain it
>
>
>
> I want to run 10 services on 10 nodes and each service run script that
> store its data by default  on /image and used mysql script which store in
> /var/lib/mysql
>
> in each service there is the following resources with following order:
> *-IP
> -LVM for my Script DATA
> -LVM for mysql DATA
> -File system to mount my Script DATA /images
> -File system to mount /var/lib/mysql
> -Script mysql
> -Script : MyScript File*
>
>
>
>
>
> when I run the service its give me the following error and ignore mount
> point resources:
>
>
>
> May 13 13:56:32 MENCS01 clurgmgrd[21863]: <notice> Resource Group Manager
> Starting
>
> May 13 13:56:32 MENCS01 clurgmgrd[21863]: <info> Loading Service Data
>
> May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Unique attribute collision.
> type=fs attr=mountpoint value=/Images
>
> May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Error storing fs resource
>
> May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Unique attribute collision.
> type=fs attr=mountpoint value=/var/lib/mysql
>
> May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Error storing fs resource
>
> May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Unique attribute collision.
> type=fs attr=mountpoint value=/Images
>
> May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Error storing fs resource
>
> May 13 13:56:32 MENCS01 clurgmgrd[21863]: <err> Unique attribute collision.
> type=fs attr=mountpoint value=/var/lib/mysql
>
>
>
>  if i used different names i cant told my service to store its data on
> anoter location .??? because my script by default store in /Images   and
> cannot changed.
>
> I think you should have a mountpoint for each resource (script, mysql) on
each service.
You should change that script to use a custom directory name. If you are
running a database instance for each service, you should have a separate
my.cnf file for each service and also a separate "datadir" filesystem.
Cluster is able to manage mysql as a resource on its own without using any
external script.
I.e:
<mysql config_file="/datadir/my.cnf" listen_address="x.x.x.x"
name="mydatabase1" />

Hope that helps,
Juanra

>
>
>
> by the way all my nodes use the same script file located on /etc/init.d/
> so i cant make service #1 store in /images1 and service#2 store on /images2
> and so on..
>
>
>
>
>
> Best Regards
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090514/6b0f50b6/attachment.htm>

From trippi at faor.it  Thu May 14 13:56:25 2009
From: trippi at faor.it (David Trippi)
Date: Thu, 14 May 2009 15:56:25 +0200
Subject: [Linux-cluster] HP iLo 2 doesn't works?
Message-ID: <DB1FCE2E5BE01C4AAE43562F5F9EDEC701BE5D@email.faor.it>

Hi,
 
I setup a two server cluster with HP hardware (Proliant ML 310 G5).
Everything works very well ( Cluster members can switch active/passive
state) but I cant see any fence activity with iLo2 and no
messages are logged.
The only command that responds is fence_ilo:
fence_ilo -a 10.0.0.2 -l Administrator -p XXXXXX -v -o off
 
Next my cluster.conf

1.	
	<?xml version="1.0" ?>
2.	
	<cluster alias="Cluster" config_version="9" name="Cluster">
3.	
	        <fence_daemon post_fail_delay="0" post_join_delay="20"/>
4.	
	        <clusternodes>
5.	
	                <clusternode name="cls01" nodeid="1" votes="1">
6.	
	                        <fence>
7.	
	                                <method name="1">
8.	
	                                        <device
name="iLocls01"/>
9.	
	                                </method>
10.	
	                        </fence>
11.	
	                </clusternode>
12.	
	                <clusternode name="cls02" nodeid="2" votes="1">
13.	
	                        <fence>
14.	
	                                <method name="1">
15.	
	                                        <device
name="iLocls02"/>
16.	
	                                </method>
17.	
	                        </fence>
18.	
	                </clusternode>
19.	
	        </clusternodes>
20.	
	        <cman expected_votes="1" two_node="1"/>
21.	
	        <fencedevices>
22.	
	                <fencedevice agent="fence_ilo"
hostname="10.0.0.1" login="Administrator" name="iLocls01"
passwd="XXXXXXX"/>
23.	
	                <fencedevice agent="fence_ilo"
hostname="10.0.0.2" login="Administrator" name="iLocls02"
passwd="XXXXXXX"/>
24.	
	        </fencedevices>
25.	
	        <rm>
26.	
	                <failoverdomains>
27.	
	                        <failoverdomain name="FenceDomain"
ordered="0" restricted="1">
28.	
	                                <failoverdomainnode name="cls01"
priority="1"/>
29.	
	                                <failoverdomainnode name="cls02"
priority="1"/>
30.	
	                        </failoverdomain>
31.	
	                </failoverdomains>
32.	
	                <resources>
33.	
	                        <ip address="192.168.100.180"
monitor_link="1"/>
34.	
	                        <script file="/etc/init.d/ibserverd"
name="Interbase"/>
35.	
	                        <smb name="SambaSMB"
workgroup="Production"/>
36.	
	                </resources>
37.	
	                <service autostart="1" domain="FenceDomain"
name="ProductionSRV" recovery="relocate">
38.	
	                        <ip ref="192.168.100.180">
39.	
	                                <smb ref="SambaSMB">
40.	
	                                        <script
ref="Interbase"/>
41.	
	                                </smb>
42.	
	                        </ip>
43.	
	                </service>
44.	
	  </rm>
45.	
	</cluster>

Thank in advance
 
David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090514/7edc784a/attachment.htm>

From mederyf at LEXUM.UMontreal.CA  Thu May 14 17:33:02 2009
From: mederyf at LEXUM.UMontreal.CA (=?UTF-8?B?RnLDqWTDqXJpYyBNw6lkZXJ5?=)
Date: Thu, 14 May 2009 13:33:02 -0400
Subject: [Linux-cluster] Luci/Rici error : Unable to retrieve batch ...
Message-ID: <4A0C55CE.1020301@lexum.umontreal.ca>

Hello,
I have these errors when trying to start a VM on using luci /ricci.
There is no firewall between luci ans ricci server and 11111 is open :

[root at cluster01-node1 queue]# netstat -an | grep 11111
tcp        0      0 0.0.0.0:11111               
0.0.0.0:*                   LISTEN

Unable to retrieve batch 2034690312 status from 
cluster01-node1.cluster.lexum.pri:11111: test is in unknown state 118 
Starting cluster service "test" on node 
"cluster01-node1.cluster.lexum.pri" -- You will be redirected in 5 seconds.

Any idea ?

-- 
------------------------------------
Fr?d?ric M?dery
Administrateur Syst?me /
System Administrator
LexUM, Universit? de Montr?al
email : mederyf at lexum.umontreal.ca
tel.  : (514) 343-6111  p. 1-3288
fax. : (514) 343-7359
------------------------------------


From dist-list at LEXUM.UMontreal.CA  Thu May 14 17:33:12 2009
From: dist-list at LEXUM.UMontreal.CA (FM)
Date: Thu, 14 May 2009 13:33:12 -0400
Subject: [Linux-cluster] Luci/Rici error : Unable to retrieve batch ...
Message-ID: <4A0C55D8.5030308@lexum.umontreal.ca>

Hello,
I have these errors when trying to start a VM on using luci /ricci.
There is no firewall between luci ans ricci server and 11111 is open :

[root at cluster01-node1 queue]# netstat -an | grep 11111
tcp        0      0 0.0.0.0:11111
0.0.0.0:*                   LISTEN

Unable to retrieve batch 2034690312 status from
cluster01-node1.cluster.lexum.pri:11111: test is in unknown state 118
Starting cluster service "test" on node
"cluster01-node1.cluster.lexum.pri" -- You will be redirected in 5 seconds.

Any idea ?


From alan.zg at gmail.com  Fri May 15 14:48:29 2009
From: alan.zg at gmail.com (Alan A)
Date: Fri, 15 May 2009 09:48:29 -0500
Subject: [Linux-cluster] What is the best way to check if process is 
	running on a different node
In-Reply-To: <1113014435.27411241733167156.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>
References: <fac531740905071429t534e6d40jae26c96629635b8c@mail.gmail.com>
	<1113014435.27411241733167156.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>
Message-ID: <fac531740905150748i3bfc792aie25fd9ae313916bf@mail.gmail.com>

Do you know if there is source of good documentation on 'rgmanager' and
setup of some of the standard services/ custom made services with examples?
Tutorial would be great!?


On Thu, May 7, 2009 at 4:52 PM, Bob Peterson <rpeterso at redhat.com> wrote:

> ----- "Alan A" <alan.zg at gmail.com> wrote:
> | I would like to write a script or utilize a service provided on 5.3
> | RHEL cluster to check if a process is running on a different node
> | before it's started. Need just some general information on possible
> | best way to do this?
> |
> | --
> | Alan A.
>
> Hi Alan,
>
> This entry from the faq might help:
>
> http://sources.redhat.com/cluster/wiki/FAQ/RGManager#rgm_wheresvc
>
> Regards,
>
> Bob Peterson
> Red Hat GFS
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Alan A.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090515/ae2f3502/attachment.htm>

From crosa at redhat.com  Fri May 15 15:23:14 2009
From: crosa at redhat.com (crosa at redhat.com)
Date: Fri, 15 May 2009 11:23:14 -0400 (EDT)
Subject: [Linux-cluster] What is the best way to check if process is  r
	unning on a different node
Message-ID: <yVKPgfLvlTYg@DpJNG0Mj>

Not sure if you can fetch this info from the CMAN api, but you sure can get this from SNMP. Install and configure cluster-snmp, and modclusterd.
CR.

--- mensagem original ---
De: Alan A <alan.zg at gmail.com>
Assunto: Re: [Linux-cluster] What is the best way to check if process is  running on a different node
Data: 15 de Maio de 2009
Hora: 10:49:8

Do you know if there is source of good documentation on 'rgmanager' and
setup of some of the standard services/ custom made services with examples?
Tutorial would be great!?


On Thu, May 7, 2009 at 4:52 PM, Bob Peterson <rpeterso at redhat.com> wrote:

> ----- "Alan A" <alan.zg at gmail.com> wrote:
> | I would like to write a script or utilize a service provided on 5.3
> | RHEL cluster to check if a process is running on a different node
> | before it's started. Need just some general information on possible
> | best way to do this?
> |
> | --
> | Alan A.
>
> Hi Alan,
>
> This entry from the faq might help:
>
> http://sources.redhat.com/cluster/wiki/FAQ/RGManager#rgm_wheresvc
>
> Regards,
>
> Bob Peterson
> Red Hat GFS
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Alan A.


From ravikumar.c25 at gmail.com  Sun May 17 18:28:07 2009
From: ravikumar.c25 at gmail.com (ravi kumar)
Date: Mon, 18 May 2009 02:28:07 +0800
Subject: [Linux-cluster] Not able to add the cluster nodes
Message-ID: <539ce33b0905171128l51a3aa1bi446f208f1ea19e7d@mail.gmail.com>

Hi Linux cluster experts,

 Not able to add as a member in dbnode1 & same as dbnode2 side. Please help...

Please find the details as below


rpm list on dbnode1
luci-0.12.0-7.el5
ricci-0.12.0-7.el5
system-config-cluster-1.0.52-1.1
lvm2-2.02.32-4.el5
lvm2-cluster-2.02.32-4.el5
cman-2.0.84-2.el5
rgmanager-2.0.38-2.el5
gfs2-utils-0.1.44-1.el5
gfs-utils-0.1.17-1.el5
openais-0.80.3-15.el5
perl-Net-Telnet-3.03-5
dbnode1:/tmp #


rpm list on dbnode2
luci-0.12.1-7.el5
ricci-0.12.0-7.el5
system-config-cluster-1.0.52-1.1
lvm2-2.02.32-4.el5
lvm2-cluster-2.02.32-4.el5
cman-2.0.84-2.el5
rgmanager-2.0.38-2.el5
gfs-utils-0.1.17-1.el5
gfs2-utils-0.1.44-1.el5
openais-0.80.3-15.el5
perl-Net-Telnet-3.03-5
dbnode2:/tmp #


dbnode1:/root # service cman start
Starting cluster:
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... done
                                                           [  OK  ]
dbnode1:/root #


messages:

May 17 17:46:00 dbnode1 ccsd[11159]: Starting ccsd 2.0.84:
May 17 17:46:00 dbnode1 ccsd[11159]:  Built: Apr 15 2008 16:19:15
May 17 17:46:00 dbnode1 ccsd[11159]:  Copyright (C) Red Hat, Inc.
2004  All rights reserved.
May 17 17:46:00 dbnode1 ccsd[11159]: cluster.conf (cluster name =
md_cluster, version = 7) found.
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] AIS Executive Service
RELEASE 'subrev 1358 version 0.80.3'
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] Copyright (C)
2002-2006 MontaVista Software, Inc and contributors.
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] Copyright (C) 2006 Red
Hat, Inc.
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] AIS Executive Service:
started and ready to provide service.
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] Using default
multicast address of 239.192.82.7
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] openais component
openais_cpg loaded.
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] Registering service
handler 'openais cluster closed process group service v1.01'
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] openais component
openais_cfg loaded.
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] Registering service
handler 'openais configuration service'
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] openais component
openais_msg loaded.
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] Registering service
handler 'openais message service B.01.01'
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] openais component
openais_lck loaded.
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] Registering service
handler 'openais distributed locking service B.01.01'
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] openais component
openais_evt loaded.
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] Registering service
handler 'openais event service B.01.01'
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] openais component
openais_ckpt loaded.
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] Registering service
handler 'openais checkpoint service B.01.01'
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] openais component
openais_amf loaded.
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] Registering service
handler 'openais availability management framework B.01.01'
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] openais component
openais_clm loaded.
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] Registering service
handler 'openais cluster membership service B.01.01'
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] openais component
openais_evs loaded.
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] Registering service
handler 'openais extended virtual synchrony service'
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] openais component
openais_cman loaded.
May 17 17:46:03 dbnode1 openais[11167]: [MAIN ] Registering service
handler 'openais CMAN membership service 2.01'
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] Token Timeout (10000
ms) retransmit timeout (495 ms)
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] token hold (386 ms)
retransmits before loss (20 retrans)
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] join (60 ms) send_join
(0 ms) consensus (4800 ms) merge (200 ms)
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] downcheck (1000 ms)
fail to recv const (50 msgs)
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] seqno unchanged const
(30 rotations) Maximum network MTU 1500
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] window size per
rotation (50 messages) maximum messages per rotation (17 messages)
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] send threads (0 threads)
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] RRP token expired
timeout (495 ms)
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] RRP token problem
counter (2000 ms)
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] RRP threshold (10
problem count)
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] RRP mode set to none.
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] heartbeat_failures_allowed (0)
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] max_network_delay (50 ms)
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] HeartBeat is Disabled.
To enable set heartbeat_failures_allowed > 0
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] Receive multicast
socket recv buffer size (262142 bytes).
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] Transmit multicast
socket send buffer size (262142 bytes).
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] The network interface
[10.192.64.41] is now up.
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] Created or loaded
sequence id 296.10.192.64.41 for this ring.
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] entering GATHER state from 15.
May 17 17:46:03 dbnode1 openais[11167]: [SERV ] Initialising service
handler 'openais extended virtual synchrony service'
May 17 17:46:03 dbnode1 openais[11167]: [SERV ] Initialising service
handler 'openais cluster membership service B.01.01'
May 17 17:46:03 dbnode1 openais[11167]: [SERV ] Initialising service
handler 'openais availability management framework B.01.01'
May 17 17:46:03 dbnode1 openais[11167]: [SERV ] Initialising service
handler 'openais checkpoint service B.01.01'
May 17 17:46:03 dbnode1 openais[11167]: [SERV ] Initialising service
handler 'openais event service B.01.01'
May 17 17:46:03 dbnode1 openais[11167]: [SERV ] Initialising service
handler 'openais distributed locking service B.01.01'
May 17 17:46:03 dbnode1 openais[11167]: [SERV ] Initialising service
handler 'openais message service B.01.01'
May 17 17:46:03 dbnode1 openais[11167]: [SERV ] Initialising service
handler 'openais configuration service'
May 17 17:46:03 dbnode1 openais[11167]: [SERV ] Initialising service
handler 'openais cluster closed process group service v1.01'
May 17 17:46:03 dbnode1 openais[11167]: [SERV ] Initialising service
handler 'openais CMAN membership service 2.01'
May 17 17:46:03 dbnode1 openais[11167]: [CMAN ] CMAN 2.0.84 (built Apr
15 2008 16:19:19) started
May 17 17:46:03 dbnode1 openais[11167]: [SYNC ] Not using a virtual
synchrony filter.
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] Creating commit token
because I am the rep.
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] Saving state aru 0
high seq received 0
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] Storing new sequence
id for ring 12c
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] entering COMMIT state.
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] entering RECOVERY state.
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] position [0] member
10.192.64.41:
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] previous ring seq 296
rep 10.192.64.41
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] aru 0 high delivered 0
received flag 1
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] Did not need to
originate any messages in recovery.
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] Sending initial ORF token
May 17 17:46:03 dbnode1 openais[11167]: [CLM  ] CLM CONFIGURATION CHANGE
May 17 17:46:03 dbnode1 openais[11167]: [CLM  ] New Configuration:
May 17 17:46:03 dbnode1 openais[11167]: [CLM  ] Members Left:
May 17 17:46:03 dbnode1 openais[11167]: [CLM  ] Members Joined:
May 17 17:46:03 dbnode1 openais[11167]: [CLM  ] CLM CONFIGURATION CHANGE
May 17 17:46:03 dbnode1 openais[11167]: [CLM  ] New Configuration:
May 17 17:46:03 dbnode1 openais[11167]: [CLM  ]  r(0) ip(10.192.64.41)
May 17 17:46:03 dbnode1 openais[11167]: [CLM  ] Members Left:
May 17 17:46:03 dbnode1 openais[11167]: [CLM  ] Members Joined:
May 17 17:46:03 dbnode1 openais[11167]: [CLM  ]  r(0) ip(10.192.64.41)
May 17 17:46:03 dbnode1 openais[11167]: [SYNC ] This node is within
the primary component and will provide service.
May 17 17:46:03 dbnode1 openais[11167]: [TOTEM] entering OPERATIONAL state.
May 17 17:46:03 dbnode1 openais[11167]: [CMAN ] quorum regained,
resuming activity
May 17 17:46:03 dbnode1 openais[11167]: [CLM  ] got nodejoin message
10.192.64.41
May 17 17:46:03 dbnode1 ccsd[11159]: Initial status:: Quorate
May 17 17:46:08 dbnode1 fenced[11188]: dbnode2.xtks.com not a cluster
member after 3 sec post_join_delay
May 17 17:46:08 dbnode1 fenced[11188]: fencing node "dbnode2.xtks.com"
May 17 17:46:08 dbnode1 fence_manual: Node dbnode2.xtks.com needs to
be reset before recovery can procede.  Waiting for dbnode2.xtks.com to
rejoin the cluster or for manual acknowledgement that it has been
reset (i.e. fence_ack_manual -n dbnode2.xtks.com)


dbnode2:/root # service cman start
Starting cluster:
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... done
                                                           [  OK  ]
dbnode2:/root #


messages :

May 17 17:52:20 dbnode2 kernel: DLM (built Apr 29 2008 13:16:47) installed
May 17 17:52:20 dbnode2 kernel: GFS2 (built Apr 29 2008 13:17:12) installed
May 17 17:52:20 dbnode2 kernel: Lock_DLM (built Apr 29 2008 13:17:18) installed
May 17 17:52:20 dbnode2 ccsd[27230]: Starting ccsd 2.0.84:
May 17 17:52:20 dbnode2 ccsd[27230]:  Built: Apr 15 2008 16:19:15
May 17 17:52:20 dbnode2 ccsd[27230]:  Copyright (C) Red Hat, Inc.
2004  All rights reserved.
May 17 17:52:20 dbnode2 ccsd[27230]: cluster.conf (cluster name =
md_cluster, version = 7) found.
May 17 17:52:20 dbnode2 ccsd[27230]: Remote copy of cluster.conf is
from quorate node.
May 17 17:52:20 dbnode2 ccsd[27230]:  Local version # : 7
May 17 17:52:20 dbnode2 ccsd[27230]:  Remote version #: 7
May 17 17:52:20 dbnode2 ccsd[27230]: Remote copy of cluster.conf is
from quorate node.
May 17 17:52:20 dbnode2 ccsd[27230]:  Local version # : 7
May 17 17:52:20 dbnode2 ccsd[27230]:  Remote version #: 7
May 17 17:52:20 dbnode2 ccsd[27230]: Remote copy of cluster.conf is
from quorate node.
May 17 17:52:20 dbnode2 ccsd[27230]:  Local version # : 7
May 17 17:52:20 dbnode2 ccsd[27230]:  Remote version #: 7
May 17 17:52:20 dbnode2 ccsd[27230]: Remote copy of cluster.conf is
from quorate node.
May 17 17:52:20 dbnode2 ccsd[27230]:  Local version # : 7
May 17 17:52:20 dbnode2 ccsd[27230]:  Remote version #: 7
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] AIS Executive Service
RELEASE 'subrev 1358 version 0.80.3'
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] Copyright (C)
2002-2006 MontaVista Software, Inc and contributors.
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] Copyright (C) 2006 Red
Hat, Inc.
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] AIS Executive Service:
started and ready to provide service.
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] Using default
multicast address of 239.192.82.7
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] openais component
openais_cpg loaded.
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] Registering service
handler 'openais cluster closed process group service v1.01'
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] openais component
openais_cfg loaded.
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] Registering service
handler 'openais configuration service'
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] openais component
openais_msg loaded.
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] Registering service
handler 'openais message service B.01.01'
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] openais component
openais_lck loaded.
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] Registering service
handler 'openais distributed locking service B.01.01'
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] openais component
openais_evt loaded.
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] Registering service
handler 'openais event service B.01.01'
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] openais component
openais_ckpt loaded.
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] Registering service
handler 'openais checkpoint service B.01.01'
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] openais component
openais_amf loaded.
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] Registering service
handler 'openais availability management framework B.01.01'
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] openais component
openais_clm loaded.
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] Registering service
handler 'openais cluster membership service B.01.01'
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] openais component
openais_evs loaded.
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] Registering service
handler 'openais extended virtual synchrony service'
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] openais component
openais_cman loaded.
May 17 17:52:20 dbnode2 openais[27238]: [MAIN ] Registering service
handler 'openais CMAN membership service 2.01'
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] Token Timeout (10000
ms) retransmit timeout (495 ms)
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] token hold (386 ms)
retransmits before loss (20 retrans)
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] join (60 ms) send_join
(0 ms) consensus (4800 ms) merge (200 ms)
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] downcheck (1000 ms)
fail to recv const (50 msgs)
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] seqno unchanged const
(30 rotations) Maximum network MTU 1500
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] window size per
rotation (50 messages) maximum messages per rotation (17 messages)
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] send threads (0 threads)
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] RRP token expired
timeout (495 ms)
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] RRP token problem
counter (2000 ms)
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] RRP threshold (10
problem count)
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] RRP mode set to none.
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] heartbeat_failures_allowed (0)
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] max_network_delay (50 ms)
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] HeartBeat is Disabled.
To enable set heartbeat_failures_allowed > 0
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] Receive multicast
socket recv buffer size (262142 bytes).
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] Transmit multicast
socket send buffer size (262142 bytes).
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] The network interface
[10.192.64.42] is now up.
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] Created or loaded
sequence id 292.10.192.64.42 for this ring.
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] entering GATHER state from 15.
May 17 17:52:20 dbnode2 openais[27238]: [SERV ] Initialising service
handler 'openais extended virtual synchrony service'
May 17 17:52:20 dbnode2 openais[27238]: [SERV ] Initialising service
handler 'openais cluster membership service B.01.01'
May 17 17:52:20 dbnode2 openais[27238]: [SERV ] Initialising service
handler 'openais availability management framework B.01.01'
May 17 17:52:20 dbnode2 openais[27238]: [SERV ] Initialising service
handler 'openais checkpoint service B.01.01'
May 17 17:52:20 dbnode2 openais[27238]: [SERV ] Initialising service
handler 'openais event service B.01.01'
May 17 17:52:20 dbnode2 openais[27238]: [SERV ] Initialising service
handler 'openais distributed locking service B.01.01'
May 17 17:52:20 dbnode2 openais[27238]: [SERV ] Initialising service
handler 'openais message service B.01.01'
May 17 17:52:20 dbnode2 openais[27238]: [SERV ] Initialising service
handler 'openais configuration service'
May 17 17:52:20 dbnode2 openais[27238]: [SERV ] Initialising service
handler 'openais cluster closed process group service v1.01'
May 17 17:52:20 dbnode2 openais[27238]: [SERV ] Initialising service
handler 'openais CMAN membership service 2.01'
May 17 17:52:20 dbnode2 openais[27238]: [CMAN ] CMAN 2.0.84 (built Apr
15 2008 16:19:19) started
May 17 17:52:20 dbnode2 openais[27238]: [SYNC ] Not using a virtual
synchrony filter.
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] Creating commit token
because I am the rep.
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] Saving state aru 0
high seq received 0
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] Storing new sequence
id for ring 128
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] entering COMMIT state.
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] entering RECOVERY state.
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] position [0] member
10.192.64.42:
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] previous ring seq 292
rep 10.192.64.42
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] aru 0 high delivered 0
received flag 1
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] Did not need to
originate any messages in recovery.
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] Sending initial ORF token
May 17 17:52:20 dbnode2 openais[27238]: [CLM  ] CLM CONFIGURATION CHANGE
May 17 17:52:20 dbnode2 openais[27238]: [CLM  ] New Configuration:
May 17 17:52:20 dbnode2 openais[27238]: [CLM  ] Members Left:
May 17 17:52:20 dbnode2 openais[27238]: [CLM  ] Members Joined:
May 17 17:52:20 dbnode2 openais[27238]: [CLM  ] CLM CONFIGURATION CHANGE
May 17 17:52:20 dbnode2 openais[27238]: [CLM  ] New Configuration:
May 17 17:52:20 dbnode2 openais[27238]: [CLM  ]  r(0) ip(10.192.64.42)
May 17 17:52:20 dbnode2 openais[27238]: [CLM  ] Members Left:
May 17 17:52:20 dbnode2 openais[27238]: [CLM  ] Members Joined:
May 17 17:52:20 dbnode2 openais[27238]: [CLM  ]  r(0) ip(10.192.64.42)
May 17 17:52:20 dbnode2 openais[27238]: [SYNC ] This node is within
the primary component and will provide service.
May 17 17:52:20 dbnode2 openais[27238]: [TOTEM] entering OPERATIONAL state.
May 17 17:52:20 dbnode2 openais[27238]: [CMAN ] quorum regained,
resuming activity
May 17 17:52:20 dbnode2 openais[27238]: [CLM  ] got nodejoin message
10.192.64.42
May 17 17:52:20 dbnode2 ccsd[27230]: Cluster is not quorate.  Refusing
connection.
May 17 17:52:20 dbnode2 ccsd[27230]: Error while processing connect:
Connection refused
May 17 17:52:21 dbnode2 ccsd[27230]: Initial status:: Quorate
May 17 17:52:25 dbnode2 fenced[27258]: dbnode1.xtks.com not a cluster
member after 3 sec post_join_delay
May 17 17:52:25 dbnode2 fenced[27258]: fencing node "dbnode1.xtks.com"
May 17 17:52:25 dbnode2 fence_manual: Node dbnode1.xtks.com needs to
be reset before recovery can procede.  Waiting for dbnode1.xtks.com to
rejoin the cluster or for manual acknowledgement that it has been
reset (i.e. fence_ack_manual -n dbnode1.xtks.com)
May 17 17:55:36 dbnode2 openais[27238]: [TOTEM] entering GATHER state from 9.
May 17 17:55:51 dbnode2 openais[27238]: [TOTEM] The consensus timeout expired.
May 17 17:55:51 dbnode2 openais[27238]: [TOTEM] entering GATHER state from 0.
May 17 17:55:51 dbnode2 openais[27238]: [TOTEM] entering GATHER state from 3.
May 17 17:55:51 dbnode2 openais[27238]: [TOTEM] Creating commit token
because I am the rep.
May 17 17:55:51 dbnode2 openais[27238]: [TOTEM] Saving state aru c
high seq received c
May 17 17:55:51 dbnode2 openais[27238]: [TOTEM] Storing new sequence
id for ring 130
May 17 17:55:51 dbnode2 openais[27238]: [TOTEM] entering COMMIT state.
May 17 17:55:51 dbnode2 openais[27238]: [TOTEM] entering RECOVERY state.
May 17 17:55:51 dbnode2 openais[27238]: [TOTEM] position [0] member
10.192.64.42:
May 17 17:55:51 dbnode2 openais[27238]: [TOTEM] previous ring seq 296
rep 10.192.64.42
May 17 17:55:51 dbnode2 openais[27238]: [TOTEM] aru c high delivered c
received flag 1
May 17 17:55:51 dbnode2 openais[27238]: [TOTEM] Did not need to
originate any messages in recovery.
May 17 17:55:51 dbnode2 openais[27238]: [TOTEM] Sending initial ORF token
May 17 17:55:51 dbnode2 openais[27238]: [CLM  ] CLM CONFIGURATION CHANGE
May 17 17:55:51 dbnode2 openais[27238]: [CLM  ] New Configuration:
May 17 17:55:51 dbnode2 openais[27238]: [CLM  ]  r(0) ip(10.192.64.42)
May 17 17:55:51 dbnode2 openais[27238]: [CLM  ] Members Left:
May 17 17:55:51 dbnode2 openais[27238]: [CLM  ] Members Joined:
May 17 17:55:51 dbnode2 openais[27238]: [CLM  ] CLM CONFIGURATION CHANGE
May 17 17:55:51 dbnode2 openais[27238]: [CLM  ] New Configuration:
May 17 17:55:51 dbnode2 openais[27238]: [CLM  ]  r(0) ip(10.192.64.42)
May 17 17:55:51 dbnode2 openais[27238]: [CLM  ] Members Left:
May 17 17:55:51 dbnode2 openais[27238]: [CLM  ] Members Joined:
May 17 17:55:51 dbnode2 openais[27238]: [SYNC ] This node is within
the primary component and will provide service.
May 17 17:55:51 dbnode2 openais[27238]: [TOTEM] entering OPERATIONAL state.
May 17 17:55:51 dbnode2 openais[27238]: [CLM  ] got nodejoin message
10.192.64.42
May 17 17:55:51 dbnode2 openais[27238]: [CPG  ] got joinlist message
from node 2


dbnode1:/var/log # clustat
Cluster Status for md_cluster @ Sun May 17 17:58:59 2009
Member Status: Quorate

 Member Name                                              ID   Status
 ------ ----                                              ---- ------
 dbnode1.xtks.com                             1 Online, Local
 dbnode2.xtks.com                             2 Offline


dbnode2:/root # clustat
Cluster Status for md_cluster @ Sun May 17 17:58:45 2009
Member Status: Quorate

 Member Name                                              ID   Status
 ------ ----                                              ---- ------
 dbnode1.xtks.com                            1 Offline
 dbnode2.xtks.com                            2 Online, Local

dbnode2:/root #


From cthulhucalling at gmail.com  Sun May 17 19:04:47 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Sun, 17 May 2009 12:04:47 -0700
Subject: [Linux-cluster] Not able to add the cluster nodes
In-Reply-To: <539ce33b0905171128l51a3aa1bi446f208f1ea19e7d@mail.gmail.com>
References: <539ce33b0905171128l51a3aa1bi446f208f1ea19e7d@mail.gmail.com>
Message-ID: <36df569a0905171204k618006dbq8c8eff0ddeda5a76@mail.gmail.com>

Try adding clean_start="1" to the fence_daemon line of both members and try
it again.

On Sun, May 17, 2009 at 11:28 AM, ravi kumar <ravikumar.c25 at gmail.com>wrote:

> Hi Linux cluster experts,
>
>  Not able to add as a member in dbnode1 & same as dbnode2 side. Please
> help...
>
> Please find the details as below
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090517/057be373/attachment.htm>

From robejrm at gmail.com  Mon May 18 08:35:09 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Mon, 18 May 2009 10:35:09 +0200
Subject: [Linux-cluster] System load at 1.00 for gfs2?
In-Reply-To: <20090513173511.GA5992@esri.com>
References: <20090513173511.GA5992@esri.com>
Message-ID: <8a5668960905180135p118312bfj6625f8513f477674@mail.gmail.com>

On Wed, May 13, 2009 at 7:35 PM, Ray Van Dolson <rvandolson at esri.com> wrote:

> Running a gfs2 filesystem on RHEL 5.3 PPC (kernel 2.6.18-128.1.10.el5)
> backed by clvm (two nodes).
>
> # gfs2_tool list
> 253:4 domus:homedir1
>
> # gfs2_tool getargs /domus1
> data 2
> suiddir 0
> quota 0
> posix_acl 0
> num_glockd 1
> upgrade 0
> debug 0
> localflocks 0
> localcaching 0
> ignore_local_fs 0
> spectator 0
> hostdata jid=1:id=196610:first=0
> locktable
> lockproto
>
> As soon as I mount the gfs2 filesystem on either node, that node's
> system load average goes to 1.00 and stays there until the filesystem
> is unmounted.
>
> top, sar and iostat all show the system as being completely idle, and
> no iowait going on:
>
> Linux 2.6.18-128.1.10.el5 (domusA.esri.com)   05/13/2009
>
> 12:00:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
> 12:10:01 AM         0       145      1.00      1.00      1.00
> 12:20:01 AM         0       145      1.00      1.00      1.00
> 12:30:01 AM         0       145      1.00      1.01      1.00
> 12:40:01 AM         0       145      1.00      1.00      1.00
> 12:50:01 AM         0       145      1.01      1.01      1.00
> 01:00:01 AM         0       145      1.00      1.00      1.00
> 01:10:01 AM         0       145      1.00      1.00      1.00
>
> 12:00:01 AM       CPU     %user     %nice   %system   %iowait    %steal
> %idle
> 12:10:01 AM       all      0.09      0.00      0.46      0.02      0.00
> 99.43
> 12:20:01 AM       all      0.09      0.00      0.45      0.02      0.00
> 99.44
> 12:30:01 AM       all      0.09      0.52      0.48      0.02      0.00
> 98.88
> 12:40:01 AM       all      0.10      0.00      0.45      0.02      0.00
> 99.43
> 12:50:01 AM       all      0.09      0.00      0.45      0.02      0.00
> 99.44
> 01:00:01 AM       all      0.09      0.00      0.45      0.02      0.00
> 99.44
> 01:10:01 AM       all      0.09      0.00      0.45      0.02      0.00
> 99.43
>
> Can anyone speculate on what might be causing this?  I found one other
> post on the subject[1], but there were no replies.  Would any other
> information be helpful?
>
This is still happening to me, using the last available packages from rhel5.
Load is incremented by 1 everytime I mount a gfs2 filesystem.
I am not using kmod-gfs2, do I have to?

Regards,
Juan Ram?n Mart?n

>
> Ray
>
> [1]: http://www.redhat.com/archives/linux-cluster/2009-March/msg00166.html
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090518/b629f21b/attachment.htm>

From esggrupos at gmail.com  Mon May 18 09:38:04 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Mon, 18 May 2009 11:38:04 +0200
Subject: [Linux-cluster] Luci/Rici error : Unable to retrieve batch ...
In-Reply-To: <4A0C55D8.5030308@lexum.umontreal.ca>
References: <4A0C55D8.5030308@lexum.umontreal.ca>
Message-ID: <3128ba140905180238i43a80e6ctd3ac595920d10345@mail.gmail.com>

Hello,

I don?t know if this can help you but,

I have this error when the cman process is stopped,

What I do in this situation is to restart cman, and it works for me,

HTH

ESG

2009/5/14 FM <dist-list at lexum.umontreal.ca>

> Hello,
> I have these errors when trying to start a VM on using luci /ricci.
> There is no firewall between luci ans ricci server and 11111 is open :
>
> [root at cluster01-node1 queue]# netstat -an | grep 11111
> tcp        0      0 0.0.0.0:11111
> 0.0.0.0:*                   LISTEN
>
> Unable to retrieve batch 2034690312 status from
> cluster01-node1.cluster.lexum.pri:11111: test is in unknown state 118
> Starting cluster service "test" on node
> "cluster01-node1.cluster.lexum.pri" -- You will be redirected in 5 seconds.
>
> Any idea ?
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090518/f1f440fb/attachment.htm>

From rvandolson at esri.com  Mon May 18 14:02:01 2009
From: rvandolson at esri.com (Ray Van Dolson)
Date: Mon, 18 May 2009 07:02:01 -0700
Subject: [Linux-cluster] System load at 1.00 for gfs2?
In-Reply-To: <8a5668960905180135p118312bfj6625f8513f477674@mail.gmail.com>
References: <20090513173511.GA5992@esri.com>
	<8a5668960905180135p118312bfj6625f8513f477674@mail.gmail.com>
Message-ID: <20090518140201.GA7429@esri.com>

On Mon, May 18, 2009 at 01:35:09AM -0700, Juan Ramon Martin Blanco wrote:
> This is still happening to me, using the last available packages from
> rhel5. Load is incremented by 1 everytime I mount a gfs2 filesystem.
> I am not using kmod-gfs2, do I have to?

Doesn't the gfs2 module get loaded automatically for you?  It does for
me.

We might need to post to the -devel list to get some answers.  It
doesn't seem to be impeding performance or functionality, but at some
point here I'll file a bz and an SR with RH.

Ray


From swhiteho at redhat.com  Mon May 18 14:08:05 2009
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 18 May 2009 15:08:05 +0100
Subject: [Linux-cluster] System load at 1.00 for gfs2?
In-Reply-To: <20090518140201.GA7429@esri.com>
References: <20090513173511.GA5992@esri.com>
	<8a5668960905180135p118312bfj6625f8513f477674@mail.gmail.com>
	<20090518140201.GA7429@esri.com>
Message-ID: <1242655685.29604.345.camel@localhost.localdomain>

Hi,

On Mon, 2009-05-18 at 07:02 -0700, Ray Van Dolson wrote:
> On Mon, May 18, 2009 at 01:35:09AM -0700, Juan Ramon Martin Blanco wrote:
> > This is still happening to me, using the last available packages from
> > rhel5. Load is incremented by 1 everytime I mount a gfs2 filesystem.
> > I am not using kmod-gfs2, do I have to?
> 
> Doesn't the gfs2 module get loaded automatically for you?  It does for
> me.
> 
> We might need to post to the -devel list to get some answers.  It
> doesn't seem to be impeding performance or functionality, but at some
> point here I'll file a bz and an SR with RH.
> 
> Ray
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

The fix has gone in to RHEL 5.4. I have a feeling that it might also go
into 5.3.z but I'm not 100% sure what the timescales are there. The bug
is known and fixed in upstream too.

It isn't actually using any more CPU, its just that the LA is
incremented by 1. So a fix is already on its way,

Steve.


From rvandolson at esri.com  Mon May 18 15:21:54 2009
From: rvandolson at esri.com (Ray Van Dolson)
Date: Mon, 18 May 2009 08:21:54 -0700
Subject: [Linux-cluster] System load at 1.00 for gfs2?
In-Reply-To: <1242655685.29604.345.camel@localhost.localdomain>
References: <20090513173511.GA5992@esri.com>
	<8a5668960905180135p118312bfj6625f8513f477674@mail.gmail.com>
	<20090518140201.GA7429@esri.com>
	<1242655685.29604.345.camel@localhost.localdomain>
Message-ID: <20090518152154.GA9945@esri.com>

On Mon, May 18, 2009 at 07:08:05AM -0700, Steven Whitehouse wrote:
> Hi,
> 
> On Mon, 2009-05-18 at 07:02 -0700, Ray Van Dolson wrote:
> > On Mon, May 18, 2009 at 01:35:09AM -0700, Juan Ramon Martin Blanco wrote:
> > > This is still happening to me, using the last available packages from
> > > rhel5. Load is incremented by 1 everytime I mount a gfs2 filesystem.
> > > I am not using kmod-gfs2, do I have to?
> > 
> > Doesn't the gfs2 module get loaded automatically for you?  It does for
> > me.
> > 
> > We might need to post to the -devel list to get some answers.  It
> > doesn't seem to be impeding performance or functionality, but at some
> > point here I'll file a bz and an SR with RH.
> > 
> > Ray
> > 
> 
> The fix has gone in to RHEL 5.4. I have a feeling that it might also go
> into 5.3.z but I'm not 100% sure what the timescales are there. The bug
> is known and fixed in upstream too.
> 
> It isn't actually using any more CPU, its just that the LA is
> incremented by 1. So a fix is already on its way,
> 

Good enough for me.  I don't have access to 5.3.z channels, but no
problem waiting for 5.4.

Ray


From jruemker at redhat.com  Mon May 18 16:09:24 2009
From: jruemker at redhat.com (John Ruemker)
Date: Mon, 18 May 2009 12:09:24 -0400
Subject: [Linux-cluster] Unique attribute collision.
In-Reply-To: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAAJgyklEYzXtNq32/sN3uFfYBAAAAAA==@gmail.com>
References: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAAJgyklEYzXtNq32/sN3uFfYBAAAAAA==@gmail.com>
Message-ID: <4A118834.5050603@redhat.com>

>
> <fs device="/dev/mencs02vg/mencs02lv" force_fsck="0" force_unmount="1"
> fsid="55922" fstype="ext3" mountpoint="/Images" name="MENCS02fs"
> options="" self_fence="0"/>
>
> <fs device="/dev/mencs02sqlvg/mencs02sqllv" force_fsck="0"
> force_unmount="1" fsid="47577" fstype="ext3" mountpoint="/var/lib/mysql"
> name="MENCS02sqlfs" options="" self_fence="0"/>
>
> <fs device="/dev/mencs04vg/mencs04lv" force_fsck="0" force_unmount="1"
> fsid="4940" fstype="ext3" mountpoint="/Images" name="MENCS04fs"
> options="" self_fence="0"/>
>
> <fs device="/dev/mencs04sqlvg/mencs04sqllv" force_fsck="0"
> force_unmount="1" fsid="3795" fstype="ext3" mountpoint="/var/lib/mysql"
> name="MENCS04sqlfs" options="" self_fence="0"/>
>
> <fs device="/dev/mencs05vg/mencs05lv" force_fsck="0" force_unmount="1"
> fsid="52471" fstype="ext3" mountpoint="/Images" name="MENCS05fs"
> options="" self_fence="0"/>
>
> <fs device="/dev/mencs05sqlvg/mencs05sqllv" force_fsck="0"
> force_unmount="1" fsid="52206" fstype="ext3" mountpoint="/var/lib/mysql"
> name="MENFS05sqlfs" options="" self_fence="0"/>
>
> <fs device="/dev/mencs11vg/mencs11lv" force_fsck="0" force_unmount="1"
> fsid="846" fstype="ext3" mountpoint="/Images" name="MENCS11fs"
> options="" self_fence="0"/>
>
> <fs device="/dev/mencs11sqlvg/mencs11sqllv" force_fsck="0"
> force_unmount="1" fsid="30665" fstype="ext3" mountpoint="/var/lib/mysql"
> name="mencs11sqlfs" options="" self_fence="0"/>
>
> <fs device="/dev/mencs012vg/mencs12lv" force_fsck="0" force_unmount="1"
> fsid="48407" fstype="ext3" mountpoint="/Images" name="MENCS12fs"
> options="" self_fence="0"/>
>
> <fs device="/dev/mencs012sqlvg/mencs012sqllv" force_fsck="0"
> force_unmount="1" fsid="45677" fstype="ext3" mountpoint="/var/lib/mysql"
> name="MENCS12sqlfs" options="" self_fence="0"/>
>
> <fs device="/dev/mencs13vg/mencs13lv" force_fsck="0" force_unmount="1"
> fsid="55072" fstype="ext3" mountpoint="/Images" name="MENCS13fs"
> options="" self_fence="0"/>
>
> <fs device="/dev/mencs13sqlvg/mencs13sqllvs" force_fsck="0"
> force_unmount="1" fsid="25032" fstype="ext3" mountpoint="/var/lib/mysql"
> name="MENCS13sqlfs" options="" self_fence="0"/>
>
> <fs device="/dev/mencs15vg/mencs15lv" force_fsck="0" force_unmount="1"
> fsid="43922" fstype="ext3" mountpoint="/Images" name="MENCS15fs"
> options="rw" self_fence="0"/>
>
> <fs device="/dev/mencs15sqlvg/mencs15sqllv" force_fsck="0"
> force_unmount="1" fsid="33493" fstype="ext3" mountpoint="/var/lib/mysql"
> name="MENCS15sqlfs" options="rw" self_fence="0"/>
>


You cannot specify the same mountpoint value for multiple fs resources, 
because different devices can't be mounted on the same mountpoint.  For 
instance all of your *sqlfs resources have /var/lib/mysql as the 
mountpoint.  If two of those services were to be started on the same 
node, it would attempt to mount different devices to the same 
mountpoint.  You need to have unique values for each fs mountpoint.

-John


From ravikumar.c25 at gmail.com  Mon May 18 18:47:22 2009
From: ravikumar.c25 at gmail.com (ravi kumar)
Date: Tue, 19 May 2009 02:47:22 +0800
Subject: [Linux-cluster] Not able to add the cluster nodes
In-Reply-To: <36df569a0905171204k618006dbq8c8eff0ddeda5a76@mail.gmail.com>
References: <539ce33b0905171128l51a3aa1bi446f208f1ea19e7d@mail.gmail.com>
	<36df569a0905171204k618006dbq8c8eff0ddeda5a76@mail.gmail.com>
Message-ID: <539ce33b0905181147m481b0f3ci773cc91391dfb1a9@mail.gmail.com>

Thanks Ian Hayes,

 Again another problem i m facing. GFS not able to mount.

Please find the details

dbnode1:/root # clustat
Cluster Status for md_cluster @ Mon May 18 18:15:43 2009
Member Status: Quorate

 Member Name                                              ID   Status
 ------ ----                                              ---- ------
 dbnode1.xtks.com                            1 Online
 dbnode2.xtks.com                            2 Online, Local


dbnode1:/root # gfs_mkfs -t md_cluster:test1 -p lock_dlm -j 2
/dev/vg_cluster1/test1
This will destroy any data on /dev/vg_cluster1/test1.
  It appears to contain a gfs2 filesystem.

Are you sure you want to proceed? [y/n] y

Device:                    /dev/vg_cluster1/test1
Blocksize:                 4096
Filesystem Size:           5177000
Journals:                  2
Resource Groups:           80
Locking Protocol:          lock_dlm
Lock Table:                md_cluster:test1

Syncing...
All Done
dbnode1:/root

dbnode1:/root # cat /etc/fstab
/dev/rootvg/rootvol     /                       ext3    defaults        1 1
/dev/rootvg/varvol      /var                    ext3
defaults,nosuid        1 2
/dev/rootvg/homevol     /home                   ext3
defaults,nosuid        1 2
/dev/rootvg/optvol      /opt                    ext3    defaults        1 2
LABEL=/boot             /boot                   ext3
defaults,nosuid        1 2
tmpfs                   /dev/shm                tmpfs
defaults,nosuid        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
/dev/rootvg/swapvol     swap                    swap    defaults        0 0
/dev/cdrom    /mnt/cdrom  auto    pamconsole,exec,noauto,managed  0 0
/dev/vg_cluster1/test1  /test1   gfs      defaults 0 0

dbnode1:/root # mount -a
/sbin/mount.gfs: error mounting /dev/mapper/vg_cluster1-test1 on
/test1: No such device

dbnode1:/root # clustat
Cluster Status for md_cluster @ Mon May 18 18:41:30 2009
Member Status: Quorate

 Member Name                                              ID   Status
 ------ ----                                              ---- ------
 dbnode1.xtks.com                            1 Online, Local
 dbnode2.xtks.com                            2 Offline

dbnode1:/root #


dbnode2:/root # clustat
Cluster Status for md_cluster @ Mon May 18 18:15:52 2009
Member Status: Quorate

 Member Name                                              ID   Status
 ------ ----                                              ---- ------
 dbnode1.xtks.com                            1 Online, Local
 dbnode2.xtks.com                            2 Online


dbnode2:/root # mount -a
/sbin/mount.gfs: can't connect to gfs_controld: Connection refused
/sbin/mount.gfs: can't connect to gfs_controld: Connection refused
/sbin/mount.gfs: can't connect to gfs_controld: Connection refused
/sbin/mount.gfs: can't connect to gfs_controld: Connection refused
/sbin/mount.gfs: can't connect to gfs_controld: Connection refused
/sbin/mount.gfs: can't connect to gfs_controld: Connection refused
/sbin/mount.gfs: can't connect to gfs_controld: Connection refused
/sbin/mount.gfs: can't connect to gfs_controld: Connection refused
/sbin/mount.gfs: can't connect to gfs_controld: Connection refused
/sbin/mount.gfs: can't connect to gfs_controld: Connection refused
/sbin/mount.gfs: gfs_controld not running
/sbin/mount.gfs: error mounting lockproto lock_dlm
dbnode2:/root #


dbnode2:/root # clustat
Could not connect to CMAN: Connection refused
dbnode2:/root #

On Mon, May 18, 2009 at 3:04 AM, Ian Hayes <cthulhucalling at gmail.com> wrote:
> Try adding clean_start="1" to the fence_daemon line of both members and try
> it again.
>
> On Sun, May 17, 2009 at 11:28 AM, ravi kumar <ravikumar.c25 at gmail.com>
> wrote:
>>
>> Hi Linux cluster experts,
>>
>> ?Not able to add as a member in dbnode1 & same as dbnode2 side. Please
>> help...
>>
>> Please find the details as below
>>
>>
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From rvandolson at esri.com  Mon May 18 22:27:53 2009
From: rvandolson at esri.com (Ray Van Dolson)
Date: Mon, 18 May 2009 15:27:53 -0700
Subject: [Linux-cluster] Quota reporting (for users)
Message-ID: <20090518222752.GA21159@esri.com>

Hello all;

Planning to export a GFS2 filesystem via NFS and am using quotas.
Everything works fine (users are notified when they exceed quota).

Is there a good way to allow users to query what their quota is and how
much they're using of it?  Something like "quota -s" with the standard
quota tools.

I can think of a couple kludgy ways to do this... caching output to a
file readable by the user, or a daemon that answers questions regarding
quotas from a user initiated command...

Maybe there's a better way I just don't know about?

Thanks in advance,
Ray


From cthulhucalling at gmail.com  Mon May 18 22:38:19 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Mon, 18 May 2009 18:38:19 -0400
Subject: [Linux-cluster] Quota reporting (for users)
In-Reply-To: <20090518222752.GA21159@esri.com>
References: <20090518222752.GA21159@esri.com>
Message-ID: <36df569a0905181538x1711b742m718b9fc6332a1d71@mail.gmail.com>

Have you looked at repquota?

On Mon, May 18, 2009 at 6:27 PM, Ray Van Dolson <rvandolson at esri.com> wrote:

> Hello all;
>
> Planning to export a GFS2 filesystem via NFS and am using quotas.
> Everything works fine (users are notified when they exceed quota).
>
> Is there a good way to allow users to query what their quota is and how
> much they're using of it?  Something like "quota -s" with the standard
> quota tools.
>
> I can think of a couple kludgy ways to do this... caching output to a
> file readable by the user, or a daemon that answers questions regarding
> quotas from a user initiated command...
>
> Maybe there's a better way I just don't know about?
>
> Thanks in advance,
> Ray
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090518/c581d684/attachment.htm>

From adas at redhat.com  Mon May 18 22:44:27 2009
From: adas at redhat.com (Abhijith Das)
Date: Mon, 18 May 2009 17:44:27 -0500
Subject: [Linux-cluster] Quota reporting (for users)
In-Reply-To: <20090518222752.GA21159@esri.com>
References: <20090518222752.GA21159@esri.com>
Message-ID: <4A11E4CB.9070501@redhat.com>

Ray Van Dolson wrote:
> Hello all;
>
> Planning to export a GFS2 filesystem via NFS and am using quotas.
> Everything works fine (users are notified when they exceed quota).
>
> Is there a good way to allow users to query what their quota is and how
> much they're using of it?  Something like "quota -s" with the standard
> quota tools.
>
> I can think of a couple kludgy ways to do this... caching output to a
> file readable by the user, or a daemon that answers questions regarding
> quotas from a user initiated command...
>
> Maybe there's a better way I just don't know about?
>
> Thanks in advance,
> Ray
>   

Ray,

We've been looking at adding support in gfs2 to interact with the
generic quota tools. This would also mean being able to use rpc.rquotad
to access nfs quotas. This is the bugzilla that tracks this feature:
https://bugzilla.redhat.com/show_bug.cgi?id=298561.

Until we get gfs2 quotas to synchronize with vfs' diskquotas interface,
the only way to query gfs2 quotas is through the gfs2_quota tool, I'm
afraid.


Cheers!
--Abhi


From rvandolson at esri.com  Mon May 18 22:44:33 2009
From: rvandolson at esri.com (Ray Van Dolson)
Date: Mon, 18 May 2009 15:44:33 -0700
Subject: [Linux-cluster] Quota reporting (for users)
In-Reply-To: <36df569a0905181538x1711b742m718b9fc6332a1d71@mail.gmail.com>
References: <20090518222752.GA21159@esri.com>
	<36df569a0905181538x1711b742m718b9fc6332a1d71@mail.gmail.com>
Message-ID: <20090518224432.GA21469@esri.com>

On Mon, May 18, 2009 at 03:38:19PM -0700, Ian Hayes wrote:
> Have you looked at repquota?

I don't think this tool works with GFS2 quotas (at least it didn't when
I just tried :)

Ray


From rvandolson at esri.com  Mon May 18 22:46:35 2009
From: rvandolson at esri.com (Ray Van Dolson)
Date: Mon, 18 May 2009 15:46:35 -0700
Subject: [Linux-cluster] Quota reporting (for users)
In-Reply-To: <4A11E4CB.9070501@redhat.com>
References: <20090518222752.GA21159@esri.com> <4A11E4CB.9070501@redhat.com>
Message-ID: <20090518224635.GB21469@esri.com>

On Mon, May 18, 2009 at 03:44:27PM -0700, Abhijith Das wrote:
> Ray Van Dolson wrote:
> > Hello all;
> >
> > Planning to export a GFS2 filesystem via NFS and am using quotas.
> > Everything works fine (users are notified when they exceed quota).
> >
> > Is there a good way to allow users to query what their quota is and how
> > much they're using of it?  Something like "quota -s" with the standard
> > quota tools.
> >
> > I can think of a couple kludgy ways to do this... caching output to a
> > file readable by the user, or a daemon that answers questions regarding
> > quotas from a user initiated command...
> >
> > Maybe there's a better way I just don't know about?
> >
> > Thanks in advance,
> > Ray
> >   
> 
> Ray,
> 
> We've been looking at adding support in gfs2 to interact with the
> generic quota tools. This would also mean being able to use rpc.rquotad
> to access nfs quotas. This is the bugzilla that tracks this feature:
> https://bugzilla.redhat.com/show_bug.cgi?id=298561.

Anything top secret in this bug or could you guys make it public?

> Until we get gfs2 quotas to synchronize with vfs' diskquotas interface,
> the only way to query gfs2 quotas is through the gfs2_quota tool, I'm
> afraid.

Gotcha.  We'll survive without it for now. :-)

Thanks,
Ray


From adas at redhat.com  Mon May 18 22:58:16 2009
From: adas at redhat.com (Abhijith Das)
Date: Mon, 18 May 2009 17:58:16 -0500
Subject: [Linux-cluster] Quota reporting (for users)
In-Reply-To: <20090518224635.GB21469@esri.com>
References: <20090518222752.GA21159@esri.com> <4A11E4CB.9070501@redhat.com>
	<20090518224635.GB21469@esri.com>
Message-ID: <4A11E808.2070403@redhat.com>

Ray Van Dolson wrote:
> On Mon, May 18, 2009 at 03:44:27PM -0700, Abhijith Das wrote:
>   
>> Ray Van Dolson wrote:
>>     
>>> Hello all;
>>>
>>> Planning to export a GFS2 filesystem via NFS and am using quotas.
>>> Everything works fine (users are notified when they exceed quota).
>>>
>>> Is there a good way to allow users to query what their quota is and how
>>> much they're using of it?  Something like "quota -s" with the standard
>>> quota tools.
>>>
>>> I can think of a couple kludgy ways to do this... caching output to a
>>> file readable by the user, or a daemon that answers questions regarding
>>> quotas from a user initiated command...
>>>
>>> Maybe there's a better way I just don't know about?
>>>
>>> Thanks in advance,
>>> Ray
>>>   
>>>       
>> Ray,
>>
>> We've been looking at adding support in gfs2 to interact with the
>> generic quota tools. This would also mean being able to use rpc.rquotad
>> to access nfs quotas. This is the bugzilla that tracks this feature:
>> https://bugzilla.redhat.com/show_bug.cgi?id=298561.
>>     
>
> Anything top secret in this bug or could you guys make it public?
Errr... Actually yes. Didn't realise this was a internal bug. In any
case, the problem statement in the bug is exactly what you mentioned in
your email above.

Recently changes went into vfs quotas to support ocfs2 quotas. If they
can do it through the vfs and use the generic quota tools, we could as
well. We're investigating how best to integrate our existing quota
mechanism with the vfs.

Cheers!
--Abhi


From july_snow at 163.com  Tue May 19 04:17:07 2009
From: july_snow at 163.com (victory.xu)
Date: Tue, 19 May 2009 12:17:07 +0800
Subject: [Linux-cluster] SOS!!! RH 5.2 cluster,  my gfs filesytem  is hung
Message-ID: <4A1232C4.02F515.12651@m50-132.163.com>

linux-cluster??

	   i create 2 GFS Partition ,NO. 1 is 10G,NO. 2 is 2TB
this is the order list i do
----------------------------------------------------------------------
pvcreate /dev/sdb
pvcreate /dev/sdc
vgcreate volGFS01 /dev/sdb
vgcreate volGFS02 /dev/sdc
lvcreate -L 10G volGFS01 -n vg01
lvcreate -L 2197G volGFS02 -n vg02
gfs_mkfs -p lock_dlm -t webcs:gfs01 -j 10 /dev/volGFS01/vg01
gfs_mkfs -p lock_dlm -t webcs:gfs02 -j 10 /dev/volGFS02/vg02

sfdisk -R /dev/sdb
sfdisk -R /dev/sdc
lvmdiskscan

mkdir /web
mkdir /file
mount /dev/volGFS01/vg01 /web
mount /dev/volGFS02/vg02 /file
chcon -R -t httpd_sys_content_t /web
chcon -R -t httpd_sys_content_t /file
---------------------------------------------------------------

the cluster system  have run two months,but now the  /file Partition is hung,i reboot two server,then it is ok,but after some time,it is hung again

here is the error log in /var/log/messages
--------------------------------------------------------

May 20 04:28:42 web01 kernel: GFS: fsid=webcs:gfs02.1: fatal: I/O error
May 20 04:28:42 web01 kernel: GFS: fsid=webcs:gfs02.1:   block = 30119242
May 20 04:28:42 web01 kernel: GFS: fsid=webcs:gfs02.1:   function = gfs_dreread
May 20 04:28:42 web01 kernel: GFS: fsid=webcs:gfs02.1:   file = /builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_/src/gfs/dio.c, line = 576
May 20 04:28:42 web01 kernel: GFS: fsid=webcs:gfs02.1:   time = 1242764921
May 20 04:28:42 web01 kernel: GFS: fsid=webcs:gfs02.1: about to withdraw from the cluster
May 20 04:28:42 web01 kernel: GFS: fsid=webcs:gfs02.1: telling LM to withdraw
May 20 04:28:42 web01 kernel: sd 1:0:0:2: SCSI error: return code = 0x00070000
May 20 04:28:42 web01 kernel: end_request: I/O error, dev sdc, sector 253159576
May 20 04:28:42 web01 kernel: sd 1:0:0:2: SCSI error: return code = 0x00070000
May 20 04:28:42 web01 kernel: end_request: I/O error, dev sdc, sector 253272408
---------------------------------------------------------------------------------------------------


?????????
??
 				

????????victory.xu
????????july_snow at 163.com
??????????2009-05-19


From raju.rajsand at gmail.com  Tue May 19 06:48:00 2009
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Tue, 19 May 2009 12:18:00 +0530
Subject: [Linux-cluster] SOS!!! RH 5.2 cluster, my gfs filesytem is hung
In-Reply-To: <4A1232C4.02F515.12651@m50-132.163.com>
References: <4A1232C4.02F515.12651@m50-132.163.com>
Message-ID: <8786b91c0905182348h5147ae87nf3d3500c4e4d0a3@mail.gmail.com>

Did you edit the lvm.conf to enable cluster locking


>From http://sources.redhat.com/cluster/doc/usage.txt

Update lvm2.conf
----------------

Standard lvm commands are "clusterized" (told to talk to the clvm daemon)
by using the following two lines in /etc/lvm/lvm.conf:

locking_type = 3
locking_library = "/lib/liblvm2clusterlock.so"


From swun2010 at gmail.com  Tue May 19 08:45:24 2009
From: swun2010 at gmail.com (Sam Wun)
Date: Tue, 19 May 2009 18:45:24 +1000
Subject: [Linux-cluster] ipvs/ipvsadm
Message-ID: <736c47cb0905190145k9cd9a59h77fb8654d7e2f535@mail.gmail.com>

Hi,

I used the following commands setup a loadbalancer, however, it doesn't work.

ipvsadm -A -t 192.168.1.240:ssh -s wlc -p
ipvsadm -a -t 192.168.1.240:ssh -r 192.168.1.246 -i -w 3

Here is the IP configuration:
eth0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=8<VLAN_MTU>
        ether 00:13:20:65:ad:bd
        inet 192.168.1.248 netmask 0xffffff00 broadcast 192.168.1.255
        inet 192.168.1.240 netmask 0xffffff00 broadcast 192.168.1.255
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active

If I remove the ipvsadm setup with "ipvsadm -C", ssh into
192.168.1.240 works for 192.168.1.240 host.

With the above ipvadm rules in-placed,
I can't ssh thru 192.168.1.240 and ipvsadm shown:
# ipvsadm
IP Virtual Server version 1.0.10 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  192.168.1.240:ssh wlc persistent 360
  -> 192.168.1.246:ssh            Tunnel  3      0          1

What is the correct steps to setup ipvs and use ipvsadm?

Your help is very much appreciated.
Thanks


From mrugeshkarnik at gmail.com  Tue May 19 08:46:50 2009
From: mrugeshkarnik at gmail.com (Mrugesh Karnik)
Date: Tue, 19 May 2009 14:16:50 +0530
Subject: [Linux-cluster] qdiskd: Updated votes configuration not used even
	after restart
Message-ID: <200905191416.50387.mrugeshkarnik@gmail.com>

Hi,

I have a three node cluster, of which, one node is perpetually offline. I have 
the cluster configured with qdisk. Since the third node is always offline, I 
configured the quorum settings for a two node cluster:

* 1 vote assigned to each node
* 1 vote assigned to qdisk
* expected votes set to 3

The configuration worked perfectly well.

I then needed to add a fourth node to the cluster, so I updated the 
configuration as follows:

* 1 vote assigned to each node
* 3 votes assigned to qdisk
* expected votes set to 4

I then did a ccs_tool update.

To have the value of qdisk votes updated, I then did the following:

* cman_tool expected -e 3

so as not to lose quorum while the qdisk was offline.

I restarted qdiskd on each node in turn, yet the value of votes was not 
updated.

I then stopped qdiskd on _all_ nodes and then started it up on each node one-
by-one. Yet, again, the value of votes assigned to the qdisk was not updated.

Here's the debug output from when qdiskd starts up:
May 19 13:36:52 eru qdiskd[18691]: <debug> 0 heuristics loaded
May 19 13:36:52 eru qdiskd[18691]: <debug> Quorum Daemon: 0 heuristics, 5 
interval, 5 tko, 3 votes
May 19 13:36:52 eru qdiskd[18691]: <debug> Run Flags: 00000031
May 19 13:36:52 eru qdiskd[18692]: <info> Quorum Daemon Initializing
May 19 13:36:52 eru qdiskd[18692]: <debug> I/O Size: 512  Page Size: 4096
May 19 13:36:52 eru qdiskd[18692]: <debug> Permanently setting score to 1/1
May 19 13:37:02 eru qdiskd[18692]: <debug> Node 2 is UP
May 19 13:37:13 eru qdiskd[18692]: <debug> Node 4 is UP
May 19 13:37:18 eru qdiskd[18692]: <info> Initial score 1/1
May 19 13:37:18 eru qdiskd[18692]: <info> Initialization complete
May 19 13:37:18 eru qdiskd[18692]: <notice> Score sufficient for master 
operation (1/1; required=1); upgrading
May 19 13:37:28 eru qdiskd[18692]: <debug> Making bid for master
May 19 13:37:38 eru qdiskd[18692]: <info> Assuming master role

Here's the output of cman_tool status, on the same node:
Version: 6.1.0
Config Version: 35
Cluster Name: dom0cluster
Cluster Id: 10036
Cluster Member: Yes
Cluster Generation: 784
Membership state: Cluster-Member
Nodes: 3
Expected votes: 4
Quorum device votes: 1
Total votes: 4
Quorum: 3
Active subsystems: 11
Flags: Dirty
Ports Bound: 0 11 177
Node name: eru.san.arda.geodesic.net
Node ID: 1
Multicast addresses: 229.192.0.1
Node addresses: 192.168.10.2

The same problem is seen on all the nodes. The updated configuration is read 
correctly by qdiskd, but once started, the updated votes value isn't used.

Thanks,
Mrugesh


From brettcave at gmail.com  Tue May 19 08:59:30 2009
From: brettcave at gmail.com (Brett Cave)
Date: Tue, 19 May 2009 10:59:30 +0200
Subject: [Linux-cluster] ipvs/ipvsadm
In-Reply-To: <736c47cb0905190145k9cd9a59h77fb8654d7e2f535@mail.gmail.com>
References: <736c47cb0905190145k9cd9a59h77fb8654d7e2f535@mail.gmail.com>
Message-ID: <c0773fd30905190159v43275823j4823f35070455bd9@mail.gmail.com>

on the real hosts (i.e. on 192.168.1.246), add an interface lo:0 (e.g.
rh in /etc/sysconfig/network-scripts/ifcfg-lo:0), add the HA ip's:

DEVICE=lo:0
IPADDR=192.168.1.240
MASK=255.255.255.255
ONBOOT=yes


You should also configure the kernel for arp requests, as follows
(either in /etc/sysctl.conf per persistent or via /proc)
net.ipv4.conf.all.arp_ignore = 1
net.ipv4.conf.eth0.arp_ignore = 1
net.ipv4.conf.all.arp_announce = 2
net.ipv4.conf.eth0.arp_announce = 2


On Tue, May 19, 2009 at 10:45 AM, Sam Wun <swun2010 at gmail.com> wrote:
> Hi,
>
> I used the following commands setup a loadbalancer, however, it doesn't work.
>
> ipvsadm -A -t 192.168.1.240:ssh -s wlc -p
> ipvsadm -a -t 192.168.1.240:ssh -r 192.168.1.246 -i -w 3
>
> Here is the IP configuration:
> eth0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
> ? ? ? ?options=8<VLAN_MTU>
> ? ? ? ?ether 00:13:20:65:ad:bd
> ? ? ? ?inet 192.168.1.248 netmask 0xffffff00 broadcast 192.168.1.255
> ? ? ? ?inet 192.168.1.240 netmask 0xffffff00 broadcast 192.168.1.255
> ? ? ? ?media: Ethernet autoselect (100baseTX <full-duplex>)
> ? ? ? ?status: active
>
> If I remove the ipvsadm setup with "ipvsadm -C", ssh into
> 192.168.1.240 works for 192.168.1.240 host.
>
> With the above ipvadm rules in-placed,
> I can't ssh thru 192.168.1.240 and ipvsadm shown:
> # ipvsadm
> IP Virtual Server version 1.0.10 (size=4096)
> Prot LocalAddress:Port Scheduler Flags
> ?-> RemoteAddress:Port ? ? ? ? ? Forward Weight ActiveConn InActConn
> TCP ?192.168.1.240:ssh wlc persistent 360
> ?-> 192.168.1.246:ssh ? ? ? ? ? ?Tunnel ?3 ? ? ?0 ? ? ? ? ?1
>
> What is the correct steps to setup ipvs and use ipvsadm?
>
> Your help is very much appreciated.
> Thanks
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From brettcave at gmail.com  Tue May 19 09:06:39 2009
From: brettcave at gmail.com (Brett Cave)
Date: Tue, 19 May 2009 11:06:39 +0200
Subject: [Linux-cluster] Increasing a GFS partition
Message-ID: <c0773fd30905190206q7d0db57fi23a872c84d5d999f@mail.gmail.com>

Hi,

what is the correct way to increase a GFS partition? Is it possible?
(GFS is using a SAN connected via FC, the partions on teh SAN can be
increased dynamically).

Brett


From brettcave at gmail.com  Tue May 19 09:13:20 2009
From: brettcave at gmail.com (Brett Cave)
Date: Tue, 19 May 2009 11:13:20 +0200
Subject: [Linux-cluster] solved: increasing the size of a gfs partition
Message-ID: <c0773fd30905190213u61116f18kdf6e9b4f1c7c9753@mail.gmail.com>

hi, missed this one somehow...

http://www.open-sharedroot.org/faq/troubleshooting-guide/file-systems/gfs/resizing-gfs

gfs_grow after device size has been increased.


From lists at brimer.org  Tue May 19 12:26:53 2009
From: lists at brimer.org (Barry Brimer)
Date: Tue, 19 May 2009 07:26:53 -0500 (CDT)
Subject: [Linux-cluster] ipvs/ipvsadm
In-Reply-To: <736c47cb0905190145k9cd9a59h77fb8654d7e2f535@mail.gmail.com>
References: <736c47cb0905190145k9cd9a59h77fb8654d7e2f535@mail.gmail.com>
Message-ID: <Pine.LNX.4.61.0905190723550.516@localhost.localdomain>


On Tue, 19 May 2009, Sam Wun wrote:

> Hi,
>
> I used the following commands setup a loadbalancer, however, it doesn't work.
>
> ipvsadm -A -t 192.168.1.240:ssh -s wlc -p
> ipvsadm -a -t 192.168.1.240:ssh -r 192.168.1.246 -i -w 3
>
> Here is the IP configuration:
> eth0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
>        options=8<VLAN_MTU>
>        ether 00:13:20:65:ad:bd
>        inet 192.168.1.248 netmask 0xffffff00 broadcast 192.168.1.255
>        inet 192.168.1.240 netmask 0xffffff00 broadcast 192.168.1.255
>        media: Ethernet autoselect (100baseTX <full-duplex>)
>        status: active
>
> If I remove the ipvsadm setup with "ipvsadm -C", ssh into
> 192.168.1.240 works for 192.168.1.240 host.
>
> With the above ipvadm rules in-placed,
> I can't ssh thru 192.168.1.240 and ipvsadm shown:
> # ipvsadm
> IP Virtual Server version 1.0.10 (size=4096)
> Prot LocalAddress:Port Scheduler Flags
>  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
> TCP  192.168.1.240:ssh wlc persistent 360
>  -> 192.168.1.246:ssh            Tunnel  3      0          1
>
> What is the correct steps to setup ipvs and use ipvsadm?
>
> Your help is very much appreciated.
> Thanks

Your ipvsadm output shows you using tunneling .. if you drop the "-i" in 
your second ipvsadm statement you will be using gatewaying, which is what 
I would expect you are looking for?  Past that you will need to block 
arps, bind your address to localhost, etc as one would do for gatewaying.


From brem.belguebli at gmail.com  Tue May 19 13:09:56 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Tue, 19 May 2009 15:09:56 +0200
Subject: [Linux-cluster] Disaster recovery setup
Message-ID: <29ae894c0905190609t2ab7fe51leac6e6275e67bc50@mail.gmail.com>

Hello,
I need to setup a 4 nodes cluster for hosting a few database instances.

The main constraint being that this cluster has to be equally distributed on
2 different sites (distance less than 100 km), ie 2 nodes on each site.

The storage being based on 2 FC disk arrays, one on each site, connected
thru 2 SAN fabrics for redundancy and failover.

Each database instance runs on one node, a should be able to switch to
another in case of machine crash.

I would like to use CLVM with mirror (or mdadm) to give access  each node to
both disk arrays (local and distant).


In case something happens on one site(disaster) , the nodes on the other
site should be able to take over the failing nodes by hosting the  databases
instances that failed.

My questions concern :

- Quorum: In case of a disaster, the remaining nodes won't have the majority
(2 over 4)
- If I use qdisk, on wich site to put it ? the disaster could occur on any
site
- CLVM with mirror (or mdadm layer to built on top of it CLVM) is a
"supported " setup ?

Regards
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090519/04944861/attachment.htm>

From july_snow at 163.com  Tue May 19 14:42:25 2009
From: july_snow at 163.com (victory.xu)
Date: Tue, 19 May 2009 22:42:25 +0800
Subject: [Linux-cluster] SOS!!! RH 5.2 cluster,
	my gfs filesytem is hung
Message-ID: <4A12C550.000685.27580@m50-132.163.com>

Rajagopal Swaminathan,HELLO?

	i change it ,but it dont need "locking_library = "/lib/liblvm2clusterlock.so" in rhel 5 ,when locking_type = 3
	i saw it in redhat document


	is it right ?

	thank you   Rajagopal

======= 2009-05-19 15:18:00 ????????=======

>Did you edit the lvm.conf to enable cluster locking
>
>
>From http://sources.redhat.com/cluster/doc/usage.txt
>
>Update lvm2.conf
>----------------
>
>Standard lvm commands are "clusterized" (told to talk to the clvm daemon)
>by using the following two lines in /etc/lvm/lvm.conf:
>
>locking_type = 3
>locking_library = "/lib/liblvm2clusterlock.so"

= = = = = = = = = = = = = = = = = = = =
			

?????????
??
 
				 
????????victory.xu
????????july_snow at 163.com
??????????2009-05-19


From fajar at fajar.net  Tue May 19 14:55:36 2009
From: fajar at fajar.net (Fajar A. Nugraha)
Date: Tue, 19 May 2009 21:55:36 +0700
Subject: [Linux-cluster] SOS!!! RH 5.2 cluster, my gfs filesytem is hung
In-Reply-To: <4A1232C4.02F515.12651@m50-132.163.com>
References: <4A1232C4.02F515.12651@m50-132.163.com>
Message-ID: <7207d96f0905190755t471ca114w42cadb6dccbdc74a@mail.gmail.com>

2009/5/19 victory.xu <july_snow at 163.com>:
> the cluster system ?have run two months,but now the ?/file Partition is hung,i reboot two server,then it is ok,but after some time,it is hung again
>
> May 20 04:28:42 web01 kernel: sd 1:0:0:2: SCSI error: return code = 0x00070000
> May 20 04:28:42 web01 kernel: end_request: I/O error, dev sdc, sector 253159576

These shows you have a bad disk.

-- 
Fajar


From maurizio.rottin at gmail.com  Tue May 19 15:00:02 2009
From: maurizio.rottin at gmail.com (Maurizio Rottin)
Date: Tue, 19 May 2009 17:00:02 +0200
Subject: [Linux-cluster] SOS!!! RH 5.2 cluster, my gfs filesytem is hung
In-Reply-To: <4A1232C4.02F515.12651@m50-132.163.com>
References: <4A1232C4.02F515.12651@m50-132.163.com>
Message-ID: <e83473390905190800r3e8a2dch6190eafcd2e5fe41@mail.gmail.com>

2009/5/19 victory.xu <july_snow at 163.com>:
> May 20 04:28:42 web01 kernel: sd 1:0:0:2: SCSI error: return code = 0x00070000
> May 20 04:28:42 web01 kernel: end_request: I/O error, dev sdc, sector 253159576
> May 20 04:28:42 web01 kernel: sd 1:0:0:2: SCSI error: return code = 0x00070000
> May 20 04:28:42 web01 kernel: end_request: I/O error, dev sdc, sector 253272408
> ---------------------------------------------------------------------------------------------------

according to include/scsi/scsi.h found in my kernel source this means:
#define DID_ERROR       0x07    /* Internal error                          */

in the disk...

-- 
mr


From Gary_Hunt at gallup.com  Tue May 19 15:55:33 2009
From: Gary_Hunt at gallup.com (Hunt, Gary)
Date: Tue, 19 May 2009 10:55:33 -0500
Subject: [Linux-cluster] samba resource
Message-ID: <B0176B19AD215F4DA7E9EAC74EF4B0D60429F3D947@EXCHNG5.noam.gallup.com>

I am trying to add a samba resource to a service I am running.  Does anyone know the options I can use in the cluster.conf to point to my own compiled version of samba?

This is the defaults that luci gave me.

<smb name="samba1" workgroup="XXXXXX"/>

I would like to customize this further, but cannot find any documentation on what the options are.


Thanks

Gary Hunt

________________________________
IMPORTANT NOTICE: This e-mail message and all attachments, if any, may contain confidential and privileged material and are intended only for the person or entity to which the message is addressed. If you are not an intended recipient, you are hereby notified that any use, dissemination, distribution, disclosure, or copying of this information is unauthorized and strictly prohibited. If you have received this communication in error, please contact the sender immediately by reply e-mail, and destroy all copies of the original message.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090519/fab6ec2c/attachment.htm>

From lhh at redhat.com  Tue May 19 22:16:57 2009
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 19 May 2009 18:16:57 -0400
Subject: [Linux-cluster] qdiskd: Updated votes configuration not used
	even	after restart
In-Reply-To: <200905191416.50387.mrugeshkarnik@gmail.com>
References: <200905191416.50387.mrugeshkarnik@gmail.com>
Message-ID: <4A132FD9.7090809@redhat.com>

Mrugesh Karnik wrote:
> Hi,
> 
> I have a three node cluster, of which, one node is perpetually offline. I have 
> the cluster configured with qdisk. Since the third node is always offline, I 
> configured the quorum settings for a two node cluster:
> 
> * 1 vote assigned to each node
> * 1 vote assigned to qdisk
> * expected votes set to 3
> 
> The configuration worked perfectly well.
> 
> I then needed to add a fourth node to the cluster, so I updated the 
> configuration as follows:
> 
> * 1 vote assigned to each node
> * 3 votes assigned to qdisk
> * expected votes set to 4
> 
> I then did a ccs_tool update.
> 
> To have the value of qdisk votes updated, I then did the following:
> 
> * cman_tool expected -e 3
> 
> so as not to lose quorum while the qdisk was offline.
> 
> I restarted qdiskd on each node in turn, yet the value of votes was not 
> updated.
> 
> I then stopped qdiskd on _all_ nodes and then started it up on each node one-
> by-one. Yet, again, the value of votes assigned to the qdisk was not updated.
> 
> Here's the debug output from when qdiskd starts up:
> May 19 13:36:52 eru qdiskd[18691]: <debug> 0 heuristics loaded
> May 19 13:36:52 eru qdiskd[18691]: <debug> Quorum Daemon: 0 heuristics, 5 
> interval, 5 tko, 3 votes
> May 19 13:36:52 eru qdiskd[18691]: <debug> Run Flags: 00000031
> May 19 13:36:52 eru qdiskd[18692]: <info> Quorum Daemon Initializing
> May 19 13:36:52 eru qdiskd[18692]: <debug> I/O Size: 512  Page Size: 4096
> May 19 13:36:52 eru qdiskd[18692]: <debug> Permanently setting score to 1/1
> May 19 13:37:02 eru qdiskd[18692]: <debug> Node 2 is UP
> May 19 13:37:13 eru qdiskd[18692]: <debug> Node 4 is UP
> May 19 13:37:18 eru qdiskd[18692]: <info> Initial score 1/1
> May 19 13:37:18 eru qdiskd[18692]: <info> Initialization complete
> May 19 13:37:18 eru qdiskd[18692]: <notice> Score sufficient for master 
> operation (1/1; required=1); upgrading
> May 19 13:37:28 eru qdiskd[18692]: <debug> Making bid for master
> May 19 13:37:38 eru qdiskd[18692]: <info> Assuming master role
> 
> Here's the output of cman_tool status, on the same node:
> Version: 6.1.0
> Config Version: 35
> Cluster Name: dom0cluster
> Cluster Id: 10036
> Cluster Member: Yes
> Cluster Generation: 784
> Membership state: Cluster-Member
> Nodes: 3
> Expected votes: 4
> Quorum device votes: 1
> Total votes: 4
> Quorum: 3
> Active subsystems: 11
> Flags: Dirty
> Ports Bound: 0 11 177
> Node name: eru.san.arda.geodesic.net
> Node ID: 1
> Multicast addresses: 229.192.0.1
> Node addresses: 192.168.10.2
> 
> The same problem is seen on all the nodes. The updated configuration is read 
> correctly by qdiskd, but once started, the updated votes value isn't used.
>

So, once registered, cman doesn't accept new registrations or changes to 
the registration - meaning that votes can't currently be changed.  This 
obviously would explain both behaviors you saw.

It looks like we could safely add a block to 
do_cmd_register_quorum_dev() function that does something like:

  - if a quorum device exists and it is being reregistered with the same
    name, just change the votes and recalculate quorum

-- Lon


From ccaulfie at redhat.com  Wed May 20 07:08:44 2009
From: ccaulfie at redhat.com (Chrissie Caulfield)
Date: Wed, 20 May 2009 08:08:44 +0100
Subject: [Linux-cluster] qdiskd: Updated votes configuration not used
	even	after restart
In-Reply-To: <4A132FD9.7090809@redhat.com>
References: <200905191416.50387.mrugeshkarnik@gmail.com>
	<4A132FD9.7090809@redhat.com>
Message-ID: <4A13AC7C.1030901@redhat.com>

Lon Hohberger wrote:
> Mrugesh Karnik wrote:
>> Hi,
>>
>> I have a three node cluster, of which, one node is perpetually
>> offline. I have the cluster configured with qdisk. Since the third
>> node is always offline, I configured the quorum settings for a two
>> node cluster:
>>
>> * 1 vote assigned to each node
>> * 1 vote assigned to qdisk
>> * expected votes set to 3
>>
>> The configuration worked perfectly well.
>>
>> I then needed to add a fourth node to the cluster, so I updated the
>> configuration as follows:
>>
>> * 1 vote assigned to each node
>> * 3 votes assigned to qdisk
>> * expected votes set to 4
>>
>> I then did a ccs_tool update.
>>
>> To have the value of qdisk votes updated, I then did the following:
>>
>> * cman_tool expected -e 3
>>
>> so as not to lose quorum while the qdisk was offline.
>>
>> I restarted qdiskd on each node in turn, yet the value of votes was
>> not updated.
>>
>> I then stopped qdiskd on _all_ nodes and then started it up on each
>> node one-
>> by-one. Yet, again, the value of votes assigned to the qdisk was not
>> updated.
>>
>> Here's the debug output from when qdiskd starts up:
>> May 19 13:36:52 eru qdiskd[18691]: <debug> 0 heuristics loaded
>> May 19 13:36:52 eru qdiskd[18691]: <debug> Quorum Daemon: 0
>> heuristics, 5 interval, 5 tko, 3 votes
>> May 19 13:36:52 eru qdiskd[18691]: <debug> Run Flags: 00000031
>> May 19 13:36:52 eru qdiskd[18692]: <info> Quorum Daemon Initializing
>> May 19 13:36:52 eru qdiskd[18692]: <debug> I/O Size: 512  Page Size: 4096
>> May 19 13:36:52 eru qdiskd[18692]: <debug> Permanently setting score
>> to 1/1
>> May 19 13:37:02 eru qdiskd[18692]: <debug> Node 2 is UP
>> May 19 13:37:13 eru qdiskd[18692]: <debug> Node 4 is UP
>> May 19 13:37:18 eru qdiskd[18692]: <info> Initial score 1/1
>> May 19 13:37:18 eru qdiskd[18692]: <info> Initialization complete
>> May 19 13:37:18 eru qdiskd[18692]: <notice> Score sufficient for
>> master operation (1/1; required=1); upgrading
>> May 19 13:37:28 eru qdiskd[18692]: <debug> Making bid for master
>> May 19 13:37:38 eru qdiskd[18692]: <info> Assuming master role
>>
>> Here's the output of cman_tool status, on the same node:
>> Version: 6.1.0
>> Config Version: 35
>> Cluster Name: dom0cluster
>> Cluster Id: 10036
>> Cluster Member: Yes
>> Cluster Generation: 784
>> Membership state: Cluster-Member
>> Nodes: 3
>> Expected votes: 4
>> Quorum device votes: 1
>> Total votes: 4
>> Quorum: 3
>> Active subsystems: 11
>> Flags: Dirty
>> Ports Bound: 0 11 177
>> Node name: eru.san.arda.geodesic.net
>> Node ID: 1
>> Multicast addresses: 229.192.0.1
>> Node addresses: 192.168.10.2
>>
>> The same problem is seen on all the nodes. The updated configuration
>> is read correctly by qdiskd, but once started, the updated votes value
>> isn't used.
>>
> 
> So, once registered, cman doesn't accept new registrations or changes to
> the registration - meaning that votes can't currently be changed.  This
> obviously would explain both behaviors you saw.
> 
> It looks like we could safely add a block to
> do_cmd_register_quorum_dev() function that does something like:
> 
>  - if a quorum device exists and it is being reregistered with the same
>    name, just change the votes and recalculate quorum

cman doesn't allow the votes to be changed without deregistering and
reregistering the quorum device.

I have checked the code and I can't see any reason why doing it this way
would fail, if register succeeds then it allocates a new node structure
for the qdisk and populates it from the parameters given.

Is it possible that qdisk might not unregister the qdisk  when it is
stopped under some circumstances ?


Chrissie


From mgrac at redhat.com  Wed May 20 08:49:23 2009
From: mgrac at redhat.com (Marek Grac)
Date: Wed, 20 May 2009 10:49:23 +0200
Subject: [Linux-cluster] samba resource
In-Reply-To: <B0176B19AD215F4DA7E9EAC74EF4B0D60429F3D947@EXCHNG5.noam.gallup.com>
References: <B0176B19AD215F4DA7E9EAC74EF4B0D60429F3D947@EXCHNG5.noam.gallup.com>
Message-ID: <4A13C413.2000603@redhat.com>

Hunt, Gary wrote:
>
> I am trying to add a samba resource to a service I am running.  Does 
> anyone know the options I can use in the cluster.conf to point to my 
> own compiled version of samba?
>
>  
>
> This is the defaults that luci gave me.
>
>  
>
> <smb name="samba1" workgroup="XXXXXX"/>
>
>  
>
> I would like to customize this further, but cannot find any 
> documentation on what the options are.
>
You can run /usr/share/cluster/smb.sh meta-data. It will show you 
available options (this should work on every resource agent).

m,


From mrugeshkarnik at gmail.com  Wed May 20 13:38:36 2009
From: mrugeshkarnik at gmail.com (Mrugesh Karnik)
Date: Wed, 20 May 2009 19:08:36 +0530
Subject: [Linux-cluster] Resource configuration syntax for cluster.conf
Message-ID: <200905201908.37057.mrugeshkarnik@gmail.com>

Hi,

I have few questions about the exact syntax for configuring resources managed 
by rgmanager in cluster.conf.

* It seems like the vm resource agent can be configured directly under <rm/>. 
Is there a specific schema or such to figure out which resource agents can be 
configured this way, without using <resources/> or <service/>? So far the only 
criterion I've come across is to check whether the RA supports the domain 
attribute. It turns out that Live Migration of the VMs with clusvcadm -M does 
not work if the vm resource is specified under a resource tree.

* The meta-data for a resource lists various <action/> values. Are these 
configurable? If yes, how do I configure these? Do they go in as a child tag 
for the specific resource configured under a resource tree?

* What is the meaning of the <special/> tag output by the meta-data?

Thanks,
Mrugesh


From lhh at redhat.com  Wed May 20 13:49:42 2009
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 20 May 2009 09:49:42 -0400
Subject: [Linux-cluster] qdiskd: Updated votes configuration not used
	even	after restart
In-Reply-To: <4A13AC7C.1030901@redhat.com>
References: <200905191416.50387.mrugeshkarnik@gmail.com>
	<4A132FD9.7090809@redhat.com>  <4A13AC7C.1030901@redhat.com>
Message-ID: <1242827382.8430.19.camel@ayanami>

On Wed, 2009-05-20 at 08:08 +0100, Chrissie Caulfield wrote:
> >  - if a quorum device exists and it is being reregistered with the same
> >    name, just change the votes and recalculate quorum
> 
> cman doesn't allow the votes to be changed without deregistering and
> reregistering the quorum device.
> 
> I have checked the code and I can't see any reason why doing it this way
> would fail, if register succeeds then it allocates a new node structure
> for the qdisk and populates it from the parameters given.
> 
> Is it possible that qdisk might not unregister the qdisk  when it is
> stopped under some circumstances ?

It's possible, but unlikely -- it only ever doesn't unregister if it:

(a) hits I/O errors
(b) is killed with -SIGKILL
(c) cman went away (in which case it doesn't matter :) )

I suspect:

        if (quorum_device->state == NODESTATE_MEMBER)
                return -EBUSY;

... is causing the unregister operation to fail.  Maybe I need to call
cman_poll_quorum_device(xxx, 0).  It seems a bit odd.

Basically, the use case is online upgrade of # of nodes in the cluster.

3 nodes + 2-vote quorum device ==> 4 nodes + 3-vote quorum device

In my mind, it'd work like:

* Ensure all current members are up and healthy
  * each old member sees: votes = 3 + 2
* Update cluster.conf w/ new member.
* Copy cluster.conf to new member
  * each old member sees: votes = 4 + 2
* Have new member start cluster stack 
  * each old member sees: votes = 4 + 2
  * the new member sees: votes = 4 + 3
* Stop qdiskd on the old nodes
  * each old member sees: votes = 4
  * the new member sees: votes = 4 + 3
* Restart qdiskd on the old nodes 
  * everyone is consistent w/ 4 + 3

I don't think calling poll(0) will make a difference in the above case,
but I had gotten used to the fact that if you kill qdiskd you had a few
seconds to restart it before CMAN noticed... 

So, I can fix it I think with poll(0), but if an admin kills qdiskd with
SIGKILL (or any other fatal signal), restarting qdiskd will prevent
correct vote registration (though as I have found out, polling still
works great).

-- Lon


From ccaulfie at redhat.com  Wed May 20 13:59:59 2009
From: ccaulfie at redhat.com (Chrissie Caulfield)
Date: Wed, 20 May 2009 14:59:59 +0100
Subject: [Linux-cluster] qdiskd: Updated votes configuration not used
	even	after restart
In-Reply-To: <1242827382.8430.19.camel@ayanami>
References: <200905191416.50387.mrugeshkarnik@gmail.com>	<4A132FD9.7090809@redhat.com>
	<4A13AC7C.1030901@redhat.com> <1242827382.8430.19.camel@ayanami>
Message-ID: <4A140CDF.60503@redhat.com>

Lon Hohberger wrote:
> On Wed, 2009-05-20 at 08:08 +0100, Chrissie Caulfield wrote:
>>>  - if a quorum device exists and it is being reregistered with the same
>>>    name, just change the votes and recalculate quorum
>> cman doesn't allow the votes to be changed without deregistering and
>> reregistering the quorum device.
>>
>> I have checked the code and I can't see any reason why doing it this way
>> would fail, if register succeeds then it allocates a new node structure
>> for the qdisk and populates it from the parameters given.
>>
>> Is it possible that qdisk might not unregister the qdisk  when it is
>> stopped under some circumstances ?
> 
> It's possible, but unlikely -- it only ever doesn't unregister if it:
> 
> (a) hits I/O errors
> (b) is killed with -SIGKILL
> (c) cman went away (in which case it doesn't matter :) )
> 
> I suspect:
> 
>         if (quorum_device->state == NODESTATE_MEMBER)
>                 return -EBUSY;


Yes, that sounds very likely

> ... is causing the unregister operation to fail.  Maybe I need to call
> cman_poll_quorum_device(xxx, 0).  It seems a bit odd.
> 
> Basically, the use case is online upgrade of # of nodes in the cluster.
> 
> 3 nodes + 2-vote quorum device ==> 4 nodes + 3-vote quorum device
> 
> In my mind, it'd work like:
> 
> * Ensure all current members are up and healthy
>   * each old member sees: votes = 3 + 2
> * Update cluster.conf w/ new member.
> * Copy cluster.conf to new member
>   * each old member sees: votes = 4 + 2
> * Have new member start cluster stack 
>   * each old member sees: votes = 4 + 2
>   * the new member sees: votes = 4 + 3
> * Stop qdiskd on the old nodes
>   * each old member sees: votes = 4
>   * the new member sees: votes = 4 + 3
> * Restart qdiskd on the old nodes 
>   * everyone is consistent w/ 4 + 3
> 
> I don't think calling poll(0) will make a difference in the above case,
> but I had gotten used to the fact that if you kill qdiskd you had a few
> seconds to restart it before CMAN noticed... 
> 
> So, I can fix it I think with poll(0), but if an admin kills qdiskd with
> SIGKILL (or any other fatal signal), restarting qdiskd will prevent
> correct vote registration (though as I have found out, polling still
> works great).

When qdiskd restarts, if you get EBUSY from _register then you could
deregister and reregister with the new information.

There's an argument here for a cman API call to change the number of
votes associated with the quorum disk though ... what do you think ?

Chrissie


From lhh at redhat.com  Wed May 20 14:12:23 2009
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 20 May 2009 10:12:23 -0400
Subject: [Linux-cluster] Re: Resource configuration syntax for cluster.conf
In-Reply-To: <200905201908.37057.mrugeshkarnik@gmail.com>
References: <200905201908.37057.mrugeshkarnik@gmail.com>
Message-ID: <1242828743.8430.43.camel@ayanami>

On Wed, 2009-05-20 at 19:08 +0530, Mrugesh Karnik wrote:
> Hi,
> 
> I have few questions about the exact syntax for configuring resources managed 
> by rgmanager in cluster.conf.
> 
> * It seems like the vm resource agent can be configured directly under <rm/>. 
> Is there a specific schema or such to figure out which resource agents can be 
> configured this way, without using <resources/> or <service/>? 

*Technically*, they all can, but it's not terribly useful and becomes
very complicated to track very quickly:

  <rm>
    <ip address="1.2.3.4" />
  </rm>

... lets you do:

  clusvcadm -[edRr] ip:1.2.3.4

:o

Only <vm> and <service> support adding policies (failover domains, etc.)
to them.

This method of operation isn't really supported since rgmanager doesn't
have a proper dependency engine.


> It turns out that Live Migration of the VMs with clusvcadm -M does 
> not work if the vm resource is specified under a resource tree.

Correct.  This is because there are implied dependencies in the tree,
and not all resources can be live-migrated (in fact... only <vm> can
right now).  So, you can only migrate a VM if it has no dependencies and
is dependent on nothing (for now).


> * The meta-data for a resource lists various <action/> values. Are these 
> configurable? If yes, how do I configure these? Do they go in as a child tag 
> for the specific resource configured under a resource tree?

You can configure them in cluster.conf using the special <action>
child[1]:

   <service> 
     <ip ... >
       <action name="status" interval="60" depth="*"/>
     </ip>
     ...
   </service>

Or if you were using a vm...

   <vm ...>
     <action name="status" interval="60" depth="*"/>
   </vm>

It's an ugly "special case".  It was added a couple of years after we
started using the resource tree setup :/

> 
> * What is the meaning of the <special/> tag output by the meta-data?

Resource-manager specific stuff.  rgmanager has things that give it
hints about when to 'stop' resources and child-type ordering, as well as
the maximum number of times a resource can be started (it's either '1'
or 'unset' right now).  There used to be hints about whether a resource
could appear at the top level (attributes/@root="1"), but that's been
unused since the RHEL4/STABLE branch.

Child-type ordering is reasonably documented:

http://sources.redhat.com/cluster/wiki/ResourceTrees

-- Lon

1. That means that you can't define a resource type called "action".  I
don't know what would happen if you did ;)


From lhh at redhat.com  Wed May 20 15:34:10 2009
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 20 May 2009 11:34:10 -0400
Subject: [Linux-cluster] qdiskd: Updated votes configuration not used
	even	after restart
In-Reply-To: <4A140CDF.60503@redhat.com>
References: <200905191416.50387.mrugeshkarnik@gmail.com>
	<4A132FD9.7090809@redhat.com> <4A13AC7C.1030901@redhat.com>
	<1242827382.8430.19.camel@ayanami>  <4A140CDF.60503@redhat.com>
Message-ID: <1242833650.8430.47.camel@ayanami>

On Wed, 2009-05-20 at 14:59 +0100, Chrissie Caulfield wrote:

> > So, I can fix it I think with poll(0), but if an admin kills qdiskd with
> > SIGKILL (or any other fatal signal), restarting qdiskd will prevent
> > correct vote registration (though as I have found out, polling still
> > works great).
> 
> When qdiskd restarts, if you get EBUSY from _register then you could
> deregister and reregister with the new information.

Well...  cman_poll_quorum_device(x, 0), then unregister, then register.


> There's an argument here for a cman API call to change the number of
> votes associated with the quorum disk though ... what do you think ?

Either one works for me.  I'm happy to fix up qdiskd, and I don't think
it's unreasonable to require clean de-register / re-register to adjust
vote count.

-- Lon


From Jurgen.Knodlseder at cesr.fr  Wed May 20 17:43:16 2009
From: Jurgen.Knodlseder at cesr.fr (=?ISO-8859-1?Q?J=FCrgen_Kn=F6dlseder?=)
Date: Wed, 20 May 2009 19:43:16 +0200
Subject: [Linux-cluster] gfs2 bug
Message-ID: <5A544B42-D814-4C2F-AAFC-9B4A258B553D@cesr.fr>

Dear all,

I just got the attached kernel bug related to gfs2 handling. I have a  
2 node cluster (PE 1950) connected to a MD3000 with several gfs2  
filesystems installed. I'm running kernel 2.6.27.7 together with  
cluster-2.03.11. Another machine has access to gfs2 via nfs. While  
the bug occured there was a substantial load on all machines with  
related heavy disk access.

Does this bug ring some bells? Any clues what's going on?

Best regards,

J?rgen
- - - - - -

May 20 17:41:40 einstein [87029.808089] BUG: unable to handle kernel  
NULL pointer dereference at 0000000000000000
May 20 17:41:40 einstein [87029.808095] IP: [<ffffffffa0031acf>]  
gfs2_glock_nq+0x133/0x28f [gfs2]
May 20 17:41:40 einstein [87029.808112] PGD 4730f0067 PUD 48ec8c067  
PMD 0
May 20 17:41:40 einstein [87029.808115] Oops: 0000 [1] SMP
May 20 17:41:40 einstein [87029.808117] CPU 0
May 20 17:41:40 einstein [87029.808118] Modules linked in: gfs  
lock_dlm gfs2 dlm configfs
May 20 17:41:40 einstein [87029.808123] Pid: 23868, comm: /opt/ 
projects/g Not tainted 2.6.27.7 #3
May 20 17:41:40 einstein [87029.808124] RIP: 0010: 
[<ffffffffa0031acf>]  [<ffffffffa0031acf>] gfs2_glock_nq+0x133/0x28f  
[gfs2]
May 20 17:41:40 einstein [87029.808135] RSP: 0000:ffff8803029edb78   
EFLAGS: 00010202
May 20 17:41:40 einstein [87029.808136] RAX: 0000000000000001 RBX:  
0000000000000000 RCX: 0000000000000006
May 20 17:41:40 einstein [87029.808137] RDX: 0000000000000000 RSI:  
00007fff77f08540 RDI: 0000000000000040
May 20 17:41:40 einstein [87029.808139] RBP: ffff8803029edbb8 R08:  
ffff8805a0d68e70 R09: ffff8804d08c4c90
May 20 17:41:40 einstein [87029.808140] R10: ffff8803510d7938 R11:  
ffffffffa0047cc0 R12: ffff8803510d7978
May 20 17:41:40 einstein [87029.808142] R13: ffff88058718d480 R14:  
0000000000000000 R15: ffff88058718d480
May 20 17:41:40 einstein [87029.808143] FS:  00007ff26feeb6f0(0000)  
GS:ffffffff817ca640(0000) knlGS:0000000000000000
May 20 17:41:40 einstein [87029.808145] CS:  0010 DS: 0000 ES: 0000  
CR0: 000000008005003b
May 20 17:41:40 einstein [87029.808146] CR2: 0000000000000000 CR3:  
0000000302805000 CR4: 00000000000006a0
May 20 17:41:40 einstein [87029.808148] DR0: ffffffffff600000 DR1:  
ffffffffff600400 DR2: ffffffffff600800
May 20 17:41:40 einstein [87029.808149] DR3: 0000000000000000 DR6:  
00000000ffff0ff2 DR7: 0000000000000400
May 20 17:41:40 einstein [87029.808151] Process /opt/projects/g (pid:  
23868, threadinfo ffff8803029ec000, task ffff8804d08c4c40)
May 20 17:41:40 einstein [87029.808152] Stack:  ffff8803510d7938  
00000000634ff148 ffff880876431000 ffff8803510d7978
May 20 17:41:40 einstein [87029.808155]  ffff880588556398  
0000000000000000 ffff880876431000 ffff8803510d7800
May 20 17:41:40 einstein [87029.808158]  ffff8803029edbd8  
ffffffffa0032e8e ffff8805634feee0 ffff880588556398
May 20 17:41:40 einstein [87029.808160] Call Trace:
May 20 17:41:40 einstein [87029.808171]  [<ffffffffa0032e8e>]  
gfs2_glock_nq_init+0x17/0x2e [gfs2]
May 20 17:41:40 einstein [87029.808181]  [<ffffffffa0033539>]  
gfs2_dinode_dealloc+0x116/0x1bd [gfs2]
May 20 17:41:40 einstein [87029.808192]  [<ffffffffa003f601>]  
gfs2_delete_inode+0x111/0x1b5 [gfs2]
May 20 17:41:40 einstein [87029.808204]  [<ffffffffa003f54c>] ?  
gfs2_delete_inode+0x5c/0x1b5 [gfs2]
May 20 17:41:40 einstein [87029.808215]  [<ffffffffa003f4f0>] ?  
gfs2_delete_inode+0x0/0x1b5 [gfs2]
May 20 17:41:40 einstein [87029.808220]  [<ffffffff810b9897>]  
generic_delete_inode+0xaa/0xff
May 20 17:41:40 einstein [87029.808222]  [<ffffffff810b9901>]  
generic_drop_inode+0x15/0x11a
May 20 17:41:40 einstein [87029.808233]  [<ffffffffa003f44b>]  
gfs2_drop_inode+0x54/0x58 [gfs2]
May 20 17:41:40 einstein [87029.808235]  [<ffffffff810b8e6e>] iput 
+0x61/0x65
May 20 17:41:40 einstein [87029.808236]  [<ffffffff810b6b88>]  
dentry_iput+0x8a/0x9a
May 20 17:41:40 einstein [87029.808238]  [<ffffffff810b6c4d>] d_kill 
+0x38/0x58
May 20 17:41:40 einstein [87029.808239]  [<ffffffff810b7e1e>] dput 
+0x101/0x10d
May 20 17:41:40 einstein [87029.808242]  [<ffffffff810ae9d7>]  
do_revalidate+0x33/0x48
May 20 17:41:40 einstein [87029.808244]  [<ffffffff810aebd2>]  
__lookup_hash+0x89/0xef
May 20 17:41:40 einstein [87029.808245]  [<ffffffff810aec6d>]  
lookup_hash+0x35/0x3f
May 20 17:41:40 einstein [87029.808247]  [<ffffffff810b099e>]  
sys_renameat+0x12d/0x1e3
May 20 17:41:40 einstein [87029.808250]  [<ffffffff8107ef8b>] ?  
free_hot_page+0xb/0xd
May 20 17:41:40 einstein [87029.808252]  [<ffffffff8107efa5>] ?  
__free_pages+0x18/0x21
May 20 17:41:40 einstein [87029.808254]  [<ffffffff810a7088>] ?  
fsnotify_access+0x62/0x6a
May 20 17:41:40 einstein [87029.808256]  [<ffffffff810a7d1a>] ?  
vfs_read+0xcd/0x102
May 20 17:41:40 einstein [87029.808258]  [<ffffffff810b0a6a>]  
sys_rename+0x16/0x18
May 20 17:41:40 einstein [87029.808260]  [<ffffffff8100c1fb>]  
system_call_fastpath+0x16/0x1b
May 20 17:41:40 einstein [87029.808262]
May 20 17:41:40 einstein [87029.808262]
May 20 17:41:40 einstein [87029.808263] Code: 48 8d 73 30 bf 06 00 00  
00 e8 66 df ff ff 85 c0 75 16 41 8b 54 24 24 31 c0 c1 ea 04 4d 85 f6  
0f 94 c0 85 c2 4c 0f 45 f3 48 8b 1b <48> 8b 03 0f 18 08 49 8d 45 50  
48 39 c3 0f 85 73 ff ff ff 4d 85
May 20 17:41:40 einstein [87029.808282] RIP  [<ffffffffa0031acf>]  
gfs2_glock_nq+0x133/0x28f [gfs2]
May 20 17:41:40 einstein [87029.808292]  RSP <ffff8803029edb78>
May 20 17:41:40 einstein [87029.808293] CR2: 0000000000000000
May 20 17:41:40 einstein [87029.808295] ---[ end trace  
af94d521028b5618 ]---

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090520/0c359cb8/attachment.htm>

From swhiteho at redhat.com  Thu May 21 08:25:46 2009
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Thu, 21 May 2009 09:25:46 +0100
Subject: [Linux-cluster] gfs2 bug
In-Reply-To: <5A544B42-D814-4C2F-AAFC-9B4A258B553D@cesr.fr>
References: <5A544B42-D814-4C2F-AAFC-9B4A258B553D@cesr.fr>
Message-ID: <1242894346.3367.11.camel@localhost.localdomain>

Hi,

On Wed, 2009-05-20 at 19:43 +0200, J?rgen Kn?dlseder wrote:
> Dear all,
> 
> 
> I just got the attached kernel bug related to gfs2 handling. I have a
> 2 node cluster (PE 1950) connected to a MD3000 with several gfs2
> filesystems installed. I'm running kernel 2.6.27.7 together with
> cluster-2.03.11. Another machine has access to gfs2 via nfs. While the
> bug occured there was a substantial load on all machines with related
> heavy disk access.
> 
> 
> Does this bug ring some bells? Any clues what's going on?
> 
I suspect that you have run into the bug introduced in the patch
removing glockd & scand and recently fixed in Linus' upstream kernel
with the following patch:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=0c7a531a200480c7bc447260376973d830da9069

It took a while to track down because it only happens when entries on
the glock lru list are arranged in a particular order and the memory
pressure was enough to cause the glock shrinker to run. To add to the
issue, the actual problem never occurred at the point where the bug was
as the ref count was always too high then.

So there is a good chance that is what you have hit, and I'd suggest
using the latest upstream Linus' kernel instead,

Steve.


> 
> Best regards,
> 
> 
> J?rgen
> - - - - - -
> 
> 
> May 20 17:41:40 einstein [87029.808089] BUG: unable to handle kernel
> NULL pointer dereference at 0000000000000000
> May 20 17:41:40 einstein [87029.808095] IP: [<ffffffffa0031acf>]
> gfs2_glock_nq+0x133/0x28f [gfs2]
> May 20 17:41:40 einstein [87029.808112] PGD 4730f0067 PUD 48ec8c067
> PMD 0 
> May 20 17:41:40 einstein [87029.808115] Oops: 0000 [1] SMP 
> May 20 17:41:40 einstein [87029.808117] CPU 0 
> May 20 17:41:40 einstein [87029.808118] Modules linked in: gfs
> lock_dlm gfs2 dlm configfs
> May 20 17:41:40 einstein [87029.808123] Pid: 23868,
> comm: /opt/projects/g Not tainted 2.6.27.7 #3
> May 20 17:41:40 einstein [87029.808124] RIP: 0010:[<ffffffffa0031acf>]
> [<ffffffffa0031acf>] gfs2_glock_nq+0x133/0x28f [gfs2]
> May 20 17:41:40 einstein [87029.808135] RSP: 0000:ffff8803029edb78
> EFLAGS: 00010202
> May 20 17:41:40 einstein [87029.808136] RAX: 0000000000000001 RBX:
> 0000000000000000 RCX: 0000000000000006
> May 20 17:41:40 einstein [87029.808137] RDX: 0000000000000000 RSI:
> 00007fff77f08540 RDI: 0000000000000040
> May 20 17:41:40 einstein [87029.808139] RBP: ffff8803029edbb8 R08:
> ffff8805a0d68e70 R09: ffff8804d08c4c90
> May 20 17:41:40 einstein [87029.808140] R10: ffff8803510d7938 R11:
> ffffffffa0047cc0 R12: ffff8803510d7978
> May 20 17:41:40 einstein [87029.808142] R13: ffff88058718d480 R14:
> 0000000000000000 R15: ffff88058718d480
> May 20 17:41:40 einstein [87029.808143] FS:  00007ff26feeb6f0(0000)
> GS:ffffffff817ca640(0000) knlGS:0000000000000000
> May 20 17:41:40 einstein [87029.808145] CS:  0010 DS: 0000 ES: 0000
> CR0: 000000008005003b
> May 20 17:41:40 einstein [87029.808146] CR2: 0000000000000000 CR3:
> 0000000302805000 CR4: 00000000000006a0
> May 20 17:41:40 einstein [87029.808148] DR0: ffffffffff600000 DR1:
> ffffffffff600400 DR2: ffffffffff600800
> May 20 17:41:40 einstein [87029.808149] DR3: 0000000000000000 DR6:
> 00000000ffff0ff2 DR7: 0000000000000400
> May 20 17:41:40 einstein [87029.808151] Process /opt/projects/g (pid:
> 23868, threadinfo ffff8803029ec000, task ffff8804d08c4c40)
> May 20 17:41:40 einstein [87029.808152] Stack:  ffff8803510d7938
> 00000000634ff148 ffff880876431000 ffff8803510d7978
> May 20 17:41:40 einstein [87029.808155]  ffff880588556398
> 0000000000000000 ffff880876431000 ffff8803510d7800
> May 20 17:41:40 einstein [87029.808158]  ffff8803029edbd8
> ffffffffa0032e8e ffff8805634feee0 ffff880588556398
> May 20 17:41:40 einstein [87029.808160] Call Trace:
> May 20 17:41:40 einstein [87029.808171]  [<ffffffffa0032e8e>]
> gfs2_glock_nq_init+0x17/0x2e [gfs2]
> May 20 17:41:40 einstein [87029.808181]  [<ffffffffa0033539>]
> gfs2_dinode_dealloc+0x116/0x1bd [gfs2]
> May 20 17:41:40 einstein [87029.808192]  [<ffffffffa003f601>]
> gfs2_delete_inode+0x111/0x1b5 [gfs2]
> May 20 17:41:40 einstein [87029.808204]  [<ffffffffa003f54c>] ?
> gfs2_delete_inode+0x5c/0x1b5 [gfs2]
> May 20 17:41:40 einstein [87029.808215]  [<ffffffffa003f4f0>] ?
> gfs2_delete_inode+0x0/0x1b5 [gfs2]
> May 20 17:41:40 einstein [87029.808220]  [<ffffffff810b9897>]
> generic_delete_inode+0xaa/0xff
> May 20 17:41:40 einstein [87029.808222]  [<ffffffff810b9901>]
> generic_drop_inode+0x15/0x11a
> May 20 17:41:40 einstein [87029.808233]  [<ffffffffa003f44b>]
> gfs2_drop_inode+0x54/0x58 [gfs2]
> May 20 17:41:40 einstein [87029.808235]  [<ffffffff810b8e6e>] iput
> +0x61/0x65
> May 20 17:41:40 einstein [87029.808236]  [<ffffffff810b6b88>]
> dentry_iput+0x8a/0x9a
> May 20 17:41:40 einstein [87029.808238]  [<ffffffff810b6c4d>] d_kill
> +0x38/0x58
> May 20 17:41:40 einstein [87029.808239]  [<ffffffff810b7e1e>] dput
> +0x101/0x10d
> May 20 17:41:40 einstein [87029.808242]  [<ffffffff810ae9d7>]
> do_revalidate+0x33/0x48
> May 20 17:41:40 einstein [87029.808244]  [<ffffffff810aebd2>]
> __lookup_hash+0x89/0xef
> May 20 17:41:40 einstein [87029.808245]  [<ffffffff810aec6d>]
> lookup_hash+0x35/0x3f
> May 20 17:41:40 einstein [87029.808247]  [<ffffffff810b099e>]
> sys_renameat+0x12d/0x1e3
> May 20 17:41:40 einstein [87029.808250]  [<ffffffff8107ef8b>] ?
> free_hot_page+0xb/0xd
> May 20 17:41:40 einstein [87029.808252]  [<ffffffff8107efa5>] ?
> __free_pages+0x18/0x21
> May 20 17:41:40 einstein [87029.808254]  [<ffffffff810a7088>] ?
> fsnotify_access+0x62/0x6a
> May 20 17:41:40 einstein [87029.808256]  [<ffffffff810a7d1a>] ?
> vfs_read+0xcd/0x102
> May 20 17:41:40 einstein [87029.808258]  [<ffffffff810b0a6a>]
> sys_rename+0x16/0x18
> May 20 17:41:40 einstein [87029.808260]  [<ffffffff8100c1fb>]
> system_call_fastpath+0x16/0x1b
> May 20 17:41:40 einstein [87029.808262] 
> May 20 17:41:40 einstein [87029.808262] 
> May 20 17:41:40 einstein [87029.808263] Code: 48 8d 73 30 bf 06 00 00
> 00 e8 66 df ff ff 85 c0 75 16 41 8b 54 24 24 31 c0 c1 ea 04 4d 85 f6
> 0f 94 c0 85 c2 4c 0f 45 f3 48 8b 1b <48> 8b 03 0f 18 08 49 8d 45 50 48
> 39 c3 0f 85 73 ff ff ff 4d 85 
> May 20 17:41:40 einstein [87029.808282] RIP  [<ffffffffa0031acf>]
> gfs2_glock_nq+0x133/0x28f [gfs2]
> May 20 17:41:40 einstein [87029.808292]  RSP <ffff8803029edb78>
> May 20 17:41:40 einstein [87029.808293] CR2: 0000000000000000
> May 20 17:41:40 einstein [87029.808295] ---[ end trace
> af94d521028b5618 ]---
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From esggrupos at gmail.com  Thu May 21 11:44:12 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Thu, 21 May 2009 13:44:12 +0200
Subject: [Linux-cluster] all nodes halt when one lose connection
Message-ID: <3128ba140905210444o21959031iff759490ace7c8bc@mail.gmail.com>

Hello,

I continue with my 2 node cluster and with strange behaviours (or not....).

I?ll explain the actual situation.

I have a cluster with 2 nodes. I  Use IPMI as fencing device and all works
fine.

I have two separate networks for the services and for the cluster
management. (as I was instructed in this list, thanks for the advice ;-) )

I use a iscsi volumen mounted on the two nodes and with GFS on it.

The problem I have now is that when one of the nodes lose the connection to
the service network (I disconnect the cable from the interface) I expect the
second node fence this node and take the control of the services. But what I
have is that the 2 nodes halt, (no reboot, halt), is this normal?

one detail, the access to the GFS volumen is through this service network.
When I pull the cable I lose the connection to it. Can be this the problem?

Thanks in advance

ESG
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090521/9f29cf1c/attachment.htm>

From jbrassow at redhat.com  Thu May 21 13:06:12 2009
From: jbrassow at redhat.com (Jonathan Brassow)
Date: Thu, 21 May 2009 08:06:12 -0500
Subject: [Linux-cluster] all nodes halt when one lose connection
In-Reply-To: <3128ba140905210444o21959031iff759490ace7c8bc@mail.gmail.com>
References: <3128ba140905210444o21959031iff759490ace7c8bc@mail.gmail.com>
Message-ID: <A75C4B18-E985-4D35-8BAF-55BA6C202B53@redhat.com>


On May 21, 2009, at 6:44 AM, ESGLinux wrote:

> Hello,
>
> I continue with my 2 node cluster and with strange behaviours (or  
> not....).
>
> I?ll explain the actual situation.
>
> I have a cluster with 2 nodes. I  Use IPMI as fencing device and all  
> works fine.
>
> I have two separate networks for the services and for the cluster  
> management. (as I was instructed in this list, thanks for the  
> advice ;-) )
>
> I use a iscsi volumen mounted on the two nodes and with GFS on it.
>
> The problem I have now is that when one of the nodes lose the  
> connection to the service network (I disconnect the cable from the  
> interface) I expect the second node fence this node and take the  
> control of the services. But what I have is that the 2 nodes halt,  
> (no reboot, halt), is this normal?
>
> one detail, the access to the GFS volumen is through this service  
> network. When I pull the cable I lose the connection to it. Can be  
> this the problem?

What do your logs say?  Was fencing ever attempted?  completed?  Are  
you pulling the connection that allows for IPMI communication?

  brassow


From esggrupos at gmail.com  Thu May 21 14:57:16 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Thu, 21 May 2009 16:57:16 +0200
Subject: [Linux-cluster] all nodes halt when one lose connection
In-Reply-To: <A75C4B18-E985-4D35-8BAF-55BA6C202B53@redhat.com>
References: <3128ba140905210444o21959031iff759490ace7c8bc@mail.gmail.com>
	<A75C4B18-E985-4D35-8BAF-55BA6C202B53@redhat.com>
Message-ID: <3128ba140905210757jd814f52hc1ca97c4da6e3a7a@mail.gmail.com>

Hello,

these are the logs I get:

In node1:

May 21 11:33:44 NODE1 fenced[3840]: NODE2 not a cluster member after 5 sec
post_fail_delay
May 21 11:33:44 NODE1 fenced[3840]: fencing node "NODE2"
May 21 11:33:44 NODE1 shutdown[5448]: shutting down for system halt

in node2:

May 21 11:33:45 NODE2 fenced[3843]: NODE1 not a cluster member after 5 sec
post_fail_delay
May 21 11:33:45 NODE2 fenced[3843]: fencing node "NODE1"
May 21 11:33:45 NODE2 shutdown[5923]: shutting down for system halt


what I don?t know is way they lose the connection with the cluster, they are
still connected (I only unplug a cable from the service network)

Thanks,

ESG


2009/5/21 Jonathan Brassow <jbrassow at redhat.com>

>
> On May 21, 2009, at 6:44 AM, ESGLinux wrote:
>
>  Hello,
>>
>> I continue with my 2 node cluster and with strange behaviours (or
>> not....).
>>
>> I?ll explain the actual situation.
>>
>> I have a cluster with 2 nodes. I  Use IPMI as fencing device and all works
>> fine.
>>
>> I have two separate networks for the services and for the cluster
>> management. (as I was instructed in this list, thanks for the advice ;-) )
>>
>> I use a iscsi volumen mounted on the two nodes and with GFS on it.
>>
>> The problem I have now is that when one of the nodes lose the connection
>> to the service network (I disconnect the cable from the interface) I expect
>> the second node fence this node and take the control of the services. But
>> what I have is that the 2 nodes halt, (no reboot, halt), is this normal?
>>
>> one detail, the access to the GFS volumen is through this service network.
>> When I pull the cable I lose the connection to it. Can be this the problem?
>>
>
> What do your logs say?  Was fencing ever attempted?  completed?  Are you
> pulling the connection that allows for IPMI communication?
>
>  brassow
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090521/3436412e/attachment.htm>

From jbrassow at redhat.com  Thu May 21 15:01:39 2009
From: jbrassow at redhat.com (Jonathan Brassow)
Date: Thu, 21 May 2009 10:01:39 -0500
Subject: [Linux-cluster] all nodes halt when one lose connection
In-Reply-To: <3128ba140905210757jd814f52hc1ca97c4da6e3a7a@mail.gmail.com>
References: <3128ba140905210444o21959031iff759490ace7c8bc@mail.gmail.com>
	<A75C4B18-E985-4D35-8BAF-55BA6C202B53@redhat.com>
	<3128ba140905210757jd814f52hc1ca97c4da6e3a7a@mail.gmail.com>
Message-ID: <1D109AC0-9EE0-419B-A841-D98EA53FF1C8@redhat.com>


On May 21, 2009, at 9:57 AM, ESGLinux wrote:

> Hello,
>
> these are the logs I get:
>
> In node1:
>
> May 21 11:33:44 NODE1 fenced[3840]: NODE2 not a cluster member after  
> 5 sec post_fail_delay
> May 21 11:33:44 NODE1 fenced[3840]: fencing node "NODE2"
> May 21 11:33:44 NODE1 shutdown[5448]: shutting down for system halt
>
> in node2:
>
> May 21 11:33:45 NODE2 fenced[3843]: NODE1 not a cluster member after  
> 5 sec post_fail_delay
> May 21 11:33:45 NODE2 fenced[3843]: fencing node "NODE1"
> May 21 11:33:45 NODE2 shutdown[5923]: shutting down for system halt
>
>
> what I don?t know is way they lose the connection with the cluster,  
> they are still connected (I only unplug a cable from the service  
> network)

That may be something worth chasing down, as it appears that your  
cluster communication is on a network you don't expect?

Also, are the nodes simply "shutting down", or are they being forcibly  
rebooted.  If it is a casual shutdown, then it would appear that both  
nodes are trying to shutdown simultaneously.

  brassow


From esggrupos at gmail.com  Thu May 21 15:34:24 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Thu, 21 May 2009 17:34:24 +0200
Subject: [Linux-cluster] all nodes halt when one lose connection
In-Reply-To: <1D109AC0-9EE0-419B-A841-D98EA53FF1C8@redhat.com>
References: <3128ba140905210444o21959031iff759490ace7c8bc@mail.gmail.com>
	<A75C4B18-E985-4D35-8BAF-55BA6C202B53@redhat.com>
	<3128ba140905210757jd814f52hc1ca97c4da6e3a7a@mail.gmail.com>
	<1D109AC0-9EE0-419B-A841-D98EA53FF1C8@redhat.com>
Message-ID: <3128ba140905210834gdcdf89ahb34bf45d1272861b@mail.gmail.com>

2009/5/21 Jonathan Brassow <jbrassow at redhat.com>

>
> On May 21, 2009, at 9:57 AM, ESGLinux wrote:
>
>  Hello,
>>
>> these are the logs I get:
>>
>> In node1:
>>
>> May 21 11:33:44 NODE1 fenced[3840]: NODE2 not a cluster member after 5 sec
>> post_fail_delay
>> May 21 11:33:44 NODE1 fenced[3840]: fencing node "NODE2"
>> May 21 11:33:44 NODE1 shutdown[5448]: shutting down for system halt
>>
>> in node2:
>>
>> May 21 11:33:45 NODE2 fenced[3843]: NODE1 not a cluster member after 5 sec
>> post_fail_delay
>> May 21 11:33:45 NODE2 fenced[3843]: fencing node "NODE1"
>> May 21 11:33:45 NODE2 shutdown[5923]: shutting down for system halt
>>
>>
>> what I don?t know is way they lose the connection with the cluster, they
>> are still connected (I only unplug a cable from the service network)
>>
>
> That may be something worth chasing down, as it appears that your cluster
> communication is on a network you don't expect?
>

How can I be sure about the network the nodes are using for communication? I
think they do for the network I have configured to do that....


>
> Also, are the nodes simply "shutting down", or are they being forcibly
> rebooted.  If it is a casual shutdown, then it would appear that both nodes
> are trying to shutdown simultaneously.
>

they simply shutdown. They no reboot.

This is what I get every time I unplug the nework cable from eth0 of any of
the two nodes. (they communicate through eth1...)

Greetings,

ESG


>
>
>  brassow
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090521/624faaca/attachment.htm>

From esggrupos at gmail.com  Thu May 21 16:21:49 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Thu, 21 May 2009 18:21:49 +0200
Subject: [Linux-cluster] Not able to add the cluster nodes
In-Reply-To: <539ce33b0905181147m481b0f3ci773cc91391dfb1a9@mail.gmail.com>
References: <539ce33b0905171128l51a3aa1bi446f208f1ea19e7d@mail.gmail.com>
	<36df569a0905171204k618006dbq8c8eff0ddeda5a76@mail.gmail.com>
	<539ce33b0905181147m481b0f3ci773cc91391dfb1a9@mail.gmail.com>
Message-ID: <3128ba140905210921u1d0a6b8cm4de6bb1efe3ef527@mail.gmail.com>

Hello,

this thing happened to me this morning.

I try to restart cman with service cman start in node2 and it works,

For some reason your cman on node2 is stopped.

HTH

ESG

2009/5/18 ravi kumar <ravikumar.c25 at gmail.com>

> Thanks Ian Hayes,
>
>  Again another problem i m facing. GFS not able to mount.
>
> Please find the details
>
> dbnode1:/root # clustat
> Cluster Status for md_cluster @ Mon May 18 18:15:43 2009
> Member Status: Quorate
>
>  Member Name                                              ID   Status
>  ------ ----                                              ---- ------
>  dbnode1.xtks.com                            1 Online
>  dbnode2.xtks.com                            2 Online, Local
>
>
>
>
>
> dbnode1:/root # gfs_mkfs -t md_cluster:test1 -p lock_dlm -j 2
> /dev/vg_cluster1/test1
> This will destroy any data on /dev/vg_cluster1/test1.
>  It appears to contain a gfs2 filesystem.
>
> Are you sure you want to proceed? [y/n] y
>
> Device:                    /dev/vg_cluster1/test1
> Blocksize:                 4096
> Filesystem Size:           5177000
> Journals:                  2
> Resource Groups:           80
> Locking Protocol:          lock_dlm
> Lock Table:                md_cluster:test1
>
> Syncing...
> All Done
> dbnode1:/root
>
> dbnode1:/root # cat /etc/fstab
> /dev/rootvg/rootvol     /                       ext3    defaults        1 1
> /dev/rootvg/varvol      /var                    ext3
> defaults,nosuid        1 2
> /dev/rootvg/homevol     /home                   ext3
> defaults,nosuid        1 2
> /dev/rootvg/optvol      /opt                    ext3    defaults        1 2
> LABEL=/boot             /boot                   ext3
> defaults,nosuid        1 2
> tmpfs                   /dev/shm                tmpfs
> defaults,nosuid        0 0
> devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
> sysfs                   /sys                    sysfs   defaults        0 0
> proc                    /proc                   proc    defaults        0 0
> /dev/rootvg/swapvol     swap                    swap    defaults        0 0
> /dev/cdrom    /mnt/cdrom  auto    pamconsole,exec,noauto,managed  0 0
> /dev/vg_cluster1/test1  /test1   gfs      defaults 0 0
>
> dbnode1:/root # mount -a
> /sbin/mount.gfs: error mounting /dev/mapper/vg_cluster1-test1 on
> /test1: No such device
>
> dbnode1:/root # clustat
> Cluster Status for md_cluster @ Mon May 18 18:41:30 2009
> Member Status: Quorate
>
>  Member Name                                              ID   Status
>  ------ ----                                              ---- ------
>  dbnode1.xtks.com                            1 Online, Local
>  dbnode2.xtks.com                            2 Offline
>
> dbnode1:/root #
>
>
>
>
>
>
> dbnode2:/root # clustat
> Cluster Status for md_cluster @ Mon May 18 18:15:52 2009
> Member Status: Quorate
>
>  Member Name                                              ID   Status
>  ------ ----                                              ---- ------
>  dbnode1.xtks.com                            1 Online, Local
>  dbnode2.xtks.com                            2 Online
>
>
>
> dbnode2:/root # mount -a
> /sbin/mount.gfs: can't connect to gfs_controld: Connection refused
> /sbin/mount.gfs: can't connect to gfs_controld: Connection refused
> /sbin/mount.gfs: can't connect to gfs_controld: Connection refused
> /sbin/mount.gfs: can't connect to gfs_controld: Connection refused
> /sbin/mount.gfs: can't connect to gfs_controld: Connection refused
> /sbin/mount.gfs: can't connect to gfs_controld: Connection refused
> /sbin/mount.gfs: can't connect to gfs_controld: Connection refused
> /sbin/mount.gfs: can't connect to gfs_controld: Connection refused
> /sbin/mount.gfs: can't connect to gfs_controld: Connection refused
> /sbin/mount.gfs: can't connect to gfs_controld: Connection refused
> /sbin/mount.gfs: gfs_controld not running
> /sbin/mount.gfs: error mounting lockproto lock_dlm
> dbnode2:/root #
>
>
>
> dbnode2:/root # clustat
> Could not connect to CMAN: Connection refused
> dbnode2:/root #
>
> On Mon, May 18, 2009 at 3:04 AM, Ian Hayes <cthulhucalling at gmail.com>
> wrote:
> > Try adding clean_start="1" to the fence_daemon line of both members and
> try
> > it again.
> >
> > On Sun, May 17, 2009 at 11:28 AM, ravi kumar <ravikumar.c25 at gmail.com>
> > wrote:
> >>
> >> Hi Linux cluster experts,
> >>
> >>  Not able to add as a member in dbnode1 & same as dbnode2 side. Please
> >> help...
> >>
> >> Please find the details as below
> >>
> >>
> >>
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090521/0957343f/attachment.htm>

From rmccabe at redhat.com  Thu May 21 19:43:32 2009
From: rmccabe at redhat.com (Ryan McCabe)
Date: Thu, 21 May 2009 15:43:32 -0400
Subject: [Linux-cluster] Luci/Rici error : Unable to retrieve batch ...
In-Reply-To: <4A0C55D8.5030308@lexum.umontreal.ca>
References: <4A0C55D8.5030308@lexum.umontreal.ca>
Message-ID: <20090521194331.GA2953@redhat.com>

On Thu, May 14, 2009 at 01:33:12PM -0400, FM wrote:
> cluster01-node1.cluster.lexum.pri:11111: test is in unknown state 118
> Starting cluster service "test" on node

Services in state 118 are pending recovery. You may want to try
explicitly disabling the service first.

Ryan


From carlopmart at gmail.com  Thu May 21 21:53:53 2009
From: carlopmart at gmail.com (carlopmart)
Date: Thu, 21 May 2009 23:53:53 +0200
Subject: [Linux-cluster] Openais doesn't sync, coninuos errors
Message-ID: <4A15CD71.8080803@gmail.com>

Hi all,

  I have a strange problem with a CentOS5 cluster (two nodes). Under 
/var/log/messages I see this errors:

May 21 23:46:51 radagast openais[6283]: [MAIN ] Received message has invalid 
digest... ignoring.
May 21 23:46:51 radagast openais[6283]: [MAIN ] Invalid packet data
May 21 23:46:51 radagast openais[6283]: [MAIN ] Received message has invalid 
digest... ignoring.
May 21 23:46:51 radagast openais[6283]: [MAIN ] Invalid packet data
May 21 23:46:51 radagast openais[6283]: [MAIN ] Received message has invalid 
digest... ignoring.
May 21 23:46:51 radagast openais[6283]: [MAIN ] Invalid packet data
May 21 23:46:52 radagast openais[6283]: [MAIN ] Received message has invalid 
digest... ignoring.
May 21 23:46:52 radagast openais[6283]: [MAIN ] Invalid packet data
May 21 23:46:52 radagast openais[6283]: [MAIN ] Received message has invalid 
digest... ignoring.
May 21 23:46:52 radagast openais[6283]: [MAIN ] Invalid packet data
May 21 23:46:53 radagast openais[6283]: [MAIN ] Received message has invalid 
digest... ignoring.
May 21 23:46:53 radagast openais[6283]: [MAIN ] Invalid packet data
May 21 23:46:53 radagast openais[6283]: [MAIN ] Received message has invalid 
digest... ignoring.
May 21 23:46:53 radagast openais[6283]: [MAIN ] Invalid packet data
May 21 23:46:53 radagast openais[6283]: [MAIN ] Received message has invalid 
digest... ignoring.
May 21 23:46:53 radagast openais[6283]: [MAIN ] Invalid packet data
May 21 23:46:54 radagast openais[6283]: [MAIN ] Received message has invalid 
digest... ignoring.
May 21 23:46:54 radagast openais[6283]: [MAIN ] Invalid packet data
May 21 23:46:54 radagast openais[6283]: [MAIN ] Received message has invalid 
digest... ignoring.
May 21 23:46:54 radagast openais[6283]: [MAIN ] Invalid packet data
May 21 23:46:54 radagast openais[6283]: [MAIN ] Received message has invalid 
digest... ignoring.
May 21 23:46:54 radagast openais[6283]: [MAIN ] Invalid packet data
May 21 23:46:55 radagast openais[6283]: [MAIN ] Received message has invalid 
digest... ignoring.
May 21 23:46:55 radagast openais[6283]: [MAIN ] Invalid packet data
May 21 23:46:55 radagast openais[6283]: [MAIN ] Received message has invalid 
digest... ignoring.
May 21 23:46:55 radagast openais[6283]: [MAIN ] Invalid packet data
May 21 23:46:56 radagast openais[6283]: [MAIN ] Received message has invalid 
digest... ignoring.
May 21 23:46:56 radagast openais[6283]: [MAIN ] Invalid packet data

  .. and rgmanager doesn't starts different services that I configured. Can this 
be a network problem?? Why can't I start services configured under cluster.conf 
?? I see the same errors on the other node. Both nodes are centos5.3, kernel 
2.6.18-128.1.10.el5, and cman cman-2.0.98-1.el5_3.1.


-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From ccaulfie at redhat.com  Fri May 22 06:57:54 2009
From: ccaulfie at redhat.com (Chrissie Caulfield)
Date: Fri, 22 May 2009 07:57:54 +0100
Subject: [Linux-cluster] Openais doesn't sync, coninuos errors
In-Reply-To: <4A15CD71.8080803@gmail.com>
References: <4A15CD71.8080803@gmail.com>
Message-ID: <4A164CF2.5040901@redhat.com>

carlopmart wrote:
> Hi all,
> 
>  I have a strange problem with a CentOS5 cluster (two nodes). Under
> /var/log/messages I see this errors:
> 
> May 21 23:46:51 radagast openais[6283]: [MAIN ] Received message has
> invalid digest... ignoring.
> May 21 23:46:51 radagast openais[6283]: [MAIN ] Invalid packet data
> May 21 23:46:51 radagast openais[6283]: [MAIN ] Received message has
> invalid digest... ignoring.
> May 21 23:46:51 radagast openais[6283]: [MAIN ] Invalid packet data
> May 21 23:46:51 radagast openais[6283]: [MAIN ] Received message has
> invalid digest... ignoring.
> May 21 23:46:51 radagast openais[6283]: [MAIN ] Invalid packet data
> May 21 23:46:52 radagast openais[6283]: [MAIN ] Received message has
> invalid digest... ignoring.
> May 21 23:46:52 radagast openais[6283]: [MAIN ] Invalid packet data
> May 21 23:46:52 radagast openais[6283]: [MAIN ] Received message has
> invalid digest... ignoring.
> May 21 23:46:52 radagast openais[6283]: [MAIN ] Invalid packet data
> May 21 23:46:53 radagast openais[6283]: [MAIN ] Received message has
> invalid digest... ignoring.
> May 21 23:46:53 radagast openais[6283]: [MAIN ] Invalid packet data
> May 21 23:46:53 radagast openais[6283]: [MAIN ] Received message has
> invalid digest... ignoring.
> May 21 23:46:53 radagast openais[6283]: [MAIN ] Invalid packet data
> May 21 23:46:53 radagast openais[6283]: [MAIN ] Received message has
> invalid digest... ignoring.
> May 21 23:46:53 radagast openais[6283]: [MAIN ] Invalid packet data
> May 21 23:46:54 radagast openais[6283]: [MAIN ] Received message has
> invalid digest... ignoring.
> May 21 23:46:54 radagast openais[6283]: [MAIN ] Invalid packet data
> May 21 23:46:54 radagast openais[6283]: [MAIN ] Received message has
> invalid digest... ignoring.
> May 21 23:46:54 radagast openais[6283]: [MAIN ] Invalid packet data
> May 21 23:46:54 radagast openais[6283]: [MAIN ] Received message has
> invalid digest... ignoring.
> May 21 23:46:54 radagast openais[6283]: [MAIN ] Invalid packet data
> May 21 23:46:55 radagast openais[6283]: [MAIN ] Received message has
> invalid digest... ignoring.
> May 21 23:46:55 radagast openais[6283]: [MAIN ] Invalid packet data
> May 21 23:46:55 radagast openais[6283]: [MAIN ] Received message has
> invalid digest... ignoring.
> May 21 23:46:55 radagast openais[6283]: [MAIN ] Invalid packet data
> May 21 23:46:56 radagast openais[6283]: [MAIN ] Received message has
> invalid digest... ignoring.
> May 21 23:46:56 radagast openais[6283]: [MAIN ] Invalid packet data
> 
>  .. and rgmanager doesn't starts different services that I configured.
> Can this be a network problem?? Why can't I start services configured
> under cluster.conf ?? I see the same errors on the other node. Both
> nodes are centos5.3, kernel 2.6.18-128.1.10.el5, and cman
> cman-2.0.98-1.el5_3.1.


Nothing will work until those errors are gone. They mean that the
openais processes on the two nodes are not talking to each other.

There are a few things to check. Firstly manse sure that openais is
being run from cman and not on its own ('chkconfig openais off'). Also
check that cluster.conf is the same on both nodes, in particular the
cluster names match and if there is a cluster_id specified that it's the
same on both nodes too.

it's also worth checking for iptables rules that might be fiddling with
the packets.

Failing that, you might just have the most dreadful network corruption I
have ever seen ...


Chrissie


From macbogucki at gmail.com  Fri May 22 08:37:01 2009
From: macbogucki at gmail.com (Maciej Bogucki)
Date: Fri, 22 May 2009 10:37:01 +0200
Subject: [Linux-cluster] Disaster recovery setup
In-Reply-To: <29ae894c0905190609t2ab7fe51leac6e6275e67bc50@mail.gmail.com>
References: <29ae894c0905190609t2ab7fe51leac6e6275e67bc50@mail.gmail.com>
Message-ID: <4A16642D.1010507@gmail.com>

brem belguebli pisze:
> Hello,
>
> I need to setup a 4 nodes cluster for hosting a few database instances.
>
> The main constraint being that this cluster has to be equally 
> distributed on 2 different sites (distance less than 100 km), ie 2 
> nodes on each site.
>
> The storage being based on 2 FC disk arrays, one on each site, 
> connected thru 2 SAN fabrics for redundancy and failover.
>
> Each database instance runs on one node, a should be able to switch to 
> another in case of machine crash.
>
> I would like to use CLVM with mirror (or mdadm) to give access  each 
> node to both disk arrays (local and distant). 
>
>
> In case something happens on one site(disaster) , the nodes on the 
> other site should be able to take over the failing nodes by hosting 
> the  databases instances that failed.
>  
> My questions concern :
>
> - Quorum: In case of a disaster, the remaining nodes won't have the 
> majority (2 over 4)
> - If I use qdisk, on wich site to put it ? the disaster could occur on 
> any site
Use qdisk and mirror it over network.
> - CLVM with mirror (or mdadm layer to built on top of it CLVM) is a 
> "supported " setup ?
Do not use mdadm for mirroring. If You SAN support remote mirroring use 
it or configure DRBD(it supports only two servers) [1].
This post [2] could help You also.

[1] - http://www.drbd.org/
[2] - http://archives.free.net.ph/message/20090311.200230.23f4f917.pl.html

Best Regards
Maciej Bogucki


From xsanch at gmail.com  Fri May 22 13:50:14 2009
From: xsanch at gmail.com (jorge sanchez)
Date: Fri, 22 May 2009 15:50:14 +0200
Subject: [Linux-cluster] all nodes halt when one lose connection
In-Reply-To: <3128ba140905210834gdcdf89ahb34bf45d1272861b@mail.gmail.com>
References: <3128ba140905210444o21959031iff759490ace7c8bc@mail.gmail.com>
	<A75C4B18-E985-4D35-8BAF-55BA6C202B53@redhat.com>
	<3128ba140905210757jd814f52hc1ca97c4da6e3a7a@mail.gmail.com>
	<1D109AC0-9EE0-419B-A841-D98EA53FF1C8@redhat.com>
	<3128ba140905210834gdcdf89ahb34bf45d1272861b@mail.gmail.com>
Message-ID: <5b192c7e0905220650h7cc737c5k7581972add42e21f@mail.gmail.com>

Hi,

try also disable the acpi if is it running , see following:

http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/s1-acpi-CA.html


Regards,

Jorge Sanchez

On Thu, May 21, 2009 at 5:34 PM, ESGLinux <esggrupos at gmail.com> wrote:

>
>
> 2009/5/21 Jonathan Brassow <jbrassow at redhat.com>
>
>>
>> On May 21, 2009, at 9:57 AM, ESGLinux wrote:
>>
>>  Hello,
>>>
>>> these are the logs I get:
>>>
>>> In node1:
>>>
>>> May 21 11:33:44 NODE1 fenced[3840]: NODE2 not a cluster member after 5
>>> sec post_fail_delay
>>> May 21 11:33:44 NODE1 fenced[3840]: fencing node "NODE2"
>>> May 21 11:33:44 NODE1 shutdown[5448]: shutting down for system halt
>>>
>>> in node2:
>>>
>>> May 21 11:33:45 NODE2 fenced[3843]: NODE1 not a cluster member after 5
>>> sec post_fail_delay
>>> May 21 11:33:45 NODE2 fenced[3843]: fencing node "NODE1"
>>> May 21 11:33:45 NODE2 shutdown[5923]: shutting down for system halt
>>>
>>>
>>> what I don?t know is way they lose the connection with the cluster, they
>>> are still connected (I only unplug a cable from the service network)
>>>
>>
>> That may be something worth chasing down, as it appears that your cluster
>> communication is on a network you don't expect?
>>
>
> How can I be sure about the network the nodes are using for communication?
> I think they do for the network I have configured to do that....
>
>
>>
>> Also, are the nodes simply "shutting down", or are they being forcibly
>> rebooted.  If it is a casual shutdown, then it would appear that both nodes
>> are trying to shutdown simultaneously.
>>
>
> they simply shutdown. They no reboot.
>
> This is what I get every time I unplug the nework cable from eth0 of any of
> the two nodes. (they communicate through eth1...)
>
> Greetings,
>
> ESG
>
>
>
>
>
>>
>>
>>  brassow
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090522/f0c8e245/attachment.htm>

From carlopmart at gmail.com  Fri May 22 13:53:27 2009
From: carlopmart at gmail.com (carlopmart)
Date: Fri, 22 May 2009 15:53:27 +0200
Subject: [Linux-cluster] Openais doesn't sync, coninuos errors (SOLVED)
In-Reply-To: <4A164CF2.5040901@redhat.com>
References: <4A15CD71.8080803@gmail.com> <4A164CF2.5040901@redhat.com>
Message-ID: <4A16AE57.5000602@gmail.com>

Chrissie Caulfield wrote:
> carlopmart wrote:
>> Hi all,
>>
>>  I have a strange problem with a CentOS5 cluster (two nodes). Under
>> /var/log/messages I see this errors:
>>
>> May 21 23:46:51 radagast openais[6283]: [MAIN ] Received message has
>> invalid digest... ignoring.
>> May 21 23:46:51 radagast openais[6283]: [MAIN ] Invalid packet data
>> May 21 23:46:51 radagast openais[6283]: [MAIN ] Received message has
>> invalid digest... ignoring.
>> May 21 23:46:51 radagast openais[6283]: [MAIN ] Invalid packet data
>> May 21 23:46:51 radagast openais[6283]: [MAIN ] Received message has
>> invalid digest... ignoring.
>> May 21 23:46:51 radagast openais[6283]: [MAIN ] Invalid packet data
>> May 21 23:46:52 radagast openais[6283]: [MAIN ] Received message has
>> invalid digest... ignoring.
>> May 21 23:46:52 radagast openais[6283]: [MAIN ] Invalid packet data
>> May 21 23:46:52 radagast openais[6283]: [MAIN ] Received message has
>> invalid digest... ignoring.
>> May 21 23:46:52 radagast openais[6283]: [MAIN ] Invalid packet data
>> May 21 23:46:53 radagast openais[6283]: [MAIN ] Received message has
>> invalid digest... ignoring.
>> May 21 23:46:53 radagast openais[6283]: [MAIN ] Invalid packet data
>> May 21 23:46:53 radagast openais[6283]: [MAIN ] Received message has
>> invalid digest... ignoring.
>> May 21 23:46:53 radagast openais[6283]: [MAIN ] Invalid packet data
>> May 21 23:46:53 radagast openais[6283]: [MAIN ] Received message has
>> invalid digest... ignoring.
>> May 21 23:46:53 radagast openais[6283]: [MAIN ] Invalid packet data
>> May 21 23:46:54 radagast openais[6283]: [MAIN ] Received message has
>> invalid digest... ignoring.
>> May 21 23:46:54 radagast openais[6283]: [MAIN ] Invalid packet data
>> May 21 23:46:54 radagast openais[6283]: [MAIN ] Received message has
>> invalid digest... ignoring.
>> May 21 23:46:54 radagast openais[6283]: [MAIN ] Invalid packet data
>> May 21 23:46:54 radagast openais[6283]: [MAIN ] Received message has
>> invalid digest... ignoring.
>> May 21 23:46:54 radagast openais[6283]: [MAIN ] Invalid packet data
>> May 21 23:46:55 radagast openais[6283]: [MAIN ] Received message has
>> invalid digest... ignoring.
>> May 21 23:46:55 radagast openais[6283]: [MAIN ] Invalid packet data
>> May 21 23:46:55 radagast openais[6283]: [MAIN ] Received message has
>> invalid digest... ignoring.
>> May 21 23:46:55 radagast openais[6283]: [MAIN ] Invalid packet data
>> May 21 23:46:56 radagast openais[6283]: [MAIN ] Received message has
>> invalid digest... ignoring.
>> May 21 23:46:56 radagast openais[6283]: [MAIN ] Invalid packet data
>>
>>  .. and rgmanager doesn't starts different services that I configured.
>> Can this be a network problem?? Why can't I start services configured
>> under cluster.conf ?? I see the same errors on the other node. Both
>> nodes are centos5.3, kernel 2.6.18-128.1.10.el5, and cman
>> cman-2.0.98-1.el5_3.1.
> 
> 
> Nothing will work until those errors are gone. They mean that the
> openais processes on the two nodes are not talking to each other.
> 
> There are a few things to check. Firstly manse sure that openais is
> being run from cman and not on its own ('chkconfig openais off'). Also
> check that cluster.conf is the same on both nodes, in particular the
> cluster names match and if there is a cluster_id specified that it's the
> same on both nodes too.
> 
> it's also worth checking for iptables rules that might be fiddling with
> the packets.
> 
> Failing that, you might just have the most dreadful network corruption I
> have ever seen ...
> 
> 
> Chrissie

I have found the problem: multicast address. On the same network exists another 
c5 cluster with 239.192.55.1 multicast address, and this cluster use this 
239.192.55.75 . I have changed this last multicast address by this 
239.192.75.25, and I have changed network interface where this multicast works 
(eth0 to eth1) and problems are gone ...

Thanks.

> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From lhh at redhat.com  Fri May 22 14:12:53 2009
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 22 May 2009 10:12:53 -0400
Subject: [Linux-cluster] [PATCH] Fix qdiskd re-registration
In-Reply-To: <1242833650.8430.47.camel@ayanami>
References: <200905191416.50387.mrugeshkarnik@gmail.com>
	<4A132FD9.7090809@redhat.com> <4A13AC7C.1030901@redhat.com>
	<1242827382.8430.19.camel@ayanami>  <4A140CDF.60503@redhat.com>
	<1242833650.8430.47.camel@ayanami>
Message-ID: <1243001573.8430.115.camel@ayanami>

Hi Chrissie,

This (I think) should fix the qdiskd registration issue.

-- Lon
-------------- next part --------------
A non-text attachment was scrubbed...
Name: s3-qdisk-reregister.patch
Type: text/x-patch
Size: 1377 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090522/28657ff1/attachment.bin>

From ccaulfie at redhat.com  Fri May 22 14:36:58 2009
From: ccaulfie at redhat.com (Chrissie Caulfield)
Date: Fri, 22 May 2009 15:36:58 +0100
Subject: [Linux-cluster] Re: [Cluster-devel] [PATCH] Fix qdiskd
	re-registration
In-Reply-To: <1243001573.8430.115.camel@ayanami>
References: <200905191416.50387.mrugeshkarnik@gmail.com>	<4A132FD9.7090809@redhat.com>
	<4A13AC7C.1030901@redhat.com>	<1242827382.8430.19.camel@ayanami>
	<4A140CDF.60503@redhat.com>	<1242833650.8430.47.camel@ayanami>
	<1243001573.8430.115.camel@ayanami>
Message-ID: <4A16B88A.3080106@redhat.com>

Lon Hohberger wrote:
> Hi Chrissie,
> 
> This (I think) should fix the qdiskd registration issue.


That looks reasonable to me.

I've fixed it in cluster4 so you can change the votes of a quorum device
without removing it. It probably wouldn't be that hard to add it to
cluster3 too, if that's useful.

-- 

Chrissie


From swhiteho at redhat.com  Fri May 22 14:44:41 2009
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Fri, 22 May 2009 15:44:41 +0100
Subject: [Linux-cluster] gfs2_tool df command
Message-ID: <1243003481.3372.20.camel@localhost.localdomain>

Hi,

I'd like to update the output given by the gfs2_tool df command to omit
five items of information that are currently printed, those being:

lockproto
locktable
hostdata
localflocks
localcaching

It seems to me that if you are doing a df that this information isn't
really very important and its all easily available from other sources
(cat /proc/mounts). We can preserve it if really required, but I wonder
does anybody actually rely on these particular things in any of their
scripts?

If not then its easier to just remove it I think,

Steve.


From swun2010 at gmail.com  Sun May 24 06:42:41 2009
From: swun2010 at gmail.com (Sam Wun)
Date: Sun, 24 May 2009 16:42:41 +1000
Subject: [Linux-cluster] haproxy terminate connection
Message-ID: <736c47cb0905232342i788f7f59w785d9cc72ef96925@mail.gmail.com>

Hi,

I am running haproxy -1.3.15.5, with the following *simple* configuration:

global
        log 127.0.0.1   local0
        log 127.0.0.1   local1 notice
        #log loghost    local0 info
        maxconn 4096
        #chroot /usr/share/haproxy
        #uid 99
        #gid 99
        daemon
        pidfile /var/run/haproxy.pid
        #debug
        #quiet

defaults
        log     global
        mode    http
        option  httplog
        option  dontlognull
        retries 3
        option redispatch
        maxconn 2000
        contimeout      5000
        clitimeout      50000
        srvtimeout      50000

listen web_appl 0.0.0.0:80
        mode http
        cookie SERVERID insert nocache indirect
        balance roundrobin
        server wp1 192.168.1.246:80 cookie server01 weight 1 check
        server wp2 192.168.1.245:80 cookie server02 weight 1 check
        server wp3 192.168.1.244:80 cookie server03 weight 1 check
        server wp4-backup 192.168.1.243:80 cookie server04 check backup
        server wp5-backup 192.168.1.242:80 check backup
        option abortonclose # Dropping aborted requests
        option nolinger # disables data lingering


192.168.1.248 is the address of the haproxy load balancer.
When I tried lynx 192.168.1.248 within the haproxy server
(192.168.1.248), it works, it directed me to 192.168.1.246, because
all other servers have no httpd running.
When I tried http://192.168.1.248 from my vista windows, it said
"Connection Interrupted".

What is the problem with my setup?

Your suggestion is very much appreciated.

Thanks


From esggrupos at gmail.com  Mon May 25 09:28:57 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Mon, 25 May 2009 11:28:57 +0200
Subject: [Linux-cluster] all nodes halt when one lose connection
In-Reply-To: <5b192c7e0905220650h7cc737c5k7581972add42e21f@mail.gmail.com>
References: <3128ba140905210444o21959031iff759490ace7c8bc@mail.gmail.com>
	<A75C4B18-E985-4D35-8BAF-55BA6C202B53@redhat.com>
	<3128ba140905210757jd814f52hc1ca97c4da6e3a7a@mail.gmail.com>
	<1D109AC0-9EE0-419B-A841-D98EA53FF1C8@redhat.com>
	<3128ba140905210834gdcdf89ahb34bf45d1272861b@mail.gmail.com>
	<5b192c7e0905220650h7cc737c5k7581972add42e21f@mail.gmail.com>
Message-ID: <3128ba140905250228x5577d24eucd68bbd4b1e57e1b@mail.gmail.com>

Hi,
I think this is not my problem because fencing works fine. The nodes gets
fenced inmediatly but I think they fence when they don't must

Greetings,

ESG

2009/5/22 jorge sanchez <xsanch at gmail.com>

> Hi,
>
> try also disable the acpi if is it running , see following:
>
>
> http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/s1-acpi-CA.html
>
>
> Regards,
>
> Jorge Sanchez
>
>
> On Thu, May 21, 2009 at 5:34 PM, ESGLinux <esggrupos at gmail.com> wrote:
>
>>
>>
>> 2009/5/21 Jonathan Brassow <jbrassow at redhat.com>
>>
>>>
>>> On May 21, 2009, at 9:57 AM, ESGLinux wrote:
>>>
>>>  Hello,
>>>>
>>>> these are the logs I get:
>>>>
>>>> In node1:
>>>>
>>>> May 21 11:33:44 NODE1 fenced[3840]: NODE2 not a cluster member after 5
>>>> sec post_fail_delay
>>>> May 21 11:33:44 NODE1 fenced[3840]: fencing node "NODE2"
>>>> May 21 11:33:44 NODE1 shutdown[5448]: shutting down for system halt
>>>>
>>>> in node2:
>>>>
>>>> May 21 11:33:45 NODE2 fenced[3843]: NODE1 not a cluster member after 5
>>>> sec post_fail_delay
>>>> May 21 11:33:45 NODE2 fenced[3843]: fencing node "NODE1"
>>>> May 21 11:33:45 NODE2 shutdown[5923]: shutting down for system halt
>>>>
>>>>
>>>> what I don?t know is way they lose the connection with the cluster, they
>>>> are still connected (I only unplug a cable from the service network)
>>>>
>>>
>>> That may be something worth chasing down, as it appears that your cluster
>>> communication is on a network you don't expect?
>>>
>>
>> How can I be sure about the network the nodes are using for communication?
>> I think they do for the network I have configured to do that....
>>
>>
>>>
>>> Also, are the nodes simply "shutting down", or are they being forcibly
>>> rebooted.  If it is a casual shutdown, then it would appear that both nodes
>>> are trying to shutdown simultaneously.
>>>
>>
>> they simply shutdown. They no reboot.
>>
>> This is what I get every time I unplug the nework cable from eth0 of any
>> of the two nodes. (they communicate through eth1...)
>>
>> Greetings,
>>
>> ESG
>>
>>
>>
>>
>>
>>>
>>>
>>>  brassow
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090525/dd5a9df2/attachment.htm>

From billpp at gmail.com  Mon May 25 22:10:33 2009
From: billpp at gmail.com (Flavio Junior)
Date: Mon, 25 May 2009 19:10:33 -0300
Subject: [Linux-cluster] GFS2 + CTDB: Problem to get 2 fcntl() calls (Bug
	ID: 502531)
Message-ID: <58aa8d780905251510q3c8af794jdce68aafb23fe6c1@mail.gmail.com>

Hi folks..

I've added a bugzilla for problems to use GFS2 with CTDB, seems that when I
try to reach one file for two or more nodes using fnctl() call, it hangs and
make proccess as zombie.

More description can be found at bugzilla:

https://bugzilla.redhat.com/show_bug.cgi?id=502531


Does anyone there using GFS2 with CTDB ?

--

Fl?vio do Carmo J?nior aka waKKu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090525/d3e090eb/attachment.htm>

From esggrupos at gmail.com  Tue May 26 06:49:24 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Tue, 26 May 2009 08:49:24 +0200
Subject: [Linux-cluster] iscsi target change ip
Message-ID: <3128ba140905252349y3528ddd7g58ef47b7a9994259@mail.gmail.com>

Hello,
I have a server configured as iscsi target to use it with my cluster. To use
it with the inititiators I run this command:

iscsiadm -m discovery -t sendtargets -p 192.168.1.191


that creates a disk /dev/sda on the hosts. Now I need to change the ip of
the target and want to know where I need to change in the initiators the
configuration of the target ip.

Thanks in advance

ESG
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090526/a744fb6f/attachment.htm>

From esggrupos at gmail.com  Tue May 26 07:49:22 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Tue, 26 May 2009 09:49:22 +0200
Subject: [Linux-cluster] diferent ip nodes with diferent commands
Message-ID: <3128ba140905260049h7772929exb72e3d62e0408b63@mail.gmail.com>

Hello,
Looking the configuration of my 2 nodes cluster I have seen that when I run
the command cman_tool with diferent commands I get diferent ips for my
nodes.

here is the example:

cman_tool -af nodes
Node  Sts   Inc   Joined               Name
   1   M   1624   2009-05-21 11:41:41  NODE1
       Addresses: 172.16.1.185
   2   M   1620   2009-05-21 11:41:41  NODE2
       Addresses: 172.16.1.188

---------------------------

cman_tool status
Version: 6.1.0
Config Version: 28
Cluster Name: myCLUSTER
Cluster Id: 10708
Cluster Member: Yes
Cluster Generation: 1624
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Quorum: 1
Active subsystems: 9
Flags: 2node Dirty
Ports Bound: 0 11 177
Node name: NODE1
Node ID: 1
Multicast addresses: 239.192.41.253
Node addresses: 172.16.1.186


As you can see node1 shows 2 diferente ips (172.16.1.185 (eth0)
and 172.16.1.186 (eth1))

How can I determine which ip they are using to communicate and how can I
change the configuration to use always the same ip (the one asociated to
eth1)

Thanks in advance

ESG
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090526/32597116/attachment.htm>

From ccaulfie at redhat.com  Tue May 26 08:22:46 2009
From: ccaulfie at redhat.com (Chrissie Caulfield)
Date: Tue, 26 May 2009 09:22:46 +0100
Subject: [Linux-cluster] diferent ip nodes with diferent commands
In-Reply-To: <3128ba140905260049h7772929exb72e3d62e0408b63@mail.gmail.com>
References: <3128ba140905260049h7772929exb72e3d62e0408b63@mail.gmail.com>
Message-ID: <4A1BA6D6.6030601@redhat.com>

ESGLinux wrote:
> Hello, 
> 
> Looking the configuration of my 2 nodes cluster I have seen that when I
> run the command cman_tool with diferent commands I get diferent ips for
> my nodes.
> 
> here is the example:
> 
> cman_tool -af nodes
> Node  Sts   Inc   Joined               Name
>    1   M   1624   2009-05-21 11:41:41  NODE1
>        Addresses: 172.16.1.185
>    2   M   1620   2009-05-21 11:41:41  NODE2
>        Addresses: 172.16.1.188
> 
> ---------------------------
> 
> cman_tool status
> Version: 6.1.0
> Config Version: 28
> Cluster Name: myCLUSTER
> Cluster Id: 10708
> Cluster Member: Yes
> Cluster Generation: 1624
> Membership state: Cluster-Member
> Nodes: 2
> Expected votes: 1
> Total votes: 2
> Quorum: 1
> Active subsystems: 9
> Flags: 2node Dirty
> Ports Bound: 0 11 177
> Node name: NODE1
> Node ID: 1
> Multicast addresses: 239.192.41.253
> Node addresses: 172.16.1.186
> 
> 
> As you can see node1 shows 2 diferente ips (172.16.1.185 (eth0)
> and 172.16.1.186 (eth1))
> 
> How can I determine which ip they are using to communicate and how can I
> change the configuration to use always the same ip (the one asociated to
> eth1)

That's very very strange! I can't think what might be causing that. The
correct information is that shown by 'cman_tool nodes -a' as that
queries the totem stack to get its information. You can double-check
this with the command

  # lsof -p `pidof aisexec`

But I'm intrigued as to how this has happened. Is it possible you could
post your cluster.conf and also the output of "cman_tool join -d" please?


Chrissie


From esggrupos at gmail.com  Tue May 26 08:44:18 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Tue, 26 May 2009 10:44:18 +0200
Subject: [Linux-cluster] diferent ip nodes with diferent commands
In-Reply-To: <4A1BA6D6.6030601@redhat.com>
References: <3128ba140905260049h7772929exb72e3d62e0408b63@mail.gmail.com>
	<4A1BA6D6.6030601@redhat.com>
Message-ID: <3128ba140905260144y7c137372tab167c2f7e079d87@mail.gmail.com>

Hello,
thanks for you answer,

Here is the information you have requested:

lsof -p `pidof aisexec`
COMMAND  PID USER   FD   TYPE             DEVICE    SIZE     NODE NAME
aisexec 3781 root  cwd    DIR              253,0    4096 36208670
/var/lib/openais
aisexec 3781 root  rtd    DIR              253,0    4096        2 /
aisexec 3781 root  txt    REG              253,0  181336 49495806
/usr/sbin/aisexec
aisexec 3781 root  mem    REG              253,0  139416  7307558 /lib64/
ld-2.5.so
aisexec 3781 root  mem    REG              253,0 1713088  7307559 /lib64/
libc-2.5.so
aisexec 3781 root  mem    REG              253,0   23360  7307561 /lib64/
libdl-2.5.so
aisexec 3781 root  mem    REG              253,0  145592  7307562 /lib64/
libpthread-2.5.so
aisexec 3781 root  mem    REG              253,0   58400  7307466
/lib64/libgcc_s-4.1.2-20080825.so.1
aisexec 3781 root  mem    REG              253,0   10896 50332152
/usr/libexec/lcrso/objdb.lcrso
aisexec 3781 root  mem    REG              253,0   57584 50332612
/usr/libexec/lcrso/service_cman.lcrso
aisexec 3781 root  mem    REG              253,0   53880  7307289 /lib64/
libnss_files-2.5.so
aisexec 3781 root  mem    REG              253,0   19592 50332157
/usr/libexec/lcrso/service_cpg.lcrso
aisexec 3781 root  mem    REG              253,0    8984 50332154
/usr/libexec/lcrso/service_cfg.lcrso
aisexec 3781 root  mem    REG              253,0   20264 50332161
/usr/libexec/lcrso/service_msg.lcrso
aisexec 3781 root  mem    REG              253,0   19704 50332160
/usr/libexec/lcrso/service_lck.lcrso
aisexec 3781 root  mem    REG              253,0   47416 50332159
/usr/libexec/lcrso/service_evt.lcrso
aisexec 3781 root  mem    REG              253,0   48888 50332155
/usr/libexec/lcrso/service_ckpt.lcrso
aisexec 3781 root  mem    REG              253,0   79520 50332153
/usr/libexec/lcrso/service_amf.lcrso
aisexec 3781 root  mem    REG              253,0   15400 50332156
/usr/libexec/lcrso/service_clm.lcrso
aisexec 3781 root  mem    REG              253,0    8872 50332158
/usr/libexec/lcrso/service_evs.lcrso
aisexec 3781 root    0u   CHR                1,3             1738 /dev/null
aisexec 3781 root    1u   CHR                1,3             1738 /dev/null
aisexec 3781 root    2u   CHR                1,3             1738 /dev/null
aisexec 3781 root    3u   CHR                1,3             1738 /dev/null
aisexec 3781 root    4u  unix 0xffff8101042dd140             7966 socket
aisexec 3781 root    5u  unix 0xffff8101278d08c0             7950 socket
aisexec 3781 root    6u  IPv4               7959              UDP
239.192.41.253:netsupport
aisexec 3781 root    7u  IPv4               7960              UDP
ciambbdd2.lab.cert.inteco.es:5149
aisexec 3781 root    8u  IPv4               7961              UDP
ciambbdd2.lab.cert.inteco.es:netsupport
aisexec 3781 root    9u  unix 0xffff8101278d0600             7962
/var/run/cman_client
aisexec 3781 root   10u  unix 0xffff8101042dd400             7964
/var/run/cman_admin
aisexec 3781 root   11u  unix 0xffff8102278220c0             7972
/var/run/cman_client
aisexec 3781 root   12u  unix 0xffff810227cef9c0             8006
/var/run/cman_client
aisexec 3781 root   13u  unix 0xffff810227cee100             8007
/var/run/cman_admin
aisexec 3781 root   14u  unix 0xffff810227cee680             8009 socket
aisexec 3781 root   15u  unix 0xffff810227ceec00             8011 socket
aisexec 3781 root   16u  unix 0xffff810226a4e6c0             8033
/var/run/cman_client
aisexec 3781 root   17u  unix 0xffff810227823400             8074
/var/run/cman_client
aisexec 3781 root   18u  unix 0xffff810227cef440             8121
/var/run/cman_client
aisexec 3781 root   19u  unix 0xffff810226a57940             8124 socket
aisexec 3781 root   20u  unix 0xffff810226a4fa00             8184 socket
aisexec 3781 root   21u  unix 0xffff810226a573c0             8129 socket
aisexec 3781 root   22u  unix 0xffff810226a568c0             8133 socket
aisexec 3781 root   23u  unix 0xffff810226a56340             8136 socket
aisexec 3781 root   24u  unix 0xffff810227cef180             8186 socket
aisexec 3781 root   25u  unix 0xffff8101042dc640             8477 socket
aisexec 3781 root   26u  unix 0xffff8101042dcbc0             8479 socket
aisexec 3781 root   27u  unix 0xffff810126c6d1c0             8503 socket
aisexec 3781 root   28u  unix 0xffff810126c6d740             8506 socket
aisexec 3781 root   29u  unix 0xffff81022348b100            11787
/var/run/cman_client
aisexec 3781 root   30u  unix 0xffff810119f78680            11793 socket
aisexec 3781 root   31u  unix 0xffff810119f78100            11796 socket

------------------------------------------------------------------------

cluster.conf

?xml version="1.0"?>
<cluster alias="myCLUSTER" config_version="28" name="myCLUSTER">
<fence_daemon clean_start="0" post_fail_delay="5" post_join_delay="3"/>
<clusternodes>
<clusternode name="NODE1" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="FENCENODE1"/>
</method>
</fence>
</clusternode>
<clusternode name="NODE2" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="FENCENODE2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_ipmilan" ipaddr="172.16.1.187" login="admin"
name="FENCENODE1" passwd=***/>
<fencedevice agent="fence_ipmilan" ipaddr="172.16.1.190" login="admin"
name="FENCENODE2" passwd=***/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="FAILOVERDM" nofailback="0" ordered="0" restricted="1">
<failoverdomainnode name="NODE1" priority="1"/>
<failoverdomainnode name="NODE2" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="172.16.1.191" monitor_link="1"/>
</resources>
<service autostart="1" domain="FAILOVERDM" exclusive="0" name="MYSQL"
recovery="relocate">
<mysql config_file="/etc/my.cnf" listen_address="172.16.1.191" name="MYSQL"
shutdown_wait="5"/>
<ip ref="172.16.1.191"/>
</service>
 </rm>
</cluster>

-----
cman_tool join -d
[MAIN ] AIS Executive Service RELEASE 'subrev 1358 version 0.80.3'
[MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors.
[MAIN ] Copyright (C) 2006 Red Hat, Inc.
[MAIN ] AIS Executive Service: started and ready to provide service.
[MAIN ] Using default multicast address of 239.192.41.253
[MAIN ] openais component openais_cpg loaded.
[MAIN ] Registering service handler 'openais cluster closed process group
service v1.01'
[MAIN ] openais component openais_cfg loaded.
[MAIN ] Registering service handler 'openais configuration service'
[MAIN ] openais component openais_msg loaded.
[MAIN ] Registering service handler 'openais message service B.01.01'
[MAIN ] openais component openais_lck loaded.
[MAIN ] Registering service handler 'openais distributed locking service
B.01.01'
[MAIN ] openais component openais_evt loaded.
[MAIN ] Registering service handler 'openais event service B.01.01'
[MAIN ] openais component openais_ckpt loaded.
[MAIN ] Registering service handler 'openais checkpoint service B.01.01'
[MAIN ] openais component openais_amf loaded.
[MAIN ] Registering service handler 'openais availability management
framework B.01.01'
[MAIN ] openais component openais_clm loaded.
[MAIN ] Registering service handler 'openais cluster membership service
B.01.01'
[MAIN ] openais component openais_evs loaded.
[MAIN ] Registering service handler 'openais extended virtual synchrony
service'
[MAIN ] openais component openais_cman loaded.
[MAIN ] Registering service handler 'openais CMAN membership service 2.01'
[TOTEM] Token Timeout (10000 ms) retransmit timeout (495 ms)
[TOTEM] token hold (386 ms) retransmits before loss (20 retrans)
[TOTEM] join (60 ms) send_join (0 ms) consensus (4800 ms) merge (200 ms)
[TOTEM] downcheck (1000 ms) fail to recv const (50 msgs)
[TOTEM] seqno unchanged const (30 rotations) Maximum network MTU 1500
[TOTEM] window size per rotation (50 messages) maximum messages per rotation
(17 messages)
[TOTEM] send threads (0 threads)
[TOTEM] RRP token expired timeout (495 ms)
[TOTEM] RRP token problem counter (2000 ms)
[TOTEM] RRP threshold (10 problem count)
[TOTEM] RRP mode set to none.
[TOTEM] heartbeat_failures_allowed (0)
[TOTEM] max_network_delay (50 ms)
[TOTEM] HeartBeat is Disabled. To enable set heartbeat_failures_allowed > 0
[TOTEM] Receive multicast socket recv buffer size (288000 bytes).
[TOTEM] Transmit multicast socket send buffer size (262142 bytes).
[TOTEM] The network interface [172.16.1.188] is now up.
[TOTEM] Created or loaded sequence id 1624.172.16.1.188 for this ring.
[TOTEM] entering GATHER state from 15.
[SERV ] Initialising service handler 'openais extended virtual synchrony
service'
[SERV ] Initialising service handler 'openais cluster membership service
B.01.01'
[SERV ] Initialising service handler 'openais availability management
framework B.01.01'
[SERV ] Initialising service handler 'openais checkpoint service B.01.01'
[SERV ] Initialising service handler 'openais event service B.01.01'
[SERV ] Initialising service handler 'openais distributed locking service
B.01.01'
[SERV ] Initialising service handler 'openais message service B.01.01'
[SERV ] Initialising service handler 'openais configuration service'
[SERV ] Initialising service handler 'openais cluster closed process group
service v1.01'
[SERV ] Initialising service handler 'openais CMAN membership service 2.01'
[CMAN ] CMAN 2.0.98 (built Dec  3 2008 16:32:34) started
[SYNC ] Not using a virtual synchrony filter.
[TOTEM] Creating commit token because I am the rep.
[TOTEM] Saving state aru 0 high seq received 0
[TOTEM] Storing new sequence id for ring 65c
[TOTEM] entering COMMIT state.
[TOTEM] entering RECOVERY state.
[TOTEM] position [0] member 172.16.1.188:
[TOTEM] previous ring seq 1624 rep 172.16.1.188
[TOTEM] aru 0 high delivered 0 received flag 1
[TOTEM] Did not need to originate any messages in recovery.
[TOTEM] Sending initial ORF token
[CLM  ] CLM CONFIGURATION CHANGE
[CLM  ] New Configuration:
[CLM  ] Members Left:
[CLM  ] Members Joined:
[CLM  ] CLM CONFIGURATION CHANGE
[CLM  ] New Configuration:
[CLM  ]         r(0) ip(172.16.1.188)
[CLM  ] Members Left:
[CLM  ] Members Joined:
[CLM  ]         r(0) ip(172.16.1.188)
[SYNC ] This node is within the primary component and will provide service.
[TOTEM] entering OPERATIONAL state.
# [CMAN ] quorum regained, resuming activity
[CLM  ] got nodejoin message 172.16.1.188
[TOTEM] entering GATHER state from 11.
[TOTEM] Saving state aru a high seq received a
[TOTEM] Storing new sequence id for ring 660
[TOTEM] entering COMMIT state.
[TOTEM] entering RECOVERY state.
[TOTEM] position [0] member 172.16.1.185:
[TOTEM] previous ring seq 1624 rep 172.16.1.185
[TOTEM] aru 4f high delivered 4f received flag 1
[TOTEM] position [1] member 172.16.1.188:
[TOTEM] previous ring seq 1628 rep 172.16.1.188
[TOTEM] aru a high delivered a received flag 1
[TOTEM] Did not need to originate any messages in recovery.
[CLM  ] CLM CONFIGURATION CHANGE
[CLM  ] New Configuration:
[CLM  ]         r(0) ip(172.16.1.188)
[CLM  ] Members Left:
[CLM  ] Members Joined:
[CLM  ] CLM CONFIGURATION CHANGE
[CLM  ] New Configuration:
[CLM  ]         r(0) ip(172.16.1.185)
[CLM  ]         r(0) ip(172.16.1.188)
[CLM  ] Members Left:
[CLM  ] Members Joined:
[CLM  ]         r(0) ip(172.16.1.185)
[SYNC ] This node is within the primary component and will provide service.
[TOTEM] entering OPERATIONAL state.
[CLM  ] got nodejoin message 172.16.1.185
[CLM  ] got nodejoin message 172.16.1.188
[CPG  ] got joinlist message from node 1


Thanks again,

ESG


2009/5/26 Chrissie Caulfield <ccaulfie at redhat.com>

> ESGLinux wrote:
> > Hello,
> >
> > Looking the configuration of my 2 nodes cluster I have seen that when I
> > run the command cman_tool with diferent commands I get diferent ips for
> > my nodes.
> >
> > here is the example:
> >
> > cman_tool -af nodes
> > Node  Sts   Inc   Joined               Name
> >    1   M   1624   2009-05-21 11:41:41  NODE1
> >        Addresses: 172.16.1.185
> >    2   M   1620   2009-05-21 11:41:41  NODE2
> >        Addresses: 172.16.1.188
> >
> > ---------------------------
> >
> > cman_tool status
> > Version: 6.1.0
> > Config Version: 28
> > Cluster Name: myCLUSTER
> > Cluster Id: 10708
> > Cluster Member: Yes
> > Cluster Generation: 1624
> > Membership state: Cluster-Member
> > Nodes: 2
> > Expected votes: 1
> > Total votes: 2
> > Quorum: 1
> > Active subsystems: 9
> > Flags: 2node Dirty
> > Ports Bound: 0 11 177
> > Node name: NODE1
> > Node ID: 1
> > Multicast addresses: 239.192.41.253
> > Node addresses: 172.16.1.186
> >
> >
> > As you can see node1 shows 2 diferente ips (172.16.1.185 (eth0)
> > and 172.16.1.186 (eth1))
> >
> > How can I determine which ip they are using to communicate and how can I
> > change the configuration to use always the same ip (the one asociated to
> > eth1)
>
> That's very very strange! I can't think what might be causing that. The
> correct information is that shown by 'cman_tool nodes -a' as that
> queries the totem stack to get its information. You can double-check
> this with the command
>
>  # lsof -p `pidof aisexec`
>
> But I'm intrigued as to how this has happened. Is it possible you could
> post your cluster.conf and also the output of "cman_tool join -d" please?
>
>
>
> Chrissie
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090526/e3e382ef/attachment.htm>

From esggrupos at gmail.com  Tue May 26 08:53:21 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Tue, 26 May 2009 10:53:21 +0200
Subject: [Linux-cluster] diferent ip nodes with diferent commands
In-Reply-To: <3128ba140905260144y7c137372tab167c2f7e079d87@mail.gmail.com>
References: <3128ba140905260049h7772929exb72e3d62e0408b63@mail.gmail.com>
	<4A1BA6D6.6030601@redhat.com>
	<3128ba140905260144y7c137372tab167c2f7e079d87@mail.gmail.com>
Message-ID: <3128ba140905260153i4221ccb5t3dff3f53e31e9dc5@mail.gmail.com>

Hi again,
a problem with the copy paste:

here is the real output of the command:

lsof -p `pidof aisexec`
COMMAND  PID USER   FD   TYPE             DEVICE    SIZE     NODE NAME
aisexec 3781 root  cwd    DIR              253,0    4096 36208670
/var/lib/openais
aisexec 3781 root  rtd    DIR              253,0    4096        2 /
aisexec 3781 root  txt    REG              253,0  181336 49495806
/usr/sbin/aisexec
aisexec 3781 root  mem    REG              253,0  139416  7307558 /lib64/
ld-2.5.so
aisexec 3781 root  mem    REG              253,0 1713088  7307559 /lib64/
libc-2.5.so
aisexec 3781 root  mem    REG              253,0   23360  7307561 /lib64/
libdl-2.5.so
aisexec 3781 root  mem    REG              253,0  145592  7307562 /lib64/
libpthread-2.5.so
aisexec 3781 root  mem    REG              253,0   58400  7307466
/lib64/libgcc_s-4.1.2-20080825.so.1
aisexec 3781 root  mem    REG              253,0   10896 50332152
/usr/libexec/lcrso/objdb.lcrso
aisexec 3781 root  mem    REG              253,0   57584 50332612
/usr/libexec/lcrso/service_cman.lcrso
aisexec 3781 root  mem    REG              253,0   53880  7307289 /lib64/
libnss_files-2.5.so
aisexec 3781 root  mem    REG              253,0   19592 50332157
/usr/libexec/lcrso/service_cpg.lcrso
aisexec 3781 root  mem    REG              253,0    8984 50332154
/usr/libexec/lcrso/service_cfg.lcrso
aisexec 3781 root  mem    REG              253,0   20264 50332161
/usr/libexec/lcrso/service_msg.lcrso
aisexec 3781 root  mem    REG              253,0   19704 50332160
/usr/libexec/lcrso/service_lck.lcrso
aisexec 3781 root  mem    REG              253,0   47416 50332159
/usr/libexec/lcrso/service_evt.lcrso
aisexec 3781 root  mem    REG              253,0   48888 50332155
/usr/libexec/lcrso/service_ckpt.lcrso
aisexec 3781 root  mem    REG              253,0   79520 50332153
/usr/libexec/lcrso/service_amf.lcrso
aisexec 3781 root  mem    REG              253,0   15400 50332156
/usr/libexec/lcrso/service_clm.lcrso
aisexec 3781 root  mem    REG              253,0    8872 50332158
/usr/libexec/lcrso/service_evs.lcrso
aisexec 3781 root    0u   CHR                1,3             1738 /dev/null
aisexec 3781 root    1u   CHR                1,3             1738 /dev/null
aisexec 3781 root    2u   CHR                1,3             1738 /dev/null
aisexec 3781 root    3u   CHR                1,3             1738 /dev/null
aisexec 3781 root    4u  unix 0xffff8101042dd140             7966 socket
aisexec 3781 root    5u  unix 0xffff8101278d08c0             7950 socket
aisexec 3781 root    6u  IPv4               7959              UDP
239.192.41.253:netsupport
aisexec 3781 root    7u  IPv4               7960              UDP node2:5149
aisexec 3781 root    8u  IPv4               7961              UDP
node2:netsupport
aisexec 3781 root    9u  unix 0xffff8101278d0600             7962
/var/run/cman_client
aisexec 3781 root   10u  unix 0xffff8101042dd400             7964
/var/run/cman_admin
aisexec 3781 root   11u  unix 0xffff8102278220c0             7972
/var/run/cman_client
aisexec 3781 root   12u  unix 0xffff810227cef9c0             8006
/var/run/cman_client
aisexec 3781 root   13u  unix 0xffff810227cee100             8007
/var/run/cman_admin
aisexec 3781 root   14u  unix 0xffff810227cee680             8009 socket
aisexec 3781 root   15u  unix 0xffff810227ceec00             8011 socket
aisexec 3781 root   16u  unix 0xffff810226a4e6c0             8033
/var/run/cman_client
aisexec 3781 root   17u  unix 0xffff810227823400             8074
/var/run/cman_client
aisexec 3781 root   18u  unix 0xffff810227cef440             8121
/var/run/cman_client
aisexec 3781 root   19u  unix 0xffff810226a57940             8124 socket
aisexec 3781 root   20u  unix 0xffff810226a4fa00             8184 socket
aisexec 3781 root   21u  unix 0xffff810226a573c0             8129 socket
aisexec 3781 root   22u  unix 0xffff810226a568c0             8133 socket
aisexec 3781 root   23u  unix 0xffff810226a56340             8136 socket
aisexec 3781 root   24u  unix 0xffff810227cef180             8186 socket
aisexec 3781 root   25u  unix 0xffff8101042dc640             8477 socket
aisexec 3781 root   26u  unix 0xffff8101042dcbc0             8479 socket
aisexec 3781 root   27u  unix 0xffff810126c6d1c0             8503 socket
aisexec 3781 root   28u  unix 0xffff810126c6d740             8506 socket
aisexec 3781 root   29u  unix 0xffff81022348b100            11787
/var/run/cman_client
aisexec 3781 root   30u  unix 0xffff810119f78680            11793 socket
aisexec 3781 root   31u  unix 0xffff810119f78100            11796 socket

What I see is that the ips I see in the log are:

[CLM  ] got nodejoin message 172.16.1.185
[CLM  ] got nodejoin message 172.16.1.188

This ips are on the eth0 interfaces of the nodes, and I want to use the ips
on the eth1 interfaces 172.16.1.186 and 172.16.1.189

Greetings

ESG
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090526/27e6584f/attachment.htm>

From esggrupos at gmail.com  Tue May 26 10:42:04 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Tue, 26 May 2009 12:42:04 +0200
Subject: [Linux-cluster] Default params for quorumdisk
Message-ID: <3128ba140905260342k523cc51v7d8e321907b1d049@mail.gmail.com>

Hi all,

I have my problematic 2 nodes cluster and I want to test if using quorum
disk helps me with my problem.

I?m using luci to setup the quorumdisk and It ask for a lot of params.
As doc reference I use
http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Cluster_Administration/s1-general-prop-conga-CA.html
but there is not explanation about the default params,

for example,
Interval - The frequency of read/write cycles, in seconds.  I have not idea
what to say to that. Which is the default and how can I answer it?


The same with most of the params


Greetings

ESG
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090526/51af8306/attachment.htm>

From dougbunger at yahoo.com  Tue May 26 13:35:48 2009
From: dougbunger at yahoo.com (Doug Bunger)
Date: Tue, 26 May 2009 06:35:48 -0700 (PDT)
Subject: [Linux-cluster] iscsi target change ip
Message-ID: <255128.72620.qm@web110202.mail.gq1.yahoo.com>

Just to verify, as well as the discover, you are also using:
? iscsiadm -m node -T iqn.xx.xx:xx-xx -p x.x.x.x:3260 --login
Together these create a set of links visible under /var/lib/iscsi:
? ls -l /var/lib/iscsi/*/*
You could stop the service, delete the links and associated directoies, then rediscover and login with the new IP.? The "more correct" way of doing it is to have iscsiadm delete them for you:
? iscsiadm -m node -T iqn.xx.xx:xx-xx -p x.x.x.x:3260 -o delete
I pointed out the "wrong way" first, because in real world failures, I've had the "right way" hang, forcing me to do it the wrong way.? ;)


-- Doug Bunger 

-- dougbunger at yahoo.com 

--

--- On Tue, 5/26/09, ESGLinux <esggrupos at gmail.com> wrote:

From: ESGLinux <esggrupos at gmail.com>
Subject: [Linux-cluster] iscsi target change ip
To: "linux clustering" <linux-cluster at redhat.com>
Date: Tuesday, May 26, 2009, 1:49 AM

Hello,?
I have a server configured as iscsi target to use it with my cluster. To use it with the inititiators I run this command:

iscsiadm -m discovery -t sendtargets -p 192.168.1.191

that creates a disk /dev/sda on the hosts. Now I need to change the ip of the target and want to know where I need to change in the initiators the configuration of the target ip.

Thanks in advance
ESG


-----Inline Attachment Follows-----

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090526/ac1b4c69/attachment.htm>

From m.nietz-redhat at iplabs.de  Tue May 26 14:00:42 2009
From: m.nietz-redhat at iplabs.de (Marco Nietz)
Date: Tue, 26 May 2009 16:00:42 +0200
Subject: [Linux-cluster] Two-Node Cluster Problem
Message-ID: <4A1BF60A.3030402@iplabs.de>

Hi,

i have a Problem running a Two-Node Cluster (without qudisk) on RHEL 5.3
with all the latest Patches installed. For testing i use a very simple
setup with only one ip-address as the service.

Both nodes have two network-interfaces, both are configured in cluster.conf

<clusternodes>
   <clusternode name="10.102.10.51" nodeid="1" votes="1">
      <altname name="10.209.170.51"/>
         <fence>
            <method name="1">
               <device name="ipsdb01.drac"/>
            </method>
         </fence>
   </clusternode>
   <clusternode name="10.102.10.28" nodeid="2" votes="1">
      <altname name="10.209.170.28"/>
         <fence>
            <method name="1">
               <device name="ips08.drac"/>
            </method>
         </fence>
</clusternode>

When the service runs on node b and i take down the first interface
(10.209.170.28) with: 'ip link set eth1 down' the service is taken over
by node a within a few seconds. When i then disable the second interface
on node b i would expect that node a recognize the failure and fence
down node b but this does not happen. This only occurs in this one
direction, the other way - node a holds the service and need to be
fenced by node b works fine.

What could cause such a behaviour ?

Here are some more sniplets which could be interessting from my cluster.conf

<fence_daemon clean_start="1" post_fail_delay="10" post_join_delay="30"/>

<cman expected_votes="1" two_node="1">

Most of the configuration was done using conga (luci,ricci)


Regards
Marco


From esggrupos at gmail.com  Tue May 26 16:04:27 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Tue, 26 May 2009 18:04:27 +0200
Subject: [Linux-cluster] iscsi target change ip
In-Reply-To: <255128.72620.qm@web110202.mail.gq1.yahoo.com>
References: <255128.72620.qm@web110202.mail.gq1.yahoo.com>
Message-ID: <3128ba140905260904j4228a35cq38b3bec06dbd818f@mail.gmail.com>

Thank you very much for your answers:
The right way and the wrong way (Murphy always points to the second one ;-)
)

Greetings

ESG


2009/5/26 Doug Bunger <dougbunger at yahoo.com>

> Just to verify, as well as the discover, you are also using:
>   iscsiadm -m node -T iqn.xx.xx:xx-xx -p x.x.x.x:3260 --login
> Together these create a set of links visible under /var/lib/iscsi:
>   ls -l /var/lib/iscsi/*/*
> You could stop the service, delete the links and associated directoies,
> then rediscover and login with the new IP.  The "more correct" way of doing
> it is to have iscsiadm delete them for you:
>   iscsiadm -m node -T iqn.xx.xx:xx-xx -p x.x.x.x:3260 -o delete
> I pointed out the "wrong way" first, because in real world failures, I've
> had the "right way" hang, forcing me to do it the wrong way.  ;)
>
>
> -- Doug Bunger
> -- dougbunger at yahoo.com
> --
>
> --- On *Tue, 5/26/09, ESGLinux <esggrupos at gmail.com>* wrote:
>
>
> From: ESGLinux <esggrupos at gmail.com>
> Subject: [Linux-cluster] iscsi target change ip
> To: "linux clustering" <linux-cluster at redhat.com>
> Date: Tuesday, May 26, 2009, 1:49 AM
>
>
> Hello,
> I have a server configured as iscsi target to use it with my cluster. To
> use it with the inititiators I run this command:
>
> iscsiadm -m discovery -t sendtargets -p 192.168.1.191
>
>
> that creates a disk /dev/sda on the hosts. Now I need to change the ip of
> the target and want to know where I need to change in the initiators the
> configuration of the target ip.
>
> Thanks in advance
>
> ESG
>
>
>
> -----Inline Attachment Follows-----
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com <http://mc/compose?to=Linux-cluster at redhat.com>
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090526/72c7762b/attachment.htm>

From tiagocruz at forumgdh.net  Tue May 26 16:24:41 2009
From: tiagocruz at forumgdh.net (Tiago Cruz)
Date: Tue, 26 May 2009 13:24:41 -0300
Subject: [Linux-cluster] Two-Node Cluster Problem
In-Reply-To: <4A1BF60A.3030402@iplabs.de>
References: <4A1BF60A.3030402@iplabs.de>
Message-ID: <1243355081.3901.71.camel@tuxkiller>

Did you have:

<cman two_node="1" expected_votes="1"/>

?

-- 
Tiago Cruz <tiagocruz at forumgdh.net>


On Tue, 2009-05-26 at 16:00 +0200, Marco Nietz wrote:
> Hi,
> 
> i have a Problem running a Two-Node Cluster (without qudisk) on RHEL 5.3
> with all the latest Patches installed. For testing i use a very simple
> setup with only one ip-address as the service.
> 
> Both nodes have two network-interfaces, both are configured in cluster.conf
> 
> <clusternodes>
>    <clusternode name="10.102.10.51" nodeid="1" votes="1">
>       <altname name="10.209.170.51"/>
>          <fence>
>             <method name="1">
>                <device name="ipsdb01.drac"/>
>             </method>
>          </fence>
>    </clusternode>
>    <clusternode name="10.102.10.28" nodeid="2" votes="1">
>       <altname name="10.209.170.28"/>
>          <fence>
>             <method name="1">
>                <device name="ips08.drac"/>
>             </method>
>          </fence>
> </clusternode>
> 
> When the service runs on node b and i take down the first interface
> (10.209.170.28) with: 'ip link set eth1 down' the service is taken over
> by node a within a few seconds. When i then disable the second interface
> on node b i would expect that node a recognize the failure and fence
> down node b but this does not happen. This only occurs in this one
> direction, the other way - node a holds the service and need to be
> fenced by node b works fine.
> 
> What could cause such a behaviour ?
> 
> Here are some more sniplets which could be interessting from my cluster.conf
> 
> <fence_daemon clean_start="1" post_fail_delay="10" post_join_delay="30"/>
> 
> <cman expected_votes="1" two_node="1">
> 
> Most of the configuration was done using conga (luci,ricci)
> 
> 
> Regards
> Marco
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From m.nietz-redhat at iplabs.de  Tue May 26 17:46:22 2009
From: m.nietz-redhat at iplabs.de (Marco Nietz)
Date: Tue, 26 May 2009 19:46:22 +0200
Subject: [Linux-cluster] Two-Node Cluster Problem
In-Reply-To: <1243355081.3901.71.camel@tuxkiller>
References: <4A1BF60A.3030402@iplabs.de> <1243355081.3901.71.camel@tuxkiller>
Message-ID: <4A1C2AEE.1040707@iplabs.de>

Tiago Cruz schrieb:
> Did you have:
> 
> <cman two_node="1" expected_votes="1"/>
> 
> ?
> 

yes, have this in my config.


From anasnajj at gmail.com  Wed May 27 19:41:33 2009
From: anasnajj at gmail.com (anasnajj)
Date: Wed, 27 May 2009 22:41:33 +0300
Subject: [Linux-cluster] RHCS with two Network card
Message-ID: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAAPEWAL50JKpJtUWpAiYrHN4BAAAAAA==@gmail.com>

Hi all

I have 5 cluster node and each node have two network card eth0 ,eth1

My cluster services  have the IP Virtual ranges 10.0.10.100-103

And Eth0 have IP  range  10.10.10.10 -13..

I need to setup up the second interface Eth1 to used as heartbeat
communication and used as failover for Eth0  if the eth0 have cable failure
or down , so how I can implement such scenario???

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090527/2c263ff5/attachment.htm>

From alfredo.moralejo at roche.com  Wed May 27 20:29:16 2009
From: alfredo.moralejo at roche.com (Moralejo, Alfredo)
Date: Wed, 27 May 2009 22:29:16 +0200
Subject: [Linux-cluster] RHCS with two Network card
In-Reply-To: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAAPEWAL50JKpJtUWpAiYrHN4BAAAAAA==@gmail.com>
References: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAAPEWAL50JKpJtUWpAiYrHN4BAAAAAA==@gmail.com>
Message-ID: <18106F5AEC2A20499826B0BE8D04F0411DBA01DD@rbamsem701.emea.roche.com>

You can take a look to:

http://kbase.redhat.com/faq/docs/DOC-5975

I guess it can help you.

Related to this. As far as I know conga can not be used from a system out of the cluster to configure clusters when private networks are used for inter-cluster communication, is that right?


________________________________
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of anasnajj
Sent: Wednesday, May 27, 2009 9:42 PM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] RHCS with two Network card

Hi all
I have 5 cluster node and each node have two network card eth0 ,eth1
My cluster services  have the IP Virtual ranges 10.0.10.100-103
And Eth0 have IP  range  10.10.10.10 -13..
I need to setup up the second interface Eth1 to used as heartbeat communication and used as failover for Eth0  if the eth0 have cable failure or down , so how I can implement such scenario???


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090527/8481e680/attachment.htm>

From anasnajj at gmail.com  Wed May 27 20:49:41 2009
From: anasnajj at gmail.com (anasnajj)
Date: Wed, 27 May 2009 23:49:41 +0300
Subject: [Linux-cluster] RHCS with two Network card
In-Reply-To: <18106F5AEC2A20499826B0BE8D04F0411DBA01DD@rbamsem701.emea.roche.com>
References: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAAPEWAL50JKpJtUWpAiYrHN4BAAAAAA==@gmail.com>
	<18106F5AEC2A20499826B0BE8D04F0411DBA01DD@rbamsem701.emea.roche.com>
Message-ID: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAANUArjuKZ+VBnk3pKpvzXHABAAAAAA==@gmail.com>

Thanks Moralejo

But im asking about make the heartbeat and the communication to used on both
interface (Ethernet bond) so the two interface will be as failover for each
other

 
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Moralejo, Alfredo
Sent: Wednesday, May 27, 2009 11:29 PM
To: linux clustering
Subject: RE: [Linux-cluster] RHCS with two Network card

 
You can take a look to:

 
http://kbase.redhat.com/faq/docs/DOC-5975

 
I guess it can help you.

 
Related to this. As far as I know conga can not be used from a system out of
the cluster to configure clusters when private networks are used for
inter-cluster communication, is that right?

 
  _____  

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of anasnajj
Sent: Wednesday, May 27, 2009 9:42 PM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] RHCS with two Network card

 
Hi all

I have 5 cluster node and each node have two network card eth0 ,eth1

My cluster services  have the IP Virtual ranges 10.0.10.100-103

And Eth0 have IP  range  10.10.10.10 -13..

I need to setup up the second interface Eth1 to used as heartbeat
communication and used as failover for Eth0  if the eth0 have cable failure
or down , so how I can implement such scenario???

 
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.5.339 / Virus Database: 270.12.39/2133 - Release Date: 05/25/09
08:16:00

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090527/f2676909/attachment.htm>

From robejrm at gmail.com  Wed May 27 22:01:13 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Thu, 28 May 2009 00:01:13 +0200
Subject: [Linux-cluster] RHCS with two Network card
In-Reply-To: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAANUArjuKZ+VBnk3pKpvzXHABAAAAAA==@gmail.com>
References: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAAPEWAL50JKpJtUWpAiYrHN4BAAAAAA==@gmail.com>
	<18106F5AEC2A20499826B0BE8D04F0411DBA01DD@rbamsem701.emea.roche.com>
	<!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAANUArjuKZ+VBnk3pKpvzXHABAAAAAA==@gmail.com>
Message-ID: <8a5668960905271501h4732bb4atd8d243b60655a5a@mail.gmail.com>

On Wed, May 27, 2009 at 10:49 PM, anasnajj <anasnajj at gmail.com> wrote:

>  Thanks Moralejo
>
> But im asking about make the heartbeat and the communication to used on
> both interface (Ethernet bond) so the two interface will be as failover for
> each other
>
> Have a look at  http://www.linuxfoundation.org/en/Net:Bonding
Maybe balance-xor.

Greetings,
Juanra

>
>
> *From:* linux-cluster-bounces at redhat.com [mailto:
> linux-cluster-bounces at redhat.com] *On Behalf Of *Moralejo, Alfredo
> *Sent:* Wednesday, May 27, 2009 11:29 PM
> *To:* linux clustering
> *Subject:* RE: [Linux-cluster] RHCS with two Network card
>
>
>
> You can take a look to:
>
>
>
> http://kbase.redhat.com/faq/docs/DOC-5975
>
>
>
> I guess it can help you.
>
>
>
> Related to this. As far as I know conga can not be used from a system out
> of the cluster to configure clusters when private networks are used for
> inter-cluster communication, is that right?
>
>
>
>
>
>
>
>
>  ------------------------------
>
> *From:* linux-cluster-bounces at redhat.com [mailto:
> linux-cluster-bounces at redhat.com] *On Behalf Of *anasnajj
> *Sent:* Wednesday, May 27, 2009 9:42 PM
> *To:* linux-cluster at redhat.com
> *Subject:* [Linux-cluster] RHCS with two Network card
>
>
>
> Hi all
>
> I have 5 cluster node and each node have two network card eth0 ,eth1
>
> My cluster services  have the IP Virtual ranges 10.0.10.100-103
>
> And Eth0 have IP  range  10.10.10.10 -13..
>
> I need to setup up the second interface Eth1 to used as heartbeat
> communication and used as failover for Eth0  if the eth0 have cable failure
> or down , so how I can implement such scenario???
>
>
>
>
>
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 8.5.339 / Virus Database: 270.12.39/2133 - Release Date: 05/25/09
> 08:16:00
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090528/dc00c4fd/attachment.htm>

From kbphillips80 at gmail.com  Wed May 27 23:48:36 2009
From: kbphillips80 at gmail.com (Kaerka Phillips)
Date: Wed, 27 May 2009 19:48:36 -0400
Subject: [Linux-cluster] all nodes halt when one lose connection
In-Reply-To: <3128ba140905250228x5577d24eucd68bbd4b1e57e1b@mail.gmail.com>
References: <3128ba140905210444o21959031iff759490ace7c8bc@mail.gmail.com>
	<A75C4B18-E985-4D35-8BAF-55BA6C202B53@redhat.com>
	<3128ba140905210757jd814f52hc1ca97c4da6e3a7a@mail.gmail.com>
	<1D109AC0-9EE0-419B-A841-D98EA53FF1C8@redhat.com>
	<3128ba140905210834gdcdf89ahb34bf45d1272861b@mail.gmail.com>
	<5b192c7e0905220650h7cc737c5k7581972add42e21f@mail.gmail.com>
	<3128ba140905250228x5577d24eucd68bbd4b1e57e1b@mail.gmail.com>
Message-ID: <e558921b0905271648h3f75a271r18b9d7117a8b6dee@mail.gmail.com>

It sounds like they're fencing themselves.  We got around this issue on a
two-node cluster by including the alternate node's internal ip address in
the /etc/hosts file of both hosts and a cross-over cable for the service
network with the private ip addresses assigned to that network.  If you're
trying to get them to monitor each other via the public network, in theory
this could be done with a backup fencing method, but we weren't able to get
this work since the heartbeat functions only happen on the network that the
node names are defined to use.

On Mon, May 25, 2009 at 5:28 AM, ESGLinux <esggrupos at gmail.com> wrote:

> Hi,
> I think this is not my problem because fencing works fine. The nodes gets
> fenced inmediatly but I think they fence when they don't must
>
> Greetings,
>
> ESG
>
> 2009/5/22 jorge sanchez <xsanch at gmail.com>
>
> Hi,
>>
>> try also disable the acpi if is it running , see following:
>>
>>
>> http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/s1-acpi-CA.html
>>
>>
>> Regards,
>>
>> Jorge Sanchez
>>
>>
>> On Thu, May 21, 2009 at 5:34 PM, ESGLinux <esggrupos at gmail.com> wrote:
>>
>>>
>>>
>>> 2009/5/21 Jonathan Brassow <jbrassow at redhat.com>
>>>
>>>>
>>>> On May 21, 2009, at 9:57 AM, ESGLinux wrote:
>>>>
>>>>  Hello,
>>>>>
>>>>> these are the logs I get:
>>>>>
>>>>> In node1:
>>>>>
>>>>> May 21 11:33:44 NODE1 fenced[3840]: NODE2 not a cluster member after 5
>>>>> sec post_fail_delay
>>>>> May 21 11:33:44 NODE1 fenced[3840]: fencing node "NODE2"
>>>>> May 21 11:33:44 NODE1 shutdown[5448]: shutting down for system halt
>>>>>
>>>>> in node2:
>>>>>
>>>>> May 21 11:33:45 NODE2 fenced[3843]: NODE1 not a cluster member after 5
>>>>> sec post_fail_delay
>>>>> May 21 11:33:45 NODE2 fenced[3843]: fencing node "NODE1"
>>>>> May 21 11:33:45 NODE2 shutdown[5923]: shutting down for system halt
>>>>>
>>>>>
>>>>> what I don?t know is way they lose the connection with the cluster,
>>>>> they are still connected (I only unplug a cable from the service network)
>>>>>
>>>>
>>>> That may be something worth chasing down, as it appears that your
>>>> cluster communication is on a network you don't expect?
>>>>
>>>
>>> How can I be sure about the network the nodes are using for
>>> communication? I think they do for the network I have configured to do
>>> that....
>>>
>>>
>>>>
>>>> Also, are the nodes simply "shutting down", or are they being forcibly
>>>> rebooted.  If it is a casual shutdown, then it would appear that both nodes
>>>> are trying to shutdown simultaneously.
>>>>
>>>
>>> they simply shutdown. They no reboot.
>>>
>>> This is what I get every time I unplug the nework cable from eth0 of any
>>> of the two nodes. (they communicate through eth1...)
>>>
>>> Greetings,
>>>
>>> ESG
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>>  brassow
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090527/baf9a0dc/attachment.htm>

From kbphillips80 at gmail.com  Wed May 27 23:52:12 2009
From: kbphillips80 at gmail.com (Kaerka Phillips)
Date: Wed, 27 May 2009 19:52:12 -0400
Subject: [Linux-cluster] all nodes halt when one lose connection
In-Reply-To: <e558921b0905271648h3f75a271r18b9d7117a8b6dee@mail.gmail.com>
References: <3128ba140905210444o21959031iff759490ace7c8bc@mail.gmail.com>
	<A75C4B18-E985-4D35-8BAF-55BA6C202B53@redhat.com>
	<3128ba140905210757jd814f52hc1ca97c4da6e3a7a@mail.gmail.com>
	<1D109AC0-9EE0-419B-A841-D98EA53FF1C8@redhat.com>
	<3128ba140905210834gdcdf89ahb34bf45d1272861b@mail.gmail.com>
	<5b192c7e0905220650h7cc737c5k7581972add42e21f@mail.gmail.com>
	<3128ba140905250228x5577d24eucd68bbd4b1e57e1b@mail.gmail.com>
	<e558921b0905271648h3f75a271r18b9d7117a8b6dee@mail.gmail.com>
Message-ID: <e558921b0905271652s2dfba73dhcc013ae8c632eb6b@mail.gmail.com>

One thing we did not try, but might've worked, would be to bond two network
interfaces together and then use vlan tagging on top of the bond interface
to create a vlan across it to the other node, and then pointing the cluster
to the vlan interfaces, which should still be up if even if the loss of one
network interface or one switch.

On Wed, May 27, 2009 at 7:48 PM, Kaerka Phillips <kbphillips80 at gmail.com>wrote:

> It sounds like they're fencing themselves.  We got around this issue on a
> two-node cluster by including the alternate node's internal ip address in
> the /etc/hosts file of both hosts and a cross-over cable for the service
> network with the private ip addresses assigned to that network.  If you're
> trying to get them to monitor each other via the public network, in theory
> this could be done with a backup fencing method, but we weren't able to get
> this work since the heartbeat functions only happen on the network that the
> node names are defined to use.
>
>
> On Mon, May 25, 2009 at 5:28 AM, ESGLinux <esggrupos at gmail.com> wrote:
>
>> Hi,
>> I think this is not my problem because fencing works fine. The nodes gets
>> fenced inmediatly but I think they fence when they don't must
>>
>> Greetings,
>>
>> ESG
>>
>> 2009/5/22 jorge sanchez <xsanch at gmail.com>
>>
>> Hi,
>>>
>>> try also disable the acpi if is it running , see following:
>>>
>>>
>>> http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/s1-acpi-CA.html
>>>
>>>
>>> Regards,
>>>
>>> Jorge Sanchez
>>>
>>>
>>> On Thu, May 21, 2009 at 5:34 PM, ESGLinux <esggrupos at gmail.com> wrote:
>>>
>>>>
>>>>
>>>> 2009/5/21 Jonathan Brassow <jbrassow at redhat.com>
>>>>
>>>>>
>>>>> On May 21, 2009, at 9:57 AM, ESGLinux wrote:
>>>>>
>>>>>  Hello,
>>>>>>
>>>>>> these are the logs I get:
>>>>>>
>>>>>> In node1:
>>>>>>
>>>>>> May 21 11:33:44 NODE1 fenced[3840]: NODE2 not a cluster member after 5
>>>>>> sec post_fail_delay
>>>>>> May 21 11:33:44 NODE1 fenced[3840]: fencing node "NODE2"
>>>>>> May 21 11:33:44 NODE1 shutdown[5448]: shutting down for system halt
>>>>>>
>>>>>> in node2:
>>>>>>
>>>>>> May 21 11:33:45 NODE2 fenced[3843]: NODE1 not a cluster member after 5
>>>>>> sec post_fail_delay
>>>>>> May 21 11:33:45 NODE2 fenced[3843]: fencing node "NODE1"
>>>>>> May 21 11:33:45 NODE2 shutdown[5923]: shutting down for system halt
>>>>>>
>>>>>>
>>>>>> what I don?t know is way they lose the connection with the cluster,
>>>>>> they are still connected (I only unplug a cable from the service network)
>>>>>>
>>>>>
>>>>> That may be something worth chasing down, as it appears that your
>>>>> cluster communication is on a network you don't expect?
>>>>>
>>>>
>>>> How can I be sure about the network the nodes are using for
>>>> communication? I think they do for the network I have configured to do
>>>> that....
>>>>
>>>>
>>>>>
>>>>> Also, are the nodes simply "shutting down", or are they being forcibly
>>>>> rebooted.  If it is a casual shutdown, then it would appear that both nodes
>>>>> are trying to shutdown simultaneously.
>>>>>
>>>>
>>>> they simply shutdown. They no reboot.
>>>>
>>>> This is what I get every time I unplug the nework cable from eth0 of any
>>>> of the two nodes. (they communicate through eth1...)
>>>>
>>>> Greetings,
>>>>
>>>> ESG
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>>  brassow
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090527/ed4cb37e/attachment.htm>

From shariq.siddiqui at yahoo.com  Thu May 28 01:06:29 2009
From: shariq.siddiqui at yahoo.com (Shariq Siddiqui)
Date: Wed, 27 May 2009 18:06:29 -0700 (PDT)
Subject: [Linux-cluster] I want to build cluster... but Confused  
Message-ID: <706315.50607.qm@web39802.mail.mud.yahoo.com>

Hello Dears,
?
I know it?s not good to ask such a childish thing over here...
?
I am new in LINUX world and now days I am working on Clustering for high availability of Oracle E-business suit,?ACTIVE-ACTIVE and/or ACTIVE-PASSIVE, The thing I want to ask...
1. What packages should?I use for best?
2. What material I should read before doing this exercise?
?
I goggled a lot... But you know some time goggling create confusions
?
So that?s why I joined and using this platform for Expert's suggestion.
Correction in my language and concepts or in any sense will highly be appreciated
?
Thanks in anticipation


Best Regards,

Shariq Siddiqui
?


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090527/03002f54/attachment.htm>

From m.nietz-redhat at iplabs.de  Thu May 28 08:06:03 2009
From: m.nietz-redhat at iplabs.de (Marco Nietz)
Date: Thu, 28 May 2009 10:06:03 +0200
Subject: [Linux-cluster] Two-Node Cluster Problem
In-Reply-To: <4A1C2AEE.1040707@iplabs.de>
References: <4A1BF60A.3030402@iplabs.de> <1243355081.3901.71.camel@tuxkiller>
	<4A1C2AEE.1040707@iplabs.de>
Message-ID: <4A1E45EB.9030207@iplabs.de>

Just gave the whole configuration a new try and setuped the whole
cluster once again.

This is the resulting cluster.conf with a very basic configuration.

[root at ipsdb01 ~]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster alias="ips_database" config_version="7" name="ips_database">
        <fence_daemon clean_start="1" post_fail_delay="10"
post_join_delay="30"/>
        <clusternodes>
                <clusternode name="10.102.10.51" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="ipsdb01.drac"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="10.102.10.28" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="ips08.drac"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_drac" ipaddr="10.102.10.128"
login="root" name="ips08.drac" passwd="xxx"/>
                <fencedevice agent="fence_drac" ipaddr="10.102.10.151"
login="root" name="ipsdb01.drac" passwd="xxx"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources>
                        <ip address="10.209.170.55" monitor_link="1"/>
                </resources>
                <service autostart="1" exclusive="0" name="ips_database"
recovery="relocate">
                        <ip ref="10.209.170.55"/>
                </service>
        </rm>
</cluster>

Services running on 10.102.10.28. I've did a 'powerdown' with the
drac-interface but the service is not taken over by the second node.

clustat on the remaining node gave an interessting output

[root at ipsdb01 ~]# clustat
Cluster Status for ips_database @ Thu May 28 09:31:30 2009
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 10.102.10.51                                                        1
Online, Local, rgmanager
 10.102.10.28                                                        2
Offline

 Service Name                                                     Owner
(Last)                                                     State
 ------- ----                                                     -----
------                                                     -----
 service:ips_database
10.102.10.28                                                     started

The service is 'started' but the Owner (10.102.10.28) is offline.

These are the last lines from /var/log/messages

May 28 09:27:03 ipsdb01 kernel: dlm: closing connection to node 2
May 28 09:27:03 ipsdb01 openais[5295]: [CLM  ] Members Joined:
May 28 09:27:03 ipsdb01 fenced[5315]: 10.102.10.28 not a cluster member
after 0 sec post_fail_delay
May 28 09:27:03 ipsdb01 openais[5295]: [SYNC ] This node is within the
primary component and will provide service.
May 28 09:27:03 ipsdb01 openais[5295]: [TOTEM] entering OPERATIONAL state.
May 28 09:27:03 ipsdb01 openais[5295]: [CLM  ] got nodejoin message
10.102.10.51
May 28 09:27:03 ipsdb01 openais[5295]: [CPG  ] got joinlist message from
node 1

The remaining system recognizes the failure, but don't start any
takeover-action.

Anyone an Idea what can cause such a Problem ?


Marco Nietz schrieb:
> Tiago Cruz schrieb:
>> Did you have:
>>
>> <cman two_node="1" expected_votes="1"/>
>>
>> ?
>>
> 
> yes, have this in my config.
> 
> 
> 


From tanay.ganguly at patni.com  Thu May 28 13:49:21 2009
From: tanay.ganguly at patni.com (Tanay Ganguly)
Date: Thu, 28 May 2009 19:19:21 +0530
Subject: [Linux-cluster] Problem Regarding Cluster Setup
Message-ID: <200905281340.n4SDe0Lb008493@arlmailha03.patni.com>

Hi,

 
I am having problem in doing Cluster setup in IA64 machine.

I have connected two nodes.

I have performed the same setup in IA32 Machine and it worked perfectly.

 
But in IA64 I am receiving the following error message.

 
# Receiving message have invalid digest...ignoring

# Invalid Packet data.

# The token was lost in the commit state.

# Enter gather state.

 
I will be thankful if you all can provide me a solution and necessary
changes which I have to make.

 
Thanks & Regards

Tanay Ganguly

 
_____________________________________________________________________ 

This e-mail message may contain proprietary, confidential or legally privileged information for the sole use of the person or entity to whom this message was originally addressed. Any review, e-transmission dissemination or other use of or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you have received this e-mail in error kindly delete this e-mail from your records. If it appears that this mail has been forwarded to you without proper authority, please notify us immediately at netadmin at patni.com and delete this mail.
_____________________________________________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090528/691948eb/attachment.htm>

From jbrassow at redhat.com  Thu May 28 18:49:54 2009
From: jbrassow at redhat.com (Jonathan Brassow)
Date: Thu, 28 May 2009 13:49:54 -0500
Subject: [Linux-cluster] I want to build cluster... but Confused  
In-Reply-To: <706315.50607.qm@web39802.mail.mud.yahoo.com>
References: <706315.50607.qm@web39802.mail.mud.yahoo.com>
Message-ID: <6EB736FE-745F-4A7B-96C3-DA5B6F3EBD53@redhat.com>

I'm not sure how well all the instructions translate for different  
distributions, but this RHEL5 documentation should help either way:
For basic cluster knowledge:
http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Suite_Overview/index.html
http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/index.html

For additional storage knowledge (and how it relates to clusters):
http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Logical_Volume_Manager/index.html
http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Global_File_System/index.html
http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Global_File_System_2/index.html

  brassow

On May 27, 2009, at 8:06 PM, Shariq Siddiqui wrote:

> Hello Dears,
>
> I know it?s not good to ask such a childish thing over here...
>
> I am new in LINUX world and now days I am working on Clustering for  
> high availability of Oracle E-business suit, ACTIVE-ACTIVE and/or  
> ACTIVE-PASSIVE, The thing I want to ask...
> 1. What packages should I use for best?
> 2. What material I should read before doing this exercise?
>
> I goggled a lot... But you know some time goggling create confusions
>
> So that?s why I joined and using this platform for Expert's  
> suggestion.
> Correction in my language and concepts or in any sense will highly  
> be appreciated
>
>
> Thanks in anticipation
>
> Best Regards,
> Shariq Siddiqui
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090528/378d73e2/attachment.htm>

From merhar at arlut.utexas.edu  Thu May 28 20:24:46 2009
From: merhar at arlut.utexas.edu (David Merhar)
Date: Thu, 28 May 2009 15:24:46 -0500
Subject: [Linux-cluster] STABLE3: Cannot mount gfs, "no such device"
Message-ID: <EE5D169A-7320-4771-BCEB-143A487CB074@arlut.utexas.edu>

Everything seems perfect until we try to mount.  lvdisplay shows the  
volume.  mkfs.gfs2 runs fine.  The device is visible under both /dev  
and /dev/mapper

Cluster 3.0.0rc2
Corosync .97
OpenAIS .96
LVMS 2.02.47
Kernel 2.6.29.4

Please advise.

Thanks.

djm


From rpeterso at redhat.com  Thu May 28 20:43:37 2009
From: rpeterso at redhat.com (Bob Peterson)
Date: Thu, 28 May 2009 16:43:37 -0400 (EDT)
Subject: [Linux-cluster] STABLE3: Cannot mount gfs, "no such device"
In-Reply-To: <EE5D169A-7320-4771-BCEB-143A487CB074@arlut.utexas.edu>
Message-ID: <16875127.1450621243543416973.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>

----- "David Merhar" <merhar at arlut.utexas.edu> wrote:
| Everything seems perfect until we try to mount.  lvdisplay shows the 
| 
| volume.  mkfs.gfs2 runs fine.  The device is visible under both /dev 
| 
| and /dev/mapper
| 
| Cluster 3.0.0rc2
| Corosync .97
| OpenAIS .96
| LVMS 2.02.47
| Kernel 2.6.29.4
| 
| Please advise.
| 
| Thanks.
| 
| djm

Hi David,

Usually this means, for whatever reason, that you don't have
one (or more) of these kernel modules loaded:

gfs2.ko, lock_dlm.ko, dlm.ko

So check lsmod to see if they're there.  If not, you'll have
to get them, and getting them depends on your platform, which
you didn't mention.

Regards,

Bob Peterson
Red Hat GFS


From merhar at arlut.utexas.edu  Thu May 28 21:03:44 2009
From: merhar at arlut.utexas.edu (David Merhar)
Date: Thu, 28 May 2009 16:03:44 -0500
Subject: [Linux-cluster] STABLE3: Cannot mount gfs, "no such device"
In-Reply-To: <16875127.1450621243543416973.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>
References: <16875127.1450621243543416973.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>
Message-ID: <44CD83DE-3E5E-4C5B-835D-3C9514FC0A80@arlut.utexas.edu>

RHEL5U3.

I assumed that the dlm_lock would be built and installed with Cluster  
3.0.0rc2, so I built the cluster without the gfs/cluster modules.

When I modprobe gfs, dmesg reports that Lock_DLM has been installed  
(along with GFS and Lock_NoLock), however  lsmod did not show a  
lock_dlm.  I could not find lock_dlm.ko on the systems.  lsmod did  
show dlm and gfs was loaded.

Please let me know where I can get the necessary modules.

Thanks.

djm


On May 28, 2009, at 3:43 PM, Bob Peterson wrote:

> ----- "David Merhar" <merhar at arlut.utexas.edu> wrote:
> | Everything seems perfect until we try to mount.  lvdisplay shows the
> |
> | volume.  mkfs.gfs2 runs fine.  The device is visible under both /dev
> |
> | and /dev/mapper
> |
> | Cluster 3.0.0rc2
> | Corosync .97
> | OpenAIS .96
> | LVMS 2.02.47
> | Kernel 2.6.29.4
> |
> | Please advise.
> |
> | Thanks.
> |
> | djm
>
> Hi David,
>
> Usually this means, for whatever reason, that you don't have
> one (or more) of these kernel modules loaded:
>
> gfs2.ko, lock_dlm.ko, dlm.ko
>
> So check lsmod to see if they're there.  If not, you'll have
> to get them, and getting them depends on your platform, which
> you didn't mention.
>
> Regards,
>
> Bob Peterson
> Red Hat GFS
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From cthulhucalling at gmail.com  Thu May 28 21:29:59 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Thu, 28 May 2009 17:29:59 -0400
Subject: [Linux-cluster] STABLE3: Cannot mount gfs, "no such device"
In-Reply-To: <44CD83DE-3E5E-4C5B-835D-3C9514FC0A80@arlut.utexas.edu>
References: <16875127.1450621243543416973.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>
	<44CD83DE-3E5E-4C5B-835D-3C9514FC0A80@arlut.utexas.edu>
Message-ID: <36df569a0905281429x36a60736l51e58392a3f6ff1e@mail.gmail.com>

If you're installing from RHEL5 media, I believe the kmod-gfs2 RPM supplies
that object.

On Thu, May 28, 2009 at 5:03 PM, David Merhar <merhar at arlut.utexas.edu>wrote:

> RHEL5U3.
>
> I assumed that the dlm_lock would be built and installed with Cluster
> 3.0.0rc2, so I built the cluster without the gfs/cluster modules.
>
> When I modprobe gfs, dmesg reports that Lock_DLM has been installed (along
> with GFS and Lock_NoLock), however  lsmod did not show a lock_dlm.  I could
> not find lock_dlm.ko on the systems.  lsmod did show dlm and gfs was loaded.
>
> Please let me know where I can get the necessary modules.
>
> Thanks.
>
> djm
>
>
>
>
>
> On May 28, 2009, at 3:43 PM, Bob Peterson wrote:
>
>  ----- "David Merhar" <merhar at arlut.utexas.edu> wrote:
>> | Everything seems perfect until we try to mount.  lvdisplay shows the
>> |
>> | volume.  mkfs.gfs2 runs fine.  The device is visible under both /dev
>> |
>> | and /dev/mapper
>> |
>> | Cluster 3.0.0rc2
>> | Corosync .97
>> | OpenAIS .96
>> | LVMS 2.02.47
>> | Kernel 2.6.29.4
>> |
>> | Please advise.
>> |
>> | Thanks.
>> |
>> | djm
>>
>> Hi David,
>>
>> Usually this means, for whatever reason, that you don't have
>> one (or more) of these kernel modules loaded:
>>
>> gfs2.ko, lock_dlm.ko, dlm.ko
>>
>> So check lsmod to see if they're there.  If not, you'll have
>> to get them, and getting them depends on your platform, which
>> you didn't mention.
>>
>> Regards,
>>
>> Bob Peterson
>> Red Hat GFS
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090528/ac2a7f3f/attachment.htm>

From merhar at arlut.utexas.edu  Thu May 28 22:33:29 2009
From: merhar at arlut.utexas.edu (Dave Merhar)
Date: Thu, 28 May 2009 17:33:29 -0500
Subject: [Linux-cluster] STABLE3: Cannot mount gfs, "no such device"
In-Reply-To: <36df569a0905281429x36a60736l51e58392a3f6ff1e@mail.gmail.com>
Message-ID: <4a1f1140.11bd720a.7218.ffff8b11@mx.google.com>

Any idea why dmesg reports that Lock_DLM has been installed, but no module
is reported by lsmod?
 
Will the RHE5U3 kmod_gfs2 rpm be compatible with STABLE3?  I had to scrape
out the RHEL5U3 rpms in order to get the STABLE3 cman and clvmd working.
 
djm
 
 
  _____  

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ian Hayes
Sent: Thursday, May 28, 2009 4:30 PM
To: linux clustering
Subject: Re: [Linux-cluster] STABLE3: Cannot mount gfs, "no such device"


If you're installing from RHEL5 media, I believe the kmod-gfs2 RPM supplies
that object.


On Thu, May 28, 2009 at 5:03 PM, David Merhar <merhar at arlut.utexas.edu>
wrote:


RHEL5U3.

I assumed that the dlm_lock would be built and installed with Cluster
3.0.0rc2, so I built the cluster without the gfs/cluster modules.

When I modprobe gfs, dmesg reports that Lock_DLM has been installed (along
with GFS and Lock_NoLock), however  lsmod did not show a lock_dlm.  I could
not find lock_dlm.ko on the systems.  lsmod did show dlm and gfs was loaded.

Please let me know where I can get the necessary modules.

Thanks.

djm 


On May 28, 2009, at 3:43 PM, Bob Peterson wrote:


----- "David Merhar" <merhar at arlut.utexas.edu> wrote:
| Everything seems perfect until we try to mount.  lvdisplay shows the
|
| volume.  mkfs.gfs2 runs fine.  The device is visible under both /dev
|
| and /dev/mapper
|
| Cluster 3.0.0rc2
| Corosync .97
| OpenAIS .96
| LVMS 2.02.47
| Kernel 2.6.29.4
|
| Please advise.
|
| Thanks.
|
| djm

Hi David,

Usually this means, for whatever reason, that you don't have
one (or more) of these kernel modules loaded:

gfs2.ko, lock_dlm.ko, dlm.ko

So check lsmod to see if they're there.  If not, you'll have
to get them, and getting them depends on your platform, which
you didn't mention.

Regards,

Bob Peterson
Red Hat GFS

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090528/179fee7a/attachment.htm>

From mad at wol.de  Fri May 29 08:13:51 2009
From: mad at wol.de (Marc - A. Dahlhaus [ Administration | Westermann GmbH ])
Date: Fri, 29 May 2009 10:13:51 +0200
Subject: [Linux-cluster] STABLE3: Cannot mount gfs, "no such device"
In-Reply-To: <4a1f1140.11bd720a.7218.ffff8b11@mx.google.com>
References: <4a1f1140.11bd720a.7218.ffff8b11@mx.google.com>
Message-ID: <1243584831.30522.11.camel@marc>

Hello djm,


i don't think that stable3 works as expected on RHEL5U3 as it is
tracking the vanilla kernel >=2.6.29 and RHEL5 uses 2.6.18 and its
cluster suite uses another branch... Might be better if you checkout the
latest branch for RHEL5 and build up your stack from there. As far as i
know, the module lock_dlm is no more a module by itself in stable3 as it
is build into gfs.ko now.

It might be useful for you to read the change logs of stable3 (you'll
find them in the list archives) and also take a look on
http://sourceware.org/cluster for branches and other useful
information...

Hope that helps,

Marc

Am Donnerstag, den 28.05.2009, 17:33 -0500 schrieb Dave Merhar:
> Any idea why dmesg reports that Lock_DLM has been installed, but no
> module is reported by lsmod?
>  
> Will the RHE5U3 kmod_gfs2 rpm be compatible with STABLE3?  I had to
> scrape out the RHEL5U3 rpms in order to get the STABLE3 cman and clvmd
> working.
>  
> djm
>  
>  
> 
> 
> ______________________________________________________________________
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ian Hayes
> Sent: Thursday, May 28, 2009 4:30 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] STABLE3: Cannot mount gfs, "no such
> device"
> 
> 
> 
> If you're installing from RHEL5 media, I believe the kmod-gfs2 RPM
> supplies that object.
> 
> On Thu, May 28, 2009 at 5:03 PM, David Merhar
> <merhar at arlut.utexas.edu> wrote:
>         RHEL5U3.
>         
>         I assumed that the dlm_lock would be built and installed with
>         Cluster 3.0.0rc2, so I built the cluster without the
>         gfs/cluster modules.
>         
>         When I modprobe gfs, dmesg reports that Lock_DLM has been
>         installed (along with GFS and Lock_NoLock), however  lsmod did
>         not show a lock_dlm.  I could not find lock_dlm.ko on the
>         systems.  lsmod did show dlm and gfs was loaded.
>         
>         Please let me know where I can get the necessary modules.
>         
>         Thanks.
>         
>         djm 
>         
>         
>         
>         
>         
>         
>         On May 28, 2009, at 3:43 PM, Bob Peterson wrote:
>         
>                 ----- "David Merhar" <merhar at arlut.utexas.edu> wrote:
>                 | Everything seems perfect until we try to mount.
>                  lvdisplay shows the
>                 |
>                 | volume.  mkfs.gfs2 runs fine.  The device is visible
>                 under both /dev
>                 |
>                 | and /dev/mapper
>                 |
>                 | Cluster 3.0.0rc2
>                 | Corosync .97
>                 | OpenAIS .96
>                 | LVMS 2.02.47
>                 | Kernel 2.6.29.4
>                 |
>                 | Please advise.
>                 |
>                 | Thanks.
>                 |
>                 | djm
>                 
>                 Hi David,
>                 
>                 Usually this means, for whatever reason, that you
>                 don't have
>                 one (or more) of these kernel modules loaded:
>                 
>                 gfs2.ko, lock_dlm.ko, dlm.ko
>                 
>                 So check lsmod to see if they're there.  If not,
>                 you'll have
>                 to get them, and getting them depends on your
>                 platform, which
>                 you didn't mention.
>                 
>                 Regards,
>                 
>                 Bob Peterson
>                 Red Hat GFS
>                 
>                 --
>                 Linux-cluster mailing list
>                 Linux-cluster at redhat.com
>                 https://www.redhat.com/mailman/listinfo/linux-cluster
>         
>         --
>         Linux-cluster mailing list
>         Linux-cluster at redhat.com
>         https://www.redhat.com/mailman/listinfo/linux-cluster
>         
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From merhar at arlut.utexas.edu  Fri May 29 12:27:16 2009
From: merhar at arlut.utexas.edu (David Merhar)
Date: Fri, 29 May 2009 07:27:16 -0500
Subject: [Linux-cluster] STABLE3: Cannot mount gfs, "no such device"
In-Reply-To: <1243584831.30522.11.camel@marc>
References: <4a1f1140.11bd720a.7218.ffff8b11@mx.google.com>
	<1243584831.30522.11.camel@marc>
Message-ID: <6F96ED86-B361-436A-8E9F-CB77C0BDFE93@arlut.utexas.edu>

I'm using the vanilla kernel, 2.6.29.4.  I've held off on RH's kernel  
patches.

I've downloaded the source from http://sourceware.org/cluster.

Has anyone made this combination work?  Is there any more information  
I can give you?


>>                |
>>                | Cluster 3.0.0rc2
>>                | Corosync .97
>>                | OpenAIS .96
>>                | LVMS 2.02.47
>>                | Kernel 2.6.29.4


djm


On May 29, 2009, at 3:13 AM, Marc - A. Dahlhaus [ Administration |  
Westermann GmbH ] wrote:

> Hello djm,
>
>
> i don't think that stable3 works as expected on RHEL5U3 as it is
> tracking the vanilla kernel >=2.6.29 and RHEL5 uses 2.6.18 and its
> cluster suite uses another branch... Might be better if you checkout  
> the
> latest branch for RHEL5 and build up your stack from there. As far  
> as i
> know, the module lock_dlm is no more a module by itself in stable3  
> as it
> is build into gfs.ko now.
>
> It might be useful for you to read the change logs of stable3 (you'll
> find them in the list archives) and also take a look on
> http://sourceware.org/cluster for branches and other useful
> information...
>
> Hope that helps,
>
> Marc
>
> Am Donnerstag, den 28.05.2009, 17:33 -0500 schrieb Dave Merhar:
>> Any idea why dmesg reports that Lock_DLM has been installed, but no
>> module is reported by lsmod?
>>
>> Will the RHE5U3 kmod_gfs2 rpm be compatible with STABLE3?  I had to
>> scrape out the RHEL5U3 rpms in order to get the STABLE3 cman and  
>> clvmd
>> working.
>>
>> djm
>>
>>
>>
>>
>> ______________________________________________________________________
>> From: linux-cluster-bounces at redhat.com
>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ian Hayes
>> Sent: Thursday, May 28, 2009 4:30 PM
>> To: linux clustering
>> Subject: Re: [Linux-cluster] STABLE3: Cannot mount gfs, "no such
>> device"
>>
>>
>>
>> If you're installing from RHEL5 media, I believe the kmod-gfs2 RPM
>> supplies that object.
>>
>> On Thu, May 28, 2009 at 5:03 PM, David Merhar
>> <merhar at arlut.utexas.edu> wrote:
>>        RHEL5U3.
>>
>>        I assumed that the dlm_lock would be built and installed with
>>        Cluster 3.0.0rc2, so I built the cluster without the
>>        gfs/cluster modules.
>>
>>        When I modprobe gfs, dmesg reports that Lock_DLM has been
>>        installed (along with GFS and Lock_NoLock), however  lsmod did
>>        not show a lock_dlm.  I could not find lock_dlm.ko on the
>>        systems.  lsmod did show dlm and gfs was loaded.
>>
>>        Please let me know where I can get the necessary modules.
>>
>>        Thanks.
>>
>>        djm
>>
>>
>>
>>
>>
>>
>>        On May 28, 2009, at 3:43 PM, Bob Peterson wrote:
>>
>>                ----- "David Merhar" <merhar at arlut.utexas.edu> wrote:
>>                | Everything seems perfect until we try to mount.
>>                 lvdisplay shows the
>>                |
>>                | volume.  mkfs.gfs2 runs fine.  The device is visible
>>                under both /dev
>>                |
>>                | and /dev/mapper
>>                |
>>                | Cluster 3.0.0rc2
>>                | Corosync .97
>>                | OpenAIS .96
>>                | LVMS 2.02.47
>>                | Kernel 2.6.29.4
>>                |
>>                | Please advise.
>>                |
>>                | Thanks.
>>                |
>>                | djm
>>
>>                Hi David,
>>
>>                Usually this means, for whatever reason, that you
>>                don't have
>>                one (or more) of these kernel modules loaded:
>>
>>                gfs2.ko, lock_dlm.ko, dlm.ko
>>
>>                So check lsmod to see if they're there.  If not,
>>                you'll have
>>                to get them, and getting them depends on your
>>                platform, which
>>                you didn't mention.
>>
>>                Regards,
>>
>>                Bob Peterson
>>                Red Hat GFS
>>
>>                --
>>                Linux-cluster mailing list
>>                Linux-cluster at redhat.com
>>                https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>        --
>>        Linux-cluster mailing list
>>        Linux-cluster at redhat.com
>>        https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From lhh at redhat.com  Fri May 29 21:05:24 2009
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 29 May 2009 17:05:24 -0400
Subject: [Linux-cluster] Default params for quorumdisk
In-Reply-To: <3128ba140905260342k523cc51v7d8e321907b1d049@mail.gmail.com>
References: <3128ba140905260342k523cc51v7d8e321907b1d049@mail.gmail.com>
Message-ID: <1243631124.25291.36.camel@ayanami>

On Tue, 2009-05-26 at 12:42 +0200, ESGLinux wrote:
> for example, 
> Interval - The frequency of read/write cycles, in seconds.  I have not
> idea what to say to that. Which is the default and how can I answer
> it? 

'man qdisk' explains all the parameters and their defaults.

-- Lon


From swun2010 at gmail.com  Sun May 31 07:35:26 2009
From: swun2010 at gmail.com (Sam Wun)
Date: Sun, 31 May 2009 17:35:26 +1000
Subject: [Linux-cluster] Least Conection with haproxy
Message-ID: <736c47cb0905310035j72d6781cy8d5c10d3cacfc7e5@mail.gmail.com>

Hi,

I just copied the leastconnection conf file to the working directory ,
started up haproxy,
but it doesn't really work.
Here is the conf file:

global
        maxconn    1000
        stats socket /tmp/sock1 mode 600
        stats timeout 3000
        stats maxconn 2000
        daemon

listen  webfarm 0.0.0.0:80
        mode       http
        retries    1
        option redispatch
        contimeout 1000
        clitimeout 120000
        srvtimeout 120000
        maxconn    40000
        ##bind       :8080
        balance    leastconn
        option     allbackups
        server     wp5 192.168.1.246:80 weight 10 maxconn 200 check
inter 1000 fall 1
        server     wp2 192.168.1.245:80 weight 20 maxconn 200 check
inter 1000 fall 1
        server     wp3 192.168.1.244:80 weight 30 maxconn 200 check
inter 1000 fall 1
        server     wp4 192.168.1.243:80 weight 40 maxconn 200 check
inter 1000 fall 1
        server     wp1 192.168.1.242:80 weight 10 check inter 1000 fall 1 backup
        option     httpclose

But the following conf works:

global
        #log 127.0.0.1  local0
        #log 127.0.0.1  local1 notice
        #log loghost    local0 info
        log localhost   local0 info
        maxconn 4096
        #chroot /usr/share/haproxy
        #uid 99
        #gid 99
        daemon
        pidfile /var/run/haproxy.pid
        #debug
        #quiet


defaults
        log     global
        mode    http
        option  httplog
        option  dontlognull
        retries 3
        option redispatch
        maxconn 2000
        contimeout      5000
        clitimeout      50000
        srvtimeout      50000

listen web_appl 0.0.0.0:80
        mode http
        cookie SERVERID insert nocache indirect
        balance roundrobin
        server wp1 192.168.1.246:80 cookie server01 weight 1 check
        server wp2 192.168.1.245:80 cookie server02 weight 1 check
        server wp3 192.168.1.244:80 cookie server03 weight 1 check
        server wp4-backup 192.168.1.243:80 cookie server04 check backup
        server wp5-backup 192.168.1.242:80 check backup
        #option abortonclose # Dropping aborted requests
        #option nolinger        # disables data lingering

How to setup haproxy with Least Connection?\

Your help is very much appreciated.
Thanks