From griswold at cs.wisc.edu  Mon Jun  1 16:03:40 2009
From: griswold at cs.wisc.edu (Nathaniel Griswold)
Date: Mon, 1 Jun 2009 11:03:40 -0500
Subject: [Linux-cluster] gfs2: st_size is 0 for symbolic links
Message-ID: <1c1b4d3a0906010903m30d2233bv9fab99ecef2ac83@mail.gmail.com>

Hi,

I had an application fail on gfs2 today because of incorrect stat
st_size on a symlink. The application was trying to utilize the fact
that st_size on a symlink should be the character length of the
destination path.

[root at somehost somepath]# touch somefile
[root at somehost somepath]# ln -s somefile somelink
[root at somehost somepath]# stat somelink |grep Size
  Size: 0               Blocks: 8          IO Block: 4096   symbolic link
[root at somehost somepath]# gfs2_tool getargs /somepath
noatime 0
data 2
suiddir 0
quota 0
posix_acl 1
num_glockd 1
upgrade 0
debug 0
localflocks 0
localcaching 0
ignore_local_fs 0
spectator 0
hostdata jid=0:id=196612:first=1
locktable
lockproto

[root at somehost somepath]# uname -r
2.6.18-128.1.10.el5

[root at somehost somepath]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.3 (Tikanga)


If i go to some other host or remount the filesystem, then st_size is correct:

[root at someotherhost somepath]# stat somelink |grep Size
  Size: 8               Blocks: 8          IO Block: 4096   symbolic link

Searched archives and didn't see anything. Is this a bug?

-nate


From swhiteho at redhat.com  Mon Jun  1 16:07:32 2009
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 01 Jun 2009 17:07:32 +0100
Subject: [Linux-cluster] gfs2: st_size is 0 for symbolic links
In-Reply-To: <1c1b4d3a0906010903m30d2233bv9fab99ecef2ac83@mail.gmail.com>
References: <1c1b4d3a0906010903m30d2233bv9fab99ecef2ac83@mail.gmail.com>
Message-ID: <1243872452.29604.546.camel@localhost.localdomain>

Hi,

That was bz #492911 and its now fixed both upstream and in RHEL,

Steve.

On Mon, 2009-06-01 at 11:03 -0500, Nathaniel Griswold wrote:
> Hi,
> 
> I had an application fail on gfs2 today because of incorrect stat
> st_size on a symlink. The application was trying to utilize the fact
> that st_size on a symlink should be the character length of the
> destination path.
> 
> [root at somehost somepath]# touch somefile
> [root at somehost somepath]# ln -s somefile somelink
> [root at somehost somepath]# stat somelink |grep Size
>   Size: 0               Blocks: 8          IO Block: 4096   symbolic link
> [root at somehost somepath]# gfs2_tool getargs /somepath
> noatime 0
> data 2
> suiddir 0
> quota 0
> posix_acl 1
> num_glockd 1
> upgrade 0
> debug 0
> localflocks 0
> localcaching 0
> ignore_local_fs 0
> spectator 0
> hostdata jid=0:id=196612:first=1
> locktable
> lockproto
> 
> [root at somehost somepath]# uname -r
> 2.6.18-128.1.10.el5
> 
> [root at somehost somepath]# cat /etc/redhat-release
> Red Hat Enterprise Linux Server release 5.3 (Tikanga)
> 
> 
> If i go to some other host or remount the filesystem, then st_size is correct:
> 
> [root at someotherhost somepath]# stat somelink |grep Size
>   Size: 8               Blocks: 8          IO Block: 4096   symbolic link
> 
> Searched archives and didn't see anything. Is this a bug?
> 
> -nate
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From rmicmirregs at gmail.com  Mon Jun  1 19:17:30 2009
From: rmicmirregs at gmail.com (Rafael =?ISO-8859-1?Q?Mic=F3?= Miranda)
Date: Mon, 01 Jun 2009 21:17:30 +0200
Subject: [Linux-Cluster] Submitting two new resource plugins to the project
Message-ID: <1243883850.6761.2.camel@mecatol>

Hi,

I have developed a couple of resources for Linux-Cluster (CMAN
+rgmanager) which try to fix some needs I see in Linux-Cluster when
compared with other cluster solution (concretely, Linux-HA a.k.a.
Heartbeat). I am a Linux-HA user and I think this two functionalities
could be useful in Linux-Cluster. 

I would like to give them (both resources) to the community to make them
be into the project, and maybe after testing/quality testing or so be
included into the RedHat Enterprise Linux packages of Linux-Cluster, so
RedHat will give support for them and include them into the
system-config-cluster tool to have a GUI that can configure this
resources and handle their information.

I'll give you some details of both resources:

1.- ping-group: tries to bring to Linux-Cluster the Ping Group
functionality of Linux-HA.

For those who don't know Ping Group, the idea is the following: its a
NODE functionality (not a service or a resource) that checks IP
communications with a list of given client nodes. When failed, Ping
Group will move all services running in the affected node to other nodes
that have proved that keep their communications right, so the service is
provided to the clients even if there is a network problem that affects
only a node of your cluster but the cluster itself wont realize about
it.

I have developed ping-group as a resource to be used into a service of
your cluster, so in the resource arguments you can specify a list of
clients that service should take note on. There is one thing that could
be improved: ping-group will mark the service as failed even if the
other nodes of the cluster would fail too due to lack of communications
with the clients (for example, all clients are powered off). In this
situation the service will go on migrating from one node to another
according to your service failover policy and finally the service will
be stopped. Maybe some ideas could be useful to improve this behaviour.


2.- lvm-cluster: tries to bring to Linux-Cluster an exclusive shared
storage option, using features of LVM2. I got accustomed to this kind of
volumes when working with Linux-Ha + EVMS solution (using Cluster
Segment Manager plug-in).

When defining a new LVM2 volume four your cluster, you can set it as
cluster-disabled (the volume will behave as a local volume even if it is
on shared storage) or as cluster-enabled (the LVM volume can be
activated on many different cluster nodes at the same time).

Of course, if the filesystem placed into the LVM volume is not a
clustered filesystem (GFS2) a cluster-enabled volume allows a bad
administrator mount a no-clustered filesystem (EXT3) in more than one
node of the cluster which may produce filesystem corruption. This is
because the LVM "open flag" of the filesystem is not propagated through
all the members of the cluster, so there is no knowledge of the state of
the filesystem and this situations can happen.

This can be fixed with some of the options of LVM, specifically the
"enable exclusively flag". This flag, when used over a cluster-enabled
volume, will allow the VolumeGroup to be imported by all the nodes of
the cluster but the LogicalVolumes into the VolumeGroup can only be
activated by a single node. So, only one node of your cluster will have
the LogicalVolume device (for example /dev/VolGrp01/LogVol01) and the
problem explained above cannot happen. This is not about propagating the
"open flag" through the nodes, this is about making the LogicalVolume be
in only one node.

I have developed lvm-cluster as a resource to be used into a service of
your cluster. In the arguments you an specify the name of the
VolumeGroup and the LogicalVolume to handle.


So, I would like to receive the instructions to submit this two
resources to the project to improve them, test them and find any bugs
that could still be in the code. I have made some testing but of course
they need much more to allow them be put into the main project.

Sincerely yours,

Rafael Mic? Miranda


-- 
Rafael Mic? Miranda


From fdinitto at redhat.com  Tue Jun  2 05:04:23 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 02 Jun 2009 07:04:23 +0200
Subject: [Linux-Cluster] Submitting two new resource plugins to the project
In-Reply-To: <1243883850.6761.2.camel@mecatol>
References: <1243883850.6761.2.camel@mecatol>
Message-ID: <1243919063.24866.14.camel@cerberus.int.fabbione.net>

Hi Rafael,

On Mon, 2009-06-01 at 21:17 +0200, Rafael Mic? Miranda wrote:
> Hi,
> 
> I have developed a couple of resources for Linux-Cluster (CMAN
> +rgmanager) which try to fix some needs I see in Linux-Cluster when
> compared with other cluster solution (concretely, Linux-HA a.k.a.
> Heartbeat). I am a Linux-HA user and I think this two functionalities
> could be useful in Linux-Cluster. 
> 
> I would like to give them (both resources) to the community to make them
> be into the project, and maybe after testing/quality testing or so be
> included into the RedHat Enterprise Linux packages of Linux-Cluster, so
> RedHat will give support for them and include them into the
> system-config-cluster tool to have a GUI that can configure this
> resources and handle their information.

[SNIP]

> So, I would like to receive the instructions to submit this two
> resources to the project to improve them, test them and find any bugs
> that could still be in the code. I have made some testing but of course
> they need much more to allow them be put into the main project.


The best way to submit is to post the code to cluster-devel at redhat.com
mailing list. We don't have a very formal procedure in place.
What we need to know is what it is, on what version of the software has
been tested and what distribution.
The right guys will take care of doing the correct steps (ask more,
review, commit etc).
Of course a patch against a git tree is the best but it's not a
requirement at all (aka don't spend time learning git if you don't
need/want to).

Cheers
Fabio


From xavier.montagutelli at unilim.fr  Tue Jun  2 09:08:14 2009
From: xavier.montagutelli at unilim.fr (Xavier Montagutelli)
Date: Tue, 2 Jun 2009 11:08:14 +0200
Subject: [Linux-Cluster] Submitting two new resource plugins to the project
In-Reply-To: <1243883850.6761.2.camel@mecatol>
References: <1243883850.6761.2.camel@mecatol>
Message-ID: <200906021108.14625.xavier.montagutelli@unilim.fr>

On Monday 01 June 2009 21:17:30 Rafael Mic? Miranda wrote:
> Hi,
>
> I have developed a couple of resources for Linux-Cluster (CMAN
> +rgmanager) which try to fix some needs I see in Linux-Cluster when
> compared with other cluster solution (concretely, Linux-HA a.k.a.
> Heartbeat). I am a Linux-HA user and I think this two functionalities
> could be useful in Linux-Cluster.
>
> I would like to give them (both resources) to the community to make them
> be into the project, and maybe after testing/quality testing or so be
> included into the RedHat Enterprise Linux packages of Linux-Cluster, so
> RedHat will give support for them and include them into the
> system-config-cluster tool to have a GUI that can configure this
> resources and handle their information.
>
> I'll give you some details of both resources:
[...]
>
> 2.- lvm-cluster: tries to bring to Linux-Cluster an exclusive shared
> storage option, using features of LVM2. I got accustomed to this kind of
> volumes when working with Linux-Ha + EVMS solution (using Cluster
> Segment Manager plug-in).
>
> When defining a new LVM2 volume four your cluster, you can set it as
> cluster-disabled (the volume will behave as a local volume even if it is
> on shared storage) or as cluster-enabled (the LVM volume can be
> activated on many different cluster nodes at the same time).
>
> Of course, if the filesystem placed into the LVM volume is not a
> clustered filesystem (GFS2) a cluster-enabled volume allows a bad
> administrator mount a no-clustered filesystem (EXT3) in more than one
> node of the cluster which may produce filesystem corruption. This is
> because the LVM "open flag" of the filesystem is not propagated through
> all the members of the cluster, so there is no knowledge of the state of
> the filesystem and this situations can happen.
>
> This can be fixed with some of the options of LVM, specifically the
> "enable exclusively flag". This flag, when used over a cluster-enabled
> volume, will allow the VolumeGroup to be imported by all the nodes of
> the cluster but the LogicalVolumes into the VolumeGroup can only be
> activated by a single node. So, only one node of your cluster will have
> the LogicalVolume device (for example /dev/VolGrp01/LogVol01) and the
> problem explained above cannot happen. This is not about propagating the
> "open flag" through the nodes, this is about making the LogicalVolume be
> in only one node.
>
> I have developed lvm-cluster as a resource to be used into a service of
> your cluster. In the arguments you an specify the name of the
> VolumeGroup and the LogicalVolume to handle.
[...]

This looks very useful. We are using a shared storage with CLVM, and a non-
clustered FS. I always fear mounting the same FS on different nodes. I have 
always hoped this feature could exist the LVM layer.

It would be great to see this incorporated in CLVM.

-- 
Xavier Montagutelli                      Tel : +33 (0)5 55 45 77 20
Service Commun Informatique              Fax : +33 (0)5 55 45 75 95
Universite de Limoges
123, avenue Albert Thomas
87060 Limoges cedex


From brettcave at gmail.com  Tue Jun  2 11:53:52 2009
From: brettcave at gmail.com (Brett Cave)
Date: Tue, 2 Jun 2009 13:53:52 +0200
Subject: [Linux-cluster] IPVS on 2 servers running the HA services?
Message-ID: <c0773fd30906020453w395ed2a9ga4ff3ad16f66b180@mail.gmail.com>

hi,

I am running ipvs on a single node with ipvs configured to load
balance to 2 backend servers (mysql). I remember having issues load
balancing to the server that HA is running on, due to the IP address
being local. ipvs was configured using ldirector, with the real
servers using the "gate" redirect method.

Is it possible to run heartbeat + ipvs + apache on 2 nodes though?
Perhaps using masq method?

e.g.

server1: ip 192.168.0.1 + heartbeat + primary HA ip 192.168.0.10 + apache
server2: ip 192.168.0.2 + heartbeat + secondary for HA IP + apache

ipvs / ldirector to then direct incoming http requests on 192.168.0.10
to .1 and .2 using masq - would that load balance requests between the
servers, or would all requests come in to primary and be served by
primary?

Regards,
Brett


From tsengjs at gmail.com  Tue Jun  2 13:30:57 2009
From: tsengjs at gmail.com (Jin-Shan Tseng)
Date: Tue, 2 Jun 2009 21:30:57 +0800
Subject: [Linux-cluster] compile gnbd-kernel error
Message-ID: <2495e3790906020630g34d1b32ckeff5db4b8167169@mail.gmail.com>

Hi folks,
I tried to compile gnbd-kernel on Gentoo Linux 2.6.29-gentoo-r5 but I got
some error messages. :(

the error messages are appear
on cluster-2.03.09, cluster-2.03.10, cluster-2.03.11

# uname -a
Linux node26 2.6.29-gentoo-r5 #3 SMP Mon Jun 1 19:05:23 CST 2009 i686
Intel(R) Xeon(TM) CPU 3.06GHz GenuineIntel GNU/Linux

cluster-2.03.11 # make gnbd-kernel
[ -n "" ] || make -C gnbd-kernel/src all
make[1]: Entering directory
`/usr/portage/distfiles/cluster-2.03.11/gnbd-kernel/src'
make -C /lib/modules/2.6.29-gentoo-r5/build
M=/usr/portage/distfiles/cluster-2.03.11/gnbd-kernel/src
symverfile=/lib/modules/2.6.29-gentoo-r5/build/Module.symvers modules
USING_KBUILD=yes
make[2]: Entering directory `/usr/src/linux-2.6.29-gentoo-r5'
  CC [M]  /usr/portage/distfiles/cluster-2.03.11/gnbd-kernel/src/gnbd.o
/usr/portage/distfiles/cluster-2.03.11/gnbd-kernel/src/gnbd.c:933: warning:
initialization from incompatible pointer type
/usr/portage/distfiles/cluster-2.03.11/gnbd-kernel/src/gnbd.c:934: warning:
initialization from incompatible pointer type
/usr/portage/distfiles/cluster-2.03.11/gnbd-kernel/src/gnbd.c: In function
'gnbd_init':
/usr/portage/distfiles/cluster-2.03.11/gnbd-kernel/src/gnbd.c:1054: error:
'struct gendisk' has no member named 'dev'
make[3]: *** [/usr/portage/distfiles/cluster-2.03.11/gnbd-kernel/src/gnbd.o]
Error 1
make[2]: ***
[_module_/usr/portage/distfiles/cluster-2.03.11/gnbd-kernel/src] Error 2
make[2]: Leaving directory `/usr/src/linux-2.6.29-gentoo-r5'
make[1]: *** [gnbd.ko] Error 2
make[1]: Leaving directory
`/usr/portage/distfiles/cluster-2.03.11/gnbd-kernel/src'
make: *** [gnbd-kernel/src] Error 2

Does anyone have the same problems?
Any suggestions are appreciate.

Thanks in advanced,
Jin-Shan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090602/662250c3/attachment.htm>

From jbrassow at redhat.com  Tue Jun  2 17:28:13 2009
From: jbrassow at redhat.com (Jonathan Brassow)
Date: Tue, 2 Jun 2009 12:28:13 -0500
Subject: [Linux-Cluster] Submitting two new resource plugins to the project
In-Reply-To: <1243883850.6761.2.camel@mecatol>
References: <1243883850.6761.2.camel@mecatol>
Message-ID: <DEC7D4CE-DA0C-4480-B39D-6A8921E4D634@redhat.com>


On Jun 1, 2009, at 2:17 PM, Rafael Mic? Miranda wrote:

> This can be fixed with some of the options of LVM, specifically the
> "enable exclusively flag". This flag, when used over a cluster-enabled
> volume, will allow the VolumeGroup to be imported by all the nodes of
> the cluster but the LogicalVolumes into the VolumeGroup can only be
> activated by a single node. So, only one node of your cluster will  
> have
> the LogicalVolume device (for example /dev/VolGrp01/LogVol01) and the
> problem explained above cannot happen. This is not about propagating  
> the
> "open flag" through the nodes, this is about making the  
> LogicalVolume be
> in only one node.

This is different from the current approach.  We would likely take  
this if it is cleaner, better, or more advantageous than the current  
solution.

Current solution is described here:
http://kbase.redhat.com/faq/docs/DOC-3068

  brassow


From dougbunger at yahoo.com  Tue Jun  2 18:12:31 2009
From: dougbunger at yahoo.com (Doug Bunger)
Date: Tue, 2 Jun 2009 11:12:31 -0700 (PDT)
Subject: [Linux-cluster] F8/F10 fence_xvm Key Errors
Message-ID: <740081.74040.qm@web110216.mail.gq1.yahoo.com>

I have a VM running Fedora 8 that I want to connect to a cluster that is all Fedora 10 VMs, running on F10 platforms.? The F8 fails a fence test, reporting:
# fence_xvm -H cicero3 -ddd -o null
Debugging threshold is now 3
-- args @ 0x7fffb9fd4870 --
? args->addr = 225.0.0.12
? args->domain = cicero3
? args->key_file = /etc/cluster/fence_xvm.key
? args->op = 0
? args->hash = 2
? args->auth = 2
? args->port = 1229
? args->family = 2
? args->timeout = 30
? args->retr_time = 20
? args->flags = 0
? args->debug = 3
-- end args --
Reading in key file /etc/cluster/fence_xvm.key into 0x7fffb9fd3820 (4096 len)Actual key length = 4096 bytesSending to 225.0.0.12 via 127.0.0.1
Sending to 225.0.0.12 via 192.168.69.63
Waiting for connection from XVM host daemon.

The physical host is reporting:
? [fence_xvmd.c:0691] Key mismatch; dropping packet

It seems odd that it doesn't work since the key was gen'd from /dev/random.? Nothing OS or machine specific about the key.? Something different with the transport?? Any suggestions, before I blindly upgrade from F8 to F10?

-- Doug Bunger 

-- dougbunger at yahoo.com 

--


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090602/0af69ae3/attachment.htm>

From jschulz at soapstonenetworks.com  Tue Jun  2 23:36:23 2009
From: jschulz at soapstonenetworks.com (Jon Schulz)
Date: Tue, 2 Jun 2009 19:36:23 -0400
Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters
Message-ID: <E6D5CAC0ACC932498E6AC1A692E0CBA122979D0AAA@corp-e2k7-mbx01.Soapstone.local>

I'm in the process of doing a concept review with the redhat cluster suite. I've been given a requirement that cluster nodes are able to be located in geographically separated data centers. I realize that this is not an ideal scenario due to latency issues. Does anyone have any papers or articles you could point me to that outline cluster network requirements and best practices?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090602/f040c6c9/attachment.htm>

From fajar at fajar.net  Wed Jun  3 02:06:08 2009
From: fajar at fajar.net (Fajar A. Nugraha)
Date: Wed, 3 Jun 2009 09:06:08 +0700
Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters
In-Reply-To: <E6D5CAC0ACC932498E6AC1A692E0CBA122979D0AAA@corp-e2k7-mbx01.Soapstone.local>
References: <E6D5CAC0ACC932498E6AC1A692E0CBA122979D0AAA@corp-e2k7-mbx01.Soapstone.local>
Message-ID: <7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com>

On Wed, Jun 3, 2009 at 6:36 AM, Jon Schulz
<jschulz at soapstonenetworks.com> wrote:
> I'm in the process of doing a concept review with the redhat cluster suite.
> I've been given a requirement that cluster nodes are able to be located in
> geographically separated data centers. I realize that this is not an ideal
> scenario due to latency issues.

For most purposes, RHCS would require that all nodes have access to
the same storage/disk. That pretty much ruled out the DR  feature that
one might expect to get from having nodes in geographically separated
data centers.

I'd suggest you refine your requirements. Perhaps what you need is
something like MySQL cluster replication, where there are two
geographically separated data centers, each having its own cluster,
and the two clusters replicate each other's data asynchronously.

-- 
Fajar


From m.nietz-redhat at iplabs.de  Wed Jun  3 14:22:30 2009
From: m.nietz-redhat at iplabs.de (Marco Nietz)
Date: Wed, 03 Jun 2009 16:22:30 +0200
Subject: [Linux-cluster] Problem with Fenced
Message-ID: <4A268726.4060901@iplabs.de>

Hi,

i have a Problem with (propably) the Communication between fenced and
ccsd. After a node-failure, fenced should connect ccsd and then try to
fence the failing node. this does not happen on one of our systems.

Here's an strace from the fence-daemon.

socket(PF_FILE, SOCK_STREAM, 0)         = 9
connect(9, {sa_family=AF_FILE, path=@"groupd_socket"}, 16) = 0
write(9, "get_group -1 groupd\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2200) = 2200
read(9,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1128) = 1128
close(9)                                = 0
write(7, "start_done default 3\0\0\0\0\0\0\0\0\0\0\0\0"..., 2200) = 2200
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7,
events=POLLIN}, {fd=-1}], 4, -1) = 1 ([{fd=7, revents=POLLIN}])
read(7, "finish default 3\0\0\0\0\0\0\0\0\350\37Y\21\377\177\0\0"...,
2200) = 2200
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7,
events=POLLIN}, {fd=-1}], 4, -1

At the Connect-Line i expect the Path to the ccsd-socket
(/var/run/cluster/ccsd.sock).

How can i tell fenced where to find the Socket.


Best Regards
Marco


From teigland at redhat.com  Wed Jun  3 15:40:44 2009
From: teigland at redhat.com (David Teigland)
Date: Wed, 3 Jun 2009 10:40:44 -0500
Subject: [Linux-cluster] Problem with Fenced
In-Reply-To: <4A268726.4060901@iplabs.de>
References: <4A268726.4060901@iplabs.de>
Message-ID: <20090603154044.GA14469@redhat.com>

On Wed, Jun 03, 2009 at 04:22:30PM +0200, Marco Nietz wrote:
> Hi,
> 
> i have a Problem with (propably) the Communication between fenced and
> ccsd. After a node-failure, fenced should connect ccsd and then try to
> fence the failing node. this does not happen on one of our systems.
> 
> Here's an strace from the fence-daemon.
> 
> socket(PF_FILE, SOCK_STREAM, 0)         = 9
> connect(9, {sa_family=AF_FILE, path=@"groupd_socket"}, 16) = 0
> write(9, "get_group -1 groupd\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2200) = 2200
> read(9,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 1128) = 1128
> close(9)                                = 0
> write(7, "start_done default 3\0\0\0\0\0\0\0\0\0\0\0\0"..., 2200) = 2200
> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7,
> events=POLLIN}, {fd=-1}], 4, -1) = 1 ([{fd=7, revents=POLLIN}])
> read(7, "finish default 3\0\0\0\0\0\0\0\0\350\37Y\21\377\177\0\0"...,
> 2200) = 2200
> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7,
> events=POLLIN}, {fd=-1}], 4, -1
> 
> At the Connect-Line i expect the Path to the ccsd-socket
> (/var/run/cluster/ccsd.sock).
> 
> How can i tell fenced where to find the Socket.

It's not clear from this that fenced/ccsd communication is the problem.

After the node failure, please collect from all nodes the output of
- cman_tool nodes
- group_tool -v
- group_tool dump fence
- any messages in /var/log/messages

Dave


From rmicmirregs at gmail.com  Wed Jun  3 16:28:40 2009
From: rmicmirregs at gmail.com (Rafael =?ISO-8859-1?Q?Mic=F3?= Miranda)
Date: Wed, 03 Jun 2009 18:28:40 +0200
Subject: [Linux-Cluster] Submitting two new resource plugins to the project
In-Reply-To: <1243919063.24866.14.camel@cerberus.int.fabbione.net>
References: <1243883850.6761.2.camel@mecatol>
	<1243919063.24866.14.camel@cerberus.int.fabbione.net>
Message-ID: <1244046520.6750.6.camel@mecatol>

Hi Fabio,

El mar, 02-06-2009 a las 07:04 +0200, Fabio M. Di Nitto escribi?:
> Hi Rafael,
> 
> On Mon, 2009-06-01 at 21:17 +0200, Rafael Mic? Miranda wrote:
[...]
> 
> 
> The best way to submit is to post the code to cluster-devel at redhat.com
> mailing list. We don't have a very formal procedure in place.
> What we need to know is what it is, on what version of the software has
> been tested and what distribution.
> The right guys will take care of doing the correct steps (ask more,
> review, commit etc).
> Of course a patch against a git tree is the best but it's not a
> requirement at all (aka don't spend time learning git if you don't
> need/want to).
> 
> Cheers
> Fabio
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


I have sent the e-mail to that mail list and i have had no answer yet.
Its the only occurrence i have found about the "devel list" on the CMAN
Project web page, are you sure this address is right?

Thanks in advance.

-- 
Rafael Mic? Miranda


From admin1-bua.dage-etd at justice.gouv.fr  Thu Jun  4 06:54:07 2009
From: admin1-bua.dage-etd at justice.gouv.fr (Jean Diallo)
Date: Thu, 04 Jun 2009 08:54:07 +0200
Subject: [Linux-cluster] Clvm Hang after an node is fenced in a 2 nodes
	cluster
Message-ID: <4A276F8F.70101@justice.gouv.fr>

Description of problem: In a 2 nodes cluster, after 1 node is fence, any 
clvm command hang on the ramaining node. when the fenced node cluster 
come back in the cluster, any clvm command also hang, moreover the node 
do not activate any clustered vg, and so do not access any shared device.


Version-Release number of selected component (if applicable):
redhat 5.2
update device-mapper-1.02.28-2.el5.x86_64.rpm
       lvm2-2.02.40-6.el5.x86_64.rpm
       lvm2-cluster-2.02.40-7.el5.x86_64.rpm


Steps to Reproduce:
1.2 nodes cluster , quorum formed with qdisk
2.cold boot node 2
3.node 2 is evicted and fenced, service are taken over by node 1
4.node ? come back in cluster, quorate, but no clustered vg are up and 
any lvm  related command hang
5.At this step every lvm command hang on node 1
 

Expected results: node 2 should be able to get back the lock on 
clustered lvm volume and node 1 should be able to issue any lvm relate 
command

Here are my cluster.conf and lvm.conf
<?xml version="1.0"?>
<cluster alias="rome" config_version="53" name="rome">
        <fence_daemon clean_start="0" post_fail_delay="9" 
post_join_delay="6"/>
        <clusternodes>
                <clusternode name="romulus.fr" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="ilo172"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="remus.fr" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="ilo173"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="3"/>
        <totem consensus="4800" join="60" token="21002" 
token_retransmits_before_loss_const="20"/>
        <fencedevices>
                <fencedevice agent="fence_ilo" hostname="X.X.X.X" 
login="Administrator" name="ilo172" passwd="X.X.X.X"/>
                <fencedevice agent="fence_ilo" hostname="XXXX" 
login="Administrator" name="ilo173" passwd="XXXX"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
                <vm autostart="1" exclusive="0" migrate="live" 
name="alfrescoP64" path="/etc/xen" recovery="relocate"/>
                <vm autostart="1" exclusive="0" migrate="live" 
name="alfrescoI64" path="/etc/xen" recovery="relocate"/>
                <vm autostart="1" exclusive="0" migrate="live" 
name="alfrescoS64" path="/etc/xen" recovery="relocate"/>
        </rm>
        <quorumd interval="3" label="quorum64" min_score="1" tko="30" 
votes="1">
                <heuristic interval="2" program="ping -c3 -t2 X.X.X.X" 
score="1"/>
        </quorumd>
</cluster>

part of lvm.conf:
 # Type 3 uses built-in clustered locking.
    locking_type = 3

    # If using external locking (type 2) and initialisation fails,
    # with this set to 1 an attempt will be made to use the built-in
    # clustered locking.
    # If you are using a customised locking_library you should set this 
to 0.
    fallback_to_clustered_locking = 0

    # If an attempt to initialise type 2 or type 3 locking failed, perhaps
    # because cluster components such as clvmd are not running, with 
this set
    # to 1 an attempt will be made to use local file-based locking (type 1).
    # If this succeeds, only commands against local volume groups will 
proceed.
    # Volume Groups marked as clustered will be ignored.
    fallback_to_local_locking = 1

    # Local non-LV directory that holds file-based locks while commands are
    # in progress.  A directory like /tmp that may get wiped on reboot 
is OK.
    locking_dir = "/var/lock/lvm"

    # Other entries can go here to allow you to load shared libraries
    # e.g. if support for LVM1 metadata was compiled as a shared library use
    #   format_libraries = "liblvm2format1.so"
    # Full pathnames can be given.

    # Search this directory first for shared libraries.
    #   library_dir = "/lib"

    # The external locking library to load if locking_type is set to 2.
    #   locking_library = "liblvm2clusterlock.so"


part of lvm log on second node :

vgchange.c:165   Activated logical volumes in volume group "VolGroup00"
vgchange.c:172   7 logical volume(s) in volume group "VolGroup00" now active
cache/lvmcache.c:1220   Wiping internal VG cache
commands/toolcontext.c:188   Logging initialised at Wed Jun  3 15:17:29 2009
commands/toolcontext.c:209   Set umask to 0077
locking/cluster_locking.c:83   connect() failed on local socket: 
Connexion refus?e
locking/locking.c:259   WARNING: Falling back to local file-based locking.
locking/locking.c:261   Volume Groups with the clustered attribute will 
be inaccessible.
toollib.c:578   Finding all volume groups
toollib.c:491   Finding volume group "VGhomealfrescoS64"
metadata/metadata.c:2379   Skipping clustered volume group VGhomealfrescoS64
toollib.c:491   Finding volume group "VGhomealfS64"
metadata/metadata.c:2379   Skipping clustered volume group VGhomealfS64
toollib.c:491   Finding volume group "VGvmalfrescoS64"
metadata/metadata.c:2379   Skipping clustered volume group VGvmalfrescoS64
toollib.c:491   Finding volume group "VGvmalfrescoI64"
metadata/metadata.c:2379   Skipping clustered volume group VGvmalfrescoI64
toollib.c:491   Finding volume group "VGvmalfrescoP64"
metadata/metadata.c:2379   Skipping clustered volume group VGvmalfrescoP64
toollib.c:491   Finding volume group "VolGroup00"
libdm-report.c:981   VolGroup00
cache/lvmcache.c:1220   Wiping internal VG cache
commands/toolcontext.c:188   Logging initialised at Wed Jun  3 15:17:29 2009
commands/toolcontext.c:209   Set umask to 0077
locking/cluster_locking.c:83   connect() failed on local socket: 
Connexion refus?e
locking/locking.c:259   WARNING: Falling back to local file-based locking.
locking/locking.c:261   Volume Groups with the clustered attribute will 
be inaccessible.
toollib.c:542   Using volume group(s) on command line
toollib.c:491   Finding volume group "VolGroup00"
vgchange.c:117   7 logical volume(s) in volume group "VolGroup00" monitored
cache/lvmcache.c:1220   Wiping internal VG cache
commands/toolcontext.c:188   Logging initialised at Wed Jun  3 15:20:45 2009
commands/toolcontext.c:209   Set umask to 0077
toollib.c:331   Finding all logical volumes
commands/toolcontext.c:188   Logging initialised at Wed Jun  3 15:20:50 2009
commands/toolcontext.c:209   Set umask to 0077
toollib.c:578   Finding all volume groups


group_tool on node 1
type             level name       id       state       
fence            0     default    00010001 none        
[1 2]
dlm              1     clvmd      00010002 none        
[1 2]
dlm              1     rgmanager  00020002 none        
[1]


group_tool on node 2
[root at remus ~]# group_tool
type             level name     id       state       
fence            0     default  00010001 none        
[1 2]
dlm              1     clvmd    00010002 none        
[1 2]

Additional info:


From jbrassow at redhat.com  Thu Jun  4 17:04:06 2009
From: jbrassow at redhat.com (Jonathan Brassow)
Date: Thu, 4 Jun 2009 12:04:06 -0500
Subject: [Linux-Cluster] Submitting two new resource plugins to the project
In-Reply-To: <1244046520.6750.6.camel@mecatol>
References: <1243883850.6761.2.camel@mecatol>
	<1243919063.24866.14.camel@cerberus.int.fabbione.net>
	<1244046520.6750.6.camel@mecatol>
Message-ID: <F00DFDAB-F17B-48B9-9AE2-8EA491ADF7EF@redhat.com>


On Jun 3, 2009, at 11:28 AM, Rafael Mic? Miranda wrote:

> Hi Fabio,
>
> El mar, 02-06-2009 a las 07:04 +0200, Fabio M. Di Nitto escribi?:
>> Hi Rafael,
>>
>> On Mon, 2009-06-01 at 21:17 +0200, Rafael Mic? Miranda wrote:
> [...]
>>
>>
>> The best way to submit is to post the code to cluster- 
>> devel at redhat.com
>> mailing list. We don't have a very formal procedure in place.
>> What we need to know is what it is, on what version of the software  
>> has
>> been tested and what distribution.
>> The right guys will take care of doing the correct steps (ask more,
>> review, commit etc).
>> Of course a patch against a git tree is the best but it's not a
>> requirement at all (aka don't spend time learning git if you don't
>> need/want to).
>>
>> Cheers
>> Fabio
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> I have sent the e-mail to that mail list and i have had no answer yet.
> Its the only occurrence i have found about the "devel list" on the  
> CMAN
> Project web page, are you sure this address is right?

I missed that post.  Perhaps you could send it directly to me?

  brassow


From m.nietz-redhat at iplabs.de  Thu Jun  4 17:54:22 2009
From: m.nietz-redhat at iplabs.de (Marco Nietz)
Date: Thu, 04 Jun 2009 19:54:22 +0200
Subject: [Linux-cluster] Monitoring Multipathd
Message-ID: <4A280A4E.5010309@iplabs.de>

Hi,

we use a Two-Node-Cluster each one assembled with a Dual-Port 
Fibre-Channel HBA with two Paths to a redundant Storage-Array.

When one of the Paths fail the Multipath-Daemon activates the 
Standby-Paths and everything works fine.

We want the cluster to initiate a takeover when both Paths are failed.

Is there a way to achieve this ? I think of some kind of Monitor from 
the Cluster to the Multipathd.


Regards
Marco


From rmicmirregs at gmail.com  Thu Jun  4 18:48:47 2009
From: rmicmirregs at gmail.com (Rafael =?ISO-8859-1?Q?Mic=F3?= Miranda)
Date: Thu, 04 Jun 2009 20:48:47 +0200
Subject: [Linux-Cluster] Submitting two new resource plugins to the project
In-Reply-To: <F00DFDAB-F17B-48B9-9AE2-8EA491ADF7EF@redhat.com>
References: <1243883850.6761.2.camel@mecatol>
	<1243919063.24866.14.camel@cerberus.int.fabbione.net>
	<1244046520.6750.6.camel@mecatol>
	<F00DFDAB-F17B-48B9-9AE2-8EA491ADF7EF@redhat.com>
Message-ID: <1244141327.6771.6.camel@mecatol>

Hi Jonathan,

El jue, 04-06-2009 a las 12:04 -0500, Jonathan Brassow escribi?:

> I missed that post.  Perhaps you could send it directly to me?
> 
>   brassow
> 
> 

I have just send them to you. 

Thanks in advance,

-- 
Rafael Mic? Miranda


From brem.belguebli at gmail.com  Thu Jun  4 20:11:51 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Thu, 4 Jun 2009 22:11:51 +0200
Subject: [Linux-Cluster] Submitting two new resource plugins to the 
	project
In-Reply-To: <DEC7D4CE-DA0C-4480-B39D-6A8921E4D634@redhat.com>
References: <1243883850.6761.2.camel@mecatol>
	<DEC7D4CE-DA0C-4480-B39D-6A8921E4D634@redhat.com>
Message-ID: <29ae894c0906041311w4f4e8c0aw5fdf55b70e4f39b6@mail.gmail.com>

2009/6/2 Jonathan Brassow <jbrassow at redhat.com>

>
> On Jun 1, 2009, at 2:17 PM, Rafael Mic? Miranda wrote:
>
>  This can be fixed with some of the options of LVM, specifically the
>> "enable exclusively flag". This flag, when used over a cluster-enabled
>> volume, will allow the VolumeGroup to be imported by all the nodes of
>> the cluster but the LogicalVolumes into the VolumeGroup can only be
>> activated by a single node. So, only one node of your cluster will have
>> the LogicalVolume device (for example /dev/VolGrp01/LogVol01) and the
>> problem explained above cannot happen. This is not about propagating the
>> "open flag" through the nodes, this is about making the LogicalVolume be
>> in only one node.
>>
>
> This is different from the current approach.  We would likely take this if
> it is cleaner, better, or more advantageous than the current solution.
>
> Current solution is described here:
> http://kbase.redhat.com/faq/docs/DOC-3068
>
>  brassow
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

Hello,

Isn't it how it is supposed to work, the exclusive flag standing for that ?

>From what I saw on other systems, especially on HP-UX from which Linux LVM
was much inspired, on a cluster, when activating exclusively (vgchange -ae
VGxx ) a VG on a node, the exclusive flag is set on the VG preventing the
other nodes from activating the volume as long as the holding node is alive.

Brem
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090604/2e04d929/attachment.htm>

From charlieb-linux-cluster at budge.apana.org.au  Thu Jun  4 20:23:13 2009
From: charlieb-linux-cluster at budge.apana.org.au (Charlie Brady)
Date: Thu, 4 Jun 2009 16:23:13 -0400 (EDT)
Subject: [Linux-cluster] "dlm_controld[nnnn]: cluster is down,
 exiting" on node1 when starting node2
Message-ID: <Pine.LNX.4.64.0906041607370.1696@e-smith.charlieb.ott.istop.com>


I'm trying to understand a node shutdown during transition from 1 node to 
2 node with qdisk cluster. The platform is CentOS 5.3, with versions:

cman-2.0.98-1.el5
openais-0.80.3-22.el5

Jun  4 10:55:08 sun4150node1 root[8103]:
S10make-event-queue=action|Event|cluster-node-added|Action|S10make-event-queue|Start|1244127
308 610636|End|1244127308 614973|Elapsed|0.004337
Jun  4 10:55:08 sun4150node1 root[8103]: Running event handler:
/etc/e-smith/events/cluster-node-added/S15iscsi-adjust
Jun  4 10:55:08 sun4150node1 root[8103]:
S15iscsi-adjust=action|Event|cluster-node-added|Action|S15iscsi-adjust|Start|1244127308
6153
33|End|1244127308 677757|Elapsed|0.062424
Jun  4 10:55:08 sun4150node1 root[8103]: Running event handler:
/etc/e-smith/events/cluster-node-added/S20cluster-conf
Jun  4 10:55:08 sun4150node1 ccsd[7879]: Update of cluster.conf complete
(version 2 -> 3).
Jun  4 10:55:08 sun4150node1 root[8103]: Config file updated from version
2 to 3
Jun  4 10:55:08 sun4150node1 root[8103]:
Jun  4 10:55:08 sun4150node1 root[8103]: Update complete.
Jun  4 10:55:08 sun4150node1 root[8103]:
S20cluster-conf=action|Event|cluster-node-added|Action|S20cluster-conf|Start|1244127308
6781
19|End|1244127308 793629|Elapsed|0.11551
Jun  4 10:55:08 sun4150node1 root[8103]: Running event handler:
/etc/e-smith/events/cluster-node-added/S31qdiskd-adjust
Jun  4 10:55:08 sun4150node1 qdiskd[8128]: <info> Quorum Daemon
Initializing
Jun  4 10:55:08 sun4150node1 root[8103]: Starting the Quorum Disk Daemon:[
OK  ]^M
Jun  4 10:55:08 sun4150node1 root[8103]:
S31qdiskd-adjust=action|Event|cluster-node-added|Action|S31qdiskd-adjust|Start|1244127308
79
7450|End|1244127308 928144|Elapsed|0.130694
Jun  4 10:55:08 sun4150node1 root[8103]: Running event handler:
/etc/e-smith/events/cluster-node-added/S32cman-adjust
Jun  4 10:55:09 sun4150node1 root[8103]: Starting cluster:
Jun  4 10:55:09 sun4150node1 root[8103]:    Loading modules... done
Jun  4 10:55:09 sun4150node1 root[8103]:    Mounting configfs... done
Jun  4 10:55:09 sun4150node1 root[8103]:    Starting ccsd... done
Jun  4 10:55:09 sun4150node1 root[8103]:    Starting cman... done
Jun  4 10:55:09 sun4150node1 root[8103]:    Starting daemons... done
Jun  4 10:55:10 sun4150node1 root[8103]:    Starting fencing... done
Jun  4 10:55:10 sun4150node1 root[8103]: [  OK  ]^M
Jun  4 10:55:10 sun4150node1 root[8103]:
S32cman-adjust=action|Event|cluster-node-added|Action|S32cman-adjust|Start|1244127308
928465
|End|1244127310 103254|Elapsed|1.174789
Jun  4 10:55:10 sun4150node1 root[8103]: Running event handler:
/etc/e-smith/events/cluster-node-added/S40cluster-join
Jun  4 10:55:10 sun4150node1 root[8103]: building file list ... done
Jun  4 10:55:10 sun4150node1 root[8103]:
Jun  4 10:55:10 sun4150node1 root[8103]: sent 64 bytes  received 20 bytes
168.00 bytes/sec
Jun  4 10:55:10 sun4150node1 root[8103]: total size is 3162  speedup is
37.64
Jun  4 10:55:17 sun4150node1 qdiskd[8128]: <info> Initial score 1/1
Jun  4 10:55:17 sun4150node1 qdiskd[8128]: <info> Initialization complete
Jun  4 10:55:17 sun4150node1 qdiskd[8128]: <notice> Score sufficient for
master operation (1/1; required=1); upgrading
Jun  4 10:55:29 sun4150node1 qdiskd[8128]: <info> Assuming master role
Jun  4 10:55:34 sun4150node1 dlm_controld[7916]: cluster is down, exiting
Jun  4 10:55:34 sun4150node1 fenced[7910]: cluster is down, exiting
Jun  4 10:55:34 sun4150node1 gfs_controld[7922]: cluster is down, exiting
Jun  4 10:55:34 sun4150node1 kernel: dlm: closing connection to node 2
Jun  4 10:55:34 sun4150node1 kernel: dlm: closing connection to node 1
Jun  4 10:55:35 sun4150node1 qdiskd[8128]: <err> cman_dispatch: Host is
down
Jun  4 10:55:35 sun4150node1 qdiskd[8128]: <err> Halting qdisk operations
Jun  4 10:55:51 sun4150node1 kernel: dlm: FS1: remove fr 0 ID 1
Jun  4 10:56:01 sun4150node1 ccsd[7879]: Unable to connect to cluster
infrastructure after 30 seconds.
Jun  4 10:56:31 sun4150node1 ccsd[7879]: Unable to connect to cluster
infrastructure after 60 seconds.
Jun  4 10:57:01 sun4150node1 ccsd[7879]: Unable to connect to cluster
infrastructure after 90 seconds.
Jun  4 10:57:31 sun4150node1 ccsd[7879]: Unable to connect to cluster
infrastructure after 120 seconds.

The first thing I see awry is "dlm_controld[7916]: cluster is down, 
exiting". I can see from source code that that could be from either 
process_member() or cluster_dead(), both of which would be called via 
callback from loop(). My best guess is that process_member() called 
cman_dispatch(ch, CMAN_DISPATCH_ALL) and rv was -1 with errno set to 
EHOSTDOWN. But I don't know why that would be the case, and in particular 
why here.

cman started fine on node2, and node1 joined without incident after 
reboot.

Any hints on how to debug this would be appreciated.

Thanks

---
Charlie


From brem.belguebli at gmail.com  Fri Jun  5 09:22:23 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Fri, 5 Jun 2009 11:22:23 +0200
Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters
In-Reply-To: <7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com>
References: <E6D5CAC0ACC932498E6AC1A692E0CBA122979D0AAA@corp-e2k7-mbx01.Soapstone.local>
	<7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com>
Message-ID: <29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com>

Hello,

That sounds pretty much to the question I've asked to this mailing-list last
May (https://www.redhat.com/archives/linux-cluster/2009-May/msg00093.html).

We are in the same setup, already doing "Geo-cluster" with other technos and
we are looking at RHCS to provide us the same service level.

Latency could be a problem indeed if too high , but in a lot of cases (many
companies for which I've worked), datacenters are a few tens of kilometers
far, with a latency max close to 1 ms, which is not a problem.

Let's consider this kind of setup, 2 datacenters far from each other by 1 ms
delay, each hosting a SAN array, each of them connected to 2 SAN fabrics
extended between the 2 sites.

What reason would prevent us from building Geo-clusters without having to
rely on a database replication mechanism, as the setup I would like to
implement would also be used to provide NFS services that are disaster
recovery proof.

Obviously, such setup should rely on LVM mirroring to allow a node hosting a
service to be able to write to both local and distant SAN LUN's.

Brem


2009/6/3, Fajar A. Nugraha <fajar at fajar.net>:
>
> On Wed, Jun 3, 2009 at 6:36 AM, Jon Schulz
> <jschulz at soapstonenetworks.com> wrote:
> > I'm in the process of doing a concept review with the redhat cluster
> suite.
> > I've been given a requirement that cluster nodes are able to be located
> in
> > geographically separated data centers. I realize that this is not an
> ideal
> > scenario due to latency issues.
>
> For most purposes, RHCS would require that all nodes have access to
> the same storage/disk. That pretty much ruled out the DR  feature that
> one might expect to get from having nodes in geographically separated
> data centers.
>
> I'd suggest you refine your requirements. Perhaps what you need is
> something like MySQL cluster replication, where there are two
> geographically separated data centers, each having its own cluster,
> and the two clusters replicate each other's data asynchronously.
>
> --
> Fajar
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090605/fd028564/attachment.htm>

From fajar at fajar.net  Fri Jun  5 09:47:25 2009
From: fajar at fajar.net (Fajar A. Nugraha)
Date: Fri, 5 Jun 2009 16:47:25 +0700
Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters
In-Reply-To: <29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com>
References: <E6D5CAC0ACC932498E6AC1A692E0CBA122979D0AAA@corp-e2k7-mbx01.Soapstone.local>
	<7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com>
	<29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com>
Message-ID: <7207d96f0906050247s54c5c468td31fceeabe1a8da3@mail.gmail.com>

On Fri, Jun 5, 2009 at 4:22 PM, brem belguebli<brem.belguebli at gmail.com> wrote:
> We are in the same setup, already doing "Geo-cluster" with other technos and
> we are looking at RHCS to provide us the same service level.

Usually the concepts are the same. What solution are you using? How
does it work, replication or real cluster?

> Let's consider this kind of setup, 2 datacenters far from each other by 1 ms
> delay, each hosting?a SAN array, each of them connected to 2 SAN fabrics
> extended between the 2 sites.
>
> What reason would prevent?us from building Geo-clusters without?having to
> rely on?a database replication mechanism, as the setup I would like to
> implement would also be used to provide NFS services that are disaster
> recovery proof.
>
> Obviously, such setup should rely on LVM mirroring to allow a node hosting a
> service to?be able to write to both local and distant SAN LUN's.

Does LVM mirroring work with clustered LVM?

-- 
Fajar


From jschulz at soapstonenetworks.com  Fri Jun  5 14:37:57 2009
From: jschulz at soapstonenetworks.com (Jon Schulz)
Date: Fri, 5 Jun 2009 10:37:57 -0400
Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters
In-Reply-To: <7207d96f0906050247s54c5c468td31fceeabe1a8da3@mail.gmail.com>
References: <E6D5CAC0ACC932498E6AC1A692E0CBA122979D0AAA@corp-e2k7-mbx01.Soapstone.local>
	<7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com>
	<29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com>
	<7207d96f0906050247s54c5c468td31fceeabe1a8da3@mail.gmail.com>
Message-ID: <E6D5CAC0ACC932498E6AC1A692E0CBA12297E05421@corp-e2k7-mbx01.Soapstone.local>

Yes I would be interested to see what products you are currently using to achieve this. In my proposed setup we are actually completely database transaction driven. The problem is the people higher up want active database <-> database replication which will be problematic I know.

Outside of the data side of the equation, how tolerant is the cluster network/heartbeat to latency assuming no packet loss? Or more to the point, at what point does everyone in their past experience see the heartbeat network become unreliable, latency wise. E.g. anything over 30ms?

Most of my experiences with rhcs and linux-ha have always been with the cluster network being within the same LAN :(

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Fajar A. Nugraha
Sent: Friday, June 05, 2009 5:47 AM
To: linux clustering
Subject: Re: [Linux-cluster] Networking guidelines for RHCS across datacenters

On Fri, Jun 5, 2009 at 4:22 PM, brem belguebli<brem.belguebli at gmail.com> wrote:
> We are in the same setup, already doing "Geo-cluster" with other technos and
> we are looking at RHCS to provide us the same service level.

Usually the concepts are the same. What solution are you using? How
does it work, replication or real cluster?

> Let's consider this kind of setup, 2 datacenters far from each other by 1 ms
> delay, each hosting?a SAN array, each of them connected to 2 SAN fabrics
> extended between the 2 sites.
>
> What reason would prevent?us from building Geo-clusters without?having to
> rely on?a database replication mechanism, as the setup I would like to
> implement would also be used to provide NFS services that are disaster
> recovery proof.
>
> Obviously, such setup should rely on LVM mirroring to allow a node hosting a
> service to?be able to write to both local and distant SAN LUN's.

Does LVM mirroring work with clustered LVM?

-- 
Fajar

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From Jeremy.Eder at mindshift.com  Fri Jun  5 14:41:36 2009
From: Jeremy.Eder at mindshift.com (Jeremy Eder)
Date: Fri, 5 Jun 2009 10:41:36 -0400
Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters
In-Reply-To: <E6D5CAC0ACC932498E6AC1A692E0CBA12297E05421@corp-e2k7-mbx01.Soapstone.local>
References: <E6D5CAC0ACC932498E6AC1A692E0CBA122979D0AAA@corp-e2k7-mbx01.Soapstone.local>
	<7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com>
	<29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com>
	<7207d96f0906050247s54c5c468td31fceeabe1a8da3@mail.gmail.com>
	<E6D5CAC0ACC932498E6AC1A692E0CBA12297E05421@corp-e2k7-mbx01.Soapstone.local>
Message-ID: <1734CA24F5FC1848880E6B1AB788DD7703774AB156@inv-ex1>

I have no relation to this company, but I have heard good stories from people who worked with their products:

If you're database is oracle, mysql or postgres check out products on www.continuent.com


Best Regards,
 

Jeremy Eder, RHCE, VCP


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jon Schulz
Sent: Friday, June 05, 2009 10:38 AM
To: linux clustering
Subject: RE: [Linux-cluster] Networking guidelines for RHCS across datacenters

Yes I would be interested to see what products you are currently using to achieve this. In my proposed setup we are actually completely database transaction driven. The problem is the people higher up want active database <-> database replication which will be problematic I know.

Outside of the data side of the equation, how tolerant is the cluster network/heartbeat to latency assuming no packet loss? Or more to the point, at what point does everyone in their past experience see the heartbeat network become unreliable, latency wise. E.g. anything over 30ms?

Most of my experiences with rhcs and linux-ha have always been with the cluster network being within the same LAN :(

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Fajar A. Nugraha
Sent: Friday, June 05, 2009 5:47 AM
To: linux clustering
Subject: Re: [Linux-cluster] Networking guidelines for RHCS across datacenters

On Fri, Jun 5, 2009 at 4:22 PM, brem belguebli<brem.belguebli at gmail.com> wrote:
> We are in the same setup, already doing "Geo-cluster" with other technos and
> we are looking at RHCS to provide us the same service level.

Usually the concepts are the same. What solution are you using? How
does it work, replication or real cluster?

> Let's consider this kind of setup, 2 datacenters far from each other by 1 ms
> delay, each hosting?a SAN array, each of them connected to 2 SAN fabrics
> extended between the 2 sites.
>
> What reason would prevent?us from building Geo-clusters without?having to
> rely on?a database replication mechanism, as the setup I would like to
> implement would also be used to provide NFS services that are disaster
> recovery proof.
>
> Obviously, such setup should rely on LVM mirroring to allow a node hosting a
> service to?be able to write to both local and distant SAN LUN's.

Does LVM mirroring work with clustered LVM?

-- 
Fajar

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From apfaffeneder at pfaffeneder.org  Fri Jun  5 15:14:06 2009
From: apfaffeneder at pfaffeneder.org (Andreas Pfaffeneder)
Date: Fri, 05 Jun 2009 17:14:06 +0200
Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters
In-Reply-To: <7207d96f0906050247s54c5c468td31fceeabe1a8da3@mail.gmail.com>
References: <E6D5CAC0ACC932498E6AC1A692E0CBA122979D0AAA@corp-e2k7-mbx01.Soapstone.local>	<7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com>	<29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com>
	<7207d96f0906050247s54c5c468td31fceeabe1a8da3@mail.gmail.com>
Message-ID: <4A29363E.1010108@pfaffeneder.org>

Fajar A. Nugraha wrote:
>
>
> Does LVM mirroring work with clustered LVM?
>
>   

Since 5.3 it works. Install lvm2-cluster.

If you'd like to use mirrored volumes before 5.3 you can do so using 
lvm-tags (see filters in lvm.conf) but the mirror then is available only 
to one systeme at a time.

Cheers
Andreas


From teigland at redhat.com  Fri Jun  5 15:14:21 2009
From: teigland at redhat.com (David Teigland)
Date: Fri, 5 Jun 2009 10:14:21 -0500
Subject: [Linux-cluster] "dlm_controld[nnnn]: cluster is down,
	exiting" on node1 when starting node2
In-Reply-To: <Pine.LNX.4.64.0906041607370.1696@e-smith.charlieb.ott.istop.com>
References: <Pine.LNX.4.64.0906041607370.1696@e-smith.charlieb.ott.istop.com>
Message-ID: <20090605151421.GB28143@redhat.com>

On Thu, Jun 04, 2009 at 04:23:13PM -0400, Charlie Brady wrote:
> Jun  4 10:55:34 sun4150node1 dlm_controld[7916]: cluster is down, exiting
> Jun  4 10:55:34 sun4150node1 fenced[7910]: cluster is down, exiting
> Jun  4 10:55:34 sun4150node1 gfs_controld[7922]: cluster is down, exiting
> Jun  4 10:55:35 sun4150node1 qdiskd[8128]: <err> cman_dispatch: Host is down

They are all complaining that the the cluster is down, which is a polite way
of saying that aisexec has died/crashed/failed/killed/gone-away.

Dave


From charlieb-linux-cluster at budge.apana.org.au  Fri Jun  5 15:42:59 2009
From: charlieb-linux-cluster at budge.apana.org.au (Charlie Brady)
Date: Fri, 5 Jun 2009 11:42:59 -0400 (EDT)
Subject: [Linux-cluster] "dlm_controld[nnnn]: cluster is down, exiting"
	on node1 when starting node2
In-Reply-To: <20090605151421.GB28143@redhat.com>
References: <Pine.LNX.4.64.0906041607370.1696@e-smith.charlieb.ott.istop.com>
	<20090605151421.GB28143@redhat.com>
Message-ID: <Pine.LNX.4.64.0906051141140.26212@e-smith.charlieb.ott.istop.com>


On Fri, 5 Jun 2009, David Teigland wrote:

> On Thu, Jun 04, 2009 at 04:23:13PM -0400, Charlie Brady wrote:
>> Jun  4 10:55:34 sun4150node1 dlm_controld[7916]: cluster is down, exiting
>> Jun  4 10:55:34 sun4150node1 fenced[7910]: cluster is down, exiting
>> Jun  4 10:55:34 sun4150node1 gfs_controld[7922]: cluster is down, exiting
>> Jun  4 10:55:35 sun4150node1 qdiskd[8128]: <err> cman_dispatch: Host is down
>
> They are all complaining that the the cluster is down, which is a polite way
> of saying that aisexec has died/crashed/failed/killed/gone-away.

Thanks. Why might that have occurred? Where would I look for clues? How 
can I increase logging output from aisexec?

Thanks

---
Charlie


From teigland at redhat.com  Fri Jun  5 16:04:55 2009
From: teigland at redhat.com (David Teigland)
Date: Fri, 5 Jun 2009 11:04:55 -0500
Subject: [Linux-cluster] "dlm_controld[nnnn]: cluster is down,
	exiting" on node1 when starting node2
In-Reply-To: <Pine.LNX.4.64.0906051141140.26212@e-smith.charlieb.ott.istop.com>
References: <Pine.LNX.4.64.0906041607370.1696@e-smith.charlieb.ott.istop.com>
	<20090605151421.GB28143@redhat.com>
	<Pine.LNX.4.64.0906051141140.26212@e-smith.charlieb.ott.istop.com>
Message-ID: <20090605160455.GD28143@redhat.com>

On Fri, Jun 05, 2009 at 11:42:59AM -0400, Charlie Brady wrote:
> 
> On Fri, 5 Jun 2009, David Teigland wrote:
> 
> >On Thu, Jun 04, 2009 at 04:23:13PM -0400, Charlie Brady wrote:
> >>Jun  4 10:55:34 sun4150node1 dlm_controld[7916]: cluster is down, exiting
> >>Jun  4 10:55:34 sun4150node1 fenced[7910]: cluster is down, exiting
> >>Jun  4 10:55:34 sun4150node1 gfs_controld[7922]: cluster is down, exiting
> >>Jun  4 10:55:35 sun4150node1 qdiskd[8128]: <err> cman_dispatch: Host is 
> >>down
> >
> >They are all complaining that the the cluster is down, which is a polite 
> >way
> >of saying that aisexec has died/crashed/failed/killed/gone-away.
> 
> Thanks. Why might that have occurred? Where would I look for clues? How 
> can I increase logging output from aisexec?

If you're lucky it'll leave a core file, otherwise aisexec is notorious for
disappearing without leaving any clues about why.

Dave


From charlieb-linux-cluster at budge.apana.org.au  Fri Jun  5 16:50:57 2009
From: charlieb-linux-cluster at budge.apana.org.au (Charlie Brady)
Date: Fri, 5 Jun 2009 12:50:57 -0400 (EDT)
Subject: [Linux-cluster] "dlm_controld[nnnn]: cluster is down, exiting"
	on node1 when starting node2
In-Reply-To: <20090605160455.GD28143@redhat.com>
References: <Pine.LNX.4.64.0906041607370.1696@e-smith.charlieb.ott.istop.com>
	<20090605151421.GB28143@redhat.com>
	<Pine.LNX.4.64.0906051141140.26212@e-smith.charlieb.ott.istop.com>
	<20090605160455.GD28143@redhat.com>
Message-ID: <Pine.LNX.4.64.0906051244070.26212@e-smith.charlieb.ott.istop.com>


On Fri, 5 Jun 2009, David Teigland wrote:

> On Fri, Jun 05, 2009 at 11:42:59AM -0400, Charlie Brady wrote:
>>
>> On Fri, 5 Jun 2009, David Teigland wrote:
>>
>>> On Thu, Jun 04, 2009 at 04:23:13PM -0400, Charlie Brady wrote:
>>>> Jun  4 10:55:34 sun4150node1 dlm_controld[7916]: cluster is down, exiting
>>>> Jun  4 10:55:34 sun4150node1 fenced[7910]: cluster is down, exiting
>>>> Jun  4 10:55:34 sun4150node1 gfs_controld[7922]: cluster is down, exiting
>>>> Jun  4 10:55:35 sun4150node1 qdiskd[8128]: <err> cman_dispatch: Host is
>>>> down
>>>
>>> They are all complaining that the the cluster is down, which is a polite
>>> way
>>> of saying that aisexec has died/crashed/failed/killed/gone-away.
>>
>> Thanks. Why might that have occurred? Where would I look for clues? How
>> can I increase logging output from aisexec?
>
> If you're lucky it'll leave a core file, otherwise aisexec is notorious for
> disappearing without leaving any clues about why.

That's very disconcerting to hear. Doesn't sound like HA. :-(

I don't have any core files.


From teigland at redhat.com  Fri Jun  5 16:49:51 2009
From: teigland at redhat.com (David Teigland)
Date: Fri, 5 Jun 2009 11:49:51 -0500
Subject: [Linux-cluster] "dlm_controld[nnnn]: cluster is down,
	exiting" on node1 when starting node2
In-Reply-To: <Pine.LNX.4.64.0906051244070.26212@e-smith.charlieb.ott.istop.com>
References: <Pine.LNX.4.64.0906041607370.1696@e-smith.charlieb.ott.istop.com>
	<20090605151421.GB28143@redhat.com>
	<Pine.LNX.4.64.0906051141140.26212@e-smith.charlieb.ott.istop.com>
	<20090605160455.GD28143@redhat.com>
	<Pine.LNX.4.64.0906051244070.26212@e-smith.charlieb.ott.istop.com>
Message-ID: <20090605164951.GE28143@redhat.com>

On Fri, Jun 05, 2009 at 12:50:57PM -0400, Charlie Brady wrote:
> 
> On Fri, 5 Jun 2009, David Teigland wrote:
> 
> >On Fri, Jun 05, 2009 at 11:42:59AM -0400, Charlie Brady wrote:
> >>
> >>On Fri, 5 Jun 2009, David Teigland wrote:
> >>
> >>>On Thu, Jun 04, 2009 at 04:23:13PM -0400, Charlie Brady wrote:
> >>>>Jun  4 10:55:34 sun4150node1 dlm_controld[7916]: cluster is down, 
> >>>>exiting
> >>>>Jun  4 10:55:34 sun4150node1 fenced[7910]: cluster is down, exiting
> >>>>Jun  4 10:55:34 sun4150node1 gfs_controld[7922]: cluster is down, 
> >>>>exiting
> >>>>Jun  4 10:55:35 sun4150node1 qdiskd[8128]: <err> cman_dispatch: Host is
> >>>>down
> >>>
> >>>They are all complaining that the the cluster is down, which is a polite
> >>>way
> >>>of saying that aisexec has died/crashed/failed/killed/gone-away.
> >>
> >>Thanks. Why might that have occurred? Where would I look for clues? How
> >>can I increase logging output from aisexec?
> >
> >If you're lucky it'll leave a core file, otherwise aisexec is notorious for
> >disappearing without leaving any clues about why.
> 
> That's very disconcerting to hear. Doesn't sound like HA. :-(

To clarify, aisexec does not often disappear, it's very reliable.  The point
was that in the rare case when it does, it's notorious for not leaving any
reasons behind.

Dave


From sdake at redhat.com  Fri Jun  5 17:10:38 2009
From: sdake at redhat.com (Steven Dake)
Date: Fri, 05 Jun 2009 10:10:38 -0700
Subject: [Linux-cluster] "dlm_controld[nnnn]: cluster is down, exiting"
	on node1 when starting node2
In-Reply-To: <20090605164951.GE28143@redhat.com>
References: <Pine.LNX.4.64.0906041607370.1696@e-smith.charlieb.ott.istop.com>
	<20090605151421.GB28143@redhat.com>
	<Pine.LNX.4.64.0906051141140.26212@e-smith.charlieb.ott.istop.com>
	<20090605160455.GD28143@redhat.com>
	<Pine.LNX.4.64.0906051244070.26212@e-smith.charlieb.ott.istop.com>
	<20090605164951.GE28143@redhat.com>
Message-ID: <1244221838.2626.29.camel@localhost.localdomain>

On Fri, 2009-06-05 at 11:49 -0500, David Teigland wrote:
> On Fri, Jun 05, 2009 at 12:50:57PM -0400, Charlie Brady wrote:
> > 
> > On Fri, 5 Jun 2009, David Teigland wrote:
> > 
> > >On Fri, Jun 05, 2009 at 11:42:59AM -0400, Charlie Brady wrote:
> > >>
> > >>On Fri, 5 Jun 2009, David Teigland wrote:
> > >>
> > >>>On Thu, Jun 04, 2009 at 04:23:13PM -0400, Charlie Brady wrote:
> > >>>>Jun  4 10:55:34 sun4150node1 dlm_controld[7916]: cluster is down, 
> > >>>>exiting
> > >>>>Jun  4 10:55:34 sun4150node1 fenced[7910]: cluster is down, exiting
> > >>>>Jun  4 10:55:34 sun4150node1 gfs_controld[7922]: cluster is down, 
> > >>>>exiting
> > >>>>Jun  4 10:55:35 sun4150node1 qdiskd[8128]: <err> cman_dispatch: Host is
> > >>>>down
> > >>>
> > >>>They are all complaining that the the cluster is down, which is a polite
> > >>>way
> > >>>of saying that aisexec has died/crashed/failed/killed/gone-away.
> > >>
> > >>Thanks. Why might that have occurred? Where would I look for clues? How
> > >>can I increase logging output from aisexec?
> > >
> > >If you're lucky it'll leave a core file, otherwise aisexec is notorious for
> > >disappearing without leaving any clues about why.
> > 
> > That's very disconcerting to hear. Doesn't sound like HA. :-(
> 
> To clarify, aisexec does not often disappear, it's very reliable.  The point
> was that in the rare case when it does, it's notorious for not leaving any
> reasons behind.
> 
> Dave
> 

99.9% of the time there would be a core file in /var/lib/openais/core*
if aisexec faults.  We have not seen faults during normal operations for
years in a released version under typical gfs2 usage scenarios.  If
there is no core, it means some other component failed, exited, and
caused that node to be fenced, or the core file could not be written by
the OS because of some other OS specific failure.  Another option is
that the OOM killer killed aisexec.  I would have a hard time believing
aisexec would crash without a core file while the operating system was
still functional.

In the trunk we are enhancing our failure analysis to do fulltime event
tracing so failures can be debugged more rapidly then looking at a core
file.  I hope that helps.

regards
-steve
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From charlieb-linux-cluster at budge.apana.org.au  Fri Jun  5 17:13:13 2009
From: charlieb-linux-cluster at budge.apana.org.au (Charlie Brady)
Date: Fri, 5 Jun 2009 13:13:13 -0400 (EDT)
Subject: [Linux-cluster] "dlm_controld[nnnn]: cluster is down, exiting"
	on node1 when starting node2
In-Reply-To: <20090605164951.GE28143@redhat.com>
References: <Pine.LNX.4.64.0906041607370.1696@e-smith.charlieb.ott.istop.com>
	<20090605151421.GB28143@redhat.com>
	<Pine.LNX.4.64.0906051141140.26212@e-smith.charlieb.ott.istop.com>
	<20090605160455.GD28143@redhat.com>
	<Pine.LNX.4.64.0906051244070.26212@e-smith.charlieb.ott.istop.com>
	<20090605164951.GE28143@redhat.com>
Message-ID: <Pine.LNX.4.64.0906051312380.26212@e-smith.charlieb.ott.istop.com>


On Fri, 5 Jun 2009, David Teigland wrote:

>>>>> They are all complaining that the the cluster is down, which is a polite
>>>>> way
>>>>> of saying that aisexec has died/crashed/failed/killed/gone-away.
>>>>
>>>> Thanks. Why might that have occurred? Where would I look for clues? How
>>>> can I increase logging output from aisexec?
>>>
>>> If you're lucky it'll leave a core file, otherwise aisexec is notorious for
>>> disappearing without leaving any clues about why.
>>
>> That's very disconcerting to hear. Doesn't sound like HA. :-(
>
> To clarify, aisexec does not often disappear, it's very reliable.  The point
> was that in the rare case when it does, it's notorious for not leaving any
> reasons behind.

Thanks for the clarification.


From charlieb-linux-cluster at budge.apana.org.au  Fri Jun  5 17:20:11 2009
From: charlieb-linux-cluster at budge.apana.org.au (Charlie Brady)
Date: Fri, 5 Jun 2009 13:20:11 -0400 (EDT)
Subject: [Linux-cluster] "dlm_controld[nnnn]: cluster is down, exiting"
	on node1 when starting node2
In-Reply-To: <1244221838.2626.29.camel@localhost.localdomain>
References: <Pine.LNX.4.64.0906041607370.1696@e-smith.charlieb.ott.istop.com>
	<20090605151421.GB28143@redhat.com>
	<Pine.LNX.4.64.0906051141140.26212@e-smith.charlieb.ott.istop.com>
	<20090605160455.GD28143@redhat.com>
	<Pine.LNX.4.64.0906051244070.26212@e-smith.charlieb.ott.istop.com>
	<20090605164951.GE28143@redhat.com>
	<1244221838.2626.29.camel@localhost.localdomain>
Message-ID: <Pine.LNX.4.64.0906051317340.26212@e-smith.charlieb.ott.istop.com>


On Fri, 5 Jun 2009, Steven Dake wrote:

> On Fri, 2009-06-05 at 11:49 -0500, David Teigland wrote:
>> On Fri, Jun 05, 2009 at 12:50:57PM -0400, Charlie Brady wrote:
>>>
>>> On Fri, 5 Jun 2009, David Teigland wrote:
>>>
>>>> On Fri, Jun 05, 2009 at 11:42:59AM -0400, Charlie Brady wrote:
>>>>>
>>>>> On Fri, 5 Jun 2009, David Teigland wrote:
>>>>>
>>>>>> They are all complaining that the the cluster is down, which is a polite
>>>>>> way
>>>>>> of saying that aisexec has died/crashed/failed/killed/gone-away.
>>>>>
>>>>> Thanks. Why might that have occurred? Where would I look for clues? How
>>>>> can I increase logging output from aisexec?
>>>>
>>>> If you're lucky it'll leave a core file, otherwise aisexec is notorious for
>>>> disappearing without leaving any clues about why.
>>>
>>> That's very disconcerting to hear. Doesn't sound like HA. :-(
>>
>> To clarify, aisexec does not often disappear, it's very reliable.  The point
>> was that in the rare case when it does, it's notorious for not leaving any
>> reasons behind.
>>
>> Dave
>>
>
> 99.9% of the time there would be a core file in /var/lib/openais/core*
> if aisexec faults.

Only file I have there is named.

ringid_10.39.171.212

>  We have not seen faults during normal operations for
> years in a released version under typical gfs2 usage scenarios.  If
> there is no core, it means some other component failed, exited, and
> caused that node to be fenced, or the core file could not be written by
> the OS because of some other OS specific failure.  Another option is
> that the OOM killer killed aisexec.

No sign of the oom killer in the log I quoted yesterday.

>  I would have a hard time believing
> aisexec would crash without a core file while the operating system was
> still functional.
>
> In the trunk we are enhancing our failure analysis to do fulltime event
> tracing so failures can be debugged more rapidly then looking at a core
> file.  I hope that helps.

Thanks.

I'll try to reproduce the scenario. Meanwhile I'm still looking for hints 
as to how to get more visibility of what is happening.


From brem.belguebli at gmail.com  Fri Jun  5 17:17:21 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Fri, 5 Jun 2009 19:17:21 +0200
Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters
In-Reply-To: <7207d96f0906050247s54c5c468td31fceeabe1a8da3@mail.gmail.com>
References: <E6D5CAC0ACC932498E6AC1A692E0CBA122979D0AAA@corp-e2k7-mbx01.Soapstone.local>
	<7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com>
	<29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com>
	<7207d96f0906050247s54c5c468td31fceeabe1a8da3@mail.gmail.com>
Message-ID: <29ae894c0906051017ice9628bw2af3f94de8c126c5@mail.gmail.com>

Hello,

We are long term HP ServiceGuard on HP-UX users and since a few months HP
ServiceGuard on Linux (aka SGLX).

The first one (HP-UX) works by using their Cluster LVM (a clvmd-like
daemon named cmlvmd on each node) allowing one node of the cluster to
activate exclusively (vgchange -a e VGXX) on one node and use a
non-clustered FS (vxfs) on top of the LV's.

The LV's are mirrored (a leg on each SAN array, one local and the other
distant).

On Linux (SGLX) is a bit more tricky but when masterized it works well.

It relies on non-clustered LVM, with the LVM2 hosttags feature (HA-LVM
described by RH) built on top of MD raid1 devices with a cluster module
that guarantees the raid device to be consistent on one node at a time.

Unfortunately, HP just announced the discontinuation of SGLX, that's why we
are looking towards RHCS to see if it can provide the same service, which
doesn't seem to be obvious.

Concerning LVM mirroring with Clustered LVM, I hope it does or will.

The only thing I know about  LVM mirror is that, soon (maybe around RH5u5)
it will support online resizing without having to break the mirror.

Brem


2009/6/5, Fajar A. Nugraha <fajar at fajar.net>:
>
> On Fri, Jun 5, 2009 at 4:22 PM, brem belguebli<brem.belguebli at gmail.com>
> wrote:
> > We are in the same setup, already doing "Geo-cluster" with other technos
> and
> > we are looking at RHCS to provide us the same service level.
>
> Usually the concepts are the same. What solution are you using? How
> does it work, replication or real cluster?
>
> > Let's consider this kind of setup, 2 datacenters far from each other by 1
> ms
> > delay, each hosting a SAN array, each of them connected to 2 SAN fabrics
> > extended between the 2 sites.
> >
> > What reason would prevent us from building Geo-clusters without having to
> > rely on a database replication mechanism, as the setup I would like to
> > implement would also be used to provide NFS services that are disaster
> > recovery proof.
> >
> > Obviously, such setup should rely on LVM mirroring to allow a node
> hosting a
> > service to be able to write to both local and distant SAN LUN's.
>
> Does LVM mirroring work with clustered LVM?
>
> --
> Fajar
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090605/02d852dd/attachment.htm>

From sdake at redhat.com  Fri Jun  5 17:26:37 2009
From: sdake at redhat.com (Steven Dake)
Date: Fri, 05 Jun 2009 10:26:37 -0700
Subject: [Linux-cluster] "dlm_controld[nnnn]: cluster is down, exiting"
	on node1 when starting node2
In-Reply-To: <Pine.LNX.4.64.0906051317340.26212@e-smith.charlieb.ott.istop.com>
References: <Pine.LNX.4.64.0906041607370.1696@e-smith.charlieb.ott.istop.com>
	<20090605151421.GB28143@redhat.com>
	<Pine.LNX.4.64.0906051141140.26212@e-smith.charlieb.ott.istop.com>
	<20090605160455.GD28143@redhat.com>
	<Pine.LNX.4.64.0906051244070.26212@e-smith.charlieb.ott.istop.com>
	<20090605164951.GE28143@redhat.com>
	<1244221838.2626.29.camel@localhost.localdomain>
	<Pine.LNX.4.64.0906051317340.26212@e-smith.charlieb.ott.istop.com>
Message-ID: <1244222797.2626.33.camel@localhost.localdomain>

On Fri, 2009-06-05 at 13:20 -0400, Charlie Brady wrote:
> On Fri, 5 Jun 2009, Steven Dake wrote:
> 
> > On Fri, 2009-06-05 at 11:49 -0500, David Teigland wrote:
> >> On Fri, Jun 05, 2009 at 12:50:57PM -0400, Charlie Brady wrote:
> >>>
> >>> On Fri, 5 Jun 2009, David Teigland wrote:
> >>>
> >>>> On Fri, Jun 05, 2009 at 11:42:59AM -0400, Charlie Brady wrote:
> >>>>>
> >>>>> On Fri, 5 Jun 2009, David Teigland wrote:
> >>>>>
> >>>>>> They are all complaining that the the cluster is down, which is a polite
> >>>>>> way
> >>>>>> of saying that aisexec has died/crashed/failed/killed/gone-away.
> >>>>>
> >>>>> Thanks. Why might that have occurred? Where would I look for clues? How
> >>>>> can I increase logging output from aisexec?
> >>>>
> >>>> If you're lucky it'll leave a core file, otherwise aisexec is notorious for
> >>>> disappearing without leaving any clues about why.
> >>>
> >>> That's very disconcerting to hear. Doesn't sound like HA. :-(
> >>
> >> To clarify, aisexec does not often disappear, it's very reliable.  The point
> >> was that in the rare case when it does, it's notorious for not leaving any
> >> reasons behind.
> >>
> >> Dave
> >>
> >
> > 99.9% of the time there would be a core file in /var/lib/openais/core*
> > if aisexec faults.
> 
> Only file I have there is named.
> 
> ringid_10.39.171.212
> 
> >  We have not seen faults during normal operations for
> > years in a released version under typical gfs2 usage scenarios.  If
> > there is no core, it means some other component failed, exited, and
> > caused that node to be fenced, or the core file could not be written by
> > the OS because of some other OS specific failure.  Another option is
> > that the OOM killer killed aisexec.
> 
> No sign of the oom killer in the log I quoted yesterday.
> 
> >  I would have a hard time believing
> > aisexec would crash without a core file while the operating system was
> > still functional.
> >
> > In the trunk we are enhancing our failure analysis to do fulltime event
> > tracing so failures can be debugged more rapidly then looking at a core
> > file.  I hope that helps.
> 
> Thanks.
> 
> I'll try to reproduce the scenario. Meanwhile I'm still looking for hints 
> as to how to get more visibility of what is happening.

some users change their default core file storage location.  This would
then override the defaults used by openais.  another possibility is
selinux is enabled.  aisexec integration with selinux needs more work
and selinux might prevent a core file from being written.

You can check selinux by looking /etc/selinux/config. If it is set to
enforcing or permisssive, that may be your culprit.

Regards
-steve


From brem.belguebli at gmail.com  Fri Jun  5 17:40:33 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Fri, 5 Jun 2009 19:40:33 +0200
Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters
In-Reply-To: <E6D5CAC0ACC932498E6AC1A692E0CBA12297E05421@corp-e2k7-mbx01.Soapstone.local>
References: <E6D5CAC0ACC932498E6AC1A692E0CBA122979D0AAA@corp-e2k7-mbx01.Soapstone.local>
	<7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com>
	<29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com>
	<7207d96f0906050247s54c5c468td31fceeabe1a8da3@mail.gmail.com>
	<E6D5CAC0ACC932498E6AC1A692E0CBA12297E05421@corp-e2k7-mbx01.Soapstone.local>
Message-ID: <29ae894c0906051040v64a92e55je6c3568e14d5a20f@mail.gmail.com>

2009/6/5, Jon Schulz <jschulz at soapstonenetworks.com>:
>
> Yes I would be interested to see what products you are currently using to
> achieve this. In my proposed setup we are actually completely database
> transaction driven. The problem is the people higher up want active database
> <-> database replication which will be problematic I know.


Still we also use DB (Oracle, Sybase) replication mechanisms to address
accidental data corruption, as mirroring being synchonous, if something
happens (someone intentionnaly alters the DB or filesystem corruption) it
will be on both legs of the mirror.


Outside of the data side of the equation, how tolerant is the cluster
> network/heartbeat to latency assuming no packet loss? Or more to the point,
> at what point does everyone in their past experience see the heartbeat
> network become unreliable, latency wise. E.g. anything over 30ms?
>
> Most of my experiences with rhcs and linux-ha have always been with the
> cluster network being within the same LAN :(


It is definitely the best solution in case you cannot rely on your network
infrastructure. This is not completely my case :-)


-----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:
> linux-cluster-bounces at redhat.com] On Behalf Of Fajar A. Nugraha
> Sent: Friday, June 05, 2009 5:47 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] Networking guidelines for RHCS across
> datacenters
>
> On Fri, Jun 5, 2009 at 4:22 PM, brem belguebli<brem.belguebli at gmail.com>
> wrote:
> > We are in the same setup, already doing "Geo-cluster" with other technos
> and
> > we are looking at RHCS to provide us the same service level.
>
> Usually the concepts are the same. What solution are you using? How
> does it work, replication or real cluster?
>
> > Let's consider this kind of setup, 2 datacenters far from each other by 1
> ms
> > delay, each hosting a SAN array, each of them connected to 2 SAN fabrics
> > extended between the 2 sites.
> >
> > What reason would prevent us from building Geo-clusters without having to
> > rely on a database replication mechanism, as the setup I would like to
> > implement would also be used to provide NFS services that are disaster
> > recovery proof.
> >
> > Obviously, such setup should rely on LVM mirroring to allow a node
> hosting a
> > service to be able to write to both local and distant SAN LUN's.
>
> Does LVM mirroring work with clustered LVM?
>
> --
> Fajar
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090605/ed4e745b/attachment.htm>

From sdake at redhat.com  Fri Jun  5 18:01:10 2009
From: sdake at redhat.com (Steven Dake)
Date: Fri, 05 Jun 2009 11:01:10 -0700
Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters
In-Reply-To: <29ae894c0906051040v64a92e55je6c3568e14d5a20f@mail.gmail.com>
References: <E6D5CAC0ACC932498E6AC1A692E0CBA122979D0AAA@corp-e2k7-mbx01.Soapstone.local>
	<7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com>
	<29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com>
	<7207d96f0906050247s54c5c468td31fceeabe1a8da3@mail.gmail.com>
	<E6D5CAC0ACC932498E6AC1A692E0CBA12297E05421@corp-e2k7-mbx01.Soapstone.local>
	<29ae894c0906051040v64a92e55je6c3568e14d5a20f@mail.gmail.com>
Message-ID: <1244224870.2626.37.camel@localhost.localdomain>

On Fri, 2009-06-05 at 19:40 +0200, brem belguebli wrote:
> 
> 
> 2009/6/5, Jon Schulz <jschulz at soapstonenetworks.com>: 
>         Yes I would be interested to see what products you are
>         currently using to achieve this. In my proposed setup we are
>         actually completely database transaction driven. The problem
>         is the people higher up want active database <-> database
>         replication which will be problematic I know.
>  
> Still we also use DB (Oracle, Sybase) replication mechanisms
> to address accidental data corruption, as mirroring being synchonous,
> if something happens (someone intentionnaly alters the DB or
> filesystem corruption) it will be on both legs of the mirror.
>  
>    
> 
>         Outside of the data side of the equation, how tolerant is the
>         cluster network/heartbeat to latency assuming no packet loss?
>         Or more to the point, at what point does everyone in their
>         past experience see the heartbeat network become unreliable,
>         latency wise. E.g. anything over 30ms?
>         

The default configured timers for failure detection are quite high and
retransmit many times for failed packets (for lossy networks).   30msec
latency would pose no major problem, except performance.  If you used
posix locking and your machine->machine latency was 30msec, each posix
lock would take 30.03 msec to grant or more, which may not meet your
performance requirements.

I can't recommend wan connections with totem (the protocol used in rhcs)
because of the performance characteristics.  If the performance of posix
locks is not a high requirement, it should be functional.

Regards
-steve


> 


From invite+kjdmu_5j51di at facebookmail.com  Fri Jun  5 22:07:22 2009
From: invite+kjdmu_5j51di at facebookmail.com (Varun Galande)
Date: Fri, 5 Jun 2009 15:07:22 -0700
Subject: [Linux-cluster] Check out my photos on Facebook
Message-ID: <7e9d99562107ef1cab28aabcbfd48237@10.22.41.202>

Hi linux-cluster at redhat.com,

I invited you to join Facebook a while back and wanted to remind you that once you join, we'll be able to connect online, share photos, organize groups and events, and more.

Thanks,
Varun

To sign up for Facebook, follow the link below:
http://www.facebook.com/p.php?i=542993879&k=RVBUZ4WSUV2M5BD1QB63URRTSW1&r


linux-cluster at redhat.com was invited to join Facebook by Varun Galande. If you do not wish to receive this type of email from Facebook in the future, please click on the link below to unsubscribe.
http://www.facebook.com/o.php?k=e846e5&u=100000004637023&mid=939448G5af310c1015fG0G8
Facebook's offices are located at 1601 S. California Ave., Palo Alto, CA 94304.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090605/3aadb694/attachment.htm>

From darcy.sherwood at gmail.com  Mon Jun  8 02:40:38 2009
From: darcy.sherwood at gmail.com (Darcy Sherwood)
Date: Sun, 7 Jun 2009 22:40:38 -0400
Subject: [Linux-cluster] Clvm Hang after an node is fenced in a 2 nodes 
	cluster
In-Reply-To: <4A276F8F.70101@justice.gouv.fr>
References: <4A276F8F.70101@justice.gouv.fr>
Message-ID: <7a7f2ea30906071940k19d69dfcs122cdaf6e71f79fc@mail.gmail.com>

Do you have all of your cluster services chkconfig'd on at node2 ? Sounds to
me like clvmd might be chkconfig'd off


On Thu, Jun 4, 2009 at 2:54 AM, Jean Diallo <
admin1-bua.dage-etd at justice.gouv.fr> wrote:

> Description of problem: In a 2 nodes cluster, after 1 node is fence, any
> clvm command hang on the ramaining node. when the fenced node cluster come
> back in the cluster, any clvm command also hang, moreover the node do not
> activate any clustered vg, and so do not access any shared device.
>
>
> Version-Release number of selected component (if applicable):
> redhat 5.2
> update device-mapper-1.02.28-2.el5.x86_64.rpm
>      lvm2-2.02.40-6.el5.x86_64.rpm
>      lvm2-cluster-2.02.40-7.el5.x86_64.rpm
>
>
> Steps to Reproduce:
> 1.2 nodes cluster , quorum formed with qdisk
> 2.cold boot node 2
> 3.node 2 is evicted and fenced, service are taken over by node 1
> 4.node ? come back in cluster, quorate, but no clustered vg are up and any
> lvm  related command hang
> 5.At this step every lvm command hang on node 1
>
>
> Expected results: node 2 should be able to get back the lock on clustered
> lvm volume and node 1 should be able to issue any lvm relate command
>
> Here are my cluster.conf and lvm.conf
> <?xml version="1.0"?>
> <cluster alias="rome" config_version="53" name="rome">
>       <fence_daemon clean_start="0" post_fail_delay="9"
> post_join_delay="6"/>
>       <clusternodes>
>               <clusternode name="romulus.fr" nodeid="1" votes="1">
>                       <fence>
>                               <method name="1">
>                                       <device name="ilo172"/>
>                               </method>
>                       </fence>
>               </clusternode>
>               <clusternode name="remus.fr" nodeid="2" votes="1">
>                       <fence>
>                               <method name="1">
>                                       <device name="ilo173"/>
>                               </method>
>                       </fence>
>               </clusternode>
>       </clusternodes>
>       <cman expected_votes="3"/>
>       <totem consensus="4800" join="60" token="21002"
> token_retransmits_before_loss_const="20"/>
>       <fencedevices>
>               <fencedevice agent="fence_ilo" hostname="X.X.X.X"
> login="Administrator" name="ilo172" passwd="X.X.X.X"/>
>               <fencedevice agent="fence_ilo" hostname="XXXX"
> login="Administrator" name="ilo173" passwd="XXXX"/>
>       </fencedevices>
>       <rm>
>               <failoverdomains/>
>               <resources/>
>               <vm autostart="1" exclusive="0" migrate="live"
> name="alfrescoP64" path="/etc/xen" recovery="relocate"/>
>               <vm autostart="1" exclusive="0" migrate="live"
> name="alfrescoI64" path="/etc/xen" recovery="relocate"/>
>               <vm autostart="1" exclusive="0" migrate="live"
> name="alfrescoS64" path="/etc/xen" recovery="relocate"/>
>       </rm>
>       <quorumd interval="3" label="quorum64" min_score="1" tko="30"
> votes="1">
>               <heuristic interval="2" program="ping -c3 -t2 X.X.X.X"
> score="1"/>
>       </quorumd>
> </cluster>
>
> part of lvm.conf:
> # Type 3 uses built-in clustered locking.
>   locking_type = 3
>
>   # If using external locking (type 2) and initialisation fails,
>   # with this set to 1 an attempt will be made to use the built-in
>   # clustered locking.
>   # If you are using a customised locking_library you should set this to 0.
>   fallback_to_clustered_locking = 0
>
>   # If an attempt to initialise type 2 or type 3 locking failed, perhaps
>   # because cluster components such as clvmd are not running, with this set
>   # to 1 an attempt will be made to use local file-based locking (type 1).
>   # If this succeeds, only commands against local volume groups will
> proceed.
>   # Volume Groups marked as clustered will be ignored.
>   fallback_to_local_locking = 1
>
>   # Local non-LV directory that holds file-based locks while commands are
>   # in progress.  A directory like /tmp that may get wiped on reboot is OK.
>   locking_dir = "/var/lock/lvm"
>
>   # Other entries can go here to allow you to load shared libraries
>   # e.g. if support for LVM1 metadata was compiled as a shared library use
>   #   format_libraries = "liblvm2format1.so"
>   # Full pathnames can be given.
>
>   # Search this directory first for shared libraries.
>   #   library_dir = "/lib"
>
>   # The external locking library to load if locking_type is set to 2.
>   #   locking_library = "liblvm2clusterlock.so"
>
>
> part of lvm log on second node :
>
> vgchange.c:165   Activated logical volumes in volume group "VolGroup00"
> vgchange.c:172   7 logical volume(s) in volume group "VolGroup00" now
> active
> cache/lvmcache.c:1220   Wiping internal VG cache
> commands/toolcontext.c:188   Logging initialised at Wed Jun  3 15:17:29
> 2009
> commands/toolcontext.c:209   Set umask to 0077
> locking/cluster_locking.c:83   connect() failed on local socket: Connexion
> refus?e
> locking/locking.c:259   WARNING: Falling back to local file-based locking.
> locking/locking.c:261   Volume Groups with the clustered attribute will be
> inaccessible.
> toollib.c:578   Finding all volume groups
> toollib.c:491   Finding volume group "VGhomealfrescoS64"
> metadata/metadata.c:2379   Skipping clustered volume group
> VGhomealfrescoS64
> toollib.c:491   Finding volume group "VGhomealfS64"
> metadata/metadata.c:2379   Skipping clustered volume group VGhomealfS64
> toollib.c:491   Finding volume group "VGvmalfrescoS64"
> metadata/metadata.c:2379   Skipping clustered volume group VGvmalfrescoS64
> toollib.c:491   Finding volume group "VGvmalfrescoI64"
> metadata/metadata.c:2379   Skipping clustered volume group VGvmalfrescoI64
> toollib.c:491   Finding volume group "VGvmalfrescoP64"
> metadata/metadata.c:2379   Skipping clustered volume group VGvmalfrescoP64
> toollib.c:491   Finding volume group "VolGroup00"
> libdm-report.c:981   VolGroup00
> cache/lvmcache.c:1220   Wiping internal VG cache
> commands/toolcontext.c:188   Logging initialised at Wed Jun  3 15:17:29
> 2009
> commands/toolcontext.c:209   Set umask to 0077
> locking/cluster_locking.c:83   connect() failed on local socket: Connexion
> refus?e
> locking/locking.c:259   WARNING: Falling back to local file-based locking.
> locking/locking.c:261   Volume Groups with the clustered attribute will be
> inaccessible.
> toollib.c:542   Using volume group(s) on command line
> toollib.c:491   Finding volume group "VolGroup00"
> vgchange.c:117   7 logical volume(s) in volume group "VolGroup00" monitored
> cache/lvmcache.c:1220   Wiping internal VG cache
> commands/toolcontext.c:188   Logging initialised at Wed Jun  3 15:20:45
> 2009
> commands/toolcontext.c:209   Set umask to 0077
> toollib.c:331   Finding all logical volumes
> commands/toolcontext.c:188   Logging initialised at Wed Jun  3 15:20:50
> 2009
> commands/toolcontext.c:209   Set umask to 0077
> toollib.c:578   Finding all volume groups
>
>
> group_tool on node 1
> type             level name       id       state       fence            0
>   default    00010001 none        [1 2]
> dlm              1     clvmd      00010002 none        [1 2]
> dlm              1     rgmanager  00020002 none        [1]
>
>
> group_tool on node 2
> [root at remus ~]# group_tool
> type             level name     id       state       fence            0
> default  00010001 none        [1 2]
> dlm              1     clvmd    00010002 none        [1 2]
>
> Additional info:
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090607/8605d017/attachment.htm>

From CISPLengineer.hz at ril.com  Mon Jun  8 04:37:34 2009
From: CISPLengineer.hz at ril.com (Viral .D. Ahire)
Date: Mon, 08 Jun 2009 10:07:34 +0530
Subject: [Linux-cluster] Node Leave Cluster while Stopping Cluster
	Application (Oracle)
Message-ID: <4A2C958E.1020701@ril.com>

Hi,
   I have configured 2 node Clustering on RHEL-5. During migration of 
server i have changed host name  & IP Address of bothe node and 
reconfigure the cluster through system-config-cluster.
  Now the problem is whenever i stop cluster application (oracle) , the 
node which is owner of that application it's cman service is getting 
stop.So it is now not a part of cluster and fence by other node.
  same this also happens during restart & relocation of cluster 
application,because while restart & relocation application will be stop 
first.


Please help ..................

Regards,
Viral Ahire

"Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s). 
are confidential. and may be privileged. If you are not the intended recipient. you are hereby notified that any 
review. re-transmission. conversion to hard copy. copying. circulation or other use of this message and any attachments is 
strictly prohibited. If you are not the intended recipient. please notify the sender immediately by return email. 
and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. 
The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment."


From fdinitto at redhat.com  Mon Jun  8 06:51:20 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Mon, 08 Jun 2009 08:51:20 +0200
Subject: [Linux-Cluster] Submitting two new resource plugins to the project
In-Reply-To: <1244046520.6750.6.camel@mecatol>
References: <1243883850.6761.2.camel@mecatol>
	<1243919063.24866.14.camel@cerberus.int.fabbione.net>
	<1244046520.6750.6.camel@mecatol>
Message-ID: <1244443880.3665.3.camel@cerberus.int.fabbione.net>

On Wed, 2009-06-03 at 18:28 +0200, Rafael Mic? Miranda wrote:
> Hi Fabio,
> 
> 
> I have sent the e-mail to that mail list and i have had no answer yet.
> Its the only occurrence i have found about the "devel list" on the CMAN
> Project web page, are you sure this address is right?

Pretty sure it's right. Did you subscribe to the mailing list before
posting?

> 
> Thanks in advance.
> 

No problem at all.

Fabio


From fdinitto at redhat.com  Mon Jun  8 06:52:17 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Mon, 08 Jun 2009 08:52:17 +0200
Subject: [Linux-cluster] compile gnbd-kernel error
In-Reply-To: <2495e3790906020630g34d1b32ckeff5db4b8167169@mail.gmail.com>
References: <2495e3790906020630g34d1b32ckeff5db4b8167169@mail.gmail.com>
Message-ID: <1244443937.3665.5.camel@cerberus.int.fabbione.net>

On Tue, 2009-06-02 at 21:30 +0800, Jin-Shan Tseng wrote:
> Hi folks,
> 
> 
> I tried to compile gnbd-kernel on Gentoo Linux 2.6.29-gentoo-r5 but I
> got some error messages. :(
> 
> 
> the error messages are appear
> on cluster-2.03.09, cluster-2.03.10, cluster-2.03.11
> 
> 
> # uname -a
> Linux node26 2.6.29-gentoo-r5 #3 SMP Mon Jun 1 19:05:23 CST 2009 i686
> Intel(R) Xeon(TM) CPU 3.06GHz GenuineIntel GNU/Linux

[SNIP]

> 
> Does anyone have the same problems?
> Any suggestions are appreciate.

gnbd has not been ported to any kernel > 2.6.27 because it's been
deprecated upstream.

Fabio


From esggrupos at gmail.com  Mon Jun  8 07:55:41 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Mon, 8 Jun 2009 09:55:41 +0200
Subject: [Linux-cluster] all nodes halt when one lose connection
In-Reply-To: <e558921b0905271652s2dfba73dhcc013ae8c632eb6b@mail.gmail.com>
References: <3128ba140905210444o21959031iff759490ace7c8bc@mail.gmail.com>
	<A75C4B18-E985-4D35-8BAF-55BA6C202B53@redhat.com>
	<3128ba140905210757jd814f52hc1ca97c4da6e3a7a@mail.gmail.com>
	<1D109AC0-9EE0-419B-A841-D98EA53FF1C8@redhat.com>
	<3128ba140905210834gdcdf89ahb34bf45d1272861b@mail.gmail.com>
	<5b192c7e0905220650h7cc737c5k7581972add42e21f@mail.gmail.com>
	<3128ba140905250228x5577d24eucd68bbd4b1e57e1b@mail.gmail.com>
	<e558921b0905271648h3f75a271r18b9d7117a8b6dee@mail.gmail.com>
	<e558921b0905271652s2dfba73dhcc013ae8c632eb6b@mail.gmail.com>
Message-ID: <3128ba140906080055i7ddd67e9h67cccc5d931a6ec8@mail.gmail.com>

Thanks for your answers,
I have used a  separated network for the manage and service networks with 2
switchs and now it works fine.

Thanks again,

ESG

2009/5/28 Kaerka Phillips <kbphillips80 at gmail.com>

> One thing we did not try, but might've worked, would be to bond two network
> interfaces together and then use vlan tagging on top of the bond interface
> to create a vlan across it to the other node, and then pointing the cluster
> to the vlan interfaces, which should still be up if even if the loss of one
> network interface or one switch.
>
>
> On Wed, May 27, 2009 at 7:48 PM, Kaerka Phillips <kbphillips80 at gmail.com>wrote:
>
>> It sounds like they're fencing themselves.  We got around this issue on a
>> two-node cluster by including the alternate node's internal ip address in
>> the /etc/hosts file of both hosts and a cross-over cable for the service
>> network with the private ip addresses assigned to that network.  If you're
>> trying to get them to monitor each other via the public network, in theory
>> this could be done with a backup fencing method, but we weren't able to get
>> this work since the heartbeat functions only happen on the network that the
>> node names are defined to use.
>>
>>
>> On Mon, May 25, 2009 at 5:28 AM, ESGLinux <esggrupos at gmail.com> wrote:
>>
>>> Hi,
>>> I think this is not my problem because fencing works fine. The nodes gets
>>> fenced inmediatly but I think they fence when they don't must
>>>
>>> Greetings,
>>>
>>> ESG
>>>
>>> 2009/5/22 jorge sanchez <xsanch at gmail.com>
>>>
>>> Hi,
>>>>
>>>> try also disable the acpi if is it running , see following:
>>>>
>>>>
>>>> http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/s1-acpi-CA.html
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Jorge Sanchez
>>>>
>>>>
>>>> On Thu, May 21, 2009 at 5:34 PM, ESGLinux <esggrupos at gmail.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> 2009/5/21 Jonathan Brassow <jbrassow at redhat.com>
>>>>>
>>>>>>
>>>>>> On May 21, 2009, at 9:57 AM, ESGLinux wrote:
>>>>>>
>>>>>>  Hello,
>>>>>>>
>>>>>>> these are the logs I get:
>>>>>>>
>>>>>>> In node1:
>>>>>>>
>>>>>>> May 21 11:33:44 NODE1 fenced[3840]: NODE2 not a cluster member after
>>>>>>> 5 sec post_fail_delay
>>>>>>> May 21 11:33:44 NODE1 fenced[3840]: fencing node "NODE2"
>>>>>>> May 21 11:33:44 NODE1 shutdown[5448]: shutting down for system halt
>>>>>>>
>>>>>>> in node2:
>>>>>>>
>>>>>>> May 21 11:33:45 NODE2 fenced[3843]: NODE1 not a cluster member after
>>>>>>> 5 sec post_fail_delay
>>>>>>> May 21 11:33:45 NODE2 fenced[3843]: fencing node "NODE1"
>>>>>>> May 21 11:33:45 NODE2 shutdown[5923]: shutting down for system halt
>>>>>>>
>>>>>>>
>>>>>>> what I don?t know is way they lose the connection with the cluster,
>>>>>>> they are still connected (I only unplug a cable from the service network)
>>>>>>>
>>>>>>
>>>>>> That may be something worth chasing down, as it appears that your
>>>>>> cluster communication is on a network you don't expect?
>>>>>>
>>>>>
>>>>> How can I be sure about the network the nodes are using for
>>>>> communication? I think they do for the network I have configured to do
>>>>> that....
>>>>>
>>>>>
>>>>>>
>>>>>> Also, are the nodes simply "shutting down", or are they being forcibly
>>>>>> rebooted.  If it is a casual shutdown, then it would appear that both nodes
>>>>>> are trying to shutdown simultaneously.
>>>>>>
>>>>>
>>>>> they simply shutdown. They no reboot.
>>>>>
>>>>> This is what I get every time I unplug the nework cable from eth0 of
>>>>> any of the two nodes. (they communicate through eth1...)
>>>>>
>>>>> Greetings,
>>>>>
>>>>> ESG
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>  brassow
>>>>>>
>>>>>> --
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090608/1f7031a1/attachment.htm>

From esggrupos at gmail.com  Mon Jun  8 07:59:42 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Mon, 8 Jun 2009 09:59:42 +0200
Subject: [Linux-cluster] Default params for quorumdisk
In-Reply-To: <1243631124.25291.36.camel@ayanami>
References: <3128ba140905260342k523cc51v7d8e321907b1d049@mail.gmail.com>
	<1243631124.25291.36.camel@ayanami>
Message-ID: <3128ba140906080059p72509deeoa3dd850de291706a@mail.gmail.com>

Hi,
I finally have configured a quorom disk and it works fine. Now my 2 nodes
cluster is more stable than ever.

For anyone who doesnt use it, like me, I recommend it ;-)

ESG

2009/5/29 Lon Hohberger <lhh at redhat.com>

> On Tue, 2009-05-26 at 12:42 +0200, ESGLinux wrote:
> > for example,
> > Interval - The frequency of read/write cycles, in seconds.  I have not
> > idea what to say to that. Which is the default and how can I answer
> > it?
>
> 'man qdisk' explains all the parameters and their defaults.
>
> -- Lon
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090608/b8cb1947/attachment.htm>

From tsengjs at gmail.com  Mon Jun  8 08:27:03 2009
From: tsengjs at gmail.com (Jin-Shan Tseng)
Date: Mon, 8 Jun 2009 16:27:03 +0800
Subject: [Linux-cluster] compile gnbd-kernel error
In-Reply-To: <1244443937.3665.5.camel@cerberus.int.fabbione.net>
References: <2495e3790906020630g34d1b32ckeff5db4b8167169@mail.gmail.com>
	<1244443937.3665.5.camel@cerberus.int.fabbione.net>
Message-ID: <2495e3790906080127q7d939cceh70632899a42aa1d3@mail.gmail.com>

On Mon, Jun 8, 2009 at 2:52 PM, Fabio M. Di Nitto <fdinitto at redhat.com>wrote:

>
> gnbd has not been ported to any kernel > 2.6.27 because it's been
> deprecated upstream.
>
> Fabio
>

Hi Fabio,

Thanks for your reply. :)
I'll use nbd instead.

Regards,
Jin-Shan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090608/5d960cd5/attachment.htm>

From fajar at fajar.net  Mon Jun  8 08:32:44 2009
From: fajar at fajar.net (Fajar A. Nugraha)
Date: Mon, 8 Jun 2009 15:32:44 +0700
Subject: [Linux-cluster] compile gnbd-kernel error
In-Reply-To: <2495e3790906080127q7d939cceh70632899a42aa1d3@mail.gmail.com>
References: <2495e3790906020630g34d1b32ckeff5db4b8167169@mail.gmail.com>
	<1244443937.3665.5.camel@cerberus.int.fabbione.net>
	<2495e3790906080127q7d939cceh70632899a42aa1d3@mail.gmail.com>
Message-ID: <7207d96f0906080132u613d7bffw959d6a5225b15eac@mail.gmail.com>

On Mon, Jun 8, 2009 at 3:27 PM, Jin-Shan Tseng<tsengjs at gmail.com> wrote:
> On Mon, Jun 8, 2009 at 2:52 PM, Fabio M. Di Nitto <fdinitto at redhat.com>
> wrote:
>>
>> gnbd has not been ported to any kernel > 2.6.27 because it's been
>> deprecated upstream.
>>
>> Fabio
>
> Hi Fabio,
> Thanks for your reply. :)
> I'll use nbd instead.

This is interesting.
What is the currently recommended method to use for exporting
block-device via TCP/IP? nbd? iscsi? use whatever works?

-- 
Fajar


From fdinitto at redhat.com  Mon Jun  8 08:46:16 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Mon, 08 Jun 2009 10:46:16 +0200
Subject: [Linux-cluster] compile gnbd-kernel error
In-Reply-To: <7207d96f0906080132u613d7bffw959d6a5225b15eac@mail.gmail.com>
References: <2495e3790906020630g34d1b32ckeff5db4b8167169@mail.gmail.com>
	<1244443937.3665.5.camel@cerberus.int.fabbione.net>
	<2495e3790906080127q7d939cceh70632899a42aa1d3@mail.gmail.com>
	<7207d96f0906080132u613d7bffw959d6a5225b15eac@mail.gmail.com>
Message-ID: <1244450776.3665.15.camel@cerberus.int.fabbione.net>

On Mon, 2009-06-08 at 15:32 +0700, Fajar A. Nugraha wrote:
> On Mon, Jun 8, 2009 at 3:27 PM, Jin-Shan Tseng<tsengjs at gmail.com> wrote:
> > On Mon, Jun 8, 2009 at 2:52 PM, Fabio M. Di Nitto <fdinitto at redhat.com>
> > wrote:
> >>
> >> gnbd has not been ported to any kernel > 2.6.27 because it's been
> >> deprecated upstream.
> >>
> >> Fabio
> >
> > Hi Fabio,
> > Thanks for your reply. :)
> > I'll use nbd instead.
> 
> This is interesting.
> What is the currently recommended method to use for exporting
> block-device via TCP/IP? nbd? iscsi? use whatever works?
> 

There was a short thread discussing this same issue in November when we
announced GNDB deprecation:

http://www.redhat.com/archives/cluster-devel/2008-November/msg00062.html

in short, iscsi/aoe/nbd and others are recognized as defacto standard
protocols and supported by different vendors. It makes no sense to carry
around yet another network block device protocol/implementation that is
not even standard.

A lot of people had great deal of success using iSCSI. I personally used
AOE for testing for a long time with very little issues.

Fabio


From grimme at atix.de  Mon Jun  8 09:02:19 2009
From: grimme at atix.de (Marc Grimme)
Date: Mon, 8 Jun 2009 11:02:19 +0200
Subject: [Linux-cluster] Mountoption _netdev status with gfs/gfs2
Message-ID: <200906081102.20019.grimme@atix.de>

Hello,
in a few bugs I read that you don't want to support the _netdev mountoption 
(Dave/Steve) with gfs/gfs2.

In order to being able to establish a relyable process to mount filesystems 
depending on the network (independently from the filesystem itself) for me it 
looks like a good step to at least support then _netdev mountoption with 
gfs/gfs2.

But if I specify the _netdev option with gfs it is just ignored and will not 
be shown.

Could you please shortly sum up or give me a reference where you described the 
reasons for not supporting it with gfs?

For us (using gfs/gfs2 as rootfs) it would make things much easier if the 
_netdev option would be available.

BTW: ocfs2 sets it as default.
-- 
Gruss / Regards,

Marc Grimme
http://www.atix.de/               http://www.open-sharedroot.org/


From swhiteho at redhat.com  Mon Jun  8 09:22:51 2009
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 08 Jun 2009 10:22:51 +0100
Subject: [Linux-cluster] Mountoption _netdev status with gfs/gfs2
In-Reply-To: <200906081102.20019.grimme@atix.de>
References: <200906081102.20019.grimme@atix.de>
Message-ID: <1244452971.29604.764.camel@localhost.localdomain>

Hi,

On Mon, 2009-06-08 at 11:02 +0200, Marc Grimme wrote:
> Hello,
> in a few bugs I read that you don't want to support the _netdev mountoption 
> (Dave/Steve) with gfs/gfs2.
> 
> In order to being able to establish a relyable process to mount filesystems 
> depending on the network (independently from the filesystem itself) for me it 
> looks like a good step to at least support then _netdev mountoption with 
> gfs/gfs2.
> 
> But if I specify the _netdev option with gfs it is just ignored and will not 
> be shown.
> 
> Could you please shortly sum up or give me a reference where you described the 
> reasons for not supporting it with gfs?
> 
> For us (using gfs/gfs2 as rootfs) it would make things much easier if the 
> _netdev option would be available.
> 
> BTW: ocfs2 sets it as default.

The _netdev option is only read by scripts, not by the kernel itself and
its an ordering constraint. The issue is that it doesn't make the
ordering correct in all cases and there are good reasons for wanting to
specify other orderings too.

The man page for fstab says that entries will be read in the order in
which they appear. Thats not quite true of course as _netdev entries
will be read later on, and they are then mounted according to fstype and
the fstab ordering is only respected within a particular fstype.

Ideally we want to be able to mix ordinary fs mounts, network fs mounts
and bind mounts in any order. Although you can use _netdev to solve one
particular case with gfs/gfs2 it certainly is not a general solution, so
more thought needs to go into this.

The upstart project has been suggested as a possible solution. I've not
looked into it enough to be certain whether that is the case or not, nor
do I know the current state of enthusiasm for it amoung the distros.

I do appreciate that this has been a long standing issue and I would be
very happy to see it resolved. As you are aware, we have open bugs on
this issue: #435906 also related are #480002 and #207697

Steve.


From esggrupos at gmail.com  Mon Jun  8 09:22:53 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Mon, 8 Jun 2009 11:22:53 +0200
Subject: [Linux-cluster] question about 2 nodes cluster
Message-ID: <3128ba140906080222w22d5c2b4y63d9ee4df80b38bc@mail.gmail.com>

Hi all,
I have one  existential question about two nodes cluster. I have read that
for 2 nodes cluster is necessary a third element to give stabilty to the
cluster.

One way is to add a third node, so its not a 2 nodes cluster. For me its not
an answer because It becomes another kind of cluster ( 3 nodes cluster)

Other, is to use qdisk (this is the way I?m trying nowadays)

My question is if it?s absollutely necesary this third element in the
architecture. and If so which are the options, ?

thanks in advance

ESG
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090608/f114a453/attachment.htm>

From dist-list at LEXUM.UMontreal.CA  Mon Jun  8 13:16:41 2009
From: dist-list at LEXUM.UMontreal.CA (FM)
Date: Mon, 08 Jun 2009 09:16:41 -0400
Subject: [Linux-cluster] Xen , Out of memory and dom0-min-mem, dom0_mem
Message-ID: <4A2D0F39.7050308@lexum.umontreal.ca>

Hello
2 of 3 of my xen nodes ( dom0) died tonight because of out of memory error.
These servers are only running Xen and cluster suite packages.
I googled the error and everything point out to dom0-min-mem and grub 
dom0_mem options

dom0-min-mem : I left it by default : (dom0-min-mem 256)

I read about on the internet but still do not understand this parameter 
and if if should be =0 on servers

any advice ?

tx !

here are some info about one server  :

[root at cluster01-node1 xen]# virsh dominfo 0
Id:             0
Name:           Domain-0
UUID:           00000000-0000-0000-0000-000000000000
OS Type:        linux
State:          running
CPU(s):         8
CPU time:       182.8s
Max memory:     no limit
Used memory:    33554544 kB


host                   : cluster01-node1.cluster.lexum.pri
release                : 2.6.18-128.1.10.el5xen
version                : #1 SMP Thu May 7 11:07:18 EDT 2009
machine                : x86_64
nr_cpus                : 8
nr_nodes               : 1
sockets_per_node       : 2
cores_per_socket       : 4
threads_per_core       : 1
cpu_mhz                : 2833
hw_caps                : 
bfebfbff:20000800:00000000:00000140:040ce3bd:00000000:00000001
total_memory           : 36861
free_memory            : 963
node_to_cpu            : node0:0-7
xen_major              : 3
xen_minor              : 1
xen_extra              : .2-128.1.10.el5
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 
hvm-3.0-x86_32p hvm-3.0-x86_64
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : unavailable
cc_compiler            : gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)
cc_compile_by          : mockbuild
cc_compile_domain      : centos.org
cc_compile_date        : Thu May  7 10:28:47 EDT 2009
xend_config_format     : 2


From rmicmirregs at gmail.com  Mon Jun  8 14:44:39 2009
From: rmicmirregs at gmail.com (Rafael =?ISO-8859-1?Q?Mic=F3?= Miranda)
Date: Mon, 08 Jun 2009 16:44:39 +0200
Subject: [Linux-Cluster] Submitting two new resource plugins to the project
In-Reply-To: <31EF80F0-B621-4889-9605-9F50A431F8AF@redhat.com>
References: <1243883850.6761.2.camel@mecatol>
	<1243919063.24866.14.camel@cerberus.int.fabbione.net>
	<1244046520.6750.6.camel@mecatol>
	<F00DFDAB-F17B-48B9-9AE2-8EA491ADF7EF@redhat.com>
	<1244141327.6771.6.camel@mecatol>
	<31EF80F0-B621-4889-9605-9F50A431F8AF@redhat.com>
Message-ID: <1244472279.7104.1.camel@mecatol>

Hi Jonathan

El jue, 04-06-2009 a las 16:39 -0500, Jonathan Brassow escribi?:
> On Jun 4, 2009, at 1:48 PM, Rafael Mic? Miranda wrote:

> 
> I am sorry, I have not received your e-mail yet.  I suppose it could  
> have been caught by my spam filter.  Could you please try to send  
> again to: jbrassow at redhat.com?
> 
> If that doesn't work, then I can send you a web address where you can  
> upload the code.
> 
> thanks,
>   brassow
> 
> P.S.  It seems I get all your messages on linux-cluster at redhat.com,  
> but I'm not seeing the others...
> 

I send you a e-mail with the files last Thursday, but 'cause i see no
feedback I think you did not receive it.

Please tell me any other way I can upload the code.

Thanks, 

-- 
Rafael Mic? Miranda


From rmicmirregs at gmail.com  Mon Jun  8 14:50:06 2009
From: rmicmirregs at gmail.com (Rafael =?ISO-8859-1?Q?Mic=F3?= Miranda)
Date: Mon, 08 Jun 2009 16:50:06 +0200
Subject: [Linux-Cluster] Submitting two new resource plugins to the project
In-Reply-To: <1244443880.3665.3.camel@cerberus.int.fabbione.net>
References: <1243883850.6761.2.camel@mecatol>
	<1243919063.24866.14.camel@cerberus.int.fabbione.net>
	<1244046520.6750.6.camel@mecatol>
	<1244443880.3665.3.camel@cerberus.int.fabbione.net>
Message-ID: <1244472606.7104.8.camel@mecatol>

Hi Fabio,

El lun, 08-06-2009 a las 08:51 +0200, Fabio M. Di Nitto escribi?:

> 
> Pretty sure it's right. Did you subscribe to the mailing list before
> posting?
> 

> 
> Fabio
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

I tried to explain I found no references to that list,
cluster-devel at redhat.com, anywhere in the CMAN Project webpage so no, I
have not subscribed to that list yet.

Now I have taken a look at Google and I found this URL:

http://www.redhat.com/mailman/listinfo/cluster-devel

I will subscribe to it just now.

Thanks,

-- 
Rafael Mic? Miranda


From cthulhucalling at gmail.com  Mon Jun  8 15:13:08 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Mon, 8 Jun 2009 11:13:08 -0400
Subject: [Linux-cluster] question about 2 nodes cluster
In-Reply-To: <3128ba140906080222w22d5c2b4y63d9ee4df80b38bc@mail.gmail.com>
References: <3128ba140906080222w22d5c2b4y63d9ee4df80b38bc@mail.gmail.com>
Message-ID: <36df569a0906080813p27c9b8ccu5a89e210887bfdf1@mail.gmail.com>

You don't need a third node or quorum disk... A 2-node cluster is sort of a
special case, which is why there is a special config line for a setup with
only 2 nodes. I'm running a couple of 2-node clusters right now with no
quorum disk. The main issue I've encountered is that they can go split-brain
easily, and you get to watch both nodes fence each other off endlessly.
Adding clean_start="1" to the fence_daemon line helps prevent this, but a
quorum disk would be better if you're absolutely committed to a 2-node
setup.

I'm running a test cluster at this moment with 3 nodes and a 1Gb quorum
partition that is shared out via iSCSI. I sort of discovered that I needed a
quorum disk the hard way after taking 2 nodes down and the suriviving node
gave up due to the cluster being inquorate.


On Mon, Jun 8, 2009 at 5:22 AM, ESGLinux <esggrupos at gmail.com> wrote:

> Hi all,
> I have one  existential question about two nodes cluster. I have read that
> for 2 nodes cluster is necessary a third element to give stabilty to the
> cluster.
>
> One way is to add a third node, so its not a 2 nodes cluster. For me its
> not an answer because It becomes another kind of cluster ( 3 nodes cluster)
>
> Other, is to use qdisk (this is the way I?m trying nowadays)
>
> My question is if it?s absollutely necesary this third element in the
> architecture. and If so which are the options, ?
>
> thanks in advance
>
> ESG
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090608/903a3178/attachment.htm>

From teigland at redhat.com  Mon Jun  8 15:06:52 2009
From: teigland at redhat.com (David Teigland)
Date: Mon, 8 Jun 2009 10:06:52 -0500
Subject: [Linux-cluster] "dlm_controld[nnnn]: cluster is down,
	exiting" on node1 when starting node2
In-Reply-To: <1244221838.2626.29.camel@localhost.localdomain>
References: <Pine.LNX.4.64.0906041607370.1696@e-smith.charlieb.ott.istop.com>
	<20090605151421.GB28143@redhat.com>
	<Pine.LNX.4.64.0906051141140.26212@e-smith.charlieb.ott.istop.com>
	<20090605160455.GD28143@redhat.com>
	<Pine.LNX.4.64.0906051244070.26212@e-smith.charlieb.ott.istop.com>
	<20090605164951.GE28143@redhat.com>
	<1244221838.2626.29.camel@localhost.localdomain>
Message-ID: <20090608150652.GA8734@redhat.com>

On Fri, Jun 05, 2009 at 10:10:38AM -0700, Steven Dake wrote:
> 99.9% of the time there would be a core file in /var/lib/openais/core*
> if aisexec faults.  We have not seen faults during normal operations for
> years in a released version under typical gfs2 usage scenarios.  If
> there is no core, it means some other component failed, exited, and
> caused that node to be fenced, or the core file could not be written by
> the OS because of some other OS specific failure.  

That's why it would be so valuable to leave a simple "I'm failing" message.
That and the fact that people don't naturally know to go looking for a
/var/lib/openais/core file when everything falls apart.

Dave


From esggrupos at gmail.com  Mon Jun  8 15:31:03 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Mon, 8 Jun 2009 17:31:03 +0200
Subject: [Linux-cluster] question about 2 nodes cluster
In-Reply-To: <36df569a0906080813p27c9b8ccu5a89e210887bfdf1@mail.gmail.com>
References: <3128ba140906080222w22d5c2b4y63d9ee4df80b38bc@mail.gmail.com>
	<36df569a0906080813p27c9b8ccu5a89e210887bfdf1@mail.gmail.com>
Message-ID: <3128ba140906080831x716faa70i395c0424e0e41cde@mail.gmail.com>

Thank you Ian,
I?l take your answer in account,

Greetings,

ESG

2009/6/8 Ian Hayes <cthulhucalling at gmail.com>

> You don't need a third node or quorum disk... A 2-node cluster is sort of a
> special case, which is why there is a special config line for a setup with
> only 2 nodes. I'm running a couple of 2-node clusters right now with no
> quorum disk. The main issue I've encountered is that they can go split-brain
> easily, and you get to watch both nodes fence each other off endlessly.
> Adding clean_start="1" to the fence_daemon line helps prevent this, but a
> quorum disk would be better if you're absolutely committed to a 2-node
> setup.
>
> I'm running a test cluster at this moment with 3 nodes and a 1Gb quorum
> partition that is shared out via iSCSI. I sort of discovered that I needed a
> quorum disk the hard way after taking 2 nodes down and the suriviving node
> gave up due to the cluster being inquorate.
>
>
> On Mon, Jun 8, 2009 at 5:22 AM, ESGLinux <esggrupos at gmail.com> wrote:
>
>> Hi all,
>> I have one  existential question about two nodes cluster. I have read that
>> for 2 nodes cluster is necessary a third element to give stabilty to the
>> cluster.
>>
>> One way is to add a third node, so its not a 2 nodes cluster. For me its
>> not an answer because It becomes another kind of cluster ( 3 nodes cluster)
>>
>> Other, is to use qdisk (this is the way I?m trying nowadays)
>>
>> My question is if it?s absollutely necesary this third element in the
>> architecture. and If so which are the options, ?
>>
>> thanks in advance
>>
>> ESG
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090608/0b3fa3a2/attachment.htm>

From brem.belguebli at gmail.com  Mon Jun  8 18:06:37 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Mon, 8 Jun 2009 20:06:37 +0200
Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters
In-Reply-To: <1244224870.2626.37.camel@localhost.localdomain>
References: <E6D5CAC0ACC932498E6AC1A692E0CBA122979D0AAA@corp-e2k7-mbx01.Soapstone.local>
	<7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com>
	<29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com>
	<7207d96f0906050247s54c5c468td31fceeabe1a8da3@mail.gmail.com>
	<E6D5CAC0ACC932498E6AC1A692E0CBA12297E05421@corp-e2k7-mbx01.Soapstone.local>
	<29ae894c0906051040v64a92e55je6c3568e14d5a20f@mail.gmail.com>
	<1244224870.2626.37.camel@localhost.localdomain>
Message-ID: <29ae894c0906081106qda49510he49c558f45d56045@mail.gmail.com>

Hello,

Here's a link to illustrate the kind of setup I'm trying to setup with RHCS.

http://brehak.blogspot.com/2009/06/disaster-recovery-setup.html

Regards


2009/6/5, Steven Dake <sdake at redhat.com>:
>
> On Fri, 2009-06-05 at 19:40 +0200, brem belguebli wrote:
> >
> >
> > 2009/6/5, Jon Schulz <jschulz at soapstonenetworks.com>:
> >         Yes I would be interested to see what products you are
> >         currently using to achieve this. In my proposed setup we are
> >         actually completely database transaction driven. The problem
> >         is the people higher up want active database <-> database
> >         replication which will be problematic I know.
> >
> > Still we also use DB (Oracle, Sybase) replication mechanisms
> > to address accidental data corruption, as mirroring being synchonous,
> > if something happens (someone intentionnaly alters the DB or
> > filesystem corruption) it will be on both legs of the mirror.
> >
> >
> >
> >         Outside of the data side of the equation, how tolerant is the
> >         cluster network/heartbeat to latency assuming no packet loss?
> >         Or more to the point, at what point does everyone in their
> >         past experience see the heartbeat network become unreliable,
> >         latency wise. E.g. anything over 30ms?
> >
>
> The default configured timers for failure detection are quite high and
> retransmit many times for failed packets (for lossy networks).   30msec
> latency would pose no major problem, except performance.  If you used
> posix locking and your machine->machine latency was 30msec, each posix
> lock would take 30.03 msec to grant or more, which may not meet your
> performance requirements.
>
> I can't recommend wan connections with totem (the protocol used in rhcs)
> because of the performance characteristics.  If the performance of posix
> locks is not a high requirement, it should be functional.
>
> Regards
> -steve
>
>
> >
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090608/ea225349/attachment.htm>

From fajar at fajar.net  Tue Jun  9 03:15:23 2009
From: fajar at fajar.net (Fajar A. Nugraha)
Date: Tue, 9 Jun 2009 10:15:23 +0700
Subject: [Linux-cluster] Xen , Out of memory and dom0-min-mem, dom0_mem
In-Reply-To: <4A2D0F39.7050308@lexum.umontreal.ca>
References: <4A2D0F39.7050308@lexum.umontreal.ca>
Message-ID: <7207d96f0906082015r4e79325j22dc31640e91637d@mail.gmail.com>

On Mon, Jun 8, 2009 at 8:16 PM, FM<dist-list at lexum.umontreal.ca> wrote:
> Hello
> 2 of 3 of my xen nodes ( dom0) died tonight because of out of memory error.
> These servers are only running Xen and cluster suite packages.
> I googled the error and everything point out to dom0-min-mem and grub
> dom0_mem options
>
> dom0-min-mem : I left it by default : (dom0-min-mem 256)
>
> I read about on the internet but still do not understand this parameter and
> if if should be =0 on servers
>
> any advice ?

This might be more suitable on xen-users lists. Anyway, what does xm
list show? e.g. how much memory does dom0 currently use?

from my experince 256MB (give or take a few) is the bare minimum for
RHEL/Centos5 Xen dom0 with phy:/. If you want to use tap:aio, you need
to add more. If you want to run other services (snmpd, httpd, cluster)
you need to have more. For your usage, my guess is you should start
with at least 1GB for dom0, and monitor its usage.

-- 
Fajar


From swhiteho at redhat.com  Tue Jun  9 07:57:49 2009
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Tue, 09 Jun 2009 08:57:49 +0100
Subject: [Linux-cluster] Re: Still having GFS2 mount hang
In-Reply-To: <89c397150906081537i661beb55ke5f0284909f9e71f@mail.gmail.com>
References: <89c397150906081537i661beb55ke5f0284909f9e71f@mail.gmail.com>
Message-ID: <1244534269.29604.789.camel@localhost.localdomain>

Hi,

On Mon, 2009-06-08 at 18:37 -0400, William A. (Andy) Adamson wrote:
> Hello
> 
> I'm still not able to mount GFS2 on both of my 2 node cluster nodes at
> once. Please! Any help is welcome...
> 
> Setup: 2 Fedora 10 VM sharing disk over AOE from third vm machine.
> 2.6.30-rc7 kernel with latest Fedora 10 rpm updates. The cluster.conf
> is attached.
> 
> I gdb'ed mount.gfs2 on the 2nd node - it hangs trying to read in gfsc_fs_result.
> 
> -->Andy
> 
> gfs2_controld -D output on first node: mount /gfs2.
> 
> 1244499915 client connection 6 fd 17
> 1244499915 join: /gfs2 gfs2 lock_dlm androsGFS2:ClusterFS rw,noauto
> /dev/etherd/e3.2p1
> 1244499915 ClusterFS join: cluster name matches: androsGFS2
> 1244499915 ClusterFS process_dlmcontrol register nodeid 0 result 0
> 1244499915 ClusterFS add_change cg 1 joined nodeid 2
> 1244499915 ClusterFS add_change cg 1 we joined
> 1244499915 ClusterFS add_change cg 1 counts member 1 joined 1 remove 0 failed 0
> 1244499915 ClusterFS wait_conditions skip for zero started_count
> 1244499915 ClusterFS send_start cg 1 id_count 1 om 0 nm 1 oj 0 nj 0
> 1244499915 ClusterFS receive_start 2:1 len 92
> 1244499915 ClusterFS match_change 2:1 matches cg 1
> 1244499915 ClusterFS wait_messages cg 1 got all 1
> 1244499915 ClusterFS pick_first_recovery_master low 2 old 0
> 1244499915 ClusterFS sync_state all_nodes_new first_recovery_needed master 2
> 1244499915 ClusterFS create_old_nodes all new
> 1244499915 ClusterFS create_new_nodes 2 ro 0 spect 0
> 1244499915 ClusterFS create_failed_journals all new
> 1244499915 ClusterFS create_new_journals 2 gets jid 0
> 1244499915 ClusterFS apply_recovery first start_kernel
> 1244499915 ClusterFS start_kernel cg 1 member_count 1
> 1244499915 ClusterFS set
> /sys/fs/gfs2/androsGFS2:ClusterFS/lock_module/block to 0
> 1244499915 ClusterFS set open
> /sys/fs/gfs2/androsGFS2:ClusterFS/lock_module/block error -1 2
This is returning -ENOENT. Do you have sysfs mounted somewhere strange?

> 1244499915 ClusterFS client_reply_join_full ci 6 result 0
> hostdata=jid=0:id=1562653156:first=1
> 1244499915 client_reply_join ClusterFS ci 6 result 0
> 1244499915 uevent: add@/fs/gfs2/androsGFS2:ClusterFS
> 1244499915 kernel: add@ androsGFS2:ClusterFS
> 1244499915 uevent: change@/fs/gfs2/androsGFS2:ClusterFS
> 1244499915 kernel: change@ androsGFS2:ClusterFS
> 1244499915 uevent: change@/fs/gfs2/androsGFS2:ClusterFS
> 1244499915 kernel: change@ androsGFS2:ClusterFS
> 1244499915 uevent: change@/fs/gfs2/androsGFS2:ClusterFS
> 1244499915 kernel: change@ androsGFS2:ClusterFS
> 1244499915 mount_done: ClusterFS result 0
> 1244499915 connection 6 read error -1
I'm not sure if this is "normal" or not, but it may well point towards
what is going wrong here,

Steve.


From teigland at redhat.com  Tue Jun  9 14:01:12 2009
From: teigland at redhat.com (David Teigland)
Date: Tue, 9 Jun 2009 09:01:12 -0500
Subject: [Linux-cluster] Re: Still having GFS2 mount hang
In-Reply-To: <1244534269.29604.789.camel@localhost.localdomain>
References: <89c397150906081537i661beb55ke5f0284909f9e71f@mail.gmail.com>
	<1244534269.29604.789.camel@localhost.localdomain>
Message-ID: <20090609140112.GB13914@redhat.com>

On Tue, Jun 09, 2009 at 08:57:49AM +0100, Steven Whitehouse wrote:
> > I gdb'ed mount.gfs2 on the 2nd node - it hangs trying to read in
> > gfsc_fs_result.

That's an unusual problem, are mount.gfs2 and gfs_controld from the same
release?  nothing in /var/log/messages?  is selinux turned off?

> > gfs2_controld -D output on first node: mount /gfs2.
> > 
> > 1244499915 client connection 6 fd 17
> > 1244499915 join: /gfs2 gfs2 lock_dlm androsGFS2:ClusterFS rw,noauto
> > /dev/etherd/e3.2p1
> > 1244499915 ClusterFS join: cluster name matches: androsGFS2
> > 1244499915 ClusterFS process_dlmcontrol register nodeid 0 result 0
> > 1244499915 ClusterFS add_change cg 1 joined nodeid 2
> > 1244499915 ClusterFS add_change cg 1 we joined
> > 1244499915 ClusterFS add_change cg 1 counts member 1 joined 1 remove 0 failed 0
> > 1244499915 ClusterFS wait_conditions skip for zero started_count
> > 1244499915 ClusterFS send_start cg 1 id_count 1 om 0 nm 1 oj 0 nj 0
> > 1244499915 ClusterFS receive_start 2:1 len 92
> > 1244499915 ClusterFS match_change 2:1 matches cg 1
> > 1244499915 ClusterFS wait_messages cg 1 got all 1
> > 1244499915 ClusterFS pick_first_recovery_master low 2 old 0
> > 1244499915 ClusterFS sync_state all_nodes_new first_recovery_needed master 2
> > 1244499915 ClusterFS create_old_nodes all new
> > 1244499915 ClusterFS create_new_nodes 2 ro 0 spect 0
> > 1244499915 ClusterFS create_failed_journals all new
> > 1244499915 ClusterFS create_new_journals 2 gets jid 0
> > 1244499915 ClusterFS apply_recovery first start_kernel
> > 1244499915 ClusterFS start_kernel cg 1 member_count 1
> > 1244499915 ClusterFS set
> > /sys/fs/gfs2/androsGFS2:ClusterFS/lock_module/block to 0
> > 1244499915 ClusterFS set open
> > /sys/fs/gfs2/androsGFS2:ClusterFS/lock_module/block error -1 2
> This is returning -ENOENT. Do you have sysfs mounted somewhere strange?

that's normal

> 
> > 1244499915 ClusterFS client_reply_join_full ci 6 result 0
> > hostdata=jid=0:id=1562653156:first=1
> > 1244499915 client_reply_join ClusterFS ci 6 result 0
> > 1244499915 uevent: add@/fs/gfs2/androsGFS2:ClusterFS
> > 1244499915 kernel: add@ androsGFS2:ClusterFS
> > 1244499915 uevent: change@/fs/gfs2/androsGFS2:ClusterFS
> > 1244499915 kernel: change@ androsGFS2:ClusterFS
> > 1244499915 uevent: change@/fs/gfs2/androsGFS2:ClusterFS
> > 1244499915 kernel: change@ androsGFS2:ClusterFS
> > 1244499915 uevent: change@/fs/gfs2/androsGFS2:ClusterFS
> > 1244499915 kernel: change@ androsGFS2:ClusterFS
> > 1244499915 mount_done: ClusterFS result 0
> > 1244499915 connection 6 read error -1
> I'm not sure if this is "normal" or not, but it may well point towards
> what is going wrong here,

this is all correct

Dave


From teigland at redhat.com  Tue Jun  9 19:36:17 2009
From: teigland at redhat.com (David Teigland)
Date: Tue, 9 Jun 2009 14:36:17 -0500
Subject: [Linux-cluster] Re: Still having GFS2 mount hang
In-Reply-To: <89c397150906091214k377d7213ud93ce341c3cb1167@mail.gmail.com>
References: <89c397150906081537i661beb55ke5f0284909f9e71f@mail.gmail.com>
	<1244534269.29604.789.camel@localhost.localdomain>
	<20090609140112.GB13914@redhat.com>
	<89c397150906091214k377d7213ud93ce341c3cb1167@mail.gmail.com>
Message-ID: <20090609193616.GA22800@redhat.com>

On Tue, Jun 09, 2009 at 03:14:09PM -0400, William A. (Andy) Adamson wrote:
> Hi David
> 
> Thanks for looking at this. The kernel does report a recursive lock

that's harmless

> issue when running /etc/init.d/cman. Details inline.

I can't see anything wrong, I'm going to check whether we have or can get some
more recent packages, since 2.99.12 is a bit old, it looks like you're on
fedora 10?

Dave


From gordan at bobich.net  Tue Jun  9 21:13:25 2009
From: gordan at bobich.net (Gordan Bobic)
Date: Tue, 09 Jun 2009 22:13:25 +0100
Subject: [Linux-cluster] Prototype Fencing Agent for Raritan eRIC G4
Message-ID: <4A2ED075.5020207@bobich.net>

As the subject line says. The agent is attached.
As all currently included fencing agents, this one is also written in 
Perl, and has the same requirements and dependencies as the DRAC fencing 
agent (Net::Telnet, Getopt::Std).

What does it take to get it included in the distro? ;)

Many thanks.

Gordan
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: fence_eric
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090609/d8aaa072/attachment.ksh>

From gordan at bobich.net  Tue Jun  9 21:24:56 2009
From: gordan at bobich.net (Gordan Bobic)
Date: Tue, 09 Jun 2009 22:24:56 +0100
Subject: [Linux-cluster] Redhat Lists Question
Message-ID: <4A2ED328.4010906@bobich.net>

Sorry, not related to clustering, but can anybody point me at the best 
Redhat list to post suggested patches to? I just wrote a (RHEL5 
specific) patch aimed at laptops with (cheap) SSDs that aims to reduce 
the number of disk writes and prolong flash life. I looked at the list 
of lists here:

http://www.redhat.com/mailman/listinfo

and it looks like there could be several relevant ones, but I'm not sure 
which ones are deprecated and no longer used. Can anybody please advise?

Many thanks.

Gordan


From tom at netspot.com.au  Wed Jun 10 01:27:10 2009
From: tom at netspot.com.au (Tom Lanyon)
Date: Wed, 10 Jun 2009 10:57:10 +0930
Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters
In-Reply-To: <29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com>
References: <E6D5CAC0ACC932498E6AC1A692E0CBA122979D0AAA@corp-e2k7-mbx01.Soapstone.local>
	<7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com>
	<29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com>
Message-ID: <EAAE3C57-D3BC-4F5E-A8A4-BE910737CAF3@netspot.com.au>

On 05/06/2009, at 6:52 PM, brem belguebli wrote:

> Hello,
>
> That sounds pretty much to the question I've asked to this mailing- 
> list last May (https://www.redhat.com/archives/linux-cluster/2009-May/msg00093.html 
> ).
>
> We are in the same setup, already doing "Geo-cluster" with other  
> technos and we are looking at RHCS to provide us the same service  
> level.
>
> Latency could be a problem indeed if too high , but in a lot of  
> cases (many companies for which I've worked), datacenters are a few  
> tens of kilometers far, with a latency max close to 1 ms, which is  
> not a problem.
>
> Let's consider this kind of setup, 2 datacenters far from each other  
> by 1 ms delay, each hosting a SAN array, each of them connected to 2  
> SAN fabrics extended between the 2 sites.
>
> What reason would prevent us from building Geo-clusters without  
> having to rely on a database replication mechanism, as the setup I  
> would like to implement would also be used to provide NFS services  
> that are disaster recovery proof.
>
> Obviously, such setup should rely on LVM mirroring to allow a node  
> hosting a service to be able to write to both local and distant SAN  
> LUN's.
>
> Brem


I have been wondering whether the same could be done (cross-site RHCS)  
using SAN replication and multipath, avoiding LVM mirroring. This is  
going to depend strongly on the storage replication failover time; if  
the IO to shared storage devices is queued for too long, the cluster  
will stop. Does anyone have any experience with how quick this would  
need to happen for RHCS to tolerate it?

I have been meaning to test this but have not had a chance...

Tom


From tom at netspot.com.au  Wed Jun 10 01:29:51 2009
From: tom at netspot.com.au (Tom Lanyon)
Date: Wed, 10 Jun 2009 10:59:51 +0930
Subject: [Linux-cluster] System load at 1.00 for gfs2?
In-Reply-To: <1242655685.29604.345.camel@localhost.localdomain>
References: <20090513173511.GA5992@esri.com>
	<8a5668960905180135p118312bfj6625f8513f477674@mail.gmail.com>
	<20090518140201.GA7429@esri.com>
	<1242655685.29604.345.camel@localhost.localdomain>
Message-ID: <13E5ADD5-B0C6-4339-8D86-5E46DA37B6A6@netspot.com.au>

On 18/05/2009, at 11:38 PM, Steven Whitehouse wrote:

> The fix has gone in to RHEL 5.4. I have a feeling that it might also  
> go
> into 5.3.z but I'm not 100% sure what the timescales are there. The  
> bug
> is known and fixed in upstream too.
>
> It isn't actually using any more CPU, its just that the LA is
> incremented by 1. So a fix is already on its way,
>
> Steve.


Great, we experience this bug too. It doesn't cause any problems but  
confuses some of the administrators... :)

Tom


From sghosh at redhat.com  Wed Jun 10 01:37:52 2009
From: sghosh at redhat.com (Subhendu Ghosh)
Date: Tue, 09 Jun 2009 21:37:52 -0400
Subject: [Linux-cluster] Redhat Lists Question
In-Reply-To: <4A2ED328.4010906@bobich.net>
References: <4A2ED328.4010906@bobich.net>
Message-ID: <4A2F0E70.3060407@redhat.com>

Gordan Bobic wrote:
> Sorry, not related to clustering, but can anybody point me at the best
> Redhat list to post suggested patches to? I just wrote a (RHEL5
> specific) patch aimed at laptops with (cheap) SSDs that aims to reduce
> the number of disk writes and prolong flash life. I looked at the list
> of lists here:
> 
> http://www.redhat.com/mailman/listinfo
> 
> and it looks like there could be several relevant ones, but I'm not sure
> which ones are deprecated and no longer used. Can anybody please advise?
> 
> Many thanks.
> 
> Gordan
> 


Ideally, you want to post the patch to the upstream component.
Posting it to the Red Hat Bugzilla under the approriate component would also help.
http://bugzilla.redhat.com

For RHEL5, the discussion list is:
https://www.redhat.com/mailman/listinfo/rhelv5-list

SSDs typically work under SATA chipsets - so libata
http://ata.wiki.kernel.org/index.php/Main_Page

-regards
Subhendu
-- 
Subhendu Ghosh
Red Hat


From gordan at bobich.net  Wed Jun 10 08:27:49 2009
From: gordan at bobich.net (Gordan Bobic)
Date: Wed, 10 Jun 2009 09:27:49 +0100
Subject: [Linux-cluster] Redhat Lists Question
In-Reply-To: <4A2F0E70.3060407@redhat.com>
References: <4A2ED328.4010906@bobich.net> <4A2F0E70.3060407@redhat.com>
Message-ID: <4A2F6E85.1090601@bobich.net>

Subhendu Ghosh wrote:

> Ideally, you want to post the patch to the upstream component.

Thanks for responding. It's mostly an initscript patch that checks if 
for file systems mounted on tmpfs (e.g. if we put /var/lock, /var/run or 
similar there to save hitting the disk) and saves and restores subtree 
structure (if changed) at shutdown and startup. It saves about 100-200 
writes on startup/shutdown.

I thought init scripts are pretty distro specific, and since this is 
mostly an init script patch...

> Posting it to the Red Hat Bugzilla under the approriate component would also help.
> http://bugzilla.redhat.com

It's not a bug fix, it's a feature addition.

> For RHEL5, the discussion list is:
> https://www.redhat.com/mailman/listinfo/rhelv5-list

OK, I'll post there. Thanks.

> SSDs typically work under SATA chipsets - so libata
> http://ata.wiki.kernel.org/index.php/Main_Page

This patch is not that low a level. :)

Gordan


From rajpurush at gmail.com  Wed Jun 10 08:41:21 2009
From: rajpurush at gmail.com (Rajeev P)
Date: Wed, 10 Jun 2009 14:11:21 +0530
Subject: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3
Message-ID: <7a271b290906100141o296ff97w72b97fe2de70b0f4@mail.gmail.com>

 I wanted to know if fence_scsi is supported in a multipath environment for
RHEL5.3 release.

In earlier releases of RHEL5 fence_scsi was not supported in a multipath
environment for RHEL5.3 release. If I am not wrong, this was because the
DM-MPIO driver forwarded the registration/unregistration commands on only on
one of the physical paths of a LUN.  Ideally it should have passed the
commands on all physical paths.

For RHEL5.3, is this issue resolved so that I can fence_scsi in multipath
environment.

Thanks in advance.

Rajeev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090610/0f811c3d/attachment.htm>

From tom at netspot.com.au  Wed Jun 10 10:50:17 2009
From: tom at netspot.com.au (Tom Lanyon)
Date: Wed, 10 Jun 2009 20:20:17 +0930
Subject: [Linux-cluster] Redhat Lists Question
In-Reply-To: <4A2F6E85.1090601@bobich.net>
References: <4A2ED328.4010906@bobich.net> <4A2F0E70.3060407@redhat.com>
	<4A2F6E85.1090601@bobich.net>
Message-ID: <E84DA172-B5B9-4F10-B977-DE5594FC5E3F@netspot.com.au>

On 10/06/2009, at 5:57 PM, Gordan Bobic wrote:

> Subhendu Ghosh wrote:
>> Posting it to the Red Hat Bugzilla under the approriate component  
>> would also help.
>> http://bugzilla.redhat.com
>
> It's not a bug fix, it's a feature addition.


The bugzilla is for "defects" which, along with bugs, includes  
requests for enhancements, etc.

A quick search of the bugzilla returns an existing bug which seems to  
be in line with your requirements:
	Bug 223722 - RFE: add functionality to persist temporary state back  
to original location
	https://bugzilla.redhat.com/show_bug.cgi?id=223722

Cheers,
Tom


From macbogucki at gmail.com  Wed Jun 10 11:16:35 2009
From: macbogucki at gmail.com (Maciej Bogucki)
Date: Wed, 10 Jun 2009 13:16:35 +0200
Subject: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3
In-Reply-To: <7a271b290906100141o296ff97w72b97fe2de70b0f4@mail.gmail.com>
References: <7a271b290906100141o296ff97w72b97fe2de70b0f4@mail.gmail.com>
Message-ID: <4A2F9613.8040208@gmail.com>

Rajeev P pisze:
> I wanted to know if fence_scsi is supported in a multipath environment 
> for RHEL5.3 release. 
>  
> In earlier releases of RHEL5 fence_scsi was not supported in a 
> multipath environment for RHEL5.3 release. If I am not wrong, this was 
> because the DM-MPIO driver forwarded the registration/unregistration 
> commands on only on one of the physical paths of a LUN.  Ideally it 
> should have passed the commands on all physical paths.
>  
> For RHEL5.3, is this issue resolved so that I can fence_scsi in 
> multipath environment. 
>  
Hello,

I don't think it's supported [1]

[1] - https://www.redhat.com/archives/rhelv5-list/2009-January/msg00092.html

Best Regards
Maciej Bogucki


From gordan at bobich.net  Wed Jun 10 12:07:28 2009
From: gordan at bobich.net (Gordan Bobic)
Date: Wed, 10 Jun 2009 13:07:28 +0100
Subject: [Linux-cluster] Redhat Lists Question
In-Reply-To: <E84DA172-B5B9-4F10-B977-DE5594FC5E3F@netspot.com.au>
References: <4A2ED328.4010906@bobich.net>
	<4A2F0E70.3060407@redhat.com>	<4A2F6E85.1090601@bobich.net>
	<E84DA172-B5B9-4F10-B977-DE5594FC5E3F@netspot.com.au>
Message-ID: <4A2FA200.5030500@bobich.net>

Tom Lanyon wrote:

>>> Posting it to the Red Hat Bugzilla under the approriate component 
>>> would also help.
>>> http://bugzilla.redhat.com
>>
>> It's not a bug fix, it's a feature addition.
> 
> The bugzilla is for "defects" which, along with bugs, includes requests 
> for enhancements, etc.
> 
> A quick search of the bugzilla returns an existing bug which seems to be 
> in line with your requirements:
>     Bug 223722 - RFE: add functionality to persist temporary state back 
> to original location
>     https://bugzilla.redhat.com/show_bug.cgi?id=223722

Thanks for that, really appreciated.

Gordan


From alfredo.moralejo at roche.com  Wed Jun 10 12:35:17 2009
From: alfredo.moralejo at roche.com (Moralejo, Alfredo)
Date: Wed, 10 Jun 2009 14:35:17 +0200
Subject: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3
In-Reply-To: <4A2F9613.8040208@gmail.com>
References: <7a271b290906100141o296ff97w72b97fe2de70b0f4@mail.gmail.com>
	<4A2F9613.8040208@gmail.com>
Message-ID: <18106F5AEC2A20499826B0BE8D04F0411DE1D04E@rbamsem701.emea.roche.com>

As cluster wiki:

http://sources.redhat.com/cluster/wiki/SCSI_FencingConfig

"Multipath devices are currently only supported for RHEL 5.0 and later with the use of device-mapper-multipath."

Additionally, I found in a HP document info about how to set up cluster. Acconding to that information it's supported with version 5.3:

http://docs.hp.com/en/15689/Migrating_SGLX_cluster_to_RHCS_Cluster.pdf

It's not a Red Hat document but they are partners so....


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Maciej Bogucki
Sent: Wednesday, June 10, 2009 1:17 PM
To: linux clustering
Subject: Re: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3

Rajeev P pisze:
> I wanted to know if fence_scsi is supported in a multipath environment 
> for RHEL5.3 release. 
>  
> In earlier releases of RHEL5 fence_scsi was not supported in a 
> multipath environment for RHEL5.3 release. If I am not wrong, this was 
> because the DM-MPIO driver forwarded the registration/unregistration 
> commands on only on one of the physical paths of a LUN.  Ideally it 
> should have passed the commands on all physical paths.
>  
> For RHEL5.3, is this issue resolved so that I can fence_scsi in 
> multipath environment. 
>  
Hello,

I don't think it's supported [1]

[1] - https://www.redhat.com/archives/rhelv5-list/2009-January/msg00092.html

Best Regards
Maciej Bogucki

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From teigland at redhat.com  Wed Jun 10 14:13:51 2009
From: teigland at redhat.com (David Teigland)
Date: Wed, 10 Jun 2009 09:13:51 -0500
Subject: [Linux-cluster] Re: Still having GFS2 mount hang
In-Reply-To: <89c397150906100633p5a0b15d9vab2da39107d362fc@mail.gmail.com>
References: <89c397150906081537i661beb55ke5f0284909f9e71f@mail.gmail.com>
	<1244534269.29604.789.camel@localhost.localdomain>
	<20090609140112.GB13914@redhat.com>
	<89c397150906091214k377d7213ud93ce341c3cb1167@mail.gmail.com>
	<20090609193616.GA22800@redhat.com>
	<89c397150906100633p5a0b15d9vab2da39107d362fc@mail.gmail.com>
Message-ID: <20090610141351.GA18341@redhat.com>

On Wed, Jun 10, 2009 at 09:33:33AM -0400, William A. (Andy) Adamson wrote:
> On Tue, Jun 9, 2009 at 3:36 PM, David Teigland<teigland at redhat.com> wrote:
> > On Tue, Jun 09, 2009 at 03:14:09PM -0400, William A. (Andy) Adamson wrote:
> >> Hi David
> >>
> >> Thanks for looking at this. The kernel does report a recursive lock
> >
> > that's harmless
> >
> >> issue when running /etc/init.d/cman. Details inline.
> >
> > I can't see anything wrong, I'm going to check whether we have or can get
> > some more recent packages, since 2.99.12 is a bit old, it looks like
> > you're on fedora 10?
> 
> yes. I could move to fedora 11.

I did some checking, and unfortunately 2.99.12 is the newest version we've
packaged for either f10 or f11.  It has something to do with the corosync
api's changing too rapidly, and the trouble with patching and rebuilding all
the packages that depend on it because they are using various versions of the
api... the hope is it will all be better when a stable corosync 1.0 release
happens.

In the mean time, Fabio was kind enough to make a set of srpms of all the
latest versions, http://fabbione.fedorapeople.org/srpm/ I just built and
installed corosync, openais and cluster srpms from there on my fedora 10
machine.  Started the cluster and mounted gfs with the result.

I limited what I built/installed to avoid some annoying dependencies, to

rpmbuild --rebuild corosync
rpm -Uhv corosync*
rpmbuild --rebuild openais
rpm -Uhv openais*
rpmbuild --rebuild cluster
rpm -Uhv cluster*
rpm -Uhv gfs*
rpm -Uhv --nodeps cman*

Dave


From fdinitto at redhat.com  Wed Jun 10 16:59:44 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Wed, 10 Jun 2009 18:59:44 +0200
Subject: [Linux-cluster] Re: Still having GFS2 mount hang
In-Reply-To: <20090610141351.GA18341@redhat.com>
References: <89c397150906081537i661beb55ke5f0284909f9e71f@mail.gmail.com>
	<1244534269.29604.789.camel@localhost.localdomain>
	<20090609140112.GB13914@redhat.com>
	<89c397150906091214k377d7213ud93ce341c3cb1167@mail.gmail.com>
	<20090609193616.GA22800@redhat.com>
	<89c397150906100633p5a0b15d9vab2da39107d362fc@mail.gmail.com>
	<20090610141351.GA18341@redhat.com>
Message-ID: <1244653184.3665.77.camel@cerberus.int.fabbione.net>

On Wed, 2009-06-10 at 09:13 -0500, David Teigland wrote:
> On Wed, Jun 10, 2009 at 09:33:33AM -0400, William A. (Andy) Adamson wrote:
> > On Tue, Jun 9, 2009 at 3:36 PM, David Teigland<teigland at redhat.com> wrote:
> > > On Tue, Jun 09, 2009 at 03:14:09PM -0400, William A. (Andy) Adamson wrote:
> > >> Hi David
> > >>
> > >> Thanks for looking at this. The kernel does report a recursive lock
> > >
> > > that's harmless
> > >
> > >> issue when running /etc/init.d/cman. Details inline.
> > >
> > > I can't see anything wrong, I'm going to check whether we have or can get
> > > some more recent packages, since 2.99.12 is a bit old, it looks like
> > > you're on fedora 10?
> > 
> > yes. I could move to fedora 11.
> 
> I did some checking, and unfortunately 2.99.12 is the newest version we've
> packaged for either f10 or f11.  It has something to do with the corosync
> api's changing too rapidly, and the trouble with patching and rebuilding all
> the packages that depend on it because they are using various versions of the
> api... the hope is it will all be better when a stable corosync 1.0 release
> happens.
> 
> In the mean time, Fabio was kind enough to make a set of srpms of all the
> latest versions, http://fabbione.fedorapeople.org/srpm/ I just built and
> installed corosync, openais and cluster srpms from there on my fedora 10
> machine.  Started the cluster and mounted gfs with the result.
> 
> I limited what I built/installed to avoid some annoying dependencies, to
> 
> rpmbuild --rebuild corosync
> rpm -Uhv corosync*
> rpmbuild --rebuild openais
> rpm -Uhv openais*
> rpmbuild --rebuild cluster
> rpm -Uhv cluster*
> rpm -Uhv gfs*
> rpm -Uhv --nodeps cman*

Just FYI, you can build fence-agents srpm from there after install
clusterlib and then install full cman.

If you don't need fence-agents, then use the --nodeps.

Fabio


From garromo at us.ibm.com  Wed Jun 10 18:17:01 2009
From: garromo at us.ibm.com (Gary Romo)
Date: Wed, 10 Jun 2009 12:17:01 -0600
Subject: [Linux-cluster] gfs_grow
Message-ID: <OF17CE52C5.63AAE823-ON872575D1.006442BF-872575D1.00646FA1@us.ibm.com>


Can you increase GFS file systems on the fly, without unmounting or
stopping processes?

-Gary
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090610/ce46e70f/attachment.htm>

From bmr at redhat.com  Wed Jun 10 18:25:27 2009
From: bmr at redhat.com (Bryn M. Reeves)
Date: Wed, 10 Jun 2009 19:25:27 +0100
Subject: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3
In-Reply-To: <4A2F9613.8040208@gmail.com>
References: <7a271b290906100141o296ff97w72b97fe2de70b0f4@mail.gmail.com>
	<4A2F9613.8040208@gmail.com>
Message-ID: <1244658327.18101.287.camel@breeves.fab.redhat.com>

On Wed, 2009-06-10 at 13:16 +0200, Maciej Bogucki wrote:
> Rajeev P pisze:
> > I wanted to know if fence_scsi is supported in a multipath environment 
> > for RHEL5.3 release. 
> >  
> > In earlier releases of RHEL5 fence_scsi was not supported in a 
> > multipath environment for RHEL5.3 release. If I am not wrong, this was 
> > because the DM-MPIO driver forwarded the registration/unregistration 
> > commands on only on one of the physical paths of a LUN.  Ideally it 
> > should have passed the commands on all physical paths.
> >  
> > For RHEL5.3, is this issue resolved so that I can fence_scsi in 
> > multipath environment. 
> >  
> Hello,
> 
> I don't think it's supported [1]
> 
> [1] - https://www.redhat.com/archives/rhelv5-list/2009-January/msg00092.html

Doesn't mention it at all.

Better to check the kernel ChangeLog:

* Tue Oct 10 2006 Don Zickus <dzickus at redhat.com> [2.6.18-1.2725.el5] 
- kernel dm multipath: ioctl support (Alasdair Kergon) [207575] 

This was included in the RHEL5 GA kernel (2.6.18-8.el5) so the ioctl
passthrough has been there all along in RHEL5.

Unfortunately the bug that introduced the change is private, but the
RHEL4 bug that it was cloned from is accessible:

https://bugzilla.redhat.com/show_bug.cgi?id=168801

Regards,
Bryn.


From jumanjiman at gmail.com  Wed Jun 10 18:17:56 2009
From: jumanjiman at gmail.com (Paul Morgan)
Date: Wed, 10 Jun 2009 18:17:56 +0000
Subject: [Linux-cluster] gfs_grow
In-Reply-To: <OF17CE52C5.63AAE823-ON872575D1.006442BF-872575D1.00646FA1@us.ibm.com>
References: <OF17CE52C5.63AAE823-ON872575D1.006442BF-872575D1.00646FA1@us.ibm.com>
Message-ID: <931927457-1244658008-cardhu_decombobulator_blackberry.rim.net-637071402-@bxe1110.bisx.prod.on.blackberry>

Yes, assuming you have sufficient free extents. Just remember to add any needed journals first. 
-paul

-----Original Message-----
From: Gary Romo <garromo at us.ibm.com>

Date: Wed, 10 Jun 2009 12:17:01 
To: <linux-cluster at redhat.com>
Subject: [Linux-cluster] gfs_grow


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From sghosh at redhat.com  Wed Jun 10 18:49:24 2009
From: sghosh at redhat.com (Subhendu Ghosh)
Date: Wed, 10 Jun 2009 14:49:24 -0400
Subject: [Linux-cluster] Re: [Cluster-devel] Prototype Fencing Agent for
	Raritan eRIC G4
In-Reply-To: <4A2ED075.5020207@bobich.net>
References: <4A2ED075.5020207@bobich.net>
Message-ID: <4A300034.9050603@redhat.com>

Gordan Bobic wrote:
> As the subject line says. The agent is attached.
> As all currently included fencing agents, this one is also written in
> Perl, and has the same requirements and dependencies as the DRAC fencing
> agent (Net::Telnet, Getopt::Std).
> 
> What does it take to get it included in the distro? ;)
> 
> Many thanks.
> 
> Gordan
> 

Hi Gordan

Would it be possible to look at migrating this agent to SSH (more secure)
or to SNMP (less screen scraping)?

Look at fence_cisco as an example of snmp usage.

Long term maintainability of screen scraping is an issue with firmware changes.

Also it seems that card has IPMI support. If so, can use test with fence_ipmi?
Would remove the need for yet-another-agent ;)

-regards
Subhendu

-- 
Subhendu Ghosh
Red Hat
Email: sghosh at redhat.com


From rohara at redhat.com  Wed Jun 10 19:18:02 2009
From: rohara at redhat.com (Ryan O'Hara)
Date: Wed, 10 Jun 2009 14:18:02 -0500
Subject: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3
In-Reply-To: <7a271b290906100141o296ff97w72b97fe2de70b0f4@mail.gmail.com>
References: <7a271b290906100141o296ff97w72b97fe2de70b0f4@mail.gmail.com>
Message-ID: <20090610191802.GA12988@redhat.com>

On Wed, Jun 10, 2009 at 02:11:21PM +0530, Rajeev P wrote:
>  I wanted to know if fence_scsi is supported in a multipath environment for
> RHEL5.3 release.

Yes, it is supported.

> In earlier releases of RHEL5 fence_scsi was not supported in a multipath
> environment for RHEL5.3 release. If I am not wrong, this was because the
> DM-MPIO driver forwarded the registration/unregistration commands on only on
> one of the physical paths of a LUN.  Ideally it should have passed the
> commands on all physical paths.
> 
> For RHEL5.3, is this issue resolved so that I can fence_scsi in multipath
> environment.
> 
> Thanks in advance.
> 
> Rajeev

> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From gordan at bobich.net  Wed Jun 10 19:24:45 2009
From: gordan at bobich.net (Gordan Bobic)
Date: Wed, 10 Jun 2009 20:24:45 +0100
Subject: [Linux-cluster] Re: [Cluster-devel] Prototype Fencing Agent for
	Raritan eRIC G4
In-Reply-To: <4A300034.9050603@redhat.com>
References: <4A2ED075.5020207@bobich.net> <4A300034.9050603@redhat.com>
Message-ID: <4A30087D.3060901@bobich.net>

Subhendu Ghosh wrote:

> Would it be possible to look at migrating this agent to SSH (more secure)

I started with the idea of doing it over ssh, but Net::SSH module seemed 
to be a lot less forgiving about the terminal quirkyness. I can have 
another go. There's also the issue of manual intervention being required 
to save the signatures (and where do the known hosts go?).

> or to SNMP (less screen scraping)?

Hmm, maybe. I haven't looked into the SNMP capability on the device, but 
it looks like it'll work, and probably be easier to do than SSH.

> Look at fence_cisco as an example of snmp usage.

Assuming they speak a compatible dialect, which may not be the case. 
I'll have a look.

> Long term maintainability of screen scraping is an issue with firmware changes.

Tell me about it. I submitted a patch for fence_drac a while back to 
address an issue that seems to have arisen from a firmware update 
inducted pattern match failure.

Not only that, but I've discovered a bug on the latest eRIC G4 firmware 
- 04.02.00-7153 seems to have broken USB keyboard support (you'd think 
this was important on a remote console device!) and potentially some 
power button press dodgyness. The previous firmware, however - 
04.02.00-6505, works OK.

> Also it seems that card has IPMI support. If so, can use test with fence_ipmi?
> Would remove the need for yet-another-agent ;)

Sadly, my servers with these cards in them don't have IPMI support. The 
card only proxies it. The card supports direct power/reset button 
control in addition to IPMI, so this is what I'm using. But as you can 
see from the code, it operates only on the power on/off even for a 
reboot because the said servers also don't have a reset connector. I 
wrote this agent because I _needed_ it. :)

But I'll look into the SNMP way of doing it, it sounds like it might be 
neater. I'll add it as an option since the telnet way is already 
written. What parameter should/can be used to specify such things, that 
is available from a cluster.conf reference?

Thanks.

Gordan


From rvandolson at esri.com  Wed Jun 10 20:54:43 2009
From: rvandolson at esri.com (Ray Van Dolson)
Date: Wed, 10 Jun 2009 13:54:43 -0700
Subject: [Linux-cluster] GFS2 cluster and fencing
Message-ID: <20090610205436.GA3215@esri.com>

I'm setting up a simple 5 node "cluster" basically just for using a
shared GFS2 filesystem between the nodes.

I'm not really concerned about HA, I just want to be able to have all
the nodes accessing the same block device (iSCSI)

<?xml version="1.0"?>
<cluster alias="pds" config_version="6" name="pds">
  <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
  <clusternodes>
    <clusternode name="pds27.esri.com" nodeid="1" votes="1">
      <fence>
        <method name="human">
          <device name="human" nodename="pds27.esri.com"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="pds28.esri.com" nodeid="2" votes="1">
      <fence>
        <method name="human">
          <device name="human" nodename="pds28.esri.com"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="pds29.esri.com" nodeid="3" votes="1">
      <fence>
        <method name="human">
          <device name="human" nodename="pds29.esri.com"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="pds30.esri.com" nodeid="4" votes="1">
      <fence>
        <method name="human">
          <device name="human" nodename="pds30.esri.com"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="pds30.esri.com" nodeid="5" votes="1">
      <fence>
        <method name="human">
          <device name="human" nodename="pds30.esri.com"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <cman expected_votes="1"/>
  <fencedevices>
    <fencedevice name="human" agent="fence_manual"/>
  </fencedevices>
</cluster>

In my thinking this sets up a cluster where only one node need be up to
have quorum, and manual fencing is done for each node.

However, when I start up the first node in the cluster, the fencing
daemon hangs complaining about not being able to fence the other nodes.
I have to run fence_ack_manual -n <nodename> for all the other nodes,
then things start up fine.

Is there a way to make the node just assume all the other nodes are
fine and start up?  Am I really running much risk of the GFS2
filesystem failing out?

Thanks,
Ray


From cthulhucalling at gmail.com  Wed Jun 10 21:21:41 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Wed, 10 Jun 2009 14:21:41 -0700
Subject: [Linux-cluster] GFS2 cluster and fencing
In-Reply-To: <20090610205436.GA3215@esri.com>
References: <20090610205436.GA3215@esri.com>
Message-ID: <36df569a0906101421k55aeb7ddofe316878cfba86d5@mail.gmail.com>

Have you tried changing clean_start="0" to 1?

On Wed, Jun 10, 2009 at 1:54 PM, Ray Van Dolson <rvandolson at esri.com> wrote:

> I'm setting up a simple 5 node "cluster" basically just for using a
> shared GFS2 filesystem between the nodes.
>
> I'm not really concerned about HA, I just want to be able to have all
> the nodes accessing the same block device (iSCSI)
>
> <?xml version="1.0"?>
> <cluster alias="pds" config_version="6" name="pds">
>  <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
>  <clusternodes>
>    <clusternode name="pds27.esri.com" nodeid="1" votes="1">
>      <fence>
>        <method name="human">
>          <device name="human" nodename="pds27.esri.com"/>
>        </method>
>      </fence>
>    </clusternode>
>    <clusternode name="pds28.esri.com" nodeid="2" votes="1">
>      <fence>
>        <method name="human">
>          <device name="human" nodename="pds28.esri.com"/>
>        </method>
>      </fence>
>    </clusternode>
>    <clusternode name="pds29.esri.com" nodeid="3" votes="1">
>      <fence>
>        <method name="human">
>          <device name="human" nodename="pds29.esri.com"/>
>        </method>
>      </fence>
>    </clusternode>
>    <clusternode name="pds30.esri.com" nodeid="4" votes="1">
>      <fence>
>        <method name="human">
>          <device name="human" nodename="pds30.esri.com"/>
>        </method>
>      </fence>
>    </clusternode>
>    <clusternode name="pds30.esri.com" nodeid="5" votes="1">
>      <fence>
>        <method name="human">
>          <device name="human" nodename="pds30.esri.com"/>
>        </method>
>      </fence>
>    </clusternode>
>  </clusternodes>
>  <cman expected_votes="1"/>
>  <fencedevices>
>    <fencedevice name="human" agent="fence_manual"/>
>  </fencedevices>
> </cluster>
>
> In my thinking this sets up a cluster where only one node need be up to
> have quorum, and manual fencing is done for each node.
>
> However, when I start up the first node in the cluster, the fencing
> daemon hangs complaining about not being able to fence the other nodes.
> I have to run fence_ack_manual -n <nodename> for all the other nodes,
> then things start up fine.
>
> Is there a way to make the node just assume all the other nodes are
> fine and start up?  Am I really running much risk of the GFS2
> filesystem failing out?
>
> Thanks,
> Ray
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090610/c4894d10/attachment.htm>

From alfredo.moralejo at roche.com  Wed Jun 10 21:33:14 2009
From: alfredo.moralejo at roche.com (Moralejo, Alfredo)
Date: Wed, 10 Jun 2009 23:33:14 +0200
Subject: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3
In-Reply-To: <20090610191802.GA12988@redhat.com>
References: <7a271b290906100141o296ff97w72b97fe2de70b0f4@mail.gmail.com>
	<20090610191802.GA12988@redhat.com>
Message-ID: <FD716ADFEE797543AA8B2AB0B3F9148E06C606CA@rkamsem701.emea.roche.com>

Anyone is successfully using it?

I'm testing it with a clariion storage frame on RHEL 5.3, and as soon as I enable scsi_reserve, multipath starts failing and a path goes good and bad in a loop and scsi fencing fails sometimes, should I configure in a specific way multipath.conf?:

Jun 10 23:31:58 rmamseslab07 multipathd: mpath0: remaining active paths: 3
Jun 10 23:32:00 rmamseslab07 kernel: sd 1:0:0:1: reservation conflict
Jun 10 23:32:00 rmamseslab07 kernel: sd 1:0:0:1: SCSI error: return code = 0x00000018
Jun 10 23:32:00 rmamseslab07 kernel: end_request: I/O error, dev sdm, sector 79
Jun 10 23:32:00 rmamseslab07 kernel: device-mapper: multipath: Failing path 8:192.
Jun 10 23:32:00 rmamseslab07 multipathd: 8:192: mark as failed
Jun 10 23:32:00 rmamseslab07 multipathd: mpathquorum: remaining active paths: 3
Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: add map (uevent)
Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: devmap already registered
Jun 10 23:32:00 rmamseslab07 multipathd: sdm: emc_clariion_checker: Path healthy
Jun 10 23:32:00 rmamseslab07 multipathd: 8:192: reinstated
Jun 10 23:32:00 rmamseslab07 multipathd: mpathquorum: remaining active paths: 4
Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: add map (uevent)
Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: devmap already registered
Jun 10 23:32:02 rmamseslab07 multipathd: sdl: emc_clariion_checker: Path healthy
Jun 10 23:32:02 rmamseslab07 multipathd: 8:176: reinstated
Jun 10 23:32:02 rmamseslab07 multipathd: mpath0: remaining active paths: 4
Jun 10 23:32:02 rmamseslab07 multipathd: dm-9: add map (uevent)
Jun 10 23:32:02 rmamseslab07 multipathd: dm-9: devmap already registered
Jun 10 23:32:03 rmamseslab07 kernel: sd 1:0:0:0: reservation conflict
Jun 10 23:32:03 rmamseslab07 kernel: sd 1:0:0:0: SCSI error: return code = 0x00000018
Jun 10 23:32:03 rmamseslab07 kernel: end_request: I/O error, dev sdl, sector 25256
Jun 10 23:32:03 rmamseslab07 kernel: device-mapper: multipath: Failing path 8:176.
Jun 10 23:32:03 rmamseslab07 multipathd: dm-9: add map (uevent)
Jun 10 23:32:03 rmamseslab07 multipathd: dm-9: devmap already registered
Jun 10 23:32:03 rmamseslab07 multipathd: 8:176: mark as failed
Jun 10 23:32:03 rmamseslab07 multipathd: mpath0: remaining active paths: 3


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan O'Hara
Sent: Wednesday, June 10, 2009 9:18 PM
To: linux clustering
Subject: Re: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3

On Wed, Jun 10, 2009 at 02:11:21PM +0530, Rajeev P wrote:
>  I wanted to know if fence_scsi is supported in a multipath environment for
> RHEL5.3 release.

Yes, it is supported.

> In earlier releases of RHEL5 fence_scsi was not supported in a multipath
> environment for RHEL5.3 release. If I am not wrong, this was because the
> DM-MPIO driver forwarded the registration/unregistration commands on only on
> one of the physical paths of a LUN.  Ideally it should have passed the
> commands on all physical paths.
> 
> For RHEL5.3, is this issue resolved so that I can fence_scsi in multipath
> environment.
> 
> Thanks in advance.
> 
> Rajeev

> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From rvandolson at esri.com  Wed Jun 10 21:39:02 2009
From: rvandolson at esri.com (Ray Van Dolson)
Date: Wed, 10 Jun 2009 14:39:02 -0700
Subject: [Linux-cluster] GFS2 cluster and fencing
In-Reply-To: <36df569a0906101421k55aeb7ddofe316878cfba86d5@mail.gmail.com>
References: <20090610205436.GA3215@esri.com>
	<36df569a0906101421k55aeb7ddofe316878cfba86d5@mail.gmail.com>
Message-ID: <20090610213902.GB4203@esri.com>

On Wed, Jun 10, 2009 at 02:21:41PM -0700, Ian Hayes wrote:
> Have you tried changing clean_start="0" to 1?

Nope, will do.  I misinterpreted the fenced(8) man page thinking that
clean_start="0" was the way to do this.

Thanks,
Ray


From brem.belguebli at gmail.com  Wed Jun 10 21:45:15 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Wed, 10 Jun 2009 23:45:15 +0200
Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters
In-Reply-To: <EAAE3C57-D3BC-4F5E-A8A4-BE910737CAF3@netspot.com.au>
References: <E6D5CAC0ACC932498E6AC1A692E0CBA122979D0AAA@corp-e2k7-mbx01.Soapstone.local>
	<7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com>
	<29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com>
	<EAAE3C57-D3BC-4F5E-A8A4-BE910737CAF3@netspot.com.au>
Message-ID: <29ae894c0906101445k7077184bxacc6964eb790fde7@mail.gmail.com>

Indeed, SAN replication could be another way to partially address  this.
To make it work, one should be able to add sort of external resource in the
cluster monitoring the synchronization status between the source LUNs and
the target ones, and by the way automatically invert the synchronization in
case your resource or service fails over another node on the other site.

This can be tricky and your SAN arrays must allow you to do this (HDS/HP
command devices, etc...)

IMHO, LVM mirror is the simplest way to achieve this if latency constraints
are acceptable.

When I say partially, there is always the quorum issue, as on a 4 nodes
cluster, equally located on 2 sites, in case of a site failure, the 2
remaining nodes are not quorate.

Brem

2009/6/10 Tom Lanyon <tom at netspot.com.au>

> On 05/06/2009, at 6:52 PM, brem belguebli wrote:
>
>  Hello,
>>
>> That sounds pretty much to the question I've asked to this mailing-list
>> last May (
>> https://www.redhat.com/archives/linux-cluster/2009-May/msg00093.html).
>>
>> We are in the same setup, already doing "Geo-cluster" with other technos
>> and we are looking at RHCS to provide us the same service level.
>>
>> Latency could be a problem indeed if too high , but in a lot of cases
>> (many companies for which I've worked), datacenters are a few tens of
>> kilometers far, with a latency max close to 1 ms, which is not a problem.
>>
>> Let's consider this kind of setup, 2 datacenters far from each other by 1
>> ms delay, each hosting a SAN array, each of them connected to 2 SAN fabrics
>> extended between the 2 sites.
>>
>> What reason would prevent us from building Geo-clusters without having to
>> rely on a database replication mechanism, as the setup I would like to
>> implement would also be used to provide NFS services that are disaster
>> recovery proof.
>>
>> Obviously, such setup should rely on LVM mirroring to allow a node hosting
>> a service to be able to write to both local and distant SAN LUN's.
>>
>> Brem
>>
>
>
> I have been wondering whether the same could be done (cross-site RHCS)
> using SAN replication and multipath, avoiding LVM mirroring. This is going
> to depend strongly on the storage replication failover time; if the IO to
> shared storage devices is queued for too long, the cluster will stop. Does
> anyone have any experience with how quick this would need to happen for RHCS
> to tolerate it?
>
> I have been meaning to test this but have not had a chance...
>
> Tom
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090610/f38685e4/attachment.htm>

From rohara at redhat.com  Wed Jun 10 22:11:08 2009
From: rohara at redhat.com (Ryan O'Hara)
Date: Wed, 10 Jun 2009 17:11:08 -0500
Subject: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3
In-Reply-To: <FD716ADFEE797543AA8B2AB0B3F9148E06C606CA@rkamsem701.emea.roche.com>
References: <7a271b290906100141o296ff97w72b97fe2de70b0f4@mail.gmail.com>
	<20090610191802.GA12988@redhat.com>
	<FD716ADFEE797543AA8B2AB0B3F9148E06C606CA@rkamsem701.emea.roche.com>
Message-ID: <20090610221108.GE12988@redhat.com>

On Wed, Jun 10, 2009 at 11:33:14PM +0200, Moralejo, Alfredo wrote:

> Anyone is successfully using it?

What path checker are you using? I've heard that certain path checkers
cause problems, but I honestly don't know enough about dm-multipath
to understand the reason for this.

I have successfully used it with RDAC.

Ryan

> I'm testing it with a clariion storage frame on RHEL 5.3, and as soon as I enable scsi_reserve, multipath starts failing and a path goes good and bad in a loop and scsi fencing fails sometimes, should I configure in a specific way multipath.conf?:
> 
> Jun 10 23:31:58 rmamseslab07 multipathd: mpath0: remaining active paths: 3
> Jun 10 23:32:00 rmamseslab07 kernel: sd 1:0:0:1: reservation conflict
> Jun 10 23:32:00 rmamseslab07 kernel: sd 1:0:0:1: SCSI error: return code = 0x00000018
> Jun 10 23:32:00 rmamseslab07 kernel: end_request: I/O error, dev sdm, sector 79
> Jun 10 23:32:00 rmamseslab07 kernel: device-mapper: multipath: Failing path 8:192.
> Jun 10 23:32:00 rmamseslab07 multipathd: 8:192: mark as failed
> Jun 10 23:32:00 rmamseslab07 multipathd: mpathquorum: remaining active paths: 3
> Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: add map (uevent)
> Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: devmap already registered
> Jun 10 23:32:00 rmamseslab07 multipathd: sdm: emc_clariion_checker: Path healthy
> Jun 10 23:32:00 rmamseslab07 multipathd: 8:192: reinstated
> Jun 10 23:32:00 rmamseslab07 multipathd: mpathquorum: remaining active paths: 4
> Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: add map (uevent)
> Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: devmap already registered
> Jun 10 23:32:02 rmamseslab07 multipathd: sdl: emc_clariion_checker: Path healthy
> Jun 10 23:32:02 rmamseslab07 multipathd: 8:176: reinstated
> Jun 10 23:32:02 rmamseslab07 multipathd: mpath0: remaining active paths: 4
> Jun 10 23:32:02 rmamseslab07 multipathd: dm-9: add map (uevent)
> Jun 10 23:32:02 rmamseslab07 multipathd: dm-9: devmap already registered
> Jun 10 23:32:03 rmamseslab07 kernel: sd 1:0:0:0: reservation conflict
> Jun 10 23:32:03 rmamseslab07 kernel: sd 1:0:0:0: SCSI error: return code = 0x00000018
> Jun 10 23:32:03 rmamseslab07 kernel: end_request: I/O error, dev sdl, sector 25256
> Jun 10 23:32:03 rmamseslab07 kernel: device-mapper: multipath: Failing path 8:176.
> Jun 10 23:32:03 rmamseslab07 multipathd: dm-9: add map (uevent)
> Jun 10 23:32:03 rmamseslab07 multipathd: dm-9: devmap already registered
> Jun 10 23:32:03 rmamseslab07 multipathd: 8:176: mark as failed
> Jun 10 23:32:03 rmamseslab07 multipathd: mpath0: remaining active paths: 3
> 
> 
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan O'Hara
> Sent: Wednesday, June 10, 2009 9:18 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3
> 
> On Wed, Jun 10, 2009 at 02:11:21PM +0530, Rajeev P wrote:
> >  I wanted to know if fence_scsi is supported in a multipath environment for
> > RHEL5.3 release.
> 
> Yes, it is supported.
> 
> > In earlier releases of RHEL5 fence_scsi was not supported in a multipath
> > environment for RHEL5.3 release. If I am not wrong, this was because the
> > DM-MPIO driver forwarded the registration/unregistration commands on only on
> > one of the physical paths of a LUN.  Ideally it should have passed the
> > commands on all physical paths.
> > 
> > For RHEL5.3, is this issue resolved so that I can fence_scsi in multipath
> > environment.
> > 
> > Thanks in advance.
> > 
> > Rajeev
> 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From alfredo.moralejo at roche.com  Wed Jun 10 22:29:02 2009
From: alfredo.moralejo at roche.com (Moralejo, Alfredo)
Date: Thu, 11 Jun 2009 00:29:02 +0200
Subject: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3
In-Reply-To: <20090610221108.GE12988@redhat.com>
References: <7a271b290906100141o296ff97w72b97fe2de70b0f4@mail.gmail.com>
	<20090610191802.GA12988@redhat.com>
	<FD716ADFEE797543AA8B2AB0B3F9148E06C606CA@rkamsem701.emea.roche.com>
	<20090610221108.GE12988@redhat.com>
Message-ID: <FD716ADFEE797543AA8B2AB0B3F9148E06C606CB@rkamsem701.emea.roche.com>

I'm using the config provide by Red Hat by default:

       device {
               vendor                  "DGC"
                product                 ".*"
                product_blacklist       "LUN_Z"
                getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
                prio_callout            "/sbin/mpath_prio_emc /dev/%n"
                features                "1 queue_if_no_path"
                hardware_handler        "1 emc"
                path_grouping_policy    group_by_prio
                failback                immediate
                rr_weight               uniform
                no_path_retry           300
                rr_min_io               1000
                path_checker            emc_clariion
        }

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan O'Hara
Sent: Thursday, June 11, 2009 12:11 AM
To: linux clustering
Subject: Re: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3

On Wed, Jun 10, 2009 at 11:33:14PM +0200, Moralejo, Alfredo wrote:

> Anyone is successfully using it?

What path checker are you using? I've heard that certain path checkers
cause problems, but I honestly don't know enough about dm-multipath
to understand the reason for this.

I have successfully used it with RDAC.

Ryan

> I'm testing it with a clariion storage frame on RHEL 5.3, and as soon as I enable scsi_reserve, multipath starts failing and a path goes good and bad in a loop and scsi fencing fails sometimes, should I configure in a specific way multipath.conf?:
> 
> Jun 10 23:31:58 rmamseslab07 multipathd: mpath0: remaining active paths: 3
> Jun 10 23:32:00 rmamseslab07 kernel: sd 1:0:0:1: reservation conflict
> Jun 10 23:32:00 rmamseslab07 kernel: sd 1:0:0:1: SCSI error: return code = 0x00000018
> Jun 10 23:32:00 rmamseslab07 kernel: end_request: I/O error, dev sdm, sector 79
> Jun 10 23:32:00 rmamseslab07 kernel: device-mapper: multipath: Failing path 8:192.
> Jun 10 23:32:00 rmamseslab07 multipathd: 8:192: mark as failed
> Jun 10 23:32:00 rmamseslab07 multipathd: mpathquorum: remaining active paths: 3
> Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: add map (uevent)
> Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: devmap already registered
> Jun 10 23:32:00 rmamseslab07 multipathd: sdm: emc_clariion_checker: Path healthy
> Jun 10 23:32:00 rmamseslab07 multipathd: 8:192: reinstated
> Jun 10 23:32:00 rmamseslab07 multipathd: mpathquorum: remaining active paths: 4
> Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: add map (uevent)
> Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: devmap already registered
> Jun 10 23:32:02 rmamseslab07 multipathd: sdl: emc_clariion_checker: Path healthy
> Jun 10 23:32:02 rmamseslab07 multipathd: 8:176: reinstated
> Jun 10 23:32:02 rmamseslab07 multipathd: mpath0: remaining active paths: 4
> Jun 10 23:32:02 rmamseslab07 multipathd: dm-9: add map (uevent)
> Jun 10 23:32:02 rmamseslab07 multipathd: dm-9: devmap already registered
> Jun 10 23:32:03 rmamseslab07 kernel: sd 1:0:0:0: reservation conflict
> Jun 10 23:32:03 rmamseslab07 kernel: sd 1:0:0:0: SCSI error: return code = 0x00000018
> Jun 10 23:32:03 rmamseslab07 kernel: end_request: I/O error, dev sdl, sector 25256
> Jun 10 23:32:03 rmamseslab07 kernel: device-mapper: multipath: Failing path 8:176.
> Jun 10 23:32:03 rmamseslab07 multipathd: dm-9: add map (uevent)
> Jun 10 23:32:03 rmamseslab07 multipathd: dm-9: devmap already registered
> Jun 10 23:32:03 rmamseslab07 multipathd: 8:176: mark as failed
> Jun 10 23:32:03 rmamseslab07 multipathd: mpath0: remaining active paths: 3
> 
> 
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan O'Hara
> Sent: Wednesday, June 10, 2009 9:18 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3
> 
> On Wed, Jun 10, 2009 at 02:11:21PM +0530, Rajeev P wrote:
> >  I wanted to know if fence_scsi is supported in a multipath environment for
> > RHEL5.3 release.
> 
> Yes, it is supported.
> 
> > In earlier releases of RHEL5 fence_scsi was not supported in a multipath
> > environment for RHEL5.3 release. If I am not wrong, this was because the
> > DM-MPIO driver forwarded the registration/unregistration commands on only on
> > one of the physical paths of a LUN.  Ideally it should have passed the
> > commands on all physical paths.
> > 
> > For RHEL5.3, is this issue resolved so that I can fence_scsi in multipath
> > environment.
> > 
> > Thanks in advance.
> > 
> > Rajeev
> 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From ml at eyes-works.com  Thu Jun 11 01:40:17 2009
From: ml at eyes-works.com (Yasuhiro Fujii)
Date: Thu, 11 Jun 2009 10:40:17 +0900
Subject: [Linux-cluster] cman_tool leave does not reduce expected votes.
Message-ID: <20090611102146.7A77.45046F47@eyes-works.com>

Hi.

I'm testing 3nodes CentOS5.3 cluster.

When 3 nodes joined and one node leaved from cluster,but expected votes
did not reduce.
So when 2 nodes leaved(cman_tool leave),only one node status chaneged to
activity blocked.

How to Activity blocked

3 nodes joined. This is normal.

Version: 6.1.0
Config Version: 1
Cluster Name: cl
Cluster Id: 28318
Cluster Member: Yes
Cluster Generation: 292
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Quorum: 2

1 node cman_tool leave

Version: 6.1.0
Config Version: 1
Cluster Name: cl
Cluster Id: 28318
Cluster Member: Yes
Cluster Generation: 296
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Total votes: 2
Quorum: 2

2 nodes cman_tool leave

Version: 6.1.0
Config Version: 1
Cluster Name: cl
Cluster Id: 28318
Cluster Member: Yes
Cluster Generation: 300
Membership state: Cluster-Member
Nodes: 1
Expected votes: 3
Total votes: 1
Quorum: 2 Activity blocked

I tested cman_tool leave and cman_tool leave remove,but expected votes
did no reduced.

I think a node cman_tool leave is used, expected votes must be reduced 
avoiding to activity blocked.

I know cman_tool expected -e 1 avoids this activity blocked,but
cman_tool leave (remove) should reduce expected votes automatically.

--
cman-2.0.98-1.el5_3.1
openais-0.80.3-22.el5_3.4

-- /etc/cluster/cluster.conf --
<?xml version="1.0"?>
<cluster alias="cl" config_version="1" name="cl">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="cv1" nodeid="1" votes="1"/>
                <clusternode name="cv2" nodeid="2" votes="1"/>
                <clusternode name="cv3" nodeid="3" votes="1"/>
        </clusternodes>
        <cman/>
        <fencedevices/>
        <rm/>
</cluster>


From amalik at intertechmedia.com  Thu Jun 11 02:44:06 2009
From: amalik at intertechmedia.com (Atif Malik)
Date: Thu, 11 Jun 2009 02:44:06 +0000
Subject: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3
In-Reply-To: <FD716ADFEE797543AA8B2AB0B3F9148E06C606CA@rkamsem701.emea.roche.com>
References: <7a271b290906100141o296ff97w72b97fe2de70b0f4@mail.gmail.com><20090610191802.GA12988@redhat.com><FD716ADFEE797543AA8B2AB0B3F9148E06C606CA@rkamsem701.emea.roche.com>
Message-ID: <932651848-1244688228-cardhu_decombobulator_blackberry.rim.net-1844332269-@bxe1136.bisx.prod.on.blackberry>

P
-----Original Message-----
From: "Moralejo, Alfredo" <alfredo.moralejo at roche.com>

Date: Wed, 10 Jun 2009 23:33:14 
To: linux clustering<linux-cluster at redhat.com>
Subject: RE: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3


Anyone is successfully using it?

I'm testing it with a clariion storage frame on RHEL 5.3, and as soon as I enable scsi_reserve, multipath starts failing and a path goes good and bad in a loop and scsi fencing fails sometimes, should I configure in a specific way multipath.conf?:

Jun 10 23:31:58 rmamseslab07 multipathd: mpath0: remaining active paths: 3
Jun 10 23:32:00 rmamseslab07 kernel: sd 1:0:0:1: reservation conflict
Jun 10 23:32:00 rmamseslab07 kernel: sd 1:0:0:1: SCSI error: return code = 0x00000018
Jun 10 23:32:00 rmamseslab07 kernel: end_request: I/O error, dev sdm, sector 79
Jun 10 23:32:00 rmamseslab07 kernel: device-mapper: multipath: Failing path 8:192.
Jun 10 23:32:00 rmamseslab07 multipathd: 8:192: mark as failed
Jun 10 23:32:00 rmamseslab07 multipathd: mpathquorum: remaining active paths: 3
Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: add map (uevent)
Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: devmap already registered
Jun 10 23:32:00 rmamseslab07 multipathd: sdm: emc_clariion_checker: Path healthy
Jun 10 23:32:00 rmamseslab07 multipathd: 8:192: reinstated
Jun 10 23:32:00 rmamseslab07 multipathd: mpathquorum: remaining active paths: 4
Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: add map (uevent)
Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: devmap already registered
Jun 10 23:32:02 rmamseslab07 multipathd: sdl: emc_clariion_checker: Path healthy
Jun 10 23:32:02 rmamseslab07 multipathd: 8:176: reinstated
Jun 10 23:32:02 rmamseslab07 multipathd: mpath0: remaining active paths: 4
Jun 10 23:32:02 rmamseslab07 multipathd: dm-9: add map (uevent)
Jun 10 23:32:02 rmamseslab07 multipathd: dm-9: devmap already registered
Jun 10 23:32:03 rmamseslab07 kernel: sd 1:0:0:0: reservation conflict
Jun 10 23:32:03 rmamseslab07 kernel: sd 1:0:0:0: SCSI error: return code = 0x00000018
Jun 10 23:32:03 rmamseslab07 kernel: end_request: I/O error, dev sdl, sector 25256
Jun 10 23:32:03 rmamseslab07 kernel: device-mapper: multipath: Failing path 8:176.
Jun 10 23:32:03 rmamseslab07 multipathd: dm-9: add map (uevent)
Jun 10 23:32:03 rmamseslab07 multipathd: dm-9: devmap already registered
Jun 10 23:32:03 rmamseslab07 multipathd: 8:176: mark as failed
Jun 10 23:32:03 rmamseslab07 multipathd: mpath0: remaining active paths: 3


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan O'Hara
Sent: Wednesday, June 10, 2009 9:18 PM
To: linux clustering
Subject: Re: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3

On Wed, Jun 10, 2009 at 02:11:21PM +0530, Rajeev P wrote:
>  I wanted to know if fence_scsi is supported in a multipath environment for
> RHEL5.3 release.

Yes, it is supported.

> In earlier releases of RHEL5 fence_scsi was not supported in a multipath
> environment for RHEL5.3 release. If I am not wrong, this was because the
> DM-MPIO driver forwarded the registration/unregistration commands on only on
> one of the physical paths of a LUN.  Ideally it should have passed the
> commands on all physical paths.
> 
> For RHEL5.3, is this issue resolved so that I can fence_scsi in multipath
> environment.
> 
> Thanks in advance.
> 
> Rajeev

> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From carlopmart at gmail.com  Thu Jun 11 08:44:28 2009
From: carlopmart at gmail.com (carlopmart)
Date: Thu, 11 Jun 2009 10:44:28 +0200
Subject: [Linux-cluster] fence_vmware works on vsphere esxi??
Message-ID: <4A30C3EC.80706@gmail.com>

Hi all,

  Sombebody have tried to use fence_vmware (on rhel5.x) on vsphere 
esxi?? works or not??


Thanks.
-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From ccaulfie at redhat.com  Thu Jun 11 08:52:09 2009
From: ccaulfie at redhat.com (Chrissie Caulfield)
Date: Thu, 11 Jun 2009 09:52:09 +0100
Subject: [Linux-cluster] cman_tool leave does not reduce expected votes.
In-Reply-To: <20090611102146.7A77.45046F47@eyes-works.com>
References: <20090611102146.7A77.45046F47@eyes-works.com>
Message-ID: <4A30C5B9.1000300@redhat.com>

Yasuhiro Fujii wrote:
> Hi.
> 
> I'm testing 3nodes CentOS5.3 cluster.
> 
> When 3 nodes joined and one node leaved from cluster,but expected votes
> did not reduce.
> So when 2 nodes leaved(cman_tool leave),only one node status chaneged to
> activity blocked.
> 


Eek! You're right.

I've raised a bugzilla report for this:

https://bugzilla.redhat.com/show_bug.cgi?id=505258


Chrissie


From jfriesse at redhat.com  Thu Jun 11 09:31:55 2009
From: jfriesse at redhat.com (Jan Friesse)
Date: Thu, 11 Jun 2009 11:31:55 +0200
Subject: [Linux-cluster] Re: [Cluster-devel] Prototype Fencing Agent for
	Raritan eRIC G4
In-Reply-To: <4A30087D.3060901@bobich.net>
References: <4A2ED075.5020207@bobich.net> <4A300034.9050603@redhat.com>
	<4A30087D.3060901@bobich.net>
Message-ID: <4A30CF0B.2030203@redhat.com>

Gordan,

Gordan Bobic wrote:
> Subhendu Ghosh wrote:
> 
>> Would it be possible to look at migrating this agent to SSH (more secure)
> 
> I started with the idea of doing it over ssh, but Net::SSH module seemed
> to be a lot less forgiving about the terminal quirkyness. I can have
> another go. There's also the issue of manual intervention being required
> to save the signatures (and where do the known hosts go?).
> 
>> or to SNMP (less screen scraping)?
> 
> Hmm, maybe. I haven't looked into the SNMP capability on the device, but
> it looks like it'll work, and probably be easier to do than SSH.
> 
>> Look at fence_cisco as an example of snmp usage.
> 
> Assuming they speak a compatible dialect, which may not be the case.
> I'll have a look.

We are using fence agents library, which makes writing agents easier
(capable of doing things like command line parsing, implement reboot
operation, ...), shorter and easier to maintain. fence_cisco is good
example (short, tested, ...) HOW to write such agent. Agents are written
in Python, and we are migrating all agents on top of library.

> 
>> Long term maintainability of screen scraping is an issue with firmware
>> changes.
> 
> Tell me about it. I submitted a patch for fence_drac a while back to
> address an issue that seems to have arisen from a firmware update
> inducted pattern match failure.
> 
> Not only that, but I've discovered a bug on the latest eRIC G4 firmware
> - 04.02.00-7153 seems to have broken USB keyboard support (you'd think
> this was important on a remote console device!) and potentially some
> power button press dodgyness. The previous firmware, however -
> 04.02.00-6505, works OK.
> 
>> Also it seems that card has IPMI support. If so, can use test with
>> fence_ipmi?
>> Would remove the need for yet-another-agent ;)
> 
> Sadly, my servers with these cards in them don't have IPMI support. The
> card only proxies it. The card supports direct power/reset button
> control in addition to IPMI, so this is what I'm using. But as you can
> see from the code, it operates only on the power on/off even for a
> reboot because the said servers also don't have a reset connector. I
> wrote this agent because I _needed_ it. :)
> 
> But I'll look into the SNMP way of doing it, it sounds like it might be
> neater. I'll add it as an option since the telnet way is already
> written. What parameter should/can be used to specify such things, that
> is available from a cluster.conf reference?

This question answers you little look to fence_cisco agent (or you can
use fence_ifmib, fence_intel_modular, fence_apc_snmp, ...). In case you
will not understand something, please ask.

> 
> Thanks.
> 
> Gordan
> 

Regards,
  Honza


From viral_ahire at yahoo.com  Thu Jun 11 10:05:05 2009
From: viral_ahire at yahoo.com (viral ahire)
Date: Thu, 11 Jun 2009 15:35:05 +0530 (IST)
Subject: [Linux-cluster] Re:Node Leave Cluster while Stopping Cluster
	Application (Oracle)
Message-ID: <594531.32475.qm@web94716.mail.in2.yahoo.com>

Still there is no replay from geniuses.......
?
Please help for me for this problem

-------------------
Regards,

VIRAL .D. AHIRE
(Mobile- +91 9724507304)


      Explore and discover exciting holidays and getaways with Yahoo! India Travel http://in.travel.yahoo.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090611/096f13de/attachment.htm>

From ml at eyes-works.com  Thu Jun 11 11:23:55 2009
From: ml at eyes-works.com (Yasuhiro Fujii)
Date: Thu, 11 Jun 2009 20:23:55 +0900
Subject: [Linux-cluster] cman_tool leave does not reduce expected votes.
In-Reply-To: <4A30C5B9.1000300@redhat.com>
References: <20090611102146.7A77.45046F47@eyes-works.com>
	<4A30C5B9.1000300@redhat.com>
Message-ID: <20090611202255.6E31.45046F47@eyes-works.com>

Dear Chrissie.

Thank you for your reply and reporting redhat bugzilla.
I'll check redhat bugzilla,too.


On Thu, 11 Jun 2009 09:52:09 +0100
Chrissie Caulfield <ccaulfie at redhat.com> wrote:

> Yasuhiro Fujii wrote:
> > Hi.
> > 
> > I'm testing 3nodes CentOS5.3 cluster.
> > 
> > When 3 nodes joined and one node leaved from cluster,but expected votes
> > did not reduce.
> > So when 2 nodes leaved(cman_tool leave),only one node status chaneged to
> > activity blocked.
> > 
> 
> 
> Eek! You're right.
> 
> I've raised a bugzilla report for this:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=505258
> 
> 
> Chrissie
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Yasuhiro Fujii <ml at eyes-works.com>


From info at lizardkings.nl  Thu Jun 11 17:39:18 2009
From: info at lizardkings.nl (LizardKings)
Date: Thu, 11 Jun 2009 19:39:18 +0200
Subject: [Linux-cluster] get cluster nodes via XML-RPC
Message-ID: <4A314146.4050107@lizardkings.nl>

Hi,

Is it possible to receive a list of cluster nodes via XML-RPC to one of 
the ricci's.

DG


From fdinitto at redhat.com  Thu Jun 11 22:03:35 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Fri, 12 Jun 2009 00:03:35 +0200
Subject: [Linux-cluster] Re: Still having GFS2 mount hang
In-Reply-To: <89c397150906111208n78c222fhdf3a57e5dbbe9f50@mail.gmail.com>
References: <89c397150906081537i661beb55ke5f0284909f9e71f@mail.gmail.com>
	<1244534269.29604.789.camel@localhost.localdomain>
	<20090609140112.GB13914@redhat.com>
	<89c397150906091214k377d7213ud93ce341c3cb1167@mail.gmail.com>
	<20090609193616.GA22800@redhat.com>
	<89c397150906100633p5a0b15d9vab2da39107d362fc@mail.gmail.com>
	<20090610141351.GA18341@redhat.com>
	<1244653184.3665.77.camel@cerberus.int.fabbione.net>
	<89c397150906111208n78c222fhdf3a57e5dbbe9f50@mail.gmail.com>
Message-ID: <1244757816.3665.109.camel@cerberus.int.fabbione.net>

On Thu, 2009-06-11 at 15:08 -0400, William A. (Andy) Adamson wrote:
> On Wed, Jun 10, 2009 at 12:59 PM, Fabio M. Di Nitto<fdinitto at redhat.com> wrote:
> > On Wed, 2009-06-10 at 09:13 -0500, David Teigland wrote:
> >> On Wed, Jun 10, 2009 at 09:33:33AM -0400, William A. (Andy) Adamson wrote:
> >> > On Tue, Jun 9, 2009 at 3:36 PM, David Teigland<teigland at redhat.com> wrote:
> >> > > On Tue, Jun 09, 2009 at 03:14:09PM -0400, William A. (Andy) Adamson wrote:
> >> > >> Hi David
> >> > >>
> >> > >> Thanks for looking at this. The kernel does report a recursive lock
> >> > >
> >> > > that's harmless
> >> > >
> >> > >> issue when running /etc/init.d/cman. Details inline.
> >> > >
> >> > > I can't see anything wrong, I'm going to check whether we have or can get
> >> > > some more recent packages, since 2.99.12 is a bit old, it looks like
> >> > > you're on fedora 10?
> >> >
> >> > yes. I could move to fedora 11.
> >>
> >> I did some checking, and unfortunately 2.99.12 is the newest version we've
> >> packaged for either f10 or f11.  It has something to do with the corosync
> >> api's changing too rapidly, and the trouble with patching and rebuilding all
> >> the packages that depend on it because they are using various versions of the
> >> api... the hope is it will all be better when a stable corosync 1.0 release
> >> happens.
> >>
> >> In the mean time, Fabio was kind enough to make a set of srpms of all the
> >> latest versions, http://fabbione.fedorapeople.org/srpm/ I just built and
> >> installed corosync, openais and cluster srpms from there on my fedora 10
> >> machine.  Started the cluster and mounted gfs with the result.
> >>
> >> I limited what I built/installed to avoid some annoying dependencies, to
> >>
> >> rpmbuild --rebuild corosync
> >> rpm -Uhv corosync*
> >> rpmbuild --rebuild openais
> >> rpm -Uhv openais*
> >> rpmbuild --rebuild cluster
> >> rpm -Uhv cluster*
> >> rpm -Uhv gfs*
> >> rpm -Uhv --nodeps cman*
> >
> > Just FYI, you can build fence-agents srpm from there after install
> > clusterlib and then install full cman.
> 
> 
> Cool. Thanks! I'll try the new rpm's at the NFSv4.1 bakeathon next
> week. I'm able to pass the basic connectathon tests with my current
> gfs2 setup and the new reworked pnfs server code which can export an
> unmodified gfs2 file system.

I'll push some more updated srpm tomorrow. I found a couple of issues
with the current ones that could be problematic.

i'll send you an email with the versions to use.

Fabio


From fdinitto at redhat.com  Fri Jun 12 06:40:10 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Fri, 12 Jun 2009 08:40:10 +0200
Subject: [Linux-cluster] Re: Still having GFS2 mount hang
In-Reply-To: <89c397150906111208n78c222fhdf3a57e5dbbe9f50@mail.gmail.com>
References: <89c397150906081537i661beb55ke5f0284909f9e71f@mail.gmail.com>
	<1244534269.29604.789.camel@localhost.localdomain>
	<20090609140112.GB13914@redhat.com>
	<89c397150906091214k377d7213ud93ce341c3cb1167@mail.gmail.com>
	<20090609193616.GA22800@redhat.com>
	<89c397150906100633p5a0b15d9vab2da39107d362fc@mail.gmail.com>
	<20090610141351.GA18341@redhat.com>
	<1244653184.3665.77.camel@cerberus.int.fabbione.net>
	<89c397150906111208n78c222fhdf3a57e5dbbe9f50@mail.gmail.com>
Message-ID: <1244788810.3665.112.camel@cerberus.int.fabbione.net>

On Thu, 2009-06-11 at 15:08 -0400, William A. (Andy) Adamson wrote:
> On Wed, Jun 10, 2009 at 12:59 PM, Fabio M. Di Nitto<fdinitto at redhat.com> wrote:
> > On Wed, 2009-06-10 at 09:13 -0500, David Teigland wrote:
> >> On Wed, Jun 10, 2009 at 09:33:33AM -0400, William A. (Andy) Adamson wrote:
> >> > On Tue, Jun 9, 2009 at 3:36 PM, David Teigland<teigland at redhat.com> wrote:
> >> > > On Tue, Jun 09, 2009 at 03:14:09PM -0400, William A. (Andy) Adamson wrote:
> >> > >> Hi David
> >> > >>
> >> > >> Thanks for looking at this. The kernel does report a recursive lock
> >> > >
> >> > > that's harmless
> >> > >
> >> > >> issue when running /etc/init.d/cman. Details inline.
> >> > >
> >> > > I can't see anything wrong, I'm going to check whether we have or can get
> >> > > some more recent packages, since 2.99.12 is a bit old, it looks like
> >> > > you're on fedora 10?
> >> >
> >> > yes. I could move to fedora 11.
> >>
> >> I did some checking, and unfortunately 2.99.12 is the newest version we've
> >> packaged for either f10 or f11.  It has something to do with the corosync
> >> api's changing too rapidly, and the trouble with patching and rebuilding all
> >> the packages that depend on it because they are using various versions of the
> >> api... the hope is it will all be better when a stable corosync 1.0 release
> >> happens.
> >>
> >> In the mean time, Fabio was kind enough to make a set of srpms of all the
> >> latest versions, http://fabbione.fedorapeople.org/srpm/ I just built and
> >> installed corosync, openais and cluster srpms from there on my fedora 10
> >> machine.  Started the cluster and mounted gfs with the result.

Same URL:

cluster-3.0.0-17.rc2.fc12.src.rpm
corosync-0.97-1.svn2233.fc12.src.rpm
fence-agents-3.0.0-11.rc2.fc12.src.rpm
lvm2-2.02.47-2.fc12.src.rpm
openais-0.96-1.svn1951.fc12.src.rpm
resource-agents-3.0.0-9.rc2.fc12.src.rpm

I think vs the previous run, there is only a major update (important!)
for corosync and cluster. The other packages should be unchanged.

Fabio


From marco.huang at sit.auckland.ac.nz  Fri Jun 12 10:19:17 2009
From: marco.huang at sit.auckland.ac.nz (Marco Huang)
Date: Fri, 12 Jun 2009 22:19:17 +1200
Subject: [Linux-cluster] kernel panic on debian lenny GFS2 when exporting
	via NFS
Message-ID: <4A322BA5.30505@sit.auckland.ac.nz>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
 
Hi,

I have two debian lenny nodes (kernel 2.6.26-1-amd64) are running
redhat cluster suite.  I mount gfs2 with acl option on the two nodes.
Everything are looking ok until I export the gfs2 file system to other
servers with nfs acl option (I have tried without acl option). It just
crashes the cluster when every time I try to edit or cat a file, but I
can ls any directory without any problem. Does anyone have suggestion
on that?

The following is from dmesg

[73567.236977] ------------[ cut here ]------------
[73567.236977] kernel BUG at fs/gfs2/glock.c:1134!
[73567.237483] invalid opcode: 0000 [1] SMP
[73567.237483] CPU 3
[73567.237483] Modules linked in: nfsd auth_rpcgss exportfs nfs lockd
nfs_acl sunrpc sctp libcrc32c gfs lock_dlm gfs2 dlm configfs ipv6 aoe
ext2 loop parport_pc parport snd_pcm snd_timer snd pcspkr soundcore
psmouse snd_page_alloc serio_raw container i2c_piix4 ac button
i2c_core intel_agp shpchp pci_hotplug evdev ext3 jbd mbcache dm_mirror
dm_log dm_snapshot dm_mod sd_mod ide_cd_mod cdrom ata_generic libata
dock ide_pci_generic floppy mptspi mptscsih mptbase scsi_transport_spi
e1000 scsi_mod piix ide_core thermal processor fan thermal_sys
[73567.237483] Pid: 23419, comm: nfsd Not tainted 2.6.26-1-amd64 #1
[73567.237483] RIP: 0010:[<ffffffffa02abf65>]  [<ffffffffa02abf65>]
:gfs2:gfs2_glock_nq+0x11b/0x1e0
[73567.237483] RSP: 0018:ffff81004b4f3cb0  EFLAGS: 00010282
[73567.237483] RAX: 000000000000002f RBX: ffff81004b4f3cf0 RCX:
0000000000000082
[73567.237483] RDX: 0000000000009f3a RSI: 0000000000000046 RDI:
0000000000000286
[73567.237483] RBP: ffff810057832740 R08: ffff8100d330dd48 R09:
ffff81004b4f3800
[73567.237483] R10: 0000000000000000 R11: 0000000000000046 R12:
ffff8100d330dd48
[73567.237483] R13: ffff8100d330dd48 R14: 0000000000000000 R15:
ffff8100e0c32000
[73567.237483] FS:  00007f83d0fc86e0(0000) GS:ffff8100ef6df9c0(0000)
knlGS:0000000000000000
[73567.237483] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[73567.237483] CR2: 0000000002679e08 CR3: 00000000ec8cf000 CR4:
00000000000006e0
[73567.237483] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[73567.237483] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[73567.237483] Process nfsd (pid: 23419, threadinfo ffff81004b4f2000,
task ffff81000f9a70a0)
[73567.237483] Stack:  ffff8100ed4fecc8 ffff8100ed4fecc8
ffff8100ed4fecc8 ffff8100ee097c80
[73567.237483]  0000000000000000 ffff8100a5dd7740 ffff8100ce7f6cb0
ffffffffa02b4ec1
[73567.237483]  ffff81004b4f3cf0 ffff81004b4f3cf0 ffff8100d330dd48
ffff8100ecd4e840
[73567.237483] Call Trace:
[73567.237483]  [<ffffffffa02b4ec1>] ? :gfs2:gfs2_open+0xc7/0x13c
[73567.237483]  [<ffffffffa02b4eb9>] ? :gfs2:gfs2_open+0xbf/0x13c
[73567.237483]  [<ffffffffa02b4dfa>] ? :gfs2:gfs2_open+0x0/0x13c
[73567.237483]  [<ffffffff8029976d>] ? __dentry_open+0x12c/0x238
[73567.237483]  [<ffffffffa0429407>] ? :nfsd:nfsd_open+0x13c/0x170
[73567.237483]  [<ffffffffa04296db>] ? :nfsd:nfsd_read+0x7f/0xc4
[73567.237483]  [<ffffffff80429e2d>] ? _spin_lock_bh+0x9/0x1f
[73567.237483]  [<ffffffffa04304b7>] ? :nfsd:nfsd3_proc_read+0xfe/0x141
[73567.237483]  [<ffffffffa0425245>] ? :nfsd:nfsd_dispatch+0xde/0x1b6
[73567.237483]  [<ffffffffa039755b>] ? :sunrpc:svc_process+0x408/0x6e9
[73567.237483]  [<ffffffff80429c74>] ? __down_read+0x12/0xa1
[73567.237483]  [<ffffffffa042567c>] ? :nfsd:nfsd+0x0/0x2a4
[73567.237483]  [<ffffffffa0425810>] ? :nfsd:nfsd+0x194/0x2a4
[73567.237483]  [<ffffffff80230196>] ? schedule_tail+0x27/0x5c
[73567.237483]  [<ffffffff8020cf28>] ? child_rip+0xa/0x12
[73567.237483]  [<ffffffffa042567c>] ? :nfsd:nfsd+0x0/0x2a4
[73567.237483]  [<ffffffffa042567c>] ? :nfsd:nfsd+0x0/0x2a4
[73567.237483]  [<ffffffffa042567c>] ? :nfsd:nfsd+0x0/0x2a4
[73567.237483]  [<ffffffff8020cf1e>] ? child_rip+0x0/0x12
[73567.237483]
[73567.237483]
[73567.237483] Code: 74 03 8b 70 38 48 c7 c7 b0 2f 2c a0 31 c0 e8 72
94 f8 df 41 8b 54 24 30 41 8b 74 24 20 48 c7 c7 bd 2f 2c a0 31 c0 e8
5a 94 f8 df <0f> 0b eb fe 48 39 70 18 74 10 48 89 d0 48 8b 10 48 39 c8
0f 18
[73567.387600] RIP  [<ffffffffa02abf65>] :gfs2:gfs2_glock_nq+0x11b/0x1e0
[73567.387600]  RSP <ffff81004b4f3cb0>
[73567.394761] ---[ end trace 7902e4725ced022f ]---
[73693.051705] BUG: soft lockup - CPU#1 stuck for 61s! [nfsd:23421]


Cheers,
Marco
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
 
iEYEARECAAYFAkoyK6UACgkQSSHqatd3m2OaEACfexhB38p0InHX1WuvXFyy4st+
yxcAmwQaLeOz63p2rOnsQ0fswrlI4tEk
=rDLx
-----END PGP SIGNATURE-----


From siddiqut at gmail.com  Fri Jun 12 13:40:50 2009
From: siddiqut at gmail.com (Tajdar Siddiqui)
Date: Fri, 12 Jun 2009 09:40:50 -0400
Subject: [Linux-cluster] gfs2 question
Message-ID: <3abaa1ce0906120640v137f612at847e8a1847ee83b2@mail.gmail.com>

We are running gfs2 on Red Hat Enterprise Linux Server release 5.3 (Tikanga)

This is a 2 node cluster and what we have noticed is that from time to time,
one node gets approximately 1/3rd write thruput on gfs2 as compared to the
other node.
Writing program is in java.

Any ideas on what to check for  etc.?

Many thanx,
Tajdar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090612/888e8d12/attachment.htm>

From cthulhucalling at gmail.com  Fri Jun 12 15:44:09 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Fri, 12 Jun 2009 08:44:09 -0700
Subject: [Linux-cluster] Running additional scripts at service startup
Message-ID: <36df569a0906120844h25fa6ac3v6950071e58ee089a@mail.gmail.com>

HI all...

I've been given the task of setting up a cluster for a service that we run
here. The init script for the service calls an outside Perl script to do
some administrative tasks once the daemon is started up. The script must be
run as a different user so in the init script we have "su - someuser -c
adminscript.pl". This all works fine if we start the daemon manually, but it
doesn't appear that the script is running or it's failing whenever it is
being started up via the cluster. Is there some magic foo that I'm missing?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090612/024f5e6b/attachment.htm>

From fdinitto at redhat.com  Fri Jun 12 17:15:36 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Fri, 12 Jun 2009 19:15:36 +0200
Subject: [Linux-cluster] Re: Still having GFS2 mount hang
In-Reply-To: <89c397150906120830r4ff643baw5f57eb16b57cc6e2@mail.gmail.com>
References: <89c397150906081537i661beb55ke5f0284909f9e71f@mail.gmail.com>
	<1244534269.29604.789.camel@localhost.localdomain>
	<20090609140112.GB13914@redhat.com>
	<89c397150906091214k377d7213ud93ce341c3cb1167@mail.gmail.com>
	<20090609193616.GA22800@redhat.com>
	<89c397150906100633p5a0b15d9vab2da39107d362fc@mail.gmail.com>
	<20090610141351.GA18341@redhat.com>
	<1244653184.3665.77.camel@cerberus.int.fabbione.net>
	<89c397150906111208n78c222fhdf3a57e5dbbe9f50@mail.gmail.com>
	<1244788810.3665.112.camel@cerberus.int.fabbione.net>
	<89c397150906120830r4ff643baw5f57eb16b57cc6e2@mail.gmail.com>
Message-ID: <1244826936.3665.126.camel@cerberus.int.fabbione.net>

On Fri, 2009-06-12 at 11:30 -0400, William A. (Andy) Adamson wrote:

> > Same URL:
> >
> > cluster-3.0.0-17.rc2.fc12.src.rpm
> > corosync-0.97-1.svn2233.fc12.src.rpm
> > fence-agents-3.0.0-11.rc2.fc12.src.rpm
> > lvm2-2.02.47-2.fc12.src.rpm
> > openais-0.96-1.svn1951.fc12.src.rpm
> > resource-agents-3.0.0-9.rc2.fc12.src.rpm
> >
> > I think vs the previous run, there is only a major update (important!)
> > for corosync and cluster. The other packages should be unchanged.
> 
> OK. Next week at the bakeathon, I'll first test with what I have
> 'cause it's working, and then I'll update to these rpm's and let you
> know how it goes.

OK cool. Looking forward to feedback.

Fabio


From dougbunger at yahoo.com  Fri Jun 12 18:55:58 2009
From: dougbunger at yahoo.com (Doug Bunger)
Date: Fri, 12 Jun 2009 11:55:58 -0700 (PDT)
Subject: [Linux-cluster] gfs2 question
Message-ID: <255434.14635.qm@web110215.mail.gq1.yahoo.com>

What's the connectivity?? SAN or NAS?? Is it on physical RAID?? Are you accessing the same file?

I have notice inconsistent access across iSCSI, as a result of network bandwidth, buffering, caching, et al.

--- On Fri, 6/12/09, Tajdar Siddiqui <siddiqut at gmail.com> wrote:

From: Tajdar Siddiqui <siddiqut at gmail.com>
Subject: [Linux-cluster] gfs2 question
To: linux-cluster at redhat.com
Date: Friday, June 12, 2009, 8:40 AM

We are running gfs2 on Red Hat Enterprise Linux Server release 5.3 (Tikanga)

This is a 2 node cluster and what we have noticed is that from time to time, one node gets approximately 1/3rd write thruput on gfs2 as compared to the other node.

Writing program is in java.

Any ideas on what to check for? etc.?

Many thanx,
Tajdar


-----Inline Attachment Follows-----

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090612/05fadf55/attachment.htm>

From songyu555 at gmail.com  Sun Jun 14 01:21:08 2009
From: songyu555 at gmail.com (yu song)
Date: Sun, 14 Jun 2009 11:21:08 +1000
Subject: [Linux-cluster] Could some one explain why SCSI_Fence Agent can not
	be used in 2 nodes cluster?
Message-ID: <420241f50906131821u54d12cd6vba75805b7cd82c0b@mail.gmail.com>

Hi,

I am planning to build a 2 nodes cluster on rhcl 5.3, and looking for what
fencing method I could use.

On the storage side, it is EMC clarion and supports scsi 3 reservation.

So I'm thinking to use fence_scsi agent to do the disk fencing. however,
according the redhat website,  it states that fence_scsi does not support
two nodes cluster.

Could anyone kindly explain it why? (never had this issue when use veritas
cluster)

Another question is what is best practice to have how many Quorum disk for
two-nodes cluster? It looks like not compulsory and better have it..

cheers,

Yu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090614/9a09279a/attachment.htm>

From rohara at redhat.com  Mon Jun 15 03:36:12 2009
From: rohara at redhat.com (Ryan O'Hara)
Date: Sun, 14 Jun 2009 22:36:12 -0500
Subject: [Linux-cluster] Could some one explain why SCSI_Fence Agent
	can not be used in 2 nodes cluster?
In-Reply-To: <420241f50906131821u54d12cd6vba75805b7cd82c0b@mail.gmail.com>
References: <420241f50906131821u54d12cd6vba75805b7cd82c0b@mail.gmail.com>
Message-ID: <20090615033612.GA15883@redhat.com>

On Sun, Jun 14, 2009 at 11:21:08AM +1000, yu song wrote:
> Hi,
> 
> I am planning to build a 2 nodes cluster on rhcl 5.3, and looking for what
> fencing method I could use.
> 
> On the storage side, it is EMC clarion and supports scsi 3 reservation.
> 
> So I'm thinking to use fence_scsi agent to do the disk fencing. however,
> according the redhat website,  it states that fence_scsi does not support
> two nodes cluster.
> 
> Could anyone kindly explain it why? (never had this issue when use veritas
> cluster)

In a 2 node cluster, fencing becomes a race -- the node fences the
other node first wins. This works well with power fencing, but not so
well with SAN fencing (eg. fence_scsi).

The problem with fence_scsi in a 2 node cluster is this:

Suppose we have 2 node, call them A and B. Also assume we have multple
LUNs, which we will call lun1, lun2, lun3. Consider what happens when
a network partition occurs -- both nodes attempt to fence one
another. It is possible that A could remove B's key from lun1 and
lun2, but node B could remove node A's key from lun3. This is
inconsistent and there is no clear "winner".

Ryan


From songyu555 at gmail.com  Mon Jun 15 04:25:32 2009
From: songyu555 at gmail.com (yu song)
Date: Mon, 15 Jun 2009 14:25:32 +1000
Subject: [Linux-cluster] Could some one explain why SCSI_Fence Agent can 
	not be used in 2 nodes cluster?
In-Reply-To: <20090615033612.GA15883@redhat.com>
References: <420241f50906131821u54d12cd6vba75805b7cd82c0b@mail.gmail.com>
	<20090615033612.GA15883@redhat.com>
Message-ID: <420241f50906142125s4828bf70y8310b52825a6986d@mail.gmail.com>

thanks Ryan.

So in the linux cluster, there is no concept about odd number coordinate
disks, which is used to deal with this issue?

anyway, probably I have to use power fencing.


cheers

Yu

On Mon, Jun 15, 2009 at 1:36 PM, Ryan O'Hara <rohara at redhat.com> wrote:

> On Sun, Jun 14, 2009 at 11:21:08AM +1000, yu song wrote:
> > Hi,
> >
> > I am planning to build a 2 nodes cluster on rhcl 5.3, and looking for
> what
> > fencing method I could use.
> >
> > On the storage side, it is EMC clarion and supports scsi 3 reservation.
> >
> > So I'm thinking to use fence_scsi agent to do the disk fencing. however,
> > according the redhat website,  it states that fence_scsi does not support
> > two nodes cluster.
> >
> > Could anyone kindly explain it why? (never had this issue when use
> veritas
> > cluster)
>
> In a 2 node cluster, fencing becomes a race -- the node fences the
> other node first wins. This works well with power fencing, but not so
> well with SAN fencing (eg. fence_scsi).
>
> The problem with fence_scsi in a 2 node cluster is this:
>
> Suppose we have 2 node, call them A and B. Also assume we have multple
> LUNs, which we will call lun1, lun2, lun3. Consider what happens when
> a network partition occurs -- both nodes attempt to fence one
> another. It is possible that A could remove B's key from lun1 and
> lun2, but node B could remove node A's key from lun3. This is
> inconsistent and there is no clear "winner".
>
> Ryan
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090615/8c501e9f/attachment.htm>

From alfredo.moralejo at roche.com  Mon Jun 15 14:17:55 2009
From: alfredo.moralejo at roche.com (Moralejo, Alfredo)
Date: Mon, 15 Jun 2009 16:17:55 +0200
Subject: [Linux-cluster] cman + qdisk timeouts....
Message-ID: <FD716ADFEE797543AA8B2AB0B3F9148E07442044@rkamsem701.emea.roche.com>

Hi,

I'm having what I think is a timeouts issue in my cluster.

I have a two node cluster using qdisk. Everytime the node that has the master role for qdisk becomes down (for failure or even stopping qdiskd manually), packages in the sane node are stopped because of the lack of quorum as the qdiskd becames unresponsive until second node becames master node and start working properly. Once qdiskd start working fine (usually 5-6 seconds) packages are started again.

I've read in the cluster manual section for "CMAN membership timeout value" and I think this is the case. I've used RHEL 5.3 and I thought this parameter is the token that I set much longer that needed:

<cluster alias="CLUSTER_ENG" config_version="75" name="CLUSTER_ENG">
        <totem token="50000"/>
...

        <quorumd device="/dev/mapper/mpathquorump1" interval="3" status_file="/tmp/qdisk" tko="3" votes="5" log_level="7" log_facility="local4"/>


Totem token is much more that double of qdisk timeout, so I guess it should be enough but everytime qdisk dies in the master node I get same result, services restarted in the sane node:

Jun 15 16:11:33 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update (2/3)
Jun 15 16:11:38 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update (3/3)
Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update (4/3)
Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: <debug> Node 1 DOWN
Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: <debug> Making bid for master
Jun 15 16:11:44 rmamseslab07 clurgmgrd: [18510]: <info> Executing /etc/init.d/watchdog status
Jun 15 16:11:48 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update (5/3)
Jun 15 16:11:53 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update (6/3)
Jun 15 16:11:53 rmamseslab07 qdiskd[14130]: <info> Assuming master role

Message from syslogd at rmamseslab07 at Jun 15 16:11:53 ...
 clurgmgrd[18510]: <emerg> #1: Quorum Dissolved
Jun 15 16:11:53 rmamseslab07 openais[14087]: [CMAN ] lost contact with quorum device
Jun 15 16:11:53 rmamseslab07 openais[14087]: [CMAN ] quorum lost, blocking activity
Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Membership Change Event
Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <emerg> #1: Quorum Dissolved
Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of service:Cluster_test_2
Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of service:wdtcscript-rmamseslab05-ic
Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of service:wdtcscript-rmamseslab07-ic
Jun 15 16:11:54 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of service:Logical volume 1
Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update (7/3)
Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: <notice> Writing eviction notice for node 1
Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: <debug> Telling CMAN to kill the node
Jun 15 16:11:58 rmamseslab07 openais[14087]: [CMAN ] quorum regained, resuming activity

I've just logged a case but... any idea????

Regards,


Alfredo Moralejo
Business Platforms Engineering - OS Servers - UNIX Senior Specialist
F. Hoffmann-La Roche Ltd.
Global Informatics Group Infrastructure
Josefa Valc?rcel, 40
28027 Madrid SPAIN

Phone: +34 91 305 97 87

alfredo.moralejo at roche.com<mailto:alfredo.moralejo at roche.com>

Confidentiality Note: This message is intended only for the use of the named recipient(s) and may contain confidential and/or proprietary information. If you are not the intended recipient, please contact the sender and delete this message. Any unauthorized use of the information contained in this message is prohibited.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090615/8134684e/attachment.htm>

From anasnajj at gmail.com  Mon Jun 15 18:32:24 2009
From: anasnajj at gmail.com (anasnajj)
Date: Mon, 15 Jun 2009 21:32:24 +0300
Subject: [Linux-cluster] Service Owner Unknown-after power failure
In-Reply-To: <FD716ADFEE797543AA8B2AB0B3F9148E07442044@rkamsem701.emea.roche.com>
References: <FD716ADFEE797543AA8B2AB0B3F9148E07442044@rkamsem701.emea.roche.com>
Message-ID: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAAHU7yyfe+lxJnbMkt8nTx3QBAAAAAA==@gmail.com>

Hi all

I have Redhat cluster with 5 nodes run 5 services for each node with two
additional backup nodes

when suddenly power failure happened on  one node , the cluster state on
another nodes show that the service owner of failed node is unknown and when
we try to disable or relocate the service its stay keep trying without
result .. , so how I can make the cluster start the service again on another
node when the first node has power failure and no way to return it Up ????

thanks

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090615/c023ebd3/attachment.htm>

From giuseppe.fuggiano at gmail.com  Mon Jun 15 18:41:36 2009
From: giuseppe.fuggiano at gmail.com (Giuseppe Fuggiano)
Date: Mon, 15 Jun 2009 20:41:36 +0200
Subject: [Linux-cluster] Service Owner Unknown-after power failure
In-Reply-To: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAAHU7yyfe+lxJnbMkt8nTx3QBAAAAAA==@gmail.com>
References: <FD716ADFEE797543AA8B2AB0B3F9148E07442044@rkamsem701.emea.roche.com>
	<!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAAHU7yyfe+lxJnbMkt8nTx3QBAAAAAA==@gmail.com>
Message-ID: <1e09d9070906151141x7b872d82x8cf7afbaa2ef5f82@mail.gmail.com>

2009/6/15 anasnajj <anasnajj at gmail.com>:
> Hi all
>
> I have Redhat cluster with 5 nodes run 5 services for each node with two
> additional backup nodes
>
> when suddenly power failure happened on ?one node , the cluster state on
> another nodes show that the service owner of failed node is unknown and when
> we try to disable or relocate the service its stay keep trying without
> result .. , so how I can make the cluster start the service again on another
> node when the first node has power failure and no way to return it Up ????

What about your cluster.conf?

-- 
Giuseppe


From anasnajj at gmail.com  Mon Jun 15 18:44:21 2009
From: anasnajj at gmail.com (anasnajj)
Date: Mon, 15 Jun 2009 21:44:21 +0300
Subject: [Linux-cluster] Service Owner Unknown-after power failure
In-Reply-To: <1e09d9070906151141x7b872d82x8cf7afbaa2ef5f82@mail.gmail.com>
References: <FD716ADFEE797543AA8B2AB0B3F9148E07442044@rkamsem701.emea.roche.com>	<!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAAHU7yyfe+lxJnbMkt8nTx3QBAAAAAA==@gmail.com>
	<1e09d9070906151141x7b872d82x8cf7afbaa2ef5f82@mail.gmail.com>
Message-ID: <!&!AAAAAAAAAAAYAAAAAAAAAGAlTvHWGdlOniIP8e0pMpHihwAAEAAAAEY+5SlHr/hEv2osL3n5CM0BAAAAAA==@gmail.com>

Is the quarum disk will solve this problem ??

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Giuseppe Fuggiano
Sent: Monday, June 15, 2009 9:42 PM
To: linux clustering
Subject: Re: [Linux-cluster] Service Owner Unknown-after power failure

2009/6/15 anasnajj <anasnajj at gmail.com>:
> Hi all
>
> I have Redhat cluster with 5 nodes run 5 services for each node with two
> additional backup nodes
>
> when suddenly power failure happened on ?one node , the cluster state on
> another nodes show that the service owner of failed node is unknown and
when
> we try to disable or relocate the service its stay keep trying without
> result .. , so how I can make the cluster start the service again on
another
> node when the first node has power failure and no way to return it Up ????

What about your cluster.conf?

-- 
Giuseppe

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From dhopp at coreps.com  Mon Jun 15 19:00:26 2009
From: dhopp at coreps.com (Dennis B. Hopp)
Date: Mon, 15 Jun 2009 14:00:26 -0500
Subject: [Linux-cluster] GFS2 locking issues
Message-ID: <20090615140026.z2wnqgqshwgwo80w@mail.coreps.com>

We have a three node nfs/samba cluster that we seem to be having very
poor performance on GFS2.  We have a samba share that is acting as a
disk to disk backup share for Backup Exec and during the backup
process the load on the server will go through the roof until the
network requests timeout and the backup job fails.

I downloaded the ping_pong utility and ran it and seem to be getting
terrible performance:

[root at sc2 ~]# ./ping_ping /mnt/backup/test.dat 4
       97 locks/sec

The results are the same on all three nodes.

I can't seem to figure out why this is so bad.  Some additional information:

[root at sc2 ~]# gfs2_tool gettune /mnt/backup
new_files_directio = 0
new_files_jdata = 0
quota_scale = 1.0000   (1, 1)
logd_secs = 1
recoverd_secs = 60
statfs_quantum = 30
stall_secs = 600
quota_cache_secs = 300
quota_simul_sync = 64
statfs_slow = 0
complain_secs = 10
max_readahead = 262144
quota_quantum = 60
quota_warn_period = 10
jindex_refresh_secs = 60
log_flush_secs = 60
incore_log_blocks = 1024
demote_secs = 600

[root at sc2 ~]# gfs2_tool getargs /mnt/backup
data 2
suiddir 0
quota 0
posix_acl 1
num_glockd 1
upgrade 0
debug 0
localflocks 0
localcaching 0
ignore_local_fs 0
spectator 0
hostdata jid=0:id=262146:first=0
locktable
lockproto lock_dlm

       97 locks/sec
[root at sc2 ~]# rpm -qa | grep gfs
kmod-gfs-0.1.31-3.el5
gfs-utils-0.1.18-1.el5
gfs2-utils-0.1.53-1.el5_3.3

[root at sc2 ~]# uname -r
2.6.18-128.1.10.el5

Thanks,

--Dennis


From adas at redhat.com  Mon Jun 15 19:07:28 2009
From: adas at redhat.com (Abhijith Das)
Date: Mon, 15 Jun 2009 14:07:28 -0500
Subject: [Linux-cluster] GFS2 locking issues
In-Reply-To: <20090615140026.z2wnqgqshwgwo80w@mail.coreps.com>
References: <20090615140026.z2wnqgqshwgwo80w@mail.coreps.com>
Message-ID: <4A369BF0.3010203@redhat.com>

Dennis,

You seem to be running plock_rate_limit=100 that limits the number of
plocks/sec to 100 to avoid network flooding due to plocks.

Setting this as <gfs_controld plock_rate_limit="0"/> in cluster.conf
should give you better plock performance.

Hope this helps,
Thanks!
--Abhi

Dennis B. Hopp wrote:
> We have a three node nfs/samba cluster that we seem to be having very
> poor performance on GFS2.  We have a samba share that is acting as a
> disk to disk backup share for Backup Exec and during the backup
> process the load on the server will go through the roof until the
> network requests timeout and the backup job fails.
>
> I downloaded the ping_pong utility and ran it and seem to be getting
> terrible performance:
>
> [root at sc2 ~]# ./ping_ping /mnt/backup/test.dat 4
>        97 locks/sec
>
> The results are the same on all three nodes.
>
> I can't seem to figure out why this is so bad.  Some additional information:
>
> [root at sc2 ~]# gfs2_tool gettune /mnt/backup
> new_files_directio = 0
> new_files_jdata = 0
> quota_scale = 1.0000   (1, 1)
> logd_secs = 1
> recoverd_secs = 60
> statfs_quantum = 30
> stall_secs = 600
> quota_cache_secs = 300
> quota_simul_sync = 64
> statfs_slow = 0
> complain_secs = 10
> max_readahead = 262144
> quota_quantum = 60
> quota_warn_period = 10
> jindex_refresh_secs = 60
> log_flush_secs = 60
> incore_log_blocks = 1024
> demote_secs = 600
>
> [root at sc2 ~]# gfs2_tool getargs /mnt/backup
> data 2
> suiddir 0
> quota 0
> posix_acl 1
> num_glockd 1
> upgrade 0
> debug 0
> localflocks 0
> localcaching 0
> ignore_local_fs 0
> spectator 0
> hostdata jid=0:id=262146:first=0
> locktable
> lockproto lock_dlm
>
>        97 locks/sec
> [root at sc2 ~]# rpm -qa | grep gfs
> kmod-gfs-0.1.31-3.el5
> gfs-utils-0.1.18-1.el5
> gfs2-utils-0.1.53-1.el5_3.3
>
> [root at sc2 ~]# uname -r
> 2.6.18-128.1.10.el5
>
> Thanks,
>
> --Dennis
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>   


From dhopp at coreps.com  Mon Jun 15 20:09:01 2009
From: dhopp at coreps.com (Dennis B. Hopp)
Date: Mon, 15 Jun 2009 15:09:01 -0500
Subject: [Linux-cluster] GFS2 locking issues
In-Reply-To: <4A369BF0.3010203@redhat.com>
References: <20090615140026.z2wnqgqshwgwo80w@mail.coreps.com>
	<4A369BF0.3010203@redhat.com>
Message-ID: <20090615150901.2aapnpgn6sw0cog4@mail.coreps.com>

That didn't work, but I changed it to:

         <dlm plock_ownership="1" plock_rate_limit="0"/>

And I'm getting different results, but still not good performance.   
Running ping_pong on one node

[root at sc2 ~]# ./ping_ping /mnt/backup/test.dat 4
     5870 locks/sec

I think that should be much higher, but as soon as I start it on  
another node it drops to 97 locks/sec

Any other ideas?

--Dennis

Quoting Abhijith Das <adas at redhat.com>:

> Dennis,
>
> You seem to be running plock_rate_limit=100 that limits the number of
> plocks/sec to 100 to avoid network flooding due to plocks.
>
> Setting this as <gfs_controld plock_rate_limit="0"/> in cluster.conf
> should give you better plock performance.
>
> Hope this helps,
> Thanks!
> --Abhi
>
> Dennis B. Hopp wrote:
>> We have a three node nfs/samba cluster that we seem to be having very
>> poor performance on GFS2.  We have a samba share that is acting as a
>> disk to disk backup share for Backup Exec and during the backup
>> process the load on the server will go through the roof until the
>> network requests timeout and the backup job fails.
>>
>> I downloaded the ping_pong utility and ran it and seem to be getting
>> terrible performance:
>>
>> [root at sc2 ~]# ./ping_ping /mnt/backup/test.dat 4
>>        97 locks/sec
>>
>> The results are the same on all three nodes.
>>
>> I can't seem to figure out why this is so bad.  Some additional information:
>>
>> [root at sc2 ~]# gfs2_tool gettune /mnt/backup
>> new_files_directio = 0
>> new_files_jdata = 0
>> quota_scale = 1.0000   (1, 1)
>> logd_secs = 1
>> recoverd_secs = 60
>> statfs_quantum = 30
>> stall_secs = 600
>> quota_cache_secs = 300
>> quota_simul_sync = 64
>> statfs_slow = 0
>> complain_secs = 10
>> max_readahead = 262144
>> quota_quantum = 60
>> quota_warn_period = 10
>> jindex_refresh_secs = 60
>> log_flush_secs = 60
>> incore_log_blocks = 1024
>> demote_secs = 600
>>
>> [root at sc2 ~]# gfs2_tool getargs /mnt/backup
>> data 2
>> suiddir 0
>> quota 0
>> posix_acl 1
>> num_glockd 1
>> upgrade 0
>> debug 0
>> localflocks 0
>> localcaching 0
>> ignore_local_fs 0
>> spectator 0
>> hostdata jid=0:id=262146:first=0
>> locktable
>> lockproto lock_dlm
>>
>>        97 locks/sec
>> [root at sc2 ~]# rpm -qa | grep gfs
>> kmod-gfs-0.1.31-3.el5
>> gfs-utils-0.1.18-1.el5
>> gfs2-utils-0.1.53-1.el5_3.3
>>
>> [root at sc2 ~]# uname -r
>> 2.6.18-128.1.10.el5
>>
>> Thanks,
>>
>> --Dennis
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From dhopp at coreps.com  Mon Jun 15 20:55:20 2009
From: dhopp at coreps.com (Dennis B. Hopp)
Date: Mon, 15 Jun 2009 15:55:20 -0500
Subject: [Linux-cluster] GFS2 locking issues
In-Reply-To: <20090615150901.2aapnpgn6sw0cog4@mail.coreps.com>
References: <20090615140026.z2wnqgqshwgwo80w@mail.coreps.com>
	<4A369BF0.3010203@redhat.com>
	<20090615150901.2aapnpgn6sw0cog4@mail.coreps.com>
Message-ID: <20090615155520.0quqdh0gqo4s0cco@mail.coreps.com>

Actually...I added both

<dlm plock_ownership="1" plock_rate_limit="0"/>
<gfs_controld plock_rate_limit="0"/>

to cluster.conf and rebooted every node.  Now running ping_pong gives  
me roughly 3500 locks/sec when running it on more then one node  
(running it on just one node gives me around 5000 locks/sec) which  
according to the samba wiki are about in line with what it should be.

Thanks,

--Dennis

Quoting "Dennis B. Hopp" <dhopp at coreps.com>:

> That didn't work, but I changed it to:
>
>         <dlm plock_ownership="1" plock_rate_limit="0"/>
>
> And I'm getting different results, but still not good performance.
> Running ping_pong on one node
>
> [root at sc2 ~]# ./ping_ping /mnt/backup/test.dat 4
>     5870 locks/sec
>
> I think that should be much higher, but as soon as I start it on
> another node it drops to 97 locks/sec
>
> Any other ideas?
>
> --Dennis
>
> Quoting Abhijith Das <adas at redhat.com>:
>
>> Dennis,
>>
>> You seem to be running plock_rate_limit=100 that limits the number of
>> plocks/sec to 100 to avoid network flooding due to plocks.
>>
>> Setting this as <gfs_controld plock_rate_limit="0"/> in cluster.conf
>> should give you better plock performance.
>>
>> Hope this helps,
>> Thanks!
>> --Abhi
>>
>> Dennis B. Hopp wrote:
>>> We have a three node nfs/samba cluster that we seem to be having very
>>> poor performance on GFS2.  We have a samba share that is acting as a
>>> disk to disk backup share for Backup Exec and during the backup
>>> process the load on the server will go through the roof until the
>>> network requests timeout and the backup job fails.
>>>
>>> I downloaded the ping_pong utility and ran it and seem to be getting
>>> terrible performance:
>>>
>>> [root at sc2 ~]# ./ping_ping /mnt/backup/test.dat 4
>>>       97 locks/sec
>>>
>>> The results are the same on all three nodes.
>>>
>>> I can't seem to figure out why this is so bad.  Some additional   
>>> information:
>>>
>>> [root at sc2 ~]# gfs2_tool gettune /mnt/backup
>>> new_files_directio = 0
>>> new_files_jdata = 0
>>> quota_scale = 1.0000   (1, 1)
>>> logd_secs = 1
>>> recoverd_secs = 60
>>> statfs_quantum = 30
>>> stall_secs = 600
>>> quota_cache_secs = 300
>>> quota_simul_sync = 64
>>> statfs_slow = 0
>>> complain_secs = 10
>>> max_readahead = 262144
>>> quota_quantum = 60
>>> quota_warn_period = 10
>>> jindex_refresh_secs = 60
>>> log_flush_secs = 60
>>> incore_log_blocks = 1024
>>> demote_secs = 600
>>>
>>> [root at sc2 ~]# gfs2_tool getargs /mnt/backup
>>> data 2
>>> suiddir 0
>>> quota 0
>>> posix_acl 1
>>> num_glockd 1
>>> upgrade 0
>>> debug 0
>>> localflocks 0
>>> localcaching 0
>>> ignore_local_fs 0
>>> spectator 0
>>> hostdata jid=0:id=262146:first=0
>>> locktable
>>> lockproto lock_dlm
>>>
>>>       97 locks/sec
>>> [root at sc2 ~]# rpm -qa | grep gfs
>>> kmod-gfs-0.1.31-3.el5
>>> gfs-utils-0.1.18-1.el5
>>> gfs2-utils-0.1.53-1.el5_3.3
>>>
>>> [root at sc2 ~]# uname -r
>>> 2.6.18-128.1.10.el5
>>>
>>> Thanks,
>>>
>>> --Dennis
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From devi at atc.tcs.com  Thu Jun 18 06:38:10 2009
From: devi at atc.tcs.com (devi)
Date: Thu, 18 Jun 2009 12:08:10 +0530
Subject: [Linux-cluster] managing of resources
Message-ID: <1245307090.4090.17.camel@localhost.localdomain>

Hi,

	how can we manage the resources of cluster ?  I mean to stop , start,
or to find the status of cluster resources

Regards,
Devi.


From cthulhucalling at gmail.com  Thu Jun 18 07:02:10 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Thu, 18 Jun 2009 00:02:10 -0700
Subject: [Linux-cluster] managing of resources
In-Reply-To: <1245307090.4090.17.camel@localhost.localdomain>
References: <1245307090.4090.17.camel@localhost.localdomain>
Message-ID: <36df569a0906180002g2eca2eecu22c5f06bf6207f95@mail.gmail.com>

Clustat for the cluster and service status, clusvcadm for starting, stopping
and moving services. Luci will also do all that and more with a nice gui
frontend

On Jun 17, 2009 11:41 PM, "devi" <devi at atc.tcs.com> wrote:

Hi,

       how can we manage the resources of cluster ?  I mean to stop , start,
or to find the status of cluster resources

Regards,
Devi.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090618/200fd9cb/attachment.htm>

From brahadambal at gmail.com  Thu Jun 18 11:41:42 2009
From: brahadambal at gmail.com (Brahadambal Srinivasan)
Date: Thu, 18 Jun 2009 17:11:42 +0530
Subject: [Linux-cluster] Cluster among geographically separated nodes ?
Message-ID: <68867af80906180441o4a7fac7fhd8338484cac4db57@mail.gmail.com>

Hi,

I am trying to figure out if it is possible to create an RHCS cluster among
nodes that are in remote locations? If yes, then how are the following
handled? :

1. Storage - how is the shared storage acheived?
2. Fencing - any special methods to fence ?
3. Max. number of nodes possible in such a setup
4. any special methods/exceptions/rules to setup this cluster?

Pointers to any material in this regard will be great. Thanks much in
advance.

Thanks and regards,
Brahadambal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090618/c0434530/attachment.htm>

From gordan at bobich.net  Thu Jun 18 12:12:12 2009
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 18 Jun 2009 13:12:12 +0100
Subject: [Linux-cluster] Cluster among geographically separated nodes
	=?UTF-8?Q?=3F?=
In-Reply-To: <68867af80906180441o4a7fac7fhd8338484cac4db57@mail.gmail.com>
References: <68867af80906180441o4a7fac7fhd8338484cac4db57@mail.gmail.com>
Message-ID: <0a2fe75edcf31e38421bed12af83618a@localhost>

On Thu, 18 Jun 2009 17:11:42 +0530, Brahadambal Srinivasan
<brahadambal at gmail.com> wrote:
> Hi,
> 
> I am trying to figure out if it is possible to create an RHCS cluster
among
> nodes that are in remote locations? If yes, then how are the following
> handled? :
> 
> 1. Storage - how is the shared storage acheived?

Same as it is achieved locally. It is up to your SAN to handle this in a
real-time, consistent way. You may want to look into DRBD
(http://www.drbd.org) for the block device level replication. Be aware,
however, that performance on the disk access front will be terrible,
because the latency will end up being limited by your ping time on the WAN.
So instead of it having 0.1ms added via a local gigabit interconnect, it'll
have 50-100ms added to it. Most applications will not produce usable
performance with this kind of disk I/O speed.

You may, instead, want to look into something like GlusterFS
(http://www.gluster.org) or PeerFS (http://www.radiantdata.com).

> 2. Fencing - any special methods to fence ?

Just be aware that if your site interconnect goes down, you'll end up with
a hung cluster, since the nodes will disconnect and be unable to fence each
other. You could offset that by having separate cluster and fencing
interconnects, but you would also need to look into quorum - you need n/2+1
nodes for quorum, so to make this work sensibly you'd need at least three
sites - otherwise if you lose the bigger site you lose the whole cluster
anyway.

> 3. Max. number of nodes possible in such a setup

I don't think there is a difference in this regard between LAN and WAN
clusters.

Gordan


From raju.rajsand at gmail.com  Thu Jun 18 12:13:58 2009
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Thu, 18 Jun 2009 17:43:58 +0530
Subject: [Linux-cluster] Cluster among geographically separated nodes ?
In-Reply-To: <68867af80906180441o4a7fac7fhd8338484cac4db57@mail.gmail.com>
References: <68867af80906180441o4a7fac7fhd8338484cac4db57@mail.gmail.com>
Message-ID: <8786b91c0906180513j6feef774r2a718d238d303fed@mail.gmail.com>

Greetings

Not an expert in Cluster. But my 2c below

All of them below assume that cluster heartbeat network is on a fast network

On Thu, Jun 18, 2009 at 5:11 PM, Brahadambal Srinivasan <
brahadambal at gmail.com> wrote:


>
> 1. Storage - how is the shared storage acheived?
>
By storage replication (preferably with fibre)

If two node,  DRBD may work with a very-very fast link (It works in a
100mbps lan though like campuses with two nodes in very seperate buildings)


> 2. Fencing - any special methods to fence ?
>
DRC/ILO/ILOM comes to mind


> 3. Max. number of nodes possible in such a setup
>

I have done it with 4 nodes max

4. any special methods/exceptions/rules to setup this cluster?
>


>
> Pointers to any material in this regard will be great. Thanks much in
> advance.
>
http://archives.free.net.ph/message/20090311.200230.23f4f917.pl.html
http://www.mail-archive.com/linux-cluster at redhat.com/msg06229.html


>
> Thanks and regards,
> Brahadambal
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090618/a4d22a75/attachment.htm>

From giuseppe.fuggiano at gmail.com  Thu Jun 18 12:51:32 2009
From: giuseppe.fuggiano at gmail.com (Giuseppe Fuggiano)
Date: Thu, 18 Jun 2009 14:51:32 +0200
Subject: [Linux-cluster] Cluster among geographically separated nodes ?
In-Reply-To: <0a2fe75edcf31e38421bed12af83618a@localhost>
References: <68867af80906180441o4a7fac7fhd8338484cac4db57@mail.gmail.com>
	<0a2fe75edcf31e38421bed12af83618a@localhost>
Message-ID: <1e09d9070906180551u451edf39q123a8a2a7e16a265@mail.gmail.com>

2009/6/18 Gordan Bobic <gordan at bobich.net>:
> On Thu, 18 Jun 2009 17:11:42 +0530, Brahadambal Srinivasan
> <brahadambal at gmail.com> wrote:
>> Hi,
>>
>> I am trying to figure out if it is possible to create an RHCS cluster
> among
>> nodes that are in remote locations? If yes, then how are the following
>> handled? :
>>
>> 1. Storage - how is the shared storage acheived?
>
> Same as it is achieved locally. It is up to your SAN to handle this in a
> real-time, consistent way. You may want to look into DRBD
> (http://www.drbd.org) for the block device level replication. Be aware,
> however, that performance on the disk access front will be terrible,
> because the latency will end up being limited by your ping time on the WAN.
> So instead of it having 0.1ms added via a local gigabit interconnect, it'll
> have 50-100ms added to it. Most applications will not produce usable
> performance with this kind of disk I/O speed.

I am wondering if that will affect both read and write requests or
only write/verify ones (which DRBD have to replicate using the
network).

-- 
Giuseppe


From gordan at bobich.net  Thu Jun 18 13:31:10 2009
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 18 Jun 2009 14:31:10 +0100
Subject: [Linux-cluster] Cluster among geographically separated nodes
	=?UTF-8?Q?=3F?=
In-Reply-To: <1e09d9070906180551u451edf39q123a8a2a7e16a265@mail.gmail.com>
References: <68867af80906180441o4a7fac7fhd8338484cac4db57@mail.gmail.com>	<0a2fe75edcf31e38421bed12af83618a@localhost>
	<1e09d9070906180551u451edf39q123a8a2a7e16a265@mail.gmail.com>
Message-ID: <dedadb426700be1679fc8d29789eb97f@localhost>

On Thu, 18 Jun 2009 14:51:32 +0200, Giuseppe Fuggiano
<giuseppe.fuggiano at gmail.com> wrote:
> 2009/6/18 Gordan Bobic <gordan at bobich.net>:
>> On Thu, 18 Jun 2009 17:11:42 +0530, Brahadambal Srinivasan
>> <brahadambal at gmail.com> wrote:
>>> Hi,
>>>
>>> I am trying to figure out if it is possible to create an RHCS cluster
>> among
>>> nodes that are in remote locations? If yes, then how are the following
>>> handled? :
>>>
>>> 1. Storage - how is the shared storage acheived?
>>
>> Same as it is achieved locally. It is up to your SAN to handle this in a
>> real-time, consistent way. You may want to look into DRBD
>> (http://www.drbd.org) for the block device level replication. Be aware,
>> however, that performance on the disk access front will be terrible,
>> because the latency will end up being limited by your ping time on the
>> WAN.
>> So instead of it having 0.1ms added via a local gigabit interconnect,
>> it'll
>> have 50-100ms added to it. Most applications will not produce usable
>> performance with this kind of disk I/O speed.
> 
> I am wondering if that will affect both read and write requests or
> only write/verify ones (which DRBD have to replicate using the
> network).

It'll affect both a lot of the time even if one site is passive/failover,
and pretty much all the time if it's an active-active configuration with
both sides handling load. DLM will end up bouncing and checking locks back
and forth between the sites. This will be the case with any real-time
distributed storage system that guarantees full consistency.

In other words, load sharing over a WAN will have unusable performance in
most cases. Within the same campus, it'd be OK, but between different
continents, I don't see it being viable. The real question is whether you
really need/want load sharing. If not, you can just use ext3 with DRBD in
active-passive mode with failover. Or you can use a more farm-like approach
where the servers are mostly serving data, and updates/writes can be
streamed from a single master using something like SeznamFS.

Gordan


From giuseppe.fuggiano at gmail.com  Thu Jun 18 19:22:34 2009
From: giuseppe.fuggiano at gmail.com (Giuseppe Fuggiano)
Date: Thu, 18 Jun 2009 21:22:34 +0200
Subject: [Linux-cluster] DRBD+GFS - Link is down, Link is up
Message-ID: <1e09d9070906181222m6cbacdf0mcb6e97f7dcd33bd0@mail.gmail.com>

Hi all,

I configured GFS over DRBD (active-active) with RHCS and IPMI as fence device.

When I try to mount my GFS resource, my interconnect interface goes
down and one node is fenced.  This happen every time.


DRBD joins and become primary...

Jun 18 19:04:30 alice kernel: drbd0: Handshake successful: Agreed
network protocol version 89
Jun 18 19:04:30 alice kernel: drbd0: Peer authenticated using 20 bytes
of 'sha1' HMAC
Jun 18 19:04:30 alice kernel: drbd0: conn( WFConnection -> WFReportParams )
Jun 18 19:04:30 alice kernel: drbd0: Starting asender thread (from
drbd0_receiver [3315])
Jun 18 19:04:30 alice kernel: drbd0: data-integrity-alg: <not-used>
Jun 18 19:04:30 alice kernel: drbd0: drbd_sync_handshake:
Jun 18 19:04:30 alice kernel: drbd0: self
2BA45318C0A122D1:CBAA0E591815072F:3F39591B4EF90EDD:2E40DDEB552666B9
Jun 18 19:04:30 alice kernel: drbd0: peer
CBAA0E591815072E:0000000000000000:3F39591B4EF90EDD:2E40DDEB552666B9
Jun 18 19:04:30 alice kernel: drbd0: uuid_compare()=1 by rule 7
Jun 18 19:04:30 alice kernel: drbd0: peer( Unknown -> Secondary )
conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
Jun 18 19:04:30 alice kernel: drbd0: peer( Secondary -> Primary )
Jun 18 19:04:31 alice kernel: drbd0: conn( WFBitMapS -> SyncSource )
pdsk( UpToDate -> Inconsistent )
Jun 18 19:04:31 alice kernel: drbd0: Began resync as SyncSource (will
sync 16384 KB [4096 bits set]).
Jun 18 19:04:33 alice kernel: drbd0: Resync done (total 1 sec; paused
0 sec; 16384 K/sec)
Jun 18 19:04:33 alice kernel: drbd0: conn( SyncSource -> Connected )
pdsk( Inconsistent -> UpToDate )

Then the fence domain is OK:

Jun 18 19:04:35 alice openais[3475]: [TOTEM] entering GATHER state from 11.
Jun 18 19:04:35 alice openais[3475]: [TOTEM] Creating commit token
because I am the rep.
Jun 18 19:04:35 alice openais[3475]: [TOTEM] Saving state aru 1b high
seq received 1b
Jun 18 19:04:35 alice openais[3475]: [TOTEM] Storing new sequence id for ring 34
Jun 18 19:04:35 alice openais[3475]: [TOTEM] entering COMMIT state.
Jun 18 19:04:35 alice openais[3475]: [TOTEM] entering RECOVERY state.
Jun 18 19:04:35 alice openais[3475]: [TOTEM] position [0] member 10.17.44.116:
Jun 18 19:04:35 alice openais[3475]: [TOTEM] previous ring seq 48 rep
10.17.44.116
Jun 18 19:04:35 alice openais[3475]: [TOTEM] aru 1b high delivered 1b
received flag 1
Jun 18 19:04:35 alice openais[3475]: [TOTEM] position [1] member 10.17.44.117:
Jun 18 19:04:35 alice openais[3475]: [TOTEM] previous ring seq 48 rep
10.17.44.117
Jun 18 19:04:35 alice openais[3475]: [TOTEM] aru a high delivered a
received flag 1
Jun 18 19:04:35 alice openais[3475]: [TOTEM] Did not need to originate
any messages in recovery.
Jun 18 19:04:35 alice openais[3475]: [TOTEM] Sending initial ORF token
Jun 18 19:04:35 alice openais[3475]: [CLM  ] CLM CONFIGURATION CHANGE
Jun 18 19:04:36 alice openais[3475]: [CLM  ] New Configuration:
Jun 18 19:04:36 alice openais[3475]: [CLM  ]  r(0) ip(10.17.44.116)
Jun 18 19:04:36 alice openais[3475]: [CLM  ] Members Left:
Jun 18 19:04:36 alice openais[3475]: [CLM  ] Members Joined:
Jun 18 19:04:36 alice openais[3475]: [CLM  ] CLM CONFIGURATION CHANGE
Jun 18 19:04:36 alice openais[3475]: [CLM  ] New Configuration:
Jun 18 19:04:36 alice openais[3475]: [CLM  ]  r(0) ip(10.17.44.116)
Jun 18 19:04:36 alice openais[3475]: [CLM  ]  r(0) ip(10.17.44.117)
Jun 18 19:04:36 alice openais[3475]: [CLM  ] Members Left:
Jun 18 19:04:36 alice openais[3475]: [CLM  ] Members Joined:
Jun 18 19:04:36 alice openais[3475]: [CLM  ]  r(0) ip(10.17.44.117)
Jun 18 19:04:36 alice openais[3475]: [SYNC ] This node is within the
primary component and will provide service.
Jun 18 19:04:36 alice openais[3475]: [TOTEM] entering OPERATIONAL state.
Jun 18 19:04:36 alice openais[3475]: [CLM  ] got nodejoin message 10.17.44.116
Jun 18 19:04:36 alice openais[3475]: [CLM  ] got nodejoin message 10.17.44.117
Jun 18 19:04:36 alice openais[3475]: [CPG  ] got joinlist message from node 1
Jun 18 19:04:40 alice kernel: dlm: connecting to 2
Jun 18 19:04:40 alice kernel: dlm: got connection from 2

WHY DOWN?

Jun 18 19:04:53 alice kernel: eth2: Link is Down
Jun 18 19:04:53 alice openais[3475]: [TOTEM] The token was lost in the
OPERATIONAL state.
Jun 18 19:04:53 alice openais[3475]: [TOTEM] Receive multicast socket
recv buffer size (288000 bytes).
Jun 18 19:04:53 alice openais[3475]: [TOTEM] Transmit multicast socket
send buffer size (262142 bytes).
Jun 18 19:04:53 alice openais[3475]: [TOTEM] entering GATHER state from 2.
Jun 18 19:04:57 alice kernel: eth2: Link is Up 100 Mbps Full Duplex,
Flow Control: None
Jun 18 19:04:57 alice kernel: eth2: 10/100 speed: disabling TSO

Something goes wrong with DRBD

Jun 18 19:04:58 alice kernel: drbd0: PingAck did not arrive in time.
Jun 18 19:04:58 alice kernel: drbd0: peer( Primary -> Unknown ) conn(
Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Jun 18 19:04:58 alice kernel: drbd0: asender terminated
Jun 18 19:04:58 alice kernel: drbd0: Terminating asender thread
Jun 18 19:04:58 alice kernel: drbd0: short read expecting header on sock: r=-512
Jun 18 19:04:58 alice kernel: drbd0: Creating new current UUID
Jun 18 19:04:58 alice kernel: drbd0: Connection closed
Jun 18 19:04:58 alice kernel: drbd0: conn( NetworkFailure -> Unconnected )
Jun 18 19:04:58 alice kernel: drbd0: receiver terminated
Jun 18 19:04:58 alice kernel: drbd0: Restarting receiver thread
Jun 18 19:04:58 alice kernel: drbd0: receiver (re)started
Jun 18 19:04:58 alice kernel: drbd0: conn( Unconnected -> WFConnection )

Something goes wrong in the cluster

Jun 18 19:04:58 alice openais[3475]: [TOTEM] entering GATHER state from 0.
Jun 18 19:04:58 alice openais[3475]: [TOTEM] Creating commit token
because I am the rep.
Jun 18 19:04:58 alice openais[3475]: [TOTEM] Saving state aru 3c high
seq received 3c
Jun 18 19:04:58 alice openais[3475]: [TOTEM] Storing new sequence id for ring 38
Jun 18 19:04:58 alice openais[3475]: [TOTEM] entering COMMIT state.
Jun 18 19:04:58 alice openais[3475]: [TOTEM] entering RECOVERY state.
Jun 18 19:04:58 alice openais[3475]: [TOTEM] position [0] member 10.17.44.116:
Jun 18 19:04:58 alice openais[3475]: [TOTEM] previous ring seq 52 rep
10.17.44.116
Jun 18 19:04:58 alice openais[3475]: [TOTEM] aru 3c high delivered 3c
received flag 1
Jun 18 19:04:58 alice openais[3475]: [TOTEM] Did not need to originate
any messages in recovery.
Jun 18 19:04:58 alice openais[3475]: [TOTEM] Sending initial ORF token
Jun 18 19:04:58 alice openais[3475]: [CLM  ] CLM CONFIGURATION CHANGE
Jun 18 19:04:58 alice openais[3475]: [CLM  ] New Configuration:
Jun 18 19:04:58 alice kernel: dlm: closing connection to node 2
Jun 18 19:04:58 alice fenced[3494]: bob not a cluster member after 0
sec post_fail_delay
Jun 18 19:04:58 alice openais[3475]: [CLM  ]  r(0) ip(10.17.44.116)

"bob" node is fenced (it just joined!)

Jun 18 19:04:58 alice fenced[3494]: fencing node "bob"
Jun 18 19:04:58 alice openais[3475]: [CLM  ] Members Left:
Jun 18 19:04:58 alice openais[3475]: [CLM  ]  r(0) ip(10.17.44.117)
Jun 18 19:04:58 alice openais[3475]: [CLM  ] Members Joined:
Jun 18 19:04:58 alice openais[3475]: [CLM  ] CLM CONFIGURATION CHANGE
Jun 18 19:04:58 alice openais[3475]: [CLM  ] New Configuration:
Jun 18 19:04:58 alice openais[3475]: [CLM  ]  r(0) ip(10.17.44.116)
Jun 18 19:04:58 alice openais[3475]: [CLM  ] Members Left:
Jun 18 19:04:58 alice openais[3475]: [CLM  ] Members Joined:
Jun 18 19:04:58 alice openais[3475]: [SYNC ] This node is within the
primary component and will provide service.
Jun 18 19:04:58 alice openais[3475]: [TOTEM] entering OPERATIONAL state.
Jun 18 19:04:58 alice openais[3475]: [CLM  ] got nodejoin message 10.17.44.116
Jun 18 19:04:58 alice openais[3475]: [CPG  ] got joinlist message from node 1
Jun 18 19:05:03 alice kernel: eth2: Link is Down
Jun 18 19:05:08 alice kernel: eth2: Link is Up 100 Mbps Full Duplex,
Flow Control: None
Jun 18 19:05:08 alice kernel: eth2: 10/100 speed: disabling TSO
Jun 18 19:05:12 alice kernel: eth2: Link is Down
Jun 18 19:05:13 alice fenced[3494]: fence "bob" success
Jun 18 19:05:13 alice kernel: GFS: fsid=webclima:web.0: jid=1: Trying
to acquire journal lock...
Jun 18 19:05:13 alice kernel: GFS: fsid=webclima:web.0: jid=1: Looking
at journal...
Jun 18 19:05:13 alice kernel: GFS: fsid=webclima:web.0: jid=1: Done

eth2 is up and down....

Jun 18 19:05:15 alice kernel: eth2: Link is Up 100 Mbps Full Duplex,
Flow Control: None
Jun 18 19:05:15 alice kernel: eth2: 10/100 speed: disabling TSO
Jun 18 19:05:21 alice kernel: eth2: Link is Down
Jun 18 19:05:24 alice kernel: eth2: Link is Up 100 Mbps Full Duplex,
Flow Control: None
Jun 18 19:05:24 alice kernel: eth2: 10/100 speed: disabling TSO
Jun 18 19:05:29 alice kernel: eth2: Link is Down
Jun 18 19:05:33 alice kernel: eth2: Link is Up 100 Mbps Full Duplex,
Flow Control: None
Jun 18 19:05:33 alice kernel: eth2: 10/100 speed: disabling TSO
Jun 18 19:07:26 alice kernel: eth2: Link is Down
Jun 18 19:07:29 alice kernel: eth2: Link is Up 100 Mbps Full Duplex,
Flow Control: None
Jun 18 19:07:29 alice kernel: eth2: 10/100 speed: disabling TSO
Jun 18 19:07:36 alice kernel: eth2: Link is Down
Jun 18 19:07:38 alice kernel: eth2: Link is Up 100 Mbps Full Duplex,
Flow Control: None
Jun 18 19:07:38 alice kernel: eth2: 10/100 speed: disabling TSO


Consider that if I don't mount GFS, the node is not fenced and the
failover domains becomes active.
So, I guess the problem is in GFS... and not for example with the NIC.

Here is my configuration:


# cat /etc/drbd.conf
global {
  usage-count no;
}

resource r1 {
  protocol C;

  syncer {
    rate 10M;
    verify-alg sha1;
  }

  startup {
    become-primary-on both;
    wfc-timeout 150;
  }

  disk {
    on-io-error detach;
  }

  net {
    allow-two-primaries;
    cram-hmac-alg "sha1";
    shared-secret "123456";
    after-sb-0pri discard-least-changes;
    after-sb-1pri violently-as0p;
    after-sb-2pri violently-as0p;
    rr-conflict violently;
    ping-timeout 50;
  }

  on alice {
    device      /dev/drbd0;
    disk        /dev/sda2;
    address     10.17.44.116:7789;
    meta-disk   internal;
  }

  on bob {
    device      /dev/drbd0;
    disk        /dev/sda2;
    address     10.17.44.117:7789;
    meta-disk   internal;
  }
}


# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster alias="web" config_version="20" name="web">
        <fence_daemon post_fail_delay="0" post_join_delay="6"/>
        <clusternodes>
                <clusternode name="alice" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device lanplus="" name="alice-ipmi"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="bob" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device lanplus="" name="bob-ipmi"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_ipmilan" auth="password"
ipaddr="10.17.44.134" login="cnmca" name="alice-ipmi"
passwd="xxxxxx"/>
                <fencedevice agent="fence_ipmilan" auth="password"
ipaddr="10.17.44.135" login="cnmca" name="bob-ipmi" passwd="xxxxxx"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="alice-domain"
ordered="1" restricted="1">
                                <failoverdomainnode name="alice" priority="1"/>
                                <failoverdomainnode name="bob" priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="bob-domain" ordered="1"
restricted="1">
                                <failoverdomainnode name="bob" priority="1"/>
                                <failoverdomainnode name="alice" priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="10.17.44.16" monitor_link="1"/>
                        <ip address="10.17.44.17" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="alice-domain"
name="alice-alias" recovery="relocate">
                        <ip ref="10.17.44.16"/>
                </service>
                <service autostart="1" domain="bob-domain"
name="bob-alias" recovery="relocate">
                        <ip ref="10.17.44.17"/>
                </service>
        </rm>
</cluster>


# cat /etc/hosts:
127.0.0.1       localhost.localdomain           localhost
172.17.44.116    alice
172.17.44.117    bob


# ifconfig
bond0     Link encap:Ethernet  HWaddr 00:15:17:51:70:38
          inet addr:10.17.44.116  Bcast:10.17.44.255  Mask:255.255.255.0
          inet6 addr: fe80::215:17ff:fe51:7038/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:49984 errors:0 dropped:0 overruns:0 frame:0
          TX packets:83669 errors:0 dropped:0 overruns:0 carrier:0
          collisions:11221 txqueuelen:0
          RX bytes:16151284 (15.4 MiB)  TX bytes:102618030 (97.8 MiB)

eth0      Link encap:Ethernet  HWaddr 00:15:17:51:70:38
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:49984 errors:0 dropped:0 overruns:0 frame:0
          TX packets:83669 errors:0 dropped:0 overruns:0 carrier:0
          collisions:11221 txqueuelen:100
          RX bytes:16151284 (15.4 MiB)  TX bytes:102618030 (97.8 MiB)
          Memory:f9140000-f9160000

eth1      Link encap:Ethernet  HWaddr 00:15:17:51:70:38
          UP BROADCAST SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
          Memory:f91a0000-f91c0000

eth2      Link encap:Ethernet  HWaddr 00:19:99:29:08:8B
          inet addr:172.17.44.116  Bcast:172.17.44.255  Mask:255.255.255.0
          inet6 addr: fe80::219:99ff:fe29:88b/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:20 errors:0 dropped:0 overruns:0 frame:0
          TX packets:45 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100
          RX bytes:1200 (1.1 KiB)  TX bytes:7902 (7.7 KiB)
          Memory:f9200000-f9220000

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:3541 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3541 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:464552 (453.6 KiB)  TX bytes:464552 (453.6 KiB)

I hope there is someone just experienced this bad issue.

Thanks in advance.

-- 
Giuseppe


From giuseppe.fuggiano at gmail.com  Thu Jun 18 20:07:14 2009
From: giuseppe.fuggiano at gmail.com (Giuseppe Fuggiano)
Date: Thu, 18 Jun 2009 22:07:14 +0200
Subject: [Linux-cluster] Re: DRBD+GFS - Link is down, Link is up
In-Reply-To: <1e09d9070906181222m6cbacdf0mcb6e97f7dcd33bd0@mail.gmail.com>
References: <1e09d9070906181222m6cbacdf0mcb6e97f7dcd33bd0@mail.gmail.com>
Message-ID: <1e09d9070906181307y595a3bc9hb2c89d46f3a9d424@mail.gmail.com>

2009/6/18 Giuseppe Fuggiano <giuseppe.fuggiano at gmail.com>:
[snip]
> eth2 is up and down....
>
> Jun 18 19:05:15 alice kernel: eth2: Link is Up 100 Mbps Full Duplex,
> Flow Control: None
> Jun 18 19:05:15 alice kernel: eth2: 10/100 speed: disabling TSO
> Jun 18 19:05:21 alice kernel: eth2: Link is Down
> Jun 18 19:05:24 alice kernel: eth2: Link is Up 100 Mbps Full Duplex,
[snip]
>
> Consider that if I don't mount GFS, the node is not fenced and the
> failover domains becomes active.
> So, I guess the problem is in GFS... and not for example with the NIC.

Trying to use bond0 as heartbeat, I discovered that eth2 stuff doesn't
affect the infinite fencing behaviour...

-- 
Giuseppe


From tom at netspot.com.au  Fri Jun 19 02:47:59 2009
From: tom at netspot.com.au (Tom Lanyon)
Date: Fri, 19 Jun 2009 12:17:59 +0930
Subject: [Linux-cluster] Cluster among geographically separated nodes ?
In-Reply-To: <0a2fe75edcf31e38421bed12af83618a@localhost>
References: <68867af80906180441o4a7fac7fhd8338484cac4db57@mail.gmail.com>
	<0a2fe75edcf31e38421bed12af83618a@localhost>
Message-ID: <87128C8B-AC72-4E56-9FAC-D4B84E2578DA@netspot.com.au>


On 18/06/2009, at 9:42 PM, Gordan Bobic wrote:

> On Thu, 18 Jun 2009 17:11:42 +0530, Brahadambal Srinivasan
> <brahadambal at gmail.com> wrote:
>> 2. Fencing - any special methods to fence ?
>
> Just be aware that if your site interconnect goes down, you'll end  
> up with
> a hung cluster, since the nodes will disconnect and be unable to  
> fence each
> other. You could offset that by having separate cluster and fencing
> interconnects, but you would also need to look into quorum - you  
> need n/2+1
> nodes for quorum, so to make this work sensibly you'd need at least  
> three
> sites - otherwise if you lose the bigger site you lose the whole  
> cluster
> anyway.


This question came up last week as well so I have been thinking about  
the options here. Gordan's suggestion of three sites is a good one but  
may not be feasible for some.

If you are using replicated SAN LUN(s) for your shared storage, the  
LUN is only ever going to be active at one site. So, if you lose  
connectivity between sites you obviously want the cluster to remain  
operational at the site with the active storage LUN. I can imagine a  
cross-site accessible qdisk *almost* solving this problem.

The remaining issue, as I see it, is that if your network connectivity  
is lost the cluster will pause all services until it has successfully  
removed the failed nodes -- if it can't fence these nodes due to the  
lost network connectivity, you may end up with a site that effectively  
has quorum but all services are still hung. This sort of issue would  
especially arise if, for example, you lost ethernet connectivity but  
not FC/storage connectivity - the nodes at the remote site would still  
be able to access the qdisk. Perhaps a combination of power fencing  
(via ethernet) + storage fencing (on the local side of the SAN) could  
make this a workable solution?

Regards,
Tom

--
Tom Lanyon
Senior Systems Engineer
NetSpot Pty Ltd


From vcmarti at sph.emory.edu  Fri Jun 19 14:13:54 2009
From: vcmarti at sph.emory.edu (Vernard C. Martin)
Date: Fri, 19 Jun 2009 10:13:54 -0400
Subject: [Linux-cluster] fencing Cisco MDS 9134 w/ RHEL5
Message-ID: <4A3B9D22.4080908@sph.emory.edu>

I can't seem to find any evidence that this fiber switch has a fencing 
agent for RHEL4. There seems to be some documentation of it being 
supported in RHEL 5.4.

Is it reasonable to just port the agent or am I missing some technical 
detail that the agent requires that is in the newer kernel?


-- 
Vernard Martin 	Applications Developer/Analyst
Email: vcmarti at sph.emory.edu 	Desk:404.727.2076
Office of Information Technology 	-Rollins School of Public Health


From alietsantiesteban at gmail.com  Sat Jun 20 03:02:20 2009
From: alietsantiesteban at gmail.com (Aliet Santiesteban Sifontes)
Date: Fri, 19 Jun 2009 23:02:20 -0400
Subject: [Linux-cluster] Will redhat release the srpms of cluster suite for
	rhel-4.8 to the public???
Message-ID: <365467590906192002m3e4991d2m2e74ac26b1134fa5@mail.gmail.com>

Hi, just wondering if redhat will release the srpms for the cluster
suite updated for rhel-4.8???, I have been looking for it in redhat
ftp site, but can not find it.
Any ideas??
Best regards


From fdinitto at redhat.com  Sat Jun 20 11:19:49 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Sat, 20 Jun 2009 13:19:49 +0200
Subject: [Linux-cluster] Cluster 3.0.0.rc3 release
Message-ID: <1245496789.3665.328.camel@cerberus.int.fabbione.net>

The cluster team and its community are proud to announce the
3.0.0.rc3 release candidate from the STABLE3 branch.

The development cycle for 3.0.0 is completed. The STABLE3 branch
is now collecting only bug fixes and minimal update required to build
and run on top of the latest upstream kernel/corosync/openais.

Everybody with test equipment and time to spare, is highly encouraged to
download, install and test this release candidate and more important
report problems. This is the time for people to make a difference and
help us testing as much as possible.

In order to build the 3.0.0.rc3 release you will need:

- corosync 0.98
- openais 0.97
- linux kernel 2.6.29

The new source tarball can be downloaded here:

ftp://sources.redhat.com/pub/cluster/releases/cluster-3.0.0.rc3.tar.gz
https://fedorahosted.org/releases/c/l/cluster/cluster-3.0.0.rc3.tar.gz

At the same location is now possible to find separated tarballs for
fence-agents and resource-agents as previously announced
(http://www.redhat.com/archives/cluster-devel/2009-February/msg00003.htm)

To report bugs or issues:

   https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

   Join us on IRC (irc.freenode.net #linux-cluster) and share your
   experience  with other sysadministrators or power users.

Happy clustering,
Fabio

Under the hood (from 3.0.0.rc2):

Abhijith Das (2):
      gfs-kernel: enable FS_HAS_FREEZE
      gfs-kernel: bz479421 - gfs_tool: page allocation failure. order:4,
mode:0xd0

Andrew Price (2):
      gfs2-utils: Clean up leftover prog_name globals
      fsck.gfs2: Remove compute_height

Bob Peterson (10):
      mount failure after gfs2_edit restoremeta of GFS file system
      gfs2_edit savemeta needs to save freemeta blocks
      gfs2_edit: Fix indirect block scrolling
      Correction to an earlier commit.  Buffers were being updated
      Removed check for incorrect height
      GFS2: gfs2_edit savemeta wasn't saving indirect eattribute blocks
      GFS2: gfs2_edit savemeta wasn't saving ea sub-blocks
      GFS2: fsck.gfs2 sometimes needs to be run twice
      fsck.gfs2 writing bitmap when -n specified
      Fixed compiler warnings and errors that crept in.

Christine Caulfield (9):
      dlm: don't print an error from lockdump if there are no locks.
      cman: More changes for the latest corosync API
      cman: call corosync->request_shutdown on cman-tool leave
      cman: Allow use of broadcast communications
      gfs2: Fix includes for building on rawhide
      cman: Change some more ais references to Corosync
      fence: Allow IP addresses as node names
      cman: Remove references to ccs in the man pages
      cman: Catch failure to determine default multicast address

David Teigland (8):
      fenced/dlm_controld/gfs_controld:
      dlm_controld: remove unused plock_exit
      dlm_tool: fix shadow warnings
      gfs_control: fix shadow warnings
      fenced: avoid static warnings
      dlm_tool: fix warning
      fenced: remove const string warnings
      fenced: fix id_info struct alignment

Fabio M. Di Nitto (62):
      gfs: fix most of the warnings spotted by paranoia-cflags
      dlm: fix function prototypes
      libdlmcontrol: fix const warning
      libdlmcontrol: make function static
      dlm_tool: constify functions
      dlm_tool: make functions static
      dlm_tool: fix format warnings
      libfenced: fix const warning
      fence_node: fix const warning
      fence_tool: fix const warning
      fenced: fix function declaration
      libgroup: fix const warning
      dlm_controld: fix function declaration warning
      dlm_controld: fix const warning
      dlm_controld: fix return warning in plock
      dlm_controld: make functions static
      libgfscontrol: fix const warnings
      libgfscontrol: make functions static
      gfs_control: fix const warnings
      gfs_control: make functions static
      gfs_controld: fix function declaration
      gfs_controld: fix const warnings
      gfs_controld: ifdef out unused code
      group_tool: fix const warnings
      group_tool: fix function declaration
      group_tool: make functions static
      group daemon: fix function declaration
      group_daemon: fix const warnings
      group_tool: fix shadow warning
      group_daemon: make functions static
      group_dameon: ifdef out unused code
      dlm: fix void arithmetic
      group: fix void arithmetic
      group: fix print formats
      fence: fix void arithmetic
      fence: fix print formats
      fenced: add const to ccs functions
      fenced: add const to msg_name
      fenced: add const to setup_listener
      fenced: add const bits to recover.c
      cman: fix logging config and major cleanup
      gfs2: fix build warnings spotted by paranoia cflags
      build: set paranoia build warnings by default
      gfs2: restore libgfs2.h vfprintf call
      gfs: fix endian conversion
      gfs2: fix endian conversion
      gfs2: don't swab in place
      gfs: don't swab in place
      cman init: add support for join and leave options
      qdisk: fix disk scanning check in sysfs
      build: drop unrequired include dir
      build: fix build dependency for ccs_tool
      build: clean up perl bindings .d files
      config: drop obselete build check in libccs
      scandisk: remove build debug entry (now unrequired)
      qdisk: remove build DEBUG option in favour of runtime
      build: fix clean operation for .pc files
      dlm: fix libdlm_lt pc file module name
      build: allow easy build of test tarballs for the whole set
      build: drop dependency on libvolume_id
      gfs2: drop leftover file from import
      cman init: fix groupd check

Jan Friesse (1):
      CMAN: Support for openaisserviceenablestable service loader

Lon Hohberger (9):
      qdisk: Add reporting for I/O hangs to quourm disk
      rgmanager: Allow reboot if main proc. is killed
      rgmanager: Make vm.sh use libvirt
      rgmanager: Remove extra checks from Oracle agents
      rgmanager: Fix up multiple Oracle instance handling
      rgmanager: Check for all ORA- errors on start/stop
      group: Make group_tool checks more robust
      rgmanager: Fix restart-after-migrate issue
      rgmanager: Fix noise when running in foreground

Marc Grimme (1):
      rgmanager: Implement explicit ordering for failover

Marek 'marx' Grac (8):
      fence_scsi_test.pl: #499871 fence_scsi_test.pl does not check for
sg_persist in the path
      fence_drac5: #496724 - support for modulename in drac5 agent
      fence_apc: #501586 - fence_apc fails with pexpect exception
      apache.sh: #489785 - does not handle a
valid /etc/httpd/conf/httpd.conf configuration correctly
      fence_lpar: fence_lpar can't log in to IVM systems
      fence_agents: #501586 - fence agents fails with pexpect exception
      fence_lpar: #504705 - fence_lpar: lssyscfg command on HMC can take
longer than SHELL_TIMEOUT
      fence agents: Option for setting port for telnet/ssh/ssl used by
fence agent

Steven Whitehouse (19):
      Remove unused code from various places
      gfs2_tool: gettext support
      mkfs.gfs2: Add gettext support
      gfs2_tool: Fix misplaced bracket that bob spotted
      fsck.gfs2: Add gettext support
      Makefile: Fix problem which crept in earlier
      gfs2_tool: Use FIFREEZE/FITHAW ioctl
      fsck.gfs2: Add gettext support
      gfs2_tool: Remove obsolete subcommands
      libgfs2: Remove unused library function
      gfs2_tool: Remove ref to non-existent sysfs file
      gfs2_tool: Remove code to read args/*
      gfs2_tool: Fix help message
      man: Remove obsolete info from mount.gfs2 man page
      man: More updates
      fsck: Fix up merge issue
      gfs2_tool: Remove df command from gfs2_tool
      mkfs.gfs2: Remove dep on libvolume_id
      mkfs.gfs: Remove dep on libvolume_id

 Makefile                                        |    5 +-
 cman/cman_tool/join.c                           |    4 +-
 cman/cman_tool/main.c                           |    4 +-
 cman/daemon/Makefile                            |    1 -
 cman/daemon/ais.c                               |   51 +-
 cman/daemon/ais.h                               |    1 +
 cman/daemon/barrier.c                           |   13 +-
 cman/daemon/cman-preconfig.c                    |  133 ++++--
 cman/daemon/cmanconfig.c                        |    3 +-
 cman/daemon/commands.c                          |   98 ++--
 cman/daemon/commands.h                          |    2 +-
 cman/daemon/daemon.c                            |   35 +-
 cman/daemon/daemon.h                            |    2 +-
 cman/daemon/logging.c                           |   29 --
 cman/daemon/logging.h                           |   17 -
 cman/init.d/cman.in                             |   22 +-
 cman/man/cman.5                                 |    6 +-
 cman/man/cman_tool.8                            |   14 +-
 cman/qdisk/Makefile                             |    6 +-
 cman/qdisk/disk.c                               |   23 +-
 cman/qdisk/iostate.c                            |  142 ++++++
 cman/qdisk/iostate.h                            |   17 +
 cman/qdisk/main.c                               |    7 +
 cman/qdisk/scandisk.c                           |    6 +-
 cman/qdisk/scandisk.h                           |    6 +-
 config/libs/libccsconfdb/ccs.h                  |    4 -
 config/plugins/xml/Makefile                     |    1 -
 config/tools/ccs_tool/Makefile                  |    4 +-
 configure                                       |   37 +-
 dlm/libdlm/libdlm.c                             |    4 +-
 dlm/libdlm/libdlm.h                             |    4 +-
 dlm/libdlm/libdlm_lt.pc.in                      |    2 +-
 dlm/libdlmcontrol/main.c                        |    8 +-
 dlm/tool/main.c                                 |   56 +--
 fence/agents/alom/fence_alom.py                 |   11 +-
 fence/agents/apc/fence_apc.py                   |    4 +-
 fence/agents/bladecenter/fence_bladecenter.py   |   11 +-
 fence/agents/drac/fence_drac5.py                |   11 +-
 fence/agents/ilo/fence_ilo.py                   |    2 +-
 fence/agents/ldom/fence_ldom.py                 |   11 +-
 fence/agents/lib/fencing.py.py                  |   25 +-
 fence/agents/lpar/fence_lpar.py                 |   17 +-
 fence/agents/rsa/fence_rsa.py                   |   11 +-
 fence/agents/scsi/fence_scsi_test.pl            |   15 +
 fence/agents/virsh/fence_virsh.py               |   11 +-
 fence/agents/wti/fence_wti.py                   |   11 +-
 fence/agents/xvm/Makefile                       |    1 -
 fence/fence_node/fence_node.c                   |    4 +-
 fence/fence_tool/fence_tool.c                   |    8 +-
 fence/fenced/config.c                           |    8 +-
 fence/fenced/config.h                           |    2 +-
 fence/fenced/cpg.c                              |    7 +-
 fence/fenced/fd.h                               |    8 +-
 fence/fenced/main.c                             |   26 +-
 fence/fenced/member_cman.c                      |   14 +-
 fence/fenced/recover.c                          |   11 +-
 fence/libfenced/main.c                          |    6 +-
 gfs-kernel/src/gfs/gfs_ondisk.h                 |   38 +-
 gfs-kernel/src/gfs/ioctl.c                      |    5 +-
 gfs-kernel/src/gfs/ops_fstype.c                 |    2 +-
 gfs/gfs_debug/basic.c                           |    2 +-
 gfs/gfs_debug/readfile.c                        |    2 +-
 gfs/gfs_debug/util.c                            |   14 +-
 gfs/gfs_fsck/eattr.c                            |    2 +-
 gfs/gfs_fsck/file.c                             |    6 +-
 gfs/gfs_fsck/fs_bits.c                          |    2 +-
 gfs/gfs_fsck/fs_dir.c                           |   56 +-
 gfs/gfs_fsck/fs_inode.c                         |    6 +-
 gfs/gfs_fsck/fs_inode.h                         |    2 +-
 gfs/gfs_fsck/initialize.c                       |    2 +-
 gfs/gfs_fsck/log.c                              |    4 +-
 gfs/gfs_fsck/log.h                              |    2 +-
 gfs/gfs_fsck/main.c                             |   12 +-
 gfs/gfs_fsck/metawalk.c                         |   32 +-
 gfs/gfs_fsck/ondisk.c                           |   34 +-
 gfs/gfs_fsck/pass1.c                            |   10 +-
 gfs/gfs_fsck/pass1b.c                           |    9 +-
 gfs/gfs_fsck/pass1c.c                           |   18 +-
 gfs/gfs_fsck/pass2.c                            |   13 +-
 gfs/gfs_fsck/pass3.c                            |    2 +-
 gfs/gfs_fsck/pass4.c                            |    4 +-
 gfs/gfs_fsck/pass5.c                            |    6 +-
 gfs/gfs_fsck/super.c                            |   12 +-
 gfs/gfs_fsck/util.c                             |   14 +-
 gfs/gfs_grow/main.c                             |   35 +-
 gfs/gfs_jadd/main.c                             |   39 +-
 gfs/gfs_mkfs/Makefile                           |    5 +-
 gfs/gfs_mkfs/device_geometry.c                  |    2 +-
 gfs/gfs_mkfs/main.c                             |  136 ++++--
 gfs/gfs_mkfs/structures.c                       |    6 +-
 gfs/gfs_quota/check.c                           |   34 +-
 gfs/gfs_quota/gfs_quota.h                       |    4 +
 gfs/gfs_quota/layout.c                          |   25 +-
 gfs/gfs_quota/main.c                            |   45 ++-
 gfs/gfs_tool/counters.c                         |    6 +-
 gfs/gfs_tool/df.c                               |   40 +-
 gfs/gfs_tool/gfs_tool.h                         |    6 +-
 gfs/gfs_tool/layout.c                           |   57 ++-
 gfs/gfs_tool/misc.c                             |   78 ++-
 gfs/gfs_tool/tune.c                             |   12 +-
 gfs/gfs_tool/util.c                             |   10 +-
 gfs/libgfs/file.c                               |    6 +-
 gfs/libgfs/fs_bits.c                            |    2 +-
 gfs/libgfs/fs_dir.c                             |   46 +-
 gfs/libgfs/fs_inode.c                           |    4 +-
 gfs/libgfs/libgfs.h                             |    5 +-
 gfs/libgfs/log.c                                |    4 +-
 gfs/libgfs/ondisk.c                             |   36 +-
 gfs/libgfs/super.c                              |    1 -
 gfs/libgfs/util.c                               |   14 +-
 gfs2/convert/gfs2_convert.c                     |   53 +-
 gfs2/edit/gfs2hex.c                             |   76 ++--
 gfs2/edit/gfs2hex.h                             |    4 +
 gfs2/edit/hexedit.c                             |  453
++++++++----------
 gfs2/edit/hexedit.h                             |    5 +-
 gfs2/edit/savemeta.c                            |  204 ++++++---
 gfs2/fsck/eattr.c                               |    9 +-
 gfs2/fsck/fs_recovery.c                         |   39 +-
 gfs2/fsck/initialize.c                          |   66 ++--
 gfs2/fsck/link.c                                |   28 +-
 gfs2/fsck/lost_n_found.c                        |   22 +-
 gfs2/fsck/main.c                                |  121 +++---
 gfs2/fsck/metawalk.c                            |  265 ++++++----
 gfs2/fsck/pass1.c                               |  304 +++++++-----
 gfs2/fsck/pass1b.c                              |  187 ++++---
 gfs2/fsck/pass1c.c                              |  217 ++++++---
 gfs2/fsck/pass2.c                               |  371 +++++++++------
 gfs2/fsck/pass3.c                               |  105 ++--
 gfs2/fsck/pass4.c                               |   64 ++--
 gfs2/fsck/pass5.c                               |   50 +-
 gfs2/fsck/rgrepair.c                            |   88 ++--
 gfs2/fsck/test.c                                |    8 -
 gfs2/fsck/util.c                                |   35 +--
 gfs2/fsck/util.h                                |    1 -
 gfs2/libgfs2/block_list.c                       |   34 +-
 gfs2/libgfs2/buf.c                              |    4 +-
 gfs2/libgfs2/fs_bits.c                          |   61 +++
 gfs2/libgfs2/fs_geometry.c                      |    4 +-
 gfs2/libgfs2/fs_ops.c                           |   62 ++-
 gfs2/libgfs2/gfs1.c                             |    5 +-
 gfs2/libgfs2/gfs2_log.c                         |    7 +-
 gfs2/libgfs2/libgfs2.h                          |   26 +-
 gfs2/libgfs2/misc.c                             |   92 +----
 gfs2/libgfs2/rgrp.c                             |    8 +-
 gfs2/man/gfs2_convert.8                         |   16 +-
 gfs2/man/gfs2_grow.8                            |    7 +-
 gfs2/man/gfs2_quota.8                           |    2 +-
 gfs2/man/gfs2_tool.8                            |   67 +--
 gfs2/man/mount.gfs2.8                           |   39 +-
 gfs2/mkfs/Makefile                              |    2 -
 gfs2/mkfs/gfs2_mkfs.h                           |    2 -
 gfs2/mkfs/main.c                                |   19 +-
 gfs2/mkfs/main_grow.c                           |   64 ++--
 gfs2/mkfs/main_jadd.c                           |  155 +++---
 gfs2/mkfs/main_mkfs.c                           |  290 +++++++----
 gfs2/mount/mount.gfs2.c                         |   35 +--
 gfs2/mount/mtab.c                               |    1 -
 gfs2/mount/util.c                               |   11 +-
 gfs2/mount/util.h                               |    5 +-
 gfs2/quota/check.c                              |   33 +--
 gfs2/quota/gfs2_quota.h                         |    6 +-
 gfs2/quota/main.c                               |   12 +-
 gfs2/tool/Makefile                              |    3 +-
 gfs2/tool/df.c                                  |  290 -----------
 gfs2/tool/gfs2_tool.h                           |   16 -
 gfs2/tool/main.c                                |  139 ++----
 gfs2/tool/misc.c                                |  257 ++--------
 gfs2/tool/sb.c                                  |   62 ++--
 gfs2/tool/tune.c                                |   26 +-
 group/daemon/app.c                              |   40 +-
 group/daemon/cpg.c                              |   28 +-
 group/daemon/gd_internal.h                      |   14 +-
 group/daemon/joinleave.c                        |    4 +-
 group/daemon/main.c                             |   20 +-
 group/dlm_controld/action.c                     |   11 +-
 group/dlm_controld/config.c                     |    8 +-
 group/dlm_controld/cpg.c                        |    6 +-
 group/dlm_controld/deadlock.c                   |    2 +-
 group/dlm_controld/dlm_daemon.h                 |   12 +-
 group/dlm_controld/main.c                       |   14 +-
 group/dlm_controld/netlink.c                    |    2 +-
 group/dlm_controld/plock.c                      |   11 +-
 group/gfs_control/main.c                        |   30 +-
 group/gfs_controld/config.c                     |    6 +-
 group/gfs_controld/cpg-new.c                    |    6 +-
 group/gfs_controld/cpg-old.c                    |   18 +-
 group/gfs_controld/gfs_daemon.h                 |   12 +-
 group/gfs_controld/group.c                      |    2 +-
 group/gfs_controld/main.c                       |    8 +-
 group/gfs_controld/plock.c                      |   12 +-
 group/gfs_controld/util.c                       |    8 +-
 group/lib/libgroup.c                            |    8 +-
 group/lib/libgroup.h                            |    2 +-
 group/libgfscontrol/main.c                      |    8 +-
 group/tool/main.c                               |   18 +-
 make/clean.mk                                   |    2 +-
 make/defines.mk.input                           |    3 -
 make/perl-binding-common.mk                     |    2 +-
 make/release.mk                                 |   50 +-
 rgmanager/src/clulib/Makefile                   |    2 +-
 rgmanager/src/daemons/Makefile                  |    2 +-
 rgmanager/src/daemons/groups.c                  |    1 -
 rgmanager/src/daemons/restree.c                 |   13 +-
 rgmanager/src/daemons/rg_state.c                |   16 +-
 rgmanager/src/daemons/watchdog.c                |   24 +-
 rgmanager/src/resources/apache.sh               |    6 +-
 rgmanager/src/resources/default_event_script.sl |  150 ++++++-
 rgmanager/src/resources/oracledb.sh.in          |   28 +-
 rgmanager/src/resources/service.sh              |   19 +-
 rgmanager/src/resources/vm.sh                   |  608
+++++++++++++++-------
 rgmanager/src/utils/Makefile                    |    2 +-
 211 files changed, 4240 insertions(+), 3606 deletions(-)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090620/20f05cab/attachment.sig>

From thomas at sjolshagen.net  Sat Jun 20 20:06:11 2009
From: thomas at sjolshagen.net (Thomas Sjolshagen)
Date: Sat, 20 Jun 2009 16:06:11 -0400
Subject: [Linux-cluster] clusvcadm -M <service> -m <service> fails with
 "Invalid operation for resource"
Message-ID: <20090620160611.20273n7qqnh4jd4j@www.sjolshagen.net>

Hi,

I'm trying to do a migration to another member of the cluster but it  
fails with:

# clusvcadm -M samba -m host2
Trying to migrate service:samba to  
virt0-backup.sjolshagen.net...Invalid operation for resource

I'm running Fedora 11 with rgmanager-3.0.0-15.rc1.fc11.x86_64  
installed as well as a downloaded copy of the 5/21 version of vm.sh  
from the git.fedorahosted.org repository. I've configured the resource  
as follows:

<rm>
<resources>
   <vm name="kvm06-hvm" domain="cluster" autostart="1"  
recovery="relocate" snapshot="/cluster/kvm-guests/snapshots"  
use_virsh="1" exclusive="1" hypervisor="qemu"  
migration_mapping="host1:host2,host2:host1"  
hypervisor_uri="qemu+ssh:///system" />
</resources>
<service autostart="1" domain="cluster" name="samba">
   <vm ref="kvn06-hvm"/>
</service>
</rm>

And the guest container files are hosted on a (previously mounted)  
gfs2 file system.

Is this a rgmanager shortcoming (rgmanager needs to be coded to  
support virsh & live migration) or - more likely - user error?

Thanks in advance
// Thomas

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


From fdinitto at redhat.com  Sun Jun 21 05:43:59 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Sun, 21 Jun 2009 07:43:59 +0200
Subject: [Linux-cluster] Will redhat release the srpms of cluster suite
	for rhel-4.8 to the public???
In-Reply-To: <365467590906192002m3e4991d2m2e74ac26b1134fa5@mail.gmail.com>
References: <365467590906192002m3e4991d2m2e74ac26b1134fa5@mail.gmail.com>
Message-ID: <1245563039.3665.360.camel@cerberus.int.fabbione.net>

On Fri, 2009-06-19 at 23:02 -0400, Aliet Santiesteban Sifontes wrote:
> Hi, just wondering if redhat will release the srpms for the cluster
> suite updated for rhel-4.8???, I have been looking for it in redhat
> ftp site, but can not find it.
> Any ideas??

If you can't find the srpm, you can always use git to get the code.

http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=shortlog;h=refs/heads/RHEL48

the code is public, no secrets patches or anything like that :)

I don't exclude the possibility that the srpm has not been released, but
if so, it's probably purely a mistake.

I am CC'ing Chris that can investigate where it is.

Fabio


From henry.robertson at hjrconsulting.com  Mon Jun 22 03:57:34 2009
From: henry.robertson at hjrconsulting.com (Henry Robertson)
Date: Sun, 21 Jun 2009 23:57:34 -0400
Subject: [Linux-cluster] Re: clusvcadm -M <service> -m <service> fails with
	"Invalid operation for resource"
Message-ID: <c0b9f0730906212057y7cac2c81q252ce4adcfdcb09@mail.gmail.com>

Today's Topics:
>
>   1. clusvcadm -M <service> -m <service> fails with "Invalid
>      operation for resource" (Thomas Sjolshagen)
>   2. Re: Will redhat release the srpms of cluster suite        for
>      rhel-4.8 to the public??? (Fabio M. Di Nitto)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sat, 20 Jun 2009 16:06:11 -0400
> From: Thomas Sjolshagen <thomas at sjolshagen.net>
> Subject: [Linux-cluster] clusvcadm -M <service> -m <service> fails
>        with "Invalid operation for resource"
> To: linux-cluster at redhat.com
> Message-ID: <20090620160611.20273n7qqnh4jd4j at www.sjolshagen.net>
> Content-Type: text/plain; charset=ISO-8859-1; DelSp="Yes";
>        format="flowed"
>
> Hi,
>
> I'm trying to do a migration to another member of the cluster but it
> fails with:
>
> # clusvcadm -M samba -m host2
> Trying to migrate service:samba to
> virt0-backup.sjolshagen.net...Invalid operation for resource
>
> I'm running Fedora 11 with rgmanager-3.0.0-15.rc1.fc11.x86_64
> installed as well as a downloaded copy of the 5/21 version of vm.sh
> from the git.fedorahosted.org repository. I've configured the resource
> as follows:
>
> <rm>
> <resources>
>   <vm name="kvm06-hvm" domain="cluster" autostart="1"
> recovery="relocate" snapshot="/cluster/kvm-guests/snapshots"
> use_virsh="1" exclusive="1" hypervisor="qemu"
> migration_mapping="host1:host2,host2:host1"
> hypervisor_uri="qemu+ssh:///system" />
> </resources>
> <service autostart="1" domain="cluster" name="samba">
>   <vm ref="kvn06-hvm"/>
> </service>
> </rm>
>
> And the guest container files are hosted on a (previously mounted)
> gfs2 file system.
>
> Is this a rgmanager shortcoming (rgmanager needs to be coded to
> support virsh & live migration) or - more likely - user error?
>
> Thanks in advance
> // Thomas
>
> ----------------------------------------------------------------


Are you sure you don't mean to relocate the service to host2 instead
of Migrate? -R will stop/start a service like samba onto another node.
clusvcadm -r <service> -m <host>

I wasn't aware that migration worked for anything other than moving VM's around.

Henry
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090621/550b7658/attachment.htm>

From thomas at sjolshagen.net  Mon Jun 22 13:43:43 2009
From: thomas at sjolshagen.net (Thomas Sjolshagen)
Date: Mon, 22 Jun 2009 09:43:43 -0400
Subject: [Linux-cluster] Re: clusvcadm -M <service> -m <service> fails
	with "Invalid operation for resource"
In-Reply-To: <c0b9f0730906212057y7cac2c81q252ce4adcfdcb09@mail.gmail.com>
References: <c0b9f0730906212057y7cac2c81q252ce4adcfdcb09@mail.gmail.com>
Message-ID: <20090622094343.10362qjp3x4g1jtb@www.sjolshagen.net>


Quoting Henry Robertson <henry.robertson at hjrconsulting.com>:

> Today's Topics:
>>
>>   1. clusvcadm -M <service> -m <service> fails with "Invalid
>>      operation for resource" (Thomas Sjolshagen)

..

>
>
> Are you sure you don't mean to relocate the service to host2 instead
> of Migrate? -R will stop/start a service like samba onto another node.
> clusvcadm -r <service> -m <host>
>
> I wasn't aware that migration worked for anything other than moving  
> VM's around.
>
> Henry
>

If you look at the resource definition, you'll see that I'm trying to  
migrate a VM (the KVM guest is called samba since it hosts a Samba  
instance).

// Thomas


----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


From henry.robertson at hjrconsulting.com  Mon Jun 22 22:12:07 2009
From: henry.robertson at hjrconsulting.com (Henry Robertson)
Date: Mon, 22 Jun 2009 18:12:07 -0400
Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 62, Issue 21
In-Reply-To: <20090622160008.589096191B5@hormel.redhat.com>
References: <20090622160008.589096191B5@hormel.redhat.com>
Message-ID: <c0b9f0730906221512m61bbad37v417f42b7b0504d89@mail.gmail.com>

On Mon, Jun 22, 2009 at 12:00 PM, <linux-cluster-request at redhat.com> wrote:

> Send Linux-cluster mailing list submissions to
>        linux-cluster at redhat.com
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        https://www.redhat.com/mailman/listinfo/linux-cluster
> or, via email, send a message with subject or body 'help' to
>        linux-cluster-request at redhat.com
>
> You can reach the person managing the list at
>        linux-cluster-owner at redhat.com
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Linux-cluster digest..."
>
>
> Today's Topics:
>
>   1. Re: clusvcadm -M <service> -m <service> fails with        "Invalid
>      operation for resource" (Henry Robertson)
>   2. Re: Re: clusvcadm -M <service> -m <service> fails with
>      "Invalid operation for resource" (Thomas Sjolshagen)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sun, 21 Jun 2009 23:57:34 -0400
> From: Henry Robertson <henry.robertson at hjrconsulting.com>
> Subject: [Linux-cluster] Re: clusvcadm -M <service> -m <service> fails
>        with    "Invalid operation for resource"
> To: linux-cluster at redhat.com
> Message-ID:
>        <c0b9f0730906212057y7cac2c81q252ce4adcfdcb09 at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Today's Topics:
> >
> >   1. clusvcadm -M <service> -m <service> fails with "Invalid
> >      operation for resource" (Thomas Sjolshagen)
> >   2. Re: Will redhat release the srpms of cluster suite        for
> >      rhel-4.8 to the public??? (Fabio M. Di Nitto)
> >
> >
> > ----------------------------------------------------------------------
> >
> > Message: 1
> > Date: Sat, 20 Jun 2009 16:06:11 -0400
> > From: Thomas Sjolshagen <thomas at sjolshagen.net>
> > Subject: [Linux-cluster] clusvcadm -M <service> -m <service> fails
> >        with "Invalid operation for resource"
> > To: linux-cluster at redhat.com
> > Message-ID: <20090620160611.20273n7qqnh4jd4j at www.sjolshagen.net>
> > Content-Type: text/plain; charset=ISO-8859-1; DelSp="Yes";
> >        format="flowed"
> >
> > Hi,
> >
> > I'm trying to do a migration to another member of the cluster but it
> > fails with:
> >
> > # clusvcadm -M samba -m host2
> > Trying to migrate service:samba to
> > virt0-backup.sjolshagen.net...Invalid operation for resource
> >
> > I'm running Fedora 11 with rgmanager-3.0.0-15.rc1.fc11.x86_64
> > installed as well as a downloaded copy of the 5/21 version of vm.sh
> > from the git.fedorahosted.org repository. I've configured the resource
> > as follows:
> >
> > <rm>
> > <resources>
> >   <vm name="kvm06-hvm" domain="cluster" autostart="1"
> > recovery="relocate" snapshot="/cluster/kvm-guests/snapshots"
> > use_virsh="1" exclusive="1" hypervisor="qemu"
> > migration_mapping="host1:host2,host2:host1"
> > hypervisor_uri="qemu+ssh:///system" />
> > </resources>
> > <service autostart="1" domain="cluster" name="samba">
> >   <vm ref="kvn06-hvm"/>
> > </service>
> > </rm>
> >
> > And the guest container files are hosted on a (previously mounted)
> > gfs2 file system.
> >
> > Is this a rgmanager shortcoming (rgmanager needs to be coded to
> > support virsh & live migration) or - more likely - user error?
> >
> > Thanks in advance
> > // Thomas
> >
> > ----------------------------------------------------------------
>
>
> Are you sure you don't mean to relocate the service to host2 instead
> of Migrate? -R will stop/start a service like samba onto another node.
> clusvcadm -r <service> -m <host>
>
> I wasn't aware that migration worked for anything other than moving VM's
> around.
>
> Henry
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> https://www.redhat.com/archives/linux-cluster/attachments/20090621/550b7658/attachment.html
>
> ------------------------------
>
> Message: 2
> Date: Mon, 22 Jun 2009 09:43:43 -0400
> From: Thomas Sjolshagen <thomas at sjolshagen.net>
> Subject: Re: [Linux-cluster] Re: clusvcadm -M <service> -m <service>
>        fails   with "Invalid operation for resource"
> To: linux-cluster at redhat.com
> Message-ID: <20090622094343.10362qjp3x4g1jtb at www.sjolshagen.net>
> Content-Type: text/plain; charset=ISO-8859-1; DelSp="Yes";
>        format="flowed"
>
>
> Quoting Henry Robertson <henry.robertson at hjrconsulting.com>:
>
> > Today's Topics:
> >>
> >>   1. clusvcadm -M <service> -m <service> fails with "Invalid
> >>      operation for resource" (Thomas Sjolshagen)
>
> ..
>
> >
> >
> > Are you sure you don't mean to relocate the service to host2 instead
> > of Migrate? -R will stop/start a service like samba onto another node.
> > clusvcadm -r <service> -m <host>
> >
> > I wasn't aware that migration worked for anything other than moving
> > VM's around.
> >
> > Henry
> >
>
> If you look at the resource definition, you'll see that I'm trying to
> migrate a VM (the KVM guest is called samba since it hosts a Samba
> instance).
>
> // Thomas
>
>
>
> ----------------------------------------------------------------
> This message was sent using IMP, the Internet Messaging Program.
>


Ah. Does manual migration through virsh work rather than clusvcadm?
If it does -- I'd put rgmanager into some extra logging by editing
/etc/init.d/rgmanager with RGMGR_OPTS="-dddd" under the RGMGRD part.
Then restart rgmanager and check logs for more info after trying clusvcadm
-M again. (add debug to host / target servers and see if you catch anything
new)

Good luck
Henry
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090622/58fea150/attachment.htm>

From thomas at sjolshagen.net  Tue Jun 23 02:46:38 2009
From: thomas at sjolshagen.net (Thomas Sjolshagen)
Date: Mon, 22 Jun 2009 22:46:38 -0400
Subject: [Linux-cluster] Re: Guest MIgration w/rgmanager - WAS:
	Linux-cluster Digest, Vol 62, Issue 21
In-Reply-To: <c0b9f0730906221512m61bbad37v417f42b7b0504d89@mail.gmail.com>
References: <20090622160008.589096191B5@hormel.redhat.com>
	<c0b9f0730906221512m61bbad37v417f42b7b0504d89@mail.gmail.com>
Message-ID: <20090622224638.20492l2x5u5mbpxq@www.sjolshagen.net>


Quoting Henry Robertson <henry.robertson at hjrconsulting.com>:

> On Mon, Jun 22, 2009 at 12:00 PM, <linux-cluster-request at redhat.com> wrote:
>
...
>
> Ah. Does manual migration through virsh work rather than clusvcadm?

virsh migrate --live <guest> qemu+ssh://2nd node/system

Migrates the guest to the other cluster member w/no objections.

> If it does -- I'd put rgmanager into some extra logging by editing
> /etc/init.d/rgmanager with RGMGR_OPTS="-dddd" under the RGMGRD part.

Added "-dddd" to /etc/sysconfig/rgmanager, restarted rgmanager and  
verified that rgmanager is running w/the option set. I do not see any  
increase in logging from the default setting of "-d"?

> Then restart rgmanager and check logs for more info after trying clusvcadm
> -M again. (add debug to host / target servers and see if you catch anything
> new)

Attempted another clusvcadm -M <vm service> -m <2nd cluster node>, and  
I see nothing in either of the /var/log/cluster/rgmanager.log files.  
Not even that the operation was attempted?!?


// Thomas


----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


From mrugeshkarnik at gmail.com  Tue Jun 23 07:01:25 2009
From: mrugeshkarnik at gmail.com (Mrugesh Karnik)
Date: Tue, 23 Jun 2009 12:31:25 +0530
Subject: [Linux-cluster] Re: clusvcadm -M <service> -m <service> fails
	with "Invalid operation for resource"
In-Reply-To: <20090622094343.10362qjp3x4g1jtb@www.sjolshagen.net>
References: <c0b9f0730906212057y7cac2c81q252ce4adcfdcb09@mail.gmail.com>
	<20090622094343.10362qjp3x4g1jtb@www.sjolshagen.net>
Message-ID: <200906231231.25911.mrugeshkarnik@gmail.com>

On Monday 22 Jun 2009 19:13:43 Thomas Sjolshagen wrote:
> If you look at the resource definition, you'll see that I'm trying to
> migrate a VM (the KVM guest is called samba since it hosts a Samba
> instance).

Is migration supported on KVM? I've tried it with Xen and works fine. The only 
gotcha was that the `nx' flag on the CPU needed to be available.

Mrugesh


From ironludo at free.fr  Tue Jun 23 09:41:57 2009
From: ironludo at free.fr (LEROUX Ludovic)
Date: Tue, 23 Jun 2009 11:41:57 +0200
Subject: [Linux-cluster] redhat cluster installation
Message-ID: <8DF9888392AA48D5960BE531138BB981@siim94.local>

hello all.
I installed two redhat hat 5.2 servers with redhat cluster suite option.
When i want to create a cluster with luci i got an error message: An error occurred when trying to contact any of the nodes in the rh-cluster cluster.
Do you have any ideas?
thanks.
Ludovic
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090623/a56f25e8/attachment.htm>

From reggaestar at gmail.com  Tue Jun 23 09:48:42 2009
From: reggaestar at gmail.com (remi doubi)
Date: Tue, 23 Jun 2009 09:48:42 +0000
Subject: [Linux-cluster] redhat cluster installation
In-Reply-To: <8DF9888392AA48D5960BE531138BB981@siim94.local>
References: <8DF9888392AA48D5960BE531138BB981@siim94.local>
Message-ID: <3c88c73a0906230248r17c0ec6ct151a484115682e01@mail.gmail.com>

is the ricci agent started on the two nodes ??

2009/6/23 LEROUX Ludovic <ironludo at free.fr>

>  hello all.
> I installed two redhat hat 5.2 servers with redhat cluster suite option.
> When i want to create a cluster with luci i got an error message: *An
> error occurred when trying to contact any of the nodes in the rh-cluster
> cluster.*
> Do you have any ideas?
> thanks.
> Ludovic
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090623/3508464b/attachment.htm>

From amrossi at linux.it  Tue Jun 23 09:51:17 2009
From: amrossi at linux.it (Andrea Modesto Rossi)
Date: Tue, 23 Jun 2009 11:51:17 +0200 (CEST)
Subject: [Linux-cluster] redhat cluster installation
In-Reply-To: <8DF9888392AA48D5960BE531138BB981@siim94.local>
References: <8DF9888392AA48D5960BE531138BB981@siim94.local>
Message-ID: <38070.82.105.99.92.1245750677.squirrel@picard.linux.it>


On Mar, 23 Giugno 2009 11:41 am, LEROUX Ludovic wrote:
> hello all.
> I installed two redhat hat 5.2 servers with redhat cluster suite option.
> When i want to create a cluster with luci i got an error message: An error
> occurred when trying to contact any of the nodes in the rh-cluster
> cluster.
> Do you have any ideas?

hello,

is /etc/hosts configured properly? try with IP address instead of the
hostname.


-- 
Andrea Modesto Rossi
Fedora Ambassador
+---------------------------------------------------------------------+
| Bello. Che gli diciamo? Che sono tutti stronzi monopolisti di merda,|
| con i loro protocolli brevettati e i loro driver finestrosi?        |
| Ci sono!                                                            |
| Alessandro Rubini                                                   |
+---------------------------------------------------------------------+


From thomas at sjolshagen.net  Tue Jun 23 12:07:20 2009
From: thomas at sjolshagen.net (Thomas Sjolshagen)
Date: Tue, 23 Jun 2009 08:07:20 -0400
Subject: [Linux-cluster] Re: clusvcadm -M <service> -m <service> fails
	with "Invalid operation for resource"
In-Reply-To: <200906231231.25911.mrugeshkarnik@gmail.com>
References: <c0b9f0730906212057y7cac2c81q252ce4adcfdcb09@mail.gmail.com>
	<20090622094343.10362qjp3x4g1jtb@www.sjolshagen.net>
	<200906231231.25911.mrugeshkarnik@gmail.com>
Message-ID: <20090623080720.9695183ennxm3r08@www.sjolshagen.net>


Quoting Mrugesh Karnik <mrugeshkarnik at gmail.com>:

> On Monday 22 Jun 2009 19:13:43 Thomas Sjolshagen wrote:
>> If you look at the resource definition, you'll see that I'm trying to
>> migrate a VM (the KVM guest is called samba since it hosts a Samba
>> instance).
>
> Is migration supported on KVM? I've tried it with Xen and works  
> fine. The only
> gotcha was that the `nx' flag on the CPU needed to be available.
>

Yes, KVM supports both migration & live migration and with KVM-8* and  
libvirt 0.6.4 you can use "virsh migrate --live" to move running  
guests between nodes (live migrate). This fact is reflected in the  
upstream (git repo) version of the vm.sh resource script, but it seems  
like something - rgmanager itself? - is blocking it from even trying.

//Thomas


----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


From mech at meteo.uni-koeln.de  Tue Jun 23 14:32:16 2009
From: mech at meteo.uni-koeln.de (Mario Mech)
Date: Tue, 23 Jun 2009 16:32:16 +0200
Subject: [Linux-cluster] system-config-cluster, secure, and fence_drac5 
Message-ID: <4A40E770.9050001@meteo.uni-koeln.de>

Hi all,

I'm configuring a FailOver-cluster with CentOS 5.3 on two Dell PowerEdge 
2950 with DRAC5 cards. The basic configuration with 
system-config-cluster worked fine after enabling telnet on the DRACs and 
I got the cluster running. Then I switched to ssh by manually editing 
the cluster.conf file, since system-config-cluster is not aware of 
fence_drac5. Unfortunately now the cluster.conf file is not readable 
anymore by system-config-cluster. I still want to use sys-con-clu, since 
there is still much to configure  (services, failover-domains,....). 
Except of using telnet and fence_drac until the end of the configuration 
process, I have no other idea how to manage that. DOes anyone know how 
to include fence_drac5 and the secure="1" attribute in cluster.conf and 
still using sys-con-clu?

All best

Mario

P.S. Is secure="1" in the right place?

cluster.conf:

<?xml version="1.0"?>
<cluster alias="ninjo_cluster" config_version="9" name="ninjo_cluster">
         <fence_daemon post_fail_delay="0" post_join_delay="3"/>
         <clusternodes>
                 <clusternode name="ninjo" nodeid="1" votes="1">
                         <fence>
                                 <method name="1">
                                         <device name="ninjo-drac"/>
                                 </method>
                         </fence>
                 </clusternode>
                 <clusternode name="ninja" nodeid="2" votes="1">
                         <fence>
                                 <method name="1">
                                         <device modulename="" 
name="ninja-drac"/>
                                 </method>
                         </fence>
                 </clusternode>
         </clusternodes>
         <cman expected_votes="1" two_node="1"/>
         <fencedevices>
                 <fencedevice agent="fence_drac5" 
ipaddr="xxx.xxx.xxx.xxx" login="root" name="ninjo-drac" passwd="xxxx" 
secure="1"/>
                 <fencedevice agent="fence_drac5" 
ipaddr="xxx.xxx.xxx.xxx" login="root" name="ninja-drac" passwd="xxxx" 
secure="1"/>
         </fencedevices>
         <rm>
                 <failoverdomains/>
                 <resources/>
         </rm>
</cluster>

-- 
Dr. Mario Mech

Institute for Geophysics and Meteorology
University of Cologne
Zuelpicherstr. 49a
50674 Cologne
Germany

t: +49 (0)221 - 470 - 1776
f: +49 (0)221 - 470 - 5198
e: mech at meteo.uni-koeln.de
w: http://www.meteo.uni-koeln.de/~mmech/


From ch_spicy at yahoo.co.in  Wed Jun 24 04:00:25 2009
From: ch_spicy at yahoo.co.in (Kanthi)
Date: Wed, 24 Jun 2009 09:30:25 +0530 (IST)
Subject: [Linux-cluster] Re:redhat cluster installation (LEROUX Ludovic)
In-Reply-To: <20090623160011.293AF619858@hormel.redhat.com>
References: <20090623160011.293AF619858@hormel.redhat.com>
Message-ID: <161862.80348.qm@web8407.mail.in.yahoo.com>


________________________________
From: "linux-cluster-request at redhat.com" <linux-cluster-request at redhat.com>
To: linux-cluster at redhat.com
Sent: Tuesday, 23 June, 2009 9:30:11 PM
Subject: Linux-cluster Digest, Vol 62, Issue 22

Send Linux-cluster mailing list submissions to
    linux-cluster at redhat.com

To subscribe or unsubscribe via the World Wide Web, visit
    https://www.redhat.com/mailman/listinfo/linux-cluster
or, via email, send a message with subject or body 'help' to
    linux-cluster-request at redhat.com

You can reach the person managing the list at
    linux-cluster-owner at redhat.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Linux-cluster digest..."


Today's Topics:

   1. Re: Linux-cluster Digest, Vol 62, Issue 21 (Henry Robertson)
   2. Re: Guest MIgration w/rgmanager - WAS:    Linux-cluster Digest,
      Vol 62, Issue 21 (Thomas Sjolshagen)
   3. Re: Re: clusvcadm -M <service> -m <service> fails    with
      "Invalid operation for resource" (Mrugesh Karnik)
   4. redhat cluster installation (LEROUX Ludovic)
   5. Re: redhat cluster installation (remi doubi)
   6. Re: redhat cluster installation (Andrea Modesto Rossi)
   7. Re: Re: clusvcadm -M <service> -m <service> fails    with
      "Invalid operation for resource" (Thomas Sjolshagen)
   8. system-config-cluster, secure, and fence_drac5  (Mario Mech)


----------------------------------------------------------------------

Message: 1
Date: Mon, 22 Jun 2009 18:12:07 -0400
From: Henry Robertson <henry.robertson at hjrconsulting.com>
Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 62, Issue 21
To: linux-cluster at redhat.com
Message-ID:
    <c0b9f0730906221512m61bbad37v417f42b7b0504d89 at mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

On Mon, Jun 22, 2009 at 12:00 PM, <linux-cluster-request at redhat.com> wrote:

> Send Linux-cluster mailing list submissions to
>        linux-cluster at redhat.com
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        https://www.redhat.com/mailman/listinfo/linux-cluster
> or, via email, send a message with subject or body 'help' to
>        linux-cluster-request at redhat.com
>
> You can reach the person managing the list at
>        linux-cluster-owner at redhat.com
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Linux-cluster digest..."
>
>
> Today's Topics:
>
>   1. Re: clusvcadm -M <service> -m <service> fails with        "Invalid
>      operation for resource" (Henry Robertson)
>   2. Re: Re: clusvcadm -M <service> -m <service> fails with
>      "Invalid operation for resource" (Thomas Sjolshagen)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sun, 21 Jun 2009 23:57:34 -0400
> From: Henry Robertson <henry.robertson at hjrconsulting.com>
> Subject: [Linux-cluster] Re: clusvcadm -M <service> -m <service> fails
>        with    "Invalid operation for resource"
> To: linux-cluster at redhat.com
> Message-ID:
>        <c0b9f0730906212057y7cac2c81q252ce4adcfdcb09 at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Today's Topics:
> >
> >   1. clusvcadm -M <service> -m <service> fails with "Invalid
> >      operation for resource" (Thomas Sjolshagen)
> >   2. Re: Will redhat release the srpms of cluster suite        for
> >      rhel-4.8 to the public??? (Fabio M. Di Nitto)
> >
> >
> > ----------------------------------------------------------------------
> >
> > Message: 1
> > Date: Sat, 20 Jun 2009 16:06:11 -0400
> > From: Thomas Sjolshagen <thomas at sjolshagen.net>
> > Subject: [Linux-cluster] clusvcadm -M <service> -m <service> fails
> >        with "Invalid operation for resource"
> > To: linux-cluster at redhat.com
> > Message-ID: <20090620160611.20273n7qqnh4jd4j at www.sjolshagen.net>
> > Content-Type: text/plain; charset=ISO-8859-1; DelSp="Yes";
> >        format="flowed"
> >
> > Hi,
> >
> > I'm trying to do a migration to another member of the cluster but it
> > fails with:
> >
> > # clusvcadm -M samba -m host2
> > Trying to migrate service:samba to
> > virt0-backup.sjolshagen.net...Invalid operation for resource
> >
> > I'm running Fedora 11 with rgmanager-3.0.0-15.rc1.fc11.x86_64
> > installed as well as a downloaded copy of the 5/21 version of vm.sh
> > from the git.fedorahosted.org repository. I've configured the resource
> > as follows:
> >
> > <rm>
> > <resources>
> >   <vm name="kvm06-hvm" domain="cluster" autostart="1"
> > recovery="relocate" snapshot="/cluster/kvm-guests/snapshots"
> > use_virsh="1" exclusive="1" hypervisor="qemu"
> > migration_mapping="host1:host2,host2:host1"
> > hypervisor_uri="qemu+ssh:///system" />
> > </resources>
> > <service autostart="1" domain="cluster" name="samba">
> >   <vm ref="kvn06-hvm"/>
> > </service>
> > </rm>
> >
> > And the guest container files are hosted on a (previously mounted)
> > gfs2 file system.
> >
> > Is this a rgmanager shortcoming (rgmanager needs to be coded to
> > support virsh & live migration) or - more likely - user error?
> >
> > Thanks in advance
> > // Thomas
> >
> > ----------------------------------------------------------------
>
>
> Are you sure you don't mean to relocate the service to host2 instead
> of Migrate? -R will stop/start a service like samba onto another node.
> clusvcadm -r <service> -m <host>
>
> I wasn't aware that migration worked for anything other than moving VM's
> around.
>
> Henry
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> https://www.redhat.com/archives/linux-cluster/attachments/20090621/550b7658/attachment.html
>
> ------------------------------
>
> Message: 2
> Date: Mon, 22 Jun 2009 09:43:43 -0400
> From: Thomas Sjolshagen <thomas at sjolshagen.net>
> Subject: Re: [Linux-cluster] Re: clusvcadm -M <service> -m <service>
>        fails   with "Invalid operation for resource"
> To: linux-cluster at redhat.com
> Message-ID: <20090622094343.10362qjp3x4g1jtb at www.sjolshagen.net>
> Content-Type: text/plain; charset=ISO-8859-1; DelSp="Yes";
>        format="flowed"
>
>
> Quoting Henry Robertson <henry.robertson at hjrconsulting.com>:
>
> > Today's Topics:
> >>
> >>   1. clusvcadm -M <service> -m <service> fails with "Invalid
> >>      operation for resource" (Thomas Sjolshagen)
>
> ..
>
> >
> >
> > Are you sure you don't mean to relocate the service to host2 instead
> > of Migrate? -R will stop/start a service like samba onto another node.
> > clusvcadm -r <service> -m <host>
> >
> > I wasn't aware that migration worked for anything other than moving
> > VM's around.
> >
> > Henry
> >
>
> If you look at the resource definition, you'll see that I'm trying to
> migrate a VM (the KVM guest is called samba since it hosts a Samba
> instance).
>
> // Thomas
>
>
>
> ----------------------------------------------------------------
> This message was sent using IMP, the Internet Messaging Program.
>


Ah. Does manual migration through virsh work rather than clusvcadm?
If it does -- I'd put rgmanager into some extra logging by editing
/etc/init.d/rgmanager with RGMGR_OPTS="-dddd" under the RGMGRD part.
Then restart rgmanager and check logs for more info after trying clusvcadm
-M again. (add debug to host / target servers and see if you catch anything
new)

Good luck
Henry
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://www.redhat.com/archives/linux-cluster/attachments/20090622/58fea150/attachment.html

------------------------------

Message: 2
Date: Mon, 22 Jun 2009 22:46:38 -0400
From: Thomas Sjolshagen <thomas at sjolshagen.net>
Subject: [Linux-cluster] Re: Guest MIgration w/rgmanager - WAS:
    Linux-cluster Digest, Vol 62, Issue 21
To: linux-cluster at redhat.com
Message-ID: <20090622224638.20492l2x5u5mbpxq at www.sjolshagen.net>
Content-Type: text/plain; charset=ISO-8859-1; DelSp="Yes";
    format="flowed"


Quoting Henry Robertson <henry.robertson at hjrconsulting.com>:

> On Mon, Jun 22, 2009 at 12:00 PM, <linux-cluster-request at redhat.com> wrote:
>
...
>
> Ah. Does manual migration through virsh work rather than clusvcadm?

virsh migrate --live <guest> qemu+ssh://2nd node/system

Migrates the guest to the other cluster member w/no objections.

> If it does -- I'd put rgmanager into some extra logging by editing
> /etc/init.d/rgmanager with RGMGR_OPTS="-dddd" under the RGMGRD part.

Added "-dddd" to /etc/sysconfig/rgmanager, restarted rgmanager and  
verified that rgmanager is running w/the option set. I do not see any  
increase in logging from the default setting of "-d"?

> Then restart rgmanager and check logs for more info after trying clusvcadm
> -M again. (add debug to host / target servers and see if you catch anything
> new)

Attempted another clusvcadm -M <vm service> -m <2nd cluster node>, and  
I see nothing in either of the /var/log/cluster/rgmanager.log files.  
Not even that the operation was attempted?!?


// Thomas


----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


------------------------------

Message: 3
Date: Tue, 23 Jun 2009 12:31:25 +0530
From: Mrugesh Karnik <mrugeshkarnik at gmail.com>
Subject: Re: [Linux-cluster] Re: clusvcadm -M <service> -m <service>
    fails    with "Invalid operation for resource"
To: linux-cluster at redhat.com
Message-ID: <200906231231.25911.mrugeshkarnik at gmail.com>
Content-Type: Text/Plain;  charset="iso-8859-1"

On Monday 22 Jun 2009 19:13:43 Thomas Sjolshagen wrote:
> If you look at the resource definition, you'll see that I'm trying to
> migrate a VM (the KVM guest is called samba since it hosts a Samba
> instance).

Is migration supported on KVM? I've tried it with Xen and works fine. The only 
gotcha was that the `nx' flag on the CPU needed to be available.

Mrugesh


------------------------------

Message: 4
Date: Tue, 23 Jun 2009 11:41:57 +0200
From: "LEROUX Ludovic" <ironludo at free.fr>
Subject: [Linux-cluster] redhat cluster installation
To: <linux-cluster at redhat.com>
Message-ID: <8DF9888392AA48D5960BE531138BB981 at siim94.local>
Content-Type: text/plain; charset="iso-8859-1"

hello all.
I installed two redhat hat 5.2 servers with redhat cluster suite option.
When i want to create a cluster with luci i got an error message: An error occurred when trying to contact any of the nodes in the rh-cluster cluster.
Do you have any ideas?
thanks.
Ludovic
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://www.redhat.com/archives/linux-cluster/attachments/20090623/a56f25e8/attachment.html

------------------------------

Message: 5
Date: Tue, 23 Jun 2009 09:48:42 +0000
From: remi doubi <reggaestar at gmail.com>
Subject: Re: [Linux-cluster] redhat cluster installation
To: LEROUX Ludovic <ironludo at free.fr>,    linux clustering
    <linux-cluster at redhat.com>
Message-ID:
    <3c88c73a0906230248r17c0ec6ct151a484115682e01 at mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

is the ricci agent started on the two nodes ??

2009/6/23 LEROUX Ludovic <ironludo at free.fr>

>  hello all.
> I installed two redhat hat 5.2 servers with redhat cluster suite option.
> When i want to create a cluster with luci i got an error message: *An
> error occurred when trying to contact any of the nodes in the rh-cluster
> cluster.*
> Do you have any ideas?
> thanks.
> Ludovic
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://www.redhat.com/archives/linux-cluster/attachments/20090623/3508464b/attachment.html

------------------------------

Message: 6
Date: Tue, 23 Jun 2009 11:51:17 +0200 (CEST)
From: "Andrea Modesto Rossi" <amrossi at linux.it>
Subject: Re: [Linux-cluster] redhat cluster installation
To: "LEROUX Ludovic" <ironludo at free.fr>,    "linux clustering"
    <linux-cluster at redhat.com>
Cc: linux-cluster at redhat.com
Message-ID: <38070.82.105.99.92.1245750677.squirrel at picard.linux.it>
Content-Type: text/plain;charset=iso-8859-1


On Mar, 23 Giugno 2009 11:41 am, LEROUX Ludovic wrote:
> hello all.
> I installed two redhat hat 5.2 servers with redhat cluster suite option.
> When i want to create a cluster with luci i got an error message: An error
> occurred when trying to contact any of the nodes in the rh-cluster
> cluster.
> Do you have any ideas?

hello,

is /etc/hosts configured properly? try with IP address instead of the
hostname.


-- 
Andrea Modesto Rossi
Fedora Ambassador
+---------------------------------------------------------------------+
| Bello. Che gli diciamo? Che sono tutti stronzi monopolisti di merda,|
| con i loro protocolli brevettati e i loro driver finestrosi?        |
| Ci sono!                                                            |
| Alessandro Rubini                                                   |
+---------------------------------------------------------------------+


------------------------------

Message: 7
Date: Tue, 23 Jun 2009 08:07:20 -0400
From: Thomas Sjolshagen <thomas at sjolshagen.net>
Subject: Re: [Linux-cluster] Re: clusvcadm -M <service> -m <service>
    fails    with "Invalid operation for resource"
To: linux-cluster at redhat.com
Message-ID: <20090623080720.9695183ennxm3r08 at www.sjolshagen.net>
Content-Type: text/plain; charset=ISO-8859-1; DelSp="Yes";
    format="flowed"


Quoting Mrugesh Karnik <mrugeshkarnik at gmail.com>:

> On Monday 22 Jun 2009 19:13:43 Thomas Sjolshagen wrote:
>> If you look at the resource definition, you'll see that I'm trying to
>> migrate a VM (the KVM guest is called samba since it hosts a Samba
>> instance).
>
> Is migration supported on KVM? I've tried it with Xen and works  
> fine. The only
> gotcha was that the `nx' flag on the CPU needed to be available.
>

Yes, KVM supports both migration & live migration and with KVM-8* and  
libvirt 0.6.4 you can use "virsh migrate --live" to move running  
guests between nodes (live migrate). This fact is reflected in the  
upstream (git repo) version of the vm.sh resource script, but it seems  
like something - rgmanager itself? - is blocking it from even trying.

//Thomas


----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


------------------------------

Message: 8
Date: Tue, 23 Jun 2009 16:32:16 +0200
From: Mario Mech <mech at meteo.uni-koeln.de>
Subject: [Linux-cluster] system-config-cluster, secure, and
    fence_drac5 
To: linux-cluster at redhat.com
Message-ID: <4A40E770.9050001 at meteo.uni-koeln.de>
Content-Type: text/plain; charset=ISO-8859-15; format=flowed

Hi all,

I'm configuring a FailOver-cluster with CentOS 5.3 on two Dell PowerEdge 
2950 with DRAC5 cards. The basic configuration with 
system-config-cluster worked fine after enabling telnet on the DRACs and 
I got the cluster running. Then I switched to ssh by manually editing 
the cluster.conf file, since system-config-cluster is not aware of 
fence_drac5. Unfortunately now the cluster.conf file is not readable 
anymore by system-config-cluster. I still want to use sys-con-clu, since 
there is still much to configure  (services, failover-domains,....). 
Except of using telnet and fence_drac until the end of the configuration 
process, I have no other idea how to manage that. DOes anyone know how 
to include fence_drac5 and the secure="1" attribute in cluster.conf and 
still using sys-con-clu?

All best

Mario

P.S. Is secure="1" in the right place?

cluster.conf:

<?xml version="1.0"?>
<cluster alias="ninjo_cluster" config_version="9" name="ninjo_cluster">
         <fence_daemon post_fail_delay="0" post_join_delay="3"/>
         <clusternodes>
                 <clusternode name="ninjo" nodeid="1" votes="1">
                         <fence>
                                 <method name="1">
                                         <device name="ninjo-drac"/>
                                 </method>
                         </fence>
                 </clusternode>
                 <clusternode name="ninja" nodeid="2" votes="1">
                         <fence>
                                 <method name="1">
                                         <device modulename="" 
name="ninja-drac"/>
                                 </method>
                         </fence>
                 </clusternode>
         </clusternodes>
         <cman expected_votes="1" two_node="1"/>
         <fencedevices>
                 <fencedevice agent="fence_drac5" 
ipaddr="xxx.xxx.xxx.xxx" login="root" name="ninjo-drac" passwd="xxxx" 
secure="1"/>
                 <fencedevice agent="fence_drac5" 
ipaddr="xxx.xxx.xxx.xxx" login="root" name="ninja-drac" passwd="xxxx" 
secure="1"/>
         </fencedevices>
         <rm>
                 <failoverdomains/>
                 <resources/>
         </rm>
</cluster>

-- 
Dr. Mario Mech

Institute for Geophysics and Meteorology
University of Cologne
Zuelpicherstr. 49a
50674 Cologne
Germany

t: +49 (0)221 - 470 - 1776
f: +49 (0)221 - 470 - 5198
e: mech at meteo.uni-koeln.de
w: http://www.meteo.uni-koeln.de/~mmech/


------------------------------

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

End of Linux-cluster Digest, Vol 62, Issue 22
*********************************************


      Love Cricket? Check out live scores, photos, video highlights and more. Click here http://cricket.yahoo.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090624/0d19c37b/attachment.htm>

From cfeist at redhat.com  Wed Jun 24 13:56:03 2009
From: cfeist at redhat.com (Chris Feist)
Date: Wed, 24 Jun 2009 09:56:03 -0400 (EDT)
Subject: [Linux-cluster] Will redhat release the srpms of cluster suite
	for rhel-4.8 to the public???
In-Reply-To: <606164656.436501245851720923.JavaMail.root@zmail04.collab.prod.int.phx2.redhat.com>
Message-ID: <1414381385.436541245851763590.JavaMail.root@zmail04.collab.prod.int.phx2.redhat.com>


----- "Aliet Santiesteban Sifontes" <alietsantiesteban at gmail.com> wrote:

> Hi, just wondering if redhat will release the srpms for the cluster
> suite updated for rhel-4.8???, I have been looking for it in redhat
> ftp site, but can not find it.

You should be able to find the 4.8 RHCS srpms here:
/pub/redhat/linux/updates/enterprise/4AS/en/RHCS/SRPMS

Let me know if anything appears to be missing.

THanks,
Chris

> Any ideas??
> Best regards
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From mgrac at redhat.com  Wed Jun 24 14:55:54 2009
From: mgrac at redhat.com (=?UTF-8?B?TWFyZWsgJ21hcngnIEdyw6Fj?=)
Date: Wed, 24 Jun 2009 16:55:54 +0200
Subject: [Linux-cluster] fencing Cisco MDS 9134 w/ RHEL5
In-Reply-To: <4A3B9D22.4080908@sph.emory.edu>
References: <4A3B9D22.4080908@sph.emory.edu>
Message-ID: <4A423E7A.7000006@redhat.com>

Hi,

Vernard C. Martin wrote:
> I can't seem to find any evidence that this fiber switch has a fencing 
> agent for RHEL4. There seems to be some documentation of it being 
> supported in RHEL 5.4.
>
> Is it reasonable to just port the agent or am I missing some technical 
> detail that the agent requires that is in the newer kernel?
Agents for RHEL 5.4 "should" work also on RHEL 4 but you will have to 
copy agent together with fencing library (fencing.py and fencing_snmp.py).

m,


From esggrupos at gmail.com  Thu Jun 25 08:06:22 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Thu, 25 Jun 2009 10:06:22 +0200
Subject: [Linux-cluster] using lacie ethernet disk raid as shared storage
Message-ID: <3128ba140906250106h484d1647r2c8ab5c4ba439c5@mail.gmail.com>

Hi all,
I have a customer with has a Lacie Ethernet Disk RAID and wants to use it as
a shared storage to use in a HA cluster.

Which can be the best approach to use this kind of storage? (iscsi, gnbd,
nfs... ???)

at the first moment I thought I can?t use it but In can't believe that I
can?t do something with it

any suggestion?

Thanks in advance.

ESG
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090625/dd59c204/attachment.htm>

From gordan at bobich.net  Thu Jun 25 10:12:03 2009
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 25 Jun 2009 11:12:03 +0100
Subject: [Linux-cluster] using lacie ethernet disk raid as shared storage
In-Reply-To: <3128ba140906250106h484d1647r2c8ab5c4ba439c5@mail.gmail.com>
References: <3128ba140906250106h484d1647r2c8ab5c4ba439c5@mail.gmail.com>
Message-ID: <0b3ac5a839d5c805cb88fc592d7dc3e3@localhost>

On Thu, 25 Jun 2009 10:06:22 +0200, ESGLinux <esggrupos at gmail.com> wrote:
> Hi all,
> I have a customer with has a Lacie Ethernet Disk RAID and wants to use it
> as a shared storage to use in a HA cluster.
> 
> Which can be the best approach to use this kind of storage? (iscsi, gnbd,
> nfs... ???)

I don't imagine for a moment that any of those would be supported,
considering the target audience is unlikely to ever have heard of those
protocols. It's likely to give you SMB/CIFS and nothing else. There's no
reason why you couldn't use it for shared storage, but that is in no way
related to RHCS.

Also remember that a single SAN/NAS of whatever description is still a
single point of failure, which makes a mockery of the concept of HA. This
is also (shockingly) a point (willfully) overlooked by most administrators
and architects.


From esggrupos at gmail.com  Thu Jun 25 10:38:04 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Thu, 25 Jun 2009 12:38:04 +0200
Subject: [Linux-cluster] using lacie ethernet disk raid as shared storage
In-Reply-To: <0b3ac5a839d5c805cb88fc592d7dc3e3@localhost>
References: <3128ba140906250106h484d1647r2c8ab5c4ba439c5@mail.gmail.com>
	<0b3ac5a839d5c805cb88fc592d7dc3e3@localhost>
Message-ID: <3128ba140906250338h63ea6e64k7bb627ed1884d8e9@mail.gmail.com>

Hi Gordan,
thanks for your answer,

I can mount this disk with NFS (also with CIFS but I?m not using this
protocol)

My idea was to mount the disk with NFS on the 2 nodes of a red hat cluster,
but I don?t know if it is a good idea. (perhaps no :-( )

The cluster are going to serve a HA httpd service (I know, with this disk I
have a SPOF, but that is all I have, no money for more .-(( )

any suggestion with this scenario?

Thanks again,

ESG


2009/6/25 Gordan Bobic <gordan at bobich.net>

> On Thu, 25 Jun 2009 10:06:22 +0200, ESGLinux <esggrupos at gmail.com> wrote:
> > Hi all,
> > I have a customer with has a Lacie Ethernet Disk RAID and wants to use it
> > as a shared storage to use in a HA cluster.
> >
> > Which can be the best approach to use this kind of storage? (iscsi, gnbd,
> > nfs... ???)
>
> I don't imagine for a moment that any of those would be supported,
> considering the target audience is unlikely to ever have heard of those
> protocols. It's likely to give you SMB/CIFS and nothing else. There's no
> reason why you couldn't use it for shared storage, but that is in no way
> related to RHCS.
>
> Also remember that a single SAN/NAS of whatever description is still a
> single point of failure, which makes a mockery of the concept of HA. This
> is also (shockingly) a point (willfully) overlooked by most administrators
> and architects.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090625/2f67b15d/attachment.htm>

From gordan at bobich.net  Thu Jun 25 11:04:57 2009
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 25 Jun 2009 12:04:57 +0100
Subject: [Linux-cluster] using lacie ethernet disk raid as shared storage
In-Reply-To: <3128ba140906250338h63ea6e64k7bb627ed1884d8e9@mail.gmail.com>
References: <3128ba140906250106h484d1647r2c8ab5c4ba439c5@mail.gmail.com>	<0b3ac5a839d5c805cb88fc592d7dc3e3@localhost>
	<3128ba140906250338h63ea6e64k7bb627ed1884d8e9@mail.gmail.com>
Message-ID: <1ea88057f3ad37b08df6a7e8798465e7@localhost>

On Thu, 25 Jun 2009 12:38:04 +0200, ESGLinux <esggrupos at gmail.com> wrote:

> I can mount this disk with NFS (also with CIFS but I?m not using this
> protocol)
> 
> My idea was to mount the disk with NFS on the 2 nodes of a red hat
cluster,
> but I don?t know if it is a good idea. (perhaps no :-( )

There's no reason why you couldn't or shouldn't do this. If all you want is
some shared storage and don't care about the single point of failure, then
this is exactly what the device was intended for. :)

> The cluster are going to serve a HA httpd service (I know, with this disk
I
> have a SPOF, but that is all I have, no money for more .-(( )
> 
> any suggestion with this scenario?

It should "just work" as you described. NFS mount it on both nodes and
point Apache at it as per usual. It'll probably work faster than a
clustered file system solution. For redundancy, however, if you have enough
disk space on the web nodes, you could set up mirrored storage using DRBD
and run GFS on top of that. You'd end up with full redundancy and no need
for the NAS (assuming, as I said, that nodes have enough space). Note that
fencing would be absolutely mandatory if you use GFS or else either node
failing would halt the cluster to prevent data corruption.

Gordan


From esggrupos at gmail.com  Thu Jun 25 11:15:56 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Thu, 25 Jun 2009 13:15:56 +0200
Subject: [Linux-cluster] using lacie ethernet disk raid as shared storage
In-Reply-To: <1ea88057f3ad37b08df6a7e8798465e7@localhost>
References: <3128ba140906250106h484d1647r2c8ab5c4ba439c5@mail.gmail.com>
	<0b3ac5a839d5c805cb88fc592d7dc3e3@localhost>
	<3128ba140906250338h63ea6e64k7bb627ed1884d8e9@mail.gmail.com>
	<1ea88057f3ad37b08df6a7e8798465e7@localhost>
Message-ID: <3128ba140906250415q6796416dy76520515a2cbfed2@mail.gmail.com>

2009/6/25 Gordan Bobic <gordan at bobich.net>

> On Thu, 25 Jun 2009 12:38:04 +0200, ESGLinux <esggrupos at gmail.com> wrote:
>
> > I can mount this disk with NFS (also with CIFS but I?m not using this
> > protocol)
> >
> > My idea was to mount the disk with NFS on the 2 nodes of a red hat
> cluster,
> > but I don?t know if it is a good idea. (perhaps no :-( )
>
> There's no reason why you couldn't or shouldn't do this. If all you want is
> some shared storage and don't care about the single point of failure, then
> this is exactly what the device was intended for. :)


ok, I?m always afraid with data corruption and thougth I will have problems
with this, but If you think that there is not problem I?ll folow your advice
( at my own risk of course, ;-)


>
>
> > The cluster are going to serve a HA httpd service (I know, with this disk
> I
> > have a SPOF, but that is all I have, no money for more .-(( )
> >
> > any suggestion with this scenario?
>
> It should "just work" as you described. NFS mount it on both nodes and
> point Apache at it as per usual. It'll probably work faster than a
> clustered file system solution. For redundancy, however, if you have enough
> disk space on the web nodes, you could set up mirrored storage using DRBD
> and run GFS on top of that. You'd end up with full redundancy and no need
> for the NAS (assuming, as I said, that nodes have enough space). Note that
> fencing would be absolutely mandatory if you use GFS or else either node
> failing would halt the cluster to prevent data corruption.
>

I was allways looking for an oportunity to test DRBD. I think now is the
moment. My reference web about DRBD is http://www.drbd.org/, any advice,
read,  before I begin to test it?

ESG


>
> Gordan
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090625/b3b44395/attachment.htm>

From gordan at bobich.net  Thu Jun 25 12:01:16 2009
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 25 Jun 2009 13:01:16 +0100
Subject: [Linux-cluster] using lacie ethernet disk raid as shared storage
In-Reply-To: <3128ba140906250415q6796416dy76520515a2cbfed2@mail.gmail.com>
References: <3128ba140906250106h484d1647r2c8ab5c4ba439c5@mail.gmail.com>	<0b3ac5a839d5c805cb88fc592d7dc3e3@localhost>	<3128ba140906250338h63ea6e64k7bb627ed1884d8e9@mail.gmail.com>	<1ea88057f3ad37b08df6a7e8798465e7@localhost>
	<3128ba140906250415q6796416dy76520515a2cbfed2@mail.gmail.com>
Message-ID: <a1549942a35eed2208ccc12d63f167eb@localhost>

On Thu, 25 Jun 2009 13:15:56 +0200, ESGLinux <esggrupos at gmail.com> wrote:
> 2009/6/25 Gordan Bobic <gordan at bobich.net>
> 
>> On Thu, 25 Jun 2009 12:38:04 +0200, ESGLinux <esggrupos at gmail.com>
wrote:
>>
>> > I can mount this disk with NFS (also with CIFS but I?m not using this
>> > protocol)
>> >
>> > My idea was to mount the disk with NFS on the 2 nodes of a red hat
>> cluster,
>> > but I don?t know if it is a good idea. (perhaps no :-( )
>>
>> There's no reason why you couldn't or shouldn't do this. If all you want
>> is
>> some shared storage and don't care about the single point of failure,
>> then
>> this is exactly what the device was intended for. :)
> 
> 
> ok, I?m always afraid with data corruption and thougth I will have
> problems with this, but If you think that there is not problem I?ll
folow your
> advice ( at my own risk of course, ;-)

NFS is designed for concurrent access, it shouldn't cause corruption. And
anyway, your apache web data is likely to be read-only in most cases
anyway. Don't put things like database files into shared access areas,
though - that generally won't work, and even when it does, performance will
be appalling.

>> > The cluster are going to serve a HA httpd service (I know, with this
>> > disk I
>> > have a SPOF, but that is all I have, no money for more .-(( )
>> >
>> > any suggestion with this scenario?
>>
>> It should "just work" as you described. NFS mount it on both nodes and
>> point Apache at it as per usual. It'll probably work faster than a
>> clustered file system solution. For redundancy, however, if you have
>> enough
>> disk space on the web nodes, you could set up mirrored storage using
DRBD
>> and run GFS on top of that. You'd end up with full redundancy and no
need
>> for the NAS (assuming, as I said, that nodes have enough space). Note
>> that
>> fencing would be absolutely mandatory if you use GFS or else either node
>> failing would halt the cluster to prevent data corruption.
>>
> 
> I was allways looking for an oportunity to test DRBD. I think now is the
> moment. My reference web about DRBD is http://www.drbd.org/, any advice,
> read,  before I begin to test it?

That is, indeed, the right site. Stick to the docs, they are pretty good.
If you are going with this solution, you may also want to look into Open
Shared Root (http://www.open-sharedroot.org/). It should save you some
admin overhead since you can get away with using a single root fs for
multiple nodes. Just make sure your fencing works. But if you are new to
clustering, you may not want to dive straight into OSR - there are
potential pitfalls that aren't always entirely obvious. There are mailing
lists for both DRBD and OSR, so if you run into problems and the docs don't
provide an obvious answer, you can always ask there.

Gordan


From xavier.montagutelli at unilim.fr  Thu Jun 25 12:21:57 2009
From: xavier.montagutelli at unilim.fr (Xavier Montagutelli)
Date: Thu, 25 Jun 2009 14:21:57 +0200
Subject: [Linux-cluster] using lacie ethernet disk raid as shared storage
In-Reply-To: <3128ba140906250338h63ea6e64k7bb627ed1884d8e9@mail.gmail.com>
References: <3128ba140906250106h484d1647r2c8ab5c4ba439c5@mail.gmail.com>
	<0b3ac5a839d5c805cb88fc592d7dc3e3@localhost>
	<3128ba140906250338h63ea6e64k7bb627ed1884d8e9@mail.gmail.com>
Message-ID: <200906251421.57758.xavier.montagutelli@unilim.fr>

On Thursday 25 June 2009 12:38:04 ESGLinux wrote:
> Hi Gordan,
> thanks for your answer,
>
> I can mount this disk with NFS (also with CIFS but I?m not using this
> protocol)
>
> My idea was to mount the disk with NFS on the 2 nodes of a red hat cluster,
> but I don?t know if it is a good idea. (perhaps no :-( )
>
> The cluster are going to serve a HA httpd service (I know, with this disk I
> have a SPOF, but that is all I have, no money for more .-(( )
>
> any suggestion with this scenario?

I *know* that's not your question, but have you think about using local disks 
on each server, with DRBD for the replication ? This would eliminate the SPOF 
and it's still cost effective (perhaps more than one lassie NAS ... ? I don't 
know).

>
> Thanks again,
>
> ESG
>
>
> 2009/6/25 Gordan Bobic <gordan at bobich.net>
>
> > On Thu, 25 Jun 2009 10:06:22 +0200, ESGLinux <esggrupos at gmail.com> wrote:
> > > Hi all,
> > > I have a customer with has a Lacie Ethernet Disk RAID and wants to use
> > > it as a shared storage to use in a HA cluster.
> > >
> > > Which can be the best approach to use this kind of storage? (iscsi,
> > > gnbd, nfs... ???)
> >
> > I don't imagine for a moment that any of those would be supported,
> > considering the target audience is unlikely to ever have heard of those
> > protocols. It's likely to give you SMB/CIFS and nothing else. There's no
> > reason why you couldn't use it for shared storage, but that is in no way
> > related to RHCS.
> >
> > Also remember that a single SAN/NAS of whatever description is still a
> > single point of failure, which makes a mockery of the concept of HA. This
> > is also (shockingly) a point (willfully) overlooked by most
> > administrators and architects.
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Xavier Montagutelli                      Tel : +33 (0)5 55 45 77 20
Service Commun Informatique              Fax : +33 (0)5 55 45 75 95
Universite de Limoges
123, avenue Albert Thomas
87060 Limoges cedex


From jeff.sturm at eprize.com  Thu Jun 25 13:40:51 2009
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Thu, 25 Jun 2009 09:40:51 -0400
Subject: [Linux-cluster] using lacie ethernet disk raid as shared storage
In-Reply-To: <a1549942a35eed2208ccc12d63f167eb@localhost>
References: <3128ba140906250106h484d1647r2c8ab5c4ba439c5@mail.gmail.com>	<0b3ac5a839d5c805cb88fc592d7dc3e3@localhost>	<3128ba140906250338h63ea6e64k7bb627ed1884d8e9@mail.gmail.com>	<1ea88057f3ad37b08df6a7e8798465e7@localhost><3128ba140906250415q6796416dy76520515a2cbfed2@mail.gmail.com>
	<a1549942a35eed2208ccc12d63f167eb@localhost>
Message-ID: <64D0546C5EBBD147B75DE133D798665F02FDC10B@hugo.eprize.local>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Gordan Bobic
> Sent: Thursday, June 25, 2009 8:01 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] using lacie ethernet disk raid as shared storage
> 
> On Thu, 25 Jun 2009 13:15:56 +0200, ESGLinux <esggrupos at gmail.com> wrote:
> > ok, I?m always afraid with data corruption and thougth I will have
> > problems with this, but If you think that there is not problem I?ll
> folow your
> > advice ( at my own risk of course, ;-)
> 
> NFS is designed for concurrent access, it shouldn't cause corruption. And
> anyway, your apache web data is likely to be read-only in most cases
> anyway. Don't put things like database files into shared access areas,
> though - that generally won't work, and even when it does, performance will
> be appalling.

Or if you still want the redundancy of RHCS and go the DRBD route, you can always use the shared device for backups.  (That's the ONLY thing I use NFS for these days.)

As a plus, you won't have to tell your customer he can't use his NAS appliance :)

Jeff


From gordan at bobich.net  Thu Jun 25 13:51:17 2009
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 25 Jun 2009 14:51:17 +0100
Subject: [Linux-cluster] using lacie ethernet disk raid as shared storage
In-Reply-To: <64D0546C5EBBD147B75DE133D798665F02FDC10B@hugo.eprize.local>
References: <3128ba140906250106h484d1647r2c8ab5c4ba439c5@mail.gmail.com>	<0b3ac5a839d5c805cb88fc592d7dc3e3@localhost>	<3128ba140906250338h63ea6e64k7bb627ed1884d8e9@mail.gmail.com>	<1ea88057f3ad37b08df6a7e8798465e7@localhost><3128ba140906250415q6796416dy76520515a2cbfed2@mail.gmail.com>	<a1549942a35eed2208ccc12d63f167eb@localhost>
	<64D0546C5EBBD147B75DE133D798665F02FDC10B@hugo.eprize.local>
Message-ID: <308de57a27fd930e7041defe5a672371@localhost>

On Thu, 25 Jun 2009 09:40:51 -0400, Jeff Sturm <jeff.sturm at eprize.com>
wrote:

>> On Thu, 25 Jun 2009 13:15:56 +0200, ESGLinux <esggrupos at gmail.com>
wrote:
>> > ok, I?m always afraid with data corruption and thougth I will have
>> > problems with this, but If you think that there is not problem I?ll
>> > folow your advice ( at my own risk of course, ;-)
>> 
>> NFS is designed for concurrent access, it shouldn't cause corruption.
And
>> anyway, your apache web data is likely to be read-only in most cases
>> anyway. Don't put things like database files into shared access areas,
>> though - that generally won't work, and even when it does, performance
>> will
>> be appalling.
> 
> Or if you still want the redundancy of RHCS and go the DRBD route, you
can
> always use the shared device for backups.  (That's the ONLY thing I use
NFS
> for these days.)

Don't underestimate NFS performance for heavily concurrent I/O with a
significant write load on lots of small file from multiple nodes. There are
things for which NFS is a better solution.

Gordan


From esggrupos at gmail.com  Fri Jun 26 07:17:18 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Fri, 26 Jun 2009 09:17:18 +0200
Subject: [Linux-cluster] using lacie ethernet disk raid as shared storage
In-Reply-To: <308de57a27fd930e7041defe5a672371@localhost>
References: <3128ba140906250106h484d1647r2c8ab5c4ba439c5@mail.gmail.com>
	<0b3ac5a839d5c805cb88fc592d7dc3e3@localhost>
	<3128ba140906250338h63ea6e64k7bb627ed1884d8e9@mail.gmail.com>
	<1ea88057f3ad37b08df6a7e8798465e7@localhost>
	<3128ba140906250415q6796416dy76520515a2cbfed2@mail.gmail.com>
	<a1549942a35eed2208ccc12d63f167eb@localhost>
	<64D0546C5EBBD147B75DE133D798665F02FDC10B@hugo.eprize.local>
	<308de57a27fd930e7041defe5a672371@localhost>
Message-ID: <3128ba140906260017i11b66264o1d848b4f9e3e4ff2@mail.gmail.com>

Thanks all for your answers
I?m going to try DRBD with all the indications you gave me.

I?m going to spend a good summer time. ;-)

ESG

2009/6/25 Gordan Bobic <gordan at bobich.net>

> On Thu, 25 Jun 2009 09:40:51 -0400, Jeff Sturm <jeff.sturm at eprize.com>
> wrote:
>
> >> On Thu, 25 Jun 2009 13:15:56 +0200, ESGLinux <esggrupos at gmail.com>
> wrote:
> >> > ok, I?m always afraid with data corruption and thougth I will have
> >> > problems with this, but If you think that there is not problem I?ll
> >> > folow your advice ( at my own risk of course, ;-)
> >>
> >> NFS is designed for concurrent access, it shouldn't cause corruption.
> And
> >> anyway, your apache web data is likely to be read-only in most cases
> >> anyway. Don't put things like database files into shared access areas,
> >> though - that generally won't work, and even when it does, performance
> >> will
> >> be appalling.
> >
> > Or if you still want the redundancy of RHCS and go the DRBD route, you
> can
> > always use the shared device for backups.  (That's the ONLY thing I use
> NFS
> > for these days.)
>
> Don't underestimate NFS performance for heavily concurrent I/O with a
> significant write load on lots of small file from multiple nodes. There are
> things for which NFS is a better solution.
>
> Gordan
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090626/7d0165b2/attachment.htm>

From esggrupos at gmail.com  Mon Jun 29 09:38:39 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Mon, 29 Jun 2009 11:38:39 +0200
Subject: [Linux-cluster] quorum disk size recommedation
Message-ID: <3128ba140906290238l3e98aac2s2b5ac2bf97418307@mail.gmail.com>

Hi all,
I?m planning a 2 nodes cluster and I?m going to use quorum disk. My question
is which is the  best size of this kind of disk. It will be interesting to
explain how calculate this size,

Thanks in advance

ESG
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090629/1baa9ac7/attachment.htm>

From harri.paivaniemi at tieto.com  Mon Jun 29 09:43:18 2009
From: harri.paivaniemi at tieto.com (=?iso-8859-1?q?H=2EP=E4iv=E4niemi?=)
Date: Mon, 29 Jun 2009 12:43:18 +0300
Subject: [Linux-cluster] quorum disk size recommedation
In-Reply-To: <3128ba140906290238l3e98aac2s2b5ac2bf97418307@mail.gmail.com>
References: <3128ba140906290238l3e98aac2s2b5ac2bf97418307@mail.gmail.com>
Message-ID: <200906291243.18175.harri.paivaniemi@tieto.com>


http://sources.redhat.com/cluster/wiki/FAQ/CMAN#quorumdisksize

What's the minimum size of a quorum disk/partition?

The official answer is 10MB. The real number is something like 100KB, but we'd like to reserve 10MB for possible 
future expansion and features. 


-hjp


On Monday 29 June 2009 12:38:39 ESGLinux wrote:
> Hi all,
>
> I?m planning a 2 nodes cluster and I?m going to use quorum disk. My
> question is which is the  best size of this kind of disk. It will be
> interesting to explain how calculate this size,
>
> Thanks in advance
>
> ESG


From esggrupos at gmail.com  Mon Jun 29 09:48:29 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Mon, 29 Jun 2009 11:48:29 +0200
Subject: [Linux-cluster] quorum disk size recommedation
In-Reply-To: <200906291243.18175.harri.paivaniemi@tieto.com>
References: <3128ba140906290238l3e98aac2s2b5ac2bf97418307@mail.gmail.com>
	<200906291243.18175.harri.paivaniemi@tieto.com>
Message-ID: <3128ba140906290248q620ad560m8700f65ab0bd63d8@mail.gmail.com>

hi,
Thanks for your quick answer.

Just for curiosity, why this size? and with 10 MB, what happens if you need
more? (the question is why can you need more? perhaps 1000 nodes? or it
doesnt matter)

Greetings,

ESG

2009/6/29 H.P?iv?niemi <harri.paivaniemi at tieto.com>

>
> http://sources.redhat.com/cluster/wiki/FAQ/CMAN#quorumdisksize
>
> What's the minimum size of a quorum disk/partition?
>
> The official answer is 10MB. The real number is something like 100KB, but
> we'd like to reserve 10MB for possible
> future expansion and features.
>
>
> -hjp
>
>
>
> On Monday 29 June 2009 12:38:39 ESGLinux wrote:
> > Hi all,
> >
> > I?m planning a 2 nodes cluster and I?m going to use quorum disk. My
> > question is which is the  best size of this kind of disk. It will be
> > interesting to explain how calculate this size,
> >
> > Thanks in advance
> >
> > ESG
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090629/b8e648b2/attachment.htm>

From agx at sigxcpu.org  Mon Jun 29 18:48:48 2009
From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=)
Date: Mon, 29 Jun 2009 20:48:48 +0200
Subject: [Linux-cluster] Cluster 3.0.0.rc3 release
In-Reply-To: <1245496789.3665.328.camel@cerberus.int.fabbione.net>
References: <1245496789.3665.328.camel@cerberus.int.fabbione.net>
Message-ID: <20090629184848.GA25796@bogon.sigxcpu.org>

Hi Fabione,
Thanks for rolling this rc candidate!

On Sat, Jun 20, 2009 at 01:19:49PM +0200, Fabio M. Di Nitto wrote:
[..snip..] 
> In order to build the 3.0.0.rc3 release you will need:
> 
> - corosync 0.98
> - openais 0.97
We used these without any patches.

> - linux kernel 2.6.29
We were running against 2.6.30.

We observed these issues:

fenced segfaults with:

(gdb) bt
#0  0x00007f8e293508fe in fence_node (victim=0x114b510 "node1.foo.bar", log=0x61e0a0, log_size=32, log_count=0x7fff2e46a634) at /var/home/schmitz/3/redhat-cluster/fence/libfence/agent.c:156
#1  0x000000000040c5cd in fence_victims (fd=0x114f270) at /var/home/schmitz/3/redhat-cluster/fence/fenced/recover.c:319
#2  0x0000000000405f27 in apply_changes (fd=0x114f270) at /var/home/schmitz/3/redhat-cluster/fence/fenced/cpg.c:1056
#3  0x00007f8e2914bcc1 in cpg_dispatch () from /usr/lib/libcpg.so.4 #4  0x0000000000404588 in process_fd_cpg (ci=4) at /var/home/schmitz/3/redhat-cluster/fence/fenced/cpg.c:1351 #5  0x000000000040b0f7 in main (argc=<value optimized out>, argv=<value optimized out>) at /var/home/schmitz/3/redhat-cluster/fence/fenced/main.c:818

this leads to

1246297857 fenced 3.0.0.rc3 started
1246297857 our_nodeid 1 our_name node2.foo.bar
1246297857 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/fenced.log
1246297857 found uncontrolled entry /sys/kernel/dlm/rgmanager

when trying to restart fenced. Since this is not possible one has to
reboot the node.

We're also seeing:

Jun 29 19:29:03 node2 kernel: [   50.149855] dlm: no local IP address has been set
Jun 29 19:29:03 node2 kernel: [   50.150035] dlm: cannot start dlm lowcomms -107

from time to time. Stopping/starting via cman's init script (as from the
Ubuntu package) several times makes this go away.

Any ideas what causes this?
Cheers,
 -- Guido


From fdinitto at redhat.com  Mon Jun 29 20:10:00 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Mon, 29 Jun 2009 22:10:00 +0200
Subject: [Linux-cluster] Cluster 3.0.0.rc3 release
In-Reply-To: <20090629184848.GA25796@bogon.sigxcpu.org>
References: <1245496789.3665.328.camel@cerberus.int.fabbione.net>
	<20090629184848.GA25796@bogon.sigxcpu.org>
Message-ID: <1246306200.25867.86.camel@cerberus.int.fabbione.net>

Hi Guido,

On Mon, 2009-06-29 at 20:48 +0200, Guido G?nther wrote:
> Hi Fabione,
> Thanks for rolling this rc candidate!
> 
> On Sat, Jun 20, 2009 at 01:19:49PM +0200, Fabio M. Di Nitto wrote:
> [..snip..] 
> > In order to build the 3.0.0.rc3 release you will need:
> > 
> > - corosync 0.98
> > - openais 0.97
> We used these without any patches.
> 
> > - linux kernel 2.6.29
> We were running against 2.6.30.

Shouldn't be a problem. You simply won't be able to build or use gfs1.

> 
> We observed these issues:
> 
> fenced segfaults with:
> 
> (gdb) bt
> #0  0x00007f8e293508fe in fence_node (victim=0x114b510 "node1.foo.bar", log=0x61e0a0, log_size=32, log_count=0x7fff2e46a634) at /var/home/schmitz/3/redhat-cluster/fence/libfence/agent.c:156
> #1  0x000000000040c5cd in fence_victims (fd=0x114f270) at /var/home/schmitz/3/redhat-cluster/fence/fenced/recover.c:319
> #2  0x0000000000405f27 in apply_changes (fd=0x114f270) at /var/home/schmitz/3/redhat-cluster/fence/fenced/cpg.c:1056
> #3  0x00007f8e2914bcc1 in cpg_dispatch () from /usr/lib/libcpg.so.4 #4  0x0000000000404588 in process_fd_cpg (ci=4) at /var/home/schmitz/3/redhat-cluster/fence/fenced/cpg.c:1351 #5  0x000000000040b0f7 in main (argc=<value optimized out>, argv=<value optimized out>) at /var/home/schmitz/3/redhat-cluster/fence/fenced/main.c:818
> 
> this leads to
> 
> 1246297857 fenced 3.0.0.rc3 started
> 1246297857 our_nodeid 1 our_name node2.foo.bar
> 1246297857 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/fenced.log
> 1246297857 found uncontrolled entry /sys/kernel/dlm/rgmanager

It looks to me the node has not been shutdown properly and an attempt to
restart it did fail. The fenced segfault shouldn't happen but I am
CC'ing David. Maybe he has a better idea.

> 
> when trying to restart fenced. Since this is not possible one has to
> reboot the node.
> 
> We're also seeing:
> 
> Jun 29 19:29:03 node2 kernel: [   50.149855] dlm: no local IP address has been set
> Jun 29 19:29:03 node2 kernel: [   50.150035] dlm: cannot start dlm lowcomms -107

hmm this looks like a bad configuration to me or bad startup.

IIRC dlm kernel is configured via configfs and probably it was not
mounted by the init script.

> 
> from time to time. Stopping/starting via cman's init script (as from the
> Ubuntu package) several times makes this go away.
> 
> Any ideas what causes this?

Could you please try to use our upstream init scripts? They work just
fine (unchanged) in ubuntu/debian environment and they are for sure a
lot more robust than the ones I originally wrote for Ubuntu many years
ago.

Could you also please summarize your setup and config? I assume you did
the normal checks such as cman_tool status, cman_tool nodes and so on...

The usual extra things I'd check are:

- make sure the hostname doesn't resolve to localhost but to the real ip
address of the cluster interface
- cman_tool status
- cman_tool nodes
- Before starting any kind of service, such as rgmanager or gfs*, make
sure that the fencing configuration is correct. Test by using fence_node
$nodename.

Cheers
Fabio


From brettcave at gmail.com  Tue Jun 30 15:18:13 2009
From: brettcave at gmail.com (Brett Cave)
Date: Tue, 30 Jun 2009 17:18:13 +0200
Subject: [Linux-cluster] increasing gfs size to add journals on existing
	file system
Message-ID: <c0773fd30906300818k180f388bvaf2ff70115e5f8c4@mail.gmail.com>

Hi,

I am trying to add an extra node to my GFS cluster, but dont have enough
journals. I dont have any more free space to add journals (see this thread
http://www.mail-archive.com/linux-cluster at redhat.com/msg05624.html )

What would be the best solution to use (I can increase the SAN vdisk which
should allow me to resize, but wondering if there is another way).

Regards,
Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090630/2b318522/attachment.htm>

From rpeterso at redhat.com  Tue Jun 30 16:07:28 2009
From: rpeterso at redhat.com (Bob Peterson)
Date: Tue, 30 Jun 2009 12:07:28 -0400 (EDT)
Subject: [Linux-cluster] increasing gfs size to add journals on existing
	file system
In-Reply-To: <c0773fd30906300818k180f388bvaf2ff70115e5f8c4@mail.gmail.com>
Message-ID: <451634528.646961246378048373.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>

----- "Brett Cave" <brettcave at gmail.com> wrote:
| Hi,
| 
| I am trying to add an extra node to my GFS cluster, but dont have
| enough journals. I dont have any more free space to add journals (see
| this thread
| http://www.mail-archive.com/linux-cluster at redhat.com/msg05624.html )
| 
| What would be the best solution to use (I can increase the SAN vdisk
| which should allow me to resize, but wondering if there is another
| way).
| 
| Regards,
| Brett

Hi Brett,

That issue has always been a design problem with GFS.  You need to
increase the size of the device before doing gfs_jadd.  Don't make
the mistake of running gfs_grow immediately because that will consume
your new storage for file system space and still leave you no room
for any new journals.  Only run gfs_grow after you've added the
journals you need.

We eliminated the problem in GFS2, so another option would be to
use gfs2_convert to convert the file system to GFS2 and then use
gfs2_jadd.  Of course, GFS2 and gfs2_convert are still pretty new, so
they carry a certain amount of risk, as with all new software.  Some
old versions of gfs2_convert had bad problems, so if you want to go
this route, make sure you make a current backup before you do anything.
Second, make sure you gfs_fsck before you convert so that your file system
is consistent before running gfs2_convert.  Third, make sure you have the
latest and greatest gfs2_convert, so if you're on RHEL5.3, for example,
make sure you've got all the latest z-stream updates.  If you build from
source, make sure you compile from the most recent source code.

Regards,

Bob Peterson
Red Hat File Systems


From tiagocruz at forumgdh.net  Tue Jun 30 16:15:23 2009
From: tiagocruz at forumgdh.net (Tiago Cruz)
Date: Tue, 30 Jun 2009 13:15:23 -0300
Subject: [Linux-cluster] Did you use GFS with witch technology?
Message-ID: <1246378523.7787.12.camel@tuxkiller>

Hello, guys.. please... I need to know a little thing:

I'm using GFS v1 with ESX 3.5 and I'm not very happy :)
High load from vms, freeze and quorum lost, for example.

Did you use GFS and witch technology? KVM? Xen? VirtualBox? Not Virtual?
Witch version are you using? v1 or v2?

Are you a happy people using this? =)

Thanks

-- 
Tiago Cruz <tiagocruz at forumgdh.net>


From andrew at ntsg.umt.edu  Tue Jun 30 16:37:45 2009
From: andrew at ntsg.umt.edu (Andrew A. Neuschwander)
Date: Tue, 30 Jun 2009 10:37:45 -0600
Subject: [Linux-cluster] Did you use GFS with witch technology?
In-Reply-To: <1246378523.7787.12.camel@tuxkiller>
References: <1246378523.7787.12.camel@tuxkiller>
Message-ID: <4A4A3F59.1080200@ntsg.umt.edu>

I'm using GFS1 with CentOS 5.3 on ESX 3.5 and I'm mostly happy with it. 
If you are using a non-tickless kernel (i.e. RHEL/CentOS 2.6.18-x) be 
sure you are using the tick divider kernel option on your VMs. 
Otherwise, you'll see high loads.

-A
--
Andrew A. Neuschwander, RHCE
Systems/Software Engineer
College of Forestry and Conservation
The University of Montana
http://www.ntsg.umt.edu
andrew at ntsg.umt.edu - 406.243.6310


Tiago Cruz wrote:
> Hello, guys.. please... I need to know a little thing:
> 
> I'm using GFS v1 with ESX 3.5 and I'm not very happy :)
> High load from vms, freeze and quorum lost, for example.
> 
> Did you use GFS and witch technology? KVM? Xen? VirtualBox? Not Virtual?
> Witch version are you using? v1 or v2?
> 
> Are you a happy people using this? =)
> 
> Thanks
> 


From Eric.Johnson at mtsallstream.com  Tue Jun 30 16:40:33 2009
From: Eric.Johnson at mtsallstream.com (Johnson, Eric)
Date: Tue, 30 Jun 2009 11:40:33 -0500
Subject: [Linux-cluster] RHEL 5.3 NFSv4 cluster
Message-ID: <CD9C931A046A4A41876F7F5E134298B401BB6D5F@PTEEXB02.mtsallstream.com>

Hi,

 
Is there an up to date document detailing the configuration of an NFSv4
cluster service on a 2-node RHEL 5.3 Cluster Suite setup? Most of the
info I find is from 2006/2007 and states that these features are in a
state of flux and could change soon.

 
My current configuration is 2 nodes, RHEL 5.3 (kernel
2.6.18-128.1.14.el5PAE), SAN attached shared storage, with GFS2 file
systems.

 
I read the documents at:

http://wiki.linux-nfs.org/wiki/index.php/NFS_Recovery_and_Client_Migrati
on

http://www.howtoforge.com/high_availability_nfs_drbd_heartbeat

 
And also the NFS cluster cookbook and Red Hat's NFS cluster example. The
former two are fairly old, and the latter two documents seem fairly
basic and don't address certain issues like:

 
1.	Is it still recommended to configure /var/lib/nfs/v4recovery on
a shared file system between nodes?
2.	Do I need to set the "fsid=" parameter for every export in
/etc/exports and set it to a unique value? (I currently only have fsid
set for nfs root)
3.	Should I set all of the RPC services in /etc/sysconfig/nfs to
listen on a dedicated port?
4.	Can I leave the NFS service running on both nodes at the same
time and just fail over the IP address, or should I add the nfs service
script to the cluster config to start/stop it as part of the service?
5.	The NFS Recovery and Client Migration doc above mentions that
lock migration is not handled yet and that there needs to be a way to
release locks and leases during failover. Has this been addressed
somehow? Does stopping/starting the NFS service accomplish this?

 
Also, when mounting my NFS shares using the cluster's virtual IP address
or name, I get some errors in my NFS server's logs regarding timed out
callbacks:

 
Jun 25 15:00:12 node2 kernel: nfs4_cb: server <CLIENT1 IP ADDRESS> not
responding, timed out

Jun 25 17:07:37 node2 kernel: nfs4_cb: server <CLIENT2 IP ADDRESS> not
responding, timed out

 
If I mount the file system using the cluster node's static address/name,
these errors don't appear, but for obvious reasons, this is undesirable.

 
Thanks,

Eric

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090630/4e95d0eb/attachment.htm>

From brettcave at gmail.com  Tue Jun 30 16:45:43 2009
From: brettcave at gmail.com (Brett Cave)
Date: Tue, 30 Jun 2009 18:45:43 +0200
Subject: [Linux-cluster] increasing gfs size to add journals on existing 
	file system
In-Reply-To: <451634528.646961246378048373.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>
References: <c0773fd30906300818k180f388bvaf2ff70115e5f8c4@mail.gmail.com>
	<451634528.646961246378048373.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>
Message-ID: <c0773fd30906300945k26a4c77fv39b9aba9bb0bc3d1@mail.gmail.com>

On Tue, Jun 30, 2009 at 6:07 PM, Bob Peterson <rpeterso at redhat.com> wrote:

> ----- "Brett Cave" <brettcave at gmail.com> wrote:
> Hi Brett,
>
> That issue has always been a design problem with GFS.  You need to
> increase the size of the device before doing gfs_jadd.  Don't make
> the mistake of running gfs_grow immediately because that will consume
> your new storage for file system space and still leave you no room
> for any new journals.  Only run gfs_grow after you've added the
> journals you need.


Thanks Bob, I have increased the relevant vdisks,  going to rescan the disks
and then add the journals. We ran into some instability issues with gfs2
locking up while we were testing a good few months ago, so going to
sacrifice bleeding edge for stability as its a production system.

Will keep my eye on gfs2 and see how it runs in our test environment when we
get past this phase. 8 months of stable gfs is great :) (we found the older
kmod_gfs or cman had a node numbering issue which caused some locking up a
while ago, but this has been resolved)

how is gfs2 running on your side?


>
> We eliminated the problem in GFS2, so another option would be to
> use gfs2_convert to convert the file system to GFS2 and then use
> gfs2_jadd.  Of course, GFS2 and gfs2_convert are still pretty new, so
> they carry a certain amount of risk, as with all new software.  Some
> old versions of gfs2_convert had bad problems, so if you want to go
> this route, make sure you make a current backup before you do anything.
> Second, make sure you gfs_fsck before you convert so that your file system
> is consistent before running gfs2_convert.  Third, make sure you have the
> latest and greatest gfs2_convert, so if you're on RHEL5.3, for example,
> make sure you've got all the latest z-stream updates.  If you build from
> source, make sure you compile from the most recent source code.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090630/c2398e27/attachment.htm>

From brettcave at gmail.com  Tue Jun 30 16:51:10 2009
From: brettcave at gmail.com (Brett Cave)
Date: Tue, 30 Jun 2009 18:51:10 +0200
Subject: [Linux-cluster] Did you use GFS with witch technology?
In-Reply-To: <1246378523.7787.12.camel@tuxkiller>
References: <1246378523.7787.12.camel@tuxkiller>
Message-ID: <c0773fd30906300951p65bb9dfal3742d399e6bf2f0f@mail.gmail.com>

On Tue, Jun 30, 2009 at 6:15 PM, Tiago Cruz <tiagocruz at forumgdh.net> wrote:

> Hello, guys.. please... I need to know a little thing:
>
> I'm using GFS v1 with ESX 3.5 and I'm not very happy :)
> High load from vms, freeze and quorum lost, for example.
>
> Did you use GFS and witch technology? KVM? Xen? VirtualBox? Not Virtual?
> Witch version are you using? v1 or v2?


GFS 1 on CentOS5  2.6.18, but not using virtualization. very happy with the
performance. The GFS volumes are on an enterprise FC SAN

had some issues with gfs2 locking up, that was quite a while back though,
but didnt have performance issues (and neither do we on gfs1).


>
>
> Are you a happy people using this? =)
>
> Thanks
>
> --
> Tiago Cruz <tiagocruz at forumgdh.net>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090630/e2e673ad/attachment.htm>

From tiagocruz at forumgdh.net  Tue Jun 30 17:28:11 2009
From: tiagocruz at forumgdh.net (Tiago Cruz)
Date: Tue, 30 Jun 2009 14:28:11 -0300
Subject: [Linux-cluster] Did you use GFS with witch technology?
In-Reply-To: <4A4A3F59.1080200@ntsg.umt.edu>
References: <1246378523.7787.12.camel@tuxkiller>
	<4A4A3F59.1080200@ntsg.umt.edu>
Message-ID: <1246382891.7787.25.camel@tuxkiller>

Hello Andrew! Many thanks for your reply!

It's very good to see an environment like my!

I'm using RHEL 5.2 with kernel-2.6.18-92.1.22.el5... can you explain a
little bit around this trick divider?

I'm usually have 2-3 IBM x3850 (16 cores CPU and 128 GB RAM) with 10-15
virtual machines running under GFS, with a LUN ~500 GB SAN.

My problem happens when Multicast:

	Switch -> GFS -> Switch = OK
	vSwitch (Box_A) -> Switch-> vSwitch (Box_B) = NOK

Did you have some problem with? If I put all VMs inside the same box
(vSwitch Box_A -> vSwitch Box_A) I don't have any problem...

Thanks a lot!

-- 
Tiago Cruz <tiagocruz at forumgdh.net>


On Tue, 2009-06-30 at 10:37 -0600, Andrew A. Neuschwander wrote:
> I'm using GFS1 with CentOS 5.3 on ESX 3.5 and I'm mostly happy with it. 
> If you are using a non-tickless kernel (i.e. RHEL/CentOS 2.6.18-x) be 
> sure you are using the tick divider kernel option on your VMs. 
> Otherwise, you'll see high loads.
> 
> -A
> --
> Andrew A. Neuschwander, RHCE
> Systems/Software Engineer
> College of Forestry and Conservation
> The University of Montana
> http://www.ntsg.umt.edu
> andrew at ntsg.umt.edu - 406.243.6310
> 
> 
> Tiago Cruz wrote:
> > Hello, guys.. please... I need to know a little thing:
> > 
> > I'm using GFS v1 with ESX 3.5 and I'm not very happy :)
> > High load from vms, freeze and quorum lost, for example.
> > 
> > Did you use GFS and witch technology? KVM? Xen? VirtualBox? Not Virtual?
> > Witch version are you using? v1 or v2?
> > 
> > Are you a happy people using this? =)
> > 
> > Thanks
> > 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From steve at linuxsuite.org  Tue Jun 30 17:32:06 2009
From: steve at linuxsuite.org (steve at linuxsuite.org)
Date: Tue, 30 Jun 2009 13:32:06 -0400 (EDT)
Subject: [Linux-cluster] Cluster startup weirdness?
Message-ID: <56115.205.207.123.130.1246383126.squirrel@webmail.netfirms.com>


	I am trying to set up a minimal proof of concept with
RHCS on CentOS 5.3. Three nodes in a cluster (vz1,vz2 vz3),
2 services both just apache defualt page as defined in the
cluster.conf below.

	If I do

service cman start

on vz1 and vz2 they both hang trying to do "fence_tool -w join"

yet clustat and cman_tool status show cluster membership and quorum

no services are running

If I run tcpdump on vz3 I see that initially both vz1  and vz2
send out (from port 5149) to the multicast address but then vz2
stops and only vz1 continues. Is this correct behaviour?

  If I then do

service cman start

on vz3 everything runs (ie. fence_tool doesn' hang), tcpdump on vz3 shows
vz1,vz2 and vz3 doing muliticast and then vz2 and vz3 drop out and only
vz1 continues with multicast. vz3 has taken on the service vz1.  service
vz2 never comes up.

	Ideas? or how do I get service vz1 and vz2 running
with vz3 as a spare failover?

	thanx - steve

	Below is cluster.conf generated by system-config-cluster


<?xml version="1.0" ?>
<cluster config_version="2" name="VPS">
	<fence_daemon post_fail_delay="0" post_join_delay="3"/>
	<clusternodes>
		<clusternode name="vz1" nodeid="1" votes="1">
			<fence>
				<method name="1">
					<device lanplus="" name="vz1_fence"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="vz2" nodeid="2" votes="1">
			<fence>
				<method name="1">
					<device lanplus="" name="vz2_fence"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="vz3" nodeid="3" votes="1">
			<fence>
				<method name="1">
					<device lanplus="" name="vz3_fence"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<cman/>
	<fencedevices>
		<fencedevice agent="fence_ipmilan" auth="PASSWORD"
ipaddr="10.254.31.201" login="root" name="vz1_fence" passwd="changeme"/>
<fencedevice agent="fence_ipmilan" auth="PASSWORD"
ipaddr="10.254.31.202" login="root" name="vz2_fence" passwd="changeme"/>
<fencedevice agent="fence_ipmilan" auth="PASSWORD"
ipaddr="10.254.31.203" login="root" name="vz3_fence" passwd="changeme"/>
	</fencedevices>
	<rm>
		<failoverdomains>
			<failoverdomain name="vz1" ordered="1" restricted="1">
				<failoverdomainnode name="vz1" priority="1"/>
				<failoverdomainnode name="vz2" priority="2"/>
				<failoverdomainnode name="vz3" priority="2"/>
			</failoverdomain>
			<failoverdomain name="vz2" ordered="1" restricted="1">
				<failoverdomainnode name="vz1" priority="2"/>
				<failoverdomainnode name="vz2" priority="1"/>
				<failoverdomainnode name="vz3" priority="2"/>
			</failoverdomain>
		</failoverdomains>
		<resources>
			<ip address="10.254.32.201" monitor_link="1"/>
			<script file="/etc/init.d/httpd" name="vzstart"/>
			<ip address="10.254.32.202" monitor_link="1"/>
		</resources>
		<service autostart="1" domain="vz1" exclusive="1" name="vz1"
recovery="relocate">
			<ip ref="10.254.32.201"/>
			<script ref="vzstart"/>
		</service>
		<service autostart="1" domain="vz2" exclusive="1" name="vz2"
recovery="relocate">
			<ip ref="10.254.32.202"/>
			<script ref="vzstart"/>
		</service>
	</rm>
</cluster>


From brettcave at gmail.com  Tue Jun 30 18:03:37 2009
From: brettcave at gmail.com (Brett Cave)
Date: Tue, 30 Jun 2009 20:03:37 +0200
Subject: [Linux-cluster] erk, cant add journals
Message-ID: <c0773fd30906301103y789ecb3etfd538651cf1cdb63@mail.gmail.com>

Can add journals. This is the process I have gone through:

1) Allocate additional space to Vdisk (/dev/sdc +5gb)
2) /dev/sdc has 1 partition. Delete and re-create it. Now bigger, but with
unused space
3) Restart all servers without cman or gfs services in startup
4) all servers now see /dev/sdc1 with the new size
5) start up cman, qdiskd and gfs services
6) run gfs_jadd

still getting an error about not enough space. Do I need to run another gfs_
util prior to step 6 or 5 above?

Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090630/f8c9278b/attachment.htm>

From rpeterso at redhat.com  Tue Jun 30 18:16:42 2009
From: rpeterso at redhat.com (Bob Peterson)
Date: Tue, 30 Jun 2009 14:16:42 -0400 (EDT)
Subject: [Linux-cluster] erk, cant add journals
In-Reply-To: <c0773fd30906301103y789ecb3etfd538651cf1cdb63@mail.gmail.com>
Message-ID: <136129354.654311246385802002.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>

----- "Brett Cave" <brettcave at gmail.com> wrote:
| Can add journals. This is the process I have gone through:
| 
| 1) Allocate additional space to Vdisk (/dev/sdc +5gb)
| 2) /dev/sdc has 1 partition. Delete and re-create it. Now bigger, but
| with unused space
| 3) Restart all servers without cman or gfs services in startup
| 4) all servers now see /dev/sdc1 with the new size
| 5) start up cman, qdiskd and gfs services
| 6) run gfs_jadd
| 
| still getting an error about not enough space. Do I need to run
| another gfs_ util prior to step 6 or 5 above?
| 
| Brett

Hi Brett,

What "error about not enough space" did you get?  The gfs_jadd
function adds more journals, not more space.  If you want more
space, you need to run gfs_grow (with the file system mounted).

It sounds like you are using a raw device (sdc) and not using
clvmd, so any changes made to the file system size will not be
communicated between the nodes in the cluster.  Therefore, if
you did run gfs_grow, you might need to reboot the other nodes
in the cluster so they will see the new file system size as
written out by gfs_grow.  This is something that clvmd would
ordinarily do, but if you're not using that, a reboot would
be needed.

Regards,

Bob Peterson
Red Hat File Systems


From brettcave at gmail.com  Tue Jun 30 18:20:33 2009
From: brettcave at gmail.com (Brett Cave)
Date: Tue, 30 Jun 2009 20:20:33 +0200
Subject: [Linux-cluster] erk, cant add journals
In-Reply-To: <136129354.654311246385802002.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>
References: <c0773fd30906301103y789ecb3etfd538651cf1cdb63@mail.gmail.com>
	<136129354.654311246385802002.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>
Message-ID: <c0773fd30906301120w796f0498q360b965f653e3ad2@mail.gmail.com>

On Tue, Jun 30, 2009 at 8:16 PM, Bob Peterson <rpeterso at redhat.com> wrote:

>
> Hi Brett,
>
> What "error about not enough space" did you get?  The gfs_jadd
> function adds more journals, not more space.  If you want more
> space, you need to run gfs_grow (with the file system mounted).


Sorry, bad terminology on my side, i am still trying to add more journals
and am getting an error about insufficient blocks:


# gfs_jadd -j 2 /gfs/cache1

Requested size  (65536 blocks) greater than available space (2 blocks)

This is after rebooting all the nodes after resizing the partition


>
> It sounds like you are using a raw device (sdc) and not using
> clvmd, so any changes made to the file system size will not be
> communicated between the nodes in the cluster.  Therefore, if
> you did run gfs_grow, you might need to reboot the other nodes
> in the cluster so they will see the new file system size as
> written out by gfs_grow.  This is something that clvmd would
> ordinarily do, but if you're not using that, a reboot would
> be needed.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090630/334fc197/attachment.htm>

From andrew at ntsg.umt.edu  Tue Jun 30 18:26:14 2009
From: andrew at ntsg.umt.edu (Andrew A. Neuschwander)
Date: Tue, 30 Jun 2009 12:26:14 -0600
Subject: [Linux-cluster] Did you use GFS with witch technology?
In-Reply-To: <1246382891.7787.25.camel@tuxkiller>
References: <1246378523.7787.12.camel@tuxkiller>	<4A4A3F59.1080200@ntsg.umt.edu>
	<1246382891.7787.25.camel@tuxkiller>
Message-ID: <4A4A58C6.50609@ntsg.umt.edu>

The tick divider first showed up in RHEL 5.1. It is a kernel command 
line option. Search for 'divider' in the release notes 
(http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/release-notes/RELEASE-NOTES-U1-x86-en.html).

Multicast works fine for me. I have 4 virtual and 1 physical members in 
my gfs cluster. The VMs are split onto two ESX nodes (along with 20+ 
other VMs). The physical/virtual ethernet switching works fine. But I 
did make sure the GFS VMs are using the vmxnet driver, as it has a lower 
latency than the other virtual nic driver. My gfs volumes consist of 
about 14TB of 4Gbps FC SAN LUNS.

-A
--
Andrew A. Neuschwander, RHCE
Systems/Software Engineer
College of Forestry and Conservation
The University of Montana
http://www.ntsg.umt.edu
andrew at ntsg.umt.edu - 406.243.6310


Tiago Cruz wrote:
> Hello Andrew! Many thanks for your reply!
> 
> It's very good to see an environment like my!
> 
> I'm using RHEL 5.2 with kernel-2.6.18-92.1.22.el5... can you explain a
> little bit around this trick divider?
> 
> I'm usually have 2-3 IBM x3850 (16 cores CPU and 128 GB RAM) with 10-15
> virtual machines running under GFS, with a LUN ~500 GB SAN.
> 
> My problem happens when Multicast:
> 
> 	Switch -> GFS -> Switch = OK
> 	vSwitch (Box_A) -> Switch-> vSwitch (Box_B) = NOK
> 
> Did you have some problem with? If I put all VMs inside the same box
> (vSwitch Box_A -> vSwitch Box_A) I don't have any problem...
> 
> Thanks a lot!
> 


From henry.robertson at hjrconsulting.com  Tue Jun 30 18:50:51 2009
From: henry.robertson at hjrconsulting.com (Henry Robertson)
Date: Tue, 30 Jun 2009 14:50:51 -0400
Subject: [Linux-cluster] Did you use GFS with witch technology?
Message-ID: <c0b9f0730906301150j55819e7ie8b5ca95702d7a8d@mail.gmail.com>

I've also had some issues with VM fencing and multicast and VLANs.
When configured as:

VM --> switch ---> Dom0, the fencing will operate fine. (You can test by
running fence_xvmd -dddd on the dom0 box housing the VM to be fenced and run
a fence_node command from a clustered VM. You should see lots of traffic
about fence_xvm destroying the domain in question)

Adding a second stop in the middle causes an issue because the TTL of the
multicast packet is 1. (You can change the openAIS ttl default, but I never
got it to work with the firewall redirects)

VM --> Gateway VM -->Switch --> Dom0 fails. (If you do the same debug above,
you'll see no traffic because the multicast never makes it to dom0  and
fence_node will report a timeout.)

If you add an extra interface to the VM's and the Dom0's (think eth0.50) you
can also get around the multicast TTL issue. It's not preferred security
wise in some environments, though. So we went the switch route.

If the switch(es) are configured to pass multicast between interfaces
regardless, it will work. I can't remember the Cisco directive at the
moment, but I could find it if someone needs it.

What I've done in the past, is
run the VM's in VLANs which all work back to the Switch as its default
gateway (1 stop) and the switch is config'd
to pass those multicast packets to the next switch, etc, then to the dom0
VLAN which drops them off in the right place. This was specifically
necessary in one system where we had several layers of separation between
the dom0's and the domU's to properly compartment the data that was being
processed.

Best of luck!

Henry Robertson


Message: 12
> Date: Tue, 30 Jun 2009 12:26:14 -0600
> From: "Andrew A. Neuschwander" <andrew at ntsg.umt.edu>
> Subject: Re: [Linux-cluster] Did you use GFS with witch technology?
> To: linux clustering <linux-cluster at redhat.com>
> Message-ID: <4A4A58C6.50609 at ntsg.umt.edu>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> The tick divider first showed up in RHEL 5.1. It is a kernel command
> line option. Search for 'divider' in the release notes
> (
> http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/release-notes/RELEASE-NOTES-U1-x86-en.html
> ).
>
> Multicast works fine for me. I have 4 virtual and 1 physical members in
> my gfs cluster. The VMs are split onto two ESX nodes (along with 20+
> other VMs). The physical/virtual ethernet switching works fine. But I
> did make sure the GFS VMs are using the vmxnet driver, as it has a lower
> latency than the other virtual nic driver. My gfs volumes consist of
> about 14TB of 4Gbps FC SAN LUNS.
>
> -A
> --
> Andrew A. Neuschwander, RHCE
> Systems/Software Engineer
> College of Forestry and Conservation
> The University of Montana
> http://www.ntsg.umt.edu
> andrew at ntsg.umt.edu - 406.243.6310
>
>
> Tiago Cruz wrote:
> > Hello Andrew! Many thanks for your reply!
> >
> > It's very good to see an environment like my!
> >
> > I'm using RHEL 5.2 with kernel-2.6.18-92.1.22.el5... can you explain a
> > little bit around this trick divider?
> >
> > I'm usually have 2-3 IBM x3850 (16 cores CPU and 128 GB RAM) with 10-15
> > virtual machines running under GFS, with a LUN ~500 GB SAN.
> >
> > My problem happens when Multicast:
> >
> >       Switch -> GFS -> Switch = OK
> >       vSwitch (Box_A) -> Switch-> vSwitch (Box_B) = NOK
> >
> > Did you have some problem with? If I put all VMs inside the same box
> > (vSwitch Box_A -> vSwitch Box_A) I don't have any problem...
> >
> > Thanks a lot!
> >
>
>
>
> ------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> End of Linux-cluster Digest, Vol 62, Issue 28
> *********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090630/820436b2/attachment.htm>

From brettcave at gmail.com  Tue Jun 30 18:57:02 2009
From: brettcave at gmail.com (Brett Cave)
Date: Tue, 30 Jun 2009 20:57:02 +0200
Subject: [Linux-cluster] erk, cant add journals
In-Reply-To: <c0773fd30906301120w796f0498q360b965f653e3ad2@mail.gmail.com>
References: <c0773fd30906301103y789ecb3etfd538651cf1cdb63@mail.gmail.com>
	<136129354.654311246385802002.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>
	<c0773fd30906301120w796f0498q360b965f653e3ad2@mail.gmail.com>
Message-ID: <c0773fd30906301157g7186bd5h2d5590af1aa03226@mail.gmail.com>

On Tue, Jun 30, 2009 at 8:20 PM, Brett Cave <brettcave at gmail.com> wrote:

> On Tue, Jun 30, 2009 at 8:16 PM, Bob Peterson <rpeterso at redhat.com> wrote:
>
>>
>> Hi Brett,
>>
>> What "error about not enough space" did you get?  The gfs_jadd
>> function adds more journals, not more space.  If you want more
>> space, you need to run gfs_grow (with the file system mounted).
>
>
> Sorry, bad terminology on my side, i am still trying to add more journals
> and am getting an error about insufficient blocks:
>
>
> # gfs_jadd -j 2 /gfs/cache1
>
> Requested size  (65536 blocks) greater than available space (2 blocks)
>
> This is after rebooting all the nodes after resizing the partition
>
>
I ran the fdisk  after resizing but before rebooting, so the device
(/dev/sdc) was the same size.

Just got it working on the other partition (took it nice and slow, and shut
down all the servers except for 1 and then worked of that that :p

On a side note... would standard LVM2 on the SAN probably be a better idea
that accessing the raw devices directly... at least it would save on reboots
perhaps? what sort of performance impact might this have?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090630/47d24ff9/attachment.htm>

From Piche.Etienne at hydro.qc.ca  Tue Jun 30 19:28:20 2009
From: Piche.Etienne at hydro.qc.ca (Piche.Etienne at hydro.qc.ca)
Date: Tue, 30 Jun 2009 15:28:20 -0400
Subject: [Linux-cluster] Config fencing device when nodes are on 2 distincts
	Blade Center
Message-ID: <FA74F77968BCA64AAAEB864A0251D1F501AD0908@WPMSXC05.hydroqc.hydro.qc.ca>

Hi,
 
I have a RHCS 5.3 with 2 active/passive nodes.  I have 1 cluster service running these cluster ressources :
 
- 2 virtual IP;
- Tomcat 5 Web server;
- 1 GFS on SAN for the Web application files.
 
Each nodes has : 
 
  - 1 QUORUM disk (qdisk) on SAN too.
  - 2 LAN heartbeat. The 2nd is configured with <altname name> in cluster.conf
 
 
My two nodes are IBM Blades physically installed on 2 DISTINCTS Blade Center. 
 
How should I configure my fencing power device ? Must I define the 2 Blade Center ?
 
Thank you
 
Etienne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090630/28182060/attachment.htm>

From cthulhucalling at gmail.com  Tue Jun 30 19:54:13 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Tue, 30 Jun 2009 12:54:13 -0700
Subject: [Linux-cluster] Did you use GFS with witch technology?
In-Reply-To: <1246378523.7787.12.camel@tuxkiller>
References: <1246378523.7787.12.camel@tuxkiller>
Message-ID: <36df569a0906301254p5dcece20g1336aece80bcd708@mail.gmail.com>

On Tue, Jun 30, 2009 at 9:15 AM, Tiago Cruz <tiagocruz at forumgdh.net> wrote:

> Hello, guys.. please... I need to know a little thing:
>
> I'm using GFS v1 with ESX 3.5 and I'm not very happy :)
> High load from vms, freeze and quorum lost, for example.
>
> Did you use GFS and witch technology? KVM? Xen? VirtualBox? Not Virtual?
> Witch version are you using? v1 or v2?
>
> Are you a happy people using this? =)


If you're using ESX, why are you using GFS instead of VMFS?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090630/c6ba4e62/attachment.htm>

From tiagocruz at forumgdh.net  Tue Jun 30 21:07:25 2009
From: tiagocruz at forumgdh.net (Tiago Cruz)
Date: Tue, 30 Jun 2009 18:07:25 -0300
Subject: [Linux-cluster] Did you use GFS with witch technology?
In-Reply-To: <36df569a0906301254p5dcece20g1336aece80bcd708@mail.gmail.com>
References: <1246378523.7787.12.camel@tuxkiller>
	<36df569a0906301254p5dcece20g1336aece80bcd708@mail.gmail.com>
Message-ID: <1246396045.7787.45.camel@tuxkiller>

Hello  Ian,

'cause AFAIK I can't format one block device with VMFS.
You can think in VMFS in some like LVM - just one abstraction layer and
not a FS itself :)

-- 
Tiago Cruz <tiagocruz at forumgdh.net>


On Tue, 2009-06-30 at 12:54 -0700, Ian Hayes wrote:
> 
> 
> On Tue, Jun 30, 2009 at 9:15 AM, Tiago Cruz <tiagocruz at forumgdh.net>
> wrote:
>         Hello, guys.. please... I need to know a little thing:
>         
>         I'm using GFS v1 with ESX 3.5 and I'm not very happy :)
>         High load from vms, freeze and quorum lost, for example.
>         
>         Did you use GFS and witch technology? KVM? Xen? VirtualBox?
>         Not Virtual?
>         Witch version are you using? v1 or v2?
>         
>         Are you a happy people using this? =)
> 
> If you're using ESX, why are you using GFS instead of VMFS? 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From tiagocruz at forumgdh.net  Tue Jun 30 21:22:58 2009
From: tiagocruz at forumgdh.net (Tiago Cruz)
Date: Tue, 30 Jun 2009 18:22:58 -0300
Subject: [Linux-cluster] Did you use GFS with witch technology?
In-Reply-To: <4A4A58C6.50609@ntsg.umt.edu>
References: <1246378523.7787.12.camel@tuxkiller>
	<4A4A3F59.1080200@ntsg.umt.edu> <1246382891.7787.25.camel@tuxkiller>
	<4A4A58C6.50609@ntsg.umt.edu>
Message-ID: <1246396978.7787.60.camel@tuxkiller>

Andrew,

Many thanks for your time!

What value of 'divider' did you use? I'll put the same, 'cause its
already tested :-p

Best regards!

-- 
Tiago Cruz <tiagocruz at forumgdh.net>


On Tue, 2009-06-30 at 12:26 -0600, Andrew A. Neuschwander wrote:
> The tick divider first showed up in RHEL 5.1. It is a kernel command 
> line option. Search for 'divider' in the release notes 
> (http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/release-notes/RELEASE-NOTES-U1-x86-en.html).
> 
> Multicast works fine for me. I have 4 virtual and 1 physical members in 
> my gfs cluster. The VMs are split onto two ESX nodes (along with 20+ 
> other VMs). The physical/virtual ethernet switching works fine. But I 
> did make sure the GFS VMs are using the vmxnet driver, as it has a lower 
> latency than the other virtual nic driver. My gfs volumes consist of 
> about 14TB of 4Gbps FC SAN LUNS.
> 
> -A
> --
> Andrew A. Neuschwander, RHCE
> Systems/Software Engineer
> College of Forestry and Conservation
> The University of Montana
> http://www.ntsg.umt.edu
> andrew at ntsg.umt.edu - 406.243.6310
> 
> 
> Tiago Cruz wrote:
> > Hello Andrew! Many thanks for your reply!
> > 
> > It's very good to see an environment like my!
> > 
> > I'm using RHEL 5.2 with kernel-2.6.18-92.1.22.el5... can you explain a
> > little bit around this trick divider?
> > 
> > I'm usually have 2-3 IBM x3850 (16 cores CPU and 128 GB RAM) with 10-15
> > virtual machines running under GFS, with a LUN ~500 GB SAN.
> > 
> > My problem happens when Multicast:
> > 
> > 	Switch -> GFS -> Switch = OK
> > 	vSwitch (Box_A) -> Switch-> vSwitch (Box_B) = NOK
> > 
> > Did you have some problem with? If I put all VMs inside the same box
> > (vSwitch Box_A -> vSwitch Box_A) I don't have any problem...
> > 
> > Thanks a lot!
> > 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From brem.belguebli at gmail.com  Tue Jun 30 21:29:16 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Tue, 30 Jun 2009 23:29:16 +0200
Subject: [Linux-cluster] Did you use GFS with witch technology?
In-Reply-To: <1246396045.7787.45.camel@tuxkiller>
References: <1246378523.7787.12.camel@tuxkiller>
	<36df569a0906301254p5dcece20g1336aece80bcd708@mail.gmail.com>
	<1246396045.7787.45.camel@tuxkiller>
Message-ID: <29ae894c0906301429g6550e907hfe633c28c75c08eb@mail.gmail.com>

Not really,
VMFS is the clustered filesystem shipped with ESX.

If I understand well, you got the source code of GFS that you did recompile
on your ESX host, is that it ?

I think you're already out of support from VMware if so.


2009/6/30 Tiago Cruz <tiagocruz at forumgdh.net>

> Hello  Ian,
>
> 'cause AFAIK I can't format one block device with VMFS.
> You can think in VMFS in some like LVM - just one abstraction layer and
> not a FS itself :)
>
> --
> Tiago Cruz <tiagocruz at forumgdh.net>
>
>
> On Tue, 2009-06-30 at 12:54 -0700, Ian Hayes wrote:
> >
> >
> > On Tue, Jun 30, 2009 at 9:15 AM, Tiago Cruz <tiagocruz at forumgdh.net>
> > wrote:
> >         Hello, guys.. please... I need to know a little thing:
> >
> >         I'm using GFS v1 with ESX 3.5 and I'm not very happy :)
> >         High load from vms, freeze and quorum lost, for example.
> >
> >         Did you use GFS and witch technology? KVM? Xen? VirtualBox?
> >         Not Virtual?
> >         Witch version are you using? v1 or v2?
> >
> >         Are you a happy people using this? =)
> >
> > If you're using ESX, why are you using GFS instead of VMFS?
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090630/a20431d4/attachment.htm>

From tiagocruz at forumgdh.net  Tue Jun 30 21:36:18 2009
From: tiagocruz at forumgdh.net (Tiago Cruz)
Date: Tue, 30 Jun 2009 18:36:18 -0300
Subject: [Linux-cluster] Did you use GFS with witch technology?
In-Reply-To: <29ae894c0906301429g6550e907hfe633c28c75c08eb@mail.gmail.com>
References: <1246378523.7787.12.camel@tuxkiller>
	<36df569a0906301254p5dcece20g1336aece80bcd708@mail.gmail.com>
	<1246396045.7787.45.camel@tuxkiller>
	<29ae894c0906301429g6550e907hfe633c28c75c08eb@mail.gmail.com>
Message-ID: <1246397778.7787.67.camel@tuxkiller>

Not,

What I did:

I have 10 virtual machines.

I have one LUN with 200 GB formated by ESX using VMFS.

Inside this lun, I have a lot of small pieces of 10 GB (the "/" os each
one virtual machine) formated by RHEL 5.x using EXT3.

And my GFS was in another LUN, called DRM (some like Direct Raw Mapping)
where the LUN was delivered to VM without pass "inside" of ESX.

Can you understood of I've complicated ever more? :-p
-- 
Tiago Cruz <tiagocruz at forumgdh.net>


On Tue, 2009-06-30 at 23:29 +0200, brem belguebli wrote:
> Not really,
> 
> 
> VMFS is the clustered filesystem shipped with ESX.
> 
> 
> If I understand well, you got the source code of GFS that you did
> recompile on your ESX host, is that it ?
> 
> 
> I think you're already out of support from VMware if so.
> 
> 
>  
> 
> 2009/6/30 Tiago Cruz <tiagocruz at forumgdh.net>
>         Hello  Ian,
>         
>         'cause AFAIK I can't format one block device with VMFS.
>         You can think in VMFS in some like LVM - just one abstraction
>         layer and
>         not a FS itself :)
>         
>         --
>         Tiago Cruz <tiagocruz at forumgdh.net>
>         
>         
>         
>         
>         On Tue, 2009-06-30 at 12:54 -0700, Ian Hayes wrote:
>         >
>         >
>         > On Tue, Jun 30, 2009 at 9:15 AM, Tiago Cruz
>         <tiagocruz at forumgdh.net>
>         > wrote:
>         >         Hello, guys.. please... I need to know a little
>         thing:
>         >
>         >         I'm using GFS v1 with ESX 3.5 and I'm not very
>         happy :)
>         >         High load from vms, freeze and quorum lost, for
>         example.
>         >
>         >         Did you use GFS and witch technology? KVM? Xen?
>         VirtualBox?
>         >         Not Virtual?
>         >         Witch version are you using? v1 or v2?
>         >
>         >         Are you a happy people using this? =)
>         >
>         > If you're using ESX, why are you using GFS instead of VMFS?
>         >
>         >
>         
>         
>         > --
>         > Linux-cluster mailing list
>         > Linux-cluster at redhat.com
>         > https://www.redhat.com/mailman/listinfo/linux-cluster
>         
>         --
>         Linux-cluster mailing list
>         Linux-cluster at redhat.com
>         https://www.redhat.com/mailman/listinfo/linux-cluster
>         
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From andrew at ntsg.umt.edu  Tue Jun 30 21:37:28 2009
From: andrew at ntsg.umt.edu (Andrew A. Neuschwander)
Date: Tue, 30 Jun 2009 15:37:28 -0600
Subject: [Linux-cluster] Did you use GFS with witch technology?
In-Reply-To: <1246396978.7787.60.camel@tuxkiller>
References: <1246378523.7787.12.camel@tuxkiller>	<4A4A3F59.1080200@ntsg.umt.edu>
	<1246382891.7787.25.camel@tuxkiller>	<4A4A58C6.50609@ntsg.umt.edu>
	<1246396978.7787.60.camel@tuxkiller>
Message-ID: <4A4A8598.6000206@ntsg.umt.edu>

I use divider=10 in my grub.conf. I remember playing with it a bit, but 
that was a long time ago. So I don't remember why I settled on that. It 
works well though.

-A
--
Andrew A. Neuschwander, RHCE
Systems/Software Engineer
College of Forestry and Conservation
The University of Montana
http://www.ntsg.umt.edu
andrew at ntsg.umt.edu - 406.243.6310


Tiago Cruz wrote:
> Andrew,
> 
> Many thanks for your time!
> 
> What value of 'divider' did you use? I'll put the same, 'cause its
> already tested :-p
> 
> Best regards!
> 


From marco.huang at sit.auckland.ac.nz  Tue Jun 30 23:43:40 2009
From: marco.huang at sit.auckland.ac.nz (Marco Huang)
Date: Wed, 01 Jul 2009 11:43:40 +1200
Subject: [Linux-cluster] fatal: assertion "!mapping->nrpages" failed
Message-ID: <4A4AA32C.2020405@sit.auckland.ac.nz>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

I am using GFS2 on debian lenny. As gfs2 is not stable on kernel
2.6.26, I upgraded to 2.6.29.5.

But I am getting the following kernel error randomly when rsyncing
some files. Anyone can explain that?

[57147.431290] GFS2: fsid=webdev:wwwdev.0: fatal: assertion
"!mapping->nrpages" failed
[57147.431296] GFS2: fsid=webdev:wwwdev.0:   function =
gfs2_meta_inval, file = fs/gfs2/meta_io.c, line = 110
[57147.431305] GFS2: fsid=webdev:wwwdev.0: about to withdraw this file
system
[57147.433294] GFS2: fsid=webdev:wwwdev.0: telling LM to withdraw


Thanks
Marco
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpKoywACgkQSSHqatd3m2OSQQCdHy7h8naS5q2ionAI03DrDxGZ
qIMAoMXLNDz/0byI0i9qq9iXYVXeeFpz
=p9GJ
-----END PGP SIGNATURE-----