From jeff.sturm at eprize.com Wed Jul 1 03:57:47 2009
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Tue, 30 Jun 2009 23:57:47 -0400
Subject: [Linux-cluster] Did you use GFS with witch technology?
In-Reply-To: <1246378523.7787.12.camel@tuxkiller>
References: <1246378523.7787.12.camel@tuxkiller>
Message-ID: <64D0546C5EBBD147B75DE133D798665F02FDC1CF@hugo.eprize.local>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Tiago Cruz
> Sent: Tuesday, June 30, 2009 12:15 PM
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] Did you use GFS with witch technology?
>
> Did you use GFS and witch technology? KVM? Xen? VirtualBox? Not
Virtual?
> Witch version are you using? v1 or v2?
Xen here, with GFS1. Works great. Pay attention to the performance
optimizations (noatime, etc.) including statfs_fast if you are on GFS1.
We export LUNs from our SAN to each domU using tap:sync driver.
Performance seems to be limited by our SAN. Each domU in our setup has
two vif's: one for openais, another for everything else, though I can't
say if that helps or hurts performance.
Jeff
From agx at sigxcpu.org Wed Jul 1 11:57:25 2009
From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=)
Date: Wed, 1 Jul 2009 13:57:25 +0200
Subject: [Linux-cluster] Cluster 3.0.0.rc3 release
In-Reply-To: <1246306200.25867.86.camel@cerberus.int.fabbione.net>
References: <1245496789.3665.328.camel@cerberus.int.fabbione.net>
<20090629184848.GA25796@bogon.sigxcpu.org>
<1246306200.25867.86.camel@cerberus.int.fabbione.net>
Message-ID: <20090701115725.GA6565@bogon.sigxcpu.org>
On Mon, Jun 29, 2009 at 10:10:00PM +0200, Fabio M. Di Nitto wrote:
> > 1246297857 fenced 3.0.0.rc3 started
> > 1246297857 our_nodeid 1 our_name node2.foo.bar
> > 1246297857 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/fenced.log
> > 1246297857 found uncontrolled entry /sys/kernel/dlm/rgmanager
And it also leads to:
dlm_controld[14981]: fenced_domain_info error -1
so it's not possible to get the node back without rebooting.
> It looks to me the node has not been shutdown properly and an attempt to
> restart it did fail. The fenced segfault shouldn't happen but I am
> CC'ing David. Maybe he has a better idea.
>
> >
> > when trying to restart fenced. Since this is not possible one has to
> > reboot the node.
> >
> > We're also seeing:
> >
> > Jun 29 19:29:03 node2 kernel: [ 50.149855] dlm: no local IP address has been set
> > Jun 29 19:29:03 node2 kernel: [ 50.150035] dlm: cannot start dlm lowcomms -107
>
> hmm this looks like a bad configuration to me or bad startup.
>
> IIRC dlm kernel is configured via configfs and probably it was not
> mounted by the init script.
It is.
> > from time to time. Stopping/starting via cman's init script (as from the
> > Ubuntu package) several times makes this go away.
> >
> > Any ideas what causes this?
>
> Could you please try to use our upstream init scripts? They work just
> fine (unchanged) in ubuntu/debian environment and they are for sure a
> lot more robust than the ones I originally wrote for Ubuntu many years
> ago.
Tested that without any notable change.
> Could you also please summarize your setup and config? I assume you did
> the normal checks such as cman_tool status, cman_tool nodes and so on...
>
> The usual extra things I'd check are:
>
> - make sure the hostname doesn't resolve to localhost but to the real ip
> address of the cluster interface
> - cman_tool status
> - cman_tool nodes
These all do look o.k. However:
> - Before starting any kind of service, such as rgmanager or gfs*, make
> sure that the fencing configuration is correct. Test by using fence_node
> $nodename.
fence_node node1
gives the segfaults at the same locationo as described above which seems
to be the cause of the trouble. (Howvever "fence_ilo -z -l user -p pass
-a iloip" works as expected).
The segfault happens in fence/libfence/agent.c's make_args where the
second XPath lookup (FENCE_DEVICE_ARGS_PATH) returns a bogus (non NULL)
str. Doing this xpath lookup by hand looks fine. So it seems
ccs_get_list is returning corrupted pointers. I've attached the current
clluster.conf.
Cheers,
-- Guido
-------------- next part --------------
?xml version="1.0"?>
From fdinitto at redhat.com Wed Jul 1 13:23:56 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Wed, 01 Jul 2009 15:23:56 +0200
Subject: [Linux-cluster] Cluster 3.0.0.rc3 release
In-Reply-To: <20090701115725.GA6565@bogon.sigxcpu.org>
References: <1245496789.3665.328.camel@cerberus.int.fabbione.net>
<20090629184848.GA25796@bogon.sigxcpu.org>
<1246306200.25867.86.camel@cerberus.int.fabbione.net>
<20090701115725.GA6565@bogon.sigxcpu.org>
Message-ID: <1246454636.19414.30.camel@cerberus.int.fabbione.net>
Hi Guido,
On Wed, 2009-07-01 at 13:57 +0200, Guido G?nther wrote:
> > - Before starting any kind of service, such as rgmanager or gfs*, make
> > sure that the fencing configuration is correct. Test by using fence_node
> > $nodename.
> fence_node node1
>
> gives the segfaults at the same locationo as described above which seems
> to be the cause of the trouble. (Howvever "fence_ilo -z -l user -p pass
> -a iloip" works as expected).
> The segfault happens in fence/libfence/agent.c's make_args where the
> second XPath lookup (FENCE_DEVICE_ARGS_PATH) returns a bogus (non NULL)
> str. Doing this xpath lookup by hand looks fine. So it seems
> ccs_get_list is returning corrupted pointers. I've attached the current
> clluster.conf.
I am having problems to reproduce this problem and I'll need your help.
First of all I replicated your configuration:
as you can see node names and fencing methods are the same.
I don't have ilo but it shouldn't matter.
Now my question is: did you mangle the configuration you sent me
manually? because there is no matching entry between device to use for a
node and the fencedevices section and I get:
[root at node2]# fence_node -vv node1
fence node1 dev 0.0 agent none result: error config agent
agent args:
fence node1 failed
Now if i change device name="fenceX" to name="nodeX" there is a matching
and:
[root at node2 cluster]# fence_node -vv node1
fence node1 dev 0.0 agent fence_virsh result: success
agent args: agent=fence_virsh port=fedora-rh-node1
ipaddr=daikengo.int.fabbione.net login=root secure=1
identity_file=/root/.ssh/id_rsa
fence node1 success
and I still don't see the segfault...
Since you can reproduce the problem regularly I'd really like to see
some debugging output of libfence to start with. I'd really appreciate
if you could help us.
test 1:
Please add a bunch fprintf(stderr, to agents.c to see the created XPath
queries and the result coming back from libccs.
If you could please collect the output and send it to me.
test 2:
If you could please find:
cd = ccs_connect(); (line 287 in agent.c)
and right before that add:
fullxpath=1;
That change will ask libccs to use a different Xpath engine internally.
And then re-run test1.
This should be able to isolate pretty much the problem and give me
enough information to debug the issue.
the next question is: are you running on some fancy architecture? Maybe
something in that environment is not initialized properly (the garbage
string you get back from libccs sounds like that) but on more common
arches like x86/x86_64 gcc takes care of that for us.... (really wild
guessing but still something to fix!).
Thanks
Fabio
From jeff.sturm at eprize.com Wed Jul 1 13:50:36 2009
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Wed, 1 Jul 2009 09:50:36 -0400
Subject: [Linux-cluster] Recovering from "telling LM to withdraw"
Message-ID: <64D0546C5EBBD147B75DE133D798665F02FDC1E2@hugo.eprize.local>
Recently we had a cluster node fail with a failed assertion:
Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: fatal:
assertion "gfs_glock_is_locked_by_me(gl) && gfs_glock_is_held_excl(gl)"
failed
Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: function =
gfs_trans_add_gl
Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: file =
/builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_/src/gfs/trans.c, line
= 237
Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: time =
1246022619
Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: about to
withdraw from the cluster
Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: telling LM
to withdraw
This is with CentOS 5.2, GFS1. The cluster had been operating
continuously for about 3 months.
My challenge isn't in preventing assertion failures entirely-I recognize
lurking software bugs and hardware anomalies can lead to a failed node.
Rather, I want to prevent one node from freezing the cluster. When the
above was logged, all nodes in the cluster which access the tb2data
filesystem also froze and did not recover. We recovered with a rolling
cluster restart and a precautionary gfs_fsck.
Most cluster problems can be quickly handled by the fence agents. The
"telling LM to withdraw" does not trigger a fence operation, or any
other automated recovery. I need a deployment strategy to fix that.
Should I write an agent to scan the syslog, match on the message above,
and fence the node?
Has anyone else encountered the same problem? If so, how did you get
around it?
-Jeff
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From garromo at us.ibm.com Wed Jul 1 14:21:26 2009
From: garromo at us.ibm.com (Gary Romo)
Date: Wed, 1 Jul 2009 08:21:26 -0600
Subject: [Linux-cluster] GFS on stand alone
Message-ID:
Can GFS be used on a stand alone server without RHCS running?
Any pro's or con's to this type of setup? Thanks.
-Gary Romo
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From cthulhucalling at gmail.com Wed Jul 1 14:32:26 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Wed, 1 Jul 2009 07:32:26 -0700
Subject: [Linux-cluster] GFS on stand alone
In-Reply-To:
References:
Message-ID: <36df569a0907010732m450ae24eu8e5827ee3a37b93f@mail.gmail.com>
Yes it can. Use lock_nolock as your locking protocol.
On Wed, Jul 1, 2009 at 7:21 AM, Gary Romo wrote:
> Can GFS be used on a stand alone server without RHCS running?
>
> Any pro's or con's to this type of setup? Thanks.
>
> -Gary Romo
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From Andrea.Giussani at nposistemi.it Wed Jul 1 14:59:41 2009
From: Andrea.Giussani at nposistemi.it (Giussani Andrea)
Date: Wed, 1 Jul 2009 16:59:41 +0200
Subject: [Linux-cluster] Package Apache and Mysql Problem
Message-ID:
Hi,
i have a little big problem with RH Cluster Suite.
I have 2 cluster nodes with 1 partition to share between the 2 node. There is no SAN.
The node have the same hardware and the same partition.
I have 1 partition with drbd to sycronize the 2 nodes Primary/Primary.
I try in a lot type of configuration of Apache and Mysql package but i have the same problem.
The error is:
Jul 1 18:50:39 nodo1 luci[2581]: Unable to retrieve batch 1072342062 status from nodo2.local:11111: clusvcadm start failed to start Httpd:
nodo1 and nodo2 is the 2 nodes and httpd is the apache service.
Any idea???
I try the configuration in this procedure: http://kbase.redhat.com/faq/docs/DOC-5648 for Mysql but the result is the same.
In attach my cluster.conf and drbd.conf
If we need more tell me please.
Thanks a lot
Andrea Giussani
AVVERTENZE AI SENSI DEL D.LGS. 196/2003 .
Il contenuto di questo messaggio (ed eventuali allegati) e' strettamente confidenziale. L'utilizzo del contenuto del messaggio e' riservato esclusivamente al destinatario. La modifica, distribuzione, copia del messaggio da parte di altri e' severamente proibita. Se non siete i destinatari Vi invitiamo ad informare il mittente ed eliminare tutte le copie del suddetto messaggio .
The content of this message (and attachment) is closely confidentiality. Use of the content of the message is classified exclusively to the addressee. The modification, distribution, copy of the message from others are forbidden. If you are not the addressees, we invite You to inform the sender and to eliminate all the copies of the message.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: cluster.txt
URL:
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: drbd.txt
URL:
From brettcave at gmail.com Wed Jul 1 15:24:44 2009
From: brettcave at gmail.com (Brett Cave)
Date: Wed, 1 Jul 2009 17:24:44 +0200
Subject: [Linux-cluster] problem with heartbeat + ipvs
Message-ID:
hi all,
have a problem with HA / LB system, using heartbeat for HA and ldirector /
ipvs for load balancing.
When the primary node is shut down or heartbeat is stopped, the migration of
services works fine, but the loadbalancing does not work (ipvs rules are
active, but connect connect to HA services). Configs on primary and
secondary are the same:
haresources:
primary 172.16.5.1/16/bond0 ldirectord::ldirectord.cf
ldirectord.cf:
virtual = 172.16.5.1:3306
service = mysql
real = 172.16.10.1:3306 gate 1000
checktype, login, passwd, database, request values all set
scheduler = sed
ip_forward is enabled (checked via /proc, configured via sysctl)
network configs are almost the same except for the IP address (using a
bonded interface in active/passive mode)
have set iptables policies to ACCEPT with rules that would not block the
traffic (99.99% sure on this).
if i try connect from a server such as 172.16.10.10, i cannot connect if the
secondary is up:
[user at someserver]$ mysql -h 172.16.5.1
ERROR 2003 (HY000): Can't connect to MySQL server on '172.16.5.1' (111)
perror shows that 111 is Connection Refused
running a sniffer on the secondary HA box, i dont see the tcp 3306 packets
coming in.
the arp_ignore / arp_announce kernel params are configured on teh real
server, HA ip address is added on a /32 subnet to the lo interface, etc,
etc.... (everything works 100% when primary is up).
sure it is something i have overlooked, any idea's?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From agx at sigxcpu.org Wed Jul 1 16:40:07 2009
From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=)
Date: Wed, 1 Jul 2009 18:40:07 +0200
Subject: [Linux-cluster] Cluster 3.0.0.rc3 release
In-Reply-To: <1246454636.19414.30.camel@cerberus.int.fabbione.net>
References: <1245496789.3665.328.camel@cerberus.int.fabbione.net>
<20090629184848.GA25796@bogon.sigxcpu.org>
<1246306200.25867.86.camel@cerberus.int.fabbione.net>
<20090701115725.GA6565@bogon.sigxcpu.org>
<1246454636.19414.30.camel@cerberus.int.fabbione.net>
Message-ID: <20090701164007.GA10680@bogon.sigxcpu.org>
Hi Fabio,
On Wed, Jul 01, 2009 at 03:23:56PM +0200, Fabio M. Di Nitto wrote:
> Now my question is: did you mangle the configuration you sent me
> manually? because there is no matching entry between device to use for a
> node and the fencedevices section and I get:
Yes, I had to get some internal names out. This is what went wrong:
-
+
^^^^^^
(same for node2/fence2).
> Since you can reproduce the problem regularly I'd really like to see
> some debugging output of libfence to start with. I'd really appreciate
> if you could help us.
>
> test 1:
>
> Please add a bunch fprintf(stderr, to agents.c to see the created XPath
> queries and the result coming back from libccs.
# fence_node -vv node2
make_args(149): /cluster/fencedevices/fencedevice[@name="fence2"]/@*
make_args(156)
Segmentation fault
> test 2:
>
> If you could please find:
>
> cd = ccs_connect(); (line 287 in agent.c)
> and right before that add:
> fullxpath=1;
>
> That change will ask libccs to use a different Xpath engine internally.
>
> And then re-run test1.
# fence_node -vv node2
fence_node(289): fullxpath: 0
fence_node(291): fullxpath: 1
make_args(149): /cluster/fencedevices/fencedevice[@name="fence2"]/@*
make_args(156)
Segmentation fault
make_args(156) is just before the strncmp. Trying to print out str
results in a segfault too (that's why it's missing from the output).
[..snip..]
> the next question is: are you running on some fancy architecture? Maybe
> something in that environment is not initialized properly (the garbage
> string you get back from libccs sounds like that) but on more common
> arches like x86/x86_64 gcc takes care of that for us.... (really wild
> guessing but still something to fix!).
Nothing fancy here:
# uname -a
Linux vm41 2.6.30-1-amd64 #1 SMP Sun Jun 14 15:00:29 UTC 2009 x86_64
GNU/Linux
Cheers,
-- Guido
From adas at redhat.com Wed Jul 1 16:43:26 2009
From: adas at redhat.com (Abhijith Das)
Date: Wed, 01 Jul 2009 11:43:26 -0500
Subject: [Linux-cluster] Recovering from "telling LM to withdraw"
In-Reply-To: <64D0546C5EBBD147B75DE133D798665F02FDC1E2@hugo.eprize.local>
References: <64D0546C5EBBD147B75DE133D798665F02FDC1E2@hugo.eprize.local>
Message-ID: <4A4B922E.5090301@redhat.com>
Jeff Sturm wrote:
>
> Recently we had a cluster node fail with a failed assertion:
>
> Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: fatal:
> assertion "gfs_glock_is_locked_by_me(gl) &&
> gfs_glock_is_held_excl(gl)" failed
>
> Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: function =
> gfs_trans_add_gl
>
> Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: file =
> /builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_/src/gfs/trans.c,
> line = 237
>
> Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: time =
> 1246022619
>
> Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: about to
> withdraw from the cluster
>
> Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: telling LM
> to withdraw
>
> This is with CentOS 5.2, GFS1. The cluster had been operating
> continuously for about 3 months.
>
> My challenge isn't in preventing assertion failures entirely?I
> recognize lurking software bugs and hardware anomalies can lead to a
> failed node. Rather, I want to prevent one node from freezing the
> cluster. When the above was logged, all nodes in the cluster which
> access the tb2data filesystem also froze and did not recover. We
> recovered with a rolling cluster restart and a precautionary gfs_fsck.
>
> Most cluster problems can be quickly handled by the fence agents. The
> "telling LM to withdraw" does not trigger a fence operation, or any
> other automated recovery. I need a deployment strategy to fix that.
> Should I write an agent to scan the syslog, match on the message
> above, and fence the node?
>
> Has anyone else encountered the same problem? If so, how did you get
> around it?
>
> -Jeff
>
https://bugzilla.redhat.com/show_bug.cgi?id=471258
The assert+withdraw you're seeing seems to be this bug above. I've tried
to recreate this on my cluster and failed. If you have a recipe to
create this, could you please post it to the bugzilla?
Meanwhile, I'll look at the code again to see if I can spot anything.
Thanks!
--Abhi
From fdinitto at redhat.com Wed Jul 1 17:12:07 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Wed, 01 Jul 2009 19:12:07 +0200
Subject: [Linux-cluster] Cluster 3.0.0.rc3 release
In-Reply-To: <20090701164007.GA10680@bogon.sigxcpu.org>
References: <1245496789.3665.328.camel@cerberus.int.fabbione.net>
<20090629184848.GA25796@bogon.sigxcpu.org>
<1246306200.25867.86.camel@cerberus.int.fabbione.net>
<20090701115725.GA6565@bogon.sigxcpu.org>
<1246454636.19414.30.camel@cerberus.int.fabbione.net>
<20090701164007.GA10680@bogon.sigxcpu.org>
Message-ID: <1246468327.19414.65.camel@cerberus.int.fabbione.net>
On Wed, 2009-07-01 at 18:40 +0200, Guido G?nther wrote:
> Hi Fabio,
> On Wed, Jul 01, 2009 at 03:23:56PM +0200, Fabio M. Di Nitto wrote:
> > Now my question is: did you mangle the configuration you sent me
> > manually? because there is no matching entry between device to use for a
> > node and the fencedevices section and I get:
> Yes, I had to get some internal names out. This is what went wrong:
>
> -
> +
Ok perfect thanks.
>
> # fence_node -vv node2
> make_args(149): /cluster/fencedevices/fencedevice[@name="fence2"]/@*
> make_args(156)
> Segmentation fault
>
> > test 2:
> >
> > If you could please find:
> >
> > cd = ccs_connect(); (line 287 in agent.c)
> > and right before that add:
> > fullxpath=1;
> >
> > That change will ask libccs to use a different Xpath engine internally.
> >
> > And then re-run test1.
> # fence_node -vv node2
> fence_node(289): fullxpath: 0
> fence_node(291): fullxpath: 1
> make_args(149): /cluster/fencedevices/fencedevice[@name="fence2"]/@*
> make_args(156)
> Segmentation fault
>
> make_args(156) is just before the strncmp. Trying to print out str
> results in a segfault too (that's why it's missing from the output).
No matter what, I can't trigger this segfault.
Do you have a build log for the package? and could you send me the
make/defines.mk in the build tree?
gcc versions and usual tool chain info.. maybe it's a gcc bug or maybe
it's an optimization that behaves differently between debian and fedora.
I have attached a small test case to simply test libccs. At this point I
don't believe it's a problem in libfence. Could you please run it for me
and send me the output? If the bug is in libccs this would start
isolating it.
[root at fedora-rh-node4 ~]# gcc -Wall -o testccs main.c -lccs
[root at fedora-rh-node4 ~]# ./testccs
-hopefully some output-
and please check the XPath query at the top of main.c as it could be
slightly different given your config.
Thanks
Fabio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: main.c
Type: text/x-csrc
Size: 528 bytes
Desc: not available
URL:
From Luis.Cerezo at pgs.com Wed Jul 1 18:24:07 2009
From: Luis.Cerezo at pgs.com (Luis Cerezo)
Date: Wed, 1 Jul 2009 13:24:07 -0500
Subject: [Linux-cluster] qdisk best practices
Message-ID: <15D5002F61F31A45A82A153D2F73906760FBD3F011@HOUMS26.onshore.pgs.com>
Hi all-
i've got a RHEL 5.3 cluster, 2node with qdisk. All works fine, but the qdisk seems to beat on the SAN (I/Ops) I adjusted the interval from the default of 1 to 5 and it is still high (san admin is crying)
does anyone have best practices for this? its an LSI san and both nodes are mulitpathed to it via 4G FC.
thanks!
Luis E. Cerezo
PGS
Global IT
This e-mail, any attachments and response string may contain proprietary information, which are confidential and may be legally privileged. It is for the intended recipient only and if you are not the intended recipient or transmission error has misdirected this e-mail, please notify the author by return e-mail and delete this message and any attachment immediately. If you are not the intended recipient you must not use, disclose, distribute, forward, copy, print or rely in this e-mail in any way except as permitted by the author.
From tiagocruz at forumgdh.net Wed Jul 1 19:00:15 2009
From: tiagocruz at forumgdh.net (Tiago Cruz)
Date: Wed, 01 Jul 2009 16:00:15 -0300
Subject: [Linux-cluster] Did you use GFS with witch technology?
In-Reply-To: <64D0546C5EBBD147B75DE133D798665F02FDC1CF@hugo.eprize.local>
References: <1246378523.7787.12.camel@tuxkiller>
<64D0546C5EBBD147B75DE133D798665F02FDC1CF@hugo.eprize.local>
Message-ID: <1246474815.7192.148.camel@tuxkiller>
Thanks guys for all the comments!
Just one more question:
I have 10 VM inside a apache cluster, and I've compiled one httpd inside
GFS, some like (/gfs/httpd_servers/bin-2.2.9).
Did you see any problem with this? How do you use Apache with GFS?
--
Tiago Cruz
On Tue, 2009-06-30 at 23:57 -0400, Jeff Sturm wrote:
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com]
> > On Behalf Of Tiago Cruz
> > Sent: Tuesday, June 30, 2009 12:15 PM
> > To: linux-cluster at redhat.com
> > Subject: [Linux-cluster] Did you use GFS with witch technology?
> >
> > Did you use GFS and witch technology? KVM? Xen? VirtualBox? Not
> Virtual?
> > Witch version are you using? v1 or v2?
>
> Xen here, with GFS1. Works great. Pay attention to the performance
> optimizations (noatime, etc.) including statfs_fast if you are on GFS1.
>
> We export LUNs from our SAN to each domU using tap:sync driver.
> Performance seems to be limited by our SAN. Each domU in our setup has
> two vif's: one for openais, another for everything else, though I can't
> say if that helps or hurts performance.
>
> Jeff
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
From brem.belguebli at gmail.com Wed Jul 1 21:09:43 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Wed, 1 Jul 2009 23:09:43 +0200
Subject: [Linux-cluster] Did you use GFS with witch technology?
In-Reply-To: <1246397778.7787.67.camel@tuxkiller>
References: <1246378523.7787.12.camel@tuxkiller>
<36df569a0906301254p5dcece20g1336aece80bcd708@mail.gmail.com>
<1246396045.7787.45.camel@tuxkiller>
<29ae894c0906301429g6550e907hfe633c28c75c08eb@mail.gmail.com>
<1246397778.7787.67.camel@tuxkiller>
Message-ID: <29ae894c0907011409k470d1405qe23d7041b6c13dce@mail.gmail.com>
Hi,
Ok I understand, that should be supported.
If your problems (freeze, qdisk loose, etc..;) occur when your VMs are under
high load (CPU, RAM, disk IOs ?) why don't you configure more VCPUs on your
VMs, split the number of VMs accross more LUNs, etc..
The thing to keep in mind, there is no way under ESX to limit the IO rate
per VM, and ESX 3.5 doesn't support multipathing, if the bottleneck is more
located on the disk subsystem.
2009/6/30 Tiago Cruz
> Not,
>
> What I did:
>
> I have 10 virtual machines.
>
> I have one LUN with 200 GB formated by ESX using VMFS.
>
> Inside this lun, I have a lot of small pieces of 10 GB (the "/" os each
> one virtual machine) formated by RHEL 5.x using EXT3.
>
> And my GFS was in another LUN, called DRM (some like Direct Raw Mapping)
> where the LUN was delivered to VM without pass "inside" of ESX.
>
> Can you understood of I've complicated ever more? :-p
> --
> Tiago Cruz
>
>
> On Tue, 2009-06-30 at 23:29 +0200, brem belguebli wrote:
> > Not really,
> >
> >
> > VMFS is the clustered filesystem shipped with ESX.
> >
> >
> > If I understand well, you got the source code of GFS that you did
> > recompile on your ESX host, is that it ?
> >
> >
> > I think you're already out of support from VMware if so.
> >
> >
> >
> >
> > 2009/6/30 Tiago Cruz
> > Hello Ian,
> >
> > 'cause AFAIK I can't format one block device with VMFS.
> > You can think in VMFS in some like LVM - just one abstraction
> > layer and
> > not a FS itself :)
> >
> > --
> > Tiago Cruz
> >
> >
> >
> >
> > On Tue, 2009-06-30 at 12:54 -0700, Ian Hayes wrote:
> > >
> > >
> > > On Tue, Jun 30, 2009 at 9:15 AM, Tiago Cruz
> >
> > > wrote:
> > > Hello, guys.. please... I need to know a little
> > thing:
> > >
> > > I'm using GFS v1 with ESX 3.5 and I'm not very
> > happy :)
> > > High load from vms, freeze and quorum lost, for
> > example.
> > >
> > > Did you use GFS and witch technology? KVM? Xen?
> > VirtualBox?
> > > Not Virtual?
> > > Witch version are you using? v1 or v2?
> > >
> > > Are you a happy people using this? =)
> > >
> > > If you're using ESX, why are you using GFS instead of VMFS?
> > >
> > >
> >
> >
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From fdinitto at redhat.com Wed Jul 1 23:16:30 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Thu, 02 Jul 2009 01:16:30 +0200
Subject: [Linux-cluster] Cluster 3.0.0.rc4 release
Message-ID: <1246490190.19414.93.camel@cerberus.int.fabbione.net>
The cluster team and its community are proud to announce the
3.0.0.rc4 release candidate from the STABLE3 branch.
This should be the last release candidate unless major problems will be
found during the final testing stage.
Everybody with test equipment and time to spare, is highly encouraged to
download, install and test this release candidate and more important
report problems. This is the time for people to make a difference and
help us testing as much as possible.
In order to build the 3.0.0.rc4 release you will need:
- corosync 0.100 (1.0.0.rc1)
- openais 0.100 (1.0.0.rc1)
- linux kernel 2.6.29
The new source tarball can be downloaded here:
ftp://sources.redhat.com/pub/cluster/releases/cluster-3.0.0.rc4.tar.gz
https://fedorahosted.org/releases/c/l/cluster/cluster-3.0.0.rc5.tar.gz
At the same location is now possible to find separated tarballs for
fence-agents and resource-agents as previously announced
(http://www.redhat.com/archives/cluster-devel/2009-February/msg00003.htm)
To report bugs or issues:
https://bugzilla.redhat.com/
Would you like to meet the cluster team or members of its community?
Join us on IRC (irc.freenode.net #linux-cluster) and share your
experience with other sysadministrators or power users.
Happy clustering,
Fabio
Under the hood (from 3.0.0.rc3):
Bob Peterson (4):
GFS2: gfs2_convert, parameter not understood on ppc
/sbin/mount.gfs2: can't find /proc/mounts entry for directory /
Message printed to stderr instead of stdout
gfs_fsck: Segfault in EA leaf repair
Christine Caulfield (3):
cman: use api->shutdown_request instead of api->request_shutdown
cman: Fix some compile-time warning
dlm: Fix some compile warnings
Fabio M. Di Nitto (17):
gfs: kill dead test code
gfs2: drop dead test code
build: enable fence_xvm by default
config: fix warnings in confdb2ldif
config: use HDB_X instead of _D
gfs: add missing format attributes
gfs2: handle output conversion properly
gfs2: add missing casts
gfs2: make functions static
gfs2: backport coding format from master
gfs2: resync internationalization support from master
cman: port to the latest corosync API
cman init: stop qdiskd only if enabled
qdiskd: fix log file name
cman init: don't stop fence_xvmd if we don't know the status
cman init: readd support for fence_xvmd standalone operations
Revert "gfs-kernel: enable FS_HAS_FREEZE"
Federico Simoncelli (1):
rgmanager: Allow vm.sh use of libvirt XML file
Jim Meyering (5):
src/clulib/ckpt_state.c (ds_key_init_nt): detect failed malloc
dlm/tests: handle malloc failure
cman: handle malloc failure (i.e., don't deref NULL)
dlm_controld: handle heap allocation failure and plug leaks
dlm_controld: add comments: mark memory problems
Lon Hohberger (42):
rgmanager: Fix ptr arithmetic and C90 warnings
rgmanager: Fix rg_locks.c build warnings
rgmanager: Fix rg_strings.c build warnings
rgmanager: Fix members.c and related build warnings
rgmanager: Change ccs_read_old_logging to static
rgmanager: Fix daemon_init related warnings
rgmanager: Remove unused function
rgmanager: Remove unused proof-of-concept code
rgmanager: Fix build warnings in cman.c
rgmanager: Fix build warnings in fdops.c
rgmanager: Fix vft.c and related build warnings
rgmanager: Fix msgtest.c build warnings
rgmanager: Fix complier warnings in msg_cluster.c
rgmanager: Fix build warnings in msg_socket.c
rgmanager: Fix build warnings in msgtest.c
rgmanager: Fix fo_domain.c build warnings
rgmanager: Fix fo_domain.c build warnings (part 2)
rgmanager: Fix clufindhostname.c build warnings
rgmanager: Fix clustat.c build warnings
rgmanager: Fix clusvcadm.c build warnings
rgmanager: Fix clulog.c build warnings
rgmanager: groups.c cleanup
rgmanager: Cleanups around main.c
rgmanager: Fix reslist.c complier warnings
rgmanager: Fix resrules.c compiler warnings
rgmanager: Fix restree.c compiler warnings
rgmanager: Clean up rg_event.c and related build warnings
rgmanager: Fix rg_forward.c build warnings
rgmanager: Fix rg_queue.c build warnings
rgmanager: Clean up rg_queue.c and related warnings
rgmanager: Clean up slang_event.c and related warnings
rgmanager: Fix last bits of compiler warnings
rgmanager: Fix leaked context on queue fail
rgmanager: Fix stop/start race
rgmanager: Fix stack overflows on stress testing
rgmanager: Fix small memory leak
rgmanager: Don't push NULL on to the S/Lang stack
rgmanager: Fix error message
rgmanager: Fix --debug build
fence: Make fence_node return 2 for no fencing
rgmanager: follow-service.sl stack cleanup
rgmanager: Allow exit while waiting for fencing
Marek 'marx' Grac (1):
fence_wti: Fence agent for WTI ends with traceback when option is
missing
Steven Dake (1):
fence: Fix missing case in switch statement
Steven Whitehouse (1):
libgfs2: Use -o meta rather than gfs2meta fs type
cman/daemon/ais.c | 7 +-
cman/daemon/commands.c | 6 +-
cman/daemon/daemon.c | 5 +-
cman/daemon/daemon.h | 2 +-
cman/init.d/cman.in | 27 +-
cman/qdisk/main.c | 2 +-
config/tools/ldap/confdb2ldif.c | 6 +-
configure | 8 -
dlm/tests/usertest/alternate-lvb.c | 10 +-
dlm/tests/usertest/asttest.c | 14 +-
dlm/tests/usertest/dlmtest.c | 6 +-
dlm/tests/usertest/dlmtest2.c | 7 +-
dlm/tests/usertest/flood.c | 7 +-
dlm/tests/usertest/joinleave.c | 2 +-
dlm/tests/usertest/lstest.c | 12 +-
dlm/tests/usertest/lvb.c | 11 +-
dlm/tests/usertest/pingtest.c | 8 +-
dlm/tests/usertest/threads.c | 34 +-
fence/agents/Makefile | 13 +-
fence/agents/wti/fence_wti.py | 14 +-
fence/agents/xvm/vm_states.c | 2 +
fence/fence_node/fence_node.c | 6 +-
fence/libfence/agent.c | 2 +-
gfs-kernel/src/gfs/ops_fstype.c | 2 +-
gfs/gfs_fsck/Makefile | 7 -
gfs/gfs_fsck/log.c | 9 +-
gfs/gfs_fsck/metawalk.c | 7 +-
gfs/gfs_fsck/test_bitmap.c | 38 -
gfs/gfs_fsck/test_block_list.c | 91 -
gfs/libgfs/log.c | 9 +-
gfs2/convert/gfs2_convert.c | 2 +-
gfs2/fsck/Makefile | 6 -
gfs2/fsck/fs_recovery.c | 34 +-
gfs2/fsck/initialize.c | 6 +-
gfs2/fsck/main.c | 2 +-
gfs2/fsck/rgrepair.c | 2 +-
gfs2/fsck/test_bitmap.c | 38 -
gfs2/fsck/test_block_list.c | 91 -
gfs2/libgfs2/misc.c | 2 +-
gfs2/mkfs/main.c | 2 +-
gfs2/mkfs/main_grow.c | 4 +-
gfs2/mkfs/main_jadd.c | 11 +-
gfs2/mkfs/main_mkfs.c | 10 +-
gfs2/mount/util.c | 15 +-
gfs2/tool/main.c | 2 +-
group/dlm_controld/pacemaker.c | 15 +-
make/defines.mk.input | 1 -
rgmanager/include/daemon_init.h | 9 +
rgmanager/include/depends.h | 134 --
rgmanager/include/event.h | 10 +
rgmanager/include/fo_domain.h | 48 +
rgmanager/include/groups.h | 42 +
rgmanager/include/lock.h | 4 +-
rgmanager/include/members.h | 1 +
rgmanager/include/message.h | 20 +-
rgmanager/include/resgroup.h | 82 +-
rgmanager/include/reslist.h | 51 +-
rgmanager/include/restart_counter.h | 2 +-
rgmanager/include/rg_locks.h | 9 +
rgmanager/include/rg_queue.h | 6 +-
rgmanager/include/vf.h | 10 +-
rgmanager/src/clulib/ckpt_state.c | 1 +
rgmanager/src/clulib/cman.c | 3 +-
rgmanager/src/clulib/daemon_init.c | 8 +-
rgmanager/src/clulib/fdops.c | 5 +-
rgmanager/src/clulib/lock.c | 4 +-
rgmanager/src/clulib/logging.c | 4 +-
rgmanager/src/clulib/members.c | 66 -
rgmanager/src/clulib/message.c | 22 +-
rgmanager/src/clulib/msg_cluster.c | 13 +-
rgmanager/src/clulib/msg_socket.c | 12 +-
rgmanager/src/clulib/msgtest.c | 19 +-
rgmanager/src/clulib/rg_strings.c | 2 +-
rgmanager/src/clulib/vft.c | 53 +-
rgmanager/src/daemons/Makefile | 6 +-
rgmanager/src/daemons/depends.c | 2512
-----------------------
rgmanager/src/daemons/dtest.c | 810 --------
rgmanager/src/daemons/event_config.c | 19 +-
rgmanager/src/daemons/fo_domain.c | 29 +-
rgmanager/src/daemons/groups.c | 94 +-
rgmanager/src/daemons/main.c | 173 +--
rgmanager/src/daemons/reslist.c | 35 +-
rgmanager/src/daemons/resrules.c | 41 +-
rgmanager/src/daemons/restree.c | 70 +-
rgmanager/src/daemons/rg_event.c | 30 +-
rgmanager/src/daemons/rg_forward.c | 6 +-
rgmanager/src/daemons/rg_locks.c | 12 +-
rgmanager/src/daemons/rg_queue.c | 8 +-
rgmanager/src/daemons/rg_state.c | 145 +-
rgmanager/src/daemons/rg_thread.c | 14 +-
rgmanager/src/daemons/service_op.c | 15 +-
rgmanager/src/daemons/slang_event.c | 266 ++--
rgmanager/src/daemons/test.c | 72 +-
rgmanager/src/daemons/watchdog.c | 5 +
rgmanager/src/resources/default_event_script.sl | 16 +-
rgmanager/src/resources/follow-service.sl | 10 +-
rgmanager/src/resources/vm.sh | 17 +-
rgmanager/src/utils/clufindhostname.c | 2 +-
rgmanager/src/utils/clulog.c | 4 +-
rgmanager/src/utils/clustat.c | 67 +-
rgmanager/src/utils/clusvcadm.c | 16 +-
101 files changed, 939 insertions(+), 4812 deletions(-)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL:
From jeff.sturm at eprize.com Thu Jul 2 03:40:40 2009
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Wed, 1 Jul 2009 23:40:40 -0400
Subject: [Linux-cluster] Did you use GFS with witch technology?
In-Reply-To: <1246474815.7192.148.camel@tuxkiller>
References: <1246378523.7787.12.camel@tuxkiller><64D0546C5EBBD147B75DE133D798665F02FDC1CF@hugo.eprize.local>
<1246474815.7192.148.camel@tuxkiller>
Message-ID: <64D0546C5EBBD147B75DE133D798665F02FDC207@hugo.eprize.local>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Tiago Cruz
> Sent: Wednesday, July 01, 2009 3:00 PM
> To: linux clustering
> Subject: RE: [Linux-cluster] Did you use GFS with witch technology?
>
> I have 10 VM inside a apache cluster, and I've compiled one httpd
inside
> GFS, some like (/gfs/httpd_servers/bin-2.2.9).
You can do that. It sounds like most of the nodes may be accessing this
httpd instance read-only. If that will be the case, consider using
spectator mounts on some of the nodes so you don't have to create 10
individual journals.
> Did you see any problem with this? How do you use Apache with GFS?
We actually use it for several purposes. For one, we keep our document
root on GFS, so when web content is modified, the new content is
immediately visible to all web servers. For another, we have a
file-based session implementation on a GFS mount.
The only real limitations I know of have to do with applications which
are not cluster-aware, and performance of heavy read-write loads.
-Jeff
From jeff.sturm at eprize.com Thu Jul 2 03:45:00 2009
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Wed, 1 Jul 2009 23:45:00 -0400
Subject: [Linux-cluster] Recovering from "telling LM to withdraw"
In-Reply-To: <4A4B922E.5090301@redhat.com>
References: <64D0546C5EBBD147B75DE133D798665F02FDC1E2@hugo.eprize.local>
<4A4B922E.5090301@redhat.com>
Message-ID: <64D0546C5EBBD147B75DE133D798665F02FDC208@hugo.eprize.local>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Abhijith Das
> Sent: Wednesday, July 01, 2009 12:43 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] Recovering from "telling LM to withdraw"
>
> https://bugzilla.redhat.com/show_bug.cgi?id=471258
>
> The assert+withdraw you're seeing seems to be this bug above. I've
tried
> to recreate this on my cluster and failed. If you have a recipe to
> create this, could you please post it to the bugzilla?
Thank you for the link. I'm not confident I can easily reproduce this
yet, as we've had months of continuous uptime without such an incident.
However if I do learn more about the circumstances leading up to our
crash, I'll certainly post information to the bugzilla page.
In the meantime I'll see if I can install a nagios agent to scan logs
for any GFS problems. The sooner we know about it, the faster we can
recover if this happens again.
-Jeff
From Emmanuel.Thome at normalesup.org Thu Jul 2 09:56:17 2009
From: Emmanuel.Thome at normalesup.org (Emmanuel =?iso-8859-1?Q?Thom=E9?=)
Date: Thu, 2 Jul 2009 11:56:17 +0200
Subject: [Linux-cluster] ipmi activates session, but no talk.
Message-ID: <20090702095617.GA24015@tiramisu.loria.fr>
Hi.
I'm trying to set up ipmi (1.5) management using the bmc on ibm
eserver326 machines. Yes, these machines are old.
So far, I've been able to access the bmc with ipmitool, and configure it
as correctly as I could for remote access.
When trying to access it from afar, I successfully activate a session,
but further requests are unanswered.
Some dumps of ipmitool commands are included below.
If anybody has an idea of what's going on, that would be greatly
appreciated.
I might also try to flash the bmc firmware, as it seems that ibm released
a newer firmware for these servers. But I'm already a bit puzzled by
the situation so far.
Thanks,
E.
I'm trying to access the bmc with IP 152.81.4.81 from the host with IP
152.81.3.83. The BMC piggies-back on the eth0 NIC which has IP
152.81.3.81 for the system. Thus the BMC and the system both have
different MAC and IPs. Seems to work fine as some kind of conversation
occurs.
Here's the output of a remote ipmi request:
[root at cassandre ~]# IPMI_PASSWORD=xxx ipmitool -vvI lan -L USER -H 152.81.4.81 -E mc info
ipmi_lan_send_cmd:opened=[0], open=[4490512]
IPMI LAN host 152.81.4.81 port 623
Sending IPMI/RMCP presence ping packet
ipmi_lan_send_cmd:opened=[1], open=[4490512]
Channel 01 Authentication Capabilities:
Privilege Level : USER
Auth Types : MD5
Per-msg auth : disabled
User level auth : disabled
Non-null users : enabled
Null users : enabled
Anonymous login : disabled
Proceeding with AuthType MD5
ipmi_lan_send_cmd:opened=[1], open=[4490512]
Opening Session
Session ID : 751168e4
Challenge : e44e37374801833f77701411992dae25
Privilege Level : USER
Auth Type : MD5
ipmi_lan_send_cmd:opened=[1], open=[4490512]
Session Activated
Auth Type : MD5
Max Priv Level : USER
Session ID : 751168e4
Inbound Seq : 00000001
opened=[1], open=[4490512]
No response from remote controller
Get Device ID command failed
ipmi_lan_send_cmd:opened=[1], open=[4490512]
No response from remote controller
Close Session command failed
On the machine I'm trying to talk to, I have in particular:
[root at achille ~]# ipmitool -I open session info all
[...]
session handle : 255
slot count : 4
active sessions : 1
user id : 1
privilege level : USER
session type : IPMIv1.5
channel number : 0x01
console ip : 152.81.3.83
console mac : 00:00:00:00:00:00
console port : 60599
[...]
[root at achille ~]# /usr/bin/ipmitool -I open lan print
Set in Progress : Set Complete
Auth Type Support : NONE MD5 PASSWORD
Auth Type Enable : Callback : MD5
: User : MD5
: Operator : MD5
: Admin : MD5
: OEM : NONE MD5 PASSWORD
IP Address Source : Static Address
IP Address : 152.81.4.81
Subnet Mask : 255.255.240.0
MAC Address : 00:0d:60:18:7c:47
SNMP Community String : public
IP Header : TTL=0x00 Flags=0x00 Precedence=0x00 TOS=0x00
Default Gateway IP : 152.81.1.1
Default Gateway MAC : 00:13:5f:89:14:00
Backup Gateway IP : 192.168.0.2
Backup Gateway MAC : 00:00:00:00:00:02
Cipher Suite Priv Max : Not Available
[root at achille ~]# ipmitool user list 1
ID Name Callin Link Auth IPMI Msg Channel Priv Limit
1 true false true ADMINISTRATOR
2 root true true true OPERATOR
3 USERID true true true ADMINISTRATOR
4 OEM true true true OEM
[root at achille ~]# ipmitool -I open channel info 1
Channel 0x1 info:
Channel Medium Type : 802.3 LAN
Channel Protocol Type : IPMB-1.0
Session Support : multi-session
Active Session Count : 1
Protocol Vendor ID : 7154
Volatile(active) Settings
Alerting : disabled
Per-message Auth : disabled
User Level Auth : disabled
Access Mode : always available
Non-Volatile Settings
Alerting : disabled
Per-message Auth : disabled
User Level Auth : disabled
Access Mode : always available
From j.buzzard at dundee.ac.uk Thu Jul 2 10:15:36 2009
From: j.buzzard at dundee.ac.uk (Jonathan Buzzard)
Date: Thu, 02 Jul 2009 11:15:36 +0100
Subject: [Linux-cluster] ipmi activates session, but no talk.
In-Reply-To: <20090702095617.GA24015@tiramisu.loria.fr>
References: <20090702095617.GA24015@tiramisu.loria.fr>
Message-ID: <1246529736.23585.6.camel@penguin.lifesci.dundee.ac.uk>
On Thu, 2009-07-02 at 11:56 +0200, Emmanuel Thom? wrote:
> Hi.
>
> I'm trying to set up ipmi (1.5) management using the bmc on ibm
> eserver326 machines. Yes, these machines are old.
they are cheap and nasty rebadged MSI boxes.
> So far, I've been able to access the bmc with ipmitool, and configure it
> as correctly as I could for remote access.
>
> When trying to access it from afar, I successfully activate a session,
> but further requests are unanswered.
>
> Some dumps of ipmitool commands are included below.
Well that's your problem, it don't work with ipmitools :-(
> If anybody has an idea of what's going on, that would be greatly
> appreciated.
>
I suggest switching to freeipmi which does work.
> I might also try to flash the bmc firmware, as it seems that ibm released
> a newer firmware for these servers. But I'm already a bit puzzled by
> the situation so far.
I would if I where you. I would also do the BIOS, BMC and hard disk
firmware at a minimum if I where you. The diagnostics are option.
Note that you cannot configure bonding on eth0 and use the IPMI
interface.
Even when you get it working it is not reliable. I have seen boxes hang
and refuse to respond to IPMI commands to reboot.
I have also never been able to get the serial over LAN bit working
either.
They are just cheap and nasty.
JAB.
--
Jonathan A. Buzzard Tel: +441382-386998
Storage Administrator, College of Life Sciences
University of Dundee, DD1 5EH
From Emmanuel.Thome at normalesup.org Thu Jul 2 10:47:45 2009
From: Emmanuel.Thome at normalesup.org (Emmanuel =?iso-8859-1?Q?Thom=E9?=)
Date: Thu, 2 Jul 2009 12:47:45 +0200
Subject: [Linux-cluster] ipmi activates session, but no talk.
In-Reply-To: <1246529736.23585.6.camel@penguin.lifesci.dundee.ac.uk>
References: <20090702095617.GA24015@tiramisu.loria.fr>
<1246529736.23585.6.camel@penguin.lifesci.dundee.ac.uk>
Message-ID: <20090702104745.GA25283@tiramisu.loria.fr>
On Thu, Jul 02, 2009 at 11:15:36AM +0100, Jonathan Buzzard wrote:
> > Some dumps of ipmitool commands are included below.
>
> Well that's your problem, it don't work with ipmitools :-(
thanks a lot. Indeed.
Regards,
E.
From brettcave at gmail.com Thu Jul 2 10:54:35 2009
From: brettcave at gmail.com (Brett Cave)
Date: Thu, 2 Jul 2009 12:54:35 +0200
Subject: [Linux-cluster] Re: [SOLVED] problem with heartbeat + ipvs
In-Reply-To:
References:
Message-ID:
Was missing the DBD::mysql module so the connectioncheck was failing and
setting weight to 0.
only noticed this when i ran ldirector in debug mode.
On Wed, Jul 1, 2009 at 5:24 PM, Brett Cave wrote:
> hi all,
>
> have a problem with HA / LB system, using heartbeat for HA and ldirector /
> ipvs for load balancing.
>
> When the primary node is shut down or heartbeat is stopped, the migration
> of services works fine, but the loadbalancing does not work (ipvs rules are
> active, but connect connect to HA services). Configs on primary and
> secondary are the same:
>
>
> haresources:
> primary 172.16.5.1/16/bond0 ldirectord::ldirectord.cf
>
> ldirectord.cf:
> virtual = 172.16.5.1:3306
> service = mysql
> real = 172.16.10.1:3306 gate 1000
> checktype, login, passwd, database, request values all set
> scheduler = sed
>
> ip_forward is enabled (checked via /proc, configured via sysctl)
>
>
> network configs are almost the same except for the IP address (using a
> bonded interface in active/passive mode)
> have set iptables policies to ACCEPT with rules that would not block the
> traffic (99.99% sure on this).
>
> if i try connect from a server such as 172.16.10.10, i cannot connect if
> the secondary is up:
> [user at someserver]$ mysql -h 172.16.5.1
> ERROR 2003 (HY000): Can't connect to MySQL server on '172.16.5.1' (111)
>
>
> perror shows that 111 is Connection Refused
>
> running a sniffer on the secondary HA box, i dont see the tcp 3306 packets
> coming in.
>
> the arp_ignore / arp_announce kernel params are configured on teh real
> server, HA ip address is added on a /32 subnet to the lo interface, etc,
> etc.... (everything works 100% when primary is up).
>
> sure it is something i have overlooked, any idea's?
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From ironludo at free.fr Thu Jul 2 12:09:01 2009
From: ironludo at free.fr (LEROUX Ludovic)
Date: Thu, 2 Jul 2009 14:09:01 +0200
Subject: [Linux-cluster] redhat 5 cluster suite with gfs and oracle 11g
References:
Message-ID: <3AA7FB4A03854AE2953DA88714C2DBA4@siim94.local>
I try to install Oracle 11g on a redhat 5 cluster with 2 nodes.
I have a gfs mount point for the shared datafiles.
Oracle binaries are installed on each node.
I want to create a failover instance (active/passive) but the service with the ressource oracle 10g failover instance doesn't start (see the logfile).
I think that the resource doesn't work with Oracle 11g.
Do you have any ideas?
Do you have any documents to set up a redhat cluster with Oracle but without Oracle RAC?
Thanks a lot.
Ludo
________________________________________________________________________________________________________
Jul 2 14:11:03 siimlinux13 luci[2956]: Unable to retrieve batch 1273662007 status from siimlinux13.siim:11111: Unable to disable failed service oracle before starting it: clusvcadm failed to stop oracle
Jul 2 14:11:11 siimlinux13 : error getting update info: Cannot retrieve repository metadata (repomd.xml) for repository: Cluster. Please verify its path and try again
Jul 2 14:11:55 siimlinux13 clurgmgrd[3045]: Starting disabled service service:oracle
Jul 2 14:11:55 siimlinux13 clurgmgrd[3045]: start on script "serviceoracle" returned 5 (program not installed)
Jul 2 14:11:55 siimlinux13 clurgmgrd[3045]: #68: Failed to start service:oracle; return value: 1
Jul 2 14:11:55 siimlinux13 clurgmgrd[3045]: Stopping service service:oracle
Jul 2 14:11:55 siimlinux13 luci[2956]: Unable to retrieve batch 710454996 status from siimlinux13.siim:11111: module scheduled for execution
Jul 2 14:11:55 siimlinux13 clurgmgrd[3045]: stop on script "serviceoracle" returned 5 (program not installed)
Jul 2 14:11:55 siimlinux13 clurgmgrd[3045]: #12: RG service:oracle failed to stop; intervention required
Jul 2 14:11:56 siimlinux13 clurgmgrd[3045]: Service service:oracle is failed
Jul 2 14:11:56 siimlinux13 clurgmgrd[3045]: #13: Service service:oracle failed to stop cleanly
Jul 2 14:12:01 siimlinux13 luci[2956]: Unable to retrieve batch 710454996 status from siimlinux13.siim:11111: clusvcadm start failed to start oracle:
Jul 2 15:11:14 siimlinux13 : error getting update info: Cannot retrieve repository metadata (repomd.xml) for repository: Cluster. Please verify its path and try again
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From raju.rajsand at gmail.com Thu Jul 2 12:17:02 2009
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Thu, 2 Jul 2009 17:47:02 +0530
Subject: [Linux-cluster] redhat 5 cluster suite with gfs and oracle 11g
In-Reply-To: <3AA7FB4A03854AE2953DA88714C2DBA4@siim94.local>
References:
<3AA7FB4A03854AE2953DA88714C2DBA4@siim94.local>
Message-ID: <8786b91c0907020517s51ccd802pfa306b401ad3f07e@mail.gmail.com>
Greetings,
On Thu, Jul 2, 2009 at 5:39 PM, LEROUX Ludovic wrote:
> I try to install Oracle 11g on a redhat 5 cluster with 2 nodes.
> I have a gfs mount point for the shared datafiles.
> Oracle binaries are installed on each node.
> I want to create a failover instance (active/passive) but the service with
> the ressource oracle 10g failover instance doesn't start (see the logfile).
Have you done chkconfig --off for the oracle script on both the nodes
and added Oracle to cluster managed service along with the listener
IP.
Regards,
Rajagopal
From esggrupos at gmail.com Thu Jul 2 17:24:03 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Thu, 2 Jul 2009 19:24:03 +0200
Subject: [Linux-cluster] OFF TOPIC: cloud computing
Message-ID: <3128ba140907021024m4e48ff1fl992272d2239b2c0c@mail.gmail.com>
Hi folks,
First sorry for the off topic but I?m sure you know a lot about the concept
cloud computing.
While I have been learning about clustering (with the help of this list..) I
have read about using clusters for cloud computing.
I?m totally newbie about that concept, so I want to ask you what you have to
say about it, is it real? is an abstract concept and it?s not going to be
interesting at all?
what do you think?
by the way, any web, book, magazine, article or any thing to profundice in
this concept
greetings
ESG
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From brettcave at gmail.com Thu Jul 2 17:35:43 2009
From: brettcave at gmail.com (Brett Cave)
Date: Thu, 2 Jul 2009 19:35:43 +0200
Subject: [Linux-cluster] OFF TOPIC: cloud computing
In-Reply-To: <3128ba140907021024m4e48ff1fl992272d2239b2c0c@mail.gmail.com>
References: <3128ba140907021024m4e48ff1fl992272d2239b2c0c@mail.gmail.com>
Message-ID:
On Thu, Jul 2, 2009 at 7:24 PM, ESGLinux wrote:
> Hi folks,
> First sorry for the off topic but I?m sure you know a lot about the concept
> cloud computing.
>
> While I have been learning about clustering (with the help of this list..)
> I have read about using clusters for cloud computing.
>
> I?m totally newbie about that concept, so I want to ask you what you have
> to say about it, is it real? is an abstract concept and it?s not going to be
> interesting at all?
>
It is real, have a look at MPI for development of cloud computing (MPI CH as
an implementation). Its used for message passing to queue out components of
a job to various nodes. We implemented sorting using this library last year
that allocated tasks on a per-core basis across multiple servers.
> what do you think?
>
> by the way, any web, book, magazine, article or any thing to profundice in
> this concept
>
> greetings
>
> ESG
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From jruemker at redhat.com Thu Jul 2 19:53:58 2009
From: jruemker at redhat.com (John Ruemker)
Date: Thu, 02 Jul 2009 15:53:58 -0400
Subject: [Linux-cluster] redhat 5 cluster suite with gfs and oracle 11g
In-Reply-To: <3AA7FB4A03854AE2953DA88714C2DBA4@siim94.local>
References:
<3AA7FB4A03854AE2953DA88714C2DBA4@siim94.local>
Message-ID: <4A4D1056.6090807@redhat.com>
On 07/02/2009 08:09 AM, LEROUX Ludovic wrote:
> Jul 2 14:11:55 siimlinux13 clurgmgrd[3045]: start on script
> "serviceoracle" returned 5 (program not installed)
> Jul 2 14:11:55 siimlinux13 clurgmgrd[3045]: #68: Failed to
> start service:oracle; return value: 1
The above error is why its failing, but unfortunately this is pretty
generic. Something returned a status code 5, but from these logs theres
no way to be sure what since the oracle agent does a number of things
during the startup sequence.
Usually the best way to troubleshoot these issues is with rg_test, as it
will be much more verbose. First disable your service
# clusvcadm -d serviceoracle
Now do
# rg_test test /etc/cluster/cluster.conf start service serviceoracle
You should see it logging each operation and it will tell you where it
failed. If this doesn't point you to your answer then post the output
here as well as your cluster.conf.
Also there are some good guidelines and basic steps for setting up an
oracle service here:
http://people.redhat.com/lhh/oracle-rhel5-notes-0.6/oracle-notes.html
HTH
-John
From hlawatschek at atix.de Fri Jul 3 09:25:55 2009
From: hlawatschek at atix.de (Mark Hlawatschek)
Date: Fri, 3 Jul 2009 11:25:55 +0200
Subject: [Linux-cluster] redhat 5 cluster suite with gfs and oracle 11g
In-Reply-To: <3AA7FB4A03854AE2953DA88714C2DBA4@siim94.local>
References:
<3AA7FB4A03854AE2953DA88714C2DBA4@siim94.local>
Message-ID: <200907031125.55435.hlawatschek@atix.de>
Hi Ludo,
could you please provide your cluster.conf file ?
-Mark
On Thursday 02 July 2009 14:09:01 LEROUX Ludovic wrote:
> I try to install Oracle 11g on a redhat 5 cluster with 2 nodes.
> I have a gfs mount point for the shared datafiles.
> Oracle binaries are installed on each node.
> I want to create a failover instance (active/passive) but the service with
> the ressource oracle 10g failover instance doesn't start (see the logfile).
> I think that the resource doesn't work with Oracle 11g.
> Do you have any ideas?
> Do you have any documents to set up a redhat cluster with Oracle but
> without Oracle RAC? Thanks a lot.
> Ludo
>
> ___________________________________________________________________________
>_____________________________
>
> Jul 2 14:11:03 siimlinux13 luci[2956]: Unable to retrieve batch 1273662007
> status from siimlinux13.siim:11111: Unable to disable failed service oracle
> before starting it: clusvcadm failed to stop oracle Jul 2 14:11:11
> siimlinux13 : error getting update info: Cannot retrieve repository
> metadata (repomd.xml) for repository: Cluster. Please verify its path and
> try again Jul 2 14:11:55 siimlinux13 clurgmgrd[3045]: Starting
> disabled service service:oracle Jul 2 14:11:55 siimlinux13
> clurgmgrd[3045]: start on script "serviceoracle" returned 5
> (program not installed) Jul 2 14:11:55 siimlinux13 clurgmgrd[3045]:
> #68: Failed to start service:oracle; return value: 1 Jul 2
> 14:11:55 siimlinux13 clurgmgrd[3045]: Stopping service
> service:oracle Jul 2 14:11:55 siimlinux13 luci[2956]: Unable to retrieve
> batch 710454996 status from siimlinux13.siim:11111: module scheduled for
> execution Jul 2 14:11:55 siimlinux13 clurgmgrd[3045]: stop on
> script "serviceoracle" returned 5 (program not installed) Jul 2 14:11:55
> siimlinux13 clurgmgrd[3045]: #12: RG service:oracle failed to stop;
> intervention required Jul 2 14:11:56 siimlinux13 clurgmgrd[3045]:
> Service service:oracle is failed Jul 2 14:11:56 siimlinux13
> clurgmgrd[3045]: #13: Service service:oracle failed to stop cleanly
> Jul 2 14:12:01 siimlinux13 luci[2956]: Unable to retrieve batch 710454996
> status from siimlinux13.siim:11111: clusvcadm start failed to start oracle:
> Jul 2 15:11:14 siimlinux13 : error getting update info: Cannot retrieve
> repository metadata (repomd.xml) for repository: Cluster. Please verify its
> path and try again
--
Dipl.-Ing. Mark Hlawatschek
ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 |
85716 Unterschleissheim | www.atix.de | www.open-sharedroot.org
From mech at meteo.uni-koeln.de Fri Jul 3 15:06:13 2009
From: mech at meteo.uni-koeln.de (Mario Mech)
Date: Fri, 03 Jul 2009 17:06:13 +0200
Subject: [Linux-cluster] running services as non-root user
Message-ID: <4A4E1E65.2070908@meteo.uni-koeln.de>
Hi,
in my cluster environment some services need to run as a non-root user. What are the necessary settings?
Settings in my cluster.conf like
(not accepted by system-config-cluster) and in /usr/share/cluster/scripts.sh
User name
User name
su - ${OCF_RESKEY_user} -c "${OCF_RESKEY_file} $1"
didn't succeed. The services are startet but as root.
Is it the wrong way?
Thank you
Mario
--
From billpp at gmail.com Fri Jul 3 19:30:44 2009
From: billpp at gmail.com (Flavio Junior)
Date: Fri, 3 Jul 2009 16:30:44 -0300
Subject: [Linux-cluster] GFS2 with IMAP Maildir server
Message-ID: <58aa8d780907031230i6d78890fpdd8791521c421348@mail.gmail.com>
Hi folks....
I'm (trying to) using GFS2 with a mailserver scenario using:
- CentOS 5.3 updated
- Dovecot IMAP/Maildir
- Postfix
To make servers active/active i'm using CTDB (http://ctdb.samba.org).
Some info that could be relevant:
[root at pinky ~]# uname -a
Linux pinky 2.6.18-128.1.16.el5 #1 SMP Tue Jun 30 06:07:26 EDT 2009 x86_64
x86_64 x86_64 GNU/Linux
[root at pinky ~]# rpm -qa | grep -E 'gfs2|clust|kernel|cman|openais'
kernel-2.6.18-128.1.16.el5
gfs2-utils-0.1.53-1.el5_3.3
modcluster-0.12.1-2.el5.centos
cluster-cim-0.12.1-2.el5.centos
kernel-devel-2.6.18-128.1.10.el5
openais-0.80.3-22.el5_3.8
system-config-cluster-1.0.55-1.0
kernel-2.6.18-128.1.6.el5
kernel-2.6.18-128.1.10.el5
kernel-devel-2.6.18-128.1.16.el5
lvm2-cluster-2.02.40-7.el5
cluster-snmp-0.12.1-2.el5.centos
kernel-headers-2.6.18-128.1.16.el5
kernel-devel-2.6.18-128.1.6.el5
cman-2.0.98-1.el5_3.4
[root at pinky ~]# grep /home /etc/fstab
/dev/homeClusterVG/home_vmail /home gfs2
auto,noatime,quota=off,noexec,nodev,_netdev 0 0
Everything works fine for some time, but two or three times by day I get
some dovecot/deliver process hanged D state, so the only way to solve it is
rebooting node.
I'm not a developer and don't know much about debugging. As i've got other
problems ago I learn to use "sysrq-t" and here is the output related with
two of these process:
Pastebin: http://pastebin.ca/1483264
Jul 3 15:45:20 cerebro kernel: deliver D ffff81007e442800 0
24420 23846 (NOTLB)
Jul 3 15:45:20 cerebro kernel: ffff810013885e08 0000000000000082
ffff810013885d68 0000000000000092
Jul 3 15:45:20 cerebro kernel: ffff810013885e20 0000000000000001
ffff8100141870c0 ffff81000904b0c0
Jul 3 15:45:20 cerebro kernel: 0000052a72ff2a70 000000000000034a
ffff8100141872a8 000000036caf5000
Jul 3 15:45:20 cerebro kernel: Call Trace:
Jul 3 15:45:20 cerebro kernel: []
:dlm:dlm_posix_lock+0x172/0x210
Jul 3 15:45:20 cerebro kernel: []
autoremove_wake_function+0x0/0x2e
Jul 3 15:45:20 cerebro kernel: []
:gfs2:gfs2_lock+0xc3/0xcf
Jul 3 15:45:20 cerebro kernel: []
fcntl_setlk+0x11e/0x273
Jul 3 15:45:20 cerebro kernel: []
audit_syscall_entry+0x16e/0x1a1
Jul 3 15:45:20 cerebro kernel: [] sys_fcntl+0x269/0x2dc
Jul 3 15:45:20 cerebro kernel: [] tracesys+0xd5/0xe0
Jul 3 15:45:21 cerebro kernel: deliver D ffff81000238f480 0
1358 32225 (NOTLB)
Jul 3 15:45:21 cerebro kernel: ffff8100086cfe08 0000000000000082
ffff8100086cfd68 0000000000000092
Jul 3 15:45:21 cerebro kernel: ffff8100086cfe20 0000000000000001
ffff81000904b0c0 ffff81007ff28100
Jul 3 15:45:21 cerebro kernel: 0000052a72ff2ca2 0000000000000232
ffff81000904b2a8 000000037ed68a00
Jul 3 15:45:21 cerebro kernel: Call Trace:
Jul 3 15:45:21 cerebro kernel: []
:dlm:dlm_posix_lock+0x172/0x210
Jul 3 15:45:21 cerebro kernel: []
autoremove_wake_function+0x0/0x2e
Jul 3 15:45:21 cerebro kernel: []
:gfs2:gfs2_lock+0xc3/0xcf
Jul 3 15:45:21 cerebro kernel: []
fcntl_setlk+0x11e/0x273
Jul 3 15:45:21 cerebro kernel: []
audit_syscall_entry+0x16e/0x1a1
Jul 3 15:45:21 cerebro kernel: [] sys_fcntl+0x269/0x2dc
Jul 3 15:45:21 cerebro kernel: [] tracesys+0xd5/0xe0
Before reboot the node I went into the directory of this user and run some
"ls" and everything works as expected. I was pretty sure that command will
hang, but it don't.
Here is the "ps ax" output:
cicero 24420 0.0 0.0 8960 1220 ? Ds 14:46 0:00
/usr/libexec/dovecot/deliver -f cicero -d cicero
I've already rebooted that node, but if there is someway more deeply to
perform a debug of this case, just let me know that probably till the end of
the day i'll get same situation.
Thanks in advance.
--
Fl?vio do Carmo J?nior aka waKKu
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From gordan at bobich.net Fri Jul 3 19:40:11 2009
From: gordan at bobich.net (Gordan Bobic)
Date: Fri, 03 Jul 2009 20:40:11 +0100
Subject: [Linux-cluster] GFS2 with IMAP Maildir server
In-Reply-To: <58aa8d780907031230i6d78890fpdd8791521c421348@mail.gmail.com>
References: <58aa8d780907031230i6d78890fpdd8791521c421348@mail.gmail.com>
Message-ID: <4A4E5E9B.7060906@bobich.net>
Sounds like you are running into the same bug that I ran into with GFS2
on a similar setup nearly 2 years ago, except I could produce a lock-up
in under 2 seconds every time. Solution is to use GFS1 if you really
want to stick with that setup, but bear in mind that, regardless of the
cluster file system (GFS1, GFS2, OCFS2) the performance will scale
_inversely_. Cluster file systems really don't work well with millions
of small files.
You might, instead, want to look into something like DBMail with a MySQL
proxy to serialize all writes to a single node.
You can, of course, still use GFS1 for the root file system to share the
OS install. Look at Open Shared Root project if this is of interest.
Gordan
Flavio Junior wrote:
> Hi folks....
>
> I'm (trying to) using GFS2 with a mailserver scenario using:
>
> - CentOS 5.3 updated
> - Dovecot IMAP/Maildir
> - Postfix
>
> To make servers active/active i'm using CTDB (http://ctdb.samba.org).
>
> Some info that could be relevant:
> [root at pinky ~]# uname -a
> Linux pinky 2.6.18-128.1.16.el5 #1 SMP Tue Jun 30 06:07:26 EDT 2009
> x86_64 x86_64 x86_64 GNU/Linux
> [root at pinky ~]# rpm -qa | grep -E 'gfs2|clust|kernel|cman|openais'
> kernel-2.6.18-128.1.16.el5
> gfs2-utils-0.1.53-1.el5_3.3
> modcluster-0.12.1-2.el5.centos
> cluster-cim-0.12.1-2.el5.centos
> kernel-devel-2.6.18-128.1.10.el5
> openais-0.80.3-22.el5_3.8
> system-config-cluster-1.0.55-1.0
> kernel-2.6.18-128.1.6.el5
> kernel-2.6.18-128.1.10.el5
> kernel-devel-2.6.18-128.1.16.el5
> lvm2-cluster-2.02.40-7.el5
> cluster-snmp-0.12.1-2.el5.centos
> kernel-headers-2.6.18-128.1.16.el5
> kernel-devel-2.6.18-128.1.6.el5
> cman-2.0.98-1.el5_3.4
> [root at pinky ~]# grep /home /etc/fstab
> /dev/homeClusterVG/home_vmail /home gfs2
> auto,noatime,quota=off,noexec,nodev,_netdev 0 0
>
>
> Everything works fine for some time, but two or three times by day I get
> some dovecot/deliver process hanged D state, so the only way to solve it
> is rebooting node.
>
> I'm not a developer and don't know much about debugging. As i've got
> other problems ago I learn to use "sysrq-t" and here is the output
> related with two of these process:
>
> Pastebin: http://pastebin.ca/1483264
>
> Jul 3 15:45:20 cerebro kernel: deliver D ffff81007e442800 0
> 24420 23846 (NOTLB)
> Jul 3 15:45:20 cerebro kernel: ffff810013885e08 0000000000000082
> ffff810013885d68 0000000000000092
> Jul 3 15:45:20 cerebro kernel: ffff810013885e20 0000000000000001
> ffff8100141870c0 ffff81000904b0c0
> Jul 3 15:45:20 cerebro kernel: 0000052a72ff2a70 000000000000034a
> ffff8100141872a8 000000036caf5000
> Jul 3 15:45:20 cerebro kernel: Call Trace:
> Jul 3 15:45:20 cerebro kernel: []
> :dlm:dlm_posix_lock+0x172/0x210
> Jul 3 15:45:20 cerebro kernel: []
> autoremove_wake_function+0x0/0x2e
> Jul 3 15:45:20 cerebro kernel: []
> :gfs2:gfs2_lock+0xc3/0xcf
> Jul 3 15:45:20 cerebro kernel: []
> fcntl_setlk+0x11e/0x273
> Jul 3 15:45:20 cerebro kernel: []
> audit_syscall_entry+0x16e/0x1a1
> Jul 3 15:45:20 cerebro kernel: [] sys_fcntl+0x269/0x2dc
> Jul 3 15:45:20 cerebro kernel: [] tracesys+0xd5/0xe0
>
>
> Jul 3 15:45:21 cerebro kernel: deliver D ffff81000238f480 0
> 1358 32225 (NOTLB)
> Jul 3 15:45:21 cerebro kernel: ffff8100086cfe08 0000000000000082
> ffff8100086cfd68 0000000000000092
> Jul 3 15:45:21 cerebro kernel: ffff8100086cfe20 0000000000000001
> ffff81000904b0c0 ffff81007ff28100
> Jul 3 15:45:21 cerebro kernel: 0000052a72ff2ca2 0000000000000232
> ffff81000904b2a8 000000037ed68a00
> Jul 3 15:45:21 cerebro kernel: Call Trace:
> Jul 3 15:45:21 cerebro kernel: []
> :dlm:dlm_posix_lock+0x172/0x210
> Jul 3 15:45:21 cerebro kernel: []
> autoremove_wake_function+0x0/0x2e
> Jul 3 15:45:21 cerebro kernel: []
> :gfs2:gfs2_lock+0xc3/0xcf
> Jul 3 15:45:21 cerebro kernel: []
> fcntl_setlk+0x11e/0x273
> Jul 3 15:45:21 cerebro kernel: []
> audit_syscall_entry+0x16e/0x1a1
> Jul 3 15:45:21 cerebro kernel: [] sys_fcntl+0x269/0x2dc
> Jul 3 15:45:21 cerebro kernel: [] tracesys+0xd5/0xe0
>
>
> Before reboot the node I went into the directory of this user and run
> some "ls" and everything works as expected. I was pretty sure that
> command will hang, but it don't.
> Here is the "ps ax" output:
> cicero 24420 0.0 0.0 8960 1220 ? Ds 14:46 0:00
> /usr/libexec/dovecot/deliver -f cicero -d cicero
>
> I've already rebooted that node, but if there is someway more deeply to
> perform a debug of this case, just let me know that probably till the
> end of the day i'll get same situation.
>
>
> Thanks in advance.
>
> --
>
> Fl?vio do Carmo J?nior aka waKKu
>
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
From billpp at gmail.com Fri Jul 3 20:02:29 2009
From: billpp at gmail.com (Flavio Junior)
Date: Fri, 3 Jul 2009 17:02:29 -0300
Subject: [Linux-cluster] GFS2 with IMAP Maildir server
In-Reply-To: <4A4E5E9B.7060906@bobich.net>
References: <58aa8d780907031230i6d78890fpdd8791521c421348@mail.gmail.com>
<4A4E5E9B.7060906@bobich.net>
Message-ID: <58aa8d780907031302x2cf76587had74673f962d1e61@mail.gmail.com>
On Fri, Jul 3, 2009 at 4:40 PM, Gordan Bobic wrote:
> Sounds like you are running into the same bug that I ran into with GFS2 on
> a similar setup nearly 2 years ago, except I could produce a lock-up in
> under 2 seconds every time. Solution is to use GFS1 if you really want to
> stick with that setup, but bear in mind that, regardless of the cluster file
> system (GFS1, GFS2, OCFS2) the performance will scale _inversely_. Cluster
> file systems really don't work well with millions of small files.
>
Hi Gordan, thanks for answer.
But, if it is "possible" to be solved (as it was with GFS1) why is it not
feasible to GFS2?
Well, no problem at al to migrate to GFS1, actually I've already thinked
about it, but all those gfs1 tunning options and tests makes me a bit
apprehensive.
I'll wait a bit more for GFS2 community, if they say that it can't be done I
go to GFS1 or even ocfs2 (what is the third option, as I've already a RHCS
structure with clvmd).
--
Fl?vio do Carmo J?nior aka waKKu
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From gordan at bobich.net Fri Jul 3 21:00:13 2009
From: gordan at bobich.net (Gordan Bobic)
Date: Fri, 03 Jul 2009 22:00:13 +0100
Subject: [Linux-cluster] GFS2 with IMAP Maildir server
In-Reply-To: <58aa8d780907031302x2cf76587had74673f962d1e61@mail.gmail.com>
References: <58aa8d780907031230i6d78890fpdd8791521c421348@mail.gmail.com> <4A4E5E9B.7060906@bobich.net>
<58aa8d780907031302x2cf76587had74673f962d1e61@mail.gmail.com>
Message-ID: <4A4E715D.5010204@bobich.net>
Flavio Junior wrote:
> On Fri, Jul 3, 2009 at 4:40 PM, Gordan Bobic > wrote:
>
> Sounds like you are running into the same bug that I ran into with
> GFS2 on a similar setup nearly 2 years ago, except I could produce a
> lock-up in under 2 seconds every time. Solution is to use GFS1 if
> you really want to stick with that setup, but bear in mind that,
> regardless of the cluster file system (GFS1, GFS2, OCFS2) the
> performance will scale _inversely_. Cluster file systems really
> don't work well with millions of small files.
>
>
> Hi Gordan, thanks for answer.
>
> But, if it is "possible" to be solved (as it was with GFS1) why is it
> not feasible to GFS2?
1) Performance will such regardless of whether it's GFS1 or GFS2. It's
fine for 10-20 users, but if you have 10,000-20,000 users, it will grind
to a halt.
2) The GFS2 clearly still isn't stable enough if this sort of crash
still happens.
> Well, no problem at al to migrate to GFS1, actually I've already thinked
> about it, but all those gfs1 tunning options and tests makes me a bit
> apprehensive.
GFS1 doesn't have any more tuning options than GFS2 that I can think of.
And besides, in practice, if the performance isn't in the right ball
park out of the box, no amount of tweaking will help. Just about the
only think that makes a significant difference is the noatime mount
option. I wouldn't bother with the rest unless you really need those
last few percent.
> I'll wait a bit more for GFS2 community, if they say that it can't be
> done I go to GFS1 or even ocfs2 (what is the third option, as I've
> already a RHCS structure with clvmd).
The problem with GFS2 is that it's still a bit buggy, as you've found.
But there isn't that much difference in performance between various
similar file systems. Sure, GFS2 is faster than GFS1, but it's not an
order of magnitude faster.
Gordan
From cthulhucalling at gmail.com Sat Jul 4 01:48:16 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Fri, 3 Jul 2009 18:48:16 -0700
Subject: [Linux-cluster] GFS2 with IMAP Maildir server
In-Reply-To: <4A4E715D.5010204@bobich.net>
References: <58aa8d780907031230i6d78890fpdd8791521c421348@mail.gmail.com>
<4A4E5E9B.7060906@bobich.net>
<58aa8d780907031302x2cf76587had74673f962d1e61@mail.gmail.com>
<4A4E715D.5010204@bobich.net>
Message-ID: <36df569a0907031848u52b19902s963d26ccea69abb4@mail.gmail.com>
On Fri, Jul 3, 2009 at 2:00 PM, Gordan Bobic wrote:
> Flavio Junior wrote:
>
>> On Fri, Jul 3, 2009 at 4:40 PM, Gordan Bobic > gordan at bobich.net>> wrote:
>>
>>
>
> Well, no problem at al to migrate to GFS1, actually I've already thinked
>> about it, but all those gfs1 tunning options and tests makes me a bit
>> apprehensive.
>>
>
> GFS1 doesn't have any more tuning options than GFS2 that I can think of.
> And besides, in practice, if the performance isn't in the right ball park
> out of the box, no amount of tweaking will help. Just about the only think
> that makes a significant difference is the noatime mount option. I wouldn't
> bother with the rest unless you really need those last few percent.
Noatime helps, but where I've seen some really good performance boosts is in
tweaking glock_purge and demote_secs parameters. Of course, always start
with a modest setting and start tweaking from there. Playing around with
statfs_fast =1, noatime, nodiratime and playing withe glock settings, I've
seen a pretty significant jump in performance.
>
> I'll wait a bit more for GFS2 community, if they say that it can't be done
>> I go to GFS1 or even ocfs2 (what is the third option, as I've already a RHCS
>> structure with clvmd).
>>
>
> The problem with GFS2 is that it's still a bit buggy, as you've found. But
> there isn't that much difference in performance between various similar file
> systems. Sure, GFS2 is faster than GFS1, but it's not an order of magnitude
> faster.
I've done some GFS vs GFS2 performance benchmarking for a cluster that I
will be putting in soon. I've found that GFS1 performance has been much much
better than GFS2. As far as I can tell, GFS2 lacks a lot of the tunability
that GFS1 has. All the documentation I've seen says that it's supposed to be
self-tuning, so there are fewer performance tuning options you have to play
with. From my tests, I've had almost a 50% reduction in performance using
GFS2.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From brdvss at gmail.com Sat Jul 4 10:19:32 2009
From: brdvss at gmail.com (Brady Vass)
Date: Sat, 4 Jul 2009 15:49:32 +0530
Subject: [Linux-cluster] Re: Commands for communicating among nodes?
In-Reply-To: <995446330907030938t67b4c101y4dde2fbf2c51e8eb@mail.gmail.com>
References: <995446330907030938t67b4c101y4dde2fbf2c51e8eb@mail.gmail.com>
Message-ID: <995446330907040319j57c13bfbhfe8b92a02f2937e4@mail.gmail.com>
> Hi,
>
> I am trying to find out if there are any dedicated commands that allows for
> file copying or command execution among nodes in RHCS. i.e, these commands
> need to be exclusive with the RHCS s/w and the communication should be
> seamlessly without the need for password authentications.
>
> (PS: I dont want to use rsh/ssh genre of commands. Other HA solution comes
> with exclusive set of cluster commands. I am looking for something similar.)
>
>
> Thanks and regards,
>
> Brady
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From robejrm at gmail.com Sat Jul 4 13:13:49 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Sat, 4 Jul 2009 15:13:49 +0200
Subject: [Linux-cluster] Re: Commands for communicating among nodes?
In-Reply-To: <995446330907040319j57c13bfbhfe8b92a02f2937e4@mail.gmail.com>
References: <995446330907030938t67b4c101y4dde2fbf2c51e8eb@mail.gmail.com>
<995446330907040319j57c13bfbhfe8b92a02f2937e4@mail.gmail.com>
Message-ID: <8a5668960907040613s34eb6046r5009f6bf24573e26@mail.gmail.com>
On Sat, Jul 4, 2009 at 12:19 PM, Brady Vass wrote:
>
> Hi,
>>
>> I am trying to find out if there are any dedicated commands that allows
>> for file copying or command execution among nodes in RHCS. i.e, these
>> commands need to be exclusive with the RHCS s/w and the communication should
>> be seamlessly without the need for password authentications.
>>
>> (PS: I dont want to use rsh/ssh genre of commands. Other HA solution comes
>> with exclusive set of cluster commands. I am looking for something similar.)
>>
>>
> You can always use public key authentication with ssh and scp,
communication will be seamless. You can also use dsh (or any parallel shell
on top of ssh) to execute the same command on all the nodes at once.
Greetings,
Juanra
> Thanks and regards,
>>
>> Brady
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From grimme at atix.de Sun Jul 5 11:29:54 2009
From: grimme at atix.de (Marc Grimme)
Date: Sun, 5 Jul 2009 13:29:54 +0200
Subject: [Linux-cluster] Re: Commands for communicating among nodes?
In-Reply-To: <8a5668960907040613s34eb6046r5009f6bf24573e26@mail.gmail.com>
References: <995446330907030938t67b4c101y4dde2fbf2c51e8eb@mail.gmail.com>
<995446330907040319j57c13bfbhfe8b92a02f2937e4@mail.gmail.com>
<8a5668960907040613s34eb6046r5009f6bf24573e26@mail.gmail.com>
Message-ID: <200907051329.54722.grimme@atix.de>
You might also want to have a look at com-dsh as part of comoonics-cs-py
(python based version of pydsh to be got from
http://download.atix.de/yum/comoonics/redhat-el5/productive/noarch/RPMS/comoonics-cs-py-0.1-56.noarch.rpm).
It should automatically detect all "online" nodes in the cluster and then
issue the command on all nodes. It also detects the node you are working on
and will issue this command directly. See below for an example.
[root at generix3 ~]# cman_tool nodes
Node Sts Inc Joined Name
2 M 12 2009-05-05 10:06:24 generix2.local
3 M 4 2009-05-05 09:37:02 generix3.local
4 M 32 2009-05-05 10:14:49 generix4.local
[root at generix3 ~]# com-dsh hostname
Host | Output:
---------------+-------------------------------------------------------------------------------------------------------------------------------------------
localhost | generix3
generix2.local | generix2
generix4.local | generix4
[root at generix3 ~]# rpm -qf $(which com-dsh)
comoonics-cs-py-0.1-56
Hope this helps
Marc.
On Saturday 04 July 2009 15:13:49 Juan Ramon Martin Blanco wrote:
> On Sat, Jul 4, 2009 at 12:19 PM, Brady Vass wrote:
> > Hi,
> >
> >> I am trying to find out if there are any dedicated commands that allows
> >> for file copying or command execution among nodes in RHCS. i.e, these
> >> commands need to be exclusive with the RHCS s/w and the communication
> >> should be seamlessly without the need for password authentications.
> >>
> >> (PS: I dont want to use rsh/ssh genre of commands. Other HA solution
> >> comes with exclusive set of cluster commands. I am looking for something
> >> similar.)
> >
> > You can always use public key authentication with ssh and scp,
>
> communication will be seamless. You can also use dsh (or any parallel shell
> on top of ssh) to execute the same command on all the nodes at once.
>
> Greetings,
> Juanra
>
> > Thanks and regards,
> >
> >> Brady
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
--
Gruss / Regards,
Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/ http://www.open-sharedroot.org/
ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 |
85716 Unterschleissheim | www.atix.de | www.open-sharedroot.org
Registergericht: Amtsgericht Muenchen, Registernummer: HRB 168930, USt.-Id.:
DE209485962 | Vorstand: Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.) |
Vorsitzender des Aufsichtsrats: Dr. Martin Buss
From armanets at ill.fr Mon Jul 6 08:08:33 2009
From: armanets at ill.fr (Armanet Stephane)
Date: Mon, 06 Jul 2009 10:08:33 +0200
Subject: [Linux-cluster] force fencing
Message-ID: <4A51B101.2010500@ill.fr>
Hello list
I'm trying to setup a 3 nodes Cluster with 2 failover Domain for an HA
mail solution.
I want 1 run active for the Imap server in the Imap Failover domain , 1
node active for the Smtp in the Smtp Failover domain and the 3rd in the
2 failover domain as a backup node.
I run Centos 5.3
My fence device is a wti power switch
My cluster.conf is in attachement
My SMTP service is composed of:
1 IP
1 amavisd scritp
1 postfix script
2 NFS mount for postfix and amavis
If I manually kill the postfix master process (to simulate a crash), my
node is not fence and the logs said:
Jul 6 10:00:40 centos-smtp1 clurgmgrd: [4228]: Executing
/etc/init.d/postfix status
Jul 6 10:00:40 centos-smtp1 clurgmgrd: [4228]: script:postfix:
status of /etc/init.d/postfix failed (returned 3)
Jul 6 10:00:40 centos-smtp1 clurgmgrd[4228]: status on script
"postfix" returned 1 (generic error)
Jul 6 10:00:40 centos-smtp1 clurgmgrd[4228]: Stopping service
service:Postfix
Jul 6 10:00:40 centos-smtp1 clurgmgrd: [4228]: Executing
/etc/init.d/amavisd stop
Jul 6 10:00:40 centos-smtp1 kernel: do_vfs_lock: VFS is out of sync
with lock manager!
Jul 6 10:00:40 centos-smtp1 last message repeated 8 times
Jul 6 10:00:41 centos-smtp1 clurgmgrd: [4228]: Executing
/etc/init.d/postfix stop
Jul 6 10:00:41 centos-smtp1 clurgmgrd: [4228]: script:postfix:
stop of /etc/init.d/postfix failed (returned 1)
Jul 6 10:00:41 centos-smtp1 clurgmgrd[4228]: stop on script
"postfix" returned 1 (generic error)
Jul 6 10:00:41 centos-smtp1 clurgmgrd: [4228]: Removing IPv4
address 195.83.126.201/24 from bond0
Jul 6 10:00:41 centos-smtp1 avahi-daemon[3552]: Withdrawing address
record for 195.83.126.201 on bond0.
Jul 6 10:00:51 centos-smtp1 clurgmgrd: [4228]: unmounting
/var/lib/amavis
Jul 6 10:00:51 centos-smtp1 clurgmgrd: [4228]: unmounting
/var/spool/postfix
Jul 6 10:00:51 centos-smtp1 clurgmgrd[4228]: #12: RG
service:Postfix failed to stop; intervention required
Jul 6 10:00:51 centos-smtp1 clurgmgrd[4228]: Service
service:Postfix is failed
Jul 6 10:00:52 centos-smtp1 ntpd[3322]: synchronized to 195.83.126.119,
stratum 1
Clustat said:
Cluster Status for cluster-test @ Mon Jul 6 10:02:39 2009
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
centos-imap1.ill.fr 1
Online, Local, rgmanager
centos-imap2.ill.fr 2
Online, rgmanager
centos-smtp1.ill.fr 3
Online, rgmanager
/dev/disk/by-id/scsi-360a98000567247514634507447594661-part1 0
Online, Quorum Disk
Service Name Owner
(Last) State
------- ---- -----
------ -----
service:Imap
centos-imap2.ill.fr started
service:Postfix
(centos-smtp1.ill.fr) failed
So I have to disable the Postfix servcie with:
clusvcadm -d Postfix
and re-enable
clusvcadm -e Postfix
Could you explain my why my original smtp node is not fenced and why my
service is not start on the 2nd node ???
Is there a way to force the fencing ???
--
ARMANET Stephane
Division Projet Technique
Service Informatique
Groupe Infrastructure
Institut Laue langevin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: text/xml
Size: 3723 bytes
Desc: not available
URL:
From robejrm at gmail.com Mon Jul 6 08:22:23 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Mon, 6 Jul 2009 10:22:23 +0200
Subject: [Linux-cluster] force fencing
In-Reply-To: <4A51B101.2010500@ill.fr>
References: <4A51B101.2010500@ill.fr>
Message-ID: <8a5668960907060122s8a47dd6rb89f4dade8621efe@mail.gmail.com>
On Mon, Jul 6, 2009 at 10:08 AM, Armanet Stephane wrote:
> Hello list
>
> I'm trying to setup a 3 nodes Cluster with 2 failover Domain for an HA
> mail solution.
> I want 1 run active for the Imap server in the Imap Failover domain , 1
> node active for the Smtp in the Smtp Failover domain and the 3rd in the
> 2 failover domain as a backup node.
>
> I run Centos 5.3
> My fence device is a wti power switch
>
> My cluster.conf is in attachement
>
> My SMTP service is composed of:
> 1 IP
> 1 amavisd scritp
> 1 postfix script
> 2 NFS mount for postfix and amavis
>
> If I manually kill the postfix master process (to simulate a crash), my
> node is not fence and the logs said:
>
> Jul 6 10:00:40 centos-smtp1 clurgmgrd: [4228]: Executing
> /etc/init.d/postfix status
> Jul 6 10:00:40 centos-smtp1 clurgmgrd: [4228]: script:postfix:
> status of /etc/init.d/postfix failed (returned 3)
> Jul 6 10:00:40 centos-smtp1 clurgmgrd[4228]: status on script
> "postfix" returned 1 (generic error)
> Jul 6 10:00:40 centos-smtp1 clurgmgrd[4228]: Stopping service
> service:Postfix
> Jul 6 10:00:40 centos-smtp1 clurgmgrd: [4228]: Executing
> /etc/init.d/amavisd stop
> Jul 6 10:00:40 centos-smtp1 kernel: do_vfs_lock: VFS is out of sync
> with lock manager!
> Jul 6 10:00:40 centos-smtp1 last message repeated 8 times
> Jul 6 10:00:41 centos-smtp1 clurgmgrd: [4228]: Executing
> /etc/init.d/postfix stop
> Jul 6 10:00:41 centos-smtp1 clurgmgrd: [4228]: script:postfix:
> stop of /etc/init.d/postfix failed (returned 1)
> Jul 6 10:00:41 centos-smtp1 clurgmgrd[4228]: stop on script
> "postfix" returned 1 (generic error)
> Jul 6 10:00:41 centos-smtp1 clurgmgrd: [4228]: Removing IPv4
> address 195.83.126.201/24 from bond0
> Jul 6 10:00:41 centos-smtp1 avahi-daemon[3552]: Withdrawing address
> record for 195.83.126.201 on bond0.
> Jul 6 10:00:51 centos-smtp1 clurgmgrd: [4228]: unmounting
> /var/lib/amavis
> Jul 6 10:00:51 centos-smtp1 clurgmgrd: [4228]: unmounting
> /var/spool/postfix
> Jul 6 10:00:51 centos-smtp1 clurgmgrd[4228]: #12: RG
> service:Postfix failed to stop; intervention required
> Jul 6 10:00:51 centos-smtp1 clurgmgrd[4228]: Service
> service:Postfix is failed
> Jul 6 10:00:52 centos-smtp1 ntpd[3322]: synchronized to 195.83.126.119,
> stratum 1
>
> Clustat said:
>
> Cluster Status for cluster-test @ Mon Jul 6 10:02:39 2009
> Member Status: Quorate
>
> Member Name ID
> Status
> ------ ---- ----
> ------
> centos-imap1.ill.fr 1
> Online, Local, rgmanager
> centos-imap2.ill.fr 2
> Online, rgmanager
> centos-smtp1.ill.fr 3
> Online, rgmanager
> /dev/disk/by-id/scsi-360a98000567247514634507447594661-part1 0
> Online, Quorum Disk
>
> Service Name Owner
> (Last) State
> ------- ---- -----
> ------ -----
> service:Imap
> centos-imap2.ill.fr started
>
> service:Postfix
> (centos-smtp1.ill.fr) failed
>
>
>
>
> So I have to disable the Postfix servcie with:
> clusvcadm -d Postfix
> and re-enable
> clusvcadm -e Postfix
>
>
>
> Could you explain my why my original smtp node is not fenced and why my
> service is not start on the 2nd node ???
>
Nodes are fenced only when they lost communications with the other nodes,
not when a service fails.
You should check the init scripts to make sure it works fine outside the
cluster, return values are important. I think in your case is failing
because you killed postfix in a way it deleted the .pid file, and that made
the init script fail.
BTW you should configure the service as recovery="relocate" if you want them
to be started on a different node.
Greetings,
Juanra
> Is there a way to force the fencing ???
>
>
> --
> ARMANET Stephane
> Division Projet Technique
> Service Informatique
> Groupe Infrastructure
>
> Institut Laue langevin
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From armanets at ill.fr Mon Jul 6 10:49:42 2009
From: armanets at ill.fr (Armanet Stephane)
Date: Mon, 06 Jul 2009 12:49:42 +0200
Subject: [Linux-cluster] force fencing
In-Reply-To: <8a5668960907060122s8a47dd6rb89f4dade8621efe@mail.gmail.com>
References: <4A51B101.2010500@ill.fr>
<8a5668960907060122s8a47dd6rb89f4dade8621efe@mail.gmail.com>
Message-ID: <4A51D6C6.2030006@ill.fr>
Juan Ramon Martin Blanco a ?crit :
>>
> Nodes are fenced only when they lost communications with the other nodes,
> not when a service fails.
> You should check the init scripts to make sure it works fine outside the
> cluster, return values are important. I think in your case is failing
> because you killed postfix in a way it deleted the .pid file, and that made
> the init script fail.
> BTW you should configure the service as recovery="relocate" if you want them
> to be started on a different node.
>
> Greetings,
> Juanra
>
>
>
Thank's for the reply
I will check my init.d scripts
--
ARMANET Stephane
Division Projet Technique
Service Informatique
Groupe Infrastructure
Institut Laue langevin
38042 Grenoble Cedex 9
France
Tel: 04.76.20.78.56 email: armanets at ill.fr
From esggrupos at gmail.com Mon Jul 6 10:55:23 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Mon, 6 Jul 2009 12:55:23 +0200
Subject: [Linux-cluster] OFF TOPIC: cloud computing
In-Reply-To:
References: <3128ba140907021024m4e48ff1fl992272d2239b2c0c@mail.gmail.com>
Message-ID: <3128ba140907060355x7f110150v25492461b004ff22@mail.gmail.com>
>
> It is real, have a look at MPI for development of cloud computing (MPI CH
> as an implementation). Its used for message passing to queue out components
> of a job to various nodes. We implemented sorting using this library last
> year that allocated tasks on a per-core basis across multiple servers.
>
Hi, thanks for your answer, it looks like interesting,
I?m still looking how to start to study this, For now I?m reading about it,
and watching videos in youtube ;-)
Thanks again,
ESG
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From agx at sigxcpu.org Mon Jul 6 12:46:10 2009
From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=)
Date: Mon, 6 Jul 2009 14:46:10 +0200
Subject: [Linux-cluster] Cluster 3.0.0.rc3 release
In-Reply-To: <1246468327.19414.65.camel@cerberus.int.fabbione.net>
References: <1245496789.3665.328.camel@cerberus.int.fabbione.net>
<20090629184848.GA25796@bogon.sigxcpu.org>
<1246306200.25867.86.camel@cerberus.int.fabbione.net>
<20090701115725.GA6565@bogon.sigxcpu.org>
<1246454636.19414.30.camel@cerberus.int.fabbione.net>
<20090701164007.GA10680@bogon.sigxcpu.org>
<1246468327.19414.65.camel@cerberus.int.fabbione.net>
Message-ID: <20090706124610.GA2229@bogon.sigxcpu.org>
On Wed, Jul 01, 2009 at 07:12:07PM +0200, Fabio M. Di Nitto wrote:
> Do you have a build log for the package? and could you send me the
> make/defines.mk in the build tree?
No, not from that build we're currently using. I can rebuild though. But
from our libccss modifications:
gcc -Wall -Wformat=2 -Wshadow -Wmissing-prototypes -Wstrict-prototypes
-Wdeclaration-after-statement -Wpointer-arith -Wwrite-strings
-Wcast-align
-Wbad-function-cast -Wmissing-format-attribute -Wformat-security
-Wformat-nonliteral -Wno-long-long -Wno-strict-aliasing
-Wmissing-declarations -O2 -ggdb3 -MMD
-I/var/home/schmitz/3/redhat-cluster/make
-DDEFAULT_CONFIG_DIR=\"/etc/cluster\"
-DDEFAULT_CONFIG_FILE=\"cluster.conf\" -DENABLE_PACEMAKER=
-DLOGDIR=\"/var/log/cluster\" -DSYSLOGFACILITY=LOG_LOCAL4
-DSYSLOGLEVEL=LOG_INFO -DRELEASE_VERSION=\"3.0.0.rc3\" -fPIC
-D_GNU_SOURCE
-D_FILE_OFFSET_BITS=64 -I/usr/include
-I/var/home/schmitz/3/redhat-cluster/common/liblogthread `xml2-config
--cflags` -I/usr/include -c -o libccs.o
/var/home/schmitz/3/redhat-cluster/config/libs/libccsconfdb/libccs.c
ar cru libccs.a libccs.o xpathlite.o fullxpath.o extras.o
ranlib libccs.a
gcc -shared -o libccs.so.3.0 -Wl,-soname=libccs.so.3 libccs.o
xpathlite.o
fullxpath.o extras.o -L/usr/lib/corosync -lconfdb `xml2-config --libs`
-L/usr/lib
ln -sf libccs.so.3.0 libccs.so
ln -sf libccs.so.3.0 libccs.so.3
> gcc versions and usual tool chain info.. maybe it's a gcc bug or maybe
> it's an optimization that behaves differently between debian and fedora.
$ gcc -v
Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 4.3.2-1.1'
$ ld -v
GNU ld (GNU Binutils for Debian) 2.18.0.20080103
$ dpkg -s libc6-dev
Package: libc6-dev
Status: install ok installed
Priority: optional
Section: libdevel
Installed-Size: 11072
Maintainer: GNU Libc Maintainers
Architecture: amd64
Source: glibc
Version: 2.7-18
Replaces: man-db (<= 2.3.10-41), gettext (<= 0.10.26-1), ppp (<=
2.2.0f-24), libgdbmg1-dev (<= 1.7.3-24)
Provides: libc-dev
Depends: libc6 (= 2.7-18), linux-libc-dev
Recommends: gcc | c-compiler
Suggests: glibc-doc, manpages-dev
Conflicts: libstdc++2.10-dev (<< 1:2.95.2-15), gcc-2.95 (<< 1:2.95.3-8),
binutils (<< 2.17cvs20070426-1), libc-dev
> I have attached a small test case to simply test libccs. At this point I
> don't believe it's a problem in libfence. Could you please run it for me
> and send me the output? If the bug is in libccs this would start
> isolating it.
# ./testccs
xpathlite
agent=fence_ilo
Segmentation fault
# and if I prefer fullxpath over xpathlite:
# ./testccs
fullxpath
agent=fence_ilo
Segmentation fault
Cheers,
-- Guido
From robejrm at gmail.com Mon Jul 6 14:09:17 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Mon, 6 Jul 2009 16:09:17 +0200
Subject: [Linux-cluster] Problems with cluster-snmp rhel5.3 x86_64
Message-ID: <8a5668960907060709p169cf7fcsc60b39704d68fa29@mail.gmail.com>
Hi,
I would like to use snmp to monitor the services status in my clusters(rhel
5.3 x86_64), so I installed cluster-snmp and configure snmpd as it can be
seen on the cluster-snmp documentation with the public community "cluster".
The thing is that I cannot obtain any information from the community, only
this:
# snmpwalk -v 2c -c cluster localhost REDHAT-CLUSTER-MIB::RedHatCluster
REDHAT-CLUSTER-MIB::rhcMIBVersion.0 = INTEGER: 1
That's the only information that can be obtained from the MIB...
I.E if I query the services get this:
# snmpwalk -v 2c -c cluster localhost
REDHAT-CLUSTER-MIB::rhcClusterServicesNames
REDHAT-CLUSTER-MIB::rhcClusterServicesNames = No Such Instance currently
exists at this OID
Any clues? Is it a bug in the x86_64 version? I tested this also in rhel 5.1
32bits and worked fine.
Thanks in advance,
Juanra
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From agx at sigxcpu.org Mon Jul 6 19:09:54 2009
From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=)
Date: Mon, 6 Jul 2009 21:09:54 +0200
Subject: [Linux-cluster] add "force-reload" targets to init scripts
Message-ID: <20090706190954.GA28021@bogon.sigxcpu.org>
Hi,
attached patch adds the force-reload targets to the init scripts as
expected by Debian based distros. Would be nice to have this applied for
3.0.
Cheers,
-- Guido
-------------- next part --------------
A non-text attachment was scrubbed...
Name: force-reload.diff
Type: text/x-diff
Size: 1093 bytes
Desc: not available
URL:
From fdinitto at redhat.com Tue Jul 7 07:21:25 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 07 Jul 2009 09:21:25 +0200
Subject: [Linux-cluster] add "force-reload" targets to init scripts
In-Reply-To: <20090706190954.GA28021@bogon.sigxcpu.org>
References: <20090706190954.GA28021@bogon.sigxcpu.org>
Message-ID: <1246951285.7993.1.camel@cerberus.int.fabbione.net>
hi Guido,
On Mon, 2009-07-06 at 21:09 +0200, Guido G?nther wrote:
> Hi,
> attached patch adds the force-reload targets to the init scripts as
> expected by Debian based distros. Would be nice to have this applied for
> 3.0.
> Cheers,
> -- Guido
http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=192d4e27c36fb263617ad726795f1c8dbc709497
thanks for the patch.
In future could you please send patches to cluster-devel mailing list?
It will be easier to notice them.
Fabio
From robejrm at gmail.com Tue Jul 7 09:57:41 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Tue, 7 Jul 2009 11:57:41 +0200
Subject: [Linux-cluster] qdisk best practices
In-Reply-To: <15D5002F61F31A45A82A153D2F73906760FBD3F011@HOUMS26.onshore.pgs.com>
References:
<15D5002F61F31A45A82A153D2F73906760FBD3F011@HOUMS26.onshore.pgs.com>
Message-ID: <8a5668960907070257k27567349pbba385cb3329489c@mail.gmail.com>
On Wed, Jul 1, 2009 at 8:24 PM, Luis Cerezo wrote:
> Hi all-
>
> i've got a RHEL 5.3 cluster, 2node with qdisk. All works fine, but the
> qdisk seems to beat on the SAN (I/Ops) I adjusted the interval from the
> default of 1 to 5 and it is still high (san admin is crying)
>
> does anyone have best practices for this? its an LSI san and both nodes are
> mulitpathed to it via 4G FC.
>
If it's really a big problem for the SAN, consider adding a third node to
the cluster and get rid of the qdisk.
Greetings,
Juanra
>
> thanks!
>
>
>
> Luis E. Cerezo
> PGS
> Global IT
>
> This e-mail, any attachments and response string may contain proprietary
> information, which are confidential and may be legally privileged. It is
> for the intended recipient only and if you are not the intended recipient or
> transmission error has misdirected this e-mail, please notify the author by
> return e-mail and delete this message and any attachment immediately. If
> you are not the intended recipient you must not use, disclose, distribute,
> forward, copy, print or rely in this e-mail in any way except as permitted
> by the author.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From robejrm at gmail.com Tue Jul 7 10:10:51 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Tue, 7 Jul 2009 12:10:51 +0200
Subject: [Linux-cluster] quorum disk size recommedation
In-Reply-To: <3128ba140906290248q620ad560m8700f65ab0bd63d8@mail.gmail.com>
References: <3128ba140906290238l3e98aac2s2b5ac2bf97418307@mail.gmail.com>
<200906291243.18175.harri.paivaniemi@tieto.com>
<3128ba140906290248q620ad560m8700f65ab0bd63d8@mail.gmail.com>
Message-ID: <8a5668960907070310y4cf0924v16ccec3f892a67f5@mail.gmail.com>
On Mon, Jun 29, 2009 at 11:48 AM, ESGLinux wrote:
> hi,
> Thanks for your quick answer.
>
> Just for curiosity, why this size? and with 10 MB, what happens if you need
> more? (the question is why can you need more? perhaps 1000 nodes? or it
> doesnt matter)
>
Correct me if I'm wrong, but Red Hat does not officially support clusters
with quorum disks, with more than 16 nodes.
Regards,
Juanra
>
> Greetings,
>
> ESG
>
> 2009/6/29 H.P?iv?niemi
>
>
>> http://sources.redhat.com/cluster/wiki/FAQ/CMAN#quorumdisksize
>>
>> What's the minimum size of a quorum disk/partition?
>>
>> The official answer is 10MB. The real number is something like 100KB, but
>> we'd like to reserve 10MB for possible
>> future expansion and features.
>>
>>
>> -hjp
>>
>>
>>
>> On Monday 29 June 2009 12:38:39 ESGLinux wrote:
>> > Hi all,
>> >
>> > I?m planning a 2 nodes cluster and I?m going to use quorum disk. My
>> > question is which is the best size of this kind of disk. It will be
>> > interesting to explain how calculate this size,
>> >
>> > Thanks in advance
>> >
>> > ESG
>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From robejrm at gmail.com Tue Jul 7 10:21:02 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Tue, 7 Jul 2009 12:21:02 +0200
Subject: [Linux-cluster] cman + qdisk timeouts....
In-Reply-To:
References:
Message-ID: <8a5668960907070321p5a082091oa7f83fff625dde47@mail.gmail.com>
On Mon, Jun 15, 2009 at 4:17 PM, Moralejo, Alfredo <
alfredo.moralejo at roche.com> wrote:
> Hi,
>
>
>
> I?m having what I think is a timeouts issue in my cluster.
>
>
>
> I have a two node cluster using qdisk. Everytime the node that has the
> master role for qdisk becomes down (for failure or even stopping qdiskd
> manually), packages in the sane node are stopped because of the lack of
> quorum as the qdiskd becames unresponsive until second node becames master
> node and start working properly. Once qdiskd start working fine (usually 5-6
> seconds) packages are started again.
>
>
>
> I?ve read in the cluster manual section for ?CMAN membership timeout
> value? and I think this is the case. I?ve used RHEL 5.3 and I thought this
> parameter is the token that I set much longer that needed:
>
>
>
>
>
>
>
> ?
>
>
>
> status_file="/tmp/qdisk" tko="3" votes="5" log_level="7"
> log_facility="local4"/>
>
>
>
>
>
> Totem token is much more that double of qdisk timeout, so I guess it should
> be enough but everytime qdisk dies in the master node I get same result,
> services restarted in the sane node:
>
>
>
> Jun 15 16:11:33 rmamseslab07 qdiskd[14130]: Node 1 missed an update
> (2/3)
>
> Jun 15 16:11:38 rmamseslab07 qdiskd[14130]: Node 1 missed an update
> (3/3)
>
> Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: Node 1 missed an update
> (4/3)
>
> Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: Node 1 DOWN
>
> Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: Making bid for master
>
> Jun 15 16:11:44 rmamseslab07 clurgmgrd: [18510]: Executing
> /etc/init.d/watchdog status
>
> Jun 15 16:11:48 rmamseslab07 qdiskd[14130]: Node 1 missed an update
> (5/3)
>
> Jun 15 16:11:53 rmamseslab07 qdiskd[14130]: Node 1 missed an update
> (6/3)
>
> *Jun 15 16:11:53 rmamseslab07 qdiskd[14130]: Assuming master role*
>
>
>
> Message from syslogd at rmamseslab07 at Jun 15 16:11:53 ...
>
> clurgmgrd[18510]: #1: Quorum Dissolved
>
> Jun 15 16:11:53 rmamseslab07 openais[14087]: [CMAN ] lost contact with
> quorum device
>
> Jun 15 16:11:53 rmamseslab07 openais[14087]: [CMAN ] quorum lost, blocking
> activity
>
> Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: Membership Change
> Event
>
> *Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: #1: Quorum
> Dissolved*
>
> Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: Emergency stop of
> service:Cluster_test_2
>
> Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: Emergency stop of
> service:wdtcscript-rmamseslab05-ic
>
> Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: Emergency stop of
> service:wdtcscript-rmamseslab07-ic
>
> Jun 15 16:11:54 rmamseslab07 clurgmgrd[18510]: Emergency stop of
> service:Logical volume 1
>
> Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: Node 1 missed an update
> (7/3)
>
> Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: Writing eviction
> notice for node 1
>
> Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: Telling CMAN to kill
> the node
>
> *Jun 15 16:11:58 rmamseslab07 openais[14087]: [CMAN ] quorum regained,
> resuming activity*
>
>
>
> I?ve just logged a case but? any idea????
>
>
>
> Regards,
>
> Hi!
Have you set two_node="0" in cman section?
Why don't you use any heuristics within the quorumd configuration? I.e:
pinging a router...
Could you paste us your cluster.conf?
Greetings,
Juanra
>
>
>
>
> *Alfredo Moralejo*
> Business Platforms Engineering - OS Servers - UNIX Senior Specialist
>
> F. Hoffmann-La Roche Ltd.
>
> Global Informatics Group Infrastructure
> Josefa Valc?rcel, 40
> 28027 Madrid SPAIN
>
> Phone: +34 91 305 97 87
>
> alfredo.moralejo at roche.com
>
> *Confidentiality Note:* This message is intended only for the use of the
> named recipient(s) and may contain confidential and/or proprietary
> information. If you are not the intended recipient, please contact the
> sender and delete this message. Any unauthorized use of the information
> contained in this message is prohibited.
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From robejrm at gmail.com Tue Jul 7 10:22:29 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Tue, 7 Jul 2009 12:22:29 +0200
Subject: [Linux-cluster] System load at 1.00 for gfs2?
In-Reply-To: <13E5ADD5-B0C6-4339-8D86-5E46DA37B6A6@netspot.com.au>
References: <20090513173511.GA5992@esri.com>
<8a5668960905180135p118312bfj6625f8513f477674@mail.gmail.com>
<20090518140201.GA7429@esri.com>
<1242655685.29604.345.camel@localhost.localdomain>
<13E5ADD5-B0C6-4339-8D86-5E46DA37B6A6@netspot.com.au>
Message-ID: <8a5668960907070322x4bdbed49nd0ae3712a4069b0e@mail.gmail.com>
On Wed, Jun 10, 2009 at 3:29 AM, Tom Lanyon wrote:
> On 18/05/2009, at 11:38 PM, Steven Whitehouse wrote:
>
> The fix has gone in to RHEL 5.4. I have a feeling that it might also go
>> into 5.3.z but I'm not 100% sure what the timescales are there. The bug
>> is known and fixed in upstream too.
>>
>> It isn't actually using any more CPU, its just that the LA is
>> incremented by 1. So a fix is already on its way,
>>
>> Steve.
>>
>
>
> Great, we experience this bug too. It doesn't cause any problems but
> confuses some of the administrators... :)
>
It's currently fixed in 5.3
Many thanks!
>
> Tom
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From esggrupos at gmail.com Tue Jul 7 10:28:34 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Tue, 7 Jul 2009 12:28:34 +0200
Subject: [Linux-cluster] quorum disk size recommedation
In-Reply-To: <8a5668960907070310y4cf0924v16ccec3f892a67f5@mail.gmail.com>
References: <3128ba140906290238l3e98aac2s2b5ac2bf97418307@mail.gmail.com>
<200906291243.18175.harri.paivaniemi@tieto.com>
<3128ba140906290248q620ad560m8700f65ab0bd63d8@mail.gmail.com>
<8a5668960907070310y4cf0924v16ccec3f892a67f5@mail.gmail.com>
Message-ID: <3128ba140907070328o3ce52d8au1e7c1934a38e0019@mail.gmail.com>
> Correct me if I'm wrong, but Red Hat does not officially support clusters
> with quorum disks, with more than 16 nodes.
>
> Regards,
> Juanra
>
>>
>>
Hi Juanra, no idea about this limit, my numbers was only to ask what happens
if you need more....
Greetings,
ESG
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From brdvss at gmail.com Tue Jul 7 11:14:04 2009
From: brdvss at gmail.com (Brady Vass)
Date: Tue, 7 Jul 2009 16:44:04 +0530
Subject: [Linux-cluster] Re: Commands for communicating among nodes?
In-Reply-To: <995446330907040319j57c13bfbhfe8b92a02f2937e4@mail.gmail.com>
References: <995446330907030938t67b4c101y4dde2fbf2c51e8eb@mail.gmail.com>
<995446330907040319j57c13bfbhfe8b92a02f2937e4@mail.gmail.com>
Message-ID: <995446330907070414j10a5a728r69689c6fb2da34a9@mail.gmail.com>
Thanks much for the responses. I will definitely try it out.
Thanks and regards,
Brady.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From brdvss at gmail.com Tue Jul 7 11:21:02 2009
From: brdvss at gmail.com (Brady Vass)
Date: Tue, 7 Jul 2009 16:51:02 +0530
Subject: [Linux-cluster] Disk Monitoring in RHCS
Message-ID: <995446330907070421q5b395772j85ef8860bf7e2552@mail.gmail.com>
Hi,
I am trying to configure a cluster where the resource is on a SCSI disk and
I need to monitor the disk. The failover should happen depending on the
disk-monitoring result. Can someone point me in the right direction? How to
go about monitoring the disk ?
Thanks much in advance.
regards,
Brady.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From rsetchfield at xcalibre.co.uk Tue Jul 7 12:35:24 2009
From: rsetchfield at xcalibre.co.uk (Raymond Setchfield)
Date: Tue, 07 Jul 2009 13:35:24 +0100
Subject: [Linux-cluster] Trying to locate the bottleneck
Message-ID: <4A53410C.3090704@xcalibre.co.uk>
Hi
I am trying to find a problem here with a setup which I am currently
testing.
This is the current setup which I have at the moment
15 web farm servers which are running vhost-ldap module and also have
ldap caching enabled. Which are behind 2 Load balancer servers which are
in fail over. The software which it is currently running is Piranha on
the load balancers.
I am using siege to get some benchmarking done on these to test
basically their availability when pushing high concurrency.
At 100 (99.60 according to siege) Concurrent Connection it appears to be
all ok with 99.89%. At 120 (119.52 according to siege) Concurrent
connections I get 99.9%, and at 130 (129.51 according to siege)
Concurrent Connections I get 100% availability.
However pushing it any further than this, for example 150 concurrent
connections it is falling over and siege bails out with multiple
connection time outs. I am trying to find the bottle neck here and I am
wondering if it is software which I am using for the load balancers or a
limitation with apache.
The command I am using for siege is pretty simple nothing special;
siege --concurrent=150 --internet --file=urls.txt --benchmark --time=60M
My lvs.cf file can be found here to show you guys the config which I am
using.
http://pastebin.com/m52d6cc23
Any help would be greatly appreciated
Many Thanks
R.
From esggrupos at gmail.com Tue Jul 7 12:52:54 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Tue, 7 Jul 2009 14:52:54 +0200
Subject: [Linux-cluster] Package Apache and Mysql Problem
In-Reply-To:
References:
Message-ID: <3128ba140907070552u71769fdci6201d9fd24d731a5@mail.gmail.com>
Hi,
How are you configuring the cluster? with conga? with system-config-cluster?
if you run clustat what does it shows?
If you use the command clusvcadm to start the services what happens?
any error in /var/log/messages?
Greetings
ESG
2009/7/1 Giussani Andrea
> Hi,
>
> i have a little big problem with RH Cluster Suite.
>
> I have 2 cluster nodes with 1 partition to share between the 2 node. There
> is no SAN.
> The node have the same hardware and the same partition.
> I have 1 partition with drbd to sycronize the 2 nodes Primary/Primary.
>
> I try in a lot type of configuration of Apache and Mysql package but i have
> the same problem.
> The error is:
> Jul 1 18:50:39 nodo1 luci[2581]: Unable to retrieve batch 1072342062
> status from nodo2.local:11111: clusvcadm start failed to start Httpd:
>
> nodo1 and nodo2 is the 2 nodes and httpd is the apache service.
>
> Any idea???
>
> I try the configuration in this procedure:
> http://kbase.redhat.com/faq/docs/DOC-5648 for Mysql but the result is the
> same.
>
> In attach my cluster.conf and drbd.conf
>
> If we need more tell me please.
>
> Thanks a lot
>
> Andrea Giussani
>
>
> AVVERTENZE AI SENSI DEL D.LGS. 196/2003 .
>
> Il contenuto di questo messaggio (ed eventuali allegati) e' strettamente
> confidenziale. L'utilizzo del contenuto del messaggio e' riservato
> esclusivamente al destinatario. La modifica, distribuzione, copia del
> messaggio da parte di altri e' severamente proibita. Se non siete i
> destinatari Vi invitiamo ad informare il mittente ed eliminare tutte le
> copie del suddetto messaggio .
>
> The content of this message (and attachment) is closely confidentiality.
> Use of the content of the message is classified exclusively to the
> addressee. The modification, distribution, copy of the message from others
> are forbidden. If you are not the addressees, we invite You to inform the
> sender and to eliminate all the copies of the message.
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From santosh.balan at linuxmail.org Tue Jul 7 13:08:21 2009
From: santosh.balan at linuxmail.org (Santosh Balan)
Date: Tue, 7 Jul 2009 08:08:21 -0500
Subject: [Linux-cluster] Redhat Cluster Problem
Message-ID: <20090707130821.4B5AB10612@ws1-3.us4.outblaze.com>
Hi Friends,
I am getting the following errors on the cluster at my site. I am using
the RHEL 5 cluster suite on for HA of my database and web server. My
cluster service at one point in a day restarts the service automatically
as it cannot find the virtual ip. Can you please guide me on this issue.
On checking my logs i.e. /var/log/messages it shows me the following
info:
Jul? 6 18:18:11 DB01 clurgmgrd: [4350]: Failed to ping
xxx.xxx.xxx.xxx
Jul? 6 18:18:11 DB01 clurgmgrd[4350]: status on ip
"xxx.xxx.xxx.xxx" returned 1 (generic error)
Jul? 6 18:18:11 DB01 clurgmgrd[4350]: Stopping service
service:mysql
Jul? 6 18:18:11 DB01 clurgmgrd: [4350]: Executing
/etc/init.d/mysql stop
Jul? 6 18:18:19 DB01 clurgmgrd: [4350]: Removing IPv4 address
xxx.xxx.xxx.xxx from bond0
Jul? 6 18:18:19 DB01 snmpd[2238]: Connection from UDP: [127.0.0.1]:36318
Jul? 6 18:18:29 DB01 clurgmgrd: [4350]: unmounting /data
Jul? 6 18:18:29 DB01 clurgmgrd[4350]: Service service:mysql is
recovering
Jul? 6 18:18:29 DB01 clurgmgrd[4350]: Recovering failed service
service:mysql
Jul? 6 18:18:30 DB01 clurgmgrd: [4350]: mounting
/dev/mapper/vg01-DB on /data
Jul? 6 18:18:30 DB01 kernel: kjournald starting.? Commit interval 5
seconds
Jul? 6 18:18:30 DB01 kernel: EXT3 FS on dm-3, internal journal
Jul? 6 18:18:30 DB01 kernel: EXT3-fs: mounted filesystem with ordered
data mode.
Jul? 6 18:18:30 DB01 clurgmgrd: [4350]: Adding IPv4 address
xxx.xxx.xxx.xxx to bond0
Jul? 6 18:18:31 DB01 clurgmgrd: [4350]: Executing
/etc/init.d/mysql start
Jul? 6 18:18:33 DB01 clurgmgrd[4350]: Service service:mysql
started
Thanks in advance and expecting your reply at the earliest.
Thanks and Regards
Santosh Balan
9819419509
--
Be Yourself @ mail.com!
Choose From 200+ Email Addresses
Get a Free Account at www.mail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From robejrm at gmail.com Tue Jul 7 13:12:52 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Tue, 7 Jul 2009 15:12:52 +0200
Subject: [Linux-cluster] Redhat Cluster Problem
In-Reply-To: <20090707130821.4B5AB10612@ws1-3.us4.outblaze.com>
References: <20090707130821.4B5AB10612@ws1-3.us4.outblaze.com>
Message-ID: <8a5668960907070612p788d154ap2a60755caa95ef9d@mail.gmail.com>
On Tue, Jul 7, 2009 at 3:08 PM, Santosh Balan
wrote:
> Hi Friends,
>
> I am getting the following errors on the cluster at my site. I am using the
> RHEL 5 cluster suite on for HA of my database and web server. My cluster
> service at one point in a day restarts the service automatically as it
> cannot find the virtual ip. Can you please guide me on this issue. On
> checking my logs i.e. /var/log/messages it shows me the following info:
>
> Jul 6 18:18:11 DB01 clurgmgrd: [4350]: Failed to ping
> xxx.xxx.xxx.xxx
> Jul 6 18:18:11 DB01 clurgmgrd[4350]: status on ip
> "xxx.xxx.xxx.xxx" returned 1 (generic error)
> Jul 6 18:18:11 DB01 clurgmgrd[4350]: Stopping service
> service:mysql
> Jul 6 18:18:11 DB01 clurgmgrd: [4350]: Executing /etc/init.d/mysql
> stop
> Jul 6 18:18:19 DB01 clurgmgrd: [4350]: Removing IPv4 address
> xxx.xxx.xxx.xxx from bond0
> Jul 6 18:18:19 DB01 snmpd[2238]: Connection from UDP: [127.0.0.1]:36318
> Jul 6 18:18:29 DB01 clurgmgrd: [4350]: unmounting /data
> Jul 6 18:18:29 DB01 clurgmgrd[4350]: Service service:mysql is
> recovering
> Jul 6 18:18:29 DB01 clurgmgrd[4350]: Recovering failed service
> service:mysql
> Jul 6 18:18:30 DB01 clurgmgrd: [4350]: mounting /dev/mapper/vg01-DB
> on /data
> Jul 6 18:18:30 DB01 kernel: kjournald starting. Commit interval 5 seconds
> Jul 6 18:18:30 DB01 kernel: EXT3 FS on dm-3, internal journal
> Jul 6 18:18:30 DB01 kernel: EXT3-fs: mounted filesystem with ordered data
> mode.
> Jul 6 18:18:30 DB01 clurgmgrd: [4350]: Adding IPv4 address
> xxx.xxx.xxx.xxx to bond0
> Jul 6 18:18:31 DB01 clurgmgrd: [4350]: Executing /etc/init.d/mysql
> start
> Jul 6 18:18:33 DB01 clurgmgrd[4350]: Service service:mysql
> started
>
> The cluster is doing its job, isn't it? When the ip address is not
reachable, it restarts the service. Are you using a correct ip? Maybe
another machine in the network tries to put that ip up.
Regards,
Juanra
>
> Thanks in advance and expecting your reply at the earliest.
>
> Thanks and Regards
> Santosh Balan
> 9819419509
>
> -- Be Yourself @ mail.com!
> Choose From 200+ Email Addresses
> Get a *Free* Account at www.mail.com !
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From charlieb-linux-cluster at budge.apana.org.au Tue Jul 7 14:38:39 2009
From: charlieb-linux-cluster at budge.apana.org.au (Charlie Brady)
Date: Tue, 7 Jul 2009 10:38:39 -0400 (EDT)
Subject: [Linux-cluster] GFS2 with IMAP Maildir server
In-Reply-To: <4A4E5E9B.7060906@bobich.net>
References: <58aa8d780907031230i6d78890fpdd8791521c421348@mail.gmail.com>
<4A4E5E9B.7060906@bobich.net>
Message-ID:
On Fri, 3 Jul 2009, Gordan Bobic wrote:
> Sounds like you are running into the same bug that I ran into with GFS2 on a
> similar setup nearly 2 years ago, except I could produce a lock-up in under 2
> seconds every time. Solution is to use GFS1 if you really want to stick with
> that setup, but bear in mind that, regardless of the cluster file system
> (GFS1, GFS2, OCFS2) the performance will scale _inversely_. Cluster file
> systems really don't work well with millions of small files.
Isn't Maildir designed to work reliably with NFS? Do you really need a
cluster file system?
From agx at sigxcpu.org Tue Jul 7 16:28:45 2009
From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=)
Date: Tue, 7 Jul 2009 18:28:45 +0200
Subject: [Linux-cluster] Cluster 3.0.0.rc4 release
In-Reply-To: <1246490190.19414.93.camel@cerberus.int.fabbione.net>
References: <1246490190.19414.93.camel@cerberus.int.fabbione.net>
Message-ID: <20090707162845.GA30094@bogon.sigxcpu.org>
Hi,
On Thu, Jul 02, 2009 at 01:16:30AM +0200, Fabio M. Di Nitto wrote:
> The cluster team and its community are proud to announce the
> 3.0.0.rc4 release candidate from the STABLE3 branch.
Based on earlier Debian and Ubuntu packages of corosync, openais and
cluster I have put prelimenary Debian packges (built against Debian
Squeeze) here:
http://pkg-libvirt.alioth.debian.org/packages/unstable/
Here are the soruces.list entries:
http://wiki.debian.org/Teams/DebianLibvirtTeam#Packages
Cheers,
-- Guido
From fdinitto at redhat.com Tue Jul 7 17:45:16 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 07 Jul 2009 19:45:16 +0200
Subject: [Linux-cluster] Cluster 3.0.0.rc4 release
In-Reply-To: <20090707162845.GA30094@bogon.sigxcpu.org>
References: <1246490190.19414.93.camel@cerberus.int.fabbione.net>
<20090707162845.GA30094@bogon.sigxcpu.org>
Message-ID: <1246988716.7993.13.camel@cerberus.int.fabbione.net>
Hi Guido,
On Tue, 2009-07-07 at 18:28 +0200, Guido G?nther wrote:
> Hi,
> On Thu, Jul 02, 2009 at 01:16:30AM +0200, Fabio M. Di Nitto wrote:
> > The cluster team and its community are proud to announce the
> > 3.0.0.rc4 release candidate from the STABLE3 branch.
> Based on earlier Debian and Ubuntu packages of corosync, openais and
> cluster I have put prelimenary Debian packges (built against Debian
> Squeeze) here:
> http://pkg-libvirt.alioth.debian.org/packages/unstable/
> Here are the soruces.list entries:
> http://wiki.debian.org/Teams/DebianLibvirtTeam#Packages
> Cheers,
> -- Guido
awesome! thanks!
I didn't check them out... anyway I am adding the Ubuntu HA team in
CC... it's worth sharing the effort.
For sometime I have been thinking to pull in debian/ and .spec files
upstream and involve the maintainers to work directly with us.
This would happen for corosync/openais (they already have spec files)
and cluster.
Anybody in contact with the Debian team and see if they would like to
work more closely with us?
Cheers
Fabio
From jeff.sturm at eprize.com Tue Jul 7 17:58:45 2009
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Tue, 7 Jul 2009 13:58:45 -0400
Subject: [Linux-cluster] Trying to locate the bottleneck
In-Reply-To: <4A53410C.3090704@xcalibre.co.uk>
References: <4A53410C.3090704@xcalibre.co.uk>
Message-ID: <64D0546C5EBBD147B75DE133D798665F02FDC2C5@hugo.eprize.local>
Hi Raymond,
At those concurrency levels I would suspect network tuning may help.
Does dmesg show anything interesting on the load balancers during your
testing?
For high levels of concurrency on a NAT'd firewall or load balancer I
specifically remember having to adjust ip_conntrack_max upwards.
Perhaps network buffers as well.
-Jeff
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Raymond Setchfield
> Sent: Tuesday, July 07, 2009 8:35 AM
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] Trying to locate the bottleneck
>
> Hi
>
> I am trying to find a problem here with a setup which I am currently
> testing.
>
> This is the current setup which I have at the moment
>
> 15 web farm servers which are running vhost-ldap module and also have
> ldap caching enabled. Which are behind 2 Load balancer servers which
are
> in fail over. The software which it is currently running is Piranha on
> the load balancers.
>
> I am using siege to get some benchmarking done on these to test
> basically their availability when pushing high concurrency.
>
> At 100 (99.60 according to siege) Concurrent Connection it appears to
be
> all ok with 99.89%. At 120 (119.52 according to siege) Concurrent
> connections I get 99.9%, and at 130 (129.51 according to siege)
> Concurrent Connections I get 100% availability.
>
> However pushing it any further than this, for example 150 concurrent
> connections it is falling over and siege bails out with multiple
> connection time outs. I am trying to find the bottle neck here and I
am
> wondering if it is software which I am using for the load balancers or
a
> limitation with apache.
>
> The command I am using for siege is pretty simple nothing special;
>
> siege --concurrent=150 --internet --file=urls.txt --benchmark
--time=60M
>
> My lvs.cf file can be found here to show you guys the config which I
am
> using.
>
> http://pastebin.com/m52d6cc23
>
> Any help would be greatly appreciated
>
> Many Thanks
>
> R.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
From abednegoyulo at yahoo.com Wed Jul 8 06:53:37 2009
From: abednegoyulo at yahoo.com (Abed-nego G. Escobal, Jr.)
Date: Tue, 7 Jul 2009 23:53:37 -0700 (PDT)
Subject: [Linux-cluster] Cannot make cluster after upgrade
Message-ID: <614645.81236.qm@web110403.mail.gq1.yahoo.com>
After an upgrade from 5.2 to 5.3, the cluster, named GFSCluster, seems to stop being a cluster. GFSCluster is a 2 node cluster using iscsi, cman, clvm, and gfs and it was working fine when it was on 5.2 The configuration on both of the nodes (passwords removed)
When starting the service cman, they both hang on the part starting fencing
Starting cluster:
Loading modules... done
Mounting configfs... done
Starting ccsd... done
Starting cman... done
Starting daemons... done
Starting fencing...
After 5 minutes the task finishes with "done" but clustat says
==== As root on web01.company.com ====
Cluster Status for GFSCluster @ Wed Jul 8 01:00:24 2009
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
node01.company.com 1 Online, Local
node02.company.com 2 Offline
==== As root on web02.company.com ====
Cluster Status for GFSCluster @ Wed Jul 8 01:00:26 2009
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
node01.company.com 1 Offline
node02.company.com 2 Online, Local
They are both quorate with their own cluster
In the logs of web01 I found repeating messages
Jul 8 00:55:27 web01 fenced[21872]: node02.company.com not a cluster member after 6 sec post_join_delay
Jul 8 00:55:27 web01 fenced[21872]: fencing node "node02.company.com"
Jul 8 00:55:52 web01 fenced[21872]: agent "fence_ipmilan" reports: Rebooting machine @ IPMI:10.1.0.7...ipmilan: Failed to connect after 30 seconds Failed
In the logs of web02 I also found the same repeating messages
Jul 8 00:55:27 web02 fenced[6363]: node01.company.com not a cluster member after 6 sec post_join_delay
Jul 8 00:55:27 web02 fenced[6363]: fencing node "node01.company.com"
Jul 8 00:55:53 web02 fenced[6363]: agent "fence_ipmilan" reports: Rebooting machine @ IPMI:10.1.0.5...ipmilan: Failed to connect after 30 seconds Failed
Is there a bug on 5.3 with regards to clustering?
Is there any workarounds?
Feel safer online. Upgrade to the new, safer Internet Explorer 8 optimized for Yahoo! to put your mind at peace. It's free. Get IE8 here! http://downloads.yahoo.com/sg/internetexplorer/
From cthulhucalling at gmail.com Wed Jul 8 06:59:24 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Tue, 7 Jul 2009 23:59:24 -0700
Subject: [Linux-cluster] Cannot make cluster after upgrade
In-Reply-To: <614645.81236.qm@web110403.mail.gq1.yahoo.com>
References: <614645.81236.qm@web110403.mail.gq1.yahoo.com>
Message-ID: <36df569a0907072359m2fce04d3h3c437b219eb73a9e@mail.gmail.com>
Sounds a little split-brainish....... have you tried the clean_start=1
option?
On Jul 7, 2009 11:54 PM, "Abed-nego G. Escobal, Jr."
wrote:
After an upgrade from 5.2 to 5.3, the cluster, named GFSCluster, seems to
stop being a cluster. GFSCluster is a 2 node cluster using iscsi, cman,
clvm, and gfs and it was working fine when it was on 5.2 The configuration
on both of the nodes (passwords removed)
When starting the service cman, they both hang on the part starting fencing
Starting cluster:
Loading modules... done
Mounting configfs... done
Starting ccsd... done
Starting cman... done
Starting daemons... done
Starting fencing...
After 5 minutes the task finishes with "done" but clustat says
==== As root on web01.company.com ====
Cluster Status for GFSCluster @ Wed Jul 8 01:00:24 2009
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
node01.company.com 1 Online, Local
node02.company.com 2 Offline
==== As root on web02.company.com ====
Cluster Status for GFSCluster @ Wed Jul 8 01:00:26 2009
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
node01.company.com 1 Offline
node02.company.com 2 Online, Local
They are both quorate with their own cluster
In the logs of web01 I found repeating messages
Jul 8 00:55:27 web01 fenced[21872]: node02.company.com not a cluster member
after 6 sec post_join_delay
Jul 8 00:55:27 web01 fenced[21872]: fencing node "node02.company.com"
Jul 8 00:55:52 web01 fenced[21872]: agent "fence_ipmilan" reports:
Rebooting machine @ IPMI:10.1.0.7...ipmilan: Failed to connect after 30
seconds Failed
In the logs of web02 I also found the same repeating messages
Jul 8 00:55:27 web02 fenced[6363]: node01.company.com not a cluster member
after 6 sec post_join_delay
Jul 8 00:55:27 web02 fenced[6363]: fencing node "node01.company.com"
Jul 8 00:55:53 web02 fenced[6363]: agent "fence_ipmilan" reports: Rebooting
machine @ IPMI:10.1.0.5...ipmilan: Failed to connect after 30 seconds Failed
Is there a bug on 5.3 with regards to clustering?
Is there any workarounds?
Feel safer online. Upgrade to the new, safer Internet Explorer 8
optimized for Yahoo! to put your mind at peace. It's free. Get IE8 here!
http://downloads.yahoo.com/sg/internetexplorer/
--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From alfredo.moralejo at roche.com Wed Jul 8 08:40:21 2009
From: alfredo.moralejo at roche.com (Moralejo, Alfredo)
Date: Wed, 8 Jul 2009 10:40:21 +0200
Subject: [Linux-cluster] cman + qdisk timeouts....
In-Reply-To: <8a5668960907070321p5a082091oa7f83fff625dde47@mail.gmail.com>
References:
<8a5668960907070321p5a082091oa7f83fff625dde47@mail.gmail.com>
Message-ID:
Hi,
I added a heuristic checking network status and help in network failure scenarios.
However, I still face the same problem as soon as I stop the services orderly in the node holding the qdisk master role or reboot it.
If I execute in master qdisk node:
# service rgmanager stop
# service clvmd stop
# service qdiskd stop
# service cman stop
As said by Red Hat, I get the quorum lost in the other node until get the master role (some seconds) and stop the services.
I'm managing that by adding a sleep after stopping qdiskd long enough for the other node to become master, and then stop cman.
I understand this is a bug.
My cluster.conf file:
Best regards,
Alfredo
________________________________
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Juan Ramon Martin Blanco
Sent: Tuesday, July 07, 2009 12:21 PM
To: linux clustering
Subject: Re: [Linux-cluster] cman + qdisk timeouts....
On Mon, Jun 15, 2009 at 4:17 PM, Moralejo, Alfredo > wrote:
Hi,
I'm having what I think is a timeouts issue in my cluster.
I have a two node cluster using qdisk. Everytime the node that has the master role for qdisk becomes down (for failure or even stopping qdiskd manually), packages in the sane node are stopped because of the lack of quorum as the qdiskd becames unresponsive until second node becames master node and start working properly. Once qdiskd start working fine (usually 5-6 seconds) packages are started again.
I've read in the cluster manual section for "CMAN membership timeout value" and I think this is the case. I've used RHEL 5.3 and I thought this parameter is the token that I set much longer that needed:
...
Totem token is much more that double of qdisk timeout, so I guess it should be enough but everytime qdisk dies in the master node I get same result, services restarted in the sane node:
Jun 15 16:11:33 rmamseslab07 qdiskd[14130]: Node 1 missed an update (2/3)
Jun 15 16:11:38 rmamseslab07 qdiskd[14130]: Node 1 missed an update (3/3)
Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: Node 1 missed an update (4/3)
Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: Node 1 DOWN
Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: Making bid for master
Jun 15 16:11:44 rmamseslab07 clurgmgrd: [18510]: Executing /etc/init.d/watchdog status
Jun 15 16:11:48 rmamseslab07 qdiskd[14130]: Node 1 missed an update (5/3)
Jun 15 16:11:53 rmamseslab07 qdiskd[14130]: Node 1 missed an update (6/3)
Jun 15 16:11:53 rmamseslab07 qdiskd[14130]: Assuming master role
Message from syslogd at rmamseslab07 at Jun 15 16:11:53 ...
clurgmgrd[18510]: #1: Quorum Dissolved
Jun 15 16:11:53 rmamseslab07 openais[14087]: [CMAN ] lost contact with quorum device
Jun 15 16:11:53 rmamseslab07 openais[14087]: [CMAN ] quorum lost, blocking activity
Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: Membership Change Event
Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: #1: Quorum Dissolved
Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: Emergency stop of service:Cluster_test_2
Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: Emergency stop of service:wdtcscript-rmamseslab05-ic
Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: Emergency stop of service:wdtcscript-rmamseslab07-ic
Jun 15 16:11:54 rmamseslab07 clurgmgrd[18510]: Emergency stop of service:Logical volume 1
Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: Node 1 missed an update (7/3)
Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: Writing eviction notice for node 1
Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: Telling CMAN to kill the node
Jun 15 16:11:58 rmamseslab07 openais[14087]: [CMAN ] quorum regained, resuming activity
I've just logged a case but... any idea????
Regards,
Hi!
Have you set two_node="0" in cman section?
Why don't you use any heuristics within the quorumd configuration? I.e: pinging a router...
Could you paste us your cluster.conf?
Greetings,
Juanra
Alfredo Moralejo
Business Platforms Engineering - OS Servers - UNIX Senior Specialist
F. Hoffmann-La Roche Ltd.
Global Informatics Group Infrastructure
Josefa Valc?rcel, 40
28027 Madrid SPAIN
Phone: +34 91 305 97 87
alfredo.moralejo at roche.com
Confidentiality Note: This message is intended only for the use of the named recipient(s) and may contain confidential and/or proprietary information. If you are not the intended recipient, please contact the sender and delete this message. Any unauthorized use of the information contained in this message is prohibited.
--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From rsetchfield at xcalibre.co.uk Wed Jul 8 09:21:23 2009
From: rsetchfield at xcalibre.co.uk (Raymond Setchfield)
Date: Wed, 08 Jul 2009 10:21:23 +0100
Subject: [Linux-cluster] Trying to locate the bottleneck
In-Reply-To: <64D0546C5EBBD147B75DE133D798665F02FDC2C5@hugo.eprize.local>
References: <4A53410C.3090704@xcalibre.co.uk>
<64D0546C5EBBD147B75DE133D798665F02FDC2C5@hugo.eprize.local>
Message-ID: <4A546513.1070408@xcalibre.co.uk>
Hi Jeff
Many Thanks for your reply.
I have had a look to see if there if there is anything suspicious within
dmesg and within messages and unfortunately there isn't anything at all
apart from one timeout.
Jul 8 10:15:51 loadbalancer-01 nanny[5427]: [inactive] shutting down
192.168.10.36:80 due to connection failure
Jul 8 10:16:03 loadbalancer-01 nanny[5427]: [ active ] making
192.168.10.36:80 available
I'll check out the possibility of any network related issues which may
cause this problem though.
Thanks for all your help!
R.
Jeff Sturm wrote:
> Hi Raymond,
>
> At those concurrency levels I would suspect network tuning may help.
> Does dmesg show anything interesting on the load balancers during your
> testing?
>
> For high levels of concurrency on a NAT'd firewall or load balancer I
> specifically remember having to adjust ip_conntrack_max upwards.
> Perhaps network buffers as well.
>
> -Jeff
>
>
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com
>>
> [mailto:linux-cluster-bounces at redhat.com]
>
>> On Behalf Of Raymond Setchfield
>> Sent: Tuesday, July 07, 2009 8:35 AM
>> To: linux-cluster at redhat.com
>> Subject: [Linux-cluster] Trying to locate the bottleneck
>>
>> Hi
>>
>> I am trying to find a problem here with a setup which I am currently
>> testing.
>>
>> This is the current setup which I have at the moment
>>
>> 15 web farm servers which are running vhost-ldap module and also have
>> ldap caching enabled. Which are behind 2 Load balancer servers which
>>
> are
>
>> in fail over. The software which it is currently running is Piranha on
>> the load balancers.
>>
>> I am using siege to get some benchmarking done on these to test
>> basically their availability when pushing high concurrency.
>>
>> At 100 (99.60 according to siege) Concurrent Connection it appears to
>>
> be
>
>> all ok with 99.89%. At 120 (119.52 according to siege) Concurrent
>> connections I get 99.9%, and at 130 (129.51 according to siege)
>> Concurrent Connections I get 100% availability.
>>
>> However pushing it any further than this, for example 150 concurrent
>> connections it is falling over and siege bails out with multiple
>> connection time outs. I am trying to find the bottle neck here and I
>>
> am
>
>> wondering if it is software which I am using for the load balancers or
>>
> a
>
>> limitation with apache.
>>
>> The command I am using for siege is pretty simple nothing special;
>>
>> siege --concurrent=150 --internet --file=urls.txt --benchmark
>>
> --time=60M
>
>> My lvs.cf file can be found here to show you guys the config which I
>>
> am
>
>> using.
>>
>> http://pastebin.com/m52d6cc23
>>
>> Any help would be greatly appreciated
>>
>> Many Thanks
>>
>> R.
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
From abednegoyulo at yahoo.com Wed Jul 8 09:50:35 2009
From: abednegoyulo at yahoo.com (Abed-nego G. Escobal, Jr.)
Date: Wed, 8 Jul 2009 02:50:35 -0700 (PDT)
Subject: [Linux-cluster] Cannot make cluster after upgrade
Message-ID: <407893.33658.qm@web110415.mail.gq1.yahoo.com>
I haven't tried it yet. To which part of the cluster.conf should I be inserting clean_start=1 ?
--- On Wed, 7/8/09, Ian Hayes wrote:
> From: Ian Hayes
> Subject: Re: [Linux-cluster] Cannot make cluster after upgrade
> To: "linux clustering"
> Date: Wednesday, 8 July, 2009, 2:59 PM
> Sounds a little
> split-brainish....... have you tried the clean_start=1
> option?
> On Jul 7, 2009 11:54 PM,
> "Abed-nego G. Escobal, Jr."
> wrote:
>
>
>
> After an upgrade from 5.2 to 5.3, the cluster, named
> GFSCluster, seems to stop being a cluster. GFSCluster is a 2
> node cluster using iscsi, cman, clvm, and gfs and it was
> working fine when it was on 5.2 The configuration on both of
> the nodes (passwords removed)
>
>
>
>
>
>
> config_version="5">
>
> two_node="1"/>
>
> ? votes="1"
> nodeid="1"> name="single"> name="node01_ipmi"/> name="node02.company.com"
> votes="1"
> nodeid="2"> name="single"> name="node02_ipmi"/>
>
>
> ? name="node01_ipmi" agent="fence_ipmilan"
> ipaddr="10.1.0.5" login="root"
> passwd="********"/> name="node02_ipmi" agent="fence_ipmilan"
> ipaddr="10.1.0.7" login="root"
> passwd="********"/>
>
>
> ?
>
> ? ?
>
> ? ?
>
> ?
>
>
>
>
>
> When starting the service cman, they both hang on the part
> starting fencing
>
>
>
> Starting cluster:
>
> ? Loading modules... done
>
> ? Mounting configfs... done
>
> ? Starting ccsd... done
>
> ? Starting cman... done
>
> ? Starting daemons... done
>
> ? Starting fencing...
>
>
>
> After 5 minutes the task finishes with "done" but
> clustat says
>
>
>
> ==== As root on web01.company.com ====
>
> ?Cluster Status for GFSCluster @ Wed Jul ?8 01:00:24
> 2009
>
> ?Member Status: Quorate
>
>
>
> ? Member Name ? ? ? ? ? ? ? ? ? ? ? ? ? ?
> ID ? Status
>
> ? ------ ---- ? ? ? ? ? ? ? ? ? ? ? ? ? ?
> ---- ------
>
> ? node01.company.com ?
> ? ? ? ? ? ? ? ? ? ? ? 1 Online, Local
>
> ? node02.company.com ?
> ? ? ? ? ? ? ? ? ? ? ? 2 Offline
>
>
>
>
>
> ==== As root on web02.company.com ====
>
> ?Cluster Status for GFSCluster @ Wed Jul ?8 01:00:26
> 2009
>
> ?Member Status: Quorate
>
>
>
> ? Member Name ? ? ? ? ? ? ? ? ? ? ? ? ? ?
> ID ? Status
>
> ? ------ ---- ? ? ? ? ? ? ? ? ? ? ? ? ? ?
> ---- ------
>
> ? node01.company.com ?
> ? ? ? ? ? ? ? ? ? ? ? 1 Offline
>
> ? node02.company.com ?
> ? ? ? ? ? ? ? ? ? ? ? 2 Online, Local
>
>
>
> They are both quorate with their own cluster
>
>
>
> In the logs of web01 I found repeating messages
>
>
>
> Jul ?8 00:55:27 web01 fenced[21872]: node02.company.com not
> a cluster member after 6 sec post_join_delay
>
> Jul ?8 00:55:27 web01 fenced[21872]: fencing node "node02.company.com"
>
> Jul ?8 00:55:52 web01 fenced[21872]: agent
> "fence_ipmilan" reports: Rebooting machine @
> IPMI:10.1.0.7...ipmilan: Failed to connect after 30 seconds
> Failed
>
>
>
>
>
> In the logs of web02 I also found the same repeating
> messages
>
>
>
> Jul ?8 00:55:27 web02 fenced[6363]: node01.company.com not
> a cluster member after 6 sec post_join_delay
>
> Jul ?8 00:55:27 web02 fenced[6363]: fencing node "node01.company.com"
>
> Jul ?8 00:55:53 web02 fenced[6363]: agent
> "fence_ipmilan" reports: Rebooting machine @
> IPMI:10.1.0.5...ipmilan: Failed to connect after 30 seconds
> Failed
>
>
>
>
>
> Is there a bug on 5.3 with regards to clustering?
>
> Is there any workarounds?
>
>
>
>
>
>
>
> ? ? ?Feel safer online. Upgrade to the new, safer
> Internet Explorer 8 optimized for Yahoo! to put your mind at
> peace. It's free. Get IE8 here! http://downloads.yahoo.com/sg/internetexplorer/
>
>
>
>
> --
>
> Linux-cluster mailing list
>
> Linux-cluster at redhat.com
>
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
> -----Inline Attachment Follows-----
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
From svictor.titus at gmail.com Wed Jul 8 11:05:52 2009
From: svictor.titus at gmail.com (victor titus)
Date: Wed, 8 Jul 2009 19:05:52 +0800
Subject: [Linux-cluster] " Inconsistent NVRAM detected" ERROR
Message-ID: <8374e0ba0907080405haee5317h3653bff38c1c21@mail.gmail.com>
Hi All,
Below are the messages found from the Log "/var/log/messages".
Seems to be some problem with the release of NVRAM memory. Due to this
the LVM in the cluster are not detected by the server, commands like
lvdisplay, pvdisplay just show no output.
******************************************************************************
Jul 7 11:58:12 lxxxx kernel: QLogic Fibre Channel HBA Driver
Jul 7 11:58:12 lxxxx kernel: ACPI: PCI Interrupt 0000:08:00.0[A] ->
GSI 18 (level, high) -> IRQ 185
Jul 7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Found an ISP2432,
irq 185, iobase 0xffffff0000006000
Jul 7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Configuring PCI space...
Jul 7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Configure NVRAM
parameters...
Jul 7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Inconsistent NVRAM
detected: checksum=0xd46cae00 id=I version=0x1.
Jul 7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Falling back to
functioning (yet invalid -- WWPN) defaults.
Jul 7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Verifying loaded
RISC code...
Jul 7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Allocated (64 KB) for EFT...
Jul 7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Allocated (1413
KB) for firmware dump...
Jul 7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Waiting for LIP to
complete...
Jul 7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Cable is unplugged...
Jul 7 11:58:12 lxxxx kernel: scsi0 : qla2xxx
Jul 7 11:58:12 lxxx kernel: qla2400 0000:08:00.0:
*********************************************************************************
Thanks,
Victor
From muruganlnx at gmail.com Wed Jul 8 11:58:33 2009
From: muruganlnx at gmail.com (Murugan P)
Date: Wed, 8 Jul 2009 17:28:33 +0530
Subject: [Linux-cluster] RHCS with GFS2
Message-ID: <52868b3e0907080458ud37c4ffsc4d27a00e8e53d2d@mail.gmail.com>
Hi Friends,
I want to install the RHCS with GFS2 on Centos 5.3.
Kindly provide the list of packages(NAME) which is need for my requirement
and confirm whether DLM is inbuild with 5.3 Kernel.
Thanks & Regards,
P. Murugan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From robejrm at gmail.com Wed Jul 8 12:02:03 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Wed, 8 Jul 2009 14:02:03 +0200
Subject: [Linux-cluster] " Inconsistent NVRAM detected" ERROR
In-Reply-To: <8374e0ba0907080405haee5317h3653bff38c1c21@mail.gmail.com>
References: <8374e0ba0907080405haee5317h3653bff38c1c21@mail.gmail.com>
Message-ID: <8a5668960907080502x7371849ds15ec406729d48995@mail.gmail.com>
On Wed, Jul 8, 2009 at 1:05 PM, victor titus wrote:
> Hi All,
> Below are the messages found from the Log "/var/log/messages".
> Seems to be some problem with the release of NVRAM memory. Due to this
> the LVM in the cluster are not detected by the server, commands like
> lvdisplay, pvdisplay just show no output.
>
>
> ******************************************************************************
> Jul 7 11:58:12 lxxxx kernel: QLogic Fibre Channel HBA Driver
> Jul 7 11:58:12 lxxxx kernel: ACPI: PCI Interrupt 0000:08:00.0[A] ->
> GSI 18 (level, high) -> IRQ 185
> Jul 7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Found an ISP2432,
> irq 185, iobase 0xffffff0000006000
> Jul 7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Configuring PCI
> space...
> Jul 7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Configure NVRAM
> parameters...
> Jul 7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Inconsistent NVRAM
> detected: checksum=0xd46cae00 id=I version=0x1.
> Jul 7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Falling back to
> functioning (yet invalid -- WWPN) defaults.
> Jul 7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Verifying loaded
> RISC code...
> Jul 7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Allocated (64 KB) for
> EFT...
> Jul 7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Allocated (1413
> KB) for firmware dump...
> Jul 7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Waiting for LIP to
> complete...
> Jul 7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Cable is unplugged...
> Jul 7 11:58:12 lxxxx kernel: scsi0 : qla2xxx
> Jul 7 11:58:12 lxxx kernel: qla2400 0000:08:00.0:
>
> *********************************************************************************
>
Hi!
It seems that your fibre connection is failing, or maybe the HBA, or the
switch. Do you have a redundant path so the SAN? If so, have you configured
multipath?
Greetings,
Juanra
>
> Thanks,
> Victor
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From giuseppe.fuggiano at gmail.com Wed Jul 8 12:04:49 2009
From: giuseppe.fuggiano at gmail.com (Giuseppe Fuggiano)
Date: Wed, 8 Jul 2009 14:04:49 +0200
Subject: [Linux-cluster] RHCS with GFS2
In-Reply-To: <52868b3e0907080458ud37c4ffsc4d27a00e8e53d2d@mail.gmail.com>
References: <52868b3e0907080458ud37c4ffsc4d27a00e8e53d2d@mail.gmail.com>
Message-ID: <1e09d9070907080504p588c2514q18b4a3f59b5fd62d@mail.gmail.com>
2009/7/8 Murugan P :
> Hi Friends,
>
> I want to install the RHCS with GFS2 on Centos 5.3.
>
> Kindly provide the list of packages(NAME) which is need for my requirement
> and confirm whether DLM is inbuild with 5.3 Kernel.
http://www.centos.org/docs/5/html/5.2/Cluster_Suite_Overview/
--
Giuseppe
From muruganlnx at gmail.com Wed Jul 8 13:10:01 2009
From: muruganlnx at gmail.com (Murugan P)
Date: Wed, 8 Jul 2009 18:40:01 +0530
Subject: [Linux-cluster] RHCS with GFS2 on centos 5.3
Message-ID: <52868b3e0907080610u30451c85n25d726d277259237@mail.gmail.com>
HI Friends,
I need small clarification from u guys...
Whille installing the centos 5.3 which software needs to select for
RHCS(Cluster service) and clarify which is having the CMAN package.
Thanks & Regards,
P. Murugan
muruganlnx at gmail.com
9841705767
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From robejrm at gmail.com Wed Jul 8 13:22:50 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Wed, 8 Jul 2009 15:22:50 +0200
Subject: [Linux-cluster] RHCS with GFS2 on centos 5.3
In-Reply-To: <52868b3e0907080610u30451c85n25d726d277259237@mail.gmail.com>
References: <52868b3e0907080610u30451c85n25d726d277259237@mail.gmail.com>
Message-ID: <8a5668960907080622o2a5f47dfjd1c87abfcd6a404d@mail.gmail.com>
On Wed, Jul 8, 2009 at 3:10 PM, Murugan P wrote:
> HI Friends,
>
> I need small clarification from u guys...
>
> Whille installing the centos 5.3 which software needs to select for
> RHCS(Cluster service) and clarify which is having the CMAN package.
Hi,
rgmanager
cman
openais
and if you are using gfs2 and/or clustered lvm:
gfs2-utils
lvm2-cluster
#rpm -ql cman
In summary: fenced qdiskd ccsd groupd and associated tools
Greetings,
Juanra
P.S: I don't pretend to be rude, but read some documentation before
asking...
http://www.centos.org/docs/5/html/5.2/Cluster_Suite_Overview/s1-ha-components-CSO.html
>
> Thanks & Regards,
> P. Murugan
> muruganlnx at gmail.com
> 9841705767
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From muruganlnx at gmail.com Wed Jul 8 13:58:50 2009
From: muruganlnx at gmail.com (Murugan P)
Date: Wed, 8 Jul 2009 19:28:50 +0530
Subject: [Linux-cluster] RHCS with GFS2 on centos 5.3
In-Reply-To: <8a5668960907080622o2a5f47dfjd1c87abfcd6a404d@mail.gmail.com>
References: <52868b3e0907080610u30451c85n25d726d277259237@mail.gmail.com>
<8a5668960907080622o2a5f47dfjd1c87abfcd6a404d@mail.gmail.com>
Message-ID: <52868b3e0907080658r20ee64e3sab2285c67512fd36@mail.gmail.com>
I have installed the OS(centos 5.3) with cluster software and after
installation i can able to see
[root at testgfs ~]# rpm -qa | grep cman
cman-2.0.98-1.el5
**********
My question is, while selecting the software at the installation time i
don't find the CMAN packages using F2 on the software
clusters/clusterStorage & base.
Kindly clarify friends , how to know that which software is having the CMAN
packages since i haven't seen the same in clustering/ClusterStorage.
On Wed, Jul 8, 2009 at 6:52 PM, Juan Ramon Martin Blanco
wrote:
>
> On Wed, Jul 8, 2009 at 3:10 PM, Murugan P wrote:
>
>> HI Friends,
>>
>> I need small clarification from u guys...
>>
>> Whille installing the centos 5.3 which software needs to select for
>> RHCS(Cluster service) and clarify which is having the CMAN package.
>
> Hi,
>
> rgmanager
> cman
> openais
> and if you are using gfs2 and/or clustered lvm:
> gfs2-utils
> lvm2-cluster
>
> #rpm -ql cman
> In summary: fenced qdiskd ccsd groupd and associated tools
>
> Greetings,
> Juanra
>
> P.S: I don't pretend to be rude, but read some documentation before
> asking...
> http://www.centos.org/docs/5/html/5.2/Cluster_Suite_Overview/s1-ha-components-CSO.html
>
>>
>> Thanks & Regards,
>> P. Murugan
>> muruganlnx at gmail.com
>> 9841705767
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From giuseppe.fuggiano at gmail.com Wed Jul 8 14:10:56 2009
From: giuseppe.fuggiano at gmail.com (Giuseppe Fuggiano)
Date: Wed, 8 Jul 2009 16:10:56 +0200
Subject: [Linux-cluster] RHCS with GFS2 on centos 5.3
In-Reply-To: <52868b3e0907080658r20ee64e3sab2285c67512fd36@mail.gmail.com>
References: <52868b3e0907080610u30451c85n25d726d277259237@mail.gmail.com>
<8a5668960907080622o2a5f47dfjd1c87abfcd6a404d@mail.gmail.com>
<52868b3e0907080658r20ee64e3sab2285c67512fd36@mail.gmail.com>
Message-ID: <1e09d9070907080710l312e256elfbdd0e3f4b273840@mail.gmail.com>
2009/7/8 Murugan P :
> Kindly clarify friends , how to know that which software is having the CMAN
> packages since i haven't seen the same in clustering/ClusterStorage.
At installation time, you can install the software selecting "Groups"
of packages. These groups can be tuned as you prefer by clicking on a
button to edit them ("Details", IIRC). Doing so, a window with
detailed information is showed.
Cheers
--
Giuseppe
From robejrm at gmail.com Wed Jul 8 14:17:04 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Wed, 8 Jul 2009 16:17:04 +0200
Subject: [Linux-cluster] RHCS with GFS2 on centos 5.3
In-Reply-To: <52868b3e0907080658r20ee64e3sab2285c67512fd36@mail.gmail.com>
References: <52868b3e0907080610u30451c85n25d726d277259237@mail.gmail.com>
<8a5668960907080622o2a5f47dfjd1c87abfcd6a404d@mail.gmail.com>
<52868b3e0907080658r20ee64e3sab2285c67512fd36@mail.gmail.com>
Message-ID: <8a5668960907080717i33bd5a4fxf1096c2129a2c11b@mail.gmail.com>
On Wed, Jul 8, 2009 at 3:58 PM, Murugan P wrote:
>
> I have installed the OS(centos 5.3) with cluster software and after
> installation i can able to see
>
> [root at testgfs ~]# rpm -qa | grep cman
> cman-2.0.98-1.el5
>
> **********
>
> My question is, while selecting the software at the installation time i
> don't find the CMAN packages using F2 on the software
> clusters/clusterStorage & base.
>
Those are "groups" (sorry for the spanish):
# yum groupinfo "Clustering"
Loaded plugins: downloadonly, rhnplugin, security
Setting up Group Process
Group: Agrupamiento (clustering)
Description: Soporte para clustering (agrupamiento).
Default Packages:
Cluster_Administration-en-US
cluster-cim
cluster-snmp
clustermon
conga-devel
ipvsadm
luci
modcluster
piranha
rgmanager
ricci
ricci-modcluster
system-config-cluster
#yum groupinfo "Cluster Storage"
Loaded plugins: downloadonly, rhnplugin, security
Setting up Group Process
Group: Almacenamiento del Cluster
Description: Paquetes que proveen soporte para el almacenamiento de
cluster.
Default Packages:
Global_File_System-en-US
gfs
gfs-utils
gnbd
kmod-gfs
kmod-gfs-kdump
kmod-gnbd
kmod-gnbd-kdump
lvm2-cluster
Optional Packages:
kmod-gfs-PAE
kmod-gfs-xen
kmod-gnbd-PAE
kmod-gnbd-xen
cman package is not in any group, but probably its installation was
triggered when you selected some of the groups above
> Kindly clarify friends , how to know that which software is having the CMAN
> packages since i haven't seen the same in clustering/ClusterStorage.
>
I think I don't understand you...
There are no CMAN packages, just one cman package.
In your installation process you should select the cman and rgmanager
packages, but they aren't in any group.
Regards,
Juanra
>
>
>
>
>
>
> On Wed, Jul 8, 2009 at 6:52 PM, Juan Ramon Martin Blanco <
> robejrm at gmail.com> wrote:
>
>
>
>>
>> On Wed, Jul 8, 2009 at 3:10 PM, Murugan P wrote:
>>
>>> HI Friends,
>>>
>>> I need small clarification from u guys...
>>>
>>> Whille installing the centos 5.3 which software needs to select for
>>> RHCS(Cluster service) and clarify which is having the CMAN package.
>>
>> Hi,
>>
>> rgmanager
>> cman
>> openais
>> and if you are using gfs2 and/or clustered lvm:
>> gfs2-utils
>> lvm2-cluster
>>
>> #rpm -ql cman
>> In summary: fenced qdiskd ccsd groupd and associated tools
>>
>> Greetings,
>> Juanra
>>
>> P.S: I don't pretend to be rude, but read some documentation before
>> asking...
>> http://www.centos.org/docs/5/html/5.2/Cluster_Suite_Overview/s1-ha-components-CSO.html
>>
>>>
>>> Thanks & Regards,
>>> P. Murugan
>>> muruganlnx at gmail.com
>>> 9841705767
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From cthulhucalling at gmail.com Wed Jul 8 14:45:38 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Wed, 8 Jul 2009 07:45:38 -0700
Subject: [Linux-cluster] Cannot make cluster after upgrade
In-Reply-To: <407893.33658.qm@web110415.mail.gq1.yahoo.com>
References: <407893.33658.qm@web110415.mail.gq1.yahoo.com>
Message-ID: <36df569a0907080745u1a498a96oc8853f37f093ea08@mail.gmail.com>
In the fence_daemon tag. Like this:
On Wed, Jul 8, 2009 at 2:50 AM, Abed-nego G. Escobal, Jr. <
abednegoyulo at yahoo.com> wrote:
>
> I haven't tried it yet. To which part of the cluster.conf should I be
> inserting clean_start=1 ?
>
> --- On Wed, 7/8/09, Ian Hayes wrote:
>
> > From: Ian Hayes
> > Subject: Re: [Linux-cluster] Cannot make cluster after upgrade
> > To: "linux clustering"
> > Date: Wednesday, 8 July, 2009, 2:59 PM
> > Sounds a little
> > split-brainish....... have you tried the clean_start=1
> > option?
> > On Jul 7, 2009 11:54 PM,
> > "Abed-nego G. Escobal, Jr."
> > wrote:
> >
> >
> >
> > After an upgrade from 5.2 to 5.3, the cluster, named
> > GFSCluster, seems to stop being a cluster. GFSCluster is a 2
> > node cluster using iscsi, cman, clvm, and gfs and it was
> > working fine when it was on 5.2 The configuration on both of
> > the nodes (passwords removed)
> >
> >
> >
> >
> >
> >
> > > config_version="5">
> >
> >