From jeff.sturm at eprize.com Wed Jul 1 03:57:47 2009 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Tue, 30 Jun 2009 23:57:47 -0400 Subject: [Linux-cluster] Did you use GFS with witch technology? In-Reply-To: <1246378523.7787.12.camel@tuxkiller> References: <1246378523.7787.12.camel@tuxkiller> Message-ID: <64D0546C5EBBD147B75DE133D798665F02FDC1CF@hugo.eprize.local> > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] > On Behalf Of Tiago Cruz > Sent: Tuesday, June 30, 2009 12:15 PM > To: linux-cluster at redhat.com > Subject: [Linux-cluster] Did you use GFS with witch technology? > > Did you use GFS and witch technology? KVM? Xen? VirtualBox? Not Virtual? > Witch version are you using? v1 or v2? Xen here, with GFS1. Works great. Pay attention to the performance optimizations (noatime, etc.) including statfs_fast if you are on GFS1. We export LUNs from our SAN to each domU using tap:sync driver. Performance seems to be limited by our SAN. Each domU in our setup has two vif's: one for openais, another for everything else, though I can't say if that helps or hurts performance. Jeff From agx at sigxcpu.org Wed Jul 1 11:57:25 2009 From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=) Date: Wed, 1 Jul 2009 13:57:25 +0200 Subject: [Linux-cluster] Cluster 3.0.0.rc3 release In-Reply-To: <1246306200.25867.86.camel@cerberus.int.fabbione.net> References: <1245496789.3665.328.camel@cerberus.int.fabbione.net> <20090629184848.GA25796@bogon.sigxcpu.org> <1246306200.25867.86.camel@cerberus.int.fabbione.net> Message-ID: <20090701115725.GA6565@bogon.sigxcpu.org> On Mon, Jun 29, 2009 at 10:10:00PM +0200, Fabio M. Di Nitto wrote: > > 1246297857 fenced 3.0.0.rc3 started > > 1246297857 our_nodeid 1 our_name node2.foo.bar > > 1246297857 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/fenced.log > > 1246297857 found uncontrolled entry /sys/kernel/dlm/rgmanager And it also leads to: dlm_controld[14981]: fenced_domain_info error -1 so it's not possible to get the node back without rebooting. > It looks to me the node has not been shutdown properly and an attempt to > restart it did fail. The fenced segfault shouldn't happen but I am > CC'ing David. Maybe he has a better idea. > > > > > when trying to restart fenced. Since this is not possible one has to > > reboot the node. > > > > We're also seeing: > > > > Jun 29 19:29:03 node2 kernel: [ 50.149855] dlm: no local IP address has been set > > Jun 29 19:29:03 node2 kernel: [ 50.150035] dlm: cannot start dlm lowcomms -107 > > hmm this looks like a bad configuration to me or bad startup. > > IIRC dlm kernel is configured via configfs and probably it was not > mounted by the init script. It is. > > from time to time. Stopping/starting via cman's init script (as from the > > Ubuntu package) several times makes this go away. > > > > Any ideas what causes this? > > Could you please try to use our upstream init scripts? They work just > fine (unchanged) in ubuntu/debian environment and they are for sure a > lot more robust than the ones I originally wrote for Ubuntu many years > ago. Tested that without any notable change. > Could you also please summarize your setup and config? I assume you did > the normal checks such as cman_tool status, cman_tool nodes and so on... > > The usual extra things I'd check are: > > - make sure the hostname doesn't resolve to localhost but to the real ip > address of the cluster interface > - cman_tool status > - cman_tool nodes These all do look o.k. However: > - Before starting any kind of service, such as rgmanager or gfs*, make > sure that the fencing configuration is correct. Test by using fence_node > $nodename. fence_node node1 gives the segfaults at the same locationo as described above which seems to be the cause of the trouble. (Howvever "fence_ilo -z -l user -p pass -a iloip" works as expected). The segfault happens in fence/libfence/agent.c's make_args where the second XPath lookup (FENCE_DEVICE_ARGS_PATH) returns a bogus (non NULL) str. Doing this xpath lookup by hand looks fine. So it seems ccs_get_list is returning corrupted pointers. I've attached the current clluster.conf. Cheers, -- Guido -------------- next part -------------- ?xml version="1.0"?> From fdinitto at redhat.com Wed Jul 1 13:23:56 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Wed, 01 Jul 2009 15:23:56 +0200 Subject: [Linux-cluster] Cluster 3.0.0.rc3 release In-Reply-To: <20090701115725.GA6565@bogon.sigxcpu.org> References: <1245496789.3665.328.camel@cerberus.int.fabbione.net> <20090629184848.GA25796@bogon.sigxcpu.org> <1246306200.25867.86.camel@cerberus.int.fabbione.net> <20090701115725.GA6565@bogon.sigxcpu.org> Message-ID: <1246454636.19414.30.camel@cerberus.int.fabbione.net> Hi Guido, On Wed, 2009-07-01 at 13:57 +0200, Guido G?nther wrote: > > - Before starting any kind of service, such as rgmanager or gfs*, make > > sure that the fencing configuration is correct. Test by using fence_node > > $nodename. > fence_node node1 > > gives the segfaults at the same locationo as described above which seems > to be the cause of the trouble. (Howvever "fence_ilo -z -l user -p pass > -a iloip" works as expected). > The segfault happens in fence/libfence/agent.c's make_args where the > second XPath lookup (FENCE_DEVICE_ARGS_PATH) returns a bogus (non NULL) > str. Doing this xpath lookup by hand looks fine. So it seems > ccs_get_list is returning corrupted pointers. I've attached the current > clluster.conf. I am having problems to reproduce this problem and I'll need your help. First of all I replicated your configuration: as you can see node names and fencing methods are the same. I don't have ilo but it shouldn't matter. Now my question is: did you mangle the configuration you sent me manually? because there is no matching entry between device to use for a node and the fencedevices section and I get: [root at node2]# fence_node -vv node1 fence node1 dev 0.0 agent none result: error config agent agent args: fence node1 failed Now if i change device name="fenceX" to name="nodeX" there is a matching and: [root at node2 cluster]# fence_node -vv node1 fence node1 dev 0.0 agent fence_virsh result: success agent args: agent=fence_virsh port=fedora-rh-node1 ipaddr=daikengo.int.fabbione.net login=root secure=1 identity_file=/root/.ssh/id_rsa fence node1 success and I still don't see the segfault... Since you can reproduce the problem regularly I'd really like to see some debugging output of libfence to start with. I'd really appreciate if you could help us. test 1: Please add a bunch fprintf(stderr, to agents.c to see the created XPath queries and the result coming back from libccs. If you could please collect the output and send it to me. test 2: If you could please find: cd = ccs_connect(); (line 287 in agent.c) and right before that add: fullxpath=1; That change will ask libccs to use a different Xpath engine internally. And then re-run test1. This should be able to isolate pretty much the problem and give me enough information to debug the issue. the next question is: are you running on some fancy architecture? Maybe something in that environment is not initialized properly (the garbage string you get back from libccs sounds like that) but on more common arches like x86/x86_64 gcc takes care of that for us.... (really wild guessing but still something to fix!). Thanks Fabio From jeff.sturm at eprize.com Wed Jul 1 13:50:36 2009 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Wed, 1 Jul 2009 09:50:36 -0400 Subject: [Linux-cluster] Recovering from "telling LM to withdraw" Message-ID: <64D0546C5EBBD147B75DE133D798665F02FDC1E2@hugo.eprize.local> Recently we had a cluster node fail with a failed assertion: Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: fatal: assertion "gfs_glock_is_locked_by_me(gl) && gfs_glock_is_held_excl(gl)" failed Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: function = gfs_trans_add_gl Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: file = /builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_/src/gfs/trans.c, line = 237 Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: time = 1246022619 Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: about to withdraw from the cluster Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: telling LM to withdraw This is with CentOS 5.2, GFS1. The cluster had been operating continuously for about 3 months. My challenge isn't in preventing assertion failures entirely-I recognize lurking software bugs and hardware anomalies can lead to a failed node. Rather, I want to prevent one node from freezing the cluster. When the above was logged, all nodes in the cluster which access the tb2data filesystem also froze and did not recover. We recovered with a rolling cluster restart and a precautionary gfs_fsck. Most cluster problems can be quickly handled by the fence agents. The "telling LM to withdraw" does not trigger a fence operation, or any other automated recovery. I need a deployment strategy to fix that. Should I write an agent to scan the syslog, match on the message above, and fence the node? Has anyone else encountered the same problem? If so, how did you get around it? -Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From garromo at us.ibm.com Wed Jul 1 14:21:26 2009 From: garromo at us.ibm.com (Gary Romo) Date: Wed, 1 Jul 2009 08:21:26 -0600 Subject: [Linux-cluster] GFS on stand alone Message-ID: Can GFS be used on a stand alone server without RHCS running? Any pro's or con's to this type of setup? Thanks. -Gary Romo -------------- next part -------------- An HTML attachment was scrubbed... URL: From cthulhucalling at gmail.com Wed Jul 1 14:32:26 2009 From: cthulhucalling at gmail.com (Ian Hayes) Date: Wed, 1 Jul 2009 07:32:26 -0700 Subject: [Linux-cluster] GFS on stand alone In-Reply-To: References: Message-ID: <36df569a0907010732m450ae24eu8e5827ee3a37b93f@mail.gmail.com> Yes it can. Use lock_nolock as your locking protocol. On Wed, Jul 1, 2009 at 7:21 AM, Gary Romo wrote: > Can GFS be used on a stand alone server without RHCS running? > > Any pro's or con's to this type of setup? Thanks. > > -Gary Romo > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Andrea.Giussani at nposistemi.it Wed Jul 1 14:59:41 2009 From: Andrea.Giussani at nposistemi.it (Giussani Andrea) Date: Wed, 1 Jul 2009 16:59:41 +0200 Subject: [Linux-cluster] Package Apache and Mysql Problem Message-ID: Hi, i have a little big problem with RH Cluster Suite. I have 2 cluster nodes with 1 partition to share between the 2 node. There is no SAN. The node have the same hardware and the same partition. I have 1 partition with drbd to sycronize the 2 nodes Primary/Primary. I try in a lot type of configuration of Apache and Mysql package but i have the same problem. The error is: Jul 1 18:50:39 nodo1 luci[2581]: Unable to retrieve batch 1072342062 status from nodo2.local:11111: clusvcadm start failed to start Httpd: nodo1 and nodo2 is the 2 nodes and httpd is the apache service. Any idea??? I try the configuration in this procedure: http://kbase.redhat.com/faq/docs/DOC-5648 for Mysql but the result is the same. In attach my cluster.conf and drbd.conf If we need more tell me please. Thanks a lot Andrea Giussani AVVERTENZE AI SENSI DEL D.LGS. 196/2003 . Il contenuto di questo messaggio (ed eventuali allegati) e' strettamente confidenziale. L'utilizzo del contenuto del messaggio e' riservato esclusivamente al destinatario. La modifica, distribuzione, copia del messaggio da parte di altri e' severamente proibita. Se non siete i destinatari Vi invitiamo ad informare il mittente ed eliminare tutte le copie del suddetto messaggio . The content of this message (and attachment) is closely confidentiality. Use of the content of the message is classified exclusively to the addressee. The modification, distribution, copy of the message from others are forbidden. If you are not the addressees, we invite You to inform the sender and to eliminate all the copies of the message. -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: cluster.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: drbd.txt URL: From brettcave at gmail.com Wed Jul 1 15:24:44 2009 From: brettcave at gmail.com (Brett Cave) Date: Wed, 1 Jul 2009 17:24:44 +0200 Subject: [Linux-cluster] problem with heartbeat + ipvs Message-ID: hi all, have a problem with HA / LB system, using heartbeat for HA and ldirector / ipvs for load balancing. When the primary node is shut down or heartbeat is stopped, the migration of services works fine, but the loadbalancing does not work (ipvs rules are active, but connect connect to HA services). Configs on primary and secondary are the same: haresources: primary 172.16.5.1/16/bond0 ldirectord::ldirectord.cf ldirectord.cf: virtual = 172.16.5.1:3306 service = mysql real = 172.16.10.1:3306 gate 1000 checktype, login, passwd, database, request values all set scheduler = sed ip_forward is enabled (checked via /proc, configured via sysctl) network configs are almost the same except for the IP address (using a bonded interface in active/passive mode) have set iptables policies to ACCEPT with rules that would not block the traffic (99.99% sure on this). if i try connect from a server such as 172.16.10.10, i cannot connect if the secondary is up: [user at someserver]$ mysql -h 172.16.5.1 ERROR 2003 (HY000): Can't connect to MySQL server on '172.16.5.1' (111) perror shows that 111 is Connection Refused running a sniffer on the secondary HA box, i dont see the tcp 3306 packets coming in. the arp_ignore / arp_announce kernel params are configured on teh real server, HA ip address is added on a /32 subnet to the lo interface, etc, etc.... (everything works 100% when primary is up). sure it is something i have overlooked, any idea's? -------------- next part -------------- An HTML attachment was scrubbed... URL: From agx at sigxcpu.org Wed Jul 1 16:40:07 2009 From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=) Date: Wed, 1 Jul 2009 18:40:07 +0200 Subject: [Linux-cluster] Cluster 3.0.0.rc3 release In-Reply-To: <1246454636.19414.30.camel@cerberus.int.fabbione.net> References: <1245496789.3665.328.camel@cerberus.int.fabbione.net> <20090629184848.GA25796@bogon.sigxcpu.org> <1246306200.25867.86.camel@cerberus.int.fabbione.net> <20090701115725.GA6565@bogon.sigxcpu.org> <1246454636.19414.30.camel@cerberus.int.fabbione.net> Message-ID: <20090701164007.GA10680@bogon.sigxcpu.org> Hi Fabio, On Wed, Jul 01, 2009 at 03:23:56PM +0200, Fabio M. Di Nitto wrote: > Now my question is: did you mangle the configuration you sent me > manually? because there is no matching entry between device to use for a > node and the fencedevices section and I get: Yes, I had to get some internal names out. This is what went wrong: - + ^^^^^^ (same for node2/fence2). > Since you can reproduce the problem regularly I'd really like to see > some debugging output of libfence to start with. I'd really appreciate > if you could help us. > > test 1: > > Please add a bunch fprintf(stderr, to agents.c to see the created XPath > queries and the result coming back from libccs. # fence_node -vv node2 make_args(149): /cluster/fencedevices/fencedevice[@name="fence2"]/@* make_args(156) Segmentation fault > test 2: > > If you could please find: > > cd = ccs_connect(); (line 287 in agent.c) > and right before that add: > fullxpath=1; > > That change will ask libccs to use a different Xpath engine internally. > > And then re-run test1. # fence_node -vv node2 fence_node(289): fullxpath: 0 fence_node(291): fullxpath: 1 make_args(149): /cluster/fencedevices/fencedevice[@name="fence2"]/@* make_args(156) Segmentation fault make_args(156) is just before the strncmp. Trying to print out str results in a segfault too (that's why it's missing from the output). [..snip..] > the next question is: are you running on some fancy architecture? Maybe > something in that environment is not initialized properly (the garbage > string you get back from libccs sounds like that) but on more common > arches like x86/x86_64 gcc takes care of that for us.... (really wild > guessing but still something to fix!). Nothing fancy here: # uname -a Linux vm41 2.6.30-1-amd64 #1 SMP Sun Jun 14 15:00:29 UTC 2009 x86_64 GNU/Linux Cheers, -- Guido From adas at redhat.com Wed Jul 1 16:43:26 2009 From: adas at redhat.com (Abhijith Das) Date: Wed, 01 Jul 2009 11:43:26 -0500 Subject: [Linux-cluster] Recovering from "telling LM to withdraw" In-Reply-To: <64D0546C5EBBD147B75DE133D798665F02FDC1E2@hugo.eprize.local> References: <64D0546C5EBBD147B75DE133D798665F02FDC1E2@hugo.eprize.local> Message-ID: <4A4B922E.5090301@redhat.com> Jeff Sturm wrote: > > Recently we had a cluster node fail with a failed assertion: > > Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: fatal: > assertion "gfs_glock_is_locked_by_me(gl) && > gfs_glock_is_held_excl(gl)" failed > > Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: function = > gfs_trans_add_gl > > Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: file = > /builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_/src/gfs/trans.c, > line = 237 > > Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: time = > 1246022619 > > Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: about to > withdraw from the cluster > > Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: telling LM > to withdraw > > This is with CentOS 5.2, GFS1. The cluster had been operating > continuously for about 3 months. > > My challenge isn't in preventing assertion failures entirely?I > recognize lurking software bugs and hardware anomalies can lead to a > failed node. Rather, I want to prevent one node from freezing the > cluster. When the above was logged, all nodes in the cluster which > access the tb2data filesystem also froze and did not recover. We > recovered with a rolling cluster restart and a precautionary gfs_fsck. > > Most cluster problems can be quickly handled by the fence agents. The > "telling LM to withdraw" does not trigger a fence operation, or any > other automated recovery. I need a deployment strategy to fix that. > Should I write an agent to scan the syslog, match on the message > above, and fence the node? > > Has anyone else encountered the same problem? If so, how did you get > around it? > > -Jeff > https://bugzilla.redhat.com/show_bug.cgi?id=471258 The assert+withdraw you're seeing seems to be this bug above. I've tried to recreate this on my cluster and failed. If you have a recipe to create this, could you please post it to the bugzilla? Meanwhile, I'll look at the code again to see if I can spot anything. Thanks! --Abhi From fdinitto at redhat.com Wed Jul 1 17:12:07 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Wed, 01 Jul 2009 19:12:07 +0200 Subject: [Linux-cluster] Cluster 3.0.0.rc3 release In-Reply-To: <20090701164007.GA10680@bogon.sigxcpu.org> References: <1245496789.3665.328.camel@cerberus.int.fabbione.net> <20090629184848.GA25796@bogon.sigxcpu.org> <1246306200.25867.86.camel@cerberus.int.fabbione.net> <20090701115725.GA6565@bogon.sigxcpu.org> <1246454636.19414.30.camel@cerberus.int.fabbione.net> <20090701164007.GA10680@bogon.sigxcpu.org> Message-ID: <1246468327.19414.65.camel@cerberus.int.fabbione.net> On Wed, 2009-07-01 at 18:40 +0200, Guido G?nther wrote: > Hi Fabio, > On Wed, Jul 01, 2009 at 03:23:56PM +0200, Fabio M. Di Nitto wrote: > > Now my question is: did you mangle the configuration you sent me > > manually? because there is no matching entry between device to use for a > > node and the fencedevices section and I get: > Yes, I had to get some internal names out. This is what went wrong: > > - > + Ok perfect thanks. > > # fence_node -vv node2 > make_args(149): /cluster/fencedevices/fencedevice[@name="fence2"]/@* > make_args(156) > Segmentation fault > > > test 2: > > > > If you could please find: > > > > cd = ccs_connect(); (line 287 in agent.c) > > and right before that add: > > fullxpath=1; > > > > That change will ask libccs to use a different Xpath engine internally. > > > > And then re-run test1. > # fence_node -vv node2 > fence_node(289): fullxpath: 0 > fence_node(291): fullxpath: 1 > make_args(149): /cluster/fencedevices/fencedevice[@name="fence2"]/@* > make_args(156) > Segmentation fault > > make_args(156) is just before the strncmp. Trying to print out str > results in a segfault too (that's why it's missing from the output). No matter what, I can't trigger this segfault. Do you have a build log for the package? and could you send me the make/defines.mk in the build tree? gcc versions and usual tool chain info.. maybe it's a gcc bug or maybe it's an optimization that behaves differently between debian and fedora. I have attached a small test case to simply test libccs. At this point I don't believe it's a problem in libfence. Could you please run it for me and send me the output? If the bug is in libccs this would start isolating it. [root at fedora-rh-node4 ~]# gcc -Wall -o testccs main.c -lccs [root at fedora-rh-node4 ~]# ./testccs -hopefully some output- and please check the XPath query at the top of main.c as it could be slightly different given your config. Thanks Fabio -------------- next part -------------- A non-text attachment was scrubbed... Name: main.c Type: text/x-csrc Size: 528 bytes Desc: not available URL: From Luis.Cerezo at pgs.com Wed Jul 1 18:24:07 2009 From: Luis.Cerezo at pgs.com (Luis Cerezo) Date: Wed, 1 Jul 2009 13:24:07 -0500 Subject: [Linux-cluster] qdisk best practices Message-ID: <15D5002F61F31A45A82A153D2F73906760FBD3F011@HOUMS26.onshore.pgs.com> Hi all- i've got a RHEL 5.3 cluster, 2node with qdisk. All works fine, but the qdisk seems to beat on the SAN (I/Ops) I adjusted the interval from the default of 1 to 5 and it is still high (san admin is crying) does anyone have best practices for this? its an LSI san and both nodes are mulitpathed to it via 4G FC. thanks! Luis E. Cerezo PGS Global IT This e-mail, any attachments and response string may contain proprietary information, which are confidential and may be legally privileged. It is for the intended recipient only and if you are not the intended recipient or transmission error has misdirected this e-mail, please notify the author by return e-mail and delete this message and any attachment immediately. If you are not the intended recipient you must not use, disclose, distribute, forward, copy, print or rely in this e-mail in any way except as permitted by the author. From tiagocruz at forumgdh.net Wed Jul 1 19:00:15 2009 From: tiagocruz at forumgdh.net (Tiago Cruz) Date: Wed, 01 Jul 2009 16:00:15 -0300 Subject: [Linux-cluster] Did you use GFS with witch technology? In-Reply-To: <64D0546C5EBBD147B75DE133D798665F02FDC1CF@hugo.eprize.local> References: <1246378523.7787.12.camel@tuxkiller> <64D0546C5EBBD147B75DE133D798665F02FDC1CF@hugo.eprize.local> Message-ID: <1246474815.7192.148.camel@tuxkiller> Thanks guys for all the comments! Just one more question: I have 10 VM inside a apache cluster, and I've compiled one httpd inside GFS, some like (/gfs/httpd_servers/bin-2.2.9). Did you see any problem with this? How do you use Apache with GFS? -- Tiago Cruz On Tue, 2009-06-30 at 23:57 -0400, Jeff Sturm wrote: > > -----Original Message----- > > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] > > On Behalf Of Tiago Cruz > > Sent: Tuesday, June 30, 2009 12:15 PM > > To: linux-cluster at redhat.com > > Subject: [Linux-cluster] Did you use GFS with witch technology? > > > > Did you use GFS and witch technology? KVM? Xen? VirtualBox? Not > Virtual? > > Witch version are you using? v1 or v2? > > Xen here, with GFS1. Works great. Pay attention to the performance > optimizations (noatime, etc.) including statfs_fast if you are on GFS1. > > We export LUNs from our SAN to each domU using tap:sync driver. > Performance seems to be limited by our SAN. Each domU in our setup has > two vif's: one for openais, another for everything else, though I can't > say if that helps or hurts performance. > > Jeff > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From brem.belguebli at gmail.com Wed Jul 1 21:09:43 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Wed, 1 Jul 2009 23:09:43 +0200 Subject: [Linux-cluster] Did you use GFS with witch technology? In-Reply-To: <1246397778.7787.67.camel@tuxkiller> References: <1246378523.7787.12.camel@tuxkiller> <36df569a0906301254p5dcece20g1336aece80bcd708@mail.gmail.com> <1246396045.7787.45.camel@tuxkiller> <29ae894c0906301429g6550e907hfe633c28c75c08eb@mail.gmail.com> <1246397778.7787.67.camel@tuxkiller> Message-ID: <29ae894c0907011409k470d1405qe23d7041b6c13dce@mail.gmail.com> Hi, Ok I understand, that should be supported. If your problems (freeze, qdisk loose, etc..;) occur when your VMs are under high load (CPU, RAM, disk IOs ?) why don't you configure more VCPUs on your VMs, split the number of VMs accross more LUNs, etc.. The thing to keep in mind, there is no way under ESX to limit the IO rate per VM, and ESX 3.5 doesn't support multipathing, if the bottleneck is more located on the disk subsystem. 2009/6/30 Tiago Cruz > Not, > > What I did: > > I have 10 virtual machines. > > I have one LUN with 200 GB formated by ESX using VMFS. > > Inside this lun, I have a lot of small pieces of 10 GB (the "/" os each > one virtual machine) formated by RHEL 5.x using EXT3. > > And my GFS was in another LUN, called DRM (some like Direct Raw Mapping) > where the LUN was delivered to VM without pass "inside" of ESX. > > Can you understood of I've complicated ever more? :-p > -- > Tiago Cruz > > > On Tue, 2009-06-30 at 23:29 +0200, brem belguebli wrote: > > Not really, > > > > > > VMFS is the clustered filesystem shipped with ESX. > > > > > > If I understand well, you got the source code of GFS that you did > > recompile on your ESX host, is that it ? > > > > > > I think you're already out of support from VMware if so. > > > > > > > > > > 2009/6/30 Tiago Cruz > > Hello Ian, > > > > 'cause AFAIK I can't format one block device with VMFS. > > You can think in VMFS in some like LVM - just one abstraction > > layer and > > not a FS itself :) > > > > -- > > Tiago Cruz > > > > > > > > > > On Tue, 2009-06-30 at 12:54 -0700, Ian Hayes wrote: > > > > > > > > > On Tue, Jun 30, 2009 at 9:15 AM, Tiago Cruz > > > > > wrote: > > > Hello, guys.. please... I need to know a little > > thing: > > > > > > I'm using GFS v1 with ESX 3.5 and I'm not very > > happy :) > > > High load from vms, freeze and quorum lost, for > > example. > > > > > > Did you use GFS and witch technology? KVM? Xen? > > VirtualBox? > > > Not Virtual? > > > Witch version are you using? v1 or v2? > > > > > > Are you a happy people using this? =) > > > > > > If you're using ESX, why are you using GFS instead of VMFS? > > > > > > > > > > > > > -- > > > Linux-cluster mailing list > > > Linux-cluster at redhat.com > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fdinitto at redhat.com Wed Jul 1 23:16:30 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Thu, 02 Jul 2009 01:16:30 +0200 Subject: [Linux-cluster] Cluster 3.0.0.rc4 release Message-ID: <1246490190.19414.93.camel@cerberus.int.fabbione.net> The cluster team and its community are proud to announce the 3.0.0.rc4 release candidate from the STABLE3 branch. This should be the last release candidate unless major problems will be found during the final testing stage. Everybody with test equipment and time to spare, is highly encouraged to download, install and test this release candidate and more important report problems. This is the time for people to make a difference and help us testing as much as possible. In order to build the 3.0.0.rc4 release you will need: - corosync 0.100 (1.0.0.rc1) - openais 0.100 (1.0.0.rc1) - linux kernel 2.6.29 The new source tarball can be downloaded here: ftp://sources.redhat.com/pub/cluster/releases/cluster-3.0.0.rc4.tar.gz https://fedorahosted.org/releases/c/l/cluster/cluster-3.0.0.rc5.tar.gz At the same location is now possible to find separated tarballs for fence-agents and resource-agents as previously announced (http://www.redhat.com/archives/cluster-devel/2009-February/msg00003.htm) To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Happy clustering, Fabio Under the hood (from 3.0.0.rc3): Bob Peterson (4): GFS2: gfs2_convert, parameter not understood on ppc /sbin/mount.gfs2: can't find /proc/mounts entry for directory / Message printed to stderr instead of stdout gfs_fsck: Segfault in EA leaf repair Christine Caulfield (3): cman: use api->shutdown_request instead of api->request_shutdown cman: Fix some compile-time warning dlm: Fix some compile warnings Fabio M. Di Nitto (17): gfs: kill dead test code gfs2: drop dead test code build: enable fence_xvm by default config: fix warnings in confdb2ldif config: use HDB_X instead of _D gfs: add missing format attributes gfs2: handle output conversion properly gfs2: add missing casts gfs2: make functions static gfs2: backport coding format from master gfs2: resync internationalization support from master cman: port to the latest corosync API cman init: stop qdiskd only if enabled qdiskd: fix log file name cman init: don't stop fence_xvmd if we don't know the status cman init: readd support for fence_xvmd standalone operations Revert "gfs-kernel: enable FS_HAS_FREEZE" Federico Simoncelli (1): rgmanager: Allow vm.sh use of libvirt XML file Jim Meyering (5): src/clulib/ckpt_state.c (ds_key_init_nt): detect failed malloc dlm/tests: handle malloc failure cman: handle malloc failure (i.e., don't deref NULL) dlm_controld: handle heap allocation failure and plug leaks dlm_controld: add comments: mark memory problems Lon Hohberger (42): rgmanager: Fix ptr arithmetic and C90 warnings rgmanager: Fix rg_locks.c build warnings rgmanager: Fix rg_strings.c build warnings rgmanager: Fix members.c and related build warnings rgmanager: Change ccs_read_old_logging to static rgmanager: Fix daemon_init related warnings rgmanager: Remove unused function rgmanager: Remove unused proof-of-concept code rgmanager: Fix build warnings in cman.c rgmanager: Fix build warnings in fdops.c rgmanager: Fix vft.c and related build warnings rgmanager: Fix msgtest.c build warnings rgmanager: Fix complier warnings in msg_cluster.c rgmanager: Fix build warnings in msg_socket.c rgmanager: Fix build warnings in msgtest.c rgmanager: Fix fo_domain.c build warnings rgmanager: Fix fo_domain.c build warnings (part 2) rgmanager: Fix clufindhostname.c build warnings rgmanager: Fix clustat.c build warnings rgmanager: Fix clusvcadm.c build warnings rgmanager: Fix clulog.c build warnings rgmanager: groups.c cleanup rgmanager: Cleanups around main.c rgmanager: Fix reslist.c complier warnings rgmanager: Fix resrules.c compiler warnings rgmanager: Fix restree.c compiler warnings rgmanager: Clean up rg_event.c and related build warnings rgmanager: Fix rg_forward.c build warnings rgmanager: Fix rg_queue.c build warnings rgmanager: Clean up rg_queue.c and related warnings rgmanager: Clean up slang_event.c and related warnings rgmanager: Fix last bits of compiler warnings rgmanager: Fix leaked context on queue fail rgmanager: Fix stop/start race rgmanager: Fix stack overflows on stress testing rgmanager: Fix small memory leak rgmanager: Don't push NULL on to the S/Lang stack rgmanager: Fix error message rgmanager: Fix --debug build fence: Make fence_node return 2 for no fencing rgmanager: follow-service.sl stack cleanup rgmanager: Allow exit while waiting for fencing Marek 'marx' Grac (1): fence_wti: Fence agent for WTI ends with traceback when option is missing Steven Dake (1): fence: Fix missing case in switch statement Steven Whitehouse (1): libgfs2: Use -o meta rather than gfs2meta fs type cman/daemon/ais.c | 7 +- cman/daemon/commands.c | 6 +- cman/daemon/daemon.c | 5 +- cman/daemon/daemon.h | 2 +- cman/init.d/cman.in | 27 +- cman/qdisk/main.c | 2 +- config/tools/ldap/confdb2ldif.c | 6 +- configure | 8 - dlm/tests/usertest/alternate-lvb.c | 10 +- dlm/tests/usertest/asttest.c | 14 +- dlm/tests/usertest/dlmtest.c | 6 +- dlm/tests/usertest/dlmtest2.c | 7 +- dlm/tests/usertest/flood.c | 7 +- dlm/tests/usertest/joinleave.c | 2 +- dlm/tests/usertest/lstest.c | 12 +- dlm/tests/usertest/lvb.c | 11 +- dlm/tests/usertest/pingtest.c | 8 +- dlm/tests/usertest/threads.c | 34 +- fence/agents/Makefile | 13 +- fence/agents/wti/fence_wti.py | 14 +- fence/agents/xvm/vm_states.c | 2 + fence/fence_node/fence_node.c | 6 +- fence/libfence/agent.c | 2 +- gfs-kernel/src/gfs/ops_fstype.c | 2 +- gfs/gfs_fsck/Makefile | 7 - gfs/gfs_fsck/log.c | 9 +- gfs/gfs_fsck/metawalk.c | 7 +- gfs/gfs_fsck/test_bitmap.c | 38 - gfs/gfs_fsck/test_block_list.c | 91 - gfs/libgfs/log.c | 9 +- gfs2/convert/gfs2_convert.c | 2 +- gfs2/fsck/Makefile | 6 - gfs2/fsck/fs_recovery.c | 34 +- gfs2/fsck/initialize.c | 6 +- gfs2/fsck/main.c | 2 +- gfs2/fsck/rgrepair.c | 2 +- gfs2/fsck/test_bitmap.c | 38 - gfs2/fsck/test_block_list.c | 91 - gfs2/libgfs2/misc.c | 2 +- gfs2/mkfs/main.c | 2 +- gfs2/mkfs/main_grow.c | 4 +- gfs2/mkfs/main_jadd.c | 11 +- gfs2/mkfs/main_mkfs.c | 10 +- gfs2/mount/util.c | 15 +- gfs2/tool/main.c | 2 +- group/dlm_controld/pacemaker.c | 15 +- make/defines.mk.input | 1 - rgmanager/include/daemon_init.h | 9 + rgmanager/include/depends.h | 134 -- rgmanager/include/event.h | 10 + rgmanager/include/fo_domain.h | 48 + rgmanager/include/groups.h | 42 + rgmanager/include/lock.h | 4 +- rgmanager/include/members.h | 1 + rgmanager/include/message.h | 20 +- rgmanager/include/resgroup.h | 82 +- rgmanager/include/reslist.h | 51 +- rgmanager/include/restart_counter.h | 2 +- rgmanager/include/rg_locks.h | 9 + rgmanager/include/rg_queue.h | 6 +- rgmanager/include/vf.h | 10 +- rgmanager/src/clulib/ckpt_state.c | 1 + rgmanager/src/clulib/cman.c | 3 +- rgmanager/src/clulib/daemon_init.c | 8 +- rgmanager/src/clulib/fdops.c | 5 +- rgmanager/src/clulib/lock.c | 4 +- rgmanager/src/clulib/logging.c | 4 +- rgmanager/src/clulib/members.c | 66 - rgmanager/src/clulib/message.c | 22 +- rgmanager/src/clulib/msg_cluster.c | 13 +- rgmanager/src/clulib/msg_socket.c | 12 +- rgmanager/src/clulib/msgtest.c | 19 +- rgmanager/src/clulib/rg_strings.c | 2 +- rgmanager/src/clulib/vft.c | 53 +- rgmanager/src/daemons/Makefile | 6 +- rgmanager/src/daemons/depends.c | 2512 ----------------------- rgmanager/src/daemons/dtest.c | 810 -------- rgmanager/src/daemons/event_config.c | 19 +- rgmanager/src/daemons/fo_domain.c | 29 +- rgmanager/src/daemons/groups.c | 94 +- rgmanager/src/daemons/main.c | 173 +-- rgmanager/src/daemons/reslist.c | 35 +- rgmanager/src/daemons/resrules.c | 41 +- rgmanager/src/daemons/restree.c | 70 +- rgmanager/src/daemons/rg_event.c | 30 +- rgmanager/src/daemons/rg_forward.c | 6 +- rgmanager/src/daemons/rg_locks.c | 12 +- rgmanager/src/daemons/rg_queue.c | 8 +- rgmanager/src/daemons/rg_state.c | 145 +- rgmanager/src/daemons/rg_thread.c | 14 +- rgmanager/src/daemons/service_op.c | 15 +- rgmanager/src/daemons/slang_event.c | 266 ++-- rgmanager/src/daemons/test.c | 72 +- rgmanager/src/daemons/watchdog.c | 5 + rgmanager/src/resources/default_event_script.sl | 16 +- rgmanager/src/resources/follow-service.sl | 10 +- rgmanager/src/resources/vm.sh | 17 +- rgmanager/src/utils/clufindhostname.c | 2 +- rgmanager/src/utils/clulog.c | 4 +- rgmanager/src/utils/clustat.c | 67 +- rgmanager/src/utils/clusvcadm.c | 16 +- 101 files changed, 939 insertions(+), 4812 deletions(-) -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From jeff.sturm at eprize.com Thu Jul 2 03:40:40 2009 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Wed, 1 Jul 2009 23:40:40 -0400 Subject: [Linux-cluster] Did you use GFS with witch technology? In-Reply-To: <1246474815.7192.148.camel@tuxkiller> References: <1246378523.7787.12.camel@tuxkiller><64D0546C5EBBD147B75DE133D798665F02FDC1CF@hugo.eprize.local> <1246474815.7192.148.camel@tuxkiller> Message-ID: <64D0546C5EBBD147B75DE133D798665F02FDC207@hugo.eprize.local> > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] > On Behalf Of Tiago Cruz > Sent: Wednesday, July 01, 2009 3:00 PM > To: linux clustering > Subject: RE: [Linux-cluster] Did you use GFS with witch technology? > > I have 10 VM inside a apache cluster, and I've compiled one httpd inside > GFS, some like (/gfs/httpd_servers/bin-2.2.9). You can do that. It sounds like most of the nodes may be accessing this httpd instance read-only. If that will be the case, consider using spectator mounts on some of the nodes so you don't have to create 10 individual journals. > Did you see any problem with this? How do you use Apache with GFS? We actually use it for several purposes. For one, we keep our document root on GFS, so when web content is modified, the new content is immediately visible to all web servers. For another, we have a file-based session implementation on a GFS mount. The only real limitations I know of have to do with applications which are not cluster-aware, and performance of heavy read-write loads. -Jeff From jeff.sturm at eprize.com Thu Jul 2 03:45:00 2009 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Wed, 1 Jul 2009 23:45:00 -0400 Subject: [Linux-cluster] Recovering from "telling LM to withdraw" In-Reply-To: <4A4B922E.5090301@redhat.com> References: <64D0546C5EBBD147B75DE133D798665F02FDC1E2@hugo.eprize.local> <4A4B922E.5090301@redhat.com> Message-ID: <64D0546C5EBBD147B75DE133D798665F02FDC208@hugo.eprize.local> > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] > On Behalf Of Abhijith Das > Sent: Wednesday, July 01, 2009 12:43 PM > To: linux clustering > Subject: Re: [Linux-cluster] Recovering from "telling LM to withdraw" > > https://bugzilla.redhat.com/show_bug.cgi?id=471258 > > The assert+withdraw you're seeing seems to be this bug above. I've tried > to recreate this on my cluster and failed. If you have a recipe to > create this, could you please post it to the bugzilla? Thank you for the link. I'm not confident I can easily reproduce this yet, as we've had months of continuous uptime without such an incident. However if I do learn more about the circumstances leading up to our crash, I'll certainly post information to the bugzilla page. In the meantime I'll see if I can install a nagios agent to scan logs for any GFS problems. The sooner we know about it, the faster we can recover if this happens again. -Jeff From Emmanuel.Thome at normalesup.org Thu Jul 2 09:56:17 2009 From: Emmanuel.Thome at normalesup.org (Emmanuel =?iso-8859-1?Q?Thom=E9?=) Date: Thu, 2 Jul 2009 11:56:17 +0200 Subject: [Linux-cluster] ipmi activates session, but no talk. Message-ID: <20090702095617.GA24015@tiramisu.loria.fr> Hi. I'm trying to set up ipmi (1.5) management using the bmc on ibm eserver326 machines. Yes, these machines are old. So far, I've been able to access the bmc with ipmitool, and configure it as correctly as I could for remote access. When trying to access it from afar, I successfully activate a session, but further requests are unanswered. Some dumps of ipmitool commands are included below. If anybody has an idea of what's going on, that would be greatly appreciated. I might also try to flash the bmc firmware, as it seems that ibm released a newer firmware for these servers. But I'm already a bit puzzled by the situation so far. Thanks, E. I'm trying to access the bmc with IP 152.81.4.81 from the host with IP 152.81.3.83. The BMC piggies-back on the eth0 NIC which has IP 152.81.3.81 for the system. Thus the BMC and the system both have different MAC and IPs. Seems to work fine as some kind of conversation occurs. Here's the output of a remote ipmi request: [root at cassandre ~]# IPMI_PASSWORD=xxx ipmitool -vvI lan -L USER -H 152.81.4.81 -E mc info ipmi_lan_send_cmd:opened=[0], open=[4490512] IPMI LAN host 152.81.4.81 port 623 Sending IPMI/RMCP presence ping packet ipmi_lan_send_cmd:opened=[1], open=[4490512] Channel 01 Authentication Capabilities: Privilege Level : USER Auth Types : MD5 Per-msg auth : disabled User level auth : disabled Non-null users : enabled Null users : enabled Anonymous login : disabled Proceeding with AuthType MD5 ipmi_lan_send_cmd:opened=[1], open=[4490512] Opening Session Session ID : 751168e4 Challenge : e44e37374801833f77701411992dae25 Privilege Level : USER Auth Type : MD5 ipmi_lan_send_cmd:opened=[1], open=[4490512] Session Activated Auth Type : MD5 Max Priv Level : USER Session ID : 751168e4 Inbound Seq : 00000001 opened=[1], open=[4490512] No response from remote controller Get Device ID command failed ipmi_lan_send_cmd:opened=[1], open=[4490512] No response from remote controller Close Session command failed On the machine I'm trying to talk to, I have in particular: [root at achille ~]# ipmitool -I open session info all [...] session handle : 255 slot count : 4 active sessions : 1 user id : 1 privilege level : USER session type : IPMIv1.5 channel number : 0x01 console ip : 152.81.3.83 console mac : 00:00:00:00:00:00 console port : 60599 [...] [root at achille ~]# /usr/bin/ipmitool -I open lan print Set in Progress : Set Complete Auth Type Support : NONE MD5 PASSWORD Auth Type Enable : Callback : MD5 : User : MD5 : Operator : MD5 : Admin : MD5 : OEM : NONE MD5 PASSWORD IP Address Source : Static Address IP Address : 152.81.4.81 Subnet Mask : 255.255.240.0 MAC Address : 00:0d:60:18:7c:47 SNMP Community String : public IP Header : TTL=0x00 Flags=0x00 Precedence=0x00 TOS=0x00 Default Gateway IP : 152.81.1.1 Default Gateway MAC : 00:13:5f:89:14:00 Backup Gateway IP : 192.168.0.2 Backup Gateway MAC : 00:00:00:00:00:02 Cipher Suite Priv Max : Not Available [root at achille ~]# ipmitool user list 1 ID Name Callin Link Auth IPMI Msg Channel Priv Limit 1 true false true ADMINISTRATOR 2 root true true true OPERATOR 3 USERID true true true ADMINISTRATOR 4 OEM true true true OEM [root at achille ~]# ipmitool -I open channel info 1 Channel 0x1 info: Channel Medium Type : 802.3 LAN Channel Protocol Type : IPMB-1.0 Session Support : multi-session Active Session Count : 1 Protocol Vendor ID : 7154 Volatile(active) Settings Alerting : disabled Per-message Auth : disabled User Level Auth : disabled Access Mode : always available Non-Volatile Settings Alerting : disabled Per-message Auth : disabled User Level Auth : disabled Access Mode : always available From j.buzzard at dundee.ac.uk Thu Jul 2 10:15:36 2009 From: j.buzzard at dundee.ac.uk (Jonathan Buzzard) Date: Thu, 02 Jul 2009 11:15:36 +0100 Subject: [Linux-cluster] ipmi activates session, but no talk. In-Reply-To: <20090702095617.GA24015@tiramisu.loria.fr> References: <20090702095617.GA24015@tiramisu.loria.fr> Message-ID: <1246529736.23585.6.camel@penguin.lifesci.dundee.ac.uk> On Thu, 2009-07-02 at 11:56 +0200, Emmanuel Thom? wrote: > Hi. > > I'm trying to set up ipmi (1.5) management using the bmc on ibm > eserver326 machines. Yes, these machines are old. they are cheap and nasty rebadged MSI boxes. > So far, I've been able to access the bmc with ipmitool, and configure it > as correctly as I could for remote access. > > When trying to access it from afar, I successfully activate a session, > but further requests are unanswered. > > Some dumps of ipmitool commands are included below. Well that's your problem, it don't work with ipmitools :-( > If anybody has an idea of what's going on, that would be greatly > appreciated. > I suggest switching to freeipmi which does work. > I might also try to flash the bmc firmware, as it seems that ibm released > a newer firmware for these servers. But I'm already a bit puzzled by > the situation so far. I would if I where you. I would also do the BIOS, BMC and hard disk firmware at a minimum if I where you. The diagnostics are option. Note that you cannot configure bonding on eth0 and use the IPMI interface. Even when you get it working it is not reliable. I have seen boxes hang and refuse to respond to IPMI commands to reboot. I have also never been able to get the serial over LAN bit working either. They are just cheap and nasty. JAB. -- Jonathan A. Buzzard Tel: +441382-386998 Storage Administrator, College of Life Sciences University of Dundee, DD1 5EH From Emmanuel.Thome at normalesup.org Thu Jul 2 10:47:45 2009 From: Emmanuel.Thome at normalesup.org (Emmanuel =?iso-8859-1?Q?Thom=E9?=) Date: Thu, 2 Jul 2009 12:47:45 +0200 Subject: [Linux-cluster] ipmi activates session, but no talk. In-Reply-To: <1246529736.23585.6.camel@penguin.lifesci.dundee.ac.uk> References: <20090702095617.GA24015@tiramisu.loria.fr> <1246529736.23585.6.camel@penguin.lifesci.dundee.ac.uk> Message-ID: <20090702104745.GA25283@tiramisu.loria.fr> On Thu, Jul 02, 2009 at 11:15:36AM +0100, Jonathan Buzzard wrote: > > Some dumps of ipmitool commands are included below. > > Well that's your problem, it don't work with ipmitools :-( thanks a lot. Indeed. Regards, E. From brettcave at gmail.com Thu Jul 2 10:54:35 2009 From: brettcave at gmail.com (Brett Cave) Date: Thu, 2 Jul 2009 12:54:35 +0200 Subject: [Linux-cluster] Re: [SOLVED] problem with heartbeat + ipvs In-Reply-To: References: Message-ID: Was missing the DBD::mysql module so the connectioncheck was failing and setting weight to 0. only noticed this when i ran ldirector in debug mode. On Wed, Jul 1, 2009 at 5:24 PM, Brett Cave wrote: > hi all, > > have a problem with HA / LB system, using heartbeat for HA and ldirector / > ipvs for load balancing. > > When the primary node is shut down or heartbeat is stopped, the migration > of services works fine, but the loadbalancing does not work (ipvs rules are > active, but connect connect to HA services). Configs on primary and > secondary are the same: > > > haresources: > primary 172.16.5.1/16/bond0 ldirectord::ldirectord.cf > > ldirectord.cf: > virtual = 172.16.5.1:3306 > service = mysql > real = 172.16.10.1:3306 gate 1000 > checktype, login, passwd, database, request values all set > scheduler = sed > > ip_forward is enabled (checked via /proc, configured via sysctl) > > > network configs are almost the same except for the IP address (using a > bonded interface in active/passive mode) > have set iptables policies to ACCEPT with rules that would not block the > traffic (99.99% sure on this). > > if i try connect from a server such as 172.16.10.10, i cannot connect if > the secondary is up: > [user at someserver]$ mysql -h 172.16.5.1 > ERROR 2003 (HY000): Can't connect to MySQL server on '172.16.5.1' (111) > > > perror shows that 111 is Connection Refused > > running a sniffer on the secondary HA box, i dont see the tcp 3306 packets > coming in. > > the arp_ignore / arp_announce kernel params are configured on teh real > server, HA ip address is added on a /32 subnet to the lo interface, etc, > etc.... (everything works 100% when primary is up). > > sure it is something i have overlooked, any idea's? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ironludo at free.fr Thu Jul 2 12:09:01 2009 From: ironludo at free.fr (LEROUX Ludovic) Date: Thu, 2 Jul 2009 14:09:01 +0200 Subject: [Linux-cluster] redhat 5 cluster suite with gfs and oracle 11g References: Message-ID: <3AA7FB4A03854AE2953DA88714C2DBA4@siim94.local> I try to install Oracle 11g on a redhat 5 cluster with 2 nodes. I have a gfs mount point for the shared datafiles. Oracle binaries are installed on each node. I want to create a failover instance (active/passive) but the service with the ressource oracle 10g failover instance doesn't start (see the logfile). I think that the resource doesn't work with Oracle 11g. Do you have any ideas? Do you have any documents to set up a redhat cluster with Oracle but without Oracle RAC? Thanks a lot. Ludo ________________________________________________________________________________________________________ Jul 2 14:11:03 siimlinux13 luci[2956]: Unable to retrieve batch 1273662007 status from siimlinux13.siim:11111: Unable to disable failed service oracle before starting it: clusvcadm failed to stop oracle Jul 2 14:11:11 siimlinux13 : error getting update info: Cannot retrieve repository metadata (repomd.xml) for repository: Cluster. Please verify its path and try again Jul 2 14:11:55 siimlinux13 clurgmgrd[3045]: Starting disabled service service:oracle Jul 2 14:11:55 siimlinux13 clurgmgrd[3045]: start on script "serviceoracle" returned 5 (program not installed) Jul 2 14:11:55 siimlinux13 clurgmgrd[3045]: #68: Failed to start service:oracle; return value: 1 Jul 2 14:11:55 siimlinux13 clurgmgrd[3045]: Stopping service service:oracle Jul 2 14:11:55 siimlinux13 luci[2956]: Unable to retrieve batch 710454996 status from siimlinux13.siim:11111: module scheduled for execution Jul 2 14:11:55 siimlinux13 clurgmgrd[3045]: stop on script "serviceoracle" returned 5 (program not installed) Jul 2 14:11:55 siimlinux13 clurgmgrd[3045]: #12: RG service:oracle failed to stop; intervention required Jul 2 14:11:56 siimlinux13 clurgmgrd[3045]: Service service:oracle is failed Jul 2 14:11:56 siimlinux13 clurgmgrd[3045]: #13: Service service:oracle failed to stop cleanly Jul 2 14:12:01 siimlinux13 luci[2956]: Unable to retrieve batch 710454996 status from siimlinux13.siim:11111: clusvcadm start failed to start oracle: Jul 2 15:11:14 siimlinux13 : error getting update info: Cannot retrieve repository metadata (repomd.xml) for repository: Cluster. Please verify its path and try again -------------- next part -------------- An HTML attachment was scrubbed... URL: From raju.rajsand at gmail.com Thu Jul 2 12:17:02 2009 From: raju.rajsand at gmail.com (Rajagopal Swaminathan) Date: Thu, 2 Jul 2009 17:47:02 +0530 Subject: [Linux-cluster] redhat 5 cluster suite with gfs and oracle 11g In-Reply-To: <3AA7FB4A03854AE2953DA88714C2DBA4@siim94.local> References: <3AA7FB4A03854AE2953DA88714C2DBA4@siim94.local> Message-ID: <8786b91c0907020517s51ccd802pfa306b401ad3f07e@mail.gmail.com> Greetings, On Thu, Jul 2, 2009 at 5:39 PM, LEROUX Ludovic wrote: > I try to install Oracle 11g on a redhat 5 cluster with 2 nodes. > I have a gfs mount point for the shared datafiles. > Oracle binaries are installed on each node. > I want to create a failover instance (active/passive) but the service with > the ressource oracle 10g failover instance doesn't start (see the logfile). Have you done chkconfig --off for the oracle script on both the nodes and added Oracle to cluster managed service along with the listener IP. Regards, Rajagopal From esggrupos at gmail.com Thu Jul 2 17:24:03 2009 From: esggrupos at gmail.com (ESGLinux) Date: Thu, 2 Jul 2009 19:24:03 +0200 Subject: [Linux-cluster] OFF TOPIC: cloud computing Message-ID: <3128ba140907021024m4e48ff1fl992272d2239b2c0c@mail.gmail.com> Hi folks, First sorry for the off topic but I?m sure you know a lot about the concept cloud computing. While I have been learning about clustering (with the help of this list..) I have read about using clusters for cloud computing. I?m totally newbie about that concept, so I want to ask you what you have to say about it, is it real? is an abstract concept and it?s not going to be interesting at all? what do you think? by the way, any web, book, magazine, article or any thing to profundice in this concept greetings ESG -------------- next part -------------- An HTML attachment was scrubbed... URL: From brettcave at gmail.com Thu Jul 2 17:35:43 2009 From: brettcave at gmail.com (Brett Cave) Date: Thu, 2 Jul 2009 19:35:43 +0200 Subject: [Linux-cluster] OFF TOPIC: cloud computing In-Reply-To: <3128ba140907021024m4e48ff1fl992272d2239b2c0c@mail.gmail.com> References: <3128ba140907021024m4e48ff1fl992272d2239b2c0c@mail.gmail.com> Message-ID: On Thu, Jul 2, 2009 at 7:24 PM, ESGLinux wrote: > Hi folks, > First sorry for the off topic but I?m sure you know a lot about the concept > cloud computing. > > While I have been learning about clustering (with the help of this list..) > I have read about using clusters for cloud computing. > > I?m totally newbie about that concept, so I want to ask you what you have > to say about it, is it real? is an abstract concept and it?s not going to be > interesting at all? > It is real, have a look at MPI for development of cloud computing (MPI CH as an implementation). Its used for message passing to queue out components of a job to various nodes. We implemented sorting using this library last year that allocated tasks on a per-core basis across multiple servers. > what do you think? > > by the way, any web, book, magazine, article or any thing to profundice in > this concept > > greetings > > ESG > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jruemker at redhat.com Thu Jul 2 19:53:58 2009 From: jruemker at redhat.com (John Ruemker) Date: Thu, 02 Jul 2009 15:53:58 -0400 Subject: [Linux-cluster] redhat 5 cluster suite with gfs and oracle 11g In-Reply-To: <3AA7FB4A03854AE2953DA88714C2DBA4@siim94.local> References: <3AA7FB4A03854AE2953DA88714C2DBA4@siim94.local> Message-ID: <4A4D1056.6090807@redhat.com> On 07/02/2009 08:09 AM, LEROUX Ludovic wrote: > Jul 2 14:11:55 siimlinux13 clurgmgrd[3045]: start on script > "serviceoracle" returned 5 (program not installed) > Jul 2 14:11:55 siimlinux13 clurgmgrd[3045]: #68: Failed to > start service:oracle; return value: 1 The above error is why its failing, but unfortunately this is pretty generic. Something returned a status code 5, but from these logs theres no way to be sure what since the oracle agent does a number of things during the startup sequence. Usually the best way to troubleshoot these issues is with rg_test, as it will be much more verbose. First disable your service # clusvcadm -d serviceoracle Now do # rg_test test /etc/cluster/cluster.conf start service serviceoracle You should see it logging each operation and it will tell you where it failed. If this doesn't point you to your answer then post the output here as well as your cluster.conf. Also there are some good guidelines and basic steps for setting up an oracle service here: http://people.redhat.com/lhh/oracle-rhel5-notes-0.6/oracle-notes.html HTH -John From hlawatschek at atix.de Fri Jul 3 09:25:55 2009 From: hlawatschek at atix.de (Mark Hlawatschek) Date: Fri, 3 Jul 2009 11:25:55 +0200 Subject: [Linux-cluster] redhat 5 cluster suite with gfs and oracle 11g In-Reply-To: <3AA7FB4A03854AE2953DA88714C2DBA4@siim94.local> References: <3AA7FB4A03854AE2953DA88714C2DBA4@siim94.local> Message-ID: <200907031125.55435.hlawatschek@atix.de> Hi Ludo, could you please provide your cluster.conf file ? -Mark On Thursday 02 July 2009 14:09:01 LEROUX Ludovic wrote: > I try to install Oracle 11g on a redhat 5 cluster with 2 nodes. > I have a gfs mount point for the shared datafiles. > Oracle binaries are installed on each node. > I want to create a failover instance (active/passive) but the service with > the ressource oracle 10g failover instance doesn't start (see the logfile). > I think that the resource doesn't work with Oracle 11g. > Do you have any ideas? > Do you have any documents to set up a redhat cluster with Oracle but > without Oracle RAC? Thanks a lot. > Ludo > > ___________________________________________________________________________ >_____________________________ > > Jul 2 14:11:03 siimlinux13 luci[2956]: Unable to retrieve batch 1273662007 > status from siimlinux13.siim:11111: Unable to disable failed service oracle > before starting it: clusvcadm failed to stop oracle Jul 2 14:11:11 > siimlinux13 : error getting update info: Cannot retrieve repository > metadata (repomd.xml) for repository: Cluster. Please verify its path and > try again Jul 2 14:11:55 siimlinux13 clurgmgrd[3045]: Starting > disabled service service:oracle Jul 2 14:11:55 siimlinux13 > clurgmgrd[3045]: start on script "serviceoracle" returned 5 > (program not installed) Jul 2 14:11:55 siimlinux13 clurgmgrd[3045]: > #68: Failed to start service:oracle; return value: 1 Jul 2 > 14:11:55 siimlinux13 clurgmgrd[3045]: Stopping service > service:oracle Jul 2 14:11:55 siimlinux13 luci[2956]: Unable to retrieve > batch 710454996 status from siimlinux13.siim:11111: module scheduled for > execution Jul 2 14:11:55 siimlinux13 clurgmgrd[3045]: stop on > script "serviceoracle" returned 5 (program not installed) Jul 2 14:11:55 > siimlinux13 clurgmgrd[3045]: #12: RG service:oracle failed to stop; > intervention required Jul 2 14:11:56 siimlinux13 clurgmgrd[3045]: > Service service:oracle is failed Jul 2 14:11:56 siimlinux13 > clurgmgrd[3045]: #13: Service service:oracle failed to stop cleanly > Jul 2 14:12:01 siimlinux13 luci[2956]: Unable to retrieve batch 710454996 > status from siimlinux13.siim:11111: clusvcadm start failed to start oracle: > Jul 2 15:11:14 siimlinux13 : error getting update info: Cannot retrieve > repository metadata (repomd.xml) for repository: Cluster. Please verify its > path and try again -- Dipl.-Ing. Mark Hlawatschek ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 | 85716 Unterschleissheim | www.atix.de | www.open-sharedroot.org From mech at meteo.uni-koeln.de Fri Jul 3 15:06:13 2009 From: mech at meteo.uni-koeln.de (Mario Mech) Date: Fri, 03 Jul 2009 17:06:13 +0200 Subject: [Linux-cluster] running services as non-root user Message-ID: <4A4E1E65.2070908@meteo.uni-koeln.de> Hi, in my cluster environment some services need to run as a non-root user. What are the necessary settings? Settings in my cluster.conf like Thank you, - G. Felix -------------- next part -------------- An HTML attachment was scrubbed... URL: From crosa at redhat.com Wed Jul 15 22:27:08 2009 From: crosa at redhat.com (Cleber Rosa) Date: Wed, 15 Jul 2009 18:27:08 -0400 (EDT) Subject: [Linux-cluster] DRAC 4 In-Reply-To: <8b711df40907151304g68d4fbdqf835a70aeee03f59@mail.gmail.com> Message-ID: <14860381.31247696672994.JavaMail.cleber@localhost.localdomain> Paras, Maybe you should try using the "secure" (ssh) option. CR. --- Cleber Rodrigues < crosa at redhat.com > Solutions Architect - Red Hat, Inc. Mobile: +55 61 9185.3454 ----- Original Message ----- From: "Paras pradhan" To: linux-poweredge at lists.us.dell.com, "linux clustering" Sent: Wednesday, July 15, 2009 5:04:38 PM GMT -03:00 Brasilia Subject: [Linux-cluster] DRAC 4 hi, I am using centos 5.3 in dell poweredge 1850 servers . Drac is 4/i. I am working on to create a cluster using 3 poweredge nodes and I am in need to use DRAC as fencing device. While testing fencing using fence_drac it complains me as: root at tst1 ~]# fence_drac -a 10.10.10.2 -l user -p calvin failed: telnet open failed: problem connecting to "10.10.10.2", port 23: Connection refused I tried to enable telnet using racadm but got the following error. [root at tst1 ~]# racadm config -g cfgSerial -o cfgSerialSshEnable 1 ERROR: RACADM is unable to process the requested subcommand because there is no local RAC configuration to communicate with. Local RACADM subcommand execution requires the following: 1. A Remote Access Controller (RAC) must be present on the managed server 2. Appropriate managed node software must be installed and running on the server I am really stuck here. Any one having the similar problem? Thanks Paras. -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From pradhanparas at gmail.com Wed Jul 15 22:30:18 2009 From: pradhanparas at gmail.com (Paras pradhan) Date: Wed, 15 Jul 2009 17:30:18 -0500 Subject: [Linux-cluster] DRAC 4 In-Reply-To: <14860381.31247696672994.JavaMail.cleber@localhost.localdomain> References: <8b711df40907151304g68d4fbdqf835a70aeee03f59@mail.gmail.com> <14860381.31247696672994.JavaMail.cleber@localhost.localdomain> Message-ID: <8b711df40907151530g139f13e7wf70586e61494626c@mail.gmail.com> On Wed, Jul 15, 2009 at 5:27 PM, Cleber Rosa wrote: > Paras, > > Maybe you should try using the "secure" (ssh) option. Is it possible in DRAC 4? Thanks Paras. > > CR. > > --- > Cleber Rodrigues > Solutions Architect - Red Hat, Inc. > Mobile: +55 61 9185.3454 > > ----- Original Message ----- > From: "Paras pradhan" > To: linux-poweredge at lists.us.dell.com, "linux clustering" > > Sent: Wednesday, July 15, 2009 5:04:38 PM GMT -03:00 Brasilia > Subject: [Linux-cluster] DRAC 4 > > hi, > > I am using centos 5.3 in dell poweredge 1850 servers . Drac is 4/i. ?I > am working on to create a cluster using 3 poweredge nodes and I am in > need to use DRAC as fencing device. > > While testing fencing using fence_drac it complains me as: > > root at tst1 ~]# fence_drac -a 10.10.10.2 -l user -p calvin > failed: telnet open failed: problem connecting to "10.10.10.2", port > 23: Connection refused > > I tried to enable telnet using racadm but got the following error. > > [root at tst1 ~]# racadm config -g cfgSerial -o cfgSerialSshEnable 1 > ERROR: RACADM is unable to process the requested subcommand because there is > no > local RAC configuration to communicate with. > > Local RACADM subcommand execution requires the following: > > ?1. A Remote Access Controller (RAC) must be present on the managed server > ?2. Appropriate managed node software must be installed and running on the > ?? ?server > > > I am really stuck here. Any one having the similar problem? > > Thanks > Paras. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From abednegoyulo at yahoo.com Thu Jul 16 02:46:24 2009 From: abednegoyulo at yahoo.com (Abed-nego G. Escobal, Jr.) Date: Wed, 15 Jul 2009 19:46:24 -0700 (PDT) Subject: [Linux-cluster] Starting two-node cluster with only one node Message-ID: <408681.82392.qm@web110416.mail.gq1.yahoo.com> Using the config file below Is it possible to start the cluster by only bringing up one node? The reason why I asked is because currently bringing them up together produces a split brain, each of them member of the cluster GFSCluster of their own fencing each other. My plan is to bring up only one node to create a quorum then bring the other one up and manually join it to the existing cluster. I have already don the start_clean approach but it seems it does not work. From abednegoyulo at yahoo.com Thu Jul 16 06:16:46 2009 From: abednegoyulo at yahoo.com (Abed-nego G. Escobal, Jr.) Date: Wed, 15 Jul 2009 23:16:46 -0700 (PDT) Subject: [Linux-cluster] Starting two-node cluster with only one node In-Reply-To: <408681.82392.qm@web110416.mail.gq1.yahoo.com> Message-ID: <813634.56331.qm@web110409.mail.gq1.yahoo.com> Tried it and now the two node cluster is running with only one node. My problem right now is how to force the second node to join the first node's cluster. Right now it is creating its own cluster and trying to fence the first node. I tried cman_tool leave on the second node but I got cman_tool: Error leaving cluster: Device or resource busy clvmd and gfs are not running on the second node. What is running on the second node is cman. When I did service cman start It took 5 approximately 5 minutes before I got the [ok] meassage. Am I missing something here? Not doing right? Should be doing something? --- On Thu, 7/16/09, Abed-nego G. Escobal, Jr. wrote: > From: Abed-nego G. Escobal, Jr. > Subject: [Linux-cluster] Starting two-node cluster with only one node > To: "linux clustering" > Date: Thursday, 16 July, 2009, 10:46 AM > > Using the config file below > > > > > ? name="node01.company.com" votes="1" > nodeid="1"> name="single"> name="node01_ipmi"/> name="node02.company.com" votes="1" > nodeid="2"> name="single"> name="node02_ipmi"/> > ? name="node01_ipmi" agent="fence_ipmilan" ipaddr="10.1.0.5" > login="root" passwd="********"/> name="node02_ipmi" agent="fence_ipmilan" ipaddr="10.1.0.7" > login="root" passwd="********"/> > ? > ? ? > ? ? > ? > > > Is it possible to start the cluster by only bringing up one > node? The reason why I asked is because currently bringing > them up together produces a split brain, each of them member > of the cluster GFSCluster of their own fencing each other. > My plan is to bring up only one node to create a quorum then > bring the other one up and manually join it to the existing > cluster. > > I have already don the start_clean approach but it seems it > does not work. > > > ? ? ? > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > Try the new Yahoo! Messenger. Now with all you love about messenger and more! http://ph.messenger.yahoo.com From esggrupos at gmail.com Thu Jul 16 08:27:09 2009 From: esggrupos at gmail.com (ESGLinux) Date: Thu, 16 Jul 2009 10:27:09 +0200 Subject: [Linux-cluster] EMC vs HP EVA In-Reply-To: <1247670305.725.63.camel@penguin.lifesci.dundee.ac.uk> References: <3128ba140907150000l8475a7fx44374523df51b73d@mail.gmail.com> <1247670305.725.63.camel@penguin.lifesci.dundee.ac.uk> Message-ID: <3128ba140907160127p79279fd7uc7ebb662fa8831d9@mail.gmail.com> Hi All, First thanks for your answers, My budget for buying the share storage is about 10.000 EUR With this in mind, I want something good enough to make a goog cluster. One of the things I want with this is get knowledge about the products used in big implementations. This is the reason I?m thinking on EMC or EVA, ( I don?t know with my budget I can afford this or I?m being a litle ingenious) by the way anybody has told wich of this family of products is best. (or if there is something better. I have also seen Dell equallogic) Greetings ESG 2009/7/15 Jonathan Buzzard > > On Wed, 2009-07-15 at 08:00 -0600, Aaron Benner wrote: > > We have an EMC AX150i here and it works very well, but EMC's policy > > makes it pretty expensive. We just added 8 720 GB drives to our > > unit. The drives could be purchased commodity for about $80-120, but > > to put them into the EMC box we had to purchase "certified" units with > > the mounting sleds for $500 per. Seemed like a pretty steep markup at > > the time. > > That is par for the course, they all come in platinum sleds whether you > get EMC, HP, IBM etc. > > Last time I looked list price for a 1TB SATA drive for an IBM DS5000 was > something like 2000 USD. > > JAB. > > -- > Jonathan A. Buzzard Tel: +441382-386998 > Storage Administrator, College of Life Sciences > University of Dundee, DD1 5EH > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tharindub at millenniumit.com Thu Jul 16 08:06:53 2009 From: tharindub at millenniumit.com (Tharindu Rukshan Bamunuarachchi) Date: Thu, 16 Jul 2009 14:06:53 +0600 Subject: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1 In-Reply-To: <21E1652F6AA39A4B89489DCFC399C1FD171BDD@gbplamail.genband.com> References: <005301ca050a$7a469ba0$6ed3d2e0$@com> <21E1652F6AA39A4B89489DCFC399C1FD171BDB@gbplamail.genband.com> <21E1652F6AA39A4B89489DCFC399C1FD171BDD@gbplamail.genband.com> Message-ID: <000501ca05ec$633467c0$299d3740$@com> Are you going to deploy SuSE 11 soon . We could not use RedHat as RHEL 5.x do not support High Resolution Timers . Now we can not continue, as GFS is not supported on SuSE 11 J From: Ed Sanborn [mailto:Ed.Sanborn at genband.com] Sent: Wednesday, July 15, 2009 8:32 PM To: Tharindu Bamunuarachchi; linux clustering Subject: RE: RE: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1 No, a great question and one I am interested in the answer along with you. From: Tharindu Bamunuarachchi [mailto:tharindub at millenniumit.com] Sent: Wednesday, July 15, 2009 10:30 AM To: linux clustering Cc: Ed Sanborn Subject: Re: RE: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1 mmm ... not yet ... is it a wrong question ?? ----- Original Message ----- From: Ed Sanborn Date: Wednesday, July 15, 2009 7:44 pm Subject: RE: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1 To: linux clustering > > > > > > > > > > > > > > > > > > > > > Tharindu, > > > > Did anyone answer you on this > question? > > > > Ed > > > > > > > > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Tharindu Rukshan > Bamunuarachchi > > Sent: Wednesday, July 15, 2009 1:10 AM > > To: Linux-cluster at redhat.com > > Subject: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1 > > > > > > > > hi All, > > > > SuSE 11 kernel comes with GFS2 module. > > Where I can find cluster software for GFS2 deployment. > > > > In source.redhat.com there are few cluster versions. Which > version should be compatible with 2.6.27* kernel. > > Do I have to build module from cluster source package ? > > > > Have you guys tried GFS on SuSE 11 or openSuSE 11.1 ? > > > > cheers > > __ > > tharindu > > > > > > > **************************************************************************** **************************************************************************** *********** > > > > "The information contained in this email including in any attachment is > confidential and is meant to be read only by the person to whom it is > addressed. If you are not the intended recipient(s), you are prohibited from > printing, forwarding, saving or copying this email. If you have received this > e-mail in error, please immediately notify the sender and delete this e-mail > and its attachments from your computer." > > > > **************************************************************************** **************************************************************************** *********** > > > > > > > > > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster **************************************************************************** **************************************************************************** *********** "The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer." **************************************************************************** **************************************************************************** *********** ******************************************************************************************************************************************************************* "The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer." ******************************************************************************************************************************************************************* From rsetchfield at xcalibre.co.uk Thu Jul 16 09:05:52 2009 From: rsetchfield at xcalibre.co.uk (Raymond Setchfield) Date: Thu, 16 Jul 2009 10:05:52 +0100 Subject: [Linux-cluster] keepalived on centos Message-ID: <4A5EED70.7080904@xcalibre.co.uk> Hey Guys Just looking to see if any of you know of good websites which will help on setting and troubleshooting keepalived? As I am coming across a problem which when having a look on google I am not getting very much help from. I dont know if this is a common occurrence with thin the logs or a configuration error or possibly a bug. Getting errors like this showing in the logs; Jul 16 10:01:56 loadbalancer-03 Keepalived: VRRP child process(5847) died: Respawning Jul 16 10:01:56 loadbalancer-03 Keepalived: Remove a zombie pid file /var/run/vrrp.pid Jul 16 10:01:56 loadbalancer-03 Keepalived: Starting VRRP child process, pid=5848 Jul 16 10:01:56 loadbalancer-03 Keepalived_vrrp: Using MII-BMSR NIC polling thread... Jul 16 10:01:56 loadbalancer-03 Keepalived_vrrp: Registering Kernel netlink reflector Jul 16 10:01:56 loadbalancer-03 Keepalived_vrrp: Registering Kernel netlink command channel Jul 16 10:01:56 loadbalancer-03 Keepalived_vrrp: Registering gratutious ARP shared channel Jul 16 10:01:56 loadbalancer-03 Keepalived_vrrp: Opening file '/etc/keepalived/keepalived.conf' If anyone can help and point me in the right direction it would be greatly appreciated Thanks in advance Raymond From abednegoyulo at yahoo.com Thu Jul 16 10:07:54 2009 From: abednegoyulo at yahoo.com (Abed-nego G. Escobal, Jr.) Date: Thu, 16 Jul 2009 03:07:54 -0700 (PDT) Subject: [Linux-cluster] Eliminating split brain Message-ID: <472005.90546.qm@web110415.mail.gq1.yahoo.com> Does converting a two node cluster to a three node cluster (adding one more member) eliminate the possibility of split brain? Feel safer online. Upgrade to the new, safer Internet Explorer 8 optimized for Yahoo! to put your mind at peace. It's free. Get IE8 here! http://downloads.yahoo.com/sg/internetexplorer/ From glaurence at networkenablers.com.au Thu Jul 16 10:29:49 2009 From: glaurence at networkenablers.com.au (Geoffrey Laurence) Date: Thu, 16 Jul 2009 20:29:49 +1000 Subject: [Linux-cluster] DRAC 4 In-Reply-To: <8b711df40907151304g68d4fbdqf835a70aeee03f59@mail.gmail.com> References: <8b711df40907151304g68d4fbdqf835a70aeee03f59@mail.gmail.com> Message-ID: Hi, I have had a simular problem with Fedora-10, I think some kernels have a problem detecting the Drac. Anyway I found that you can enable telnet from the web interface. >From the 'command box' on the Diagnostics tab you can type the following commands, 'd3debug propget ENABLE_TELNET' - Prints if telnet is enabled 'd3debug propset ENABLE_TELNET=TRUE' - Enables telnet 'd3debug racadm racreset' - Reboots the drac. After you have enabled telnet and rebooted the drac, you should be able to telnet to the drac card. Hope this helps, Geoffrey. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Paras pradhan Sent: Thursday, 16 July 2009 6:05 AM To: linux-poweredge at lists.us.dell.com; linux clustering Subject: [Linux-cluster] DRAC 4 hi, I am using centos 5.3 in dell poweredge 1850 servers . Drac is 4/i. I am working on to create a cluster using 3 poweredge nodes and I am in need to use DRAC as fencing device. While testing fencing using fence_drac it complains me as: root at tst1 ~]# fence_drac -a 10.10.10.2 -l user -p calvin failed: telnet open failed: problem connecting to "10.10.10.2", port 23: Connection refused I tried to enable telnet using racadm but got the following error. [root at tst1 ~]# racadm config -g cfgSerial -o cfgSerialSshEnable 1 ERROR: RACADM is unable to process the requested subcommand because there is no local RAC configuration to communicate with. Local RACADM subcommand execution requires the following: 1. A Remote Access Controller (RAC) must be present on the managed server 2. Appropriate managed node software must be installed and running on the server I am really stuck here. Any one having the similar problem? Thanks Paras. -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From Ed.Sanborn at genband.com Thu Jul 16 13:31:51 2009 From: Ed.Sanborn at genband.com (Ed Sanborn) Date: Thu, 16 Jul 2009 08:31:51 -0500 Subject: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1 In-Reply-To: <000001ca05ec$63108c10$2931a430$@com> References: <21E1652F6AA39A4B89489DCFC399C1FD171BDD@gbplamail.genband.com> <000001ca05ec$63108c10$2931a430$@com> Message-ID: <21E1652F6AA39A4B89489DCFC399C1FD171BEA@gbplamail.genband.com> Well, that's a bummer to here that GFS is not supported by SUSE 11. I think I'll end up installing SUSE 11 on another machine then and using NFS to access our GFS filesystem. Ed From: Tharindu Rukshan Bamunuarachchi [mailto:tharindub at millenniumit.com] Sent: Thursday, July 16, 2009 4:07 AM To: Ed Sanborn; 'linux clustering' Subject: RE: RE: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1 Are you going to deploy SuSE 11 soon ... We could not use RedHat as RHEL 5.x do not support High Resolution Timers ... Now we can not continue, as GFS is not supported on SuSE 11 J From: Ed Sanborn [mailto:Ed.Sanborn at genband.com] Sent: Wednesday, July 15, 2009 8:32 PM To: Tharindu Bamunuarachchi; linux clustering Subject: RE: RE: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1 No, a great question and one I am interested in the answer along with you. From: Tharindu Bamunuarachchi [mailto:tharindub at millenniumit.com] Sent: Wednesday, July 15, 2009 10:30 AM To: linux clustering Cc: Ed Sanborn Subject: Re: RE: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1 mmm ... not yet ... is it a wrong question ?? ----- Original Message ----- From: Ed Sanborn Date: Wednesday, July 15, 2009 7:44 pm Subject: RE: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1 To: linux clustering > > > > > > > > > > > > > > > > > > > > > Tharindu, > > > > Did anyone answer you on this > question? > > > > Ed > > > > > > > > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Tharindu Rukshan > Bamunuarachchi > > Sent: Wednesday, July 15, 2009 1:10 AM > > To: Linux-cluster at redhat.com > > Subject: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1 > > > > > > > > hi All, > > > > SuSE 11 kernel comes with GFS2 module. > > Where I can find cluster software for GFS2 deployment. > > > > In source.redhat.com there are few cluster versions. Which > version should be compatible with 2.6.27* kernel. > > Do I have to build module from cluster source package ? > > > > Have you guys tried GFS on SuSE 11 or openSuSE 11.1 ? > > > > cheers > > __ > > tharindu > > > > > > > ************************************************************************ ************************************************************************ ******************* > > > > "The information contained in this email including in any attachment is > confidential and is meant to be read only by the person to whom it is > addressed. If you are not the intended recipient(s), you are prohibited from > printing, forwarding, saving or copying this email. If you have received this > e-mail in error, please immediately notify the sender and delete this e-mail > and its attachments from your computer." > > > > ************************************************************************ ************************************************************************ ******************* > > > > > > > > > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster ************************************************************************ ************************************************************************ ******************* "The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer." ************************************************************************ ************************************************************************ ******************* ************************************************************************ ************************************************************************ ******************* "The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer." ************************************************************************ ************************************************************************ ******************* -------------- next part -------------- An HTML attachment was scrubbed... URL: From gordan at bobich.net Thu Jul 16 13:50:05 2009 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 16 Jul 2009 14:50:05 +0100 Subject: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1 In-Reply-To: <21E1652F6AA39A4B89489DCFC399C1FD171BEA@gbplamail.genband.com> References: <21E1652F6AA39A4B89489DCFC399C1FD171BDD@gbplamail.genband.com> <000001ca05ec$63108c10$2931a430$@com> <21E1652F6AA39A4B89489DCFC399C1FD171BEA@gbplamail.genband.com> Message-ID: <39426d35b502eb5ceccb659385880d6d@localhost> > We could not use RedHat as RHEL 5.x do not support High Resolution > Timers ... Are you saying that changing the distribution is deemed less of a problem than rebuilding the kernel with the high-res timer option ticked? The kernel src.rpm package is there for a reason. Gordan From tharindub at millenniumit.com Thu Jul 16 13:32:09 2009 From: tharindub at millenniumit.com (Tharindu Rukshan Bamunuarachchi) Date: Thu, 16 Jul 2009 19:32:09 +0600 Subject: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1 In-Reply-To: <39426d35b502eb5ceccb659385880d6d@localhost> References: <21E1652F6AA39A4B89489DCFC399C1FD171BDD@gbplamail.genband.com> <000001ca05ec$63108c10$2931a430$@com> <21E1652F6AA39A4B89489DCFC399C1FD171BEA@gbplamail.genband.com> <39426d35b502eb5ceccb659385880d6d@localhost> Message-ID: <000801ca0619$d2febf60$78fc3e20$@com> Why did not RedHat release 5.x kernel with High Resolution Timers enabled ? We thought that particular kernel is not much stable with High Resolution Timers. However, SuSE 11 supported High Resolution Timers AS-IS. Initially, we tried RedHat 5 on our system. But, In production environment we needed to have stable version. I am not sure, but I think vendor does not support custom kernels. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Gordan Bobic Sent: Thursday, July 16, 2009 7:50 PM To: linux clustering Subject: RE: RE: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1 > We could not use RedHat as RHEL 5.x do not support High Resolution > Timers ... Are you saying that changing the distribution is deemed less of a problem than rebuilding the kernel with the high-res timer option ticked? The kernel src.rpm package is there for a reason. Gordan -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster ******************************************************************************************************************************************************************* "The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer." ******************************************************************************************************************************************************************* From tfrumbacher at gmail.com Thu Jul 16 14:04:45 2009 From: tfrumbacher at gmail.com (Aaron Benner) Date: Thu, 16 Jul 2009 08:04:45 -0600 Subject: [Linux-cluster] Starting two-node cluster with only one node In-Reply-To: <813634.56331.qm@web110409.mail.gq1.yahoo.com> References: <813634.56331.qm@web110409.mail.gq1.yahoo.com> Message-ID: <4D278A4A-3120-4114-A374-CC087A02897A@gmail.com> Have you tried setting the "post_join_delay" value in the declaration to -1? This is a hint I picked up from the fenced man page section on avoiding boot time fencing. It tells fenced to wait until all of the nodes have joined the cluster before starting up. We use this on a couple of 2 node clusters (with qdisk) to allow them to start up without the first node to grab the quorum disk fencing the other node. --Aaron On Jul 16, 2009, at 12:16 AM, Abed-nego G. Escobal, Jr. wrote: > > > Tried it and now the two node cluster is running with only one node. > My problem right now is how to force the second node to join the > first node's cluster. Right now it is creating its own cluster and > trying to fence the first node. I tried cman_tool leave on the > second node but I got > > cman_tool: Error leaving cluster: Device or resource busy > > clvmd and gfs are not running on the second node. What is running on > the second node is cman. When I did > > service cman start > > It took 5 approximately 5 minutes before I got the [ok] meassage. Am > I missing something here? Not doing right? Should be doing something? > > > --- On Thu, 7/16/09, Abed-nego G. Escobal, Jr. > wrote: > >> From: Abed-nego G. Escobal, Jr. >> Subject: [Linux-cluster] Starting two-node cluster with only one node >> To: "linux clustering" >> Date: Thursday, 16 July, 2009, 10:46 AM >> >> Using the config file below >> >> >> >> >> > name="node01.company.com" votes="1" >> nodeid="1">> name="single">> name="node01_ipmi"/>> name="node02.company.com" votes="1" >> nodeid="2">> name="single">> name="node02_ipmi"/> >> > name="node01_ipmi" agent="fence_ipmilan" ipaddr="10.1.0.5" >> login="root" passwd="********"/>> name="node02_ipmi" agent="fence_ipmilan" ipaddr="10.1.0.7" >> login="root" passwd="********"/> >> >> >> >> >> >> >> Is it possible to start the cluster by only bringing up one >> node? The reason why I asked is because currently bringing >> them up together produces a split brain, each of them member >> of the cluster GFSCluster of their own fencing each other. >> My plan is to bring up only one node to create a quorum then >> bring the other one up and manually join it to the existing >> cluster. >> >> I have already don the start_clean approach but it seems it >> does not work. >> >> >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > Try the new Yahoo! Messenger. Now with all you love about > messenger and more! http://ph.messenger.yahoo.com > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From pradhanparas at gmail.com Thu Jul 16 14:32:33 2009 From: pradhanparas at gmail.com (Paras pradhan) Date: Thu, 16 Jul 2009 09:32:33 -0500 Subject: [Linux-cluster] DRAC 4 In-Reply-To: References: <8b711df40907151304g68d4fbdqf835a70aeee03f59@mail.gmail.com> Message-ID: <8b711df40907160732n59486bd4tbecdffcd02fe02fb@mail.gmail.com> On Thu, Jul 16, 2009 at 5:29 AM, Geoffrey Laurence wrote: > Hi, > > I have had a simular problem with Fedora-10, I think some kernels have a > problem detecting the Drac. ?Anyway I found that you can enable telnet > from the web interface. > > >From the 'command box' on the Diagnostics tab you can type the following > commands, > > 'd3debug propget ENABLE_TELNET' ?- ?Prints if telnet is enabled > 'd3debug propset ENABLE_TELNET=TRUE' ?- ?Enables telnet > 'd3debug racadm racreset' ?- Reboots the drac. Geoffrey, Thanks a lot. It worked for me ! Paras. > > After you have enabled telnet and rebooted the drac, you should be able > to telnet to the drac card. > > Hope this helps, > Geoffrey. > > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Paras pradhan > Sent: Thursday, 16 July 2009 6:05 AM > To: linux-poweredge at lists.us.dell.com; linux clustering > Subject: [Linux-cluster] DRAC 4 > > hi, > > I am using centos 5.3 in dell poweredge 1850 servers . Drac is 4/i. ?I > am working on to create a cluster using 3 poweredge nodes and I am in > need to use DRAC as fencing device. > > While testing fencing using fence_drac it complains me as: > > root at tst1 ~]# fence_drac -a 10.10.10.2 -l user -p calvin > failed: telnet open failed: problem connecting to "10.10.10.2", port > 23: Connection refused > > I tried to enable telnet using racadm but got the following error. > > [root at tst1 ~]# racadm config -g cfgSerial -o cfgSerialSshEnable 1 > ERROR: RACADM is unable to process the requested subcommand because > there is no > local RAC configuration to communicate with. > > Local RACADM subcommand execution requires the following: > > ?1. A Remote Access Controller (RAC) must be present on the managed > server > ?2. Appropriate managed node software must be installed and running on > the > ? ?server > > > I am really stuck here. Any one having the similar problem? > > Thanks > Paras. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From lhh at redhat.com Thu Jul 16 21:31:18 2009 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 16 Jul 2009 17:31:18 -0400 Subject: [Linux-cluster] clurgmgrd hang/stuck In-Reply-To: <6b83838d0907151441w5fe9dd24ve7f45058e4c12dae@mail.gmail.com> References: <6b83838d0907151441w5fe9dd24ve7f45058e4c12dae@mail.gmail.com> Message-ID: <1247779878.5229.29.camel@localhost> On Wed, 2009-07-15 at 18:41 -0300, Guilherme G. Felix wrote: > Hi all, > > I'm having a odd problem with my 2 node cluster. What package n-v-r / git checkout are you using? -- Lon From garromo at us.ibm.com Thu Jul 16 22:19:15 2009 From: garromo at us.ibm.com (Gary Romo) Date: Thu, 16 Jul 2009 16:19:15 -0600 Subject: [Linux-cluster] RHCS config - Conga or system-config-cluster Message-ID: Any known issues with configuring RHCS with both Conga and system-config-cluster? Gary Romo -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggfelix at gmail.com Thu Jul 16 22:47:18 2009 From: ggfelix at gmail.com (Guilherme G. Felix) Date: Thu, 16 Jul 2009 19:47:18 -0300 Subject: [Linux-cluster] clurgmgrd hang/stuck In-Reply-To: <1247779878.5229.29.camel@localhost> References: <6b83838d0907151441w5fe9dd24ve7f45058e4c12dae@mail.gmail.com> <1247779878.5229.29.camel@localhost> Message-ID: <6b83838d0907161547g6e521e1du9d944ad052a8634b@mail.gmail.com> Howdy Lon, I'm using the standard RHES 5.3 (Tikanga), with the following package n-v-r-arch - no updates were applied to the system yet: cman.2.0.98.1.el5-i386 rgmanager.2.0.46.1.el5-i386 modcluster.0.12.1.2.el5-i386 system-config-cluster.1.0.55.1.0-noarch openais.0.80.3.22.el5-i386 cluster-cim.0.12.1.2.el5-i386 Thank you, - G. Felix On Thu, Jul 16, 2009 at 6:31 PM, Lon Hohberger wrote: > On Wed, 2009-07-15 at 18:41 -0300, Guilherme G. Felix wrote: > > Hi all, > > > > I'm having a odd problem with my 2 node cluster. > > What package n-v-r / git checkout are you using? > > -- Lon > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nagarurnaren at gmail.com Fri Jul 17 05:04:38 2009 From: nagarurnaren at gmail.com (narendra reddy naren) Date: Fri, 17 Jul 2009 10:34:38 +0530 Subject: [Linux-cluster] New to cluster Message-ID: HI : I am new to clustering concept .Please send me some good PDF for HA and HPC,,,, -- Thanks & Regards, G.Narendra Reddy , Mob : 9000500132 -------------- next part -------------- An HTML attachment was scrubbed... URL: From abednegoyulo at yahoo.com Fri Jul 17 06:41:41 2009 From: abednegoyulo at yahoo.com (Abed-nego G. Escobal, Jr.) Date: Thu, 16 Jul 2009 23:41:41 -0700 (PDT) Subject: [Linux-cluster] Starting two-node cluster with only one node In-Reply-To: <4D278A4A-3120-4114-A374-CC087A02897A@gmail.com> Message-ID: <384614.10667.qm@web110407.mail.gq1.yahoo.com> Thanks for the tip. It helped by stopping each node kicking each other, as per the logs, but still I have a split brain status. On node01 # /usr/sbin/cman_tool nodes Node Sts Inc Joined Name 1 M 680 2009-07-17 00:30:42 node01.company.com 2 X 0 node02.company.com # /usr/sbin/clustat Cluster Status for GFSCluster @ Fri Jul 17 01:01:09 2009 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ node01.company.com 1 Online, Local node02.company.com 2 Offline On node02 # /usr/sbin/cman_tool nodes Node Sts Inc Joined Name 1 X 0 node01.company.com 2 M 676 2009-07-17 00:30:43 node02.company.com # /usr/sbin/clustat Cluster Status for GFSCluster @ Fri Jul 17 01:01:22 2009 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ node01.company.com 1 Offline node02.company.com 2 Online, Local Another thing that I have noticed, 1. Start node01 with only itself as the member of the cluster 2. Update cluster.conf to have node02 as an additional member 3. Start node02 Yields both nodes being quorate (split brain) but only node02 tries to fence out node01. After some time, clustat will yield both of them being in the same cluster. Then I will be starting clvmd on node02 but will not be successful. After trying to start the clvmd service, clustat will yield split brain again. Are there some troubleshootings that I should be doing? --- On Thu, 7/16/09, Aaron Benner wrote: > From: Aaron Benner > Subject: Re: [Linux-cluster] Starting two-node cluster with only one node > To: "linux clustering" > Date: Thursday, 16 July, 2009, 10:04 PM > Have you tried setting the > "post_join_delay" value in the > declaration to -1? > > post_join_delay="-1" /> > > This is a hint I picked up from the fenced man page section > on avoiding boot time fencing.? It tells fenced to wait > until all of the nodes have joined the cluster before > starting up.? We use this on a couple of 2 node > clusters (with qdisk) to allow them to start up without the > first node to grab the quorum disk fencing the other node. > > --Aaron > > On Jul 16, 2009, at 12:16 AM, Abed-nego G. Escobal, Jr. > wrote: > > > > > > > Tried it and now the two node cluster is running with > only one node. My problem right now is how to force the > second node to join the first node's cluster. Right now it > is creating its own cluster and trying to fence the first > node. I tried cman_tool leave on the second node but I got > > > > cman_tool: Error leaving cluster: Device or resource > busy > > > > clvmd and gfs are not running on the second node. What > is running on the second node is cman. When I did > > > > service cman start > > > > It took 5 approximately 5 minutes before I got the > [ok] meassage. Am I missing something here? Not doing right? > Should be doing something? > > > > > > --- On Thu, 7/16/09, Abed-nego G. Escobal, Jr. > wrote: > > > >> From: Abed-nego G. Escobal, Jr. > >> Subject: [Linux-cluster] Starting two-node cluster > with only one node > >> To: "linux clustering" > >> Date: Thursday, 16 July, 2009, 10:46 AM > >> > >> Using the config file below > >> > >> > >> config_version="5"> > >> > >>??? >> name="node01.company.com" votes="1" > >> nodeid="1"> >> name="single"> >> > name="node01_ipmi"/> >> name="node02.company.com" votes="1" > >> nodeid="2"> >> name="single"> >> > name="node02_ipmi"/> > >>??? >> name="node01_ipmi" agent="fence_ipmilan" > ipaddr="10.1.0.5" > >> login="root" > passwd="********"/> >> name="node02_ipmi" agent="fence_ipmilan" > ipaddr="10.1.0.7" > >> login="root" > passwd="********"/> > >>??? > >>? ??? > >>? ??? > >>??? > >> > >> > >> Is it possible to start the cluster by only > bringing up one > >> node? The reason why I asked is because currently > bringing > >> them up together produces a split brain, each of > them member > >> of the cluster GFSCluster of their own fencing > each other. > >> My plan is to bring up only one node to create a > quorum then > >> bring the other one up and manually join it to the > existing > >> cluster. > >> > >> I have already don the start_clean approach but it > seems it > >> does not work. > >> > >> > >> > >> > >> -- > >> Linux-cluster mailing list > >> Linux-cluster at redhat.com > >> https://www.redhat.com/mailman/listinfo/linux-cluster > >> > > > > > >? ? ? Try the new Yahoo! Messenger. Now > with all you love about messenger and more! http://ph.messenger.yahoo.com > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > Get connected with chat on network profile, blog, or any personal website! Yahoo! allows you to IM with Pingbox. Check it out! http://ph.messenger.yahoo.com/pingbox From agx at sigxcpu.org Fri Jul 17 08:46:23 2009 From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=) Date: Fri, 17 Jul 2009 10:46:23 +0200 Subject: [Linux-cluster] Cluster 3.0.0.rc4 release In-Reply-To: <1247501173.10789.2.camel@localhost.localdomain> References: <1246490190.19414.93.camel@cerberus.int.fabbione.net> <20090707162845.GA30094@bogon.sigxcpu.org> <1246988716.7993.13.camel@cerberus.int.fabbione.net> <20090709121958.GB18140@bogon.sigxcpu.org> <20090713155324.GA20030@bogon.sigxcpu.org> <1247501173.10789.2.camel@localhost.localdomain> Message-ID: <20090717084623.GA16661@bogon.sigxcpu.org> On Mon, Jul 13, 2009 at 09:06:13AM -0700, Steven Dake wrote: > On Mon, 2009-07-13 at 17:53 +0200, Guido G?nther wrote: > > On Thu, Jul 09, 2009 at 02:19:58PM +0200, Guido G?nther wrote: > > > This stuff needs soome more work to be uploadable but it's mostly there > > > I think. I've pushed the git archives here: > > > > [..snip..] > > I've updated the packages at: > > > > deb http://pkg-libvirt.alioth.debian.org/packages unstable/i386/ > > deb http://pkg-libvirt.alioth.debian.org/packages unstable/all/ > > > > to cluster 3.0.0. > > Cheers, > > -- Guido > > > Guido, > > Thanks for the work! If there is anything we can do in upstream > corosync or openais to help simplify these efforts in the future, please > share your ideas. I think we're fine from the packaging point of view at the moment, thanks! The switch to autoconf helped a lot. Cheers, -- Guido From cluster at xinet.it Fri Jul 17 08:55:27 2009 From: cluster at xinet.it (Francesco Gallo) Date: Fri, 17 Jul 2009 10:55:27 +0200 Subject: [Linux-cluster] Clone/Snapshot Lun based VM Message-ID: <001001ca06bc$54c4ea90$fe4ebfb0$@it> Hi all, i would like to clone/make a snapshot of a lun-based VM. I use lvm2 clustered and an ISCSI storage (Dell MD3000i). Is there a particular software who could help me? Any suggestion? Any link? Thanks a lot, Francesco -------------- next part -------------- An HTML attachment was scrubbed... URL: From mad at wol.de Fri Jul 17 09:56:22 2009 From: mad at wol.de (Marc - A. Dahlhaus [ Administration | Westermann GmbH ]) Date: Fri, 17 Jul 2009 11:56:22 +0200 Subject: [Linux-cluster] Starting two-node cluster with only one node In-Reply-To: <384614.10667.qm@web110407.mail.gq1.yahoo.com> References: <384614.10667.qm@web110407.mail.gq1.yahoo.com> Message-ID: <1247824582.973.48.camel@marc> Hello, can you give us some hard facts on what versions of cluster-suite packages you are using in your environment and also the related logs? Have you read the corresponding parts of the cluster suites manual, man pages, FAQ and also searched the list-archives for similar problems already? If not -> do it, there are may good hints to find there. The nodes find each other and create a cluster very fast IF they can talk to each other. As no cluster networking is involved in fencing a remote node if the fencing node by itself is quorate this could be your problem. You should change to fence_manual and switch back to your real fencing devices after you have debuged your problem. Also get rid of the tag in your cluster.conf as fenced does the right thing by default if the remaining configuration is right and now it is just hiding a part of the problem. Also the 5 minute break on cman start smells like a DNS-lookup problem or other network related problem to me. Here is a short check-list to be sure the nodes can talk to each other: Can the individual nodes ping each other? Can the individual nodes dns-lookup the other node-names (which you used in your cluster.conf)? (Try to add them to your etc/hosts file, that way you have a working cluster even if your dns-system is going on vacation.) Is your switch allowing multicast communication on all ports that are used for cluster communication? (This is a prerequisite for openais / corosync based cman which would be anything >= RHEL 5. Search the archives on this if you need more info...) Can you trace (eg. with wiresharks tshark) incoming cluster communication from remote nodes? (If you don't changed your fencing to fence_manual your listening system will get fenced before you can get any useful information out of it. Try with and without active firewall.) If all above could be answered with "yes" your cluster should form just fine. You could try to add a qdisk-device as tiebreaker after that and test it just to be sure you have a working last man standing setup... Hope that helps, Marc Am Donnerstag, den 16.07.2009, 23:41 -0700 schrieb Abed-nego G. Escobal, Jr.: > > Thanks for the tip. It helped by stopping each node kicking each other, as per the logs, but still I have a split brain status. > > On node01 > > # /usr/sbin/cman_tool nodes > Node Sts Inc Joined Name > 1 M 680 2009-07-17 00:30:42 node01.company.com > 2 X 0 node02.company.com > > # /usr/sbin/clustat > Cluster Status for GFSCluster @ Fri Jul 17 01:01:09 2009 > Member Status: Quorate > > Member Name ID Status > ------ ---- ---- ------ > node01.company.com 1 Online, Local > node02.company.com 2 Offline > > > On node02 > > # /usr/sbin/cman_tool nodes > Node Sts Inc Joined Name > 1 X 0 node01.company.com > 2 M 676 2009-07-17 00:30:43 node02.company.com > > > # /usr/sbin/clustat > Cluster Status for GFSCluster @ Fri Jul 17 01:01:22 2009 > Member Status: Quorate > > Member Name ID Status > ------ ---- ---- ------ > node01.company.com 1 Offline > node02.company.com 2 Online, Local > > > Another thing that I have noticed, > > 1. Start node01 with only itself as the member of the cluster > 2. Update cluster.conf to have node02 as an additional member > 3. Start node02 > > Yields both nodes being quorate (split brain) but only node02 tries to fence out node01. After some time, clustat will yield both of them being in the same cluster. Then I will be starting clvmd on node02 but will not be successful. After trying to start the clvmd service, clustat will yield split brain again. > > Are there some troubleshootings that I should be doing? > > > --- On Thu, 7/16/09, Aaron Benner wrote: > > > From: Aaron Benner > > Subject: Re: [Linux-cluster] Starting two-node cluster with only one node > > To: "linux clustering" > > Date: Thursday, 16 July, 2009, 10:04 PM > > Have you tried setting the > > "post_join_delay" value in the > > declaration to -1? > > > > > post_join_delay="-1" /> > > > > This is a hint I picked up from the fenced man page section > > on avoiding boot time fencing. It tells fenced to wait > > until all of the nodes have joined the cluster before > > starting up. We use this on a couple of 2 node > > clusters (with qdisk) to allow them to start up without the > > first node to grab the quorum disk fencing the other node. > > > > --Aaron > > > > On Jul 16, 2009, at 12:16 AM, Abed-nego G. Escobal, Jr. > > wrote: > > > > > > > > > > > Tried it and now the two node cluster is running with > > only one node. My problem right now is how to force the > > second node to join the first node's cluster. Right now it > > is creating its own cluster and trying to fence the first > > node. I tried cman_tool leave on the second node but I got > > > > > > cman_tool: Error leaving cluster: Device or resource > > busy > > > > > > clvmd and gfs are not running on the second node. What > > is running on the second node is cman. When I did > > > > > > service cman start > > > > > > It took 5 approximately 5 minutes before I got the > > [ok] meassage. Am I missing something here? Not doing right? > > Should be doing something? > > > > > > > > > --- On Thu, 7/16/09, Abed-nego G. Escobal, Jr. > > wrote: > > > > > >> From: Abed-nego G. Escobal, Jr. > > >> Subject: [Linux-cluster] Starting two-node cluster > > with only one node > > >> To: "linux clustering" > > >> Date: Thursday, 16 July, 2009, 10:46 AM > > >> > > >> Using the config file below > > >> > > >> > > >> > config_version="5"> > > >> > > >> > >> name="node01.company.com" votes="1" > > >> nodeid="1"> > >> name="single"> > >> > > name="node01_ipmi"/> > >> name="node02.company.com" votes="1" > > >> nodeid="2"> > >> name="single"> > >> > > name="node02_ipmi"/> > > >> > >> name="node01_ipmi" agent="fence_ipmilan" > > ipaddr="10.1.0.5" > > >> login="root" > > passwd="********"/> > >> name="node02_ipmi" agent="fence_ipmilan" > > ipaddr="10.1.0.7" > > >> login="root" > > passwd="********"/> > > >> > > >> > > >> > > >> > > >> > > >> > > >> Is it possible to start the cluster by only > > bringing up one > > >> node? The reason why I asked is because currently > > bringing > > >> them up together produces a split brain, each of > > them member > > >> of the cluster GFSCluster of their own fencing > > each other. > > >> My plan is to bring up only one node to create a > > quorum then > > >> bring the other one up and manually join it to the > > existing > > >> cluster. > > >> > > >> I have already don the start_clean approach but it > > seems it > > >> does not work. > > >> From td3201 at gmail.com Fri Jul 17 16:05:19 2009 From: td3201 at gmail.com (Terry) Date: Fri, 17 Jul 2009 11:05:19 -0500 Subject: [Linux-cluster] determining fsid for fs resource Message-ID: <8ee061010907170905x3468960foc938215865056a10@mail.gmail.com> Hello, When I create a fs resource using redhat's luci, it is able to find the fsid for a fs and life is good. However, I am not crazy about luci and would prefer to manually create the resources from the command line but how do I find the fsid for a filesystem? Here's an example of a fs resource created using luci: Thanks! From fxmulder at gmail.com Fri Jul 17 16:20:49 2009 From: fxmulder at gmail.com (James Devine) Date: Fri, 17 Jul 2009 10:20:49 -0600 Subject: [Linux-cluster] disk fencing Message-ID: Has anybody looked into using the network for heartbeat only, and disk for fencing in GFS? i.e. using the disk to communicate quorum when network heartbeat is lost between 1 or more nodes. If the disk is still accessible to all nodes, this should be a valid way to communicate quorum, if not, then the remaining nodes, assuming enough for quorum, should be able to continue knowing that nodes it can't communicate with either have been fenced or can't read/write to disk anyway. Does this sound like a valid approach? From ggfelix at gmail.com Fri Jul 17 17:45:28 2009 From: ggfelix at gmail.com (Guilherme G. Felix) Date: Fri, 17 Jul 2009 14:45:28 -0300 Subject: [Linux-cluster] clurgmgrd hang/stuck In-Reply-To: <6b83838d0907161547g6e521e1du9d944ad052a8634b@mail.gmail.com> References: <6b83838d0907151441w5fe9dd24ve7f45058e4c12dae@mail.gmail.com> <1247779878.5229.29.camel@localhost> <6b83838d0907161547g6e521e1du9d944ad052a8634b@mail.gmail.com> Message-ID: <6b83838d0907171045q3c01eb89r8fd1ab2af6dd8201@mail.gmail.com> Some more info....removed everything from the cluster and left only apache and the shared IPs, I also made a huge change to cluster.conf and started to use "Shared Resources". Even after that rgmanager is freezing, when rgmanager enter in this state system-config-cluster cannot show the service when it's opened, clusvcadm won't work either, only solution is to restart rgmanager on both nodes. Hardware vendor has done a full health check and upgraded firmwares to eliminate this possibility. Thank you, - G. Felix On Thu, Jul 16, 2009 at 7:47 PM, Guilherme G. Felix wrote: > Howdy Lon, > > I'm using the standard RHES 5.3 (Tikanga), with the following package > n-v-r-arch - no updates were applied to the system yet: > > cman.2.0.98.1.el5-i386 > rgmanager.2.0.46.1.el5-i386 > modcluster.0.12.1.2.el5-i386 > system-config-cluster.1.0.55.1.0-noarch > openais.0.80.3.22.el5-i386 > cluster-cim.0.12.1.2.el5-i386 > > Thank you, > > - G. Felix > > > On Thu, Jul 16, 2009 at 6:31 PM, Lon Hohberger wrote: > >> On Wed, 2009-07-15 at 18:41 -0300, Guilherme G. Felix wrote: >> > Hi all, >> > >> > I'm having a odd problem with my 2 node cluster. >> >> What package n-v-r / git checkout are you using? >> >> -- Lon >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pradhanparas at gmail.com Fri Jul 17 20:15:25 2009 From: pradhanparas at gmail.com (Paras pradhan) Date: Fri, 17 Jul 2009 15:15:25 -0500 Subject: [Linux-cluster] Cluster failover Message-ID: <8b711df40907171315n70057804rbc44d7fb255523af@mail.gmail.com> hi, I have 3 nodes of CentOS 5.3 running xen virtual machines as virtual machine service. This cluster is working fine. One thing I would like to know that how to make failover only to third node. What I mean to say is: I have 3 virtual machine running on node 1 and 2 virtual machines running on node 2. Now if node 1 fails I want my the node1 virtual machines to be stared only on node 3 but not on node2. Similary if node2 breaks, I want virtual machines to be started on node3 but never on node 1. Thanks ! Paras. From raju.rajsand at gmail.com Sat Jul 18 03:26:35 2009 From: raju.rajsand at gmail.com (Rajagopal Swaminathan) Date: Sat, 18 Jul 2009 08:56:35 +0530 Subject: [Linux-cluster] Cluster failover In-Reply-To: <8b711df40907171315n70057804rbc44d7fb255523af@mail.gmail.com> References: <8b711df40907171315n70057804rbc44d7fb255523af@mail.gmail.com> Message-ID: <8786b91c0907172026g3cdec94bw946187e2fdf3181a@mail.gmail.com> Greetings, You basically need to define multiple failover domains. On Sat, Jul 18, 2009 at 1:45 AM, Paras pradhan wrote: > I have 3 virtual machine ?running on node 1 and 2 virtual > machines running on node 2. > Now if node 1 fails I want my the node1 > virtual machines ?to be stared only on node 3 but not on node2. Failover domain 1 consisting of Node 1 and Node 3 > Similary if node2 breaks, I want virtual machines to be started on > node3 but never on node 1. Failover domain 2 consisting of Node 2 and Node 3 HTH Thanks and Regards Rajagopal From cthulhucalling at gmail.com Sat Jul 18 06:50:01 2009 From: cthulhucalling at gmail.com (Ian Hayes) Date: Fri, 17 Jul 2009 23:50:01 -0700 Subject: [Linux-cluster] disk fencing In-Reply-To: References: Message-ID: <36df569a0907172350g5b381e3bt87abd7fbe6fe8677@mail.gmail.com> I'm not sure what you're asking here, but it sounds like you're describing a qdisk. If a node loses heartbeat with the rest of the cluster, that's a fencin'. Doesn't matter if it can still access the shared storage, and if it has lost communication with the rest of the cluster, you probably don't want it accessing your data anyway. On Fri, Jul 17, 2009 at 9:20 AM, James Devine wrote: > Has anybody looked into using the network for heartbeat only, and disk > for fencing in GFS? i.e. using the disk to communicate quorum when > network heartbeat is lost between 1 or more nodes. If the disk is > still accessible to all nodes, this should be a valid way to > communicate quorum, if not, then the remaining nodes, assuming enough > for quorum, should be able to continue knowing that nodes it can't > communicate with either have been fenced or can't read/write to disk > anyway. Does this sound like a valid approach? > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cthulhucalling at gmail.com Sat Jul 18 06:56:08 2009 From: cthulhucalling at gmail.com (Ian Hayes) Date: Fri, 17 Jul 2009 23:56:08 -0700 Subject: [Linux-cluster] Cluster failover In-Reply-To: <8b711df40907171315n70057804rbc44d7fb255523af@mail.gmail.com> References: <8b711df40907171315n70057804rbc44d7fb255523af@mail.gmail.com> Message-ID: <36df569a0907172356g55aab93fwdd12a660f36e49f9@mail.gmail.com> Specify 2 different failover domains for the services. I have a similar setup for a project with a 3 node cluster. Node 1 runs Service A, Node 2 runs Service B and Node 3 is the floater Failover Domain 1: Node 1, Node 3 Failover Domain 2: Node 2, Node 3 Service A: Failover Domain1 Service B: Failover Domain2 On Fri, Jul 17, 2009 at 1:15 PM, Paras pradhan wrote: > hi, > > I have 3 nodes of CentOS 5.3 running xen virtual machines as virtual > machine service. This cluster is working fine. One thing I would like > to know that how to make failover only to third node. What I mean to > say is: I have 3 virtual machine running on node 1 and 2 virtual > machines running on node 2. Now if node 1 fails I want my the node1 > virtual machines to be stared only on node 3 but not on node2. > Similary if node2 breaks, I want virtual machines to be started on > node3 but never on node 1. > > Thanks ! > Paras. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From abednegoyulo at yahoo.com Sat Jul 18 07:38:14 2009 From: abednegoyulo at yahoo.com (Abed-nego G. Escobal, Jr.) Date: Sat, 18 Jul 2009 00:38:14 -0700 (PDT) Subject: [Linux-cluster] Starting two-node cluster with only one node In-Reply-To: <1247824582.973.48.camel@marc> Message-ID: <273907.46020.qm@web110412.mail.gq1.yahoo.com> Thanks for giving the pointers! uname -r on both nodes 2.6.18-128.1.16.el5 on node01 rpm -q cman gfs-utils kmod-gfs modcluster ricci luci cluster-snmp iscsi-initiator-utils lvm2-cluster openais oddjob rgmanager cman-2.0.98-2chrissie gfs-utils-0.1.18-1.el5 kmod-gfs-0.1.23-5.el5_2.4 kmod-gfs-0.1.31-3.el5 modcluster-0.12.1-2.el5.centos ricci-0.12.1-7.3.el5.centos.1 luci-0.12.1-7.3.el5.centos.1 cluster-snmp-0.12.1-2.el5.centos iscsi-initiator-utils-6.2.0.868-0.18.el5_3.1 lvm2-cluster-2.02.40-7.el5 openais-0.80.3-22.el5_3.8 oddjob-0.27-9.el5 rgmanager-2.0.46-1.el5.centos.3 on node02 rpm -q cman gfs-utils kmod-gfs modcluster ricci luci cluster-snmp iscsi-initiator-utils lvm2-cluster openais oddjob rgmanager cman-2.0.98-2chrissie gfs-utils-0.1.18-1.el5 kmod-gfs-0.1.31-3.el5 modcluster-0.12.1-2.el5.centos ricci-0.12.1-7.3.el5.centos.1 luci-0.12.1-7.3.el5.centos.1 cluster-snmp-0.12.1-2.el5.centos iscsi-initiator-utils-6.2.0.868-0.18.el5_3.1 lvm2-cluster-2.02.40-7.el5 openais-0.80.3-22.el5_3.8 oddjob-0.27-9.el5 rgmanager-2.0.46-1.el5.centos.3 I used http://knowledgelayer.softlayer.com/questions/443/GFS+howto to configure my cluster. When it was still on 5.2 the cluster worked, but after the recent update to 5.3, it broke. On one of the threads that I have found in the archive, it states that there is a problem with the most current official version of cman, bug id 485026. I replaced the most current cman package with cman-2.0.98-2chrissie because I tested if this was my problem, seems not so I will be moving back to the official package. I also found on another thread that openais was the culprit, changed it back to openais-0.80.3-15.el5 even though the change log indicates a lot of bug fixes were done on the most current official package. After doing it, it still did not work. I tried clean_start="1" with caution. I unmounted the iscsi then started cman but still it did not work. The most recent is post_join_delay="-1", I did not noticed that there was a man for fenced, which is much safer than clean_start="1" but still it did not fixed it. The man pages that I have read over and over again is cman and cluster.conf. Some pages in the online manual is somewhat not suitable for my situation because I do not have X installed on the machines and some pages in the online manual used system-config-cluster. As I understand in the online manual and FAQ, qdisk is not required if I have two_nodes="1" so I did not create any. I have removed the fence_daemon tag since I only used it for trying the solutions that were suggested. The hosts are present in each others hosts with correct ips. The ping results ping node02.company.com --- node01.company.com ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 8999ms rtt min/avg/max/mdev = 0.010/0.016/0.034/0.007 ms ping node01.company.com --- node01.company.com ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9003ms rtt min/avg/max/mdev = 0.341/0.668/1.084/0.273 ms According to the people in the data center, the switch supports multicast communication on all ports that are used for cluster communication because they are in the same VLAN. For the logs, I will sending fresh logs as soon as possible. Currently I have not enough time window to bring down the machine. For the wireshark, I will be reading the man pages on how to use it. Please advise if any other information is needed to solve this. I am very grateful for the very detailed pointers. Thank you very much! --- On Fri, 7/17/09, Marc - A. Dahlhaus [ Administration | Westermann GmbH ] wrote: > From: Marc - A. Dahlhaus [ Administration | Westermann GmbH ] > Subject: Re: [Linux-cluster] Starting two-node cluster with only one node > To: "linux clustering" > Date: Friday, 17 July, 2009, 5:56 PM > Hello, > > > can you give us some hard facts on what versions of > cluster-suite > packages you are using in your environment and also the > related logs? > > Have you read the corresponding parts of the cluster suites > manual, man > pages, FAQ and also searched the list-archives for similar > problems > already? If not -> do it, there are may good hints to > find there. > > > The nodes find each other and create a cluster very fast IF > they can > talk to each other. As no cluster networking is involved in > fencing a > remote node if the fencing node by itself is quorate this > could be your > problem. > > You should change to fence_manual and switch back to your > real fencing > devices after you have debuged your problem. Also get rid > of the > tag in your cluster.conf as > fenced does the right > thing by default if the remaining configuration is right > and now it is > just hiding a part of the problem. > > Also the 5 minute break on cman start smells like a > DNS-lookup problem > or other network related problem to me. > > Here is a short check-list to be sure the nodes can talk to > each other: > > Can the individual nodes ping each other? > > Can the individual nodes dns-lookup the other node-names > (which you used > in your cluster.conf)? (Try to add them to your etc/hosts > file, that way > you have a working cluster even if your dns-system is going > on > vacation.) > > Is your switch allowing multicast communication on all > ports that are > used for cluster communication? (This is a prerequisite for > openais / > corosync based cman which would be anything >= RHEL 5. > Search the > archives on this if you need more info...) > > Can you trace (eg. with wiresharks tshark) incoming > cluster > communication from remote nodes? (If you don't changed your > fencing to > fence_manual your listening system will get fenced before > you can get > any useful information out of it. Try with and without > active firewall.) > > If all above could be answered with "yes" your cluster > should form just > fine. You could try to add a qdisk-device as tiebreaker > after that and > test it just to be sure you have a working last man > standing setup... > > Hope that helps, > > Marc > > Am Donnerstag, den 16.07.2009, 23:41 -0700 schrieb > Abed-nego G. Escobal, > Jr.: > > > > Thanks for the tip. It helped by stopping each node > kicking each other, as per the logs, but still I have a > split brain status. > > > > On node01 > > > > # /usr/sbin/cman_tool nodes > > Node? > Sts???Inc???Joined? > ? ? ? ? ? ???Name > >? ? 1???M? ? > 680???2009-07-17 00:30:42? > node01.company.com > >? ? 2???X? ? ? > 0? ? ? ? ? ? ? ? > ? ? ? ? node02.company.com > > > > # /usr/sbin/clustat > > Cluster Status for GFSCluster @ Fri Jul 17 01:01:09 > 2009 > > Member Status: Quorate > > > >? Member Name? ? ? ? ? > ? ? ? ? ? ? ? ? > ???ID???Status > >? ------ ----? ? ? ? ? > ? ? ? ? ? ? ? ? > ???---- ------ > >? node01.company.com? ? ? ? > ? ? ? ? ? ? ? > ???1 Online, Local > >? node02.company.com? ? ? ? > ? ? ? ? ? ? ? > ???2 Offline > > > > > > On node02 > > > > # /usr/sbin/cman_tool nodes > > Node? > Sts???Inc???Joined? > ? ? ? ? ? ???Name > >? ? 1???X? ? ? > 0? ? ? ? ? ? ? ? > ? ? ? ? node01.company.com > >? ? 2???M? ? > 676???2009-07-17 00:30:43? > node02.company.com > >? > > > > # /usr/sbin/clustat > > Cluster Status for GFSCluster @ Fri Jul 17 01:01:22 > 2009 > > Member Status: Quorate > > > >? Member Name? ? ? ? ? > ? ? ? ? ? ? ? ? > ???ID???Status > >? ------ ----? ? ? ? ? > ? ? ? ? ? ? ? ? > ???---- ------ > >? node01.company.com? ? ? ? > ? ? ? ? ? ? ? > ???1 Offline > >? node02.company.com? ? ? ? > ? ? ? ? ? ? ? > ???2 Online, Local > > > > > > Another thing that I have noticed, > > > > 1. Start node01 with only itself as the member of the > cluster > > 2. Update cluster.conf to have node02 as an additional > member > > 3. Start node02 > > > > Yields both nodes being quorate (split brain) but only > node02 tries to fence out node01. After some time, clustat > will yield both of them being in the same cluster. Then I > will be starting clvmd on node02 but will not be successful. > After trying to start the clvmd service, clustat will yield > split brain again. > > > > Are there some troubleshootings that I should be > doing? > > > > > > --- On Thu, 7/16/09, Aaron Benner > wrote: > > > > > From: Aaron Benner > > > Subject: Re: [Linux-cluster] Starting two-node > cluster with only one node > > > To: "linux clustering" > > > Date: Thursday, 16 July, 2009, 10:04 PM > > > Have you tried setting the > > > "post_join_delay" value in the ...> > > > declaration to -1? > > > > > > post_fail_delay="0" > > > post_join_delay="-1" /> > > > > > > This is a hint I picked up from the fenced man > page section > > > on avoiding boot time fencing.? It tells > fenced to wait > > > until all of the nodes have joined the cluster > before > > > starting up.? We use this on a couple of 2 > node > > > clusters (with qdisk) to allow them to start up > without the > > > first node to grab the quorum disk fencing the > other node. > > > > > > --Aaron > > > > > > On Jul 16, 2009, at 12:16 AM, Abed-nego G. > Escobal, Jr. > > > wrote: > > > > > > > > > > > > > > > Tried it and now the two node cluster is > running with > > > only one node. My problem right now is how to > force the > > > second node to join the first node's cluster. > Right now it > > > is creating its own cluster and trying to fence > the first > > > node. I tried cman_tool leave on the second node > but I got > > > > > > > > cman_tool: Error leaving cluster: Device or > resource > > > busy > > > > > > > > clvmd and gfs are not running on the second > node. What > > > is running on the second node is cman. When I > did > > > > > > > > service cman start > > > > > > > > It took 5 approximately 5 minutes before I > got the > > > [ok] meassage. Am I missing something here? Not > doing right? > > > Should be doing something? > > > > > > > > > > > > --- On Thu, 7/16/09, Abed-nego G. Escobal, > Jr. > > > wrote: > > > > > > > >> From: Abed-nego G. Escobal, Jr. > > > >> Subject: [Linux-cluster] Starting > two-node cluster > > > with only one node > > > >> To: "linux clustering" > > > >> Date: Thursday, 16 July, 2009, 10:46 AM > > > >> > > > >> Using the config file below > > > >> > > > >> > > > >> > > config_version="5"> > > > >> two_node="1"/> > > > > >>??? > > >> name="node01.company.com" votes="1" > > > >> nodeid="1"> > > >> name="single"> > > >> > > > > name="node01_ipmi"/> > > >> name="node02.company.com" votes="1" > > > >> nodeid="2"> > > >> name="single"> > > >> > > > > name="node02_ipmi"/> > > > > >>??? > > >> name="node01_ipmi" > agent="fence_ipmilan" > > > ipaddr="10.1.0.5" > > > >> login="root" > > > passwd="********"/> > > >> name="node02_ipmi" > agent="fence_ipmilan" > > > ipaddr="10.1.0.7" > > > >> login="root" > > > passwd="********"/> > > > >>??? > > > >>? > ??? > > > >>? > ??? > > > >>??? > > > >> > > > >> > > > >> Is it possible to start the cluster by > only > > > bringing up one > > > >> node? The reason why I asked is because > currently > > > bringing > > > >> them up together produces a split brain, > each of > > > them member > > > >> of the cluster GFSCluster of their own > fencing > > > each other. > > > >> My plan is to bring up only one node to > create a > > > quorum then > > > >> bring the other one up and manually join > it to the > > > existing > > > >> cluster. > > > >> > > > >> I have already don the start_clean > approach but it > > > seems it > > > >> does not work. > > > >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From mad at wol.de Sat Jul 18 14:02:13 2009 From: mad at wol.de (Marc - A. Dahlhaus) Date: Sat, 18 Jul 2009 16:02:13 +0200 Subject: [Linux-cluster] Starting two-node cluster with only one node In-Reply-To: <273907.46020.qm@web110412.mail.gq1.yahoo.com> References: <273907.46020.qm@web110412.mail.gq1.yahoo.com> Message-ID: <4A61D5E5.4030702@wol.de> Hello, as your cluster worked well on centos 5.2 the networking hardware components couldn't be the culprit in this case but is still think that it is an cluster communication related problem. It could be your iptables ruleset... Try to disable the firewall and check again... You can use tshark to check this as well in this case by using something like this: tshark -i -f 'host ' -V | less Have you checked that openais is still chkconfig off after your upgrade? Abed-nego G. Escobal, Jr. schrieb: > Thanks for giving the pointers! > > uname -r on both nodes > > 2.6.18-128.1.16.el5 > > on node01 > > rpm -q cman gfs-utils kmod-gfs modcluster ricci luci cluster-snmp iscsi-initiator-utils lvm2-cluster openais oddjob rgmanager > cman-2.0.98-2chrissie > gfs-utils-0.1.18-1.el5 > kmod-gfs-0.1.23-5.el5_2.4 > kmod-gfs-0.1.31-3.el5 > modcluster-0.12.1-2.el5.centos > ricci-0.12.1-7.3.el5.centos.1 > luci-0.12.1-7.3.el5.centos.1 > cluster-snmp-0.12.1-2.el5.centos > iscsi-initiator-utils-6.2.0.868-0.18.el5_3.1 > lvm2-cluster-2.02.40-7.el5 > openais-0.80.3-22.el5_3.8 > oddjob-0.27-9.el5 > rgmanager-2.0.46-1.el5.centos.3 > > on node02 > > rpm -q cman gfs-utils kmod-gfs modcluster ricci luci cluster-snmp iscsi-initiator-utils lvm2-cluster openais oddjob rgmanager > cman-2.0.98-2chrissie > gfs-utils-0.1.18-1.el5 > kmod-gfs-0.1.31-3.el5 > modcluster-0.12.1-2.el5.centos > ricci-0.12.1-7.3.el5.centos.1 > luci-0.12.1-7.3.el5.centos.1 > cluster-snmp-0.12.1-2.el5.centos > iscsi-initiator-utils-6.2.0.868-0.18.el5_3.1 > lvm2-cluster-2.02.40-7.el5 > openais-0.80.3-22.el5_3.8 > oddjob-0.27-9.el5 > rgmanager-2.0.46-1.el5.centos.3 > > I used http://knowledgelayer.softlayer.com/questions/443/GFS+howto to configure my cluster. When it was still on 5.2 the cluster worked, but after the recent update to 5.3, it broke. > > On one of the threads that I have found in the archive, it states that there is a problem with the most current official version of cman, bug id 485026. I replaced the most current cman package with cman-2.0.98-2chrissie because I tested if this was my problem, seems not so I will be moving back to the official package. > I also found on another thread that openais was the culprit, changed it back to openais-0.80.3-15.el5 even though the change log indicates a lot of bug fixes were done on the most current official package. After doing it, it still did not work. I tried clean_start="1" with caution. I unmounted the iscsi then started cman but still it did not work. The most recent is post_join_delay="-1", I did not noticed that there was a man for fenced, which is much safer than clean_start="1" but still it did not fixed it. The man pages that I have read over and over again is cman and cluster.conf. Some pages in the online manual is somewhat not suitable for my situation because I do not have X installed on the machines and some pages in the online manual used system-config-cluster. > > As I understand in the online manual and FAQ, qdisk is not required if I have two_nodes="1" so I did not create any. I have removed the fence_daemon tag since I only used it for trying the solutions that were suggested. The hosts are present in each others hosts with correct ips. > > > The ping results > > ping node02.company.com > > --- node01.company.com ping statistics --- > 10 packets transmitted, 10 received, 0% packet loss, time 8999ms > rtt min/avg/max/mdev = 0.010/0.016/0.034/0.007 ms > > ping node01.company.com > > --- node01.company.com ping statistics --- > 10 packets transmitted, 10 received, 0% packet loss, time 9003ms > rtt min/avg/max/mdev = 0.341/0.668/1.084/0.273 ms > > According to the people in the data center, the switch supports multicast communication on all ports that are used for cluster communication because they are in the same VLAN. > > For the logs, I will sending fresh logs as soon as possible. Currently I have not enough time window to bring down the machine. > > For the wireshark, I will be reading the man pages on how to use it. > > Please advise if any other information is needed to solve this. I am very grateful for the very detailed pointers. Thank you very much! > > > --- On Fri, 7/17/09, Marc - A. Dahlhaus [ Administration | Westermann GmbH ] wrote: > > >> From: Marc - A. Dahlhaus [ Administration | Westermann GmbH ] >> Subject: Re: [Linux-cluster] Starting two-node cluster with only one node >> To: "linux clustering" >> Date: Friday, 17 July, 2009, 5:56 PM >> Hello, >> >> >> can you give us some hard facts on what versions of >> cluster-suite >> packages you are using in your environment and also the >> related logs? >> >> Have you read the corresponding parts of the cluster suites >> manual, man >> pages, FAQ and also searched the list-archives for similar >> problems >> already? If not -> do it, there are may good hints to >> find there. >> >> >> The nodes find each other and create a cluster very fast IF >> they can >> talk to each other. As no cluster networking is involved in >> fencing a >> remote node if the fencing node by itself is quorate this >> could be your >> problem. >> >> You should change to fence_manual and switch back to your >> real fencing >> devices after you have debuged your problem. Also get rid >> of the >> tag in your cluster.conf as >> fenced does the right >> thing by default if the remaining configuration is right >> and now it is >> just hiding a part of the problem. >> >> Also the 5 minute break on cman start smells like a >> DNS-lookup problem >> or other network related problem to me. >> >> Here is a short check-list to be sure the nodes can talk to >> each other: >> >> Can the individual nodes ping each other? >> >> Can the individual nodes dns-lookup the other node-names >> (which you used >> in your cluster.conf)? (Try to add them to your etc/hosts >> file, that way >> you have a working cluster even if your dns-system is going >> on >> vacation.) >> >> Is your switch allowing multicast communication on all >> ports that are >> used for cluster communication? (This is a prerequisite for >> openais / >> corosync based cman which would be anything >= RHEL 5. >> Search the >> archives on this if you need more info...) >> >> Can you trace (eg. with wiresharks tshark) incoming >> cluster >> communication from remote nodes? (If you don't changed your >> fencing to >> fence_manual your listening system will get fenced before >> you can get >> any useful information out of it. Try with and without >> active firewall.) >> >> If all above could be answered with "yes" your cluster >> should form just >> fine. You could try to add a qdisk-device as tiebreaker >> after that and >> test it just to be sure you have a working last man >> standing setup... >> >> Hope that helps, >> >> Marc >> From abednegoyulo at yahoo.com Sat Jul 18 15:47:28 2009 From: abednegoyulo at yahoo.com (Abed-nego G. Escobal, Jr.) Date: Sat, 18 Jul 2009 08:47:28 -0700 (PDT) Subject: [Linux-cluster] Starting two-node cluster with only one node In-Reply-To: <4A61D5E5.4030702@wol.de> Message-ID: <667406.91442.qm@web110403.mail.gq1.yahoo.com> Hi! I am very sorry that I did not mention that when I am testing different suggestions on solving this, I always temporarily disable the firewall. Then turning it back on after the testing. Thank you very much on the tip for the tshark! I will post the output as as soon as I get a maintenance window to restart the cman service. With regards to openais, it is still off on both servers. Should it be turned on on boot? I am very sorry but I haven't read it in the manuals that it should be "on". --- On Sat, 7/18/09, Marc - A. Dahlhaus wrote: > From: Marc - A. Dahlhaus > Subject: Re: [Linux-cluster] Starting two-node cluster with only one node > To: "linux clustering" > Date: Saturday, 18 July, 2009, 10:02 PM > Hello, > > as your cluster worked well on centos 5.2 the networking > hardware > components couldn't be the culprit in this case but is > still think that > it is an cluster communication related problem. > > It could be your iptables ruleset... Try to disable the > firewall and > check again... > > You can use tshark to check this as well in this case by > using something > like this: > > tshark -i -f 'host > is useing>' -V | less > > Have you checked that openais is still chkconfig off after > your upgrade? > > Abed-nego G. Escobal, Jr. schrieb: > > Thanks for giving the pointers! > > > > uname -r on both nodes > > > > 2.6.18-128.1.16.el5 > > > > on node01 > > > > rpm -q cman gfs-utils kmod-gfs modcluster ricci luci > cluster-snmp iscsi-initiator-utils lvm2-cluster openais > oddjob rgmanager > > cman-2.0.98-2chrissie > > gfs-utils-0.1.18-1.el5 > > kmod-gfs-0.1.23-5.el5_2.4 > > kmod-gfs-0.1.31-3.el5 > > modcluster-0.12.1-2.el5.centos > > ricci-0.12.1-7.3.el5.centos.1 > > luci-0.12.1-7.3.el5.centos.1 > > cluster-snmp-0.12.1-2.el5.centos > > iscsi-initiator-utils-6.2.0.868-0.18.el5_3.1 > > lvm2-cluster-2.02.40-7.el5 > > openais-0.80.3-22.el5_3.8 > > oddjob-0.27-9.el5 > > rgmanager-2.0.46-1.el5.centos.3 > > > > on node02 > > > > rpm -q cman gfs-utils kmod-gfs modcluster ricci luci > cluster-snmp iscsi-initiator-utils lvm2-cluster openais > oddjob rgmanager > > cman-2.0.98-2chrissie > > gfs-utils-0.1.18-1.el5 > > kmod-gfs-0.1.31-3.el5 > > modcluster-0.12.1-2.el5.centos > > ricci-0.12.1-7.3.el5.centos.1 > > luci-0.12.1-7.3.el5.centos.1 > > cluster-snmp-0.12.1-2.el5.centos > > iscsi-initiator-utils-6.2.0.868-0.18.el5_3.1 > > lvm2-cluster-2.02.40-7.el5 > > openais-0.80.3-22.el5_3.8 > > oddjob-0.27-9.el5 > > rgmanager-2.0.46-1.el5.centos.3 > > > > I used http://knowledgelayer.softlayer.com/questions/443/GFS+howto > to configure my cluster. When it was still on 5.2 the > cluster worked, but after the recent update to 5.3, it > broke. > > > > On one of the threads that I have found in the > archive, it states that there is a problem with the most > current official version of cman, bug id 485026. I replaced > the most current cman package with cman-2.0.98-2chrissie > because I tested if this was my problem, seems not so I will > be moving back to the official package. > > I also found on another thread that openais was the > culprit, changed it back to openais-0.80.3-15.el5 even > though the change log indicates a lot of bug fixes were done > on the most current official package. After doing it, it > still did not work. I tried clean_start="1" with caution. I > unmounted the iscsi then started cman but still it did not > work. The most recent is post_join_delay="-1", I did not > noticed that there was a man for fenced, which is much safer > than clean_start="1" but still it did not fixed it. The man > pages that I have read over and over again is cman and > cluster.conf. Some pages in the online manual is somewhat > not suitable for my situation because I do not have X > installed on the machines and some pages in the online > manual used system-config-cluster. > > > > As I understand in the online manual and FAQ, qdisk is > not required if I have two_nodes="1" so I did not create > any. I have removed the fence_daemon tag since I only used > it for trying the solutions that were suggested. The hosts > are present in each others hosts with correct ips. > > > > > > The ping results > > > > ping node02.company.com > > > > --- node01.company.com ping statistics --- > > 10 packets transmitted, 10 received, 0% packet loss, > time 8999ms > > rtt min/avg/max/mdev = 0.010/0.016/0.034/0.007 ms > > > > ping node01.company.com > > > > --- node01.company.com ping statistics --- > > 10 packets transmitted, 10 received, 0% packet loss, > time 9003ms > > rtt min/avg/max/mdev = 0.341/0.668/1.084/0.273 ms > > > > According to the people in the data center, the switch > supports multicast communication on all ports that are used > for cluster communication because they are in the same > VLAN. > > > > For the logs, I will sending fresh logs as soon as > possible. Currently I have not enough time window to bring > down the machine. > > > > For the wireshark, I will be reading the man pages on > how to use it. > > > > Please advise if any other information is needed to > solve this. I am very grateful for the very detailed > pointers. Thank you very much! > > > > > > --- On Fri, 7/17/09, Marc - A. Dahlhaus [ > Administration | Westermann GmbH ] wrote: > > > >??? > >> From: Marc - A. Dahlhaus [ Administration | > Westermann GmbH ] > >> Subject: Re: [Linux-cluster] Starting two-node > cluster with only one node > >> To: "linux clustering" > >> Date: Friday, 17 July, 2009, 5:56 PM > >> Hello, > >> > >> > >> can you give us some hard facts on what versions > of > >> cluster-suite > >> packages you are using in your environment and > also the > >> related logs? > >> > >> Have you read the corresponding parts of the > cluster suites > >> manual, man > >> pages, FAQ and also searched the list-archives for > similar > >> problems > >> already? If not -> do it, there are may good > hints to > >> find there. > >> > >> > >> The nodes find each other and create a cluster > very fast IF > >> they can > >> talk to each other. As no cluster networking is > involved in > >> fencing a > >> remote node if the fencing node by itself is > quorate this > >> could be your > >> problem. > >> > >> You should change to fence_manual and switch back > to your > >> real fencing > >> devices after you have debuged your problem. Also > get rid > >> of the > >> tag in your > cluster.conf as > >> fenced does the right > >> thing by default if the remaining configuration is > right > >> and now it is > >> just hiding a part of the problem. > >> > >> Also the 5 minute break on cman start smells like > a > >> DNS-lookup problem > >> or other network related problem to me. > >> > >> Here is a short check-list to be sure the nodes > can talk to > >> each other: > >> > >> Can the individual nodes ping each other? > >> > >> Can the individual nodes dns-lookup the other > node-names > >> (which you used > >> in your cluster.conf)? (Try to add them to your > etc/hosts > >> file, that way > >> you have a working cluster even if your dns-system > is going > >> on > >> vacation.) > >> > >> Is your switch allowing multicast communication on > all > >> ports that are > >> used for cluster communication? (This is a > prerequisite for > >> openais / > >> corosync based cman which would be anything >= > RHEL 5. > >> Search the > >> archives on this if you need more info...) > >> > >> Can you trace (eg. with wiresharks tshark) > incoming > >> cluster > >> communication from remote nodes? (If you don't > changed your > >> fencing to > >> fence_manual your listening system will get fenced > before > >> you can get > >> any useful information out of it. Try with and > without > >> active firewall.) > >> > >> If all above could be answered with "yes" your > cluster > >> should form just > >> fine. You could try to add a qdisk-device as > tiebreaker > >> after that and > >> test it just to be sure you have a working last > man > >> standing setup... > >> > >> Hope that helps, > >> > >> Marc > >>? ??? > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > "Try the new FASTER Yahoo! Mail. Experience it today at http://ph.mail.yahoo.com" From mad at wol.de Sat Jul 18 23:03:46 2009 From: mad at wol.de (Marc - A. Dahlhaus) Date: Sun, 19 Jul 2009 01:03:46 +0200 Subject: [Linux-cluster] Starting two-node cluster with only one node In-Reply-To: <667406.91442.qm@web110403.mail.gq1.yahoo.com> References: <667406.91442.qm@web110403.mail.gq1.yahoo.com> Message-ID: <4A6254D2.1050003@wol.de> Hello, Abed-nego G. Escobal, Jr. schrieb: > Hi! > > I am very sorry that I did not mention that when I am testing different suggestions on solving this, I always temporarily disable the firewall. Then turning it back on after the testing. > > Thank you very much on the tip for the tshark! I will post the output as as soon as I get a maintenance window to restart the cman service. > > With regards to openais, it is still off on both servers. Should it be turned on on boot? I am very sorry but I haven't read it in the manuals that it should be "on". > No leave it off as cman starts openais with configuration based on your cluster.conf. If openais starts via init your cluster will not work like it should... Marc From fxmulder at gmail.com Mon Jul 20 15:43:14 2009 From: fxmulder at gmail.com (James Devine) Date: Mon, 20 Jul 2009 09:43:14 -0600 Subject: [Linux-cluster] disk fencing In-Reply-To: <36df569a0907172350g5b381e3bt87abd7fbe6fe8677@mail.gmail.com> References: <36df569a0907172350g5b381e3bt87abd7fbe6fe8677@mail.gmail.com> Message-ID: I had looked at qdisk, it looks like qdisk is just a way for the nodes to share information about what it thinks the current cluster status is. It looked like external fencing still needed to take place, which was normally some power or network intervention. I was thinking of using the disk to do the fencing also. It could use the status information provided in the quorum disk for nodes to determine if they are fenced off or not. In the case of complete cutoff from disk, the remaining nodes would have to work under the assumption that the failed node(s) were no longer trying to access disk as they were making no more status updates to the quorum disk for a period of time. So they would be "fenced off" as it were, and the remaining nodes could continue on without them until the node(s) came back and made further status updates to the quorum disk. This way, fencing could be done completely independent of the hardware running, no need for network or power management. On Sat, Jul 18, 2009 at 12:50 AM, Ian Hayes wrote: > I'm not sure what you're asking here, but it sounds like you're describing a > qdisk. > > If a node loses heartbeat with the rest of the cluster, that's a fencin'. > Doesn't matter if it can still access the shared storage, and if it has lost > communication with the rest of the cluster, you probably don't want it > accessing your data anyway. > > On Fri, Jul 17, 2009 at 9:20 AM, James Devine wrote: >> >> Has anybody looked into using the network for heartbeat only, and disk >> for fencing in GFS? ?i.e. using the disk to communicate quorum when >> network heartbeat is lost between 1 or more nodes. ?If the disk is >> still accessible to all nodes, this should be a valid way to >> communicate quorum, if not, then the remaining nodes, assuming enough >> for quorum, should be able to continue knowing that nodes it can't >> communicate with either have been fenced or can't read/write to disk >> anyway. ?Does this sound like a valid approach? >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From pradhanparas at gmail.com Mon Jul 20 15:43:17 2009 From: pradhanparas at gmail.com (Paras pradhan) Date: Mon, 20 Jul 2009 10:43:17 -0500 Subject: [Linux-cluster] Cluster failover In-Reply-To: <8786b91c0907172026g3cdec94bw946187e2fdf3181a@mail.gmail.com> References: <8b711df40907171315n70057804rbc44d7fb255523af@mail.gmail.com> <8786b91c0907172026g3cdec94bw946187e2fdf3181a@mail.gmail.com> Message-ID: <8b711df40907200843k63a37d07l6adca62aceae0478@mail.gmail.com> On Fri, Jul 17, 2009 at 10:26 PM, Rajagopal Swaminathan wrote: > Greetings, > > You basically need to define multiple failover domains. > > On Sat, Jul 18, 2009 at 1:45 AM, Paras pradhan wrote: >> I have 3 virtual machine ?running on node 1 and 2 virtual >> machines running on node 2. > >> Now if node 1 fails I want my the node1 >> virtual machines ?to be stared only on node 3 but not on node2. > Failover domain 1 consisting of Node 1 and Node 3 > >> Similary if node2 breaks, I want virtual machines to be started on >> node3 but never on node 1. > Failover domain 2 consisting of Node 2 and Node 3 > > HTH > > Thanks and Regards > > Rajagopal > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > Thanks ! Will try it out. Paras. From pradhanparas at gmail.com Mon Jul 20 15:43:40 2009 From: pradhanparas at gmail.com (Paras pradhan) Date: Mon, 20 Jul 2009 10:43:40 -0500 Subject: [Linux-cluster] Cluster failover In-Reply-To: <36df569a0907172356g55aab93fwdd12a660f36e49f9@mail.gmail.com> References: <8b711df40907171315n70057804rbc44d7fb255523af@mail.gmail.com> <36df569a0907172356g55aab93fwdd12a660f36e49f9@mail.gmail.com> Message-ID: <8b711df40907200843s20585d2fo3929baaea9f32125@mail.gmail.com> On Sat, Jul 18, 2009 at 1:56 AM, Ian Hayes wrote: > Specify 2 different failover domains for the services. I have a similar > setup for a project with a 3 node cluster. Node 1 runs Service A, Node 2 > runs Service B and Node 3 is the floater > > Failover Domain 1: Node 1, Node 3 > Failover Domain 2: Node 2, Node 3 > > > Service A: Failover Domain1 > Service B: Failover Domain2 > > On Fri, Jul 17, 2009 at 1:15 PM, Paras pradhan > wrote: >> >> hi, >> >> I have 3 nodes of CentOS 5.3 running xen virtual machines as virtual >> machine service. This cluster is working fine. One thing I would like >> to know that how to make failover only to ?third node. What I mean to >> say is: I have 3 virtual machine ?running on node 1 and 2 virtual >> machines running on node 2. Now if node 1 fails I want my the node1 >> virtual machines ?to be stared only on node 3 but not on node2. >> Similary if node2 breaks, I want virtual machines to be started on >> node3 but never on node 1. >> >> Thanks ! >> Paras. >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > Thanks ! Will try it out. From pschobel at 1iopen.net Mon Jul 20 17:20:44 2009 From: pschobel at 1iopen.net (Peter Schobel) Date: Mon, 20 Jul 2009 10:20:44 -0700 Subject: [Linux-cluster] Re: rm -r on gfs2 filesystem is very slow In-Reply-To: References: Message-ID: In followup on this thread, We discovered that the reason why we didn't notice this performance problem in our initial proof of concept was because some directories were recently added to our source repository which contain a large number of files and subdirectories. I tested a number of directories and discovered that other similar sized directories (1 to 2 G) could remove in 1 to 3 seconds but the 1.6 G directory referenced below took ~ 9 minutes. This particular directory contained ~ 10,000 subdirectories containing a total of 65,000+ files. Further tests with touching empty files revealed that a directory with 65,000 empty files removed relatively quickly while a directory with 65,000 subdirectories containing 1 empty file each took a very long time to remove. Using the find command to identify files and piping the results to rm was much faster than using rm -r. Also, mounting the fs with the nodiratime option and setting: in cluster.conf reduced rm -r time on the directory referenced below from ~ 9 mins to ~ 5 mins. Since we now understand the problem a little bit better we are able to implement some workarounds to get us by. Peter ~ On Wed, Jul 8, 2009 at 1:58 PM, Peter Schobel wrote: > I am trying to set up a four node cluster but am getting very poor > performance when removing large directories. A directory approximately > 1.6G ?in size takes around 5 mins to remove from the gfs2 filesystem > but removes in around 10 seconds from the local disk. > > I am using CentOS 5.3 with kernel 2.6.18-128.1.16.el5PAE. > > The filesystem was formatted in the following manner: mkfs.gfs2 -t > wtl_build:dev_home00 -p lock_dlm -j 10 > /dev/mapper/VolGroupGFS-LogVolDevHome00 and is being mounted with the > following options: _netdev,noatime,defaults. > > If anyone knows what could be causing this please let me know. I'm > happy to provide any other information. > > Regards, > > -- > Peter Schobel > ~ > -- Peter Schobel ~ From pschobel at 1iopen.net Mon Jul 20 21:20:25 2009 From: pschobel at 1iopen.net (Peter Schobel) Date: Mon, 20 Jul 2009 14:20:25 -0700 Subject: [Linux-cluster] kernel BUG at fs/gfs2/rgrp.c:1458! Message-ID: We are experiencing fatal exceptions on a four node Linux cluster using a gfs2 filesystem. Any help would be appreciated. Am happy to provide additional info. kernel BUG at fs/gfs2/rgrp.c:1458! invalid opcode: 0000 [#1] SMP last sysfs file: /devices/pci0000:00/0000:00:00.0/irq Modules linked in: ipv6 xfrm_nalgo crypto_api lock_dlm gfs2 dlm configfs sunrpc dm_round_robin dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_cord CPU: 0 EIP: 0060:[] Not tainted VLI EFLAGS: 00010246 (2.6.18-128.1.16.el5PAE #1) EIP is at gfs2_alloc_data+0x75/0x155 [gfs2] eax: ffffffff ebx: 00000000 ecx: 00000000 edx: 00000001 esi: 05ec1513 edi: 00000000 ebp: f51aa114 esp: edf96c74 ds: 007b es: 007b ss: 0068 Process p4v.bin (pid: 8895, ti=edf96000 task=ccfe0550 task.ti=edf96000) Stack: d3b34548 f7242000 d9663380 00000000 d3b34548 00000000 d4575000 f96c4db2 cf449378 d4575140 00001000 00000000 cf449378 d3b34548 edf96cf4 f96c50a0 edf96cf4 00000001 edf96d18 edf96d10 00000000 0000000c 00000000 0000c000 Call Trace: [] lookup_block+0xb4/0x153 [gfs2] [] gfs2_block_map+0x24f/0x392 [gfs2] [] set_bh_page+0x43/0x4c [] alloc_page_buffers+0x74/0xba [] __block_prepare_write+0x1a2/0x439 [] do_promote+0xe8/0x10b [gfs2] [] block_prepare_write+0x16/0x23 [] gfs2_block_map+0x0/0x392 [gfs2] [] gfs2_write_begin+0x2af/0x359 [gfs2] [] gfs2_block_map+0x0/0x392 [gfs2] [] gfs2_file_buffered_write+0x10d/0x287 [gfs2] [] current_fs_time+0x4a/0x55 [] __gfs2_file_aio_write_nolock+0x2d4/0x32d [gfs2] [] sock_aio_read+0x53/0x61 [] gfs2_file_write_nolock+0xb0/0x111 [gfs2] [] autoremove_wake_function+0x0/0x2d [] autoremove_wake_function+0x0/0x2d [] gfs2_file_write+0x0/0x94 [gfs2] [] gfs2_file_write+0x3a/0x94 [gfs2] [] gfs2_file_write+0x0/0x94 [gfs2] [] vfs_write+0xa1/0x143 [] sys_write+0x3c/0x63 [] sysenter_past_esp+0x56/0x79 ======================= Code: 16 31 d2 01 f0 11 fa 39 d3 77 0c 72 04 39 c1 73 06 89 ca 29 f2 eb 03 8b 55 70 31 c9 89 e8 6a 01 e8 39 e8 ff ff 5a 83 f8 ff 75 08 <0f> 0b b2 05 7c 5d 6 EIP: [] gfs2_alloc_data+0x75/0x155 [gfs2] SS:ESP 0068:edf96c74 <0>Kernel panic - not syncing: Fatal exception kernel BUG at fs/gfs2/rgrp.c:1458! invalid opcode: 0000 [#1] SMP last sysfs file: /devices/pci0000:00/0000:00:02.0/0000:04:00.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/irq Modules linked in: ipv6 xfrm_nalgo crypto_api lock_dlm gfs2 dlm configfs sunrpc dm_round_robin dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_cord CPU: 1 EIP: 0060:[] Not tainted VLI EFLAGS: 00010246 (2.6.18-128.1.16.el5PAE #1) EIP is at gfs2_alloc_data+0x75/0x155 [gfs2] eax: ffffffff ebx: 00000000 ecx: 00000000 edx: 00000001 esi: 05ec1513 edi: 00000000 ebp: f0e11858 esp: f01f1c74 ds: 007b es: 007b ss: 0068 Process cp (pid: 31700, ti=f01f1000 task=f2e83550 task.ti=f01f1000) Stack: f4361d68 f6cb1000 e45b7dc0 00000000 f4361d68 00000001 d25b8000 f96c3db2 d12d53ac d25b8e10 00001000 00000001 e40686ec f4361d68 f01f1cf4 f96c40a0 f01f1cf4 00000001 f01f1d18 f01f1d10 00000000 000013a5 00000000 013a5000 Call Trace: [] lookup_block+0xb4/0x153 [gfs2] [] gfs2_block_map+0x24f/0x392 [gfs2] [] set_bh_page+0x43/0x4c [] alloc_page_buffers+0x74/0xba [] __block_prepare_write+0x1a2/0x439 [] do_promote+0xe8/0x10b [gfs2] [] block_prepare_write+0x16/0x23 [] gfs2_block_map+0x0/0x392 [gfs2] [] gfs2_write_begin+0x2af/0x359 [gfs2] [] gfs2_block_map+0x0/0x392 [gfs2] [] gfs2_file_buffered_write+0x10d/0x287 [gfs2] [] current_fs_time+0x4a/0x55 [] __gfs2_file_aio_write_nolock+0x2d4/0x32d [gfs2] [] gfs2_file_write_nolock+0xb0/0x111 [gfs2] [] autoremove_wake_function+0x0/0x2d [] autoremove_wake_function+0x0/0x2d [] gfs2_file_write+0x0/0x94 [gfs2] [] gfs2_file_write+0x3a/0x94 [gfs2] [] gfs2_file_write+0x0/0x94 [gfs2] [] vfs_write+0xa1/0x143 [] sys_write+0x3c/0x63 [] sysenter_past_esp+0x56/0x79 ======================= Code: 16 31 d2 01 f0 11 fa 39 d3 77 0c 72 04 39 c1 73 06 89 ca 29 f2 eb 03 8b 55 70 31 c9 89 e8 6a 01 e8 39 e8 ff ff 5a 83 f8 ff 75 08 <0f> 0b b2 05 7c 4d 6 EIP: [] gfs2_alloc_data+0x75/0x155 [gfs2] SS:ESP 0068:f01f1c74 <0>Kernel panic - not syncing: Fatal exception Thanks in advance, -- Peter Schobel ~ From brem.belguebli at gmail.com Tue Jul 21 09:21:23 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Tue, 21 Jul 2009 11:21:23 +0200 Subject: [Linux-cluster] CLVMD without GFS Message-ID: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com> Hi all, I think there is something to clarify about using CLVM across a cluster in a active/passive mode without GFS. >From my understanding, CLVM keeps LVM metadata coherent among the cluster nodes and provides a cluster wide locking mechanism that can prevent any node from trying to activate a volume group if it has been activated exclusively (vgchange -a e VGXXX) by another node (which needs to be up). I have been playing with it to check this behaviour but it doesn't seem to make what is expected. I have 2 nodes (RHEL 5.3 X86_64, cluster installed and configured) , A and B using a SAN shared storage. I have a LUN from this SAN seen by both nodes, pvcreate'd /dev/mpath/mpath0 , vgcreate'd vg10 and lvcreate'd lvol1 (on one node), created an ext3 FS on /dev/vg10/lvol1 CLVM is running in debug mode (clvmd -d2 ) (but it complains about locking disabled though locking set to 3 on both nodes) On node A: vgchange -c y vg10 returns OK (vgs --> vg10 1 1 0 wz--nc) vgchange -a e --> OK lvs returns lvol1 vg10 -wi-a- On node B (while things are active on A, A is UP and member of the cluster ): vgchange -a e --> Error locking on node B: Volume is busy on another node 1 logical volume(s) in volume group "vg10" now active It activates vg10 even if it sees it busy on another node . on B, lvs returns lvol1 vg10 -wi-a- as well as on A. I think the main problem comes from the fact that, as it is said when starting CLVM in debug mode, WARNING: Locking disabled. Be careful! This could corrupt your metadata. IMHO, the algorithm should be as follows: VG is tagged as clustered (vgchange -c y VGXXX) if a node (node P) tries to activate the VG exclusively (vgchange -a VGXXX) ask the lock manager to check if VG is not already locked by another node (node X) if so, check if node X is up if node X is down, return OK to node P else return NOK to node P (explicitely that VG is held exclusively by node X) Brem PS: this shouldn't be a problem with GFS or other clustered FS (OCFS, etc...) as no node should try to activate exclusively any VG. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ccaulfie at redhat.com Tue Jul 21 09:55:58 2009 From: ccaulfie at redhat.com (Christine Caulfield) Date: Tue, 21 Jul 2009 10:55:58 +0100 Subject: [Linux-cluster] CLVMD without GFS In-Reply-To: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com> References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com> Message-ID: <4A6590AE.2040808@redhat.com> Hiya, I've just tried this on my cluster and it works fine. What you need to remember is that lvcreate on one node will also activate the LV on all nodes in the cluster - it does an implicit lvchange -ay when you create it. What I can't explain is why vgchange -ae seemed to work fine on node A, it should give the same error as on node B because LVs are open shared on both nodes. Its not clear to me when you tagged the VG as clustered, so that might be contributing to the problem. When I create a new VG on shared storage it automatically gets labelled clustered so I have never needed to do this explicitly. If you create a non-clustered VG you probably ought to deactivate it on all nodes first as it could mess up the locking otherwise. This *might* be the cause of your troubles. The error on clvmd startup can be ignored. It's caused by clvmd ussing a background command with --no_locking so that it can check which LVs (if any) are already active and re-acquire locks for them Sorry this isn't conclusive, The exact order in which things are happening is not clear to me. Chrissie. On 07/21/2009 10:21 AM, brem belguebli wrote: > Hi all, > I think there is something to clarify about using CLVM across a cluster > in a active/passive mode without GFS. > From my understanding, CLVM keeps LVM metadata coherent among the > cluster nodes and provides a cluster wide locking mechanism that can > prevent any node from trying to activate a volume group if it has been > activated exclusively (vgchange -a e VGXXX) by another node (which > needs to be up). > I have been playing with it to check this behaviour but it doesn't seem > to make what is expected. > I have 2 nodes (RHEL 5.3 X86_64, cluster installed and configured) , A > and B using a SAN shared storage. > I have a LUN from this SAN seen by both nodes, pvcreate'd > /dev/mpath/mpath0 , vgcreate'd vg10 and lvcreate'd lvol1 (on one node), > created an ext3 FS on /dev/vg10/lvol1 > CLVM is running in debug mode (clvmd -d2 ) (but it complains about > locking disabled though locking set to 3 on both nodes) > On node A: > vgchange -c y vg10 returns OK (vgs --> vg10 1 1 0 > wz--nc) > vgchange -a e --> OK > lvs returns lvol1 vg10 -wi-a- > On node B (while things are active on A, A is UP and member of the > cluster ): > vgchange -a e --> Error locking on node B: Volume is busy on > another node > 1 logical volume(s) in volume group > "vg10" now active > It activates vg10 even if it sees it busy on another node . > on B, lvs returns lvol1 vg10 -wi-a- > as well as on A. > I think the main problem comes from the fact that, as it is said when > starting CLVM in debug mode, WARNING: Locking disabled. Be careful! > This could corrupt your metadata. > IMHO, the algorithm should be as follows: > VG is tagged as clustered (vgchange -c y VGXXX) > if a node (node P) tries to activate the VG exclusively (vgchange -a VGXXX) > ask the lock manager to check if VG is not already locked by another > node (node X) > if so, check if node X is up > if node X is down, return OK to node P > else > return NOK to node P (explicitely that VG is held exclusively by node X) > Brem > PS: this shouldn't be a problem with GFS or other clustered FS (OCFS, > etc...) as no node should try to activate exclusively any VG. > > > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From brem.belguebli at gmail.com Tue Jul 21 11:16:19 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Tue, 21 Jul 2009 13:16:19 +0200 Subject: [Linux-cluster] CLVMD without GFS In-Reply-To: <4A6590AE.2040808@redhat.com> References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com> <4A6590AE.2040808@redhat.com> Message-ID: <29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com> Hi Chrissie, Indeed, by default when creating the VG, it is clustered, thus when creating the LV it is active on all nodes. To avoid data corruption, I have re created the VG as non clustered (vgcreate -c n vgXX) then created the LV which got activated only on the node where it got created. Then changed the VG to clustered (vgchange -c y VGXX) and activated it exclusively on this node. But, I could reproduce the behaviour of bypassing the exclusive flag: On node B, re change the VG to non clustered though it is activated exclusively on node A. and then activate it on node B and it works. The thing I'm trying to point is that simply by erasing the clustered flag you can bypass the exclusive activation. I think a barrier is necessary to prevent this to happen, removing the clustered flag from a VG should be possible only if the node holding the VG exclusively is down (does the lock manager DLM report which node holds exclusively a VG ?) Thanks Brem 2009/7/21, Christine Caulfield : > > Hiya, > > I've just tried this on my cluster and it works fine. > > What you need to remember is that lvcreate on one node will also activate > the LV on all nodes in the cluster - it does an implicit lvchange -ay when > you create it. What I can't explain is why vgchange -ae seemed to work fine > on node A, it should give the same error as on node B because LVs are open > shared on both nodes. > > Its not clear to me when you tagged the VG as clustered, so that might be > contributing to the problem. When I create a new VG on shared storage it > automatically gets labelled clustered so I have never needed to do this > explicitly. If you create a non-clustered VG you probably ought to > deactivate it on all nodes first as it could mess up the locking otherwise. > This *might* be the cause of your troubles. > > The error on clvmd startup can be ignored. It's caused by clvmd ussing a > background command with --no_locking so that it can check which LVs (if any) > are already active and re-acquire locks for them > > Sorry this isn't conclusive, The exact order in which things are happening > is not clear to me. > > Chrissie. > > On 07/21/2009 10:21 AM, brem belguebli wrote: > >> Hi all, >> I think there is something to clarify about using CLVM across a cluster >> in a active/passive mode without GFS. >> From my understanding, CLVM keeps LVM metadata coherent among the >> cluster nodes and provides a cluster wide locking mechanism that can >> prevent any node from trying to activate a volume group if it has been >> activated exclusively (vgchange -a e VGXXX) by another node (which >> needs to be up). >> I have been playing with it to check this behaviour but it doesn't seem >> to make what is expected. >> I have 2 nodes (RHEL 5.3 X86_64, cluster installed and configured) , A >> and B using a SAN shared storage. >> I have a LUN from this SAN seen by both nodes, pvcreate'd >> /dev/mpath/mpath0 , vgcreate'd vg10 and lvcreate'd lvol1 (on one node), >> created an ext3 FS on /dev/vg10/lvol1 >> CLVM is running in debug mode (clvmd -d2 ) (but it complains about >> locking disabled though locking set to 3 on both nodes) >> On node A: >> vgchange -c y vg10 returns OK (vgs --> vg10 1 1 0 >> wz--nc) >> vgchange -a e --> OK >> lvs returns lvol1 vg10 -wi-a- >> On node B (while things are active on A, A is UP and member of the >> cluster ): >> vgchange -a e --> Error locking on node B: Volume is busy on >> another node >> 1 logical volume(s) in volume group >> "vg10" now active >> It activates vg10 even if it sees it busy on another node . >> on B, lvs returns lvol1 vg10 -wi-a- >> as well as on A. >> I think the main problem comes from the fact that, as it is said when >> starting CLVM in debug mode, WARNING: Locking disabled. Be careful! >> This could corrupt your metadata. >> IMHO, the algorithm should be as follows: >> VG is tagged as clustered (vgchange -c y VGXXX) >> if a node (node P) tries to activate the VG exclusively (vgchange -a >> VGXXX) >> ask the lock manager to check if VG is not already locked by another >> node (node X) >> if so, check if node X is up >> if node X is down, return OK to node P >> else >> return NOK to node P (explicitely that VG is held exclusively by node X) >> Brem >> PS: this shouldn't be a problem with GFS or other clustered FS (OCFS, >> etc...) as no node should try to activate exclusively any VG. >> >> >> ------------------------------------------------------------------------ >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ccaulfie at redhat.com Tue Jul 21 11:48:50 2009 From: ccaulfie at redhat.com (Christine Caulfield) Date: Tue, 21 Jul 2009 12:48:50 +0100 Subject: [Linux-cluster] CLVMD without GFS In-Reply-To: <29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com> References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com> <4A6590AE.2040808@redhat.com> <29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com> Message-ID: <4A65AB22.1030601@redhat.com> Hiya, If you make a VG non-clustered then you, by definition, forfeit all cluster locking protection on all of the LVs in that group. In practise you should always make shared volumes clustered and non-shared volumes non-clustered. If you make a shared volume non-clustered then it's up to you to manage the protection of it. It might be possible to put some protection in for when a volume group is changed from clustered to non-clustered, but really, you're not supposed to do that! Look at it this way, if you create a shared VG and mark it non-clustered to start with, then you can corrupt it as much as you like by mounting its filesystems on multiple nodes. It's just the same as a SAN volume then. On 07/21/2009 12:16 PM, brem belguebli wrote: > Hi Chrissie, > Indeed, by default when creating the VG, it is clustered, thus when > creating the LV it is active on all nodes. > To avoid data corruption, I have re created the VG as non clustered > (vgcreate -c n vgXX) then created the LV which got activated only on the > node where it got created. > Then changed the VG to clustered (vgchange -c y VGXX) and activated it > exclusively on this node. > But, I could reproduce the behaviour of bypassing the exclusive flag: > On node B, re change the VG to non clustered though it is activated > exclusively on node A. > and then activate it on node B and it works. > The thing I'm trying to point is that simply by erasing the clustered > flag you can bypass the exclusive activation. > I think a barrier is necessary to prevent this to happen, removing the > clustered flag from a VG should be possible only if the node holding the > VG exclusively is down (does the lock manager DLM report which node > holds exclusively a VG ?) > Thanks > Brem > > > 2009/7/21, Christine Caulfield >: > > Hiya, > > I've just tried this on my cluster and it works fine. > > What you need to remember is that lvcreate on one node will also > activate the LV on all nodes in the cluster - it does an implicit > lvchange -ay when you create it. What I can't explain is why > vgchange -ae seemed to work fine on node A, it should give the same > error as on node B because LVs are open shared on both nodes. > > Its not clear to me when you tagged the VG as clustered, so that > might be contributing to the problem. When I create a new VG on > shared storage it automatically gets labelled clustered so I have > never needed to do this explicitly. If you create a non-clustered VG > you probably ought to deactivate it on all nodes first as it could > mess up the locking otherwise. This *might* be the cause of your > troubles. > > The error on clvmd startup can be ignored. It's caused by clvmd > ussing a background command with --no_locking so that it can check > which LVs (if any) are already active and re-acquire locks for them > > Sorry this isn't conclusive, The exact order in which things are > happening is not clear to me. > > Chrissie. > > > On 07/21/2009 10:21 AM, brem belguebli wrote: > > Hi all, > I think there is something to clarify about using CLVM across a > cluster > in a active/passive mode without GFS. > From my understanding, CLVM keeps LVM metadata coherent among the > cluster nodes and provides a cluster wide locking mechanism that can > prevent any node from trying to activate a volume group if it > has been > activated exclusively (vgchange -a e VGXXX) by another node (which > needs to be up). > I have been playing with it to check this behaviour but it > doesn't seem > to make what is expected. > I have 2 nodes (RHEL 5.3 X86_64, cluster installed and > configured) , A > and B using a SAN shared storage. > I have a LUN from this SAN seen by both nodes, pvcreate'd > /dev/mpath/mpath0 , vgcreate'd vg10 and lvcreate'd lvol1 (on one > node), > created an ext3 FS on /dev/vg10/lvol1 > CLVM is running in debug mode (clvmd -d2 ) (but it complains about > locking disabled though locking set to 3 on both nodes) > On node A: > vgchange -c y vg10 returns OK (vgs --> vg10 1 > 1 0 > wz--nc) > vgchange -a e --> OK > lvs returns lvol1 vg10 -wi-a- > On node B (while things are active on A, A is UP and member of the > cluster ): > vgchange -a e --> Error locking on node B: Volume is > busy on > another node > 1 logical volume(s) in > volume group > "vg10" now active > It activates vg10 even if it sees it busy on another node . > on B, lvs returns lvol1 vg10 -wi-a- > as well as on A. > I think the main problem comes from the fact that, as it is said > when > starting CLVM in debug mode, WARNING: Locking disabled. Be careful! > This could corrupt your metadata. > IMHO, the algorithm should be as follows: > VG is tagged as clustered (vgchange -c y VGXXX) > if a node (node P) tries to activate the VG exclusively > (vgchange -a VGXXX) > ask the lock manager to check if VG is not already locked by another > node (node X) > if so, check if node X is up > if node X is down, return OK to node P > else > return NOK to node P (explicitely that VG is held exclusively by > node X) > Brem > PS: this shouldn't be a problem with GFS or other clustered FS > (OCFS, > etc...) as no node should try to activate exclusively any VG. > > > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From listener at may.co.at Tue Jul 21 11:51:22 2009 From: listener at may.co.at (Wolfgang Hotwagner) Date: Tue, 21 Jul 2009 13:51:22 +0200 Subject: [Linux-cluster] Waiting for fenced to join the fence group Message-ID: <4A65ABBA.3030305@may.co.at> Hello, i am not able to make a gfs2-cluster on a drbd-device. I always have the problem with joining the fence group. I am using a debian stable(lenny) system. On eth0 there is also a ctdb-service which enables 2 additional ip's. Maybe someone could help me to get it working.. Greetings Wolfgang dslin1: eth0: 172.30.50.83 eth1: 10.13.13.2 /etc/hosts: 127.0.0.1 localhost 172.30.50.83 dslin1 172.30.50.84 dslin2 10.13.13.2 node1 10.13.13.3 node2 /proc/drbd: version: 8.0.14 (api:86/proto:86) GIT-hash: bb447522fc9a87d0069b7e14f0234911ebdab0f7 build by phil at fat-tyre, 2008-11-12 16:40:33 0: cs:Connected st:Primary/Primary ds:UpToDate/UpToDate C r--- ns:0 nr:12288 dw:12288 dr:0 al:0 bm:3 lo:0 pe:0 ua:0 ap:0 resync: used:0/61 hits:765 misses:3 starving:0 dirty:0 changed:3 act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0 syslog: Jul 21 13:38:27 dslin1 ccsd[14975]: Starting ccsd 2.03.09: Jul 21 13:38:27 dslin1 ccsd[14975]: Built: Nov 3 2008 18:22:21 Jul 21 13:38:27 dslin1 ccsd[14975]: Copyright (C) Red Hat, Inc. 2004-2008 All rights reserved. Jul 21 13:38:28 dslin1 ccsd[14975]: /etc/cluster/cluster.conf (cluster name = cluster, version = 1) found. Jul 21 13:38:31 dslin1 ccsd[14975]: Initial status:: Quorate Jul 21 13:38:35 dslin1 openais[14980]: cman killed by node 2 because we rejoined the cluster without a full restart Jul 21 13:38:35 dslin1 groupd[14984]: cman_get_nodes error -1 104 Jul 21 13:38:35 dslin1 gfs_controld[14992]: cluster is down, exiting Jul 21 13:39:00 dslin1 ccsd[14975]: Unable to connect to cluster infrastructure after 30 seconds. Jul 21 13:39:30 dslin1 ccsd[14975]: Unable to connect to cluster infrastructure after 60 seconds. Jul 21 13:40:00 dslin1 ccsd[14975]: Unable to connect to cluster infrastructure after 90 seconds. Jul 21 13:40:30 dslin1 ccsd[14975]: Unable to connect to cluster infrastructure after 120 seconds. and so on.. dslin2: eth0: 172.30.50.84 eth1: 10.13.13.3 /etc/hosts: 127.0.0.1 localhost 172.30.50.83 dslin1 172.30.50.84 dslin2 10.13.13.2 node1 10.13.13.3 node2 /proc/drbd version: 8.0.14 (api:86/proto:86) GIT-hash: bb447522fc9a87d0069b7e14f0234911ebdab0f7 build by phil at fat-tyre, 2008-11-12 16:40:33 0: cs:Connected st:Primary/Primary ds:UpToDate/UpToDate C r--- ns:12292 nr:0 dw:0 dr:12296 al:0 bm:6 lo:0 pe:0 ua:0 ap:0 resync: used:0/61 hits:765 misses:3 starving:0 dirty:0 changed:3 act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0 syslog: Jul 21 13:38:27 dslin1 ccsd[14975]: Starting ccsd 2.03.09: Jul 21 13:38:27 dslin1 ccsd[14975]: Built: Nov 3 2008 18:22:21 Jul 21 13:38:27 dslin1 ccsd[14975]: Copyright (C) Red Hat, Inc. 2004-2008 All rights reserved. Jul 21 13:38:28 dslin1 ccsd[14975]: /etc/cluster/cluster.conf (cluster name = cluster, version = 1) found. Jul 21 13:38:31 dslin1 ccsd[14975]: Initial status:: Quorate Jul 21 13:38:35 dslin1 openais[14980]: cman killed by node 2 because we rejoined the cluster without a full restart Jul 21 13:38:35 dslin1 groupd[14984]: cman_get_nodes error -1 104 Jul 21 13:38:35 dslin1 gfs_controld[14992]: cluster is down, exiting Jul 21 13:39:00 dslin1 ccsd[14975]: Unable to connect to cluster infrastructure after 30 seconds. Jul 21 13:39:30 dslin1 ccsd[14975]: Unable to connect to cluster infrastructure after 60 seconds. Jul 21 13:40:00 dslin1 ccsd[14975]: Unable to connect to cluster infrastructure after 90 seconds. Jul 21 13:40:30 dslin1 ccsd[14975]: Unable to connect to cluster infrastructure after 120 seconds. Jul 21 13:41:00 dslin1 ccsd[14975]: Unable to connect to cluster infrastructure after 150 seconds. Jul 21 13:41:30 dslin1 ccsd[14975]: Unable to connect to cluster infrastructure after 180 seconds. Jul 21 13:42:00 dslin1 ccsd[14975]: Unable to connect to cluster infrastructure after 210 seconds. and so on.. /etc/cluster/cluster.conf: From brem.belguebli at gmail.com Tue Jul 21 12:11:47 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Tue, 21 Jul 2009 14:11:47 +0200 Subject: [Linux-cluster] CLVMD without GFS In-Reply-To: <4A65AB22.1030601@redhat.com> References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com> <4A6590AE.2040808@redhat.com> <29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com> <4A65AB22.1030601@redhat.com> Message-ID: <29ae894c0907210511o28fa6a80j2b403d2ae6866d79@mail.gmail.com> Hi, When creating the VG by default clustered, you implicitely assume that it will be used with a clustered FS on top of it (gfs, ocfs, etc...) that will handle the active/active mode. As I do not intend to use GFS in this particular case, but ext3 and raw devices, I need to make sure the vg is exclusively activated on one node, preventing the other nodes to access it unless it is the failover procedure (node holding the VG crashed) and then re activate it exclusively on the failover node. Thanks Brem 2009/7/21, Christine Caulfield : > > Hiya, > > If you make a VG non-clustered then you, by definition, forfeit all cluster > locking protection on all of the LVs in that group. > > In practise you should always make shared volumes clustered and non-shared > volumes non-clustered. If you make a shared volume non-clustered then it's > up to you to manage the protection of it. > > It might be possible to put some protection in for when a volume group is > changed from clustered to non-clustered, but really, you're not supposed to > do that! > > Look at it this way, if you create a shared VG and mark it non-clustered to > start with, then you can corrupt it as much as you like by mounting its > filesystems on multiple nodes. It's just the same as a SAN volume then. > > > On 07/21/2009 12:16 PM, brem belguebli wrote: > >> Hi Chrissie, >> Indeed, by default when creating the VG, it is clustered, thus when >> creating the LV it is active on all nodes. >> To avoid data corruption, I have re created the VG as non clustered >> (vgcreate -c n vgXX) then created the LV which got activated only on the >> node where it got created. >> Then changed the VG to clustered (vgchange -c y VGXX) and activated it >> exclusively on this node. >> But, I could reproduce the behaviour of bypassing the exclusive flag: >> On node B, re change the VG to non clustered though it is activated >> exclusively on node A. >> and then activate it on node B and it works. >> The thing I'm trying to point is that simply by erasing the clustered >> flag you can bypass the exclusive activation. >> I think a barrier is necessary to prevent this to happen, removing the >> clustered flag from a VG should be possible only if the node holding the >> VG exclusively is down (does the lock manager DLM report which node >> holds exclusively a VG ?) >> Thanks >> Brem >> >> >> 2009/7/21, Christine Caulfield > >: >> >> Hiya, >> >> I've just tried this on my cluster and it works fine. >> >> What you need to remember is that lvcreate on one node will also >> activate the LV on all nodes in the cluster - it does an implicit >> lvchange -ay when you create it. What I can't explain is why >> vgchange -ae seemed to work fine on node A, it should give the same >> error as on node B because LVs are open shared on both nodes. >> >> Its not clear to me when you tagged the VG as clustered, so that >> might be contributing to the problem. When I create a new VG on >> shared storage it automatically gets labelled clustered so I have >> never needed to do this explicitly. If you create a non-clustered VG >> you probably ought to deactivate it on all nodes first as it could >> mess up the locking otherwise. This *might* be the cause of your >> troubles. >> >> The error on clvmd startup can be ignored. It's caused by clvmd >> ussing a background command with --no_locking so that it can check >> which LVs (if any) are already active and re-acquire locks for them >> >> Sorry this isn't conclusive, The exact order in which things are >> happening is not clear to me. >> >> Chrissie. >> >> >> On 07/21/2009 10:21 AM, brem belguebli wrote: >> >> Hi all, >> I think there is something to clarify about using CLVM across a >> cluster >> in a active/passive mode without GFS. >> From my understanding, CLVM keeps LVM metadata coherent among the >> cluster nodes and provides a cluster wide locking mechanism that >> can >> prevent any node from trying to activate a volume group if it >> has been >> activated exclusively (vgchange -a e VGXXX) by another node (which >> needs to be up). >> I have been playing with it to check this behaviour but it >> doesn't seem >> to make what is expected. >> I have 2 nodes (RHEL 5.3 X86_64, cluster installed and >> configured) , A >> and B using a SAN shared storage. >> I have a LUN from this SAN seen by both nodes, pvcreate'd >> /dev/mpath/mpath0 , vgcreate'd vg10 and lvcreate'd lvol1 (on one >> node), >> created an ext3 FS on /dev/vg10/lvol1 >> CLVM is running in debug mode (clvmd -d2 ) (but it complains about >> locking disabled though locking set to 3 on both nodes) >> On node A: >> vgchange -c y vg10 returns OK (vgs --> vg10 1 >> 1 0 >> wz--nc) >> vgchange -a e --> OK >> lvs returns lvol1 vg10 -wi-a- >> On node B (while things are active on A, A is UP and member of the >> cluster ): >> vgchange -a e --> Error locking on node B: Volume is >> busy on >> another node >> 1 logical volume(s) in >> volume group >> "vg10" now active >> It activates vg10 even if it sees it busy on another node . >> on B, lvs returns lvol1 vg10 -wi-a- >> as well as on A. >> I think the main problem comes from the fact that, as it is said >> when >> starting CLVM in debug mode, WARNING: Locking disabled. Be >> careful! >> This could corrupt your metadata. >> IMHO, the algorithm should be as follows: >> VG is tagged as clustered (vgchange -c y VGXXX) >> if a node (node P) tries to activate the VG exclusively >> (vgchange -a VGXXX) >> ask the lock manager to check if VG is not already locked by >> another >> node (node X) >> if so, check if node X is up >> if node X is down, return OK to node P >> else >> return NOK to node P (explicitely that VG is held exclusively by >> node X) >> Brem >> PS: this shouldn't be a problem with GFS or other clustered FS >> (OCFS, >> etc...) as no node should try to activate exclusively any VG. >> >> >> >> ------------------------------------------------------------------------ >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> >> ------------------------------------------------------------------------ >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ccaulfie at redhat.com Tue Jul 21 13:23:56 2009 From: ccaulfie at redhat.com (Christine Caulfield) Date: Tue, 21 Jul 2009 14:23:56 +0100 Subject: [Linux-cluster] CLVMD without GFS In-Reply-To: <29ae894c0907210511o28fa6a80j2b403d2ae6866d79@mail.gmail.com> References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com> <4A6590AE.2040808@redhat.com> <29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com> <4A65AB22.1030601@redhat.com> <29ae894c0907210511o28fa6a80j2b403d2ae6866d79@mail.gmail.com> Message-ID: <4A65C16C.20104@redhat.com> On 07/21/2009 01:11 PM, brem belguebli wrote: > Hi, > When creating the VG by default clustered, you implicitely assume that > it will be used with a clustered FS on top of it (gfs, ocfs, etc...) > that will handle the active/active mode. > As I do not intend to use GFS in this particular case, but ext3 and raw > devices, I need to make sure the vg is exclusively activated on one > node, preventing the other nodes to access it unless it is the failover > procedure (node holding the VG crashed) and then re activate it > exclusively on the failover node. > Thanks In that case you probably ought to be using rgmanager to do the failover for you. It has a script for doing exactly this :-) Chrissie From brem.belguebli at gmail.com Tue Jul 21 14:40:16 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Tue, 21 Jul 2009 16:40:16 +0200 Subject: [Linux-cluster] CLVMD without GFS In-Reply-To: <4A65C16C.20104@redhat.com> References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com> <4A6590AE.2040808@redhat.com> <29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com> <4A65AB22.1030601@redhat.com> <29ae894c0907210511o28fa6a80j2b403d2ae6866d79@mail.gmail.com> <4A65C16C.20104@redhat.com> Message-ID: <29ae894c0907210740n35ce71d6r2c3ba7478118c6a6@mail.gmail.com> Hi, That's what I 'm trying to do. If you mean lvm.sh, well, I've been playing with it, but it does some "sanity" checks that are wierd 1. It expects HA LVM to be setup (why such check if we want to use CLVM). 2. it exits if it finds a CLVM VG (kind of funny !) 3. it exits if the lvm.conf is newer than /boot/*.img (about this one, we tend to prevent the cluster from automatically starting ...) I was looking to find some doc on how to write my own resources, ie CLVM resource that checks if the vg is clustered, if so by which node is it exclusively held, and if the node is down to activate exclusively the VG. If you have some good links to provide me, that'll be great. Thanks 2009/7/21, Christine Caulfield : > On 07/21/2009 01:11 PM, brem belguebli wrote: > >> Hi, >> When creating the VG by default clustered, you implicitely assume that >> it will be used with a clustered FS on top of it (gfs, ocfs, etc...) >> that will handle the active/active mode. >> As I do not intend to use GFS in this particular case, but ext3 and raw >> devices, I need to make sure the vg is exclusively activated on one >> node, preventing the other nodes to access it unless it is the failover >> procedure (node holding the VG crashed) and then re activate it >> exclusively on the failover node. >> Thanks >> > > > In that case you probably ought to be using rgmanager to do the failover > for you. It has a script for doing exactly this :-) > > Chrissie > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ccaulfie at redhat.com Tue Jul 21 14:55:32 2009 From: ccaulfie at redhat.com (Christine Caulfield) Date: Tue, 21 Jul 2009 15:55:32 +0100 Subject: [Linux-cluster] CLVMD without GFS In-Reply-To: <29ae894c0907210740n35ce71d6r2c3ba7478118c6a6@mail.gmail.com> References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com> <4A6590AE.2040808@redhat.com> <29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com> <4A65AB22.1030601@redhat.com> <29ae894c0907210511o28fa6a80j2b403d2ae6866d79@mail.gmail.com> <4A65C16C.20104@redhat.com> <29ae894c0907210740n35ce71d6r2c3ba7478118c6a6@mail.gmail.com> Message-ID: <4A65D6E4.1090905@redhat.com> It seems a little pointless to integrate clvmd with a failover system. They're almost totally different ways of running a cluster. clvmd assumes a symmetrical cluster (as you've found out) and is designed so that the LVs are available on all nodes for a cluster filesystem. Trying to make that sort of system work for a failover installation is always going to be awkward, it's not what it was designed for. That, in part I think, is why HA-LVM checks for a clustered VGs and declines to manage them. A resource should be controlled by one manager, not two, it's just asking for confusion. Basically you either use clvmd or HA-LVM; not both together. If you really want to write a resource manager to use clvmd then feel free, I don't have any references but others might. It's not an area I have ever had to go into. Good luck ;-) Chrissie On 07/21/2009 03:40 PM, brem belguebli wrote: > Hi, > That's what I 'm trying to do. > If you mean lvm.sh, well, I've been playing with it, but it does some > "sanity" checks that are wierd > > 1. It expects HA LVM to be setup (why such check if we want to use CLVM). > 2. it exits if it finds a CLVM VG (kind of funny !) > 3. it exits if the lvm.conf is newer than /boot/*.img (about this > one, we tend to prevent the cluster from automatically starting ...) > > I was looking to find some doc on how to write my own resources, ie CLVM > resource that checks if the vg is clustered, if so by which node is it > exclusively held, and if the node is down to activate exclusively the VG. > If you have some good links to provide me, that'll be great. > Thanks > > > 2009/7/21, Christine Caulfield >: > > On 07/21/2009 01:11 PM, brem belguebli wrote: > > Hi, > When creating the VG by default clustered, you implicitely > assume that > it will be used with a clustered FS on top of it (gfs, ocfs, etc...) > that will handle the active/active mode. > As I do not intend to use GFS in this particular case, but ext3 > and raw > devices, I need to make sure the vg is exclusively activated on one > node, preventing the other nodes to access it unless it is the > failover > procedure (node holding the VG crashed) and then re activate it > exclusively on the failover node. > Thanks > > > > In that case you probably ought to be using rgmanager to do the > failover for you. It has a script for doing exactly this :-) > > Chrissie > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From rmicmirregs at gmail.com Tue Jul 21 14:56:25 2009 From: rmicmirregs at gmail.com (Rafael =?ISO-8859-1?Q?Mic=F3?= Miranda) Date: Tue, 21 Jul 2009 16:56:25 +0200 Subject: [Linux-cluster] CLVMD without GFS In-Reply-To: <29ae894c0907210740n35ce71d6r2c3ba7478118c6a6@mail.gmail.com> References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com> <4A6590AE.2040808@redhat.com> <29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com> <4A65AB22.1030601@redhat.com> <29ae894c0907210511o28fa6a80j2b403d2ae6866d79@mail.gmail.com> <4A65C16C.20104@redhat.com> <29ae894c0907210740n35ce71d6r2c3ba7478118c6a6@mail.gmail.com> Message-ID: <1248188185.6464.2.camel@mecatol> Hi Brem, El mar, 21-07-2009 a las 16:40 +0200, brem belguebli escribi?: > Hi, > > That's what I 'm trying to do. > > If you mean lvm.sh, well, I've been playing with it, but it does some > "sanity" checks that are wierd > 1. It expects HA LVM to be setup (why such check if we want to > use CLVM). > 2. it exits if it finds a CLVM VG (kind of funny !) > 3. it exits if the lvm.conf is newer than /boot/*.img (about this > one, we tend to prevent the cluster from automatically > starting ...) > I was looking to find some doc on how to write my own resources, ie > CLVM resource that checks if the vg is clustered, if so by which node > is it exclusively held, and if the node is down to activate > exclusively the VG. > > If you have some good links to provide me, that'll be great. > > Thanks > > > 2009/7/21, Christine Caulfield : > On 07/21/2009 01:11 PM, brem belguebli wrote: > Hi, > When creating the VG by default clustered, you > implicitely assume that > it will be used with a clustered FS on top of it (gfs, > ocfs, etc...) > that will handle the active/active mode. > As I do not intend to use GFS in this particular case, > but ext3 and raw > devices, I need to make sure the vg is exclusively > activated on one > node, preventing the other nodes to access it unless > it is the failover > procedure (node holding the VG crashed) and then re > activate it > exclusively on the failover node. > Thanks > > > In that case you probably ought to be using rgmanager to do > the failover for you. It has a script for doing exactly > this :-) > > Chrissie > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Please, check this link: https://www.redhat.com/archives/cluster-devel/2009-June/msg00020.html I found exactly the same problem as you, and i developed the "lvm-cluster.sh" script to solve the needs I had. You can find the script on the last message of the thread. I submitted it to make it part of the main project, but i have no news about that yet. I hope this helps. Cheers, Rafael -- Rafael Mic? Miranda From brem.belguebli at gmail.com Tue Jul 21 15:24:43 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Tue, 21 Jul 2009 17:24:43 +0200 Subject: [Linux-cluster] CLVMD without GFS In-Reply-To: <1248188185.6464.2.camel@mecatol> References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com> <4A6590AE.2040808@redhat.com> <29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com> <4A65AB22.1030601@redhat.com> <29ae894c0907210511o28fa6a80j2b403d2ae6866d79@mail.gmail.com> <4A65C16C.20104@redhat.com> <29ae894c0907210740n35ce71d6r2c3ba7478118c6a6@mail.gmail.com> <1248188185.6464.2.camel@mecatol> Message-ID: <29ae894c0907210824m751038fak861f5f9d2a8ca76b@mail.gmail.com> Hola Rafael, Thanks a lot, that'll avoid me going from scratch. I'll have a look at them and keep you updated. Brem 2009/7/21, Rafael Mic? Miranda : > > Hi Brem, > > El mar, 21-07-2009 a las 16:40 +0200, brem belguebli escribi?: > > Hi, > > > > That's what I 'm trying to do. > > > > If you mean lvm.sh, well, I've been playing with it, but it does some > > "sanity" checks that are wierd > > 1. It expects HA LVM to be setup (why such check if we want to > > use CLVM). > > 2. it exits if it finds a CLVM VG (kind of funny !) > > 3. it exits if the lvm.conf is newer than /boot/*.img (about this > > one, we tend to prevent the cluster from automatically > > starting ...) > > I was looking to find some doc on how to write my own resources, ie > > CLVM resource that checks if the vg is clustered, if so by which node > > is it exclusively held, and if the node is down to activate > > exclusively the VG. > > > > If you have some good links to provide me, that'll be great. > > > > Thanks > > > > > > 2009/7/21, Christine Caulfield : > > On 07/21/2009 01:11 PM, brem belguebli wrote: > > Hi, > > When creating the VG by default clustered, you > > implicitely assume that > > it will be used with a clustered FS on top of it (gfs, > > ocfs, etc...) > > that will handle the active/active mode. > > As I do not intend to use GFS in this particular case, > > but ext3 and raw > > devices, I need to make sure the vg is exclusively > > activated on one > > node, preventing the other nodes to access it unless > > it is the failover > > procedure (node holding the VG crashed) and then re > > activate it > > exclusively on the failover node. > > Thanks > > > > > > In that case you probably ought to be using rgmanager to do > > the failover for you. It has a script for doing exactly > > this :-) > > > > Chrissie > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > Please, check this link: > > https://www.redhat.com/archives/cluster-devel/2009-June/msg00020.html > > I found exactly the same problem as you, and i developed the > "lvm-cluster.sh" script to solve the needs I had. You can find the > script on the last message of the thread. > > I submitted it to make it part of the main project, but i have no news > about that yet. > > I hope this helps. > > Cheers, > > Rafael > > -- > Rafael Mic? Miranda > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From brem.belguebli at gmail.com Tue Jul 21 15:19:56 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Tue, 21 Jul 2009 17:19:56 +0200 Subject: [Linux-cluster] CLVMD without GFS In-Reply-To: <4A65D6E4.1090905@redhat.com> References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com> <4A6590AE.2040808@redhat.com> <29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com> <4A65AB22.1030601@redhat.com> <29ae894c0907210511o28fa6a80j2b403d2ae6866d79@mail.gmail.com> <4A65C16C.20104@redhat.com> <29ae894c0907210740n35ce71d6r2c3ba7478118c6a6@mail.gmail.com> <4A65D6E4.1090905@redhat.com> Message-ID: <29ae894c0907210819g6ccf00f0o79fc3d85c20dbe14@mail.gmail.com> Well, Pointless, I'm not sure as you take advantage of having all the other nodes in the cluster updated if a LVM metadata is modified by the node holding the VG. Second point, HA-LVM aka hosttags has, IMHO, a security problem as anyone could modify the hosttag on a VG without any problem (no locking mechanisms as CLVM). I have nothing against Clustered FS, but in my specific case, I have to host serveral Sybase Dataservers on some clusters, and the only acceptable option for my DBA's is to use raw devices. I never meant to combine HA-LVM and CLVM, I consider them mutualy exclusive. Regards 2009/7/21, Christine Caulfield : > > It seems a little pointless to integrate clvmd with a failover system. > They're almost totally different ways of running a cluster. clvmd assumes a > symmetrical cluster (as you've found out) and is designed so that the LVs > are available on all nodes for a cluster filesystem. Trying to make that > sort of system work for a failover installation is always going to be > awkward, it's not what it was designed for. > > That, in part I think, is why HA-LVM checks for a clustered VGs and > declines to manage them. A resource should be controlled by one manager, not > two, it's just asking for confusion. > > Basically you either use clvmd or HA-LVM; not both together. > > If you really want to write a resource manager to use clvmd then feel free, > I don't have any references but others might. It's not an area I have ever > had to go into. > > Good luck ;-) > > Chrissie > > > > On 07/21/2009 03:40 PM, brem belguebli wrote: > >> Hi, >> That's what I 'm trying to do. >> If you mean lvm.sh, well, I've been playing with it, but it does some >> "sanity" checks that are wierd >> >> 1. It expects HA LVM to be setup (why such check if we want to use >> CLVM). >> 2. it exits if it finds a CLVM VG (kind of funny !) >> 3. it exits if the lvm.conf is newer than /boot/*.img (about this >> one, we tend to prevent the cluster from automatically starting ...) >> >> I was looking to find some doc on how to write my own resources, ie CLVM >> resource that checks if the vg is clustered, if so by which node is it >> exclusively held, and if the node is down to activate exclusively the VG. >> If you have some good links to provide me, that'll be great. >> Thanks >> >> >> 2009/7/21, Christine Caulfield > >: >> >> On 07/21/2009 01:11 PM, brem belguebli wrote: >> >> Hi, >> When creating the VG by default clustered, you implicitely >> assume that >> it will be used with a clustered FS on top of it (gfs, ocfs, >> etc...) >> that will handle the active/active mode. >> As I do not intend to use GFS in this particular case, but ext3 >> and raw >> devices, I need to make sure the vg is exclusively activated on one >> node, preventing the other nodes to access it unless it is the >> failover >> procedure (node holding the VG crashed) and then re activate it >> exclusively on the failover node. >> Thanks >> >> >> >> In that case you probably ought to be using rgmanager to do the >> failover for you. It has a script for doing exactly this :-) >> >> Chrissie >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> >> ------------------------------------------------------------------------ >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From azzopardi at eib.org Wed Jul 22 09:33:32 2009 From: azzopardi at eib.org (AZZOPARDI Konrad) Date: Wed, 22 Jul 2009 11:33:32 +0200 Subject: [Linux-cluster] redhat cluster - veritas Message-ID: <0KN600ED7FVXVID0@comexp1-srv.lux.eib.org> Dear all, I am new to RedHat cluster and I am looking into different architectural models. In my workplace we have two datacentres with a SAN in each DC. Now we do not have the luxury of SAN replication so I was looking into doing this using host based mirroring. From what I read, the new RedHat 5.3 supports CLVM host based mirrors but to be honest I am quite reluctant to use something so new and I cannot find any references of anyone using it in PROD, which leads me to another question. Can RedHat cluster work with other volume managers such as veritas. Probably the related question would be whether GFS can with veritas volume manager. Thank you for any responses. konrad -------------------------------------------------------------------- Les informations contenues dans ce message et/ou ses annexes sont reservees a l'attention et a l'utilisation de leur destinataire et peuvent etre confidentielles. Si vous n'etes pas destinataire de ce message, vous etes informes que vous l'avez recu par erreur et que toute utilisation en est interdite. Dans ce cas, vous etes pries de le detruire et d'en informer la Banque Europeenne d'Investissement. The information in this message and/or attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are hereby notified that you have received this transmittal in error and that any use of it is prohibited. In such a case please delete this message and kindly notify the European Investment Bank accordingly. -------------------------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From pasik at iki.fi Wed Jul 22 12:30:07 2009 From: pasik at iki.fi (Pasi =?iso-8859-1?Q?K=E4rkk=E4inen?=) Date: Wed, 22 Jul 2009 15:30:07 +0300 Subject: [Linux-cluster] Virtual service using CLVM not migrating In-Reply-To: <2B9A98D7-A9C6-4210-B1B6-9B6306E21362@netspot.com.au> References: <2B9A98D7-A9C6-4210-B1B6-9B6306E21362@netspot.com.au> Message-ID: <20090722123007.GW24960@edu.joroinen.fi> On Mon, Aug 25, 2008 at 03:49:49PM +0930, Tom Lanyon wrote: > Hi list, > > (let me know if this should be on the xen list, but I think it's an > issue with clvm locking a logical volume) > > I have a three node RHEL5 cluster running some virtual machines. The > virtual machines use a LVM LV as their root which is available cluster- > wide via clvmd. > > Live migration between cluster nodes seems to work well when running > one-vm-per-node exclusively, but fails when a node is running more > than one virtual machine. > Hi! Did you ever figure out the reason for this behaviour? Have you tried it again with RHEL 5.3 ? -- Pasi From gianluca.cecchi at gmail.com Wed Jul 22 15:12:40 2009 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Wed, 22 Jul 2009 17:12:40 +0200 Subject: [Linux-cluster] lvm2: cluster request failed: Unknown error 65538 Message-ID: <561c252c0907220812l152462d9o27cd64ba37c017a4@mail.gmail.com> Hello, by mistake I previously sent this to fedora-list. I resend to the appropriate list I wanted... Excuse in advance for eventual cross-posting effects for anyone... fedora11 x86_64 with lvm2, device-mapper and related packages updated at : lvm2-2.02.48-1.fc11.x86_64 lvm2-cluster-2.02.48-1.fc11.x86_64 device-mapper-1.02.33-1.fc11.x86_64 device-mapper-libs-1.02.33-1.fc11.x86_64 I have 2 VGs: vg_virtfed that is a system vg (with root lv ans swap lv) and vg_qemu01 that is a clustered vg. At the moment only one node active [root virtfed ~]# service clvmd status clvmd (pid 2581) is running... active volumes: centos53 centos53_cldisk1 centos53_cldisk2 centos53_disk2 centos53_node02 centos53_node02_disk2 centos53_qdisk test_vm_drbd w2k3_01 lv_root lv_swap - it is strange that between the active volumes I have also the ones owned by vg_virtfed, that is not a clustered vg vgs command gives [root virtfed ~]# cman_tool status Version: 6.2.0 Config Version: 3 Cluster Name: kvm Cluster Id: 773 Cluster Member: Yes Cluster Generation: 196 Membership state: Cluster-Member Nodes: 1 Expected votes: 1 Total votes: 1 Node votes: 1 Quorum: 1 Active subsystems: 10 Flags: 2node HaveState Ports Bound: 0 177 Node name: kvm1 Node ID: 1 Multicast addresses: 239.192.3.8 Node addresses: 192.168.10.1 I have no services or filesystem on cluster. Only the infra with cman and clvmd [root virtfed ~]# clustat Cluster Status for kvm @ Wed Jul 22 16:00:14 2009 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ kvm1 1 Online, Local kvm2 2 Offline [root virtfed ~]# vgs cluster request failed: Unknown error 65538 cluster request failed: Unknown error 65538 VG #PV #LV #SN Attr VSize VFree vg_qemu01 1 9 0 wz--nc 52.12G 13.07G vg_virtfed 1 2 0 wz--n- 16.00G 0 Internal error: Volume Group vg_qemu01 was not unlocked Internal error: Volume Group vg_virtfed was not unlocked Device '/dev/drbd0' has been left open. Device '/dev/block/104:2' has been left open. Device '/dev/drbd0' has been left open. Device '/dev/block/104:2' has been left open. Device '/dev/block/104:2' has been left open. Device '/dev/drbd0' has been left open. Device '/dev/block/104:2' has been left open. so it seems that the cluster attribute is correctly present only for vg_qemu01 It seems that I can do normal operations such as adding an LV and so on, but I don't understand the errors about Internal error: Volume Group vg_qemu01 was not unlocked and Device '/dev/drbd0' has been left open. Any clues? dumpconfig from lvm> prompt gives: devices { dir="/dev" scan="/dev" preferred_names=[] filter=["a|^/dev/cciss/c0d0p2$|", "a|drbd.*|", "r|.*|"] cache_dir="/etc/lvm/cache" cache_file_prefix="" write_cache_state=1 sysfs_scan=1 md_component_detection=1 md_chunk_alignment=1 data_alignment=0 ignore_suspended_devices=0 } activation { missing_stripe_filler="error" reserved_stack=256 reserved_memory=8192 process_priority=-18 mirror_region_size=512 readahead="auto" mirror_log_fault_policy="allocate" mirror_device_fault_policy="remove" } global { umask=63 test=0 units="h" activation=1 proc="/proc" locking_type=3 fallback_to_clustered_locking=1 fallback_to_local_locking=1 locking_dir="/var/lock/lvm" } shell { history_size=100 } backup { backup=1 backup_dir="/etc/lvm/backup" archive=1 archive_dir="/etc/lvm/archive" retain_min=10 retain_days=30 } log { verbose=0 syslog=1 file="/var/log/lvm2.log" overwrite=0 level=6 indent=1 command_names=0 prefix=" " } I set log verbosse (level 6) and when I run vgs command i get inside the file /var/log/lvm2.log: config/config.c:955 log/activation not found in config: defaulting to 0 commands/toolcontext.c:189 Logging initialised at Wed Jul 22 15:58:19 2009 config/config.c:950 Setting global/umask to 63 commands/toolcontext.c:210 Set umask to 0077 config/config.c:927 Setting devices/dir to /dev config/config.c:927 Setting global/proc to /proc config/config.c:950 Setting global/activation to 1 config/config.c:955 global/suffix not found in config: defaulting to 1 config/config.c:927 Setting global/units to h config/config.c:927 Setting activation/readahead to auto config/config.c:927 Setting activation/missing_stripe_filler to error device/dev-cache.c:485 devices/preferred_names not found in config file: using built-in preferences config/config.c:950 Setting devices/ignore_suspended_devices to 0 config/config.c:927 Setting devices/cache_dir to /etc/lvm/cache config/config.c:950 Setting devices/write_cache_state to 1 filters/filter-persistent.c:131 Loaded persistent filter cache from /etc/lvm/cache/.cache config/config.c:950 Setting activation/reserved_stack to 256 config/config.c:950 Setting activation/reserved_memory to 8192 config/config.c:950 Setting activation/process_priority to -18 format1/format1.c:532 Initialised format: lvm1 format_pool/format_pool.c:333 Initialised format: pool format_text/format-text.c:2015 Initialised format: lvm2 config/config.c:933 global/format not found in config: defaulting to lvm2 striped/striped.c:228 Initialised segtype: striped zero/zero.c:110 Initialised segtype: zero error/errseg.c:113 Initialised segtype: error freeseg/freeseg.c:57 Initialised segtype: free snapshot/snapshot.c:312 Initialised segtype: snapshot mirror/mirrored.c:579 Initialised segtype: mirror config/config.c:950 Setting backup/retain_days to 30 config/config.c:950 Setting backup/retain_min to 10 config/config.c:927 Setting backup/archive_dir to /etc/lvm/archive config/config.c:927 Setting backup/backup_dir to /etc/lvm/backup config/config.c:955 global/fallback_to_lvm1 not found in config: defaulting to 1 config/config.c:950 Setting global/locking_type to 3 locking/locking.c:253 Cluster locking selected. config/config.c:955 report/aligned not found in config: defaulting to 1 config/config.c:955 report/buffered not found in config: defaulting to 1 config/config.c:955 report/headings not found in config: defaulting to 1 config/config.c:933 report/separator not found in config: defaulting to config/config.c:955 report/prefixes not found in config: defaulting to 0 config/config.c:955 report/quoted not found in config: defaulting to 1 config/config.c:955 report/columns_as_rows not found in config: defaulting to 0 config/config.c:933 report/vgs_sort not found in config: defaulting to vg_name config/config.c:933 report/vgs_cols not found in config: defaulting to vg_name,pv_count,lv_count,snap_count,vg_attr,vg_size,vg_free toollib.c:572 Finding all volume groups label/label.c:160 /dev/drbd0: lvm2 label detected label/label.c:160 /dev/block/104:2: lvm2 label detected label/label.c:184 /dev/dm-10: No label detected locking/cluster_locking.c:458 Locking VG V_vg_virtfed CR B (0x1) toollib.c:468 Finding volume group "vg_virtfed" label/label.c:160 /dev/block/104:2: lvm2 label detected config/config.c:933 description not found in config: defaulting to config/config.c:927 Setting description to Created *after* executing 'lvextend -l +100%FREE /dev/vg_virtfed/lv_root' locking/cluster_locking.c:458 Locking VG V_vg_virtfed UN B (0x6) locking/cluster_locking.c:159 cluster request failed: Unknown error 65538 locking/cluster_locking.c:458 Locking VG V_vg_qemu01 CR B (0x1) toollib.c:468 Finding volume group "vg_qemu01" label/label.c:160 /dev/drbd0: lvm2 label detected config/config.c:933 description not found in config: defaulting to config/config.c:927 Setting description to Created *after* executing '/sbin/lvcreate --name test_vm_drbd -L 4194304K /dev/vg_qemu01' locking/cluster_locking.c:458 Locking VG V_vg_qemu01 UN B (0x6) locking/cluster_locking.c:159 cluster request failed: Unknown error 65538 libdm-report.c:805 VG #PV #LV #SN Attr VSize VFree libdm-report.c:1065 vg_qemu01 1 9 0 wz--nc 52.12G 13.07G libdm-report.c:1065 vg_virtfed 1 2 0 wz--n- 16.00G 0 filters/filter-persistent.c:199 Dumping persistent device cache to /etc/lvm/cache/.cache misc/lvm-file.c:236 Locking /etc/lvm/cache/.cache (F_WRLCK, 1) misc/lvm-file.c:265 Unlocking fd 5 cache/lvmcache.c:1241 Wiping internal VG cache cache/lvmcache.c:1235 Internal error: Volume Group vg_qemu01 was not unlocked cache/lvmcache.c:1235 Internal error: Volume Group vg_virtfed was not unlocked device/dev-cache.c:564 Device '/dev/drbd0' has been left open. device/dev-cache.c:564 Device '/dev/block/104:2' has been left open. device/dev-cache.c:564 Device '/dev/drbd0' has been left open. device/dev-cache.c:564 Device '/dev/block/104:2' has been left open. device/dev-cache.c:564 Device '/dev/block/104:2' has been left open. device/dev-cache.c:564 Device '/dev/drbd0' has been left open. device/dev-cache.c:564 Device '/dev/block/104:2' has been left open. Thanks, Gianluca From brem.belguebli at gmail.com Wed Jul 22 17:40:40 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Wed, 22 Jul 2009 19:40:40 +0200 Subject: [Linux-cluster] redhat cluster - veritas In-Reply-To: <0KN600ED7FVXVID0@comexp1-srv.lux.eib.org> References: <0KN600ED7FVXVID0@comexp1-srv.lux.eib.org> Message-ID: <29ae894c0907221040j3edd77dex37a8b42ed2cd2286@mail.gmail.com> Hello Konrad, LVM mirroring has some stuff to be aware of before using it in production: 1) till now, no live extension, you need to break the mirror before growing it 2) the mirror metadata has to reside on a 3rd disk if you do not want to resynchronise entirely your mirror at boot time, if it is not a problem for you, you can put the metadata in memory (lost at reboot) I'm not sure anyone has already experienced VX storage foundation on top of RHCS, I'm ready to bet you would be the first. 2009/7/22, AZZOPARDI Konrad : > > Dear all, > > I am new to RedHat cluster and I am looking into different architectural > models. In my workplace we have two datacentres with a SAN in each DC. Now > we do not have the luxury of SAN replication so I was looking into doing > this using host based mirroring. From what I read, the new RedHat 5.3 > supports CLVM host based mirrors but to be honest I am quite reluctant to > use something so new and I cannot find any references of anyone using it in > PROD, which leads me to another question. Can RedHat cluster work with other > volume managers such as veritas. Probably the related question would be > whether GFS can with veritas volume manager. > > Thank you for any responses. > > konrad > > -------------------------------------------------------------------- > > Les informations contenues dans ce message et/ou ses annexes sont > reservees a l'attention et a l'utilisation de leur destinataire et peuvent etre > confidentielles. Si vous n'etes pas destinataire de ce message, vous etes > informes que vous l'avez recu par erreur et que toute utilisation en est > interdite. Dans ce cas, vous etes pries de le detruire et d'en informer la > Banque Europeenne d'Investissement. > > The information in this message and/or attachments is intended solely for > the attention and use of the named addressee and may be confidential. If > you are not the intended recipient, you are hereby notified that you have > received this transmittal in error and that any use of it is prohibited. In > such a case please delete this message and kindly notify the European > Investment Bank accordingly. > -------------------------------------------------------------------- > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From td3201 at gmail.com Wed Jul 22 18:26:15 2009 From: td3201 at gmail.com (Terry) Date: Wed, 22 Jul 2009 13:26:15 -0500 Subject: [Linux-cluster] Re: determining fsid for fs resource In-Reply-To: <8ee061010907170905x3468960foc938215865056a10@mail.gmail.com> References: <8ee061010907170905x3468960foc938215865056a10@mail.gmail.com> Message-ID: <8ee061010907221126r4329f2bbp19851aea365320ba@mail.gmail.com> On Fri, Jul 17, 2009 at 11:05 AM, Terry wrote: > Hello, > > When I create a fs resource using redhat's luci, it is able to find > the fsid for a fs and life is good. ?However, I am not crazy about > luci and would prefer to manually create the resources from the > command line but how do I find the fsid for a filesystem? ?Here's an > example of a fs resource created using luci: > > force_unmount="1" fsid="49256" fstype="ext3" mountpoint="/data01i" > name="omadvnfs01-data01i" > options="noatime,nodiratime,data=writeback,commit=30" self_fence="0"/> > > Thanks! > Anyone have an idea for this? From s.wendy.cheng at gmail.com Wed Jul 22 19:53:31 2009 From: s.wendy.cheng at gmail.com (Wendy Cheng) Date: Wed, 22 Jul 2009 12:53:31 -0700 Subject: [Linux-cluster] Re: determining fsid for fs resource In-Reply-To: <8ee061010907221126r4329f2bbp19851aea365320ba@mail.gmail.com> References: <8ee061010907170905x3468960foc938215865056a10@mail.gmail.com> <8ee061010907221126r4329f2bbp19851aea365320ba@mail.gmail.com> Message-ID: <4A676E3B.3010507@gmail.com> Terry wrote: > On Fri, Jul 17, 2009 at 11:05 AM, Terry wrote: > >> Hello, >> >> When I create a fs resource using redhat's luci, it is able to find >> the fsid for a fs and life is good. However, I am not crazy about >> luci and would prefer to manually create the resources from the >> command line but how do I find the fsid for a filesystem? Here's an >> example of a fs resource created using luci: >> >> > force_unmount="1" fsid="49256" fstype="ext3" mountpoint="/data01i" >> name="omadvnfs01-data01i" >> options="noatime,nodiratime,data=writeback,commit=30" self_fence="0"/> >> >> Thanks! >> >> > > Anyone have an idea for this? > IIRC, you basically have to make up the key (fsid) by yourself. Just pick any number (integer) that is less then 2**32 - but make sure it is unique per-filesystem-per-export while NFS service is up and running. That is, if you plan to export the same filesystem via two export entries (or say, export two different directories from the very same filesystem) , you need two fsids. If you have x exports (regardless they are from the same filesystem or different filessytems) at the same time, you would need x fsid(s). This is mostly to do with NFS export (internally represented by an unsigned integer) - don't confuse it with filesystem id (that is obtained via stat system call family). -- Wendy From jason at monsterjam.org Wed Jul 22 23:15:40 2009 From: jason at monsterjam.org (Jason Welsh) Date: Wed, 22 Jul 2009 19:15:40 -0400 Subject: [Linux-cluster] quickie GFS with lock_nolock question Message-ID: <4A679D9C.4010208@monsterjam.org> we have a funky custom application that doesnt seem to like reading/writing to the shared GFS filesystem we have on our 2node cluster. someone at the vendor suggested just having the main node mount the GFS filesystem with "lock_nolock" to see if that would fix the problem. What is the correct way to go through the motions to do this? because currently they both mount up the gfs partition and I believe its a dependency for the other services. feel free to direct me to the right FM to RTFM. Jason From rhurst at bidmc.harvard.edu Thu Jul 23 01:21:13 2009 From: rhurst at bidmc.harvard.edu (rhurst at bidmc.harvard.edu) Date: Wed, 22 Jul 2009 21:21:13 -0400 Subject: [Linux-cluster] quickie GFS with lock_nolock question References: <4A679D9C.4010208@monsterjam.org> Message-ID: For example, in /etc/fstab an entry could look like: /dev/VGCCC/lvolshare /cluster/share gfs lockproto=lock_nolock 0 0 An ad-hoc mount could look like: mount -t gfs -o lockproto=lock_nolock /dev/VGCCC/lvolshare /cluster/share Make certain that this filesystem is not mounted anywhere else before overriding, because it will corrupt. Robert Hurst, Sr. Cach? Administrator Beth Israel Deaconess Medical Center 1135 Tremont Street, REN-7 Boston, Massachusetts 02120-2140 617-754-8754 ? Fax: 617-754-8730 ? Cell: 401-787-3154 Any technology distinguishable from magic is insufficiently advanced. ________________________________ From: linux-cluster-bounces at redhat.com on behalf of Jason Welsh Sent: Wed 7/22/2009 7:15 PM To: linux clustering Subject: [Linux-cluster] quickie GFS with lock_nolock question we have a funky custom application that doesnt seem to like reading/writing to the shared GFS filesystem we have on our 2node cluster. someone at the vendor suggested just having the main node mount the GFS filesystem with "lock_nolock" to see if that would fix the problem. What is the correct way to go through the motions to do this? because currently they both mount up the gfs partition and I believe its a dependency for the other services. feel free to direct me to the right FM to RTFM. Jason -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlieb-linux-cluster at budge.apana.org.au Thu Jul 23 16:23:46 2009 From: charlieb-linux-cluster at budge.apana.org.au (Charlie Brady) Date: Thu, 23 Jul 2009 12:23:46 -0400 (EDT) Subject: [Linux-cluster] quickie GFS with lock_nolock question In-Reply-To: <4A679D9C.4010208@monsterjam.org> References: <4A679D9C.4010208@monsterjam.org> Message-ID: On Wed, 22 Jul 2009, Jason Welsh wrote: > we have a funky custom application that doesnt seem to like > reading/writing to the shared GFS filesystem ... What is your definition of "doesn't seem to like reading/writing"? > we have on our 2node > cluster. someone at the vendor suggested just having the main node mount > the GFS filesystem with "lock_nolock" to see if that would fix the problem. "fix" the problem? Or just hide it? From charlieb-linux-cluster at budge.apana.org.au Thu Jul 23 16:25:18 2009 From: charlieb-linux-cluster at budge.apana.org.au (Charlie Brady) Date: Thu, 23 Jul 2009 12:25:18 -0400 (EDT) Subject: [Linux-cluster] kernel BUG at fs/gfs2/rgrp.c:1458! In-Reply-To: References: Message-ID: On Mon, 20 Jul 2009, Peter Schobel wrote: > We are experiencing fatal exceptions on a four node Linux cluster > using a gfs2 filesystem. Any help would be appreciated. Am happy to > provide additional info. > > kernel BUG at fs/gfs2/rgrp.c:1458! > invalid opcode: 0000 [#1] > SMP > last sysfs file: /devices/pci0000:00/0000:00:00.0/irq > Modules linked in: ipv6 xfrm_nalgo crypto_api lock_dlm gfs2 dlm > configfs sunrpc dm_round_robin dm_multipath scsi_dh video hwmon > backlight sbs i2c_ec > i2c_cord > CPU: 0 > EIP: 0060:[] Not tainted VLI > EFLAGS: 00010246 (2.6.18-128.1.16.el5PAE #1) > EIP is at gfs2_alloc_data+0x75/0x155 [gfs2] I think you should open a bug at https://bugzilla.redhat.com/. From jason at monsterjam.org Thu Jul 23 17:08:36 2009 From: jason at monsterjam.org (Jason Welsh) Date: Thu, 23 Jul 2009 13:08:36 -0400 Subject: [Linux-cluster] quickie GFS with lock_nolock question In-Reply-To: References: <4A679D9C.4010208@monsterjam.org> Message-ID: <4A689914.7050104@monsterjam.org> Charlie Brady wrote: > > On Wed, 22 Jul 2009, Jason Welsh wrote: > >> we have a funky custom application that doesnt seem to like >> reading/writing to the shared GFS filesystem ... > > What is your definition of "doesn't seem to like reading/writing"? > >> we have on our 2node >> cluster. someone at the vendor suggested just having the main node mount >> the GFS filesystem with "lock_nolock" to see if that would fix the >> problem. > > "fix" the problem? Or just hide it? well, we are trying to test to see if thats the real problem or not.. It might not be. to make it simple, I just added that argument in the fstab on the first server and shutdown the second server. the question I was trying to ask was without shutting down and rebooting the servers, whats the right way to back both servers out of the clustered service and manually mount up the GFS partition on the first node.. I tried doing the "cman_tool leave remove" but I got the error that there were still services running, I tried shutting down everything, but still had like 3 lingering.. at this point, I just shut one down and rebooted the other with the correct mount options and everything is fine now for our testing.. I will hopefully know later today if the application will behave with its data on GFS volume mounted by the single node. thanks/regards, Jason From pschobel at 1iopen.net Thu Jul 23 17:19:38 2009 From: pschobel at 1iopen.net (Peter Schobel) Date: Thu, 23 Jul 2009 10:19:38 -0700 Subject: [Linux-cluster] kernel BUG at fs/gfs2/rgrp.c:1458! In-Reply-To: References: Message-ID: I actually have found the bug filed at https://bugzilla.redhat.com/show_bug.cgi?id=499333 and posted my comments. Regards, Peter Schobel ~ On Thu, Jul 23, 2009 at 9:25 AM, Charlie Brady wrote: > > On Mon, 20 Jul 2009, Peter Schobel wrote: > >> We are experiencing fatal exceptions on a four node Linux cluster >> using a gfs2 filesystem. Any help would be appreciated. Am happy to >> provide additional info. >> >> kernel BUG at fs/gfs2/rgrp.c:1458! >> invalid opcode: 0000 [#1] >> SMP >> last sysfs file: /devices/pci0000:00/0000:00:00.0/irq >> Modules linked in: ipv6 xfrm_nalgo crypto_api lock_dlm gfs2 dlm >> configfs sunrpc dm_round_robin dm_multipath scsi_dh video hwmon >> backlight sbs i2c_ec >> i2c_cord >> CPU: ? ?0 >> EIP: ? ?0060:[] ? ?Not tainted VLI >> EFLAGS: 00010246 ? (2.6.18-128.1.16.el5PAE #1) >> EIP is at gfs2_alloc_data+0x75/0x155 [gfs2] > > I think you should open a bug at https://bugzilla.redhat.com/. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Peter Schobel ~ From scooter at cgl.ucsf.edu Thu Jul 23 21:08:21 2009 From: scooter at cgl.ucsf.edu (Scooter Morris) Date: Thu, 23 Jul 2009 14:08:21 -0700 Subject: [Linux-cluster] Dependencies in resources Message-ID: <4A68D145.6070706@cgl.ucsf.edu> Hi all, I saw in a message on the net about a *depends="service:xxxx" *option for services in cluster.conf for 5.3beta. Did this single-level dependency support make it into the 5.3 release? It would be really, really useful if it did! If not, can anyone suggest a way that I can have a multiple services depend on a single IP? Thanks in advance! -- scooter -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: scooter.vcf Type: text/x-vcard Size: 378 bytes Desc: not available URL: From alfredo.moralejo at roche.com Thu Jul 23 21:15:21 2009 From: alfredo.moralejo at roche.com (Moralejo, Alfredo) Date: Thu, 23 Jul 2009 23:15:21 +0200 Subject: [Linux-cluster] Dependencies in resources In-Reply-To: <4A68D145.6070706@cgl.ucsf.edu> References: <4A68D145.6070706@cgl.ucsf.edu> Message-ID: Take a look into https://inquiries.redhat.com/go/redhat/ReferenceArchitectureSAPClusterSuite Deploying Highly Available SAP Servers using Red Hat Cluster Suite Gives some example about dependencies on services etc.. A very good document ________________________________ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Scooter Morris Sent: Thursday, July 23, 2009 11:08 PM To: linux clustering Subject: [Linux-cluster] Dependencies in resources Hi all, I saw in a message on the net about a depends="service:xxxx" option for services in cluster.conf for 5.3beta. Did this single-level dependency support make it into the 5.3 release? It would be really, really useful if it did! If not, can anyone suggest a way that I can have a multiple services depend on a single IP? Thanks in advance! -- scooter -------------- next part -------------- An HTML attachment was scrubbed... URL: From agx at sigxcpu.org Fri Jul 24 00:35:43 2009 From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=) Date: Fri, 24 Jul 2009 02:35:43 +0200 Subject: [Linux-cluster] updating cluster.conf on one node, when the other is down Message-ID: <20090724003543.GA19353@bogon.sigxcpu.org> Hi, taking node2 down, updating the cluster configuration on node1 then using "cman_tool version -r 7" on node1 and then booting node2 gives the error below: corosync[1790]: [QUORUM] This node is within the primary component and will provide service. corosync[1790]: [QUORUM] Members[1]: corosync[1790]: [QUORUM] 2 corosync[1790]: [CLM ] CLM CONFIGURATION CHANGE corosync[1790]: [CLM ] New Configuration: corosync[1790]: [CLM ] r(0) ip(192.168.122.228) corosync[1790]: [CLM ] Members Left: corosync[1790]: [CLM ] Members Joined: corosync[1790]: [CLM ] CLM CONFIGURATION CHANGE corosync[1790]: [CLM ] New Configuration: corosync[1790]: [CLM ] r(0) ip(192.168.122.82) corosync[1790]: [CLM ] r(0) ip(192.168.122.228) corosync[1790]: [CLM ] Members Left: corosync[1790]: [CLM ] Members Joined: corosync[1790]: [CLM ] r(0) ip(192.168.122.82) corosync[1790]: [TOTEM ] A processor joined or left the membership and a new membership was formed. corosync[1790]: [CMAN ] Can't get updated config version 7, config file is version 5. corosync[1790]: [QUORUM] This node is within the primary component and will provide service. corosync[1790]: [QUORUM] Members[1]: corosync[1790]: [QUORUM] 2 corosync[1790]: [CMAN ] Node 1 conflict, remote config version id=7, local=5 corosync[1790]: [MAIN ] Completed service synchronization, ready to provide service. corosync[1790]: [CMAN ] Can't get updated config version 7, config file is version 5. Afterwards corosync is spinning with 100% cpu usage. This is cluster 3.0.0 with corosync/openais 1.0.0. cluster.conf is attached. Any ideas? Cheers, -- Guido -------------- next part -------------- From agx at sigxcpu.org Fri Jul 24 00:27:40 2009 From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=) Date: Fri, 24 Jul 2009 02:27:40 +0200 Subject: [Linux-cluster] Cluster 3.0.0 final stable release In-Reply-To: <1247087412.7941.37.camel@cerberus.int.fabbione.net> References: <1247087412.7941.37.camel@cerberus.int.fabbione.net> Message-ID: <20090724002740.GA19266@bogon.sigxcpu.org> On Wed, Jul 08, 2009 at 11:10:12PM +0200, Fabio M. Di Nitto wrote: > The cluster team and its community are proud to announce the 3.0.0 final > release from the STABLE3 branch. > > "And now what?" > > The STABLE3 branch will continue to receive bug fixes and improvements > as feedback from our community and users will flow in. > Regular update releases will be available to sync with corosync/openais > releases and new kernels (for gfs1-kernel module). > > In order to build the 3.0.0 release you will need: > > - corosync 1.0.0 > - openais 1.0.0 > - linux kernel 2.6.29 > > The new source tarball can be downloaded here: > > ftp://sources.redhat.com/pub/cluster/releases/cluster-3.0.0.tar.gz > https://fedorahosted.org/releases/c/l/cluster/cluster-3.0.0.tar.gz > > At the same location is now possible to find separated tarballs for > fence-agents and resource-agents as previously announced > (http://www.redhat.com/archives/cluster-devel/2009-February/msg00003.htm) > > To report bugs or issues: > > https://bugzilla.redhat.com/ What would be the right component if not running on RHEL? -- Guido From scooter at cgl.ucsf.edu Fri Jul 24 16:59:46 2009 From: scooter at cgl.ucsf.edu (Scooter Morris) Date: Fri, 24 Jul 2009 09:59:46 -0700 Subject: [Linux-cluster] Dependencies in resources In-Reply-To: References: <4A68D145.6070706@cgl.ucsf.edu> Message-ID: <4A69E882.9080300@cgl.ucsf.edu> Alfredo, Thanks! That was very helpful, although I have a couple more questions for the list. What I am trying to do seems well suited for the existing "depend" and "depend-mode" capability. Essentially, I'm trying to use this to "group" services. So, assume I have 3 services: A, B, and C (in my case, A is an IP service, B and C both have an fs and a script resource). I want B and C to depend on A such that B and C will only start on the node where A is running. I assume I can do this by: