From linux at vfemail.net Mon Sep 1 08:54:43 2008 From: linux at vfemail.net (Alex) Date: Mon, 1 Sep 2008 11:54:43 +0300 Subject: [Linux-cluster] one click to start httpd on all nodes - possible? In-Reply-To: <38A48FA2F0103444906AD22E14F1B5A307F20245@mailxchg01.corp.opsource.net> References: <200808271301.57414.linux@vfemail.net> <38A48FA2F0103444906AD22E14F1B5A307F20245@mailxchg01.corp.opsource.net> Message-ID: <200809011154.43460.linux@vfemail.net> On Friday 29 August 2008 18:44, Jeff Stoner wrote: > If you are running httpd on all nodes, why are you managing httpd with > RHCS in the first place? Simply "chkconfig httpd on" and when the server > starts, httpd will start, too. > Hello Jeff, Here, http has been used just as an example. Could be any other service (ftp, proxy cache, etc) or shared resurce (a gfs volume mounted on all our servers). Here i am talking about functionality... > If you want to build a simple web server farm, this is more easily > accomplished using a load balancer (hardware or software) in front of > the web servers than with Cluster Services. > > Perhaps you could explain in more detail what you are trying to > accomplish. Are there an additional resources (file system mount, ip > address, etc.) associated with httpd. Under what conditions would you > start or stop httpd on a node? Simple: supposing that i have 3 thiered cluster model (1st thier configured for HA and load balancing, 2nd thier with N nodes acting as real servers, all running the same service and accesing a shared volume via iscsi, and 3rd thier acting as SAN), i need a "command center" to control (start/stop) a resourse/service globally in 2nd thier, on all N nodes. For example, currently i am using fstab to mount at boot time a shared GFS volume on all our cluster nodes. There are some cases, when i want that resource to be unmounted at the same time on all nodes... Is difficult and take time to ssh inside each node and stop a service accessing that resource and after that umount resource on each node (eg: service httpd stop && umount /dev/my_shared_volume). Maybe, what i want is not possible using cluster configuration, but i would like to know, how other peoples are are achieving this task. Regards, Alx > > > --Jeff > Sr. Systems Engineer > > OpSource, Inc. > http://www.opsource.net > "Your Success is Our Success" > > > -----Original Message----- > > From: linux-cluster-bounces at redhat.com > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Alex > > Sent: Wednesday, August 27, 2008 6:02 AM > > To: linux clustering > > Subject: [Linux-cluster] one click to start httpd on all > > nodes - possible? > > > > Hi all, > > > > I have 3 nodes, forming a cluster. How sould be configured a > > service in > > cluster.conf file in order to be able to stop or to start > > httpd daemon on all > > our nodes at the same time? All i can find in docs is related > > to failover > > scenario (stoping httpd on one node wil cause starting httpd > > on other node) > > which is not what i need. For nodes management i am using > > conga, so, i would > > like to have a service to do that? Is possible? If not, > > should i use other > > external tools (like nagios) to do that? > > > > Regards, > > Alx > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From gpbuono at gmail.com Mon Sep 1 09:29:49 2008 From: gpbuono at gmail.com (Gian Paolo Buono) Date: Mon, 1 Sep 2008 11:29:49 +0200 Subject: [Linux-cluster] the cluster don't restart (clvmd) Message-ID: Hi, I have a cluster configuration with two node..this is my cluster.conf: ####################cluster.conf#################### ####################cluster.conf#################### I have tried to restart cluster without reboot because the command clustat on node 2 don't work ... but ther is a problem on fence device..this is the messages.. [root at yoda1 cluster]# /etc/init.d/cman start Starting cluster: Enabling workaround for Xend bridged networking... done Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... done Starting daemons... done Starting fencing... failed The follow the log: Sep 1 11:27:47 yoda1 groupd[8162]: found uncontrolled kernel object clvmd in /sys/kernel/dlm Sep 1 11:27:47 yoda1 groupd[8162]: local node must be reset to clear 1 uncontrolled instances of gfs and/or dlm Sep 1 11:27:47 yoda1 openais[8154]: [CMAN ] cman killed by node 2 because we were killed by cman_tool or other application Sep 1 11:27:47 yoda1 fence_node[8163]: Fence of "yoda1.cs.tin.it" was unsuccessful Sep 1 11:27:47 yoda1 fenced[8169]: cman_init error (nil) 111 Sep 1 11:27:47 yoda1 gfs_controld[8181]: cman_init error 111 Sep 1 11:27:57 yoda1 dlm_controld[8175]: group_init error (nil) 111 best regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From jakub.suchy at enlogit.cz Mon Sep 1 15:03:49 2008 From: jakub.suchy at enlogit.cz (Jakub Suchy) Date: Mon, 1 Sep 2008 17:03:49 +0200 Subject: [Linux-cluster] the cluster don't restart (clvmd) In-Reply-To: References: Message-ID: <20080901150349.GA28998@nazorne.cz> Fence may be failing because you have NO fencing devices :) Jakub Suchy Gian Paolo Buono wrote: > Hi, > I have a cluster configuration with two node..this is my cluster.conf: > > ####################cluster.conf#################### > > > post_join_delay="3"/> > > > > > > > > > > > > > > > > ####################cluster.conf#################### > > I have tried to restart cluster without reboot because the command clustat > on node 2 don't work ... but ther is a problem on fence device..this is the > messages.. > > [root at yoda1 cluster]# /etc/init.d/cman start > Starting cluster: > Enabling workaround for Xend bridged networking... done > Loading modules... done > Mounting configfs... done > Starting ccsd... done > Starting cman... done > Starting daemons... done > Starting fencing... failed > > The follow the log: > Sep 1 11:27:47 yoda1 groupd[8162]: found uncontrolled kernel object clvmd > in /sys/kernel/dlm > Sep 1 11:27:47 yoda1 groupd[8162]: local node must be reset to clear 1 > uncontrolled instances of gfs and/or dlm > Sep 1 11:27:47 yoda1 openais[8154]: [CMAN ] cman killed by node 2 because > we were killed by cman_tool or other application > Sep 1 11:27:47 yoda1 fence_node[8163]: Fence of "yoda1.cs.tin.it" was > unsuccessful > Sep 1 11:27:47 yoda1 fenced[8169]: cman_init error (nil) 111 > Sep 1 11:27:47 yoda1 gfs_controld[8181]: cman_init error 111 > Sep 1 11:27:57 yoda1 dlm_controld[8175]: group_init error (nil) 111 > > best regards > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Jakub Such? GSM: +420 - 777 817 949 Enlogit s.r.o, U Cukrovaru 509/4, 400 07 ?st? nad Labem tel.: +420 - 474 745 159, fax: +420 - 474 745 160 e-mail: info at enlogit.cz, web: http://www.enlogit.cz Energy & Logic in IT From ccaulfie at redhat.com Mon Sep 1 15:14:15 2008 From: ccaulfie at redhat.com (Christine Caulfield) Date: Mon, 01 Sep 2008 16:14:15 +0100 Subject: [Linux-cluster] the cluster don't restart (clvmd) In-Reply-To: References: Message-ID: <48BC06C7.7010800@redhat.com> Gian Paolo Buono wrote: > Hi, > I have a cluster configuration with two node..this is my cluster.conf: > > ####################cluster.conf#################### > > > post_join_delay="3"/> > > > > > > > > > > > > > > > > ####################cluster.conf#################### > > I have tried to restart cluster without reboot because the command > clustat on node 2 don't work ... but ther is a problem on fence > device..this is the messages.. > > [root at yoda1 cluster]# /etc/init.d/cman start > Starting cluster: > Enabling workaround for Xend bridged networking... done > Loading modules... done > Mounting configfs... done > Starting ccsd... done > Starting cman... done > Starting daemons... done > Starting fencing... failed > > The follow the log: > Sep 1 11:27:47 yoda1 groupd[8162]: found uncontrolled kernel object > clvmd in /sys/kernel/dlm > Sep 1 11:27:47 yoda1 groupd[8162]: local node must be reset to clear 1 > uncontrolled instances of gfs and/or dlm > Sep 1 11:27:47 yoda1 openais[8154]: [CMAN ] cman killed by node 2 > because we were killed by cman_tool or other application > Sep 1 11:27:47 yoda1 fence_node[8163]: Fence of "yoda1.cs.tin.it > " was unsuccessful > Sep 1 11:27:47 yoda1 fenced[8169]: cman_init error (nil) 111 > Sep 1 11:27:47 yoda1 gfs_controld[8181]: cman_init error 111 > Sep 1 11:27:57 yoda1 dlm_controld[8175]: group_init error (nil) 111 > It's all failing to start because the cluster software wasn't shut down properly originally. ALL the daemons must be shut down and GFS filesystems mounted etc. Only then can you restart the cluster software. Looking at the messages I would guess that either clvmd was killed with -9 (there is a stray clvmd lockspace in existance) or the cluster was shutdown with "cman_tool leave force". Or maybe the daemons were killed by hand. In the event it's often easier to reboot ... Chrissie From jamesc at exa.com Mon Sep 1 23:55:48 2008 From: jamesc at exa.com (James Chamberlain) Date: Mon, 1 Sep 2008 19:55:48 -0400 Subject: [Linux-cluster] lm_dlm_cancel In-Reply-To: <81D8B57D-B9C8-4AA0-8BEC-F45212795FB6@exa.com> References: <81D8B57D-B9C8-4AA0-8BEC-F45212795FB6@exa.com> Message-ID: Hi all, Since I sent the below, the aforementioned cluster crashed. Now I can't mount the scratch112 filesystem. Attempts to do so crash the node trying to mount it. If I run gfs_fsck against it, I see the following: # gfs_fsck -nv /dev/s12/scratch112 Initializing fsck Initializing lists... Initializing special inodes... Validating Resource Group index. Level 1 check. 5834 resource groups found. (passed) Setting block ranges... Can't seek to last block in file system: 4969529913 Unable to determine the boundaries of the file system. Freeing buffers. Not being able to determine the boundaries of the file system seems like a very bad thing. However, LVM didn't complain in the slightest when I expanded the logical volume. How can I recover from this? Thanks, James On Aug 29, 2008, at 9:19 PM, James Chamberlain wrote: > Hi all, > > I'm trying to grow a GFS filesystem. I've grown this filesystem > before and everything went fine. However, when I issued gfs_grow > this time, I saw the following messages in my logs: > > Aug 29 21:04:13 s12n02 kernel: lock_dlm: lm_dlm_cancel 2,17 flags 80 > Aug 29 21:04:13 s12n02 kernel: lock_dlm: lm_dlm_cancel skip 2,17 > flags 100 > Aug 29 21:04:14 s12n02 kernel: lock_dlm: lm_dlm_cancel 2,17 flags 80 > Aug 29 21:04:14 s12n02 kernel: dlm: scratch112: (14239) dlm_unlock: > 10241 busy 2 > Aug 29 21:04:14 s12n02 kernel: lock_dlm: lm_dlm_cancel rv -16 2,17 > flags 40080 > > The last three lines of these log entries repeat themselves once a > second until I hit ^C. The filesystem appears to still be up and > accessible. Any thoughts on what's going on here and what I can do > about it? > > Thanks, > > James > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From gregory at steulet.org Tue Sep 2 12:49:17 2008 From: gregory at steulet.org (gregory steulet) Date: Tue, 02 Sep 2008 14:49:17 +0200 Subject: [Linux-cluster] Unable to retrieve batch 576881287 status from xxxx Service Manager not running on this node Message-ID: <1220359757-0f96762ead54f0b961f5fb5c743dba0a@steulet.org> Hi folks, I've a problem with luci, maybe did you already encounter this kind of problem. I add an IP ressource and a VIP service. Unfortunately my VIP service is monitored down. if I try to enable this service I get an error message like below Sep 2 14:31:51 emperor01 luci[8115]: Unable to retrieve batch 576881287 status from emperor01.high-availability.eu:11111: Service Manager not running on this node [root at emperor01 ~]# hostname emperor01.high-availability.eu [root at emperor01 ~]# vi /etc/hosts # Do not remove the following line, or various programs # that require network functionality will fail. 127.0.0.1 localhost.localdomain localhost 192.168.1.102 emperor02.high-availability.eu emperor02 192.168.1.101 emperor01.high-availability.eu emperor01 1.0.0.1 emperor01.int 1.0.0.2 emperor02.int vi /etc/cluster/cluster.conf Have a great day, best regards Greg From teigland at redhat.com Tue Sep 2 13:55:33 2008 From: teigland at redhat.com (David Teigland) Date: Tue, 2 Sep 2008 08:55:33 -0500 Subject: [Linux-cluster] lm_dlm_cancel In-Reply-To: References: <81D8B57D-B9C8-4AA0-8BEC-F45212795FB6@exa.com> Message-ID: <20080902135533.GB21199@redhat.com> On Mon, Sep 01, 2008 at 07:55:48PM -0400, James Chamberlain wrote: > Hi all, > > Since I sent the below, the aforementioned cluster crashed. Now I > can't mount the scratch112 filesystem. Attempts to do so crash the > node trying to mount it. If I run gfs_fsck against it, I see the > following: > > # gfs_fsck -nv /dev/s12/scratch112 > Initializing fsck > Initializing lists... > Initializing special inodes... > Validating Resource Group index. > Level 1 check. > 5834 resource groups found. > (passed) > Setting block ranges... > Can't seek to last block in file system: 4969529913 > Unable to determine the boundaries of the file system. > Freeing buffers. > > Not being able to determine the boundaries of the file system seems > like a very bad thing. However, LVM didn't complain in the slightest > when I expanded the logical volume. How can I recover from this? Looks like the killed gfs_grow left your fs is a bad condition. I believe Bob Peterson has addressed that recently. > >I'm trying to grow a GFS filesystem. I've grown this filesystem > >before and everything went fine. However, when I issued gfs_grow > >this time, I saw the following messages in my logs: > > > >Aug 29 21:04:13 s12n02 kernel: lock_dlm: lm_dlm_cancel 2,17 flags 80 > >Aug 29 21:04:13 s12n02 kernel: lock_dlm: lm_dlm_cancel skip 2,17 > >flags 100 > >Aug 29 21:04:14 s12n02 kernel: lock_dlm: lm_dlm_cancel 2,17 flags 80 > >Aug 29 21:04:14 s12n02 kernel: dlm: scratch112: (14239) dlm_unlock: > >10241 busy 2 > >Aug 29 21:04:14 s12n02 kernel: lock_dlm: lm_dlm_cancel rv -16 2,17 > >flags 40080 > > > >The last three lines of these log entries repeat themselves once a > >second until I hit ^C. The filesystem appears to still be up and > >accessible. Any thoughts on what's going on here and what I can do > >about it? Should be fixed by https://bugzilla.redhat.com/show_bug.cgi?id=438268 Dave From jamesc at exa.com Tue Sep 2 14:15:25 2008 From: jamesc at exa.com (James Chamberlain) Date: Tue, 2 Sep 2008 10:15:25 -0400 (EDT) Subject: [Linux-cluster] lm_dlm_cancel In-Reply-To: <20080902135533.GB21199@redhat.com> References: <81D8B57D-B9C8-4AA0-8BEC-F45212795FB6@exa.com> <20080902135533.GB21199@redhat.com> Message-ID: On Tue, 2 Sep 2008, David Teigland wrote: > On Mon, Sep 01, 2008 at 07:55:48PM -0400, James Chamberlain wrote: >> Hi all, >> >> Since I sent the below, the aforementioned cluster crashed. Now I >> can't mount the scratch112 filesystem. Attempts to do so crash the >> node trying to mount it. If I run gfs_fsck against it, I see the >> following: >> >> # gfs_fsck -nv /dev/s12/scratch112 >> Initializing fsck >> Initializing lists... >> Initializing special inodes... >> Validating Resource Group index. >> Level 1 check. >> 5834 resource groups found. >> (passed) >> Setting block ranges... >> Can't seek to last block in file system: 4969529913 >> Unable to determine the boundaries of the file system. >> Freeing buffers. >> >> Not being able to determine the boundaries of the file system seems >> like a very bad thing. However, LVM didn't complain in the slightest >> when I expanded the logical volume. How can I recover from this? > > Looks like the killed gfs_grow left your fs is a bad condition. > I believe Bob Peterson has addressed that recently. I think it was in a bad condition before I hit ^C rather than because I did. As I mentioned, I was getting the lm_dlm_cancel messages before I hit ^C. But I'd agree that one way or another, the gfs_grow operation somehow left the fs in a bad state. >>> I'm trying to grow a GFS filesystem. I've grown this filesystem >>> before and everything went fine. However, when I issued gfs_grow >>> this time, I saw the following messages in my logs: >>> >>> Aug 29 21:04:13 s12n02 kernel: lock_dlm: lm_dlm_cancel 2,17 flags 80 >>> Aug 29 21:04:13 s12n02 kernel: lock_dlm: lm_dlm_cancel skip 2,17 >>> flags 100 >>> Aug 29 21:04:14 s12n02 kernel: lock_dlm: lm_dlm_cancel 2,17 flags 80 >>> Aug 29 21:04:14 s12n02 kernel: dlm: scratch112: (14239) dlm_unlock: >>> 10241 busy 2 >>> Aug 29 21:04:14 s12n02 kernel: lock_dlm: lm_dlm_cancel rv -16 2,17 >>> flags 40080 >>> >>> The last three lines of these log entries repeat themselves once a >>> second until I hit ^C. The filesystem appears to still be up and >>> accessible. Any thoughts on what's going on here and what I can do >>> about it? > > Should be fixed by > https://bugzilla.redhat.com/show_bug.cgi?id=438268 Thanks Dave. Any idea if there's a corresponding patch for RHEL 4? Regards, James From pradhanparas at gmail.com Tue Sep 2 19:54:52 2008 From: pradhanparas at gmail.com (Paras pradhan) Date: Tue, 2 Sep 2008 14:54:52 -0500 Subject: [Linux-cluster] VM migration Message-ID: <8b711df40809021254k38457dbdi7ebf6e4c4a6e5df@mail.gmail.com> Hi, I am running a cluster having 2 nodes using red hat cluster suite in CentOS 5.2. node 1 has a para virtualized guest(centOS) running under Xen. My question is when node1 is rebooted, guest is automatically relocated to node 2 . Instead of relocation, is migration possible in this case which can result in Zero down time? Thanks in adv Paras. -------------- next part -------------- An HTML attachment was scrubbed... URL: From macscr at macscr.com Tue Sep 2 20:03:15 2008 From: macscr at macscr.com (Mark Chaney) Date: Tue, 2 Sep 2008 15:03:15 -0500 Subject: [Linux-cluster] VM migration In-Reply-To: <8b711df40809021254k38457dbdi7ebf6e4c4a6e5df@mail.gmail.com> References: <8b711df40809021254k38457dbdi7ebf6e4c4a6e5df@mail.gmail.com> Message-ID: <036001c90d36$f151e140$d3f5a3c0$@com> Are you using shared storage? Is far as I know, the current cluster suite wont do an automatic live migration to another server during a reboot, but it will do the live migration back after the reboot. Does that make sense? From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Paras pradhan Sent: Tuesday, September 02, 2008 2:55 PM To: linux-cluster at redhat.com Subject: [Linux-cluster] VM migration Hi, I am running a cluster having 2 nodes using red hat cluster suite in CentOS 5.2. node 1 has a para virtualized guest(centOS) running under Xen. My question is when node1 is rebooted, guest is automatically relocated to node 2 . Instead of relocation, is migration possible in this case which can result in Zero down time? Thanks in adv Paras. -------------- next part -------------- An HTML attachment was scrubbed... URL: From pradhanparas at gmail.com Tue Sep 2 20:08:52 2008 From: pradhanparas at gmail.com (Paras pradhan) Date: Tue, 2 Sep 2008 15:08:52 -0500 Subject: [Linux-cluster] VM migration In-Reply-To: <036001c90d36$f151e140$d3f5a3c0$@com> References: <8b711df40809021254k38457dbdi7ebf6e4c4a6e5df@mail.gmail.com> <036001c90d36$f151e140$d3f5a3c0$@com> Message-ID: <8b711df40809021308k53c511c7n3fb944fa94bbff1@mail.gmail.com> 2008/9/2 Mark Chaney > Are you using shared storage? Is far as I know, the current cluster suite > wont do an automatic live migration to another server during a reboot, but > it will do the live migration back after the reboot. Does that make sense? > > > > *From:* linux-cluster-bounces at redhat.com [mailto: > linux-cluster-bounces at redhat.com] *On Behalf Of *Paras pradhan > *Sent:* Tuesday, September 02, 2008 2:55 PM > *To:* linux-cluster at redhat.com > *Subject:* [Linux-cluster] VM migration > > > > Hi, > > > > I am running a cluster having 2 nodes using red hat cluster suite in CentOS > 5.2. node 1 has a para virtualized guest(centOS) running under Xen. My > question is when node1 is rebooted, guest is automatically relocated to node > 2 . Instead of relocation, is migration possible in this case which can > result in Zero down time? > > > > > > Thanks in adv > > Paras. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Exactly as you said. It is relocating and as soon as the original node comes back it is migrated automatically. Yes I am using shared storage using SAN/GFS2. Any workaround on this to make migration possible inserted of relocation Paras. -------------- next part -------------- An HTML attachment was scrubbed... URL: From emuller at engineyard.com Tue Sep 2 22:11:35 2008 From: emuller at engineyard.com (Edward Muller) Date: Tue, 2 Sep 2008 15:11:35 -0700 Subject: [Linux-cluster] lock_dlm: gdlm_cancel messages? Message-ID: <590F9A59-C4EC-42F7-B036-34E3245FB9B8@engineyard.com> We have a customer who we believe is putting excessive locking pressure on one of several gfs volumes (9 total across 5 systems). They've started to get occasional load spikes that seem to show that the gfs is "locking" for a minute or two. Without any action on our part the load spikes clear and everything continues as normal. And we've recently seen the following log entries: Sep 2 12:57:57 xc88-s00007 kernel: lock_dlm: gdlm_cancel 1,2 flags 0 Sep 2 12:57:57 xc88-s00007 kernel: lock_dlm: gdlm_cancel skip 1,2 flags 0 Sep 2 12:57:58 xc88-s00007 kernel: lock_dlm: gdlm_cancel 1,2 flags 0 Sep 2 12:57:58 xc88-s00007 kernel: lock_dlm: gdlm_cancel skip 1,2 flags 0 Sep 2 12:58:40 xc88-s00007 kernel: lock_dlm: gdlm_cancel 1,2 flags 0 Sep 2 12:58:40 xc88-s00007 kernel: lock_dlm: gdlm_cancel skip 1,2 flags 0 Sep 2 12:58:58 xc88-s00007 kernel: lock_dlm: gdlm_cancel 1,2 flags 0 Sep 2 12:58:58 xc88-s00007 kernel: lock_dlm: gdlm_cancel skip 1,2 flags 0 Sep 2 12:59:14 xc88-s00007 kernel: lock_dlm: gdlm_cancel 1,2 flags 0 Sep 2 12:59:14 xc88-s00007 kernel: lock_dlm: gdlm_cancel skip 1,2 flags 0 For all intents and purposes we're running RHCS2 from RHEL 5.2 w/ the RHEL 5.2 kernel (2.6.18-92.1.10) This used to happen to this customer a lot more frequently on RHCS1 (1.03), but we upgraded them to the above RHCS2 packages and kernel and things have been much better. I'm going to start dumping gfs_tool counters data for the various gfs filesystems. Any advice tracking this down would be useful. Thanks! -- Edward Muller Engine Yard Inc. : Support, Scalability, Reliability +1.866.518.9273 x209 - Mobile: +1.417.844.2435 IRC: edwardam - XMPP/GTalk: emuller at engineyard.com Pacific/US From macscr at macscr.com Wed Sep 3 10:05:02 2008 From: macscr at macscr.com (Mark Chaney) Date: Wed, 3 Sep 2008 05:05:02 -0500 Subject: [Linux-cluster] dlm: no local IP address has been set Message-ID: <001101c90dac$89f858a0$9de909e0$@com> grr, I am still getting this error every so often: dlm: no local IP address has been set dlm: cannot start dlm lowcomms -107 but my hosts files are fine and actually all servers are members of the clutser error is only showing on 1 of 3 servers and its always right after they come back up after being fenced. I am running CentOS 5.2 and am using the newest stable packages, using yum. Here is an example of my hosts files: 127.0.0.1 localhost.localdomain localhost #::1 localhost6.localdomain6 localhost6 67.xxx.159.xx wheeljac.blah.com wheeljack 67.xxx.159.xx skydive.blah.com skydive 67.xxx.159.xx ratchet.blah.com ratchet 192.168.1.11 wheeljack.local 192.168.1.10 ratchet.local 192.168.1.12 skydive.local The .local hostnames are the names of my nodes in my cluster.conf. Like I said, the server joined the cluster fine, but since DLM had the issue, CLVMD wasn't able to start. Any help would sincerely be appreciated. -Mark From teigland at redhat.com Wed Sep 3 13:34:48 2008 From: teigland at redhat.com (David Teigland) Date: Wed, 3 Sep 2008 08:34:48 -0500 Subject: [Linux-cluster] dlm: no local IP address has been set In-Reply-To: <001101c90dac$89f858a0$9de909e0$@com> References: <001101c90dac$89f858a0$9de909e0$@com> Message-ID: <20080903133448.GA22775@redhat.com> On Wed, Sep 03, 2008 at 05:05:02AM -0500, Mark Chaney wrote: > grr, I am still getting this error every so often: > > dlm: no local IP address has been set > dlm: cannot start dlm lowcomms -107 This is generally caused by some previous step failing or not happening in the whole startup process. The most proximate cause is dlm_controld not starting, which would usually be the result of something else like configfs not being mounted or the dlm kernel module not being loaded. Dave From teigland at redhat.com Wed Sep 3 13:44:33 2008 From: teigland at redhat.com (David Teigland) Date: Wed, 3 Sep 2008 08:44:33 -0500 Subject: [Linux-cluster] lock_dlm: gdlm_cancel messages? In-Reply-To: <590F9A59-C4EC-42F7-B036-34E3245FB9B8@engineyard.com> References: <590F9A59-C4EC-42F7-B036-34E3245FB9B8@engineyard.com> Message-ID: <20080903134433.GB22775@redhat.com> On Tue, Sep 02, 2008 at 03:11:35PM -0700, Edward Muller wrote: > We have a customer who we believe is putting excessive locking > pressure on one of several gfs volumes (9 total across 5 systems). > > They've started to get occasional load spikes that seem to show that > the gfs is "locking" for a minute or two. Without any action on our > part the load spikes clear and everything continues as normal. > > And we've recently seen the following log entries: > > Sep 2 12:57:57 xc88-s00007 kernel: lock_dlm: gdlm_cancel 1,2 flags 0 > Sep 2 12:57:57 xc88-s00007 kernel: lock_dlm: gdlm_cancel skip 1,2 > flags 0 > Sep 2 12:57:58 xc88-s00007 kernel: lock_dlm: gdlm_cancel 1,2 flags 0 > Sep 2 12:57:58 xc88-s00007 kernel: lock_dlm: gdlm_cancel skip 1,2 > flags 0 > Sep 2 12:58:40 xc88-s00007 kernel: lock_dlm: gdlm_cancel 1,2 flags 0 > Sep 2 12:58:40 xc88-s00007 kernel: lock_dlm: gdlm_cancel skip 1,2 > flags 0 > Sep 2 12:58:58 xc88-s00007 kernel: lock_dlm: gdlm_cancel 1,2 flags 0 > Sep 2 12:58:58 xc88-s00007 kernel: lock_dlm: gdlm_cancel skip 1,2 > flags 0 > Sep 2 12:59:14 xc88-s00007 kernel: lock_dlm: gdlm_cancel 1,2 flags 0 > Sep 2 12:59:14 xc88-s00007 kernel: lock_dlm: gdlm_cancel skip 1,2 > flags 0 FS activity will block while gfs does recovery, and the cancel messages are also usually due to recovery. If gfs is doing recovery, you'd see clear messages about it in /var/log/messages. Otherwise, I'd check whether they're using any gfs administrative commands like gfs_tool. Dave From macscr at macscr.com Wed Sep 3 18:33:46 2008 From: macscr at macscr.com (Mark Chaney) Date: Wed, 3 Sep 2008 13:33:46 -0500 Subject: [Linux-cluster] dlm: no local IP address has been set In-Reply-To: <20080903133448.GA22775@redhat.com> References: <001101c90dac$89f858a0$9de909e0$@com> <20080903133448.GA22775@redhat.com> Message-ID: <005f01c90df3$9b30f220$d192d660$@com> Still not sure what exactly isn't loading, I did find these details quite odd: im getting a few different results on each server when I run lsmod: http://pastebin.ca/1192806 -----Original Message----- From: David Teigland [mailto:teigland at redhat.com] Sent: Wednesday, September 03, 2008 8:35 AM To: Mark Chaney Cc: Linux-cluster at redhat.com Subject: Re: [Linux-cluster] dlm: no local IP address has been set On Wed, Sep 03, 2008 at 05:05:02AM -0500, Mark Chaney wrote: > grr, I am still getting this error every so often: > > dlm: no local IP address has been set > dlm: cannot start dlm lowcomms -107 This is generally caused by some previous step failing or not happening in the whole startup process. The most proximate cause is dlm_controld not starting, which would usually be the result of something else like configfs not being mounted or the dlm kernel module not being loaded. Dave From jbrassow at redhat.com Wed Sep 3 19:21:56 2008 From: jbrassow at redhat.com (Jonathan Brassow) Date: Wed, 3 Sep 2008 14:21:56 -0500 Subject: [Linux-cluster] lvcreate: Error locking on node In-Reply-To: <1220125884.3124.28.camel@blackhouse> References: <1220125884.3124.28.camel@blackhouse> Message-ID: clvm is a way to control/configure your storage in a cluster. It does not provide storage from one machine /to/ a cluster. There are several ways of making this happen though. You can use GNBD, iSCSI, AOE, fibre channel, or pretty much any other method of sharing block devices. Once all machines in your cluster can see the same storage, then you will be able to use CLVM to configure/manage it. brassow On Aug 30, 2008, at 2:51 PM, Alexander Vorobiyov wrote: > I try to create lvm logical volume that it was accessible through any > node of my cluster, at refusal of any cluster node. I try to organise > this storage through gigabit ethernet. clvm can provide these > storage in > this case or I am mistaken? > > -- > Alexander Vorobiyov > NTC NOC > The engineer of communications > Russia, Ryazan > +7(4912)901553 ext. 630 > mailto:alexander.vorobiyov at rzn.nex3.ru > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From quickshiftin at gmail.com Wed Sep 3 21:56:29 2008 From: quickshiftin at gmail.com (Nathan Nobbe) Date: Wed, 3 Sep 2008 15:56:29 -0600 Subject: [Linux-cluster] practial dlm usage inquiry Message-ID: <7dd2dc0b0809031456i7fd421d6qb2e59ee621f751e7@mail.gmail.com> hi all, my first post on this list. say, ive been checking out dlm a little bit lately. we are trying to build (or leverage) a distributed lock to manage access to files mounted via nfs. some of the folks here are telling me that dlm is not suitable for this, however, my suspicion is that the dlm is independent of a filesystem. is this accurate? for example i imagine the code could ask for a lock on a given resource, and then if it acquires a write lock, modify the file. am i way off base here? just trying to get my footing before investing too much time in the wrong approach... tia, -nathan -------------- next part -------------- An HTML attachment was scrubbed... URL: From egerlach at feds.uwaterloo.ca Thu Sep 4 01:19:04 2008 From: egerlach at feds.uwaterloo.ca (Eric Gerlach) Date: Wed, 03 Sep 2008 21:19:04 -0400 Subject: [Linux-cluster] Can't create files bigger than 3864 bytes on GFS Message-ID: <48BF3788.9090800@feds.uwaterloo.ca> Hi, I'm currently trying to set up GFS on a couple of Debian Testing boxes. I've got the GFS setup and mounted, however I'm having trouble writing to files. Using vi, cp, whatever. If the file is up to 3864 bytes (one inode) it's fine. But if I try to edit a file to make it bigger than that size, or copy in a file larger than that, the result is a zero byte file. I've also tried doing GFS with the lock_nolock mechanism, and I get the same result. The kernel is 2.6.26, and the Debian version of the tools is based off of cluster-2.03.06. My filesystem is GFS, not GFS2. Has anyone seen anything like this before? I'm a newbie to GFS, so I'm not even sure where to start looking for an answer. Even some help in that direction would be helpful. Thanks in advance. Cheers, -- Eric Gerlach, Network Administrator Federation of Students University of Waterloo p: (519) 888-4567 x36329 e: egerlach at feds.uwaterloo.ca From jeff.sturm at eprize.com Thu Sep 4 01:57:58 2008 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Wed, 3 Sep 2008 21:57:58 -0400 Subject: [Linux-cluster] practial dlm usage inquiry In-Reply-To: <7dd2dc0b0809031456i7fd421d6qb2e59ee621f751e7@mail.gmail.com> References: <7dd2dc0b0809031456i7fd421d6qb2e59ee621f751e7@mail.gmail.com> Message-ID: <64D0546C5EBBD147B75DE133D798665FE92B73@hugo.eprize.local> Nathan, I believe it's possible to use DLM as a general-purpose lock manager. You may find the following helpful: http://people.redhat.com/ccaulfie/docs/rhdlmbook.pdf Jeff ________________________________ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Nathan Nobbe Sent: Wednesday, September 03, 2008 5:56 PM To: linux-cluster at redhat.com Subject: [Linux-cluster] practial dlm usage inquiry hi all, my first post on this list. say, ive been checking out dlm a little bit lately. we are trying to build (or leverage) a distributed lock to manage access to files mounted via nfs. some of the folks here are telling me that dlm is not suitable for this, however, my suspicion is that the dlm is independent of a filesystem. is this accurate? for example i imagine the code could ask for a lock on a given resource, and then if it acquires a write lock, modify the file. am i way off base here? just trying to get my footing before investing too much time in the wrong approach... tia, -nathan -------------- next part -------------- An HTML attachment was scrubbed... URL: From quickshiftin at gmail.com Thu Sep 4 03:59:45 2008 From: quickshiftin at gmail.com (Nathan Nobbe) Date: Wed, 3 Sep 2008 21:59:45 -0600 Subject: [Linux-cluster] practial dlm usage inquiry In-Reply-To: <64D0546C5EBBD147B75DE133D798665FE92B73@hugo.eprize.local> References: <7dd2dc0b0809031456i7fd421d6qb2e59ee621f751e7@mail.gmail.com> <64D0546C5EBBD147B75DE133D798665FE92B73@hugo.eprize.local> Message-ID: <7dd2dc0b0809032059j6b248491i3200bef3045fd428@mail.gmail.com> 2008/9/3 Jeff Sturm > Nathan, > > I believe it's possible to use DLM as a general-purpose lock manager. > thanks for corroborating :) > You may find the following helpful: > > http://people.redhat.com/ccaulfie/docs/rhdlmbook.pdf > i was reading the book earlier today, and im going to start experimenting tonight. my intention is to develop a small php extension, to wrap some of the methods in the dlm api. -nathan -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris-m-lists at joelly.net Thu Sep 4 22:08:05 2008 From: chris-m-lists at joelly.net (Chris Joelly) Date: Fri, 5 Sep 2008 00:08:05 +0200 Subject: [Linux-cluster] Handling of shutdown and mounting of GFS Message-ID: <20080904220805.GB12807@joysn.joelly.net> Hello, i try to get RHCS up and running and have some success so far. The cluster with 2 nodes is running, but i don't know how to remove one node the correct way. I can move the active service (an IP address by now) to the second node and then want to remove the other node from the running cluster. cman_tool leave remove should be used for this which is recommended on the RH documentation. But if i try that i get the error message: root at store02:/etc/cluster# cman_tool leave remove cman_tool: Error leaving cluster: Device or resource busy I cannot figure out which device is busy so that the node is not able to leave the cluster. The service (IP address) moved to the other node correctly as i can see using clustat ... The only way to get out of this problem is to restart the whole cluster which brings down the service(s) and results in unnecessary fencing... Is there a known way to remove one node from the cluster without bringing down the whole cluster? Another strange thing comes up when i try to use GFS: i have configured DRBD on a backing HW Raid10 device, use LVM2 to build a clusteraware VG, and on top of that use LVs and GFS across the two cluster nodes. Using the GFS filesystems without noauto in fstab doesn't mount the filesystems on boot using /etc/init.d/gfs-tools. I think this is due to the ordering the sysv init scripts are started. All RHCS stuff is started from within rcS, and drbd is startet from within rc2. I read the section of the debian-policy to figure out if rcS is meant to run before rc2, but this isn't mentioned in the policy. So i assume that drbd is started in rc2 after rcS, which would mean that every filesystem on top of drbd is not able to mount on boot time... Can anybody prove this? The reason why i try to mount a GFS filesystem at boottime is that i want to build cluster services on top of it, and that services (more than one) are relying on one fs. A better solution would be to define a shared GFS filesystem resource which could be used across more than one cluster services, but the cluster take care that the filesystem is only mounted once... Can this be achieved with RHCS? thanks for any advice ... -- Ubuntu 8.04 LTS 64bit RHCS 2.0 cluster.conf attached! -------------- next part -------------- From anujhere at gmail.com Fri Sep 5 11:32:28 2008 From: anujhere at gmail.com (=?UTF-8?Q?Anuj_Singh_(=E0=A4=85=E0=A4=A8=E0=A5=81=E0=A4=9C)_?=) Date: Fri, 5 Sep 2008 17:02:28 +0530 Subject: [Linux-cluster] how to get my cluster working if my /dev/sda becomes /dev/sdb (CLVM, iscsi, ERROR: Module iscsi_sfnet in use) Message-ID: <3120c9e30809050432r66da26b2rfc5df10cbf241145@mail.gmail.com> Hi, I configured a cluster using gfs1 on rhel-4 kernel version 2.6.9-55.16.EL. Using iscsi-target and initiator. gfs1 mount is exported via nfs service. I can manually stop all services in following sequence: nfs, portmap, rgmanager, gfs, clvmd, fenced, cman, ccsd. to stop my iscsi service first I give 'vgchange -aln' then I stop iscsi service, otherwise i get an error of module in use, as I have an clusterd lvm over iscsi device (/dev/sda1) Everything works fine, but when i am trying to simulate a possible problem, f.e. iscsi service is stopped I get following error. Test1: When cluster is working I stop iscsi service with /etc/init.d/iscsi stop Searching for iscsi-based multipath maps Found 0 maps Stopping iscsid: [ OK ] Removing iscsi driver: ERROR: Module iscsi_sfnet is in use [FAILED] To stop my iscsi service without a failure, I stop all cluster services as follows. /etc/init.d/nfs stop /etc/init.d/portmap stop /etc/init.d/rgmanager stop /etc/init.d/gfs stop /etc/init.d/clvmd stop /etc/init.d/fenced stop /etc/init.d/cman stop /etc/init.d/ccsd stop Every service stops with a ok message. now again when i stop my iscsi service I get same error /etc/init.d/iscsi stop Removing iscsi driver: ERROR: Module iscsi_sfnet is in use [FAILED] On my iscsi device (which is /dev/sd1), i have a LVM with gfs1 file-system, as all the cluster services are stopped, I try to deactivate the lvm with: vgchange -aln /dev/dm-0: read failed after 0 of 4096 at 0: Input/output error No volume groups found At the moment if I start my iscsi service, my /dev/sda becomes /dev/sdb as well as iscsi service gives me following error: [root at pr0031 new]# /sbin/service iscsi start Checking iscsi config: [ OK ] Loading iscsi driver: [ OK ] mknod: `/dev/iscsictl': File exists Starting iscsid: [ OK ] Sep 5 16:42:37 pr0031 iscsi: iscsi config check succeeded Sep 5 16:42:37 pr0031 iscsi: Loading iscsi driver: succeeded Sep 5 16:42:42 pr0031 iscsid[20732]: version 4:0.1.11-7 variant (14-Apr-2008) Sep 5 16:42:42 pr0031 iscsi: iscsid startup succeeded Sep 5 16:42:42 pr0031 iscsid[20736]: Connected to Discovery Address 192.168.10.199 Sep 5 16:42:42 pr0031 kernel: iscsi-sfnet:host16: Session established Sep 5 16:42:42 pr0031 kernel: scsi16 : SFNet iSCSI driver Sep 5 16:42:42 pr0031 kernel: Vendor: IET Model: VIRTUAL-DISK Rev: 0 Sep 5 16:42:42 pr0031 kernel: Type: Direct-Access ANSI SCSI revision: 04 Sep 5 16:42:42 pr0031 kernel: SCSI device sdb: 1975932 512-byte hdwr sectors (1012 MB) Sep 5 16:42:42 pr0031 kernel: SCSI device sdb: drive cache: write through Sep 5 16:42:42 pr0031 kernel: SCSI device sdb: 1975932 512-byte hdwr sectors (1012 MB) Sep 5 16:42:42 pr0031 kernel: SCSI device sdb: drive cache: write through Sep 5 16:42:42 pr0031 kernel: sdb: sdb1 Sep 5 16:42:42 pr0031 kernel: Attached scsi disk sdb at scsi16, channel 0, id 0, lun 0 Sep 5 16:42:43 pr0031 scsi.agent[20764]: disk at /devices/platform/host16/target16:0:0/16:0:0:0 As my /dev/sda1 became /dev/sdb1, if i start cluster services, I have no gfs mount. clurgmgrd[21062]: Starting stopped service flx Sep 5 16:47:16 pr0031 kernel: scsi15 (0:0): rejecting I/O to dead device Sep 5 16:47:16 pr0031 clurgmgrd: [21062]: 'mount -t gfs /dev/mapper/VG01-LV01 /u01' failed, error=32 Sep 5 16:47:16 pr0031 clurgmgrd[21062]: start on clusterfs:gfsmount_u01 returned 2 (invalid argument(s)) Sep 5 16:47:16 pr0031 clurgmgrd[21062]: #68: Failed to start flx; return value: 1 Sep 5 16:47:16 pr0031 clurgmgrd[21062]: Stopping service flx After the above situation I need to restart the nodes, which I don't want to, I created a script to handle all this, in which if i restart all the services first, first I get the same /dev/sdb ( which should be /dev/sda so that my cluster can have a gfs mount). When I restart all the services second time, I get no error (this time iscsi disk is attached with /dev/sda device name and I don't see any /dev/iscsctl exist error at the iscsi startup time) and cluster starts working. my script : http://www.grex.org/~anuj/cluster.txt So, how to get my cluster working if my /dev/sda becomes /dev/sdb? Thanks and Regards Anuj Singh -------------- next part -------------- An HTML attachment was scrubbed... URL: From theophanis_kontogiannis at yahoo.gr Fri Sep 5 12:34:09 2008 From: theophanis_kontogiannis at yahoo.gr (Theophanis Kontogiannis) Date: Fri, 5 Sep 2008 15:34:09 +0300 Subject: [Linux-cluster] Problem with disk geometry Message-ID: <00b101c90f53$b39d9140$1ad8b3c0$@gr> Hello All, Though I have posted this to centos hardware forum, I also post it here since it affects my cluster and drbd setup (the drbd complains that "drbd0: The peer's disk size is too small!", so the LV cannot become available, so the cluster services on node-1 do not start). I have 2.6.18-92.1.10.el5.centos.plus and fdisk (util-linux 2.13-pre7) My old 80GB disk failed and I had replaced. However the vendor replaced it with a different model. My old disk was WD800JB-00JJC0 but the new one is a WD800JB-22JJC0. Since I must have the new disk with the same partitions as the old one (apart from RAID-1 software, I also have installed DRBD mirrored with a second node), I use fdisk to partition it. But there start the problems: 1. BIOS reports C38308, H16, S255, Landing Zone 38307, Precomp 0 2. Kernel reports hda: 156301488 sectors (80026 MB) w/8192KiB Cache, CHS=65535/16/63, UDMA(100) 3. fdisk reports Disk /dev/hda: 80.0 GB, 80026361856 bytes 16 heads, 63 sectors/track, 155061 cylinders Units = cylinders of 1008 * 512 = 516096 bytes and the disk turns up like: Device Boot Start End Blocks Id System /dev/hda1 * 1 208 104800+ fd Linux raid autodetect /dev/hda2 209 20554 10254384 fd Linux raid autodetect /dev/hda3 20555 24500 1988784 83 Linux /dev/hda4 24501 155061 65802744 83 Linux The result is that I cannot create the same layout on the disk as the old disk. The old disk was like that: Disk /dev/hda: 80.0 GB, 80026361856 bytes 255 heads, 63 sectors/track, 9729 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/hda1 * 1 13 104391 fd Linux raid autodetect /dev/hda2 14 1288 10241437+ fd Linux raid autodetect /dev/hda3 1289 1543 2048287+ 82 Linux swap / Solaris /dev/hda4 1544 9729 65754045 83 Linux Any help? Why the is the new disk known in different ways by BIOS, kernel and fdisk? And how can I fix this? Thank you All for your Time, Theophanis Kontogiannis -------------- next part -------------- An HTML attachment was scrubbed... URL: From anujhere at gmail.com Fri Sep 5 13:10:53 2008 From: anujhere at gmail.com (=?UTF-8?Q?Anuj_Singh_(=E0=A4=85=E0=A4=A8=E0=A5=81=E0=A4=9C)_?=) Date: Fri, 5 Sep 2008 18:40:53 +0530 Subject: [Linux-cluster] Re: how to get my cluster working if my /dev/sda becomes /dev/sdb (CLVM, iscsi, ERROR: Module iscsi_sfnet in use) In-Reply-To: <3120c9e30809050432r66da26b2rfc5df10cbf241145@mail.gmail.com> References: <3120c9e30809050432r66da26b2rfc5df10cbf241145@mail.gmail.com> Message-ID: <3120c9e30809050610o20c46493v3cd6a669eb409990@mail.gmail.com> Thanks, changed script a bit, things working now. resetting iscsi service. But device name order independent will be better. Thanks and regards Anuj Singh On Fri, Sep 5, 2008 at 5:02 PM, Anuj Singh (????) wrote: > Hi, > I configured a cluster using gfs1 on rhel-4 kernel version 2.6.9-55.16.EL. > Using iscsi-target and initiator. > gfs1 mount is exported via nfs service. > > I can manually stop all services in following sequence: > nfs, portmap, rgmanager, gfs, clvmd, fenced, cman, ccsd. > to stop my iscsi service first I give 'vgchange -aln' then I stop iscsi > service, otherwise i get an error of module in use, as I have an clusterd > lvm over iscsi device (/dev/sda1) > > Everything works fine, but when i am trying to simulate a possible problem, > f.e. iscsi service is stopped I get following error. > > Test1: > When cluster is working I stop iscsi service with > /etc/init.d/iscsi stop > Searching for iscsi-based multipath maps > Found 0 maps > Stopping iscsid: [ OK ] > Removing iscsi driver: ERROR: Module iscsi_sfnet is in use > [FAILED] > To stop my iscsi service without a failure, I stop all cluster services as > follows. > /etc/init.d/nfs stop > /etc/init.d/portmap stop > /etc/init.d/rgmanager stop > /etc/init.d/gfs stop > /etc/init.d/clvmd stop > /etc/init.d/fenced stop > /etc/init.d/cman stop > /etc/init.d/ccsd stop > Every service stops with a ok message. now again when i stop my iscsi > service I get same error > /etc/init.d/iscsi stop > Removing iscsi driver: ERROR: Module iscsi_sfnet is in > use [FAILED] > > On my iscsi device (which is /dev/sd1), i have a LVM with gfs1 file-system, > as all the cluster services are stopped, I try to deactivate the lvm with: > > vgchange -aln > /dev/dm-0: read failed after 0 of 4096 at 0: Input/output error > No volume groups found > > At the moment if I start my iscsi service, my /dev/sda becomes /dev/sdb as > well as iscsi service gives me following error: > > [root at pr0031 new]# /sbin/service iscsi start > Checking iscsi config: [ OK ] > Loading iscsi driver: [ OK ] > mknod: `/dev/iscsictl': File exists > Starting iscsid: [ OK ] > > Sep 5 16:42:37 pr0031 iscsi: iscsi config check succeeded > Sep 5 16:42:37 pr0031 iscsi: Loading iscsi driver: succeeded > Sep 5 16:42:42 pr0031 iscsid[20732]: version 4:0.1.11-7 variant > (14-Apr-2008) > Sep 5 16:42:42 pr0031 iscsi: iscsid startup succeeded > Sep 5 16:42:42 pr0031 iscsid[20736]: Connected to Discovery Address > 192.168.10.199 > Sep 5 16:42:42 pr0031 kernel: iscsi-sfnet:host16: Session established > Sep 5 16:42:42 pr0031 kernel: scsi16 : SFNet iSCSI driver > Sep 5 16:42:42 pr0031 kernel: Vendor: IET Model: VIRTUAL-DISK > Rev: 0 > Sep 5 16:42:42 pr0031 kernel: Type: Direct-Access > ANSI SCSI revision: 04 > Sep 5 16:42:42 pr0031 kernel: SCSI device sdb: 1975932 512-byte hdwr > sectors (1012 MB) > Sep 5 16:42:42 pr0031 kernel: SCSI device sdb: drive cache: write through > Sep 5 16:42:42 pr0031 kernel: SCSI device sdb: 1975932 512-byte hdwr > sectors (1012 MB) > Sep 5 16:42:42 pr0031 kernel: SCSI device sdb: drive cache: write through > Sep 5 16:42:42 pr0031 kernel: sdb: sdb1 > Sep 5 16:42:42 pr0031 kernel: Attached scsi disk sdb at scsi16, channel 0, > id 0, lun 0 > Sep 5 16:42:43 pr0031 scsi.agent[20764]: disk at > /devices/platform/host16/target16:0:0/16:0:0:0 > > As my /dev/sda1 became /dev/sdb1, if i start cluster services, I have no > gfs mount. > > clurgmgrd[21062]: Starting stopped service flx > Sep 5 16:47:16 pr0031 kernel: scsi15 (0:0): rejecting I/O to dead device > Sep 5 16:47:16 pr0031 clurgmgrd: [21062]: 'mount -t gfs > /dev/mapper/VG01-LV01 /u01' failed, error=32 > Sep 5 16:47:16 pr0031 clurgmgrd[21062]: start on > clusterfs:gfsmount_u01 returned 2 (invalid argument(s)) > Sep 5 16:47:16 pr0031 clurgmgrd[21062]: #68: Failed to start > flx; return value: 1 > Sep 5 16:47:16 pr0031 clurgmgrd[21062]: Stopping service flx > > > After the above situation I need to restart the nodes, which I don't want > to, I created a script to handle all this, in which if i restart all the > services first, first I get the same /dev/sdb ( which should be /dev/sda so > that my cluster can have a gfs mount). When I restart all the services > second time, I get no error (this time iscsi disk is attached with /dev/sda > device name and I don't see any /dev/iscsctl exist error at the iscsi > startup time) and cluster starts working. > my script : http://www.grex.org/~anuj/cluster.txt > > So, how to get my cluster working if my /dev/sda becomes /dev/sdb? > > Thanks and Regards > Anuj Singh > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mharrington at eons.com Fri Sep 5 13:30:15 2008 From: mharrington at eons.com (Matt Harrington) Date: Fri, 05 Sep 2008 09:30:15 -0400 Subject: [Linux-cluster] Re: how to get my cluster working if my /dev/sda becomes /dev/sdb (CLVM, iscsi, ERROR: Module iscsi_sfnet in use) In-Reply-To: <3120c9e30809050610o20c46493v3cd6a669eb409990@mail.gmail.com> References: <3120c9e30809050432r66da26b2rfc5df10cbf241145@mail.gmail.com> <3120c9e30809050610o20c46493v3cd6a669eb409990@mail.gmail.com> Message-ID: <48C13467.7050803@eons.com> If the problem is device naming, use multipath to create a /dev/mapper/something static device name which will always map to a particular disk independent of load order. Anuj Singh (????) wrote: > Thanks, > changed script a bit, things working now. resetting iscsi service. > But device name order independent will be better. > > Thanks and regards > Anuj Singh > > > > On Fri, Sep 5, 2008 at 5:02 PM, Anuj Singh (????) > wrote: > > Hi, > I configured a cluster using gfs1 on rhel-4 kernel version > 2.6.9-55.16.EL. > Using iscsi-target and initiator. > gfs1 mount is exported via nfs service. > > I can manually stop all services in following sequence: > nfs, portmap, rgmanager, gfs, clvmd, fenced, cman, ccsd. > to stop my iscsi service first I give 'vgchange -aln' then I stop > iscsi service, otherwise i get an error of module in use, as I > have an clusterd lvm over iscsi device (/dev/sda1) > > Everything works fine, but when i am trying to simulate a possible > problem, f.e. iscsi service is stopped I get following error. > > Test1: > When cluster is working I stop iscsi service with > /etc/init.d/iscsi stop > Searching for iscsi-based multipath maps > Found 0 maps > Stopping iscsid: [ OK ] > Removing iscsi driver: ERROR: Module iscsi_sfnet is in use > [FAILED] > To stop my iscsi service without a failure, I stop all cluster > services as follows. > /etc/init.d/nfs stop > /etc/init.d/portmap stop > /etc/init.d/rgmanager stop > /etc/init.d/gfs stop > /etc/init.d/clvmd stop > /etc/init.d/fenced stop > /etc/init.d/cman stop > /etc/init.d/ccsd stop > Every service stops with a ok message. now again when i stop my > iscsi service I get same error > /etc/init.d/iscsi stop > Removing iscsi driver: ERROR: Module iscsi_sfnet is in > use [FAILED] > > On my iscsi device (which is /dev/sd1), i have a LVM with gfs1 > file-system, > as all the cluster services are stopped, I try to deactivate the > lvm with: > > vgchange -aln > /dev/dm-0: read failed after 0 of 4096 at 0: Input/output error > No volume groups found > > At the moment if I start my iscsi service, my /dev/sda becomes > /dev/sdb as well as iscsi service gives me following error: > > [root at pr0031 new]# /sbin/service iscsi start > Checking iscsi config: [ OK ] > Loading iscsi driver: [ OK ] > mknod: `/dev/iscsictl': File exists > Starting iscsid: [ OK ] > > Sep 5 16:42:37 pr0031 iscsi: iscsi config check succeeded > Sep 5 16:42:37 pr0031 iscsi: Loading iscsi driver: succeeded > Sep 5 16:42:42 pr0031 iscsid[20732]: version 4:0.1.11-7 variant > (14-Apr-2008) > Sep 5 16:42:42 pr0031 iscsi: iscsid startup succeeded > Sep 5 16:42:42 pr0031 iscsid[20736]: Connected to Discovery > Address 192.168.10.199 > Sep 5 16:42:42 pr0031 kernel: iscsi-sfnet:host16: Session established > Sep 5 16:42:42 pr0031 kernel: scsi16 : SFNet iSCSI driver > Sep 5 16:42:42 pr0031 kernel: Vendor: IET Model: > VIRTUAL-DISK Rev: 0 > Sep 5 16:42:42 pr0031 kernel: Type: > Direct-Access ANSI SCSI revision: 04 > Sep 5 16:42:42 pr0031 kernel: SCSI device sdb: 1975932 512-byte > hdwr sectors (1012 MB) > Sep 5 16:42:42 pr0031 kernel: SCSI device sdb: drive cache: write > through > Sep 5 16:42:42 pr0031 kernel: SCSI device sdb: 1975932 512-byte > hdwr sectors (1012 MB) > Sep 5 16:42:42 pr0031 kernel: SCSI device sdb: drive cache: write > through > Sep 5 16:42:42 pr0031 kernel: sdb: sdb1 > Sep 5 16:42:42 pr0031 kernel: Attached scsi disk sdb at scsi16, > channel 0, id 0, lun 0 > Sep 5 16:42:43 pr0031 scsi.agent[20764]: disk at > /devices/platform/host16/target16:0:0/16:0:0:0 > > As my /dev/sda1 became /dev/sdb1, if i start cluster services, I > have no gfs mount. > > clurgmgrd[21062]: Starting stopped service flx > Sep 5 16:47:16 pr0031 kernel: scsi15 (0:0): rejecting I/O to dead > device > Sep 5 16:47:16 pr0031 clurgmgrd: [21062]: 'mount -t gfs > /dev/mapper/VG01-LV01 /u01' failed, error=32 > Sep 5 16:47:16 pr0031 clurgmgrd[21062]: start on > clusterfs:gfsmount_u01 returned 2 (invalid argument(s)) > Sep 5 16:47:16 pr0031 clurgmgrd[21062]: #68: Failed to > start flx; return value: 1 > Sep 5 16:47:16 pr0031 clurgmgrd[21062]: Stopping service > flx > > > After the above situation I need to restart the nodes, which I > don't want to, I created a script to handle all this, in which if > i restart all the services first, first I get the same /dev/sdb ( > which should be /dev/sda so that my cluster can have a gfs mount). > When I restart all the services second time, I get no error (this > time iscsi disk is attached with /dev/sda device name and I don't > see any /dev/iscsctl exist error at the iscsi startup time) and > cluster starts working. > my script : http://www.grex.org/~anuj/cluster.txt > > > So, how to get my cluster working if my /dev/sda becomes /dev/sdb? > > Thanks and Regards > Anuj Singh > > > > > > > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From ajeet.singh.raina at logica.com Mon Sep 8 14:25:46 2008 From: ajeet.singh.raina at logica.com (Singh Raina, Ajeet) Date: Mon, 8 Sep 2008 19:55:46 +0530 Subject: [Linux-cluster] RPC Issue? Message-ID: <0139539A634FD04A99C9B8880AB70CB209B17B00@in-ex004.groupinfra.com> _____________________________________________ From: Singh Raina, Ajeet Sent: Monday, September 08, 2008 6:45 PM To: 'linux clustering' Subject: RPC: failed to contact portmap (errno -5) I have setup two Node Cluster Setup and while trying to failover manually,its throwing the following error: [code] Sep 8 15:00:57 10.14.26.133 rgmanager: [31864]: Shutting down Cluster Service Manager... Sep 8 15:00:57 10.14.26.133 clurgmgrd[13694]: Shutting down Sep 8 15:00:57 10.14.26.133 clurgmgrd[13694]: Stopping service Mplus Sep 8 15:00:57 10.14.26.133 clurgmgrd: [13694]: Executing /home/fsadmin/featureserver/scripts/mplus_control_script.sh stop Sep 8 15:01:00 10.14.26.133 nfsd: last server has exited Sep 8 15:01:00 10.14.26.133 nfsd: unexporting all filesystems Sep 8 15:01:00 10.14.26.133 rpciod: active tasks at shutdown?! Sep 8 15:01:00 10.14.26.133 RPC: failed to contact portmap (errno -5). Sep 8 15:01:00 10.14.26.133 clurgmgrd: [13694]: Removing IPv4 address 10.14.26.139 from bond0 Sep 8 15:01:00 10.14.26.133 clurgmgrd: [13694]: unmounting /data/Xml Sep 8 15:01:00 10.14.26.133 clurgmgrd: [13694]: Device /dev/sda2 is mounted on /usr/users/fsadmin/archive instead of /home/fsadmin/archive Sep 8 15:01:00 10.14.26.133 clurgmgrd: [13694]: unmounting /home/fsadmin/archive Sep 8 15:01:01 10.14.26.133 clurgmgrd: [13694]: Device /dev/sda8 is mounted on /usr/users/fsadmin/featureserver/config instead of /home/fsadmin/featureserver/config Sep 8 15:01:01 10.14.26.133 clurgmgrd: [13694]: unmounting /home/fsadmin/featureserver/config Sep 8 15:01:01 10.14.26.133 clurgmgrd: [13694]: unmounting /var/lib/mysql Sep 8 15:01:01 10.14.26.133 clurgmgrd: [13694]: Device /dev/sda6 is mounted on /usr/users/fsadmin/mysql/logs instead of /home/fsadmin/mysql/logs Sep 8 15:01:01 10.14.26.133 clurgmgrd: [13694]: unmounting /home/fsadmin/mysql/logs Sep 8 15:01:01 10.14.26.133 clurgmgrd: [13694]: Device /dev/sda5 is mounted on /usr/users/fsadmin/mysql/data instead of /home/fsadmin/mysql/data Sep 8 15:01:01 10.14.26.133 clurgmgrd: [13694]: unmounting /home/fsadmin/mysql/data Sep 8 15:01:01 10.14.26.133 clurgmgrd: [13694]: Device /dev/sda3 is mounted on /usr/users/fsadmin/cdrs instead of /home/fsadmin/cdrs Sep 8 15:01:01 10.14.26.133 clurgmgrd: [13694]: unmounting /home/fsadmin/cdrs Sep 8 15:01:01 10.14.26.133 clurgmgrd[13694]: Service Mplus is stopped Sep 8 15:01:01 10.14.26.133 clurgmgrd[13694]: Shutdown complete, exiting Sep 8 15:01:02 10.14.26.133 rgmanager: [31864]: Cluster Service Manager is stopped. Sep 8 15:01:02 10.14.26.133 WARNING: dlm_emergency_shutdown Sep 8 15:01:02 10.14.26.133 WARNING: dlm_emergency_shutdown Sep 8 15:01:07 10.14.26.133 CMAN 2.6.9-53.16 (built Jul 15 2008 14:07:56) installed Sep 8 15:01:07 10.14.26.133 DLM 2.6.9-52.12 (built Jul 15 2008 14:34:18) installed [/code] The Above Bold Letters indicates the Error Message I am not able to trace out. I have only one service called Mplus which through the script is starting up. Please Help me with the Issue. This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From federico.simoncelli at gmail.com Mon Sep 8 15:23:27 2008 From: federico.simoncelli at gmail.com (Federico Simoncelli) Date: Mon, 8 Sep 2008 17:23:27 +0200 Subject: [Linux-cluster] Cluster Logwatch Message-ID: Hi all. Do you know if anyone has ever written a logwatch configuration/script for the cluster services? If not I'm going to start writing one on my own (is anyone interested in helping?). Thanks. -- Federico. From lhh at redhat.com Mon Sep 8 15:36:24 2008 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 08 Sep 2008 11:36:24 -0400 Subject: [Linux-cluster] [PATCH][RESEND] Add network interface select option for fence_xvmd In-Reply-To: <20080822081013.GA15358@localhost.localdomain> References: <20080822081013.GA15358@localhost.localdomain> Message-ID: <1220888184.4540.0.camel@ayanami> On Fri, 2008-08-22 at 17:10 +0900, Satoru SATOH wrote: > Hello, > > > I updated my patch for fence_xvmd to add network interface select option > posted before. > > This patch fixes the following issues ATST: > > 1. fence_xvmd selects wrong network interface to listen on if host has > multiple interfaces and target interface is not for default route. > As a result, fence_xvmd does not repond to fence_xvm's request. > 2. fence_xvmd cannot start if default route is not set. > > The following patch is for cluster-3 HEAD. > > The same problem exists in cluster-2 (rhel5's cluster) and I opened > bugzilla bug for that version: rhbz#459720. > Merged. Sorry for the delay. -- Lon From mockey.chen at nsn.com Tue Sep 9 06:58:40 2008 From: mockey.chen at nsn.com (Chen, Mockey (NSN - CN/Cheng Du)) Date: Tue, 9 Sep 2008 14:58:40 +0800 Subject: [Linux-cluster] How to config tie-breaker IP in RHEL 5.2 Message-ID: <174CED94DD8DC54AB888B56E103B118725D1A2@CNBEEXC007.nsn-intra.net> OS: RHEL 5.2. I have configured a two node cluster without shared disk, actually I have no shared disk in deployment. I want to even one of the node down, the other node can still provide service. I know there is trick called quorum disk and tie-breaker IP, since I have no shared disk resource, I want to use tie-breaker IP, but I can not find any information about tie-breaker IP in RHCS document. Any idea ? Thanks. Chen Ming -------------- next part -------------- An HTML attachment was scrubbed... URL: From cryptogrid at gmail.com Tue Sep 9 12:55:56 2008 From: cryptogrid at gmail.com (crypto grid) Date: Tue, 9 Sep 2008 09:55:56 -0300 Subject: [Linux-cluster] Two node cluster, quorum disk? Message-ID: Hi all, can anyone tell me if it's mandatory the use of a quorum disk on a two node cluster. Both nodes are connected to a SAN via two HBA's (I'm using multipath). The cluster will be configured in active/passive mode, supporting a database service. >From red hat documentation: "Configuring qdiskd is not required unless you have special requirements for node health." Any recommendations or suggestions will be appreciated. Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stpierre at NebrWesleyan.edu Tue Sep 9 14:56:42 2008 From: stpierre at NebrWesleyan.edu (Chris St. Pierre) Date: Tue, 9 Sep 2008 09:56:42 -0500 (CDT) Subject: [Linux-cluster] Two node cluster, quorum disk? In-Reply-To: References: Message-ID: On Tue, 9 Sep 2008, crypto grid wrote: > Hi all, can anyone tell me if it's mandatory the use of a quorum disk on a > two node cluster. http://sources.redhat.com/cluster/wiki/FAQ/CMAN#quorumdiskneeded Chris St. Pierre Unix Systems Administrator Nebraska Wesleyan University From gspiegl at gmx.at Tue Sep 9 15:51:15 2008 From: gspiegl at gmx.at (Gerhard Spiegl) Date: Tue, 09 Sep 2008 17:51:15 +0200 Subject: [Linux-cluster] Use qdisk heuristics w/o a quorum device/partition Message-ID: <48C69B73.4000805@gmx.at> Hello all! We are trying to set up a 20 node cluster and want to use a "ping-heuristic" and a heuristic that checks the state of the fiberchannel ports. Is it possible to use qdisk heuristics without a dedicated quorum partition, as this setup would only support 16 nodes? qdiskd fails starting: Sep 9 14:37:15 ols017c qdiskd[25339]: Quorum Daemon Initializing Sep 9 14:37:15 ols017c qdiskd[25339]: Initialization failed Although "service qdiskd start" seems to be successfull (OK). # qdiskd -fd [4248] debug: Loading configuration information [4248] debug: Heuristic: 'ping -c1 -t3 172.27.111.254' score=1 interval=2 tko=1 [4248] debug: 1 heuristics loaded [4248] debug: Quorum Daemon: 1 heuristics, 1 interval, 10 tko, 1 votes [4248] debug: Run Flags: 00000035 [4248] info: Quorum Daemon Initializing stat: Bad address [4248] crit: Initialization failed [[snip cluster.conf]] [[/snip]] Already tried experimenting with interval and tko but without success. Any help would be appreciated. cheers Gerhard From kanderso at redhat.com Tue Sep 9 18:08:59 2008 From: kanderso at redhat.com (Kevin Anderson) Date: Tue, 09 Sep 2008 13:08:59 -0500 Subject: [Linux-cluster] Use qdisk heuristics w/o a quorum device/partition In-Reply-To: <48C69B73.4000805@gmx.at> References: <48C69B73.4000805@gmx.at> Message-ID: <1220983739.4295.10.camel@dhcp80-204.msp.redhat.com> On Tue, 2008-09-09 at 17:51 +0200, Gerhard Spiegl wrote: > Hello all! > > We are trying to set up a 20 node cluster and want to use a > "ping-heuristic" and a heuristic that checks the state of > the fiberchannel ports. What actions do you want to take place based on these heuristics? > > Is it possible to use qdisk heuristics without a dedicated > quorum partition, as this setup would only support 16 nodes? > There is a 16 node limitation to qdisk primarily because we think performance hitting the same small number of blocks on the disk by that many nodes will be abysmal. Lon would know, but probably a value you could change and play with in the code. Am more interested in what problem you are trying to solve with the heuristics? It doesn't seem to be quorum related as the normal cman/openais capabilities will work fine with that number of nodes. If you are worried about split sites, just add an additional node to the cluster that is some other location. The node would only be used for quorum votes. Kevin From gspiegl at gmx.at Tue Sep 9 19:19:15 2008 From: gspiegl at gmx.at (Gerhard Spiegl) Date: Tue, 09 Sep 2008 21:19:15 +0200 Subject: [Linux-cluster] Use qdisk heuristics w/o a quorum device/partition In-Reply-To: <1220983739.4295.10.camel@dhcp80-204.msp.redhat.com> References: <48C69B73.4000805@gmx.at> <1220983739.4295.10.camel@dhcp80-204.msp.redhat.com> Message-ID: <48C6CC33.10604@gmx.at> Hello Kevin, thanks for your reply. Kevin Anderson wrote: > On Tue, 2008-09-09 at 17:51 +0200, Gerhard Spiegl wrote: >> Hello all! >> >> We are trying to set up a 20 node cluster and want to use a >> "ping-heuristic" and a heuristic that checks the state of >> the fiberchannel ports. > > What actions do you want to take place based on these heuristics? The node should get fenced (or fence/reboot itself) if the public interface (bond0) looses connection - or both paths (dm-multipath) to the storage get lost. Without quorum device: We faced the problem of complete loss of storage connectivity resulting the GFS to withdraw (only when IO is issued on it (we only use GFS for xen vm definition files)), causing GFS and CLVM to lockup and never released. Only the manual reboot/halt solves the situation (in addition the specific node gets fenced after poweroff - trifle to late ;)). With quorum device: The node loosing the storage gets fenced because it looses the qdisk. Obviousley, but >16 nodes qdisk is not an option, so I wrote a small shell script to check the fiberchannel paths. The idea is when FC is lost, the node fences/reboots itself. But the heuristics only work when a "device=/dev/dm-8" is specified in cluster.conf tag. Without it the qdiskd refuses to start. >> Is it possible to use qdisk heuristics without a dedicated >> quorum partition, as this setup would only support 16 nodes? >> > There is a 16 node limitation to qdisk primarily because we think > performance hitting the same small number of blocks on the disk by that > many nodes will be abysmal. Lon would know, but probably a value you > could change and play with in the code. I read about this in the cluster wiki/FAQ and it sounds comprehensible, also we dont want to play around in the source as our goal is a fully supported configuration by RedHat.com > Am more interested in what problem you are trying to solve with the > heuristics? It doesn't seem to be quorum related as the normal > cman/openais capabilities will work fine with that number of nodes. It seems not to, as stated the loss of storage connectivity causes the whole cluster to disfunction, wich is not expected. If it helps I will send cluster.conf tomorrow as I m off the office today ( CET ). Maybe there is another way of detecting the storage failure but I couldnt find any docs. Also I would be glad if you could point me to a more comprehensive documentation anywhere on the net. > you are worried about split sites, just add an additional node to the > cluster that is some other location. The node would only be used for > quorum votes. I am not sure what you mean with split sites (split brain?), but thats not the issue. Do you mean an additional node without any service or failoverdomain configured? regards Gerhard From markwag at u.washington.edu Tue Sep 9 19:24:30 2008 From: markwag at u.washington.edu (Mark Wagner) Date: Tue, 9 Sep 2008 12:24:30 -0700 Subject: [Linux-cluster] Re: Two node cluster, quorum disk? In-Reply-To: <20080909160016.01AF68E07C6@hormel.redhat.com> References: <20080909160016.01AF68E07C6@hormel.redhat.com> Message-ID: <20080909192430.GM3470@n-its-markwag2.mcis.washington.edu> On Tue, Sep 09, 2008 at 12:00:16PM -0400, linux-cluster-request at redhat.com wrote: > On Tue, 9 Sep 2008, crypto grid wrote: > > > Hi all, can anyone tell me if it's mandatory the use of a quorum disk on a > > two node cluster. > > http://sources.redhat.com/cluster/wiki/FAQ/CMAN#quorumdiskneeded That answer may need a little editing since it seems to be saying two different things. Note that if you configure a quorum disk/partition, you want two_node="1" or expected_votes="2" since the quorum disk solves the voting imbalance. You want two_node="0" and expected_votes="3" (or nodes + 1 if it's not a two-node cluster). -- Mark Wagner System Administrator, UW Medicine IT Services 206-616-6119 From kanderso at redhat.com Tue Sep 9 19:35:01 2008 From: kanderso at redhat.com (Kevin Anderson) Date: Tue, 09 Sep 2008 14:35:01 -0500 Subject: [Linux-cluster] Use qdisk heuristics w/o a quorum device/partition In-Reply-To: <48C6CC33.10604@gmx.at> References: <48C69B73.4000805@gmx.at> <1220983739.4295.10.camel@dhcp80-204.msp.redhat.com> <48C6CC33.10604@gmx.at> Message-ID: <1220988901.4295.31.camel@dhcp80-204.msp.redhat.com> On Tue, 2008-09-09 at 21:19 +0200, Gerhard Spiegl wrote: > Hello Kevin, > > thanks for your reply. > > Kevin Anderson wrote: > > On Tue, 2008-09-09 at 17:51 +0200, Gerhard Spiegl wrote: > >> Hello all! > >> > >> We are trying to set up a 20 node cluster and want to use a > >> "ping-heuristic" and a heuristic that checks the state of > >> the fiberchannel ports. > > > > What actions do you want to take place based on these heuristics? > > The node should get fenced (or fence/reboot itself) if the public > interface (bond0) looses connection - or both paths (dm-multipath) > to the storage get lost. > > Without quorum device: > We faced the problem of complete loss of storage connectivity resulting > the GFS to withdraw (only when IO is issued on it (we only use GFS for > xen vm definition files)), causing GFS and CLVM to lockup and never > released. Only the manual reboot/halt solves the situation (in addition > the specific node gets fenced after poweroff - trifle to late ;)). You can avoid the withdraw and force a panic by using the debug mount option for your GFS filesystems. With debug set, GFS will when getting an I/O error, panic the system effectively self fencing the node. The reason behind withdraw was to give the operator a chance to gracefully remove the node from the cluster after filesystem failure. This is useful when multiple filesystems are mounted with multiple storage devices. A withdraw always requires rebooting the node to recover. However, in your case, panic action is probably what you want. We recently opened a new bugzilla for a new feature to give you better control of the options in this case. https://bugzilla.redhat.com/show_bug.cgi?id=461065 Anyway, the debug mount option should avoid the situation you are describing. > > > you are worried about split sites, just add an additional node to the > > cluster that is some other location. The node would only be used for > > quorum votes. > > I am not sure what you mean with split sites (split brain?), but thats not the > issue. Do you mean an additional node without any service or failoverdomain > configured? With split sites and an even number of nodes, you could end up in the situation that if an entire site goes down, you no longer have cluster quorum. Having an extra and therefor odd number of nodes in the cluster would enable the cluster to continue to operate at the remaining site. But, not the problem you were trying to solve in this case. Try -o debug on your GFS mount options. Thanks Kevin From gspiegl at gmx.at Tue Sep 9 19:59:38 2008 From: gspiegl at gmx.at (Gerhard Spiegl) Date: Tue, 09 Sep 2008 21:59:38 +0200 Subject: [Linux-cluster] Use qdisk heuristics w/o a quorum device/partition In-Reply-To: <1220988901.4295.31.camel@dhcp80-204.msp.redhat.com> References: <48C69B73.4000805@gmx.at> <1220983739.4295.10.camel@dhcp80-204.msp.redhat.com> <48C6CC33.10604@gmx.at> <1220988901.4295.31.camel@dhcp80-204.msp.redhat.com> Message-ID: <48C6D5AA.8080706@gmx.at> Kevin Anderson wrote: > You can avoid the withdraw and force a panic by using the debug mount > option for your GFS filesystems. With debug set, GFS will when getting > an I/O error, panic the system effectively self fencing the node. The > reason behind withdraw was to give the operator a chance to gracefully > remove the node from the cluster after filesystem failure. This is > useful when multiple filesystems are mounted with multiple storage > devices. A withdraw always requires rebooting the node to recover. > However, in your case, panic action is probably what you want. We > recently opened a new bugzilla for a new feature to give you better > control of the options in this case. > https://bugzilla.redhat.com/show_bug.cgi?id=461065 > > Anyway, the debug mount option should avoid the situation you are > describing. If it does, it is exactly what we were looking for. In fact GFS reported an IO error in syslog (and on "ls" "df" ...), but only the "nice" withdraw happened. The only thing we found out was that passing the -w option to gfs_controld (init.d/cman) would avoid withdrawing GFS. We expected the kernel to panic but the only result was a puny syslog message :) Tomorrow I will try adding "debug" to the mount opts in fstab. > With split sites and an even number of nodes, you could end up in the > situation that if an entire site goes down, you no longer have cluster > quorum. Having an extra and therefor odd number of nodes in the cluster > would enable the cluster to continue to operate at the remaining site. Will keep this in mind, may become handy someday. > > Thanks > Kevin Thank You! Gerhard From Gerhardus.Geldenhuis at gta-travel.com Wed Sep 10 09:34:12 2008 From: Gerhardus.Geldenhuis at gta-travel.com (Gerhardus.Geldenhuis at gta-travel.com) Date: Wed, 10 Sep 2008 10:34:12 +0100 Subject: [Linux-cluster] Unable to retrieve batch 1776334432 status from.... Message-ID: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD240E@LONSEXC01.gta.travel.lcl> Hi I have justed started using cluster tools for the first time so my question might be a bit obvious to the expierenced eye. I am getting the following error when trying to do config changes using luci. Unable to retrieve batch 423452352 status from longapa02alt.gta.travel:11111: ccs_tool failed to propagate conf: unable to connect to the CCS daemon: connection refused Failed to update config file. I get the following error in /var/log/messages Sep 10 10:29:56 LONGAPA02ALT ccsd[11151]: Unable to connect to cluster infrastructure after 3570 seconds I am unsure what is causing this. I was able to initially create the cluster config and then it stopped working. Any suggestions as to what is causing this would be very much appreciated. I do get results from the ccs_tool so some parts are at least working... /sbin/ccs_tool lsnode Cluster name: apache-ha, config_version: 1 Nodename Votes Nodeid Fencetype longapa02alt.gta.travel.lcl 1 1 longapa02blt.gta.travel.lcl 1 2 Regards ______________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________ From mockey.chen at nsn.com Wed Sep 10 09:56:51 2008 From: mockey.chen at nsn.com (Chen, Mockey (NSN - CN/Cheng Du)) Date: Wed, 10 Sep 2008 17:56:51 +0800 Subject: [Linux-cluster] Unable to retrieve batch 1776334432 status from.... In-Reply-To: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD240E@LONSEXC01.gta.travel.lcl> References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD240E@LONSEXC01.gta.travel.lcl> Message-ID: <174CED94DD8DC54AB888B56E103B118728C48B@CNBEEXC007.nsn-intra.net> >-----Original Message----- >From: linux-cluster-bounces at redhat.com >[mailto:linux-cluster-bounces at redhat.com] On Behalf Of ext >Gerhardus.Geldenhuis at gta-travel.com >Sent: Wednesday, September 10, 2008 5:34 PM >To: linux-cluster at redhat.com >Subject: [Linux-cluster] Unable to retrieve batch 1776334432 >status from.... > >Hi >I have justed started using cluster tools for the first time >so my question might be a bit obvious to the expierenced eye. > >I am getting the following error when trying to do config >changes using luci. >Unable to retrieve batch 423452352 status from >longapa02alt.gta.travel:11111: ccs_tool failed to propagate >conf: unable to connect to the CCS daemon: connection refused >Failed to update config file. > I maybe due to you not start ccsd daemon. try to use ccs_test connect to test whether you can connect your ccsd daemon. In RHCS 5, you can start ccsd by following command: service cman start From Gerhardus.Geldenhuis at gta-travel.com Wed Sep 10 10:05:34 2008 From: Gerhardus.Geldenhuis at gta-travel.com (Gerhardus.Geldenhuis at gta-travel.com) Date: Wed, 10 Sep 2008 11:05:34 +0100 Subject: [Linux-cluster] Unable to retrieve batch 1776334432 statusfrom.... In-Reply-To: <174CED94DD8DC54AB888B56E103B118728C48B@CNBEEXC007.nsn-intra.net> References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD240E@LONSEXC01.gta.travel.lcl> <174CED94DD8DC54AB888B56E103B118728C48B@CNBEEXC007.nsn-intra.net> Message-ID: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2411@LONSEXC01.gta.travel.lcl> Thanks, I get the following error: service cman start Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... failed cman not started: Can't find local node name in cluster.conf /usr/sbin/cman_tool: aisexec daemon didn't start [FAILED] I am looking into this atm, but any suggestions welcomed. Regards > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Chen, > Mockey (NSN - CN/Cheng Du) > Sent: 10 September 2008 10:57 > To: linux clustering > Subject: RE: [Linux-cluster] Unable to retrieve batch > 1776334432 statusfrom.... > > > > > >-----Original Message----- > >From: linux-cluster-bounces at redhat.com > >[mailto:linux-cluster-bounces at redhat.com] On Behalf Of ext > >Gerhardus.Geldenhuis at gta-travel.com > >Sent: Wednesday, September 10, 2008 5:34 PM > >To: linux-cluster at redhat.com > >Subject: [Linux-cluster] Unable to retrieve batch 1776334432 status > >from.... > > > >Hi > >I have justed started using cluster tools for the first time so my > >question might be a bit obvious to the expierenced eye. > > > >I am getting the following error when trying to do config > changes using > >luci. > >Unable to retrieve batch 423452352 status from > >longapa02alt.gta.travel:11111: ccs_tool failed to propagate > >conf: unable to connect to the CCS daemon: connection > refused Failed to > >update config file. > > > > I maybe due to you not start ccsd daemon. try to use > ccs_test connect > to test whether you can connect your ccsd daemon. > > In RHCS 5, you can start ccsd by following command: > service cman start > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > ______________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________ From Gerhardus.Geldenhuis at gta-travel.com Wed Sep 10 11:14:12 2008 From: Gerhardus.Geldenhuis at gta-travel.com (Gerhardus.Geldenhuis at gta-travel.com) Date: Wed, 10 Sep 2008 12:14:12 +0100 Subject: [Linux-cluster] Unable to retrieve batch 1776334432 statusfrom.... In-Reply-To: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2411@LONSEXC01.gta.travel.lcl> References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD240E@LONSEXC01.gta.travel.lcl><174CED94DD8DC54AB888B56E103B118728C48B@CNBEEXC007.nsn-intra.net> <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2411@LONSEXC01.gta.travel.lcl> Message-ID: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2412@LONSEXC01.gta.travel.lcl> I am not sure what needs to be added to the config file for it to work. My config file looks like this atm: The hostname has been set in uppercase on the system, which I thought might be a problem but changing the config file to uppercase hostnames has not made a difference. Regards > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > Gerhardus.Geldenhuis at gta-travel.com > Sent: 10 September 2008 11:06 > To: linux-cluster at redhat.com > Subject: RE: [Linux-cluster] Unable to retrieve batch > 1776334432 statusfrom.... > > Thanks, > I get the following error: > > service cman start > Starting cluster: > Loading modules... done > Mounting configfs... done > Starting ccsd... done > Starting cman... failed > cman not started: Can't find local node name in cluster.conf > /usr/sbin/cman_tool: aisexec daemon didn't start > [FAILED] > > I am looking into this atm, but any suggestions welcomed. > > Regards > > > > -----Original Message----- > > From: linux-cluster-bounces at redhat.com > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Chen, Mockey > > (NSN - CN/Cheng Du) > > Sent: 10 September 2008 10:57 > > To: linux clustering > > Subject: RE: [Linux-cluster] Unable to retrieve batch > > 1776334432 statusfrom.... > > > > > > > > > > >-----Original Message----- > > >From: linux-cluster-bounces at redhat.com > > >[mailto:linux-cluster-bounces at redhat.com] On Behalf Of ext > > >Gerhardus.Geldenhuis at gta-travel.com > > >Sent: Wednesday, September 10, 2008 5:34 PM > > >To: linux-cluster at redhat.com > > >Subject: [Linux-cluster] Unable to retrieve batch > 1776334432 status > > >from.... > > > > > >Hi > > >I have justed started using cluster tools for the first time so my > > >question might be a bit obvious to the expierenced eye. > > > > > >I am getting the following error when trying to do config > > changes using > > >luci. > > >Unable to retrieve batch 423452352 status from > > >longapa02alt.gta.travel:11111: ccs_tool failed to propagate > > >conf: unable to connect to the CCS daemon: connection > > refused Failed to > > >update config file. > > > > > > > I maybe due to you not start ccsd daemon. try to use > ccs_test connect > > to test whether you can connect your ccsd daemon. > > > > In RHCS 5, you can start ccsd by following command: > > service cman start > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > ______________________________________________________________________ > This email has been scanned by the MessageLabs Email Security System. > For more information please visit > http://www.messagelabs.com/email > ______________________________________________________________________ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > ______________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________ From Gerhardus.Geldenhuis at gta-travel.com Wed Sep 10 12:45:16 2008 From: Gerhardus.Geldenhuis at gta-travel.com (Gerhardus.Geldenhuis at gta-travel.com) Date: Wed, 10 Sep 2008 13:45:16 +0100 Subject: [Linux-cluster] Unable to retrieve batch 1776334432 statusfrom.... In-Reply-To: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2412@LONSEXC01.gta.travel.lcl> References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD240E@LONSEXC01.gta.travel.lcl><174CED94DD8DC54AB888B56E103B118728C48B@CNBEEXC007.nsn-intra.net><1BCA52CF5845E543B4B81AAFEF2AFD7904AD2411@LONSEXC01.gta.travel.lcl> <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2412@LONSEXC01.gta.travel.lcl> Message-ID: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2413@LONSEXC01.gta.travel.lcl> Progress report, It looks like for some or other reason the name of the cluster is not set properly in the cman startup script. I have set it manually. Next problem is: cman not started: Can't connect to CCSD /usr/sbin/cman_tool: aisexec daemon didn't start Regards > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > Gerhardus.Geldenhuis at gta-travel.com > Sent: 10 September 2008 12:14 > To: linux-cluster at redhat.com > Subject: RE: [Linux-cluster] Unable to retrieve batch > 1776334432 statusfrom.... > > I am not sure what needs to be added to the config file for > it to work. > > My config file looks like this atm: > > > > > nodeid="1" votes="1"/> > nodeid="2" votes="1"/> > > > > > > > The hostname has been set in uppercase on the system, which I > thought might be a problem but changing the config file to > uppercase hostnames has not made a difference. > > Regards > > > -----Original Message----- > > From: linux-cluster-bounces at redhat.com > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > > Gerhardus.Geldenhuis at gta-travel.com > > Sent: 10 September 2008 11:06 > > To: linux-cluster at redhat.com > > Subject: RE: [Linux-cluster] Unable to retrieve batch > > 1776334432 statusfrom.... > > > > Thanks, > > I get the following error: > > > > service cman start > > Starting cluster: > > Loading modules... done > > Mounting configfs... done > > Starting ccsd... done > > Starting cman... failed > > cman not started: Can't find local node name in cluster.conf > > /usr/sbin/cman_tool: aisexec daemon didn't start > > [FAILED] > > > > I am looking into this atm, but any suggestions welcomed. > > > > Regards > > > > > > > -----Original Message----- > > > From: linux-cluster-bounces at redhat.com > > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > Chen, Mockey > > > (NSN - CN/Cheng Du) > > > Sent: 10 September 2008 10:57 > > > To: linux clustering > > > Subject: RE: [Linux-cluster] Unable to retrieve batch > > > 1776334432 statusfrom.... > > > > > > > > > > > > > > > >-----Original Message----- > > > >From: linux-cluster-bounces at redhat.com > > > >[mailto:linux-cluster-bounces at redhat.com] On Behalf Of ext > > > >Gerhardus.Geldenhuis at gta-travel.com > > > >Sent: Wednesday, September 10, 2008 5:34 PM > > > >To: linux-cluster at redhat.com > > > >Subject: [Linux-cluster] Unable to retrieve batch > > 1776334432 status > > > >from.... > > > > > > > >Hi > > > >I have justed started using cluster tools for the first > time so my > > > >question might be a bit obvious to the expierenced eye. > > > > > > > >I am getting the following error when trying to do config > > > changes using > > > >luci. > > > >Unable to retrieve batch 423452352 status from > > > >longapa02alt.gta.travel:11111: ccs_tool failed to propagate > > > >conf: unable to connect to the CCS daemon: connection > > > refused Failed to > > > >update config file. > > > > > > > > > > I maybe due to you not start ccsd daemon. try to use > > ccs_test connect > > > to test whether you can connect your ccsd daemon. > > > > > > In RHCS 5, you can start ccsd by following command: > > > service cman start > > > > > > > > > -- > > > Linux-cluster mailing list > > > Linux-cluster at redhat.com > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > ______________________________________________________________________ > > This email has been scanned by the MessageLabs Email > Security System. > > For more information please visit > > http://www.messagelabs.com/email > > > ______________________________________________________________________ > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > ______________________________________________________________________ > This email has been scanned by the MessageLabs Email Security System. > For more information please visit > http://www.messagelabs.com/email > ______________________________________________________________________ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > ______________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________ From Gerhardus.Geldenhuis at gta-travel.com Wed Sep 10 13:23:03 2008 From: Gerhardus.Geldenhuis at gta-travel.com (Gerhardus.Geldenhuis at gta-travel.com) Date: Wed, 10 Sep 2008 14:23:03 +0100 Subject: [Linux-cluster] Unable to retrieve batch 1776334432 statusfrom.... In-Reply-To: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2413@LONSEXC01.gta.travel.lcl> References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD240E@LONSEXC01.gta.travel.lcl><174CED94DD8DC54AB888B56E103B118728C48B@CNBEEXC007.nsn-intra.net><1BCA52CF5845E543B4B81AAFEF2AFD7904AD2411@LONSEXC01.gta.travel.lcl><1BCA52CF5845E543B4B81AAFEF2AFD7904AD2412@LONSEXC01.gta.travel.lcl> <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2413@LONSEXC01.gta.travel.lcl> Message-ID: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2414@LONSEXC01.gta.travel.lcl> For the benefit of future googlers. I managed to get the cman service started without any error messages. I also removed my customizations from the init script. It looks like the cman service is not able to detect whether services it starts up has been started and then fails. I ended up stopping all services having only the following running: Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 0.0.0.0:11111 0.0.0.0:* LISTEN 3918/ricci tcp 0 0 127.0.0.1:6010 0.0.0.0:* LISTEN 5721/0 tcp 0 0 :::22 :::* LISTEN 3601/sshd tcp 0 0 ::1:6010 :::* LISTEN 5721/0 tcp 0 0 ::ffff:10.x.x.x:22 ::ffff:10.x.x.x:40884 ESTABLISHED 5718/sshd: lcp13o [ I then ran service cman start Which started up without a problem Regards > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > Gerhardus.Geldenhuis at gta-travel.com > Sent: 10 September 2008 13:45 > To: linux-cluster at redhat.com > Subject: RE: [Linux-cluster] Unable to retrieve batch > 1776334432 statusfrom.... > > Progress report, > It looks like for some or other reason the name of the > cluster is not set properly in the cman startup script. I > have set it manually. > > Next problem is: > cman not started: Can't connect to CCSD /usr/sbin/cman_tool: > aisexec daemon didn't start > > Regards > > > -----Original Message----- > > From: linux-cluster-bounces at redhat.com > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > > Gerhardus.Geldenhuis at gta-travel.com > > Sent: 10 September 2008 12:14 > > To: linux-cluster at redhat.com > > Subject: RE: [Linux-cluster] Unable to retrieve batch > > 1776334432 statusfrom.... > > > > I am not sure what needs to be added to the config file for it to > > work. > > > > My config file looks like this atm: > > > > > > > > > > > nodeid="1" votes="1"/> > > > nodeid="2" votes="1"/> > > > > > > > > > > > > > > The hostname has been set in uppercase on the system, which > I thought > > might be a problem but changing the config file to > uppercase hostnames > > has not made a difference. > > > > Regards > > > > > -----Original Message----- > > > From: linux-cluster-bounces at redhat.com > > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > > > Gerhardus.Geldenhuis at gta-travel.com > > > Sent: 10 September 2008 11:06 > > > To: linux-cluster at redhat.com > > > Subject: RE: [Linux-cluster] Unable to retrieve batch > > > 1776334432 statusfrom.... > > > > > > Thanks, > > > I get the following error: > > > > > > service cman start > > > Starting cluster: > > > Loading modules... done > > > Mounting configfs... done > > > Starting ccsd... done > > > Starting cman... failed > > > cman not started: Can't find local node name in cluster.conf > > > /usr/sbin/cman_tool: aisexec daemon didn't start > > > > [FAILED] > > > > > > I am looking into this atm, but any suggestions welcomed. > > > > > > Regards > > > > > > > > > > -----Original Message----- > > > > From: linux-cluster-bounces at redhat.com > > > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > > Chen, Mockey > > > > (NSN - CN/Cheng Du) > > > > Sent: 10 September 2008 10:57 > > > > To: linux clustering > > > > Subject: RE: [Linux-cluster] Unable to retrieve batch > > > > 1776334432 statusfrom.... > > > > > > > > > > > > > > > > > > > > >-----Original Message----- > > > > >From: linux-cluster-bounces at redhat.com > > > > >[mailto:linux-cluster-bounces at redhat.com] On Behalf Of ext > > > > >Gerhardus.Geldenhuis at gta-travel.com > > > > >Sent: Wednesday, September 10, 2008 5:34 PM > > > > >To: linux-cluster at redhat.com > > > > >Subject: [Linux-cluster] Unable to retrieve batch > > > 1776334432 status > > > > >from.... > > > > > > > > > >Hi > > > > >I have justed started using cluster tools for the first > > time so my > > > > >question might be a bit obvious to the expierenced eye. > > > > > > > > > >I am getting the following error when trying to do config > > > > changes using > > > > >luci. > > > > >Unable to retrieve batch 423452352 status from > > > > >longapa02alt.gta.travel:11111: ccs_tool failed to propagate > > > > >conf: unable to connect to the CCS daemon: connection > > > > refused Failed to > > > > >update config file. > > > > > > > > > > > > > I maybe due to you not start ccsd daemon. try to use > > > ccs_test connect > > > > to test whether you can connect your ccsd daemon. > > > > > > > > In RHCS 5, you can start ccsd by following command: > > > > service cman start > > > > > > > > > > > > -- > > > > Linux-cluster mailing list > > > > Linux-cluster at redhat.com > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > > ______________________________________________________________________ > > > This email has been scanned by the MessageLabs Email > > Security System. > > > For more information please visit > > > http://www.messagelabs.com/email > > > > > > ______________________________________________________________________ > > > > > > -- > > > Linux-cluster mailing list > > > Linux-cluster at redhat.com > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > ______________________________________________________________________ > > This email has been scanned by the MessageLabs Email > Security System. > > For more information please visit > > http://www.messagelabs.com/email > > > ______________________________________________________________________ > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > ______________________________________________________________________ > This email has been scanned by the MessageLabs Email Security System. > For more information please visit > http://www.messagelabs.com/email > ______________________________________________________________________ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > ______________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________ From nick at javacat.f2s.com Wed Sep 10 14:42:29 2008 From: nick at javacat.f2s.com (Nick Lunt) Date: Wed, 10 Sep 2008 15:42:29 +0100 Subject: [Linux-cluster] Unable to retrieve batch 1776334432 statusfrom.... In-Reply-To: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2413@LONSEXC01.gta.travel.lcl> References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD240E@LONSEXC01.gta.travel.lcl><174CED94DD8DC54AB888B56E103B118728C48B@CNBEEXC007.nsn-intra.net><1BCA52CF5845E543B4B81AAFEF2AFD7904AD2411@LONSEXC01.gta.travel.lcl> <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2412@LONSEXC01.gta.travel.lcl> <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2413@LONSEXC01.gta.travel.lcl> Message-ID: <000301c91353$733bc500$59b34f00$@f2s.com> Hi Gerhardus > cman not started: Can't connect to CCSD /usr/sbin/cman_tool: aisexec I had the same issue, caused by openais being chkconfiged on. Try to chkconfig openais off and see how it goes. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Gerhardus.Geldenhuis at gta-travel.com Sent: 10 September 2008 13:45 To: linux-cluster at redhat.com Subject: RE: [Linux-cluster] Unable to retrieve batch 1776334432 statusfrom.... Progress report, It looks like for some or other reason the name of the cluster is not set properly in the cman startup script. I have set it manually. Next problem is: cman not started: Can't connect to CCSD /usr/sbin/cman_tool: aisexec daemon didn't start Regards > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > Gerhardus.Geldenhuis at gta-travel.com > Sent: 10 September 2008 12:14 > To: linux-cluster at redhat.com > Subject: RE: [Linux-cluster] Unable to retrieve batch > 1776334432 statusfrom.... > > I am not sure what needs to be added to the config file for > it to work. > > My config file looks like this atm: > > > > > nodeid="1" votes="1"/> > nodeid="2" votes="1"/> > > > > > > > The hostname has been set in uppercase on the system, which I > thought might be a problem but changing the config file to > uppercase hostnames has not made a difference. > > Regards > > > -----Original Message----- > > From: linux-cluster-bounces at redhat.com > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > > Gerhardus.Geldenhuis at gta-travel.com > > Sent: 10 September 2008 11:06 > > To: linux-cluster at redhat.com > > Subject: RE: [Linux-cluster] Unable to retrieve batch > > 1776334432 statusfrom.... > > > > Thanks, > > I get the following error: > > > > service cman start > > Starting cluster: > > Loading modules... done > > Mounting configfs... done > > Starting ccsd... done > > Starting cman... failed > > cman not started: Can't find local node name in cluster.conf > > /usr/sbin/cman_tool: aisexec daemon didn't start > > [FAILED] > > > > I am looking into this atm, but any suggestions welcomed. > > > > Regards > > > > > > > -----Original Message----- > > > From: linux-cluster-bounces at redhat.com > > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > Chen, Mockey > > > (NSN - CN/Cheng Du) > > > Sent: 10 September 2008 10:57 > > > To: linux clustering > > > Subject: RE: [Linux-cluster] Unable to retrieve batch > > > 1776334432 statusfrom.... > > > > > > > > > > > > > > > >-----Original Message----- > > > >From: linux-cluster-bounces at redhat.com > > > >[mailto:linux-cluster-bounces at redhat.com] On Behalf Of ext > > > >Gerhardus.Geldenhuis at gta-travel.com > > > >Sent: Wednesday, September 10, 2008 5:34 PM > > > >To: linux-cluster at redhat.com > > > >Subject: [Linux-cluster] Unable to retrieve batch > > 1776334432 status > > > >from.... > > > > > > > >Hi > > > >I have justed started using cluster tools for the first > time so my > > > >question might be a bit obvious to the expierenced eye. > > > > > > > >I am getting the following error when trying to do config > > > changes using > > > >luci. > > > >Unable to retrieve batch 423452352 status from > > > >longapa02alt.gta.travel:11111: ccs_tool failed to propagate > > > >conf: unable to connect to the CCS daemon: connection > > > refused Failed to > > > >update config file. > > > > > > > > > > I maybe due to you not start ccsd daemon. try to use > > ccs_test connect > > > to test whether you can connect your ccsd daemon. > > > > > > In RHCS 5, you can start ccsd by following command: > > > service cman start > > > > > > > > > -- > > > Linux-cluster mailing list > > > Linux-cluster at redhat.com > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > ______________________________________________________________________ > > This email has been scanned by the MessageLabs Email > Security System. > > For more information please visit > > http://www.messagelabs.com/email > > > ______________________________________________________________________ > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > ______________________________________________________________________ > This email has been scanned by the MessageLabs Email Security System. > For more information please visit > http://www.messagelabs.com/email > ______________________________________________________________________ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > ______________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________ -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From Gerhardus.Geldenhuis at gta-travel.com Wed Sep 10 16:09:50 2008 From: Gerhardus.Geldenhuis at gta-travel.com (Gerhardus.Geldenhuis at gta-travel.com) Date: Wed, 10 Sep 2008 17:09:50 +0100 Subject: [Linux-cluster] Monitoring services/customize failure criteria Message-ID: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2415@LONSEXC01.gta.travel.lcl> Hi I have managed to sucesfully install and configure a basic failover cluster that consists of two apache boxes and a vip. I am still a bit unclear about how/where the cluster software monitors a resource I have added. I have added a script resource that points to the init script of apache. It looks like the gui is clever enough to add a start command to the apache service, however I am not sure if it would do this for all scripts, what if I use a custom script that does not require a start/stop parameter. How can I customize "failure criteria" for a cluster resource? Regards ______________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________ From jstoner at opsource.net Wed Sep 10 17:58:19 2008 From: jstoner at opsource.net (Jeff Stoner) Date: Wed, 10 Sep 2008 18:58:19 +0100 Subject: [Linux-cluster] Monitoring services/customize failure criteria In-Reply-To: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2415@LONSEXC01.gta.travel.lcl> References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2415@LONSEXC01.gta.travel.lcl> Message-ID: <38A48FA2F0103444906AD22E14F1B5A30802AF6D@mailxchg01.corp.opsource.net> > I am still a bit unclear about how/where the cluster software > monitors a > resource I have added. /usr/share/cluster has the shell scripts used to manage resources, however... > I have added a script resource that > points to the > init script of apache. The Script resource is a generic resource (as opposed to the more specific mysql, nfs, ip, file system, etc. resources.) The script identified in the