From pcaulfie at redhat.com Thu Nov 1 08:46:12 2007 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Thu, 01 Nov 2007 08:46:12 +0000 Subject: [Linux-cluster] DLM Document Message-ID: <47299254.4040709@redhat.com> For those wanting more detail on writing applications to use the Red Hat DLM, I have prepared this document: http://people.redhat.com/pcaulfie/docs/rhdlmbook.pdf It's based quite heavily on the IBM dlmbook document so many thanks to Kristin Thomas for that work. It's been updated and modified to include things specific to our DLM including the API reference. Any comments gratefully received. Patrick From sanelson at gmail.com Thu Nov 1 11:03:12 2007 From: sanelson at gmail.com (Stephen Nelson-Smith) Date: Thu, 1 Nov 2007 11:03:12 +0000 Subject: [Linux-cluster] High Availability Virtualisation Message-ID: Hello, I presently run a bunch of openvz ve's on a fairly beefy machine. I am somewhat concerned that if this machine fails, the vms fail too. Other than using redundant hardware (multiple psu, mirrored disks, etc), how can I increase availability? I could put the virtual environments on a shared filesystem, but really I'd like some kind of failover mechanism. Is this asking too much? This looks interesting: http://www.pro-linux.de/work/virtual-ha/virtual-ha5.html But my German is very rusty, so it's heavy going! Any ideas? S. From mike at technomonk.com Thu Nov 1 12:28:18 2007 From: mike at technomonk.com (Mike Preston - Technomonk Industries) Date: Thu, 01 Nov 2007 12:28:18 +0000 Subject: [Linux-cluster] High Availability Virtualisation In-Reply-To: References: Message-ID: <4729C662.7010605@technomonk.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Stephen Nelson-Smith wrote: > Hello, > > I presently run a bunch of openvz ve's on a fairly beefy machine. > > I am somewhat concerned that if this machine fails, the vms fail too. > > Other than using redundant hardware (multiple psu, mirrored disks, > etc), how can I increase availability? I could put the virtual > environments on a shared filesystem, but really I'd like some kind of > failover mechanism. Is this asking too much? > > This looks interesting: http://www.pro-linux.de/work/virtual-ha/virtual-ha5.html > > But my German is very rusty, so it's heavy going! Google translate helps... http://translate.google.com/translate?u=http%3A%2F%2Fwww.pro-linux.de%2Fwork%2Fvirtual-ha%2Fvirtual-ha5.html&langpair=de%7Cen&hl=en&ie=UTF8 How I would do it, is with at least one other server. Partition the machines into multiple xen domains, each xen domain running something like vserver (supported in debian) or if you prefer it openvz if you can support it. With shared storage (where the DRBD comes in (like a network RAID 1)) you can seemlessly migrate xen domains from machine to machine, or restart them on the other machine in the event of failure. This will allow a failed machine to have all its xen domains started up on another server (since the DRDB has kept the storage in sync between them) or if the downtime is scheduled, live migrated to other boxes. Mike > Any ideas? > > S. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > - -- - -- Mike Preston mike at technomonk.com Technomonk Industries T: +44 (0) 116 2 988 433 M: +44 (0) 7849 72 68 27 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHKcZivhwPecbXDdwRAs2CAJ9YkTRzXjzTmz5EQQyfyCzf6Dz3sgCgk0mP SEpWIwNNWlCdCJrQAwljHMw= =P+Fw -----END PGP SIGNATURE----- From bmarzins at redhat.com Thu Nov 1 17:49:39 2007 From: bmarzins at redhat.com (Benjamin Marzinski) Date: Thu, 1 Nov 2007 12:49:39 -0500 Subject: [Linux-cluster] fence gnbd doesn't works as expected In-Reply-To: <4726E4B7.9000700@gmail.com> References: <4726E4B7.9000700@gmail.com> Message-ID: <20071101174939.GD3435@ether.msp.redhat.com> On Tue, Oct 30, 2007 at 09:00:55AM +0100, carlopmart wrote: > Hi all, > > I have already installed two nodes cluster using gnbd as a fence device. > When tow nodes comes up at the same time all works ok, but when only I need > to start only one node, GFS doesn't mounts because fence device doesn't > works. Error is: > > Mounting GFS filesystems: /sbin/mount.gfs: lock_dlm_join: gfs_controld > join error: -22 > /sbin/mount.gfs: error mounting lockproto lock_dlm. > > I am using a third server as GNBD server wihout serving disks. Why this > doesn't works?? Perhaps do I need quorum disk?? Let me see if I understand what you are doing. You want to use fence_gnbd as your fence device, but the nodes in your cluster aren't actually using gnbd devices for their shared storage. It this is true, it won't work at all. All fence_gnbd guarantees is that the fenced node will not be able to access its gnbd devices. If the GFS filesystems are on the gnbd devices, this will keep the fenced node from being able to corrupt them. If a GFS filesystem is not on a GNBD device, fence_gnbd does nothing at all to protect it from corruption. You really need quorum disk to deal with this. -Ben > My cluster.conf: > > > > > > > > > nodename="node01.hpulabs.org"/> > > > > > > > > nodename="node02.hpulabs.org"/> > > > > > > > > > > servers="gnbdserv.hpulabs.org"/> > > > > restricted="1"> > name="node01.hpulabs.org" priority="1"/> > name="node02.hpulabs.org" priority="2"/> > > restricted="1"> > name="node02.hpulabs.org" priority="1"/> > name="node01.hpulabs.org" priority="2"/> > > > > > > > > > > > > > > > recovery="relocate"> > > for example. -- Lon From lhh at redhat.com Fri Nov 30 01:39:54 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 29 Nov 2007 20:39:54 -0500 Subject: [Linux-cluster] I give up In-Reply-To: <474EED84.8050506@efacec.pt> References: <474D9334.3020602@bxwa.com> <1196268313.2827.21.camel@localhost.localdomain> <474DA9BF.7050006@bxwa.com> <1196281964.2827.76.camel@localhost.localdomain> <474DDD32.1020908@bxwa.com> <474DEB48.3000209@redhat.com> <474DF79D.6040403@bxwa.com> <474EEADB.9040908@bxwa.com> <474EED84.8050506@efacec.pt> Message-ID: <1196386794.10025.13.camel@ayanami.boston.devel.redhat.com> On Thu, 2007-11-29 at 16:49 +0000, Marcos David wrote: > There are no failover domains defined in your cluster.conf. > This could explain why no other nodes take over the service.... His logs didn't indicate a failover domain problem - one of the nodes didn't try to fence the other when it should have as far as I can tell. -- Lon From lhh at redhat.com Fri Nov 30 01:40:51 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 29 Nov 2007 20:40:51 -0500 Subject: [Linux-cluster] I give up In-Reply-To: <474EF015.10103@bxwa.com> References: <474D9334.3020602@bxwa.com> <1196268313.2827.21.camel@localhost.localdomain> <474DA9BF.7050006@bxwa.com> <1196281964.2827.76.camel@localhost.localdomain> <474DDD32.1020908@bxwa.com> <474DEB48.3000209@redhat.com> <474DF79D.6040403@bxwa.com> <474EEADB.9040908@bxwa.com> <474EED84.8050506@efacec.pt> <474EF015.10103@bxwa.com> Message-ID: <1196386851.10025.15.camel@ayanami.boston.devel.redhat.com> On Thu, 2007-11-29 at 09:00 -0800, Scott Becker wrote: > > Marcos David wrote: > > There are no failover domains defined in your cluster.conf. > > This could explain why no other nodes take over the service.... > > > It's my understanding from the man pages that failover domains are an > option to configure a service to run on only a subset of the nodes. Correct. Or, an ordered set of nodes. e.g. if you want a service to prefer node 1 for example. -- Lon From lhh at redhat.com Fri Nov 30 01:41:37 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 29 Nov 2007 20:41:37 -0500 Subject: [Linux-cluster] I give up In-Reply-To: <1196356312.16961.5.camel@localhost.localdomain> References: <474D9334.3020602@bxwa.com> <1196268313.2827.21.camel@localhost.localdomain> <474DA9BF.7050006@bxwa.com> <1196281964.2827.76.camel@localhost.localdomain> <474DDD32.1020908@bxwa.com> <474DEB48.3000209@redhat.com> <474DF79D.6040403@bxwa.com> <474EEADB.9040908@bxwa.com> <474EED84.8050506@efacec.pt> <474EF015.10103@bxwa.com> <1196356312.16961.5.camel@localhost.localdomain> Message-ID: <1196386897.10025.17.camel@ayanami.boston.devel.redhat.com> On Thu, 2007-11-29 at 18:11 +0100, jr wrote: > from my understanding a failover domain is required whenever you want > other nodes to take over a service. the subset is if you make it > restricted, isn't it? > regards, They're optional. If you don't define one, it's the same as saying "unordered, unrestricted failover domain of all nodes in the cluster". -- Lon From gsrlinux at gmail.com Fri Nov 30 04:03:06 2007 From: gsrlinux at gmail.com (GS R) Date: Fri, 30 Nov 2007 09:33:06 +0530 Subject: [Linux-cluster] Conga and Ricci certificate In-Reply-To: <50352.62.101.100.5.1196349932.squirrel@picard.linux.it> References: <50352.62.101.100.5.1196349932.squirrel@picard.linux.it> Message-ID: > > With system-config-cluster i make A Failover domains with 2 VM Xen and > Apache HTTP Server. This cluster works. > > Now, i'm installing luci&ricci on dom0. Ricci on booth domU. > > On Conga: > > HOMEBASE----> Add an existing Cluster > > > i insert the IP of dom0 but when i inser the domu IP i read this error: > > > The following errors occurred: > > * Unable to add the key for node vm03-dadmin.example.prv to the trusted > keys list. > * Unable to connect to the ricci agent on vm03-dadmin-example.prv: > Unable to establish an SSL connection to vm03-dadmin.example.prv:11111: > ricci's certificate is not trusted > > > I don'understand. With dom0 and the first domU it's OK....Over the second > domU i see this error .... :-( > > > On dom0: > > > tail -f /var/log/messages > > Nov 29 16:12:43 zeus03-dom0 luci[28835]: Unable to establish an SSL > connection to zeus03-dadmin.replynet.prv:11111: ricci's certificate is not > trusted > Nov 29 16:26:37 zeus03-dom0 luci[28835]: Error reading from > zeus03-dadmin.replynet.prv:11111: timeout > Nov 29 16:26:37 zeus03-dom0 luci[28835]: The SSL certificate for host > "zeus03-dadmin.replynet.prv" is not trusted. Aborting connection attempt. > Nov 29 16:26:37 zeus03-dom0 luci[28835]: Unable to establish an SSL > connection to zeus03-dadmin.replynet.prv:11111: ricci's certificate is not > trusted > > > > i don't understand > Hi, 1. Make sure you have the correct entries in /etc/hosts for all the nodes. luci refers /etc/hosts. 2. Start ricci on all the nodes before adding them to the cluster using Conga. luci is not able to identify zeus03-dadmin.replynet.prv. Adding the entries in /etc/hosts should fix your problem. One more thing, since you are dealing with xen vm's, I hope you are aware of the fact that your dom0 should not be a part of its domU cluster and vice-versa. Thanks GSR -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcos.david at efacec.pt Fri Nov 30 09:16:11 2007 From: marcos.david at efacec.pt (Marcos David) Date: Fri, 30 Nov 2007 09:16:11 +0000 Subject: [Linux-cluster] I give up In-Reply-To: <1196386897.10025.17.camel@ayanami.boston.devel.redhat.com> References: <474D9334.3020602@bxwa.com> <1196268313.2827.21.camel@localhost.localdomain> <474DA9BF.7050006@bxwa.com> <1196281964.2827.76.camel@localhost.localdomain> <474DDD32.1020908@bxwa.com> <474DEB48.3000209@redhat.com> <474DF79D.6040403@bxwa.com> <474EEADB.9040908@bxwa.com> <474EED84.8050506@efacec.pt> <474EF015.10103@bxwa.com> <1196356312.16961.5.camel@localhost.localdomain> <1196386897.10025.17.camel@ayanami.boston.devel.redhat.com> Message-ID: <474FD4DB.6030704@efacec.pt> Lon Hohberger wrote: > On Thu, 2007-11-29 at 18:11 +0100, jr wrote: > >> from my understanding a failover domain is required whenever you want >> other nodes to take over a service. the subset is if you make it >> restricted, isn't it? >> regards, >> > > They're optional. If you don't define one, it's the same as saying > "unordered, unrestricted failover domain of all nodes in the cluster". > > -- Lon > > Ok, thanks for clearing that up. Greets, Marcos David -------------- next part -------------- An HTML attachment was scrubbed... URL: From johannes.russek at io-consulting.net Fri Nov 30 10:23:09 2007 From: johannes.russek at io-consulting.net (jr) Date: Fri, 30 Nov 2007 11:23:09 +0100 Subject: [Linux-cluster] Live migration of VMs instead of relocation Message-ID: <1196418189.16961.9.camel@localhost.localdomain> Hello everybody, i was wondering if i could somehow get rgmanager to use live migration of vms when the prefered member of a failover domain for a certain vm service comes up again after a failure. the way it is right now is that if rgmanager detects a failure of a node, the virtual machine gets taken over by a different node with a lower priority. as soon as i the primary node comes back into the cluster, rgmanager relocated the vm to that node, which means shutting it down and starting it on that node again. as i managed to get live migration working in the cluster, i'd like to have rgmanager make use of that. is there a known configuration for this? best regards, johannes russek From mousavi.ehsan at gmail.com Fri Nov 30 11:30:20 2007 From: mousavi.ehsan at gmail.com (Ehsan Mousavi) Date: Fri, 30 Nov 2007 15:00:20 +0330 Subject: [Linux-cluster] C-Sharifi Message-ID: *C-Sharifi** **Cluster Engine: The Second Success Story on "Kernel-Level Paradigm" for Distributed Computing Support* Contrary to two school of thoughts in providing system software support for distributed computation that advocate either the development of a whole new distributed operating system (like Mach), or the development of library-based or patch-based middleware on top of existing operating systems (like MPI, Kerrighed and Mosix), *Dr. Mohsen Sharifi *hypothesized another school of thought as his thesis in 1986 that believes all distributed systems software requirements and supports can be and must be built at the Kernel Level of existing operating systems; requirements like Ease of Programming, Simplicity, Efficiency, Accessibility, etc which may be coined as *Usability*. Although the latter belief was hard to realize, a sample byproduct called DIPC was built purely based on this thesis and openly announced to the Linux community worldwide in 1993. This was admired for being able to provide necessary supports for distributed communication at the Kernel Level of Linux for the first time in the world, and for providing Ease of Programming as a consequence of being realized at the Kernel Level. However, it was criticized at the same time as being inefficient. This did not force the school to trade Ease of Programming for Efficiency but instead tried hard to achieve efficiency, alongside ease of programming and simplicity, without defecting the school that advocates the provision of all needs at the kernel level. The result of this effort is now manifested in the *C-Sharifi** *Cluster Engine. *C-Sharifi* is a cost effective distributed system software engine in support of high performance computing by clusters of off-the-shelf computers. It is wholly implemented in Kernel, and as a consequence of following this school, it has Ease of Programming, Ease of Clustering, Simplicity, and it can be configured to fit as best as possible to the efficiency requirements of applications that need high performance. It supports both distributed shared memory and message passing styles, it is built in Linux, and its cost/performance ratio in some scientific applications (like meteorology and cryptanalysis) has shown to be far better than non-kernel-based solutions and engines (like MPI, Kerrighed and Mosix). Best Regard *Leili Mirtaheri ~Ehsan Mousavi *C-Sharifi* Development Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From xbfair at citistreetonline.com Fri Nov 30 14:34:45 2007 From: xbfair at citistreetonline.com (Fair, Brian) Date: Fri, 30 Nov 2007 09:34:45 -0500 Subject: [Linux-cluster] Adding new file system caused problems In-Reply-To: <474C5260.6030908@noaa.gov> References: <474C5260.6030908@noaa.gov> Message-ID: <97F238EA86B5704DBAD740518CF829100394AE0C@hwpms600.tbo.citistreet.org> I think this is something we see. The workaround has basically been to disabled clustering (lvm wise) when doing this kind of change, and to handle it manually: Ie: vgchange -c n to disable the cluster flag lvmconf -disable-cluster on all nodes rescan/discover lun, whatever, on all nodes lvcreate on one node lvchange -refresh on every node lvchange -a y on one node gfs_grow on one host (you can run this on the other to confirm, it should say it can't grow anymore) When done, I've been putting things back how they were with vgchange -c y, lvmconf -disable-cluster, though I think if I you just left it unclustered it'd be fine... what you won't want to do is leave the vg clustered, but not -enable-cluster... if you do this when you reboot the clustered volume groups won't be activated. Hope this helps... if anyone knows of a definitive fix for this I'd like to hear about it, we haven't pushed for it since it isn't too big of a hassle and we aren't constantly adding new volumes, but it is a pain. Brian Fair, UNIX Administrator, CitiStreet 904.791.2662 From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Randy Brown Sent: Tuesday, November 27, 2007 12:23 PM To: linux clustering Subject: [Linux-cluster] Adding new file system caused problems I am running a two node cluster using Centos 5 that is basically being used as a NAS head for our iscsi based storage. Here are the related rpms and their versions I am using: kmod-gfs-0.1.16-5.2.6.18_8.1.14.el5 kmod-gfs-0.1.16-6.2.6.18_8.1.15.el5 system-config-lvm-1.0.22-1.0.el5 cman-2.0.64-1.0.1.el5 rgmanager-2.0.24-1.el5.centos gfs-utils-0.1.11-3.el5 lvm2-2.02.16-3.el5 lvm2-cluster-2.02.16-3.el5 This morning I created a 100GB volume on our storage unit and proceeded to make it available to the cluster so it could be served via NFS to a client on our network. I used pvcreate and vgcreate as I always do and created a new volume group. When I went to create the logical volume I saw this message: Error locking on node nfs1-cluster.nws.noaa.gov: Volume group for uuid not found: 9crOQoM3V0fcuZ1E2163k9vdRLK7njfvnIIMTLPGreuvGmdB1aqx6KR4t7mmDRDs I figured I had done something wrong and tried to remove the Lvol and couldn't. Lvdisplay showed that the logvol had been created and vgdisplay looked good with the exception of the volume not being activated. So, I ran vgchange -aly which didn't return any error, but also did not activate the volume. I then rebooted the node which made everything OK. I could now see the VG and lvol, both were active and I could now create the gfs file system on the lvol. The file system mounted and I thought I was in the clear. However, node #2 wasn't picking this new filesystem up at all. I stopped the cluster services on this node which all stopped cleanly and then tried to restart them. cman started fine but clvmd didn't. It hung on the vgscan. Even after a reboot of node #2, clvmd would not start and would hang on the vgscan. It wasn't until I shut down both nodes completely and started cluster that both nodes could see the new filesystem. I'm sure it's my own ignorance that's making this more difficult than it needs to be. Am I missing a step? Is more information required to help? Any assistance in figuring out what happened here would be greatly appreciated. I know I going to need to do similar tasks in the future and obviously can't afford to bring everything down in order for the cluster to see a new filesystem. Thank you, Randy P.S. Here is my cluster.conf: [root at nfs2-cluster ~]# cat /etc/cluster/cluster.conf -------------- next part -------------- An HTML attachment was scrubbed... URL: From balajisundar at midascomm.com Fri Nov 30 14:59:18 2007 From: balajisundar at midascomm.com (Balaji) Date: Fri, 30 Nov 2007 20:29:18 +0530 Subject: [Linux-cluster] RHEL4 Update 4 Cluster Suite Download for Testing Message-ID: <47502546.3070205@midascomm.com> Dear All, I am Downloaded the Red Hat Enterprise Linux 4 Update 4 AS 30 Days Evaluation copy and i Installed and testing the Red Hat Enterprise Linux 4 Update 4 AS and i need the Cluster Suite for same The Cluster Suite for same is not available in Red Hat Site Please can any one send me the Cluster Suite link for Red Hat Enterprise Linux 4 Update 4 AS Supported Regards -S.Balaji From lhh at redhat.com Fri Nov 30 10:18:26 2007 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 30 Nov 2007 05:18:26 -0500 Subject: [Linux-cluster] Live migration of VMs instead of relocation In-Reply-To: <1196418189.16961.9.camel@localhost.localdomain> References: <1196418189.16961.9.camel@localhost.localdomain> Message-ID: <1196417906.2454.18.camel@localhost.localdomain> On Fri, 2007-11-30 at 11:23 +0100, jr wrote: > Hello everybody, > i was wondering if i could somehow get rgmanager to use live migration > of vms when the prefered member of a failover domain for a certain vm > service comes up again after a failure. the way it is right now is that > if rgmanager detects a failure of a node, the virtual machine gets taken > over by a different node with a lower priority. as soon as i the primary > node comes back into the cluster, rgmanager relocated the vm to that > node, which means shutting it down and starting it on that node again. > as i managed to get live migration working in the cluster, i'd like to > have rgmanager make use of that. > is there a known configuration for this? > best regards, 5.1(+updates) does (or should do?) "migrate-or-nothing" when relocating VMs back to the preferred node. That is, if it can't do a migrate, leave the VM where it is. The caveat is of course that the VM is at the top level with no parent node / no children in the resource tree (i.e. it shouldn't be a child of a ), like so: Parent/child dependencies aren't allowed because of the stop/start nature of other resources: To stop a node, its children must be stopped, but to start a node, its parents must be started. Note that currently as of 5.1, it's pause-migration, not live-migration - to change this, you need to edit vm.sh and change the "xm migrate ..." command line to "xm migrate -l ...". The upside of pause-migration is that it's a simpler and faster overall operation to transfer the VM from one machine to another. The down side is of course that your downtime is several seconds during migrate rather than the typical <1 sec for live-migration. We plan to switch to live migrate as default instead of pause-migrate (with the ability to select pause migration if desired) in the next update. Actually the change is in CVS if you don't want to hax around with the resource agent: http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/cluster/rgmanager/src/resources/vm.sh?rev=1.1.2.9&content-type=text/plain&cvsroot=cluster&only_with_tag=RHEL5 ... hasn't had a lot of testing though. :) -- Lon From lhh at redhat.com Fri Nov 30 10:19:31 2007 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 30 Nov 2007 05:19:31 -0500 Subject: [Linux-cluster] on bundling http and https In-Reply-To: <1196367991.5923.19.camel@ubuntu> References: <1196367991.5923.19.camel@ubuntu> Message-ID: <1196417971.2454.20.camel@localhost.localdomain> On Thu, 2007-11-29 at 15:26 -0500, Yanik Doucet wrote: > Hello > > I'm trying piranha to see if we could throw out our actual closed > source solution. > > My test setup consist of a client, 2 lvs directors and 2 webservers. > > I first made a virtual http server and it's working great. Nothing > too fancy but I can pull the switch on a director or a webserver with > little impact on availability. > > Now I'm trying to bundle http and https to make sure the client > connect to the same server for both protocol. This is where it fails. > I have the exact same problem as this guy: > > http://osdir.com/ml/linux.redhat.piranha/2006-03/msg00014.html > > > > I setup the firewall marks with piranha, then did the same thing with > iptables, but when I restart pulse, ipvsadm fails to start virtual > service HTTPS as explaned in the above link. If that email is right, it looks like a bug in piranha. -- Lon From johannes.russek at io-consulting.net Fri Nov 30 15:23:26 2007 From: johannes.russek at io-consulting.net (jr) Date: Fri, 30 Nov 2007 16:23:26 +0100 Subject: [Linux-cluster] Live migration of VMs instead of relocation In-Reply-To: <1196417906.2454.18.camel@localhost.localdomain> References: <1196418189.16961.9.camel@localhost.localdomain> <1196417906.2454.18.camel@localhost.localdomain> Message-ID: <1196436206.2437.4.camel@localhost.localdomain> Hi Lon, thank you for your detailed answer. That's very good news I'm going to update to 5.1 as soon as this is possible here. I already did the "Hax" e.g. added -l in the ressource agent :) Thanks! regards, johannes > We plan to switch to live migrate as default instead of pause-migrate > (with the ability to select pause migration if desired) in the next > update. Actually the change is in CVS if you don't want to hax around > with the resource agent: > > http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/cluster/rgmanager/src/resources/vm.sh?rev=1.1.2.9&content-type=text/plain&cvsroot=cluster&only_with_tag=RHEL5 > > ... hasn't had a lot of testing though. :) > > -- Lon > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From johannes.russek at io-consulting.net Fri Nov 30 17:05:22 2007 From: johannes.russek at io-consulting.net (jr) Date: Fri, 30 Nov 2007 18:05:22 +0100 Subject: [Linux-cluster] Adding new file system caused problems In-Reply-To: <97F238EA86B5704DBAD740518CF829100394AE0C@hwpms600.tbo.citistreet.org> References: <474C5260.6030908@noaa.gov> <97F238EA86B5704DBAD740518CF829100394AE0C@hwpms600.tbo.citistreet.org> Message-ID: <1196442322.2437.8.camel@localhost.localdomain> is this a bug? i'm getting the exact same thing only during setup of a new clustered volume group, no resize or anything. what are the odds of having the lvm under the gfs not clustered? i can't restart the whole cluster when i add a new clustered filesystem.. regards, johannes Am Freitag, den 30.11.2007, 09:34 -0500 schrieb Fair, Brian: > I think this is something we see. The workaround has basically been to > disabled clustering (lvm wise) when doing this kind of change, and to > handle it manually: > > > > Ie: > > > > vgchange ?c n to disable the cluster flag > > lvmconf ?disable-cluster on all nodes > > rescan/discover lun, whatever, on all nodes > > lvcreate on one node > > lvchange ?refresh on every node > > lvchange ?a y on one node > > gfs_grow on one host (you can run this on the other to confirm, it > should say it can?t grow anymore) > > > > When done, I?ve been putting things back how they were with vgchange ? > c y, lvmconf ?disable-cluster, though I think if I you just left it > unclustered it?d be fine? what you won?t want to do is leave the vg > clustered, but not ?enable-cluster? if you do this when you reboot the > clustered volume groups won?t be activated. > > > > Hope this helps? if anyone knows of a definitive fix for this I?d like > to hear about it, we haven?t pushed for it since it isn?t too big of a > hassle and we aren?t constantly adding new volumes, but it is a pain. > > > > Brian Fair, UNIX Administrator, CitiStreet > > 904.791.2662 > > > > > > > > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Randy Brown > Sent: Tuesday, November 27, 2007 12:23 PM > To: linux clustering > Subject: [Linux-cluster] Adding new file system caused problems > > > > > I am running a two node cluster using Centos 5 that is basically being > used as a NAS head for our iscsi based storage. Here are the related > rpms and their versions I am using: > kmod-gfs-0.1.16-5.2.6.18_8.1.14.el5 > kmod-gfs-0.1.16-6.2.6.18_8.1.15.el5 > system-config-lvm-1.0.22-1.0.el5 > cman-2.0.64-1.0.1.el5 > rgmanager-2.0.24-1.el5.centos > gfs-utils-0.1.11-3.el5 > lvm2-2.02.16-3.el5 > lvm2-cluster-2.02.16-3.el5 > > This morning I created a 100GB volume on our storage unit and > proceeded to make it available to the cluster so it could be served > via NFS to a client on our network. I used pvcreate and vgcreate as I > always do and created a new volume group. When I went to create the > logical volume I saw this message: > Error locking on node nfs1-cluster.nws.noaa.gov: Volume group for uuid > not found: > 9crOQoM3V0fcuZ1E2163k9vdRLK7njfvnIIMTLPGreuvGmdB1aqx6KR4t7mmDRDs > > I figured I had done something wrong and tried to remove the Lvol and > couldn't. Lvdisplay showed that the logvol had been created and > vgdisplay looked good with the exception of the volume not being > activated. So, I ran vgchange -aly which didn't > return any error, but also did not activate the volume. I then > rebooted the node which made everything OK. I could now see the VG > and lvol, both were active and I could now create the gfs file system > on the lvol. The file system mounted and I thought I was in the > clear. > > However, node #2 wasn't picking this new filesystem up at all. I > stopped the cluster services on this node which all stopped cleanly > and then tried to restart them. cman started fine but clvmd didn't. > It hung on the vgscan. Even after a reboot of node #2, clvmd would > not start and would hang on the vgscan. It wasn't until I shut down > both nodes completely and started cluster that both nodes could see > the new filesystem. > > I'm sure it's my own ignorance that's making this more difficult than > it needs to be. Am I missing a step? Is more information required to > help? Any assistance in figuring out what happened here would be > greatly appreciated. I know I going to need to do similar tasks in > the future and obviously can't afford to bring everything down in > order for the cluster to see a new filesystem. > > Thank you, > > Randy > > P.S. Here is my cluster.conf: > [root at nfs2-cluster ~]# cat /etc/cluster/cluster.conf > > > > > nodeid="1" votes="1"> > > > port="8" switch="1"/> > > > > nodeid="2" votes="1"> > > > port="7" switch="1"/> > > > > > > > > ordered="0" restricted="1"> > name="nfs1-cluster.nws.noaa.gov" priority="1"/> > name="nfs2-cluster.nws.noaa.gov" priority="1"/> > > > > > device="/dev/VolGroupFS/LogVol-shared" force_unmount="0" fsid="30647" > fstype="gfs" mountpoint="/fs/shared" name="fs-shared" options="acl"/> > > options="no_root_squash,rw" path="" target="140.90.91.0/24"/> > device="/dev/VolGroupTemp/LogVol-rfcdata" force_unmount="0" > fsid="54233" fstype="gfs" mountpoint="/rfcdata" name="rfcdata" > options="acl"/> > > options="no_root_squash,rw" path="" target="140.90.91.0/24"/> > > name="nfs"> > > > ref="fs-shared-client"/> > > > > > > ref="rfcdata-client"/> > > > > > > > login="rbrown" name="nfspower" passwd="XXXXXXX"/> > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From pillai at mathstat.dal.ca Fri Nov 30 18:07:33 2007 From: pillai at mathstat.dal.ca (Balagopal Pillai) Date: Fri, 30 Nov 2007 14:07:33 -0400 Subject: [Linux-cluster] RHEL4 Update 4 Cluster Suite Download for Testing In-Reply-To: <47502546.3070205@midascomm.com> References: <47502546.3070205@midascomm.com> Message-ID: <47505165.4030905@mathstat.dal.ca> It can be downloaded from CentOS. (http://www.centos.org/) http://centos.arcticnetwork.ca/4.5/csgfs/ This is for 4.5. 4.4 one is at http://vault.centos.org/4.4/csgfs/ Balaji wrote: > Dear All, > > I am Downloaded the Red Hat Enterprise Linux 4 Update 4 AS 30 Days > Evaluation copy and i Installed and testing the Red Hat Enterprise > Linux 4 Update 4 AS and i need the Cluster Suite for same > The Cluster Suite for same is not available in Red Hat Site > > Please can any one send me the Cluster Suite link for Red Hat > Enterprise Linux 4 Update 4 AS Supported > > Regards > -S.Balaji > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From scottb at bxwa.com Fri Nov 30 22:57:44 2007 From: scottb at bxwa.com (Scott Becker) Date: Fri, 30 Nov 2007 14:57:44 -0800 Subject: [Linux-cluster] File system checking Message-ID: <47509568.905@bxwa.com> Does anybody know the best way to check that a filesystem is healthy? I'm working on a light selfcheck script (to be ran once a minute) and creating a file and checking it's existence may not work because of write caching. Checking the mount status is probably better but I don't know. I've had full filesystems and once the kernel detected an error and remounted read only. Other times, when a drive in the raid array was slowly failing, it would hang on all IO for a spell. If there's an existing source module or a script somebody is aware of that would be great. thanks scottb