From griswold at cs.wisc.edu Mon Jun 1 16:03:40 2009 From: griswold at cs.wisc.edu (Nathaniel Griswold) Date: Mon, 1 Jun 2009 11:03:40 -0500 Subject: [Linux-cluster] gfs2: st_size is 0 for symbolic links Message-ID: <1c1b4d3a0906010903m30d2233bv9fab99ecef2ac83@mail.gmail.com> Hi, I had an application fail on gfs2 today because of incorrect stat st_size on a symlink. The application was trying to utilize the fact that st_size on a symlink should be the character length of the destination path. [root at somehost somepath]# touch somefile [root at somehost somepath]# ln -s somefile somelink [root at somehost somepath]# stat somelink |grep Size Size: 0 Blocks: 8 IO Block: 4096 symbolic link [root at somehost somepath]# gfs2_tool getargs /somepath noatime 0 data 2 suiddir 0 quota 0 posix_acl 1 num_glockd 1 upgrade 0 debug 0 localflocks 0 localcaching 0 ignore_local_fs 0 spectator 0 hostdata jid=0:id=196612:first=1 locktable lockproto [root at somehost somepath]# uname -r 2.6.18-128.1.10.el5 [root at somehost somepath]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 5.3 (Tikanga) If i go to some other host or remount the filesystem, then st_size is correct: [root at someotherhost somepath]# stat somelink |grep Size Size: 8 Blocks: 8 IO Block: 4096 symbolic link Searched archives and didn't see anything. Is this a bug? -nate From swhiteho at redhat.com Mon Jun 1 16:07:32 2009 From: swhiteho at redhat.com (Steven Whitehouse) Date: Mon, 01 Jun 2009 17:07:32 +0100 Subject: [Linux-cluster] gfs2: st_size is 0 for symbolic links In-Reply-To: <1c1b4d3a0906010903m30d2233bv9fab99ecef2ac83@mail.gmail.com> References: <1c1b4d3a0906010903m30d2233bv9fab99ecef2ac83@mail.gmail.com> Message-ID: <1243872452.29604.546.camel@localhost.localdomain> Hi, That was bz #492911 and its now fixed both upstream and in RHEL, Steve. On Mon, 2009-06-01 at 11:03 -0500, Nathaniel Griswold wrote: > Hi, > > I had an application fail on gfs2 today because of incorrect stat > st_size on a symlink. The application was trying to utilize the fact > that st_size on a symlink should be the character length of the > destination path. > > [root at somehost somepath]# touch somefile > [root at somehost somepath]# ln -s somefile somelink > [root at somehost somepath]# stat somelink |grep Size > Size: 0 Blocks: 8 IO Block: 4096 symbolic link > [root at somehost somepath]# gfs2_tool getargs /somepath > noatime 0 > data 2 > suiddir 0 > quota 0 > posix_acl 1 > num_glockd 1 > upgrade 0 > debug 0 > localflocks 0 > localcaching 0 > ignore_local_fs 0 > spectator 0 > hostdata jid=0:id=196612:first=1 > locktable > lockproto > > [root at somehost somepath]# uname -r > 2.6.18-128.1.10.el5 > > [root at somehost somepath]# cat /etc/redhat-release > Red Hat Enterprise Linux Server release 5.3 (Tikanga) > > > If i go to some other host or remount the filesystem, then st_size is correct: > > [root at someotherhost somepath]# stat somelink |grep Size > Size: 8 Blocks: 8 IO Block: 4096 symbolic link > > Searched archives and didn't see anything. Is this a bug? > > -nate > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From rmicmirregs at gmail.com Mon Jun 1 19:17:30 2009 From: rmicmirregs at gmail.com (Rafael =?ISO-8859-1?Q?Mic=F3?= Miranda) Date: Mon, 01 Jun 2009 21:17:30 +0200 Subject: [Linux-Cluster] Submitting two new resource plugins to the project Message-ID: <1243883850.6761.2.camel@mecatol> Hi, I have developed a couple of resources for Linux-Cluster (CMAN +rgmanager) which try to fix some needs I see in Linux-Cluster when compared with other cluster solution (concretely, Linux-HA a.k.a. Heartbeat). I am a Linux-HA user and I think this two functionalities could be useful in Linux-Cluster. I would like to give them (both resources) to the community to make them be into the project, and maybe after testing/quality testing or so be included into the RedHat Enterprise Linux packages of Linux-Cluster, so RedHat will give support for them and include them into the system-config-cluster tool to have a GUI that can configure this resources and handle their information. I'll give you some details of both resources: 1.- ping-group: tries to bring to Linux-Cluster the Ping Group functionality of Linux-HA. For those who don't know Ping Group, the idea is the following: its a NODE functionality (not a service or a resource) that checks IP communications with a list of given client nodes. When failed, Ping Group will move all services running in the affected node to other nodes that have proved that keep their communications right, so the service is provided to the clients even if there is a network problem that affects only a node of your cluster but the cluster itself wont realize about it. I have developed ping-group as a resource to be used into a service of your cluster, so in the resource arguments you can specify a list of clients that service should take note on. There is one thing that could be improved: ping-group will mark the service as failed even if the other nodes of the cluster would fail too due to lack of communications with the clients (for example, all clients are powered off). In this situation the service will go on migrating from one node to another according to your service failover policy and finally the service will be stopped. Maybe some ideas could be useful to improve this behaviour. 2.- lvm-cluster: tries to bring to Linux-Cluster an exclusive shared storage option, using features of LVM2. I got accustomed to this kind of volumes when working with Linux-Ha + EVMS solution (using Cluster Segment Manager plug-in). When defining a new LVM2 volume four your cluster, you can set it as cluster-disabled (the volume will behave as a local volume even if it is on shared storage) or as cluster-enabled (the LVM volume can be activated on many different cluster nodes at the same time). Of course, if the filesystem placed into the LVM volume is not a clustered filesystem (GFS2) a cluster-enabled volume allows a bad administrator mount a no-clustered filesystem (EXT3) in more than one node of the cluster which may produce filesystem corruption. This is because the LVM "open flag" of the filesystem is not propagated through all the members of the cluster, so there is no knowledge of the state of the filesystem and this situations can happen. This can be fixed with some of the options of LVM, specifically the "enable exclusively flag". This flag, when used over a cluster-enabled volume, will allow the VolumeGroup to be imported by all the nodes of the cluster but the LogicalVolumes into the VolumeGroup can only be activated by a single node. So, only one node of your cluster will have the LogicalVolume device (for example /dev/VolGrp01/LogVol01) and the problem explained above cannot happen. This is not about propagating the "open flag" through the nodes, this is about making the LogicalVolume be in only one node. I have developed lvm-cluster as a resource to be used into a service of your cluster. In the arguments you an specify the name of the VolumeGroup and the LogicalVolume to handle. So, I would like to receive the instructions to submit this two resources to the project to improve them, test them and find any bugs that could still be in the code. I have made some testing but of course they need much more to allow them be put into the main project. Sincerely yours, Rafael Mic? Miranda -- Rafael Mic? Miranda From fdinitto at redhat.com Tue Jun 2 05:04:23 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Tue, 02 Jun 2009 07:04:23 +0200 Subject: [Linux-Cluster] Submitting two new resource plugins to the project In-Reply-To: <1243883850.6761.2.camel@mecatol> References: <1243883850.6761.2.camel@mecatol> Message-ID: <1243919063.24866.14.camel@cerberus.int.fabbione.net> Hi Rafael, On Mon, 2009-06-01 at 21:17 +0200, Rafael Mic? Miranda wrote: > Hi, > > I have developed a couple of resources for Linux-Cluster (CMAN > +rgmanager) which try to fix some needs I see in Linux-Cluster when > compared with other cluster solution (concretely, Linux-HA a.k.a. > Heartbeat). I am a Linux-HA user and I think this two functionalities > could be useful in Linux-Cluster. > > I would like to give them (both resources) to the community to make them > be into the project, and maybe after testing/quality testing or so be > included into the RedHat Enterprise Linux packages of Linux-Cluster, so > RedHat will give support for them and include them into the > system-config-cluster tool to have a GUI that can configure this > resources and handle their information. [SNIP] > So, I would like to receive the instructions to submit this two > resources to the project to improve them, test them and find any bugs > that could still be in the code. I have made some testing but of course > they need much more to allow them be put into the main project. The best way to submit is to post the code to cluster-devel at redhat.com mailing list. We don't have a very formal procedure in place. What we need to know is what it is, on what version of the software has been tested and what distribution. The right guys will take care of doing the correct steps (ask more, review, commit etc). Of course a patch against a git tree is the best but it's not a requirement at all (aka don't spend time learning git if you don't need/want to). Cheers Fabio From xavier.montagutelli at unilim.fr Tue Jun 2 09:08:14 2009 From: xavier.montagutelli at unilim.fr (Xavier Montagutelli) Date: Tue, 2 Jun 2009 11:08:14 +0200 Subject: [Linux-Cluster] Submitting two new resource plugins to the project In-Reply-To: <1243883850.6761.2.camel@mecatol> References: <1243883850.6761.2.camel@mecatol> Message-ID: <200906021108.14625.xavier.montagutelli@unilim.fr> On Monday 01 June 2009 21:17:30 Rafael Mic? Miranda wrote: > Hi, > > I have developed a couple of resources for Linux-Cluster (CMAN > +rgmanager) which try to fix some needs I see in Linux-Cluster when > compared with other cluster solution (concretely, Linux-HA a.k.a. > Heartbeat). I am a Linux-HA user and I think this two functionalities > could be useful in Linux-Cluster. > > I would like to give them (both resources) to the community to make them > be into the project, and maybe after testing/quality testing or so be > included into the RedHat Enterprise Linux packages of Linux-Cluster, so > RedHat will give support for them and include them into the > system-config-cluster tool to have a GUI that can configure this > resources and handle their information. > > I'll give you some details of both resources: [...] > > 2.- lvm-cluster: tries to bring to Linux-Cluster an exclusive shared > storage option, using features of LVM2. I got accustomed to this kind of > volumes when working with Linux-Ha + EVMS solution (using Cluster > Segment Manager plug-in). > > When defining a new LVM2 volume four your cluster, you can set it as > cluster-disabled (the volume will behave as a local volume even if it is > on shared storage) or as cluster-enabled (the LVM volume can be > activated on many different cluster nodes at the same time). > > Of course, if the filesystem placed into the LVM volume is not a > clustered filesystem (GFS2) a cluster-enabled volume allows a bad > administrator mount a no-clustered filesystem (EXT3) in more than one > node of the cluster which may produce filesystem corruption. This is > because the LVM "open flag" of the filesystem is not propagated through > all the members of the cluster, so there is no knowledge of the state of > the filesystem and this situations can happen. > > This can be fixed with some of the options of LVM, specifically the > "enable exclusively flag". This flag, when used over a cluster-enabled > volume, will allow the VolumeGroup to be imported by all the nodes of > the cluster but the LogicalVolumes into the VolumeGroup can only be > activated by a single node. So, only one node of your cluster will have > the LogicalVolume device (for example /dev/VolGrp01/LogVol01) and the > problem explained above cannot happen. This is not about propagating the > "open flag" through the nodes, this is about making the LogicalVolume be > in only one node. > > I have developed lvm-cluster as a resource to be used into a service of > your cluster. In the arguments you an specify the name of the > VolumeGroup and the LogicalVolume to handle. [...] This looks very useful. We are using a shared storage with CLVM, and a non- clustered FS. I always fear mounting the same FS on different nodes. I have always hoped this feature could exist the LVM layer. It would be great to see this incorporated in CLVM. -- Xavier Montagutelli Tel : +33 (0)5 55 45 77 20 Service Commun Informatique Fax : +33 (0)5 55 45 75 95 Universite de Limoges 123, avenue Albert Thomas 87060 Limoges cedex From brettcave at gmail.com Tue Jun 2 11:53:52 2009 From: brettcave at gmail.com (Brett Cave) Date: Tue, 2 Jun 2009 13:53:52 +0200 Subject: [Linux-cluster] IPVS on 2 servers running the HA services? Message-ID: hi, I am running ipvs on a single node with ipvs configured to load balance to 2 backend servers (mysql). I remember having issues load balancing to the server that HA is running on, due to the IP address being local. ipvs was configured using ldirector, with the real servers using the "gate" redirect method. Is it possible to run heartbeat + ipvs + apache on 2 nodes though? Perhaps using masq method? e.g. server1: ip 192.168.0.1 + heartbeat + primary HA ip 192.168.0.10 + apache server2: ip 192.168.0.2 + heartbeat + secondary for HA IP + apache ipvs / ldirector to then direct incoming http requests on 192.168.0.10 to .1 and .2 using masq - would that load balance requests between the servers, or would all requests come in to primary and be served by primary? Regards, Brett From tsengjs at gmail.com Tue Jun 2 13:30:57 2009 From: tsengjs at gmail.com (Jin-Shan Tseng) Date: Tue, 2 Jun 2009 21:30:57 +0800 Subject: [Linux-cluster] compile gnbd-kernel error Message-ID: <2495e3790906020630g34d1b32ckeff5db4b8167169@mail.gmail.com> Hi folks, I tried to compile gnbd-kernel on Gentoo Linux 2.6.29-gentoo-r5 but I got some error messages. :( the error messages are appear on cluster-2.03.09, cluster-2.03.10, cluster-2.03.11 # uname -a Linux node26 2.6.29-gentoo-r5 #3 SMP Mon Jun 1 19:05:23 CST 2009 i686 Intel(R) Xeon(TM) CPU 3.06GHz GenuineIntel GNU/Linux cluster-2.03.11 # make gnbd-kernel [ -n "" ] || make -C gnbd-kernel/src all make[1]: Entering directory `/usr/portage/distfiles/cluster-2.03.11/gnbd-kernel/src' make -C /lib/modules/2.6.29-gentoo-r5/build M=/usr/portage/distfiles/cluster-2.03.11/gnbd-kernel/src symverfile=/lib/modules/2.6.29-gentoo-r5/build/Module.symvers modules USING_KBUILD=yes make[2]: Entering directory `/usr/src/linux-2.6.29-gentoo-r5' CC [M] /usr/portage/distfiles/cluster-2.03.11/gnbd-kernel/src/gnbd.o /usr/portage/distfiles/cluster-2.03.11/gnbd-kernel/src/gnbd.c:933: warning: initialization from incompatible pointer type /usr/portage/distfiles/cluster-2.03.11/gnbd-kernel/src/gnbd.c:934: warning: initialization from incompatible pointer type /usr/portage/distfiles/cluster-2.03.11/gnbd-kernel/src/gnbd.c: In function 'gnbd_init': /usr/portage/distfiles/cluster-2.03.11/gnbd-kernel/src/gnbd.c:1054: error: 'struct gendisk' has no member named 'dev' make[3]: *** [/usr/portage/distfiles/cluster-2.03.11/gnbd-kernel/src/gnbd.o] Error 1 make[2]: *** [_module_/usr/portage/distfiles/cluster-2.03.11/gnbd-kernel/src] Error 2 make[2]: Leaving directory `/usr/src/linux-2.6.29-gentoo-r5' make[1]: *** [gnbd.ko] Error 2 make[1]: Leaving directory `/usr/portage/distfiles/cluster-2.03.11/gnbd-kernel/src' make: *** [gnbd-kernel/src] Error 2 Does anyone have the same problems? Any suggestions are appreciate. Thanks in advanced, Jin-Shan -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrassow at redhat.com Tue Jun 2 17:28:13 2009 From: jbrassow at redhat.com (Jonathan Brassow) Date: Tue, 2 Jun 2009 12:28:13 -0500 Subject: [Linux-Cluster] Submitting two new resource plugins to the project In-Reply-To: <1243883850.6761.2.camel@mecatol> References: <1243883850.6761.2.camel@mecatol> Message-ID: On Jun 1, 2009, at 2:17 PM, Rafael Mic? Miranda wrote: > This can be fixed with some of the options of LVM, specifically the > "enable exclusively flag". This flag, when used over a cluster-enabled > volume, will allow the VolumeGroup to be imported by all the nodes of > the cluster but the LogicalVolumes into the VolumeGroup can only be > activated by a single node. So, only one node of your cluster will > have > the LogicalVolume device (for example /dev/VolGrp01/LogVol01) and the > problem explained above cannot happen. This is not about propagating > the > "open flag" through the nodes, this is about making the > LogicalVolume be > in only one node. This is different from the current approach. We would likely take this if it is cleaner, better, or more advantageous than the current solution. Current solution is described here: http://kbase.redhat.com/faq/docs/DOC-3068 brassow From dougbunger at yahoo.com Tue Jun 2 18:12:31 2009 From: dougbunger at yahoo.com (Doug Bunger) Date: Tue, 2 Jun 2009 11:12:31 -0700 (PDT) Subject: [Linux-cluster] F8/F10 fence_xvm Key Errors Message-ID: <740081.74040.qm@web110216.mail.gq1.yahoo.com> I have a VM running Fedora 8 that I want to connect to a cluster that is all Fedora 10 VMs, running on F10 platforms.? The F8 fails a fence test, reporting: # fence_xvm -H cicero3 -ddd -o null Debugging threshold is now 3 -- args @ 0x7fffb9fd4870 -- ? args->addr = 225.0.0.12 ? args->domain = cicero3 ? args->key_file = /etc/cluster/fence_xvm.key ? args->op = 0 ? args->hash = 2 ? args->auth = 2 ? args->port = 1229 ? args->family = 2 ? args->timeout = 30 ? args->retr_time = 20 ? args->flags = 0 ? args->debug = 3 -- end args -- Reading in key file /etc/cluster/fence_xvm.key into 0x7fffb9fd3820 (4096 len)Actual key length = 4096 bytesSending to 225.0.0.12 via 127.0.0.1 Sending to 225.0.0.12 via 192.168.69.63 Waiting for connection from XVM host daemon. The physical host is reporting: ? [fence_xvmd.c:0691] Key mismatch; dropping packet It seems odd that it doesn't work since the key was gen'd from /dev/random.? Nothing OS or machine specific about the key.? Something different with the transport?? Any suggestions, before I blindly upgrade from F8 to F10? -- Doug Bunger -- dougbunger at yahoo.com -- -------------- next part -------------- An HTML attachment was scrubbed... URL: From jschulz at soapstonenetworks.com Tue Jun 2 23:36:23 2009 From: jschulz at soapstonenetworks.com (Jon Schulz) Date: Tue, 2 Jun 2009 19:36:23 -0400 Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters Message-ID: I'm in the process of doing a concept review with the redhat cluster suite. I've been given a requirement that cluster nodes are able to be located in geographically separated data centers. I realize that this is not an ideal scenario due to latency issues. Does anyone have any papers or articles you could point me to that outline cluster network requirements and best practices? -------------- next part -------------- An HTML attachment was scrubbed... URL: From fajar at fajar.net Wed Jun 3 02:06:08 2009 From: fajar at fajar.net (Fajar A. Nugraha) Date: Wed, 3 Jun 2009 09:06:08 +0700 Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters In-Reply-To: References: Message-ID: <7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com> On Wed, Jun 3, 2009 at 6:36 AM, Jon Schulz wrote: > I'm in the process of doing a concept review with the redhat cluster suite. > I've been given a requirement that cluster nodes are able to be located in > geographically separated data centers. I realize that this is not an ideal > scenario due to latency issues. For most purposes, RHCS would require that all nodes have access to the same storage/disk. That pretty much ruled out the DR feature that one might expect to get from having nodes in geographically separated data centers. I'd suggest you refine your requirements. Perhaps what you need is something like MySQL cluster replication, where there are two geographically separated data centers, each having its own cluster, and the two clusters replicate each other's data asynchronously. -- Fajar From m.nietz-redhat at iplabs.de Wed Jun 3 14:22:30 2009 From: m.nietz-redhat at iplabs.de (Marco Nietz) Date: Wed, 03 Jun 2009 16:22:30 +0200 Subject: [Linux-cluster] Problem with Fenced Message-ID: <4A268726.4060901@iplabs.de> Hi, i have a Problem with (propably) the Communication between fenced and ccsd. After a node-failure, fenced should connect ccsd and then try to fence the failing node. this does not happen on one of our systems. Here's an strace from the fence-daemon. socket(PF_FILE, SOCK_STREAM, 0) = 9 connect(9, {sa_family=AF_FILE, path=@"groupd_socket"}, 16) = 0 write(9, "get_group -1 groupd\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2200) = 2200 read(9, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1128) = 1128 close(9) = 0 write(7, "start_done default 3\0\0\0\0\0\0\0\0\0\0\0\0"..., 2200) = 2200 poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}], 4, -1) = 1 ([{fd=7, revents=POLLIN}]) read(7, "finish default 3\0\0\0\0\0\0\0\0\350\37Y\21\377\177\0\0"..., 2200) = 2200 poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}], 4, -1 At the Connect-Line i expect the Path to the ccsd-socket (/var/run/cluster/ccsd.sock). How can i tell fenced where to find the Socket. Best Regards Marco From teigland at redhat.com Wed Jun 3 15:40:44 2009 From: teigland at redhat.com (David Teigland) Date: Wed, 3 Jun 2009 10:40:44 -0500 Subject: [Linux-cluster] Problem with Fenced In-Reply-To: <4A268726.4060901@iplabs.de> References: <4A268726.4060901@iplabs.de> Message-ID: <20090603154044.GA14469@redhat.com> On Wed, Jun 03, 2009 at 04:22:30PM +0200, Marco Nietz wrote: > Hi, > > i have a Problem with (propably) the Communication between fenced and > ccsd. After a node-failure, fenced should connect ccsd and then try to > fence the failing node. this does not happen on one of our systems. > > Here's an strace from the fence-daemon. > > socket(PF_FILE, SOCK_STREAM, 0) = 9 > connect(9, {sa_family=AF_FILE, path=@"groupd_socket"}, 16) = 0 > write(9, "get_group -1 groupd\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2200) = 2200 > read(9, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > 1128) = 1128 > close(9) = 0 > write(7, "start_done default 3\0\0\0\0\0\0\0\0\0\0\0\0"..., 2200) = 2200 > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, > events=POLLIN}, {fd=-1}], 4, -1) = 1 ([{fd=7, revents=POLLIN}]) > read(7, "finish default 3\0\0\0\0\0\0\0\0\350\37Y\21\377\177\0\0"..., > 2200) = 2200 > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, > events=POLLIN}, {fd=-1}], 4, -1 > > At the Connect-Line i expect the Path to the ccsd-socket > (/var/run/cluster/ccsd.sock). > > How can i tell fenced where to find the Socket. It's not clear from this that fenced/ccsd communication is the problem. After the node failure, please collect from all nodes the output of - cman_tool nodes - group_tool -v - group_tool dump fence - any messages in /var/log/messages Dave From rmicmirregs at gmail.com Wed Jun 3 16:28:40 2009 From: rmicmirregs at gmail.com (Rafael =?ISO-8859-1?Q?Mic=F3?= Miranda) Date: Wed, 03 Jun 2009 18:28:40 +0200 Subject: [Linux-Cluster] Submitting two new resource plugins to the project In-Reply-To: <1243919063.24866.14.camel@cerberus.int.fabbione.net> References: <1243883850.6761.2.camel@mecatol> <1243919063.24866.14.camel@cerberus.int.fabbione.net> Message-ID: <1244046520.6750.6.camel@mecatol> Hi Fabio, El mar, 02-06-2009 a las 07:04 +0200, Fabio M. Di Nitto escribi?: > Hi Rafael, > > On Mon, 2009-06-01 at 21:17 +0200, Rafael Mic? Miranda wrote: [...] > > > The best way to submit is to post the code to cluster-devel at redhat.com > mailing list. We don't have a very formal procedure in place. > What we need to know is what it is, on what version of the software has > been tested and what distribution. > The right guys will take care of doing the correct steps (ask more, > review, commit etc). > Of course a patch against a git tree is the best but it's not a > requirement at all (aka don't spend time learning git if you don't > need/want to). > > Cheers > Fabio > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster I have sent the e-mail to that mail list and i have had no answer yet. Its the only occurrence i have found about the "devel list" on the CMAN Project web page, are you sure this address is right? Thanks in advance. -- Rafael Mic? Miranda From admin1-bua.dage-etd at justice.gouv.fr Thu Jun 4 06:54:07 2009 From: admin1-bua.dage-etd at justice.gouv.fr (Jean Diallo) Date: Thu, 04 Jun 2009 08:54:07 +0200 Subject: [Linux-cluster] Clvm Hang after an node is fenced in a 2 nodes cluster Message-ID: <4A276F8F.70101@justice.gouv.fr> Description of problem: In a 2 nodes cluster, after 1 node is fence, any clvm command hang on the ramaining node. when the fenced node cluster come back in the cluster, any clvm command also hang, moreover the node do not activate any clustered vg, and so do not access any shared device. Version-Release number of selected component (if applicable): redhat 5.2 update device-mapper-1.02.28-2.el5.x86_64.rpm lvm2-2.02.40-6.el5.x86_64.rpm lvm2-cluster-2.02.40-7.el5.x86_64.rpm Steps to Reproduce: 1.2 nodes cluster , quorum formed with qdisk 2.cold boot node 2 3.node 2 is evicted and fenced, service are taken over by node 1 4.node ? come back in cluster, quorate, but no clustered vg are up and any lvm related command hang 5.At this step every lvm command hang on node 1 Expected results: node 2 should be able to get back the lock on clustered lvm volume and node 1 should be able to issue any lvm relate command Here are my cluster.conf and lvm.conf part of lvm.conf: # Type 3 uses built-in clustered locking. locking_type = 3 # If using external locking (type 2) and initialisation fails, # with this set to 1 an attempt will be made to use the built-in # clustered locking. # If you are using a customised locking_library you should set this to 0. fallback_to_clustered_locking = 0 # If an attempt to initialise type 2 or type 3 locking failed, perhaps # because cluster components such as clvmd are not running, with this set # to 1 an attempt will be made to use local file-based locking (type 1). # If this succeeds, only commands against local volume groups will proceed. # Volume Groups marked as clustered will be ignored. fallback_to_local_locking = 1 # Local non-LV directory that holds file-based locks while commands are # in progress. A directory like /tmp that may get wiped on reboot is OK. locking_dir = "/var/lock/lvm" # Other entries can go here to allow you to load shared libraries # e.g. if support for LVM1 metadata was compiled as a shared library use # format_libraries = "liblvm2format1.so" # Full pathnames can be given. # Search this directory first for shared libraries. # library_dir = "/lib" # The external locking library to load if locking_type is set to 2. # locking_library = "liblvm2clusterlock.so" part of lvm log on second node : vgchange.c:165 Activated logical volumes in volume group "VolGroup00" vgchange.c:172 7 logical volume(s) in volume group "VolGroup00" now active cache/lvmcache.c:1220 Wiping internal VG cache commands/toolcontext.c:188 Logging initialised at Wed Jun 3 15:17:29 2009 commands/toolcontext.c:209 Set umask to 0077 locking/cluster_locking.c:83 connect() failed on local socket: Connexion refus?e locking/locking.c:259 WARNING: Falling back to local file-based locking. locking/locking.c:261 Volume Groups with the clustered attribute will be inaccessible. toollib.c:578 Finding all volume groups toollib.c:491 Finding volume group "VGhomealfrescoS64" metadata/metadata.c:2379 Skipping clustered volume group VGhomealfrescoS64 toollib.c:491 Finding volume group "VGhomealfS64" metadata/metadata.c:2379 Skipping clustered volume group VGhomealfS64 toollib.c:491 Finding volume group "VGvmalfrescoS64" metadata/metadata.c:2379 Skipping clustered volume group VGvmalfrescoS64 toollib.c:491 Finding volume group "VGvmalfrescoI64" metadata/metadata.c:2379 Skipping clustered volume group VGvmalfrescoI64 toollib.c:491 Finding volume group "VGvmalfrescoP64" metadata/metadata.c:2379 Skipping clustered volume group VGvmalfrescoP64 toollib.c:491 Finding volume group "VolGroup00" libdm-report.c:981 VolGroup00 cache/lvmcache.c:1220 Wiping internal VG cache commands/toolcontext.c:188 Logging initialised at Wed Jun 3 15:17:29 2009 commands/toolcontext.c:209 Set umask to 0077 locking/cluster_locking.c:83 connect() failed on local socket: Connexion refus?e locking/locking.c:259 WARNING: Falling back to local file-based locking. locking/locking.c:261 Volume Groups with the clustered attribute will be inaccessible. toollib.c:542 Using volume group(s) on command line toollib.c:491 Finding volume group "VolGroup00" vgchange.c:117 7 logical volume(s) in volume group "VolGroup00" monitored cache/lvmcache.c:1220 Wiping internal VG cache commands/toolcontext.c:188 Logging initialised at Wed Jun 3 15:20:45 2009 commands/toolcontext.c:209 Set umask to 0077 toollib.c:331 Finding all logical volumes commands/toolcontext.c:188 Logging initialised at Wed Jun 3 15:20:50 2009 commands/toolcontext.c:209 Set umask to 0077 toollib.c:578 Finding all volume groups group_tool on node 1 type level name id state fence 0 default 00010001 none [1 2] dlm 1 clvmd 00010002 none [1 2] dlm 1 rgmanager 00020002 none [1] group_tool on node 2 [root at remus ~]# group_tool type level name id state fence 0 default 00010001 none [1 2] dlm 1 clvmd 00010002 none [1 2] Additional info: From jbrassow at redhat.com Thu Jun 4 17:04:06 2009 From: jbrassow at redhat.com (Jonathan Brassow) Date: Thu, 4 Jun 2009 12:04:06 -0500 Subject: [Linux-Cluster] Submitting two new resource plugins to the project In-Reply-To: <1244046520.6750.6.camel@mecatol> References: <1243883850.6761.2.camel@mecatol> <1243919063.24866.14.camel@cerberus.int.fabbione.net> <1244046520.6750.6.camel@mecatol> Message-ID: On Jun 3, 2009, at 11:28 AM, Rafael Mic? Miranda wrote: > Hi Fabio, > > El mar, 02-06-2009 a las 07:04 +0200, Fabio M. Di Nitto escribi?: >> Hi Rafael, >> >> On Mon, 2009-06-01 at 21:17 +0200, Rafael Mic? Miranda wrote: > [...] >> >> >> The best way to submit is to post the code to cluster- >> devel at redhat.com >> mailing list. We don't have a very formal procedure in place. >> What we need to know is what it is, on what version of the software >> has >> been tested and what distribution. >> The right guys will take care of doing the correct steps (ask more, >> review, commit etc). >> Of course a patch against a git tree is the best but it's not a >> requirement at all (aka don't spend time learning git if you don't >> need/want to). >> >> Cheers >> Fabio >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > I have sent the e-mail to that mail list and i have had no answer yet. > Its the only occurrence i have found about the "devel list" on the > CMAN > Project web page, are you sure this address is right? I missed that post. Perhaps you could send it directly to me? brassow From m.nietz-redhat at iplabs.de Thu Jun 4 17:54:22 2009 From: m.nietz-redhat at iplabs.de (Marco Nietz) Date: Thu, 04 Jun 2009 19:54:22 +0200 Subject: [Linux-cluster] Monitoring Multipathd Message-ID: <4A280A4E.5010309@iplabs.de> Hi, we use a Two-Node-Cluster each one assembled with a Dual-Port Fibre-Channel HBA with two Paths to a redundant Storage-Array. When one of the Paths fail the Multipath-Daemon activates the Standby-Paths and everything works fine. We want the cluster to initiate a takeover when both Paths are failed. Is there a way to achieve this ? I think of some kind of Monitor from the Cluster to the Multipathd. Regards Marco From rmicmirregs at gmail.com Thu Jun 4 18:48:47 2009 From: rmicmirregs at gmail.com (Rafael =?ISO-8859-1?Q?Mic=F3?= Miranda) Date: Thu, 04 Jun 2009 20:48:47 +0200 Subject: [Linux-Cluster] Submitting two new resource plugins to the project In-Reply-To: References: <1243883850.6761.2.camel@mecatol> <1243919063.24866.14.camel@cerberus.int.fabbione.net> <1244046520.6750.6.camel@mecatol> Message-ID: <1244141327.6771.6.camel@mecatol> Hi Jonathan, El jue, 04-06-2009 a las 12:04 -0500, Jonathan Brassow escribi?: > I missed that post. Perhaps you could send it directly to me? > > brassow > > I have just send them to you. Thanks in advance, -- Rafael Mic? Miranda From brem.belguebli at gmail.com Thu Jun 4 20:11:51 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Thu, 4 Jun 2009 22:11:51 +0200 Subject: [Linux-Cluster] Submitting two new resource plugins to the project In-Reply-To: References: <1243883850.6761.2.camel@mecatol> Message-ID: <29ae894c0906041311w4f4e8c0aw5fdf55b70e4f39b6@mail.gmail.com> 2009/6/2 Jonathan Brassow > > On Jun 1, 2009, at 2:17 PM, Rafael Mic? Miranda wrote: > > This can be fixed with some of the options of LVM, specifically the >> "enable exclusively flag". This flag, when used over a cluster-enabled >> volume, will allow the VolumeGroup to be imported by all the nodes of >> the cluster but the LogicalVolumes into the VolumeGroup can only be >> activated by a single node. So, only one node of your cluster will have >> the LogicalVolume device (for example /dev/VolGrp01/LogVol01) and the >> problem explained above cannot happen. This is not about propagating the >> "open flag" through the nodes, this is about making the LogicalVolume be >> in only one node. >> > > This is different from the current approach. We would likely take this if > it is cleaner, better, or more advantageous than the current solution. > > Current solution is described here: > http://kbase.redhat.com/faq/docs/DOC-3068 > > brassow > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > Hello, Isn't it how it is supposed to work, the exclusive flag standing for that ? >From what I saw on other systems, especially on HP-UX from which Linux LVM was much inspired, on a cluster, when activating exclusively (vgchange -ae VGxx ) a VG on a node, the exclusive flag is set on the VG preventing the other nodes from activating the volume as long as the holding node is alive. Brem -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlieb-linux-cluster at budge.apana.org.au Thu Jun 4 20:23:13 2009 From: charlieb-linux-cluster at budge.apana.org.au (Charlie Brady) Date: Thu, 4 Jun 2009 16:23:13 -0400 (EDT) Subject: [Linux-cluster] "dlm_controld[nnnn]: cluster is down, exiting" on node1 when starting node2 Message-ID: I'm trying to understand a node shutdown during transition from 1 node to 2 node with qdisk cluster. The platform is CentOS 5.3, with versions: cman-2.0.98-1.el5 openais-0.80.3-22.el5 Jun 4 10:55:08 sun4150node1 root[8103]: S10make-event-queue=action|Event|cluster-node-added|Action|S10make-event-queue|Start|1244127 308 610636|End|1244127308 614973|Elapsed|0.004337 Jun 4 10:55:08 sun4150node1 root[8103]: Running event handler: /etc/e-smith/events/cluster-node-added/S15iscsi-adjust Jun 4 10:55:08 sun4150node1 root[8103]: S15iscsi-adjust=action|Event|cluster-node-added|Action|S15iscsi-adjust|Start|1244127308 6153 33|End|1244127308 677757|Elapsed|0.062424 Jun 4 10:55:08 sun4150node1 root[8103]: Running event handler: /etc/e-smith/events/cluster-node-added/S20cluster-conf Jun 4 10:55:08 sun4150node1 ccsd[7879]: Update of cluster.conf complete (version 2 -> 3). Jun 4 10:55:08 sun4150node1 root[8103]: Config file updated from version 2 to 3 Jun 4 10:55:08 sun4150node1 root[8103]: Jun 4 10:55:08 sun4150node1 root[8103]: Update complete. Jun 4 10:55:08 sun4150node1 root[8103]: S20cluster-conf=action|Event|cluster-node-added|Action|S20cluster-conf|Start|1244127308 6781 19|End|1244127308 793629|Elapsed|0.11551 Jun 4 10:55:08 sun4150node1 root[8103]: Running event handler: /etc/e-smith/events/cluster-node-added/S31qdiskd-adjust Jun 4 10:55:08 sun4150node1 qdiskd[8128]: Quorum Daemon Initializing Jun 4 10:55:08 sun4150node1 root[8103]: Starting the Quorum Disk Daemon:[ OK ]^M Jun 4 10:55:08 sun4150node1 root[8103]: S31qdiskd-adjust=action|Event|cluster-node-added|Action|S31qdiskd-adjust|Start|1244127308 79 7450|End|1244127308 928144|Elapsed|0.130694 Jun 4 10:55:08 sun4150node1 root[8103]: Running event handler: /etc/e-smith/events/cluster-node-added/S32cman-adjust Jun 4 10:55:09 sun4150node1 root[8103]: Starting cluster: Jun 4 10:55:09 sun4150node1 root[8103]: Loading modules... done Jun 4 10:55:09 sun4150node1 root[8103]: Mounting configfs... done Jun 4 10:55:09 sun4150node1 root[8103]: Starting ccsd... done Jun 4 10:55:09 sun4150node1 root[8103]: Starting cman... done Jun 4 10:55:09 sun4150node1 root[8103]: Starting daemons... done Jun 4 10:55:10 sun4150node1 root[8103]: Starting fencing... done Jun 4 10:55:10 sun4150node1 root[8103]: [ OK ]^M Jun 4 10:55:10 sun4150node1 root[8103]: S32cman-adjust=action|Event|cluster-node-added|Action|S32cman-adjust|Start|1244127308 928465 |End|1244127310 103254|Elapsed|1.174789 Jun 4 10:55:10 sun4150node1 root[8103]: Running event handler: /etc/e-smith/events/cluster-node-added/S40cluster-join Jun 4 10:55:10 sun4150node1 root[8103]: building file list ... done Jun 4 10:55:10 sun4150node1 root[8103]: Jun 4 10:55:10 sun4150node1 root[8103]: sent 64 bytes received 20 bytes 168.00 bytes/sec Jun 4 10:55:10 sun4150node1 root[8103]: total size is 3162 speedup is 37.64 Jun 4 10:55:17 sun4150node1 qdiskd[8128]: Initial score 1/1 Jun 4 10:55:17 sun4150node1 qdiskd[8128]: Initialization complete Jun 4 10:55:17 sun4150node1 qdiskd[8128]: Score sufficient for master operation (1/1; required=1); upgrading Jun 4 10:55:29 sun4150node1 qdiskd[8128]: Assuming master role Jun 4 10:55:34 sun4150node1 dlm_controld[7916]: cluster is down, exiting Jun 4 10:55:34 sun4150node1 fenced[7910]: cluster is down, exiting Jun 4 10:55:34 sun4150node1 gfs_controld[7922]: cluster is down, exiting Jun 4 10:55:34 sun4150node1 kernel: dlm: closing connection to node 2 Jun 4 10:55:34 sun4150node1 kernel: dlm: closing connection to node 1 Jun 4 10:55:35 sun4150node1 qdiskd[8128]: cman_dispatch: Host is down Jun 4 10:55:35 sun4150node1 qdiskd[8128]: Halting qdisk operations Jun 4 10:55:51 sun4150node1 kernel: dlm: FS1: remove fr 0 ID 1 Jun 4 10:56:01 sun4150node1 ccsd[7879]: Unable to connect to cluster infrastructure after 30 seconds. Jun 4 10:56:31 sun4150node1 ccsd[7879]: Unable to connect to cluster infrastructure after 60 seconds. Jun 4 10:57:01 sun4150node1 ccsd[7879]: Unable to connect to cluster infrastructure after 90 seconds. Jun 4 10:57:31 sun4150node1 ccsd[7879]: Unable to connect to cluster infrastructure after 120 seconds. The first thing I see awry is "dlm_controld[7916]: cluster is down, exiting". I can see from source code that that could be from either process_member() or cluster_dead(), both of which would be called via callback from loop(). My best guess is that process_member() called cman_dispatch(ch, CMAN_DISPATCH_ALL) and rv was -1 with errno set to EHOSTDOWN. But I don't know why that would be the case, and in particular why here. cman started fine on node2, and node1 joined without incident after reboot. Any hints on how to debug this would be appreciated. Thanks --- Charlie From brem.belguebli at gmail.com Fri Jun 5 09:22:23 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Fri, 5 Jun 2009 11:22:23 +0200 Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters In-Reply-To: <7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com> References: <7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com> Message-ID: <29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com> Hello, That sounds pretty much to the question I've asked to this mailing-list last May (https://www.redhat.com/archives/linux-cluster/2009-May/msg00093.html). We are in the same setup, already doing "Geo-cluster" with other technos and we are looking at RHCS to provide us the same service level. Latency could be a problem indeed if too high , but in a lot of cases (many companies for which I've worked), datacenters are a few tens of kilometers far, with a latency max close to 1 ms, which is not a problem. Let's consider this kind of setup, 2 datacenters far from each other by 1 ms delay, each hosting a SAN array, each of them connected to 2 SAN fabrics extended between the 2 sites. What reason would prevent us from building Geo-clusters without having to rely on a database replication mechanism, as the setup I would like to implement would also be used to provide NFS services that are disaster recovery proof. Obviously, such setup should rely on LVM mirroring to allow a node hosting a service to be able to write to both local and distant SAN LUN's. Brem 2009/6/3, Fajar A. Nugraha : > > On Wed, Jun 3, 2009 at 6:36 AM, Jon Schulz > wrote: > > I'm in the process of doing a concept review with the redhat cluster > suite. > > I've been given a requirement that cluster nodes are able to be located > in > > geographically separated data centers. I realize that this is not an > ideal > > scenario due to latency issues. > > For most purposes, RHCS would require that all nodes have access to > the same storage/disk. That pretty much ruled out the DR feature that > one might expect to get from having nodes in geographically separated > data centers. > > I'd suggest you refine your requirements. Perhaps what you need is > something like MySQL cluster replication, where there are two > geographically separated data centers, each having its own cluster, > and the two clusters replicate each other's data asynchronously. > > -- > Fajar > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fajar at fajar.net Fri Jun 5 09:47:25 2009 From: fajar at fajar.net (Fajar A. Nugraha) Date: Fri, 5 Jun 2009 16:47:25 +0700 Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters In-Reply-To: <29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com> References: <7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com> <29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com> Message-ID: <7207d96f0906050247s54c5c468td31fceeabe1a8da3@mail.gmail.com> On Fri, Jun 5, 2009 at 4:22 PM, brem belguebli wrote: > We are in the same setup, already doing "Geo-cluster" with other technos and > we are looking at RHCS to provide us the same service level. Usually the concepts are the same. What solution are you using? How does it work, replication or real cluster? > Let's consider this kind of setup, 2 datacenters far from each other by 1 ms > delay, each hosting?a SAN array, each of them connected to 2 SAN fabrics > extended between the 2 sites. > > What reason would prevent?us from building Geo-clusters without?having to > rely on?a database replication mechanism, as the setup I would like to > implement would also be used to provide NFS services that are disaster > recovery proof. > > Obviously, such setup should rely on LVM mirroring to allow a node hosting a > service to?be able to write to both local and distant SAN LUN's. Does LVM mirroring work with clustered LVM? -- Fajar From jschulz at soapstonenetworks.com Fri Jun 5 14:37:57 2009 From: jschulz at soapstonenetworks.com (Jon Schulz) Date: Fri, 5 Jun 2009 10:37:57 -0400 Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters In-Reply-To: <7207d96f0906050247s54c5c468td31fceeabe1a8da3@mail.gmail.com> References: <7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com> <29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com> <7207d96f0906050247s54c5c468td31fceeabe1a8da3@mail.gmail.com> Message-ID: Yes I would be interested to see what products you are currently using to achieve this. In my proposed setup we are actually completely database transaction driven. The problem is the people higher up want active database <-> database replication which will be problematic I know. Outside of the data side of the equation, how tolerant is the cluster network/heartbeat to latency assuming no packet loss? Or more to the point, at what point does everyone in their past experience see the heartbeat network become unreliable, latency wise. E.g. anything over 30ms? Most of my experiences with rhcs and linux-ha have always been with the cluster network being within the same LAN :( -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Fajar A. Nugraha Sent: Friday, June 05, 2009 5:47 AM To: linux clustering Subject: Re: [Linux-cluster] Networking guidelines for RHCS across datacenters On Fri, Jun 5, 2009 at 4:22 PM, brem belguebli wrote: > We are in the same setup, already doing "Geo-cluster" with other technos and > we are looking at RHCS to provide us the same service level. Usually the concepts are the same. What solution are you using? How does it work, replication or real cluster? > Let's consider this kind of setup, 2 datacenters far from each other by 1 ms > delay, each hosting?a SAN array, each of them connected to 2 SAN fabrics > extended between the 2 sites. > > What reason would prevent?us from building Geo-clusters without?having to > rely on?a database replication mechanism, as the setup I would like to > implement would also be used to provide NFS services that are disaster > recovery proof. > > Obviously, such setup should rely on LVM mirroring to allow a node hosting a > service to?be able to write to both local and distant SAN LUN's. Does LVM mirroring work with clustered LVM? -- Fajar -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From Jeremy.Eder at mindshift.com Fri Jun 5 14:41:36 2009 From: Jeremy.Eder at mindshift.com (Jeremy Eder) Date: Fri, 5 Jun 2009 10:41:36 -0400 Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters In-Reply-To: References: <7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com> <29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com> <7207d96f0906050247s54c5c468td31fceeabe1a8da3@mail.gmail.com> Message-ID: <1734CA24F5FC1848880E6B1AB788DD7703774AB156@inv-ex1> I have no relation to this company, but I have heard good stories from people who worked with their products: If you're database is oracle, mysql or postgres check out products on www.continuent.com Best Regards, Jeremy Eder, RHCE, VCP -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jon Schulz Sent: Friday, June 05, 2009 10:38 AM To: linux clustering Subject: RE: [Linux-cluster] Networking guidelines for RHCS across datacenters Yes I would be interested to see what products you are currently using to achieve this. In my proposed setup we are actually completely database transaction driven. The problem is the people higher up want active database <-> database replication which will be problematic I know. Outside of the data side of the equation, how tolerant is the cluster network/heartbeat to latency assuming no packet loss? Or more to the point, at what point does everyone in their past experience see the heartbeat network become unreliable, latency wise. E.g. anything over 30ms? Most of my experiences with rhcs and linux-ha have always been with the cluster network being within the same LAN :( -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Fajar A. Nugraha Sent: Friday, June 05, 2009 5:47 AM To: linux clustering Subject: Re: [Linux-cluster] Networking guidelines for RHCS across datacenters On Fri, Jun 5, 2009 at 4:22 PM, brem belguebli wrote: > We are in the same setup, already doing "Geo-cluster" with other technos and > we are looking at RHCS to provide us the same service level. Usually the concepts are the same. What solution are you using? How does it work, replication or real cluster? > Let's consider this kind of setup, 2 datacenters far from each other by 1 ms > delay, each hosting?a SAN array, each of them connected to 2 SAN fabrics > extended between the 2 sites. > > What reason would prevent?us from building Geo-clusters without?having to > rely on?a database replication mechanism, as the setup I would like to > implement would also be used to provide NFS services that are disaster > recovery proof. > > Obviously, such setup should rely on LVM mirroring to allow a node hosting a > service to?be able to write to both local and distant SAN LUN's. Does LVM mirroring work with clustered LVM? -- Fajar -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From apfaffeneder at pfaffeneder.org Fri Jun 5 15:14:06 2009 From: apfaffeneder at pfaffeneder.org (Andreas Pfaffeneder) Date: Fri, 05 Jun 2009 17:14:06 +0200 Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters In-Reply-To: <7207d96f0906050247s54c5c468td31fceeabe1a8da3@mail.gmail.com> References: <7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com> <29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com> <7207d96f0906050247s54c5c468td31fceeabe1a8da3@mail.gmail.com> Message-ID: <4A29363E.1010108@pfaffeneder.org> Fajar A. Nugraha wrote: > > > Does LVM mirroring work with clustered LVM? > > Since 5.3 it works. Install lvm2-cluster. If you'd like to use mirrored volumes before 5.3 you can do so using lvm-tags (see filters in lvm.conf) but the mirror then is available only to one systeme at a time. Cheers Andreas From teigland at redhat.com Fri Jun 5 15:14:21 2009 From: teigland at redhat.com (David Teigland) Date: Fri, 5 Jun 2009 10:14:21 -0500 Subject: [Linux-cluster] "dlm_controld[nnnn]: cluster is down, exiting" on node1 when starting node2 In-Reply-To: References: Message-ID: <20090605151421.GB28143@redhat.com> On Thu, Jun 04, 2009 at 04:23:13PM -0400, Charlie Brady wrote: > Jun 4 10:55:34 sun4150node1 dlm_controld[7916]: cluster is down, exiting > Jun 4 10:55:34 sun4150node1 fenced[7910]: cluster is down, exiting > Jun 4 10:55:34 sun4150node1 gfs_controld[7922]: cluster is down, exiting > Jun 4 10:55:35 sun4150node1 qdiskd[8128]: cman_dispatch: Host is down They are all complaining that the the cluster is down, which is a polite way of saying that aisexec has died/crashed/failed/killed/gone-away. Dave From charlieb-linux-cluster at budge.apana.org.au Fri Jun 5 15:42:59 2009 From: charlieb-linux-cluster at budge.apana.org.au (Charlie Brady) Date: Fri, 5 Jun 2009 11:42:59 -0400 (EDT) Subject: [Linux-cluster] "dlm_controld[nnnn]: cluster is down, exiting" on node1 when starting node2 In-Reply-To: <20090605151421.GB28143@redhat.com> References: <20090605151421.GB28143@redhat.com> Message-ID: On Fri, 5 Jun 2009, David Teigland wrote: > On Thu, Jun 04, 2009 at 04:23:13PM -0400, Charlie Brady wrote: >> Jun 4 10:55:34 sun4150node1 dlm_controld[7916]: cluster is down, exiting >> Jun 4 10:55:34 sun4150node1 fenced[7910]: cluster is down, exiting >> Jun 4 10:55:34 sun4150node1 gfs_controld[7922]: cluster is down, exiting >> Jun 4 10:55:35 sun4150node1 qdiskd[8128]: cman_dispatch: Host is down > > They are all complaining that the the cluster is down, which is a polite way > of saying that aisexec has died/crashed/failed/killed/gone-away. Thanks. Why might that have occurred? Where would I look for clues? How can I increase logging output from aisexec? Thanks --- Charlie From teigland at redhat.com Fri Jun 5 16:04:55 2009 From: teigland at redhat.com (David Teigland) Date: Fri, 5 Jun 2009 11:04:55 -0500 Subject: [Linux-cluster] "dlm_controld[nnnn]: cluster is down, exiting" on node1 when starting node2 In-Reply-To: References: <20090605151421.GB28143@redhat.com> Message-ID: <20090605160455.GD28143@redhat.com> On Fri, Jun 05, 2009 at 11:42:59AM -0400, Charlie Brady wrote: > > On Fri, 5 Jun 2009, David Teigland wrote: > > >On Thu, Jun 04, 2009 at 04:23:13PM -0400, Charlie Brady wrote: > >>Jun 4 10:55:34 sun4150node1 dlm_controld[7916]: cluster is down, exiting > >>Jun 4 10:55:34 sun4150node1 fenced[7910]: cluster is down, exiting > >>Jun 4 10:55:34 sun4150node1 gfs_controld[7922]: cluster is down, exiting > >>Jun 4 10:55:35 sun4150node1 qdiskd[8128]: cman_dispatch: Host is > >>down > > > >They are all complaining that the the cluster is down, which is a polite > >way > >of saying that aisexec has died/crashed/failed/killed/gone-away. > > Thanks. Why might that have occurred? Where would I look for clues? How > can I increase logging output from aisexec? If you're lucky it'll leave a core file, otherwise aisexec is notorious for disappearing without leaving any clues about why. Dave From charlieb-linux-cluster at budge.apana.org.au Fri Jun 5 16:50:57 2009 From: charlieb-linux-cluster at budge.apana.org.au (Charlie Brady) Date: Fri, 5 Jun 2009 12:50:57 -0400 (EDT) Subject: [Linux-cluster] "dlm_controld[nnnn]: cluster is down, exiting" on node1 when starting node2 In-Reply-To: <20090605160455.GD28143@redhat.com> References: <20090605151421.GB28143@redhat.com> <20090605160455.GD28143@redhat.com> Message-ID: On Fri, 5 Jun 2009, David Teigland wrote: > On Fri, Jun 05, 2009 at 11:42:59AM -0400, Charlie Brady wrote: >> >> On Fri, 5 Jun 2009, David Teigland wrote: >> >>> On Thu, Jun 04, 2009 at 04:23:13PM -0400, Charlie Brady wrote: >>>> Jun 4 10:55:34 sun4150node1 dlm_controld[7916]: cluster is down, exiting >>>> Jun 4 10:55:34 sun4150node1 fenced[7910]: cluster is down, exiting >>>> Jun 4 10:55:34 sun4150node1 gfs_controld[7922]: cluster is down, exiting >>>> Jun 4 10:55:35 sun4150node1 qdiskd[8128]: cman_dispatch: Host is >>>> down >>> >>> They are all complaining that the the cluster is down, which is a polite >>> way >>> of saying that aisexec has died/crashed/failed/killed/gone-away. >> >> Thanks. Why might that have occurred? Where would I look for clues? How >> can I increase logging output from aisexec? > > If you're lucky it'll leave a core file, otherwise aisexec is notorious for > disappearing without leaving any clues about why. That's very disconcerting to hear. Doesn't sound like HA. :-( I don't have any core files. From teigland at redhat.com Fri Jun 5 16:49:51 2009 From: teigland at redhat.com (David Teigland) Date: Fri, 5 Jun 2009 11:49:51 -0500 Subject: [Linux-cluster] "dlm_controld[nnnn]: cluster is down, exiting" on node1 when starting node2 In-Reply-To: References: <20090605151421.GB28143@redhat.com> <20090605160455.GD28143@redhat.com> Message-ID: <20090605164951.GE28143@redhat.com> On Fri, Jun 05, 2009 at 12:50:57PM -0400, Charlie Brady wrote: > > On Fri, 5 Jun 2009, David Teigland wrote: > > >On Fri, Jun 05, 2009 at 11:42:59AM -0400, Charlie Brady wrote: > >> > >>On Fri, 5 Jun 2009, David Teigland wrote: > >> > >>>On Thu, Jun 04, 2009 at 04:23:13PM -0400, Charlie Brady wrote: > >>>>Jun 4 10:55:34 sun4150node1 dlm_controld[7916]: cluster is down, > >>>>exiting > >>>>Jun 4 10:55:34 sun4150node1 fenced[7910]: cluster is down, exiting > >>>>Jun 4 10:55:34 sun4150node1 gfs_controld[7922]: cluster is down, > >>>>exiting > >>>>Jun 4 10:55:35 sun4150node1 qdiskd[8128]: cman_dispatch: Host is > >>>>down > >>> > >>>They are all complaining that the the cluster is down, which is a polite > >>>way > >>>of saying that aisexec has died/crashed/failed/killed/gone-away. > >> > >>Thanks. Why might that have occurred? Where would I look for clues? How > >>can I increase logging output from aisexec? > > > >If you're lucky it'll leave a core file, otherwise aisexec is notorious for > >disappearing without leaving any clues about why. > > That's very disconcerting to hear. Doesn't sound like HA. :-( To clarify, aisexec does not often disappear, it's very reliable. The point was that in the rare case when it does, it's notorious for not leaving any reasons behind. Dave From sdake at redhat.com Fri Jun 5 17:10:38 2009 From: sdake at redhat.com (Steven Dake) Date: Fri, 05 Jun 2009 10:10:38 -0700 Subject: [Linux-cluster] "dlm_controld[nnnn]: cluster is down, exiting" on node1 when starting node2 In-Reply-To: <20090605164951.GE28143@redhat.com> References: <20090605151421.GB28143@redhat.com> <20090605160455.GD28143@redhat.com> <20090605164951.GE28143@redhat.com> Message-ID: <1244221838.2626.29.camel@localhost.localdomain> On Fri, 2009-06-05 at 11:49 -0500, David Teigland wrote: > On Fri, Jun 05, 2009 at 12:50:57PM -0400, Charlie Brady wrote: > > > > On Fri, 5 Jun 2009, David Teigland wrote: > > > > >On Fri, Jun 05, 2009 at 11:42:59AM -0400, Charlie Brady wrote: > > >> > > >>On Fri, 5 Jun 2009, David Teigland wrote: > > >> > > >>>On Thu, Jun 04, 2009 at 04:23:13PM -0400, Charlie Brady wrote: > > >>>>Jun 4 10:55:34 sun4150node1 dlm_controld[7916]: cluster is down, > > >>>>exiting > > >>>>Jun 4 10:55:34 sun4150node1 fenced[7910]: cluster is down, exiting > > >>>>Jun 4 10:55:34 sun4150node1 gfs_controld[7922]: cluster is down, > > >>>>exiting > > >>>>Jun 4 10:55:35 sun4150node1 qdiskd[8128]: cman_dispatch: Host is > > >>>>down > > >>> > > >>>They are all complaining that the the cluster is down, which is a polite > > >>>way > > >>>of saying that aisexec has died/crashed/failed/killed/gone-away. > > >> > > >>Thanks. Why might that have occurred? Where would I look for clues? How > > >>can I increase logging output from aisexec? > > > > > >If you're lucky it'll leave a core file, otherwise aisexec is notorious for > > >disappearing without leaving any clues about why. > > > > That's very disconcerting to hear. Doesn't sound like HA. :-( > > To clarify, aisexec does not often disappear, it's very reliable. The point > was that in the rare case when it does, it's notorious for not leaving any > reasons behind. > > Dave > 99.9% of the time there would be a core file in /var/lib/openais/core* if aisexec faults. We have not seen faults during normal operations for years in a released version under typical gfs2 usage scenarios. If there is no core, it means some other component failed, exited, and caused that node to be fenced, or the core file could not be written by the OS because of some other OS specific failure. Another option is that the OOM killer killed aisexec. I would have a hard time believing aisexec would crash without a core file while the operating system was still functional. In the trunk we are enhancing our failure analysis to do fulltime event tracing so failures can be debugged more rapidly then looking at a core file. I hope that helps. regards -steve > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From charlieb-linux-cluster at budge.apana.org.au Fri Jun 5 17:13:13 2009 From: charlieb-linux-cluster at budge.apana.org.au (Charlie Brady) Date: Fri, 5 Jun 2009 13:13:13 -0400 (EDT) Subject: [Linux-cluster] "dlm_controld[nnnn]: cluster is down, exiting" on node1 when starting node2 In-Reply-To: <20090605164951.GE28143@redhat.com> References: <20090605151421.GB28143@redhat.com> <20090605160455.GD28143@redhat.com> <20090605164951.GE28143@redhat.com> Message-ID: On Fri, 5 Jun 2009, David Teigland wrote: >>>>> They are all complaining that the the cluster is down, which is a polite >>>>> way >>>>> of saying that aisexec has died/crashed/failed/killed/gone-away. >>>> >>>> Thanks. Why might that have occurred? Where would I look for clues? How >>>> can I increase logging output from aisexec? >>> >>> If you're lucky it'll leave a core file, otherwise aisexec is notorious for >>> disappearing without leaving any clues about why. >> >> That's very disconcerting to hear. Doesn't sound like HA. :-( > > To clarify, aisexec does not often disappear, it's very reliable. The point > was that in the rare case when it does, it's notorious for not leaving any > reasons behind. Thanks for the clarification. From charlieb-linux-cluster at budge.apana.org.au Fri Jun 5 17:20:11 2009 From: charlieb-linux-cluster at budge.apana.org.au (Charlie Brady) Date: Fri, 5 Jun 2009 13:20:11 -0400 (EDT) Subject: [Linux-cluster] "dlm_controld[nnnn]: cluster is down, exiting" on node1 when starting node2 In-Reply-To: <1244221838.2626.29.camel@localhost.localdomain> References: <20090605151421.GB28143@redhat.com> <20090605160455.GD28143@redhat.com> <20090605164951.GE28143@redhat.com> <1244221838.2626.29.camel@localhost.localdomain> Message-ID: On Fri, 5 Jun 2009, Steven Dake wrote: > On Fri, 2009-06-05 at 11:49 -0500, David Teigland wrote: >> On Fri, Jun 05, 2009 at 12:50:57PM -0400, Charlie Brady wrote: >>> >>> On Fri, 5 Jun 2009, David Teigland wrote: >>> >>>> On Fri, Jun 05, 2009 at 11:42:59AM -0400, Charlie Brady wrote: >>>>> >>>>> On Fri, 5 Jun 2009, David Teigland wrote: >>>>> >>>>>> They are all complaining that the the cluster is down, which is a polite >>>>>> way >>>>>> of saying that aisexec has died/crashed/failed/killed/gone-away. >>>>> >>>>> Thanks. Why might that have occurred? Where would I look for clues? How >>>>> can I increase logging output from aisexec? >>>> >>>> If you're lucky it'll leave a core file, otherwise aisexec is notorious for >>>> disappearing without leaving any clues about why. >>> >>> That's very disconcerting to hear. Doesn't sound like HA. :-( >> >> To clarify, aisexec does not often disappear, it's very reliable. The point >> was that in the rare case when it does, it's notorious for not leaving any >> reasons behind. >> >> Dave >> > > 99.9% of the time there would be a core file in /var/lib/openais/core* > if aisexec faults. Only file I have there is named. ringid_10.39.171.212 > We have not seen faults during normal operations for > years in a released version under typical gfs2 usage scenarios. If > there is no core, it means some other component failed, exited, and > caused that node to be fenced, or the core file could not be written by > the OS because of some other OS specific failure. Another option is > that the OOM killer killed aisexec. No sign of the oom killer in the log I quoted yesterday. > I would have a hard time believing > aisexec would crash without a core file while the operating system was > still functional. > > In the trunk we are enhancing our failure analysis to do fulltime event > tracing so failures can be debugged more rapidly then looking at a core > file. I hope that helps. Thanks. I'll try to reproduce the scenario. Meanwhile I'm still looking for hints as to how to get more visibility of what is happening. From brem.belguebli at gmail.com Fri Jun 5 17:17:21 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Fri, 5 Jun 2009 19:17:21 +0200 Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters In-Reply-To: <7207d96f0906050247s54c5c468td31fceeabe1a8da3@mail.gmail.com> References: <7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com> <29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com> <7207d96f0906050247s54c5c468td31fceeabe1a8da3@mail.gmail.com> Message-ID: <29ae894c0906051017ice9628bw2af3f94de8c126c5@mail.gmail.com> Hello, We are long term HP ServiceGuard on HP-UX users and since a few months HP ServiceGuard on Linux (aka SGLX). The first one (HP-UX) works by using their Cluster LVM (a clvmd-like daemon named cmlvmd on each node) allowing one node of the cluster to activate exclusively (vgchange -a e VGXX) on one node and use a non-clustered FS (vxfs) on top of the LV's. The LV's are mirrored (a leg on each SAN array, one local and the other distant). On Linux (SGLX) is a bit more tricky but when masterized it works well. It relies on non-clustered LVM, with the LVM2 hosttags feature (HA-LVM described by RH) built on top of MD raid1 devices with a cluster module that guarantees the raid device to be consistent on one node at a time. Unfortunately, HP just announced the discontinuation of SGLX, that's why we are looking towards RHCS to see if it can provide the same service, which doesn't seem to be obvious. Concerning LVM mirroring with Clustered LVM, I hope it does or will. The only thing I know about LVM mirror is that, soon (maybe around RH5u5) it will support online resizing without having to break the mirror. Brem 2009/6/5, Fajar A. Nugraha : > > On Fri, Jun 5, 2009 at 4:22 PM, brem belguebli > wrote: > > We are in the same setup, already doing "Geo-cluster" with other technos > and > > we are looking at RHCS to provide us the same service level. > > Usually the concepts are the same. What solution are you using? How > does it work, replication or real cluster? > > > Let's consider this kind of setup, 2 datacenters far from each other by 1 > ms > > delay, each hosting a SAN array, each of them connected to 2 SAN fabrics > > extended between the 2 sites. > > > > What reason would prevent us from building Geo-clusters without having to > > rely on a database replication mechanism, as the setup I would like to > > implement would also be used to provide NFS services that are disaster > > recovery proof. > > > > Obviously, such setup should rely on LVM mirroring to allow a node > hosting a > > service to be able to write to both local and distant SAN LUN's. > > Does LVM mirroring work with clustered LVM? > > -- > Fajar > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdake at redhat.com Fri Jun 5 17:26:37 2009 From: sdake at redhat.com (Steven Dake) Date: Fri, 05 Jun 2009 10:26:37 -0700 Subject: [Linux-cluster] "dlm_controld[nnnn]: cluster is down, exiting" on node1 when starting node2 In-Reply-To: References: <20090605151421.GB28143@redhat.com> <20090605160455.GD28143@redhat.com> <20090605164951.GE28143@redhat.com> <1244221838.2626.29.camel@localhost.localdomain> Message-ID: <1244222797.2626.33.camel@localhost.localdomain> On Fri, 2009-06-05 at 13:20 -0400, Charlie Brady wrote: > On Fri, 5 Jun 2009, Steven Dake wrote: > > > On Fri, 2009-06-05 at 11:49 -0500, David Teigland wrote: > >> On Fri, Jun 05, 2009 at 12:50:57PM -0400, Charlie Brady wrote: > >>> > >>> On Fri, 5 Jun 2009, David Teigland wrote: > >>> > >>>> On Fri, Jun 05, 2009 at 11:42:59AM -0400, Charlie Brady wrote: > >>>>> > >>>>> On Fri, 5 Jun 2009, David Teigland wrote: > >>>>> > >>>>>> They are all complaining that the the cluster is down, which is a polite > >>>>>> way > >>>>>> of saying that aisexec has died/crashed/failed/killed/gone-away. > >>>>> > >>>>> Thanks. Why might that have occurred? Where would I look for clues? How > >>>>> can I increase logging output from aisexec? > >>>> > >>>> If you're lucky it'll leave a core file, otherwise aisexec is notorious for > >>>> disappearing without leaving any clues about why. > >>> > >>> That's very disconcerting to hear. Doesn't sound like HA. :-( > >> > >> To clarify, aisexec does not often disappear, it's very reliable. The point > >> was that in the rare case when it does, it's notorious for not leaving any > >> reasons behind. > >> > >> Dave > >> > > > > 99.9% of the time there would be a core file in /var/lib/openais/core* > > if aisexec faults. > > Only file I have there is named. > > ringid_10.39.171.212 > > > We have not seen faults during normal operations for > > years in a released version under typical gfs2 usage scenarios. If > > there is no core, it means some other component failed, exited, and > > caused that node to be fenced, or the core file could not be written by > > the OS because of some other OS specific failure. Another option is > > that the OOM killer killed aisexec. > > No sign of the oom killer in the log I quoted yesterday. > > > I would have a hard time believing > > aisexec would crash without a core file while the operating system was > > still functional. > > > > In the trunk we are enhancing our failure analysis to do fulltime event > > tracing so failures can be debugged more rapidly then looking at a core > > file. I hope that helps. > > Thanks. > > I'll try to reproduce the scenario. Meanwhile I'm still looking for hints > as to how to get more visibility of what is happening. some users change their default core file storage location. This would then override the defaults used by openais. another possibility is selinux is enabled. aisexec integration with selinux needs more work and selinux might prevent a core file from being written. You can check selinux by looking /etc/selinux/config. If it is set to enforcing or permisssive, that may be your culprit. Regards -steve From brem.belguebli at gmail.com Fri Jun 5 17:40:33 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Fri, 5 Jun 2009 19:40:33 +0200 Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters In-Reply-To: References: <7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com> <29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com> <7207d96f0906050247s54c5c468td31fceeabe1a8da3@mail.gmail.com> Message-ID: <29ae894c0906051040v64a92e55je6c3568e14d5a20f@mail.gmail.com> 2009/6/5, Jon Schulz : > > Yes I would be interested to see what products you are currently using to > achieve this. In my proposed setup we are actually completely database > transaction driven. The problem is the people higher up want active database > <-> database replication which will be problematic I know. Still we also use DB (Oracle, Sybase) replication mechanisms to address accidental data corruption, as mirroring being synchonous, if something happens (someone intentionnaly alters the DB or filesystem corruption) it will be on both legs of the mirror. Outside of the data side of the equation, how tolerant is the cluster > network/heartbeat to latency assuming no packet loss? Or more to the point, > at what point does everyone in their past experience see the heartbeat > network become unreliable, latency wise. E.g. anything over 30ms? > > Most of my experiences with rhcs and linux-ha have always been with the > cluster network being within the same LAN :( It is definitely the best solution in case you cannot rely on your network infrastructure. This is not completely my case :-) -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto: > linux-cluster-bounces at redhat.com] On Behalf Of Fajar A. Nugraha > Sent: Friday, June 05, 2009 5:47 AM > To: linux clustering > Subject: Re: [Linux-cluster] Networking guidelines for RHCS across > datacenters > > On Fri, Jun 5, 2009 at 4:22 PM, brem belguebli > wrote: > > We are in the same setup, already doing "Geo-cluster" with other technos > and > > we are looking at RHCS to provide us the same service level. > > Usually the concepts are the same. What solution are you using? How > does it work, replication or real cluster? > > > Let's consider this kind of setup, 2 datacenters far from each other by 1 > ms > > delay, each hosting a SAN array, each of them connected to 2 SAN fabrics > > extended between the 2 sites. > > > > What reason would prevent us from building Geo-clusters without having to > > rely on a database replication mechanism, as the setup I would like to > > implement would also be used to provide NFS services that are disaster > > recovery proof. > > > > Obviously, such setup should rely on LVM mirroring to allow a node > hosting a > > service to be able to write to both local and distant SAN LUN's. > > Does LVM mirroring work with clustered LVM? > > -- > Fajar > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdake at redhat.com Fri Jun 5 18:01:10 2009 From: sdake at redhat.com (Steven Dake) Date: Fri, 05 Jun 2009 11:01:10 -0700 Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters In-Reply-To: <29ae894c0906051040v64a92e55je6c3568e14d5a20f@mail.gmail.com> References: <7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com> <29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com> <7207d96f0906050247s54c5c468td31fceeabe1a8da3@mail.gmail.com> <29ae894c0906051040v64a92e55je6c3568e14d5a20f@mail.gmail.com> Message-ID: <1244224870.2626.37.camel@localhost.localdomain> On Fri, 2009-06-05 at 19:40 +0200, brem belguebli wrote: > > > 2009/6/5, Jon Schulz : > Yes I would be interested to see what products you are > currently using to achieve this. In my proposed setup we are > actually completely database transaction driven. The problem > is the people higher up want active database <-> database > replication which will be problematic I know. > > Still we also use DB (Oracle, Sybase) replication mechanisms > to address accidental data corruption, as mirroring being synchonous, > if something happens (someone intentionnaly alters the DB or > filesystem corruption) it will be on both legs of the mirror. > > > > Outside of the data side of the equation, how tolerant is the > cluster network/heartbeat to latency assuming no packet loss? > Or more to the point, at what point does everyone in their > past experience see the heartbeat network become unreliable, > latency wise. E.g. anything over 30ms? > The default configured timers for failure detection are quite high and retransmit many times for failed packets (for lossy networks). 30msec latency would pose no major problem, except performance. If you used posix locking and your machine->machine latency was 30msec, each posix lock would take 30.03 msec to grant or more, which may not meet your performance requirements. I can't recommend wan connections with totem (the protocol used in rhcs) because of the performance characteristics. If the performance of posix locks is not a high requirement, it should be functional. Regards -steve > From invite+kjdmu_5j51di at facebookmail.com Fri Jun 5 22:07:22 2009 From: invite+kjdmu_5j51di at facebookmail.com (Varun Galande) Date: Fri, 5 Jun 2009 15:07:22 -0700 Subject: [Linux-cluster] Check out my photos on Facebook Message-ID: <7e9d99562107ef1cab28aabcbfd48237@10.22.41.202> Hi linux-cluster at redhat.com, I invited you to join Facebook a while back and wanted to remind you that once you join, we'll be able to connect online, share photos, organize groups and events, and more. Thanks, Varun To sign up for Facebook, follow the link below: http://www.facebook.com/p.php?i=542993879&k=RVBUZ4WSUV2M5BD1QB63URRTSW1&r linux-cluster at redhat.com was invited to join Facebook by Varun Galande. If you do not wish to receive this type of email from Facebook in the future, please click on the link below to unsubscribe. http://www.facebook.com/o.php?k=e846e5&u=100000004637023&mid=939448G5af310c1015fG0G8 Facebook's offices are located at 1601 S. California Ave., Palo Alto, CA 94304. -------------- next part -------------- An HTML attachment was scrubbed... URL: From darcy.sherwood at gmail.com Mon Jun 8 02:40:38 2009 From: darcy.sherwood at gmail.com (Darcy Sherwood) Date: Sun, 7 Jun 2009 22:40:38 -0400 Subject: [Linux-cluster] Clvm Hang after an node is fenced in a 2 nodes cluster In-Reply-To: <4A276F8F.70101@justice.gouv.fr> References: <4A276F8F.70101@justice.gouv.fr> Message-ID: <7a7f2ea30906071940k19d69dfcs122cdaf6e71f79fc@mail.gmail.com> Do you have all of your cluster services chkconfig'd on at node2 ? Sounds to me like clvmd might be chkconfig'd off On Thu, Jun 4, 2009 at 2:54 AM, Jean Diallo < admin1-bua.dage-etd at justice.gouv.fr> wrote: > Description of problem: In a 2 nodes cluster, after 1 node is fence, any > clvm command hang on the ramaining node. when the fenced node cluster come > back in the cluster, any clvm command also hang, moreover the node do not > activate any clustered vg, and so do not access any shared device. > > > Version-Release number of selected component (if applicable): > redhat 5.2 > update device-mapper-1.02.28-2.el5.x86_64.rpm > lvm2-2.02.40-6.el5.x86_64.rpm > lvm2-cluster-2.02.40-7.el5.x86_64.rpm > > > Steps to Reproduce: > 1.2 nodes cluster , quorum formed with qdisk > 2.cold boot node 2 > 3.node 2 is evicted and fenced, service are taken over by node 1 > 4.node ? come back in cluster, quorate, but no clustered vg are up and any > lvm related command hang > 5.At this step every lvm command hang on node 1 > > > Expected results: node 2 should be able to get back the lock on clustered > lvm volume and node 1 should be able to issue any lvm relate command > > Here are my cluster.conf and lvm.conf > > > post_join_delay="6"/> > > > > > > > > > > > > > > > > > > token_retransmits_before_loss_const="20"/> > > login="Administrator" name="ilo172" passwd="X.X.X.X"/> > login="Administrator" name="ilo173" passwd="XXXX"/> > > > > > name="alfrescoP64" path="/etc/xen" recovery="relocate"/> > name="alfrescoI64" path="/etc/xen" recovery="relocate"/> > name="alfrescoS64" path="/etc/xen" recovery="relocate"/> > > votes="1"> > score="1"/> > > > > part of lvm.conf: > # Type 3 uses built-in clustered locking. > locking_type = 3 > > # If using external locking (type 2) and initialisation fails, > # with this set to 1 an attempt will be made to use the built-in > # clustered locking. > # If you are using a customised locking_library you should set this to 0. > fallback_to_clustered_locking = 0 > > # If an attempt to initialise type 2 or type 3 locking failed, perhaps > # because cluster components such as clvmd are not running, with this set > # to 1 an attempt will be made to use local file-based locking (type 1). > # If this succeeds, only commands against local volume groups will > proceed. > # Volume Groups marked as clustered will be ignored. > fallback_to_local_locking = 1 > > # Local non-LV directory that holds file-based locks while commands are > # in progress. A directory like /tmp that may get wiped on reboot is OK. > locking_dir = "/var/lock/lvm" > > # Other entries can go here to allow you to load shared libraries > # e.g. if support for LVM1 metadata was compiled as a shared library use > # format_libraries = "liblvm2format1.so" > # Full pathnames can be given. > > # Search this directory first for shared libraries. > # library_dir = "/lib" > > # The external locking library to load if locking_type is set to 2. > # locking_library = "liblvm2clusterlock.so" > > > part of lvm log on second node : > > vgchange.c:165 Activated logical volumes in volume group "VolGroup00" > vgchange.c:172 7 logical volume(s) in volume group "VolGroup00" now > active > cache/lvmcache.c:1220 Wiping internal VG cache > commands/toolcontext.c:188 Logging initialised at Wed Jun 3 15:17:29 > 2009 > commands/toolcontext.c:209 Set umask to 0077 > locking/cluster_locking.c:83 connect() failed on local socket: Connexion > refus?e > locking/locking.c:259 WARNING: Falling back to local file-based locking. > locking/locking.c:261 Volume Groups with the clustered attribute will be > inaccessible. > toollib.c:578 Finding all volume groups > toollib.c:491 Finding volume group "VGhomealfrescoS64" > metadata/metadata.c:2379 Skipping clustered volume group > VGhomealfrescoS64 > toollib.c:491 Finding volume group "VGhomealfS64" > metadata/metadata.c:2379 Skipping clustered volume group VGhomealfS64 > toollib.c:491 Finding volume group "VGvmalfrescoS64" > metadata/metadata.c:2379 Skipping clustered volume group VGvmalfrescoS64 > toollib.c:491 Finding volume group "VGvmalfrescoI64" > metadata/metadata.c:2379 Skipping clustered volume group VGvmalfrescoI64 > toollib.c:491 Finding volume group "VGvmalfrescoP64" > metadata/metadata.c:2379 Skipping clustered volume group VGvmalfrescoP64 > toollib.c:491 Finding volume group "VolGroup00" > libdm-report.c:981 VolGroup00 > cache/lvmcache.c:1220 Wiping internal VG cache > commands/toolcontext.c:188 Logging initialised at Wed Jun 3 15:17:29 > 2009 > commands/toolcontext.c:209 Set umask to 0077 > locking/cluster_locking.c:83 connect() failed on local socket: Connexion > refus?e > locking/locking.c:259 WARNING: Falling back to local file-based locking. > locking/locking.c:261 Volume Groups with the clustered attribute will be > inaccessible. > toollib.c:542 Using volume group(s) on command line > toollib.c:491 Finding volume group "VolGroup00" > vgchange.c:117 7 logical volume(s) in volume group "VolGroup00" monitored > cache/lvmcache.c:1220 Wiping internal VG cache > commands/toolcontext.c:188 Logging initialised at Wed Jun 3 15:20:45 > 2009 > commands/toolcontext.c:209 Set umask to 0077 > toollib.c:331 Finding all logical volumes > commands/toolcontext.c:188 Logging initialised at Wed Jun 3 15:20:50 > 2009 > commands/toolcontext.c:209 Set umask to 0077 > toollib.c:578 Finding all volume groups > > > group_tool on node 1 > type level name id state fence 0 > default 00010001 none [1 2] > dlm 1 clvmd 00010002 none [1 2] > dlm 1 rgmanager 00020002 none [1] > > > group_tool on node 2 > [root at remus ~]# group_tool > type level name id state fence 0 > default 00010001 none [1 2] > dlm 1 clvmd 00010002 none [1 2] > > Additional info: > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From CISPLengineer.hz at ril.com Mon Jun 8 04:37:34 2009 From: CISPLengineer.hz at ril.com (Viral .D. Ahire) Date: Mon, 08 Jun 2009 10:07:34 +0530 Subject: [Linux-cluster] Node Leave Cluster while Stopping Cluster Application (Oracle) Message-ID: <4A2C958E.1020701@ril.com> Hi, I have configured 2 node Clustering on RHEL-5. During migration of server i have changed host name & IP Address of bothe node and reconfigure the cluster through system-config-cluster. Now the problem is whenever i stop cluster application (oracle) , the node which is owner of that application it's cman service is getting stop.So it is now not a part of cluster and fence by other node. same this also happens during restart & relocation of cluster application,because while restart & relocation application will be stop first. Please help .................. Regards, Viral Ahire "Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s). are confidential. and may be privileged. If you are not the intended recipient. you are hereby notified that any review. re-transmission. conversion to hard copy. copying. circulation or other use of this message and any attachments is strictly prohibited. If you are not the intended recipient. please notify the sender immediately by return email. and delete this message and any attachments from your system. Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment." From fdinitto at redhat.com Mon Jun 8 06:51:20 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Mon, 08 Jun 2009 08:51:20 +0200 Subject: [Linux-Cluster] Submitting two new resource plugins to the project In-Reply-To: <1244046520.6750.6.camel@mecatol> References: <1243883850.6761.2.camel@mecatol> <1243919063.24866.14.camel@cerberus.int.fabbione.net> <1244046520.6750.6.camel@mecatol> Message-ID: <1244443880.3665.3.camel@cerberus.int.fabbione.net> On Wed, 2009-06-03 at 18:28 +0200, Rafael Mic? Miranda wrote: > Hi Fabio, > > > I have sent the e-mail to that mail list and i have had no answer yet. > Its the only occurrence i have found about the "devel list" on the CMAN > Project web page, are you sure this address is right? Pretty sure it's right. Did you subscribe to the mailing list before posting? > > Thanks in advance. > No problem at all. Fabio From fdinitto at redhat.com Mon Jun 8 06:52:17 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Mon, 08 Jun 2009 08:52:17 +0200 Subject: [Linux-cluster] compile gnbd-kernel error In-Reply-To: <2495e3790906020630g34d1b32ckeff5db4b8167169@mail.gmail.com> References: <2495e3790906020630g34d1b32ckeff5db4b8167169@mail.gmail.com> Message-ID: <1244443937.3665.5.camel@cerberus.int.fabbione.net> On Tue, 2009-06-02 at 21:30 +0800, Jin-Shan Tseng wrote: > Hi folks, > > > I tried to compile gnbd-kernel on Gentoo Linux 2.6.29-gentoo-r5 but I > got some error messages. :( > > > the error messages are appear > on cluster-2.03.09, cluster-2.03.10, cluster-2.03.11 > > > # uname -a > Linux node26 2.6.29-gentoo-r5 #3 SMP Mon Jun 1 19:05:23 CST 2009 i686 > Intel(R) Xeon(TM) CPU 3.06GHz GenuineIntel GNU/Linux [SNIP] > > Does anyone have the same problems? > Any suggestions are appreciate. gnbd has not been ported to any kernel > 2.6.27 because it's been deprecated upstream. Fabio From esggrupos at gmail.com Mon Jun 8 07:55:41 2009 From: esggrupos at gmail.com (ESGLinux) Date: Mon, 8 Jun 2009 09:55:41 +0200 Subject: [Linux-cluster] all nodes halt when one lose connection In-Reply-To: References: <3128ba140905210444o21959031iff759490ace7c8bc@mail.gmail.com> <3128ba140905210757jd814f52hc1ca97c4da6e3a7a@mail.gmail.com> <1D109AC0-9EE0-419B-A841-D98EA53FF1C8@redhat.com> <3128ba140905210834gdcdf89ahb34bf45d1272861b@mail.gmail.com> <5b192c7e0905220650h7cc737c5k7581972add42e21f@mail.gmail.com> <3128ba140905250228x5577d24eucd68bbd4b1e57e1b@mail.gmail.com> Message-ID: <3128ba140906080055i7ddd67e9h67cccc5d931a6ec8@mail.gmail.com> Thanks for your answers, I have used a separated network for the manage and service networks with 2 switchs and now it works fine. Thanks again, ESG 2009/5/28 Kaerka Phillips > One thing we did not try, but might've worked, would be to bond two network > interfaces together and then use vlan tagging on top of the bond interface > to create a vlan across it to the other node, and then pointing the cluster > to the vlan interfaces, which should still be up if even if the loss of one > network interface or one switch. > > > On Wed, May 27, 2009 at 7:48 PM, Kaerka Phillips wrote: > >> It sounds like they're fencing themselves. We got around this issue on a >> two-node cluster by including the alternate node's internal ip address in >> the /etc/hosts file of both hosts and a cross-over cable for the service >> network with the private ip addresses assigned to that network. If you're >> trying to get them to monitor each other via the public network, in theory >> this could be done with a backup fencing method, but we weren't able to get >> this work since the heartbeat functions only happen on the network that the >> node names are defined to use. >> >> >> On Mon, May 25, 2009 at 5:28 AM, ESGLinux wrote: >> >>> Hi, >>> I think this is not my problem because fencing works fine. The nodes gets >>> fenced inmediatly but I think they fence when they don't must >>> >>> Greetings, >>> >>> ESG >>> >>> 2009/5/22 jorge sanchez >>> >>> Hi, >>>> >>>> try also disable the acpi if is it running , see following: >>>> >>>> >>>> http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/s1-acpi-CA.html >>>> >>>> >>>> Regards, >>>> >>>> Jorge Sanchez >>>> >>>> >>>> On Thu, May 21, 2009 at 5:34 PM, ESGLinux wrote: >>>> >>>>> >>>>> >>>>> 2009/5/21 Jonathan Brassow >>>>> >>>>>> >>>>>> On May 21, 2009, at 9:57 AM, ESGLinux wrote: >>>>>> >>>>>> Hello, >>>>>>> >>>>>>> these are the logs I get: >>>>>>> >>>>>>> In node1: >>>>>>> >>>>>>> May 21 11:33:44 NODE1 fenced[3840]: NODE2 not a cluster member after >>>>>>> 5 sec post_fail_delay >>>>>>> May 21 11:33:44 NODE1 fenced[3840]: fencing node "NODE2" >>>>>>> May 21 11:33:44 NODE1 shutdown[5448]: shutting down for system halt >>>>>>> >>>>>>> in node2: >>>>>>> >>>>>>> May 21 11:33:45 NODE2 fenced[3843]: NODE1 not a cluster member after >>>>>>> 5 sec post_fail_delay >>>>>>> May 21 11:33:45 NODE2 fenced[3843]: fencing node "NODE1" >>>>>>> May 21 11:33:45 NODE2 shutdown[5923]: shutting down for system halt >>>>>>> >>>>>>> >>>>>>> what I don?t know is way they lose the connection with the cluster, >>>>>>> they are still connected (I only unplug a cable from the service network) >>>>>>> >>>>>> >>>>>> That may be something worth chasing down, as it appears that your >>>>>> cluster communication is on a network you don't expect? >>>>>> >>>>> >>>>> How can I be sure about the network the nodes are using for >>>>> communication? I think they do for the network I have configured to do >>>>> that.... >>>>> >>>>> >>>>>> >>>>>> Also, are the nodes simply "shutting down", or are they being forcibly >>>>>> rebooted. If it is a casual shutdown, then it would appear that both nodes >>>>>> are trying to shutdown simultaneously. >>>>>> >>>>> >>>>> they simply shutdown. They no reboot. >>>>> >>>>> This is what I get every time I unplug the nework cable from eth0 of >>>>> any of the two nodes. (they communicate through eth1...) >>>>> >>>>> Greetings, >>>>> >>>>> ESG >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> >>>>>> >>>>>> brassow >>>>>> >>>>>> -- >>>>>> Linux-cluster mailing list >>>>>> Linux-cluster at redhat.com >>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>> >>>>> >>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>> >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From esggrupos at gmail.com Mon Jun 8 07:59:42 2009 From: esggrupos at gmail.com (ESGLinux) Date: Mon, 8 Jun 2009 09:59:42 +0200 Subject: [Linux-cluster] Default params for quorumdisk In-Reply-To: <1243631124.25291.36.camel@ayanami> References: <3128ba140905260342k523cc51v7d8e321907b1d049@mail.gmail.com> <1243631124.25291.36.camel@ayanami> Message-ID: <3128ba140906080059p72509deeoa3dd850de291706a@mail.gmail.com> Hi, I finally have configured a quorom disk and it works fine. Now my 2 nodes cluster is more stable than ever. For anyone who doesnt use it, like me, I recommend it ;-) ESG 2009/5/29 Lon Hohberger > On Tue, 2009-05-26 at 12:42 +0200, ESGLinux wrote: > > for example, > > Interval - The frequency of read/write cycles, in seconds. I have not > > idea what to say to that. Which is the default and how can I answer > > it? > > 'man qdisk' explains all the parameters and their defaults. > > -- Lon > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tsengjs at gmail.com Mon Jun 8 08:27:03 2009 From: tsengjs at gmail.com (Jin-Shan Tseng) Date: Mon, 8 Jun 2009 16:27:03 +0800 Subject: [Linux-cluster] compile gnbd-kernel error In-Reply-To: <1244443937.3665.5.camel@cerberus.int.fabbione.net> References: <2495e3790906020630g34d1b32ckeff5db4b8167169@mail.gmail.com> <1244443937.3665.5.camel@cerberus.int.fabbione.net> Message-ID: <2495e3790906080127q7d939cceh70632899a42aa1d3@mail.gmail.com> On Mon, Jun 8, 2009 at 2:52 PM, Fabio M. Di Nitto wrote: > > gnbd has not been ported to any kernel > 2.6.27 because it's been > deprecated upstream. > > Fabio > Hi Fabio, Thanks for your reply. :) I'll use nbd instead. Regards, Jin-Shan -------------- next part -------------- An HTML attachment was scrubbed... URL: From fajar at fajar.net Mon Jun 8 08:32:44 2009 From: fajar at fajar.net (Fajar A. Nugraha) Date: Mon, 8 Jun 2009 15:32:44 +0700 Subject: [Linux-cluster] compile gnbd-kernel error In-Reply-To: <2495e3790906080127q7d939cceh70632899a42aa1d3@mail.gmail.com> References: <2495e3790906020630g34d1b32ckeff5db4b8167169@mail.gmail.com> <1244443937.3665.5.camel@cerberus.int.fabbione.net> <2495e3790906080127q7d939cceh70632899a42aa1d3@mail.gmail.com> Message-ID: <7207d96f0906080132u613d7bffw959d6a5225b15eac@mail.gmail.com> On Mon, Jun 8, 2009 at 3:27 PM, Jin-Shan Tseng wrote: > On Mon, Jun 8, 2009 at 2:52 PM, Fabio M. Di Nitto > wrote: >> >> gnbd has not been ported to any kernel > 2.6.27 because it's been >> deprecated upstream. >> >> Fabio > > Hi Fabio, > Thanks for your reply. :) > I'll use nbd instead. This is interesting. What is the currently recommended method to use for exporting block-device via TCP/IP? nbd? iscsi? use whatever works? -- Fajar From fdinitto at redhat.com Mon Jun 8 08:46:16 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Mon, 08 Jun 2009 10:46:16 +0200 Subject: [Linux-cluster] compile gnbd-kernel error In-Reply-To: <7207d96f0906080132u613d7bffw959d6a5225b15eac@mail.gmail.com> References: <2495e3790906020630g34d1b32ckeff5db4b8167169@mail.gmail.com> <1244443937.3665.5.camel@cerberus.int.fabbione.net> <2495e3790906080127q7d939cceh70632899a42aa1d3@mail.gmail.com> <7207d96f0906080132u613d7bffw959d6a5225b15eac@mail.gmail.com> Message-ID: <1244450776.3665.15.camel@cerberus.int.fabbione.net> On Mon, 2009-06-08 at 15:32 +0700, Fajar A. Nugraha wrote: > On Mon, Jun 8, 2009 at 3:27 PM, Jin-Shan Tseng wrote: > > On Mon, Jun 8, 2009 at 2:52 PM, Fabio M. Di Nitto > > wrote: > >> > >> gnbd has not been ported to any kernel > 2.6.27 because it's been > >> deprecated upstream. > >> > >> Fabio > > > > Hi Fabio, > > Thanks for your reply. :) > > I'll use nbd instead. > > This is interesting. > What is the currently recommended method to use for exporting > block-device via TCP/IP? nbd? iscsi? use whatever works? > There was a short thread discussing this same issue in November when we announced GNDB deprecation: http://www.redhat.com/archives/cluster-devel/2008-November/msg00062.html in short, iscsi/aoe/nbd and others are recognized as defacto standard protocols and supported by different vendors. It makes no sense to carry around yet another network block device protocol/implementation that is not even standard. A lot of people had great deal of success using iSCSI. I personally used AOE for testing for a long time with very little issues. Fabio From grimme at atix.de Mon Jun 8 09:02:19 2009 From: grimme at atix.de (Marc Grimme) Date: Mon, 8 Jun 2009 11:02:19 +0200 Subject: [Linux-cluster] Mountoption _netdev status with gfs/gfs2 Message-ID: <200906081102.20019.grimme@atix.de> Hello, in a few bugs I read that you don't want to support the _netdev mountoption (Dave/Steve) with gfs/gfs2. In order to being able to establish a relyable process to mount filesystems depending on the network (independently from the filesystem itself) for me it looks like a good step to at least support then _netdev mountoption with gfs/gfs2. But if I specify the _netdev option with gfs it is just ignored and will not be shown. Could you please shortly sum up or give me a reference where you described the reasons for not supporting it with gfs? For us (using gfs/gfs2 as rootfs) it would make things much easier if the _netdev option would be available. BTW: ocfs2 sets it as default. -- Gruss / Regards, Marc Grimme http://www.atix.de/ http://www.open-sharedroot.org/ From swhiteho at redhat.com Mon Jun 8 09:22:51 2009 From: swhiteho at redhat.com (Steven Whitehouse) Date: Mon, 08 Jun 2009 10:22:51 +0100 Subject: [Linux-cluster] Mountoption _netdev status with gfs/gfs2 In-Reply-To: <200906081102.20019.grimme@atix.de> References: <200906081102.20019.grimme@atix.de> Message-ID: <1244452971.29604.764.camel@localhost.localdomain> Hi, On Mon, 2009-06-08 at 11:02 +0200, Marc Grimme wrote: > Hello, > in a few bugs I read that you don't want to support the _netdev mountoption > (Dave/Steve) with gfs/gfs2. > > In order to being able to establish a relyable process to mount filesystems > depending on the network (independently from the filesystem itself) for me it > looks like a good step to at least support then _netdev mountoption with > gfs/gfs2. > > But if I specify the _netdev option with gfs it is just ignored and will not > be shown. > > Could you please shortly sum up or give me a reference where you described the > reasons for not supporting it with gfs? > > For us (using gfs/gfs2 as rootfs) it would make things much easier if the > _netdev option would be available. > > BTW: ocfs2 sets it as default. The _netdev option is only read by scripts, not by the kernel itself and its an ordering constraint. The issue is that it doesn't make the ordering correct in all cases and there are good reasons for wanting to specify other orderings too. The man page for fstab says that entries will be read in the order in which they appear. Thats not quite true of course as _netdev entries will be read later on, and they are then mounted according to fstype and the fstab ordering is only respected within a particular fstype. Ideally we want to be able to mix ordinary fs mounts, network fs mounts and bind mounts in any order. Although you can use _netdev to solve one particular case with gfs/gfs2 it certainly is not a general solution, so more thought needs to go into this. The upstart project has been suggested as a possible solution. I've not looked into it enough to be certain whether that is the case or not, nor do I know the current state of enthusiasm for it amoung the distros. I do appreciate that this has been a long standing issue and I would be very happy to see it resolved. As you are aware, we have open bugs on this issue: #435906 also related are #480002 and #207697 Steve. From esggrupos at gmail.com Mon Jun 8 09:22:53 2009 From: esggrupos at gmail.com (ESGLinux) Date: Mon, 8 Jun 2009 11:22:53 +0200 Subject: [Linux-cluster] question about 2 nodes cluster Message-ID: <3128ba140906080222w22d5c2b4y63d9ee4df80b38bc@mail.gmail.com> Hi all, I have one existential question about two nodes cluster. I have read that for 2 nodes cluster is necessary a third element to give stabilty to the cluster. One way is to add a third node, so its not a 2 nodes cluster. For me its not an answer because It becomes another kind of cluster ( 3 nodes cluster) Other, is to use qdisk (this is the way I?m trying nowadays) My question is if it?s absollutely necesary this third element in the architecture. and If so which are the options, ? thanks in advance ESG -------------- next part -------------- An HTML attachment was scrubbed... URL: From dist-list at LEXUM.UMontreal.CA Mon Jun 8 13:16:41 2009 From: dist-list at LEXUM.UMontreal.CA (FM) Date: Mon, 08 Jun 2009 09:16:41 -0400 Subject: [Linux-cluster] Xen , Out of memory and dom0-min-mem, dom0_mem Message-ID: <4A2D0F39.7050308@lexum.umontreal.ca> Hello 2 of 3 of my xen nodes ( dom0) died tonight because of out of memory error. These servers are only running Xen and cluster suite packages. I googled the error and everything point out to dom0-min-mem and grub dom0_mem options dom0-min-mem : I left it by default : (dom0-min-mem 256) I read about on the internet but still do not understand this parameter and if if should be =0 on servers any advice ? tx ! here are some info about one server : [root at cluster01-node1 xen]# virsh dominfo 0 Id: 0 Name: Domain-0 UUID: 00000000-0000-0000-0000-000000000000 OS Type: linux State: running CPU(s): 8 CPU time: 182.8s Max memory: no limit Used memory: 33554544 kB host : cluster01-node1.cluster.lexum.pri release : 2.6.18-128.1.10.el5xen version : #1 SMP Thu May 7 11:07:18 EDT 2009 machine : x86_64 nr_cpus : 8 nr_nodes : 1 sockets_per_node : 2 cores_per_socket : 4 threads_per_core : 1 cpu_mhz : 2833 hw_caps : bfebfbff:20000800:00000000:00000140:040ce3bd:00000000:00000001 total_memory : 36861 free_memory : 963 node_to_cpu : node0:0-7 xen_major : 3 xen_minor : 1 xen_extra : .2-128.1.10.el5 xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 xen_pagesize : 4096 platform_params : virt_start=0xffff800000000000 xen_changeset : unavailable cc_compiler : gcc version 4.1.2 20080704 (Red Hat 4.1.2-44) cc_compile_by : mockbuild cc_compile_domain : centos.org cc_compile_date : Thu May 7 10:28:47 EDT 2009 xend_config_format : 2 From rmicmirregs at gmail.com Mon Jun 8 14:44:39 2009 From: rmicmirregs at gmail.com (Rafael =?ISO-8859-1?Q?Mic=F3?= Miranda) Date: Mon, 08 Jun 2009 16:44:39 +0200 Subject: [Linux-Cluster] Submitting two new resource plugins to the project In-Reply-To: <31EF80F0-B621-4889-9605-9F50A431F8AF@redhat.com> References: <1243883850.6761.2.camel@mecatol> <1243919063.24866.14.camel@cerberus.int.fabbione.net> <1244046520.6750.6.camel@mecatol> <1244141327.6771.6.camel@mecatol> <31EF80F0-B621-4889-9605-9F50A431F8AF@redhat.com> Message-ID: <1244472279.7104.1.camel@mecatol> Hi Jonathan El jue, 04-06-2009 a las 16:39 -0500, Jonathan Brassow escribi?: > On Jun 4, 2009, at 1:48 PM, Rafael Mic? Miranda wrote: > > I am sorry, I have not received your e-mail yet. I suppose it could > have been caught by my spam filter. Could you please try to send > again to: jbrassow at redhat.com? > > If that doesn't work, then I can send you a web address where you can > upload the code. > > thanks, > brassow > > P.S. It seems I get all your messages on linux-cluster at redhat.com, > but I'm not seeing the others... > I send you a e-mail with the files last Thursday, but 'cause i see no feedback I think you did not receive it. Please tell me any other way I can upload the code. Thanks, -- Rafael Mic? Miranda From rmicmirregs at gmail.com Mon Jun 8 14:50:06 2009 From: rmicmirregs at gmail.com (Rafael =?ISO-8859-1?Q?Mic=F3?= Miranda) Date: Mon, 08 Jun 2009 16:50:06 +0200 Subject: [Linux-Cluster] Submitting two new resource plugins to the project In-Reply-To: <1244443880.3665.3.camel@cerberus.int.fabbione.net> References: <1243883850.6761.2.camel@mecatol> <1243919063.24866.14.camel@cerberus.int.fabbione.net> <1244046520.6750.6.camel@mecatol> <1244443880.3665.3.camel@cerberus.int.fabbione.net> Message-ID: <1244472606.7104.8.camel@mecatol> Hi Fabio, El lun, 08-06-2009 a las 08:51 +0200, Fabio M. Di Nitto escribi?: > > Pretty sure it's right. Did you subscribe to the mailing list before > posting? > > > Fabio > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster I tried to explain I found no references to that list, cluster-devel at redhat.com, anywhere in the CMAN Project webpage so no, I have not subscribed to that list yet. Now I have taken a look at Google and I found this URL: http://www.redhat.com/mailman/listinfo/cluster-devel I will subscribe to it just now. Thanks, -- Rafael Mic? Miranda From cthulhucalling at gmail.com Mon Jun 8 15:13:08 2009 From: cthulhucalling at gmail.com (Ian Hayes) Date: Mon, 8 Jun 2009 11:13:08 -0400 Subject: [Linux-cluster] question about 2 nodes cluster In-Reply-To: <3128ba140906080222w22d5c2b4y63d9ee4df80b38bc@mail.gmail.com> References: <3128ba140906080222w22d5c2b4y63d9ee4df80b38bc@mail.gmail.com> Message-ID: <36df569a0906080813p27c9b8ccu5a89e210887bfdf1@mail.gmail.com> You don't need a third node or quorum disk... A 2-node cluster is sort of a special case, which is why there is a special config line for a setup with only 2 nodes. I'm running a couple of 2-node clusters right now with no quorum disk. The main issue I've encountered is that they can go split-brain easily, and you get to watch both nodes fence each other off endlessly. Adding clean_start="1" to the fence_daemon line helps prevent this, but a quorum disk would be better if you're absolutely committed to a 2-node setup. I'm running a test cluster at this moment with 3 nodes and a 1Gb quorum partition that is shared out via iSCSI. I sort of discovered that I needed a quorum disk the hard way after taking 2 nodes down and the suriviving node gave up due to the cluster being inquorate. On Mon, Jun 8, 2009 at 5:22 AM, ESGLinux wrote: > Hi all, > I have one existential question about two nodes cluster. I have read that > for 2 nodes cluster is necessary a third element to give stabilty to the > cluster. > > One way is to add a third node, so its not a 2 nodes cluster. For me its > not an answer because It becomes another kind of cluster ( 3 nodes cluster) > > Other, is to use qdisk (this is the way I?m trying nowadays) > > My question is if it?s absollutely necesary this third element in the > architecture. and If so which are the options, ? > > thanks in advance > > ESG > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From teigland at redhat.com Mon Jun 8 15:06:52 2009 From: teigland at redhat.com (David Teigland) Date: Mon, 8 Jun 2009 10:06:52 -0500 Subject: [Linux-cluster] "dlm_controld[nnnn]: cluster is down, exiting" on node1 when starting node2 In-Reply-To: <1244221838.2626.29.camel@localhost.localdomain> References: <20090605151421.GB28143@redhat.com> <20090605160455.GD28143@redhat.com> <20090605164951.GE28143@redhat.com> <1244221838.2626.29.camel@localhost.localdomain> Message-ID: <20090608150652.GA8734@redhat.com> On Fri, Jun 05, 2009 at 10:10:38AM -0700, Steven Dake wrote: > 99.9% of the time there would be a core file in /var/lib/openais/core* > if aisexec faults. We have not seen faults during normal operations for > years in a released version under typical gfs2 usage scenarios. If > there is no core, it means some other component failed, exited, and > caused that node to be fenced, or the core file could not be written by > the OS because of some other OS specific failure. That's why it would be so valuable to leave a simple "I'm failing" message. That and the fact that people don't naturally know to go looking for a /var/lib/openais/core file when everything falls apart. Dave From esggrupos at gmail.com Mon Jun 8 15:31:03 2009 From: esggrupos at gmail.com (ESGLinux) Date: Mon, 8 Jun 2009 17:31:03 +0200 Subject: [Linux-cluster] question about 2 nodes cluster In-Reply-To: <36df569a0906080813p27c9b8ccu5a89e210887bfdf1@mail.gmail.com> References: <3128ba140906080222w22d5c2b4y63d9ee4df80b38bc@mail.gmail.com> <36df569a0906080813p27c9b8ccu5a89e210887bfdf1@mail.gmail.com> Message-ID: <3128ba140906080831x716faa70i395c0424e0e41cde@mail.gmail.com> Thank you Ian, I?l take your answer in account, Greetings, ESG 2009/6/8 Ian Hayes > You don't need a third node or quorum disk... A 2-node cluster is sort of a > special case, which is why there is a special config line for a setup with > only 2 nodes. I'm running a couple of 2-node clusters right now with no > quorum disk. The main issue I've encountered is that they can go split-brain > easily, and you get to watch both nodes fence each other off endlessly. > Adding clean_start="1" to the fence_daemon line helps prevent this, but a > quorum disk would be better if you're absolutely committed to a 2-node > setup. > > I'm running a test cluster at this moment with 3 nodes and a 1Gb quorum > partition that is shared out via iSCSI. I sort of discovered that I needed a > quorum disk the hard way after taking 2 nodes down and the suriviving node > gave up due to the cluster being inquorate. > > > On Mon, Jun 8, 2009 at 5:22 AM, ESGLinux wrote: > >> Hi all, >> I have one existential question about two nodes cluster. I have read that >> for 2 nodes cluster is necessary a third element to give stabilty to the >> cluster. >> >> One way is to add a third node, so its not a 2 nodes cluster. For me its >> not an answer because It becomes another kind of cluster ( 3 nodes cluster) >> >> Other, is to use qdisk (this is the way I?m trying nowadays) >> >> My question is if it?s absollutely necesary this third element in the >> architecture. and If so which are the options, ? >> >> thanks in advance >> >> ESG >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From brem.belguebli at gmail.com Mon Jun 8 18:06:37 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Mon, 8 Jun 2009 20:06:37 +0200 Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters In-Reply-To: <1244224870.2626.37.camel@localhost.localdomain> References: <7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com> <29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com> <7207d96f0906050247s54c5c468td31fceeabe1a8da3@mail.gmail.com> <29ae894c0906051040v64a92e55je6c3568e14d5a20f@mail.gmail.com> <1244224870.2626.37.camel@localhost.localdomain> Message-ID: <29ae894c0906081106qda49510he49c558f45d56045@mail.gmail.com> Hello, Here's a link to illustrate the kind of setup I'm trying to setup with RHCS. http://brehak.blogspot.com/2009/06/disaster-recovery-setup.html Regards 2009/6/5, Steven Dake : > > On Fri, 2009-06-05 at 19:40 +0200, brem belguebli wrote: > > > > > > 2009/6/5, Jon Schulz : > > Yes I would be interested to see what products you are > > currently using to achieve this. In my proposed setup we are > > actually completely database transaction driven. The problem > > is the people higher up want active database <-> database > > replication which will be problematic I know. > > > > Still we also use DB (Oracle, Sybase) replication mechanisms > > to address accidental data corruption, as mirroring being synchonous, > > if something happens (someone intentionnaly alters the DB or > > filesystem corruption) it will be on both legs of the mirror. > > > > > > > > Outside of the data side of the equation, how tolerant is the > > cluster network/heartbeat to latency assuming no packet loss? > > Or more to the point, at what point does everyone in their > > past experience see the heartbeat network become unreliable, > > latency wise. E.g. anything over 30ms? > > > > The default configured timers for failure detection are quite high and > retransmit many times for failed packets (for lossy networks). 30msec > latency would pose no major problem, except performance. If you used > posix locking and your machine->machine latency was 30msec, each posix > lock would take 30.03 msec to grant or more, which may not meet your > performance requirements. > > I can't recommend wan connections with totem (the protocol used in rhcs) > because of the performance characteristics. If the performance of posix > locks is not a high requirement, it should be functional. > > Regards > -steve > > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fajar at fajar.net Tue Jun 9 03:15:23 2009 From: fajar at fajar.net (Fajar A. Nugraha) Date: Tue, 9 Jun 2009 10:15:23 +0700 Subject: [Linux-cluster] Xen , Out of memory and dom0-min-mem, dom0_mem In-Reply-To: <4A2D0F39.7050308@lexum.umontreal.ca> References: <4A2D0F39.7050308@lexum.umontreal.ca> Message-ID: <7207d96f0906082015r4e79325j22dc31640e91637d@mail.gmail.com> On Mon, Jun 8, 2009 at 8:16 PM, FM wrote: > Hello > 2 of 3 of my xen nodes ( dom0) died tonight because of out of memory error. > These servers are only running Xen and cluster suite packages. > I googled the error and everything point out to dom0-min-mem and grub > dom0_mem options > > dom0-min-mem : I left it by default : (dom0-min-mem 256) > > I read about on the internet but still do not understand this parameter and > if if should be =0 on servers > > any advice ? This might be more suitable on xen-users lists. Anyway, what does xm list show? e.g. how much memory does dom0 currently use? from my experince 256MB (give or take a few) is the bare minimum for RHEL/Centos5 Xen dom0 with phy:/. If you want to use tap:aio, you need to add more. If you want to run other services (snmpd, httpd, cluster) you need to have more. For your usage, my guess is you should start with at least 1GB for dom0, and monitor its usage. -- Fajar From swhiteho at redhat.com Tue Jun 9 07:57:49 2009 From: swhiteho at redhat.com (Steven Whitehouse) Date: Tue, 09 Jun 2009 08:57:49 +0100 Subject: [Linux-cluster] Re: Still having GFS2 mount hang In-Reply-To: <89c397150906081537i661beb55ke5f0284909f9e71f@mail.gmail.com> References: <89c397150906081537i661beb55ke5f0284909f9e71f@mail.gmail.com> Message-ID: <1244534269.29604.789.camel@localhost.localdomain> Hi, On Mon, 2009-06-08 at 18:37 -0400, William A. (Andy) Adamson wrote: > Hello > > I'm still not able to mount GFS2 on both of my 2 node cluster nodes at > once. Please! Any help is welcome... > > Setup: 2 Fedora 10 VM sharing disk over AOE from third vm machine. > 2.6.30-rc7 kernel with latest Fedora 10 rpm updates. The cluster.conf > is attached. > > I gdb'ed mount.gfs2 on the 2nd node - it hangs trying to read in gfsc_fs_result. > > -->Andy > > gfs2_controld -D output on first node: mount /gfs2. > > 1244499915 client connection 6 fd 17 > 1244499915 join: /gfs2 gfs2 lock_dlm androsGFS2:ClusterFS rw,noauto > /dev/etherd/e3.2p1 > 1244499915 ClusterFS join: cluster name matches: androsGFS2 > 1244499915 ClusterFS process_dlmcontrol register nodeid 0 result 0 > 1244499915 ClusterFS add_change cg 1 joined nodeid 2 > 1244499915 ClusterFS add_change cg 1 we joined > 1244499915 ClusterFS add_change cg 1 counts member 1 joined 1 remove 0 failed 0 > 1244499915 ClusterFS wait_conditions skip for zero started_count > 1244499915 ClusterFS send_start cg 1 id_count 1 om 0 nm 1 oj 0 nj 0 > 1244499915 ClusterFS receive_start 2:1 len 92 > 1244499915 ClusterFS match_change 2:1 matches cg 1 > 1244499915 ClusterFS wait_messages cg 1 got all 1 > 1244499915 ClusterFS pick_first_recovery_master low 2 old 0 > 1244499915 ClusterFS sync_state all_nodes_new first_recovery_needed master 2 > 1244499915 ClusterFS create_old_nodes all new > 1244499915 ClusterFS create_new_nodes 2 ro 0 spect 0 > 1244499915 ClusterFS create_failed_journals all new > 1244499915 ClusterFS create_new_journals 2 gets jid 0 > 1244499915 ClusterFS apply_recovery first start_kernel > 1244499915 ClusterFS start_kernel cg 1 member_count 1 > 1244499915 ClusterFS set > /sys/fs/gfs2/androsGFS2:ClusterFS/lock_module/block to 0 > 1244499915 ClusterFS set open > /sys/fs/gfs2/androsGFS2:ClusterFS/lock_module/block error -1 2 This is returning -ENOENT. Do you have sysfs mounted somewhere strange? > 1244499915 ClusterFS client_reply_join_full ci 6 result 0 > hostdata=jid=0:id=1562653156:first=1 > 1244499915 client_reply_join ClusterFS ci 6 result 0 > 1244499915 uevent: add@/fs/gfs2/androsGFS2:ClusterFS > 1244499915 kernel: add@ androsGFS2:ClusterFS > 1244499915 uevent: change@/fs/gfs2/androsGFS2:ClusterFS > 1244499915 kernel: change@ androsGFS2:ClusterFS > 1244499915 uevent: change@/fs/gfs2/androsGFS2:ClusterFS > 1244499915 kernel: change@ androsGFS2:ClusterFS > 1244499915 uevent: change@/fs/gfs2/androsGFS2:ClusterFS > 1244499915 kernel: change@ androsGFS2:ClusterFS > 1244499915 mount_done: ClusterFS result 0 > 1244499915 connection 6 read error -1 I'm not sure if this is "normal" or not, but it may well point towards what is going wrong here, Steve. From teigland at redhat.com Tue Jun 9 14:01:12 2009 From: teigland at redhat.com (David Teigland) Date: Tue, 9 Jun 2009 09:01:12 -0500 Subject: [Linux-cluster] Re: Still having GFS2 mount hang In-Reply-To: <1244534269.29604.789.camel@localhost.localdomain> References: <89c397150906081537i661beb55ke5f0284909f9e71f@mail.gmail.com> <1244534269.29604.789.camel@localhost.localdomain> Message-ID: <20090609140112.GB13914@redhat.com> On Tue, Jun 09, 2009 at 08:57:49AM +0100, Steven Whitehouse wrote: > > I gdb'ed mount.gfs2 on the 2nd node - it hangs trying to read in > > gfsc_fs_result. That's an unusual problem, are mount.gfs2 and gfs_controld from the same release? nothing in /var/log/messages? is selinux turned off? > > gfs2_controld -D output on first node: mount /gfs2. > > > > 1244499915 client connection 6 fd 17 > > 1244499915 join: /gfs2 gfs2 lock_dlm androsGFS2:ClusterFS rw,noauto > > /dev/etherd/e3.2p1 > > 1244499915 ClusterFS join: cluster name matches: androsGFS2 > > 1244499915 ClusterFS process_dlmcontrol register nodeid 0 result 0 > > 1244499915 ClusterFS add_change cg 1 joined nodeid 2 > > 1244499915 ClusterFS add_change cg 1 we joined > > 1244499915 ClusterFS add_change cg 1 counts member 1 joined 1 remove 0 failed 0 > > 1244499915 ClusterFS wait_conditions skip for zero started_count > > 1244499915 ClusterFS send_start cg 1 id_count 1 om 0 nm 1 oj 0 nj 0 > > 1244499915 ClusterFS receive_start 2:1 len 92 > > 1244499915 ClusterFS match_change 2:1 matches cg 1 > > 1244499915 ClusterFS wait_messages cg 1 got all 1 > > 1244499915 ClusterFS pick_first_recovery_master low 2 old 0 > > 1244499915 ClusterFS sync_state all_nodes_new first_recovery_needed master 2 > > 1244499915 ClusterFS create_old_nodes all new > > 1244499915 ClusterFS create_new_nodes 2 ro 0 spect 0 > > 1244499915 ClusterFS create_failed_journals all new > > 1244499915 ClusterFS create_new_journals 2 gets jid 0 > > 1244499915 ClusterFS apply_recovery first start_kernel > > 1244499915 ClusterFS start_kernel cg 1 member_count 1 > > 1244499915 ClusterFS set > > /sys/fs/gfs2/androsGFS2:ClusterFS/lock_module/block to 0 > > 1244499915 ClusterFS set open > > /sys/fs/gfs2/androsGFS2:ClusterFS/lock_module/block error -1 2 > This is returning -ENOENT. Do you have sysfs mounted somewhere strange? that's normal > > > 1244499915 ClusterFS client_reply_join_full ci 6 result 0 > > hostdata=jid=0:id=1562653156:first=1 > > 1244499915 client_reply_join ClusterFS ci 6 result 0 > > 1244499915 uevent: add@/fs/gfs2/androsGFS2:ClusterFS > > 1244499915 kernel: add@ androsGFS2:ClusterFS > > 1244499915 uevent: change@/fs/gfs2/androsGFS2:ClusterFS > > 1244499915 kernel: change@ androsGFS2:ClusterFS > > 1244499915 uevent: change@/fs/gfs2/androsGFS2:ClusterFS > > 1244499915 kernel: change@ androsGFS2:ClusterFS > > 1244499915 uevent: change@/fs/gfs2/androsGFS2:ClusterFS > > 1244499915 kernel: change@ androsGFS2:ClusterFS > > 1244499915 mount_done: ClusterFS result 0 > > 1244499915 connection 6 read error -1 > I'm not sure if this is "normal" or not, but it may well point towards > what is going wrong here, this is all correct Dave From teigland at redhat.com Tue Jun 9 19:36:17 2009 From: teigland at redhat.com (David Teigland) Date: Tue, 9 Jun 2009 14:36:17 -0500 Subject: [Linux-cluster] Re: Still having GFS2 mount hang In-Reply-To: <89c397150906091214k377d7213ud93ce341c3cb1167@mail.gmail.com> References: <89c397150906081537i661beb55ke5f0284909f9e71f@mail.gmail.com> <1244534269.29604.789.camel@localhost.localdomain> <20090609140112.GB13914@redhat.com> <89c397150906091214k377d7213ud93ce341c3cb1167@mail.gmail.com> Message-ID: <20090609193616.GA22800@redhat.com> On Tue, Jun 09, 2009 at 03:14:09PM -0400, William A. (Andy) Adamson wrote: > Hi David > > Thanks for looking at this. The kernel does report a recursive lock that's harmless > issue when running /etc/init.d/cman. Details inline. I can't see anything wrong, I'm going to check whether we have or can get some more recent packages, since 2.99.12 is a bit old, it looks like you're on fedora 10? Dave From gordan at bobich.net Tue Jun 9 21:13:25 2009 From: gordan at bobich.net (Gordan Bobic) Date: Tue, 09 Jun 2009 22:13:25 +0100 Subject: [Linux-cluster] Prototype Fencing Agent for Raritan eRIC G4 Message-ID: <4A2ED075.5020207@bobich.net> As the subject line says. The agent is attached. As all currently included fencing agents, this one is also written in Perl, and has the same requirements and dependencies as the DRAC fencing agent (Net::Telnet, Getopt::Std). What does it take to get it included in the distro? ;) Many thanks. Gordan -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: fence_eric URL: From gordan at bobich.net Tue Jun 9 21:24:56 2009 From: gordan at bobich.net (Gordan Bobic) Date: Tue, 09 Jun 2009 22:24:56 +0100 Subject: [Linux-cluster] Redhat Lists Question Message-ID: <4A2ED328.4010906@bobich.net> Sorry, not related to clustering, but can anybody point me at the best Redhat list to post suggested patches to? I just wrote a (RHEL5 specific) patch aimed at laptops with (cheap) SSDs that aims to reduce the number of disk writes and prolong flash life. I looked at the list of lists here: http://www.redhat.com/mailman/listinfo and it looks like there could be several relevant ones, but I'm not sure which ones are deprecated and no longer used. Can anybody please advise? Many thanks. Gordan From tom at netspot.com.au Wed Jun 10 01:27:10 2009 From: tom at netspot.com.au (Tom Lanyon) Date: Wed, 10 Jun 2009 10:57:10 +0930 Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters In-Reply-To: <29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com> References: <7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com> <29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com> Message-ID: On 05/06/2009, at 6:52 PM, brem belguebli wrote: > Hello, > > That sounds pretty much to the question I've asked to this mailing- > list last May (https://www.redhat.com/archives/linux-cluster/2009-May/msg00093.html > ). > > We are in the same setup, already doing "Geo-cluster" with other > technos and we are looking at RHCS to provide us the same service > level. > > Latency could be a problem indeed if too high , but in a lot of > cases (many companies for which I've worked), datacenters are a few > tens of kilometers far, with a latency max close to 1 ms, which is > not a problem. > > Let's consider this kind of setup, 2 datacenters far from each other > by 1 ms delay, each hosting a SAN array, each of them connected to 2 > SAN fabrics extended between the 2 sites. > > What reason would prevent us from building Geo-clusters without > having to rely on a database replication mechanism, as the setup I > would like to implement would also be used to provide NFS services > that are disaster recovery proof. > > Obviously, such setup should rely on LVM mirroring to allow a node > hosting a service to be able to write to both local and distant SAN > LUN's. > > Brem I have been wondering whether the same could be done (cross-site RHCS) using SAN replication and multipath, avoiding LVM mirroring. This is going to depend strongly on the storage replication failover time; if the IO to shared storage devices is queued for too long, the cluster will stop. Does anyone have any experience with how quick this would need to happen for RHCS to tolerate it? I have been meaning to test this but have not had a chance... Tom From tom at netspot.com.au Wed Jun 10 01:29:51 2009 From: tom at netspot.com.au (Tom Lanyon) Date: Wed, 10 Jun 2009 10:59:51 +0930 Subject: [Linux-cluster] System load at 1.00 for gfs2? In-Reply-To: <1242655685.29604.345.camel@localhost.localdomain> References: <20090513173511.GA5992@esri.com> <8a5668960905180135p118312bfj6625f8513f477674@mail.gmail.com> <20090518140201.GA7429@esri.com> <1242655685.29604.345.camel@localhost.localdomain> Message-ID: <13E5ADD5-B0C6-4339-8D86-5E46DA37B6A6@netspot.com.au> On 18/05/2009, at 11:38 PM, Steven Whitehouse wrote: > The fix has gone in to RHEL 5.4. I have a feeling that it might also > go > into 5.3.z but I'm not 100% sure what the timescales are there. The > bug > is known and fixed in upstream too. > > It isn't actually using any more CPU, its just that the LA is > incremented by 1. So a fix is already on its way, > > Steve. Great, we experience this bug too. It doesn't cause any problems but confuses some of the administrators... :) Tom From sghosh at redhat.com Wed Jun 10 01:37:52 2009 From: sghosh at redhat.com (Subhendu Ghosh) Date: Tue, 09 Jun 2009 21:37:52 -0400 Subject: [Linux-cluster] Redhat Lists Question In-Reply-To: <4A2ED328.4010906@bobich.net> References: <4A2ED328.4010906@bobich.net> Message-ID: <4A2F0E70.3060407@redhat.com> Gordan Bobic wrote: > Sorry, not related to clustering, but can anybody point me at the best > Redhat list to post suggested patches to? I just wrote a (RHEL5 > specific) patch aimed at laptops with (cheap) SSDs that aims to reduce > the number of disk writes and prolong flash life. I looked at the list > of lists here: > > http://www.redhat.com/mailman/listinfo > > and it looks like there could be several relevant ones, but I'm not sure > which ones are deprecated and no longer used. Can anybody please advise? > > Many thanks. > > Gordan > Ideally, you want to post the patch to the upstream component. Posting it to the Red Hat Bugzilla under the approriate component would also help. http://bugzilla.redhat.com For RHEL5, the discussion list is: https://www.redhat.com/mailman/listinfo/rhelv5-list SSDs typically work under SATA chipsets - so libata http://ata.wiki.kernel.org/index.php/Main_Page -regards Subhendu -- Subhendu Ghosh Red Hat From gordan at bobich.net Wed Jun 10 08:27:49 2009 From: gordan at bobich.net (Gordan Bobic) Date: Wed, 10 Jun 2009 09:27:49 +0100 Subject: [Linux-cluster] Redhat Lists Question In-Reply-To: <4A2F0E70.3060407@redhat.com> References: <4A2ED328.4010906@bobich.net> <4A2F0E70.3060407@redhat.com> Message-ID: <4A2F6E85.1090601@bobich.net> Subhendu Ghosh wrote: > Ideally, you want to post the patch to the upstream component. Thanks for responding. It's mostly an initscript patch that checks if for file systems mounted on tmpfs (e.g. if we put /var/lock, /var/run or similar there to save hitting the disk) and saves and restores subtree structure (if changed) at shutdown and startup. It saves about 100-200 writes on startup/shutdown. I thought init scripts are pretty distro specific, and since this is mostly an init script patch... > Posting it to the Red Hat Bugzilla under the approriate component would also help. > http://bugzilla.redhat.com It's not a bug fix, it's a feature addition. > For RHEL5, the discussion list is: > https://www.redhat.com/mailman/listinfo/rhelv5-list OK, I'll post there. Thanks. > SSDs typically work under SATA chipsets - so libata > http://ata.wiki.kernel.org/index.php/Main_Page This patch is not that low a level. :) Gordan From rajpurush at gmail.com Wed Jun 10 08:41:21 2009 From: rajpurush at gmail.com (Rajeev P) Date: Wed, 10 Jun 2009 14:11:21 +0530 Subject: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3 Message-ID: <7a271b290906100141o296ff97w72b97fe2de70b0f4@mail.gmail.com> I wanted to know if fence_scsi is supported in a multipath environment for RHEL5.3 release. In earlier releases of RHEL5 fence_scsi was not supported in a multipath environment for RHEL5.3 release. If I am not wrong, this was because the DM-MPIO driver forwarded the registration/unregistration commands on only on one of the physical paths of a LUN. Ideally it should have passed the commands on all physical paths. For RHEL5.3, is this issue resolved so that I can fence_scsi in multipath environment. Thanks in advance. Rajeev -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at netspot.com.au Wed Jun 10 10:50:17 2009 From: tom at netspot.com.au (Tom Lanyon) Date: Wed, 10 Jun 2009 20:20:17 +0930 Subject: [Linux-cluster] Redhat Lists Question In-Reply-To: <4A2F6E85.1090601@bobich.net> References: <4A2ED328.4010906@bobich.net> <4A2F0E70.3060407@redhat.com> <4A2F6E85.1090601@bobich.net> Message-ID: On 10/06/2009, at 5:57 PM, Gordan Bobic wrote: > Subhendu Ghosh wrote: >> Posting it to the Red Hat Bugzilla under the approriate component >> would also help. >> http://bugzilla.redhat.com > > It's not a bug fix, it's a feature addition. The bugzilla is for "defects" which, along with bugs, includes requests for enhancements, etc. A quick search of the bugzilla returns an existing bug which seems to be in line with your requirements: Bug 223722 - RFE: add functionality to persist temporary state back to original location https://bugzilla.redhat.com/show_bug.cgi?id=223722 Cheers, Tom From macbogucki at gmail.com Wed Jun 10 11:16:35 2009 From: macbogucki at gmail.com (Maciej Bogucki) Date: Wed, 10 Jun 2009 13:16:35 +0200 Subject: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3 In-Reply-To: <7a271b290906100141o296ff97w72b97fe2de70b0f4@mail.gmail.com> References: <7a271b290906100141o296ff97w72b97fe2de70b0f4@mail.gmail.com> Message-ID: <4A2F9613.8040208@gmail.com> Rajeev P pisze: > I wanted to know if fence_scsi is supported in a multipath environment > for RHEL5.3 release. > > In earlier releases of RHEL5 fence_scsi was not supported in a > multipath environment for RHEL5.3 release. If I am not wrong, this was > because the DM-MPIO driver forwarded the registration/unregistration > commands on only on one of the physical paths of a LUN. Ideally it > should have passed the commands on all physical paths. > > For RHEL5.3, is this issue resolved so that I can fence_scsi in > multipath environment. > Hello, I don't think it's supported [1] [1] - https://www.redhat.com/archives/rhelv5-list/2009-January/msg00092.html Best Regards Maciej Bogucki From gordan at bobich.net Wed Jun 10 12:07:28 2009 From: gordan at bobich.net (Gordan Bobic) Date: Wed, 10 Jun 2009 13:07:28 +0100 Subject: [Linux-cluster] Redhat Lists Question In-Reply-To: References: <4A2ED328.4010906@bobich.net> <4A2F0E70.3060407@redhat.com> <4A2F6E85.1090601@bobich.net> Message-ID: <4A2FA200.5030500@bobich.net> Tom Lanyon wrote: >>> Posting it to the Red Hat Bugzilla under the approriate component >>> would also help. >>> http://bugzilla.redhat.com >> >> It's not a bug fix, it's a feature addition. > > The bugzilla is for "defects" which, along with bugs, includes requests > for enhancements, etc. > > A quick search of the bugzilla returns an existing bug which seems to be > in line with your requirements: > Bug 223722 - RFE: add functionality to persist temporary state back > to original location > https://bugzilla.redhat.com/show_bug.cgi?id=223722 Thanks for that, really appreciated. Gordan From alfredo.moralejo at roche.com Wed Jun 10 12:35:17 2009 From: alfredo.moralejo at roche.com (Moralejo, Alfredo) Date: Wed, 10 Jun 2009 14:35:17 +0200 Subject: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3 In-Reply-To: <4A2F9613.8040208@gmail.com> References: <7a271b290906100141o296ff97w72b97fe2de70b0f4@mail.gmail.com> <4A2F9613.8040208@gmail.com> Message-ID: <18106F5AEC2A20499826B0BE8D04F0411DE1D04E@rbamsem701.emea.roche.com> As cluster wiki: http://sources.redhat.com/cluster/wiki/SCSI_FencingConfig "Multipath devices are currently only supported for RHEL 5.0 and later with the use of device-mapper-multipath." Additionally, I found in a HP document info about how to set up cluster. Acconding to that information it's supported with version 5.3: http://docs.hp.com/en/15689/Migrating_SGLX_cluster_to_RHCS_Cluster.pdf It's not a Red Hat document but they are partners so.... -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Maciej Bogucki Sent: Wednesday, June 10, 2009 1:17 PM To: linux clustering Subject: Re: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3 Rajeev P pisze: > I wanted to know if fence_scsi is supported in a multipath environment > for RHEL5.3 release. > > In earlier releases of RHEL5 fence_scsi was not supported in a > multipath environment for RHEL5.3 release. If I am not wrong, this was > because the DM-MPIO driver forwarded the registration/unregistration > commands on only on one of the physical paths of a LUN. Ideally it > should have passed the commands on all physical paths. > > For RHEL5.3, is this issue resolved so that I can fence_scsi in > multipath environment. > Hello, I don't think it's supported [1] [1] - https://www.redhat.com/archives/rhelv5-list/2009-January/msg00092.html Best Regards Maciej Bogucki -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From teigland at redhat.com Wed Jun 10 14:13:51 2009 From: teigland at redhat.com (David Teigland) Date: Wed, 10 Jun 2009 09:13:51 -0500 Subject: [Linux-cluster] Re: Still having GFS2 mount hang In-Reply-To: <89c397150906100633p5a0b15d9vab2da39107d362fc@mail.gmail.com> References: <89c397150906081537i661beb55ke5f0284909f9e71f@mail.gmail.com> <1244534269.29604.789.camel@localhost.localdomain> <20090609140112.GB13914@redhat.com> <89c397150906091214k377d7213ud93ce341c3cb1167@mail.gmail.com> <20090609193616.GA22800@redhat.com> <89c397150906100633p5a0b15d9vab2da39107d362fc@mail.gmail.com> Message-ID: <20090610141351.GA18341@redhat.com> On Wed, Jun 10, 2009 at 09:33:33AM -0400, William A. (Andy) Adamson wrote: > On Tue, Jun 9, 2009 at 3:36 PM, David Teigland wrote: > > On Tue, Jun 09, 2009 at 03:14:09PM -0400, William A. (Andy) Adamson wrote: > >> Hi David > >> > >> Thanks for looking at this. The kernel does report a recursive lock > > > > that's harmless > > > >> issue when running /etc/init.d/cman. Details inline. > > > > I can't see anything wrong, I'm going to check whether we have or can get > > some more recent packages, since 2.99.12 is a bit old, it looks like > > you're on fedora 10? > > yes. I could move to fedora 11. I did some checking, and unfortunately 2.99.12 is the newest version we've packaged for either f10 or f11. It has something to do with the corosync api's changing too rapidly, and the trouble with patching and rebuilding all the packages that depend on it because they are using various versions of the api... the hope is it will all be better when a stable corosync 1.0 release happens. In the mean time, Fabio was kind enough to make a set of srpms of all the latest versions, http://fabbione.fedorapeople.org/srpm/ I just built and installed corosync, openais and cluster srpms from there on my fedora 10 machine. Started the cluster and mounted gfs with the result. I limited what I built/installed to avoid some annoying dependencies, to rpmbuild --rebuild corosync rpm -Uhv corosync* rpmbuild --rebuild openais rpm -Uhv openais* rpmbuild --rebuild cluster rpm -Uhv cluster* rpm -Uhv gfs* rpm -Uhv --nodeps cman* Dave From fdinitto at redhat.com Wed Jun 10 16:59:44 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Wed, 10 Jun 2009 18:59:44 +0200 Subject: [Linux-cluster] Re: Still having GFS2 mount hang In-Reply-To: <20090610141351.GA18341@redhat.com> References: <89c397150906081537i661beb55ke5f0284909f9e71f@mail.gmail.com> <1244534269.29604.789.camel@localhost.localdomain> <20090609140112.GB13914@redhat.com> <89c397150906091214k377d7213ud93ce341c3cb1167@mail.gmail.com> <20090609193616.GA22800@redhat.com> <89c397150906100633p5a0b15d9vab2da39107d362fc@mail.gmail.com> <20090610141351.GA18341@redhat.com> Message-ID: <1244653184.3665.77.camel@cerberus.int.fabbione.net> On Wed, 2009-06-10 at 09:13 -0500, David Teigland wrote: > On Wed, Jun 10, 2009 at 09:33:33AM -0400, William A. (Andy) Adamson wrote: > > On Tue, Jun 9, 2009 at 3:36 PM, David Teigland wrote: > > > On Tue, Jun 09, 2009 at 03:14:09PM -0400, William A. (Andy) Adamson wrote: > > >> Hi David > > >> > > >> Thanks for looking at this. The kernel does report a recursive lock > > > > > > that's harmless > > > > > >> issue when running /etc/init.d/cman. Details inline. > > > > > > I can't see anything wrong, I'm going to check whether we have or can get > > > some more recent packages, since 2.99.12 is a bit old, it looks like > > > you're on fedora 10? > > > > yes. I could move to fedora 11. > > I did some checking, and unfortunately 2.99.12 is the newest version we've > packaged for either f10 or f11. It has something to do with the corosync > api's changing too rapidly, and the trouble with patching and rebuilding all > the packages that depend on it because they are using various versions of the > api... the hope is it will all be better when a stable corosync 1.0 release > happens. > > In the mean time, Fabio was kind enough to make a set of srpms of all the > latest versions, http://fabbione.fedorapeople.org/srpm/ I just built and > installed corosync, openais and cluster srpms from there on my fedora 10 > machine. Started the cluster and mounted gfs with the result. > > I limited what I built/installed to avoid some annoying dependencies, to > > rpmbuild --rebuild corosync > rpm -Uhv corosync* > rpmbuild --rebuild openais > rpm -Uhv openais* > rpmbuild --rebuild cluster > rpm -Uhv cluster* > rpm -Uhv gfs* > rpm -Uhv --nodeps cman* Just FYI, you can build fence-agents srpm from there after install clusterlib and then install full cman. If you don't need fence-agents, then use the --nodeps. Fabio From garromo at us.ibm.com Wed Jun 10 18:17:01 2009 From: garromo at us.ibm.com (Gary Romo) Date: Wed, 10 Jun 2009 12:17:01 -0600 Subject: [Linux-cluster] gfs_grow Message-ID: Can you increase GFS file systems on the fly, without unmounting or stopping processes? -Gary -------------- next part -------------- An HTML attachment was scrubbed... URL: From bmr at redhat.com Wed Jun 10 18:25:27 2009 From: bmr at redhat.com (Bryn M. Reeves) Date: Wed, 10 Jun 2009 19:25:27 +0100 Subject: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3 In-Reply-To: <4A2F9613.8040208@gmail.com> References: <7a271b290906100141o296ff97w72b97fe2de70b0f4@mail.gmail.com> <4A2F9613.8040208@gmail.com> Message-ID: <1244658327.18101.287.camel@breeves.fab.redhat.com> On Wed, 2009-06-10 at 13:16 +0200, Maciej Bogucki wrote: > Rajeev P pisze: > > I wanted to know if fence_scsi is supported in a multipath environment > > for RHEL5.3 release. > > > > In earlier releases of RHEL5 fence_scsi was not supported in a > > multipath environment for RHEL5.3 release. If I am not wrong, this was > > because the DM-MPIO driver forwarded the registration/unregistration > > commands on only on one of the physical paths of a LUN. Ideally it > > should have passed the commands on all physical paths. > > > > For RHEL5.3, is this issue resolved so that I can fence_scsi in > > multipath environment. > > > Hello, > > I don't think it's supported [1] > > [1] - https://www.redhat.com/archives/rhelv5-list/2009-January/msg00092.html Doesn't mention it at all. Better to check the kernel ChangeLog: * Tue Oct 10 2006 Don Zickus [2.6.18-1.2725.el5] - kernel dm multipath: ioctl support (Alasdair Kergon) [207575] This was included in the RHEL5 GA kernel (2.6.18-8.el5) so the ioctl passthrough has been there all along in RHEL5. Unfortunately the bug that introduced the change is private, but the RHEL4 bug that it was cloned from is accessible: https://bugzilla.redhat.com/show_bug.cgi?id=168801 Regards, Bryn. From jumanjiman at gmail.com Wed Jun 10 18:17:56 2009 From: jumanjiman at gmail.com (Paul Morgan) Date: Wed, 10 Jun 2009 18:17:56 +0000 Subject: [Linux-cluster] gfs_grow In-Reply-To: References: Message-ID: <931927457-1244658008-cardhu_decombobulator_blackberry.rim.net-637071402-@bxe1110.bisx.prod.on.blackberry> Yes, assuming you have sufficient free extents. Just remember to add any needed journals first. -paul -----Original Message----- From: Gary Romo Date: Wed, 10 Jun 2009 12:17:01 To: Subject: [Linux-cluster] gfs_grow -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From sghosh at redhat.com Wed Jun 10 18:49:24 2009 From: sghosh at redhat.com (Subhendu Ghosh) Date: Wed, 10 Jun 2009 14:49:24 -0400 Subject: [Linux-cluster] Re: [Cluster-devel] Prototype Fencing Agent for Raritan eRIC G4 In-Reply-To: <4A2ED075.5020207@bobich.net> References: <4A2ED075.5020207@bobich.net> Message-ID: <4A300034.9050603@redhat.com> Gordan Bobic wrote: > As the subject line says. The agent is attached. > As all currently included fencing agents, this one is also written in > Perl, and has the same requirements and dependencies as the DRAC fencing > agent (Net::Telnet, Getopt::Std). > > What does it take to get it included in the distro? ;) > > Many thanks. > > Gordan > Hi Gordan Would it be possible to look at migrating this agent to SSH (more secure) or to SNMP (less screen scraping)? Look at fence_cisco as an example of snmp usage. Long term maintainability of screen scraping is an issue with firmware changes. Also it seems that card has IPMI support. If so, can use test with fence_ipmi? Would remove the need for yet-another-agent ;) -regards Subhendu -- Subhendu Ghosh Red Hat Email: sghosh at redhat.com From rohara at redhat.com Wed Jun 10 19:18:02 2009 From: rohara at redhat.com (Ryan O'Hara) Date: Wed, 10 Jun 2009 14:18:02 -0500 Subject: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3 In-Reply-To: <7a271b290906100141o296ff97w72b97fe2de70b0f4@mail.gmail.com> References: <7a271b290906100141o296ff97w72b97fe2de70b0f4@mail.gmail.com> Message-ID: <20090610191802.GA12988@redhat.com> On Wed, Jun 10, 2009 at 02:11:21PM +0530, Rajeev P wrote: > I wanted to know if fence_scsi is supported in a multipath environment for > RHEL5.3 release. Yes, it is supported. > In earlier releases of RHEL5 fence_scsi was not supported in a multipath > environment for RHEL5.3 release. If I am not wrong, this was because the > DM-MPIO driver forwarded the registration/unregistration commands on only on > one of the physical paths of a LUN. Ideally it should have passed the > commands on all physical paths. > > For RHEL5.3, is this issue resolved so that I can fence_scsi in multipath > environment. > > Thanks in advance. > > Rajeev > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From gordan at bobich.net Wed Jun 10 19:24:45 2009 From: gordan at bobich.net (Gordan Bobic) Date: Wed, 10 Jun 2009 20:24:45 +0100 Subject: [Linux-cluster] Re: [Cluster-devel] Prototype Fencing Agent for Raritan eRIC G4 In-Reply-To: <4A300034.9050603@redhat.com> References: <4A2ED075.5020207@bobich.net> <4A300034.9050603@redhat.com> Message-ID: <4A30087D.3060901@bobich.net> Subhendu Ghosh wrote: > Would it be possible to look at migrating this agent to SSH (more secure) I started with the idea of doing it over ssh, but Net::SSH module seemed to be a lot less forgiving about the terminal quirkyness. I can have another go. There's also the issue of manual intervention being required to save the signatures (and where do the known hosts go?). > or to SNMP (less screen scraping)? Hmm, maybe. I haven't looked into the SNMP capability on the device, but it looks like it'll work, and probably be easier to do than SSH. > Look at fence_cisco as an example of snmp usage. Assuming they speak a compatible dialect, which may not be the case. I'll have a look. > Long term maintainability of screen scraping is an issue with firmware changes. Tell me about it. I submitted a patch for fence_drac a while back to address an issue that seems to have arisen from a firmware update inducted pattern match failure. Not only that, but I've discovered a bug on the latest eRIC G4 firmware - 04.02.00-7153 seems to have broken USB keyboard support (you'd think this was important on a remote console device!) and potentially some power button press dodgyness. The previous firmware, however - 04.02.00-6505, works OK. > Also it seems that card has IPMI support. If so, can use test with fence_ipmi? > Would remove the need for yet-another-agent ;) Sadly, my servers with these cards in them don't have IPMI support. The card only proxies it. The card supports direct power/reset button control in addition to IPMI, so this is what I'm using. But as you can see from the code, it operates only on the power on/off even for a reboot because the said servers also don't have a reset connector. I wrote this agent because I _needed_ it. :) But I'll look into the SNMP way of doing it, it sounds like it might be neater. I'll add it as an option since the telnet way is already written. What parameter should/can be used to specify such things, that is available from a cluster.conf reference? Thanks. Gordan From rvandolson at esri.com Wed Jun 10 20:54:43 2009 From: rvandolson at esri.com (Ray Van Dolson) Date: Wed, 10 Jun 2009 13:54:43 -0700 Subject: [Linux-cluster] GFS2 cluster and fencing Message-ID: <20090610205436.GA3215@esri.com> I'm setting up a simple 5 node "cluster" basically just for using a shared GFS2 filesystem between the nodes. I'm not really concerned about HA, I just want to be able to have all the nodes accessing the same block device (iSCSI) In my thinking this sets up a cluster where only one node need be up to have quorum, and manual fencing is done for each node. However, when I start up the first node in the cluster, the fencing daemon hangs complaining about not being able to fence the other nodes. I have to run fence_ack_manual -n for all the other nodes, then things start up fine. Is there a way to make the node just assume all the other nodes are fine and start up? Am I really running much risk of the GFS2 filesystem failing out? Thanks, Ray From cthulhucalling at gmail.com Wed Jun 10 21:21:41 2009 From: cthulhucalling at gmail.com (Ian Hayes) Date: Wed, 10 Jun 2009 14:21:41 -0700 Subject: [Linux-cluster] GFS2 cluster and fencing In-Reply-To: <20090610205436.GA3215@esri.com> References: <20090610205436.GA3215@esri.com> Message-ID: <36df569a0906101421k55aeb7ddofe316878cfba86d5@mail.gmail.com> Have you tried changing clean_start="0" to 1? On Wed, Jun 10, 2009 at 1:54 PM, Ray Van Dolson wrote: > I'm setting up a simple 5 node "cluster" basically just for using a > shared GFS2 filesystem between the nodes. > > I'm not really concerned about HA, I just want to be able to have all > the nodes accessing the same block device (iSCSI) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > In my thinking this sets up a cluster where only one node need be up to > have quorum, and manual fencing is done for each node. > > However, when I start up the first node in the cluster, the fencing > daemon hangs complaining about not being able to fence the other nodes. > I have to run fence_ack_manual -n for all the other nodes, > then things start up fine. > > Is there a way to make the node just assume all the other nodes are > fine and start up? Am I really running much risk of the GFS2 > filesystem failing out? > > Thanks, > Ray > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alfredo.moralejo at roche.com Wed Jun 10 21:33:14 2009 From: alfredo.moralejo at roche.com (Moralejo, Alfredo) Date: Wed, 10 Jun 2009 23:33:14 +0200 Subject: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3 In-Reply-To: <20090610191802.GA12988@redhat.com> References: <7a271b290906100141o296ff97w72b97fe2de70b0f4@mail.gmail.com> <20090610191802.GA12988@redhat.com> Message-ID: Anyone is successfully using it? I'm testing it with a clariion storage frame on RHEL 5.3, and as soon as I enable scsi_reserve, multipath starts failing and a path goes good and bad in a loop and scsi fencing fails sometimes, should I configure in a specific way multipath.conf?: Jun 10 23:31:58 rmamseslab07 multipathd: mpath0: remaining active paths: 3 Jun 10 23:32:00 rmamseslab07 kernel: sd 1:0:0:1: reservation conflict Jun 10 23:32:00 rmamseslab07 kernel: sd 1:0:0:1: SCSI error: return code = 0x00000018 Jun 10 23:32:00 rmamseslab07 kernel: end_request: I/O error, dev sdm, sector 79 Jun 10 23:32:00 rmamseslab07 kernel: device-mapper: multipath: Failing path 8:192. Jun 10 23:32:00 rmamseslab07 multipathd: 8:192: mark as failed Jun 10 23:32:00 rmamseslab07 multipathd: mpathquorum: remaining active paths: 3 Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: add map (uevent) Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: devmap already registered Jun 10 23:32:00 rmamseslab07 multipathd: sdm: emc_clariion_checker: Path healthy Jun 10 23:32:00 rmamseslab07 multipathd: 8:192: reinstated Jun 10 23:32:00 rmamseslab07 multipathd: mpathquorum: remaining active paths: 4 Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: add map (uevent) Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: devmap already registered Jun 10 23:32:02 rmamseslab07 multipathd: sdl: emc_clariion_checker: Path healthy Jun 10 23:32:02 rmamseslab07 multipathd: 8:176: reinstated Jun 10 23:32:02 rmamseslab07 multipathd: mpath0: remaining active paths: 4 Jun 10 23:32:02 rmamseslab07 multipathd: dm-9: add map (uevent) Jun 10 23:32:02 rmamseslab07 multipathd: dm-9: devmap already registered Jun 10 23:32:03 rmamseslab07 kernel: sd 1:0:0:0: reservation conflict Jun 10 23:32:03 rmamseslab07 kernel: sd 1:0:0:0: SCSI error: return code = 0x00000018 Jun 10 23:32:03 rmamseslab07 kernel: end_request: I/O error, dev sdl, sector 25256 Jun 10 23:32:03 rmamseslab07 kernel: device-mapper: multipath: Failing path 8:176. Jun 10 23:32:03 rmamseslab07 multipathd: dm-9: add map (uevent) Jun 10 23:32:03 rmamseslab07 multipathd: dm-9: devmap already registered Jun 10 23:32:03 rmamseslab07 multipathd: 8:176: mark as failed Jun 10 23:32:03 rmamseslab07 multipathd: mpath0: remaining active paths: 3 -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan O'Hara Sent: Wednesday, June 10, 2009 9:18 PM To: linux clustering Subject: Re: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3 On Wed, Jun 10, 2009 at 02:11:21PM +0530, Rajeev P wrote: > I wanted to know if fence_scsi is supported in a multipath environment for > RHEL5.3 release. Yes, it is supported. > In earlier releases of RHEL5 fence_scsi was not supported in a multipath > environment for RHEL5.3 release. If I am not wrong, this was because the > DM-MPIO driver forwarded the registration/unregistration commands on only on > one of the physical paths of a LUN. Ideally it should have passed the > commands on all physical paths. > > For RHEL5.3, is this issue resolved so that I can fence_scsi in multipath > environment. > > Thanks in advance. > > Rajeev > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From rvandolson at esri.com Wed Jun 10 21:39:02 2009 From: rvandolson at esri.com (Ray Van Dolson) Date: Wed, 10 Jun 2009 14:39:02 -0700 Subject: [Linux-cluster] GFS2 cluster and fencing In-Reply-To: <36df569a0906101421k55aeb7ddofe316878cfba86d5@mail.gmail.com> References: <20090610205436.GA3215@esri.com> <36df569a0906101421k55aeb7ddofe316878cfba86d5@mail.gmail.com> Message-ID: <20090610213902.GB4203@esri.com> On Wed, Jun 10, 2009 at 02:21:41PM -0700, Ian Hayes wrote: > Have you tried changing clean_start="0" to 1? Nope, will do. I misinterpreted the fenced(8) man page thinking that clean_start="0" was the way to do this. Thanks, Ray From brem.belguebli at gmail.com Wed Jun 10 21:45:15 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Wed, 10 Jun 2009 23:45:15 +0200 Subject: [Linux-cluster] Networking guidelines for RHCS across datacenters In-Reply-To: References: <7207d96f0906021906mb731687j6964f4a3466b626e@mail.gmail.com> <29ae894c0906050222naf68973r7a68b3c52eeeead@mail.gmail.com> Message-ID: <29ae894c0906101445k7077184bxacc6964eb790fde7@mail.gmail.com> Indeed, SAN replication could be another way to partially address this. To make it work, one should be able to add sort of external resource in the cluster monitoring the synchronization status between the source LUNs and the target ones, and by the way automatically invert the synchronization in case your resource or service fails over another node on the other site. This can be tricky and your SAN arrays must allow you to do this (HDS/HP command devices, etc...) IMHO, LVM mirror is the simplest way to achieve this if latency constraints are acceptable. When I say partially, there is always the quorum issue, as on a 4 nodes cluster, equally located on 2 sites, in case of a site failure, the 2 remaining nodes are not quorate. Brem 2009/6/10 Tom Lanyon > On 05/06/2009, at 6:52 PM, brem belguebli wrote: > > Hello, >> >> That sounds pretty much to the question I've asked to this mailing-list >> last May ( >> https://www.redhat.com/archives/linux-cluster/2009-May/msg00093.html). >> >> We are in the same setup, already doing "Geo-cluster" with other technos >> and we are looking at RHCS to provide us the same service level. >> >> Latency could be a problem indeed if too high , but in a lot of cases >> (many companies for which I've worked), datacenters are a few tens of >> kilometers far, with a latency max close to 1 ms, which is not a problem. >> >> Let's consider this kind of setup, 2 datacenters far from each other by 1 >> ms delay, each hosting a SAN array, each of them connected to 2 SAN fabrics >> extended between the 2 sites. >> >> What reason would prevent us from building Geo-clusters without having to >> rely on a database replication mechanism, as the setup I would like to >> implement would also be used to provide NFS services that are disaster >> recovery proof. >> >> Obviously, such setup should rely on LVM mirroring to allow a node hosting >> a service to be able to write to both local and distant SAN LUN's. >> >> Brem >> > > > I have been wondering whether the same could be done (cross-site RHCS) > using SAN replication and multipath, avoiding LVM mirroring. This is going > to depend strongly on the storage replication failover time; if the IO to > shared storage devices is queued for too long, the cluster will stop. Does > anyone have any experience with how quick this would need to happen for RHCS > to tolerate it? > > I have been meaning to test this but have not had a chance... > > Tom > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rohara at redhat.com Wed Jun 10 22:11:08 2009 From: rohara at redhat.com (Ryan O'Hara) Date: Wed, 10 Jun 2009 17:11:08 -0500 Subject: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3 In-Reply-To: References: <7a271b290906100141o296ff97w72b97fe2de70b0f4@mail.gmail.com> <20090610191802.GA12988@redhat.com> Message-ID: <20090610221108.GE12988@redhat.com> On Wed, Jun 10, 2009 at 11:33:14PM +0200, Moralejo, Alfredo wrote: > Anyone is successfully using it? What path checker are you using? I've heard that certain path checkers cause problems, but I honestly don't know enough about dm-multipath to understand the reason for this. I have successfully used it with RDAC. Ryan > I'm testing it with a clariion storage frame on RHEL 5.3, and as soon as I enable scsi_reserve, multipath starts failing and a path goes good and bad in a loop and scsi fencing fails sometimes, should I configure in a specific way multipath.conf?: > > Jun 10 23:31:58 rmamseslab07 multipathd: mpath0: remaining active paths: 3 > Jun 10 23:32:00 rmamseslab07 kernel: sd 1:0:0:1: reservation conflict > Jun 10 23:32:00 rmamseslab07 kernel: sd 1:0:0:1: SCSI error: return code = 0x00000018 > Jun 10 23:32:00 rmamseslab07 kernel: end_request: I/O error, dev sdm, sector 79 > Jun 10 23:32:00 rmamseslab07 kernel: device-mapper: multipath: Failing path 8:192. > Jun 10 23:32:00 rmamseslab07 multipathd: 8:192: mark as failed > Jun 10 23:32:00 rmamseslab07 multipathd: mpathquorum: remaining active paths: 3 > Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: add map (uevent) > Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: devmap already registered > Jun 10 23:32:00 rmamseslab07 multipathd: sdm: emc_clariion_checker: Path healthy > Jun 10 23:32:00 rmamseslab07 multipathd: 8:192: reinstated > Jun 10 23:32:00 rmamseslab07 multipathd: mpathquorum: remaining active paths: 4 > Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: add map (uevent) > Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: devmap already registered > Jun 10 23:32:02 rmamseslab07 multipathd: sdl: emc_clariion_checker: Path healthy > Jun 10 23:32:02 rmamseslab07 multipathd: 8:176: reinstated > Jun 10 23:32:02 rmamseslab07 multipathd: mpath0: remaining active paths: 4 > Jun 10 23:32:02 rmamseslab07 multipathd: dm-9: add map (uevent) > Jun 10 23:32:02 rmamseslab07 multipathd: dm-9: devmap already registered > Jun 10 23:32:03 rmamseslab07 kernel: sd 1:0:0:0: reservation conflict > Jun 10 23:32:03 rmamseslab07 kernel: sd 1:0:0:0: SCSI error: return code = 0x00000018 > Jun 10 23:32:03 rmamseslab07 kernel: end_request: I/O error, dev sdl, sector 25256 > Jun 10 23:32:03 rmamseslab07 kernel: device-mapper: multipath: Failing path 8:176. > Jun 10 23:32:03 rmamseslab07 multipathd: dm-9: add map (uevent) > Jun 10 23:32:03 rmamseslab07 multipathd: dm-9: devmap already registered > Jun 10 23:32:03 rmamseslab07 multipathd: 8:176: mark as failed > Jun 10 23:32:03 rmamseslab07 multipathd: mpath0: remaining active paths: 3 > > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan O'Hara > Sent: Wednesday, June 10, 2009 9:18 PM > To: linux clustering > Subject: Re: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3 > > On Wed, Jun 10, 2009 at 02:11:21PM +0530, Rajeev P wrote: > > I wanted to know if fence_scsi is supported in a multipath environment for > > RHEL5.3 release. > > Yes, it is supported. > > > In earlier releases of RHEL5 fence_scsi was not supported in a multipath > > environment for RHEL5.3 release. If I am not wrong, this was because the > > DM-MPIO driver forwarded the registration/unregistration commands on only on > > one of the physical paths of a LUN. Ideally it should have passed the > > commands on all physical paths. > > > > For RHEL5.3, is this issue resolved so that I can fence_scsi in multipath > > environment. > > > > Thanks in advance. > > > > Rajeev > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From alfredo.moralejo at roche.com Wed Jun 10 22:29:02 2009 From: alfredo.moralejo at roche.com (Moralejo, Alfredo) Date: Thu, 11 Jun 2009 00:29:02 +0200 Subject: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3 In-Reply-To: <20090610221108.GE12988@redhat.com> References: <7a271b290906100141o296ff97w72b97fe2de70b0f4@mail.gmail.com> <20090610191802.GA12988@redhat.com> <20090610221108.GE12988@redhat.com> Message-ID: I'm using the config provide by Red Hat by default: device { vendor "DGC" product ".*" product_blacklist "LUN_Z" getuid_callout "/sbin/scsi_id -g -u -s /block/%n" prio_callout "/sbin/mpath_prio_emc /dev/%n" features "1 queue_if_no_path" hardware_handler "1 emc" path_grouping_policy group_by_prio failback immediate rr_weight uniform no_path_retry 300 rr_min_io 1000 path_checker emc_clariion } -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan O'Hara Sent: Thursday, June 11, 2009 12:11 AM To: linux clustering Subject: Re: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3 On Wed, Jun 10, 2009 at 11:33:14PM +0200, Moralejo, Alfredo wrote: > Anyone is successfully using it? What path checker are you using? I've heard that certain path checkers cause problems, but I honestly don't know enough about dm-multipath to understand the reason for this. I have successfully used it with RDAC. Ryan > I'm testing it with a clariion storage frame on RHEL 5.3, and as soon as I enable scsi_reserve, multipath starts failing and a path goes good and bad in a loop and scsi fencing fails sometimes, should I configure in a specific way multipath.conf?: > > Jun 10 23:31:58 rmamseslab07 multipathd: mpath0: remaining active paths: 3 > Jun 10 23:32:00 rmamseslab07 kernel: sd 1:0:0:1: reservation conflict > Jun 10 23:32:00 rmamseslab07 kernel: sd 1:0:0:1: SCSI error: return code = 0x00000018 > Jun 10 23:32:00 rmamseslab07 kernel: end_request: I/O error, dev sdm, sector 79 > Jun 10 23:32:00 rmamseslab07 kernel: device-mapper: multipath: Failing path 8:192. > Jun 10 23:32:00 rmamseslab07 multipathd: 8:192: mark as failed > Jun 10 23:32:00 rmamseslab07 multipathd: mpathquorum: remaining active paths: 3 > Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: add map (uevent) > Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: devmap already registered > Jun 10 23:32:00 rmamseslab07 multipathd: sdm: emc_clariion_checker: Path healthy > Jun 10 23:32:00 rmamseslab07 multipathd: 8:192: reinstated > Jun 10 23:32:00 rmamseslab07 multipathd: mpathquorum: remaining active paths: 4 > Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: add map (uevent) > Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: devmap already registered > Jun 10 23:32:02 rmamseslab07 multipathd: sdl: emc_clariion_checker: Path healthy > Jun 10 23:32:02 rmamseslab07 multipathd: 8:176: reinstated > Jun 10 23:32:02 rmamseslab07 multipathd: mpath0: remaining active paths: 4 > Jun 10 23:32:02 rmamseslab07 multipathd: dm-9: add map (uevent) > Jun 10 23:32:02 rmamseslab07 multipathd: dm-9: devmap already registered > Jun 10 23:32:03 rmamseslab07 kernel: sd 1:0:0:0: reservation conflict > Jun 10 23:32:03 rmamseslab07 kernel: sd 1:0:0:0: SCSI error: return code = 0x00000018 > Jun 10 23:32:03 rmamseslab07 kernel: end_request: I/O error, dev sdl, sector 25256 > Jun 10 23:32:03 rmamseslab07 kernel: device-mapper: multipath: Failing path 8:176. > Jun 10 23:32:03 rmamseslab07 multipathd: dm-9: add map (uevent) > Jun 10 23:32:03 rmamseslab07 multipathd: dm-9: devmap already registered > Jun 10 23:32:03 rmamseslab07 multipathd: 8:176: mark as failed > Jun 10 23:32:03 rmamseslab07 multipathd: mpath0: remaining active paths: 3 > > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan O'Hara > Sent: Wednesday, June 10, 2009 9:18 PM > To: linux clustering > Subject: Re: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3 > > On Wed, Jun 10, 2009 at 02:11:21PM +0530, Rajeev P wrote: > > I wanted to know if fence_scsi is supported in a multipath environment for > > RHEL5.3 release. > > Yes, it is supported. > > > In earlier releases of RHEL5 fence_scsi was not supported in a multipath > > environment for RHEL5.3 release. If I am not wrong, this was because the > > DM-MPIO driver forwarded the registration/unregistration commands on only on > > one of the physical paths of a LUN. Ideally it should have passed the > > commands on all physical paths. > > > > For RHEL5.3, is this issue resolved so that I can fence_scsi in multipath > > environment. > > > > Thanks in advance. > > > > Rajeev > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From ml at eyes-works.com Thu Jun 11 01:40:17 2009 From: ml at eyes-works.com (Yasuhiro Fujii) Date: Thu, 11 Jun 2009 10:40:17 +0900 Subject: [Linux-cluster] cman_tool leave does not reduce expected votes. Message-ID: <20090611102146.7A77.45046F47@eyes-works.com> Hi. I'm testing 3nodes CentOS5.3 cluster. When 3 nodes joined and one node leaved from cluster,but expected votes did not reduce. So when 2 nodes leaved(cman_tool leave),only one node status chaneged to activity blocked. How to Activity blocked 3 nodes joined. This is normal. Version: 6.1.0 Config Version: 1 Cluster Name: cl Cluster Id: 28318 Cluster Member: Yes Cluster Generation: 292 Membership state: Cluster-Member Nodes: 3 Expected votes: 3 Total votes: 3 Quorum: 2 1 node cman_tool leave Version: 6.1.0 Config Version: 1 Cluster Name: cl Cluster Id: 28318 Cluster Member: Yes Cluster Generation: 296 Membership state: Cluster-Member Nodes: 2 Expected votes: 3 Total votes: 2 Quorum: 2 2 nodes cman_tool leave Version: 6.1.0 Config Version: 1 Cluster Name: cl Cluster Id: 28318 Cluster Member: Yes Cluster Generation: 300 Membership state: Cluster-Member Nodes: 1 Expected votes: 3 Total votes: 1 Quorum: 2 Activity blocked I tested cman_tool leave and cman_tool leave remove,but expected votes did no reduced. I think a node cman_tool leave is used, expected votes must be reduced avoiding to activity blocked. I know cman_tool expected -e 1 avoids this activity blocked,but cman_tool leave (remove) should reduce expected votes automatically. -- cman-2.0.98-1.el5_3.1 openais-0.80.3-22.el5_3.4 -- /etc/cluster/cluster.conf -- From amalik at intertechmedia.com Thu Jun 11 02:44:06 2009 From: amalik at intertechmedia.com (Atif Malik) Date: Thu, 11 Jun 2009 02:44:06 +0000 Subject: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3 In-Reply-To: References: <7a271b290906100141o296ff97w72b97fe2de70b0f4@mail.gmail.com><20090610191802.GA12988@redhat.com> Message-ID: <932651848-1244688228-cardhu_decombobulator_blackberry.rim.net-1844332269-@bxe1136.bisx.prod.on.blackberry> P -----Original Message----- From: "Moralejo, Alfredo" Date: Wed, 10 Jun 2009 23:33:14 To: linux clustering Subject: RE: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3 Anyone is successfully using it? I'm testing it with a clariion storage frame on RHEL 5.3, and as soon as I enable scsi_reserve, multipath starts failing and a path goes good and bad in a loop and scsi fencing fails sometimes, should I configure in a specific way multipath.conf?: Jun 10 23:31:58 rmamseslab07 multipathd: mpath0: remaining active paths: 3 Jun 10 23:32:00 rmamseslab07 kernel: sd 1:0:0:1: reservation conflict Jun 10 23:32:00 rmamseslab07 kernel: sd 1:0:0:1: SCSI error: return code = 0x00000018 Jun 10 23:32:00 rmamseslab07 kernel: end_request: I/O error, dev sdm, sector 79 Jun 10 23:32:00 rmamseslab07 kernel: device-mapper: multipath: Failing path 8:192. Jun 10 23:32:00 rmamseslab07 multipathd: 8:192: mark as failed Jun 10 23:32:00 rmamseslab07 multipathd: mpathquorum: remaining active paths: 3 Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: add map (uevent) Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: devmap already registered Jun 10 23:32:00 rmamseslab07 multipathd: sdm: emc_clariion_checker: Path healthy Jun 10 23:32:00 rmamseslab07 multipathd: 8:192: reinstated Jun 10 23:32:00 rmamseslab07 multipathd: mpathquorum: remaining active paths: 4 Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: add map (uevent) Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: devmap already registered Jun 10 23:32:02 rmamseslab07 multipathd: sdl: emc_clariion_checker: Path healthy Jun 10 23:32:02 rmamseslab07 multipathd: 8:176: reinstated Jun 10 23:32:02 rmamseslab07 multipathd: mpath0: remaining active paths: 4 Jun 10 23:32:02 rmamseslab07 multipathd: dm-9: add map (uevent) Jun 10 23:32:02 rmamseslab07 multipathd: dm-9: devmap already registered Jun 10 23:32:03 rmamseslab07 kernel: sd 1:0:0:0: reservation conflict Jun 10 23:32:03 rmamseslab07 kernel: sd 1:0:0:0: SCSI error: return code = 0x00000018 Jun 10 23:32:03 rmamseslab07 kernel: end_request: I/O error, dev sdl, sector 25256 Jun 10 23:32:03 rmamseslab07 kernel: device-mapper: multipath: Failing path 8:176. Jun 10 23:32:03 rmamseslab07 multipathd: dm-9: add map (uevent) Jun 10 23:32:03 rmamseslab07 multipathd: dm-9: devmap already registered Jun 10 23:32:03 rmamseslab07 multipathd: 8:176: mark as failed Jun 10 23:32:03 rmamseslab07 multipathd: mpath0: remaining active paths: 3 -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan O'Hara Sent: Wednesday, June 10, 2009 9:18 PM To: linux clustering Subject: Re: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3 On Wed, Jun 10, 2009 at 02:11:21PM +0530, Rajeev P wrote: > I wanted to know if fence_scsi is supported in a multipath environment for > RHEL5.3 release. Yes, it is supported. > In earlier releases of RHEL5 fence_scsi was not supported in a multipath > environment for RHEL5.3 release. If I am not wrong, this was because the > DM-MPIO driver forwarded the registration/unregistration commands on only on > one of the physical paths of a LUN. Ideally it should have passed the > commands on all physical paths. > > For RHEL5.3, is this issue resolved so that I can fence_scsi in multipath > environment. > > Thanks in advance. > > Rajeev > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From carlopmart at gmail.com Thu Jun 11 08:44:28 2009 From: carlopmart at gmail.com (carlopmart) Date: Thu, 11 Jun 2009 10:44:28 +0200 Subject: [Linux-cluster] fence_vmware works on vsphere esxi?? Message-ID: <4A30C3EC.80706@gmail.com> Hi all, Sombebody have tried to use fence_vmware (on rhel5.x) on vsphere esxi?? works or not?? Thanks. -- CL Martinez carlopmart {at} gmail {d0t} com From ccaulfie at redhat.com Thu Jun 11 08:52:09 2009 From: ccaulfie at redhat.com (Chrissie Caulfield) Date: Thu, 11 Jun 2009 09:52:09 +0100 Subject: [Linux-cluster] cman_tool leave does not reduce expected votes. In-Reply-To: <20090611102146.7A77.45046F47@eyes-works.com> References: <20090611102146.7A77.45046F47@eyes-works.com> Message-ID: <4A30C5B9.1000300@redhat.com> Yasuhiro Fujii wrote: > Hi. > > I'm testing 3nodes CentOS5.3 cluster. > > When 3 nodes joined and one node leaved from cluster,but expected votes > did not reduce. > So when 2 nodes leaved(cman_tool leave),only one node status chaneged to > activity blocked. > Eek! You're right. I've raised a bugzilla report for this: https://bugzilla.redhat.com/show_bug.cgi?id=505258 Chrissie From jfriesse at redhat.com Thu Jun 11 09:31:55 2009 From: jfriesse at redhat.com (Jan Friesse) Date: Thu, 11 Jun 2009 11:31:55 +0200 Subject: [Linux-cluster] Re: [Cluster-devel] Prototype Fencing Agent for Raritan eRIC G4 In-Reply-To: <4A30087D.3060901@bobich.net> References: <4A2ED075.5020207@bobich.net> <4A300034.9050603@redhat.com> <4A30087D.3060901@bobich.net> Message-ID: <4A30CF0B.2030203@redhat.com> Gordan, Gordan Bobic wrote: > Subhendu Ghosh wrote: > >> Would it be possible to look at migrating this agent to SSH (more secure) > > I started with the idea of doing it over ssh, but Net::SSH module seemed > to be a lot less forgiving about the terminal quirkyness. I can have > another go. There's also the issue of manual intervention being required > to save the signatures (and where do the known hosts go?). > >> or to SNMP (less screen scraping)? > > Hmm, maybe. I haven't looked into the SNMP capability on the device, but > it looks like it'll work, and probably be easier to do than SSH. > >> Look at fence_cisco as an example of snmp usage. > > Assuming they speak a compatible dialect, which may not be the case. > I'll have a look. We are using fence agents library, which makes writing agents easier (capable of doing things like command line parsing, implement reboot operation, ...), shorter and easier to maintain. fence_cisco is good example (short, tested, ...) HOW to write such agent. Agents are written in Python, and we are migrating all agents on top of library. > >> Long term maintainability of screen scraping is an issue with firmware >> changes. > > Tell me about it. I submitted a patch for fence_drac a while back to > address an issue that seems to have arisen from a firmware update > inducted pattern match failure. > > Not only that, but I've discovered a bug on the latest eRIC G4 firmware > - 04.02.00-7153 seems to have broken USB keyboard support (you'd think > this was important on a remote console device!) and potentially some > power button press dodgyness. The previous firmware, however - > 04.02.00-6505, works OK. > >> Also it seems that card has IPMI support. If so, can use test with >> fence_ipmi? >> Would remove the need for yet-another-agent ;) > > Sadly, my servers with these cards in them don't have IPMI support. The > card only proxies it. The card supports direct power/reset button > control in addition to IPMI, so this is what I'm using. But as you can > see from the code, it operates only on the power on/off even for a > reboot because the said servers also don't have a reset connector. I > wrote this agent because I _needed_ it. :) > > But I'll look into the SNMP way of doing it, it sounds like it might be > neater. I'll add it as an option since the telnet way is already > written. What parameter should/can be used to specify such things, that > is available from a cluster.conf reference? This question answers you little look to fence_cisco agent (or you can use fence_ifmib, fence_intel_modular, fence_apc_snmp, ...). In case you will not understand something, please ask. > > Thanks. > > Gordan > Regards, Honza From viral_ahire at yahoo.com Thu Jun 11 10:05:05 2009 From: viral_ahire at yahoo.com (viral ahire) Date: Thu, 11 Jun 2009 15:35:05 +0530 (IST) Subject: [Linux-cluster] Re:Node Leave Cluster while Stopping Cluster Application (Oracle) Message-ID: <594531.32475.qm@web94716.mail.in2.yahoo.com> Still there is no replay from geniuses....... ? Please help for me for this problem ------------------- Regards, VIRAL .D. AHIRE (Mobile- +91 9724507304) Explore and discover exciting holidays and getaways with Yahoo! India Travel http://in.travel.yahoo.com/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From ml at eyes-works.com Thu Jun 11 11:23:55 2009 From: ml at eyes-works.com (Yasuhiro Fujii) Date: Thu, 11 Jun 2009 20:23:55 +0900 Subject: [Linux-cluster] cman_tool leave does not reduce expected votes. In-Reply-To: <4A30C5B9.1000300@redhat.com> References: <20090611102146.7A77.45046F47@eyes-works.com> <4A30C5B9.1000300@redhat.com> Message-ID: <20090611202255.6E31.45046F47@eyes-works.com> Dear Chrissie. Thank you for your reply and reporting redhat bugzilla. I'll check redhat bugzilla,too. On Thu, 11 Jun 2009 09:52:09 +0100 Chrissie Caulfield wrote: > Yasuhiro Fujii wrote: > > Hi. > > > > I'm testing 3nodes CentOS5.3 cluster. > > > > When 3 nodes joined and one node leaved from cluster,but expected votes > > did not reduce. > > So when 2 nodes leaved(cman_tool leave),only one node status chaneged to > > activity blocked. > > > > > Eek! You're right. > > I've raised a bugzilla report for this: > > https://bugzilla.redhat.com/show_bug.cgi?id=505258 > > > Chrissie > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Yasuhiro Fujii From info at lizardkings.nl Thu Jun 11 17:39:18 2009 From: info at lizardkings.nl (LizardKings) Date: Thu, 11 Jun 2009 19:39:18 +0200 Subject: [Linux-cluster] get cluster nodes via XML-RPC Message-ID: <4A314146.4050107@lizardkings.nl> Hi, Is it possible to receive a list of cluster nodes via XML-RPC to one of the ricci's. DG From fdinitto at redhat.com Thu Jun 11 22:03:35 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Fri, 12 Jun 2009 00:03:35 +0200 Subject: [Linux-cluster] Re: Still having GFS2 mount hang In-Reply-To: <89c397150906111208n78c222fhdf3a57e5dbbe9f50@mail.gmail.com> References: <89c397150906081537i661beb55ke5f0284909f9e71f@mail.gmail.com> <1244534269.29604.789.camel@localhost.localdomain> <20090609140112.GB13914@redhat.com> <89c397150906091214k377d7213ud93ce341c3cb1167@mail.gmail.com> <20090609193616.GA22800@redhat.com> <89c397150906100633p5a0b15d9vab2da39107d362fc@mail.gmail.com> <20090610141351.GA18341@redhat.com> <1244653184.3665.77.camel@cerberus.int.fabbione.net> <89c397150906111208n78c222fhdf3a57e5dbbe9f50@mail.gmail.com> Message-ID: <1244757816.3665.109.camel@cerberus.int.fabbione.net> On Thu, 2009-06-11 at 15:08 -0400, William A. (Andy) Adamson wrote: > On Wed, Jun 10, 2009 at 12:59 PM, Fabio M. Di Nitto wrote: > > On Wed, 2009-06-10 at 09:13 -0500, David Teigland wrote: > >> On Wed, Jun 10, 2009 at 09:33:33AM -0400, William A. (Andy) Adamson wrote: > >> > On Tue, Jun 9, 2009 at 3:36 PM, David Teigland wrote: > >> > > On Tue, Jun 09, 2009 at 03:14:09PM -0400, William A. (Andy) Adamson wrote: > >> > >> Hi David > >> > >> > >> > >> Thanks for looking at this. The kernel does report a recursive lock > >> > > > >> > > that's harmless > >> > > > >> > >> issue when running /etc/init.d/cman. Details inline. > >> > > > >> > > I can't see anything wrong, I'm going to check whether we have or can get > >> > > some more recent packages, since 2.99.12 is a bit old, it looks like > >> > > you're on fedora 10? > >> > > >> > yes. I could move to fedora 11. > >> > >> I did some checking, and unfortunately 2.99.12 is the newest version we've > >> packaged for either f10 or f11. It has something to do with the corosync > >> api's changing too rapidly, and the trouble with patching and rebuilding all > >> the packages that depend on it because they are using various versions of the > >> api... the hope is it will all be better when a stable corosync 1.0 release > >> happens. > >> > >> In the mean time, Fabio was kind enough to make a set of srpms of all the > >> latest versions, http://fabbione.fedorapeople.org/srpm/ I just built and > >> installed corosync, openais and cluster srpms from there on my fedora 10 > >> machine. Started the cluster and mounted gfs with the result. > >> > >> I limited what I built/installed to avoid some annoying dependencies, to > >> > >> rpmbuild --rebuild corosync > >> rpm -Uhv corosync* > >> rpmbuild --rebuild openais > >> rpm -Uhv openais* > >> rpmbuild --rebuild cluster > >> rpm -Uhv cluster* > >> rpm -Uhv gfs* > >> rpm -Uhv --nodeps cman* > > > > Just FYI, you can build fence-agents srpm from there after install > > clusterlib and then install full cman. > > > Cool. Thanks! I'll try the new rpm's at the NFSv4.1 bakeathon next > week. I'm able to pass the basic connectathon tests with my current > gfs2 setup and the new reworked pnfs server code which can export an > unmodified gfs2 file system. I'll push some more updated srpm tomorrow. I found a couple of issues with the current ones that could be problematic. i'll send you an email with the versions to use. Fabio From fdinitto at redhat.com Fri Jun 12 06:40:10 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Fri, 12 Jun 2009 08:40:10 +0200 Subject: [Linux-cluster] Re: Still having GFS2 mount hang In-Reply-To: <89c397150906111208n78c222fhdf3a57e5dbbe9f50@mail.gmail.com> References: <89c397150906081537i661beb55ke5f0284909f9e71f@mail.gmail.com> <1244534269.29604.789.camel@localhost.localdomain> <20090609140112.GB13914@redhat.com> <89c397150906091214k377d7213ud93ce341c3cb1167@mail.gmail.com> <20090609193616.GA22800@redhat.com> <89c397150906100633p5a0b15d9vab2da39107d362fc@mail.gmail.com> <20090610141351.GA18341@redhat.com> <1244653184.3665.77.camel@cerberus.int.fabbione.net> <89c397150906111208n78c222fhdf3a57e5dbbe9f50@mail.gmail.com> Message-ID: <1244788810.3665.112.camel@cerberus.int.fabbione.net> On Thu, 2009-06-11 at 15:08 -0400, William A. (Andy) Adamson wrote: > On Wed, Jun 10, 2009 at 12:59 PM, Fabio M. Di Nitto wrote: > > On Wed, 2009-06-10 at 09:13 -0500, David Teigland wrote: > >> On Wed, Jun 10, 2009 at 09:33:33AM -0400, William A. (Andy) Adamson wrote: > >> > On Tue, Jun 9, 2009 at 3:36 PM, David Teigland wrote: > >> > > On Tue, Jun 09, 2009 at 03:14:09PM -0400, William A. (Andy) Adamson wrote: > >> > >> Hi David > >> > >> > >> > >> Thanks for looking at this. The kernel does report a recursive lock > >> > > > >> > > that's harmless > >> > > > >> > >> issue when running /etc/init.d/cman. Details inline. > >> > > > >> > > I can't see anything wrong, I'm going to check whether we have or can get > >> > > some more recent packages, since 2.99.12 is a bit old, it looks like > >> > > you're on fedora 10? > >> > > >> > yes. I could move to fedora 11. > >> > >> I did some checking, and unfortunately 2.99.12 is the newest version we've > >> packaged for either f10 or f11. It has something to do with the corosync > >> api's changing too rapidly, and the trouble with patching and rebuilding all > >> the packages that depend on it because they are using various versions of the > >> api... the hope is it will all be better when a stable corosync 1.0 release > >> happens. > >> > >> In the mean time, Fabio was kind enough to make a set of srpms of all the > >> latest versions, http://fabbione.fedorapeople.org/srpm/ I just built and > >> installed corosync, openais and cluster srpms from there on my fedora 10 > >> machine. Started the cluster and mounted gfs with the result. Same URL: cluster-3.0.0-17.rc2.fc12.src.rpm corosync-0.97-1.svn2233.fc12.src.rpm fence-agents-3.0.0-11.rc2.fc12.src.rpm lvm2-2.02.47-2.fc12.src.rpm openais-0.96-1.svn1951.fc12.src.rpm resource-agents-3.0.0-9.rc2.fc12.src.rpm I think vs the previous run, there is only a major update (important!) for corosync and cluster. The other packages should be unchanged. Fabio From marco.huang at sit.auckland.ac.nz Fri Jun 12 10:19:17 2009 From: marco.huang at sit.auckland.ac.nz (Marco Huang) Date: Fri, 12 Jun 2009 22:19:17 +1200 Subject: [Linux-cluster] kernel panic on debian lenny GFS2 when exporting via NFS Message-ID: <4A322BA5.30505@sit.auckland.ac.nz> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, I have two debian lenny nodes (kernel 2.6.26-1-amd64) are running redhat cluster suite. I mount gfs2 with acl option on the two nodes. Everything are looking ok until I export the gfs2 file system to other servers with nfs acl option (I have tried without acl option). It just crashes the cluster when every time I try to edit or cat a file, but I can ls any directory without any problem. Does anyone have suggestion on that? The following is from dmesg [73567.236977] ------------[ cut here ]------------ [73567.236977] kernel BUG at fs/gfs2/glock.c:1134! [73567.237483] invalid opcode: 0000 [1] SMP [73567.237483] CPU 3 [73567.237483] Modules linked in: nfsd auth_rpcgss exportfs nfs lockd nfs_acl sunrpc sctp libcrc32c gfs lock_dlm gfs2 dlm configfs ipv6 aoe ext2 loop parport_pc parport snd_pcm snd_timer snd pcspkr soundcore psmouse snd_page_alloc serio_raw container i2c_piix4 ac button i2c_core intel_agp shpchp pci_hotplug evdev ext3 jbd mbcache dm_mirror dm_log dm_snapshot dm_mod sd_mod ide_cd_mod cdrom ata_generic libata dock ide_pci_generic floppy mptspi mptscsih mptbase scsi_transport_spi e1000 scsi_mod piix ide_core thermal processor fan thermal_sys [73567.237483] Pid: 23419, comm: nfsd Not tainted 2.6.26-1-amd64 #1 [73567.237483] RIP: 0010:[] [] :gfs2:gfs2_glock_nq+0x11b/0x1e0 [73567.237483] RSP: 0018:ffff81004b4f3cb0 EFLAGS: 00010282 [73567.237483] RAX: 000000000000002f RBX: ffff81004b4f3cf0 RCX: 0000000000000082 [73567.237483] RDX: 0000000000009f3a RSI: 0000000000000046 RDI: 0000000000000286 [73567.237483] RBP: ffff810057832740 R08: ffff8100d330dd48 R09: ffff81004b4f3800 [73567.237483] R10: 0000000000000000 R11: 0000000000000046 R12: ffff8100d330dd48 [73567.237483] R13: ffff8100d330dd48 R14: 0000000000000000 R15: ffff8100e0c32000 [73567.237483] FS: 00007f83d0fc86e0(0000) GS:ffff8100ef6df9c0(0000) knlGS:0000000000000000 [73567.237483] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [73567.237483] CR2: 0000000002679e08 CR3: 00000000ec8cf000 CR4: 00000000000006e0 [73567.237483] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [73567.237483] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [73567.237483] Process nfsd (pid: 23419, threadinfo ffff81004b4f2000, task ffff81000f9a70a0) [73567.237483] Stack: ffff8100ed4fecc8 ffff8100ed4fecc8 ffff8100ed4fecc8 ffff8100ee097c80 [73567.237483] 0000000000000000 ffff8100a5dd7740 ffff8100ce7f6cb0 ffffffffa02b4ec1 [73567.237483] ffff81004b4f3cf0 ffff81004b4f3cf0 ffff8100d330dd48 ffff8100ecd4e840 [73567.237483] Call Trace: [73567.237483] [] ? :gfs2:gfs2_open+0xc7/0x13c [73567.237483] [] ? :gfs2:gfs2_open+0xbf/0x13c [73567.237483] [] ? :gfs2:gfs2_open+0x0/0x13c [73567.237483] [] ? __dentry_open+0x12c/0x238 [73567.237483] [] ? :nfsd:nfsd_open+0x13c/0x170 [73567.237483] [] ? :nfsd:nfsd_read+0x7f/0xc4 [73567.237483] [] ? _spin_lock_bh+0x9/0x1f [73567.237483] [] ? :nfsd:nfsd3_proc_read+0xfe/0x141 [73567.237483] [] ? :nfsd:nfsd_dispatch+0xde/0x1b6 [73567.237483] [] ? :sunrpc:svc_process+0x408/0x6e9 [73567.237483] [] ? __down_read+0x12/0xa1 [73567.237483] [] ? :nfsd:nfsd+0x0/0x2a4 [73567.237483] [] ? :nfsd:nfsd+0x194/0x2a4 [73567.237483] [] ? schedule_tail+0x27/0x5c [73567.237483] [] ? child_rip+0xa/0x12 [73567.237483] [] ? :nfsd:nfsd+0x0/0x2a4 [73567.237483] [] ? :nfsd:nfsd+0x0/0x2a4 [73567.237483] [] ? :nfsd:nfsd+0x0/0x2a4 [73567.237483] [] ? child_rip+0x0/0x12 [73567.237483] [73567.237483] [73567.237483] Code: 74 03 8b 70 38 48 c7 c7 b0 2f 2c a0 31 c0 e8 72 94 f8 df 41 8b 54 24 30 41 8b 74 24 20 48 c7 c7 bd 2f 2c a0 31 c0 e8 5a 94 f8 df <0f> 0b eb fe 48 39 70 18 74 10 48 89 d0 48 8b 10 48 39 c8 0f 18 [73567.387600] RIP [] :gfs2:gfs2_glock_nq+0x11b/0x1e0 [73567.387600] RSP [73567.394761] ---[ end trace 7902e4725ced022f ]--- [73693.051705] BUG: soft lockup - CPU#1 stuck for 61s! [nfsd:23421] Cheers, Marco -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkoyK6UACgkQSSHqatd3m2OaEACfexhB38p0InHX1WuvXFyy4st+ yxcAmwQaLeOz63p2rOnsQ0fswrlI4tEk =rDLx -----END PGP SIGNATURE----- From siddiqut at gmail.com Fri Jun 12 13:40:50 2009 From: siddiqut at gmail.com (Tajdar Siddiqui) Date: Fri, 12 Jun 2009 09:40:50 -0400 Subject: [Linux-cluster] gfs2 question Message-ID: <3abaa1ce0906120640v137f612at847e8a1847ee83b2@mail.gmail.com> We are running gfs2 on Red Hat Enterprise Linux Server release 5.3 (Tikanga) This is a 2 node cluster and what we have noticed is that from time to time, one node gets approximately 1/3rd write thruput on gfs2 as compared to the other node. Writing program is in java. Any ideas on what to check for etc.? Many thanx, Tajdar -------------- next part -------------- An HTML attachment was scrubbed... URL: From cthulhucalling at gmail.com Fri Jun 12 15:44:09 2009 From: cthulhucalling at gmail.com (Ian Hayes) Date: Fri, 12 Jun 2009 08:44:09 -0700 Subject: [Linux-cluster] Running additional scripts at service startup Message-ID: <36df569a0906120844h25fa6ac3v6950071e58ee089a@mail.gmail.com> HI all... I've been given the task of setting up a cluster for a service that we run here. The init script for the service calls an outside Perl script to do some administrative tasks once the daemon is started up. The script must be run as a different user so in the init script we have "su - someuser -c adminscript.pl". This all works fine if we start the daemon manually, but it doesn't appear that the script is running or it's failing whenever it is being started up via the cluster. Is there some magic foo that I'm missing? -------------- next part -------------- An HTML attachment was scrubbed... URL: From fdinitto at redhat.com Fri Jun 12 17:15:36 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Fri, 12 Jun 2009 19:15:36 +0200 Subject: [Linux-cluster] Re: Still having GFS2 mount hang In-Reply-To: <89c397150906120830r4ff643baw5f57eb16b57cc6e2@mail.gmail.com> References: <89c397150906081537i661beb55ke5f0284909f9e71f@mail.gmail.com> <1244534269.29604.789.camel@localhost.localdomain> <20090609140112.GB13914@redhat.com> <89c397150906091214k377d7213ud93ce341c3cb1167@mail.gmail.com> <20090609193616.GA22800@redhat.com> <89c397150906100633p5a0b15d9vab2da39107d362fc@mail.gmail.com> <20090610141351.GA18341@redhat.com> <1244653184.3665.77.camel@cerberus.int.fabbione.net> <89c397150906111208n78c222fhdf3a57e5dbbe9f50@mail.gmail.com> <1244788810.3665.112.camel@cerberus.int.fabbione.net> <89c397150906120830r4ff643baw5f57eb16b57cc6e2@mail.gmail.com> Message-ID: <1244826936.3665.126.camel@cerberus.int.fabbione.net> On Fri, 2009-06-12 at 11:30 -0400, William A. (Andy) Adamson wrote: > > Same URL: > > > > cluster-3.0.0-17.rc2.fc12.src.rpm > > corosync-0.97-1.svn2233.fc12.src.rpm > > fence-agents-3.0.0-11.rc2.fc12.src.rpm > > lvm2-2.02.47-2.fc12.src.rpm > > openais-0.96-1.svn1951.fc12.src.rpm > > resource-agents-3.0.0-9.rc2.fc12.src.rpm > > > > I think vs the previous run, there is only a major update (important!) > > for corosync and cluster. The other packages should be unchanged. > > OK. Next week at the bakeathon, I'll first test with what I have > 'cause it's working, and then I'll update to these rpm's and let you > know how it goes. OK cool. Looking forward to feedback. Fabio From dougbunger at yahoo.com Fri Jun 12 18:55:58 2009 From: dougbunger at yahoo.com (Doug Bunger) Date: Fri, 12 Jun 2009 11:55:58 -0700 (PDT) Subject: [Linux-cluster] gfs2 question Message-ID: <255434.14635.qm@web110215.mail.gq1.yahoo.com> What's the connectivity?? SAN or NAS?? Is it on physical RAID?? Are you accessing the same file? I have notice inconsistent access across iSCSI, as a result of network bandwidth, buffering, caching, et al. --- On Fri, 6/12/09, Tajdar Siddiqui wrote: From: Tajdar Siddiqui Subject: [Linux-cluster] gfs2 question To: linux-cluster at redhat.com Date: Friday, June 12, 2009, 8:40 AM We are running gfs2 on Red Hat Enterprise Linux Server release 5.3 (Tikanga) This is a 2 node cluster and what we have noticed is that from time to time, one node gets approximately 1/3rd write thruput on gfs2 as compared to the other node. Writing program is in java. Any ideas on what to check for? etc.? Many thanx, Tajdar -----Inline Attachment Follows----- -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From songyu555 at gmail.com Sun Jun 14 01:21:08 2009 From: songyu555 at gmail.com (yu song) Date: Sun, 14 Jun 2009 11:21:08 +1000 Subject: [Linux-cluster] Could some one explain why SCSI_Fence Agent can not be used in 2 nodes cluster? Message-ID: <420241f50906131821u54d12cd6vba75805b7cd82c0b@mail.gmail.com> Hi, I am planning to build a 2 nodes cluster on rhcl 5.3, and looking for what fencing method I could use. On the storage side, it is EMC clarion and supports scsi 3 reservation. So I'm thinking to use fence_scsi agent to do the disk fencing. however, according the redhat website, it states that fence_scsi does not support two nodes cluster. Could anyone kindly explain it why? (never had this issue when use veritas cluster) Another question is what is best practice to have how many Quorum disk for two-nodes cluster? It looks like not compulsory and better have it.. cheers, Yu -------------- next part -------------- An HTML attachment was scrubbed... URL: From rohara at redhat.com Mon Jun 15 03:36:12 2009 From: rohara at redhat.com (Ryan O'Hara) Date: Sun, 14 Jun 2009 22:36:12 -0500 Subject: [Linux-cluster] Could some one explain why SCSI_Fence Agent can not be used in 2 nodes cluster? In-Reply-To: <420241f50906131821u54d12cd6vba75805b7cd82c0b@mail.gmail.com> References: <420241f50906131821u54d12cd6vba75805b7cd82c0b@mail.gmail.com> Message-ID: <20090615033612.GA15883@redhat.com> On Sun, Jun 14, 2009 at 11:21:08AM +1000, yu song wrote: > Hi, > > I am planning to build a 2 nodes cluster on rhcl 5.3, and looking for what > fencing method I could use. > > On the storage side, it is EMC clarion and supports scsi 3 reservation. > > So I'm thinking to use fence_scsi agent to do the disk fencing. however, > according the redhat website, it states that fence_scsi does not support > two nodes cluster. > > Could anyone kindly explain it why? (never had this issue when use veritas > cluster) In a 2 node cluster, fencing becomes a race -- the node fences the other node first wins. This works well with power fencing, but not so well with SAN fencing (eg. fence_scsi). The problem with fence_scsi in a 2 node cluster is this: Suppose we have 2 node, call them A and B. Also assume we have multple LUNs, which we will call lun1, lun2, lun3. Consider what happens when a network partition occurs -- both nodes attempt to fence one another. It is possible that A could remove B's key from lun1 and lun2, but node B could remove node A's key from lun3. This is inconsistent and there is no clear "winner". Ryan From songyu555 at gmail.com Mon Jun 15 04:25:32 2009 From: songyu555 at gmail.com (yu song) Date: Mon, 15 Jun 2009 14:25:32 +1000 Subject: [Linux-cluster] Could some one explain why SCSI_Fence Agent can not be used in 2 nodes cluster? In-Reply-To: <20090615033612.GA15883@redhat.com> References: <420241f50906131821u54d12cd6vba75805b7cd82c0b@mail.gmail.com> <20090615033612.GA15883@redhat.com> Message-ID: <420241f50906142125s4828bf70y8310b52825a6986d@mail.gmail.com> thanks Ryan. So in the linux cluster, there is no concept about odd number coordinate disks, which is used to deal with this issue? anyway, probably I have to use power fencing. cheers Yu On Mon, Jun 15, 2009 at 1:36 PM, Ryan O'Hara wrote: > On Sun, Jun 14, 2009 at 11:21:08AM +1000, yu song wrote: > > Hi, > > > > I am planning to build a 2 nodes cluster on rhcl 5.3, and looking for > what > > fencing method I could use. > > > > On the storage side, it is EMC clarion and supports scsi 3 reservation. > > > > So I'm thinking to use fence_scsi agent to do the disk fencing. however, > > according the redhat website, it states that fence_scsi does not support > > two nodes cluster. > > > > Could anyone kindly explain it why? (never had this issue when use > veritas > > cluster) > > In a 2 node cluster, fencing becomes a race -- the node fences the > other node first wins. This works well with power fencing, but not so > well with SAN fencing (eg. fence_scsi). > > The problem with fence_scsi in a 2 node cluster is this: > > Suppose we have 2 node, call them A and B. Also assume we have multple > LUNs, which we will call lun1, lun2, lun3. Consider what happens when > a network partition occurs -- both nodes attempt to fence one > another. It is possible that A could remove B's key from lun1 and > lun2, but node B could remove node A's key from lun3. This is > inconsistent and there is no clear "winner". > > Ryan > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alfredo.moralejo at roche.com Mon Jun 15 14:17:55 2009 From: alfredo.moralejo at roche.com (Moralejo, Alfredo) Date: Mon, 15 Jun 2009 16:17:55 +0200 Subject: [Linux-cluster] cman + qdisk timeouts.... Message-ID: Hi, I'm having what I think is a timeouts issue in my cluster. I have a two node cluster using qdisk. Everytime the node that has the master role for qdisk becomes down (for failure or even stopping qdiskd manually), packages in the sane node are stopped because of the lack of quorum as the qdiskd becames unresponsive until second node becames master node and start working properly. Once qdiskd start working fine (usually 5-6 seconds) packages are started again. I've read in the cluster manual section for "CMAN membership timeout value" and I think this is the case. I've used RHEL 5.3 and I thought this parameter is the token that I set much longer that needed: ... Totem token is much more that double of qdisk timeout, so I guess it should be enough but everytime qdisk dies in the master node I get same result, services restarted in the sane node: Jun 15 16:11:33 rmamseslab07 qdiskd[14130]: Node 1 missed an update (2/3) Jun 15 16:11:38 rmamseslab07 qdiskd[14130]: Node 1 missed an update (3/3) Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: Node 1 missed an update (4/3) Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: Node 1 DOWN Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: Making bid for master Jun 15 16:11:44 rmamseslab07 clurgmgrd: [18510]: Executing /etc/init.d/watchdog status Jun 15 16:11:48 rmamseslab07 qdiskd[14130]: Node 1 missed an update (5/3) Jun 15 16:11:53 rmamseslab07 qdiskd[14130]: Node 1 missed an update (6/3) Jun 15 16:11:53 rmamseslab07 qdiskd[14130]: Assuming master role Message from syslogd at rmamseslab07 at Jun 15 16:11:53 ... clurgmgrd[18510]: #1: Quorum Dissolved Jun 15 16:11:53 rmamseslab07 openais[14087]: [CMAN ] lost contact with quorum device Jun 15 16:11:53 rmamseslab07 openais[14087]: [CMAN ] quorum lost, blocking activity Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: Membership Change Event Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: #1: Quorum Dissolved Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: Emergency stop of service:Cluster_test_2 Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: Emergency stop of service:wdtcscript-rmamseslab05-ic Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: Emergency stop of service:wdtcscript-rmamseslab07-ic Jun 15 16:11:54 rmamseslab07 clurgmgrd[18510]: Emergency stop of service:Logical volume 1 Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: Node 1 missed an update (7/3) Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: Writing eviction notice for node 1 Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: Telling CMAN to kill the node Jun 15 16:11:58 rmamseslab07 openais[14087]: [CMAN ] quorum regained, resuming activity I've just logged a case but... any idea???? Regards, Alfredo Moralejo Business Platforms Engineering - OS Servers - UNIX Senior Specialist F. Hoffmann-La Roche Ltd. Global Informatics Group Infrastructure Josefa Valc?rcel, 40 28027 Madrid SPAIN Phone: +34 91 305 97 87 alfredo.moralejo at roche.com Confidentiality Note: This message is intended only for the use of the named recipient(s) and may contain confidential and/or proprietary information. If you are not the intended recipient, please contact the sender and delete this message. Any unauthorized use of the information contained in this message is prohibited. -------------- next part -------------- An HTML attachment was scrubbed... URL: From anasnajj at gmail.com Mon Jun 15 18:32:24 2009 From: anasnajj at gmail.com (anasnajj) Date: Mon, 15 Jun 2009 21:32:24 +0300 Subject: [Linux-cluster] Service Owner Unknown-after power failure In-Reply-To: References: Message-ID: Hi all I have Redhat cluster with 5 nodes run 5 services for each node with two additional backup nodes when suddenly power failure happened on one node , the cluster state on another nodes show that the service owner of failed node is unknown and when we try to disable or relocate the service its stay keep trying without result .. , so how I can make the cluster start the service again on another node when the first node has power failure and no way to return it Up ???? thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From giuseppe.fuggiano at gmail.com Mon Jun 15 18:41:36 2009 From: giuseppe.fuggiano at gmail.com (Giuseppe Fuggiano) Date: Mon, 15 Jun 2009 20:41:36 +0200 Subject: [Linux-cluster] Service Owner Unknown-after power failure In-Reply-To: References: Message-ID: <1e09d9070906151141x7b872d82x8cf7afbaa2ef5f82@mail.gmail.com> 2009/6/15 anasnajj : > Hi all > > I have Redhat cluster with 5 nodes run 5 services for each node with two > additional backup nodes > > when suddenly power failure happened on ?one node , the cluster state on > another nodes show that the service owner of failed node is unknown and when > we try to disable or relocate the service its stay keep trying without > result .. , so how I can make the cluster start the service again on another > node when the first node has power failure and no way to return it Up ???? What about your cluster.conf? -- Giuseppe From anasnajj at gmail.com Mon Jun 15 18:44:21 2009 From: anasnajj at gmail.com (anasnajj) Date: Mon, 15 Jun 2009 21:44:21 +0300 Subject: [Linux-cluster] Service Owner Unknown-after power failure In-Reply-To: <1e09d9070906151141x7b872d82x8cf7afbaa2ef5f82@mail.gmail.com> References: <1e09d9070906151141x7b872d82x8cf7afbaa2ef5f82@mail.gmail.com> Message-ID: Is the quarum disk will solve this problem ?? -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Giuseppe Fuggiano Sent: Monday, June 15, 2009 9:42 PM To: linux clustering Subject: Re: [Linux-cluster] Service Owner Unknown-after power failure 2009/6/15 anasnajj : > Hi all > > I have Redhat cluster with 5 nodes run 5 services for each node with two > additional backup nodes > > when suddenly power failure happened on ?one node , the cluster state on > another nodes show that the service owner of failed node is unknown and when > we try to disable or relocate the service its stay keep trying without > result .. , so how I can make the cluster start the service again on another > node when the first node has power failure and no way to return it Up ???? What about your cluster.conf? -- Giuseppe -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From dhopp at coreps.com Mon Jun 15 19:00:26 2009 From: dhopp at coreps.com (Dennis B. Hopp) Date: Mon, 15 Jun 2009 14:00:26 -0500 Subject: [Linux-cluster] GFS2 locking issues Message-ID: <20090615140026.z2wnqgqshwgwo80w@mail.coreps.com> We have a three node nfs/samba cluster that we seem to be having very poor performance on GFS2. We have a samba share that is acting as a disk to disk backup share for Backup Exec and during the backup process the load on the server will go through the roof until the network requests timeout and the backup job fails. I downloaded the ping_pong utility and ran it and seem to be getting terrible performance: [root at sc2 ~]# ./ping_ping /mnt/backup/test.dat 4 97 locks/sec The results are the same on all three nodes. I can't seem to figure out why this is so bad. Some additional information: [root at sc2 ~]# gfs2_tool gettune /mnt/backup new_files_directio = 0 new_files_jdata = 0 quota_scale = 1.0000 (1, 1) logd_secs = 1 recoverd_secs = 60 statfs_quantum = 30 stall_secs = 600 quota_cache_secs = 300 quota_simul_sync = 64 statfs_slow = 0 complain_secs = 10 max_readahead = 262144 quota_quantum = 60 quota_warn_period = 10 jindex_refresh_secs = 60 log_flush_secs = 60 incore_log_blocks = 1024 demote_secs = 600 [root at sc2 ~]# gfs2_tool getargs /mnt/backup data 2 suiddir 0 quota 0 posix_acl 1 num_glockd 1 upgrade 0 debug 0 localflocks 0 localcaching 0 ignore_local_fs 0 spectator 0 hostdata jid=0:id=262146:first=0 locktable lockproto lock_dlm 97 locks/sec [root at sc2 ~]# rpm -qa | grep gfs kmod-gfs-0.1.31-3.el5 gfs-utils-0.1.18-1.el5 gfs2-utils-0.1.53-1.el5_3.3 [root at sc2 ~]# uname -r 2.6.18-128.1.10.el5 Thanks, --Dennis From adas at redhat.com Mon Jun 15 19:07:28 2009 From: adas at redhat.com (Abhijith Das) Date: Mon, 15 Jun 2009 14:07:28 -0500 Subject: [Linux-cluster] GFS2 locking issues In-Reply-To: <20090615140026.z2wnqgqshwgwo80w@mail.coreps.com> References: <20090615140026.z2wnqgqshwgwo80w@mail.coreps.com> Message-ID: <4A369BF0.3010203@redhat.com> Dennis, You seem to be running plock_rate_limit=100 that limits the number of plocks/sec to 100 to avoid network flooding due to plocks. Setting this as in cluster.conf should give you better plock performance. Hope this helps, Thanks! --Abhi Dennis B. Hopp wrote: > We have a three node nfs/samba cluster that we seem to be having very > poor performance on GFS2. We have a samba share that is acting as a > disk to disk backup share for Backup Exec and during the backup > process the load on the server will go through the roof until the > network requests timeout and the backup job fails. > > I downloaded the ping_pong utility and ran it and seem to be getting > terrible performance: > > [root at sc2 ~]# ./ping_ping /mnt/backup/test.dat 4 > 97 locks/sec > > The results are the same on all three nodes. > > I can't seem to figure out why this is so bad. Some additional information: > > [root at sc2 ~]# gfs2_tool gettune /mnt/backup > new_files_directio = 0 > new_files_jdata = 0 > quota_scale = 1.0000 (1, 1) > logd_secs = 1 > recoverd_secs = 60 > statfs_quantum = 30 > stall_secs = 600 > quota_cache_secs = 300 > quota_simul_sync = 64 > statfs_slow = 0 > complain_secs = 10 > max_readahead = 262144 > quota_quantum = 60 > quota_warn_period = 10 > jindex_refresh_secs = 60 > log_flush_secs = 60 > incore_log_blocks = 1024 > demote_secs = 600 > > [root at sc2 ~]# gfs2_tool getargs /mnt/backup > data 2 > suiddir 0 > quota 0 > posix_acl 1 > num_glockd 1 > upgrade 0 > debug 0 > localflocks 0 > localcaching 0 > ignore_local_fs 0 > spectator 0 > hostdata jid=0:id=262146:first=0 > locktable > lockproto lock_dlm > > 97 locks/sec > [root at sc2 ~]# rpm -qa | grep gfs > kmod-gfs-0.1.31-3.el5 > gfs-utils-0.1.18-1.el5 > gfs2-utils-0.1.53-1.el5_3.3 > > [root at sc2 ~]# uname -r > 2.6.18-128.1.10.el5 > > Thanks, > > --Dennis > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > From dhopp at coreps.com Mon Jun 15 20:09:01 2009 From: dhopp at coreps.com (Dennis B. Hopp) Date: Mon, 15 Jun 2009 15:09:01 -0500 Subject: [Linux-cluster] GFS2 locking issues In-Reply-To: <4A369BF0.3010203@redhat.com> References: <20090615140026.z2wnqgqshwgwo80w@mail.coreps.com> <4A369BF0.3010203@redhat.com> Message-ID: <20090615150901.2aapnpgn6sw0cog4@mail.coreps.com> That didn't work, but I changed it to: And I'm getting different results, but still not good performance. Running ping_pong on one node [root at sc2 ~]# ./ping_ping /mnt/backup/test.dat 4 5870 locks/sec I think that should be much higher, but as soon as I start it on another node it drops to 97 locks/sec Any other ideas? --Dennis Quoting Abhijith Das : > Dennis, > > You seem to be running plock_rate_limit=100 that limits the number of > plocks/sec to 100 to avoid network flooding due to plocks. > > Setting this as in cluster.conf > should give you better plock performance. > > Hope this helps, > Thanks! > --Abhi > > Dennis B. Hopp wrote: >> We have a three node nfs/samba cluster that we seem to be having very >> poor performance on GFS2. We have a samba share that is acting as a >> disk to disk backup share for Backup Exec and during the backup >> process the load on the server will go through the roof until the >> network requests timeout and the backup job fails. >> >> I downloaded the ping_pong utility and ran it and seem to be getting >> terrible performance: >> >> [root at sc2 ~]# ./ping_ping /mnt/backup/test.dat 4 >> 97 locks/sec >> >> The results are the same on all three nodes. >> >> I can't seem to figure out why this is so bad. Some additional information: >> >> [root at sc2 ~]# gfs2_tool gettune /mnt/backup >> new_files_directio = 0 >> new_files_jdata = 0 >> quota_scale = 1.0000 (1, 1) >> logd_secs = 1 >> recoverd_secs = 60 >> statfs_quantum = 30 >> stall_secs = 600 >> quota_cache_secs = 300 >> quota_simul_sync = 64 >> statfs_slow = 0 >> complain_secs = 10 >> max_readahead = 262144 >> quota_quantum = 60 >> quota_warn_period = 10 >> jindex_refresh_secs = 60 >> log_flush_secs = 60 >> incore_log_blocks = 1024 >> demote_secs = 600 >> >> [root at sc2 ~]# gfs2_tool getargs /mnt/backup >> data 2 >> suiddir 0 >> quota 0 >> posix_acl 1 >> num_glockd 1 >> upgrade 0 >> debug 0 >> localflocks 0 >> localcaching 0 >> ignore_local_fs 0 >> spectator 0 >> hostdata jid=0:id=262146:first=0 >> locktable >> lockproto lock_dlm >> >> 97 locks/sec >> [root at sc2 ~]# rpm -qa | grep gfs >> kmod-gfs-0.1.31-3.el5 >> gfs-utils-0.1.18-1.el5 >> gfs2-utils-0.1.53-1.el5_3.3 >> >> [root at sc2 ~]# uname -r >> 2.6.18-128.1.10.el5 >> >> Thanks, >> >> --Dennis >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From dhopp at coreps.com Mon Jun 15 20:55:20 2009 From: dhopp at coreps.com (Dennis B. Hopp) Date: Mon, 15 Jun 2009 15:55:20 -0500 Subject: [Linux-cluster] GFS2 locking issues In-Reply-To: <20090615150901.2aapnpgn6sw0cog4@mail.coreps.com> References: <20090615140026.z2wnqgqshwgwo80w@mail.coreps.com> <4A369BF0.3010203@redhat.com> <20090615150901.2aapnpgn6sw0cog4@mail.coreps.com> Message-ID: <20090615155520.0quqdh0gqo4s0cco@mail.coreps.com> Actually...I added both to cluster.conf and rebooted every node. Now running ping_pong gives me roughly 3500 locks/sec when running it on more then one node (running it on just one node gives me around 5000 locks/sec) which according to the samba wiki are about in line with what it should be. Thanks, --Dennis Quoting "Dennis B. Hopp" : > That didn't work, but I changed it to: > > > > And I'm getting different results, but still not good performance. > Running ping_pong on one node > > [root at sc2 ~]# ./ping_ping /mnt/backup/test.dat 4 > 5870 locks/sec > > I think that should be much higher, but as soon as I start it on > another node it drops to 97 locks/sec > > Any other ideas? > > --Dennis > > Quoting Abhijith Das : > >> Dennis, >> >> You seem to be running plock_rate_limit=100 that limits the number of >> plocks/sec to 100 to avoid network flooding due to plocks. >> >> Setting this as in cluster.conf >> should give you better plock performance. >> >> Hope this helps, >> Thanks! >> --Abhi >> >> Dennis B. Hopp wrote: >>> We have a three node nfs/samba cluster that we seem to be having very >>> poor performance on GFS2. We have a samba share that is acting as a >>> disk to disk backup share for Backup Exec and during the backup >>> process the load on the server will go through the roof until the >>> network requests timeout and the backup job fails. >>> >>> I downloaded the ping_pong utility and ran it and seem to be getting >>> terrible performance: >>> >>> [root at sc2 ~]# ./ping_ping /mnt/backup/test.dat 4 >>> 97 locks/sec >>> >>> The results are the same on all three nodes. >>> >>> I can't seem to figure out why this is so bad. Some additional >>> information: >>> >>> [root at sc2 ~]# gfs2_tool gettune /mnt/backup >>> new_files_directio = 0 >>> new_files_jdata = 0 >>> quota_scale = 1.0000 (1, 1) >>> logd_secs = 1 >>> recoverd_secs = 60 >>> statfs_quantum = 30 >>> stall_secs = 600 >>> quota_cache_secs = 300 >>> quota_simul_sync = 64 >>> statfs_slow = 0 >>> complain_secs = 10 >>> max_readahead = 262144 >>> quota_quantum = 60 >>> quota_warn_period = 10 >>> jindex_refresh_secs = 60 >>> log_flush_secs = 60 >>> incore_log_blocks = 1024 >>> demote_secs = 600 >>> >>> [root at sc2 ~]# gfs2_tool getargs /mnt/backup >>> data 2 >>> suiddir 0 >>> quota 0 >>> posix_acl 1 >>> num_glockd 1 >>> upgrade 0 >>> debug 0 >>> localflocks 0 >>> localcaching 0 >>> ignore_local_fs 0 >>> spectator 0 >>> hostdata jid=0:id=262146:first=0 >>> locktable >>> lockproto lock_dlm >>> >>> 97 locks/sec >>> [root at sc2 ~]# rpm -qa | grep gfs >>> kmod-gfs-0.1.31-3.el5 >>> gfs-utils-0.1.18-1.el5 >>> gfs2-utils-0.1.53-1.el5_3.3 >>> >>> [root at sc2 ~]# uname -r >>> 2.6.18-128.1.10.el5 >>> >>> Thanks, >>> >>> --Dennis >>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From devi at atc.tcs.com Thu Jun 18 06:38:10 2009 From: devi at atc.tcs.com (devi) Date: Thu, 18 Jun 2009 12:08:10 +0530 Subject: [Linux-cluster] managing of resources Message-ID: <1245307090.4090.17.camel@localhost.localdomain> Hi, how can we manage the resources of cluster ? I mean to stop , start, or to find the status of cluster resources Regards, Devi. From cthulhucalling at gmail.com Thu Jun 18 07:02:10 2009 From: cthulhucalling at gmail.com (Ian Hayes) Date: Thu, 18 Jun 2009 00:02:10 -0700 Subject: [Linux-cluster] managing of resources In-Reply-To: <1245307090.4090.17.camel@localhost.localdomain> References: <1245307090.4090.17.camel@localhost.localdomain> Message-ID: <36df569a0906180002g2eca2eecu22c5f06bf6207f95@mail.gmail.com> Clustat for the cluster and service status, clusvcadm for starting, stopping and moving services. Luci will also do all that and more with a nice gui frontend On Jun 17, 2009 11:41 PM, "devi" wrote: Hi, how can we manage the resources of cluster ? I mean to stop , start, or to find the status of cluster resources Regards, Devi. -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From brahadambal at gmail.com Thu Jun 18 11:41:42 2009 From: brahadambal at gmail.com (Brahadambal Srinivasan) Date: Thu, 18 Jun 2009 17:11:42 +0530 Subject: [Linux-cluster] Cluster among geographically separated nodes ? Message-ID: <68867af80906180441o4a7fac7fhd8338484cac4db57@mail.gmail.com> Hi, I am trying to figure out if it is possible to create an RHCS cluster among nodes that are in remote locations? If yes, then how are the following handled? : 1. Storage - how is the shared storage acheived? 2. Fencing - any special methods to fence ? 3. Max. number of nodes possible in such a setup 4. any special methods/exceptions/rules to setup this cluster? Pointers to any material in this regard will be great. Thanks much in advance. Thanks and regards, Brahadambal -------------- next part -------------- An HTML attachment was scrubbed... URL: From gordan at bobich.net Thu Jun 18 12:12:12 2009 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 18 Jun 2009 13:12:12 +0100 Subject: [Linux-cluster] Cluster among geographically separated nodes =?UTF-8?Q?=3F?= In-Reply-To: <68867af80906180441o4a7fac7fhd8338484cac4db57@mail.gmail.com> References: <68867af80906180441o4a7fac7fhd8338484cac4db57@mail.gmail.com> Message-ID: <0a2fe75edcf31e38421bed12af83618a@localhost> On Thu, 18 Jun 2009 17:11:42 +0530, Brahadambal Srinivasan wrote: > Hi, > > I am trying to figure out if it is possible to create an RHCS cluster among > nodes that are in remote locations? If yes, then how are the following > handled? : > > 1. Storage - how is the shared storage acheived? Same as it is achieved locally. It is up to your SAN to handle this in a real-time, consistent way. You may want to look into DRBD (http://www.drbd.org) for the block device level replication. Be aware, however, that performance on the disk access front will be terrible, because the latency will end up being limited by your ping time on the WAN. So instead of it having 0.1ms added via a local gigabit interconnect, it'll have 50-100ms added to it. Most applications will not produce usable performance with this kind of disk I/O speed. You may, instead, want to look into something like GlusterFS (http://www.gluster.org) or PeerFS (http://www.radiantdata.com). > 2. Fencing - any special methods to fence ? Just be aware that if your site interconnect goes down, you'll end up with a hung cluster, since the nodes will disconnect and be unable to fence each other. You could offset that by having separate cluster and fencing interconnects, but you would also need to look into quorum - you need n/2+1 nodes for quorum, so to make this work sensibly you'd need at least three sites - otherwise if you lose the bigger site you lose the whole cluster anyway. > 3. Max. number of nodes possible in such a setup I don't think there is a difference in this regard between LAN and WAN clusters. Gordan From raju.rajsand at gmail.com Thu Jun 18 12:13:58 2009 From: raju.rajsand at gmail.com (Rajagopal Swaminathan) Date: Thu, 18 Jun 2009 17:43:58 +0530 Subject: [Linux-cluster] Cluster among geographically separated nodes ? In-Reply-To: <68867af80906180441o4a7fac7fhd8338484cac4db57@mail.gmail.com> References: <68867af80906180441o4a7fac7fhd8338484cac4db57@mail.gmail.com> Message-ID: <8786b91c0906180513j6feef774r2a718d238d303fed@mail.gmail.com> Greetings Not an expert in Cluster. But my 2c below All of them below assume that cluster heartbeat network is on a fast network On Thu, Jun 18, 2009 at 5:11 PM, Brahadambal Srinivasan < brahadambal at gmail.com> wrote: > > 1. Storage - how is the shared storage acheived? > By storage replication (preferably with fibre) If two node, DRBD may work with a very-very fast link (It works in a 100mbps lan though like campuses with two nodes in very seperate buildings) > 2. Fencing - any special methods to fence ? > DRC/ILO/ILOM comes to mind > 3. Max. number of nodes possible in such a setup > I have done it with 4 nodes max 4. any special methods/exceptions/rules to setup this cluster? > > > Pointers to any material in this regard will be great. Thanks much in > advance. > http://archives.free.net.ph/message/20090311.200230.23f4f917.pl.html http://www.mail-archive.com/linux-cluster at redhat.com/msg06229.html > > Thanks and regards, > Brahadambal > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From giuseppe.fuggiano at gmail.com Thu Jun 18 12:51:32 2009 From: giuseppe.fuggiano at gmail.com (Giuseppe Fuggiano) Date: Thu, 18 Jun 2009 14:51:32 +0200 Subject: [Linux-cluster] Cluster among geographically separated nodes ? In-Reply-To: <0a2fe75edcf31e38421bed12af83618a@localhost> References: <68867af80906180441o4a7fac7fhd8338484cac4db57@mail.gmail.com> <0a2fe75edcf31e38421bed12af83618a@localhost> Message-ID: <1e09d9070906180551u451edf39q123a8a2a7e16a265@mail.gmail.com> 2009/6/18 Gordan Bobic : > On Thu, 18 Jun 2009 17:11:42 +0530, Brahadambal Srinivasan > wrote: >> Hi, >> >> I am trying to figure out if it is possible to create an RHCS cluster > among >> nodes that are in remote locations? If yes, then how are the following >> handled? : >> >> 1. Storage - how is the shared storage acheived? > > Same as it is achieved locally. It is up to your SAN to handle this in a > real-time, consistent way. You may want to look into DRBD > (http://www.drbd.org) for the block device level replication. Be aware, > however, that performance on the disk access front will be terrible, > because the latency will end up being limited by your ping time on the WAN. > So instead of it having 0.1ms added via a local gigabit interconnect, it'll > have 50-100ms added to it. Most applications will not produce usable > performance with this kind of disk I/O speed. I am wondering if that will affect both read and write requests or only write/verify ones (which DRBD have to replicate using the network). -- Giuseppe From gordan at bobich.net Thu Jun 18 13:31:10 2009 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 18 Jun 2009 14:31:10 +0100 Subject: [Linux-cluster] Cluster among geographically separated nodes =?UTF-8?Q?=3F?= In-Reply-To: <1e09d9070906180551u451edf39q123a8a2a7e16a265@mail.gmail.com> References: <68867af80906180441o4a7fac7fhd8338484cac4db57@mail.gmail.com> <0a2fe75edcf31e38421bed12af83618a@localhost> <1e09d9070906180551u451edf39q123a8a2a7e16a265@mail.gmail.com> Message-ID: On Thu, 18 Jun 2009 14:51:32 +0200, Giuseppe Fuggiano wrote: > 2009/6/18 Gordan Bobic : >> On Thu, 18 Jun 2009 17:11:42 +0530, Brahadambal Srinivasan >> wrote: >>> Hi, >>> >>> I am trying to figure out if it is possible to create an RHCS cluster >> among >>> nodes that are in remote locations? If yes, then how are the following >>> handled? : >>> >>> 1. Storage - how is the shared storage acheived? >> >> Same as it is achieved locally. It is up to your SAN to handle this in a >> real-time, consistent way. You may want to look into DRBD >> (http://www.drbd.org) for the block device level replication. Be aware, >> however, that performance on the disk access front will be terrible, >> because the latency will end up being limited by your ping time on the >> WAN. >> So instead of it having 0.1ms added via a local gigabit interconnect, >> it'll >> have 50-100ms added to it. Most applications will not produce usable >> performance with this kind of disk I/O speed. > > I am wondering if that will affect both read and write requests or > only write/verify ones (which DRBD have to replicate using the > network). It'll affect both a lot of the time even if one site is passive/failover, and pretty much all the time if it's an active-active configuration with both sides handling load. DLM will end up bouncing and checking locks back and forth between the sites. This will be the case with any real-time distributed storage system that guarantees full consistency. In other words, load sharing over a WAN will have unusable performance in most cases. Within the same campus, it'd be OK, but between different continents, I don't see it being viable. The real question is whether you really need/want load sharing. If not, you can just use ext3 with DRBD in active-passive mode with failover. Or you can use a more farm-like approach where the servers are mostly serving data, and updates/writes can be streamed from a single master using something like SeznamFS. Gordan From giuseppe.fuggiano at gmail.com Thu Jun 18 19:22:34 2009 From: giuseppe.fuggiano at gmail.com (Giuseppe Fuggiano) Date: Thu, 18 Jun 2009 21:22:34 +0200 Subject: [Linux-cluster] DRBD+GFS - Link is down, Link is up Message-ID: <1e09d9070906181222m6cbacdf0mcb6e97f7dcd33bd0@mail.gmail.com> Hi all, I configured GFS over DRBD (active-active) with RHCS and IPMI as fence device. When I try to mount my GFS resource, my interconnect interface goes down and one node is fenced. This happen every time. DRBD joins and become primary... Jun 18 19:04:30 alice kernel: drbd0: Handshake successful: Agreed network protocol version 89 Jun 18 19:04:30 alice kernel: drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC Jun 18 19:04:30 alice kernel: drbd0: conn( WFConnection -> WFReportParams ) Jun 18 19:04:30 alice kernel: drbd0: Starting asender thread (from drbd0_receiver [3315]) Jun 18 19:04:30 alice kernel: drbd0: data-integrity-alg: Jun 18 19:04:30 alice kernel: drbd0: drbd_sync_handshake: Jun 18 19:04:30 alice kernel: drbd0: self 2BA45318C0A122D1:CBAA0E591815072F:3F39591B4EF90EDD:2E40DDEB552666B9 Jun 18 19:04:30 alice kernel: drbd0: peer CBAA0E591815072E:0000000000000000:3F39591B4EF90EDD:2E40DDEB552666B9 Jun 18 19:04:30 alice kernel: drbd0: uuid_compare()=1 by rule 7 Jun 18 19:04:30 alice kernel: drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate ) Jun 18 19:04:30 alice kernel: drbd0: peer( Secondary -> Primary ) Jun 18 19:04:31 alice kernel: drbd0: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent ) Jun 18 19:04:31 alice kernel: drbd0: Began resync as SyncSource (will sync 16384 KB [4096 bits set]). Jun 18 19:04:33 alice kernel: drbd0: Resync done (total 1 sec; paused 0 sec; 16384 K/sec) Jun 18 19:04:33 alice kernel: drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) Then the fence domain is OK: Jun 18 19:04:35 alice openais[3475]: [TOTEM] entering GATHER state from 11. Jun 18 19:04:35 alice openais[3475]: [TOTEM] Creating commit token because I am the rep. Jun 18 19:04:35 alice openais[3475]: [TOTEM] Saving state aru 1b high seq received 1b Jun 18 19:04:35 alice openais[3475]: [TOTEM] Storing new sequence id for ring 34 Jun 18 19:04:35 alice openais[3475]: [TOTEM] entering COMMIT state. Jun 18 19:04:35 alice openais[3475]: [TOTEM] entering RECOVERY state. Jun 18 19:04:35 alice openais[3475]: [TOTEM] position [0] member 10.17.44.116: Jun 18 19:04:35 alice openais[3475]: [TOTEM] previous ring seq 48 rep 10.17.44.116 Jun 18 19:04:35 alice openais[3475]: [TOTEM] aru 1b high delivered 1b received flag 1 Jun 18 19:04:35 alice openais[3475]: [TOTEM] position [1] member 10.17.44.117: Jun 18 19:04:35 alice openais[3475]: [TOTEM] previous ring seq 48 rep 10.17.44.117 Jun 18 19:04:35 alice openais[3475]: [TOTEM] aru a high delivered a received flag 1 Jun 18 19:04:35 alice openais[3475]: [TOTEM] Did not need to originate any messages in recovery. Jun 18 19:04:35 alice openais[3475]: [TOTEM] Sending initial ORF token Jun 18 19:04:35 alice openais[3475]: [CLM ] CLM CONFIGURATION CHANGE Jun 18 19:04:36 alice openais[3475]: [CLM ] New Configuration: Jun 18 19:04:36 alice openais[3475]: [CLM ] r(0) ip(10.17.44.116) Jun 18 19:04:36 alice openais[3475]: [CLM ] Members Left: Jun 18 19:04:36 alice openais[3475]: [CLM ] Members Joined: Jun 18 19:04:36 alice openais[3475]: [CLM ] CLM CONFIGURATION CHANGE Jun 18 19:04:36 alice openais[3475]: [CLM ] New Configuration: Jun 18 19:04:36 alice openais[3475]: [CLM ] r(0) ip(10.17.44.116) Jun 18 19:04:36 alice openais[3475]: [CLM ] r(0) ip(10.17.44.117) Jun 18 19:04:36 alice openais[3475]: [CLM ] Members Left: Jun 18 19:04:36 alice openais[3475]: [CLM ] Members Joined: Jun 18 19:04:36 alice openais[3475]: [CLM ] r(0) ip(10.17.44.117) Jun 18 19:04:36 alice openais[3475]: [SYNC ] This node is within the primary component and will provide service. Jun 18 19:04:36 alice openais[3475]: [TOTEM] entering OPERATIONAL state. Jun 18 19:04:36 alice openais[3475]: [CLM ] got nodejoin message 10.17.44.116 Jun 18 19:04:36 alice openais[3475]: [CLM ] got nodejoin message 10.17.44.117 Jun 18 19:04:36 alice openais[3475]: [CPG ] got joinlist message from node 1 Jun 18 19:04:40 alice kernel: dlm: connecting to 2 Jun 18 19:04:40 alice kernel: dlm: got connection from 2 WHY DOWN? Jun 18 19:04:53 alice kernel: eth2: Link is Down Jun 18 19:04:53 alice openais[3475]: [TOTEM] The token was lost in the OPERATIONAL state. Jun 18 19:04:53 alice openais[3475]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). Jun 18 19:04:53 alice openais[3475]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Jun 18 19:04:53 alice openais[3475]: [TOTEM] entering GATHER state from 2. Jun 18 19:04:57 alice kernel: eth2: Link is Up 100 Mbps Full Duplex, Flow Control: None Jun 18 19:04:57 alice kernel: eth2: 10/100 speed: disabling TSO Something goes wrong with DRBD Jun 18 19:04:58 alice kernel: drbd0: PingAck did not arrive in time. Jun 18 19:04:58 alice kernel: drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Jun 18 19:04:58 alice kernel: drbd0: asender terminated Jun 18 19:04:58 alice kernel: drbd0: Terminating asender thread Jun 18 19:04:58 alice kernel: drbd0: short read expecting header on sock: r=-512 Jun 18 19:04:58 alice kernel: drbd0: Creating new current UUID Jun 18 19:04:58 alice kernel: drbd0: Connection closed Jun 18 19:04:58 alice kernel: drbd0: conn( NetworkFailure -> Unconnected ) Jun 18 19:04:58 alice kernel: drbd0: receiver terminated Jun 18 19:04:58 alice kernel: drbd0: Restarting receiver thread Jun 18 19:04:58 alice kernel: drbd0: receiver (re)started Jun 18 19:04:58 alice kernel: drbd0: conn( Unconnected -> WFConnection ) Something goes wrong in the cluster Jun 18 19:04:58 alice openais[3475]: [TOTEM] entering GATHER state from 0. Jun 18 19:04:58 alice openais[3475]: [TOTEM] Creating commit token because I am the rep. Jun 18 19:04:58 alice openais[3475]: [TOTEM] Saving state aru 3c high seq received 3c Jun 18 19:04:58 alice openais[3475]: [TOTEM] Storing new sequence id for ring 38 Jun 18 19:04:58 alice openais[3475]: [TOTEM] entering COMMIT state. Jun 18 19:04:58 alice openais[3475]: [TOTEM] entering RECOVERY state. Jun 18 19:04:58 alice openais[3475]: [TOTEM] position [0] member 10.17.44.116: Jun 18 19:04:58 alice openais[3475]: [TOTEM] previous ring seq 52 rep 10.17.44.116 Jun 18 19:04:58 alice openais[3475]: [TOTEM] aru 3c high delivered 3c received flag 1 Jun 18 19:04:58 alice openais[3475]: [TOTEM] Did not need to originate any messages in recovery. Jun 18 19:04:58 alice openais[3475]: [TOTEM] Sending initial ORF token Jun 18 19:04:58 alice openais[3475]: [CLM ] CLM CONFIGURATION CHANGE Jun 18 19:04:58 alice openais[3475]: [CLM ] New Configuration: Jun 18 19:04:58 alice kernel: dlm: closing connection to node 2 Jun 18 19:04:58 alice fenced[3494]: bob not a cluster member after 0 sec post_fail_delay Jun 18 19:04:58 alice openais[3475]: [CLM ] r(0) ip(10.17.44.116) "bob" node is fenced (it just joined!) Jun 18 19:04:58 alice fenced[3494]: fencing node "bob" Jun 18 19:04:58 alice openais[3475]: [CLM ] Members Left: Jun 18 19:04:58 alice openais[3475]: [CLM ] r(0) ip(10.17.44.117) Jun 18 19:04:58 alice openais[3475]: [CLM ] Members Joined: Jun 18 19:04:58 alice openais[3475]: [CLM ] CLM CONFIGURATION CHANGE Jun 18 19:04:58 alice openais[3475]: [CLM ] New Configuration: Jun 18 19:04:58 alice openais[3475]: [CLM ] r(0) ip(10.17.44.116) Jun 18 19:04:58 alice openais[3475]: [CLM ] Members Left: Jun 18 19:04:58 alice openais[3475]: [CLM ] Members Joined: Jun 18 19:04:58 alice openais[3475]: [SYNC ] This node is within the primary component and will provide service. Jun 18 19:04:58 alice openais[3475]: [TOTEM] entering OPERATIONAL state. Jun 18 19:04:58 alice openais[3475]: [CLM ] got nodejoin message 10.17.44.116 Jun 18 19:04:58 alice openais[3475]: [CPG ] got joinlist message from node 1 Jun 18 19:05:03 alice kernel: eth2: Link is Down Jun 18 19:05:08 alice kernel: eth2: Link is Up 100 Mbps Full Duplex, Flow Control: None Jun 18 19:05:08 alice kernel: eth2: 10/100 speed: disabling TSO Jun 18 19:05:12 alice kernel: eth2: Link is Down Jun 18 19:05:13 alice fenced[3494]: fence "bob" success Jun 18 19:05:13 alice kernel: GFS: fsid=webclima:web.0: jid=1: Trying to acquire journal lock... Jun 18 19:05:13 alice kernel: GFS: fsid=webclima:web.0: jid=1: Looking at journal... Jun 18 19:05:13 alice kernel: GFS: fsid=webclima:web.0: jid=1: Done eth2 is up and down.... Jun 18 19:05:15 alice kernel: eth2: Link is Up 100 Mbps Full Duplex, Flow Control: None Jun 18 19:05:15 alice kernel: eth2: 10/100 speed: disabling TSO Jun 18 19:05:21 alice kernel: eth2: Link is Down Jun 18 19:05:24 alice kernel: eth2: Link is Up 100 Mbps Full Duplex, Flow Control: None Jun 18 19:05:24 alice kernel: eth2: 10/100 speed: disabling TSO Jun 18 19:05:29 alice kernel: eth2: Link is Down Jun 18 19:05:33 alice kernel: eth2: Link is Up 100 Mbps Full Duplex, Flow Control: None Jun 18 19:05:33 alice kernel: eth2: 10/100 speed: disabling TSO Jun 18 19:07:26 alice kernel: eth2: Link is Down Jun 18 19:07:29 alice kernel: eth2: Link is Up 100 Mbps Full Duplex, Flow Control: None Jun 18 19:07:29 alice kernel: eth2: 10/100 speed: disabling TSO Jun 18 19:07:36 alice kernel: eth2: Link is Down Jun 18 19:07:38 alice kernel: eth2: Link is Up 100 Mbps Full Duplex, Flow Control: None Jun 18 19:07:38 alice kernel: eth2: 10/100 speed: disabling TSO Consider that if I don't mount GFS, the node is not fenced and the failover domains becomes active. So, I guess the problem is in GFS... and not for example with the NIC. Here is my configuration: # cat /etc/drbd.conf global { usage-count no; } resource r1 { protocol C; syncer { rate 10M; verify-alg sha1; } startup { become-primary-on both; wfc-timeout 150; } disk { on-io-error detach; } net { allow-two-primaries; cram-hmac-alg "sha1"; shared-secret "123456"; after-sb-0pri discard-least-changes; after-sb-1pri violently-as0p; after-sb-2pri violently-as0p; rr-conflict violently; ping-timeout 50; } on alice { device /dev/drbd0; disk /dev/sda2; address 10.17.44.116:7789; meta-disk internal; } on bob { device /dev/drbd0; disk /dev/sda2; address 10.17.44.117:7789; meta-disk internal; } } # cat /etc/cluster/cluster.conf # cat /etc/hosts: 127.0.0.1 localhost.localdomain localhost 172.17.44.116 alice 172.17.44.117 bob # ifconfig bond0 Link encap:Ethernet HWaddr 00:15:17:51:70:38 inet addr:10.17.44.116 Bcast:10.17.44.255 Mask:255.255.255.0 inet6 addr: fe80::215:17ff:fe51:7038/64 Scope:Link UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:49984 errors:0 dropped:0 overruns:0 frame:0 TX packets:83669 errors:0 dropped:0 overruns:0 carrier:0 collisions:11221 txqueuelen:0 RX bytes:16151284 (15.4 MiB) TX bytes:102618030 (97.8 MiB) eth0 Link encap:Ethernet HWaddr 00:15:17:51:70:38 UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:49984 errors:0 dropped:0 overruns:0 frame:0 TX packets:83669 errors:0 dropped:0 overruns:0 carrier:0 collisions:11221 txqueuelen:100 RX bytes:16151284 (15.4 MiB) TX bytes:102618030 (97.8 MiB) Memory:f9140000-f9160000 eth1 Link encap:Ethernet HWaddr 00:15:17:51:70:38 UP BROADCAST SLAVE MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Memory:f91a0000-f91c0000 eth2 Link encap:Ethernet HWaddr 00:19:99:29:08:8B inet addr:172.17.44.116 Bcast:172.17.44.255 Mask:255.255.255.0 inet6 addr: fe80::219:99ff:fe29:88b/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:20 errors:0 dropped:0 overruns:0 frame:0 TX packets:45 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:100 RX bytes:1200 (1.1 KiB) TX bytes:7902 (7.7 KiB) Memory:f9200000-f9220000 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:3541 errors:0 dropped:0 overruns:0 frame:0 TX packets:3541 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:464552 (453.6 KiB) TX bytes:464552 (453.6 KiB) I hope there is someone just experienced this bad issue. Thanks in advance. -- Giuseppe From giuseppe.fuggiano at gmail.com Thu Jun 18 20:07:14 2009 From: giuseppe.fuggiano at gmail.com (Giuseppe Fuggiano) Date: Thu, 18 Jun 2009 22:07:14 +0200 Subject: [Linux-cluster] Re: DRBD+GFS - Link is down, Link is up In-Reply-To: <1e09d9070906181222m6cbacdf0mcb6e97f7dcd33bd0@mail.gmail.com> References: <1e09d9070906181222m6cbacdf0mcb6e97f7dcd33bd0@mail.gmail.com> Message-ID: <1e09d9070906181307y595a3bc9hb2c89d46f3a9d424@mail.gmail.com> 2009/6/18 Giuseppe Fuggiano : [snip] > eth2 is up and down.... > > Jun 18 19:05:15 alice kernel: eth2: Link is Up 100 Mbps Full Duplex, > Flow Control: None > Jun 18 19:05:15 alice kernel: eth2: 10/100 speed: disabling TSO > Jun 18 19:05:21 alice kernel: eth2: Link is Down > Jun 18 19:05:24 alice kernel: eth2: Link is Up 100 Mbps Full Duplex, [snip] > > Consider that if I don't mount GFS, the node is not fenced and the > failover domains becomes active. > So, I guess the problem is in GFS... and not for example with the NIC. Trying to use bond0 as heartbeat, I discovered that eth2 stuff doesn't affect the infinite fencing behaviour... -- Giuseppe From tom at netspot.com.au Fri Jun 19 02:47:59 2009 From: tom at netspot.com.au (Tom Lanyon) Date: Fri, 19 Jun 2009 12:17:59 +0930 Subject: [Linux-cluster] Cluster among geographically separated nodes ? In-Reply-To: <0a2fe75edcf31e38421bed12af83618a@localhost> References: <68867af80906180441o4a7fac7fhd8338484cac4db57@mail.gmail.com> <0a2fe75edcf31e38421bed12af83618a@localhost> Message-ID: <87128C8B-AC72-4E56-9FAC-D4B84E2578DA@netspot.com.au> On 18/06/2009, at 9:42 PM, Gordan Bobic wrote: > On Thu, 18 Jun 2009 17:11:42 +0530, Brahadambal Srinivasan > wrote: >> 2. Fencing - any special methods to fence ? > > Just be aware that if your site interconnect goes down, you'll end > up with > a hung cluster, since the nodes will disconnect and be unable to > fence each > other. You could offset that by having separate cluster and fencing > interconnects, but you would also need to look into quorum - you > need n/2+1 > nodes for quorum, so to make this work sensibly you'd need at least > three > sites - otherwise if you lose the bigger site you lose the whole > cluster > anyway. This question came up last week as well so I have been thinking about the options here. Gordan's suggestion of three sites is a good one but may not be feasible for some. If you are using replicated SAN LUN(s) for your shared storage, the LUN is only ever going to be active at one site. So, if you lose connectivity between sites you obviously want the cluster to remain operational at the site with the active storage LUN. I can imagine a cross-site accessible qdisk *almost* solving this problem. The remaining issue, as I see it, is that if your network connectivity is lost the cluster will pause all services until it has successfully removed the failed nodes -- if it can't fence these nodes due to the lost network connectivity, you may end up with a site that effectively has quorum but all services are still hung. This sort of issue would especially arise if, for example, you lost ethernet connectivity but not FC/storage connectivity - the nodes at the remote site would still be able to access the qdisk. Perhaps a combination of power fencing (via ethernet) + storage fencing (on the local side of the SAN) could make this a workable solution? Regards, Tom -- Tom Lanyon Senior Systems Engineer NetSpot Pty Ltd From vcmarti at sph.emory.edu Fri Jun 19 14:13:54 2009 From: vcmarti at sph.emory.edu (Vernard C. Martin) Date: Fri, 19 Jun 2009 10:13:54 -0400 Subject: [Linux-cluster] fencing Cisco MDS 9134 w/ RHEL5 Message-ID: <4A3B9D22.4080908@sph.emory.edu> I can't seem to find any evidence that this fiber switch has a fencing agent for RHEL4. There seems to be some documentation of it being supported in RHEL 5.4. Is it reasonable to just port the agent or am I missing some technical detail that the agent requires that is in the newer kernel? -- Vernard Martin Applications Developer/Analyst Email: vcmarti at sph.emory.edu Desk:404.727.2076 Office of Information Technology -Rollins School of Public Health From alietsantiesteban at gmail.com Sat Jun 20 03:02:20 2009 From: alietsantiesteban at gmail.com (Aliet Santiesteban Sifontes) Date: Fri, 19 Jun 2009 23:02:20 -0400 Subject: [Linux-cluster] Will redhat release the srpms of cluster suite for rhel-4.8 to the public??? Message-ID: <365467590906192002m3e4991d2m2e74ac26b1134fa5@mail.gmail.com> Hi, just wondering if redhat will release the srpms for the cluster suite updated for rhel-4.8???, I have been looking for it in redhat ftp site, but can not find it. Any ideas?? Best regards From fdinitto at redhat.com Sat Jun 20 11:19:49 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Sat, 20 Jun 2009 13:19:49 +0200 Subject: [Linux-cluster] Cluster 3.0.0.rc3 release Message-ID: <1245496789.3665.328.camel@cerberus.int.fabbione.net> The cluster team and its community are proud to announce the 3.0.0.rc3 release candidate from the STABLE3 branch. The development cycle for 3.0.0 is completed. The STABLE3 branch is now collecting only bug fixes and minimal update required to build and run on top of the latest upstream kernel/corosync/openais. Everybody with test equipment and time to spare, is highly encouraged to download, install and test this release candidate and more important report problems. This is the time for people to make a difference and help us testing as much as possible. In order to build the 3.0.0.rc3 release you will need: - corosync 0.98 - openais 0.97 - linux kernel 2.6.29 The new source tarball can be downloaded here: ftp://sources.redhat.com/pub/cluster/releases/cluster-3.0.0.rc3.tar.gz https://fedorahosted.org/releases/c/l/cluster/cluster-3.0.0.rc3.tar.gz At the same location is now possible to find separated tarballs for fence-agents and resource-agents as previously announced (http://www.redhat.com/archives/cluster-devel/2009-February/msg00003.htm) To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Happy clustering, Fabio Under the hood (from 3.0.0.rc2): Abhijith Das (2): gfs-kernel: enable FS_HAS_FREEZE gfs-kernel: bz479421 - gfs_tool: page allocation failure. order:4, mode:0xd0 Andrew Price (2): gfs2-utils: Clean up leftover prog_name globals fsck.gfs2: Remove compute_height Bob Peterson (10): mount failure after gfs2_edit restoremeta of GFS file system gfs2_edit savemeta needs to save freemeta blocks gfs2_edit: Fix indirect block scrolling Correction to an earlier commit. Buffers were being updated Removed check for incorrect height GFS2: gfs2_edit savemeta wasn't saving indirect eattribute blocks GFS2: gfs2_edit savemeta wasn't saving ea sub-blocks GFS2: fsck.gfs2 sometimes needs to be run twice fsck.gfs2 writing bitmap when -n specified Fixed compiler warnings and errors that crept in. Christine Caulfield (9): dlm: don't print an error from lockdump if there are no locks. cman: More changes for the latest corosync API cman: call corosync->request_shutdown on cman-tool leave cman: Allow use of broadcast communications gfs2: Fix includes for building on rawhide cman: Change some more ais references to Corosync fence: Allow IP addresses as node names cman: Remove references to ccs in the man pages cman: Catch failure to determine default multicast address David Teigland (8): fenced/dlm_controld/gfs_controld: dlm_controld: remove unused plock_exit dlm_tool: fix shadow warnings gfs_control: fix shadow warnings fenced: avoid static warnings dlm_tool: fix warning fenced: remove const string warnings fenced: fix id_info struct alignment Fabio M. Di Nitto (62): gfs: fix most of the warnings spotted by paranoia-cflags dlm: fix function prototypes libdlmcontrol: fix const warning libdlmcontrol: make function static dlm_tool: constify functions dlm_tool: make functions static dlm_tool: fix format warnings libfenced: fix const warning fence_node: fix const warning fence_tool: fix const warning fenced: fix function declaration libgroup: fix const warning dlm_controld: fix function declaration warning dlm_controld: fix const warning dlm_controld: fix return warning in plock dlm_controld: make functions static libgfscontrol: fix const warnings libgfscontrol: make functions static gfs_control: fix const warnings gfs_control: make functions static gfs_controld: fix function declaration gfs_controld: fix const warnings gfs_controld: ifdef out unused code group_tool: fix const warnings group_tool: fix function declaration group_tool: make functions static group daemon: fix function declaration group_daemon: fix const warnings group_tool: fix shadow warning group_daemon: make functions static group_dameon: ifdef out unused code dlm: fix void arithmetic group: fix void arithmetic group: fix print formats fence: fix void arithmetic fence: fix print formats fenced: add const to ccs functions fenced: add const to msg_name fenced: add const to setup_listener fenced: add const bits to recover.c cman: fix logging config and major cleanup gfs2: fix build warnings spotted by paranoia cflags build: set paranoia build warnings by default gfs2: restore libgfs2.h vfprintf call gfs: fix endian conversion gfs2: fix endian conversion gfs2: don't swab in place gfs: don't swab in place cman init: add support for join and leave options qdisk: fix disk scanning check in sysfs build: drop unrequired include dir build: fix build dependency for ccs_tool build: clean up perl bindings .d files config: drop obselete build check in libccs scandisk: remove build debug entry (now unrequired) qdisk: remove build DEBUG option in favour of runtime build: fix clean operation for .pc files dlm: fix libdlm_lt pc file module name build: allow easy build of test tarballs for the whole set build: drop dependency on libvolume_id gfs2: drop leftover file from import cman init: fix groupd check Jan Friesse (1): CMAN: Support for openaisserviceenablestable service loader Lon Hohberger (9): qdisk: Add reporting for I/O hangs to quourm disk rgmanager: Allow reboot if main proc. is killed rgmanager: Make vm.sh use libvirt rgmanager: Remove extra checks from Oracle agents rgmanager: Fix up multiple Oracle instance handling rgmanager: Check for all ORA- errors on start/stop group: Make group_tool checks more robust rgmanager: Fix restart-after-migrate issue rgmanager: Fix noise when running in foreground Marc Grimme (1): rgmanager: Implement explicit ordering for failover Marek 'marx' Grac (8): fence_scsi_test.pl: #499871 fence_scsi_test.pl does not check for sg_persist in the path fence_drac5: #496724 - support for modulename in drac5 agent fence_apc: #501586 - fence_apc fails with pexpect exception apache.sh: #489785 - does not handle a valid /etc/httpd/conf/httpd.conf configuration correctly fence_lpar: fence_lpar can't log in to IVM systems fence_agents: #501586 - fence agents fails with pexpect exception fence_lpar: #504705 - fence_lpar: lssyscfg command on HMC can take longer than SHELL_TIMEOUT fence agents: Option for setting port for telnet/ssh/ssl used by fence agent Steven Whitehouse (19): Remove unused code from various places gfs2_tool: gettext support mkfs.gfs2: Add gettext support gfs2_tool: Fix misplaced bracket that bob spotted fsck.gfs2: Add gettext support Makefile: Fix problem which crept in earlier gfs2_tool: Use FIFREEZE/FITHAW ioctl fsck.gfs2: Add gettext support gfs2_tool: Remove obsolete subcommands libgfs2: Remove unused library function gfs2_tool: Remove ref to non-existent sysfs file gfs2_tool: Remove code to read args/* gfs2_tool: Fix help message man: Remove obsolete info from mount.gfs2 man page man: More updates fsck: Fix up merge issue gfs2_tool: Remove df command from gfs2_tool mkfs.gfs2: Remove dep on libvolume_id mkfs.gfs: Remove dep on libvolume_id Makefile | 5 +- cman/cman_tool/join.c | 4 +- cman/cman_tool/main.c | 4 +- cman/daemon/Makefile | 1 - cman/daemon/ais.c | 51 +- cman/daemon/ais.h | 1 + cman/daemon/barrier.c | 13 +- cman/daemon/cman-preconfig.c | 133 ++++-- cman/daemon/cmanconfig.c | 3 +- cman/daemon/commands.c | 98 ++-- cman/daemon/commands.h | 2 +- cman/daemon/daemon.c | 35 +- cman/daemon/daemon.h | 2 +- cman/daemon/logging.c | 29 -- cman/daemon/logging.h | 17 - cman/init.d/cman.in | 22 +- cman/man/cman.5 | 6 +- cman/man/cman_tool.8 | 14 +- cman/qdisk/Makefile | 6 +- cman/qdisk/disk.c | 23 +- cman/qdisk/iostate.c | 142 ++++++ cman/qdisk/iostate.h | 17 + cman/qdisk/main.c | 7 + cman/qdisk/scandisk.c | 6 +- cman/qdisk/scandisk.h | 6 +- config/libs/libccsconfdb/ccs.h | 4 - config/plugins/xml/Makefile | 1 - config/tools/ccs_tool/Makefile | 4 +- configure | 37 +- dlm/libdlm/libdlm.c | 4 +- dlm/libdlm/libdlm.h | 4 +- dlm/libdlm/libdlm_lt.pc.in | 2 +- dlm/libdlmcontrol/main.c | 8 +- dlm/tool/main.c | 56 +-- fence/agents/alom/fence_alom.py | 11 +- fence/agents/apc/fence_apc.py | 4 +- fence/agents/bladecenter/fence_bladecenter.py | 11 +- fence/agents/drac/fence_drac5.py | 11 +- fence/agents/ilo/fence_ilo.py | 2 +- fence/agents/ldom/fence_ldom.py | 11 +- fence/agents/lib/fencing.py.py | 25 +- fence/agents/lpar/fence_lpar.py | 17 +- fence/agents/rsa/fence_rsa.py | 11 +- fence/agents/scsi/fence_scsi_test.pl | 15 + fence/agents/virsh/fence_virsh.py | 11 +- fence/agents/wti/fence_wti.py | 11 +- fence/agents/xvm/Makefile | 1 - fence/fence_node/fence_node.c | 4 +- fence/fence_tool/fence_tool.c | 8 +- fence/fenced/config.c | 8 +- fence/fenced/config.h | 2 +- fence/fenced/cpg.c | 7 +- fence/fenced/fd.h | 8 +- fence/fenced/main.c | 26 +- fence/fenced/member_cman.c | 14 +- fence/fenced/recover.c | 11 +- fence/libfenced/main.c | 6 +- gfs-kernel/src/gfs/gfs_ondisk.h | 38 +- gfs-kernel/src/gfs/ioctl.c | 5 +- gfs-kernel/src/gfs/ops_fstype.c | 2 +- gfs/gfs_debug/basic.c | 2 +- gfs/gfs_debug/readfile.c | 2 +- gfs/gfs_debug/util.c | 14 +- gfs/gfs_fsck/eattr.c | 2 +- gfs/gfs_fsck/file.c | 6 +- gfs/gfs_fsck/fs_bits.c | 2 +- gfs/gfs_fsck/fs_dir.c | 56 +- gfs/gfs_fsck/fs_inode.c | 6 +- gfs/gfs_fsck/fs_inode.h | 2 +- gfs/gfs_fsck/initialize.c | 2 +- gfs/gfs_fsck/log.c | 4 +- gfs/gfs_fsck/log.h | 2 +- gfs/gfs_fsck/main.c | 12 +- gfs/gfs_fsck/metawalk.c | 32 +- gfs/gfs_fsck/ondisk.c | 34 +- gfs/gfs_fsck/pass1.c | 10 +- gfs/gfs_fsck/pass1b.c | 9 +- gfs/gfs_fsck/pass1c.c | 18 +- gfs/gfs_fsck/pass2.c | 13 +- gfs/gfs_fsck/pass3.c | 2 +- gfs/gfs_fsck/pass4.c | 4 +- gfs/gfs_fsck/pass5.c | 6 +- gfs/gfs_fsck/super.c | 12 +- gfs/gfs_fsck/util.c | 14 +- gfs/gfs_grow/main.c | 35 +- gfs/gfs_jadd/main.c | 39 +- gfs/gfs_mkfs/Makefile | 5 +- gfs/gfs_mkfs/device_geometry.c | 2 +- gfs/gfs_mkfs/main.c | 136 ++++-- gfs/gfs_mkfs/structures.c | 6 +- gfs/gfs_quota/check.c | 34 +- gfs/gfs_quota/gfs_quota.h | 4 + gfs/gfs_quota/layout.c | 25 +- gfs/gfs_quota/main.c | 45 ++- gfs/gfs_tool/counters.c | 6 +- gfs/gfs_tool/df.c | 40 +- gfs/gfs_tool/gfs_tool.h | 6 +- gfs/gfs_tool/layout.c | 57 ++- gfs/gfs_tool/misc.c | 78 ++- gfs/gfs_tool/tune.c | 12 +- gfs/gfs_tool/util.c | 10 +- gfs/libgfs/file.c | 6 +- gfs/libgfs/fs_bits.c | 2 +- gfs/libgfs/fs_dir.c | 46 +- gfs/libgfs/fs_inode.c | 4 +- gfs/libgfs/libgfs.h | 5 +- gfs/libgfs/log.c | 4 +- gfs/libgfs/ondisk.c | 36 +- gfs/libgfs/super.c | 1 - gfs/libgfs/util.c | 14 +- gfs2/convert/gfs2_convert.c | 53 +- gfs2/edit/gfs2hex.c | 76 ++-- gfs2/edit/gfs2hex.h | 4 + gfs2/edit/hexedit.c | 453 ++++++++---------- gfs2/edit/hexedit.h | 5 +- gfs2/edit/savemeta.c | 204 ++++++--- gfs2/fsck/eattr.c | 9 +- gfs2/fsck/fs_recovery.c | 39 +- gfs2/fsck/initialize.c | 66 ++-- gfs2/fsck/link.c | 28 +- gfs2/fsck/lost_n_found.c | 22 +- gfs2/fsck/main.c | 121 +++--- gfs2/fsck/metawalk.c | 265 ++++++---- gfs2/fsck/pass1.c | 304 +++++++----- gfs2/fsck/pass1b.c | 187 ++++--- gfs2/fsck/pass1c.c | 217 ++++++--- gfs2/fsck/pass2.c | 371 +++++++++------ gfs2/fsck/pass3.c | 105 ++-- gfs2/fsck/pass4.c | 64 ++-- gfs2/fsck/pass5.c | 50 +- gfs2/fsck/rgrepair.c | 88 ++-- gfs2/fsck/test.c | 8 - gfs2/fsck/util.c | 35 +-- gfs2/fsck/util.h | 1 - gfs2/libgfs2/block_list.c | 34 +- gfs2/libgfs2/buf.c | 4 +- gfs2/libgfs2/fs_bits.c | 61 +++ gfs2/libgfs2/fs_geometry.c | 4 +- gfs2/libgfs2/fs_ops.c | 62 ++- gfs2/libgfs2/gfs1.c | 5 +- gfs2/libgfs2/gfs2_log.c | 7 +- gfs2/libgfs2/libgfs2.h | 26 +- gfs2/libgfs2/misc.c | 92 +---- gfs2/libgfs2/rgrp.c | 8 +- gfs2/man/gfs2_convert.8 | 16 +- gfs2/man/gfs2_grow.8 | 7 +- gfs2/man/gfs2_quota.8 | 2 +- gfs2/man/gfs2_tool.8 | 67 +-- gfs2/man/mount.gfs2.8 | 39 +- gfs2/mkfs/Makefile | 2 - gfs2/mkfs/gfs2_mkfs.h | 2 - gfs2/mkfs/main.c | 19 +- gfs2/mkfs/main_grow.c | 64 ++-- gfs2/mkfs/main_jadd.c | 155 +++--- gfs2/mkfs/main_mkfs.c | 290 +++++++---- gfs2/mount/mount.gfs2.c | 35 +-- gfs2/mount/mtab.c | 1 - gfs2/mount/util.c | 11 +- gfs2/mount/util.h | 5 +- gfs2/quota/check.c | 33 +-- gfs2/quota/gfs2_quota.h | 6 +- gfs2/quota/main.c | 12 +- gfs2/tool/Makefile | 3 +- gfs2/tool/df.c | 290 ----------- gfs2/tool/gfs2_tool.h | 16 - gfs2/tool/main.c | 139 ++---- gfs2/tool/misc.c | 257 ++-------- gfs2/tool/sb.c | 62 ++-- gfs2/tool/tune.c | 26 +- group/daemon/app.c | 40 +- group/daemon/cpg.c | 28 +- group/daemon/gd_internal.h | 14 +- group/daemon/joinleave.c | 4 +- group/daemon/main.c | 20 +- group/dlm_controld/action.c | 11 +- group/dlm_controld/config.c | 8 +- group/dlm_controld/cpg.c | 6 +- group/dlm_controld/deadlock.c | 2 +- group/dlm_controld/dlm_daemon.h | 12 +- group/dlm_controld/main.c | 14 +- group/dlm_controld/netlink.c | 2 +- group/dlm_controld/plock.c | 11 +- group/gfs_control/main.c | 30 +- group/gfs_controld/config.c | 6 +- group/gfs_controld/cpg-new.c | 6 +- group/gfs_controld/cpg-old.c | 18 +- group/gfs_controld/gfs_daemon.h | 12 +- group/gfs_controld/group.c | 2 +- group/gfs_controld/main.c | 8 +- group/gfs_controld/plock.c | 12 +- group/gfs_controld/util.c | 8 +- group/lib/libgroup.c | 8 +- group/lib/libgroup.h | 2 +- group/libgfscontrol/main.c | 8 +- group/tool/main.c | 18 +- make/clean.mk | 2 +- make/defines.mk.input | 3 - make/perl-binding-common.mk | 2 +- make/release.mk | 50 +- rgmanager/src/clulib/Makefile | 2 +- rgmanager/src/daemons/Makefile | 2 +- rgmanager/src/daemons/groups.c | 1 - rgmanager/src/daemons/restree.c | 13 +- rgmanager/src/daemons/rg_state.c | 16 +- rgmanager/src/daemons/watchdog.c | 24 +- rgmanager/src/resources/apache.sh | 6 +- rgmanager/src/resources/default_event_script.sl | 150 ++++++- rgmanager/src/resources/oracledb.sh.in | 28 +- rgmanager/src/resources/service.sh | 19 +- rgmanager/src/resources/vm.sh | 608 +++++++++++++++------- rgmanager/src/utils/Makefile | 2 +- 211 files changed, 4240 insertions(+), 3606 deletions(-) -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From thomas at sjolshagen.net Sat Jun 20 20:06:11 2009 From: thomas at sjolshagen.net (Thomas Sjolshagen) Date: Sat, 20 Jun 2009 16:06:11 -0400 Subject: [Linux-cluster] clusvcadm -M -m fails with "Invalid operation for resource" Message-ID: <20090620160611.20273n7qqnh4jd4j@www.sjolshagen.net> Hi, I'm trying to do a migration to another member of the cluster but it fails with: # clusvcadm -M samba -m host2 Trying to migrate service:samba to virt0-backup.sjolshagen.net...Invalid operation for resource I'm running Fedora 11 with rgmanager-3.0.0-15.rc1.fc11.x86_64 installed as well as a downloaded copy of the 5/21 version of vm.sh from the git.fedorahosted.org repository. I've configured the resource as follows: And the guest container files are hosted on a (previously mounted) gfs2 file system. Is this a rgmanager shortcoming (rgmanager needs to be coded to support virsh & live migration) or - more likely - user error? Thanks in advance // Thomas ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From fdinitto at redhat.com Sun Jun 21 05:43:59 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Sun, 21 Jun 2009 07:43:59 +0200 Subject: [Linux-cluster] Will redhat release the srpms of cluster suite for rhel-4.8 to the public??? In-Reply-To: <365467590906192002m3e4991d2m2e74ac26b1134fa5@mail.gmail.com> References: <365467590906192002m3e4991d2m2e74ac26b1134fa5@mail.gmail.com> Message-ID: <1245563039.3665.360.camel@cerberus.int.fabbione.net> On Fri, 2009-06-19 at 23:02 -0400, Aliet Santiesteban Sifontes wrote: > Hi, just wondering if redhat will release the srpms for the cluster > suite updated for rhel-4.8???, I have been looking for it in redhat > ftp site, but can not find it. > Any ideas?? If you can't find the srpm, you can always use git to get the code. http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=shortlog;h=refs/heads/RHEL48 the code is public, no secrets patches or anything like that :) I don't exclude the possibility that the srpm has not been released, but if so, it's probably purely a mistake. I am CC'ing Chris that can investigate where it is. Fabio From henry.robertson at hjrconsulting.com Mon Jun 22 03:57:34 2009 From: henry.robertson at hjrconsulting.com (Henry Robertson) Date: Sun, 21 Jun 2009 23:57:34 -0400 Subject: [Linux-cluster] Re: clusvcadm -M -m fails with "Invalid operation for resource" Message-ID: Today's Topics: > > 1. clusvcadm -M -m fails with "Invalid > operation for resource" (Thomas Sjolshagen) > 2. Re: Will redhat release the srpms of cluster suite for > rhel-4.8 to the public??? (Fabio M. Di Nitto) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sat, 20 Jun 2009 16:06:11 -0400 > From: Thomas Sjolshagen > Subject: [Linux-cluster] clusvcadm -M -m fails > with "Invalid operation for resource" > To: linux-cluster at redhat.com > Message-ID: <20090620160611.20273n7qqnh4jd4j at www.sjolshagen.net> > Content-Type: text/plain; charset=ISO-8859-1; DelSp="Yes"; > format="flowed" > > Hi, > > I'm trying to do a migration to another member of the cluster but it > fails with: > > # clusvcadm -M samba -m host2 > Trying to migrate service:samba to > virt0-backup.sjolshagen.net...Invalid operation for resource > > I'm running Fedora 11 with rgmanager-3.0.0-15.rc1.fc11.x86_64 > installed as well as a downloaded copy of the 5/21 version of vm.sh > from the git.fedorahosted.org repository. I've configured the resource > as follows: > > > > recovery="relocate" snapshot="/cluster/kvm-guests/snapshots" > use_virsh="1" exclusive="1" hypervisor="qemu" > migration_mapping="host1:host2,host2:host1" > hypervisor_uri="qemu+ssh:///system" /> > > > > > > > And the guest container files are hosted on a (previously mounted) > gfs2 file system. > > Is this a rgmanager shortcoming (rgmanager needs to be coded to > support virsh & live migration) or - more likely - user error? > > Thanks in advance > // Thomas > > ---------------------------------------------------------------- Are you sure you don't mean to relocate the service to host2 instead of Migrate? -R will stop/start a service like samba onto another node. clusvcadm -r -m I wasn't aware that migration worked for anything other than moving VM's around. Henry -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomas at sjolshagen.net Mon Jun 22 13:43:43 2009 From: thomas at sjolshagen.net (Thomas Sjolshagen) Date: Mon, 22 Jun 2009 09:43:43 -0400 Subject: [Linux-cluster] Re: clusvcadm -M -m fails with "Invalid operation for resource" In-Reply-To: References: Message-ID: <20090622094343.10362qjp3x4g1jtb@www.sjolshagen.net> Quoting Henry Robertson : > Today's Topics: >> >> 1. clusvcadm -M -m fails with "Invalid >> operation for resource" (Thomas Sjolshagen) .. > > > Are you sure you don't mean to relocate the service to host2 instead > of Migrate? -R will stop/start a service like samba onto another node. > clusvcadm -r -m > > I wasn't aware that migration worked for anything other than moving > VM's around. > > Henry > If you look at the resource definition, you'll see that I'm trying to migrate a VM (the KVM guest is called samba since it hosts a Samba instance). // Thomas ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From henry.robertson at hjrconsulting.com Mon Jun 22 22:12:07 2009 From: henry.robertson at hjrconsulting.com (Henry Robertson) Date: Mon, 22 Jun 2009 18:12:07 -0400 Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 62, Issue 21 In-Reply-To: <20090622160008.589096191B5@hormel.redhat.com> References: <20090622160008.589096191B5@hormel.redhat.com> Message-ID: On Mon, Jun 22, 2009 at 12:00 PM, wrote: > Send Linux-cluster mailing list submissions to > linux-cluster at redhat.com > > To subscribe or unsubscribe via the World Wide Web, visit > https://www.redhat.com/mailman/listinfo/linux-cluster > or, via email, send a message with subject or body 'help' to > linux-cluster-request at redhat.com > > You can reach the person managing the list at > linux-cluster-owner at redhat.com > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Linux-cluster digest..." > > > Today's Topics: > > 1. Re: clusvcadm -M -m fails with "Invalid > operation for resource" (Henry Robertson) > 2. Re: Re: clusvcadm -M -m fails with > "Invalid operation for resource" (Thomas Sjolshagen) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sun, 21 Jun 2009 23:57:34 -0400 > From: Henry Robertson > Subject: [Linux-cluster] Re: clusvcadm -M -m fails > with "Invalid operation for resource" > To: linux-cluster at redhat.com > Message-ID: > > Content-Type: text/plain; charset="iso-8859-1" > > Today's Topics: > > > > 1. clusvcadm -M -m fails with "Invalid > > operation for resource" (Thomas Sjolshagen) > > 2. Re: Will redhat release the srpms of cluster suite for > > rhel-4.8 to the public??? (Fabio M. Di Nitto) > > > > > > ---------------------------------------------------------------------- > > > > Message: 1 > > Date: Sat, 20 Jun 2009 16:06:11 -0400 > > From: Thomas Sjolshagen > > Subject: [Linux-cluster] clusvcadm -M -m fails > > with "Invalid operation for resource" > > To: linux-cluster at redhat.com > > Message-ID: <20090620160611.20273n7qqnh4jd4j at www.sjolshagen.net> > > Content-Type: text/plain; charset=ISO-8859-1; DelSp="Yes"; > > format="flowed" > > > > Hi, > > > > I'm trying to do a migration to another member of the cluster but it > > fails with: > > > > # clusvcadm -M samba -m host2 > > Trying to migrate service:samba to > > virt0-backup.sjolshagen.net...Invalid operation for resource > > > > I'm running Fedora 11 with rgmanager-3.0.0-15.rc1.fc11.x86_64 > > installed as well as a downloaded copy of the 5/21 version of vm.sh > > from the git.fedorahosted.org repository. I've configured the resource > > as follows: > > > > > > > > > recovery="relocate" snapshot="/cluster/kvm-guests/snapshots" > > use_virsh="1" exclusive="1" hypervisor="qemu" > > migration_mapping="host1:host2,host2:host1" > > hypervisor_uri="qemu+ssh:///system" /> > > > > > > > > > > > > > > And the guest container files are hosted on a (previously mounted) > > gfs2 file system. > > > > Is this a rgmanager shortcoming (rgmanager needs to be coded to > > support virsh & live migration) or - more likely - user error? > > > > Thanks in advance > > // Thomas > > > > ---------------------------------------------------------------- > > > Are you sure you don't mean to relocate the service to host2 instead > of Migrate? -R will stop/start a service like samba onto another node. > clusvcadm -r -m > > I wasn't aware that migration worked for anything other than moving VM's > around. > > Henry > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > https://www.redhat.com/archives/linux-cluster/attachments/20090621/550b7658/attachment.html > > ------------------------------ > > Message: 2 > Date: Mon, 22 Jun 2009 09:43:43 -0400 > From: Thomas Sjolshagen > Subject: Re: [Linux-cluster] Re: clusvcadm -M -m > fails with "Invalid operation for resource" > To: linux-cluster at redhat.com > Message-ID: <20090622094343.10362qjp3x4g1jtb at www.sjolshagen.net> > Content-Type: text/plain; charset=ISO-8859-1; DelSp="Yes"; > format="flowed" > > > Quoting Henry Robertson : > > > Today's Topics: > >> > >> 1. clusvcadm -M -m fails with "Invalid > >> operation for resource" (Thomas Sjolshagen) > > .. > > > > > > > Are you sure you don't mean to relocate the service to host2 instead > > of Migrate? -R will stop/start a service like samba onto another node. > > clusvcadm -r -m > > > > I wasn't aware that migration worked for anything other than moving > > VM's around. > > > > Henry > > > > If you look at the resource definition, you'll see that I'm trying to > migrate a VM (the KVM guest is called samba since it hosts a Samba > instance). > > // Thomas > > > > ---------------------------------------------------------------- > This message was sent using IMP, the Internet Messaging Program. > Ah. Does manual migration through virsh work rather than clusvcadm? If it does -- I'd put rgmanager into some extra logging by editing /etc/init.d/rgmanager with RGMGR_OPTS="-dddd" under the RGMGRD part. Then restart rgmanager and check logs for more info after trying clusvcadm -M again. (add debug to host / target servers and see if you catch anything new) Good luck Henry -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomas at sjolshagen.net Tue Jun 23 02:46:38 2009 From: thomas at sjolshagen.net (Thomas Sjolshagen) Date: Mon, 22 Jun 2009 22:46:38 -0400 Subject: [Linux-cluster] Re: Guest MIgration w/rgmanager - WAS: Linux-cluster Digest, Vol 62, Issue 21 In-Reply-To: References: <20090622160008.589096191B5@hormel.redhat.com> Message-ID: <20090622224638.20492l2x5u5mbpxq@www.sjolshagen.net> Quoting Henry Robertson : > On Mon, Jun 22, 2009 at 12:00 PM, wrote: > ... > > Ah. Does manual migration through virsh work rather than clusvcadm? virsh migrate --live qemu+ssh://2nd node/system Migrates the guest to the other cluster member w/no objections. > If it does -- I'd put rgmanager into some extra logging by editing > /etc/init.d/rgmanager with RGMGR_OPTS="-dddd" under the RGMGRD part. Added "-dddd" to /etc/sysconfig/rgmanager, restarted rgmanager and verified that rgmanager is running w/the option set. I do not see any increase in logging from the default setting of "-d"? > Then restart rgmanager and check logs for more info after trying clusvcadm > -M again. (add debug to host / target servers and see if you catch anything > new) Attempted another clusvcadm -M -m <2nd cluster node>, and I see nothing in either of the /var/log/cluster/rgmanager.log files. Not even that the operation was attempted?!? // Thomas ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From mrugeshkarnik at gmail.com Tue Jun 23 07:01:25 2009 From: mrugeshkarnik at gmail.com (Mrugesh Karnik) Date: Tue, 23 Jun 2009 12:31:25 +0530 Subject: [Linux-cluster] Re: clusvcadm -M -m fails with "Invalid operation for resource" In-Reply-To: <20090622094343.10362qjp3x4g1jtb@www.sjolshagen.net> References: <20090622094343.10362qjp3x4g1jtb@www.sjolshagen.net> Message-ID: <200906231231.25911.mrugeshkarnik@gmail.com> On Monday 22 Jun 2009 19:13:43 Thomas Sjolshagen wrote: > If you look at the resource definition, you'll see that I'm trying to > migrate a VM (the KVM guest is called samba since it hosts a Samba > instance). Is migration supported on KVM? I've tried it with Xen and works fine. The only gotcha was that the `nx' flag on the CPU needed to be available. Mrugesh From ironludo at free.fr Tue Jun 23 09:41:57 2009 From: ironludo at free.fr (LEROUX Ludovic) Date: Tue, 23 Jun 2009 11:41:57 +0200 Subject: [Linux-cluster] redhat cluster installation Message-ID: <8DF9888392AA48D5960BE531138BB981@siim94.local> hello all. I installed two redhat hat 5.2 servers with redhat cluster suite option. When i want to create a cluster with luci i got an error message: An error occurred when trying to contact any of the nodes in the rh-cluster cluster. Do you have any ideas? thanks. Ludovic -------------- next part -------------- An HTML attachment was scrubbed... URL: From reggaestar at gmail.com Tue Jun 23 09:48:42 2009 From: reggaestar at gmail.com (remi doubi) Date: Tue, 23 Jun 2009 09:48:42 +0000 Subject: [Linux-cluster] redhat cluster installation In-Reply-To: <8DF9888392AA48D5960BE531138BB981@siim94.local> References: <8DF9888392AA48D5960BE531138BB981@siim94.local> Message-ID: <3c88c73a0906230248r17c0ec6ct151a484115682e01@mail.gmail.com> is the ricci agent started on the two nodes ?? 2009/6/23 LEROUX Ludovic > hello all. > I installed two redhat hat 5.2 servers with redhat cluster suite option. > When i want to create a cluster with luci i got an error message: *An > error occurred when trying to contact any of the nodes in the rh-cluster > cluster.* > Do you have any ideas? > thanks. > Ludovic > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amrossi at linux.it Tue Jun 23 09:51:17 2009 From: amrossi at linux.it (Andrea Modesto Rossi) Date: Tue, 23 Jun 2009 11:51:17 +0200 (CEST) Subject: [Linux-cluster] redhat cluster installation In-Reply-To: <8DF9888392AA48D5960BE531138BB981@siim94.local> References: <8DF9888392AA48D5960BE531138BB981@siim94.local> Message-ID: <38070.82.105.99.92.1245750677.squirrel@picard.linux.it> On Mar, 23 Giugno 2009 11:41 am, LEROUX Ludovic wrote: > hello all. > I installed two redhat hat 5.2 servers with redhat cluster suite option. > When i want to create a cluster with luci i got an error message: An error > occurred when trying to contact any of the nodes in the rh-cluster > cluster. > Do you have any ideas? hello, is /etc/hosts configured properly? try with IP address instead of the hostname. -- Andrea Modesto Rossi Fedora Ambassador +---------------------------------------------------------------------+ | Bello. Che gli diciamo? Che sono tutti stronzi monopolisti di merda,| | con i loro protocolli brevettati e i loro driver finestrosi? | | Ci sono! | | Alessandro Rubini | +---------------------------------------------------------------------+ From thomas at sjolshagen.net Tue Jun 23 12:07:20 2009 From: thomas at sjolshagen.net (Thomas Sjolshagen) Date: Tue, 23 Jun 2009 08:07:20 -0400 Subject: [Linux-cluster] Re: clusvcadm -M -m fails with "Invalid operation for resource" In-Reply-To: <200906231231.25911.mrugeshkarnik@gmail.com> References: <20090622094343.10362qjp3x4g1jtb@www.sjolshagen.net> <200906231231.25911.mrugeshkarnik@gmail.com> Message-ID: <20090623080720.9695183ennxm3r08@www.sjolshagen.net> Quoting Mrugesh Karnik : > On Monday 22 Jun 2009 19:13:43 Thomas Sjolshagen wrote: >> If you look at the resource definition, you'll see that I'm trying to >> migrate a VM (the KVM guest is called samba since it hosts a Samba >> instance). > > Is migration supported on KVM? I've tried it with Xen and works > fine. The only > gotcha was that the `nx' flag on the CPU needed to be available. > Yes, KVM supports both migration & live migration and with KVM-8* and libvirt 0.6.4 you can use "virsh migrate --live" to move running guests between nodes (live migrate). This fact is reflected in the upstream (git repo) version of the vm.sh resource script, but it seems like something - rgmanager itself? - is blocking it from even trying. //Thomas ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From mech at meteo.uni-koeln.de Tue Jun 23 14:32:16 2009 From: mech at meteo.uni-koeln.de (Mario Mech) Date: Tue, 23 Jun 2009 16:32:16 +0200 Subject: [Linux-cluster] system-config-cluster, secure, and fence_drac5 Message-ID: <4A40E770.9050001@meteo.uni-koeln.de> Hi all, I'm configuring a FailOver-cluster with CentOS 5.3 on two Dell PowerEdge 2950 with DRAC5 cards. The basic configuration with system-config-cluster worked fine after enabling telnet on the DRACs and I got the cluster running. Then I switched to ssh by manually editing the cluster.conf file, since system-config-cluster is not aware of fence_drac5. Unfortunately now the cluster.conf file is not readable anymore by system-config-cluster. I still want to use sys-con-clu, since there is still much to configure (services, failover-domains,....). Except of using telnet and fence_drac until the end of the configuration process, I have no other idea how to manage that. DOes anyone know how to include fence_drac5 and the secure="1" attribute in cluster.conf and still using sys-con-clu? All best Mario P.S. Is secure="1" in the right place? cluster.conf: -- Dr. Mario Mech Institute for Geophysics and Meteorology University of Cologne Zuelpicherstr. 49a 50674 Cologne Germany t: +49 (0)221 - 470 - 1776 f: +49 (0)221 - 470 - 5198 e: mech at meteo.uni-koeln.de w: http://www.meteo.uni-koeln.de/~mmech/ From ch_spicy at yahoo.co.in Wed Jun 24 04:00:25 2009 From: ch_spicy at yahoo.co.in (Kanthi) Date: Wed, 24 Jun 2009 09:30:25 +0530 (IST) Subject: [Linux-cluster] Re:redhat cluster installation (LEROUX Ludovic) In-Reply-To: <20090623160011.293AF619858@hormel.redhat.com> References: <20090623160011.293AF619858@hormel.redhat.com> Message-ID: <161862.80348.qm@web8407.mail.in.yahoo.com> ________________________________ From: "linux-cluster-request at redhat.com" To: linux-cluster at redhat.com Sent: Tuesday, 23 June, 2009 9:30:11 PM Subject: Linux-cluster Digest, Vol 62, Issue 22 Send Linux-cluster mailing list submissions to linux-cluster at redhat.com To subscribe or unsubscribe via the World Wide Web, visit https://www.redhat.com/mailman/listinfo/linux-cluster or, via email, send a message with subject or body 'help' to linux-cluster-request at redhat.com You can reach the person managing the list at linux-cluster-owner at redhat.com When replying, please edit your Subject line so it is more specific than "Re: Contents of Linux-cluster digest..." Today's Topics: 1. Re: Linux-cluster Digest, Vol 62, Issue 21 (Henry Robertson) 2. Re: Guest MIgration w/rgmanager - WAS: Linux-cluster Digest, Vol 62, Issue 21 (Thomas Sjolshagen) 3. Re: Re: clusvcadm -M -m fails with "Invalid operation for resource" (Mrugesh Karnik) 4. redhat cluster installation (LEROUX Ludovic) 5. Re: redhat cluster installation (remi doubi) 6. Re: redhat cluster installation (Andrea Modesto Rossi) 7. Re: Re: clusvcadm -M -m fails with "Invalid operation for resource" (Thomas Sjolshagen) 8. system-config-cluster, secure, and fence_drac5 (Mario Mech) ---------------------------------------------------------------------- Message: 1 Date: Mon, 22 Jun 2009 18:12:07 -0400 From: Henry Robertson Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 62, Issue 21 To: linux-cluster at redhat.com Message-ID: Content-Type: text/plain; charset="iso-8859-1" On Mon, Jun 22, 2009 at 12:00 PM, wrote: > Send Linux-cluster mailing list submissions to > linux-cluster at redhat.com > > To subscribe or unsubscribe via the World Wide Web, visit > https://www.redhat.com/mailman/listinfo/linux-cluster > or, via email, send a message with subject or body 'help' to > linux-cluster-request at redhat.com > > You can reach the person managing the list at > linux-cluster-owner at redhat.com > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Linux-cluster digest..." > > > Today's Topics: > > 1. Re: clusvcadm -M -m fails with "Invalid > operation for resource" (Henry Robertson) > 2. Re: Re: clusvcadm -M -m fails with > "Invalid operation for resource" (Thomas Sjolshagen) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sun, 21 Jun 2009 23:57:34 -0400 > From: Henry Robertson > Subject: [Linux-cluster] Re: clusvcadm -M -m fails > with "Invalid operation for resource" > To: linux-cluster at redhat.com > Message-ID: > > Content-Type: text/plain; charset="iso-8859-1" > > Today's Topics: > > > > 1. clusvcadm -M -m fails with "Invalid > > operation for resource" (Thomas Sjolshagen) > > 2. Re: Will redhat release the srpms of cluster suite for > > rhel-4.8 to the public??? (Fabio M. Di Nitto) > > > > > > ---------------------------------------------------------------------- > > > > Message: 1 > > Date: Sat, 20 Jun 2009 16:06:11 -0400 > > From: Thomas Sjolshagen > > Subject: [Linux-cluster] clusvcadm -M -m fails > > with "Invalid operation for resource" > > To: linux-cluster at redhat.com > > Message-ID: <20090620160611.20273n7qqnh4jd4j at www.sjolshagen.net> > > Content-Type: text/plain; charset=ISO-8859-1; DelSp="Yes"; > > format="flowed" > > > > Hi, > > > > I'm trying to do a migration to another member of the cluster but it > > fails with: > > > > # clusvcadm -M samba -m host2 > > Trying to migrate service:samba to > > virt0-backup.sjolshagen.net...Invalid operation for resource > > > > I'm running Fedora 11 with rgmanager-3.0.0-15.rc1.fc11.x86_64 > > installed as well as a downloaded copy of the 5/21 version of vm.sh > > from the git.fedorahosted.org repository. I've configured the resource > > as follows: > > > > > > > > > recovery="relocate" snapshot="/cluster/kvm-guests/snapshots" > > use_virsh="1" exclusive="1" hypervisor="qemu" > > migration_mapping="host1:host2,host2:host1" > > hypervisor_uri="qemu+ssh:///system" /> > > > > > > > > > > > > > > And the guest container files are hosted on a (previously mounted) > > gfs2 file system. > > > > Is this a rgmanager shortcoming (rgmanager needs to be coded to > > support virsh & live migration) or - more likely - user error? > > > > Thanks in advance > > // Thomas > > > > ---------------------------------------------------------------- > > > Are you sure you don't mean to relocate the service to host2 instead > of Migrate? -R will stop/start a service like samba onto another node. > clusvcadm -r -m > > I wasn't aware that migration worked for anything other than moving VM's > around. > > Henry > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > https://www.redhat.com/archives/linux-cluster/attachments/20090621/550b7658/attachment.html > > ------------------------------ > > Message: 2 > Date: Mon, 22 Jun 2009 09:43:43 -0400 > From: Thomas Sjolshagen > Subject: Re: [Linux-cluster] Re: clusvcadm -M -m > fails with "Invalid operation for resource" > To: linux-cluster at redhat.com > Message-ID: <20090622094343.10362qjp3x4g1jtb at www.sjolshagen.net> > Content-Type: text/plain; charset=ISO-8859-1; DelSp="Yes"; > format="flowed" > > > Quoting Henry Robertson : > > > Today's Topics: > >> > >> 1. clusvcadm -M -m fails with "Invalid > >> operation for resource" (Thomas Sjolshagen) > > .. > > > > > > > Are you sure you don't mean to relocate the service to host2 instead > > of Migrate? -R will stop/start a service like samba onto another node. > > clusvcadm -r -m > > > > I wasn't aware that migration worked for anything other than moving > > VM's around. > > > > Henry > > > > If you look at the resource definition, you'll see that I'm trying to > migrate a VM (the KVM guest is called samba since it hosts a Samba > instance). > > // Thomas > > > > ---------------------------------------------------------------- > This message was sent using IMP, the Internet Messaging Program. > Ah. Does manual migration through virsh work rather than clusvcadm? If it does -- I'd put rgmanager into some extra logging by editing /etc/init.d/rgmanager with RGMGR_OPTS="-dddd" under the RGMGRD part. Then restart rgmanager and check logs for more info after trying clusvcadm -M again. (add debug to host / target servers and see if you catch anything new) Good luck Henry -------------- next part -------------- An HTML attachment was scrubbed... URL: https://www.redhat.com/archives/linux-cluster/attachments/20090622/58fea150/attachment.html ------------------------------ Message: 2 Date: Mon, 22 Jun 2009 22:46:38 -0400 From: Thomas Sjolshagen Subject: [Linux-cluster] Re: Guest MIgration w/rgmanager - WAS: Linux-cluster Digest, Vol 62, Issue 21 To: linux-cluster at redhat.com Message-ID: <20090622224638.20492l2x5u5mbpxq at www.sjolshagen.net> Content-Type: text/plain; charset=ISO-8859-1; DelSp="Yes"; format="flowed" Quoting Henry Robertson : > On Mon, Jun 22, 2009 at 12:00 PM, wrote: > ... > > Ah. Does manual migration through virsh work rather than clusvcadm? virsh migrate --live qemu+ssh://2nd node/system Migrates the guest to the other cluster member w/no objections. > If it does -- I'd put rgmanager into some extra logging by editing > /etc/init.d/rgmanager with RGMGR_OPTS="-dddd" under the RGMGRD part. Added "-dddd" to /etc/sysconfig/rgmanager, restarted rgmanager and verified that rgmanager is running w/the option set. I do not see any increase in logging from the default setting of "-d"? > Then restart rgmanager and check logs for more info after trying clusvcadm > -M again. (add debug to host / target servers and see if you catch anything > new) Attempted another clusvcadm -M -m <2nd cluster node>, and I see nothing in either of the /var/log/cluster/rgmanager.log files. Not even that the operation was attempted?!? // Thomas ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. ------------------------------ Message: 3 Date: Tue, 23 Jun 2009 12:31:25 +0530 From: Mrugesh Karnik Subject: Re: [Linux-cluster] Re: clusvcadm -M -m fails with "Invalid operation for resource" To: linux-cluster at redhat.com Message-ID: <200906231231.25911.mrugeshkarnik at gmail.com> Content-Type: Text/Plain; charset="iso-8859-1" On Monday 22 Jun 2009 19:13:43 Thomas Sjolshagen wrote: > If you look at the resource definition, you'll see that I'm trying to > migrate a VM (the KVM guest is called samba since it hosts a Samba > instance). Is migration supported on KVM? I've tried it with Xen and works fine. The only gotcha was that the `nx' flag on the CPU needed to be available. Mrugesh ------------------------------ Message: 4 Date: Tue, 23 Jun 2009 11:41:57 +0200 From: "LEROUX Ludovic" Subject: [Linux-cluster] redhat cluster installation To: Message-ID: <8DF9888392AA48D5960BE531138BB981 at siim94.local> Content-Type: text/plain; charset="iso-8859-1" hello all. I installed two redhat hat 5.2 servers with redhat cluster suite option. When i want to create a cluster with luci i got an error message: An error occurred when trying to contact any of the nodes in the rh-cluster cluster. Do you have any ideas? thanks. Ludovic -------------- next part -------------- An HTML attachment was scrubbed... URL: https://www.redhat.com/archives/linux-cluster/attachments/20090623/a56f25e8/attachment.html ------------------------------ Message: 5 Date: Tue, 23 Jun 2009 09:48:42 +0000 From: remi doubi Subject: Re: [Linux-cluster] redhat cluster installation To: LEROUX Ludovic , linux clustering Message-ID: <3c88c73a0906230248r17c0ec6ct151a484115682e01 at mail.gmail.com> Content-Type: text/plain; charset="iso-8859-1" is the ricci agent started on the two nodes ?? 2009/6/23 LEROUX Ludovic > hello all. > I installed two redhat hat 5.2 servers with redhat cluster suite option. > When i want to create a cluster with luci i got an error message: *An > error occurred when trying to contact any of the nodes in the rh-cluster > cluster.* > Do you have any ideas? > thanks. > Ludovic > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: https://www.redhat.com/archives/linux-cluster/attachments/20090623/3508464b/attachment.html ------------------------------ Message: 6 Date: Tue, 23 Jun 2009 11:51:17 +0200 (CEST) From: "Andrea Modesto Rossi" Subject: Re: [Linux-cluster] redhat cluster installation To: "LEROUX Ludovic" , "linux clustering" Cc: linux-cluster at redhat.com Message-ID: <38070.82.105.99.92.1245750677.squirrel at picard.linux.it> Content-Type: text/plain;charset=iso-8859-1 On Mar, 23 Giugno 2009 11:41 am, LEROUX Ludovic wrote: > hello all. > I installed two redhat hat 5.2 servers with redhat cluster suite option. > When i want to create a cluster with luci i got an error message: An error > occurred when trying to contact any of the nodes in the rh-cluster > cluster. > Do you have any ideas? hello, is /etc/hosts configured properly? try with IP address instead of the hostname. -- Andrea Modesto Rossi Fedora Ambassador +---------------------------------------------------------------------+ | Bello. Che gli diciamo? Che sono tutti stronzi monopolisti di merda,| | con i loro protocolli brevettati e i loro driver finestrosi? | | Ci sono! | | Alessandro Rubini | +---------------------------------------------------------------------+ ------------------------------ Message: 7 Date: Tue, 23 Jun 2009 08:07:20 -0400 From: Thomas Sjolshagen Subject: Re: [Linux-cluster] Re: clusvcadm -M -m fails with "Invalid operation for resource" To: linux-cluster at redhat.com Message-ID: <20090623080720.9695183ennxm3r08 at www.sjolshagen.net> Content-Type: text/plain; charset=ISO-8859-1; DelSp="Yes"; format="flowed" Quoting Mrugesh Karnik : > On Monday 22 Jun 2009 19:13:43 Thomas Sjolshagen wrote: >> If you look at the resource definition, you'll see that I'm trying to >> migrate a VM (the KVM guest is called samba since it hosts a Samba >> instance). > > Is migration supported on KVM? I've tried it with Xen and works > fine. The only > gotcha was that the `nx' flag on the CPU needed to be available. > Yes, KVM supports both migration & live migration and with KVM-8* and libvirt 0.6.4 you can use "virsh migrate --live" to move running guests between nodes (live migrate). This fact is reflected in the upstream (git repo) version of the vm.sh resource script, but it seems like something - rgmanager itself? - is blocking it from even trying. //Thomas ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. ------------------------------ Message: 8 Date: Tue, 23 Jun 2009 16:32:16 +0200 From: Mario Mech Subject: [Linux-cluster] system-config-cluster, secure, and fence_drac5 To: linux-cluster at redhat.com Message-ID: <4A40E770.9050001 at meteo.uni-koeln.de> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Hi all, I'm configuring a FailOver-cluster with CentOS 5.3 on two Dell PowerEdge 2950 with DRAC5 cards. The basic configuration with system-config-cluster worked fine after enabling telnet on the DRACs and I got the cluster running. Then I switched to ssh by manually editing the cluster.conf file, since system-config-cluster is not aware of fence_drac5. Unfortunately now the cluster.conf file is not readable anymore by system-config-cluster. I still want to use sys-con-clu, since there is still much to configure (services, failover-domains,....). Except of using telnet and fence_drac until the end of the configuration process, I have no other idea how to manage that. DOes anyone know how to include fence_drac5 and the secure="1" attribute in cluster.conf and still using sys-con-clu? All best Mario P.S. Is secure="1" in the right place? cluster.conf: -- Dr. Mario Mech Institute for Geophysics and Meteorology University of Cologne Zuelpicherstr. 49a 50674 Cologne Germany t: +49 (0)221 - 470 - 1776 f: +49 (0)221 - 470 - 5198 e: mech at meteo.uni-koeln.de w: http://www.meteo.uni-koeln.de/~mmech/ ------------------------------ -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster End of Linux-cluster Digest, Vol 62, Issue 22 ********************************************* Love Cricket? Check out live scores, photos, video highlights and more. Click here http://cricket.yahoo.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From cfeist at redhat.com Wed Jun 24 13:56:03 2009 From: cfeist at redhat.com (Chris Feist) Date: Wed, 24 Jun 2009 09:56:03 -0400 (EDT) Subject: [Linux-cluster] Will redhat release the srpms of cluster suite for rhel-4.8 to the public??? In-Reply-To: <606164656.436501245851720923.JavaMail.root@zmail04.collab.prod.int.phx2.redhat.com> Message-ID: <1414381385.436541245851763590.JavaMail.root@zmail04.collab.prod.int.phx2.redhat.com> ----- "Aliet Santiesteban Sifontes" wrote: > Hi, just wondering if redhat will release the srpms for the cluster > suite updated for rhel-4.8???, I have been looking for it in redhat > ftp site, but can not find it. You should be able to find the 4.8 RHCS srpms here: /pub/redhat/linux/updates/enterprise/4AS/en/RHCS/SRPMS Let me know if anything appears to be missing. THanks, Chris > Any ideas?? > Best regards > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From mgrac at redhat.com Wed Jun 24 14:55:54 2009 From: mgrac at redhat.com (=?UTF-8?B?TWFyZWsgJ21hcngnIEdyw6Fj?=) Date: Wed, 24 Jun 2009 16:55:54 +0200 Subject: [Linux-cluster] fencing Cisco MDS 9134 w/ RHEL5 In-Reply-To: <4A3B9D22.4080908@sph.emory.edu> References: <4A3B9D22.4080908@sph.emory.edu> Message-ID: <4A423E7A.7000006@redhat.com> Hi, Vernard C. Martin wrote: > I can't seem to find any evidence that this fiber switch has a fencing > agent for RHEL4. There seems to be some documentation of it being > supported in RHEL 5.4. > > Is it reasonable to just port the agent or am I missing some technical > detail that the agent requires that is in the newer kernel? Agents for RHEL 5.4 "should" work also on RHEL 4 but you will have to copy agent together with fencing library (fencing.py and fencing_snmp.py). m, From esggrupos at gmail.com Thu Jun 25 08:06:22 2009 From: esggrupos at gmail.com (ESGLinux) Date: Thu, 25 Jun 2009 10:06:22 +0200 Subject: [Linux-cluster] using lacie ethernet disk raid as shared storage Message-ID: <3128ba140906250106h484d1647r2c8ab5c4ba439c5@mail.gmail.com> Hi all, I have a customer with has a Lacie Ethernet Disk RAID and wants to use it as a shared storage to use in a HA cluster. Which can be the best approach to use this kind of storage? (iscsi, gnbd, nfs... ???) at the first moment I thought I can?t use it but In can't believe that I can?t do something with it any suggestion? Thanks in advance. ESG -------------- next part -------------- An HTML attachment was scrubbed... URL: From gordan at bobich.net Thu Jun 25 10:12:03 2009 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 25 Jun 2009 11:12:03 +0100 Subject: [Linux-cluster] using lacie ethernet disk raid as shared storage In-Reply-To: <3128ba140906250106h484d1647r2c8ab5c4ba439c5@mail.gmail.com> References: <3128ba140906250106h484d1647r2c8ab5c4ba439c5@mail.gmail.com> Message-ID: <0b3ac5a839d5c805cb88fc592d7dc3e3@localhost> On Thu, 25 Jun 2009 10:06:22 +0200, ESGLinux wrote: > Hi all, > I have a customer with has a Lacie Ethernet Disk RAID and wants to use it > as a shared storage to use in a HA cluster. > > Which can be the best approach to use this kind of storage? (iscsi, gnbd, > nfs... ???) I don't imagine for a moment that any of those would be supported, considering the target audience is unlikely to ever have heard of those protocols. It's likely to give you SMB/CIFS and nothing else. There's no reason why you couldn't use it for shared storage, but that is in no way related to RHCS. Also remember that a single SAN/NAS of whatever description is still a single point of failure, which makes a mockery of the concept of HA. This is also (shockingly) a point (willfully) overlooked by most administrators and architects. From esggrupos at gmail.com Thu Jun 25 10:38:04 2009 From: esggrupos at gmail.com (ESGLinux) Date: Thu, 25 Jun 2009 12:38:04 +0200 Subject: [Linux-cluster] using lacie ethernet disk raid as shared storage In-Reply-To: <0b3ac5a839d5c805cb88fc592d7dc3e3@localhost> References: <3128ba140906250106h484d1647r2c8ab5c4ba439c5@mail.gmail.com> <0b3ac5a839d5c805cb88fc592d7dc3e3@localhost> Message-ID: <3128ba140906250338h63ea6e64k7bb627ed1884d8e9@mail.gmail.com> Hi Gordan, thanks for your answer, I can mount this disk with NFS (also with CIFS but I?m not using this protocol) My idea was to mount the disk with NFS on the 2 nodes of a red hat cluster, but I don?t know if it is a good idea. (perhaps no :-( ) The cluster are going to serve a HA httpd service (I know, with this disk I have a SPOF, but that is all I have, no money for more .-(( ) any suggestion with this scenario? Thanks again, ESG 2009/6/25 Gordan Bobic > On Thu, 25 Jun 2009 10:06:22 +0200, ESGLinux wrote: > > Hi all, > > I have a customer with has a Lacie Ethernet Disk RAID and wants to use it > > as a shared storage to use in a HA cluster. > > > > Which can be the best approach to use this kind of storage? (iscsi, gnbd, > > nfs... ???) > > I don't imagine for a moment that any of those would be supported, > considering the target audience is unlikely to ever have heard of those > protocols. It's likely to give you SMB/CIFS and nothing else. There's no > reason why you couldn't use it for shared storage, but that is in no way > related to RHCS. > > Also remember that a single SAN/NAS of whatever description is still a > single point of failure, which makes a mockery of the concept of HA. This > is also (shockingly) a point (willfully) overlooked by most administrators > and architects. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gordan at bobich.net Thu Jun 25 11:04:57 2009 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 25 Jun 2009 12:04:57 +0100 Subject: [Linux-cluster] using lacie ethernet disk raid as shared storage In-Reply-To: <3128ba140906250338h63ea6e64k7bb627ed1884d8e9@mail.gmail.com> References: <3128ba140906250106h484d1647r2c8ab5c4ba439c5@mail.gmail.com> <0b3ac5a839d5c805cb88fc592d7dc3e3@localhost> <3128ba140906250338h63ea6e64k7bb627ed1884d8e9@mail.gmail.com> Message-ID: <1ea88057f3ad37b08df6a7e8798465e7@localhost> On Thu, 25 Jun 2009 12:38:04 +0200, ESGLinux wrote: > I can mount this disk with NFS (also with CIFS but I?m not using this > protocol) > > My idea was to mount the disk with NFS on the 2 nodes of a red hat cluster, > but I don?t know if it is a good idea. (perhaps no :-( ) There's no reason why you couldn't or shouldn't do this. If all you want is some shared storage and don't care about the single point of failure, then this is exactly what the device was intended for. :) > The cluster are going to serve a HA httpd service (I know, with this disk I > have a SPOF, but that is all I have, no money for more .-(( ) > > any suggestion with this scenario? It should "just work" as you described. NFS mount it on both nodes and point Apache at it as per usual. It'll probably work faster than a clustered file system solution. For redundancy, however, if you have enough disk space on the web nodes, you could set up mirrored storage using DRBD and run GFS on top of that. You'd end up with full redundancy and no need for the NAS (assuming, as I said, that nodes have enough space). Note that fencing would be absolutely mandatory if you use GFS or else either node failing would halt the cluster to prevent data corruption. Gordan From esggrupos at gmail.com Thu Jun 25 11:15:56 2009 From: esggrupos at gmail.com (ESGLinux) Date: Thu, 25 Jun 2009 13:15:56 +0200 Subject: [Linux-cluster] using lacie ethernet disk raid as shared storage In-Reply-To: <1ea88057f3ad37b08df6a7e8798465e7@localhost> References: <3128ba140906250106h484d1647r2c8ab5c4ba439c5@mail.gmail.com> <0b3ac5a839d5c805cb88fc592d7dc3e3@localhost> <3128ba140906250338h63ea6e64k7bb627ed1884d8e9@mail.gmail.com> <1ea88057f3ad37b08df6a7e8798465e7@localhost> Message-ID: <3128ba140906250415q6796416dy76520515a2cbfed2@mail.gmail.com> 2009/6/25 Gordan Bobic > On Thu, 25 Jun 2009 12:38:04 +0200, ESGLinux wrote: > > > I can mount this disk with NFS (also with CIFS but I?m not using this > > protocol) > > > > My idea was to mount the disk with NFS on the 2 nodes of a red hat > cluster, > > but I don?t know if it is a good idea. (perhaps no :-( ) > > There's no reason why you couldn't or shouldn't do this. If all you want is > some shared storage and don't care about the single point of failure, then > this is exactly what the device was intended for. :) ok, I?m always afraid with data corruption and thougth I will have problems with this, but If you think that there is not problem I?ll folow your advice ( at my own risk of course, ;-) > > > > The cluster are going to serve a HA httpd service (I know, with this disk > I > > have a SPOF, but that is all I have, no money for more .-(( ) > > > > any suggestion with this scenario? > > It should "just work" as you described. NFS mount it on both nodes and > point Apache at it as per usual. It'll probably work faster than a > clustered file system solution. For redundancy, however, if you have enough > disk space on the web nodes, you could set up mirrored storage using DRBD > and run GFS on top of that. You'd end up with full redundancy and no need > for the NAS (assuming, as I said, that nodes have enough space). Note that > fencing would be absolutely mandatory if you use GFS or else either node > failing would halt the cluster to prevent data corruption. > I was allways looking for an oportunity to test DRBD. I think now is the moment. My reference web about DRBD is http://www.drbd.org/, any advice, read, before I begin to test it? ESG > > Gordan > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gordan at bobich.net Thu Jun 25 12:01:16 2009 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 25 Jun 2009 13:01:16 +0100 Subject: [Linux-cluster] using lacie ethernet disk raid as shared storage In-Reply-To: <3128ba140906250415q6796416dy76520515a2cbfed2@mail.gmail.com> References: <3128ba140906250106h484d1647r2c8ab5c4ba439c5@mail.gmail.com> <0b3ac5a839d5c805cb88fc592d7dc3e3@localhost> <3128ba140906250338h63ea6e64k7bb627ed1884d8e9@mail.gmail.com> <1ea88057f3ad37b08df6a7e8798465e7@localhost> <3128ba140906250415q6796416dy76520515a2cbfed2@mail.gmail.com> Message-ID: On Thu, 25 Jun 2009 13:15:56 +0200, ESGLinux wrote: > 2009/6/25 Gordan Bobic > >> On Thu, 25 Jun 2009 12:38:04 +0200, ESGLinux wrote: >> >> > I can mount this disk with NFS (also with CIFS but I?m not using this >> > protocol) >> > >> > My idea was to mount the disk with NFS on the 2 nodes of a red hat >> cluster, >> > but I don?t know if it is a good idea. (perhaps no :-( ) >> >> There's no reason why you couldn't or shouldn't do this. If all you want >> is >> some shared storage and don't care about the single point of failure, >> then >> this is exactly what the device was intended for. :) > > > ok, I?m always afraid with data corruption and thougth I will have > problems with this, but If you think that there is not problem I?ll folow your > advice ( at my own risk of course, ;-) NFS is designed for concurrent access, it shouldn't cause corruption. And anyway, your apache web data is likely to be read-only in most cases anyway. Don't put things like database files into shared access areas, though - that generally won't work, and even when it does, performance will be appalling. >> > The cluster are going to serve a HA httpd service (I know, with this >> > disk I >> > have a SPOF, but that is all I have, no money for more .-(( ) >> > >> > any suggestion with this scenario? >> >> It should "just work" as you described. NFS mount it on both nodes and >> point Apache at it as per usual. It'll probably work faster than a >> clustered file system solution. For redundancy, however, if you have >> enough >> disk space on the web nodes, you could set up mirrored storage using DRBD >> and run GFS on top of that. You'd end up with full redundancy and no need >> for the NAS (assuming, as I said, that nodes have enough space). Note >> that >> fencing would be absolutely mandatory if you use GFS or else either node >> failing would halt the cluster to prevent data corruption. >> > > I was allways looking for an oportunity to test DRBD. I think now is the > moment. My reference web about DRBD is http://www.drbd.org/, any advice, > read, before I begin to test it? That is, indeed, the right site. Stick to the docs, they are pretty good. If you are going with this solution, you may also want to look into Open Shared Root (http://www.open-sharedroot.org/). It should save you some admin overhead since you can get away with using a single root fs for multiple nodes. Just make sure your fencing works. But if you are new to clustering, you may not want to dive straight into OSR - there are potential pitfalls that aren't always entirely obvious. There are mailing lists for both DRBD and OSR, so if you run into problems and the docs don't provide an obvious answer, you can always ask there. Gordan From xavier.montagutelli at unilim.fr Thu Jun 25 12:21:57 2009 From: xavier.montagutelli at unilim.fr (Xavier Montagutelli) Date: Thu, 25 Jun 2009 14:21:57 +0200 Subject: [Linux-cluster] using lacie ethernet disk raid as shared storage In-Reply-To: <3128ba140906250338h63ea6e64k7bb627ed1884d8e9@mail.gmail.com> References: <3128ba140906250106h484d1647r2c8ab5c4ba439c5@mail.gmail.com> <0b3ac5a839d5c805cb88fc592d7dc3e3@localhost> <3128ba140906250338h63ea6e64k7bb627ed1884d8e9@mail.gmail.com> Message-ID: <200906251421.57758.xavier.montagutelli@unilim.fr> On Thursday 25 June 2009 12:38:04 ESGLinux wrote: > Hi Gordan, > thanks for your answer, > > I can mount this disk with NFS (also with CIFS but I?m not using this > protocol) > > My idea was to mount the disk with NFS on the 2 nodes of a red hat cluster, > but I don?t know if it is a good idea. (perhaps no :-( ) > > The cluster are going to serve a HA httpd service (I know, with this disk I > have a SPOF, but that is all I have, no money for more .-(( ) > > any suggestion with this scenario? I *know* that's not your question, but have you think about using local disks on each server, with DRBD for the replication ? This would eliminate the SPOF and it's still cost effective (perhaps more than one lassie NAS ... ? I don't know). > > Thanks again, > > ESG > > > 2009/6/25 Gordan Bobic > > > On Thu, 25 Jun 2009 10:06:22 +0200, ESGLinux wrote: > > > Hi all, > > > I have a customer with has a Lacie Ethernet Disk RAID and wants to use > > > it as a shared storage to use in a HA cluster. > > > > > > Which can be the best approach to use this kind of storage? (iscsi, > > > gnbd, nfs... ???) > > > > I don't imagine for a moment that any of those would be supported, > > considering the target audience is unlikely to ever have heard of those > > protocols. It's likely to give you SMB/CIFS and nothing else. There's no > > reason why you couldn't use it for shared storage, but that is in no way > > related to RHCS. > > > > Also remember that a single SAN/NAS of whatever description is still a > > single point of failure, which makes a mockery of the concept of HA. This > > is also (shockingly) a point (willfully) overlooked by most > > administrators and architects. > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster -- Xavier Montagutelli Tel : +33 (0)5 55 45 77 20 Service Commun Informatique Fax : +33 (0)5 55 45 75 95 Universite de Limoges 123, avenue Albert Thomas 87060 Limoges cedex From jeff.sturm at eprize.com Thu Jun 25 13:40:51 2009 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Thu, 25 Jun 2009 09:40:51 -0400 Subject: [Linux-cluster] using lacie ethernet disk raid as shared storage In-Reply-To: References: <3128ba140906250106h484d1647r2c8ab5c4ba439c5@mail.gmail.com> <0b3ac5a839d5c805cb88fc592d7dc3e3@localhost> <3128ba140906250338h63ea6e64k7bb627ed1884d8e9@mail.gmail.com> <1ea88057f3ad37b08df6a7e8798465e7@localhost><3128ba140906250415q6796416dy76520515a2cbfed2@mail.gmail.com> Message-ID: <64D0546C5EBBD147B75DE133D798665F02FDC10B@hugo.eprize.local> > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] > On Behalf Of Gordan Bobic > Sent: Thursday, June 25, 2009 8:01 AM > To: linux clustering > Subject: Re: [Linux-cluster] using lacie ethernet disk raid as shared storage > > On Thu, 25 Jun 2009 13:15:56 +0200, ESGLinux wrote: > > ok, I?m always afraid with data corruption and thougth I will have > > problems with this, but If you think that there is not problem I?ll > folow your > > advice ( at my own risk of course, ;-) > > NFS is designed for concurrent access, it shouldn't cause corruption. And > anyway, your apache web data is likely to be read-only in most cases > anyway. Don't put things like database files into shared access areas, > though - that generally won't work, and even when it does, performance will > be appalling. Or if you still want the redundancy of RHCS and go the DRBD route, you can always use the shared device for backups. (That's the ONLY thing I use NFS for these days.) As a plus, you won't have to tell your customer he can't use his NAS appliance :) Jeff From gordan at bobich.net Thu Jun 25 13:51:17 2009 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 25 Jun 2009 14:51:17 +0100 Subject: [Linux-cluster] using lacie ethernet disk raid as shared storage In-Reply-To: <64D0546C5EBBD147B75DE133D798665F02FDC10B@hugo.eprize.local> References: <3128ba140906250106h484d1647r2c8ab5c4ba439c5@mail.gmail.com> <0b3ac5a839d5c805cb88fc592d7dc3e3@localhost> <3128ba140906250338h63ea6e64k7bb627ed1884d8e9@mail.gmail.com> <1ea88057f3ad37b08df6a7e8798465e7@localhost><3128ba140906250415q6796416dy76520515a2cbfed2@mail.gmail.com> <64D0546C5EBBD147B75DE133D798665F02FDC10B@hugo.eprize.local> Message-ID: <308de57a27fd930e7041defe5a672371@localhost> On Thu, 25 Jun 2009 09:40:51 -0400, Jeff Sturm wrote: >> On Thu, 25 Jun 2009 13:15:56 +0200, ESGLinux wrote: >> > ok, I?m always afraid with data corruption and thougth I will have >> > problems with this, but If you think that there is not problem I?ll >> > folow your advice ( at my own risk of course, ;-) >> >> NFS is designed for concurrent access, it shouldn't cause corruption. And >> anyway, your apache web data is likely to be read-only in most cases >> anyway. Don't put things like database files into shared access areas, >> though - that generally won't work, and even when it does, performance >> will >> be appalling. > > Or if you still want the redundancy of RHCS and go the DRBD route, you can > always use the shared device for backups. (That's the ONLY thing I use NFS > for these days.) Don't underestimate NFS performance for heavily concurrent I/O with a significant write load on lots of small file from multiple nodes. There are things for which NFS is a better solution. Gordan From esggrupos at gmail.com Fri Jun 26 07:17:18 2009 From: esggrupos at gmail.com (ESGLinux) Date: Fri, 26 Jun 2009 09:17:18 +0200 Subject: [Linux-cluster] using lacie ethernet disk raid as shared storage In-Reply-To: <308de57a27fd930e7041defe5a672371@localhost> References: <3128ba140906250106h484d1647r2c8ab5c4ba439c5@mail.gmail.com> <0b3ac5a839d5c805cb88fc592d7dc3e3@localhost> <3128ba140906250338h63ea6e64k7bb627ed1884d8e9@mail.gmail.com> <1ea88057f3ad37b08df6a7e8798465e7@localhost> <3128ba140906250415q6796416dy76520515a2cbfed2@mail.gmail.com> <64D0546C5EBBD147B75DE133D798665F02FDC10B@hugo.eprize.local> <308de57a27fd930e7041defe5a672371@localhost> Message-ID: <3128ba140906260017i11b66264o1d848b4f9e3e4ff2@mail.gmail.com> Thanks all for your answers I?m going to try DRBD with all the indications you gave me. I?m going to spend a good summer time. ;-) ESG 2009/6/25 Gordan Bobic > On Thu, 25 Jun 2009 09:40:51 -0400, Jeff Sturm > wrote: > > >> On Thu, 25 Jun 2009 13:15:56 +0200, ESGLinux > wrote: > >> > ok, I?m always afraid with data corruption and thougth I will have > >> > problems with this, but If you think that there is not problem I?ll > >> > folow your advice ( at my own risk of course, ;-) > >> > >> NFS is designed for concurrent access, it shouldn't cause corruption. > And > >> anyway, your apache web data is likely to be read-only in most cases > >> anyway. Don't put things like database files into shared access areas, > >> though - that generally won't work, and even when it does, performance > >> will > >> be appalling. > > > > Or if you still want the redundancy of RHCS and go the DRBD route, you > can > > always use the shared device for backups. (That's the ONLY thing I use > NFS > > for these days.) > > Don't underestimate NFS performance for heavily concurrent I/O with a > significant write load on lots of small file from multiple nodes. There are > things for which NFS is a better solution. > > Gordan > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From esggrupos at gmail.com Mon Jun 29 09:38:39 2009 From: esggrupos at gmail.com (ESGLinux) Date: Mon, 29 Jun 2009 11:38:39 +0200 Subject: [Linux-cluster] quorum disk size recommedation Message-ID: <3128ba140906290238l3e98aac2s2b5ac2bf97418307@mail.gmail.com> Hi all, I?m planning a 2 nodes cluster and I?m going to use quorum disk. My question is which is the best size of this kind of disk. It will be interesting to explain how calculate this size, Thanks in advance ESG -------------- next part -------------- An HTML attachment was scrubbed... URL: From harri.paivaniemi at tieto.com Mon Jun 29 09:43:18 2009 From: harri.paivaniemi at tieto.com (=?iso-8859-1?q?H=2EP=E4iv=E4niemi?=) Date: Mon, 29 Jun 2009 12:43:18 +0300 Subject: [Linux-cluster] quorum disk size recommedation In-Reply-To: <3128ba140906290238l3e98aac2s2b5ac2bf97418307@mail.gmail.com> References: <3128ba140906290238l3e98aac2s2b5ac2bf97418307@mail.gmail.com> Message-ID: <200906291243.18175.harri.paivaniemi@tieto.com> http://sources.redhat.com/cluster/wiki/FAQ/CMAN#quorumdisksize What's the minimum size of a quorum disk/partition? The official answer is 10MB. The real number is something like 100KB, but we'd like to reserve 10MB for possible future expansion and features. -hjp On Monday 29 June 2009 12:38:39 ESGLinux wrote: > Hi all, > > I?m planning a 2 nodes cluster and I?m going to use quorum disk. My > question is which is the best size of this kind of disk. It will be > interesting to explain how calculate this size, > > Thanks in advance > > ESG From esggrupos at gmail.com Mon Jun 29 09:48:29 2009 From: esggrupos at gmail.com (ESGLinux) Date: Mon, 29 Jun 2009 11:48:29 +0200 Subject: [Linux-cluster] quorum disk size recommedation In-Reply-To: <200906291243.18175.harri.paivaniemi@tieto.com> References: <3128ba140906290238l3e98aac2s2b5ac2bf97418307@mail.gmail.com> <200906291243.18175.harri.paivaniemi@tieto.com> Message-ID: <3128ba140906290248q620ad560m8700f65ab0bd63d8@mail.gmail.com> hi, Thanks for your quick answer. Just for curiosity, why this size? and with 10 MB, what happens if you need more? (the question is why can you need more? perhaps 1000 nodes? or it doesnt matter) Greetings, ESG 2009/6/29 H.P?iv?niemi > > http://sources.redhat.com/cluster/wiki/FAQ/CMAN#quorumdisksize > > What's the minimum size of a quorum disk/partition? > > The official answer is 10MB. The real number is something like 100KB, but > we'd like to reserve 10MB for possible > future expansion and features. > > > -hjp > > > > On Monday 29 June 2009 12:38:39 ESGLinux wrote: > > Hi all, > > > > I?m planning a 2 nodes cluster and I?m going to use quorum disk. My > > question is which is the best size of this kind of disk. It will be > > interesting to explain how calculate this size, > > > > Thanks in advance > > > > ESG > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From agx at sigxcpu.org Mon Jun 29 18:48:48 2009 From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=) Date: Mon, 29 Jun 2009 20:48:48 +0200 Subject: [Linux-cluster] Cluster 3.0.0.rc3 release In-Reply-To: <1245496789.3665.328.camel@cerberus.int.fabbione.net> References: <1245496789.3665.328.camel@cerberus.int.fabbione.net> Message-ID: <20090629184848.GA25796@bogon.sigxcpu.org> Hi Fabione, Thanks for rolling this rc candidate! On Sat, Jun 20, 2009 at 01:19:49PM +0200, Fabio M. Di Nitto wrote: [..snip..] > In order to build the 3.0.0.rc3 release you will need: > > - corosync 0.98 > - openais 0.97 We used these without any patches. > - linux kernel 2.6.29 We were running against 2.6.30. We observed these issues: fenced segfaults with: (gdb) bt #0 0x00007f8e293508fe in fence_node (victim=0x114b510 "node1.foo.bar", log=0x61e0a0, log_size=32, log_count=0x7fff2e46a634) at /var/home/schmitz/3/redhat-cluster/fence/libfence/agent.c:156 #1 0x000000000040c5cd in fence_victims (fd=0x114f270) at /var/home/schmitz/3/redhat-cluster/fence/fenced/recover.c:319 #2 0x0000000000405f27 in apply_changes (fd=0x114f270) at /var/home/schmitz/3/redhat-cluster/fence/fenced/cpg.c:1056 #3 0x00007f8e2914bcc1 in cpg_dispatch () from /usr/lib/libcpg.so.4 #4 0x0000000000404588 in process_fd_cpg (ci=4) at /var/home/schmitz/3/redhat-cluster/fence/fenced/cpg.c:1351 #5 0x000000000040b0f7 in main (argc=, argv=) at /var/home/schmitz/3/redhat-cluster/fence/fenced/main.c:818 this leads to 1246297857 fenced 3.0.0.rc3 started 1246297857 our_nodeid 1 our_name node2.foo.bar 1246297857 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/fenced.log 1246297857 found uncontrolled entry /sys/kernel/dlm/rgmanager when trying to restart fenced. Since this is not possible one has to reboot the node. We're also seeing: Jun 29 19:29:03 node2 kernel: [ 50.149855] dlm: no local IP address has been set Jun 29 19:29:03 node2 kernel: [ 50.150035] dlm: cannot start dlm lowcomms -107 from time to time. Stopping/starting via cman's init script (as from the Ubuntu package) several times makes this go away. Any ideas what causes this? Cheers, -- Guido From fdinitto at redhat.com Mon Jun 29 20:10:00 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Mon, 29 Jun 2009 22:10:00 +0200 Subject: [Linux-cluster] Cluster 3.0.0.rc3 release In-Reply-To: <20090629184848.GA25796@bogon.sigxcpu.org> References: <1245496789.3665.328.camel@cerberus.int.fabbione.net> <20090629184848.GA25796@bogon.sigxcpu.org> Message-ID: <1246306200.25867.86.camel@cerberus.int.fabbione.net> Hi Guido, On Mon, 2009-06-29 at 20:48 +0200, Guido G?nther wrote: > Hi Fabione, > Thanks for rolling this rc candidate! > > On Sat, Jun 20, 2009 at 01:19:49PM +0200, Fabio M. Di Nitto wrote: > [..snip..] > > In order to build the 3.0.0.rc3 release you will need: > > > > - corosync 0.98 > > - openais 0.97 > We used these without any patches. > > > - linux kernel 2.6.29 > We were running against 2.6.30. Shouldn't be a problem. You simply won't be able to build or use gfs1. > > We observed these issues: > > fenced segfaults with: > > (gdb) bt > #0 0x00007f8e293508fe in fence_node (victim=0x114b510 "node1.foo.bar", log=0x61e0a0, log_size=32, log_count=0x7fff2e46a634) at /var/home/schmitz/3/redhat-cluster/fence/libfence/agent.c:156 > #1 0x000000000040c5cd in fence_victims (fd=0x114f270) at /var/home/schmitz/3/redhat-cluster/fence/fenced/recover.c:319 > #2 0x0000000000405f27 in apply_changes (fd=0x114f270) at /var/home/schmitz/3/redhat-cluster/fence/fenced/cpg.c:1056 > #3 0x00007f8e2914bcc1 in cpg_dispatch () from /usr/lib/libcpg.so.4 #4 0x0000000000404588 in process_fd_cpg (ci=4) at /var/home/schmitz/3/redhat-cluster/fence/fenced/cpg.c:1351 #5 0x000000000040b0f7 in main (argc=, argv=) at /var/home/schmitz/3/redhat-cluster/fence/fenced/main.c:818 > > this leads to > > 1246297857 fenced 3.0.0.rc3 started > 1246297857 our_nodeid 1 our_name node2.foo.bar > 1246297857 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/fenced.log > 1246297857 found uncontrolled entry /sys/kernel/dlm/rgmanager It looks to me the node has not been shutdown properly and an attempt to restart it did fail. The fenced segfault shouldn't happen but I am CC'ing David. Maybe he has a better idea. > > when trying to restart fenced. Since this is not possible one has to > reboot the node. > > We're also seeing: > > Jun 29 19:29:03 node2 kernel: [ 50.149855] dlm: no local IP address has been set > Jun 29 19:29:03 node2 kernel: [ 50.150035] dlm: cannot start dlm lowcomms -107 hmm this looks like a bad configuration to me or bad startup. IIRC dlm kernel is configured via configfs and probably it was not mounted by the init script. > > from time to time. Stopping/starting via cman's init script (as from the > Ubuntu package) several times makes this go away. > > Any ideas what causes this? Could you please try to use our upstream init scripts? They work just fine (unchanged) in ubuntu/debian environment and they are for sure a lot more robust than the ones I originally wrote for Ubuntu many years ago. Could you also please summarize your setup and config? I assume you did the normal checks such as cman_tool status, cman_tool nodes and so on... The usual extra things I'd check are: - make sure the hostname doesn't resolve to localhost but to the real ip address of the cluster interface - cman_tool status - cman_tool nodes - Before starting any kind of service, such as rgmanager or gfs*, make sure that the fencing configuration is correct. Test by using fence_node $nodename. Cheers Fabio From brettcave at gmail.com Tue Jun 30 15:18:13 2009 From: brettcave at gmail.com (Brett Cave) Date: Tue, 30 Jun 2009 17:18:13 +0200 Subject: [Linux-cluster] increasing gfs size to add journals on existing file system Message-ID: Hi, I am trying to add an extra node to my GFS cluster, but dont have enough journals. I dont have any more free space to add journals (see this thread http://www.mail-archive.com/linux-cluster at redhat.com/msg05624.html ) What would be the best solution to use (I can increase the SAN vdisk which should allow me to resize, but wondering if there is another way). Regards, Brett -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpeterso at redhat.com Tue Jun 30 16:07:28 2009 From: rpeterso at redhat.com (Bob Peterson) Date: Tue, 30 Jun 2009 12:07:28 -0400 (EDT) Subject: [Linux-cluster] increasing gfs size to add journals on existing file system In-Reply-To: Message-ID: <451634528.646961246378048373.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- "Brett Cave" wrote: | Hi, | | I am trying to add an extra node to my GFS cluster, but dont have | enough journals. I dont have any more free space to add journals (see | this thread | http://www.mail-archive.com/linux-cluster at redhat.com/msg05624.html ) | | What would be the best solution to use (I can increase the SAN vdisk | which should allow me to resize, but wondering if there is another | way). | | Regards, | Brett Hi Brett, That issue has always been a design problem with GFS. You need to increase the size of the device before doing gfs_jadd. Don't make the mistake of running gfs_grow immediately because that will consume your new storage for file system space and still leave you no room for any new journals. Only run gfs_grow after you've added the journals you need. We eliminated the problem in GFS2, so another option would be to use gfs2_convert to convert the file system to GFS2 and then use gfs2_jadd. Of course, GFS2 and gfs2_convert are still pretty new, so they carry a certain amount of risk, as with all new software. Some old versions of gfs2_convert had bad problems, so if you want to go this route, make sure you make a current backup before you do anything. Second, make sure you gfs_fsck before you convert so that your file system is consistent before running gfs2_convert. Third, make sure you have the latest and greatest gfs2_convert, so if you're on RHEL5.3, for example, make sure you've got all the latest z-stream updates. If you build from source, make sure you compile from the most recent source code. Regards, Bob Peterson Red Hat File Systems From tiagocruz at forumgdh.net Tue Jun 30 16:15:23 2009 From: tiagocruz at forumgdh.net (Tiago Cruz) Date: Tue, 30 Jun 2009 13:15:23 -0300 Subject: [Linux-cluster] Did you use GFS with witch technology? Message-ID: <1246378523.7787.12.camel@tuxkiller> Hello, guys.. please... I need to know a little thing: I'm using GFS v1 with ESX 3.5 and I'm not very happy :) High load from vms, freeze and quorum lost, for example. Did you use GFS and witch technology? KVM? Xen? VirtualBox? Not Virtual? Witch version are you using? v1 or v2? Are you a happy people using this? =) Thanks -- Tiago Cruz From andrew at ntsg.umt.edu Tue Jun 30 16:37:45 2009 From: andrew at ntsg.umt.edu (Andrew A. Neuschwander) Date: Tue, 30 Jun 2009 10:37:45 -0600 Subject: [Linux-cluster] Did you use GFS with witch technology? In-Reply-To: <1246378523.7787.12.camel@tuxkiller> References: <1246378523.7787.12.camel@tuxkiller> Message-ID: <4A4A3F59.1080200@ntsg.umt.edu> I'm using GFS1 with CentOS 5.3 on ESX 3.5 and I'm mostly happy with it. If you are using a non-tickless kernel (i.e. RHEL/CentOS 2.6.18-x) be sure you are using the tick divider kernel option on your VMs. Otherwise, you'll see high loads. -A -- Andrew A. Neuschwander, RHCE Systems/Software Engineer College of Forestry and Conservation The University of Montana http://www.ntsg.umt.edu andrew at ntsg.umt.edu - 406.243.6310 Tiago Cruz wrote: > Hello, guys.. please... I need to know a little thing: > > I'm using GFS v1 with ESX 3.5 and I'm not very happy :) > High load from vms, freeze and quorum lost, for example. > > Did you use GFS and witch technology? KVM? Xen? VirtualBox? Not Virtual? > Witch version are you using? v1 or v2? > > Are you a happy people using this? =) > > Thanks > From Eric.Johnson at mtsallstream.com Tue Jun 30 16:40:33 2009 From: Eric.Johnson at mtsallstream.com (Johnson, Eric) Date: Tue, 30 Jun 2009 11:40:33 -0500 Subject: [Linux-cluster] RHEL 5.3 NFSv4 cluster Message-ID: Hi, Is there an up to date document detailing the configuration of an NFSv4 cluster service on a 2-node RHEL 5.3 Cluster Suite setup? Most of the info I find is from 2006/2007 and states that these features are in a state of flux and could change soon. My current configuration is 2 nodes, RHEL 5.3 (kernel 2.6.18-128.1.14.el5PAE), SAN attached shared storage, with GFS2 file systems. I read the documents at: http://wiki.linux-nfs.org/wiki/index.php/NFS_Recovery_and_Client_Migrati on http://www.howtoforge.com/high_availability_nfs_drbd_heartbeat And also the NFS cluster cookbook and Red Hat's NFS cluster example. The former two are fairly old, and the latter two documents seem fairly basic and don't address certain issues like: 1. Is it still recommended to configure /var/lib/nfs/v4recovery on a shared file system between nodes? 2. Do I need to set the "fsid=" parameter for every export in /etc/exports and set it to a unique value? (I currently only have fsid set for nfs root) 3. Should I set all of the RPC services in /etc/sysconfig/nfs to listen on a dedicated port? 4. Can I leave the NFS service running on both nodes at the same time and just fail over the IP address, or should I add the nfs service script to the cluster config to start/stop it as part of the service? 5. The NFS Recovery and Client Migration doc above mentions that lock migration is not handled yet and that there needs to be a way to release locks and leases during failover. Has this been addressed somehow? Does stopping/starting the NFS service accomplish this? Also, when mounting my NFS shares using the cluster's virtual IP address or name, I get some errors in my NFS server's logs regarding timed out callbacks: Jun 25 15:00:12 node2 kernel: nfs4_cb: server not responding, timed out Jun 25 17:07:37 node2 kernel: nfs4_cb: server not responding, timed out If I mount the file system using the cluster node's static address/name, these errors don't appear, but for obvious reasons, this is undesirable. Thanks, Eric -------------- next part -------------- An HTML attachment was scrubbed... URL: From brettcave at gmail.com Tue Jun 30 16:45:43 2009 From: brettcave at gmail.com (Brett Cave) Date: Tue, 30 Jun 2009 18:45:43 +0200 Subject: [Linux-cluster] increasing gfs size to add journals on existing file system In-Reply-To: <451634528.646961246378048373.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> References: <451634528.646961246378048373.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: On Tue, Jun 30, 2009 at 6:07 PM, Bob Peterson wrote: > ----- "Brett Cave" wrote: > Hi Brett, > > That issue has always been a design problem with GFS. You need to > increase the size of the device before doing gfs_jadd. Don't make > the mistake of running gfs_grow immediately because that will consume > your new storage for file system space and still leave you no room > for any new journals. Only run gfs_grow after you've added the > journals you need. Thanks Bob, I have increased the relevant vdisks, going to rescan the disks and then add the journals. We ran into some instability issues with gfs2 locking up while we were testing a good few months ago, so going to sacrifice bleeding edge for stability as its a production system. Will keep my eye on gfs2 and see how it runs in our test environment when we get past this phase. 8 months of stable gfs is great :) (we found the older kmod_gfs or cman had a node numbering issue which caused some locking up a while ago, but this has been resolved) how is gfs2 running on your side? > > We eliminated the problem in GFS2, so another option would be to > use gfs2_convert to convert the file system to GFS2 and then use > gfs2_jadd. Of course, GFS2 and gfs2_convert are still pretty new, so > they carry a certain amount of risk, as with all new software. Some > old versions of gfs2_convert had bad problems, so if you want to go > this route, make sure you make a current backup before you do anything. > Second, make sure you gfs_fsck before you convert so that your file system > is consistent before running gfs2_convert. Third, make sure you have the > latest and greatest gfs2_convert, so if you're on RHEL5.3, for example, > make sure you've got all the latest z-stream updates. If you build from > source, make sure you compile from the most recent source code. > > Regards, > > Bob Peterson > Red Hat File Systems > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From brettcave at gmail.com Tue Jun 30 16:51:10 2009 From: brettcave at gmail.com (Brett Cave) Date: Tue, 30 Jun 2009 18:51:10 +0200 Subject: [Linux-cluster] Did you use GFS with witch technology? In-Reply-To: <1246378523.7787.12.camel@tuxkiller> References: <1246378523.7787.12.camel@tuxkiller> Message-ID: On Tue, Jun 30, 2009 at 6:15 PM, Tiago Cruz wrote: > Hello, guys.. please... I need to know a little thing: > > I'm using GFS v1 with ESX 3.5 and I'm not very happy :) > High load from vms, freeze and quorum lost, for example. > > Did you use GFS and witch technology? KVM? Xen? VirtualBox? Not Virtual? > Witch version are you using? v1 or v2? GFS 1 on CentOS5 2.6.18, but not using virtualization. very happy with the performance. The GFS volumes are on an enterprise FC SAN had some issues with gfs2 locking up, that was quite a while back though, but didnt have performance issues (and neither do we on gfs1). > > > Are you a happy people using this? =) > > Thanks > > -- > Tiago Cruz > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tiagocruz at forumgdh.net Tue Jun 30 17:28:11 2009 From: tiagocruz at forumgdh.net (Tiago Cruz) Date: Tue, 30 Jun 2009 14:28:11 -0300 Subject: [Linux-cluster] Did you use GFS with witch technology? In-Reply-To: <4A4A3F59.1080200@ntsg.umt.edu> References: <1246378523.7787.12.camel@tuxkiller> <4A4A3F59.1080200@ntsg.umt.edu> Message-ID: <1246382891.7787.25.camel@tuxkiller> Hello Andrew! Many thanks for your reply! It's very good to see an environment like my! I'm using RHEL 5.2 with kernel-2.6.18-92.1.22.el5... can you explain a little bit around this trick divider? I'm usually have 2-3 IBM x3850 (16 cores CPU and 128 GB RAM) with 10-15 virtual machines running under GFS, with a LUN ~500 GB SAN. My problem happens when Multicast: Switch -> GFS -> Switch = OK vSwitch (Box_A) -> Switch-> vSwitch (Box_B) = NOK Did you have some problem with? If I put all VMs inside the same box (vSwitch Box_A -> vSwitch Box_A) I don't have any problem... Thanks a lot! -- Tiago Cruz On Tue, 2009-06-30 at 10:37 -0600, Andrew A. Neuschwander wrote: > I'm using GFS1 with CentOS 5.3 on ESX 3.5 and I'm mostly happy with it. > If you are using a non-tickless kernel (i.e. RHEL/CentOS 2.6.18-x) be > sure you are using the tick divider kernel option on your VMs. > Otherwise, you'll see high loads. > > -A > -- > Andrew A. Neuschwander, RHCE > Systems/Software Engineer > College of Forestry and Conservation > The University of Montana > http://www.ntsg.umt.edu > andrew at ntsg.umt.edu - 406.243.6310 > > > Tiago Cruz wrote: > > Hello, guys.. please... I need to know a little thing: > > > > I'm using GFS v1 with ESX 3.5 and I'm not very happy :) > > High load from vms, freeze and quorum lost, for example. > > > > Did you use GFS and witch technology? KVM? Xen? VirtualBox? Not Virtual? > > Witch version are you using? v1 or v2? > > > > Are you a happy people using this? =) > > > > Thanks > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From steve at linuxsuite.org Tue Jun 30 17:32:06 2009 From: steve at linuxsuite.org (steve at linuxsuite.org) Date: Tue, 30 Jun 2009 13:32:06 -0400 (EDT) Subject: [Linux-cluster] Cluster startup weirdness? Message-ID: <56115.205.207.123.130.1246383126.squirrel@webmail.netfirms.com> I am trying to set up a minimal proof of concept with RHCS on CentOS 5.3. Three nodes in a cluster (vz1,vz2 vz3), 2 services both just apache defualt page as defined in the cluster.conf below. If I do service cman start on vz1 and vz2 they both hang trying to do "fence_tool -w join" yet clustat and cman_tool status show cluster membership and quorum no services are running If I run tcpdump on vz3 I see that initially both vz1 and vz2 send out (from port 5149) to the multicast address but then vz2 stops and only vz1 continues. Is this correct behaviour? If I then do service cman start on vz3 everything runs (ie. fence_tool doesn' hang), tcpdump on vz3 shows vz1,vz2 and vz3 doing muliticast and then vz2 and vz3 drop out and only vz1 continues with multicast. vz3 has taken on the service vz1. service vz2 never comes up. Ideas? or how do I get service vz1 and vz2 running with vz3 as a spare failover? thanx - steve Below is cluster.conf generated by system-config-cluster