From kkovachev at varna.net Wed Jun 1 08:19:31 2011 From: kkovachev at varna.net (Kaloyan Kovachev) Date: Wed, 01 Jun 2011 11:19:31 +0300 Subject: [Linux-cluster] quorum dissolved but resources are still alive In-Reply-To: <4DE515A9.40003@abilene.it> References: <4DE515A9.40003@abilene.it> Message-ID: <527e664e6a568f3049f42559cebf8359@mx.varna.net> Hi, replying to your original email ... the problem i can see in the logs is the line: openais[971]: [SYNC ] This node is within the primary component and will provide service. as you have expected_votes=2 and node votes=1 this shouldn't happen, so it looks as a bug P.S. If you had fencing configured - when node2 is back it would fence node1 and start the services On Tue, 31 May 2011 18:22:01 +0200, Martin Claudio wrote: > Hi, > > i have a problem with a 2 node cluster with this conf: > > > > > > > > > > > > > > all is ok but when node 2 goes down quorum dissolved but resources is > not stopped, here log: > > > clurgmgrd[1302]: #1: Quorum Dissolved > kernel: dlm: closing connection to node 2 > openais[971]: [CLM ] r(0) ip(10.1.1.11) > openais[971]: [CLM ] Members Left: > openais[971]: [CLM ] r(0) ip(10.1.1.12) > openais[971]: [CLM ] Members Joined: > openais[971]: [CMAN ] quorum lost, blocking activity > openais[971]: [CLM ] CLM CONFIGURATION CHANGE > openais[971]: [CLM ] New Configuration: > openais[971]: [CLM ] r(0) ip(10.1.1.11) > openais[971]: [CLM ] Members Left: > openais[971]: [CLM ] Members Joined: > openais[971]: [SYNC ] This node is within the primary component and will > provide service. > openais[971]: [TOTEM] entering OPERATIONAL state. > openais[971]: [CLM ] got nodejoin message 10.1.1.11 > openais[971]: [CPG ] got joinlist message from node 1 > ccsd[964]: Cluster is not quorate. Refusing connection. > > > cluster recognized that quorum is dissolved but resource manager doesn't > stop resource, ip address is still alive, filesystem is still mount, > i'll expect an emergency shutdown but it does not happen.... From carlopmart at gmail.com Wed Jun 1 19:48:22 2011 From: carlopmart at gmail.com (carlopmart) Date: Wed, 01 Jun 2011 21:48:22 +0200 Subject: [Linux-cluster] Corosync goes cpu to 95-99% In-Reply-To: <4DE545D7.1080703@redhat.com> References: <4DD29D03.9080901@gmail.com> <4DD2BAC3.50509@redhat.com> <4DD2BD7D.5070704@gmail.com> <4DD2CA90.6090802@redhat.com> <3B50BA7445114813AE429BEE51A2BA52@versa> <4DD78908.2030801@gmail.com> <0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca><4DD80D5D.10004@gmail.com> <4DD873C7.8080402@cybercat.ca> <22E7D11CD5E64E338A66811F31F06238@versa> <4DE545D7.1080703@redhat.com> Message-ID: <4DE69786.5010204@gmail.com> On 05/31/2011 09:47 PM, Steven Dake wrote: > On 05/31/2011 12:00 PM, Nicolas Ross wrote: >>>>> I've opened a support case at redhat for this. While collecting the >>>>> sosreport for redhat, I found ot in my var/log/message file something >>>>> about gfs2_quotad being stalled for more than 120 seconds. Tought I >>>>> disabled quotas with the noquota option. It appears that it's >>>>> "quota=off". Since I cannot chane thecluster config and remount the >>>>> filessystems at the moment, I did not made the change to tes it. >>>> >>>> Thanks Nicolas. what bugzilla id is?? >>> >>> It's not a bugzilla, it's a support case. >> >> Hi ! >> >> FYI, my support ticket is still open, and GSS are searching to find the >> cause of the problem. In the mean time, they suggested that I start >> corosync with -p option and see if that changes anything. >> >> I wanted to know how to do that since it's cman that does start corosync ? >> > > cman_tool join is called in /etc/rc.d/init.d/cman I believe. Add a -P > option to it. > > Regards > -steve Where is "-P" option under cman_tool manpage?? I didn't see it. Appears "-S", "-X", "-A", "-D" ... but not -P ... Is it correct to put this option under /etc/sysconfig/cman config file on RHEL6?? Thanks. -- CL Martinez carlopmart {at} gmail {d0t} com From rossnick-lists at cybercat.ca Wed Jun 1 23:27:50 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Wed, 01 Jun 2011 19:27:50 -0400 Subject: [Linux-cluster] Corosync goes cpu to 95-99% In-Reply-To: <4DE69786.5010204@gmail.com> References: <4DD29D03.9080901@gmail.com> <4DD2BAC3.50509@redhat.com> <4DD2BD7D.5070704@gmail.com> <4DD2CA90.6090802@redhat.com> <3B50BA7445114813AE429BEE51A2BA52@versa> <4DD78908.2030801@gmail.com> <0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca><4DD80D5D.10004@gmail.com> <4DD873C7.8080402@cybercat.ca> <22E7D11CD5E64E338A66811F31F06238@versa> <4DE545D7.1080703@redhat.com> <4DE69786.5010204@gmail.com> Message-ID: <4DE6CAF6.4000002@cybercat.ca> >> >> cman_tool join is called in /etc/rc.d/init.d/cman I believe. Add a -P >> option to it. >> >> Regards >> -steve > > Where is "-P" option under cman_tool manpage?? I didn't see it. Appears > "-S", "-X", "-A", "-D" ... but not -P ... > > Is it correct to put this option under /etc/sysconfig/cman config file > on RHEL6?? I had to modify my /etc/rc.d/init.d/cman script on each node and add -P (undocumented) at line 500, after $cman_join_opts And it did not solve the problem, but it help verry little bit to aliviate it. While a node is experiencing it, it's still not usable by ssh, but response time to service seems a very little better, barely noticable. GSS asked me today to produce a core dump of corosync while it's eating up CPU. Regards, From carlopmart at gmail.com Thu Jun 2 09:21:06 2011 From: carlopmart at gmail.com (carlopmart) Date: Thu, 02 Jun 2011 11:21:06 +0200 Subject: [Linux-cluster] Corosync goes cpu to 95-99% In-Reply-To: <4DE6CAF6.4000002@cybercat.ca> References: <4DD29D03.9080901@gmail.com> <4DD2BAC3.50509@redhat.com> <4DD2BD7D.5070704@gmail.com> <4DD2CA90.6090802@redhat.com> <3B50BA7445114813AE429BEE51A2BA52@versa> <4DD78908.2030801@gmail.com> <0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca><4DD80D5D.10004@gmail.com> <4DD873C7.8080402@cybercat.ca> <22E7D11CD5E64E338A66811F31F06238@versa> <4DE545D7.1080703@redhat.com> <4DE69786.5010204@gmail.com> <4DE6CAF6.4000002@cybercat.ca> Message-ID: <4DE75602.1000408@gmail.com> On 06/02/2011 01:27 AM, Nicolas Ross wrote: > >>> >>> cman_tool join is called in /etc/rc.d/init.d/cman I believe. Add a -P >>> option to it. >>> >>> Regards >>> -steve >> >> Where is "-P" option under cman_tool manpage?? I didn't see it. Appears >> "-S", "-X", "-A", "-D" ... but not -P ... >> >> Is it correct to put this option under /etc/sysconfig/cman config file >> on RHEL6?? > > I had to modify my /etc/rc.d/init.d/cman script on each node and add -P > (undocumented) at line 500, after $cman_join_opts > > And it did not solve the problem, but it help verry little bit to > aliviate it. While a node is experiencing it, it's still not usable by > ssh, but response time to service seems a very little better, barely > noticable. > > GSS asked me today to produce a core dump of corosync while it's eating > up CPU. > > Regards, > Oops .. Bad, bad, very bad news, almost for me. Nicolas, I have found the option to pass "-p" to corosync without modifying cman startup script. In /etc/sysconfig/cman config file, I have put a line with this: CMAN_JOIN_OPTS="-P" .. and works ok. [root at rhelnode01 sysconfig]# ps xa |grep corosync 1033 ? SLsl 0:00 corosync -f -p 1494 pts/1 S+ 0:00 grep corosync I will do some tests with two nodes, But I think RHEL6.x is not yet ready for production environments, almost RHCS. -- CL Martinez carlopmart {at} gmail {d0t} com From ajb2 at mssl.ucl.ac.uk Thu Jun 2 09:34:43 2011 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Thu, 02 Jun 2011 10:34:43 +0100 Subject: [Linux-cluster] defragmentation..... Message-ID: <4DE75933.4030302@mssl.ucl.ac.uk> GFS2 seems horribly prone to fragmentation. I have a filesystem which has been written to once (data archive, migrated from a GFS1 filesystem to a clean GFS2 fs) and a lot of the files are composed of hundreds of extents - most of these are only 1-2Mb so this is a bit over the top and it badly affects backup performance. Has there been any progress on tools to help with this kind of problem? Alan From swhiteho at redhat.com Thu Jun 2 09:46:51 2011 From: swhiteho at redhat.com (Steven Whitehouse) Date: Thu, 02 Jun 2011 10:46:51 +0100 Subject: [Linux-cluster] defragmentation..... In-Reply-To: <4DE75933.4030302@mssl.ucl.ac.uk> References: <4DE75933.4030302@mssl.ucl.ac.uk> Message-ID: <1307008011.2823.22.camel@menhir> Hi, On Thu, 2011-06-02 at 10:34 +0100, Alan Brown wrote: > GFS2 seems horribly prone to fragmentation. > > I have a filesystem which has been written to once (data archive, > migrated from a GFS1 filesystem to a clean GFS2 fs) and a lot of the > files are composed of hundreds of extents - most of these are only 1-2Mb > so this is a bit over the top and it badly affects backup performance. > > Has there been any progress on tools to help with this kind of problem? > > Alan > The thing to check is what size the extents are... the on-disk layout is designed so that you should have a metadata block separating each data extent at exactly the place where we would need to read a new metadata block in order to continue reading the file in a streaming fashion. That means on a 4k block size filesystem, the data extents are usually around 509 blocks in length, and if you see a number of these with (mostly) a single metadata block between them (sometimes more if the height of the metadata tree grows) then that is the expected layout. Fragmentation tends to be more of an issue with directories than with regular files, and that is something that we are looking into at the moment, Steve. From ajb2 at mssl.ucl.ac.uk Thu Jun 2 10:47:32 2011 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Thu, 02 Jun 2011 11:47:32 +0100 Subject: [Linux-cluster] defragmentation..... In-Reply-To: <1307008011.2823.22.camel@menhir> References: <4DE75933.4030302@mssl.ucl.ac.uk> <1307008011.2823.22.camel@menhir> Message-ID: <4DE76A44.2000902@mssl.ucl.ac.uk> Steven Whitehouse wrote: > The thing to check is what size the extents are... filefrag doesn't show this. > the on-disk layout is > designed so that you should have a metadata block separating each data > extent at exactly the place where we would need to read a new metadata > block in order to continue reading the file in a streaming fashion. > > That means on a 4k block size filesystem, the data extents are usually > around 509 blocks in length, and if you see a number of these with > (mostly) a single metadata block between them (sometimes more if the > height of the metadata tree grows) then that is the expected layout. 4k*509 = 2024k - most of these files are 800-1010k (there isn't a file on this FS larger than 2Mb) I've just taken one directory (225 entries, all 880-900k), copied each file and moved the copy back to the original spot. Filefrag says they're now 1-3 extents (50% 1 extent, 30% 2 extents) This filesystem is 700G and was originally populated in a single rsync pass. Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroupBeast03-LogVolSarch01--GFS2 700G 660G 41G 95% /stage/sarch01 Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/VolGroupBeast03-LogVolSarch01--GFS2 13072686 2542375 10530311 20% /stage/sarch01 I'd understand if the last files written were like this, but it's right across the entire FS. From ajb2 at mssl.ucl.ac.uk Thu Jun 2 10:58:10 2011 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Thu, 02 Jun 2011 11:58:10 +0100 Subject: [Linux-cluster] defragmentation..... In-Reply-To: <1307008011.2823.22.camel@menhir> References: <4DE75933.4030302@mssl.ucl.ac.uk> <1307008011.2823.22.camel@menhir> Message-ID: <4DE76CC2.8010201@mssl.ucl.ac.uk> This is interesting too. note the variation in extents (the file is a piece of marketing fluff, name is unimportant) $ df -h . Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroupBeast03-LogVolUser1 250G 113G 138G 45% /stage/user1 $ ls -l SUMO-SATA-Competitive-Positioning-v1.ppt -rw-r--r-- 1 ajb2 computing 3746304 Nov 8 2007 SUMO-SATA-Competitive-Positioning-v1.ppt $ rsync SUMO-SATA-Competitive-Positioning-v1.ppt SUMO-SATA-Competitive-Positioning-v1.ppt.new $ filefrag SUMO-SATA-Competitive-Positioning-v1.ppt* SUMO-SATA-Competitive-Positioning-v1.ppt: 915 extents found SUMO-SATA-Competitive-Positioning-v1.ppt.new: 153 extents found $ rsync SUMO-SATA-Competitive-Positioning-v1.ppt SUMO-SATA-Competitive-Positioning-v1.ppt.new $ filefrag SUMO-SATA-Competitive-Positioning-v1.ppt* SUMO-SATA-Competitive-Positioning-v1.ppt: 915 extents found SUMO-SATA-Competitive-Positioning-v1.ppt.new: 73 extents found $ rsync SUMO-SATA-Competitive-Positioning-v1.ppt SUMO-SATA-Competitive-Positioning-v1.ppt.new $ filefrag SUMO-SATA-Competitive-Positioning-v1.ppt* SUMO-SATA-Competitive-Positioning-v1.ppt: 915 extents found SUMO-SATA-Competitive-Positioning-v1.ppt.new: 12 extents found $ rsync SUMO-SATA-Competitive-Positioning-v1.ppt SUMO-SATA-Competitive-Positioning-v1.ppt.new $ filefrag SUMO-SATA-Competitive-Positioning-v1.ppt* SUMO-SATA-Competitive-Positioning-v1.ppt: 915 extents found SUMO-SATA-Competitive-Positioning-v1.ppt.new: 16 extents found $ rsync SUMO-SATA-Competitive-Positioning-v1.ppt SUMO-SATA-Competitive-Positioning-v1.ppt.new $ filefrag SUMO-SATA-Competitive-Positioning-v1.ppt* SUMO-SATA-Competitive-Positioning-v1.ppt: 915 extents found SUMO-SATA-Competitive-Positioning-v1.ppt.new: 9 extents found $ rsync SUMO-SATA-Competitive-Positioning-v1.ppt SUMO-SATA-Competitive-Positioning-v1.ppt.new $ filefrag SUMO-SATA-Competitive-Positioning-v1.ppt* SUMO-SATA-Competitive-Positioning-v1.ppt: 915 extents found SUMO-SATA-Competitive-Positioning-v1.ppt.new: 16 extents found $ cp SUMO-SATA-Competitive-Positioning-v1.ppt SUMO-SATA-Competitive-Positioning-v1.ppt.new cp: overwrite `SUMO-SATA-Competitive-Positioning-v1.ppt.new'? y $ filefrag SUMO-SATA-Competitive-Positioning-v1.ppt* SUMO-SATA-Competitive-Positioning-v1.ppt: 915 extents found SUMO-SATA-Competitive-Positioning-v1.ppt.new: 5 extents found $ cp SUMO-SATA-Competitive-Positioning-v1.ppt SUMO-SATA-Competitive-Positioning-v1.ppt.new cp: overwrite `SUMO-SATA-Competitive-Positioning-v1.ppt.new'? y $ filefrag SUMO-SATA-Competitive-Positioning-v1.ppt* SUMO-SATA-Competitive-Positioning-v1.ppt: 915 extents found SUMO-SATA-Competitive-Positioning-v1.ppt.new: 16 extents found All these commands were executed in a 30 second period. From swhiteho at redhat.com Thu Jun 2 11:03:39 2011 From: swhiteho at redhat.com (Steven Whitehouse) Date: Thu, 02 Jun 2011 12:03:39 +0100 Subject: [Linux-cluster] defragmentation..... In-Reply-To: <4DE76A44.2000902@mssl.ucl.ac.uk> References: <4DE75933.4030302@mssl.ucl.ac.uk> <1307008011.2823.22.camel@menhir> <4DE76A44.2000902@mssl.ucl.ac.uk> Message-ID: <1307012619.2823.31.camel@menhir> Hi, On Thu, 2011-06-02 at 11:47 +0100, Alan Brown wrote: > Steven Whitehouse wrote: > > > The thing to check is what size the extents are... > > filefrag doesn't show this. > Yes it does. You need the -v flag > > the on-disk layout is > > designed so that you should have a metadata block separating each data > > extent at exactly the place where we would need to read a new metadata > > block in order to continue reading the file in a streaming fashion. > > > > That means on a 4k block size filesystem, the data extents are usually > > around 509 blocks in length, and if you see a number of these with > > (mostly) a single metadata block between them (sometimes more if the > > height of the metadata tree grows) then that is the expected layout. > > 4k*509 = 2024k - most of these files are 800-1010k (there isn't a file > on this FS larger than 2Mb) > > I've just taken one directory (225 entries, all 880-900k), copied each > file and moved the copy back to the original spot. > > Filefrag says they're now 1-3 extents (50% 1 extent, 30% 2 extents) > That doesn't sound too unreasonable to me. Usually the best way to defrag is simply to copy the file elsewhere and copy it back as you've done. That is why there is no specific tool to do this. > This filesystem is 700G and was originally populated in a single rsync pass. > > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/VolGroupBeast03-LogVolSarch01--GFS2 > 700G 660G 41G 95% /stage/sarch01 > > Filesystem Inodes IUsed IFree IUse% Mounted on > /dev/mapper/VolGroupBeast03-LogVolSarch01--GFS2 > 13072686 2542375 10530311 20% /stage/sarch01 > > I'd understand if the last files written were like this, but it's right > across the entire FS. > > If rsync is writing only a single file at a time, it should be pretty good wrt to fragmentation. If it is trying to write multiple files at the same time, bit by bit, then that is the kind of thing which might increase fragmentation a bit depending on the exact pattern in this case, Steve. From ajb2 at mssl.ucl.ac.uk Thu Jun 2 11:50:33 2011 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Thu, 02 Jun 2011 12:50:33 +0100 Subject: [Linux-cluster] defragmentation..... In-Reply-To: <4DE76CC2.8010201@mssl.ucl.ac.uk> References: <4DE75933.4030302@mssl.ucl.ac.uk> <1307008011.2823.22.camel@menhir> <4DE76CC2.8010201@mssl.ucl.ac.uk> Message-ID: <4DE77909.6000405@mssl.ucl.ac.uk> Alan Brown wrote: > This is interesting too. note the variation in extents (the file is a > piece of marketing fluff, name is unimportant) I'm getting the same thing in sarch01 and that's mounted read-only by the clients - there's zero write activity going on. From rossnick-lists at cybercat.ca Thu Jun 2 13:42:39 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Thu, 2 Jun 2011 09:42:39 -0400 Subject: [Linux-cluster] Corosync goes cpu to 95-99% References: <4DD29D03.9080901@gmail.com> <4DD2BAC3.50509@redhat.com> <4DD2BD7D.5070704@gmail.com> <4DD2CA90.6090802@redhat.com> <3B50BA7445114813AE429BEE51A2BA52@versa> <4DD78908.2030801@gmail.com> <0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca><4DD80D5D.10004@gmail.com> <4DD873C7.8080402@cybercat.ca> <22E7D11CD5E64E338A66811F31F06238@versa> <4DE545D7.1080703@redhat.com> <4DE69786.5010204@gmail.com><4DE6CAF6.4000002@cybercat.ca> <4DE75602.1000408@gmail.com> Message-ID: <51BB988BCCF547E69BF222BDAF34C4DE@versa> > Oops .. Bad, bad, very bad news, almost for me. Nicolas, I have found the > option to pass "-p" to corosync without modifying cman startup script. In > /etc/sysconfig/cman config file, I have put a line with this: > > CMAN_JOIN_OPTS="-P" > > .. and works ok. > > [root at rhelnode01 sysconfig]# ps xa |grep corosync > 1033 ? SLsl 0:00 corosync -f -p > 1494 pts/1 S+ 0:00 grep corosync > > I will do some tests with two nodes, But I think RHEL6.x is not yet ready > for production environments, almost RHCS. Thanks for that, that'll prevent me from modifying a system file... And yes, I find it a little disapointing. We're now at 6.1, and our setup is exactly what RHCS was designed for... A GFS over fiber, httpd running content from that gfs... From swap_project at yahoo.com Thu Jun 2 15:37:07 2011 From: swap_project at yahoo.com (Srija) Date: Thu, 2 Jun 2011 08:37:07 -0700 (PDT) Subject: [Linux-cluster] Cluster environment issue In-Reply-To: <9537f2a4eb5ae1c11038deed2e3fe40f@mx.varna.net> Message-ID: <748430.45768.qm@web112815.mail.gq1.yahoo.com> Thank you so much for your reply again. --- On Tue, 5/31/11, Kaloyan Kovachev wrote: Thanks for your reply again. > > If it is a switch restart you will have in your logs the > interface going > down/up, but more problematic is to find a short drop of > the multicast I checked all nodes did not find anything about interface, but in all the nodes it is reporting that server19(node 12) /server18 (node 11) is the problematic, here I am mentioning the logs from three nodes (out of 16 nodes) May 24 18:04:59 server7 openais[6113]: [TOTEM] entering GATHER state from 12. May 24 18:05:01 server7 crond[5068]: (root) CMD ( /opt/hp/hp-health/bin/check-for-restart-requests) May 24 18:05:19 server7 openais[6113]: [TOTEM] entering GATHER state from 11. May 24 18:04:59 server1 openais[6148]: [TOTEM] entering GATHER state from 12. May 24 18:05:01 server1 crond[2275]: (root) CMD ( /opt/hp/hp-health/bin/check-for-restart-requests) May 24 18:05:19 server1 openais[6148]: [TOTEM] entering GATHER state from 11. May 24 18:04:59 server8 openais[6279]: [TOTEM] entering GATHER state from 12. May 24 18:05:01 server8 crond[11125]: (root) CMD ( /opt/hp/hp-health/bin/check-for-restart-requests) May 24 18:05:19 server8 openais[6279]: [TOTEM] entering GATHER state from 11. Here is some lines from node12 , at the same time ___________________________________________________ May 24 18:04:59 server19 openais[5950]: [TOTEM] The token was lost in the OPERATIONAL state. May 24 18:04:59 server19 openais[5950]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes). May 24 18:04:59 server19 openais[5950]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). May 24 18:04:59 server19 openais[5950]: [TOTEM] entering GATHER state from 2. May 24 18:05:19 server19 openais[5950]: [TOTEM] entering GATHER state from 11. May 24 18:05:20 server19 openais[5950]: [TOTEM] Saving state aru 39a8f high seq received 39a8f May 24 18:05:20 server19 openais[5950]: [TOTEM] Storing new sequence id for ring 2af0 May 24 18:05:20 server19 openais[5950]: [TOTEM] entering COMMIT state. May 24 18:05:20 server19 openais[5950]: [TOTEM] entering RECOVERY state. Here is few lines on node11 ie server18 ------------------------------------------ ay 24 18:04:48 server18 May 24 18:10:14 server18 syslog-ng[5619]: syslog-ng starting up; version='2.0.10' May 24 18:10:14 server18 Bootdata ok (command line is ro root=/dev/vgroot_xen/lvroot rhgb quiet) So it seems that node11 is rebooting just after few mintues we get all the problems in the logs of all nodes. > You may ask the network people to check for STP changes and > double check > the multicast configuration and you may also try to use > broadcast instead > of multicast or use a dedicated switch. As per the dedicated switch, I don't think it is possible as per the network team. I asked the STP chanes related. their answer is "there are no stp changes for the private network as there are no redundant devices in the environment. the multicast configs is igmp snooping with Pim" I have talked to the network team for using the broadcast instead of multicast, as per them , they can set.. Pl. comment on this... > your interface and multicast address) > ??? ping -I ethX -b -L 239.x.x.x -c 1 > and finaly run this script until the cluster gets broken Yes , I have checked it , it is working fine now. I have also set a cron for this script and set in one node. I have few questions regarding the cluster configuration ... - We are using clvm in the cluster environment. As I understand it is active-active. The environment is xen . all the xen hosts are in the cluster and each host have the guests. We are keeping the options to live migrate the guests from one host to another. - I was looking into the redhat knowledgebase https://access.redhat.com/kb/docs/DOC-3068, as per the document , what do you think using CLVM or HA-LVM will be the best choice? Pl. advice. Thanks and regards again. From bergman at merctech.com Thu Jun 2 20:05:04 2011 From: bergman at merctech.com (bergman at merctech.com) Date: Thu, 02 Jun 2011 16:05:04 -0400 Subject: [Linux-cluster] recommended method for changing quorum device In-Reply-To: Your message of "Tue, 31 May 2011 22:22:44 +0200." <215272920.3337.1306873364406.JavaMail.root@axgroupware01-1.gallien.atix> References: <215272920.3337.1306873364406.JavaMail.root@axgroupware01-1.gallien.atix> Message-ID: <2865.1307045104@localhost> In the message dated: Tue, 31 May 2011 22:22:44 +0200, The pithy ruminations from Mark Hlawatschek on were: => Mark, => => without guarantee ;-) I believe that the following method should work: Thanks for the suggestion. Here's what I did: => => 1. make sure that all 3 nodes are running and part of the cluster Yes. 1a. Decrement the number of expected votes to the expected quorum value without a quorum disk (for a 3-node cluster): cman_tool expected -e 2 1b. Change the cluster config to remove the quorum disk and decrease the number of expected votes to 2; then run "ccs_tool update" "clustat" shows the old quorum device as being "offline" the cluster remains quorate => 2. stop qdiskd on all nodes (#service qdiskd stop) Yes. => 3. create new quorum disk (#mkqdisk ...) Yes. => 4. modify cluster.conf => 5. #ccs_tool update /etc/cluster/cluster.conf Yes. Modified to use the new quorum disk. Did NOT change the expected number of votes back to 5. The cluster remains quorate. At this point, "mkqdisk -L" shows two quorum devices. => 6. start qdiskd on all nodes (#service qdiskd start) Yes. At this point, "cman_tool status" shows 2 votes from the quorum disk (5 votes total, 2 needed for quorum). 6a. Modify the cluster config to use the new quorum disk and to use the previous number of expected votes (3, to allow the 3-node cluster to function with 1 node + the quorum device). The cluster remains quorate. The expected number of votes is 3, the actual number of votes is 5. ---------------------------------------------------------------- The good news: No errors, no sudden cluster failures. However, "clustat" shows the path to the old quorum device, and doesn't show the new disk. The [old] quorum disk is shown as being "Online". Running "qdiskd -f -d" shows that the quorum device is functioning (hueristic checks, etc.), but doesn't give information about which device is being used. Running: strace -o /tmp/qdisk.strace -f /usr/sbin/qdiskd -d -f and examining the system calls shows that the new quorum device is in use. So, aside from the incorrect information from "clustat", it looks like the change in quorum device was successful. Now the old array hardware can continue failing. :) Thanks, Mark => => Kind regards, => Mark => => => ----- bergman at merctech.com wrote: => => > I've got a 3-node RHCS cluster and the quorum device is on a SAN disk => > array that needs to be replaced. The relevent versions are: => > => > CentOS 5.6 (2.6.18-238.9.1.el5) => > openais-0.80.6-28.el5_6.1 => > cman-2.0.115-68.el5_6.3 => > rgmanager-2.0.52-9.el5.centos.1 => > => > => > Currently the cluster is configured with each node having one vote => > and => > the quorum device having 2 votes, to allow operation in the event of => > multiple node failures. => > => > I'd like to know if there's any recommended method for changing the => > quorum disk "in place", without shutting down the cluster. => > => > The following approaches come to mind: => > => > 1. Create a new quorum device (multipath, mkqdisk). => > => > Ensure that at least 2 of the 3 nodes are up. => > => > Change the cluster configuration to use the new path to => > the new device instead of the old device. => > => > Commit the change to the cluster. => > => > 2. Create a new quorum device (multipath, mkqdisk). => > => > Ensure that at least 2 of the 3 nodes are up. => > => > Change the cluster configuration to not use any quorum => > device. => > => > Commit the change to the cluster. => > => > Change the cluster configuration to use the new quorum => > device. => > => > Commit the change to the cluster. => > => > 3. Create a new quorum device (multipath, mkqdisk). => > => > Change the cluster configuration to use both quorum => > devices. => > => > Commit the change to the cluster. => > => > -------------------------------------------------- => > Note: the 'mkqdisk' manual page (dated July 2006) => > states: => > using multiple different devices is currently => > not supported => > Is that still accurate? => > -------------------------------------------------- => > => > Change the cluster configuration to use just the => > new quorum device instead of the old device. => > => > Commit the change to the cluster. => > => > Thanks for any suggestions. => > => > Mark => > => > -- => > Linux-cluster mailing list => > Linux-cluster at redhat.com => > https://www.redhat.com/mailman/listinfo/linux-cluster => => -- => Mark Hlawatschek => => ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 | => 85716 Unterschleissheim | www.atix.de => => http://www.linux-subscriptions.com => From kkovachev at varna.net Fri Jun 3 08:48:31 2011 From: kkovachev at varna.net (Kaloyan Kovachev) Date: Fri, 03 Jun 2011 11:48:31 +0300 Subject: [Linux-cluster] Cluster environment issue In-Reply-To: <748430.45768.qm@web112815.mail.gq1.yahoo.com> References: <748430.45768.qm@web112815.mail.gq1.yahoo.com> Message-ID: Hi, On Thu, 2 Jun 2011 08:37:07 -0700 (PDT), Srija wrote: > Thank you so much for your reply again. > > --- On Tue, 5/31/11, Kaloyan Kovachev wrote: > Thanks for your reply again. > > > > >> If it is a switch restart you will have in your logs the >> interface going >> down/up, but more problematic is to find a short drop of >> the multicast > > I checked all nodes did not find anything about interface, but in all the > nodes it is reporting that server19(node 12) /server18 (node 11) is the > problematic, here I am mentioning the logs from three nodes (out of 16 > nodes) > > May 24 18:04:59 server7 openais[6113]: [TOTEM] entering GATHER state > from 12. > May 24 18:05:01 server7 crond[5068]: (root) CMD ( > /opt/hp/hp-health/bin/check-for-restart-requests) > May 24 18:05:19 server7 openais[6113]: [TOTEM] entering GATHER state > from 11. > > May 24 18:04:59 server1 openais[6148]: [TOTEM] entering GATHER state > from 12. > May 24 18:05:01 server1 crond[2275]: (root) CMD ( > /opt/hp/hp-health/bin/check-for-restart-requests) > May 24 18:05:19 server1 openais[6148]: [TOTEM] entering GATHER state > from 11. > > May 24 18:04:59 server8 openais[6279]: [TOTEM] entering GATHER state > from 12. > May 24 18:05:01 server8 crond[11125]: (root) CMD ( > /opt/hp/hp-health/bin/check-for-restart-requests) > May 24 18:05:19 server8 openais[6279]: [TOTEM] entering GATHER state > from 11. > > > Here is some lines from node12 , at the same time > ___________________________________________________ > > > May 24 18:04:59 server19 openais[5950]: [TOTEM] The token was lost in the > OPERATIONAL state. > May 24 18:04:59 server19 openais[5950]: [TOTEM] Receive multicast socket > recv buffer size (320000 bytes). > May 24 18:04:59 server19 openais[5950]: [TOTEM] Transmit multicast socket > send buffer size (262142 bytes). > May 24 18:04:59 server19 openais[5950]: [TOTEM] entering GATHER state from > 2. > May 24 18:05:19 server19 openais[5950]: [TOTEM] entering GATHER state from > 11. > May 24 18:05:20 server19 openais[5950]: [TOTEM] Saving state aru 39a8f > high seq received 39a8f > May 24 18:05:20 server19 openais[5950]: [TOTEM] Storing new sequence id > for ring 2af0 > May 24 18:05:20 server19 openais[5950]: [TOTEM] entering COMMIT state. > May 24 18:05:20 server19 openais[5950]: [TOTEM] entering RECOVERY state. > > > Here is few lines on node11 ie server18 > ------------------------------------------ > > ay 24 18:04:48 server18 > May 24 18:10:14 server18 syslog-ng[5619]: syslog-ng starting up; > version='2.0.10' > May 24 18:10:14 server18 Bootdata ok (command line is ro > root=/dev/vgroot_xen/lvroot rhgb quiet) > > > So it seems that node11 is rebooting just after few mintues we get all > the problems in the logs of all nodes. > > > > You may ask the network people to check for STP changes and >> double check >> the multicast configuration and you may also try to use >> broadcast instead >> of multicast or use a dedicated switch. > > As per the dedicated switch, I don't think it is possible as per the > network team. I asked the STP chanes related. their answer is > > "there are no stp changes for the private network as there are no > redundant devices in the environment. the multicast configs is igmp > snooping with Pim" > > I have talked to the network team for using the broadcast instead of > multicast, as per them , they can set.. > > Pl. comment on this... > to use broadcast (if private addresses are in the same VLAN/subnet) you just need to set it in cluster.conf - cman section, but not sure if it can be done on a running cluster (without stopping or braking it) > > your interface and multicast address) >> ping -I ethX -b -L 239.x.x.x -c 1 >> and finaly run this script until the cluster gets broken > > Yes , I have checked it , it is working fine now. I have also set a cron > for this script and set in one node. no need for cron if you haven't changed the script - this will start several processes and your network will be overloaded !!! the script was made to run on a console (or via screen) and it will exit _only_ when multicast is lost > > I have few questions regarding the cluster configuration ... > > > - We are using clvm in the cluster environment. As I understand it > is active-active. > The environment is xen . all the xen hosts are in the cluster and > each host have > the guests. We are keeping the options to live migrate the guests > from one host to another. > > - I was looking into the redhat knowledgebase > https://access.redhat.com/kb/docs/DOC-3068, > as per the document , what do you think using CLVM or HA-LVM will be > the best choice? > > Pl. advice. can't comment on this sorry > > > Thanks and regards again. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From jpeteb at gmail.com Fri Jun 3 14:01:53 2011 From: jpeteb at gmail.com (Pete) Date: Fri, 3 Jun 2011 10:01:53 -0400 Subject: [Linux-cluster] Some nodes not starting groupd correctly Message-ID: Hello, I have a startup issue with a cluster that we've set up. We have 34 HP G7 servers running in a cluster to share one SAN resource, a HP (Lefthand) P4500. All the servers are running RHEL 5.4. When we reboot the cluster, a small, random number of nodes will not mount the SAN. On inspection, the failing nodes are members of the cluster (looking at clustat). When I run a "service cman status" on them, they say that groupd is not running. I'm assuming that because of this, clvmd does not run correctly (I see a "clvmd: Can't open cluster manager socket: No such file or directory" in the messages log), so no SAN VG and no SAN mount. If I do a "service cman restart; service clvmd restart; mount -a" the SAN will mount correctly. I've created a sample cluster.conf below. It only contains 4 nodes, but it is identical to the 34 node system. We use IPMI for the fencing, as the HP G7 systems are iLO3, and we could not get fence_ilo to work with them. Any help is appreciated - thanks! --pete From swap_project at yahoo.com Fri Jun 3 15:27:58 2011 From: swap_project at yahoo.com (Srija) Date: Fri, 3 Jun 2011 08:27:58 -0700 (PDT) Subject: [Linux-cluster] Cluster environment issue In-Reply-To: Message-ID: <807701.11474.qm@web112805.mail.gq1.yahoo.com> Thanks for your reply. --- On Fri, 6/3/11, Kaloyan Kovachev wrote: > > to use broadcast (if private addresses are in the same > VLAN/subnet) you > just need to set it in cluster.conf - cman section, but not > sure if it can > be done on a running cluster (without stopping or braking > it) Yes all the ips are in the same vlan. I will test it in the lab with the 3 nodes cluster. If I want to check the difference between multicast setting and broadcast setting, how to test ? My plan is, already the test environment is set with multicast. I will test it. Then I will change the cluster.conf with broadcast setting then test. Pl. let me know. Thanks and regards. From mkathuria at tuxtechnologies.co.in Mon Jun 6 07:53:55 2011 From: mkathuria at tuxtechnologies.co.in (Manish Kathuria) Date: Mon, 6 Jun 2011 13:23:55 +0530 Subject: [Linux-cluster] Fencing Issues: fence_node fails but fence_ipmilan works Message-ID: I am facing a strange problem configuring a two node cluster using RHCS 4.8. Both nodes are HP Proliant DL 180 G6 servers using HP LO 100i (IPMI Based). When I run the fence_node command to check the fence device configuration for either of the nodes, it fails giving the following message in the logs: fence_node[nnnn]: Fence of "node1" was unsuccessful fence_node[nnnn]: Fence of "node2" was unsuccessful However, when I run the fence_impilan command using the same credentials, it executes successfully and is able to switch on, off and reboot the nodes. The cluster configuration for the fence devices is: IPMI Lan Type Name: lo1 IP Address: 172.16.1.x Login admin Password passone Auth Type password Name: lo2 IP Address: 172.16.1.y Login admin Password passtwo Auth Type password I have already tried different options for Auth Type (blank, password, md5). Have also tried using / not using lanplus for both the fence devices in the Manage Fencing dialog without success. Any suggestions ? Thanks, Manish Kathuria From sakect at gmail.com Mon Jun 6 10:47:03 2011 From: sakect at gmail.com (POWERBALL ONLINE) Date: Mon, 6 Jun 2011 17:47:03 +0700 Subject: [Linux-cluster] Fencing Issues: fence_node fails but fence_ipmilan works In-Reply-To: References: Message-ID: Please give me the cluster.conf file On Mon, Jun 6, 2011 at 2:53 PM, Manish Kathuria < mkathuria at tuxtechnologies.co.in> wrote: > I am facing a strange problem configuring a two node cluster using > RHCS 4.8. Both nodes are HP Proliant DL 180 G6 servers using HP LO > 100i (IPMI Based). > > When I run the fence_node command to check the fence device > configuration for either of the nodes, it fails giving the following > message in the logs: > > fence_node[nnnn]: Fence of "node1" was unsuccessful > fence_node[nnnn]: Fence of "node2" was unsuccessful > > However, when I run the fence_impilan command using the same > credentials, it executes successfully and is able to switch on, off > and reboot the nodes. The cluster configuration for the fence devices > is: > > IPMI Lan Type > Name: lo1 > IP Address: 172.16.1.x > Login admin > Password passone > Auth Type password > > Name: lo2 > IP Address: 172.16.1.y > Login admin > Password passtwo > Auth Type password > > I have already tried different options for Auth Type (blank, password, > md5). Have also tried using / not using lanplus for both the fence > devices in the Manage Fencing dialog without success. > > Any suggestions ? > > Thanks, > > Manish Kathuria > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From RMartinez-Sanchez at nds.com Mon Jun 6 13:52:47 2011 From: RMartinez-Sanchez at nds.com (Martinez-Sanchez, Raul) Date: Mon, 6 Jun 2011 14:52:47 +0100 Subject: [Linux-cluster] Fencing Issues: fence_node fails but fence_ipmilan works In-Reply-To: References: Message-ID: <7370F6F5ED3B874F988F5CE657D801EA13A9309268@UKMA1.UK.NDS.COM> Hi, I had a similar issue on an IBM M3 system although it was on a rhel5u6. The way we got it fix was by changing the ipmilam configuration *location*(attribute lanplus="1" in device element to fencedevice element) in the cluster.conf file, I am not sure if this would be relevant to you but just in case .... I.E. Regards, Ra?l Mart?nez S?nchez From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of POWERBALL ONLINE Sent: Monday, June 06, 2011 11:47 AM To: linux clustering Subject: Re: [Linux-cluster] Fencing Issues: fence_node fails but fence_ipmilan works Please give me the cluster.conf file On Mon, Jun 6, 2011 at 2:53 PM, Manish Kathuria > wrote: I am facing a strange problem configuring a two node cluster using RHCS 4.8. Both nodes are HP Proliant DL 180 G6 servers using HP LO 100i (IPMI Based). When I run the fence_node command to check the fence device configuration for either of the nodes, it fails giving the following message in the logs: fence_node[nnnn]: Fence of "node1" was unsuccessful fence_node[nnnn]: Fence of "node2" was unsuccessful However, when I run the fence_impilan command using the same credentials, it executes successfully and is able to switch on, off and reboot the nodes. The cluster configuration for the fence devices is: IPMI Lan Type Name: lo1 IP Address: 172.16.1.x Login admin Password passone Auth Type password Name: lo2 IP Address: 172.16.1.y Login admin Password passtwo Auth Type password I have already tried different options for Auth Type (blank, password, md5). Have also tried using / not using lanplus for both the fence devices in the Manage Fencing dialog without success. Any suggestions ? Thanks, Manish Kathuria -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster ________________________________ ************************************************************************************** This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postmaster at nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes. To protect the environment please do not print this e-mail unless necessary. NDS Limited. Registered Office: One London Road, Staines, Middlesex, TW18 4EX, United Kingdom. A company registered in England and Wales. Registered no. 3080780. VAT no. GB 603 8808 40-00 ************************************************************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From mkathuria at tuxtechnologies.co.in Mon Jun 6 15:13:32 2011 From: mkathuria at tuxtechnologies.co.in (Manish Kathuria) Date: Mon, 6 Jun 2011 20:43:32 +0530 Subject: [Linux-cluster] Fencing Issues: fence_node fails but fence_ipmilan works In-Reply-To: <7370F6F5ED3B874F988F5CE657D801EA13A9309268@UKMA1.UK.NDS.COM> References: <7370F6F5ED3B874F988F5CE657D801EA13A9309268@UKMA1.UK.NDS.COM> Message-ID: On Mon, Jun 6, 2011 at 7:22 PM, Martinez-Sanchez, Raul wrote: > Hi, > > > > I had a similar issue on an IBM M3 system although it was on a rhel5u6. The > way we got it fix was by changing the ipmilam configuration > *location*(attribute lanplus="1" in device element to fencedevice element) > in the cluster.conf file, I am not sure if this would be relevant to you but > just in case ?. > > > > I.E. > > > > > > > > > > > > > > > > > lanplus="1" login="Admin" name="m3vgc1b-ilo" passwd="***"/> > lanplus="1" login="Admin" name="m3vgc1a-ilo" passwd="***"/> > > Regards, > > Ra?l Mart?nez S?nchez > > > > Please give me the cluster.conf file > > On Mon, Jun 6, 2011 at 2:53 PM, Manish Kathuria > wrote: > > I am facing a strange problem configuring a two node cluster using > RHCS 4.8. Both nodes are HP Proliant DL 180 G6 servers using HP LO > 100i (IPMI Based). > > When I run the fence_node command to check the fence device > configuration for either of the nodes, it fails giving the following > message in the logs: > > fence_node[nnnn]: Fence of "node1" was unsuccessful > fence_node[nnnn]: Fence of "node2" was unsuccessful > > However, when I run the fence_impilan command using the same > credentials, it executes successfully and is able to switch on, off > and reboot the nodes. The cluster configuration for the fence devices > is: > > IPMI Lan Type > Name: ? ? ? ? ? lo1 > IP Address: ? ? 172.16.1.x > Login ? ? ? ? ? admin > Password ? ? ? ?passone > Auth Type ? ? ? password > > Name: ? ? ? ? ? lo2 > IP Address: ? ? 172.16.1.y > Login ? ? ? ? ? admin > Password ? ? ? ?passtwo > Auth Type ? ? ? password > > I have already tried different options for Auth Type (blank, password, > md5). Have also tried using / not using lanplus for both the fence > devices in the Manage Fencing dialog without success. > > Any suggestions ? > Thanks for the tip, I will try that out. Another interesting thing which I discovered subsequently was that the nodes were being fenced by the cluster during testing and its just the command fence_node which fails to execute giving the error message mentioned in the initial mail. Quite surprising. -- Manish From fdinitto at redhat.com Mon Jun 6 17:56:44 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Mon, 06 Jun 2011 19:56:44 +0200 Subject: [Linux-cluster] resource agents 3.9.1rc1 release Message-ID: <4DED14DC.8070604@redhat.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Hi everybody, The current resource agent repository [1] has been tagged to v3.9.1rc1. Tarballs are also available [2]. This is the very first release of the common resource agent repository. It is a big milestone towards eliminating duplication of effort with the goal of improving the overall quality and user experience. There is still a long way to go but the first stone has been laid down. Highlights for the LHA resource agents set: - - lxc, symlink: new resource agents - - db2: major rewrite and support for master/slave mode of operation - - exportfs: backup/restore of rmtab is back - - mysql: multiple improvements for master/slave and replication - - ocft: new tests for pgsql, postfix, and iscsi Highlights for the rgmanager resource agents set: - - oracledb: use shutdown immediate - - tomcat5: fix generated XML - - nfsclient: fix client name mismatch - - halvm: fix mirror dev failure - - nfs: fix selinux integration Several changes have been made to the build system and the spec file to accommodate both projects? needs. The most noticeable change is the option to select "all", "linux-ha" or "rgmanager" resource agents at configuration time, which will also set the default for the spec file. The full list of changes is available in the "ChangeLog" file for users, and in an auto-generated git-to-changelog file called "ChangeLog.devel". NOTE: About the 3.9.x version (particularly for linux-ha folks): This version was chosen simply because the rgmanager set was already at 3.1.x. In order to make it easier for distribution, and to keep package upgrades linear, we decided to bump the number higher than both projects. There is no other special meaning associated with it. The final 3.9.1 release will take place soon. Many thanks to everybody who helped with this release, in particular to the numerous contributors. Without you, the release would certainly not be possible. Cheers, The RAS Tribe [1] https://github.com/ClusterLabs/resource-agents/tarball/v3.9.1rc1 [2] https://fedorahosted.org/releases/r/e/resource-agents/ PS: I am absolutely sure that URL [2] might give some people a fit, but we are still working to get a common release area. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBCAAGBQJN7RTaAAoJEFA6oBJjVJ+OLd0QAJsNrNxjwDOuHAIt8LW6pOPL WZ7kR0S/S4rXzMC93jFAx3c4UE+7WBUHAEnOqZKQBLpkCti+o6lGG31EsM8sqk94 cHa7P4sLZ7OqnjbulBvORaGFkRrtewxugQMzX03UOnDplTluDaE4duBWou9uCZI1 mq6hd9EyDHXZJCrCN5BRAWAV2JbRb2cp9Wu4HaSFVb/1662mOuaUVLvYoAkmriF8 p5URVsJ09qJVRCpyLFIO4Xd51x46B807naRJMVEclNu5qv6IzL+HvqsR0KLL7CCv cDAzNMqOGYRi3PQlywPaC/D+/PWw5LspmdepizooyIwleUK0O9d8dl3PuMjtewfn 4uMPdp2Vc9OqpAcZpcSIBwrK9zRH+JOQDUJmCL4dRZtsukU2qxAT4f7pX66hTVts DkCkuDcX+xhi/y5eTu5cMKvsfrdcpNaDmIimKtq6T34Axncp8TYaLBfaoSB/2LIm RD7MDXxY9tLD6b/e2gK6xtSXT4A+YQm7eXsBMhjYu30Ozq9Jvjz58V3bivMDtp+E aUI/vxRnxOMjw9io8w2ltnCU9oLI3T9dDkj1Dilnl+HI0ju1flzsW8mhCA0c0GsY tqZ1Em7js1Mp4PcoI57wS4f0INfU32KTkhPBViRn+o8GNJ9wFLd6XtwMFYrinqhS mZxO0uDsvQ9gTnoVTUvL =2KKW -----END PGP SIGNATURE----- From zaeem.arshad at gmail.com Tue Jun 7 18:44:51 2011 From: zaeem.arshad at gmail.com (Zaeem Arshad) Date: Tue, 7 Jun 2011 23:44:51 +0500 Subject: [Linux-cluster] Mixing kernel versions in a GFS cluster In-Reply-To: <30487.1303931741@datil.uphs.upenn.edu> References: <4DB1C7A5.10307@ntsg.umt.edu> <30487.1303931741@datil.uphs.upenn.edu> Message-ID: On Thu, Apr 28, 2011 at 12:15 AM, wrote: > In the message dated: Fri, 22 Apr 2011 12:23:33 MDT, > The pithy ruminations from "Andrew A. Neuschwander" on > <[Linux-cluster] Mixing kernel versions in a GFS cluster> were: > => Would it be a problem to mix CentOS 5.5 and CentOS 5.6 nodes in a GFS(1) > cluster? > => > > Any information? > > Has anyone tried this? I'm trying to figure out the best update path for a > 3-node CentOS 5.5 cluster (with GFS(1) and GFS2). > > Not sure if it's relevant but we had two nodes running CentOS 5.4 and 5.5 respectively for quite a while without any issues. We evetually got around to updating the second node but never experienced any issues. HTH -- Zaeem -------------- next part -------------- An HTML attachment was scrubbed... URL: From swap_project at yahoo.com Tue Jun 7 18:57:02 2011 From: swap_project at yahoo.com (Srija) Date: Tue, 7 Jun 2011 11:57:02 -0700 (PDT) Subject: [Linux-cluster] Cluster environment issue In-Reply-To: <807701.11474.qm@web112805.mail.gq1.yahoo.com> Message-ID: <687930.36765.qm@web112809.mail.gq1.yahoo.com> Hi Kaloyan > --- On Fri, 6/3/11, Kaloyan Kovachev > wrote: > > > > > to use broadcast (if private addresses are in the > same > > VLAN/subnet) you > > just need to set it in cluster.conf - cman section, > but not > > sure if it can > > be done on a running cluster (without stopping or > braking > > it) I have configured the cluster in the lab ( with three nodes) and set the broadcast. Here is the configuration -- #--------------------------------------- #--------------------------------------- When I am executeing the cman_tool status command , getting the following output [root ~]# cman_tool status Version: 6.2.0 Config Version: 61 Cluster Name: test Cluster Id: 25790 Cluster Member: Yes Cluster Generation: 968 Membership state: Cluster-Member Nodes: 3 Expected votes: 3 Total votes: 3 Quorum: 2 Active subsystems: 7 Flags: Dirty Ports Bound: 0 Node name: node1 Node ID: 1 Multicast addresses: 239.192.xxx.xx Node addresses: 192.168.205.1 Would you pl. confirm the broadcast configuration!! Again ,in the following document https://access.redhat.com/kb/docs/DOC-40821 under Unsopported items/ Netowrking, it is telling that broadcast is not supportive... Thanks and regards. From kkovachev at varna.net Wed Jun 8 08:33:01 2011 From: kkovachev at varna.net (Kaloyan Kovachev) Date: Wed, 08 Jun 2011 11:33:01 +0300 Subject: [Linux-cluster] Cluster environment issue In-Reply-To: <687930.36765.qm@web112809.mail.gq1.yahoo.com> References: <687930.36765.qm@web112809.mail.gq1.yahoo.com> Message-ID: <8a55259aef507690dbae1bd902e0dc83@mx.varna.net> Hi, On Tue, 7 Jun 2011 11:57:02 -0700 (PDT), Srija wrote: > Hi Kaloyan > >> --- On Fri, 6/3/11, Kaloyan Kovachev >> wrote: >> >> > >> > to use broadcast (if private addresses are in the >> same >> > VLAN/subnet) you >> > just need to set it in cluster.conf - cman section, >> but not >> > sure if it can >> > be done on a running cluster (without stopping or >> braking >> > it) > > > I have configured the cluster in the lab ( with three nodes) and set the > broadcast. Here is the configuration -- > > > > post_join_delay="3"/> > > > > > > > > > > > > > > > > > > > > > > > > #--------------------------------------- > > > try to replace this with just > #--------------------------------------- > > > login="Admin" name="ilo-node1r" passwd="xxx"/> > login="Admin" name="ilo-node2r" passwd="xxx"/> > login="Admin" name="ilo-node3r" passwd="xxx"/> > > > > > > > > > > > When I am executeing the cman_tool status command , getting the > following output > > [root ~]# cman_tool status > Version: 6.2.0 > Config Version: 61 > Cluster Name: test > Cluster Id: 25790 > Cluster Member: Yes > Cluster Generation: 968 > Membership state: Cluster-Member > Nodes: 3 > Expected votes: 3 > Total votes: 3 > Quorum: 2 > Active subsystems: 7 > Flags: Dirty > Ports Bound: 0 > Node name: node1 > Node ID: 1 > Multicast addresses: 239.192.xxx.xx > Node addresses: 192.168.205.1 > > Would you pl. confirm the broadcast configuration!! > > Again ,in the following document > > https://access.redhat.com/kb/docs/DOC-40821 unfortunately i can't access the document, but using broadcast is just to confirm the problem is with multicast (like the script i've sent earlier). > > under Unsopported items/ Netowrking, it is telling that broadcast is not > supportive... > > Thanks and regards. > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From yamato at redhat.com Thu Jun 9 08:55:29 2011 From: yamato at redhat.com (Masatake YAMATO) Date: Thu, 09 Jun 2011 17:55:29 +0900 (JST) Subject: [Linux-cluster] [RFC] Read access to /config/dlm//comms//addr Message-ID: <20110609.175529.646090028440251828.yamato@redhat.com> Hi, I've found /config/dlm//comms//addr is readable (in meaning of ls -l) but no handler(comm_addr_read) is defined in dlm/fs/dlm/config.c. If cat command works fine with /config/dlm//comms//addr, it will be nice to understand the status of dlm. So I'm thinking about writing a patch. But after reading the source code, I've found its difficulties; /config/dlm//comms//addr holds 'struct sockaddr_storage'. I'd like to get your comment before going ahead. I think we have three choice. Which do you think the best? 1. When 'cat /config/dlm//comms//addr' is invoked, it converts the held sockaddr_storage to human readable text and provids it to userland. e.g. # cat /config/dlm//comms//addr AF_INET 192.168.151.1 # Advantage: human readable Disadvantage: data asymmetry in writing and reading When writing to /config/dlm//comms//addr, it expects binary format of sockaddr_storage. 2. When 'cat /config/dlm//comms//addr' is invoked, it provides the held sockaddr_storage to userland. Advantage: data symmetry in writing and reading. Disadvantage: not human readable. It needs something effort to understanding the returned binary data. 3. Make /config/dlm//comms//addr unreadable (in meaning of ls -l) e.g. # ls -l /config/dlm//comms//addr --w-------. 1 root root 4096 Jun 9 08:51 /config/dlm//comms//addr Advantage: easy to implement. Disadvantage: no way to know the value of node addr of dlm view. Regards, Masatake YAMATO From laszlo.budai at gmail.com Thu Jun 9 09:45:29 2011 From: laszlo.budai at gmail.com (Budai Laszlo) Date: Thu, 09 Jun 2011 12:45:29 +0300 Subject: [Linux-cluster] Remove GFS journal Message-ID: <4DF09639.70607@gmail.com> Hi, I would like to know if it is possible to remove a journal from GFS. I have tried to google for it, but did not found anything conclusive. I've read the documentation on the following address: http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html-single/Global_File_System/index.html but I did not found any mention about the possibility or impossibility of removing journals. Thank you, Laszlo From swhiteho at redhat.com Thu Jun 9 09:56:53 2011 From: swhiteho at redhat.com (Steven Whitehouse) Date: Thu, 09 Jun 2011 10:56:53 +0100 Subject: [Linux-cluster] Remove GFS journal In-Reply-To: <4DF09639.70607@gmail.com> References: <4DF09639.70607@gmail.com> Message-ID: <1307613413.2821.1.camel@menhir> Hi, On Thu, 2011-06-09 at 12:45 +0300, Budai Laszlo wrote: > Hi, > > I would like to know if it is possible to remove a journal from GFS. I > have tried to google for it, but did not found anything conclusive. I've > read the documentation on the following address: > http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html-single/Global_File_System/index.html > but I did not found any mention about the possibility or impossibility > of removing journals. > > Thank you, > Laszlo > There is no tool to do this, I'm afraid. Theoretically it could be done by editing the fs directly, but it would be a pretty tricky thing to do, and certainly not a recommended procedure, Steve. From shankar.jha at gmail.com Thu Jun 9 10:27:22 2011 From: shankar.jha at gmail.com (Shankar Jha) Date: Thu, 9 Jun 2011 15:57:22 +0530 Subject: [Linux-cluster] cluster is not relocation on second node. Message-ID: Hi, I have problem in rhel5.5 cluster. Mysqld service is on cluster. when there is any issue with cluster, services(hell) not relocation automatically. Even I have tried to enable on second node but fails. In that case we need to reboot both nodes and enable it on manually on anyone. HP-ILO fencing is not working. Please find the below /var/log/message and suggest. Jun 9 02:46:25 indls0040 clurgmgrd[6530]: Stopping service service:hell Jun 9 02:46:27 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 Jun 9 02:46:44 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 Jun 9 02:46:45 indls0040 ccsd[5222]: Unable to connect to cluster infrastructure after 19710 seconds. Jun 9 02:46:55 indls0040 clurgmgrd[6530]: #52: Failed changing RG status Jun 9 02:47:03 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 Jun 9 02:47:05 indls0040 clurgmgrd: [6530]: 10.48.64.82 is not configured Jun 9 02:47:05 indls0040 clurgmgrd[6530]: Stopping service service:hell Jun 9 02:47:15 indls0040 ccsd[5222]: Unable to connect to cluster infrastructure after 19740 seconds. Jun 9 02:47:20 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 Jun 9 02:47:35 indls0040 clurgmgrd[6530]: #52: Failed changing RG status Jun 9 02:47:38 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 Jun 9 02:47:45 indls0040 clurgmgrd: [6530]: 10.48.64.82 is not configured Jun 9 02:47:45 indls0040 clurgmgrd[6530]: Stopping service service:hell Jun 9 02:47:45 indls0040 ccsd[5222]: Unable to connect to cluster infrastructure after 19770 seconds. Jun 9 02:47:50 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 Jun 9 02:48:14 indls0040 last message repeated 2 times Jun 9 02:48:15 indls0040 ccsd[5222]: Unable to connect to cluster infrastructure after 19800 seconds. Jun 9 02:48:15 indls0040 clurgmgrd[6530]: #52: Failed changing RG status Jun 9 02:48:23 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 Jun 9 02:48:25 indls0040 clurgmgrd: [6530]: 10.48.64.82 is not configured Jun 9 02:48:25 indls0040 clurgmgrd[6530]: Stopping service service:hell Jun 9 02:48:37 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 Jun 9 02:48:45 indls0040 ccsd[5222]: Unable to connect to cluster infrastructure after 19830 seconds. Jun 9 02:48:55 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 Jun 9 02:48:55 indls0040 clurgmgrd[6530]: #52: Failed changing RG status Jun 9 02:49:05 indls0040 clurgmgrd: [6530]: 10.48.64.82 is not configured Jun 9 02:49:05 indls0040 clurgmgrd[6530]: Stopping service service:hell Jun 9 02:49:13 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 Jun 9 02:49:15 indls0040 ccsd[5222]: Unable to connect to cluster infrastructure after 19860 seconds. Jun 9 02:49:26 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 Jun 9 02:49:35 indls0040 clurgmgrd[6530]: #52: Failed changing RG status Jun 9 02:49:45 indls0040 clurgmgrd: [6530]: 10.48.64.82 is not configured Jun 9 02:49:45 indls0040 clurgmgrd[6530]: Stopping service service:hell Jun 9 02:49:45 indls0040 ccsd[5222]: Unable to connect to cluster infrastructure after 19890 seconds. Jun 9 02:49:47 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 Jun 9 02:50:10 indls0040 last message repeated 2 times Jun 9 02:50:15 indls0040 clurgmgrd[6530]: #52: Failed changing RG status Jun 9 10:03:59 indls0040 openais[23169]: [MAIN ] Using default multicast address of 239.192.67.158 Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] Token Timeout (10000 ms) retransmit timeout (495 ms) Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] token hold (386 ms) retransmits before loss (20 retrans) Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] join (60 ms) send_join (0 ms) consensus (4800 ms) merge (200 ms) Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] downcheck (1000 ms) fail to recv const (50 msgs) Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] seqno unchanged const (30 rotations) Maximum network MTU 1402 Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] window size per rotation (50 messages) maximum messages per rotation (1 7 messages) Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] missed count const (5 messages) Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] send threads (0 threads) Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] RRP token expired timeout (495 ms) Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] RRP token problem counter (2000 ms) Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] RRP threshold (10 problem count) Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] RRP mode set to none. Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] heartbeat_failures_allowed (0) Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] max_network_delay (50 ms) Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] HeartBeat is Disabled. To enable set heartbeat_failures_allowed > 0 Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes). Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] The network interface [10.48.65.54] is now up. Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] Created or loaded sequence id 7136704.10.48.65.54 for this ring. Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] entering GATHER state from 15. Jun 9 10:04:00 indls0040 openais[23169]: [CMAN ] CMAN 2.0.115 (built Jul 28 2010 19:18:41) started Jun 9 10:04:00 indls0040 openais[23169]: [MAIN ] Service initialized 'openais CMAN membership service 2.01' Jun 9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized 'openais extended virtual synchrony service' Jun 9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized 'openais cluster membership service B.01.01' Jun 9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized 'openais availability management framework B.01.01' Jun 9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized 'openais checkpoint service B.01.01' Jun 9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized 'openais event service B.01.01' Jun 9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized 'openais distributed locking service B.01.01' Jun 9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized 'openais message service B.01.01' Jun 9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized 'openais configuration service' Jun 9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized 'openais cluster closed process group service v1.01 ' Jun 9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized 'openais cluster config database access v1.01' Jun 9 10:04:00 indls0040 openais[23169]: [SYNC ] Not using a virtual synchrony filter. Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] Creating commit token because I am the rep. --More-- Thanks- Shankar Jun 9 10:04:01 indls0040 openais[23169]: [CLM ] r(0) ip(10.48.64.67) Jun 9 10:04:01 indls0040 openais[23169]: [SYNC ] This node is within the primary component and will provide service. Jun 9 10:04:01 indls0040 openais[23169]: [TOTEM] entering OPERATIONAL state. Jun 9 10:04:02 indls0040 openais[23169]: [CLM ] got nodejoin message 10.48.64.67 Jun 9 10:04:02 indls0040 openais[23169]: [CLM ] got nodejoin message 10.48.65.54 Jun 9 10:04:02 indls0040 openais[23169]: [CMAN ] cman killed by node 2 because we were killed by cman_tool or other appl ication Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading all openais components Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais component: openais_confdb v0 (19/10) Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais component: openais_cpg v0 (18/8) Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais component: openais_cfg v0 (17/7) Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais component: openais_msg v0 (16/6) Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais component: openais_lck v0 (15/5) Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais component: openais_evt v0 (14/4) Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais component: openais_ckpt v0 (13/3) Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais component: openais_amf v0 (12/2) Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais component: openais_clm v0 (11/1) Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais component: openais_evs v0 (10/0) Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais component: openais_cman v0 (9/9) Jun 9 10:04:03 indls0040 dlm_controld[23196]: cluster is down, exiting Jun 9 10:04:03 indls0040 fenced[23188]: cluster is down, exiting Jun 9 10:04:03 indls0040 kernel: dlm: closing connection to node 1 Jun 9 10:04:03 indls0040 gfs_controld[23203]: cpg_join error 2 Jun 9 10:04:06 indls0040 fence_node[23194]: Fence of "indls0040.qdx.in" was unsuccessful Jun 9 10:04:15 indls0040 ccsd[5222]: Unable to connect to cluster infrastructure after 45930 seconds. Jun 9 10:04:16 indls0040 clurgmgrd[6530]: #52: Failed changing RG status -------------- next part -------------- A non-text attachment was scrubbed... Name: logs.docx Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document Size: 147479 bytes Desc: not available URL: From ccaulfie at redhat.com Thu Jun 9 10:40:28 2011 From: ccaulfie at redhat.com (Christine Caulfield) Date: Thu, 09 Jun 2011 11:40:28 +0100 Subject: [Linux-cluster] cluster is not relocation on second node. In-Reply-To: References: Message-ID: <4DF0A31C.70605@redhat.com> On 09/06/11 11:27, Shankar Jha wrote: > Hi, > > I have problem in rhel5.5 cluster. > Mysqld service is on cluster. when there is any issue with cluster, > services(hell) not relocation automatically. Even I have tried to > enable on second node but fails. In that case we need to reboot both > nodes and enable it on manually on anyone. HP-ILO fencing is not > working. You answered your own question. Fix fencing and the failover should work fine :-) Chrissie > Please find the below /var/log/message and suggest. > > > Jun 9 02:46:25 indls0040 clurgmgrd[6530]: Stopping service > service:hell > Jun 9 02:46:27 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 > Jun 9 02:46:44 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 > Jun 9 02:46:45 indls0040 ccsd[5222]: Unable to connect to cluster > infrastructure after 19710 seconds. > Jun 9 02:46:55 indls0040 clurgmgrd[6530]: #52: Failed changing RG status > Jun 9 02:47:03 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 > Jun 9 02:47:05 indls0040 clurgmgrd: [6530]: 10.48.64.82 is > not configured > Jun 9 02:47:05 indls0040 clurgmgrd[6530]: Stopping service > service:hell > Jun 9 02:47:15 indls0040 ccsd[5222]: Unable to connect to cluster > infrastructure after 19740 seconds. > Jun 9 02:47:20 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 > Jun 9 02:47:35 indls0040 clurgmgrd[6530]: #52: Failed changing RG status > Jun 9 02:47:38 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 > Jun 9 02:47:45 indls0040 clurgmgrd: [6530]: 10.48.64.82 is > not configured > Jun 9 02:47:45 indls0040 clurgmgrd[6530]: Stopping service > service:hell > Jun 9 02:47:45 indls0040 ccsd[5222]: Unable to connect to cluster > infrastructure after 19770 seconds. > Jun 9 02:47:50 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 > Jun 9 02:48:14 indls0040 last message repeated 2 times > Jun 9 02:48:15 indls0040 ccsd[5222]: Unable to connect to cluster > infrastructure after 19800 seconds. > Jun 9 02:48:15 indls0040 clurgmgrd[6530]: #52: Failed changing RG status > Jun 9 02:48:23 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 > Jun 9 02:48:25 indls0040 clurgmgrd: [6530]: 10.48.64.82 is > not configured > Jun 9 02:48:25 indls0040 clurgmgrd[6530]: Stopping service > service:hell > Jun 9 02:48:37 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 > Jun 9 02:48:45 indls0040 ccsd[5222]: Unable to connect to cluster > infrastructure after 19830 seconds. > Jun 9 02:48:55 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 > Jun 9 02:48:55 indls0040 clurgmgrd[6530]: #52: Failed changing RG status > Jun 9 02:49:05 indls0040 clurgmgrd: [6530]: 10.48.64.82 is > not configured > Jun 9 02:49:05 indls0040 clurgmgrd[6530]: Stopping service > service:hell > Jun 9 02:49:13 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 > Jun 9 02:49:15 indls0040 ccsd[5222]: Unable to connect to cluster > infrastructure after 19860 seconds. > Jun 9 02:49:26 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 > Jun 9 02:49:35 indls0040 clurgmgrd[6530]: #52: Failed changing RG status > Jun 9 02:49:45 indls0040 clurgmgrd: [6530]: 10.48.64.82 is > not configured > Jun 9 02:49:45 indls0040 clurgmgrd[6530]: Stopping service > service:hell > Jun 9 02:49:45 indls0040 ccsd[5222]: Unable to connect to cluster > infrastructure after 19890 seconds. > Jun 9 02:49:47 indls0040 dhclient: DHCPREQUEST on eth7 to 10.48.64.13 port 67 > Jun 9 02:50:10 indls0040 last message repeated 2 times > Jun 9 02:50:15 indls0040 clurgmgrd[6530]: #52: Failed changing RG status > > > Jun 9 10:03:59 indls0040 openais[23169]: [MAIN ] Using default > multicast address of 239.192.67.158 > Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] Token Timeout (10000 > ms) retransmit timeout (495 ms) > Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] token hold (386 ms) > retransmits before loss (20 retrans) > Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] join (60 ms) > send_join (0 ms) consensus (4800 ms) merge (200 ms) > Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] downcheck (1000 ms) > fail to recv const (50 msgs) > Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] seqno unchanged > const (30 rotations) Maximum network MTU 1402 > Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] window size per > rotation (50 messages) maximum messages per rotation (1 > 7 messages) > Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] missed count const > (5 messages) > Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] send threads (0 threads) > Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] RRP token expired > timeout (495 ms) > Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] RRP token problem > counter (2000 ms) > Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] RRP threshold (10 > problem count) > Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] RRP mode set to none. > Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] heartbeat_failures_allowed (0) > Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] max_network_delay (50 ms) > Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] HeartBeat is > Disabled. To enable set heartbeat_failures_allowed> 0 > Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] Receive multicast > socket recv buffer size (320000 bytes). > Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] Transmit multicast > socket send buffer size (262142 bytes). > Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] The network > interface [10.48.65.54] is now up. > Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] Created or loaded > sequence id 7136704.10.48.65.54 for this ring. > Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] entering GATHER state from 15. > Jun 9 10:04:00 indls0040 openais[23169]: [CMAN ] CMAN 2.0.115 (built > Jul 28 2010 19:18:41) started > Jun 9 10:04:00 indls0040 openais[23169]: [MAIN ] Service initialized > 'openais CMAN membership service 2.01' > Jun 9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized > 'openais extended virtual synchrony service' > Jun 9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized > 'openais cluster membership service B.01.01' > Jun 9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized > 'openais availability management framework B.01.01' > > Jun 9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized > 'openais checkpoint service B.01.01' > Jun 9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized > 'openais event service B.01.01' > Jun 9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized > 'openais distributed locking service B.01.01' > Jun 9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized > 'openais message service B.01.01' > Jun 9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized > 'openais configuration service' > Jun 9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized > 'openais cluster closed process group service v1.01 > ' > Jun 9 10:04:00 indls0040 openais[23169]: [SERV ] Service initialized > 'openais cluster config database access v1.01' > Jun 9 10:04:00 indls0040 openais[23169]: [SYNC ] Not using a virtual > synchrony filter. > Jun 9 10:04:00 indls0040 openais[23169]: [TOTEM] Creating commit > token because I am the rep. > --More-- > > > Thanks- > Shankar > > > > Jun 9 10:04:01 indls0040 openais[23169]: [CLM ] r(0) ip(10.48.64.67) > Jun 9 10:04:01 indls0040 openais[23169]: [SYNC ] This node is within > the primary component and will provide service. > Jun 9 10:04:01 indls0040 openais[23169]: [TOTEM] entering OPERATIONAL state. > Jun 9 10:04:02 indls0040 openais[23169]: [CLM ] got nodejoin message > 10.48.64.67 > Jun 9 10:04:02 indls0040 openais[23169]: [CLM ] got nodejoin message > 10.48.65.54 > Jun 9 10:04:02 indls0040 openais[23169]: [CMAN ] cman killed by node > 2 because we were killed by cman_tool or other appl > ication > Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading all > openais components > Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais > component: openais_confdb v0 (19/10) > Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais > component: openais_cpg v0 (18/8) > Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais > component: openais_cfg v0 (17/7) > Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais > component: openais_msg v0 (16/6) > Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais > component: openais_lck v0 (15/5) > Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais > component: openais_evt v0 (14/4) > Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais > component: openais_ckpt v0 (13/3) > Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais > component: openais_amf v0 (12/2) > Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais > component: openais_clm v0 (11/1) > Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais > component: openais_evs v0 (10/0) > Jun 9 10:04:03 indls0040 openais[23169]: [SERV ] Unloading openais > component: openais_cman v0 (9/9) > Jun 9 10:04:03 indls0040 dlm_controld[23196]: cluster is down, exiting > Jun 9 10:04:03 indls0040 fenced[23188]: cluster is down, exiting > Jun 9 10:04:03 indls0040 kernel: dlm: closing connection to node 1 > Jun 9 10:04:03 indls0040 gfs_controld[23203]: cpg_join error 2 > Jun 9 10:04:06 indls0040 fence_node[23194]: Fence of > "indls0040.qdx.in" was unsuccessful > Jun 9 10:04:15 indls0040 ccsd[5222]: Unable to connect to cluster > infrastructure after 45930 seconds. > Jun 9 10:04:16 indls0040 clurgmgrd[6530]: #52: Failed changing RG status > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From teigland at redhat.com Thu Jun 9 14:05:46 2011 From: teigland at redhat.com (David Teigland) Date: Thu, 9 Jun 2011 10:05:46 -0400 Subject: [Linux-cluster] [RFC] Read access to /config/dlm//comms//addr In-Reply-To: <20110609.175529.646090028440251828.yamato@redhat.com> References: <20110609.175529.646090028440251828.yamato@redhat.com> Message-ID: <20110609140546.GA30732@redhat.com> On Thu, Jun 09, 2011 at 05:55:29PM +0900, Masatake YAMATO wrote: > Hi, > > I've found /config/dlm//comms//addr is readable > (in meaning of ls -l) but no handler(comm_addr_read) is defined in > dlm/fs/dlm/config.c. > > If cat command works fine with /config/dlm//comms//addr, > it will be nice to understand the status of dlm. So I'm thinking about > writing a patch. > > But after reading the source code, I've found its difficulties; > /config/dlm//comms//addr holds 'struct > sockaddr_storage'. Another problem is that you can write multiple addr's to that file sequentially when using SCTP, so which do you get when you read it? > 3. Make /config/dlm//comms//addr unreadable (in meaning of ls -l) > > e.g. > # ls -l /config/dlm//comms//addr > --w-------. 1 root root 4096 Jun 9 08:51 /config/dlm//comms//addr > > Advantage: easy to implement. > Disadvantage: no way to know the value of node addr of dlm view. I suggest this. If you want a way to read them, I'd add a new readonly file addr_list, # cat /config/dlm//comms//addr_list AF_INET 192.168.151.1 AF_INET 192.168.151.2 Dave From yamato at redhat.com Thu Jun 9 14:39:30 2011 From: yamato at redhat.com (Masatake YAMATO) Date: Thu, 09 Jun 2011 23:39:30 +0900 (JST) Subject: [Linux-cluster] [RFC] Read access to /config/dlm//comms//addr In-Reply-To: <20110609140546.GA30732@redhat.com> References: <20110609.175529.646090028440251828.yamato@redhat.com> <20110609140546.GA30732@redhat.com> Message-ID: <20110609.233930.815852573745836394.yamato@redhat.com> > On Thu, Jun 09, 2011 at 05:55:29PM +0900, Masatake YAMATO wrote: >> Hi, >> >> I've found /config/dlm//comms//addr is readable >> (in meaning of ls -l) but no handler(comm_addr_read) is defined in >> dlm/fs/dlm/config.c. >> >> If cat command works fine with /config/dlm//comms//addr, >> it will be nice to understand the status of dlm. So I'm thinking about >> writing a patch. >> >> But after reading the source code, I've found its difficulties; >> /config/dlm//comms//addr holds 'struct >> sockaddr_storage'. > > Another problem is that you can write multiple addr's to that file > sequentially when using SCTP, so which do you get when you read it? > >> 3. Make /config/dlm//comms//addr unreadable (in meaning of ls -l) >> >> e.g. >> # ls -l /config/dlm//comms//addr >> --w-------. 1 root root 4096 Jun 9 08:51 /config/dlm//comms//addr >> >> Advantage: easy to implement. >> Disadvantage: no way to know the value of node addr of dlm view. > > I suggest this. If you want a way to read them, I'd add a new readonly > file addr_list, Of course, I want:) > # cat /config/dlm//comms//addr_list > AF_INET 192.168.151.1 > AF_INET 192.168.151.2 This is what I want. > Dave > From laszlo.budai at gmail.com Thu Jun 9 14:46:57 2011 From: laszlo.budai at gmail.com (Budai Laszlo) Date: Thu, 09 Jun 2011 17:46:57 +0300 Subject: [Linux-cluster] gfs mount at boot Message-ID: <4DF0DCE1.2000406@gmail.com> Hi, What should be done in order to mount a gfs file system at boot? I've created the following line in /etc/fstab: /dev/clvg/gfsvol /mnt/testgfs gfs defaults 0 0 but it is not mounting the fs at boot. If I run "mount -a" then the fs will get mounted. Is there any option for fstab to specify that this mount should be delayed until the cluster is up and running? Thank you, Laszlo From corey.kovacs at gmail.com Thu Jun 9 14:52:46 2011 From: corey.kovacs at gmail.com (Corey Kovacs) Date: Thu, 9 Jun 2011 15:52:46 +0100 Subject: [Linux-cluster] umount failing... Message-ID: Folks, I have a 5 node cluster serving out several NFS exports, one of which is /home. All of the nfs services can be moved from node to node without problem except for the one providing /home. The logs on that node indicate the umount is failing and then the service is disabled (self-fence is not enabled). Even after the service is put into a failed state and then disabled manually, umount fails... I had noticed recently while playing with conga that creating a service for /home on a test cluster a warning was issued about reserved words and as I recall (i could be wrong) /home was among the illegal parameters for the mount point. I have turned everything off that I could think of which might be "holding" the mount and have run the various iterations of lsof, find etc. nothing shows up as having anything being actively used. This particular file system is 1TB. Is there something wrong with using /home as an export? Some specifics. RHEL5.6 (updated as of last week) HA-LVM protecting ext3 using the newer "preferred method" with clvmd Ext3 for exported file systems 5 nodes. Any ideas would be greatly appreciated. -C From thomas at sjolshagen.net Thu Jun 9 15:04:29 2011 From: thomas at sjolshagen.net (Thomas Sjolshagen) Date: Thu, 9 Jun 2011 11:04:29 -0400 Subject: [Linux-cluster] gfs mount at boot In-Reply-To: <4DF0DCE1.2000406@gmail.com> References: <4DF0DCE1.2000406@gmail.com> Message-ID: Usually, there's a gfs boot service or network filesystem boot service you may need to enable. On Jun 9, 2011, at 10:46, Budai Laszlo wrote: > Hi, > > What should be done in order to mount a gfs file system at boot? > I've created the following line in /etc/fstab: > > /dev/clvg/gfsvol /mnt/testgfs gfs defaults 0 0 > > but it is not mounting the fs at boot. If I run "mount -a" then the fs > will get mounted. > Is there any option for fstab to specify that this mount should be > delayed until the cluster is up and running? > > Thank you, > Laszlo > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From corey.kovacs at gmail.com Thu Jun 9 15:12:35 2011 From: corey.kovacs at gmail.com (Corey Kovacs) Date: Thu, 9 Jun 2011 16:12:35 +0100 Subject: [Linux-cluster] gfs mount at boot In-Reply-To: References: <4DF0DCE1.2000406@gmail.com> Message-ID: Put "_netfs" in the options line. GFS is dependent on the network so once the network is up, it should try to mount again, but not before. On Thu, Jun 9, 2011 at 4:04 PM, Thomas Sjolshagen wrote: > Usually, there's a gfs boot service or network filesystem boot service you may need to enable. > > On Jun 9, 2011, at 10:46, Budai Laszlo wrote: > >> Hi, >> >> What should be done in order to mount a gfs file system at boot? >> I've created the following line in /etc/fstab: >> >> /dev/clvg/gfsvol ? ? ? ?/mnt/testgfs ? ? ? ? ? ?gfs ? ? defaults ? ? ? ?0 0 >> >> but it is not mounting the fs at boot. If I run "mount -a" then the fs >> will get mounted. >> Is there any option for fstab to specify that this mount should be >> delayed ?until the cluster is up and running? >> >> Thank you, >> Laszlo >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From linux at alteeve.com Thu Jun 9 15:20:18 2011 From: linux at alteeve.com (Digimer) Date: Thu, 09 Jun 2011 11:20:18 -0400 Subject: [Linux-cluster] gfs mount at boot In-Reply-To: <4DF0DCE1.2000406@gmail.com> References: <4DF0DCE1.2000406@gmail.com> Message-ID: <4DF0E4B2.1070704@alteeve.com> On 06/09/2011 10:46 AM, Budai Laszlo wrote: > Hi, > > What should be done in order to mount a gfs file system at boot? > I've created the following line in /etc/fstab: > > /dev/clvg/gfsvol /mnt/testgfs gfs defaults 0 0 > > but it is not mounting the fs at boot. If I run "mount -a" then the fs > will get mounted. > Is there any option for fstab to specify that this mount should be > delayed until the cluster is up and running? > > Thank you, > Laszlo The trick is that you need to setup the GFS2 partition with "rw,suid,dev,exec,nouser,async" instead of "defaults". This is because "defaults" implies "auto", and the cluster is not online that early in the boot process. To have it mount on boot, start the cluster (chkconfig cman on). If you defined GFS2 as a managed resource, then also enable rgmanager at boot. If not, then instead, enable "gfs2" at boot. If you're not using RHCS, then the same should still work. You just need to ensure that the service that provides quorum (corosync in pacemaker) starts so that the cluster can form and provide DLM, which is needed by GFS2. With DLM, then it's a matter of starting the resource manager (pacemaker/rgmanager) if the partitions are managed, or starting GFS2 which will consult /etc/fstab and mount any found GFS2 partitions. -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "I feel confined, only free to expand myself within boundaries." From spaulo05 at hotmail.com Thu Jun 9 15:08:47 2011 From: spaulo05 at hotmail.com (Sergio Paulo) Date: Thu, 9 Jun 2011 16:08:47 +0100 Subject: [Linux-cluster] gfs mount at boot In-Reply-To: <4DF0DCE1.2000406@gmail.com> References: <4DF0DCE1.2000406@gmail.com> Message-ID: Hi! look at this example and try to adapt iton /etc/fstab /dev/VG01/LV00 /oracle gfs _netdev,defaults 0 0 manualy i use mount.gfs /dev/VG01/LV00 /oracle S?rgio Paulo Fonseca > Date: Thu, 9 Jun 2011 17:46:57 +0300 > From: laszlo.budai at gmail.com > To: linux-cluster at redhat.com > Subject: [Linux-cluster] gfs mount at boot > > Hi, > > What should be done in order to mount a gfs file system at boot? > I've created the following line in /etc/fstab: > > /dev/clvg/gfsvol /mnt/testgfs gfs defaults 0 0 > > but it is not mounting the fs at boot. If I run "mount -a" then the fs > will get mounted. > Is there any option for fstab to specify that this mount should be > delayed until the cluster is up and running? > > Thank you, > Laszlo > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Thu Jun 9 15:23:40 2011 From: linux at alteeve.com (Digimer) Date: Thu, 09 Jun 2011 11:23:40 -0400 Subject: [Linux-cluster] gfs mount at boot In-Reply-To: References: <4DF0DCE1.2000406@gmail.com> Message-ID: <4DF0E57C.8000204@alteeve.com> On 06/09/2011 11:12 AM, Corey Kovacs wrote: > Put "_netfs" in the options line. GFS is dependent on the network so > once the network is up, it should try to mount again, but not before. GFS2 is dependant on the cluster's distributed lock manager. It can't come up until: network -> cluster engine -> resource manager or gfs2 daemon. -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "I feel confined, only free to expand myself within boundaries." From corey.kovacs at gmail.com Thu Jun 9 15:27:27 2011 From: corey.kovacs at gmail.com (Corey Kovacs) Date: Thu, 9 Jun 2011 16:27:27 +0100 Subject: [Linux-cluster] gfs mount at boot In-Reply-To: <4DF0E4B2.1070704@alteeve.com> References: <4DF0DCE1.2000406@gmail.com> <4DF0E4B2.1070704@alteeve.com> Message-ID: Ahh, forgot about the gfs2 service. Been a long time since I've set GFS1/2 up. I'll go crawl back into my cave now... -C On Thu, Jun 9, 2011 at 4:20 PM, Digimer wrote: > On 06/09/2011 10:46 AM, Budai Laszlo wrote: >> >> Hi, >> >> What should be done in order to mount a gfs file system at boot? >> I've created the following line in /etc/fstab: >> >> /dev/clvg/gfsvol ? ? ? ?/mnt/testgfs ? ? ? ? ? ?gfs ? ? defaults ? ? ? ?0 >> 0 >> >> but it is not mounting the fs at boot. If I run "mount -a" then the fs >> will get mounted. >> Is there any option for fstab to specify that this mount should be >> delayed ?until the cluster is up and running? >> >> Thank you, >> Laszlo > > The trick is that you need to setup the GFS2 partition with > "rw,suid,dev,exec,nouser,async" instead of "defaults". This is because > "defaults" implies "auto", and the cluster is not online that early in the > boot process. > > To have it mount on boot, start the cluster (chkconfig cman on). If you > defined GFS2 as a managed resource, then also enable rgmanager at boot. If > not, then instead, enable "gfs2" at boot. > > If you're not using RHCS, then the same should still work. You just need to > ensure that the service that provides quorum (corosync in pacemaker) starts > so that the cluster can form and provide DLM, which is needed by GFS2. With > DLM, then it's a matter of starting the resource manager > (pacemaker/rgmanager) if the partitions are managed, or starting GFS2 which > will consult /etc/fstab and mount any found GFS2 partitions. > > -- > Digimer > E-Mail: ? ? ? ? ? ? ?digimer at alteeve.com > Freenode handle: ? ? digimer > Papers and Projects: http://alteeve.com > Node Assassin: ? ? ? http://nodeassassin.org > "I feel confined, only free to expand myself within boundaries." > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From ajb2 at mssl.ucl.ac.uk Thu Jun 9 18:48:04 2011 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Thu, 09 Jun 2011 19:48:04 +0100 Subject: [Linux-cluster] gfs mount at boot In-Reply-To: <4DF0DCE1.2000406@gmail.com> References: <4DF0DCE1.2000406@gmail.com> Message-ID: <4DF11564.7090108@mssl.ucl.ac.uk> On 09/06/11 15:46, Budai Laszlo wrote: > Hi, > > What should be done in order to mount a gfs file system at boot? > I've created the following line in /etc/fstab: > > /dev/clvg/gfsvol /mnt/testgfs gfs defaults 0 0 > > but it is not mounting the fs at boot. If I run "mount -a" then the fs > will get mounted. > Is there any option for fstab to specify that this mount should be > delayed until the cluster is up and running? Add _netfs after defaults. From rossnick-lists at cybercat.ca Sat Jun 11 02:43:22 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Fri, 10 Jun 2011 22:43:22 -0400 Subject: [Linux-cluster] clustat -x Message-ID: <4DF2D64A.3040702@cybercat.ca> Hi ! I am make scripts to help monitor and administer our cluster. And I wounder where I could find info about the xml output of clustat ? Notably what are the different state and flags ? I know that state of 112 means started, but what else ? thanks, Nicolas From lcl at nym.hush.com Sun Jun 12 23:54:00 2011 From: lcl at nym.hush.com (lcl at nym.hush.com) Date: Sun, 12 Jun 2011 16:54:00 -0700 Subject: [Linux-cluster] GFS2 reads eventually cause writes to slow Message-ID: <20110612235400.4273F6F437@smtp.hushmail.com> Hello, My team has been having a problem while testing a cluster in a lab in which write operations are extremely slow after reads have been performed continuously for some period of time. We eventually isolated the problem to where we can replicate it on only one node. The other two nodes are powered on, but no filesystems are mounted on those nodes and no operations are performed on those nodes. To replicate, we first reboot everything, then start a number of threads (300+) doing random reads of 8K files in a large directory structure. This goes well for up to 45 minutes. (We're not expecting the reads to be that fast, given they are not cached at that point, but they are within expectations.) We don't do any writes at this time. Then something changes, and we can see that glock_manager is generally at 98-99% in iotop. At this point, however, the reads are still fast enough. Once the node has gotten into this state, an attempt to write an 8K file will usually take several seconds. Note that it takes several seconds to write a file even if it is written on a different filesystem from the one on which we are doing the reads. This bad condition persists until reads are stopped. After reads are stopped, the node recovers in a few minutes, after which writes can be performed quickly. After that, once the test is restarted, it will once again take up to 45 minutes to get the node into the bad state again. Our hypothesis at this point is that there is some cleanup that is not getting performed as long as intensive reads are ongoing. Because that cleanup has not been done, writes are extremely slow. Once the reads stop, the necessary cleanup gets performed, and then it is a long time to cause the problem again. We've tried various tuning options and are starting to dig into source code to find out more, but I thought I'd find out if anyone has any insight into this. We're testing on CentOS 5 with kernel 2.6.18-238.12.1.el5, with gfs2-kmod-debuginfo.x86_64 1.92-1.1.el5_2.2. Thanks, Brian From torajveersingh at gmail.com Mon Jun 13 10:44:12 2011 From: torajveersingh at gmail.com (Rajveer Singh) Date: Mon, 13 Jun 2011 16:14:12 +0530 Subject: [Linux-cluster] umount failing... In-Reply-To: References: Message-ID: On Thu, Jun 9, 2011 at 8:22 PM, Corey Kovacs wrote: > Folks, > > I have a 5 node cluster serving out several NFS exports, one of which is > /home. > > All of the nfs services can be moved from node to node without problem > except for the one providing /home. > > The logs on that node indicate the umount is failing and then the > service is disabled (self-fence is not enabled). > > Even after the service is put into a failed state and then disabled > manually, umount fails... > > I had noticed recently while playing with conga that creating a > service for /home on a test cluster a warning was issued about > reserved words and as I recall (i could be wrong) /home was among the > illegal parameters for the mount point. > > I have turned everything off that I could think of which might be > "holding" the mount and have run the various iterations of lsof, find > etc. nothing shows up as having anything being actively used. > > This particular file system is 1TB. > > Is there something wrong with using /home as an export? > > Some specifics. > > RHEL5.6 (updated as of last week) > HA-LVM protecting ext3 using the newer "preferred method" with clvmd > Ext3 for exported file systems > 5 nodes. > > > Any ideas would be greatly appreciated. > > -C > > Can you share your log file and cluster.conf file -------------- next part -------------- An HTML attachment was scrubbed... URL: From laszlo.budai at gmail.com Tue Jun 14 11:13:13 2011 From: laszlo.budai at gmail.com (Budai Laszlo) Date: Tue, 14 Jun 2011 14:13:13 +0300 Subject: [Linux-cluster] gfs mount at boot In-Reply-To: References: <4DF0DCE1.2000406@gmail.com> Message-ID: <4DF74249.6020906@gmail.com> Hi all, Indeed enabling the gfs service has mounted the file system after reboot. I have also tried the other suggestions, but none of them has worked out for me (the most probable cause is that the cluster stack was not ready yet when the system has tried to do the mount). So my conclusion is that if one needs a gfs at boot without configuring any cluster resource to mount it, then the gfs system service needs to be enabled (chkconfig gfs on). Thank you all for your ideas and time. Laszlo On 06/09/2011 06:04 PM, Thomas Sjolshagen wrote: > Usually, there's a gfs boot service or network filesystem boot service you may need to enable. > > On Jun 9, 2011, at 10:46, Budai Laszlo wrote: > >> Hi, >> >> What should be done in order to mount a gfs file system at boot? >> I've created the following line in /etc/fstab: >> >> /dev/clvg/gfsvol /mnt/testgfs gfs defaults 0 0 >> >> but it is not mounting the fs at boot. If I run "mount -a" then the fs >> will get mounted. >> Is there any option for fstab to specify that this mount should be >> delayed until the cluster is up and running? >> >> Thank you, >> Laszlo >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From benpro82 at gmail.com Tue Jun 14 15:31:12 2011 From: benpro82 at gmail.com (benpro) Date: Tue, 14 Jun 2011 17:31:12 +0200 Subject: [Linux-cluster] CLVM Documentation. Message-ID: Hi there, I'm actualy studying some solutions to have a shared FS for KVM and live migration. I've already tested DRBD+OCFS2 with success. I wanted to take a look at CLVM, but I don't find any explicit documentation, like how to configure lvm.conf and how to set up CLVM? Did you get any links which talk about CLVM? I've already found Red Hat Documentation [1], but still don't explain the subject in term of software configuration. Thanks in advance. [1] : http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Logical_Volume_Manager_Administration/LVM_Cluster_Overview.html Regards, --- Beno?t.S alias Benpro -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Tue Jun 14 15:41:52 2011 From: linux at alteeve.com (Digimer) Date: Tue, 14 Jun 2011 11:41:52 -0400 Subject: [Linux-cluster] CLVM Documentation. In-Reply-To: References: Message-ID: <4DF78140.9080203@alteeve.com> On 06/14/2011 11:31 AM, benpro wrote: > Hi there, > > I'm actualy studying some solutions to have a shared FS for KVM and live > migration. > I've already tested DRBD+OCFS2 with success. > > I wanted to take a look at CLVM, but I don't find any explicit > documentation, like how to configure lvm.conf and how to set up CLVM? > > Did you get any links which talk about CLVM? I've already found Red Hat > Documentation [1], but still don't explain the subject in term of > software configuration. > > Thanks in advance. > > [1] : > http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Logical_Volume_Manager_Administration/LVM_Cluster_Overview.html > > > Regards, > > --- > Beno?t.S > alias Benpro In the end, all that is really needed is to change locking_type to "3" and fallback_to_local_locking to "0" (and, of course, have DLM). I've got a bit of documentation on implementing CLVM on EL5 with DRBD here: http://wiki.alteeve.com/index.php/Red_Hat_Cluster_Service_2_Tutorial#Setting_Up_Clustered_LVM It is not at all extensive, but hopefully it's sufficient to help. -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "I feel confined, only free to expand myself within boundaries." From skjbalaji at gmail.com Tue Jun 14 18:46:18 2011 From: skjbalaji at gmail.com (Balaji S) Date: Wed, 15 Jun 2011 00:16:18 +0530 Subject: [Linux-cluster] Cluster Failover Failed Message-ID: Hi, In my setup implemented 10 tow node cluster's which running mysql as cluster service, ipmi card as fencing device. In my /var/log/messages i am keep getting the errors like below, Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdm, sector 0 Jun 14 12:50:48 hostname kernel: sd 3:0:2:2: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:50:48 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required Jun 14 12:50:48 hostname kernel: Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdn, sector 0 Jun 14 12:50:48 hostname kernel: sd 3:0:2:4: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:50:48 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required Jun 14 12:50:48 hostname kernel: Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdp, sector 0 Jun 14 12:51:10 hostname kernel: sd 3:0:0:1: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required Jun 14 12:51:10 hostname kernel: Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdc, sector 0 Jun 14 12:51:10 hostname kernel: printk: 3 messages suppressed. Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdc, logical block 0 Jun 14 12:51:10 hostname kernel: sd 3:0:0:2: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required Jun 14 12:51:10 hostname kernel: Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdd, sector 0 Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdd, logical block 0 Jun 14 12:51:10 hostname kernel: sd 3:0:0:4: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required when i am checking the multipath -ll , this all devices are in passive path. Environment : RHEL 5.4 & EMC SAN Please suggest how to overcome this issue. Support will be highly helpful. Thanks in Advance -- Thanks, BSK -------------- next part -------------- An HTML attachment was scrubbed... URL: From parvez.h.shaikh at gmail.com Wed Jun 15 05:45:03 2011 From: parvez.h.shaikh at gmail.com (Parvez Shaikh) Date: Wed, 15 Jun 2011 11:15:03 +0530 Subject: [Linux-cluster] Plugged out blade from bladecenter chassis - fence_bladecenter failed In-Reply-To: References: <4DBA71EA.9070303@redhat.com> <4DBE5E66.80802@redhat.com> Message-ID: Hi, Has anyone used missing_as_off in cluster.conf file? Any help where to put this option in cluster.conf would be greatly appreciated Thanks, Parvez On Mon, May 2, 2011 at 6:49 PM, Parvez Shaikh wrote: > Hi Marek, > > I tried the option missing_as_off="1" and now I get an another error - > > fenced[18433]: fence "node5.sscdomain" failed > fenced[18433]: fencing node "node5.sscdomain" > > Sniplet of cluster.conf file is - > .... > > > > > > > > > .... > > login="USERID" name="BladeCenterFencing" passwd="PASSW0RD"/> > > > Did I miss something? > > Thanks > Parvez > > > > On Mon, May 2, 2011 at 1:03 PM, Marek Grac wrote: > >> Hi, >> >> >> On 04/29/2011 10:15 AM, Parvez Shaikh wrote: >> >>> Hi Marek, >>> >>> Can we give this option in cluster.conf file for bladecenter fencing >>> device or method >>> >> >> for cluster.conf you should add ... missing_as_off="1" ... to fence >> configuration >> >> >> >>> For IPMI, fencing is there similar option? >>> >>> >> There is no such method for IPMI. >> >> >> m, >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From RMartinez-Sanchez at nds.com Wed Jun 15 11:11:03 2011 From: RMartinez-Sanchez at nds.com (Martinez-Sanchez, Raul) Date: Wed, 15 Jun 2011 12:11:03 +0100 Subject: [Linux-cluster] Cluster Failover Failed In-Reply-To: References: Message-ID: <7370F6F5ED3B874F988F5CE657D801EA13A9309289@UKMA1.UK.NDS.COM> Hi Balaji, According to RedHat documentation some Storage Array Devices configured in active/passive mode and using multipath will display this I/O error messages, so this might also be your case (see https://access.redhat.com/kb/docs/DOC-35489), this link indicates that the messages are harmless and can be avoided following its instructions. The logs you sent do not indicate anything related to fencing, so you might need to send the relevant info for that. Cheers, Ra?l From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Balaji S Sent: Tuesday, June 14, 2011 7:46 PM To: Linux-cluster at redhat.com Subject: [Linux-cluster] Cluster Failover Failed Hi, In my setup implemented 10 tow node cluster's which running mysql as cluster service, ipmi card as fencing device. In my /var/log/messages i am keep getting the errors like below, Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdm, sector 0 Jun 14 12:50:48 hostname kernel: sd 3:0:2:2: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:50:48 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required Jun 14 12:50:48 hostname kernel: Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdn, sector 0 Jun 14 12:50:48 hostname kernel: sd 3:0:2:4: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:50:48 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required Jun 14 12:50:48 hostname kernel: Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdp, sector 0 Jun 14 12:51:10 hostname kernel: sd 3:0:0:1: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required Jun 14 12:51:10 hostname kernel: Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdc, sector 0 Jun 14 12:51:10 hostname kernel: printk: 3 messages suppressed. Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdc, logical block 0 Jun 14 12:51:10 hostname kernel: sd 3:0:0:2: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required Jun 14 12:51:10 hostname kernel: Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdd, sector 0 Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdd, logical block 0 Jun 14 12:51:10 hostname kernel: sd 3:0:0:4: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required when i am checking the multipath -ll , this all devices are in passive path. Environment : RHEL 5.4 & EMC SAN Please suggest how to overcome this issue. Support will be highly helpful. Thanks in Advance -- Thanks, BSK ________________________________ ************************************************************************************** This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postmaster at nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes. To protect the environment please do not print this e-mail unless necessary. NDS Limited. Registered Office: One London Road, Staines, Middlesex, TW18 4EX, United Kingdom. A company registered in England and Wales. Registered no. 3080780. VAT no. GB 603 8808 40-00 ************************************************************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From alvaro.fernandez at sivsa.com Wed Jun 15 12:14:44 2011 From: alvaro.fernandez at sivsa.com (Alvaro Jose Fernandez) Date: Wed, 15 Jun 2011 14:14:44 +0200 Subject: [Linux-cluster] Cluster Failover Failed References: <7370F6F5ED3B874F988F5CE657D801EA13A9309289@UKMA1.UK.NDS.COM> Message-ID: <607D6181D9919041BE792D70EF2AEC4801A50506@LIMENS.sivsa.int> Hi, DOC-35489 only partionally approaches the problem. I have it too, on a passive/active IBM DS4000 array and RHEL5.5. I've excluded from lvm.conf any SAN partitions as per the note (and also made a new initrd boot, as lvm.conf is included at boot time as the / partition I have it LVM'ed) , but messages still apears on bootup. They always dissapear when multipathd service starts and its scsi_dh_rdac discipline is loaded. Even opened a case with Redhat, and obtained the same response (but not workaround): "it's entirely harmless, they are normal". Alvaro ________________________________ De: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] En nombre de Martinez-Sanchez, Raul Enviado el: mi?rcoles, 15 de junio de 2011 13:11 Para: 'Linux-cluster at redhat.com' Asunto: Re: [Linux-cluster] Cluster Failover Failed Hi Balaji, According to RedHat documentation some Storage Array Devices configured in active/passive mode and using multipath will display this I/O error messages, so this might also be your case (see https://access.redhat.com/kb/docs/DOC-35489), this link indicates that the messages are harmless and can be avoided following its instructions. The logs you sent do not indicate anything related to fencing, so you might need to send the relevant info for that. Cheers, Ra?l From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Balaji S Sent: Tuesday, June 14, 2011 7:46 PM To: Linux-cluster at redhat.com Subject: [Linux-cluster] Cluster Failover Failed Hi, In my setup implemented 10 tow node cluster's which running mysql as cluster service, ipmi card as fencing device. In my /var/log/messages i am keep getting the errors like below, Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdm, sector 0 Jun 14 12:50:48 hostname kernel: sd 3:0:2:2: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:50:48 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required Jun 14 12:50:48 hostname kernel: Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdn, sector 0 Jun 14 12:50:48 hostname kernel: sd 3:0:2:4: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:50:48 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required Jun 14 12:50:48 hostname kernel: Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdp, sector 0 Jun 14 12:51:10 hostname kernel: sd 3:0:0:1: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required Jun 14 12:51:10 hostname kernel: Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdc, sector 0 Jun 14 12:51:10 hostname kernel: printk: 3 messages suppressed. Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdc, logical block 0 Jun 14 12:51:10 hostname kernel: sd 3:0:0:2: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required Jun 14 12:51:10 hostname kernel: Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdd, sector 0 Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdd, logical block 0 Jun 14 12:51:10 hostname kernel: sd 3:0:0:4: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required when i am checking the multipath -ll , this all devices are in passive path. Environment : RHEL 5.4 & EMC SAN Please suggest how to overcome this issue. Support will be highly helpful. Thanks in Advance -- Thanks, BSK ________________________________ ************************************************************************************** This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postmaster at nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes. To protect the environment please do not print this e-mail unless necessary. NDS Limited. Registered Office: One London Road, Staines, Middlesex, TW18 4EX, United Kingdom. A company registered in England and Wales. Registered no. 3080780. VAT no. GB 603 8808 40-00 ************************************************************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From RMartinez-Sanchez at nds.com Wed Jun 15 12:54:47 2011 From: RMartinez-Sanchez at nds.com (Martinez-Sanchez, Raul) Date: Wed, 15 Jun 2011 13:54:47 +0100 Subject: [Linux-cluster] Cluster Failover Failed In-Reply-To: <607D6181D9919041BE792D70EF2AEC4801A50506@LIMENS.sivsa.int> References: <7370F6F5ED3B874F988F5CE657D801EA13A9309289@UKMA1.UK.NDS.COM> <607D6181D9919041BE792D70EF2AEC4801A50506@LIMENS.sivsa.int> Message-ID: <7370F6F5ED3B874F988F5CE657D801EA13A930928B@UKMA1.UK.NDS.COM> Hi Alvaro, I have also opened a ticket with RedHat for the same reasons on rhel5u6 and a DS5020 and a DS3524 which I believe they are both active/active and multipath seems to treat them as active/passive, but I guess this is for another mailing list. Ra?l From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Alvaro Jose Fernandez Sent: Wednesday, June 15, 2011 1:15 PM To: linux clustering Subject: Re: [Linux-cluster] Cluster Failover Failed Hi, DOC-35489 only partionally approaches the problem. I have it too, on a passive/active IBM DS4000 array and RHEL5.5. I've excluded from lvm.conf any SAN partitions as per the note (and also made a new initrd boot, as lvm.conf is included at boot time as the / partition I have it LVM'ed) , but messages still apears on bootup. They always dissapear when multipathd service starts and its scsi_dh_rdac discipline is loaded. Even opened a case with Redhat, and obtained the same response (but not workaround): "it's entirely harmless, they are normal". Alvaro ________________________________ De: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] En nombre de Martinez-Sanchez, Raul Enviado el: mi?rcoles, 15 de junio de 2011 13:11 Para: 'Linux-cluster at redhat.com' Asunto: Re: [Linux-cluster] Cluster Failover Failed Hi Balaji, According to RedHat documentation some Storage Array Devices configured in active/passive mode and using multipath will display this I/O error messages, so this might also be your case (see https://access.redhat.com/kb/docs/DOC-35489), this link indicates that the messages are harmless and can be avoided following its instructions. The logs you sent do not indicate anything related to fencing, so you might need to send the relevant info for that. Cheers, Ra?l From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Balaji S Sent: Tuesday, June 14, 2011 7:46 PM To: Linux-cluster at redhat.com Subject: [Linux-cluster] Cluster Failover Failed Hi, In my setup implemented 10 tow node cluster's which running mysql as cluster service, ipmi card as fencing device. In my /var/log/messages i am keep getting the errors like below, Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdm, sector 0 Jun 14 12:50:48 hostname kernel: sd 3:0:2:2: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:50:48 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required Jun 14 12:50:48 hostname kernel: Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdn, sector 0 Jun 14 12:50:48 hostname kernel: sd 3:0:2:4: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:50:48 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required Jun 14 12:50:48 hostname kernel: Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdp, sector 0 Jun 14 12:51:10 hostname kernel: sd 3:0:0:1: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required Jun 14 12:51:10 hostname kernel: Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdc, sector 0 Jun 14 12:51:10 hostname kernel: printk: 3 messages suppressed. Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdc, logical block 0 Jun 14 12:51:10 hostname kernel: sd 3:0:0:2: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required Jun 14 12:51:10 hostname kernel: Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdd, sector 0 Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdd, logical block 0 Jun 14 12:51:10 hostname kernel: sd 3:0:0:4: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required when i am checking the multipath -ll , this all devices are in passive path. Environment : RHEL 5.4 & EMC SAN Please suggest how to overcome this issue. Support will be highly helpful. Thanks in Advance -- Thanks, BSK ________________________________ ************************************************************************************** This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postmaster at nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes. To protect the environment please do not print this e-mail unless necessary. NDS Limited. Registered Office: One London Road, Staines, Middlesex, TW18 4EX, United Kingdom. A company registered in England and Wales. Registered no. 3080780. VAT no. GB 603 8808 40-00 ************************************************************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From alvaro.fernandez at sivsa.com Wed Jun 15 13:16:14 2011 From: alvaro.fernandez at sivsa.com (Alvaro Jose Fernandez) Date: Wed, 15 Jun 2011 15:16:14 +0200 Subject: [Linux-cluster] Cluster Failover Failed References: <7370F6F5ED3B874F988F5CE657D801EA13A9309289@UKMA1.UK.NDS.COM><607D6181D9919041BE792D70EF2AEC4801A50506@LIMENS.sivsa.int> <7370F6F5ED3B874F988F5CE657D801EA13A930928B@UKMA1.UK.NDS.COM> Message-ID: <607D6181D9919041BE792D70EF2AEC4801A50524@LIMENS.sivsa.int> Hi Raul, Yes, it seems like-stuff. Thanks for pointing out the same still applies to RHEL5.6 . There is a opened bugzilla at https://bugzilla.redhat.com/show_bug.cgi?id=649705 . Low priority, of course (for Redhat), as no response at all. They seem to ignore that sometimes we have to do demostrations to prospective customers, etc, and the image of all these messages popping out from the console and the logs are unforgettable. Alvaro ________________________________ De: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] En nombre de Martinez-Sanchez, Raul Enviado el: mi?rcoles, 15 de junio de 2011 14:55 Para: 'linux clustering' Asunto: Re: [Linux-cluster] Cluster Failover Failed Hi Alvaro, I have also opened a ticket with RedHat for the same reasons on rhel5u6 and a DS5020 and a DS3524 which I believe they are both active/active and multipath seems to treat them as active/passive, but I guess this is for another mailing list. Ra?l From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Alvaro Jose Fernandez Sent: Wednesday, June 15, 2011 1:15 PM To: linux clustering Subject: Re: [Linux-cluster] Cluster Failover Failed Hi, DOC-35489 only partionally approaches the problem. I have it too, on a passive/active IBM DS4000 array and RHEL5.5. I've excluded from lvm.conf any SAN partitions as per the note (and also made a new initrd boot, as lvm.conf is included at boot time as the / partition I have it LVM'ed) , but messages still apears on bootup. They always dissapear when multipathd service starts and its scsi_dh_rdac discipline is loaded. Even opened a case with Redhat, and obtained the same response (but not workaround): "it's entirely harmless, they are normal". Alvaro ________________________________ De: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] En nombre de Martinez-Sanchez, Raul Enviado el: mi?rcoles, 15 de junio de 2011 13:11 Para: 'Linux-cluster at redhat.com' Asunto: Re: [Linux-cluster] Cluster Failover Failed Hi Balaji, According to RedHat documentation some Storage Array Devices configured in active/passive mode and using multipath will display this I/O error messages, so this might also be your case (see https://access.redhat.com/kb/docs/DOC-35489), this link indicates that the messages are harmless and can be avoided following its instructions. The logs you sent do not indicate anything related to fencing, so you might need to send the relevant info for that. Cheers, Ra?l From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Balaji S Sent: Tuesday, June 14, 2011 7:46 PM To: Linux-cluster at redhat.com Subject: [Linux-cluster] Cluster Failover Failed Hi, In my setup implemented 10 tow node cluster's which running mysql as cluster service, ipmi card as fencing device. In my /var/log/messages i am keep getting the errors like below, Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdm, sector 0 Jun 14 12:50:48 hostname kernel: sd 3:0:2:2: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:50:48 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required Jun 14 12:50:48 hostname kernel: Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdn, sector 0 Jun 14 12:50:48 hostname kernel: sd 3:0:2:4: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:50:48 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required Jun 14 12:50:48 hostname kernel: Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdp, sector 0 Jun 14 12:51:10 hostname kernel: sd 3:0:0:1: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required Jun 14 12:51:10 hostname kernel: Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdc, sector 0 Jun 14 12:51:10 hostname kernel: printk: 3 messages suppressed. Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdc, logical block 0 Jun 14 12:51:10 hostname kernel: sd 3:0:0:2: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required Jun 14 12:51:10 hostname kernel: Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdd, sector 0 Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdd, logical block 0 Jun 14 12:51:10 hostname kernel: sd 3:0:0:4: Device not ready: <6>: Current: sense key: Not Ready Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, manual intervention required when i am checking the multipath -ll , this all devices are in passive path. Environment : RHEL 5.4 & EMC SAN Please suggest how to overcome this issue. Support will be highly helpful. Thanks in Advance -- Thanks, BSK ________________________________ ************************************************************************************** This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postmaster at nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes. To protect the environment please do not print this e-mail unless necessary. NDS Limited. Registered Office: One London Road, Staines, Middlesex, TW18 4EX, United Kingdom. A company registered in England and Wales. Registered no. 3080780. VAT no. GB 603 8808 40-00 ************************************************************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From corey.kovacs at gmail.com Thu Jun 16 08:18:19 2011 From: corey.kovacs at gmail.com (Corey Kovacs) Date: Thu, 16 Jun 2011 09:18:19 +0100 Subject: [Linux-cluster] umount failing... In-Reply-To: References: Message-ID: My appologies for not getting back sooner. I am in the middle of a move. I cannot post my configs or logs (yeah, not helpful I know) but suffice it to say I strongly believe they are correct (I know, everyone says that). I've had other people look at them just make sure it wasn't a case of proofreading my own paper etc. and it always comes down to the umount failing. I have 6 other identical NFS services (save for the mount point/export location) and they all work flawlessly. That's why I am zeroing in on the use of '/home' as the culprit. Anyway, it's not a lot to go on I know, but I am just looking for directions to search for now. Thanks Corey On Mon, Jun 13, 2011 at 11:44 AM, Rajveer Singh wrote: > > > On Thu, Jun 9, 2011 at 8:22 PM, Corey Kovacs wrote: >> >> Folks, >> >> I have a 5 node cluster serving out several NFS exports, one of which is >> /home. >> >> All of the nfs services can be moved from node to node without problem >> except for the one providing /home. >> >> The logs on that node indicate the umount is failing and then the >> service is disabled (self-fence is not enabled). >> >> Even after the service is put into a failed state and then disabled >> manually, umount fails... >> >> I had noticed recently while playing with conga that creating a >> service for /home on a test cluster a warning was issued about >> reserved words and as I recall (i could be wrong) /home was among the >> illegal parameters for the mount point. >> >> I have turned everything off that I could think of which might be >> "holding" the mount and have run the various iterations of lsof, find >> etc. nothing shows up as having anything being actively used. >> >> This particular file system is 1TB. >> >> Is there something wrong with using /home as an export? >> >> Some specifics. >> >> RHEL5.6 (updated as of last week) >> HA-LVM protecting ext3 using the newer "preferred method" with clvmd >> Ext3 for exported file systems >> 5 nodes. >> >> >> Any ideas would be greatly appreciated. >> >> -C >> > Can you share your log file and cluster.conf file > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From fdinitto at redhat.com Thu Jun 16 13:13:14 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Thu, 16 Jun 2011 15:13:14 +0200 Subject: [Linux-cluster] resource agents 3.9.1 final release Message-ID: <4DFA016A.8040708@redhat.com> Hi everybody, The current resource agent repository [1] has been tagged to v3.9.1. Tarballs are also available [2]. This is the very first release of the common resource agent repository. It is a big milestone towards eliminating duplication of effort with the goal of improving the overall quality and user experience. There is still a long way to go but the first stone has been laid down. Highlights for the LHA resource agents set: - lxc, symlink: new resource agents - db2: major rewrite and support for master/slave mode of operation - exportfs: backup/restore of rmtab is back - mysql: multiple improvements for master/slave and replication - ocft: new tests for pgsql, postfix, and iscsi - CTDB: minor bug fixes - pgsql: improve configuration check and probe handling Highlights for the rgmanager resource agents set: - oracledb: use shutdown immediate - tomcat5: fix generated XML - nfsclient: fix client name mismatch - halvm: fix mirror dev failure - nfs: fix selinux integration Several changes have been made to the build system and the spec file to accommodate both projects? needs. The most noticeable change is the option to select "all", "linux-ha" or "rgmanager" resource agents at configuration time, which will also set the default for the spec file. Also several improvements have been made to correctly build srpm/rpms on different distributions in different versions. The full list of changes is available in the "ChangeLog" file for users, and in an auto-generated git-to-changelog file called "ChangeLog.devel". NOTE: About the 3.9.x version (particularly for linux-ha folks): This version was chosen simply because the rgmanager set was already at 3.1.x. In order to make it easier for distribution, and to keep package upgrades linear, we decided to bump the number higher than both projects. There is no other special meaning associated with it. Many thanks to everybody who helped with this release, in particular to the numerous contributors. Without you, the release would certainly not be possible. Cheers, The RAS Tribe [1] https://github.com/ClusterLabs/resource-agents/tarball/v3.9.1 [2] https://fedorahosted.org/releases/r/e/resource-agents/ PS: I am absolutely sure that URL [2] might give some people a fit, but we are still working to get a common release area. From fdinitto at redhat.com Thu Jun 16 13:48:32 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Thu, 16 Jun 2011 15:48:32 +0200 Subject: [Linux-cluster] cluster 3.1.2 stable release Message-ID: <4DFA09B0.2020001@redhat.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Welcome to the cluster 3.1.2 release. This release contains several bug fixes and improvements. This version must be used in conjunction with resource-agents 3.9.1. The new source tarball can be downloaded here: https://fedorahosted.org/releases/c/l/cluster/cluster-3.1.2.tar.xz ChangeLog: https://fedorahosted.org/releases/c/l/cluster/Changelog-3.1.2 To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Thanks/congratulations to all people that contributed to achieve this great milestone. Happy clustering, Fabio -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/ iQIcBAEBCAAGBQJN+gmrAAoJEAgUGcMLQ3qJiXAQAIh4NyN6ZP66YhZk0lw7Zjz5 80KH5SI/agu1YhXeeXFCfwgJlFdWZj1SBP75Q5f+OxUW6uWsIDOBa26hGzFXj9H6 HdXrHUReb4/5TTat26Nd8aFLJ1jn35ltp3rBTqHqIJ9gYb7wZzHrLnret0HLV9S6 dG3G4uWNciru+Acb1cIW/ANBkioFO+f1GQiBF96txYfKojJNR9R3DRDQy8ysNDmn CCqQwaAFje/JO4w3qggwngFNJ0n0vizSU8kGm1UGFYLcjeqGZE+NDuu4OMWMC6/U KgtEL48VeHqRD/sJD//Tt99LVeL7VuAKBW79pfcYl8KUqVVDMXP9FqIA4okVcEr9 vPK23T3VZ3+6NJaZVEOSuYrjvNXOsi4yAa+rR8EiwnHSG2RXuxzuTyKl90HRrugO TIkvtUj9hqGj97AviBtCFZyRUhAH68sbVFiGDV6X0nLmY2gN1A8o0CpyI6hMhsIS MieJ9DbNjqj0b9GOzzD1EFMp65+wooZJMkku70Tbx3hKaxv28HotPCpvb7yRQUF9 j1AzFVG9YZn7FWbdQS3taPzjZNxvbKEvTpUzEz5I5xUZIRODY3uCbBHRPTNpDcpE J0WYvOqlO7rMCuHYG8tj12ejdgDGexQXJFG/q4lrMId3ATVBV1NuaHAAszDLYN8C +gKWDJ8aCqFKAwCCcQxe =1IUH -----END PGP SIGNATURE----- From gianluca.cecchi at gmail.com Thu Jun 16 14:44:50 2011 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Thu, 16 Jun 2011 16:44:50 +0200 Subject: [Linux-cluster] [Linux-HA] resource agents 3.9.1 final release In-Reply-To: <4DFA016A.8040708@redhat.com> References: <4DFA016A.8040708@redhat.com> Message-ID: On Thu, Jun 16, 2011 at 3:13 PM, Fabio M. Di Nitto wrote: > Highlights for the rgmanager resource agents set: > > - oracledb: use shutdown immediate hello, from oracledb.sh.in I can see this actually is not a configurable parameter, so that I cannot choose between "immediate" and "abort", and I think it is not the best change. faction "Stopping Oracle Database:" stop_db immediate if [ $? -ne 0 ]; then faction "Stopping Oracle Database (hard):" stop_db abort || return 1 fi There are situations where an occurring problem could let a DB stuck on shutdown immediate, preventing completion of the command itself so you will never arrive to the error code condition to try the abort option... And also: " SHUTDOWN IMMEDIATE No new connections are allowed, nor are new transactions allowed to be started, after the statement is issued. Any uncommitted transactions are rolled back. (If long uncommitted transactions exist, this method of shutdown might not complete quickly, despite its name.) Oracle does not wait for users currently connected to the database to disconnect. Oracle implicitly rolls back active transactions and disconnects all connected users. " it is true that in case of shutdown abort you have anyway to rollback too, during the following crash recovery of startup phase, but I'd prefer to do this on the node where I'm going to land to and not on the node that I'm leaving (possibly because of a problem). In my opinion the only situation where "immediate" is better is for planned maintenance. Just my opininon. Keep on with the good job Gianluca From zagar at arlut.utexas.edu Thu Jun 16 18:17:13 2011 From: zagar at arlut.utexas.edu (Randy Zagar) Date: Thu, 16 Jun 2011 13:17:13 -0500 Subject: [Linux-cluster] Linux-cluster Digest, Vol 86, Issue 15 In-Reply-To: References: Message-ID: <4DFA48A9.6090109@arlut.utexas.edu> On 06/16/2011 11:00 AM, Corey Kovacs wrote: > My appologies for not getting back sooner. I am in the middle of a move. > > I cannot post my configs or logs (yeah, not helpful I know) but > suffice it to say I strongly believe they are correct (I know, > everyone says that). I've had other people look at them just make sure > it wasn't a case of proofreading my own paper etc. and it always comes > down to the umount failing. I have 6 other identical NFS services > (save for the mount point/export location) and they all work > flawlessly. That's why I am zeroing in on the use of '/home' as the > culprit. > > Anyway, it's not a lot to go on I know, but I am just looking for > directions to search for now. > > Thanks > > Corey There are several other services that might be interfering with your attempts to umount /home. In addition to NFS, my list of usual suspects includes: Apache, Samba, and Autofs. If these, or any other services, are configured to use users' home directories then you're going to have problems with umount. -RZ From martijn.storck at gmail.com Fri Jun 17 07:26:47 2011 From: martijn.storck at gmail.com (Martijn Storck) Date: Fri, 17 Jun 2011 09:26:47 +0200 Subject: [Linux-cluster] Replacing network switch in a cluster Message-ID: Hi all, Unfortunately I have to swap out the switch that is used for the cluster traffic of our 4-node cluster for a new one. I'm hoping I can do this by connecting the new switch to the old switch and then moving the nodes over one by one. Can I change the cluster configuration so that there is a longer grace period before a node is deemed 'lost' and gets fenced? The only line in my cluster.conf that looks like it has anything to do with it is this one: I think that with faststart enabled the link with a node will be down for only a few seconds. I realize that this probably means the cluster will lock up during that period (since we use a lot of GFS), but it's still better than having to bring the entire cluster down. Kind regards, Martijn Storck -------------- next part -------------- An HTML attachment was scrubbed... URL: From fdinitto at redhat.com Fri Jun 17 07:28:58 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Fri, 17 Jun 2011 09:28:58 +0200 Subject: [Linux-cluster] [Linux-HA] resource agents 3.9.1 final release In-Reply-To: References: <4DFA016A.8040708@redhat.com> Message-ID: <4DFB023A.4030409@redhat.com> Lon, what's your opinion on this one? On 06/16/2011 04:44 PM, Gianluca Cecchi wrote: > On Thu, Jun 16, 2011 at 3:13 PM, Fabio M. Di Nitto wrote: > >> Highlights for the rgmanager resource agents set: >> >> - oracledb: use shutdown immediate > > hello, > from oracledb.sh.in I can see this actually is not a configurable > parameter, so that I cannot choose between "immediate" and "abort", > and I think it is not the best change. > > > faction "Stopping Oracle Database:" stop_db immediate > if [ $? -ne 0 ]; then > faction "Stopping Oracle Database (hard):" stop_db > abort || return 1 > fi > > > There are situations where an occurring problem could let a DB stuck > on shutdown immediate, preventing completion of the command itself so > you will never arrive to the error code condition to try the abort > option... > And also: > " > SHUTDOWN IMMEDIATE > No new connections are allowed, nor are new transactions allowed to be > started, after the statement is issued. > Any uncommitted transactions are rolled back. (If long uncommitted > transactions exist, this method of shutdown might not complete > quickly, despite its name.) > Oracle does not wait for users currently connected to the database to > disconnect. Oracle implicitly rolls back active transactions and > disconnects all connected users. > " > > it is true that in case of shutdown abort you have anyway to rollback > too, during the following crash recovery of startup phase, but I'd > prefer to do this on the node where I'm going to land to and not on > the node that I'm leaving (possibly because of a problem). > In my opinion the only situation where "immediate" is better is for > planned maintenance. > > Just my opininon. > Keep on with the good job > Gianluca > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From gianluca.cecchi at gmail.com Fri Jun 17 07:58:47 2011 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Fri, 17 Jun 2011 09:58:47 +0200 Subject: [Linux-cluster] [Linux-HA] resource agents 3.9.1 final release In-Reply-To: <4DFB023A.4030409@redhat.com> References: <4DFA016A.8040708@redhat.com> <4DFB023A.4030409@redhat.com> Message-ID: On Fri, Jun 17, 2011 at 9:28 AM, Fabio M. Di Nitto wrote: > Lon, what's your opinion on this one? Some other considerations of mine. This of the current "abort" default option (as in RH EL 5 cluster suite base) is indeed a difficulty, in case of planned maintenance, so that a change inside the agent giving choice and flexibility would be a great thing. I was thinking about making myself some change and then propose but had not the time unfortunately. Just to note, nowadays if we have a planned operation for the Oracle DB we go through this workflow: - DB service is DBSRV - clusvcadm -Z DBSRV - Operations on DB, such as shutdown immediate, patching, ecc.. - startup of DB - clusvcadm -U DBSRV If the planned operation involves patching of the OS and eventually cluster suite too, after testing on test cluster, we make sometyhing like this (from memory supposing a monoservice cluster): - detach from cluster and update standby node (eventually update both os and Oracle binaries as we manage their planned maintenance together) - DB service is DBSRV - clusvcadm -Z DBSRV on primary node - shutdown immediate of db - clusvcadm -U DBSRV ; clusvcadm -d DBSRV (*) - shutdown of primary node - startup of the updated node with the service DBSRV modified so that Oracle part is not inside (so only vip, lvm, fs parts are enabled) - verify that oracle startup with new OS and Oracle binaries is ok on the node - shutdown immediate of db - change cluster.conf to insert Oracle too inside DBSRV definition and have it started/monitored from rgmanager - update the ex-primary node too and start it to join the cluster (*) this is risky: it would be better to be able to disable a frozen service, eventually after asking confirmation for that.... An idea could be to have inside the clusvcadm command something like "soft stop" option: -ss And if inside the service there is oracledb.sh it parses this and change its "abort" flag in "immediate" This "soft stop" could be managed by other resources too... Gianluca From miha.valencic at gmail.com Fri Jun 17 08:13:59 2011 From: miha.valencic at gmail.com (Miha Valencic) Date: Fri, 17 Jun 2011 10:13:59 +0200 Subject: [Linux-cluster] Troubleshooting service relocation Message-ID: Hi! I'm trying to troubleshoot service migration, which happens once a day and I don't have a clue why. (i.e.: there is nothing wrong with it and there are no entries in the log file) The system is RHEL4 (Red Hat 4.1.2-46) with cluster version 2.0.52. Cluster software used to log events to /var/log/cluster.log as configured by the syslog facility local4.*, but those messages disappeared on May 6. The service we're running on the cluster is Zimbra, if that matters at all. The problem is, that there are no logging entries in the cluster.log file. If I issue 'logger -p local4.info 'test'' I see an entry in the cluster.log file, so syslog is obviously working. In the /etc/cluster/cluster.conf file, I see no logging configuration (and I guess there is none, looking at config schema described at http://sources.redhat.com/cluster/doc/cluster_schema_rhel4.html. How can I turn on logging or what else can I check? Thank you, Miha. -------------- next part -------------- An HTML attachment was scrubbed... URL: From raju.rajsand at gmail.com Fri Jun 17 17:33:22 2011 From: raju.rajsand at gmail.com (Rajagopal Swaminathan) Date: Fri, 17 Jun 2011 23:03:22 +0530 Subject: [Linux-cluster] Replacing network switch in a cluster In-Reply-To: References: Message-ID: Greeetings, On Fri, Jun 17, 2011 at 12:56 PM, Martijn Storck wrote: > Hi all, dunno muchabout the configs. Please makesure tht the cluster traffic ports are cpndfigured to multicast. -- Regards, Rajagopal From noreply at boxbe.com Fri Jun 17 16:05:21 2011 From: noreply at boxbe.com (noreply at boxbe.com) Date: Fri, 17 Jun 2011 09:05:21 -0700 (PDT) Subject: [Linux-cluster] Linux-cluster Digest, Vol 86, Issue 16 (Action Required) Message-ID: <2035445104.167092.1308326721513.JavaMail.prod@app006.boxbe.com> Dear sender, You will not receive any more courtesy notices from our members for two days. Messages you have sent will remain in a lower priority mailbox for our member to review at their leisure. Future messages will be more likely to be viewed if you are on our member's priority Guest List. Thank you, shanavasmca at gmail.com Powered by Boxbe -- "End Email Overload" Visit http://www.boxbe.com/how-it-works?tc=8429770443_542558017 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded message was scrubbed... From: linux-cluster-request at redhat.com Subject: Linux-cluster Digest, Vol 86, Issue 16 Date: Fri, 17 Jun 2011 12:00:06 -0400 Size: 2088 URL: From michael at ulimit.org Sat Jun 18 09:24:47 2011 From: michael at ulimit.org (Michael Pye) Date: Sat, 18 Jun 2011 10:24:47 +0100 Subject: [Linux-cluster] Troubleshooting service relocation In-Reply-To: References: Message-ID: <4DFC6EDF.5090202@ulimit.org> On 17/06/2011 09:13, Miha Valencic wrote: > How can I turn on logging or what else can I check? Take a look at this knowledgebase article: https://access.redhat.com/kb/docs/DOC-53500 Michael From share2dom at gmail.com Sun Jun 19 16:33:52 2011 From: share2dom at gmail.com (dOminic) Date: Sun, 19 Jun 2011 22:03:52 +0530 Subject: [Linux-cluster] Cluster Failover Failed In-Reply-To: References: Message-ID: Hi Balaji, Yes, the reported message is harmless ... However, you can try following 1) I would suggest you to set the filter setting in lvm.conf to properly scan your mpath* devices and local disks. 2) Enable blacklist section in multipath.conf eg: blacklist { devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*" devnode "^hd[a-z]" } # multipath -v2 Observe the box. Check whether that helps ... Regards, On Wed, Jun 15, 2011 at 12:16 AM, Balaji S wrote: > Hi, > In my setup implemented 10 tow node cluster's which running mysql as > cluster service, ipmi card as fencing device. > > In my /var/log/messages i am keep getting the errors like below, > > Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdm, sector 0 > Jun 14 12:50:48 hostname kernel: sd 3:0:2:2: Device not ready: <6>: > Current: sense key: Not Ready > Jun 14 12:50:48 hostname kernel: Add. Sense: Logical unit not ready, > manual intervention required > Jun 14 12:50:48 hostname kernel: > Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdn, sector 0 > Jun 14 12:50:48 hostname kernel: sd 3:0:2:4: Device not ready: <6>: > Current: sense key: Not Ready > Jun 14 12:50:48 hostname kernel: Add. Sense: Logical unit not ready, > manual intervention required > Jun 14 12:50:48 hostname kernel: > Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdp, sector 0 > Jun 14 12:51:10 hostname kernel: sd 3:0:0:1: Device not ready: <6>: > Current: sense key: Not Ready > Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, > manual intervention required > Jun 14 12:51:10 hostname kernel: > Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdc, sector 0 > Jun 14 12:51:10 hostname kernel: printk: 3 messages suppressed. > Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdc, logical > block 0 > Jun 14 12:51:10 hostname kernel: sd 3:0:0:2: Device not ready: <6>: > Current: sense key: Not Ready > Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, > manual intervention required > Jun 14 12:51:10 hostname kernel: > Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdd, sector 0 > Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdd, logical > block 0 > Jun 14 12:51:10 hostname kernel: sd 3:0:0:4: Device not ready: <6>: > Current: sense key: Not Ready > Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, > manual intervention required > > > when i am checking the multipath -ll , this all devices are in passive > path. > > Environment : > > RHEL 5.4 & EMC SAN > > Please suggest how to overcome this issue. Support will be highly helpful. > Thanks in Advance > > > -- > Thanks, > BSK > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From share2dom at gmail.com Sun Jun 19 16:42:35 2011 From: share2dom at gmail.com (dOminic) Date: Sun, 19 Jun 2011 22:12:35 +0530 Subject: [Linux-cluster] umount failing... In-Reply-To: References: Message-ID: selinux is in Enforced mode ( worth checking audit.log ) ? .If yes, try selinux to permissive or disabled mode and check . Regards, On Thu, Jun 16, 2011 at 1:48 PM, Corey Kovacs wrote: > My appologies for not getting back sooner. I am in the middle of a move. > > I cannot post my configs or logs (yeah, not helpful I know) but > suffice it to say I strongly believe they are correct (I know, > everyone says that). I've had other people look at them just make sure > it wasn't a case of proofreading my own paper etc. and it always comes > down to the umount failing. I have 6 other identical NFS services > (save for the mount point/export location) and they all work > flawlessly. That's why I am zeroing in on the use of '/home' as the > culprit. > > Anyway, it's not a lot to go on I know, but I am just looking for > directions to search for now. > > Thanks > > Corey > > On Mon, Jun 13, 2011 at 11:44 AM, Rajveer Singh > wrote: > > > > > > On Thu, Jun 9, 2011 at 8:22 PM, Corey Kovacs > wrote: > >> > >> Folks, > >> > >> I have a 5 node cluster serving out several NFS exports, one of which is > >> /home. > >> > >> All of the nfs services can be moved from node to node without problem > >> except for the one providing /home. > >> > >> The logs on that node indicate the umount is failing and then the > >> service is disabled (self-fence is not enabled). > >> > >> Even after the service is put into a failed state and then disabled > >> manually, umount fails... > >> > >> I had noticed recently while playing with conga that creating a > >> service for /home on a test cluster a warning was issued about > >> reserved words and as I recall (i could be wrong) /home was among the > >> illegal parameters for the mount point. > >> > >> I have turned everything off that I could think of which might be > >> "holding" the mount and have run the various iterations of lsof, find > >> etc. nothing shows up as having anything being actively used. > >> > >> This particular file system is 1TB. > >> > >> Is there something wrong with using /home as an export? > >> > >> Some specifics. > >> > >> RHEL5.6 (updated as of last week) > >> HA-LVM protecting ext3 using the newer "preferred method" with clvmd > >> Ext3 for exported file systems > >> 5 nodes. > >> > >> > >> Any ideas would be greatly appreciated. > >> > >> -C > >> > > Can you share your log file and cluster.conf file > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From share2dom at gmail.com Sun Jun 19 16:44:56 2011 From: share2dom at gmail.com (dOminic) Date: Sun, 19 Jun 2011 22:14:56 +0530 Subject: [Linux-cluster] Plugged out blade from bladecenter chassis - fence_bladecenter failed In-Reply-To: References: Message-ID: There is a bug related to missing_as_off - https://bugzilla.redhat.com/show_bug.cgi?id=689851 - expects the fix in rhel5u7 . regards, On Wed, Apr 27, 2011 at 1:59 PM, Parvez Shaikh wrote: > Hi all, > > I am using RHCS on IBM bladecenter with blade center fencing. I plugged out > a blade from blade center chassis slot and was hoping that failover to > occur. However when I did so, I get following message - > > fenced[10240]: agent "fence_bladecenter" reports: Failed: Unable to obtain > correct plug status or plug is not available > fenced[10240]: fence "blade1" failed > > Is this supported that if I plug out blade from its slot, then failover > occur without manual intervention? If so, which fencing must I use? > > Thanks, > Parvez > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From parvez.h.shaikh at gmail.com Mon Jun 20 05:16:41 2011 From: parvez.h.shaikh at gmail.com (Parvez Shaikh) Date: Mon, 20 Jun 2011 10:46:41 +0530 Subject: [Linux-cluster] Plugged out blade from bladecenter chassis - fence_bladecenter failed In-Reply-To: References: Message-ID: Hi Thanks Dominic, Do fence_bladecenter "reboot" the blade as a part of fencing always? I have seen it turning the blade off by default. Through fence_bladecenter --missing-as-off...... -o off returns me a correct result when run from command line but fencing fails through "fenced". I am using RHEL 5.5 ES and fence_bladecenter version reports following - fence_bladecenter -V 2.0.115 (built Tue Dec 22 10:05:55 EST 2009) Copyright (C) Red Hat, Inc. 2004 All rights reserved. Anyway thanks for bugzilla reference Regards On Sun, Jun 19, 2011 at 10:14 PM, dOminic wrote: > There is a bug related to missing_as_off - > https://bugzilla.redhat.com/show_bug.cgi?id=689851 - expects the fix in > rhel5u7 . > > regards, > > On Wed, Apr 27, 2011 at 1:59 PM, Parvez Shaikh wrote: > >> Hi all, >> >> I am using RHCS on IBM bladecenter with blade center fencing. I plugged >> out a blade from blade center chassis slot and was hoping that failover to >> occur. However when I did so, I get following message - >> >> fenced[10240]: agent "fence_bladecenter" reports: Failed: Unable to obtain >> correct plug status or plug is not available >> fenced[10240]: fence "blade1" failed >> >> Is this supported that if I plug out blade from its slot, then failover >> occur without manual intervention? If so, which fencing must I use? >> >> Thanks, >> Parvez >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From skjbalaji at gmail.com Mon Jun 20 16:13:46 2011 From: skjbalaji at gmail.com (Balaji S) Date: Mon, 20 Jun 2011 21:43:46 +0530 Subject: [Linux-cluster] Linux-cluster Digest, Vol 86, Issue 18 In-Reply-To: References: Message-ID: Thanks dominic, i have added the filter things in lvm.conf, still i am getting same error messages, here below i am mentioning the lines i have added in lvm.conf, still aything need to modify to avoid this kind of error in system messages. filter = [ "a|/dev/mapper|", "a|/dev/sda|", "r/.*/" ] On Mon, Jun 20, 2011 at 9:30 PM, wrote: > Send Linux-cluster mailing list submissions to > linux-cluster at redhat.com > > To subscribe or unsubscribe via the World Wide Web, visit > https://www.redhat.com/mailman/listinfo/linux-cluster > or, via email, send a message with subject or body 'help' to > linux-cluster-request at redhat.com > > You can reach the person managing the list at > linux-cluster-owner at redhat.com > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Linux-cluster digest..." > > > Today's Topics: > > 1. Re: Cluster Failover Failed (dOminic) > 2. Re: umount failing... (dOminic) > 3. Re: Plugged out blade from bladecenter chassis - > fence_bladecenter failed (dOminic) > 4. Re: Plugged out blade from bladecenter chassis - > fence_bladecenter failed (Parvez Shaikh) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sun, 19 Jun 2011 22:03:52 +0530 > From: dOminic > To: linux clustering > Subject: Re: [Linux-cluster] Cluster Failover Failed > Message-ID: > Content-Type: text/plain; charset="iso-8859-1" > > Hi Balaji, > > Yes, the reported message is harmless ... However, you can try following > > 1) I would suggest you to set the filter setting in lvm.conf to properly > scan your mpath* devices and local disks. > 2) Enable blacklist section in multipath.conf eg: > > blacklist { > devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*" > devnode "^hd[a-z]" > } > > # multipath -v2 > > Observe the box. Check whether that helps ... > > > Regards, > > > On Wed, Jun 15, 2011 at 12:16 AM, Balaji S wrote: > > > Hi, > > In my setup implemented 10 tow node cluster's which running mysql as > > cluster service, ipmi card as fencing device. > > > > In my /var/log/messages i am keep getting the errors like below, > > > > Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdm, sector > 0 > > Jun 14 12:50:48 hostname kernel: sd 3:0:2:2: Device not ready: <6>: > > Current: sense key: Not Ready > > Jun 14 12:50:48 hostname kernel: Add. Sense: Logical unit not ready, > > manual intervention required > > Jun 14 12:50:48 hostname kernel: > > Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdn, sector > 0 > > Jun 14 12:50:48 hostname kernel: sd 3:0:2:4: Device not ready: <6>: > > Current: sense key: Not Ready > > Jun 14 12:50:48 hostname kernel: Add. Sense: Logical unit not ready, > > manual intervention required > > Jun 14 12:50:48 hostname kernel: > > Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdp, sector > 0 > > Jun 14 12:51:10 hostname kernel: sd 3:0:0:1: Device not ready: <6>: > > Current: sense key: Not Ready > > Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, > > manual intervention required > > Jun 14 12:51:10 hostname kernel: > > Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdc, sector > 0 > > Jun 14 12:51:10 hostname kernel: printk: 3 messages suppressed. > > Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdc, logical > > block 0 > > Jun 14 12:51:10 hostname kernel: sd 3:0:0:2: Device not ready: <6>: > > Current: sense key: Not Ready > > Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, > > manual intervention required > > Jun 14 12:51:10 hostname kernel: > > Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdd, sector > 0 > > Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdd, logical > > block 0 > > Jun 14 12:51:10 hostname kernel: sd 3:0:0:4: Device not ready: <6>: > > Current: sense key: Not Ready > > Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, > > manual intervention required > > > > > > when i am checking the multipath -ll , this all devices are in passive > > path. > > > > Environment : > > > > RHEL 5.4 & EMC SAN > > > > Please suggest how to overcome this issue. Support will be highly > helpful. > > Thanks in Advance > > > > > > -- > > Thanks, > > BSK > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > https://www.redhat.com/archives/linux-cluster/attachments/20110619/57c6e69e/attachment.html > > > > ------------------------------ > > Message: 2 > Date: Sun, 19 Jun 2011 22:12:35 +0530 > From: dOminic > To: linux clustering > Subject: Re: [Linux-cluster] umount failing... > Message-ID: > Content-Type: text/plain; charset="iso-8859-1" > > selinux is in Enforced mode ( worth checking audit.log ) ? .If yes, try > selinux to permissive or disabled mode and check . > > Regards, > > On Thu, Jun 16, 2011 at 1:48 PM, Corey Kovacs >wrote: > > > My appologies for not getting back sooner. I am in the middle of a move. > > > > I cannot post my configs or logs (yeah, not helpful I know) but > > suffice it to say I strongly believe they are correct (I know, > > everyone says that). I've had other people look at them just make sure > > it wasn't a case of proofreading my own paper etc. and it always comes > > down to the umount failing. I have 6 other identical NFS services > > (save for the mount point/export location) and they all work > > flawlessly. That's why I am zeroing in on the use of '/home' as the > > culprit. > > > > Anyway, it's not a lot to go on I know, but I am just looking for > > directions to search for now. > > > > Thanks > > > > Corey > > > > On Mon, Jun 13, 2011 at 11:44 AM, Rajveer Singh > > wrote: > > > > > > > > > On Thu, Jun 9, 2011 at 8:22 PM, Corey Kovacs > > wrote: > > >> > > >> Folks, > > >> > > >> I have a 5 node cluster serving out several NFS exports, one of which > is > > >> /home. > > >> > > >> All of the nfs services can be moved from node to node without problem > > >> except for the one providing /home. > > >> > > >> The logs on that node indicate the umount is failing and then the > > >> service is disabled (self-fence is not enabled). > > >> > > >> Even after the service is put into a failed state and then disabled > > >> manually, umount fails... > > >> > > >> I had noticed recently while playing with conga that creating a > > >> service for /home on a test cluster a warning was issued about > > >> reserved words and as I recall (i could be wrong) /home was among the > > >> illegal parameters for the mount point. > > >> > > >> I have turned everything off that I could think of which might be > > >> "holding" the mount and have run the various iterations of lsof, find > > >> etc. nothing shows up as having anything being actively used. > > >> > > >> This particular file system is 1TB. > > >> > > >> Is there something wrong with using /home as an export? > > >> > > >> Some specifics. > > >> > > >> RHEL5.6 (updated as of last week) > > >> HA-LVM protecting ext3 using the newer "preferred method" with clvmd > > >> Ext3 for exported file systems > > >> 5 nodes. > > >> > > >> > > >> Any ideas would be greatly appreciated. > > >> > > >> -C > > >> > > > Can you share your log file and cluster.conf file > > > -- > > > Linux-cluster mailing list > > > Linux-cluster at redhat.com > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > https://www.redhat.com/archives/linux-cluster/attachments/20110619/42a94077/attachment.html > > > > ------------------------------ > > Message: 3 > Date: Sun, 19 Jun 2011 22:14:56 +0530 > From: dOminic > To: linux clustering > Subject: Re: [Linux-cluster] Plugged out blade from bladecenter > chassis - fence_bladecenter failed > Message-ID: > Content-Type: text/plain; charset="iso-8859-1" > > There is a bug related to missing_as_off - > https://bugzilla.redhat.com/show_bug.cgi?id=689851 - expects the fix in > rhel5u7 . > > regards, > > On Wed, Apr 27, 2011 at 1:59 PM, Parvez Shaikh >wrote: > > > Hi all, > > > > I am using RHCS on IBM bladecenter with blade center fencing. I plugged > out > > a blade from blade center chassis slot and was hoping that failover to > > occur. However when I did so, I get following message - > > > > fenced[10240]: agent "fence_bladecenter" reports: Failed: Unable to > obtain > > correct plug status or plug is not available > > fenced[10240]: fence "blade1" failed > > > > Is this supported that if I plug out blade from its slot, then failover > > occur without manual intervention? If so, which fencing must I use? > > > > Thanks, > > Parvez > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > https://www.redhat.com/archives/linux-cluster/attachments/20110619/e1f7ba7a/attachment.html > > > > ------------------------------ > > Message: 4 > Date: Mon, 20 Jun 2011 10:46:41 +0530 > From: Parvez Shaikh > To: linux clustering > Subject: Re: [Linux-cluster] Plugged out blade from bladecenter > chassis - fence_bladecenter failed > Message-ID: > Content-Type: text/plain; charset="iso-8859-1" > > Hi Thanks Dominic, > > Do fence_bladecenter "reboot" the blade as a part of fencing always? I have > seen it turning the blade off by default. > > Through fence_bladecenter --missing-as-off...... -o off returns me a > correct > result when run from command line but fencing fails through "fenced". I am > using RHEL 5.5 ES and fence_bladecenter version reports following - > > fence_bladecenter -V > 2.0.115 (built Tue Dec 22 10:05:55 EST 2009) > Copyright (C) Red Hat, Inc. 2004 All rights reserved. > > > Anyway thanks for bugzilla reference > > Regards > > On Sun, Jun 19, 2011 at 10:14 PM, dOminic wrote: > > > There is a bug related to missing_as_off - > > https://bugzilla.redhat.com/show_bug.cgi?id=689851 - expects the fix in > > rhel5u7 . > > > > regards, > > > > On Wed, Apr 27, 2011 at 1:59 PM, Parvez Shaikh < > parvez.h.shaikh at gmail.com>wrote: > > > >> Hi all, > >> > >> I am using RHCS on IBM bladecenter with blade center fencing. I plugged > >> out a blade from blade center chassis slot and was hoping that failover > to > >> occur. However when I did so, I get following message - > >> > >> fenced[10240]: agent "fence_bladecenter" reports: Failed: Unable to > obtain > >> correct plug status or plug is not available > >> fenced[10240]: fence "blade1" failed > >> > >> Is this supported that if I plug out blade from its slot, then failover > >> occur without manual intervention? If so, which fencing must I use? > >> > >> Thanks, > >> Parvez > >> > >> -- > >> Linux-cluster mailing list > >> Linux-cluster at redhat.com > >> https://www.redhat.com/mailman/listinfo/linux-cluster > >> > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > https://www.redhat.com/archives/linux-cluster/attachments/20110620/0386831e/attachment.html > > > > ------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > End of Linux-cluster Digest, Vol 86, Issue 18 > ********************************************* > -- Thanks, Balaji S -------------- next part -------------- An HTML attachment was scrubbed... URL: From fdinitto at redhat.com Mon Jun 20 16:46:28 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Mon, 20 Jun 2011 18:46:28 +0200 Subject: [Linux-cluster] cluster 3.1.3 stable release Message-ID: <4DFF7964.6040504@redhat.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Welcome to the cluster 3.1.3 release. This release fixes a build issue in dlm_controld with any kernel older than 3.0. The new source tarball can be downloaded here: https://fedorahosted.org/releases/c/l/cluster/cluster-3.1.3.tar.xz ChangeLog: https://fedorahosted.org/releases/c/l/cluster/Changelog-3.1.3 To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Thanks/congratulations to all people that contributed to achieve this great milestone. Happy clustering, Fabio -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBCAAGBQJN/3lhAAoJEFA6oBJjVJ+O9PkP/073d0Y1ydvLbm7Yui5/ttlI aSsy3dt53QzcI1MY2MEsqWW3T4MlJMaM/kbUWcTKGy83OeZzLC13WtMb/Hyyt6Oo 2cqmnEMAbTZ89rjiJurt0aj42QOscYBBZfMR72njK94WTWGVrQ8q/vSq/BNzdUkw sS0cbb/7BpS3RN1uXxm+x0x2DJMB9uIK6s9So0hXRPL2m/wwANIoD/k9T9D6B/ax 5L2kqAATPtPbl+2H5yk3FxSgJf+bLbcxd2kbHpPxWawG6W3qRsPLi3mBBOYXQF4B y4Mj9N5i97BgrGT/nqCxqMHh2S+pKEe4gulZcQIYl/JaZhZ9LuoODRhKo/UL7i63 3n8RbPCIiIJzI7eJVbmSfcGk11ZRRxJ4nbTdvJVylaFe8bCjvTO+eLHgkTPTlDGj WWqt9uNWdvQuef77G0TOaZcvMphw1VduXLvtU2wejpfVAzz+lEprL+VthSrbNfxf HggKRDxgsrAYbJ4LgJPt/ApkhWx/HhrJYJfSkNTQOXAY3JKuOrWwbJyx9woCQu1c wIUnrQ2VB/CmKNTDT4AFYWA/GV3d/4FuijTvd3LcTKtWoCOVdKGic/MmFjBvJk/R kbSG8JMpTm02w2L0G+WDhMdC58GGQHB6GhQ8Nr6aAu55QPWlijwHtgUYw4xYSKdn 0D9vQsSNlYWiQZAmAwhg =Vvy3 -----END PGP SIGNATURE----- From share2dom at gmail.com Tue Jun 21 12:52:49 2011 From: share2dom at gmail.com (dOminic) Date: Tue, 21 Jun 2011 18:22:49 +0530 Subject: [Linux-cluster] Cluster Failover Failed In-Reply-To: References: Message-ID: Hi, Btw, how many HBAs are present in your box ? . Problem is with scsi3 only ?. Refer https://access.redhat.com/kb/docs/DOC-2991 , then set the filter. Also, I would suggest you to open ticket with Linux vendor if IO errors are belongs to Active paths. Pointed IO errors are belongs to disk that in passive paths group ?. you can verify the same in multipath-ll output . regards, On Sun, Jun 19, 2011 at 10:03 PM, dOminic wrote: > Hi Balaji, > > Yes, the reported message is harmless ... However, you can try following > > 1) I would suggest you to set the filter setting in lvm.conf to properly > scan your mpath* devices and local disks. > 2) Enable blacklist section in multipath.conf eg: > > blacklist { > devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*" > devnode "^hd[a-z]" > } > > # multipath -v2 > > Observe the box. Check whether that helps ... > > > Regards, > > > On Wed, Jun 15, 2011 at 12:16 AM, Balaji S wrote: > >> Hi, >> In my setup implemented 10 tow node cluster's which running mysql as >> cluster service, ipmi card as fencing device. >> >> In my /var/log/messages i am keep getting the errors like below, >> >> Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdm, sector 0 >> Jun 14 12:50:48 hostname kernel: sd 3:0:2:2: Device not ready: <6>: >> Current: sense key: Not Ready >> Jun 14 12:50:48 hostname kernel: Add. Sense: Logical unit not ready, >> manual intervention required >> Jun 14 12:50:48 hostname kernel: >> Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdn, sector 0 >> Jun 14 12:50:48 hostname kernel: sd 3:0:2:4: Device not ready: <6>: >> Current: sense key: Not Ready >> Jun 14 12:50:48 hostname kernel: Add. Sense: Logical unit not ready, >> manual intervention required >> Jun 14 12:50:48 hostname kernel: >> Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdp, sector 0 >> Jun 14 12:51:10 hostname kernel: sd 3:0:0:1: Device not ready: <6>: >> Current: sense key: Not Ready >> Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, >> manual intervention required >> Jun 14 12:51:10 hostname kernel: >> Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdc, sector 0 >> Jun 14 12:51:10 hostname kernel: printk: 3 messages suppressed. >> Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdc, logical >> block 0 >> Jun 14 12:51:10 hostname kernel: sd 3:0:0:2: Device not ready: <6>: >> Current: sense key: Not Ready >> Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, >> manual intervention required >> Jun 14 12:51:10 hostname kernel: >> Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdd, sector 0 >> Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdd, logical >> block 0 >> Jun 14 12:51:10 hostname kernel: sd 3:0:0:4: Device not ready: <6>: >> Current: sense key: Not Ready >> Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, >> manual intervention required >> >> >> when i am checking the multipath -ll , this all devices are in passive >> path. >> >> Environment : >> >> RHEL 5.4 & EMC SAN >> >> Please suggest how to overcome this issue. Support will be highly helpful. >> Thanks in Advance >> >> >> -- >> Thanks, >> BSK >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From miha.valencic at gmail.com Tue Jun 21 13:31:13 2011 From: miha.valencic at gmail.com (Miha Valencic) Date: Tue, 21 Jun 2011 15:31:13 +0200 Subject: [Linux-cluster] Troubleshooting service relocation In-Reply-To: <4DFC6EDF.5090202@ulimit.org> References: <4DFC6EDF.5090202@ulimit.org> Message-ID: Michael, I've configured the logging on RM and am now waiting for it to switch nodes. Hopefully, I can see a reason why it is relocating. Thanks, Miha. On Sat, Jun 18, 2011 at 11:24 AM, Michael Pye wrote: > On 17/06/2011 09:13, Miha Valencic wrote: > > How can I turn on logging or what else can I check? > > Take a look at this knowledgebase article: > https://access.redhat.com/kb/docs/DOC-53500 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rossnick-lists at cybercat.ca Tue Jun 21 13:57:38 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Tue, 21 Jun 2011 09:57:38 -0400 Subject: [Linux-cluster] GFS2 fatal: filesystem consistency error Message-ID: 8 node cluster, fiber channel hbas and disks access trough a qlogic fabric. I've got hit 3 times with this error on different nodes : GFS2: fsid=CyberCluster:GizServer.1: fatal: filesystem consistency error GFS2: fsid=CyberCluster:GizServer.1: inode = 9582 6698267 GFS2: fsid=CyberCluster:GizServer.1: function = gfs2_dinode_dealloc, file = fs/gfs2/inode.c, line = 352 GFS2: fsid=CyberCluster:GizServer.1: about to withdraw this file system GFS2: fsid=CyberCluster:GizServer.1: telling LM to unmount GFS2: fsid=CyberCluster:GizServer.1: withdrawn Pid: 2659, comm: delete_workqueu Tainted: G W ---------------- T 2.6.32-131.2.1.el6.x86_64 #1 Call Trace: [] ? gfs2_lm_withdraw+0x102/0x130 [gfs2] [] ? trunc_dealloc+0xa9/0x130 [gfs2] [] ? gfs2_consist_inode_i+0x5d/0x60 [gfs2] [] ? gfs2_dinode_dealloc+0x64/0x210 [gfs2] [] ? gfs2_delete_inode+0x1ba/0x280 [gfs2] [] ? gfs2_delete_inode+0x8d/0x280 [gfs2] [] ? gfs2_delete_inode+0x0/0x280 [gfs2] [] ? generic_delete_inode+0xde/0x1d0 [] ? delete_work_func+0x0/0x80 [gfs2] [] ? generic_drop_inode+0x65/0x80 [] ? gfs2_drop_inode+0x2e/0x30 [gfs2] [] ? iput+0x62/0x70 [] ? delete_work_func+0x54/0x80 [gfs2] [] ? worker_thread+0x170/0x2a0 [] ? autoremove_wake_function+0x0/0x40 [] ? worker_thread+0x0/0x2a0 [] ? kthread+0x96/0xa0 [] ? child_rip+0xa/0x20 [] ? kthread+0x0/0xa0 [] ? child_rip+0x0/0x20 no_formal_ino = 9582 no_addr = 6698267 i_disksize = 6838 blocks = 0 i_goal = 6698304 i_diskflags = 0x00000000 i_height = 1 i_depth = 0 i_entries = 0 i_eattr = 0 GFS2: fsid=CyberCluster:GizServer.1: gfs2_delete_inode: -5 gdlm_unlock 5,66351b err=-22 Only, with different inodes each time. After that event, services running on that filesystem are marked failed and not moved over another node. Any access to that fs yields I/O error. Server needed to be rebooted to properly work again. I did ran a fsck last night on that filesystem, and it did find some errors, but nothing serious. Lots (realy lots) of those : Ondisk and fsck bitmaps differ at block 5771602 (0x581152) Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free) Metadata type is 0 (free) Fix bitmap for block 5771602 (0x581152) ? (y/n) And after completing the fsck, I started back some services, and I got the same error on another filesystem that is practily empty and used for small utilities used troughout the cluster... What should I do to find the source of this problem ? From rpeterso at redhat.com Tue Jun 21 14:42:40 2011 From: rpeterso at redhat.com (Bob Peterson) Date: Tue, 21 Jun 2011 10:42:40 -0400 (EDT) Subject: [Linux-cluster] GFS2 fatal: filesystem consistency error In-Reply-To: Message-ID: <1036238479.689034.1308667360488.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- Original Message ----- | 8 node cluster, fiber channel hbas and disks access trough a qlogic | fabric. | | I've got hit 3 times with this error on different nodes : | | GFS2: fsid=CyberCluster:GizServer.1: fatal: filesystem consistency | error | GFS2: fsid=CyberCluster:GizServer.1: inode = 9582 6698267 | GFS2: fsid=CyberCluster:GizServer.1: function = gfs2_dinode_dealloc, | file = | fs/gfs2/inode.c, line = 352 | GFS2: fsid=CyberCluster:GizServer.1: about to withdraw this file | system | GFS2: fsid=CyberCluster:GizServer.1: telling LM to unmount | GFS2: fsid=CyberCluster:GizServer.1: withdrawn | Pid: 2659, comm: delete_workqueu Tainted: G W ---------------- T | 2.6.32-131.2.1.el6.x86_64 #1 | Call Trace: | [] ? gfs2_lm_withdraw+0x102/0x130 [gfs2] | [] ? trunc_dealloc+0xa9/0x130 [gfs2] | [] ? gfs2_consist_inode_i+0x5d/0x60 [gfs2] | [] ? gfs2_dinode_dealloc+0x64/0x210 [gfs2] | [] ? gfs2_delete_inode+0x1ba/0x280 [gfs2] | [] ? gfs2_delete_inode+0x8d/0x280 [gfs2] | [] ? gfs2_delete_inode+0x0/0x280 [gfs2] | [] ? generic_delete_inode+0xde/0x1d0 | [] ? delete_work_func+0x0/0x80 [gfs2] | [] ? generic_drop_inode+0x65/0x80 | [] ? gfs2_drop_inode+0x2e/0x30 [gfs2] | [] ? iput+0x62/0x70 | [] ? delete_work_func+0x54/0x80 [gfs2] | [] ? worker_thread+0x170/0x2a0 | [] ? autoremove_wake_function+0x0/0x40 | [] ? worker_thread+0x0/0x2a0 | [] ? kthread+0x96/0xa0 | [] ? child_rip+0xa/0x20 | [] ? kthread+0x0/0xa0 | [] ? child_rip+0x0/0x20 | no_formal_ino = 9582 | no_addr = 6698267 | i_disksize = 6838 | blocks = 0 | i_goal = 6698304 | i_diskflags = 0x00000000 | i_height = 1 | i_depth = 0 | i_entries = 0 | i_eattr = 0 | GFS2: fsid=CyberCluster:GizServer.1: gfs2_delete_inode: -5 | gdlm_unlock 5,66351b err=-22 | | | Only, with different inodes each time. | | After that event, services running on that filesystem are marked | failed and | not moved over another node. Any access to that fs yields I/O error. | Server | needed to be rebooted to properly work again. | | I did ran a fsck last night on that filesystem, and it did find some | errors, | but nothing serious. Lots (realy lots) of those : | | Ondisk and fsck bitmaps differ at block 5771602 (0x581152) | Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free) | Metadata type is 0 (free) | Fix bitmap for block 5771602 (0x581152) ? (y/n) | | And after completing the fsck, I started back some services, and I got | the | same error on another filesystem that is practily empty and used for | small | utilities used troughout the cluster... | | What should I do to find the source of this problem ? Hi, I believe this is a GFS2 bug we've already solved. Please contact Red Hat Support. Regards, Bob Peterson Red Hat File Systems From swhiteho at redhat.com Tue Jun 21 14:46:07 2011 From: swhiteho at redhat.com (Steven Whitehouse) Date: Tue, 21 Jun 2011 15:46:07 +0100 Subject: [Linux-cluster] GFS2 fatal: filesystem consistency error In-Reply-To: References: Message-ID: <1308667567.2762.15.camel@menhir> Hi, On Tue, 2011-06-21 at 09:57 -0400, Nicolas Ross wrote: > 8 node cluster, fiber channel hbas and disks access trough a qlogic fabric. > > I've got hit 3 times with this error on different nodes : > > GFS2: fsid=CyberCluster:GizServer.1: fatal: filesystem consistency error > GFS2: fsid=CyberCluster:GizServer.1: inode = 9582 6698267 > GFS2: fsid=CyberCluster:GizServer.1: function = gfs2_dinode_dealloc, file = > fs/gfs2/inode.c, line = 352 > GFS2: fsid=CyberCluster:GizServer.1: about to withdraw this file system > GFS2: fsid=CyberCluster:GizServer.1: telling LM to unmount > GFS2: fsid=CyberCluster:GizServer.1: withdrawn > Pid: 2659, comm: delete_workqueu Tainted: G W ---------------- T > 2.6.32-131.2.1.el6.x86_64 #1 > Call Trace: > [] ? gfs2_lm_withdraw+0x102/0x130 [gfs2] > [] ? trunc_dealloc+0xa9/0x130 [gfs2] > [] ? gfs2_consist_inode_i+0x5d/0x60 [gfs2] > [] ? gfs2_dinode_dealloc+0x64/0x210 [gfs2] > [] ? gfs2_delete_inode+0x1ba/0x280 [gfs2] > [] ? gfs2_delete_inode+0x8d/0x280 [gfs2] > [] ? gfs2_delete_inode+0x0/0x280 [gfs2] > [] ? generic_delete_inode+0xde/0x1d0 > [] ? delete_work_func+0x0/0x80 [gfs2] > [] ? generic_drop_inode+0x65/0x80 > [] ? gfs2_drop_inode+0x2e/0x30 [gfs2] > [] ? iput+0x62/0x70 > [] ? delete_work_func+0x54/0x80 [gfs2] > [] ? worker_thread+0x170/0x2a0 > [] ? autoremove_wake_function+0x0/0x40 > [] ? worker_thread+0x0/0x2a0 > [] ? kthread+0x96/0xa0 > [] ? child_rip+0xa/0x20 > [] ? kthread+0x0/0xa0 > [] ? child_rip+0x0/0x20 > no_formal_ino = 9582 > no_addr = 6698267 > i_disksize = 6838 > blocks = 0 > i_goal = 6698304 > i_diskflags = 0x00000000 > i_height = 1 > i_depth = 0 > i_entries = 0 > i_eattr = 0 > GFS2: fsid=CyberCluster:GizServer.1: gfs2_delete_inode: -5 > gdlm_unlock 5,66351b err=-22 > > > Only, with different inodes each time. > > After that event, services running on that filesystem are marked failed and > not moved over another node. Any access to that fs yields I/O error. Server > needed to be rebooted to properly work again. > > I did ran a fsck last night on that filesystem, and it did find some errors, > but nothing serious. Lots (realy lots) of those : > > Ondisk and fsck bitmaps differ at block 5771602 (0x581152) > Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free) > Metadata type is 0 (free) > Fix bitmap for block 5771602 (0x581152) ? (y/n) > > And after completing the fsck, I started back some services, and I got the > same error on another filesystem that is practily empty and used for small > utilities used troughout the cluster... > > What should I do to find the source of this problem ? > I suspect that this is a know problem, bz #712139 if you have access to the Red Hat bugzilla. There is a fix available via our usual support channels. Note that this particular bug is highly version specific so it only applies to RHEL 6.1 and no other version (either RHEL or upstream), Steve. From rossnick-lists at cybercat.ca Tue Jun 21 14:58:23 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Tue, 21 Jun 2011 10:58:23 -0400 Subject: [Linux-cluster] GFS2 fatal: filesystem consistency error References: <1308667567.2762.15.camel@menhir> Message-ID: <3C1F816785264B95A73FBD89C03EBAB5@versa> >> And after completing the fsck, I started back some services, and I got >> the >> same error on another filesystem that is practily empty and used for >> small >> utilities used troughout the cluster... >> >> What should I do to find the source of this problem ? >> > > I suspect that this is a know problem, bz #712139 if you have access to > the Red Hat bugzilla. There is a fix available via our usual support > channels. Note that this particular bug is highly version specific so it > only applies to RHEL 6.1 and no other version (either RHEL or upstream), Thanks, I am indeed at 6.1. I did find this bug while googling yesterday for that, I will contact support once I got the why I don't have support for resilient storage enabled cleared... From noreply at boxbe.com Tue Jun 21 14:52:45 2011 From: noreply at boxbe.com (noreply at boxbe.com) Date: Tue, 21 Jun 2011 07:52:45 -0700 (PDT) Subject: [Linux-cluster] GFS2 fatal: filesystem consistency error (Action Required) Message-ID: <1933689685.1398153.1308667965985.JavaMail.prod@app010.dmz> Hello linux clustering, You will not receive any more courtesy notices from our members for two days. Messages you have sent will remain in a lower priority mailbox for our member to review at their leisure. Future messages will be more likely to be viewed if you are on our member's priority Guest List. Thank you, debjyoti.mail at gmail.com Powered by Boxbe -- "End Email Overload" Visit http://www.boxbe.com/how-it-works?tc=8467960205_652083268 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded message was scrubbed... From: Steven Whitehouse Subject: Re: [Linux-cluster] GFS2 fatal: filesystem consistency error Date: Tue, 21 Jun 2011 15:46:07 +0100 Size: 2870 URL: From skjbalaji at gmail.com Wed Jun 22 03:01:06 2011 From: skjbalaji at gmail.com (Balaji S) Date: Wed, 22 Jun 2011 08:31:06 +0530 Subject: [Linux-cluster] Linux-cluster Digest, Vol 86, Issue 19 In-Reply-To: References: Message-ID: Hi Dominic, Yes the errors are only belongs to passive path. > ------------------------------ > > Message: 3 > Date: Tue, 21 Jun 2011 18:22:49 +0530 > From: dOminic > To: linux clustering > Subject: Re: [Linux-cluster] Cluster Failover Failed > Message-ID: > Content-Type: text/plain; charset="iso-8859-1" > > Hi, > > Btw, how many HBAs are present in your box ? . Problem is with scsi3 only > ?. > > Refer https://access.redhat.com/kb/docs/DOC-2991 , then set the filter. > Also, I would suggest you to open ticket with Linux vendor if IO errors are > belongs to Active paths. > > Pointed IO errors are belongs to disk that in passive paths group ?. you > can > verify the same in multipath-ll output . > > regards, > > On Sun, Jun 19, 2011 at 10:03 PM, dOminic wrote: > > > Hi Balaji, > > > > Yes, the reported message is harmless ... However, you can try following > > > > 1) I would suggest you to set the filter setting in lvm.conf to properly > > scan your mpath* devices and local disks. > > 2) Enable blacklist section in multipath.conf eg: > > > > blacklist { > > devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*" > > devnode "^hd[a-z]" > > } > > > > # multipath -v2 > > > > Observe the box. Check whether that helps ... > > > > > > Regards, > > > > > > On Wed, Jun 15, 2011 at 12:16 AM, Balaji S wrote: > > > >> Hi, > >> In my setup implemented 10 tow node cluster's which running mysql as > >> cluster service, ipmi card as fencing device. > >> > >> In my /var/log/messages i am keep getting the errors like below, > >> > >> Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdm, sector > 0 > >> Jun 14 12:50:48 hostname kernel: sd 3:0:2:2: Device not ready: <6>: > >> Current: sense key: Not Ready > >> Jun 14 12:50:48 hostname kernel: Add. Sense: Logical unit not ready, > >> manual intervention required > >> Jun 14 12:50:48 hostname kernel: > >> Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdn, sector > 0 > >> Jun 14 12:50:48 hostname kernel: sd 3:0:2:4: Device not ready: <6>: > >> Current: sense key: Not Ready > >> Jun 14 12:50:48 hostname kernel: Add. Sense: Logical unit not ready, > >> manual intervention required > >> Jun 14 12:50:48 hostname kernel: > >> Jun 14 12:50:48 hostname kernel: end_request: I/O error, dev sdp, sector > 0 > >> Jun 14 12:51:10 hostname kernel: sd 3:0:0:1: Device not ready: <6>: > >> Current: sense key: Not Ready > >> Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, > >> manual intervention required > >> Jun 14 12:51:10 hostname kernel: > >> Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdc, sector > 0 > >> Jun 14 12:51:10 hostname kernel: printk: 3 messages suppressed. > >> Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdc, logical > >> block 0 > >> Jun 14 12:51:10 hostname kernel: sd 3:0:0:2: Device not ready: <6>: > >> Current: sense key: Not Ready > >> Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, > >> manual intervention required > >> Jun 14 12:51:10 hostname kernel: > >> Jun 14 12:51:10 hostname kernel: end_request: I/O error, dev sdd, sector > 0 > >> Jun 14 12:51:10 hostname kernel: Buffer I/O error on device sdd, logical > >> block 0 > >> Jun 14 12:51:10 hostname kernel: sd 3:0:0:4: Device not ready: <6>: > >> Current: sense key: Not Ready > >> Jun 14 12:51:10 hostname kernel: Add. Sense: Logical unit not ready, > >> manual intervention required > >> > >> > >> when i am checking the multipath -ll , this all devices are in passive > >> path. > >> > >> Environment : > >> > >> RHEL 5.4 & EMC SAN > >> > >> Please suggest how to overcome this issue. Support will be highly > helpful. > >> Thanks in Advance > >> > >> > >> -- > >> Thanks, > >> BSK > >> > >> -- > >> Linux-cluster mailing list > >> Linux-cluster at redhat.com > >> https://www.redhat.com/mailman/listinfo/linux-cluster > >> > > > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > https://www.redhat.com/archives/linux-cluster/attachments/20110621/e41e841c/attachment.html > > > > ------------------------------ > > Message: 4 > Date: Tue, 21 Jun 2011 15:31:13 +0200 > From: Miha Valencic > To: linux clustering > Subject: Re: [Linux-cluster] Troubleshooting service relocation > Message-ID: > Content-Type: text/plain; charset="utf-8" > > Michael, I've configured the logging on RM and am now waiting for it to > switch nodes. Hopefully, I can see a reason why it is relocating. > > Thanks, > Miha. > > On Sat, Jun 18, 2011 at 11:24 AM, Michael Pye wrote: > > > On 17/06/2011 09:13, Miha Valencic wrote: > > > How can I turn on logging or what else can I check? > > > > Take a look at this knowledgebase article: > > https://access.redhat.com/kb/docs/DOC-53500 > > > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > https://www.redhat.com/archives/linux-cluster/attachments/20110621/19a643fd/attachment.html > > > > ------------------------------ > > Message: 5 > Date: Tue, 21 Jun 2011 09:57:38 -0400 > From: "Nicolas Ross" > To: "linux clustering" > Subject: [Linux-cluster] GFS2 fatal: filesystem consistency error > Message-ID: > Content-Type: text/plain; format=flowed; charset="iso-8859-1"; > reply-type=original > > 8 node cluster, fiber channel hbas and disks access trough a qlogic fabric. > > I've got hit 3 times with this error on different nodes : > > GFS2: fsid=CyberCluster:GizServer.1: fatal: filesystem consistency error > GFS2: fsid=CyberCluster:GizServer.1: inode = 9582 6698267 > GFS2: fsid=CyberCluster:GizServer.1: function = gfs2_dinode_dealloc, file = > fs/gfs2/inode.c, line = 352 > GFS2: fsid=CyberCluster:GizServer.1: about to withdraw this file system > GFS2: fsid=CyberCluster:GizServer.1: telling LM to unmount > GFS2: fsid=CyberCluster:GizServer.1: withdrawn > Pid: 2659, comm: delete_workqueu Tainted: G W ---------------- T > 2.6.32-131.2.1.el6.x86_64 #1 > Call Trace: > [] ? gfs2_lm_withdraw+0x102/0x130 [gfs2] > [] ? trunc_dealloc+0xa9/0x130 [gfs2] > [] ? gfs2_consist_inode_i+0x5d/0x60 [gfs2] > [] ? gfs2_dinode_dealloc+0x64/0x210 [gfs2] > [] ? gfs2_delete_inode+0x1ba/0x280 [gfs2] > [] ? gfs2_delete_inode+0x8d/0x280 [gfs2] > [] ? gfs2_delete_inode+0x0/0x280 [gfs2] > [] ? generic_delete_inode+0xde/0x1d0 > [] ? delete_work_func+0x0/0x80 [gfs2] > [] ? generic_drop_inode+0x65/0x80 > [] ? gfs2_drop_inode+0x2e/0x30 [gfs2] > [] ? iput+0x62/0x70 > [] ? delete_work_func+0x54/0x80 [gfs2] > [] ? worker_thread+0x170/0x2a0 > [] ? autoremove_wake_function+0x0/0x40 > [] ? worker_thread+0x0/0x2a0 > [] ? kthread+0x96/0xa0 > [] ? child_rip+0xa/0x20 > [] ? kthread+0x0/0xa0 > [] ? child_rip+0x0/0x20 > no_formal_ino = 9582 > no_addr = 6698267 > i_disksize = 6838 > blocks = 0 > i_goal = 6698304 > i_diskflags = 0x00000000 > i_height = 1 > i_depth = 0 > i_entries = 0 > i_eattr = 0 > GFS2: fsid=CyberCluster:GizServer.1: gfs2_delete_inode: -5 > gdlm_unlock 5,66351b err=-22 > > > Only, with different inodes each time. > > After that event, services running on that filesystem are marked failed and > not moved over another node. Any access to that fs yields I/O error. Server > needed to be rebooted to properly work again. > > I did ran a fsck last night on that filesystem, and it did find some > errors, > but nothing serious. Lots (realy lots) of those : > > Ondisk and fsck bitmaps differ at block 5771602 (0x581152) > Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free) > Metadata type is 0 (free) > Fix bitmap for block 5771602 (0x581152) ? (y/n) > > And after completing the fsck, I started back some services, and I got the > same error on another filesystem that is practily empty and used for small > utilities used troughout the cluster... > > What should I do to find the source of this problem ? > > > > ------------------------------ > > Message: 6 > Date: Tue, 21 Jun 2011 10:42:40 -0400 (EDT) > From: Bob Peterson > To: linux clustering > Subject: Re: [Linux-cluster] GFS2 fatal: filesystem consistency error > Message-ID: > < > 1036238479.689034.1308667360488.JavaMail.root at zmail06.collab.prod.int.phx2.redhat.com > > > > Content-Type: text/plain; charset=utf-8 > > ----- Original Message ----- > | 8 node cluster, fiber channel hbas and disks access trough a qlogic > | fabric. > | > | I've got hit 3 times with this error on different nodes : > | > | GFS2: fsid=CyberCluster:GizServer.1: fatal: filesystem consistency > | error > | GFS2: fsid=CyberCluster:GizServer.1: inode = 9582 6698267 > | GFS2: fsid=CyberCluster:GizServer.1: function = gfs2_dinode_dealloc, > | file = > | fs/gfs2/inode.c, line = 352 > | GFS2: fsid=CyberCluster:GizServer.1: about to withdraw this file > | system > | GFS2: fsid=CyberCluster:GizServer.1: telling LM to unmount > | GFS2: fsid=CyberCluster:GizServer.1: withdrawn > | Pid: 2659, comm: delete_workqueu Tainted: G W ---------------- T > | 2.6.32-131.2.1.el6.x86_64 #1 > | Call Trace: > | [] ? gfs2_lm_withdraw+0x102/0x130 [gfs2] > | [] ? trunc_dealloc+0xa9/0x130 [gfs2] > | [] ? gfs2_consist_inode_i+0x5d/0x60 [gfs2] > | [] ? gfs2_dinode_dealloc+0x64/0x210 [gfs2] > | [] ? gfs2_delete_inode+0x1ba/0x280 [gfs2] > | [] ? gfs2_delete_inode+0x8d/0x280 [gfs2] > | [] ? gfs2_delete_inode+0x0/0x280 [gfs2] > | [] ? generic_delete_inode+0xde/0x1d0 > | [] ? delete_work_func+0x0/0x80 [gfs2] > | [] ? generic_drop_inode+0x65/0x80 > | [] ? gfs2_drop_inode+0x2e/0x30 [gfs2] > | [] ? iput+0x62/0x70 > | [] ? delete_work_func+0x54/0x80 [gfs2] > | [] ? worker_thread+0x170/0x2a0 > | [] ? autoremove_wake_function+0x0/0x40 > | [] ? worker_thread+0x0/0x2a0 > | [] ? kthread+0x96/0xa0 > | [] ? child_rip+0xa/0x20 > | [] ? kthread+0x0/0xa0 > | [] ? child_rip+0x0/0x20 > | no_formal_ino = 9582 > | no_addr = 6698267 > | i_disksize = 6838 > | blocks = 0 > | i_goal = 6698304 > | i_diskflags = 0x00000000 > | i_height = 1 > | i_depth = 0 > | i_entries = 0 > | i_eattr = 0 > | GFS2: fsid=CyberCluster:GizServer.1: gfs2_delete_inode: -5 > | gdlm_unlock 5,66351b err=-22 > | > | > | Only, with different inodes each time. > | > | After that event, services running on that filesystem are marked > | failed and > | not moved over another node. Any access to that fs yields I/O error. > | Server > | needed to be rebooted to properly work again. > | > | I did ran a fsck last night on that filesystem, and it did find some > | errors, > | but nothing serious. Lots (realy lots) of those : > | > | Ondisk and fsck bitmaps differ at block 5771602 (0x581152) > | Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free) > | Metadata type is 0 (free) > | Fix bitmap for block 5771602 (0x581152) ? (y/n) > | > | And after completing the fsck, I started back some services, and I got > | the > | same error on another filesystem that is practily empty and used for > | small > | utilities used troughout the cluster... > | > | What should I do to find the source of this problem ? > > Hi, > > I believe this is a GFS2 bug we've already solved. > Please contact Red Hat Support. > > Regards, > > Bob Peterson > Red Hat File Systems > > > > ------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > End of Linux-cluster Digest, Vol 86, Issue 19 > ********************************************* > -- Thanks, Balaji S -------------- next part -------------- An HTML attachment was scrubbed... URL: From henahadu at gmail.com Wed Jun 22 04:12:47 2011 From: henahadu at gmail.com (Peter Sjoberg) Date: Wed, 22 Jun 2011 00:12:47 -0400 Subject: [Linux-cluster] kvm cluster ed guests and virtd fencing Message-ID: <1308715967.23586.36.camel@defiant1.intra.techwiz.ca> I have two KVM hosts with some clustered guests that I'm trying to setup fencing for using fence_virtd and I wonder if this is even suppose to work, that guest on one host tells the other host to kill it's guest. I wonder if I need to add some qpid stuff for the two hosts to work together. Setup: I have two kvm hosts, lets call them host1 & host2. Each hosts has a guest (guest1 on host1 & guest2 on host2) and this guests will be clustered with each other. The hosts normal network is internal only and originates on host eth0/br0 The guests have a separate DMZ network segment, and originates as bridged on host eth1/br1, host has no ip on br1 The guests also have a private link between each other and originates on host eth2/br10 (crossover cable between the two hosts). To bypass multicast routing problems I have on the host side added an ip to the private link and running /usr/sbin/fence_virtd set to listen to br10 The intent is that guest1 running on host1 should be able to fence by telling host2 to kill guest2 but this doesn't work. On the guest side I test this with "fence_xvm -o list" and I get a list of all guests on one of the hosts, I expected combined list. What host list I get depends, mostly I get same as the host I'm running on or the first _virtd started. I think the multicast part works because when I start fence_virtd on one host (host1 or host2) I can issue "fence_xvm -o list" on all 4 nodes and get the a list of guests from the host I started it on. One other thing that fails is the killing part. I start fence_virtd on host2 and then on guest1 I issue fence_xvm -H -o restart and it just returns "permission denied" So, first of all, is it suppose to work and I just messed up my config or do I need to figure out how to add qpid (or something else) to my setup? -- ------------------------------------------------------------------- Techwiz, Peter Sjoberg PGP key (12F506C8) on keyserver & homepage Key fingerprint = 3DC2 CEBA 1590 B41A 3780 955A DB42 02BB 12F5 06C8 mailto:peters-redhat AT techwiz.ca http://www.techwiz.ca/~peters -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 190 bytes Desc: This is a digitally signed message part URL: From j.koekkoek at perrit.nl Wed Jun 22 07:55:02 2011 From: j.koekkoek at perrit.nl (Jeroen Koekkoek) Date: Wed, 22 Jun 2011 07:55:02 +0000 Subject: [Linux-cluster] relationship corosync + dlm + cman in cluster 3.1.3 Message-ID: Hi, I have a question regarding the relationship between Corosync, DLM, and CMAN. Is the following statement correct? The DLM is a kernel module, dlm_controld is the control daemon. CMAN is the old messaging layer, and is now stacked on OpenAIS, which in turn is stacked on Corosync. The DLM does not use CMAN (or Corosync for that matter) to communicate, but does fetch node information from CMAN. The filesystem (GFS or OCFS2) speaks to the DLM locally (in kernel?) and the DLM takes care of the communication. Now for the real question. In the 3.1.3 release dlm_controld still depends on CMAN, but is it safe to say that I can just use Pacemaker and Heartbeat resource agents and only install CMAN so that dlm_controld can query node information? Or did I misunderstand the documentation? Regards, Jeroen From martijn.storck at gmail.com Wed Jun 22 09:11:56 2011 From: martijn.storck at gmail.com (Martijn Storck) Date: Wed, 22 Jun 2011 11:11:56 +0200 Subject: [Linux-cluster] Replacing network switch in a cluster In-Reply-To: References: Message-ID: Well, after reading man openais.conf the settings below seemed ok for this operation, so I just went ahead with it. I connected the old and new switch and moved the cluster nodes over one by one (which meant the link was down for 4-5 seconds). There were no problems whatsoever. Cheers, Martijn On Fri, Jun 17, 2011 at 9:26 AM, Martijn Storck wrote: > Hi all, > > Unfortunately I have to swap out the switch that is used for the cluster > traffic of our 4-node cluster for a new one. I'm hoping I can do this by > connecting the new switch to the old switch and then moving the nodes over > one by one. > > Can I change the cluster configuration so that there is a longer grace > period before a node is deemed 'lost' and gets fenced? The only line in my > cluster.conf that looks like it has anything to do with it is this one: > > token_retransmits_before_loss_const="20"/> > > I think that with faststart enabled the link with a node will be down for > only a few seconds. I realize that this probably means the cluster will lock > up during that period (since we use a lot of GFS), but it's still better > than having to bring the entire cluster down. > > Kind regards, > Martijn Storck > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mickael.bourneuf at celeonet.fr Wed Jun 22 11:54:47 2011 From: mickael.bourneuf at celeonet.fr (=?ISO-8859-1?Q?=22=5BCeleonet=5D_Micka=EBl_Bourneuf=22?=) Date: Wed, 22 Jun 2011 13:54:47 +0200 Subject: [Linux-cluster] (no subject) Message-ID: <4E01D807.7070708@celeonet.fr> From mgrac at redhat.com Wed Jun 22 16:12:12 2011 From: mgrac at redhat.com (Marek Grac) Date: Wed, 22 Jun 2011 18:12:12 +0200 Subject: [Linux-cluster] Plugged out blade from bladecenter chassis - fence_bladecenter failed In-Reply-To: References: Message-ID: <4E02145C.5070805@redhat.com> Hi, On 06/20/2011 07:16 AM, Parvez Shaikh wrote: > > Hi Thanks Dominic, > > Do fence_bladecenter "reboot" the blade as a part of fencing always? I > have seen it turning the blade off by default. > fence daemon always tries to reboot a machine. In most of the cases it is done by power off / check if it is really off / power on / check if it is really on. Fencing is successful if we were able to check that system is down. > Through fence_bladecenter --missing-as-off...... -o off returns me a > correct result when run from command line but fencing fails through > "fenced". I am using RHEL 5.5 ES and fence_bladecenter version reports > following - m, From henahadu at gmail.com Wed Jun 22 17:45:24 2011 From: henahadu at gmail.com (Peter Sjoberg) Date: Wed, 22 Jun 2011 13:45:24 -0400 Subject: [Linux-cluster] kvm cluster ed guests and virtd fencing In-Reply-To: References: <1308715967.23586.36.camel@defiant1.intra.techwiz.ca> Message-ID: <1308764725.6739.9.camel@defiant1.intra.techwiz.ca> On Wed, 2011-06-22 at 11:58 -0400, Victor Ramirez wrote: > You dont need qpid Good, makes life easier Forgot to say it before but there is no plans to make the guests move around so guest1 will always be on host1 and I think with that config I don't need qpid. > > First of all, try setting SELinux to permissive on your guests or else > the fenced process will not be allowed to send a multicast packet It's permissive for other reasons (but I like to enable it) > > > Second of all, remember to set the fence_xvm.key files as such: > host1 key = guest2 key > host2 key = guest1 key I have same key on all 4 nodes (for now at least) and I did fix the error I see on all howtos dd if=/dev/random bs=4096 count=1 of=/etc/cluster/fence_xvm.key Is wrong, because it fails way fast and I got a 20-200byte file when /dev/random ran out. dd if=/dev/random bs=1 count=4096 of=/etc/cluster/fence_xvm.key Works, get a 4096byte file but takes forever, specially on a remote server (in which case I would generate the file locally and scp to remote) dd if=/dev/urandom bs=1 count=4096 of=/etc/cluster/fence_xvm.key Goes fast and is good enough for me. > > so that guest1 send a multicast signal to host2 to fence guest2. Right. > > I did find a config error so now I can kill guest2 from guest1 (had dmzip instead of privip in the config file) but it is still a problem with that I only see one hosts guest, not both and that means I can only kill one way, not both ways. /ps > > 2011/6/22 Peter Sjoberg > I have two KVM hosts with some clustered guests that I'm > trying to setup > fencing for using fence_virtd and I wonder if this is even > suppose to > work, that guest on one host tells the other host to kill it's > guest. > I wonder if I need to add some qpid stuff for the two hosts to > work > together. > > Setup: > I have two kvm hosts, lets call them host1 & host2. > Each hosts has a guest (guest1 on host1 & guest2 on host2) and > this > guests will be clustered with each other. > The hosts normal network is internal only and originates on > host > eth0/br0 > The guests have a separate DMZ network segment, and originates > as > bridged on host eth1/br1, host has no ip on br1 > The guests also have a private link between each other and > originates on > host eth2/br10 (crossover cable between the two hosts). > > To bypass multicast routing problems I have on the host side > added an ip > to the private link and running /usr/sbin/fence_virtd set to > listen to > br10 > > The intent is that guest1 running on host1 should be able to > fence by > telling host2 to kill guest2 but this doesn't work. > On the guest side I test this with "fence_xvm -o list" and I > get a list > of all guests on one of the hosts, I expected combined list. > What host list I get depends, mostly I get same as the host > I'm running > on or the first _virtd started. > I think the multicast part works because when I start > fence_virtd on one > host (host1 or host2) I can issue "fence_xvm -o list" on all 4 > nodes and > get the a list of guests from the host I started it on. > > One other thing that fails is the killing part. > I start fence_virtd on host2 and then on guest1 I issue > fence_xvm -H -o restart > and it just returns "permission denied" > > So, first of all, is it suppose to work and I just messed up > my config > or do I need to figure out how to add qpid (or something else) > to my > setup? > > -- > ------------------------------------------------------------------- > Techwiz, Peter Sjoberg PGP key (12F506C8) on keyserver & > homepage > Key fingerprint = 3DC2 CEBA 1590 B41A 3780 955A DB42 02BB > 12F5 06C8 > mailto:peters-redhat AT techwiz.ca > http://www.techwiz.ca/~peters > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 190 bytes Desc: This is a digitally signed message part URL: From Ralph.Grothe at itdz-berlin.de Fri Jun 24 07:17:57 2011 From: Ralph.Grothe at itdz-berlin.de (Ralph.Grothe at itdz-berlin.de) Date: Fri, 24 Jun 2011 09:17:57 +0200 Subject: [Linux-cluster] How to achieve a service's "stickyness" to a "preferred" node in RHCS? Message-ID: Hello Clustering Gurus, I need to have a service during normal operation (i.e. not during relocation when loss of stickyness is ok and wanted) stick to a preferred cluster node. (n.b. this is only a two-node cluster) In the redhat cluster admin guide (we're on RHEL 5.6) I think to have read that such thing as a "preffered_node" or similar attribute doesn't exist in the schema any more and that instead one should define ordered="0" and restricted="0" failover domains for the respective service as this would in effect result in the wanted behavior. Is this correct? And how (e.g. a short cluster.conf XML example snippet would be appreciated) would this have to be applied? Regards Ralph From l.santeramo at brgm.fr Fri Jun 24 07:56:36 2011 From: l.santeramo at brgm.fr (Santeramo Luc) Date: Fri, 24 Jun 2011 07:56:36 +0000 Subject: [Linux-cluster] How to achieve a service's "stickyness" to a"preferred" node in RHCS? In-Reply-To: References: Message-ID: <4E04434F.4050305@brgm.fr> Hi, SVC_001 will be sticked to Failover Domain "FOD_srv1", which have node srv1 as priority node. ...and you can have more informations about options on RHCS admin guide. Luc ________________________________ Le 24/06/2011 09:17, Ralph.Grothe at itdz-berlin.de a ?crit : Hello Clustering Gurus, I need to have a service during normal operation (i.e. not during relocation when loss of stickyness is ok and wanted) stick to a preferred cluster node. (n.b. this is only a two-node cluster) In the redhat cluster admin guide (we're on RHEL 5.6) I think to have read that such thing as a "preffered_node" or similar attribute doesn't exist in the schema any more and that instead one should define ordered="0" and restricted="0" failover domains for the respective service as this would in effect result in the wanted behavior. Is this correct? And how (e.g. a short cluster.conf XML example snippet would be appreciated) would this have to be applied? Regards Ralph -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster ********************************************************************************************** Pensez a l'environnement avant d'imprimer ce message Think Environment before printing Le contenu de ce mel et de ses pieces jointes est destine a l'usage exclusif du (des) destinataire(s) designe (s) comme tel(s). En cas de reception par erreur, le signaler a son expediteur et ne pas en divulguer le contenu. L'absence de virus a ete verifiee a l'emission, il convient neanmoins de s'assurer de l'absence de contamination a sa reception. The contents of this email and any attachments are confidential. They are intended for the named recipient (s) only. If you have received this email in error please notify the system manager or the sender immediately and do not disclose the contents to anyone or make copies. eSafe scanned this email for viruses, vandals and malicious content. ********************************************************************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From rossnick-lists at cybercat.ca Fri Jun 24 16:06:51 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Fri, 24 Jun 2011 12:06:51 -0400 Subject: [Linux-cluster] Corosync goes cpu to 95-99% In-Reply-To: <51BB988BCCF547E69BF222BDAF34C4DE@versa> References: <4DD29D03.9080901@gmail.com> <4DD2BAC3.50509@redhat.com> <4DD2BD7D.5070704@gmail.com> <4DD2CA90.6090802@redhat.com> <3B50BA7445114813AE429BEE51A2BA52@versa> <4DD78908.2030801@gmail.com> <0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca><4DD80D5D.10004@gmail.com> <4DD873C7.8080402@cybercat.ca> <22E7D11CD5E64E338A66811F31F06238@versa> <4DE545D7.1080703@redhat.com> <4DE69786.5010204@gmail.com><4DE6CAF6.4000002@cybercat.ca> <4DE75602.1000408@gmail.com> <51BB988BCCF547E69BF222BDAF34C4DE@versa> Message-ID: <4E04B61B.9070208@cybercat.ca> > Thanks for that, that'll prevent me from modifying a system file... > > And yes, I find it a little disapointing. We're now at 6.1, and our > setup is exactly what RHCS was designed for... A GFS over fiber, httpd > running content from that gfs... Two thing I need to mention in this issue. One, support doesn't think anymore that it's a coro-sync specific issue, they are searching to a driver issue or other source for this problem. Second, I downgraded my kernel to 2.6.32-71.29.1.el6 (pre-6.1, or 6.0), for another issue, and since I did, I don't think I saw that issue again. I saw spikes in my cpu graphs, but I'm not 100% sure that they are caused by this issue. So, as a temporary work-around for this time, woule be (at your own risks) to downgrade to 2.6.32-71.29.1.el6 kernel : yum install kernel-2.6.32-71.29.1.el6.x86_64 Regards, From Ralph.Grothe at itdz-berlin.de Sat Jun 25 09:00:13 2011 From: Ralph.Grothe at itdz-berlin.de (Ralph.Grothe at itdz-berlin.de) Date: Sat, 25 Jun 2011 11:00:13 +0200 Subject: [Linux-cluster] How to achieve a service's "stickyness" toa"preferred" node in RHCS? In-Reply-To: <4E04434F.4050305@brgm.fr> References: <4E04434F.4050305@brgm.fr> Message-ID: Bonjour Luc, many thanks for the sample snippet. I am afraid, I couldn't reply yesterday. I will give your suggestion a try. Especially, since I want to assure that failback of a relocated service is disabled, what according to the RHCS admin guide is only applicable to ordered failover domains (i.e. cited from RHCS AG "The failback characteristic is applicable only if ordered failover is configured.") http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html /Cluster_Administration/s1-config-failover-domain-CA.html > > ...and you can have more informations about options on RHCS admin guide > I don't agree. In fact the RHCS AD is quite telling the opposite to your proposal by demanding: "To configure a preferred member, you can create an unrestricted failover domain comprising only one cluster member. Doing that causes a cluster service to run on that cluster member primarily (the preferred member), but allows the cluster service to fail over to any of the other members." So here they claim that it has to be an *unordered* failover domain with only *one* member. Sadly, they even don't care to further elaborate by e.g. providing a mor illustrative config code sample. Because I had configuered my failoverdomains according the above cited statement in RHCS AG to achieve stickyness I was more than surprised to observe the service to failback after it already had been relocated after I had rebooted the node it currently and preferredly ran on. Rgds Ralph ________________________________ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Santeramo Luc Sent: Friday, June 24, 2011 9:57 AM To: linux-cluster at redhat.com Subject: Re: [Linux-cluster] How to achieve a service's "stickyness" toa"preferred" node in RHCS? Hi, SVC_001 will be sticked to Failover Domain "FOD_srv1", which have node srv1 as priority node. ...and you can have more informations about options on RHCS admin guide. Luc ________________________________ Le 24/06/2011 09:17, Ralph.Grothe at itdz-berlin.de a ?crit : Hello Clustering Gurus, I need to have a service during normal operation (i.e. not during relocation when loss of stickyness is ok and wanted) stick to a preferred cluster node. (n.b. this is only a two-node cluster) In the redhat cluster admin guide (we're on RHEL 5.6) I think to have read that such thing as a "preffered_node" or similar attribute doesn't exist in the schema any more and that instead one should define ordered="0" and restricted="0" failover domains for the respective service as this would in effect result in the wanted behavior. Is this correct? And how (e.g. a short cluster.conf XML example snippet would be appreciated) would this have to be applied? Regards Ralph -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster P Pensez ? l'environnement avant d'imprimer ce message Think Environment before printing Le contenu de ce m?l et de ses pi?ces jointes est destin? ? l'usage exclusif du (des) destinataire(s) d?sign?(s) comme tel(s). En cas de r?ception par erreur, le signaler ? son exp?diteur et ne pas en divulguer le contenu. L'absence de virus a ?t? v?rifi?e ? l'?mission, il convient n?anmoins de s'assurer de l'absence de contamination ? sa r?ception. The contents of this email and any attachments are confidential. They are intended for the named recipient(s) only. If you have received this email in error please notify the system manager or the sender immediately and do not disclose the contents to anyone or make copies. eSafe scanned this email for viruses, vandals and malicious content. From noreply at boxbe.com Sat Jun 25 16:04:21 2011 From: noreply at boxbe.com (noreply at boxbe.com) Date: Sat, 25 Jun 2011 09:04:21 -0700 (PDT) Subject: [Linux-cluster] Linux-cluster Digest, Vol 86, Issue 24 (Action Required) Message-ID: <710906935.2168317.1309017861695.JavaMail.prod@app010.dmz> Dear sender, You will not receive any more courtesy notices from our members for two days. Messages you have sent will remain in a lower priority mailbox for our member to review at their leisure. Future messages will be more likely to be viewed if you are on our member's priority Guest List. Thank you, shanavasmca at gmail.com Powered by Boxbe -- "End Email Overload" Visit http://www.boxbe.com/how-it-works?tc=8511374584_4610803 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded message was scrubbed... From: linux-cluster-request at redhat.com Subject: Linux-cluster Digest, Vol 86, Issue 24 Date: Sat, 25 Jun 2011 12:00:05 -0400 Size: 2088 URL: From andrew at beekhof.net Sun Jun 26 23:23:57 2011 From: andrew at beekhof.net (Andrew Beekhof) Date: Mon, 27 Jun 2011 09:23:57 +1000 Subject: [Linux-cluster] relationship corosync + dlm + cman in cluster 3.1.3 In-Reply-To: References: Message-ID: On Wed, Jun 22, 2011 at 5:55 PM, Jeroen Koekkoek wrote: > Hi, > > I have a question regarding the relationship between Corosync, DLM, and CMAN. Is the following statement correct? > > The DLM is a kernel module, dlm_controld is the control daemon. > CMAN is the old messaging layer, and is now stacked on OpenAIS, which in turn is stacked on Corosync. > > The DLM does not use CMAN (or Corosync for that matter) to communicate, but does fetch node information from CMAN. > > The filesystem (GFS or OCFS2) speaks to the DLM locally (in kernel?) and the DLM takes care of the communication. > > > Now for the real question. > > In the 3.1.3 release dlm_controld still depends on CMAN, but is it safe to say that I can just use Pacemaker and Heartbeat resource agents and only install CMAN so that dlm_controld can query node information? > > Or did I misunderstand the documentation? You need to make sure everyone is getting the same membership and quorum information. So yes, install CMAN for dlm_controld but also tell pacemaker to use it too (make sure you're on 1.1.5 or higher). > > Regards, > Jeroen > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From RMartinez-Sanchez at nds.com Mon Jun 27 13:35:19 2011 From: RMartinez-Sanchez at nds.com (Martinez-Sanchez, Raul) Date: Mon, 27 Jun 2011 14:35:19 +0100 Subject: [Linux-cluster] info RHEL 6 Cluster Suite + File System Message-ID: <7370F6F5ED3B874F988F5CE657D801EA13BE7DAB87@UKMA1.UK.NDS.COM> Hi All, I have a very generic question that somehow am unable to answer. In the past (RHEL 5) we have been deploying HA Clusters in the following manner: Two to four redhat nodes with the Red Hat cluster suite on them. As well all the nodes are attached to a SAN/Fibre infrastructure with two SAN Switches and two controllers per Storage Array. The storage array was presented to the cluster suite as a GFS resources and services (Oracle) were making use of it by mounting the GFS resource and operating on it. It is my understanding (maybe am wrong) that in RHEL 6 there is no GFS support as well as that GFS2 is not oracle certified and therefore cannot be used. So my question is how can we replicate the same structure/architecture on RHEL 6 if GFS/GFS2 cannot be used? Apologies if this question is too simple but am just trying to get some more understanding on how we could proceed next. Regards, Ra?l Mart?nez S?nchez ************************************************************************************** This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postmaster at nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes. To protect the environment please do not print this e-mail unless necessary. NDS Limited. Registered Office: One London Road, Staines, Middlesex, TW18 4EX, United Kingdom. A company registered in England and Wales. Registered no. 3080780. VAT no. GB 603 8808 40-00 ************************************************************************************** From andersonlira at gmail.com Tue Jun 28 05:55:09 2011 From: andersonlira at gmail.com (anderson souza) Date: Mon, 27 Jun 2011 23:55:09 -0600 Subject: [Linux-cluster] Hight I/O Wait Rates - RHEL 6.1 + GFS2 + NFS Message-ID: Hi everyone, I have an Active/Passive RHCS 6.1 runing with 8TB of GFS2 with NFS on top and exporting 26 mouting points to 250 NFS clients. The GFS2 mounting points are mounted with noatime, nodiratime, data=writeback and localflocks options, and also the SAN and servers are fast (4Gbps and 8Gb, dual controllers working in LB, H.A... QuadCore, 48GB of memory...). The cluster has been doing its work (failover working fine...), however and unfortunately I have seen hight I/Owait rates, sometimes around 60-70% (on which is very bad), and a couple of glock_workqueue jobs, so I get a bunch of gfs2_quotad, nfsd errors and qdisk latency. The debugfs didn't show me "W", only "G" and "H". Have you guys seen it before? Looks like some glock's contention? How could I get it fixed and what does it mean? Thank you very much Jun 27 18:48:05 kernel: INFO: task gfs2_quotad:19066 blocked for more than 120 seconds. Jun 27 18:48:05 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 27 18:48:05 kernel: gfs2_quotad D 0000000000000004 0 19066 2 0x00000080 Jun 27 18:48:05 kernel: ffff880bb01e1c20 0000000000000046 0000000000000000 ffffffffa045ec6d Jun 27 18:48:05 kernel: 0000000000000000 ffff880be6e2b000 ffff880bb01e1c50 00000001051d8b46 Jun 27 18:48:05 kernel: ffff880be4865af8 ffff880bb01e1fd8 000000000000f598 ffff880be4865af8t Jun 27 18:48:05 kernel: Call Trace: Jun 27 18:48:05 kernel: [] ? dlm_put_lockspace+0x1d/0x40 [dlm] Jun 27 18:48:05 kernel: [] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2] Jun 27 18:48:05 kernel: [] gfs2_glock_holder_wait+0xe/0x20 [gfs2] Jun 27 18:48:05 kernel: [] __wait_on_bit+0x5f/0x90 Jun 27 18:48:05 kernel: [] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2] Jun 27 18:48:05 kernel: [] out_of_line_wait_on_bit+0x78/0x90 Jun 27 18:48:05 kernel: [] ? wake_bit_function+0x0/0x50 Jun 27 18:48:05 kernel: [] gfs2_glock_wait+0x36/0x40 [gfs2] Jun 27 18:48:05 kernel: [] gfs2_glock_nq+0x191/0x370 [gfs2] Jun 27 18:48:05 kernel: [] ? try_to_del_timer_sync+0x7b/0xe0 Jun 27 18:48:05 kernel: [] gfs2_statfs_sync+0x58/0x1b0 [gfs2] Jun 27 18:48:05 kernel: [] ? schedule_timeout+0x19a/0x2e0 Jun 27 18:48:05 kernel: [] ? gfs2_statfs_sync+0x50/0x1b0 [gfs2] Jun 27 18:48:05 kernel: [] quotad_check_timeo+0x57/0xb0 [gfs2] Jun 27 18:48:05 kernel: [] gfs2_quotad+0x234/0x2b0 [gfs2] Jun 27 18:48:05 kernel: [] ? autoremove_wake_function+0x0/0x40 Jun 27 18:48:05 kernel: [] ? gfs2_quotad+0x0/0x2b0 [gfs2] Jun 27 18:48:05 kernel: [] kthread+0x96/0xa0 Jun 27 18:48:05 kernel: [] child_rip+0xa/0x20 Jun 27 18:48:05 kernel: [] ? kthread+0x0/0xa0 Jun 27 18:48:05 kernel: [] ? child_rip+0x0/0x20 Jun 27 19:49:07 kernel: __ratelimit: 57 callbacks suppressed Jun 27 19:49:07 kernel: nfsd: peername failed (err 107)! Jun 27 19:49:07 kernel: nfsd: peername failed (err 107)! Jun 27 19:49:07 kernel: nfsd: peername failed (err 107)! Jun 27 19:49:07 kernel: nfsd: peername failed (err 107)! Jun 27 19:49:07 kernel: nfsd: peername failed (err 107)! Jun 27 19:49:07 kernel: nfsd: peername failed (err 107)! Jun 27 19:49:07 kernel: nfsd: peername failed (err 107)! Jun 27 19:49:07 kernel: nfsd: peername failed (err 107)! Jun 27 19:49:07 kernel: nfsd: peername failed (err 107)! Jun 27 19:49:07 kernel: nfsd: peername failed (err 107)! Jun 27 20:00:58 kernel: rpc-srv/tcp: nfsd: got error -104 when sending 140 bytes - shutting down socket Jun 27 20:00:58 kernel: __ratelimit: 40 callbacks suppressed qdiskd[10078]: qdisk cycle took more than 1 second to complete (1.170000) qdisk cycle took more than 1 second to complete (1.120000) Thanks James S. -------------- next part -------------- An HTML attachment was scrubbed... URL: From omerfsen at gmail.com Tue Jun 28 06:05:36 2011 From: omerfsen at gmail.com (Omer Faruk SEN) Date: Tue, 28 Jun 2011 09:05:36 +0300 Subject: [Linux-cluster] Hight I/O Wait Rates - RHEL 6.1 + GFS2 + NFS In-Reply-To: References: Message-ID: Hi, Open a ticket so Red Hat technical staff can take care of this. I think it is the fastest way to resolve and fix this issue. Regards. On Tue, Jun 28, 2011 at 8:55 AM, anderson souza wrote: > Hi everyone, > > I have an Active/Passive RHCS 6.1 runing with 8TB of GFS2 with NFS on > top and exporting 26 mouting points to 250 NFS clients. The GFS2 mounting > points are mounted with noatime, nodiratime, data=writeback and localflocks > options, and also the SAN and servers are fast (4Gbps and 8Gb, dual > controllers working in LB, H.A... QuadCore, 48GB of memory...). The cluster > has been doing its work (failover working fine...), however > and unfortunately I have seen hight I/Owait rates, sometimes around 60-70% > (on which is very bad), and a couple of glock_workqueue jobs, so I get a > bunch of gfs2_quotad, nfsd errors and qdisk latency. The debugfs didn't show > me "W", only "G" and "H". > > Have you guys seen it before? > Looks like some glock's contention? > How could I get it fixed and what does it mean? > > Thank you very much > > > Jun 27 18:48:05 kernel: INFO: task gfs2_quotad:19066 blocked for more than > 120 seconds. > Jun 27 18:48:05 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > Jun 27 18:48:05 kernel: gfs2_quotad D 0000000000000004 0 19066 > 2 0x00000080 > Jun 27 18:48:05 kernel: ffff880bb01e1c20 0000000000000046 0000000000000000 > ffffffffa045ec6d > Jun 27 18:48:05 kernel: 0000000000000000 ffff880be6e2b000 ffff880bb01e1c50 > 00000001051d8b46 > Jun 27 18:48:05 kernel: ffff880be4865af8 ffff880bb01e1fd8 000000000000f598 > ffff880be4865af8t > Jun 27 18:48:05 kernel: Call Trace: > Jun 27 18:48:05 kernel: [] ? dlm_put_lockspace+0x1d/0x40 > [dlm] > Jun 27 18:48:05 kernel: [] ? > gfs2_glock_holder_wait+0x0/0x20 [gfs2] > Jun 27 18:48:05 kernel: [] > gfs2_glock_holder_wait+0xe/0x20 [gfs2] > Jun 27 18:48:05 kernel: [] __wait_on_bit+0x5f/0x90 > Jun 27 18:48:05 kernel: [] ? > gfs2_glock_holder_wait+0x0/0x20 [gfs2] > Jun 27 18:48:05 kernel: [] > out_of_line_wait_on_bit+0x78/0x90 > Jun 27 18:48:05 kernel: [] ? wake_bit_function+0x0/0x50 > Jun 27 18:48:05 kernel: [] gfs2_glock_wait+0x36/0x40 > [gfs2] > Jun 27 18:48:05 kernel: [] gfs2_glock_nq+0x191/0x370 > [gfs2] > Jun 27 18:48:05 kernel: [] ? > try_to_del_timer_sync+0x7b/0xe0 > Jun 27 18:48:05 kernel: [] gfs2_statfs_sync+0x58/0x1b0 > [gfs2] > Jun 27 18:48:05 kernel: [] ? > schedule_timeout+0x19a/0x2e0 > Jun 27 18:48:05 kernel: [] ? gfs2_statfs_sync+0x50/0x1b0 > [gfs2] > Jun 27 18:48:05 kernel: [] quotad_check_timeo+0x57/0xb0 > [gfs2] > Jun 27 18:48:05 kernel: [] gfs2_quotad+0x234/0x2b0 > [gfs2] > Jun 27 18:48:05 kernel: [] ? > autoremove_wake_function+0x0/0x40 > Jun 27 18:48:05 kernel: [] ? gfs2_quotad+0x0/0x2b0 > [gfs2] > Jun 27 18:48:05 kernel: [] kthread+0x96/0xa0 > Jun 27 18:48:05 kernel: [] child_rip+0xa/0x20 > Jun 27 18:48:05 kernel: [] ? kthread+0x0/0xa0 > Jun 27 18:48:05 kernel: [] ? child_rip+0x0/0x20 > > Jun 27 19:49:07 kernel: __ratelimit: 57 callbacks suppressed > Jun 27 19:49:07 kernel: nfsd: peername failed (err 107)! > Jun 27 19:49:07 kernel: nfsd: peername failed (err 107)! > Jun 27 19:49:07 kernel: nfsd: peername failed (err 107)! > Jun 27 19:49:07 kernel: nfsd: peername failed (err 107)! > Jun 27 19:49:07 kernel: nfsd: peername failed (err 107)! > Jun 27 19:49:07 kernel: nfsd: peername failed (err 107)! > Jun 27 19:49:07 kernel: nfsd: peername failed (err 107)! > Jun 27 19:49:07 kernel: nfsd: peername failed (err 107)! > Jun 27 19:49:07 kernel: nfsd: peername failed (err 107)! > Jun 27 19:49:07 kernel: nfsd: peername failed (err 107)! > Jun 27 20:00:58 kernel: rpc-srv/tcp: nfsd: got error -104 when sending 140 > bytes - shutting down socket > Jun 27 20:00:58 kernel: __ratelimit: 40 callbacks suppressed > qdiskd[10078]: qdisk cycle took more than 1 second to complete (1.170000) > qdisk cycle took more than 1 second to complete (1.120000) > > Thanks > James S. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fdinitto at redhat.com Tue Jun 28 06:22:00 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Tue, 28 Jun 2011 08:22:00 +0200 Subject: [Linux-cluster] Hight I/O Wait Rates - RHEL 6.1 + GFS2 + NFS In-Reply-To: References: Message-ID: <4E097308.9090104@redhat.com> On 6/28/2011 7:55 AM, anderson souza wrote: > Hi everyone, > > I have an Active/Passive RHCS 6.1 runing with 8TB of GFS2 with NFS on > top and exporting 26 mouting points to 250 NFS clients. The GFS2 > mounting points are mounted with noatime, nodiratime, data=writeback and > localflocks options, and also the SAN and servers are fast (4Gbps and > 8Gb, dual controllers working in LB, H.A... QuadCore, 48GB of > memory...). The cluster has been doing its work (failover working > fine...), however and unfortunately I have seen hight I/Owait rates, > sometimes around 60-70% (on which is very bad), and a couple > of glock_workqueue jobs, so I get a bunch of gfs2_quotad, nfsd errors > and qdisk latency. The debugfs didn't show me "W", only "G" and "H". > > Have you guys seen it before? > Looks like some glock's contention? > How could I get it fixed and what does it mean? Please contact GSS and file a ticket. You are probably experiencing this: https://bugzilla.redhat.com/show_bug.cgi?id=717010 (you might not be able to see the whole content directly, but try downgrading the kernel to 6.0 should make things better) Also, given the nature of your setup, I would recommend to request a cluster architecture review to GSS for GFS2 usage in such environment. Fabio From j.koekkoek at perrit.nl Tue Jun 28 07:12:29 2011 From: j.koekkoek at perrit.nl (Jeroen Koekkoek) Date: Tue, 28 Jun 2011 07:12:29 +0000 Subject: [Linux-cluster] relationship corosync + dlm + cman in cluster 3.1.3 In-Reply-To: References: Message-ID: Hi Andrew, Thanks for answering my question. While looking at the current source tree, I noticed newer versions, at least dlm_controld, will not use cman anymore. So I'll keep using 3.0.12 for now (with dlm_controld.pcmk). Do you have any estimate on the first release without cman? Regards, Jeroen > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster- > bounces at redhat.com] On Behalf Of Andrew Beekhof > Sent: Monday, June 27, 2011 1:24 AM > To: linux clustering > Subject: Re: [Linux-cluster] relationship corosync + dlm + cman in > cluster 3.1.3 > > On Wed, Jun 22, 2011 at 5:55 PM, Jeroen Koekkoek > wrote: > > Hi, > > > > I have a question regarding the relationship between Corosync, DLM, > and CMAN. Is the following statement correct? > > > > The DLM is a kernel module, dlm_controld is the control daemon. > > CMAN is the old messaging layer, and is now stacked on OpenAIS, which > in turn is stacked on Corosync. > > > > The DLM does not use CMAN (or Corosync for that matter) to > communicate, but does fetch node information from CMAN. > > > > The filesystem (GFS or OCFS2) speaks to the DLM locally (in kernel?) > and the DLM takes care of the communication. > > > > > > Now for the real question. > > > > In the 3.1.3 release dlm_controld still depends on CMAN, but is it > safe to say that I can just use Pacemaker and Heartbeat resource agents > and only install CMAN so that dlm_controld can query node information? > > > > Or did I misunderstand the documentation? > > You need to make sure everyone is getting the same membership and quorum > information. > So yes, install CMAN for dlm_controld but also tell pacemaker to use it > too (make sure you're on 1.1.5 or higher). > > > > > Regards, > > Jeroen > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From ekuric at redhat.com Tue Jun 28 09:25:26 2011 From: ekuric at redhat.com (Elvir Kuric) Date: Tue, 28 Jun 2011 11:25:26 +0200 Subject: [Linux-cluster] relationship corosync + dlm + cman in cluster 3.1.3 In-Reply-To: References: Message-ID: <4E099E06.9080303@redhat.com> On 06/22/2011 09:55 AM, Jeroen Koekkoek wrote: > Hi, > > I have a question regarding the relationship between Corosync, DLM, and CMAN. Is the following statement correct? > > The DLM is a kernel module, dlm_controld is the control daemon. > CMAN is the old messaging layer, and is now stacked on OpenAIS, which in turn is stacked on Corosync. > > The DLM does not use CMAN (or Corosync for that matter) to communicate, but does fetch node information from CMAN. > > The filesystem (GFS or OCFS2) speaks to the DLM locally (in kernel?) and the DLM takes care of the communication. > > > Now for the real question. > > In the 3.1.3 release dlm_controld still depends on CMAN, but is it safe to say that I can just use Pacemaker and Heartbeat resource agents and only install CMAN so that dlm_controld can query node information? > > Or did I misunderstand the documentation? > > Regards, > Jeroen > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster I think document at below link will give nice overview of relationships between cman, dlm, ... http://people.redhat.com/ccaulfie/docs/ClusterPic.pdf Thanks Kind regards, Elvir From c.mammoli at apra.it Tue Jun 28 16:24:51 2011 From: c.mammoli at apra.it (Cristian Mammoli - Apra Sistemi) Date: Tue, 28 Jun 2011 18:24:51 +0200 Subject: [Linux-cluster] virtual machine resource agent does not gracefully shutdown vms Message-ID: Hi, I have 4 2-node clusters (EL clone, 5.6 release) which provide high availability to KVM virtual machines. Very often when I stop a vm service (with luci or clusvcadm) the virtual machine does not shutdown gracefully but continues to operate normally until the timeout kicks in and force the poweroff. Most of the vms are windows 2008 with virtio drivers and have ACPI enabled, indeed running virsh shutdown works! Any clue about what's going on? cluster.conf: http://pastebin.com/q6tt3gMA examplevm.xml: http://pastebin.com/z1e94h25 -- Cristian Mammoli APRA SISTEMI srl Via Brodolini,6 Jesi (AN) tel dir. +390731719822 Web www.apra.it e-mail c.mammoli at apra.it From c.mammoli at apra.it Tue Jun 28 17:03:08 2011 From: c.mammoli at apra.it (Cristian Mammoli - Apra Sistemi) Date: Tue, 28 Jun 2011 19:03:08 +0200 Subject: [Linux-cluster] virtual machine resource agent does not gracefully shutdown vms In-Reply-To: <4E0A0053.6050803@apra.it> References: <4E0A0053.6050803@apra.it> Message-ID: On 06/28/2011 06:24 PM, Cristian Mammoli - Apra Sistemi wrote: > > Hi, I have 4 2-node clusters (EL clone, 5.6 release) which provide high > availability to KVM virtual machines. > Very often when I stop a vm service (with luci or clusvcadm) the virtual > machine does not shutdown gracefully but continues to operate normally > until the timeout kicks in and force the poweroff. I reproduced the issue nad it seems that vm.sh correctly issues "virsh shutdown domain" but the vm does not actually give a f*** :) [root at srvha01 ~]# /usr/share/cluster/vm.sh stop Hypervisor: qemu Management tool: virsh Hypervisor URI: qemu:///system Migration URI format: qemu+ssh://target_host/system Virtual machine srvdc01 is running virsh shutdown srvdc01 ... Domain srvdc01 is being shutdown Nothing happens and the domain keeps running normally. Second try: [root at srvha01 ~]# /usr/share/cluster/vm.sh stop Hypervisor: qemu Management tool: virsh Hypervisor URI: qemu:///system Migration URI format: qemu+ssh://target_host/system Virtual machine srvdc01 is running virsh shutdown srvdc01 ... Domain srvdc01 is being shutdown The domain shuts down correctly At this point I think this is a libvirt/kvm/windows issue... Anyway any help is appreciated. -- Cristian Mammoli APRA SISTEMI srl Via Brodolini,6 Jesi (AN) tel dir. +390731719822 Web www.apra.it e-mail c.mammoli at apra.it From c.mammoli at apra.it Tue Jun 28 17:47:40 2011 From: c.mammoli at apra.it (Cristian Mammoli - Apra Sistemi) Date: Tue, 28 Jun 2011 19:47:40 +0200 Subject: [Linux-cluster] [SOLVED] Virtual machine resource agent does not gracefully shutdown vms In-Reply-To: <4E0A094C.2000606@apra.it> References: <4E0A0053.6050803@apra.it> <4E0A094C.2000606@apra.it> Message-ID: It seems that Windows server in an active directory environment has a default group policy setting that inhibits ACPI shutdown if no user is logged in... It is located in: Computer Configuration\Windows Settings\Security Settings\Local Policies\Security Options\Shutdown: Allow system to be shut down without having to log on After setting this to "on" VMs shutdown gracefully on the first try when I stop them from luci. -- Cristian Mammoli APRA SISTEMI srl Via Brodolini,6 Jesi (AN) tel dir. +390731719822 Web www.apra.it e-mail c.mammoli at apra.it From noreply at boxbe.com Tue Jun 28 17:57:24 2011 From: noreply at boxbe.com (noreply at boxbe.com) Date: Tue, 28 Jun 2011 10:57:24 -0700 (PDT) Subject: [Linux-cluster] [SOLVED] Virtual machine resource agent does not gracefully shutdown vms (Action Required) Message-ID: <854324580.2651655.1309283844276.JavaMail.prod@app010.dmz> Hello linux clustering, You will not receive any more courtesy notices from our members for two days. Messages you have sent will remain in a lower priority mailbox for our member to review at their leisure. Future messages will be more likely to be viewed if you are on our member's priority Guest List. Thank you, debjyoti.mail at gmail.com Powered by Boxbe -- "End Email Overload" Visit http://www.boxbe.com/how-it-works?tc=8539327229_705574608 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded message was scrubbed... From: Cristian Mammoli - Apra Sistemi Subject: Re: [Linux-cluster] [SOLVED] Virtual machine resource agent does not gracefully shutdown vms Date: Tue, 28 Jun 2011 19:47:40 +0200 Size: 4652 URL: From chris.alexander at kusiri.com Wed Jun 29 14:17:53 2011 From: chris.alexander at kusiri.com (Chris Alexander) Date: Wed, 29 Jun 2011 15:17:53 +0100 Subject: [Linux-cluster] Expected behaviour when service fails to stop Message-ID: Hi, I was wondering what the expected behaviour of the cluster would be when a service cannot be shutdown safely. For example, if you request a service group to be relocated to another node in the cluster, if one of the services in that group fails to stop (causing a timeout?), what would the result be? I should imagine that the service would be marked as Failed, is this the case? I have been unable to find this particular scenario documented anywhere. Thanks Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrew at beekhof.net Thu Jun 30 03:32:09 2011 From: andrew at beekhof.net (Andrew Beekhof) Date: Thu, 30 Jun 2011 13:32:09 +1000 Subject: [Linux-cluster] relationship corosync + dlm + cman in cluster 3.1.3 In-Reply-To: References: Message-ID: On Tue, Jun 28, 2011 at 5:12 PM, Jeroen Koekkoek wrote: > Hi Andrew, > > Thanks for answering my question. While looking at the current source tree, I noticed newer versions, at least dlm_controld, will not use cman anymore. So I'll keep using 3.0.12 for now (with dlm_controld.pcmk). Do you have any estimate on the first release without cman? Of the dlm etc? No. You'd have to talk to the owners of those projects. At a guess maybe a year from now. > > Regards, > Jeroen > >> -----Original Message----- >> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster- >> bounces at redhat.com] On Behalf Of Andrew Beekhof >> Sent: Monday, June 27, 2011 1:24 AM >> To: linux clustering >> Subject: Re: [Linux-cluster] relationship corosync + dlm + cman in >> cluster 3.1.3 >> >> On Wed, Jun 22, 2011 at 5:55 PM, Jeroen Koekkoek >> wrote: >> > Hi, >> > >> > I have a question regarding the relationship between Corosync, DLM, >> and CMAN. Is the following statement correct? >> > >> > The DLM is a kernel module, dlm_controld is the control daemon. >> > CMAN is the old messaging layer, and is now stacked on OpenAIS, which >> in turn is stacked on Corosync. >> > >> > The DLM does not use CMAN (or Corosync for that matter) to >> communicate, but does fetch node information from CMAN. >> > >> > The filesystem (GFS or OCFS2) speaks to the DLM locally (in kernel?) >> and the DLM takes care of the communication. >> > >> > >> > Now for the real question. >> > >> > In the 3.1.3 release dlm_controld still depends on CMAN, but is it >> safe to say that I can just use Pacemaker and Heartbeat resource agents >> and only install CMAN so that dlm_controld can query node information? >> > >> > Or did I misunderstand the documentation? >> >> You need to make sure everyone is getting the same membership and quorum >> information. >> So yes, install CMAN for dlm_controld but also tell pacemaker to use it >> too (make sure you're on 1.1.5 or higher). >> >> > >> > Regards, >> > Jeroen >> > >> > -- >> > Linux-cluster mailing list >> > Linux-cluster at redhat.com >> > https://www.redhat.com/mailman/listinfo/linux-cluster >> > >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From Rahul.Borate at sailpoint.com Thu Jun 30 05:57:43 2011 From: Rahul.Borate at sailpoint.com (Rahul Borate) Date: Thu, 30 Jun 2011 11:27:43 +0530 Subject: [Linux-cluster] Service Recovery Failure Message-ID: <59cb602d2d3493a90a462f595244ea3a@mail.gmail.com> Hi all, I just performed a test which fail miserably. I have two nodes node-1 and node-2 Global file system /gfs is on node-1. Two HA services running on node-1. If I unplug the cables for node 1 then those two services should transfers to Node-2. But node-2 did not take over the services. But if I do proper shutdown/reboot on node-1 then those two services are transferring to node-2 without problem. Please Help! clustat from node-2 before unplug of cable for node-1: [root at Node-2 ~]# clustat Member Status: Quorate Member Name ID Status ------ ---- ---- ------ Node-1 1 Online, rgmanager Node-2 2 Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:nfs Node-1 started service:ESS_HA Node-1 started clustat from node-2 After unplug of cable for node-1: [root at Node-2 ~]# clustat Member Status: Quorate Member Name ID Status ------ ---- ---- ------ Node-1 1 Offline Node-2 2 Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:nfs Node-1 started service:ESS_HA Node-1 started /etc/cluster/cluster.conf: [root at Node-2 ~]# cat /etc/cluster/cluster.conf