From xavier.montagutelli at unilim.fr Wed Dec 1 08:42:51 2010 From: xavier.montagutelli at unilim.fr (Xavier Montagutelli) Date: Wed, 1 Dec 2010 09:42:51 +0100 Subject: [Linux-cluster] cluster without fencing device In-Reply-To: <4CF3BC18.4020405@alteeve.com> References: <4CF3BC18.4020405@alteeve.com> Message-ID: <201012010942.51213.xavier.montagutelli@unilim.fr> On Monday 29 November 2010 15:43:36 Digimer wrote: > On 11/29/2010 03:42 AM, Mohamed Arif Khan wrote: > > How to configure cluster without fencing device ? > > In RHCS, it is not possible. > > http://wiki.alteeve.com/index.php/Red_Hat_Cluster_Service_3_Tutorial#Concep > t.3B_Fencing > I suppose you can create a "fake" fence device which responds "ok" (/bin/true ?). But you are warned, you will live in a dangerous, unsupported configuration ;-) -- Xavier Montagutelli Tel : +33 (0)5 55 45 77 20 Service Commun Informatique Fax : +33 (0)5 55 45 75 95 Universite de Limoges 123, avenue Albert Thomas 87060 Limoges cedex From laszlo at beres.me Wed Dec 1 12:45:01 2010 From: laszlo at beres.me (Laszlo Beres) Date: Wed, 1 Dec 2010 13:45:01 +0100 Subject: [Linux-cluster] OT: where is the wiki? Message-ID: Hi, just recognized that http://sources.redhat.com/cluster/wiki/ does not exist anymore. Is there a new location? Regards, -- L?szl? B?res? ? ? ? ? ? Unix system engineer http://www.google.com/profiles/beres.laszlo From bmr at redhat.com Wed Dec 1 13:30:15 2010 From: bmr at redhat.com (Bryn M. Reeves) Date: Wed, 01 Dec 2010 13:30:15 +0000 Subject: [Linux-cluster] OT: where is the wiki? In-Reply-To: References: Message-ID: <4CF64DE7.6010101@redhat.com> On 12/01/2010 12:45 PM, Laszlo Beres wrote: > Hi, > > just recognized that http://sources.redhat.com/cluster/wiki/ does not > exist anymore. Is there a new location? > > Regards, > It's working for me (also via the redirect from http://sourceware.org/cluster). Are you seeing an error loading the page? Cheers, Bryn. From fdinitto at redhat.com Wed Dec 1 13:34:31 2010 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Wed, 01 Dec 2010 14:34:31 +0100 Subject: [Linux-cluster] OT: where is the wiki? In-Reply-To: References: Message-ID: <4CF64EE7.1010200@redhat.com> On 12/1/2010 1:45 PM, Laszlo Beres wrote: > Hi, > > just recognized that http://sources.redhat.com/cluster/wiki/ does not > exist anymore. Is there a new location? > > Regards, > Sorry? what do you mean it doesn?t exist....? I just opened it after reading this email.. Fabio From thomas at sjolshagen.net Wed Dec 1 13:56:59 2010 From: thomas at sjolshagen.net (Thomas Sjolshagen) Date: Wed, 01 Dec 2010 08:56:59 -0500 Subject: [Linux-cluster] =?utf-8?q?OT=3A_where_is_the_wiki=3F?= In-Reply-To: <4CF64EE7.1010200@redhat.com> References: <4CF64EE7.1010200@redhat.com> Message-ID: On Wed, 01 Dec 2010 14:34:31 +0100, "Fabio M. Di Nitto" wrote: [SNIP] > > Sorry? what do you mean it doesn?t exist....? > > I just opened it after reading this email.. > > Fabio > Attached is what I see. -------------- next part -------------- A non-text attachment was scrubbed... Name: cluster-wiki-page-404.png Type: image/png Size: 186313 bytes Desc: not available URL: From linko22 at gmail.com Wed Dec 1 14:01:48 2010 From: linko22 at gmail.com (Lynx Ginger) Date: Wed, 1 Dec 2010 17:01:48 +0300 Subject: [Linux-cluster] OT: where is the wiki? In-Reply-To: <4CF64EE7.1010200@redhat.com> References: <4CF64EE7.1010200@redhat.com> Message-ID: 404 - not found. 2010/12/1 Fabio M. Di Nitto > On 12/1/2010 1:45 PM, Laszlo Beres wrote: > > Hi, > > > > just recognized that http://sources.redhat.com/cluster/wiki/ does not > > exist anymore. Is there a new location? > > > > Regards, > > > > Sorry? what do you mean it doesn?t exist....? > > I just opened it after reading this email.. > > Fabio > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From oheinz at fbihome.de Wed Dec 1 14:04:52 2010 From: oheinz at fbihome.de (Oliver Heinz) Date: Wed, 1 Dec 2010 15:04:52 +0100 Subject: [Linux-cluster] OT: where is the wiki? In-Reply-To: <4CF64EE7.1010200@redhat.com> References: <4CF64EE7.1010200@redhat.com> Message-ID: <201012011504.52518.oheinz@fbihome.de> Am Mittwoch, 1. Dezember 2010, um 14:34:31 schrieb Fabio M. Di Nitto: > On 12/1/2010 1:45 PM, Laszlo Beres wrote: > > Hi, > > > > just recognized that http://sources.redhat.com/cluster/wiki/ does not > > exist anymore. Is there a new location? > > > > Regards, > > Sorry? what do you mean it doesn?t exist....? I get a 404: Page Not Found (404) Sorry! The page you are looking for has been moved or no longer exists. You may search for it, or try looking in one of these areas: Oliver > > I just opened it after reading this email.. > > Fabio From crosa at redhat.com Wed Dec 1 14:11:33 2010 From: crosa at redhat.com (Cleber Rosa) Date: Wed, 01 Dec 2010 12:11:33 -0200 Subject: [Linux-cluster] OT: where is the wiki? In-Reply-To: References: <4CF64EE7.1010200@redhat.com> Message-ID: <4CF65795.9090401@redhat.com> Works from inside (our firewall, via VPN, etc). Does *not* work from outside. On 12/01/2010 12:01 PM, Lynx Ginger wrote: > 404 - not found. > > 2010/12/1 Fabio M. Di Nitto > > > On 12/1/2010 1:45 PM, Laszlo Beres wrote: > > Hi, > > > > just recognized that http://sources.redhat.com/cluster/wiki/ > does not > > exist anymore. Is there a new location? > > > > Regards, > > > > Sorry? what do you mean it doesn?t exist....? > > I just opened it after reading this email.. > > Fabio > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From cos at aaaaa.org Wed Dec 1 14:12:07 2010 From: cos at aaaaa.org (Ofer Inbar) Date: Wed, 1 Dec 2010 09:12:07 -0500 Subject: [Linux-cluster] OT: where is the wiki? In-Reply-To: References: Message-ID: <20101201141207.GZ18254@mip.aaaaa.org> Laszlo Beres wrote: > just recognized that http://sources.redhat.com/cluster/wiki/ does not > exist anymore. Is there a new location? The new location appears to be: http://sourceware.org/cluster/wiki Unfortunately http://sourceware.org/cluster/ redirects to redhat.com which gives the 404. But if you add /wiki/ you get the wiki. -- Cos From arif4linux at gmail.com Wed Dec 1 14:17:13 2010 From: arif4linux at gmail.com (Mohamed Arif Khan) Date: Wed, 1 Dec 2010 19:47:13 +0530 Subject: [Linux-cluster] cluster without fencing device Message-ID: Thanks for reply Can we configure cluster without shared storage, means can we make replicated database on individual nodes ? -- *Thanks & Regards* *M.Arif Khan* -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomas at sjolshagen.net Wed Dec 1 14:27:46 2010 From: thomas at sjolshagen.net (Thomas Sjolshagen) Date: Wed, 01 Dec 2010 09:27:46 -0500 Subject: [Linux-cluster] cluster without fencing device In-Reply-To: References: Message-ID: <6be5cdc415f018b0e5f0b3378be27193@www.sjolshagen.net> On Wed, 1 Dec 2010 19:47:13 +0530, Mohamed Arif Khan wrote: Thanks for reply Can we configure cluster without shared storage, means can we make replicated database on individual nodes ? Absolutely, but why would you even need to use the cluster stack if your only purpose is to have a group (cluster) of DB servers that replicate between them? If you've got DB replication configured and using a DB proxy, you'll get the same result with (much) less overhead - imho - in terms of system management overhead, etc. // Thomas -------------- next part -------------- An HTML attachment was scrubbed... URL: From bmr at redhat.com Wed Dec 1 14:33:12 2010 From: bmr at redhat.com (Bryn M. Reeves) Date: Wed, 01 Dec 2010 14:33:12 +0000 Subject: [Linux-cluster] OT: where is the wiki? In-Reply-To: <4CF65795.9090401@redhat.com> References: <4CF64EE7.1010200@redhat.com> <4CF65795.9090401@redhat.com> Message-ID: <4CF65CA8.10701@redhat.com> On 12/01/2010 02:11 PM, Cleber Rosa wrote: > Works from inside (our firewall, via VPN, etc). Does *not* work from outside. Confirmed; same failure via my 3G provider. Regards, Bryn. From linux at alteeve.com Wed Dec 1 15:38:33 2010 From: linux at alteeve.com (Digimer) Date: Wed, 01 Dec 2010 10:38:33 -0500 Subject: [Linux-cluster] cluster without fencing device In-Reply-To: <201012010942.51213.xavier.montagutelli@unilim.fr> References: <4CF3BC18.4020405@alteeve.com> <201012010942.51213.xavier.montagutelli@unilim.fr> Message-ID: <4CF66BF9.5050100@alteeve.com> On 12/01/2010 03:42 AM, Xavier Montagutelli wrote: > On Monday 29 November 2010 15:43:36 Digimer wrote: >> On 11/29/2010 03:42 AM, Mohamed Arif Khan wrote: >>> How to configure cluster without fencing device ? >> >> In RHCS, it is not possible. >> >> http://wiki.alteeve.com/index.php/Red_Hat_Cluster_Service_3_Tutorial#Concep >> t.3B_Fencing >> > > I suppose you can create a "fake" fence device which responds "ok" (/bin/true > ?). But you are warned, you will live in a dangerous, unsupported configuration > ;-) That is exceedingly unwise. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From linux at alteeve.com Wed Dec 1 20:12:18 2010 From: linux at alteeve.com (Digimer) Date: Wed, 01 Dec 2010 15:12:18 -0500 Subject: [Linux-cluster] OT: where is the wiki? In-Reply-To: <20101201141207.GZ18254@mip.aaaaa.org> References: <20101201141207.GZ18254@mip.aaaaa.org> Message-ID: <4CF6AC22.9000502@alteeve.com> On 12/01/2010 09:12 AM, Ofer Inbar wrote: > Laszlo Beres wrote: >> just recognized that http://sources.redhat.com/cluster/wiki/ does not >> exist anymore. Is there a new location? > > The new location appears to be: http://sourceware.org/cluster/wiki > > Unfortunately http://sourceware.org/cluster/ redirects to redhat.com > which gives the 404. But if you add /wiki/ you get the wiki. > -- Cos Might want to put some forwarders into your web server. :) -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From fdinitto at redhat.com Thu Dec 2 13:24:06 2010 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Thu, 02 Dec 2010 14:24:06 +0100 Subject: [Linux-cluster] Announcing 3.1.0 releases (cluster, fence-agents, resource-agents, gfs2-utils) Message-ID: <4CF79DF6.8060605@redhat.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 The cluster team and its community are proud to announce the 3.1.0 stable releases. As previously announced (https://www.redhat.com/archives/linux-cluster/2010-October/msg00012.html), this release is the first step towards the split of the main source tree into separate trees. The cluster, fence-agents, resource-agents and gfs2-utils projects will be released independently from each other from now on, so stay tuned for announcements from the different maintainers (see also wiki for details). cluster 3.1.0: requires: - - corosync 1.3.0 (or higher) - - openais 1.1.4 (or higher) - - any recent kernel header will work just fine (required for dlm) download: https://fedorahosted.org/releases/c/l/cluster/cluster-3.1.0.tar.xz fence-agents 3.1.0: requires: - - cluster 3.1.0 (or higher) download: https://fedorahosted.org/releases/f/e/fence-agents/fence-agents-3.1.0.tar.xz resource-agents 3.1.0: requires: - - cluster 3.1.0 (or higher) download: https://fedorahosted.org/releases/r/e/resource-agents/resource-agents-3.1.0.tar.xz gfs2-utils 3.1.0: requires: - - cluster 3.1.0 (or higher) - - openais 1.1.4 (or higher) - - corosync 1.3.0 (or higher) download: http://www.kernel.org/pub/linux/kernel/people/steve/gfs2/gfs2-utils-3.1.0.tar.bz2 To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Thanks/congratulations to all people that contributed to achieve this great milestone. Happy clustering, Fabio -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBCAAGBQJM953zAAoJEFA6oBJjVJ+Ow/kP/jEOMFM3+SSZ/jlTrVGQY2YT 61FVmE/CMWfjNLTe1blaMGQqqXBxhl3gjZ1fTqZ600fH/F2Ge7dpsaX7tPM/O2ug 9olXqg+/5prjUTeLOMpwsoQ5gBNNoVOYzAFxR75gtjDsgONMeFQLI//SYIRrdeJ0 3KYHsmQozwEwRYDfvWxO0saUj5HdOLvdFksGlgkpeOAEP3SwcC5gWo4vhKlF9jf8 CCMxu4/WWQyAReCv2kvIYgAqYAKbljG1UZDVe4GKZl9TORN7JabCZEEXmex6K5Nk Rn/yn/Jvo1eZMF+n3ZzjF084GtznUipfKEWBLBJmcxUXUTsBvl2hYm28Ky+ZoUxd 5tbe772bIHzOvy2hCMNy97C+OoMkyJhaHVJfqXclwCS2YqYTeHXJw4OFMeAz3KJh pA5ECbUqqWpOmssatPnohV3UFs3qo3vY3vOogLCqe9edPVD0lfZyTvHrRoZOEUUx k14f67c2o7KqSymz6+hdbiNZrTh9FAu9Kit/j1gN+gv1AgSUPLcjb2hDEOLbz00I 5w41NhKBIcF7jkdfdAgD3q/pCnCIWEV17XLG5IOuoDdSMjxxxQ917V2Uv1bbMyRR /S1F4+rRSCLdxZxoU5TlvLwnzYBcqG68BJi3Pj7ro/zyRLxMte333BdiIazngLmS n7dhtmNm9pftHqliV+7V =gsuQ -----END PGP SIGNATURE----- From scooter at cgl.ucsf.edu Thu Dec 2 18:27:59 2010 From: scooter at cgl.ucsf.edu (Scooter Morris) Date: Thu, 02 Dec 2010 10:27:59 -0800 Subject: [Linux-cluster] Question about gfs2_tools lockdump Message-ID: <4CF7E52F.9010501@cgl.ucsf.edu> Hi all, I've got a 5 node RHEL 5.5 cluster with a number of gfs2 filesystems. After a lot of effort (and help from RedHat) we've gotten to the stage where the cluster is quite stable, but now we're starting to see some performance degradation. In investigating this, I've been poking around and I'm seeing some things that I can't explain. In particular, on a quite filesystem (no processes according to lsof on all nodes), a gfs2_tool lockdump gives 1,000's of lock entries (G: lines). Of those several have R: entries (resource group?) and several have H: entries. The H: entries are particularly strange because all H: entries are of the form: H: s:EX f:H e:0 p:8953 [(ended)] ... My understanding is that this indicates a lock holder with an exclusive lock, but the process has ended (?). Why aren't these locks going away? Shouldn't they be cleared after the process ends (particularly since some of them are exclusive locks...)? Any help in understanding these entries would be very helpful. -- scooter From rpeterso at redhat.com Thu Dec 2 19:22:49 2010 From: rpeterso at redhat.com (Bob Peterson) Date: Thu, 2 Dec 2010 14:22:49 -0500 (EST) Subject: [Linux-cluster] Question about gfs2_tools lockdump In-Reply-To: <1551976847.1121861291317464590.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: <1374982786.1122461291317769916.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- "Scooter Morris" wrote: | Hi all, | I've got a 5 node RHEL 5.5 cluster with a number of gfs2 | filesystems. After a lot of effort (and help from RedHat) we've | gotten | to the stage where the cluster is quite stable, but now we're starting | | to see some performance degradation. In investigating this, I've | been | poking around and I'm seeing some things that I can't explain. In | particular, on a quite filesystem (no processes according to lsof on | all | nodes), a gfs2_tool lockdump gives 1,000's of lock entries (G: lines). | | Of those several have R: entries (resource group?) and several have H: | | entries. The H: entries are particularly strange because all H: | entries | are of the form: | H: s:EX f:H e:0 p:8953 [(ended)] ... | | My understanding is that this indicates a lock holder with an | exclusive | lock, but the process has ended (?). Why aren't these locks going | away? Shouldn't they be cleared after the process ends (particularly | | since some of them are exclusive locks...)? Any help in understanding | | these entries would be very helpful. | | -- scooter | | -- | Linux-cluster mailing list | Linux-cluster at redhat.com | https://www.redhat.com/mailman/listinfo/linux-cluster Hi Scooter, There are lots of different types of glocks, and the type is given before the slash. Type 2 is inode, so 2/9009 is for a disk inode located at block 0x9009 (in hex). Type 3 is for resource groups, so 3/170003 is for the resource group starting at block 0x170003. Type 5 is for i_open glocks, which also correspond mostly to files. So if you open a file and write some data, you can get both a inode glock for 2/9009 and a corresponding i_open glock for 5/9009. The inode glocks will also have a corresponding "I:" entry. The resource group glocks may have an R: entry as well. Each "H:" corresponds to a process that is holding or trying to hold that particular glock. A holder may persist even after a process has ended. For example, if I'm the first process to write to a gfs2 file system, I could cause all the resource groups to be read in, but the resource groups and their corresponding glocks will stay in memory. A holder record is said to be holding the glock if it has the f:H flag. It's waiting for the lock if it has the f:W flag. If it says "s:SH", that's a shared hold. If it says "s:EX" that's an exclusive hold on the glock, etc. So for example, "s:EX f:W" corresponds to someone waiting for an exclusive lock for that glock. Another complication is that some versions of gfs2 sometimes did not keep track of the process id (pid) when a glock was transferred. So some older versions report the pid as the old pid, which would have ended, and not the correct holder. That made debugging glock issues difficult, but it didn't hurt anything. I think that issue is fixed in 5.5 or 5.6. It's a lot more complicated than that, but those are the basics. I think Steve Whitehouse wrote a paper on glocks, but I don't have the info handy. Regards, Bob Peterson Red Hat File Systems From corey.kovacs at gmail.com Thu Dec 2 21:31:13 2010 From: corey.kovacs at gmail.com (Corey Kovacs) Date: Thu, 2 Dec 2010 21:31:13 +0000 Subject: [Linux-cluster] Clarification... Message-ID: I've been watching the development of the cluster stack from the sidelines for quite some time but somewhere things got a bit mixed up for me. It appears to me the following is true... openais, heartbeat and corosync are equivalent in terms of purpose. rgmanager and pacemaker are equivalent in terms of purpose. If these are true, can someone point me to a run-down of the differences and similarities or point me to a document? How does this all relate to what ships with RHEL6? Finally, is the wiki woefully out of date are is there a better place to be getting information other than git repos? Corey From zachar at awst.at Thu Dec 2 22:14:44 2010 From: zachar at awst.at (Balazs Zachar) Date: Thu, 02 Dec 2010 23:14:44 +0100 Subject: [Linux-cluster] Clarification... In-Reply-To: References: Message-ID: <4CF81A54.80103@awst.at> On 12/02/2010 10:31 PM, Corey Kovacs wrote: > I've been watching the development of the cluster stack from the > sidelines for quite some time but somewhere things got a bit mixed up > for me. > > It appears to me the following is true... > > openais, heartbeat and corosync are equivalent in terms of purpose. > AFAIK not true anymore: heartbeat is equivalent with corosync + openais (corosync was a fork of openais but now openais is an additional part for corosync) Corosync + openais is recommended. (pacemaker website) > rgmanager and pacemaker are equivalent in terms of purpose. > True: http://sources.redhat.com/cluster/wiki/RGManagerVsPacemaker > If these are true, can someone point me to a run-down of the > differences and similarities or point me to a document? > > How does this all relate to what ships with RHEL6? > RHCS is using rgmanager in RHEL6. Pacemaker is in technology preview state (from release notes: "not fully integrated with the RHCS stack"). I heard that Pacemaker is going to replace rgmanager in the future. (the source wasn't official! Maybe we will get some more official answer for this here :) ) Regards, Bal?zs > Finally, is the wiki woefully out of date are is there a better place > to be getting information other than git repos? > > > > > > Corey > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From Chris.Jankowski at hp.com Fri Dec 3 05:33:37 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Fri, 3 Dec 2010 05:33:37 +0000 Subject: [Linux-cluster] Validation failure of cluster.conf. Message-ID: <036B68E61A28CA49AC2767596576CD596F5A04EBC2@GVW1113EXC.americas.hpqcorp.net> Hi, I am in a process of building a cluster on RHEL6. I elected to build the /etc/cluster/cluster.conf (attached) by hand i.e. no Conga. After I added fencing and fence devices the configuration file no longer passes validation check. ccs_config_validate reports the following error: [root at booboo1 cluster]# ccs_config_validate -f cluster.conf.3.XX Relax-NG validity error : Extra element fencedevices in interleave tempfile:27: element fencedevices: Relax-NG validity error : Element cluster failed to validate content tempfile:18: element device: validity error : IDREF attribute name references an unknown ID "booboo2-ilo" Configuration fails to validate No matter how long I look at the file I cannot find any mistake in it. I would appreciate if you could run the file through your validation tools and tell me what am I doing wrong. Thanks and regards, Chris Jankowski -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ccs_config_validate.out Type: application/octet-stream Size: 374 bytes Desc: ccs_config_validate.out URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: cluster.conf.3.XX Type: application/octet-stream Size: 1156 bytes Desc: cluster.conf.3.XX URL: From andrew at beekhof.net Fri Dec 3 07:14:09 2010 From: andrew at beekhof.net (Andrew Beekhof) Date: Fri, 3 Dec 2010 08:14:09 +0100 Subject: [Linux-cluster] Clarification... In-Reply-To: <4CF81A54.80103@awst.at> References: <4CF81A54.80103@awst.at> Message-ID: On Thu, Dec 2, 2010 at 11:14 PM, Balazs Zachar wrote: > > On 12/02/2010 10:31 PM, Corey Kovacs wrote: >> >> I've been watching the development of the cluster stack from the >> sidelines for quite some time but somewhere things got a bit mixed up >> for me. >> >> It appears to me the following is true... >> >> openais, heartbeat and corosync are equivalent in terms of purpose. >> > > AFAIK not true anymore: > heartbeat is equivalent with corosync + openais (corosync was a fork of > openais but now openais is an additional part for corosync) > Corosync + openais is recommended. (pacemaker website) >> >> rgmanager and pacemaker are equivalent in terms of purpose. >> > > True: > http://sources.redhat.com/cluster/wiki/RGManagerVsPacemaker > >> If these are true, can someone point me to a run-down of the >> differences and similarities or point me to a document? >> >> How does this all relate to what ships with RHEL6? >> > > RHCS is using rgmanager in RHEL6. Pacemaker is in technology preview state > (from release notes: "not fully integrated with the RHCS stack"). Specifically there is no integration with luci yet. Other than that its works just fine with the rest of the stack > I heard that Pacemaker is going to replace rgmanager in the future. (the > source wasn't official! Maybe we will get some more official answer for this > here :) ) That is the current intention > > Regards, > Bal?zs >> >> Finally, is the wiki woefully out of date are is there a better place >> to be getting information other than git repos? >> >> >> >> >> >> Corey >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From corey.kovacs at gmail.com Fri Dec 3 07:55:22 2010 From: corey.kovacs at gmail.com (Corey Kovacs) Date: Fri, 3 Dec 2010 07:55:22 +0000 Subject: [Linux-cluster] Clarification... In-Reply-To: References: <4CF81A54.80103@awst.at> Message-ID: Folks, thanks for the info. If openais and corosync were at one time, serving the same function but now are separate, what is the division? Thanks again -C On Fri, Dec 3, 2010 at 7:14 AM, Andrew Beekhof wrote: > On Thu, Dec 2, 2010 at 11:14 PM, Balazs Zachar wrote: >> >> On 12/02/2010 10:31 PM, Corey Kovacs wrote: >>> >>> I've been watching the development of the cluster stack from the >>> sidelines for quite some time but somewhere things got a bit mixed up >>> for me. >>> >>> It appears to me the following is true... >>> >>> openais, heartbeat and corosync are equivalent in terms of purpose. >>> >> >> AFAIK not true anymore: >> heartbeat is equivalent with corosync + openais (corosync was a fork of >> openais but now openais is an additional part for corosync) >> Corosync + openais is recommended. (pacemaker website) >>> >>> rgmanager and pacemaker are equivalent in terms of purpose. >>> >> >> True: >> http://sources.redhat.com/cluster/wiki/RGManagerVsPacemaker >> >>> If these are true, can someone point me to a run-down of the >>> differences and similarities or point me to a document? >>> >>> How does this all relate to what ships with RHEL6? >>> >> >> RHCS is using rgmanager in RHEL6. Pacemaker is in technology preview state >> (from release notes: "not fully integrated with the RHCS stack"). > > Specifically there is no integration with luci yet. > Other than that its works just fine with the rest of the stack > >> I heard that Pacemaker is going to replace rgmanager in the future. (the >> source wasn't official! Maybe we will get some more official answer for this >> here :) ) > > That is the current intention > >> >> Regards, >> Bal?zs >>> >>> Finally, is the wiki woefully out of date are is there a better place >>> to be getting information other than git repos? >>> >>> >>> >>> >>> >>> Corey >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From fdinitto at redhat.com Fri Dec 3 08:26:49 2010 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Fri, 03 Dec 2010 09:26:49 +0100 Subject: [Linux-cluster] Validation failure of cluster.conf. In-Reply-To: <036B68E61A28CA49AC2767596576CD596F5A04EBC2@GVW1113EXC.americas.hpqcorp.net> References: <036B68E61A28CA49AC2767596576CD596F5A04EBC2@GVW1113EXC.americas.hpqcorp.net> Message-ID: <4CF8A9C9.80307@redhat.com> On 12/3/2010 6:33 AM, Jankowski, Chris wrote: > Hi, > I am in a process of building a cluster on RHEL6. > I elected to build the /etc/cluster/cluster.conf (attached) by hand i.e. > no Conga. > After I added fencing and fence devices the configuration file no longer > passes validation check. > > ccs_config_validate reports the following error: > > [root at booboo1 cluster]# ccs_config_validate -f cluster.conf.3.XX > Relax-NG validity error : Extra element fencedevices in interleave > tempfile:27: element fencedevices: Relax-NG validity error : Element > cluster failed to validate content > tempfile:18: element device: validity error : IDREF attribute name > references an unknown ID "booboo2-ilo" > Configuration fails to validate > > No matter how long I look at the file I cannot find any mistake in it. > > I would appreciate if you could run the file through your validation > tools and tell me what am I doing wrong. > > Thanks and regards, > > Chris Jankowski > > > hostname="booboo1-ilo.XXXX" login="XXXXX" passwd="XXXXX"/> > hostname="booboo2-ilo.XXXX" login="XXXXX" passwd="XXXXX"/> > > looking at man fence_ilo.8 (STDIN PARAMETERS section), you probably want (untested as I don?t have ilo here): Fabio From Chris.Jankowski at hp.com Fri Dec 3 09:08:38 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Fri, 3 Dec 2010 09:08:38 +0000 Subject: [Linux-cluster] Validation failure of cluster.conf. In-Reply-To: <4CF8A9C9.80307@redhat.com> References: <036B68E61A28CA49AC2767596576CD596F5A04EBC2@GVW1113EXC.americas.hpqcorp.net> <4CF8A9C9.80307@redhat.com> Message-ID: <036B68E61A28CA49AC2767596576CD596F5A04ED45@GVW1113EXC.americas.hpqcorp.net> Fabio, Indeed, you are 100% right. I should have ipaddr= and instead I had hostname= in the list of attributes for the fence_ilo device. Syntax must have changed between RHEL5 and RHEL6. I changed hostname= to ipaddr= and everything works as expected. Thank you very much for your help. I really appreciate it. Regards, Chris Jankowski -----Original Message----- From: Fabio M. Di Nitto [mailto:fdinitto at redhat.com] Sent: Friday, 3 December 2010 19:27 To: linux clustering Cc: Jankowski, Chris Subject: Re: [Linux-cluster] Validation failure of cluster.conf. On 12/3/2010 6:33 AM, Jankowski, Chris wrote: > Hi, > I am in a process of building a cluster on RHEL6. > I elected to build the /etc/cluster/cluster.conf (attached) by hand i.e. > no Conga. > After I added fencing and fence devices the configuration file no longer > passes validation check. > > ccs_config_validate reports the following error: > > [root at booboo1 cluster]# ccs_config_validate -f cluster.conf.3.XX > Relax-NG validity error : Extra element fencedevices in interleave > tempfile:27: element fencedevices: Relax-NG validity error : Element > cluster failed to validate content > tempfile:18: element device: validity error : IDREF attribute name > references an unknown ID "booboo2-ilo" > Configuration fails to validate > > No matter how long I look at the file I cannot find any mistake in it. > > I would appreciate if you could run the file through your validation > tools and tell me what am I doing wrong. > > Thanks and regards, > > Chris Jankowski > > > hostname="booboo1-ilo.XXXX" login="XXXXX" passwd="XXXXX"/> > hostname="booboo2-ilo.XXXX" login="XXXXX" passwd="XXXXX"/> > > looking at man fence_ilo.8 (STDIN PARAMETERS section), you probably want (untested as I don?t have ilo here): Fabio From andrew at beekhof.net Fri Dec 3 09:16:32 2010 From: andrew at beekhof.net (Andrew Beekhof) Date: Fri, 3 Dec 2010 10:16:32 +0100 Subject: [Linux-cluster] Clarification... In-Reply-To: References: <4CF81A54.80103@awst.at> Message-ID: On Fri, Dec 3, 2010 at 8:55 AM, Corey Kovacs wrote: > Folks, thanks for the info. > > If openais and corosync were at one time, serving the same function > but now are separate, what is the division? Different parts of the puzzle. Corosync is core functionality, Openais has the implementation of the SAF APIs: http://www.openais.org/doku.php > > > Thanks again > > -C > > On Fri, Dec 3, 2010 at 7:14 AM, Andrew Beekhof wrote: >> On Thu, Dec 2, 2010 at 11:14 PM, Balazs Zachar wrote: >>> >>> On 12/02/2010 10:31 PM, Corey Kovacs wrote: >>>> >>>> I've been watching the development of the cluster stack from the >>>> sidelines for quite some time but somewhere things got a bit mixed up >>>> for me. >>>> >>>> It appears to me the following is true... >>>> >>>> openais, heartbeat and corosync are equivalent in terms of purpose. >>>> >>> >>> AFAIK not true anymore: >>> heartbeat is equivalent with corosync + openais (corosync was a fork of >>> openais but now openais is an additional part for corosync) >>> Corosync + openais is recommended. (pacemaker website) >>>> >>>> rgmanager and pacemaker are equivalent in terms of purpose. >>>> >>> >>> True: >>> http://sources.redhat.com/cluster/wiki/RGManagerVsPacemaker >>> >>>> If these are true, can someone point me to a run-down of the >>>> differences and similarities or point me to a document? >>>> >>>> How does this all relate to what ships with RHEL6? >>>> >>> >>> RHCS is using rgmanager in RHEL6. Pacemaker is in technology preview state >>> (from release notes: "not fully integrated with the RHCS stack"). >> >> Specifically there is no integration with luci yet. >> Other than that its works just fine with the rest of the stack >> >>> I heard that Pacemaker is going to replace rgmanager in the future. (the >>> source wasn't official! Maybe we will get some more official answer for this >>> here :) ) >> >> That is the current intention >> >>> >>> Regards, >>> Bal?zs >>>> >>>> Finally, is the wiki woefully out of date are is there a better place >>>> to be getting information other than git repos? >>>> >>>> >>>> >>>> >>>> >>>> Corey >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From zachar at awst.at Fri Dec 3 09:49:38 2010 From: zachar at awst.at (zachar at awst.at) Date: Fri, 03 Dec 2010 10:49:38 +0100 (CET) Subject: [Linux-cluster] =?utf-8?q?Clarification=2E=2E=2E?= Message-ID: Andrew Beekhof schrieb: > On Thu, Dec 2, 2010 at 11:14 PM, Balazs Zachar wrote: > > > > On 12/02/2010 10:31 PM, Corey Kovacs wrote: > >> > >> I've been watching the development of the cluster stack from the > >> sidelines for quite some time but somewhere things got a bit mixed up > >> for me. > >> > >> It appears to me the following is true... > >> > >> openais, heartbeat and corosync are equivalent in terms of purpose. > >> > > > > AFAIK not true anymore: > > heartbeat is equivalent with corosync + openais (corosync was a fork > of > > openais but now openais is an additional part for corosync) > > Corosync + openais is recommended. (pacemaker website) > >> > >> rgmanager and pacemaker are equivalent in terms of purpose. > >> > > > > True: > > http://sources.redhat.com/cluster/wiki/RGManagerVsPacemaker > > > >> If these are true, can someone point me to a run-down of the > >> differences and similarities or point me to a document? > >> > >> How does this all relate to what ships with RHEL6? > >> > > > > RHCS is using rgmanager in RHEL6. Pacemaker is in technology preview > state > > (from release notes: "not fully integrated with the RHCS stack"). > > Specifically there is no integration with luci yet. > Other than that its works just fine with the rest of the stack > > > I heard that Pacemaker is going to replace rgmanager in the future. > (the > > source wasn't official! Maybe we will get some more official answer > for this > > here :) ) > > That is the current intention Andrew, What are the plans about in which version of RHEL will RedHat support pacemaker? By the way, nice job ;) > > > > > Regards, > > Bal?zs > >> > >> Finally, is the wiki woefully out of date are is there a better place > >> to be getting information other than git repos? > >> > >> > >> > >> > >> > >> Corey > >> > >> -- > >> Linux-cluster mailing list > >> Linux-cluster at redhat.com > >> https://www.redhat.com/mailman/listinfo/linux-cluster > >> > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From andrew at beekhof.net Fri Dec 3 10:01:26 2010 From: andrew at beekhof.net (Andrew Beekhof) Date: Fri, 3 Dec 2010 11:01:26 +0100 Subject: [Linux-cluster] Clarification... In-Reply-To: References: Message-ID: On Fri, Dec 3, 2010 at 10:49 AM, wrote: > Andrew Beekhof schrieb: >> On Thu, Dec 2, 2010 at 11:14 PM, Balazs Zachar wrote: >> > >> > On 12/02/2010 10:31 PM, Corey Kovacs wrote: >> >> >> >> I've been watching the development of the cluster stack from the >> >> sidelines for quite some time but somewhere things got a bit mixed up >> >> for me. >> >> >> >> It appears to me the following is true... >> >> >> >> openais, heartbeat and corosync are equivalent in terms of purpose. >> >> >> > >> > AFAIK not true anymore: >> > heartbeat is equivalent with corosync + openais (corosync was a fork >> of >> > openais but now openais is an additional part for corosync) >> > Corosync + openais is recommended. (pacemaker website) >> >> >> >> rgmanager and pacemaker are equivalent in terms of purpose. >> >> >> > >> > True: >> > http://sources.redhat.com/cluster/wiki/RGManagerVsPacemaker >> > >> >> If these are true, can someone point me to a run-down of the >> >> differences and similarities or point me to a document? >> >> >> >> How does this all relate to what ships with RHEL6? >> >> >> > >> > RHCS is using rgmanager in RHEL6. Pacemaker is in technology preview >> state >> > (from release notes: "not fully integrated with the RHCS stack"). >> >> Specifically there is no integration with luci yet. >> Other than that its works just fine with the rest of the stack >> >> > I heard that Pacemaker is going to replace rgmanager in the future. >> (the >> > source wasn't official! Maybe we will get some more official answer >> for this >> > here :) ) >> >> That is the current intention > > Andrew, > What are the plans about in which version of RHEL will RedHat support pacemaker? Alas we're not allowed to publicly discuss those kinds of details. > > By the way, nice job ;) Thanks :) > >> >> > >> > Regards, >> > Bal?zs >> >> >> >> Finally, is the wiki woefully out of date are is there a better place >> >> to be getting information other than git repos? >> >> >> >> >> >> >> >> >> >> >> >> Corey >> >> >> >> -- >> >> Linux-cluster mailing list >> >> Linux-cluster at redhat.com >> >> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> > >> > -- >> > Linux-cluster mailing list >> > Linux-cluster at redhat.com >> > https://www.redhat.com/mailman/listinfo/linux-cluster >> > >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From Chris.Jankowski at hp.com Fri Dec 3 10:10:10 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Fri, 3 Dec 2010 10:10:10 +0000 Subject: [Linux-cluster] Heuristics for quorum disk used as a tiebreaker in a two node cluster. Message-ID: <036B68E61A28CA49AC2767596576CD596F5A04ED98@GVW1113EXC.americas.hpqcorp.net> Hi, I am configuring a two node HA cluster that has only one service. The sole purpose of the cluster is to keep the service up with minimum disruption for the widest possible range of failure scenarios. I configured a quorum disk to make sure that after a failure of a node, the cluster (now consisting of only one node) continues to have quorum. I am considering a partitioned cluster scenario. Partitioned means to me that the cluster nodes lost the cluster communication path. Without quorum disk each of the nodes in the cluster will fence the other. However the manual page for qdisk gives premise of solving the problem in the list of design requirement that it apparently fulfils: Quote: Ability to use the external reasons for deciding which partition is the quorate partition in a partitioned cluster. For example, a user may have a service running on one node, and that node must always be the master in the event of a network partition. Unquote. This is exactly what I would like to achieve. I know which node should stay alive - the one running my service, and it is trivial for me to find this out directly, as I can query for its status locally on a node. I do not have use the network. This can be used as a heuristic for the quorum disc. What I am missing is how to make that into a workable whole. Specifically the following aspects are of concern: 1. I do not want the other node to be ejected from the cluster just because it does not run the service. But the test is binary, so it looks like it will be ejected. 2. Startup time, before the service started. As no node has the service, both will be candidates for ejection. 3. Service migration time. During service migration from one node to another, there is a transient period of time when the service is not active on either node. Questions: 1. How do I put all of this together to achieve the overall objective of the node with the service surviving the partitioning event uninterrupted? 2. What is the relationship between fencing and node suicide due to communication through quorum disk? 3. How does the master election relate to this? I would be grateful for any insights, pointers to documentation, etc. Thanks and regards, Chris Jankowski -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux-cluster at redhat.com Sat Dec 4 07:49:08 2010 From: linux-cluster at redhat.com (Mailbot for etexusa.com) Date: Fri, 3 Dec 2010 23:49:08 -0800 Subject: [Linux-cluster] DSN: failed (mspss@gto.net.om) Message-ID: This is a Delivery Status Notification (DSN). I was unable to deliver your message to mspss at gto.net.om. The error was; Domain "gto.net.om" can't receive email -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/rfc822-headers Size: 483 bytes Desc: not available URL: From Chris.Jankowski at hp.com Mon Dec 6 01:23:42 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Mon, 6 Dec 2010 01:23:42 +0000 Subject: [Linux-cluster] Difference between -d and -s options of clusvcadm Message-ID: <036B68E61A28CA49AC2767596576CD596F5A04EF6D@GVW1113EXC.americas.hpqcorp.net> Hi, What is the difference between -d and -s options of clusvcadm? When would I prefer using one over the other? The manual page for clusvcadm(8) says: -d Stops and disables the user service named -s Stops the service named until a member transition or until it is enabled again. I also read the manual page for rgmanager(8), but the usefulness of the distinction between stopped and disabled states escapes me. Thanks and regards, Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: From gcharles at ups.com Mon Dec 6 12:18:26 2010 From: gcharles at ups.com (gcharles at ups.com) Date: Mon, 6 Dec 2010 07:18:26 -0500 Subject: [Linux-cluster] Difference between -d and -s options of clusvcadm In-Reply-To: <036B68E61A28CA49AC2767596576CD596F5A04EF6D@GVW1113EXC.americas.hpqcorp.net> References: <036B68E61A28CA49AC2767596576CD596F5A04EF6D@GVW1113EXC.americas.hpqcorp.net> Message-ID: <49CCA172B74C1B4D916CB9B71FB952DA27941AF745@njrarsvr3bef.us.ups.com> If you "disable" a service it won't start up again without manual intervention, like with "clusvcadm -e...". If you "stop" a service and let's say the node it was running on was rebooted, your service will start up on another node in the cluster if it was configured to do so. Greg Charles gcharles at ups.com ________________________________ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jankowski, Chris Sent: Sunday, December 05, 2010 8:24 PM To: linux clustering Subject: [Linux-cluster] Difference between -d and -s options of clusvcadm Hi, What is the difference between -d and -s options of clusvcadm? When would I prefer using one over the other? The manual page for clusvcadm(8) says: -d Stops and disables the user service named -s Stops the service named until a member transition or until it is enabled again. I also read the manual page for rgmanager(8), but the usefulness of the distinction between stopped and disabled states escapes me. Thanks and regards, Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: From Chris.Jankowski at hp.com Mon Dec 6 12:27:58 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Mon, 6 Dec 2010 12:27:58 +0000 Subject: [Linux-cluster] How do I implement an unmount only filesystem resource agent Message-ID: <036B68E61A28CA49AC2767596576CD596F5A04F2FF@GVW1113EXC.americas.hpqcorp.net> Hi, I am configuring a service that uses HA-LVM and XFS filesystem on top of it. The filesystem will be backed up by a separate script run from cron(8) creating an LVM snapshot of the filesystem and mounting it on a mountpoint. To have a foolproof HA service I need to: - Check, if the snapshot filesystem is mounted - If it is, all processes running in it need to be killed - Then the snapshot filesystem needs to be unmounted. All of that is a prerequisite for HA-LVM to be able to do its work on the volume group. HA-LVM needs to deactivate the volume group. Once this is done the relocation of the service to another node will succeed I could configure a script resource with a script that would do the 3 steps listed above as part of its stop action. It would have essentially null start and status actions. Is there a better, more elegant way of achieving the same result e.g. using the filesystem resource? Thanks and regards, Chris Jankowski -------------- next part -------------- An HTML attachment was scrubbed... URL: From alvaro.fernandez at sivsa.com Mon Dec 6 21:11:24 2010 From: alvaro.fernandez at sivsa.com (Alvaro Jose Fernandez) Date: Mon, 6 Dec 2010 22:11:24 +0100 Subject: [Linux-cluster] question about number of fencing devices needed for a two node cluster Message-ID: <607D6181D9919041BE792D70EF2AEC48014D4071@LIMENS.sivsa.int> Hi, I would like to know about wheter it would suffice for a two-node RHCS cluster a single power switch (APC 7921) fencing device. The power switch has 8 power outlets and I intend to use four of them for each node's dual power supplies. I know it would be desirable to have two devices for a fully redundant configuration, but after reading some examples from the docs (they are meant for two power switch), I still cannot understand why a single power switch connected to both servers and the switch taking power from the UPS, would not be a good configuration. There is a single UPS in the environment. ?any experiences over this issue? regards, alvaro -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Mon Dec 6 21:35:16 2010 From: linux at alteeve.com (Digimer) Date: Mon, 06 Dec 2010 16:35:16 -0500 Subject: [Linux-cluster] question about number of fencing devices needed for a two node cluster In-Reply-To: <607D6181D9919041BE792D70EF2AEC48014D4071@LIMENS.sivsa.int> References: <607D6181D9919041BE792D70EF2AEC48014D4071@LIMENS.sivsa.int> Message-ID: <4CFD5714.8080305@alteeve.com> On 12/06/2010 04:11 PM, Alvaro Jose Fernandez wrote: > Hi, > > I would like to know about wheter it would suffice for a two-node RHCS > cluster a single power switch (APC 7921) fencing device. The power > switch has 8 power outlets and I intend to use four of them for each > node's dual power supplies. > > I know it would be desirable to have two devices for a fully redundant > configuration, but after reading some examples from the docs (they are > meant for two power switch), I still cannot understand why a single > power switch connected to both servers and the switch taking power from > the UPS, would not be a good configuration. There is a single UPS in the > environment. > > ?any experiences over this issue? > > regards, > > alvaro That is sufficient. The only concern is that the PDU doesn't verify node death, so success is returned when the power is cut. This requires a little extra testing to make sure that your config is accurate. Once setup, use 'fence_node ' against either node and ensure that they really do go down. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From jakov.sosic at srce.hr Mon Dec 6 23:40:40 2010 From: jakov.sosic at srce.hr (Jakov Sosic) Date: Tue, 07 Dec 2010 00:40:40 +0100 Subject: [Linux-cluster] question about number of fencing devices needed for a two node cluster In-Reply-To: <607D6181D9919041BE792D70EF2AEC48014D4071@LIMENS.sivsa.int> References: <607D6181D9919041BE792D70EF2AEC48014D4071@LIMENS.sivsa.int> Message-ID: <4CFD7478.8050009@srce.hr> On 12/06/2010 10:11 PM, Alvaro Jose Fernandez wrote: > Hi, > > > > I would like to know about wheter it would suffice for a two-node RHCS > cluster a single power switch (APC 7921) fencing device. The power > switch has 8 power outlets and I intend to use four of them for each > node's dual power supplies. > > > > I know it would be desirable to have two devices for a fully redundant > configuration, but after reading some examples from the docs (they are > meant for two power switch), I still cannot understand why a single > power switch connected to both servers and the switch taking power from > the UPS, would not be a good configuration. There is a single UPS in the > environment. > > > > ?any experiences over this issue? It's because you still have SPOF. In this case, SPOF is the electronic module of the powerswitch, so, if the electronics go down, there's no way to fence the node. It would be better to have for example iDRAC or IPMI as primary fencing device and APC as secondary. But, as in many things in IT, you are back to price/performance ratio. If you really must achieve 5x9 uptime, or else you have money penalty, then you'll invest in secondary fencing device. For most clusters, one fencing device is enough, though. -- Jakov Sosic From alvaro.fernandez at sivsa.com Tue Dec 7 00:37:45 2010 From: alvaro.fernandez at sivsa.com (Alvaro Jose Fernandez) Date: Tue, 7 Dec 2010 01:37:45 +0100 Subject: [Linux-cluster] question about number of fencing devices needed for a two node cluster References: <607D6181D9919041BE792D70EF2AEC48014D4071@LIMENS.sivsa.int> <4CFD7478.8050009@srce.hr> Message-ID: <607D6181D9919041BE792D70EF2AEC48014D407A@LIMENS.sivsa.int> Thanks for the tip, Jakov. regards, alvaro > ?any experiences over this issue? It's because you still have SPOF. In this case, SPOF is the electronic module of the powerswitch, so, if the electronics go down, there's no way to fence the node. It would be better to have for example iDRAC or IPMI as primary fencing device and APC as secondary. But, as in many things in IT, you are back to price/performance ratio. If you really must achieve 5x9 uptime, or else you have money penalty, then you'll invest in secondary fencing device. For most clusters, one fencing device is enough, though. -- Jakov Sosic -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From alvaro.fernandez at sivsa.com Tue Dec 7 00:40:12 2010 From: alvaro.fernandez at sivsa.com (Alvaro Jose Fernandez) Date: Tue, 7 Dec 2010 01:40:12 +0100 Subject: [Linux-cluster] question about number of fencing devices needed for a two node cluster References: <607D6181D9919041BE792D70EF2AEC48014D4071@LIMENS.sivsa.int> <4CFD5714.8080305@alteeve.com> Message-ID: <607D6181D9919041BE792D70EF2AEC48014D407B@LIMENS.sivsa.int> Many thanks for the advice, Digimer. regards. > I know it would be desirable to have two devices for a fully redundant > configuration, but after reading some examples from the docs (they are > meant for two power switch), I still cannot understand why a single > power switch connected to both servers and the switch taking power from > the UPS, would not be a good configuration. There is a single UPS in the > environment. > > ?any experiences over this issue? > > regards, > > alvaro That is sufficient. The only concern is that the PDU doesn't verify node death, so success is returned when the power is cut. This requires a little extra testing to make sure that your config is accurate. Once setup, use 'fence_node ' against either node and ensure that they really do go down. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From lamshuyin at gmail.com Tue Dec 7 07:48:10 2010 From: lamshuyin at gmail.com (Jacky Lam) Date: Tue, 7 Dec 2010 15:48:10 +0800 Subject: [Linux-cluster] GFS on AOE Message-ID: Dear all, I am new to GFS. I search through web but could not get a definite answer. I have 2 pc (A and B) connecting by Ethernet. 1 harddisk is attaching on A and sharing through ATA over Ethernet. Is it possible for B to access hardisk using GFS over AOE? Any know issue (like caching). I suppose A must need to access the harddisk through GFS as well, am I correct? If any, is there comparison between GFS (on AOE?) and NFS on throughput and CPU loading? Thanks a lot. Best Regards, Jacky -------------- next part -------------- An HTML attachment was scrubbed... URL: From swhiteho at redhat.com Tue Dec 7 09:52:35 2010 From: swhiteho at redhat.com (Steven Whitehouse) Date: Tue, 07 Dec 2010 09:52:35 +0000 Subject: [Linux-cluster] GFS on AOE In-Reply-To: References: Message-ID: <1291715555.2451.2.camel@dolmen> Hi, On Tue, 2010-12-07 at 15:48 +0800, Jacky Lam wrote: > Dear all, > > I am new to GFS. I search through web but could not get a > definite answer. > > I have 2 pc (A and B) connecting by Ethernet. 1 harddisk is > attaching on A and sharing through ATA over Ethernet. Is it possible > for B to access hardisk using GFS over AOE? Any know issue (like > caching). I suppose A must need to access the harddisk through GFS as > well, am I correct? > Yes. Both machines would need direct access to the shared disk. That should be possible using AoE, although I've not tried it myself. > If any, is there comparison between GFS (on AOE?) and NFS on > throughput and CPU loading? > Thanks a lot. > > Best Regards, > Jacky > -- Bearing in mind that AoE is a really simple protocol, I'd expect that NFS would create more cpu loading. However, that is a bit of an odd way to compare the two solutions. Normally the cpu is not the limiting factor, especially with lower end solutions such as AoE, it is more likely that the shared disk will be the bottleneck, Steve. From fdinitto at redhat.com Tue Dec 7 10:41:39 2010 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Tue, 07 Dec 2010 11:41:39 +0100 Subject: [Linux-cluster] GFS on AOE In-Reply-To: References: Message-ID: <4CFE0F63.4040009@redhat.com> On 12/07/2010 08:48 AM, Jacky Lam wrote: > Dear all, > > I am new to GFS. I search through web but could not get a definite > answer. > > I have 2 pc (A and B) connecting by Ethernet. 1 harddisk is > attaching on A and sharing through ATA over Ethernet. Is it possible for > B to access hardisk using GFS over AOE? Any know issue (like caching). I > suppose A must need to access the harddisk through GFS as well, am I > correct? This won't work. It is also part of the official FAQ. The problem being that AOE (as you suspect) adds a different level of caching. All nodes need to have a consistent access path to the disk. Fabio From mad at wol.de Tue Dec 7 11:18:01 2010 From: mad at wol.de (Marc - A. Dahlhaus) Date: Tue, 07 Dec 2010 12:18:01 +0100 Subject: [Linux-cluster] GFS on AOE In-Reply-To: References: Message-ID: <1291720682.7239.68.camel@marc> Hello Jacky, Am Dienstag, den 07.12.2010, 15:48 +0800 schrieb Jacky Lam: > Dear all, > > I am new to GFS. I search through web but could not get a > definite answer. > > I have 2 pc (A and B) connecting by Ethernet. 1 harddisk is > attaching on A and sharing through ATA over Ethernet. Is it possible > for B to access hardisk using GFS over AOE? Any know issue (like > caching). I suppose A must need to access the harddisk through GFS as > well, am I correct? Should work without problems. My test-clusters are using this setup and i faced no problems even under bonnie++ load... I use ggaoed because other target-creators didn't allowed (i last checked this over a year ago) access to the same target over lo and eth interfaces... I can give more details (eg. configs) if you need them. > If any, is there comparison between GFS (on AOE?) and NFS on > throughput and CPU loading? > Thanks a lot. GFS as blockdevice filesystem and NFS as network protocol can't be compared easily... An NFS-share is hosted on some random (even GFS is possible) blockdevice filesystem hidden behind the protocol of the NFS-server. So this NFS-servers architecture plays a huge role in such a comparison of client-performance... > Best Regards, > Jacky Marc From linux-cluster at redhat.com Tue Dec 7 11:56:01 2010 From: linux-cluster at redhat.com (Mailbot for etexusa.com) Date: Tue, 7 Dec 2010 03:56:01 -0800 Subject: [Linux-cluster] DSN: failed (delivery failed) Message-ID: This is a Delivery Status Notification (DSN). I was unable to deliver your message to vallalar2006 at vsnl.com. I said RCPT TO: And they gave me the error; 550 5.1.1 unknown or illegal alias: vallalar2006 at vsnl.com -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/rfc822-headers Size: 492 bytes Desc: not available URL: From jeff.sturm at eprize.com Tue Dec 7 15:39:13 2010 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Tue, 7 Dec 2010 10:39:13 -0500 Subject: [Linux-cluster] GFS on AOE In-Reply-To: <4CFE0F63.4040009@redhat.com> References: <4CFE0F63.4040009@redhat.com> Message-ID: <64D0546C5EBBD147B75DE133D798665F06A12877@hugo.eprize.local> > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] > On Behalf Of Fabio M. Di Nitto > Sent: Tuesday, December 07, 2010 5:42 AM > To: linux-cluster at redhat.com > Subject: Re: [Linux-cluster] GFS on AOE > > The problem being that AOE (as you suspect) adds a different level of caching. Note however that the AoE protocol does not specify caching, except for optional asynchronous writes. (The aoe Linux module does not utilize asynchronous writes.) Nevertheless, the configuration suggested by the OP is unusual, and won't be very useful in my opinion. Having node B rely on a hard disk in node A leaves node A as a single point of failure. We use GFS over AoE extensively, and find it works well. However we use an AoE target that runs independent of the cluster and provides high-availability on its own. -Jeff From gordan at bobich.net Tue Dec 7 16:24:17 2010 From: gordan at bobich.net (Gordan Bobic) Date: Tue, 07 Dec 2010 16:24:17 +0000 Subject: [Linux-cluster] GFS on AOE In-Reply-To: <64D0546C5EBBD147B75DE133D798665F06A12877@hugo.eprize.local> References: <4CFE0F63.4040009@redhat.com> <64D0546C5EBBD147B75DE133D798665F06A12877@hugo.eprize.local> Message-ID: <4CFE5FB1.5010404@bobich.net> Jeff Sturm wrote: >> -----Original Message----- >> From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] >> On Behalf Of Fabio M. Di Nitto >> Sent: Tuesday, December 07, 2010 5:42 AM >> To: linux-cluster at redhat.com >> Subject: Re: [Linux-cluster] GFS on AOE >> >> The problem being that AOE (as you suspect) adds a different level of > caching. > > Note however that the AoE protocol does not specify caching, except for > optional asynchronous writes. (The aoe Linux module does not utilize > asynchronous writes.) It's still an unusual setup. Rather than use a lopsided setup of one node using the disk directly and the other via AoE, it would probably be safer and more reasonable to have the physical disk only accessed by the AoE server daemon and have both nodes connect to that.. > Nevertheless, the configuration suggested by the OP is unusual, and > won't be very useful in my opinion. Having node B rely on a hard disk > in node A leaves node A as a single point of failure. Arguably a "proper" SAN would also be a SPOF itself - unless you have two mirrored in real-time. DRBD is good for a "poor man's SAN" that does away with the SPOF, unlike most "enterprise grade" SANs that are based on the assumption that the SAN will never fail. Gordan From jeff.sturm at eprize.com Tue Dec 7 18:18:08 2010 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Tue, 7 Dec 2010 13:18:08 -0500 Subject: [Linux-cluster] GFS on AOE In-Reply-To: <4CFE5FB1.5010404@bobich.net> References: <4CFE0F63.4040009@redhat.com><64D0546C5EBBD147B75DE133D798665F06A12877@hugo.eprize.local> <4CFE5FB1.5010404@bobich.net> Message-ID: <64D0546C5EBBD147B75DE133D798665F06A1287C@hugo.eprize.local> > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] > On Behalf Of Gordan Bobic > Sent: Tuesday, December 07, 2010 11:24 AM > To: linux clustering > Subject: Re: [Linux-cluster] GFS on AOE > > > Note however that the AoE protocol does not specify caching, except > > for optional asynchronous writes. (The aoe Linux module does not > > utilize asynchronous writes.) > > It's still an unusual setup. Rather than use a lopsided setup of one node using the disk > directly and the other via AoE, it would probably be safer and more reasonable to have > the physical disk only accessed by the AoE server daemon and have both nodes > connect to that.. No question about it... I was commenting on one aspect of AoE, while you're giving the OP better advice as to how he can configure a good 2-node cluster. > DRBD is good for a "poor man's SAN" that does away with the SPOF, unlike most > "enterprise grade" SANs that are based on the assumption that the SAN will never fail. Agreed, DRBD works well for that. If you need more than a 2-node cluster, it might make sense to run AoE (or iSCSI) over DRBD. Most "enterprise grade" SANs have some provisions for failover/redundancy, but you make a good point--even if a single SAN chassis is indeed bulletproof, you'll need to take them offline for maintenance (e.g. firmware updates) from time to time. (Then, there's human error to deal with as well.) -Jeff From yvette at dbtgroup.com Tue Dec 7 19:03:24 2010 From: yvette at dbtgroup.com (yvette hirth) Date: Tue, 07 Dec 2010 19:03:24 +0000 Subject: [Linux-cluster] gfs2 tuning Message-ID: <4CFE84FC.3090307@dbtgroup.com> hi all, we've now defined three nodes with two more being added soon, and the GFS2 filesystems are shared between all nodes. and, of course, i have questions. 8^O the servers are HP DL380 G6's. initially i used ipmi_lan as the fence manager, with limited success; now i'm using ILO as the fence manager, and at boot, fenced takes forever (well, 5 min or so, which in IT time is forever) to start. is this normal? the ilo2 connections are all on a separate unmanaged dell 2624 switch, which has only the three ILO2 node connections, and nothing else. next, we've added SANbox2 as a backup fencing agent, and the fibre switch is an HP 8/20q (QLogic). i'm not sure if the SANbox2 support is usable on the 8/20q. anyone have any experience with this? if this is supported, wouldn't it be faster to fence/unfence than ip-based fencing? we've got ping_pong downloaded and tested the cluster. we're getting about 2500-3000 locks/sec when ping_pong runs on one node; on two, the locks/sec drops a bit; and on all three nodes, the most we've seen with ping_pong running on all three nodes is ~1800 locks/sec. googling has produced claims of 200k-300k locks/sec when running ping_pong on one node... most of the GFS2 filesystems (600-6000 resource groups) store a relatively small number of very large (2GB+) files. the extremes among the GFS2 filesystems are: 86 files comprising 800GB, to ~98k files comprising 256GB. we've googled "gfs2 tuning" but don't seem to be coming up with anything specific, and rather than "experiment" - which on GFS2 filesystems can take "a while" - i thought i'd ask, "have we done something wrong?" finally, how does the cluster.conf resource definitions interact with GFS2? is it only for "cluster operation"; i.e., only when fencing / unfencing? we specified "noatime,noquota,data=writeback" on all GFS2 filesytems (journals = 5). is this causing our lock rate to fall? and even tho we've changed the resource definition in cluster.conf and set the same parms on /etc/fstab, when mounts are displayed, we do not see "noquota" anywhere... thanks in advance for any info y'all can provide us! yvette From swhiteho at redhat.com Tue Dec 7 20:03:14 2010 From: swhiteho at redhat.com (Steven Whitehouse) Date: Tue, 07 Dec 2010 20:03:14 +0000 Subject: [Linux-cluster] gfs2 tuning In-Reply-To: <4CFE84FC.3090307@dbtgroup.com> References: <4CFE84FC.3090307@dbtgroup.com> Message-ID: <1291752194.2451.58.camel@dolmen> Hi, On Tue, 2010-12-07 at 19:03 +0000, yvette hirth wrote: > hi all, > > we've now defined three nodes with two more being added soon, and the > GFS2 filesystems are shared between all nodes. > > and, of course, i have questions. 8^O > > the servers are HP DL380 G6's. initially i used ipmi_lan as the fence > manager, with limited success; now i'm using ILO as the fence manager, > and at boot, fenced takes forever (well, 5 min or so, which in IT time > is forever) to start. is this normal? the ilo2 connections are all on > a separate unmanaged dell 2624 switch, which has only the three ILO2 > node connections, and nothing else. > > next, we've added SANbox2 as a backup fencing agent, and the fibre > switch is an HP 8/20q (QLogic). i'm not sure if the SANbox2 support is > usable on the 8/20q. anyone have any experience with this? if this is > supported, wouldn't it be faster to fence/unfence than ip-based fencing? > > we've got ping_pong downloaded and tested the cluster. we're getting > about 2500-3000 locks/sec when ping_pong runs on one node; on two, the > locks/sec drops a bit; and on all three nodes, the most we've seen with > ping_pong running on all three nodes is ~1800 locks/sec. googling has > produced claims of 200k-300k locks/sec when running ping_pong on one node... > Don't worry too much about the performance of this test. It probably isn't that important for most real applications, particularly since you seem to be using larger files. The total time is likely to be dominated by the actual data operation on the file, rather than fcntl locking overhead. > most of the GFS2 filesystems (600-6000 resource groups) store a > relatively small number of very large (2GB+) files. the extremes among > the GFS2 filesystems are: 86 files comprising 800GB, to ~98k files > comprising 256GB. we've googled "gfs2 tuning" but don't seem to be > coming up with anything specific, and rather than "experiment" - which > on GFS2 filesystems can take "a while" - i thought i'd ask, "have we > done something wrong?" > Normally performance issues tend to relate to the way in which the workload is distributed across the nodes and the I/O pattern which arises. That can result in a bottleneck of a single resource. The locking is done on a per-inode basis, so sometimes directories can be the source of contention if there are lots of creates/deletes in that directory from multiple nodes in a relatively short period. > finally, how does the cluster.conf resource definitions interact with > GFS2? is it only for "cluster operation"; i.e., only when fencing / > unfencing? we specified "noatime,noquota,data=writeback" on all GFS2 > filesytems (journals = 5). is this causing our lock rate to fall? and > even tho we've changed the resource definition in cluster.conf and set > the same parms on /etc/fstab, when mounts are displayed, we do not see > "noquota" anywhere... > > thanks in advance for any info y'all can provide us! > > yvette > You might find that the default data=ordered is faster than writeback, depending on the workload. There shouldn't be anything in cluster.conf which is likely to affect the filesystem's performance beyond the limit on fcntl locks, which you must have already set correctly in order to get the fcntl locking rates that you mention above, Steve. > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From fdinitto at redhat.com Wed Dec 8 02:53:19 2010 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Wed, 08 Dec 2010 03:53:19 +0100 Subject: [Linux-cluster] GFS on AOE In-Reply-To: <64D0546C5EBBD147B75DE133D798665F06A12877@hugo.eprize.local> References: <4CFE0F63.4040009@redhat.com> <64D0546C5EBBD147B75DE133D798665F06A12877@hugo.eprize.local> Message-ID: <4CFEF31F.6000405@redhat.com> On 12/07/2010 04:39 PM, Jeff Sturm wrote: >> -----Original Message----- >> From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] >> On Behalf Of Fabio M. Di Nitto >> Sent: Tuesday, December 07, 2010 5:42 AM >> To: linux-cluster at redhat.com >> Subject: Re: [Linux-cluster] GFS on AOE >> >> The problem being that AOE (as you suspect) adds a different level of > caching. > > Note however that the AoE protocol does not specify caching, except for > optional asynchronous writes. (The aoe Linux module does not utilize > asynchronous writes.) In our testing we did have several issues with the setup described above and trimmed down the problem to have: node A -> controller/driver X -> harddisk node B -> (any network block device, including AOE) -> controller/driver X -> harddisk. And isolated the issue to the asymmetry of the setup. > > Nevertheless, the configuration suggested by the OP is unusual, and > won't be very useful in my opinion. Having node B rely on a hard disk > in node A leaves node A as a single point of failure. Yes absolutely. It does not make any sense, but for basic testing is "good enough". > > We use GFS over AoE extensively, and find it works well. However we use > an AoE target that runs independent of the cluster and provides > high-availability on its own. Yes, this is also tested and works fine. As you might have noticed in the FAQ, we only describe the asymmetric setup as "not-working". Fabio > > -Jeff > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From Chris.Jankowski at hp.com Wed Dec 8 03:11:39 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Wed, 8 Dec 2010 03:11:39 +0000 Subject: [Linux-cluster] rgmanager gets stuck on shutdown, if no services are running on its node. Message-ID: <036B68E61A28CA49AC2767596576CD596F5A0DEF29@GVW1113EXC.americas.hpqcorp.net> Hi, I configured a cluster of 2 RHEL6 nodes. The cluster has only one HA service defined. I have a problem with rgmanager getting stuck on shutdown when certain set of conditions are met. The details follow. 1. If I execute "shutdown -h now" on the node that is *not* running the HA service then the shutdown process gets stuck with the last message in the /var/log/messages being: 'date' my_node_name rgmanager[PID#]: Shutting down The shutdown never completes, until I send terminate signal to the two instances of the rgmanager process. Then shutdown completes normally. 2. By comparison, if I execute "shutdown -h now" on a node that *is* running the HA service, then shutdown proceeds normally. 3. The problem walks with the absence of the service i.e. each of the two nodes has the problem when the service is *not* running on it and does not have the problem when the service *is* running on it. 4. I have set the following debug level in the cluster.conf: But I am not getting any additional messages when the rgmanager is stuck during shutdown. Questions: Is this a known problem? How can I avoid it short of having some dummy service running on each node, as a workaround? Thanks and regards, Chris Jankowski -------------- next part -------------- An HTML attachment was scrubbed... URL: From fdinitto at redhat.com Wed Dec 8 03:59:36 2010 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Wed, 08 Dec 2010 04:59:36 +0100 Subject: [Linux-cluster] rgmanager gets stuck on shutdown, if no services are running on its node. In-Reply-To: <036B68E61A28CA49AC2767596576CD596F5A0DEF29@GVW1113EXC.americas.hpqcorp.net> References: <036B68E61A28CA49AC2767596576CD596F5A0DEF29@GVW1113EXC.americas.hpqcorp.net> Message-ID: <4CFF02A8.1080407@redhat.com> Hi, On 12/08/2010 04:11 AM, Jankowski, Chris wrote: > Hi, > > I configured a cluster of 2 RHEL6 nodes. > The cluster has only one HA service defined. > > I have a problem with rgmanager getting stuck on shutdown when certain > set of conditions are met. The details follow. > > 1. > If I execute ?shutdown ?h now? on the node that is **not** running the > HA service then the shutdown process gets stuck with the last message in > the /var/log/messages being: > > ?date? my_node_name rgmanager[PID#]: Shutting down > > The shutdown never completes, until I send terminate signal to the two > instances of the rgmanager process. Then shutdown completes normally. > > 2. > By comparison, if I execute ?shutdown ?h now? on a node that **is** > running the HA service, then shutdown proceeds normally. > > 3. > The problem walks with the absence of the service i.e. each of the two > nodes has the problem when the service is **not** running on it and does > not have the problem when the service **is** running on it. > > 4. > I have set the following debug level in the cluster.conf: > > > > Try also: and collect logs from all daemons. The rgmanager being stuck could be only a consequence of something else being blocked and not necessarily the root cause of the problem. > > But I am not getting any additional messages when the rgmanager is stuck > during shutdown. > > Questions: > Is this a known problem? No, can you please follow the standard procedure and report the issue through support? or file at least a bugzilla? > How can I avoid it short of having some dummy service running on each > node, as a workaround? Send us all debugging logs and cluster.conf so we can actually fix the problem asap. Fabio From Chris.Jankowski at hp.com Wed Dec 8 04:55:21 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Wed, 8 Dec 2010 04:55:21 +0000 Subject: [Linux-cluster] rgmanager gets stuck on shutdown, if no services are running on its node. In-Reply-To: <4CFF02A8.1080407@redhat.com> References: <036B68E61A28CA49AC2767596576CD596F5A0DEF29@GVW1113EXC.americas.hpqcorp.net> <4CFF02A8.1080407@redhat.com> Message-ID: <036B68E61A28CA49AC2767596576CD596F5A0DEFC0@GVW1113EXC.americas.hpqcorp.net> Fabio, Thank you. I asked the customer to log a support call with HP, who are providing 1st and 2nd level of support for them. In the meantime, I followed your advice and configured debug level of logging for all daemons. However, this did not produce any new information when I tested the scenario again. Regards, Chris Jankowski -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Fabio M. Di Nitto Sent: Wednesday, 8 December 2010 15:00 To: linux-cluster at redhat.com Subject: Re: [Linux-cluster] rgmanager gets stuck on shutdown, if no services are running on its node. Hi, On 12/08/2010 04:11 AM, Jankowski, Chris wrote: > Hi, > > I configured a cluster of 2 RHEL6 nodes. > The cluster has only one HA service defined. > > I have a problem with rgmanager getting stuck on shutdown when certain > set of conditions are met. The details follow. > > 1. > If I execute "shutdown -h now" on the node that is **not** running the > HA service then the shutdown process gets stuck with the last message in > the /var/log/messages being: > > 'date' my_node_name rgmanager[PID#]: Shutting down > > The shutdown never completes, until I send terminate signal to the two > instances of the rgmanager process. Then shutdown completes normally. > > 2. > By comparison, if I execute "shutdown -h now" on a node that **is** > running the HA service, then shutdown proceeds normally. > > 3. > The problem walks with the absence of the service i.e. each of the two > nodes has the problem when the service is **not** running on it and does > not have the problem when the service **is** running on it. > > 4. > I have set the following debug level in the cluster.conf: > > > > Try also: and collect logs from all daemons. The rgmanager being stuck could be only a consequence of something else being blocked and not necessarily the root cause of the problem. > > But I am not getting any additional messages when the rgmanager is stuck > during shutdown. > > Questions: > Is this a known problem? No, can you please follow the standard procedure and report the issue through support? or file at least a bugzilla? > How can I avoid it short of having some dummy service running on each > node, as a workaround? Send us all debugging logs and cluster.conf so we can actually fix the problem asap. Fabio -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From lhh at redhat.com Wed Dec 8 19:46:09 2010 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 08 Dec 2010 14:46:09 -0500 Subject: [Linux-cluster] rgmanager gets stuck on shutdown, if no services are running on its node. In-Reply-To: <036B68E61A28CA49AC2767596576CD596F5A0DEF29@GVW1113EXC.americas.hpqcorp.net> References: <036B68E61A28CA49AC2767596576CD596F5A0DEF29@GVW1113EXC.americas.hpqcorp.net> Message-ID: <1291837569.3865.3.camel@ayanami.boston.devel.redhat.com> On Wed, 2010-12-08 at 03:11 +0000, Jankowski, Chris wrote: > Hi, > > I configured a cluster of 2 RHEL6 nodes. > The cluster has only one HA service defined. > > I have a problem with rgmanager getting stuck on shutdown when certain > set of conditions are met. The details follow. > > 1. > If I execute ?shutdown ?h now? on the node that is *not* running the > HA service then the shutdown process gets stuck with the last message > in the /var/log/messages being: > Is this reproducible outside of 'shutdown -h now', ex: does 'service rgmanager stop' work in your configuration? If you can still reach the machine (ssh or whatever) after executing 'shutdown -h now': 1) Install 'rgmanager-debuginfo' and gdb. 2) When rgmanager hangs on shutdown, run: - gdb /usr/sbin/rgmanager `pidof -s rgmanager` 3) When inside gdb, run: - thr a a bt There's a related bug in RHEL5 related to releasing the lockspace if CMAN exits before rgmanager, but I was unable to reproduce it on the STABLE3/31 branches when I tested. -- Lon From lhh at redhat.com Wed Dec 8 19:49:27 2010 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 08 Dec 2010 14:49:27 -0500 Subject: [Linux-cluster] How do I implement an unmount only filesystem resource agent In-Reply-To: <036B68E61A28CA49AC2767596576CD596F5A04F2FF@GVW1113EXC.americas.hpqcorp.net> References: <036B68E61A28CA49AC2767596576CD596F5A04F2FF@GVW1113EXC.americas.hpqcorp.net> Message-ID: <1291837767.3865.7.camel@ayanami.boston.devel.redhat.com> On Mon, 2010-12-06 at 12:27 +0000, Jankowski, Chris wrote: > > To have a foolproof HA service I need to: > > * Check, if the snapshot filesystem is mounted > * If it is, all processes running in it need to be killed > * Then the snapshot filesystem needs to be unmounted. > > I could configure a script resource with a script that would do the 3 > steps listed above as part of its stop action. It would have > essentially null start and status actions. > > Is there a better, more elegant way of achieving the same result e.g. > using the filesystem resource? In theory you could delete the 'start' operation from the agent, but I think rgmanager will ignore that and try to start it anyway... You could edit the 'fs' agent and make the 'stop' and 'status' operations return 0 immediately, though. -- Lon From lhh at redhat.com Wed Dec 8 20:33:14 2010 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 08 Dec 2010 15:33:14 -0500 Subject: [Linux-cluster] Heuristics for quorum disk used as a tiebreaker in a two node cluster. In-Reply-To: <036B68E61A28CA49AC2767596576CD596F5A04ED98@GVW1113EXC.americas.hpqcorp.net> References: <036B68E61A28CA49AC2767596576CD596F5A04ED98@GVW1113EXC.americas.hpqcorp.net> Message-ID: <1291840394.3865.51.camel@ayanami.boston.devel.redhat.com> On Fri, 2010-12-03 at 10:10 +0000, Jankowski, Chris wrote: > This is exactly what I would like to achieve. I know which node > should stay alive - the one running my service, and it is trivial for > me to find this out directly, as I can query for its status locally on > a node. I do not have use the network. This can be used as a heuristic > for the quorum disc. > > What I am missing is how to make that into a workable whole. > Specifically the following aspects are of concern: > > 1. > I do not want the other node to be ejected from the cluster just > because it does not run the service. But the test is binary, so it > looks like it will be ejected. When a two node cluster partitions, someone has to die. > 2. > Startup time, before the service started. As no node has the service, > both will be candidates for ejection. One node will die and the other will start the service. > 3. > Service migration time. > During service migration from one node to another, there is a > transient period of time when the service is not active on either > node. If you partition during a 'relocation' operation, rgmanager will evaluate the service and start it after fencing completes. > 1. > How do I put all of this together to achieve the overall objective of > the node with the service surviving the partitioning event > uninterrupted? As it turns out, using qdiskd to do this is not the easiest thing in the world. This has to do with a variety of factors, but the biggest is that qdiskd has to make choices -before- CMAN/corosync do, so it's hard to ensure correct behavior in this particular case. The simplest thing I know of to do this is to selectively delay fencing. It's a bit of a hack (though less so than using qdiskd, as it turns out). NOTE: This agent _MUST_ be used in conjunction with a real fencing agent. Put the reference to the agent before the real fencing agent within the same method. It might look like this: #!/bin/sh me=$(hostname) service=empty1 owner=$(clustat -lfs $service | grep '^ Owner' | cut -f2 -d: ; exit ${PIPESTATUS[0]}) state=$? echo Eval $service state $state $owner if [ $state -eq 0 ] && [ "$owner" != "$me" ]; then echo Not the owner - Delaying 30 seconds sleep 30 fi exit 0 What it does is give preference to the node running the service by making the non-owner delay a bit before trying to perform real fencing operation. If the real owner is alive, it will fence first. If the service was not running before the partition, no node gets preference. If the primary driving reason for using qdiskd was to solve this problem, then you can you can avoid using qdiskd. > 2. > What is the relationship between fencing and node suicide due to > communication through quorum disk? None. Both occur. > 3. > How does the master election relate to this? It doesn't, really. To get a node to drop master, you have to turn 'reboot' off. After 'reboot' is off, a node will abdicate 'master' mode if its score drops. -- Lon From Chris.Jankowski at hp.com Thu Dec 9 03:57:33 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Thu, 9 Dec 2010 03:57:33 +0000 Subject: [Linux-cluster] Heuristics for quorum disk used as a tiebreaker in a two node cluster. In-Reply-To: <1291840394.3865.51.camel@ayanami.boston.devel.redhat.com> References: <036B68E61A28CA49AC2767596576CD596F5A04ED98@GVW1113EXC.americas.hpqcorp.net> <1291840394.3865.51.camel@ayanami.boston.devel.redhat.com> Message-ID: <036B68E61A28CA49AC2767596576CD596F5A0DF47C@GVW1113EXC.americas.hpqcorp.net> Lon, Thank you for your suggestions. 1. I like very much your idea of having additional fencing agent (called as the first one in the chain) with delay dependent on the presence of the service on the node. I understand the code. What I do not know is what are the steps in adding my own fencing agents. They all live in /usr/sbin. Is it as simple as placing the new fencing agent in /usr/bin? Is some kind of registration required e.g. so ccs_config_validate will recognise it? 2. I'd guess that the extra fencing agent can also solve the problem of both nodes being fenced when the inter-node link goes down. This is a distinct from the scenario where the communication through quorum disk ceases. This will be a bonus. 3. I am using quorum disk as a natural way to assure that the cluster of 2 nodes has quorum with just one node. I am aware of the option. What are the advantages or disadvantages of using quorum disk for two nodes compared with no quorum disk and the two_node="1" attribute set? Thanks and regards, Chris Jankowski -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Lon Hohberger Sent: Thursday, 9 December 2010 07:33 To: linux clustering Subject: Re: [Linux-cluster] Heuristics for quorum disk used as a tiebreaker in a two node cluster. On Fri, 2010-12-03 at 10:10 +0000, Jankowski, Chris wrote: > This is exactly what I would like to achieve. I know which node > should stay alive - the one running my service, and it is trivial for > me to find this out directly, as I can query for its status locally on > a node. I do not have use the network. This can be used as a heuristic > for the quorum disc. > > What I am missing is how to make that into a workable whole. > Specifically the following aspects are of concern: > > 1. > I do not want the other node to be ejected from the cluster just > because it does not run the service. But the test is binary, so it > looks like it will be ejected. When a two node cluster partitions, someone has to die. > 2. > Startup time, before the service started. As no node has the service, > both will be candidates for ejection. One node will die and the other will start the service. > 3. > Service migration time. > During service migration from one node to another, there is a > transient period of time when the service is not active on either > node. If you partition during a 'relocation' operation, rgmanager will evaluate the service and start it after fencing completes. > 1. > How do I put all of this together to achieve the overall objective of > the node with the service surviving the partitioning event > uninterrupted? As it turns out, using qdiskd to do this is not the easiest thing in the world. This has to do with a variety of factors, but the biggest is that qdiskd has to make choices -before- CMAN/corosync do, so it's hard to ensure correct behavior in this particular case. The simplest thing I know of to do this is to selectively delay fencing. It's a bit of a hack (though less so than using qdiskd, as it turns out). NOTE: This agent _MUST_ be used in conjunction with a real fencing agent. Put the reference to the agent before the real fencing agent within the same method. It might look like this: #!/bin/sh me=$(hostname) service=empty1 owner=$(clustat -lfs $service | grep '^ Owner' | cut -f2 -d: ; exit ${PIPESTATUS[0]}) state=$? echo Eval $service state $state $owner if [ $state -eq 0 ] && [ "$owner" != "$me" ]; then echo Not the owner - Delaying 30 seconds sleep 30 fi exit 0 What it does is give preference to the node running the service by making the non-owner delay a bit before trying to perform real fencing operation. If the real owner is alive, it will fence first. If the service was not running before the partition, no node gets preference. If the primary driving reason for using qdiskd was to solve this problem, then you can you can avoid using qdiskd. > 2. > What is the relationship between fencing and node suicide due to > communication through quorum disk? None. Both occur. > 3. > How does the master election relate to this? It doesn't, really. To get a node to drop master, you have to turn 'reboot' off. After 'reboot' is off, a node will abdicate 'master' mode if its score drops. -- Lon -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From Chris.Jankowski at hp.com Thu Dec 9 05:07:50 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Thu, 9 Dec 2010 05:07:50 +0000 Subject: [Linux-cluster] rgmanager gets stuck on shutdown, if no services are running on its node. In-Reply-To: <1291837569.3865.3.camel@ayanami.boston.devel.redhat.com> References: <036B68E61A28CA49AC2767596576CD596F5A0DEF29@GVW1113EXC.americas.hpqcorp.net> <1291837569.3865.3.camel@ayanami.boston.devel.redhat.com> Message-ID: <036B68E61A28CA49AC2767596576CD596F5A0DF4B5@GVW1113EXC.americas.hpqcorp.net> Lon, The problem is reproducible at will. I do have access to the system after the "shutdown -h now" command is issued and rgmanager blocks. I have gdb installed, but I do not know how to obtain rgmanager-debuginfo. The system is on an isolated network and I pointed you to an on-disk repository that is a copy of the RHEL6 distribution DVD copied to local disk. Thanks and regards, Chris -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Lon Hohberger Sent: Thursday, 9 December 2010 06:46 To: linux clustering Subject: Re: [Linux-cluster] rgmanager gets stuck on shutdown, if no services are running on its node. On Wed, 2010-12-08 at 03:11 +0000, Jankowski, Chris wrote: > Hi, > > I configured a cluster of 2 RHEL6 nodes. > The cluster has only one HA service defined. > > I have a problem with rgmanager getting stuck on shutdown when certain > set of conditions are met. The details follow. > > 1. > If I execute ?shutdown ?h now? on the node that is *not* running the > HA service then the shutdown process gets stuck with the last message > in the /var/log/messages being: > Is this reproducible outside of 'shutdown -h now', ex: does 'service rgmanager stop' work in your configuration? If you can still reach the machine (ssh or whatever) after executing 'shutdown -h now': 1) Install 'rgmanager-debuginfo' and gdb. 2) When rgmanager hangs on shutdown, run: - gdb /usr/sbin/rgmanager `pidof -s rgmanager` 3) When inside gdb, run: - thr a a bt There's a related bug in RHEL5 related to releasing the lockspace if CMAN exits before rgmanager, but I was unable to reproduce it on the STABLE3/31 branches when I tested. -- Lon -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From Chris.Jankowski at hp.com Thu Dec 9 05:09:44 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Thu, 9 Dec 2010 05:09:44 +0000 Subject: [Linux-cluster] How do I implement an unmount only filesystem resource agent In-Reply-To: <1291837767.3865.7.camel@ayanami.boston.devel.redhat.com> References: <036B68E61A28CA49AC2767596576CD596F5A04F2FF@GVW1113EXC.americas.hpqcorp.net> <1291837767.3865.7.camel@ayanami.boston.devel.redhat.com> Message-ID: <036B68E61A28CA49AC2767596576CD596F5A0DF4B7@GVW1113EXC.americas.hpqcorp.net> Lon, Thank you for your suggestion. In the meantime, I developed a script to do the unmount of a snapshot on stop and configured it as an additional resource agent of the type script. This works very well. Regards, Chris Jankowski -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Lon Hohberger Sent: Thursday, 9 December 2010 06:49 To: linux clustering Subject: Re: [Linux-cluster] How do I implement an unmount only filesystem resource agent On Mon, 2010-12-06 at 12:27 +0000, Jankowski, Chris wrote: > > To have a foolproof HA service I need to: > > * Check, if the snapshot filesystem is mounted > * If it is, all processes running in it need to be killed > * Then the snapshot filesystem needs to be unmounted. > > I could configure a script resource with a script that would do the 3 > steps listed above as part of its stop action. It would have > essentially null start and status actions. > > Is there a better, more elegant way of achieving the same result e.g. > using the filesystem resource? In theory you could delete the 'start' operation from the agent, but I think rgmanager will ignore that and try to start it anyway... You could edit the 'fs' agent and make the 'stop' and 'status' operations return 0 immediately, though. -- Lon -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From Chris.Jankowski at hp.com Thu Dec 9 06:58:41 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Thu, 9 Dec 2010 06:58:41 +0000 Subject: [Linux-cluster] rgmanager gets stuck on shutdown, if no services are running on its node. References: <036B68E61A28CA49AC2767596576CD596F5A0DEF29@GVW1113EXC.americas.hpqcorp.net> <1291837569.3865.3.camel@ayanami.boston.devel.redhat.com> Message-ID: <036B68E61A28CA49AC2767596576CD596F5A0DF556@GVW1113EXC.americas.hpqcorp.net> Lon, I think that I got to the bottom of the problem: If there are *no* services running on a node and you issue "shutdown -h now" on the node, then when it comes to shutting down rgmanger, it executes the following sequence: 1. Outputs "Shutting down" message to /var/adm/messages 2. Waits for the "status_poll_interval" value of seconds 3. Outputs the message: "Shutdown complete, exiting" and completes its own shutdown. In my case, I had , as my service scripts do not have a viable check of their status, and the status check messages were clogging up the /var/adm/messages file. So, rgmanager appeared to be stuck, whereas it was just really waiting. I think this is a bug in logic here. It should not be waiting in this situation. ------------ By comparison, if there is a service running on a node and you issue "shutdown -h now" on the node, then when it comes to shutting down rgmanger, it executes the following sequence: 1. Outputs "Shutting down" message to /var/adm/messages 2. Proceeds *immediately* (no wait) to shutting down the service 3. When the service is shutdown the rgmanager *immediately* outputs "Shutdown complete, exiting" and completes its own shutdown. ------------- As a workaround, I set status_poll_interval="10" for the time being, although I believe that I should be forced to rely on short polling interval. Regards, Chris Jankowski -----Original Message----- From: Jankowski, Chris Sent: Thursday, 9 December 2010 16:08 To: linux clustering Subject: RE: [Linux-cluster] rgmanager gets stuck on shutdown, if no services are running on its node. Lon, The problem is reproducible at will. I do have access to the system after the "shutdown -h now" command is issued and rgmanager blocks. I have gdb installed, but I do not know how to obtain rgmanager-debuginfo. The system is on an isolated network and I pointed you to an on-disk repository that is a copy of the RHEL6 distribution DVD copied to local disk. Thanks and regards, Chris -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Lon Hohberger Sent: Thursday, 9 December 2010 06:46 To: linux clustering Subject: Re: [Linux-cluster] rgmanager gets stuck on shutdown, if no services are running on its node. On Wed, 2010-12-08 at 03:11 +0000, Jankowski, Chris wrote: > Hi, > > I configured a cluster of 2 RHEL6 nodes. > The cluster has only one HA service defined. > > I have a problem with rgmanager getting stuck on shutdown when certain > set of conditions are met. The details follow. > > 1. > If I execute ?shutdown ?h now? on the node that is *not* running the > HA service then the shutdown process gets stuck with the last message > in the /var/log/messages being: > Is this reproducible outside of 'shutdown -h now', ex: does 'service rgmanager stop' work in your configuration? If you can still reach the machine (ssh or whatever) after executing 'shutdown -h now': 1) Install 'rgmanager-debuginfo' and gdb. 2) When rgmanager hangs on shutdown, run: - gdb /usr/sbin/rgmanager `pidof -s rgmanager` 3) When inside gdb, run: - thr a a bt There's a related bug in RHEL5 related to releasing the lockspace if CMAN exits before rgmanager, but I was unable to reproduce it on the STABLE3/31 branches when I tested. -- Lon -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From rossnick-lists at cybercat.ca Fri Dec 10 15:22:26 2010 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Fri, 10 Dec 2010 10:22:26 -0500 Subject: [Linux-cluster] To SELinux or not to SELinux ? Message-ID: <4F027F4093FA4317ACB29F6C64D6C867@versa> Over the CentOS-users list there is a long on-going thread about SELinux. Since it's introduction a while back, I alwasy disabled selinux because of the added complexity and never took the time to learn it. For our soon to be production cluster of 8 nodes, I will be attempting to at least set selinux at permissive to see how it works and learn it. Our services are mostly of 3 type. Database server, apache server, our own compile, and used in a non-standard locations and java servers, using the default java, application and data directory on the gfs shared storage. So, for a cluster, using fencing, gfs, and all the needed tools to run a cluster, is there any reason not to use selinux ? I am looking to see if cluster operator use or do not use selinux... Thanks, Nicolas From deJongm at TEOCO.com Fri Dec 10 16:29:37 2010 From: deJongm at TEOCO.com (de Jong, Mark-Jan) Date: Fri, 10 Dec 2010 11:29:37 -0500 Subject: [Linux-cluster] To SELinux or not to SELinux ? In-Reply-To: <4F027F4093FA4317ACB29F6C64D6C867@versa> References: <4F027F4093FA4317ACB29F6C64D6C867@versa> Message-ID: <5E3DCAE61C95FA4397679425D7275D264F66B3A2@HQ-MX03.us.teo.earth> Hello, I've had my share of conversations with the RH cluster folks regarding SELinux. They're answer at the time was (at least regarding RHEL5) that RH cluster suite was not certified to work with SELinux enabled. I HAVE made it work, but there were many instances where kernel or package updates ended up breaking it again. In the end I gave up due to time constraints and set SELinux to permissive in hopes to revisit it again sometime in the future. Hope that helps. -M -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Nicolas Ross Sent: Friday, December 10, 2010 10:22 AM To: linux clustering Subject: [Linux-cluster] To SELinux or not to SELinux ? Over the CentOS-users list there is a long on-going thread about SELinux. Since it's introduction a while back, I alwasy disabled selinux because of the added complexity and never took the time to learn it. For our soon to be production cluster of 8 nodes, I will be attempting to at least set selinux at permissive to see how it works and learn it. Our services are mostly of 3 type. Database server, apache server, our own compile, and used in a non-standard locations and java servers, using the default java, application and data directory on the gfs shared storage. So, for a cluster, using fencing, gfs, and all the needed tools to run a cluster, is there any reason not to use selinux ? I am looking to see if cluster operator use or do not use selinux... Thanks, Nicolas -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From emsearcy at gmail.com Fri Dec 10 16:37:41 2010 From: emsearcy at gmail.com (Eric Searcy) Date: Fri, 10 Dec 2010 08:37:41 -0800 Subject: [Linux-cluster] To SELinux or not to SELinux ? In-Reply-To: <4F027F4093FA4317ACB29F6C64D6C867@versa> References: <4F027F4093FA4317ACB29F6C64D6C867@versa> Message-ID: On Fri, Dec 10, 2010 at 7:22 AM, Nicolas Ross wrote: > Over the CentOS-users list there is a long on-going thread about SELinux. > Since it's introduction a while back, I alwasy disabled selinux because of > the added complexity and never took the time to learn it. > > For our soon to be production cluster of 8 nodes, I will be attempting to at > least set selinux at permissive to see how it works and learn it. Our > services are mostly of 3 type. Database server, apache server, our own > compile, and used in a non-standard locations and java servers, using the > default java, application and data directory on the gfs shared storage. > > So, for a cluster, using fencing, gfs, and all the needed tools to run a > cluster, is there any reason not to use selinux ? I am looking to see if > cluster operator use or do not use selinux... As far as RHCS (at least on 5) is concerned, there are notes that SELinux isn't supported. In other words those packages don't set labels properly or add policy modules that would be needed. Of course, that doesn't stop you from using audit2allow to "clean up" the denies you find while running in permissive (some denies will only show up during boot). I also locked myself out of the entire cluster once and had to use a kernel append option to disable selinux :-) I decided to run enforcing for greater defense in depth, but for the time being on everything except RHCS. For all my other boxes, I switch it to permissive before minor dist upgrades and then set each box back to enforcing after the next reboot without denies (I've been doing this since 5.3, when updates to the enforcing policy broke a bunch of labeling stuff and I was putting out fires since everything was in enforcing still). Eric From Colin.Simpson at iongeo.com Fri Dec 10 17:04:37 2010 From: Colin.Simpson at iongeo.com (Colin Simpson) Date: Fri, 10 Dec 2010 17:04:37 +0000 Subject: [Linux-cluster] To SELinux or not to SELinux ? In-Reply-To: References: <4F027F4093FA4317ACB29F6C64D6C867@versa> Message-ID: <1292000677.6237.71.camel@cowie> I seem to now be supported on RHEL 6 according to the Cluster Admin Guide. Colin On Fri, 2010-12-10 at 16:37 +0000, Eric Searcy wrote: > On Fri, Dec 10, 2010 at 7:22 AM, Nicolas Ross > wrote: > > Over the CentOS-users list there is a long on-going thread about > SELinux. > > Since it's introduction a while back, I alwasy disabled selinux > because of > > the added complexity and never took the time to learn it. > > > > For our soon to be production cluster of 8 nodes, I will be > attempting to at > > least set selinux at permissive to see how it works and learn it. > Our > > services are mostly of 3 type. Database server, apache server, our > own > > compile, and used in a non-standard locations and java servers, > using the > > default java, application and data directory on the gfs shared > storage. > > > > So, for a cluster, using fencing, gfs, and all the needed tools to > run a > > cluster, is there any reason not to use selinux ? I am looking to > see if > > cluster operator use or do not use selinux... > > As far as RHCS (at least on 5) is concerned, there are notes that > SELinux isn't supported. In other words those packages don't set > labels properly or add policy modules that would be needed. Of > course, that doesn't stop you from using audit2allow to "clean up" the > denies you find while running in permissive (some denies will only > show up during boot). I also locked myself out of the entire cluster > once and had to use a kernel append option to disable selinux :-) > > I decided to run enforcing for greater defense in depth, but for the > time being on everything except RHCS. For all my other boxes, I > switch it to permissive before minor dist upgrades and then set each > box back to enforcing after the next reboot without denies (I've been > doing this since 5.3, when updates to the enforcing policy broke a > bunch of labeling stuff and I was putting out fires since everything > was in enforcing still). > > Eric > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original. From jeff.sturm at eprize.com Fri Dec 10 18:03:43 2010 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Fri, 10 Dec 2010 13:03:43 -0500 Subject: [Linux-cluster] To SELinux or not to SELinux ? In-Reply-To: <4F027F4093FA4317ACB29F6C64D6C867@versa> References: <4F027F4093FA4317ACB29F6C64D6C867@versa> Message-ID: <64D0546C5EBBD147B75DE133D798665F06A128A8@hugo.eprize.local> > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] > On Behalf Of Nicolas Ross > Sent: Friday, December 10, 2010 10:22 AM > To: linux clustering > Subject: [Linux-cluster] To SELinux or not to SELinux ? > > So, for a cluster, using fencing, gfs, and all the needed tools to run a cluster, is there > any reason not to use selinux ? I am looking to see if cluster operator use or do not > use selinux... Beware that "permissive" mode, far from being benign, can be as expensive as having SELinux enabled. See http://www.mail-archive.com/linux-cluster at redhat.com/msg08317.html for some details on GFS and extended attributes. -Jeff From lhh at redhat.com Fri Dec 10 18:15:51 2010 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 10 Dec 2010 13:15:51 -0500 Subject: [Linux-cluster] Heuristics for quorum disk used as a tiebreaker in a two node cluster. In-Reply-To: <036B68E61A28CA49AC2767596576CD596F5A0DF47C@GVW1113EXC.americas.hpqcorp.net> References: <036B68E61A28CA49AC2767596576CD596F5A04ED98@GVW1113EXC.americas.hpqcorp.net> <1291840394.3865.51.camel@ayanami.boston.devel.redhat.com> <036B68E61A28CA49AC2767596576CD596F5A0DF47C@GVW1113EXC.americas.hpqcorp.net> Message-ID: <1292004951.7139.113.camel@localhost.localdomain> On Thu, 2010-12-09 at 03:57 +0000, Jankowski, Chris wrote: > Lon, > > Thank you for your suggestions. > > 1. > I like very much your idea of having additional fencing agent (called as the first one in the chain) with delay dependent on the presence of the service on the node. I understand the code. What I do not know is what are the steps in adding my own fencing agents. They all live in /usr/sbin. > > Is it as simple as placing the new fencing agent in /usr/bin? Is some kind of registration required e.g. so ccs_config_validate will recognise it? You can put the absolute path in the fencedevice tag: Your agent should not have extra parameters. Also, I think my first inclination was wrong; you shouldn't combine it with other devices in the same level. My apologies. Instead: * Your script should *always* exit 1 (failure). The only thing we want this script to do is sleep if the service is running on the other guy; we do not want it to feed fenced any sort of "success" value - ever. * If you leave it returning 0 and you delete your "real" fencedevice later, your data will be at risk. So, make the script return 1 (always) and the cluster.conf would look like this: ... ... ... ... > 2. > I'd guess that the extra fencing agent can also solve the problem of both nodes being fenced when the inter-node link goes down. This is a distinct from the scenario where the communication through quorum disk ceases. This will be a bonus. That's actually what it does... > 3. > I am using quorum disk as a natural way to assure that the cluster of 2 nodes has quorum with just one node. I am aware of the option. Ok. With the custom agent, you can use pretty much no heuristics. Qdiskd will auto-configure everything for you (see below though). However, when using qdiskd for your configuration (where you are using a custom, extra fencing agent to delay fencing based on service location), you should explicitly set master_wins to 0. > What are the advantages or disadvantages of using quorum disk for two nodes compared with no quorum disk and the two_node="1" attribute set? If you have a cluster where the fence devices are accessible only over the same network as the cluster communicates, there is no real advantage to using qdiskd. If you have the fencing devices on a separate network than the cluster uses for communication, then using qdiskd can prevent three fencing problems: a fence race, fence death, and a fencing loop. http://people.redhat.com/lhh/ClusterPitfalls.pdf * The delayservice hack eliminates the fencing race. * Qdiskd holds off fence-loops, but fence-death can still occur in rare cases when simultaneously starting both cluster nodes from a total outage. -- Lon From rossnick-lists at cybercat.ca Fri Dec 10 18:20:52 2010 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Fri, 10 Dec 2010 13:20:52 -0500 Subject: [Linux-cluster] To SELinux or not to SELinux ? References: <4F027F4093FA4317ACB29F6C64D6C867@versa> <64D0546C5EBBD147B75DE133D798665F06A128A8@hugo.eprize.local> Message-ID: >> So, for a cluster, using fencing, gfs, and all the needed tools to run > a cluster, is there >> any reason not to use selinux ? I am looking to see if cluster > operator use or do not >> use selinux... > > Beware that "permissive" mode, far from being benign, can be as > expensive as having SELinux enabled. See > http://www.mail-archive.com/linux-cluster at redhat.com/msg08317.html for > some details on GFS and extended attributes. Oh... I didn't tought of performance influence... That alone is enough to keep it off completly. We will be hosting a high-volume site where every millisecond counts. That site is composed of about a million files of different sorts. So, any added delay in accessing a file is not an option. From lhh at redhat.com Fri Dec 10 18:24:56 2010 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 10 Dec 2010 13:24:56 -0500 Subject: [Linux-cluster] rgmanager gets stuck on shutdown, if no services are running on its node. In-Reply-To: <036B68E61A28CA49AC2767596576CD596F5A0DF556@GVW1113EXC.americas.hpqcorp.net> References: <036B68E61A28CA49AC2767596576CD596F5A0DEF29@GVW1113EXC.americas.hpqcorp.net> <1291837569.3865.3.camel@ayanami.boston.devel.redhat.com> <036B68E61A28CA49AC2767596576CD596F5A0DF556@GVW1113EXC.americas.hpqcorp.net> Message-ID: <1292005496.7139.132.camel@localhost.localdomain> On Thu, 2010-12-09 at 06:58 +0000, Jankowski, Chris wrote: > Lon, > > I think that I got to the bottom of the problem: > > If there are *no* services running on a node and you issue "shutdown -h now" on the node, then when it comes to shutting down rgmanger, it executes the following sequence: > > 1. Outputs "Shutting down" message to /var/adm/messages > 2. Waits for the "status_poll_interval" value of seconds > 3. Outputs the message: "Shutdown complete, exiting" and completes its own shutdown. > > In my case, I had , as my service scripts do not have a viable check of their status, and the status check messages were clogging up the /var/adm/messages file. So, rgmanager appeared to be stuck, whereas it was just really waiting. You should just turn off status checks for your script: .. That should make things work. -- Lona From dxh at yahoo.com Fri Dec 10 19:27:49 2010 From: dxh at yahoo.com (Don Hoover) Date: Fri, 10 Dec 2010 14:27:49 -0500 Subject: [Linux-cluster] To SELinux or not to SELinux ? In-Reply-To: References: Message-ID: <118D4152-E0D7-4456-986E-25695579E436@yahoo.com> I have been working with RHEL6 and SElinux in targeted and enforcing mode works really well with everything I have tried it with including cluster and KVM. They have done a much better job with having policies that just work with most all of the software that comes on the distro. And the new 'managing secure services' manual on docs.redhat has lots of examples on what you need to do when you step outside of the defaults like how to add non-default directories(eg outside of var/www) for apache, mysql, KVM etc. I am shooting for rhel6 to be our first build that has SElinux on by default. For the first time I think SElinux might be low enough a hassle that it can be left on. Sent from my iPhone On Dec 10, 2010, at 12:00 PM, linux-cluster-request at redhat.com wrote: > Re: [Linux-cluster] To SELinux or not to SELinux ? From pmdyer at ctgcentral2.com Fri Dec 10 19:34:52 2010 From: pmdyer at ctgcentral2.com (Paul M. Dyer) Date: Fri, 10 Dec 2010 13:34:52 -0600 (CST) Subject: [Linux-cluster] To SELinux or not to SELinux ? In-Reply-To: Message-ID: <4690726.2.1292009692820.JavaMail.root@athena> Hi, I have used selinux enforcing since RHEL 5.4 on a 3-node RHCS cluster. I believe it has been supported since that release. I made some calls back in RHEL 5.3 regarding some issues, but all problems that I experienced have been resolved. I got plenty of support for my issues. According to Dan Walsh, performance was addressed early on. I have not had any performance issues using selinux in RHEL 5, RHCS included. Paul ----- Original Message ----- From: "Nicolas Ross" To: "linux clustering" Sent: Friday, December 10, 2010 12:20:52 PM Subject: Re: [Linux-cluster] To SELinux or not to SELinux ? >> So, for a cluster, using fencing, gfs, and all the needed tools to >> run > a cluster, is there >> any reason not to use selinux ? I am looking to see if cluster > operator use or do not >> use selinux... > > Beware that "permissive" mode, far from being benign, can be as > expensive as having SELinux enabled. See > http://www.mail-archive.com/linux-cluster at redhat.com/msg08317.html for > some details on GFS and extended attributes. Oh... I didn't tought of performance influence... That alone is enough to keep it off completly. We will be hosting a high-volume site where every millisecond counts. That site is composed of about a million files of different sorts. So, any added delay in accessing a file is not an option. -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From jeff.sturm at eprize.com Fri Dec 10 20:30:57 2010 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Fri, 10 Dec 2010 15:30:57 -0500 Subject: [Linux-cluster] To SELinux or not to SELinux ? In-Reply-To: <4690726.2.1292009692820.JavaMail.root@athena> References: <4690726.2.1292009692820.JavaMail.root@athena> Message-ID: <64D0546C5EBBD147B75DE133D798665F06A128B3@hugo.eprize.local> > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] > On Behalf Of Paul M. Dyer > Sent: Friday, December 10, 2010 2:35 PM > To: linux clustering > Subject: Re: [Linux-cluster] To SELinux or not to SELinux ? > > According to Dan Walsh, performance was addressed early on. I have not had any > performance issues using selinux in RHEL 5, RHCS included. Results will probably vary depending on what components you need, and what versions you run. For us, SELinux incurred a 30% overhead with GFS file operations. That was on CentOS 5.2 or 5.3, can't remember which. (We're in the middle of an upgrade to 5.5, but haven't started migrating to GFS2.) But don't take my word for it, or anyone else's... always benchmark your own application. -Jeff From linux-cluster at redhat.com Mon Dec 13 11:51:05 2010 From: linux-cluster at redhat.com (Mailbot for etexusa.com) Date: Mon, 13 Dec 2010 03:51:05 -0800 Subject: [Linux-cluster] DSN: failed (Delivery reports about your e-mail) Message-ID: This is a Delivery Status Notification (DSN). I was unable to deliver your message to pgmarshall at worldnet.att.net. I said RCPT TO: And they gave me the error; 551 not our customer -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/rfc822-headers Size: 523 bytes Desc: not available URL: From paolo.smiraglia at gmail.com Mon Dec 13 18:17:36 2010 From: paolo.smiraglia at gmail.com (Paolo Smiraglia) Date: Mon, 13 Dec 2010 19:17:36 +0100 Subject: [Linux-cluster] Clustered LVM locking issues Message-ID: Hi to every ones.... We have configured a shared storage between some nodes with iSCSI, and we want to use LVM across them. Our access model is based on syncronized commands remotely executed by the master node with ssh (authenticated by public key without password). The master node is the one that can create/remove logical volumes which are a snapshot of a "base" logical voume. Other nodes exclusively access logical volumes created by master node. In order to do so, we have installed in all nodes * RedHat Enterprise 6 beta2 * Cluster Suite * CLVMD Then we have configured LVM with locking_type=3 (lvm.conf) and fist attept we marked the volume group as "clustered". Unfortunately we got an error message that is saying snapshot is not supoprted for clustered volume groups. In order to overcome this issue we used a workaround that conists in clearing the clustered flag in the volume group. However, this caused some concurrency problems that prevent a logical volume to be removed even if not used. Do you have a solution for this problem? Thanks in advance for replies... -- PAOLO SMIRAGLIA http://portale.isf.polito.it/paolo-smiraglia From james.hofmeister at hp.com Tue Dec 14 06:28:15 2010 From: james.hofmeister at hp.com (Hofmeister, James (WTEC Linux)) Date: Tue, 14 Dec 2010 06:28:15 +0000 Subject: [Linux-cluster] RHCS & snmpd[30622]: Received SNMP packet(s) from UDP Message-ID: Hello folks, RE: SNMP packets from local host. 6-10 per minute Dec 10 07:18:57 host1 snmpd[30622]: Connection from UDP: [127.0.0.1]:49231 Dec 10 07:18:57 host1 snmpd[30622]: Received SNMP packet(s) from UDP: [127.0.0.1]:49231 I see this quite often in RHCS clusters and have not determined if the source is a function of RHCS or if this is one of the HP health agents. I am aware that these messages can be turned off in the SNMP configuration, I am more interested in the source. Any feedback would be appreciated. Regards, James Hofmeister? Hewlett Packard Linux Solutions Engineer From raju.rajsand at gmail.com Tue Dec 14 08:04:12 2010 From: raju.rajsand at gmail.com (Rajagopal Swaminathan) Date: Tue, 14 Dec 2010 13:34:12 +0530 Subject: [Linux-cluster] RHCS & snmpd[30622]: Received SNMP packet(s) from UDP In-Reply-To: References: Message-ID: Greetings, On Tue, Dec 14, 2010 at 11:58 AM, Hofmeister, James (WTEC Linux) wrote: > Hello folks, > > Dec 10 07:18:57 host1 snmpd[30622]: Connection from UDP: [127.0.0.1]:49231 > Dec 10 07:18:57 host1 snmpd[30622]: Received SNMP packet(s) from UDP: [127.0.0.1]:49231 > > I see this quite often in RHCS clusters and have not determined if the source is a function of RHCS or if this is one of the HP health agents. > > I am aware that these messages can be turned off in the SNMP configuration, I am more interested in the source. > > Any feedback would be appreciated. > Looks to me like the HP agent. only the agent shouts peridically usually in snmp... /scurries Hmmm.... where are my snmp notes? Regards, Rajagopal From kitgerrits at gmail.com Tue Dec 14 08:15:07 2010 From: kitgerrits at gmail.com (Kit Gerrits) Date: Tue, 14 Dec 2010 09:15:07 +0100 Subject: [Linux-cluster] Clustered LVM locking issues In-Reply-To: Message-ID: <4d07278e.1211cc0a.3573.37ab@mx.google.com> Hello, I might have misunderstood, but: I am assuming one machine is exporting local storage via iSCSI. Might it be easier to use LVM om the base storage device and take your snapshot there? (by exporting a LV as iSCSI target) (keep inmind, this machine would be a SPOF for the entire cluster) If you are using a Shared Storage Device and exporting the iSCSI target using a cluster, this would be a problem, as this would also be a Clustered LV. In this case, advanced Shared Storage Devices offer internal LVM snapshots, which do no interfere with the LUN. (HP MSA devices offer this) It will allow you to chreate a LVM shapshot of your (iSCSI/SCSI/FC)LUN without interfering with the 'O/S LVM layer', therefore allowing you to continue exporting your device. Regards, Kit -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Paolo Smiraglia Sent: maandag 13 december 2010 19:18 To: linux-cluster at redhat.com Subject: [Linux-cluster] Clustered LVM locking issues Hi to every ones.... We have configured a shared storage between some nodes with iSCSI, and we want to use LVM across them. Our access model is based on syncronized commands remotely executed by the master node with ssh (authenticated by public key without password). The master node is the one that can create/remove logical volumes which are a snapshot of a "base" logical voume. Other nodes exclusively access logical volumes created by master node. In order to do so, we have installed in all nodes * RedHat Enterprise 6 beta2 * Cluster Suite * CLVMD Then we have configured LVM with locking_type=3 (lvm.conf) and fist attept we marked the volume group as "clustered". Unfortunately we got an error message that is saying snapshot is not supoprted for clustered volume groups. In order to overcome this issue we used a workaround that conists in clearing the clustered flag in the volume group. However, this caused some concurrency problems that prevent a logical volume to be removed even if not used. Do you have a solution for this problem? Thanks in advance for replies... -- PAOLO SMIRAGLIA http://portale.isf.polito.it/paolo-smiraglia -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From yamato at redhat.com Tue Dec 14 12:02:16 2010 From: yamato at redhat.com (Masatake YAMATO) Date: Tue, 14 Dec 2010 21:02:16 +0900 (JST) Subject: [Linux-cluster] [Openais] packet dissectors for totempg, cman, clvmd, rgmanager, cpg, In-Reply-To: <20100527.133950.593311767624382812.yamato@redhat.com> References: <20100527.132034.642044848160830535.yamato@redhat.com> <4BFDF51C.4000808@redhat.com> <20100527.133950.593311767624382812.yamato@redhat.com> Message-ID: <20101214.210216.512133496326900668.yamato@redhat.com> https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=3232 The patch for wireshark is not merged yet. Reviewing is very slow or the patch may be rejected implicitly. So I decide to provide my dissector as a dyanamic loadable plugin. It can be built on Fedora-14 with wireshark-devel package. You, working on RHEL6, may not be interested in ccsd. https://github.com/masatake/wireshark-plugin-rhcs I will maintain this source tree. Forking are welcome. I'd like to make it a rpm package and be available as a part of Fedora. But I cannot find enough time to be a package maintainer now. Masatake YAMATO From ccaulfie at redhat.com Tue Dec 14 13:38:02 2010 From: ccaulfie at redhat.com (Christine Caulfield) Date: Tue, 14 Dec 2010 13:38:02 +0000 Subject: [Linux-cluster] [Openais] packet dissectors for totempg, cman, clvmd, rgmanager, cpg, In-Reply-To: <20101214.210216.512133496326900668.yamato@redhat.com> References: <20100527.132034.642044848160830535.yamato@redhat.com> <4BFDF51C.4000808@redhat.com> <20100527.133950.593311767624382812.yamato@redhat.com> <20101214.210216.512133496326900668.yamato@redhat.com> Message-ID: <4D07733A.9030300@redhat.com> Awesome! Thank you :-) Chrissie On 14/12/10 12:02, Masatake YAMATO wrote: > https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=3232 > > The patch for wireshark is not merged yet. Reviewing is very slow or > the patch may be rejected implicitly. > > So I decide to provide my dissector as a dyanamic loadable plugin. It > can be built on Fedora-14 with wireshark-devel package. > > > You, working on RHEL6, may not be interested in ccsd. > > > https://github.com/masatake/wireshark-plugin-rhcs > > I will maintain this source tree. Forking are welcome. > > I'd like to make it a rpm package and be available as a part of > Fedora. But I cannot find enough time to be a package maintainer > now. > > Masatake YAMATO > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From jfriesse at redhat.com Tue Dec 14 13:53:32 2010 From: jfriesse at redhat.com (Jan Friesse) Date: Tue, 14 Dec 2010 14:53:32 +0100 Subject: [Linux-cluster] [Openais] packet dissectors for totempg, cman, clvmd, rgmanager, cpg, In-Reply-To: <20101214.210216.512133496326900668.yamato@redhat.com> References: <20100527.132034.642044848160830535.yamato@redhat.com> <4BFDF51C.4000808@redhat.com> <20100527.133950.593311767624382812.yamato@redhat.com> <20101214.210216.512133496326900668.yamato@redhat.com> Message-ID: <4D0776DC.9080003@redhat.com> Masatake, I'm pretty sure that biggest problem of your code was that it was licensed under BSD (three clause, same as Corosync has) license. Wireshark is licensed under GPL and even I like BSD licenses much more, I would recommend you to try to relicense code under GPL and send them this code. Regards, Honza Masatake YAMATO napsal(a): > https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=3232 > > The patch for wireshark is not merged yet. Reviewing is very slow or > the patch may be rejected implicitly. > > So I decide to provide my dissector as a dyanamic loadable plugin. It > can be built on Fedora-14 with wireshark-devel package. > > > You, working on RHEL6, may not be interested in ccsd. > > > https://github.com/masatake/wireshark-plugin-rhcs > > I will maintain this source tree. Forking are welcome. > > I'd like to make it a rpm package and be available as a part of > Fedora. But I cannot find enough time to be a package maintainer > now. > > Masatake YAMATO > _______________________________________________ > Openais mailing list > Openais at lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/openais From yamato at redhat.com Tue Dec 14 14:15:25 2010 From: yamato at redhat.com (Masatake YAMATO) Date: Tue, 14 Dec 2010 23:15:25 +0900 (JST) Subject: [Linux-cluster] [Openais] packet dissectors for totempg, cman, clvmd, rgmanager, cpg, In-Reply-To: <4D0776DC.9080003@redhat.com> References: <20100527.133950.593311767624382812.yamato@redhat.com> <20101214.210216.512133496326900668.yamato@redhat.com> <4D0776DC.9080003@redhat.com> Message-ID: <20101214.231525.648039044490713397.yamato@redhat.com> I'd like to your advice more detail seriously. I've been developing this code for three years. I don't want to make this code garbage. > Masatake, > I'm pretty sure that biggest problem of your code was that it was > licensed under BSD (three clause, same as Corosync has) > license. Wireshark is licensed under GPL and even I like BSD licenses > much more, I would recommend you to try to relicense code under GPL > and send them this code. > > Regards, > Honza I got the similar comment from wireshark developer. Please, read the discussion: https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=3232 In my understanding there is no legal problem in putting 3-clause BSD code into GPL code. Acutally wireshark includes some 3-clause BSD code: epan/dissectors/packet-radiotap-defs.h: /*- * Copyright (c) 2003, 2004 David Young. All rights reserved. * * $Id: packet-radiotap-defs.h 34554 2010-10-18 13:24:10Z morriss $ * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. The name of David Young may not be used to endorse or promote * products derived from this software without specific prior * written permission. * * THIS SOFTWARE IS PROVIDED BY DAVID YOUNG ``AS IS'' AND ANY * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A * PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL DAVID * YOUNG BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED * TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY * OF SUCH DAMAGE. */ I'd like to separate the legal issue and preference. I think I understand the importance of preference of upstream developers. However, I'd like to clear the legal issue first. I can image there are people who prefer to GPL as the license covering their software. But here I've taken some corosync code in my dissector. It is essential part of my dissector. And corosync is licensed in 3-clause BSD, as you know. I'd like to change the license to merge my code to upstream project. I cannot do it in this context. See https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=3232#c13 Thank you. From jfriesse at redhat.com Tue Dec 14 14:51:17 2010 From: jfriesse at redhat.com (Jan Friesse) Date: Tue, 14 Dec 2010 15:51:17 +0100 Subject: [Linux-cluster] [Openais] packet dissectors for totempg, cman, clvmd, rgmanager, cpg, In-Reply-To: <20101214.231525.648039044490713397.yamato@redhat.com> References: <20100527.133950.593311767624382812.yamato@redhat.com> <20101214.210216.512133496326900668.yamato@redhat.com> <4D0776DC.9080003@redhat.com> <20101214.231525.648039044490713397.yamato@redhat.com> Message-ID: <4D078465.3020509@redhat.com> Masatake, Masatake YAMATO napsal(a): > I'd like to your advice more detail seriously. > I've been developing this code for three years. > I don't want to make this code garbage. > >> Masatake, >> I'm pretty sure that biggest problem of your code was that it was >> licensed under BSD (three clause, same as Corosync has) >> license. Wireshark is licensed under GPL and even I like BSD licenses >> much more, I would recommend you to try to relicense code under GPL >> and send them this code. >> >> Regards, >> Honza > > I got the similar comment from wireshark developer. > > Please, read the discussion: > https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=3232 > > I've read that thread long time before I've sent previous mail, so thats reason why I think that Wireshark developers just feel MUCH more comfortable with GPL and thats reason why they just ignoring it. > In my understanding there is no legal problem in putting 3-clause BSD > code into GPL code. Acutally wireshark includes some 3-clause BSD > code: > Actually there is really not. BSD to GPL works without problem, but many people just don't know it... > epan/dissectors/packet-radiotap-defs.h: > /*- > * Copyright (c) 2003, 2004 David Young. All rights reserved. > * > * $Id: packet-radiotap-defs.h 34554 2010-10-18 13:24:10Z morriss $ > * > * Redistribution and use in source and binary forms, with or without > * modification, are permitted provided that the following conditions > * are met: > * 1. Redistributions of source code must retain the above copyright > * notice, this list of conditions and the following disclaimer. > * 2. Redistributions in binary form must reproduce the above copyright > * notice, this list of conditions and the following disclaimer in the > * documentation and/or other materials provided with the distribution. > * 3. The name of David Young may not be used to endorse or promote > * products derived from this software without specific prior > * written permission. > * > * THIS SOFTWARE IS PROVIDED BY DAVID YOUNG ``AS IS'' AND ANY > * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, > * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A > * PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL DAVID > * YOUNG BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, > * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED > * TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, > * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND > * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, > * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY > * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY > * OF SUCH DAMAGE. > */ > > I'd like to separate the legal issue and preference. > I think I understand the importance of preference of upstream > developers. However, I'd like to clear the legal > issue first. > Legally it's ok. But as you said, developers preference are different. And because you are trying to change THEIR code it's sometimes better to play they rules. > > I can image there are people who prefer to GPL as the license covering > their software. But here I've taken some corosync code in my > dissector. It is essential part of my dissector. And corosync is ^^^ This may be problem. Question is how big is that part and if it can be possible to make exception there. Can you point that code? Steve, we were able to relicense HUGE portion of code in case of libqb, are we able to make the same for Wireshark dissector? > licensed in 3-clause BSD, as you know. I'd like to change the license > to merge my code to upstream project. I cannot do it in this context. > > See https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=3232#c13 > > Thank you. Regards, Honza From yamato at redhat.com Tue Dec 14 15:04:29 2010 From: yamato at redhat.com (Masatake YAMATO) Date: Wed, 15 Dec 2010 00:04:29 +0900 (JST) Subject: [Linux-cluster] [Openais] packet dissectors for totempg, cman, clvmd, rgmanager, cpg, In-Reply-To: <4D078465.3020509@redhat.com> References: <4D0776DC.9080003@redhat.com> <20101214.231525.648039044490713397.yamato@redhat.com> <4D078465.3020509@redhat.com> Message-ID: <20101215.000429.721897046580218183.yamato@redhat.com> Thank you for replying. > Masatake, > > Masatake YAMATO napsal(a): >> I'd like to your advice more detail seriously. >> I've been developing this code for three years. >> I don't want to make this code garbage. >> >>> Masatake, >>> I'm pretty sure that biggest problem of your code was that it was >>> licensed under BSD (three clause, same as Corosync has) >>> license. Wireshark is licensed under GPL and even I like BSD licenses >>> much more, I would recommend you to try to relicense code under GPL >>> and send them this code. >>> >>> Regards, >>> Honza >> I got the similar comment from wireshark developer. >> Please, read the discussion: >> https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=3232 >> > > I've read that thread long time before I've sent previous mail, so > thats reason why I think that Wireshark developers just feel MUCH more > comfortable with GPL and thats reason why they just ignoring it. I see. >> In my understanding there is no legal problem in putting 3-clause BSD >> code into GPL code. Acutally wireshark includes some 3-clause BSD >> code: >> > > Actually there is really not. BSD to GPL works without problem, but > many people just don't know it... ...it is too bad. I strongly believe FOSS developers should know the intent behind of the both licenses. >> epan/dissectors/packet-radiotap-defs.h: >> /*- >> * Copyright (c) 2003, 2004 David Young. All rights reserved. >> * >> * $Id: packet-radiotap-defs.h 34554 2010-10-18 13:24:10Z morriss $ >> * >> * Redistribution and use in source and binary forms, with or without >> * modification, are permitted provided that the following conditions >> * are met: >> * 1. Redistributions of source code must retain the above copyright >> * notice, this list of conditions and the following disclaimer. >> * 2. Redistributions in binary form must reproduce the above copyright >> * notice, this list of conditions and the following disclaimer in the >> * documentation and/or other materials provided with the distribution. >> * 3. The name of David Young may not be used to endorse or promote >> * products derived from this software without specific prior >> * written permission. >> * >> * THIS SOFTWARE IS PROVIDED BY DAVID YOUNG ``AS IS'' AND ANY >> * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, >> * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A >> * PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL DAVID >> * YOUNG BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, >> * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED >> * TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, >> * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND >> * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, >> * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY >> * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY >> * OF SUCH DAMAGE. >> */ >> I'd like to separate the legal issue and preference. I think I >> understand the importance of preference of upstream >> developers. However, I'd like to clear the legal issue first. >> > > Legally it's ok. But as you said, developers preference are > different. And because you are trying to change THEIR code it's > sometimes better to play they rules. I see. >> I can image there are people who prefer to GPL as the license covering >> their software. But here I've taken some corosync code in my >> dissector. It is essential part of my dissector. And corosync is > > ^^^ This may be problem. Question is how big is that part and if it > can be possible to make exception there. Can you point that code? > > Steve, we were able to relicense HUGE portion of code in case of > libqb, are we able to make the same for Wireshark dissector? Could you see https://github.com/masatake/wireshark-plugin-rhcs/blob/master/src/packet-corosync-totemnet.c#L156 I refer totemnet.c to write dissect_corosynec_totemnet_with_decryption() function. >> licensed in 3-clause BSD, as you know. I'd like to change the license >> to merge my code to upstream project. I cannot do it in this context. >> See https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=3232#c13 >> Thank you. > > Regards, > Honza Masatake YAMATO From sdake at redhat.com Tue Dec 14 16:28:57 2010 From: sdake at redhat.com (Steven Dake) Date: Tue, 14 Dec 2010 09:28:57 -0700 Subject: [Linux-cluster] [Openais] packet dissectors for totempg, cman, clvmd, rgmanager, cpg, In-Reply-To: <4D078465.3020509@redhat.com> References: <20100527.133950.593311767624382812.yamato@redhat.com> <20101214.210216.512133496326900668.yamato@redhat.com> <4D0776DC.9080003@redhat.com> <20101214.231525.648039044490713397.yamato@redhat.com> <4D078465.3020509@redhat.com> Message-ID: <4D079B49.8070009@redhat.com> On 12/14/2010 07:51 AM, Jan Friesse wrote: > Masatake, > > Masatake YAMATO napsal(a): >> I'd like to your advice more detail seriously. >> I've been developing this code for three years. >> I don't want to make this code garbage. >> >>> Masatake, >>> I'm pretty sure that biggest problem of your code was that it was >>> licensed under BSD (three clause, same as Corosync has) >>> license. Wireshark is licensed under GPL and even I like BSD licenses >>> much more, I would recommend you to try to relicense code under GPL >>> and send them this code. >>> >>> Regards, >>> Honza >> >> I got the similar comment from wireshark developer. >> >> Please, read the discussion: >> https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=3232 >> >> > > I've read that thread long time before I've sent previous mail, so thats > reason why I think that Wireshark developers just feel MUCH more > comfortable with GPL and thats reason why they just ignoring it. > >> In my understanding there is no legal problem in putting 3-clause BSD >> code into GPL code. Acutally wireshark includes some 3-clause BSD >> code: >> > > Actually there is really not. BSD to GPL works without problem, but many > people just don't know it... > >> epan/dissectors/packet-radiotap-defs.h: >> /*- >> * Copyright (c) 2003, 2004 David Young. All rights reserved. >> * >> * $Id: packet-radiotap-defs.h 34554 2010-10-18 13:24:10Z morriss $ >> * >> * Redistribution and use in source and binary forms, with or without >> * modification, are permitted provided that the following conditions >> * are met: >> * 1. Redistributions of source code must retain the above copyright >> * notice, this list of conditions and the following disclaimer. >> * 2. Redistributions in binary form must reproduce the above copyright >> * notice, this list of conditions and the following disclaimer in the >> * documentation and/or other materials provided with the >> distribution. >> * 3. The name of David Young may not be used to endorse or promote >> * products derived from this software without specific prior >> * written permission. >> * >> * THIS SOFTWARE IS PROVIDED BY DAVID YOUNG ``AS IS'' AND ANY >> * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, >> * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A >> * PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL DAVID >> * YOUNG BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, >> * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED >> * TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, >> * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND >> * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, >> * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY >> * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY >> * OF SUCH DAMAGE. >> */ >> >> I'd like to separate the legal issue and preference. I think I >> understand the importance of preference of upstream developers. >> However, I'd like to clear the legal issue first. >> > > Legally it's ok. But as you said, developers preference are different. > And because you are trying to change THEIR code it's sometimes better to > play they rules. > >> >> I can image there are people who prefer to GPL as the license covering >> their software. But here I've taken some corosync code in my >> dissector. It is essential part of my dissector. And corosync is > > ^^^ This may be problem. Question is how big is that part and if it can > be possible to make exception there. Can you point that code? > > Steve, we were able to relicense HUGE portion of code in case of libqb, > are we able to make the same for Wireshark dissector? > >> licensed in 3-clause BSD, as you know. I'd like to change the license >> to merge my code to upstream project. I cannot do it in this context. >> >> See https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=3232#c13 >> >> Thank you. > > Regards, > Honza I am not changing corosync license to GPL. I think the separate plugin works fine, and we can even take up packaging of it in fedora and Red Hat variants, if it is maintained in an upstream repo. Regards -steve From sdake at redhat.com Tue Dec 14 16:31:55 2010 From: sdake at redhat.com (Steven Dake) Date: Tue, 14 Dec 2010 09:31:55 -0700 Subject: [Linux-cluster] [Openais] packet dissectors for totempg, cman, clvmd, rgmanager, cpg, In-Reply-To: <20101214.210216.512133496326900668.yamato@redhat.com> References: <20100527.132034.642044848160830535.yamato@redhat.com> <4BFDF51C.4000808@redhat.com> <20100527.133950.593311767624382812.yamato@redhat.com> <20101214.210216.512133496326900668.yamato@redhat.com> Message-ID: <4D079BFB.9050906@redhat.com> On 12/14/2010 05:02 AM, Masatake YAMATO wrote: > https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=3232 > > The patch for wireshark is not merged yet. Reviewing is very slow or > the patch may be rejected implicitly. > > So I decide to provide my dissector as a dyanamic loadable plugin. It > can be built on Fedora-14 with wireshark-devel package. > > > You, working on RHEL6, may not be interested in ccsd. > > > https://github.com/masatake/wireshark-plugin-rhcs > > I will maintain this source tree. Forking are welcome. > > I'd like to make it a rpm package and be available as a part of > Fedora. But I cannot find enough time to be a package maintainer > now. > > Masatake YAMATO > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Masatake, I'll hunt around for a package maintainer and get back to you. Generally this person's role is to package new upstream releases in Fedora and Red Hat derivatives. You would do the upstream releasing/maintenance part. Regards -steve From jfriesse at redhat.com Tue Dec 14 16:50:20 2010 From: jfriesse at redhat.com (Jan Friesse) Date: Tue, 14 Dec 2010 17:50:20 +0100 Subject: [Linux-cluster] [Openais] packet dissectors for totempg, cman, clvmd, rgmanager, cpg, In-Reply-To: <4D079B49.8070009@redhat.com> References: <20100527.133950.593311767624382812.yamato@redhat.com> <20101214.210216.512133496326900668.yamato@redhat.com> <4D0776DC.9080003@redhat.com> <20101214.231525.648039044490713397.yamato@redhat.com> <4D078465.3020509@redhat.com> <4D079B49.8070009@redhat.com> Message-ID: <4D07A04C.9020703@redhat.com> Steven Dake napsal(a): > On 12/14/2010 07:51 AM, Jan Friesse wrote: >> Masatake, >> .... >>> Thank you. >> Regards, >> Honza > > > I am not changing corosync license to GPL. I think the separate plugin > works fine, and we can even take up packaging of it in fedora and Red > Hat variants, if it is maintained in an upstream repo. > > Regards > -steve Steve, I'm not talking about relicensing corosync (it doesn't make any sense and I would be first against that), but give permissions to that portion of code (seems to be more or less header files) to use GPL (which also seems to me like old version without support for NSS). It's same as what we did for libqb. Separate plugin works fine for Fedora, but I'm not sure if it works also for other distributions. Regards, Honza From bernardchew at gmail.com Wed Dec 15 09:29:01 2010 From: bernardchew at gmail.com (Bernard Chew) Date: Wed, 15 Dec 2010 17:29:01 +0800 Subject: [Linux-cluster] RHCS & snmpd[30622]: Received SNMP packet(s) from UDP In-Reply-To: References: Message-ID: > On Tue, Dec 14, 2010 at 4:04 PM, Rajagopal Swaminathan wrote: > Greetings, > > On Tue, Dec 14, 2010 at 11:58 AM, Hofmeister, James (WTEC Linux) > wrote: >> Hello folks, >> >> Dec 10 07:18:57 host1 snmpd[30622]: Connection from UDP: [127.0.0.1]:49231 >> Dec 10 07:18:57 host1 snmpd[30622]: Received SNMP packet(s) from UDP: [127.0.0.1]:49231 >> >> I see this quite often in RHCS clusters and have not determined if the source is a function of RHCS or if this is one of the HP health agents. >> >> I am aware that these messages can be turned off in the SNMP configuration, I am more interested in the source. >> >> Any feedback would be appreciated. >> > > Looks to me like the HP agent. only the agent shouts peridically > usually in snmp... > > /scurries Hmmm.... where are my snmp notes? > > > Regards, > > Rajagopal > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > Hi, Try using tcpdump to check the source Regards, Bernard Chew From lhh at redhat.com Wed Dec 15 22:14:49 2010 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 15 Dec 2010 17:14:49 -0500 Subject: [Linux-cluster] RHCS & snmpd[30622]: Received SNMP packet(s) from UDP In-Reply-To: References: Message-ID: <1292451289.3118.2.camel@localhost.localdomain> On Tue, 2010-12-14 at 06:28 +0000, Hofmeister, James (WTEC Linux) wrote: > Hello folks, > > RE: SNMP packets from local host. > > 6-10 per minute > > Dec 10 07:18:57 host1 snmpd[30622]: Connection from UDP: [127.0.0.1]:49231 > Dec 10 07:18:57 host1 snmpd[30622]: Received SNMP packet(s) from UDP: [127.0.0.1]:49231 > > I see this quite often in RHCS clusters and have not determined if the source is a function of RHCS or if this is one of the HP health agents. > > I am aware that these messages can be turned off in the SNMP configuration, I am more interested in the source. > Linux-cluster doesn't generate traps/notifications at this point, so I'd guess the HP agent :) -- Lon From james.hofmeister at hp.com Wed Dec 15 22:41:52 2010 From: james.hofmeister at hp.com (Hofmeister, James (WTEC Linux)) Date: Wed, 15 Dec 2010 22:41:52 +0000 Subject: [Linux-cluster] RHCS & snmpd[30622]: Received SNMP packet(s) from UDP In-Reply-To: <1292451289.3118.2.camel@localhost.localdomain> References: <1292451289.3118.2.camel@localhost.localdomain> Message-ID: Hello Lon, all, |Linux-cluster doesn't generate traps/notifications at this point, so I'd |guess the HP agent :) |-- Lon Yep, we found the HP Health agent (cmahostd) that quit sending SNMP messages during the cluster hang: Dec 10 07:22:24 dm73sr02 kernel: INFO: task cmahostd:31542 blocked for more than 120 seconds. Dec 10 07:22:24 dm73sr02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 10 07:22:24 dm73sr02 kernel: cmahostd D ffffffff801508e3 0 31542 1 31576 31540 (NOTLB) Dec 10 07:22:24 dm73sr02 kernel: ffff810c3b889cf8 0000000000000086 0000000000000018 ffffffff884414f8 Dec 10 07:22:24 dm73sr02 kernel: 0000000000000292 000000000000000a ffff810c3f54a820 ffff810c4e1b6040 Dec 10 07:22:24 dm73sr02 kernel: 00007122b167f658 0000000000bb9ecb ffff810c3f54aa08 0000000888442e5f Call Trace: [] :dlm:request_lock+0x93/0xa0 [] :gfs2:just_schedule+0x0/0xe [] :gfs2:just_schedule+0x9/0xe [] __wait_on_bit+0x40/0x6e [] :gfs2:just_schedule+0x0/0xe [] out_of_line_wait_on_bit+0x6c/0x78 [] wake_bit_function+0x0/0x23 [] :gfs2:gfs2_glock_wait+0x2b/0x30 [] :gfs2:gfs2_getattr+0x85/0xc4 [] :gfs2:gfs2_getattr+0x7d/0xc4 [] vfs_getattr+0x2d/0xa9 [] vfs_stat_fd+0x32/0x4a [] free_pages_and_swap_cache+0x67/0x7e [] sys32_stat64+0x11/0x29 [] sysenter_tracesys+0x48/0x83 [] sysenter_do_call+0x1e/0x76 Regards, ????? James Hofmeister? Hewlett Packard Linux Solutions Engineer From kmaguire at eso.org Wed Dec 15 23:47:23 2010 From: kmaguire at eso.org (Kevin Maguire) Date: Thu, 16 Dec 2010 00:47:23 +0100 (CET) Subject: [Linux-cluster] GFS tuning for combined batch / interactive use Message-ID: Hi We are running a 20 node cluster, using Scientific Linux 5.3, with a GFS shared filesystem hosted on our SAN. Cluster nodes are dual core units with 4 GB of RAM, and a standard Qlogic FC HBA. Most of the 20 nodes form a batch-processing cluster, and our users are happy enough with the performance they get, but some nodes are used interactively. When the filesystem is under stress due to large batch processing jobs running on other nodes, interactive use becomes very slow and painful. Is there any tuning I (the sysadmin) can do that might help in this situation? Would a migration to gfs2 make a difference? Are all nodes treated identically, or can hosts mounting the filesystem have any kind of priority/QoS? Which tools could I use to track down any bottlenecks? In theory we could update kernel+gfs bits to a later release, though we saw the same issues when using the same cluster with a SL4.x stack, but for now it's kernel-2.6.18-128.1.1.el5.i686 kmod-gfs-0.1.31-3.el5.i686 gfs-utils-0.1.20-7.el5.i386 gfs2-utils-0.1.53-1.el5_3.1.i386 Thanks for any help/suggestions, Kevin From swhiteho at redhat.com Thu Dec 16 10:53:49 2010 From: swhiteho at redhat.com (Steven Whitehouse) Date: Thu, 16 Dec 2010 10:53:49 +0000 Subject: [Linux-cluster] GFS tuning for combined batch / interactive use In-Reply-To: References: Message-ID: <1292496829.2427.5.camel@dolmen> Hi, On Thu, 2010-12-16 at 00:47 +0100, Kevin Maguire wrote: > Hi > > We are running a 20 node cluster, using Scientific Linux 5.3, with a GFS > shared filesystem hosted on our SAN. Cluster nodes are dual core units > with 4 GB of RAM, and a standard Qlogic FC HBA. > > Most of the 20 nodes form a batch-processing cluster, and our users are > happy enough with the performance they get, but some nodes are used > interactively. When the filesystem is under stress due to large batch > processing jobs running on other nodes, interactive use becomes very slow > and painful. > > Is there any tuning I (the sysadmin) can do that might help in this > situation? Would a migration to gfs2 make a difference? Are all nodes > treated identically, or can hosts mounting the filesystem have any kind of > priority/QoS? Which tools could I use to track down any bottlenecks? > There are no priority/QoS controls currently available to the users, I'm afraid. All nodes are treated equally as you say. I suspect that the reason that interactive use becomes slow is just down to locality of accesses. The GFS locking is done on a per-inode basis, so where writes are going on to an inode, ensuring that reads to that same inode are also done on the same node as much as possible should improve performance. In other words, it would be better to divide up jobs in the cluster according to the data which they access rather than according to whether they are interactive or not. Are you using mmap() at all? If so then GFS2 should be significantly more scalable than GFS, Steve. > In theory we could update kernel+gfs bits to a later release, though we > saw the same issues when using the same cluster with a SL4.x stack, but > for now it's > > kernel-2.6.18-128.1.1.el5.i686 > kmod-gfs-0.1.31-3.el5.i686 > gfs-utils-0.1.20-7.el5.i386 > gfs2-utils-0.1.53-1.el5_3.1.i386 > > Thanks for any help/suggestions, > Kevin > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From rpeterso at redhat.com Thu Dec 16 15:09:39 2010 From: rpeterso at redhat.com (Bob Peterson) Date: Thu, 16 Dec 2010 10:09:39 -0500 (EST) Subject: [Linux-cluster] GFS tuning for combined batch / interactive use In-Reply-To: <1575839559.1257921292511766147.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: <576470091.1259051292512179726.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- "Kevin Maguire" wrote: | Hi | | We are running a 20 node cluster, using Scientific Linux 5.3, with a | GFS | shared filesystem hosted on our SAN. Cluster nodes are dual core units | | with 4 GB of RAM, and a standard Qlogic FC HBA. | | Most of the 20 nodes form a batch-processing cluster, and our users | are | happy enough with the performance they get, but some nodes are used | interactively. When the filesystem is under stress due to large batch | | processing jobs running on other nodes, interactive use becomes very | slow | and painful. | | Is there any tuning I (the sysadmin) can do that might help in this | situation? Would a migration to gfs2 make a difference? Are all nodes | | treated identically, or can hosts mounting the filesystem have any | kind of | priority/QoS? Which tools could I use to track down any bottlenecks? | | In theory we could update kernel+gfs bits to a later release, though | we | saw the same issues when using the same cluster with a SL4.x stack, | but | for now it's | | kernel-2.6.18-128.1.1.el5.i686 | kmod-gfs-0.1.31-3.el5.i686 | gfs-utils-0.1.20-7.el5.i386 | gfs2-utils-0.1.53-1.el5_3.1.i386 | | Thanks for any help/suggestions, | Kevin Hi Kevin, We recently identified a slowdown in RHEL5.x that involves DLM traffic. There is a patch to speed dlm up, and it's being tested now. The patch is built into RHEL5 kernels starting with 2.6.18-232 and newer. That means it is currently scheduled to be released in RHEL5.6. It's also being z-streamed back to 5.5.z, but I don't know when that is scheduled to go out. Unfortunately, since the problem was opened by a customer, the bugzilla record is private to protect the customer's confidential information. The patch is public though. If you are a Red Hat customer, you can probably call Red Hat Support and ask to be put on the list for bugzilla bug 604139 and maybe find out when the fix will be available. There is no guarantee this is what your problem is, and there is no guarantee that the patch will speed you up. But it might be. Regards, Bob Peterson Red Hat File Systems From bturner at redhat.com Thu Dec 16 22:25:56 2010 From: bturner at redhat.com (Ben Turner) Date: Thu, 16 Dec 2010 17:25:56 -0500 (EST) Subject: [Linux-cluster] GFS tuning for combined batch / interactive use In-Reply-To: <997438983.986721292537865284.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com> Message-ID: <866555507.987141292538356551.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com> There is some helpful stuff here on the tuning side: http://sources.redhat.com/cluster/wiki/FAQ/GFS#gfs_tuning -b ----- "Bob Peterson" wrote: > ----- "Kevin Maguire" wrote: > | Hi > | > | We are running a 20 node cluster, using Scientific Linux 5.3, with > a > | GFS > | shared filesystem hosted on our SAN. Cluster nodes are dual core > units > | > | with 4 GB of RAM, and a standard Qlogic FC HBA. > | > | Most of the 20 nodes form a batch-processing cluster, and our users > | are > | happy enough with the performance they get, but some nodes are used > > | interactively. When the filesystem is under stress due to large > batch > | > | processing jobs running on other nodes, interactive use becomes > very > | slow > | and painful. > | > | Is there any tuning I (the sysadmin) can do that might help in this > > | situation? Would a migration to gfs2 make a difference? Are all > nodes > | > | treated identically, or can hosts mounting the filesystem have any > | kind of > | priority/QoS? Which tools could I use to track down any > bottlenecks? > | > | In theory we could update kernel+gfs bits to a later release, > though > | we > | saw the same issues when using the same cluster with a SL4.x stack, > | but > | for now it's > | > | kernel-2.6.18-128.1.1.el5.i686 > | kmod-gfs-0.1.31-3.el5.i686 > | gfs-utils-0.1.20-7.el5.i386 > | gfs2-utils-0.1.53-1.el5_3.1.i386 > | > | Thanks for any help/suggestions, > | Kevin > > Hi Kevin, > > We recently identified a slowdown in RHEL5.x that involves DLM > traffic. > There is a patch to speed dlm up, and it's being tested now. The > patch is built into RHEL5 kernels starting with 2.6.18-232 and newer. > That means it is currently scheduled to be released in RHEL5.6. > > It's also being z-streamed back to 5.5.z, but I don't know when that > is scheduled to go out. Unfortunately, since the problem was > opened by a customer, the bugzilla record is private to protect the > customer's confidential information. The patch is public though. > If you are a Red Hat customer, you can probably call Red Hat Support > and ask to be put on the list for bugzilla bug 604139 and > maybe find out when the fix will be available. > > There is no guarantee this is what your problem is, and there is > no guarantee that the patch will speed you up. But it might be. > > Regards, > > Bob Peterson > Red Hat File Systems > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From kmaguire at eso.org Fri Dec 17 16:35:16 2010 From: kmaguire at eso.org (Kevin Maguire) Date: Fri, 17 Dec 2010 17:35:16 +0100 (CET) Subject: [Linux-cluster] GFS tuning for combined batch / interactive use In-Reply-To: <866555507.987141292538356551.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com> References: <866555507.987141292538356551.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com> Message-ID: Hi Bob/Steven/Ben - many thanks for responding. > There is some helpful stuff here on the tuning side: > > http://sources.redhat.com/cluster/wiki/FAQ/GFS#gfs_tuning Indeed, we have implemented many these suggestions, "fast statfs" is on, -r 2048 was used, quotas off, the cluster interconnect is a dedicated gigabit LAN, hardware RAID (RAID10) on the SAN, and so on. Maybe we are just at the limit of the hardware. I have also asked and it seems the one issue that might cause slowdown, multiple nodes all trying to access the same inode (say all updating files in a common directory), should not happen with our application. I am told that essentially batch jobs will create their own working directory when executing, and work almost exclusively within that subtree. Interactive work is in another tree entirely. However I'd like to double check that - but how? When we looked at Lustre for a similar app there was a /proc interface that you could probe to see what files were being opened/read/written/closed by each connected node - does GFS offer something similar? Would mounting debugfs help me there? Kevin From swhiteho at redhat.com Fri Dec 17 16:50:59 2010 From: swhiteho at redhat.com (Steven Whitehouse) Date: Fri, 17 Dec 2010 16:50:59 +0000 Subject: [Linux-cluster] GFS tuning for combined batch / interactive use In-Reply-To: References: <866555507.987141292538356551.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com> Message-ID: <1292604659.2461.14.camel@dolmen> Hi, On Fri, 2010-12-17 at 17:35 +0100, Kevin Maguire wrote: > Hi > > Bob/Steven/Ben - many thanks for responding. > > > There is some helpful stuff here on the tuning side: > > > > http://sources.redhat.com/cluster/wiki/FAQ/GFS#gfs_tuning > > Indeed, we have implemented many these suggestions, "fast statfs" is on, > -r 2048 was used, quotas off, the cluster interconnect is a dedicated > gigabit LAN, hardware RAID (RAID10) on the SAN, and so on. Maybe we are > just at the limit of the hardware. > > I have also asked and it seems the one issue that might cause slowdown, > multiple nodes all trying to access the same inode (say all updating files > in a common directory), should not happen with our application. I am told > that essentially batch jobs will create their own working directory when > executing, and work almost exclusively within that subtree. Interactive > work is in another tree entirely. > > However I'd like to double check that - but how? When we looked at Lustre > for a similar app there was a /proc interface that you could probe to see > what files were being opened/read/written/closed by each connected node - > does GFS offer something similar? Would mounting debugfs help me there? > > Kevin > You can get a glock dump via debugfs which may show up contention, looks for type 2 glocks which have lots of lock requests queued but not granted. The lock requests (holders) are tagged with the relevant process. In rhel6/upstream there are gfs2 tracepoints which can be used to get information dynamically. These can also give some pointers to the processes involved, Steve. > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From sendmailrajiv at gmail.com Fri Dec 17 17:38:41 2010 From: sendmailrajiv at gmail.com (Rajiv Yadav) Date: Fri, 17 Dec 2010 23:08:41 +0530 Subject: [Linux-cluster] how can install GFS-Cluster on rhel5.2 Message-ID: Hi.. i want to install cluster GFS on two nodes. which package i need installation GFS on RHEL5.2... Please provide full installation and configuration server and client based.. how to use Luci and Ricci.. -- Rajiv Yadav CRIS (An Organization of the Ministry of Railways, Govt. of India) Chanakyapuri,New Delhi - 110021 website:- www.cris.org.in Cell #: +91-9711175683 -------------- next part -------------- An HTML attachment was scrubbed... URL: From bturner at redhat.com Fri Dec 17 17:53:50 2010 From: bturner at redhat.com (Ben Turner) Date: Fri, 17 Dec 2010 12:53:50 -0500 (EST) Subject: [Linux-cluster] how can install GFS-Cluster on rhel5.2 In-Reply-To: <492506111.1047221292608334777.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com> Message-ID: <1585989071.1047501292608430660.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com> Here is a link to the documentation, thats prolly the best place to start: http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/index.html If you have a support contract with Red Hat I suggest opening a case and one of the support techs can give you more detailed assistance. -b ----- "Rajiv Yadav" wrote: > Hi.. > > i want to install cluster GFS on two nodes. > which package i need installation GFS on RHEL5.2... > Please provide full installation and configuration server and client > based.. > how to use Luci and Ricci.. > -- > > > Rajiv Yadav > CRIS > (An Organization of the Ministry of Railways, Govt. of India) > Chanakyapuri,New Delhi - 110021 > website:- www.cris.org.in > Cell #: +91-9711175683 > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From kmaguire at eso.org Fri Dec 17 19:06:58 2010 From: kmaguire at eso.org (Kevin Maguire) Date: Fri, 17 Dec 2010 20:06:58 +0100 (CET) Subject: [Linux-cluster] GFS tuning for combined batch / interactive use In-Reply-To: <1292604659.2461.14.camel@dolmen> References: <866555507.987141292538356551.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com> <1292604659.2461.14.camel@dolmen> Message-ID: Hi > You can get a glock dump via debugfs which may show up contention, looks > for type 2 glocks which have lots of lock requests queued but not > granted. The lock requests (holders) are tagged with the relevant > process. Note I am currently using GFS, not GFS2. And before going further I ran the ping_pong test on my cluster and see only about 100 locks/second even on just 1 node. So maybe I should look at plock_rate_limit parameter, though not sure if that is our core problem. Anyways, As I write this my test cluster is being heavily used with batch jobs, and thus I have a window of opportunity to study it under load (but not change it). I have debugfs mounted. There are 10 nodes in this test cluster. My filesystem is called mygfs, and was created via mkfs.gfs -O -t dfoxen-cluster:mygfs -p lock_dlm -j 10 -r 2048 /dev/mapper/vggfs-lvgfs This is what I have in debugfs: # find /sys/kernel/debug/ -type f -exec wc -l {} \; 2309 /sys/kernel/debug/dlm/mygfs_locks 0 /sys/kernel/debug/dlm/mygfs_waiters 16258 /sys/kernel/debug/dlm/mygfs 2 /sys/kernel/debug/dlm/clvmd_locks 0 /sys/kernel/debug/dlm/clvmd_waiters 7 /sys/kernel/debug/dlm/clvmd The lock dump file has content like: # cat /sys/kernel/debug/dlm/mygfs_locks id nodeid remid pid xid exflags flags sts grmode rqmode time_ms r_nodeid r_len r_name 14f19eb 0 0 1038 0 0 0 2 3 -1 0 0 24 " 5 cec3e6d" 3da1a67 0 0 31861 0 0 0 2 3 -1 0 0 24 " 5 a0fafc2" 1120003 1 16f0019 3552 0 408 0 2 0 -1 0 1 24 " 3 2d8b9091" af0002 1 10024 3552 0 408 0 2 0 -1 0 1 24 " 3 2053fbf8" ... But I don't really see how to work our which type of lock is which from this file - sorry. Given $2 is the nodeid I can work our who has locks and that leads to a minor strangeness node1 # awk 'NR>1{print $2}' /sys/kernel/debug/dlm/mygfs_locks | sort | uniq -c | sort -k +2n 2142 0 1619 2 2001 3 1586 4 1566 5 1624 6 1610 7 1733 8 1592 9 1612 10 These numbers are much bigger than the counts on the 9 other nodes, e.g. node2 # awk 'NR>1{print $2}' /sys/kernel/debug/dlm/mygfs_locks | sort | uniq -c | sort -k +2n 441 0 1630 1 75 3 2 4 10 5 25 7 15 8 38 10 Is that normal ? Using gfs_tool's lockdump I see node1 # gfs_tool lockdump /newcache | egrep '^Glock' | sed 's?(\([0-9]*\).*)?\1?g' | sort | uniq -c 3 Glock 1 308 Glock 2 1538 Glock 3 2 Glock 4 233 Glock 5 2 Glock 8 Only type 2 and type 5 counts seem to change. Across the cluster there is one node with a lot more (10x more) Glock type 2 and Glock type 5 locks. # gfs_tool counters /newcache locks 2313 locks held 781 freeze count 0 incore inodes 230 metadata buffers 1061 unlinked inodes 28 quota IDs 2 incore log buffers 28 log space used 1.46% meta header cache entries 1304 glock dependencies 185 glocks on reclaim list 0 log wraps 91 outstanding LM calls 0 outstanding BIO calls 0 fh2dentry misses 0 glocks reclaimed 2125924 glock nq calls 801437507 glock dq calls 796261692 glock prefetch calls 319835 lm_lock calls 6396763 lm_unlock calls 1031709 lm callbacks 7669741 address operations 1267096416 dentry operations 35815146 export operations 0 file operations 233333825 inode operations 61818196 super operations 148712313 vm operations 87114 block I/O reads 0 block I/O writes 0 Not sure if anyone can make anything from all these numbers ... Thanks, Kevin From swhiteho at redhat.com Fri Dec 17 19:43:53 2010 From: swhiteho at redhat.com (Steven Whitehouse) Date: Fri, 17 Dec 2010 19:43:53 +0000 Subject: [Linux-cluster] GFS tuning for combined batch / interactive use In-Reply-To: References: <866555507.987141292538356551.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com> <1292604659.2461.14.camel@dolmen> Message-ID: <1292615033.2461.23.camel@dolmen> Hi, On Fri, 2010-12-17 at 20:06 +0100, Kevin Maguire wrote: > Hi > > > You can get a glock dump via debugfs which may show up contention, looks > > for type 2 glocks which have lots of lock requests queued but not > > granted. The lock requests (holders) are tagged with the relevant > > process. > > Note I am currently using GFS, not GFS2. And before going further I ran > the ping_pong test on my cluster and see only about 100 locks/second even > on just 1 node. So maybe I should look at plock_rate_limit parameter, > though not sure if that is our core problem. > The same thing applies to GFS as GFS2, its just that the format of the debugfs file is different. GFS2 uses a rather smaller format which makes a big difference in the dump size for larger machines. The plock_rate_limit can usually be turned off quite safely, but unless your app uses plocks, it won't make any difference to the performance. > Anyways, As I write this my test cluster is being heavily used with batch > jobs, and thus I have a window of opportunity to study it under load (but > not change it). I have debugfs mounted. There are 10 nodes in this test > cluster. My filesystem is called mygfs, and was created via > > mkfs.gfs -O -t dfoxen-cluster:mygfs -p lock_dlm -j 10 -r 2048 /dev/mapper/vggfs-lvgfs > > This is what I have in debugfs: > > # find /sys/kernel/debug/ -type f -exec wc -l {} \; > 2309 /sys/kernel/debug/dlm/mygfs_locks > 0 /sys/kernel/debug/dlm/mygfs_waiters > 16258 /sys/kernel/debug/dlm/mygfs > 2 /sys/kernel/debug/dlm/clvmd_locks > 0 /sys/kernel/debug/dlm/clvmd_waiters > 7 /sys/kernel/debug/dlm/clvmd > > The lock dump file has content like: > > # cat /sys/kernel/debug/dlm/mygfs_locks > id nodeid remid pid xid exflags flags sts grmode rqmode time_ms r_nodeid r_len r_name > 14f19eb 0 0 1038 0 0 0 2 3 -1 0 0 24 " 5 cec3e6d" > 3da1a67 0 0 31861 0 0 0 2 3 -1 0 0 24 " 5 a0fafc2" > 1120003 1 16f0019 3552 0 408 0 2 0 -1 0 1 24 " 3 2d8b9091" > af0002 1 10024 3552 0 408 0 2 0 -1 0 1 24 " 3 2053fbf8" > ... > > But I don't really see how to work our which type of lock is which from > this file - sorry. Given $2 is the nodeid I can work our who has locks and > that leads to a minor strangeness > > node1 # awk 'NR>1{print $2}' /sys/kernel/debug/dlm/mygfs_locks | sort | uniq -c | sort -k +2n > 2142 0 > 1619 2 > 2001 3 > 1586 4 > 1566 5 > 1624 6 > 1610 7 > 1733 8 > 1592 9 > 1612 10 > > These numbers are much bigger than the counts on the 9 other nodes, e.g. > > node2 # awk 'NR>1{print $2}' /sys/kernel/debug/dlm/mygfs_locks | sort | uniq -c | sort -k +2n > 441 0 > 1630 1 > 75 3 > 2 4 > 10 5 > 25 7 > 15 8 > 38 10 > > Is that normal ? > > Using gfs_tool's lockdump I see > > node1 # gfs_tool lockdump /newcache | egrep '^Glock' | sed 's?(\([0-9]*\).*)?\1?g' | sort | uniq -c > 3 Glock 1 > 308 Glock 2 > 1538 Glock 3 > 2 Glock 4 > 233 Glock 5 > 2 Glock 8 > > Only type 2 and type 5 counts seem to change. Across the cluster there is > one node with a lot more (10x more) Glock type 2 and Glock type 5 locks. > This lock dump is what you want to look at first. The dlm dumps are really only for when something has gone wrong and you need to check whether dlm has a different idea of what is going on to gfs. The type 2 glocks relate to inodes (as do type 5, but they don't have any bearing on the performance in this case). Type 3 glocks relates to resource groups. It is this gfs_tool lockup output that contains the info that you need. The interesting locks are those which have a number of "Holders" attached to them which are on the "Waiters" queues (i.e. not granted) and the more of those holder there are on a lock, the more interesting it is from a performance point of view. In the case of type 2 glocks, the other part of the glock number, is also the inode number, so when the system it otherwise idle, a find -inum will tell you which inode was causing the problems, provided it wasn't a temporary file, of course :-) If you have access to the Red Hat kbase system, then this is all described in the docs on that site. > # gfs_tool counters /newcache > > locks 2313 > locks held 781 > freeze count 0 > incore inodes 230 > metadata buffers 1061 > unlinked inodes 28 > quota IDs 2 > incore log buffers 28 > log space used 1.46% > meta header cache entries 1304 > glock dependencies 185 > glocks on reclaim list 0 > log wraps 91 > outstanding LM calls 0 > outstanding BIO calls 0 > fh2dentry misses 0 > glocks reclaimed 2125924 > glock nq calls 801437507 > glock dq calls 796261692 > glock prefetch calls 319835 > lm_lock calls 6396763 > lm_unlock calls 1031709 > lm callbacks 7669741 > address operations 1267096416 > dentry operations 35815146 > export operations 0 > file operations 233333825 > inode operations 61818196 > super operations 148712313 > vm operations 87114 > block I/O reads 0 > block I/O writes 0 > > Not sure if anyone can make anything from all these numbers ... > There is nothing that stands out as being a problem there, but the counters are generally not very useful, Steve. > Thanks, > Kevin > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From jeff.sturm at eprize.com Fri Dec 17 22:53:54 2010 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Fri, 17 Dec 2010 17:53:54 -0500 Subject: [Linux-cluster] GFS block size Message-ID: <64D0546C5EBBD147B75DE133D798665F06A12904@hugo.eprize.local> One of our GFS filesystems tends to have a large number of very small files, on average about 1000 bytes each. I realized this week we'd created our filesystems with default options. As an experiment on a test system, I've recreated a GFS filesystem with "-b 1024" to reduce overall disk usage and disk bandwidth. Initially, tests look very good-single file creates are less than one millisecond on average (down from about 5ms each). Before I go very far with this, I wanted to ask: Has anyone else experimented with the block size option, and are there any tricks or gotchas to report? (This is with CentOS 5.5, GFS 1.) -Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From kmaguire at eso.org Fri Dec 17 23:43:11 2010 From: kmaguire at eso.org (Kevin Maguire) Date: Sat, 18 Dec 2010 00:43:11 +0100 (CET) Subject: [Linux-cluster] GFS tuning for combined batch / interactive use In-Reply-To: <1292615033.2461.23.camel@dolmen> References: <866555507.987141292538356551.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com> <1292604659.2461.14.camel@dolmen> <1292615033.2461.23.camel@dolmen> Message-ID: Hi Steven: Thanks again. > If you have access to the Red Hat kbase system, then this is all > described in the docs on that site. I do as we have RedHat support for other platforms, just not this one. The docs I found that are worthy of a slow reading are probably: https://access.redhat.com/kb/docs/DOC-41624 https://access.redhat.com/kb/docs/DOC-41485 and maybe https://access.redhat.com/kb/docs/DOC-6533 https://access.redhat.com/kb/docs/DOC-41609 https://access.redhat.com/kb/docs/DOC-34460 https://access.redhat.com/kb/docs/DOC-34401 https://access.redhat.com/kb/docs/DOC-6479 If I missed one that is particularly helpful please let me know. I'll take this back to our software group with all that I have learned! The biggest TBC is whether we give GFS2 a try - the main reason we are not using it now is that we ported all of this from RHEL4 and did not want to change the filesystem at the same time. kevin From linux-cluster at redhat.com Sat Dec 18 08:25:03 2010 From: linux-cluster at redhat.com (Mailbot for etexusa.com) Date: Sat, 18 Dec 2010 00:25:03 -0800 Subject: [Linux-cluster] DSN: delayed () Message-ID: This is a Delivery Status Notification (DSN). After several attempts, I still haven't been able to deliver your message to christopher at aillon.com. I will keep trying for a few more days, but I thought you would want to know. The error was; Can't connect to domain "aillon.com" -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/rfc822-headers Size: 481 bytes Desc: not available URL: From linux-cluster at redhat.com Sun Dec 19 11:25:59 2010 From: linux-cluster at redhat.com (Mailbot for etexusa.com) Date: Sun, 19 Dec 2010 03:25:59 -0800 Subject: [Linux-cluster] DSN: failed () Message-ID: This is a Delivery Status Notification (DSN). I was unable to deliver your message to christopher at aillon.com. The error was; Can't connect to domain "aillon.com" -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/rfc822-headers Size: 481 bytes Desc: not available URL: From mika68vaan at gmail.com Tue Dec 21 07:55:25 2010 From: mika68vaan at gmail.com (Mika i) Date: Tue, 21 Dec 2010 09:55:25 +0200 Subject: [Linux-cluster] Cluster + NFS + GFS/GFS2 experiences Message-ID: Hi. I am planing to rhel-cluster with nfs service and that's why i am asking little bit experiences about what kind configuration falks has done this kind clusters. About 2-4 servers .. rhel6,cluster, nfs-service.. but how about gfs2, do recommend it or something else? -M -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux-cluster at redhat.com Tue Dec 21 10:27:57 2010 From: linux-cluster at redhat.com (Mailbot for etexusa.com) Date: Tue, 21 Dec 2010 02:27:57 -0800 Subject: [Linux-cluster] DSN: failed (Message could not be delivered) Message-ID: This is a Delivery Status Notification (DSN). I was unable to deliver your message to irahotel at otenet.gr. I said (end of message) And they gave me the error; 550 5.7.1 Virus Infected W32.Sality.Q-1 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/rfc822-headers Size: 503 bytes Desc: not available URL: From ram at netcore.co.in Tue Dec 21 15:08:28 2010 From: ram at netcore.co.in (Ram) Date: Tue, 21 Dec 2010 20:38:28 +0530 Subject: [Linux-cluster] How do I add a fence_vmware with system-config-cluster Message-ID: <4D10C2EC.9030800@netcore.co.in> Hello We are using RHEL 5.4 on VM nodes. When I configure the cluster using system-config-cluster I do not get an option of fence_vmware in the User Interface I am able to run fence_vmware successfully on commandline , but I am not sure how do I put in into the cluster.conf Can someone please help with a sample cluster.conf or option to add in system-config-cluster. Thanks Ram From yvette at dbtgroup.com Tue Dec 21 19:31:59 2010 From: yvette at dbtgroup.com (yvette hirth) Date: Tue, 21 Dec 2010 19:31:59 +0000 Subject: [Linux-cluster] question about network config for fencing Message-ID: <4D1100AF.8090302@dbtgroup.com> hi, i've config'ed my DL 380 G6 four ethernet ports so that they are bonded to "bond0", and my ILO devices are on the fence switch (separate switch). i'm 99.9% sure this is wrong. i'm thinking that eth0-2 and the ILO2 port should be on the same subnet, and eth3 on a separate subnet (for the multicast fencing). that way i have fencing on a non-ILO2 ethernet port, and the ILO2 is accessible from my main subnet. could someone please share with me their network config for an HP ILO-based server with four ports? i'd really appreciate it! thanks yvette From emilews2 at csc.com Tue Dec 21 21:03:21 2010 From: emilews2 at csc.com (Evan J Milewski) Date: Tue, 21 Dec 2010 16:03:21 -0500 Subject: [Linux-cluster] Linux-cluster Digest, Vol 80, Issue 20 In-Reply-To: References: Message-ID: I am currently doing this with RHEL5 cluster, using ext3 (or ext4 for you on RHEL6) and basically doing an active/passive config for each service group. My benchmarking of GFS2 performance was abysmal compared to straight ext3/ext4 on a busy NFS server. > Hi. > I am planing to rhel-cluster with nfs service and that's why i am asking > little bit experiences about what > kind configuration falks has done this kind clusters. > > About 2-4 servers .. rhel6,cluster, nfs-service.. but how about gfs2, do > recommend it or something else? > > > -M -------------- next part -------------- An HTML attachment was scrubbed... URL: From bturner at redhat.com Tue Dec 21 21:18:53 2010 From: bturner at redhat.com (Ben Turner) Date: Tue, 21 Dec 2010 16:18:53 -0500 (EST) Subject: [Linux-cluster] How do I add a fence_vmware with system-config-cluster In-Reply-To: <4D10C2EC.9030800@netcore.co.in> Message-ID: <452994304.27339.1292966333033.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com> Fence_vmware is not configurable with s-c-c, you will have to manually edit the cluster.conf. This doc should work for you: http://sources.redhat.com/cluster/wiki/VMware_FencingConfig -Ben ----- Original Message ----- > Hello > We are using RHEL 5.4 on VM nodes. > When I configure the cluster using system-config-cluster I do not get > an > option of fence_vmware in the User Interface > I am able to run fence_vmware successfully on commandline , but I am > not > sure how do I put in into the cluster.conf > > Can someone please help with a sample cluster.conf or option to add in > system-config-cluster. > > > Thanks > Ram > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From bturner at redhat.com Tue Dec 21 21:35:02 2010 From: bturner at redhat.com (Ben Turner) Date: Tue, 21 Dec 2010 16:35:02 -0500 (EST) Subject: [Linux-cluster] question about network config for fencing In-Reply-To: <4D1100AF.8090302@dbtgroup.com> Message-ID: <1876027540.27523.1292967302370.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com> Some notes on management board style fence devices can be found here: http://sources.redhat.com/cluster/wiki/IPMI_FencingConfig Here is a DOC with an example iLO config: http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Configuration_Example_-_Fence_Devices/index.html It is also recommended that you have your fence devices on the same network as your cluster heartbeat. I have heard an explanation on this at some point as to why this is but I can't remember 100%. I'm sure that you could find it if you dug around on past posts to this list though. Hopes this helps. -b ----- Original Message ----- > hi, > > i've config'ed my DL 380 G6 four ethernet ports so that they are > bonded > to "bond0", and my ILO devices are on the fence switch (separate > switch). > > i'm 99.9% sure this is wrong. > > i'm thinking that eth0-2 and the ILO2 port should be on the same > subnet, > and eth3 on a separate subnet (for the multicast fencing). that way i > have fencing on a non-ILO2 ethernet port, and the ILO2 is accessible > from my main subnet. > > could someone please share with me their network config for an HP > ILO-based server with four ports? i'd really appreciate it! > > thanks > yvette > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From jayesh.shinde at netcore.co.in Wed Dec 22 04:53:17 2010 From: jayesh.shinde at netcore.co.in (jayesh.shinde) Date: Wed, 22 Dec 2010 10:23:17 +0530 Subject: [Linux-cluster] How do I get reiserfs and xfs filesystem options in system-config-cluster ? Message-ID: <4D11843D.4040104@netcore.co.in> Hello , I am configuring redhat cluster suite with RHEL 5.4 , 32 bit architecture I am using the *system-config-cluster* tool for configuring , I have my one SAN partition with *reiserfs* and *xfs* filesystem. While configuring the resources , In *file system *option I am only getting only *"ext2" and "ext3"* in drop down. How do I get reiserfs and xfs filesystem options in drop down ? Is there any updated package for this or do i need to edit the cluster.conf file manually ? Regards Jayesh Shinde -------------- next part -------------- An HTML attachment was scrubbed... URL: From raju.rajsand at gmail.com Wed Dec 22 06:36:48 2010 From: raju.rajsand at gmail.com (Rajagopal Swaminathan) Date: Wed, 22 Dec 2010 06:36:48 +0000 Subject: [Linux-cluster] How do I get reiserfs and xfs filesystem options in system-config-cluster ? In-Reply-To: <4D11843D.4040104@netcore.co.in> References: <4D11843D.4040104@netcore.co.in> Message-ID: Greetings, On Wed, Dec 22, 2010 at 4:53 AM, jayesh.shinde wrote: > Hello , > > I am configuring redhat cluster suite with RHEL 5.4 , 32 bit architecture > I have my one SAN partition with reiserfs and xfs? filesystem. To the best of my knowledge, XFS support has just started on RHEL6. I am not sure that ReiserFS was ever supported by Redhat. If you are trying to use those filesystems in the cluster, I don't think they are cluster aware. YMMV. Regards, Rajagopal From jayesh.shinde at netcore.co.in Wed Dec 22 07:27:26 2010 From: jayesh.shinde at netcore.co.in (jayesh.shinde) Date: Wed, 22 Dec 2010 12:57:26 +0530 Subject: [Linux-cluster] How do I get reiserfs and xfs filesystem options in system-config-cluster ? In-Reply-To: References: <4D11843D.4040104@netcore.co.in> Message-ID: <4D11A85E.104@netcore.co.in> Hi Rajagopal I am not clear fully. I will use RHEL 6 . I want some more clarification on below points 1) You mean to say I can't use XFS with cluster ? OR there is no option for XFS with system-config-cluster ? 2) If I edited the cluster.conf file manually for "xfs" will the cluster server work well ? 3) what is work around solution ? Regards Jayesh Shinde On 12/22/2010 12:06 PM, Rajagopal Swaminathan wrote: > Greetings, > > On Wed, Dec 22, 2010 at 4:53 AM, jayesh.shinde > wrote: >> Hello , >> >> I am configuring redhat cluster suite with RHEL 5.4 , 32 bit architecture >> I have my one SAN partition with reiserfs and xfs filesystem. > To the best of my knowledge, XFS support has just started on RHEL6. > > I am not sure that ReiserFS was ever supported by Redhat. > > If you are trying to use those filesystems in the cluster, I don't > think they are cluster aware. > > YMMV. > > Regards, > > Rajagopal > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From raju.rajsand at gmail.com Wed Dec 22 07:50:24 2010 From: raju.rajsand at gmail.com (Rajagopal Swaminathan) Date: Wed, 22 Dec 2010 07:50:24 +0000 Subject: [Linux-cluster] How do I get reiserfs and xfs filesystem options in system-config-cluster ? In-Reply-To: <4D11A85E.104@netcore.co.in> References: <4D11843D.4040104@netcore.co.in> <4D11A85E.104@netcore.co.in> Message-ID: Greetings, On Wed, Dec 22, 2010 at 7:27 AM, jayesh.shinde wrote: > Hi ?Rajagopal > > I am not clear fully. ?I will use RHEL 6 . I want some more clarification on > below points > > 1) You mean to say I can't use XFS with cluster ? XFS/reiserFS is not a cluster aware filesystem like GFS2 or OCFS or GPFS. AFAIK, You cannot use it for multiple hosts concurrantly accessing the filesystem > OR there is no option for XFS with system-config-cluster ? I do not have cluster in front of me to attempt answering that... It can be used in the same sense as ext3/4. > 2) If I edited the cluster.conf file manually for "xfs" will the cluster > server work well ? > > 3) what is work around solution ? > You haven't defined your problem clearly enough. [commercial-plug] available for a fee currently in Mumbai -- mail me in private :) Regards, Rajagopal From rafagriman at gmail.com Wed Dec 22 07:58:03 2010 From: rafagriman at gmail.com (Rafa Griman) Date: Wed, 22 Dec 2010 08:58:03 +0100 Subject: [Linux-cluster] How do I get reiserfs and xfs filesystem options in system-config-cluster ? In-Reply-To: <4D11A85E.104@netcore.co.in> References: <4D11843D.4040104@netcore.co.in> <4D11A85E.104@netcore.co.in> Message-ID: Hi :) On Wed, Dec 22, 2010 at 8:27 AM, jayesh.shinde wrote: > Hi ?Rajagopal > > I am not clear fully. ?I will use RHEL 6 . I want some more clarification on > below points > > 1) You mean to say I can't use XFS with cluster ? ?OR there is no option for > XFS with system-config-cluster ? Depends on the type of cluster: - HA cluster: no problem as long as it's active/passive. That is: one server mounts the FS and the other is on standby. If server 1 fails, it releases the FS and server2 mounts it. - shared/clustered filesystem: you'd have to go with CXFS (get in touch with SGI). That is: both servers mount the filesystem at the same time. > 2) If I edited the cluster.conf file manually for "xfs" will the cluster > server work well ? > > 3) what is work around solution ? > > Regards > Jayesh Shinde > > On 12/22/2010 12:06 PM, Rajagopal Swaminathan wrote: >> >> Greetings, >> >> On Wed, Dec 22, 2010 at 4:53 AM, jayesh.shinde >> ?wrote: >>> >>> Hello , >>> >>> I am configuring redhat cluster suite with RHEL 5.4 , 32 bit architecture >>> I have my one SAN partition with reiserfs and xfs ?filesystem. >> >> To the best of my knowledge, XFS support has just started on RHEL6. >> >> I am not sure that ReiserFS was ever supported by Redhat. >> >> If you are trying to use those filesystems in the cluster, I don't >> think they are cluster aware. >> >> YMMV. >> >> Regards, >> >> Rajagopal HTH Rafa From yvette at dbtgroup.com Wed Dec 22 17:27:35 2010 From: yvette at dbtgroup.com (yvette hirth) Date: Wed, 22 Dec 2010 17:27:35 +0000 Subject: [Linux-cluster] gfs2.fsck bug Message-ID: <4D123507.70006@dbtgroup.com> hi, our gfs2 datasets are down; when i try to do a mount i get: [root at DBT1 ~]# mount -a /sbin/mount.gfs2: node not a member of the default fence domain /sbin/mount.gfs2: error mounting lockproto lock_dlm /sbin/mount.gfs2: node not a member of the default fence domain /sbin/mount.gfs2: error mounting lockproto lock_dlm /sbin/mount.gfs2: node not a member of the default fence domain /sbin/mount.gfs2: error mounting lockproto lock_dlm /sbin/mount.gfs2: node not a member of the default fence domain /sbin/mount.gfs2: error mounting lockproto lock_dlm /sbin/mount.gfs2: node not a member of the default fence domain /sbin/mount.gfs2: error mounting lockproto lock_dlm /sbin/mount.gfs2: node not a member of the default fence domain /sbin/mount.gfs2: error mounting lockproto lock_dlm our cluster.conf is consistent across all devices (listed below). so i thought an fsck would fix this, then i get: [root at DBT1 ~]# fsck.gfs2 -fnp /dev/NEWvg/NEWlvTemp (snippage) RG #4909212 (0x4ae89c) free count inconsistent: is 16846 should be 17157 Resource group counts updated Unlinked block 8639983 (0x83d5ef) bitmap fixed. RG #8639976 (0x83d5e8) free count inconsistent: is 65411 should be 65412 Inode count inconsistent: is 20 should be 19 Resource group counts updated Pass5 complete The statfs file is wrong: Current statfs values: blocks: 43324224 (0x2951340) free: 38433917 (0x24a747d) dinodes: 21085 (0x525d) Calculated statfs values: blocks: 43324224 (0x2951340) free: 38466752 (0x24af4c0) dinodes: 21083 (0x525b) The statfs file was fixed. gfs2_fsck: bad write: Bad file descriptor on line 44 of file buf.c i read in https://bugzilla.redhat.com/show_bug.cgi?id=457557 that there is some way of fixing this with gfs2_edit - are there docs available? as we've been having fencing issues, i removed two servers (DBT2/DBT3) from the cluster fencing, and they are not active at this time. would this cause the mount issues? tia for any advice / guidance. yvette our cluster.conf: From bturner at redhat.com Wed Dec 22 17:58:18 2010 From: bturner at redhat.com (Ben Turner) Date: Wed, 22 Dec 2010 12:58:18 -0500 (EST) Subject: [Linux-cluster] How do I get reiserfs and xfs filesystem options in system-config-cluster ? In-Reply-To: Message-ID: <2040486775.37101.1293040698392.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com> As far as I know RHEL 6 doesn't include system-config-cluster anymore. I suggest you use the luci interface to configure this. When creating a new service in luci you can choose the filesystem resource, this resource handles 9 different filesystems including reiser and XFS as well an an autodetect option. You can also manually edit the cluster.conf file to make these changes. You can use XFS with cluster, the point that others were trying to make is that XFS is not a shared filesystem like GFS and can only be mounted on one node at a time. -Ben ----- Original Message ----- > Hi :) > > On Wed, Dec 22, 2010 at 8:27 AM, jayesh.shinde > wrote: > > Hi Rajagopal > > > > I am not clear fully. I will use RHEL 6 . I want some more > > clarification on > > below points > > > > 1) You mean to say I can't use XFS with cluster ? OR there is no > > option for > > XFS with system-config-cluster ? > > > Depends on the type of cluster: > - HA cluster: no problem as long as it's active/passive. That is: > one server mounts the FS and the other is on standby. If server 1 > fails, it releases the FS and server2 mounts it. > - shared/clustered filesystem: you'd have to go with CXFS (get in > touch with SGI). That is: both servers mount the filesystem at the > same time. > > > > 2) If I edited the cluster.conf file manually for "xfs" will the > > cluster > > server work well ? > > > > 3) what is work around solution ? > > > > Regards > > Jayesh Shinde > > > > On 12/22/2010 12:06 PM, Rajagopal Swaminathan wrote: > >> > >> Greetings, > >> > >> On Wed, Dec 22, 2010 at 4:53 AM, jayesh.shinde > >> wrote: > >>> > >>> Hello , > >>> > >>> I am configuring redhat cluster suite with RHEL 5.4 , 32 bit > >>> architecture > >>> I have my one SAN partition with reiserfs and xfs filesystem. > >> > >> To the best of my knowledge, XFS support has just started on RHEL6. > >> > >> I am not sure that ReiserFS was ever supported by Redhat. > >> > >> If you are trying to use those filesystems in the cluster, I don't > >> think they are cluster aware. > >> > >> YMMV. > >> > >> Regards, > >> > >> Rajagopal > > > HTH > > Rafa > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From rpeterso at redhat.com Wed Dec 22 18:10:24 2010 From: rpeterso at redhat.com (Bob Peterson) Date: Wed, 22 Dec 2010 13:10:24 -0500 (EST) Subject: [Linux-cluster] gfs2.fsck bug In-Reply-To: <4D123507.70006@dbtgroup.com> Message-ID: <2139528362.53169.1293041424887.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- Original Message ----- | hi, | | our gfs2 datasets are down; when i try to do a mount i get: | | [root at DBT1 ~]# mount -a | /sbin/mount.gfs2: node not a member of the default fence domain | /sbin/mount.gfs2: error mounting lockproto lock_dlm | /sbin/mount.gfs2: node not a member of the default fence domain | /sbin/mount.gfs2: error mounting lockproto lock_dlm | /sbin/mount.gfs2: node not a member of the default fence domain | /sbin/mount.gfs2: error mounting lockproto lock_dlm | /sbin/mount.gfs2: node not a member of the default fence domain | /sbin/mount.gfs2: error mounting lockproto lock_dlm | /sbin/mount.gfs2: node not a member of the default fence domain | /sbin/mount.gfs2: error mounting lockproto lock_dlm | /sbin/mount.gfs2: node not a member of the default fence domain | /sbin/mount.gfs2: error mounting lockproto lock_dlm | | our cluster.conf is consistent across all devices (listed below). | | so i thought an fsck would fix this, then i get: | | [root at DBT1 ~]# fsck.gfs2 -fnp /dev/NEWvg/NEWlvTemp | (snippage) | RG #4909212 (0x4ae89c) free count inconsistent: is 16846 should be | 17157 | Resource group counts updated | Unlinked block 8639983 (0x83d5ef) bitmap fixed. | RG #8639976 (0x83d5e8) free count inconsistent: is 65411 should be | 65412 | Inode count inconsistent: is 20 should be 19 | Resource group counts updated | Pass5 complete | The statfs file is wrong: | | Current statfs values: | blocks: 43324224 (0x2951340) | free: 38433917 (0x24a747d) | dinodes: 21085 (0x525d) | | Calculated statfs values: | blocks: 43324224 (0x2951340) | free: 38466752 (0x24af4c0) | dinodes: 21083 (0x525b) | The statfs file was fixed. | | gfs2_fsck: bad write: Bad file descriptor on line 44 of file buf.c | | i read in https://bugzilla.redhat.com/show_bug.cgi?id=457557 that | there | is some way of fixing this with gfs2_edit - are there docs available? Hi Yvette, There is not enough information to know whether or not this may be fixed easily with gfs2_edit since I don't know what block it's failing on when you run fsck.gfs2. What version of fsck.gfs2 are you running? Are you running the version from my people page? If not, you could try it. http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/fsck.gfs2 Regards, Bob Peterson Red Hat File Systems From bturner at redhat.com Wed Dec 22 19:15:43 2010 From: bturner at redhat.com (Ben Turner) Date: Wed, 22 Dec 2010 14:15:43 -0500 (EST) Subject: [Linux-cluster] gfs2.fsck bug In-Reply-To: <4D123507.70006@dbtgroup.com> Message-ID: <1847865255.38090.1293045343379.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com> My responses inline: > hi, > > our gfs2 datasets are down; when i try to do a mount i get: > > [root at DBT1 ~]# mount -a > /sbin/mount.gfs2: node not a member of the default fence domain > /sbin/mount.gfs2: error mounting lockproto lock_dlm > /sbin/mount.gfs2: node not a member of the default fence domain > /sbin/mount.gfs2: error mounting lockproto lock_dlm > /sbin/mount.gfs2: node not a member of the default fence domain > /sbin/mount.gfs2: error mounting lockproto lock_dlm > /sbin/mount.gfs2: node not a member of the default fence domain > /sbin/mount.gfs2: error mounting lockproto lock_dlm > /sbin/mount.gfs2: node not a member of the default fence domain > /sbin/mount.gfs2: error mounting lockproto lock_dlm > /sbin/mount.gfs2: node not a member of the default fence domain > /sbin/mount.gfs2: error mounting lockproto lock_dlm This makes me think the node trying to mount your GFS FS is not currently a member of the cluster. Check cman_tool services on all nodes, everything should be in the state NONE. If it is not then there is prolly a membership issue. > > our cluster.conf is consistent across all devices (listed below). > > so i thought an fsck would fix this, then i get: > > [root at DBT1 ~]# fsck.gfs2 -fnp /dev/NEWvg/NEWlvTemp > (snippage) > RG #4909212 (0x4ae89c) free count inconsistent: is 16846 should be > 17157 > Resource group counts updated > Unlinked block 8639983 (0x83d5ef) bitmap fixed. > RG #8639976 (0x83d5e8) free count inconsistent: is 65411 should be > 65412 > Inode count inconsistent: is 20 should be 19 > Resource group counts updated > Pass5 complete > The statfs file is wrong: > > Current statfs values: > blocks: 43324224 (0x2951340) > free: 38433917 (0x24a747d) > dinodes: 21085 (0x525d) > > Calculated statfs values: > blocks: 43324224 (0x2951340) > free: 38466752 (0x24af4c0) > dinodes: 21083 (0x525b) > The statfs file was fixed. > > gfs2_fsck: bad write: Bad file descriptor on line 44 of file buf.c > > i read in https://bugzilla.redhat.com/show_bug.cgi?id=457557 that > there > is some way of fixing this with gfs2_edit - are there docs available? There is a development version of fsck that I have had success fixing several issue with. It can be found at: http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/ I can't comment on the gfs2_edit procedure, maybe someone else on the list can comment here if that is a better idea than the experimental gfs2 fsck. > > as we've been having fencing issues, i removed two servers (DBT2/DBT3) > from the cluster fencing, and they are not active at this time. would > this cause the mount issues? I see you removed the fence devices from: If there was a fence event on this node I could see that as a cause for not being able to mount GFS. Any time there is lost heartbeat all cluster resources will remain frozen until there is a successful fence, without a fence device you should see failed fence messages all through the logs. > tia for any advice / guidance. > > yvette > > our cluster.conf: > > > > post_join_delay="1"/> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > name="DBT0_ILO2" passwd="foo"/> > name="DEV_ILO2" passwd="foo"/> > name="DBT1_ILO2" passwd="foo"/> > > > > > fsid="19150" fstype="gfs2" mountpoint="/foo0vol002" name="foo0vol002" > options="data=writeback" self_fence="0"/> > fsid="51633" fstype="gfs2" mountpoint="/foo0vol003" name="foo0vol003" > options="data=writeback" self_fence="0"/> > fsid="36294" fstype="gfs2" mountpoint="/foo0vol004" name="foo0vol004" > options="data=writeback" self_fence="0"/> > fsid="48920" fstype="gfs2" mountpoint="/foo0vol005" name="foo0vol005" > options="noatime,noquota,data=writeback" self_fence="0"/> > fsid="24235" fstype="gfs2" mountpoint="/foo0vol000" name="foo0vol000" > options="data=ordered" self_fence="0"/> > fsid="34088" fstype="gfs2" mountpoint="/foo0vol001" name="foo0vol001" > options="data=ordered" self_fence="0"/> > > > token_retransmits_before_loss_const="20"/> > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From rossnick-lists at cybercat.ca Wed Dec 22 19:38:08 2010 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Wed, 22 Dec 2010 14:38:08 -0500 Subject: [Linux-cluster] New cluster : installing... Message-ID: <88BD849351CE45BBA495A52B387E6B73@versa> Hi ! Over the last couple of weeks, I've been playing with the cluster suite and RHEL 6 beta 2, that was availaible. Now, I got a 30 day demo of RHEL 6 to begin the re-installation from scratch for ou soon to be production cluster. With the beta, I had a deamon running, that was clvmd for the cluster logical volume manager daemon. This package doesn't seem to exist anymore. The package lvm2-cluster is on the installation DVD, but I can't seem to install it via yum. I did enabled the High Availability channel to our servers, but it's not in there. I can't seem to find in wich software channel it's located. Can anyone tell me ? From bturner at redhat.com Wed Dec 22 21:14:23 2010 From: bturner at redhat.com (Ben Turner) Date: Wed, 22 Dec 2010 16:14:23 -0500 (EST) Subject: [Linux-cluster] New cluster : installing... In-Reply-To: <88BD849351CE45BBA495A52B387E6B73@versa> Message-ID: <1410843832.39645.1293052463218.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com> Did you enable resilient storage? On my system: [root at cs-rh6-3 gfs-test-scripts]# yum info lvm2-cluster Loaded plugins: refresh-packagekit, rhnplugin Installed Packages Name : lvm2-cluster Arch : x86_64 Version : 2.02.72 Release : 8.el6_0.3 Size : 581 k Repo : installed >From repo : rhel-x86_64-server-rs-6 Summary : Cluster extensions for userland logical volume management tools URL : http://sources.redhat.com/lvm2 License : GPLv2 Description: Extensions to LVM2 to support clusters. Available Packages Name : lvm2-cluster Arch : x86_64 Version : 2.02.72 Release : 8.el6_0.4 Size : 307 k Repo : rhel-x86_64-server-rs-6 Summary : Cluster extensions for userland logical volume management tools License : GPLv2 Description: Extensions to LVM2 to support clusters. -b ----- Original Message ----- > Hi ! > > Over the last couple of weeks, I've been playing with the cluster > suite and > RHEL 6 beta 2, that was availaible. > > Now, I got a 30 day demo of RHEL 6 to begin the re-installation from > scratch > for ou soon to be production cluster. With the beta, I had a deamon > running, > that was clvmd for the cluster logical volume manager daemon. This > package > doesn't seem to exist anymore. > > The package lvm2-cluster is on the installation DVD, but I can't seem > to > install it via yum. I did enabled the High Availability channel to our > servers, but it's not in there. I can't seem to find in wich software > channel it's located. > > Can anyone tell me ? > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From pradhanparas at gmail.com Wed Dec 22 21:21:46 2010 From: pradhanparas at gmail.com (Paras pradhan) Date: Wed, 22 Dec 2010 15:21:46 -0600 Subject: [Linux-cluster] GFS problem Message-ID: Hi, This morning when I rebooted one node out of the 3 nodes cluster, it came back normally but saw repeated INFO of GFS : -- INFO: task gfs2_quotad:7957 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. gfs2_quotad D ffff8800ea0cfd30 0 7957 67 7965 7956 (L-TLB) ffff8800ea0cfcd0 0000000000000246 0000000000000000 ffff8800f996d800 000000000000000a ffff8800fb3a70c0 ffff8800ffceb7e0 000000000000a429 ffff8800fb3a72a8 0000000000000000 Call Trace: [] :dlm:dlm_put_lockspace+0x10/0x1f [] :dlm:dlm_lock+0x117/0x129 [] :lock_dlm:gdlm_ast+0x0/0x311 [] :lock_dlm:gdlm_bast+0x0/0x8d [] :gfs2:just_schedule+0x0/0xe [] :gfs2:just_schedule+0x9/0xe [] __wait_on_bit+0x40/0x6e [] :gfs2:just_schedule+0x0/0xe [] out_of_line_wait_on_bit+0x6c/0x78 [] wake_bit_function+0x0/0x23 [] :gfs2:gfs2_glock_wait+0x2b/0x30 [] :gfs2:gfs2_statfs_sync+0x3f/0x165 [] :gfs2:gfs2_statfs_sync+0x37/0x165 [] del_timer_sync+0xc/0x16 [] :gfs2:quotad_check_timeo+0x20/0x60 [] :gfs2:gfs2_quotad+0xde/0x214 [] autoremove_wake_function+0x0/0x2e [] :gfs2:gfs2_quotad+0x0/0x214 [] keventd_create_kthread+0x0/0xc4 [] kthread+0xfe/0x132 [] child_rip+0xa/0x12 [] keventd_create_kthread+0x0/0xc4 [] kthread+0x0/0x132 [] child_rip+0x0/0x12 -- clustat was not listing the services too saying Service temoprarily unavailible. try again later... Then I ran gfs2_list df. It printed out few lines then it stopped. I could't do 'ls; on mounted GFS file-systems on all three nodes. Then I rebooted this node once again. After that everything is normal. Just wanted to know what might has caused the problem. messaged logs says: Dec 22 10:53:57 cvprd2 fenced[7379]: fence "xxxx.xxx.xxx" success Dec 22 10:53:57 cvprd2 kernel: GFS2: fsid=vprd:guest_vms1.1: jid=2: Trying to acquire journal lock... Dec 22 10:53:57 cvprd2 kernel: GFS2: fsid=vprd:guest_vms2.2: jid=1: Trying to acquire journal lock... Dec 22 10:53:57 cvprd2 kernel: GFS2: fsid=vprd:guest_vms3.1: jid=0: Trying to acquire journal lock... Dec 22 10:53:57 cvprd2 kernel: GFS2: fsid=vprd:guest_vms4.1: jid=0: Trying to acquire journal lock... Dec 22 10:53:57 cvprd2 kernel: GFS2: fsid=vprd:guest_vms5.1: jid=0: Trying to acquire journal lock... Dec 22 10:53:57 cvprd2 kernel: GFS2: fsid=vprd:guest_vms6.1: jid=0: Trying to acquire journal lock... Dec 22 10:53:57 cvprd2 kernel: GFS2: fsid=vprd:guest_vms1.1: jid=2: Looking at journal... Dec 22 10:53:57 cvprd2 kernel: GFS2: fsid=vprd:guest_vms2.2: jid=1: Looking at journal... Dec 22 10:53:57 cvprd2 kernel: GFS2: fsid=vprd:guest_vms3.1: jid=0: Looking at journal... Dec 22 10:53:57 cvprd2 kernel: GFS2: fsid=vprd:guest_vms5.1: jid=0: Looking at journal... Dec 22 10:53:57 cvprd2 kernel: GFS2: fsid=vprd:guest_vms6.1: jid=0: Looking at journal... Dec 22 10:53:57 cvprd2 kernel: GFS2: fsid=vprd:guest_vms4.1: jid=0: Looking at journal... Dec 22 10:53:57 cvprd2 kernel: GFS2: fsid=vprd:guest_vms1.1: jid=2: Acquiring the transaction lock... Dec 22 10:53:57 cvprd2 kernel: GFS2: fsid=vprd:guest_vms1.1: jid=2: Replaying journal... Dec 22 10:53:57 cvprd2 kernel: GFS2: fsid=vprd:guest_vms1.1: jid=2: Replayed 0 of 0 blocks Dec 22 10:53:57 cvprd2 kernel: GFS2: fsid=vprd:guest_vms1.1: jid=2: Found 0 revoke tags Dec 22 10:53:57 cvprd2 kernel: GFS2: fsid=vprd:guest_vms1.1: jid=2: Journal replayed in 0s Dec 22 10:53:57 cvprd2 kernel: GFS2: fsid=vprd:guest_vms1.1: jid=2: Done Dec 22 10:53:57 cvprd2 kernel: GFS2: fsid=vprd:guest_vms3.1: jid=0: Acquiring the transaction lock... Dec 22 10:53:57 cvprd2 kernel: GFS2: fsid=vprd:guest_vms3.1: jid=0: Replaying journal... Dec 22 10:53:57 cvprd2 kernel: GFS2: fsid=vprd:guest_vms3.1: jid=0: Replayed 1 of 1 blocks Dec 22 10:53:58 cvprd2 kernel: GFS2: fsid=vprd:guest_vms3.1: jid=0: Found 0 revoke tags Dec 22 10:53:58 cvprd2 kernel: GFS2: fsid=vprd:guest_vms3.1: jid=0: Journal replayed in 1s Dec 22 10:53:58 cvprd2 kernel: GFS2: fsid=vprd:guest_vms3.1: jid=0: Done Dec 22 10:53:58 cvprd2 kernel: GFS2: fsid=vprd:guest_vms2.2: jid=1: Acquiring the transaction lock... Dec 22 10:53:58 cvprd2 kernel: GFS2: fsid=vprd:guest_vms2.2: jid=1: Replaying journal... Dec 22 10:53:58 cvprd2 kernel: GFS2: fsid=vprd:guest_vms2.2: jid=1: Replayed 5 of 5 blocks Dec 22 10:53:58 cvprd2 kernel: GFS2: fsid=vprd:guest_vms2.2: jid=1: Found 0 revoke tags Dec 22 10:53:58 cvprd2 kernel: GFS2: fsid=vprd:guest_vms2.2: jid=1: Journal replayed in 0s Dec 22 10:53:58 cvprd2 kernel: GFS2: fsid=vprd:guest_vms2.2: jid=1: Done Dec 22 10:53:58 cvprd2 kernel: GFS2: fsid=vprd:guest_vms5.1: jid=0: Done Dec 22 10:53:58 cvprd2 kernel: GFS2: fsid=vprd:guest_vms6.1: jid=0: Done Dec 22 10:53:58 cvprd2 kernel: GFS2: fsid=vprd:guest_vms4.1: jid=0: Acquiring the transaction lock... Dec 22 10:53:58 cvprd2 kernel: GFS2: fsid=vprd:guest_vms4.1: jid=0: Replaying journal... Dec 22 10:53:58 cvprd2 kernel: GFS2: fsid=vprd:guest_vms4.1: jid=0: Replayed 0 of 0 blocks Dec 22 10:53:58 cvprd2 kernel: GFS2: fsid=vprd:guest_vms4.1: jid=0: Found 0 revoke tags Dec 22 10:53:58 cvprd2 kernel: GFS2: fsid=vprd:guest_vms4.1: jid=0: Journal replayed in 0s Dec 22 10:53:58 cvprd2 kernel: GFS2: fsid=vprd:guest_vms4.1: jid=0: Done OS: RHEL 5.5 64 bit (up to date) Thanks! Paras. From ricks at nerd.com Wed Dec 22 21:30:51 2010 From: ricks at nerd.com (Rick Stevens) Date: Wed, 22 Dec 2010 13:30:51 -0800 Subject: [Linux-cluster] New cluster : installing... In-Reply-To: <88BD849351CE45BBA495A52B387E6B73@versa> References: <88BD849351CE45BBA495A52B387E6B73@versa> Message-ID: <4D126E0B.8040703@nerd.com> On 12/22/2010 11:38 AM, Nicolas Ross wrote: > Hi ! > > Over the last couple of weeks, I've been playing with the cluster suite > and RHEL 6 beta 2, that was availaible. > > Now, I got a 30 day demo of RHEL 6 to begin the re-installation from > scratch for ou soon to be production cluster. With the beta, I had a > deamon running, that was clvmd for the cluster logical volume manager > daemon. This package doesn't seem to exist anymore. > > The package lvm2-cluster is on the installation DVD, but I can't seem to > install it via yum. Mount your DVD and you should be able to install from it by specifying the full path to the RPM file: # yum install /media/cdrom-mount-point/path/to/file.rpm # rpm -ivh /media/cdrom-mount-point/path/to/file.rpm Double clicking on the RPM in the desktop file manager should also offer you the ability to install it. > I did enabled the High Availability channel to our > servers, but it's not in there. I can't seem to find in wich software > channel it's located. > > Can anyone tell me ? > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- ---------------------------------------------------------------------- - Rick Stevens, Systems Engineer, C2 Hosting ricks at nerd.com - - AIM/Skype: therps2 ICQ: 22643734 Yahoo: origrps2 - - - - LOOK OUT!!! BEHIND YOU!!! - ---------------------------------------------------------------------- From yvette at dbtgroup.com Wed Dec 22 21:43:58 2010 From: yvette at dbtgroup.com (yvette hirth) Date: Wed, 22 Dec 2010 21:43:58 +0000 Subject: [Linux-cluster] fixed Message-ID: <4D12711E.9060407@dbtgroup.com> hi, first, a big thanks to Bob Peterson and Ben Turner: the "devel" edition of fsck.gfs2 ended normally where the 5.5 current version didn't; and after installing new WTI power switch and reconfiguring the cluster, all my gfs2 shares are now visible across the network. and to all who responded, thank you as well. seasons greetings! yvette hirth From rossnick-lists at cybercat.ca Wed Dec 22 22:58:48 2010 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Wed, 22 Dec 2010 17:58:48 -0500 Subject: [Linux-cluster] New cluster : installing... In-Reply-To: <4D126E0B.8040703@nerd.com> References: <88BD849351CE45BBA495A52B387E6B73@versa> <4D126E0B.8040703@nerd.com> Message-ID: <61598138B246424CBECF9AD400DDC2B7@Inspiron> >> Now, I got a 30 day demo of RHEL 6 to begin the re-installation from >> scratch for ou soon to be production cluster. With the beta, I had a >> deamon running, that was clvmd for the cluster logical volume manager >> daemon. This package doesn't seem to exist anymore. >> >> The package lvm2-cluster is on the installation DVD, but I can't seem to >> install it via yum. > > Mount your DVD and you should be able to install from it by specifying > the full path to the RPM file: > > # yum install /media/cdrom-mount-point/path/to/file.rpm > # rpm -ivh /media/cdrom-mount-point/path/to/file.rpm > That's what I did in the mean time. It appears that I didn't receive a resiliant storgae demi, but a cluster suite demo, which doesn't seem to include resiliant. I sent an email to my account manager on that maner. Regards, From jayesh.shinde at netcore.co.in Thu Dec 23 04:25:32 2010 From: jayesh.shinde at netcore.co.in (jayesh.shinde) Date: Thu, 23 Dec 2010 09:55:32 +0530 Subject: [Linux-cluster] How do I get reiserfs and xfs filesystem options in system-config-cluster ? In-Reply-To: <2040486775.37101.1293040698392.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com> References: <2040486775.37101.1293040698392.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com> Message-ID: <4D12CF3C.4000506@netcore.co.in> Thanks Ben, Rajagopal & Rafa for your guidance. Regards Jayesh Shinde On 12/22/2010 11:28 PM, Ben Turner wrote: > As far as I know RHEL 6 doesn't include system-config-cluster anymore. I suggest you use the luci interface to configure this. When creating a new service in luci you can choose the filesystem resource, this resource handles 9 different filesystems including reiser and XFS as well an an autodetect option. You can also manually edit the cluster.conf file to make these changes. > > You can use XFS with cluster, the point that others were trying to make is that XFS is not a shared filesystem like GFS and can only be mounted on one node at a time. > > -Ben > > ----- Original Message ----- >> Hi :) >> >> On Wed, Dec 22, 2010 at 8:27 AM, jayesh.shinde >> wrote: >>> Hi Rajagopal >>> >>> I am not clear fully. I will use RHEL 6 . I want some more >>> clarification on >>> below points >>> >>> 1) You mean to say I can't use XFS with cluster ? OR there is no >>> option for >>> XFS with system-config-cluster ? >> >> Depends on the type of cluster: >> - HA cluster: no problem as long as it's active/passive. That is: >> one server mounts the FS and the other is on standby. If server 1 >> fails, it releases the FS and server2 mounts it. >> - shared/clustered filesystem: you'd have to go with CXFS (get in >> touch with SGI). That is: both servers mount the filesystem at the >> same time. >> >> >>> 2) If I edited the cluster.conf file manually for "xfs" will the >>> cluster >>> server work well ? >>> >>> 3) what is work around solution ? >>> >>> Regards >>> Jayesh Shinde >>> >>> On 12/22/2010 12:06 PM, Rajagopal Swaminathan wrote: >>>> Greetings, >>>> >>>> On Wed, Dec 22, 2010 at 4:53 AM, jayesh.shinde >>>> wrote: >>>>> Hello , >>>>> >>>>> I am configuring redhat cluster suite with RHEL 5.4 , 32 bit >>>>> architecture >>>>> I have my one SAN partition with reiserfs and xfs filesystem. >>>> To the best of my knowledge, XFS support has just started on RHEL6. >>>> >>>> I am not sure that ReiserFS was ever supported by Redhat. >>>> >>>> If you are trying to use those filesystems in the cluster, I don't >>>> think they are cluster aware. >>>> >>>> YMMV. >>>> >>>> Regards, >>>> >>>> Rajagopal >> >> HTH >> >> Rafa >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From parvez.h.shaikh at gmail.com Fri Dec 24 05:33:12 2010 From: parvez.h.shaikh at gmail.com (Parvez Shaikh) Date: Fri, 24 Dec 2010 11:03:12 +0530 Subject: [Linux-cluster] IP Resource behavior with Red Hat Cluster Message-ID: Hi all, I am using Red Hat cluster 6.2.0 (version shown with cman_tool version) on Red Hat 5.5 I am on host that has multiple network interfaces and all(or some) of which may be active while I tried to bring up my IP resource up. My cluster is of simple configuration - It has only 2 nodes, and service basically consist of only IP resource, I had to chose random private IP address for test/debugging purpose (192.168....) When I tried to start service it failed with message - clurgmgrd: [31853]: 192.168.25.135 is not configured I manually made this virtual IP available on host and then started service it worked - clurgmgrd: [31853]: 192.168.25.135 already configured My question is - Is it prerequisite for IP resource to be manually added before it can be protected via cluster? Thanks Parvez From raju.rajsand at gmail.com Fri Dec 24 06:00:01 2010 From: raju.rajsand at gmail.com (Rajagopal Swaminathan) Date: Fri, 24 Dec 2010 06:00:01 +0000 Subject: [Linux-cluster] IP Resource behavior with Red Hat Cluster In-Reply-To: References: Message-ID: Greetings, On Fri, Dec 24, 2010 at 5:33 AM, Parvez Shaikh wrote: > Hi all, > > I manually made this virtual IP available on host and then started > service it worked - > Can you please elaborate? did you try to assign IP to the ethx devices and then ping? > clurgmgrd: [31853]: 192.168.25.135 already configured > > > My question is - Is it prerequisite for IP resource to be manually > added before it can be protected via cluster? > Every resource/service has to be added to the cluster. And they cannot be used by anything else. Regards, Rajagopal From parvez.h.shaikh at gmail.com Fri Dec 24 08:41:53 2010 From: parvez.h.shaikh at gmail.com (Parvez Shaikh) Date: Fri, 24 Dec 2010 14:11:53 +0530 Subject: [Linux-cluster] IP Resource behavior with Red Hat Cluster In-Reply-To: References: Message-ID: Hi Rajagopal, Thank you for your response I have created a cluster configuration by adding IP resource with value 192.168.25.153 (some value) and created a service which just has IP resource part of it. I have set all requisite configuration such two node, node names, failover,fencing etc. Upon trying to start then service(enable service),it failed - clurgmgrd: [31853]: 192.168.25.135 is not configured Then manually added this IP to host ifconfig eth0:1 192.168.25.135 Then service could start but it gave message - clurgmgrd: [31853]: 192.168.25.135 already configured So do I have to add virtual interface manually (as above or any other method?) before I could start service with IP resource under it? Thanks Parvez On Fri, Dec 24, 2010 at 11:30 AM, Rajagopal Swaminathan wrote: > Greetings, > > On Fri, Dec 24, 2010 at 5:33 AM, Parvez Shaikh > wrote: >> Hi all, >> >> I manually made this virtual IP available on host and then started >> service it worked - >> > > Can you please elaborate? did you try to assign IP to the ethx devices > and then ping? > >> clurgmgrd: [31853]: 192.168.25.135 already configured >> >> >> My question is - Is it prerequisite for IP resource to be manually >> added before it can be protected via cluster? >> > > Every resource/service has to be added to the cluster. > > And they cannot be used by anything else. > > Regards, > > Rajagopal > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From jakov.sosic at srce.hr Fri Dec 24 12:46:05 2010 From: jakov.sosic at srce.hr (Jakov Sosic) Date: Fri, 24 Dec 2010 13:46:05 +0100 Subject: [Linux-cluster] IP Resource behavior with Red Hat Cluster In-Reply-To: References: Message-ID: <4D14960D.50706@srce.hr> On 12/24/2010 09:41 AM, Parvez Shaikh wrote: > Hi Rajagopal, > > Thank you for your response > > I have created a cluster configuration by adding IP resource with > value 192.168.25.153 (some value) and created a service which just has > IP resource part of it. I have set all requisite configuration such > two node, node names, failover,fencing etc. > > Upon trying to start then service(enable service),it failed - > > clurgmgrd: [31853]: 192.168.25.135 is not configured > > Then manually added this IP to host > > ifconfig eth0:1 192.168.25.135 > > Then service could start but it gave message - > > clurgmgrd: [31853]: 192.168.25.135 already configured > > So do I have to add virtual interface manually (as above or any other > method?) before I could start service with IP resource under it? How is your network configured? For an IP address to work in a cluster, you have to have interfaces on both machines set up, which are in the same subnet. For example: node1 # ifconfig eth0 192.168.25.11 netmask 255.255.255.0 node2 # ifconfig eth0 192.168.25.12 netmask 255.255.255.0 Then and only then will cluster be able to bring up virtual ip address and bind it as secondary on this interface. You can then see it with: # ip addr show I guess you're trying to bring up IP address from network subnet that is not in any way set up on your host. And that is a prerequisite with classic IP resource. -- Jakov Sosic www.srce.hr From parvez.h.shaikh at gmail.com Fri Dec 24 16:46:39 2010 From: parvez.h.shaikh at gmail.com (Parvez Shaikh) Date: Fri, 24 Dec 2010 22:16:39 +0530 Subject: [Linux-cluster] IP Resource behavior with Red Hat Cluster In-Reply-To: <4D14960D.50706@srce.hr> References: <4D14960D.50706@srce.hr> Message-ID: Hi Jakov Thank you for your response. My two hosts have multiple network interfaces or ethernet cards. I understood from your email, that the IP corresponding to "cluster node name" for both hosts, should be in the same subnet before a cluster could bring virtual IP up. I will reconfirm if these are in same subnets. Gratefully yours, Parvez On Fri, Dec 24, 2010 at 6:16 PM, Jakov Sosic wrote: > On 12/24/2010 09:41 AM, Parvez Shaikh wrote: >> Hi Rajagopal, >> >> Thank you for your response >> >> I have created a cluster configuration by adding IP resource with >> value 192.168.25.153 (some value) and created a service which just has >> IP resource part of it. I have set all requisite configuration such >> two node, node names, failover,fencing etc. >> >> Upon trying to start then service(enable service),it failed - >> >> clurgmgrd: [31853]: 192.168.25.135 is not configured >> >> Then manually added this IP to host >> >> ifconfig eth0:1 192.168.25.135 >> >> Then service could start but it gave message - >> >> clurgmgrd: [31853]: 192.168.25.135 already configured >> >> So do I have to add virtual interface manually (as above or any other >> method?) before I could start service with IP resource under it? > > How is your network configured? For an IP address to work in a cluster, > you have to have interfaces on both machines set up, which are in the > same subnet. For example: > > node1 # ifconfig eth0 192.168.25.11 netmask 255.255.255.0 > node2 # ifconfig eth0 192.168.25.12 netmask 255.255.255.0 > > Then and only then will cluster be able to bring up virtual ip address > and bind it as secondary on this interface. You can then see it with: > > # ip addr show > > > I guess you're trying to bring up IP address from network subnet that is > not in any way set up on your host. And that is a prerequisite with > classic IP resource. > > > > > -- > Jakov Sosic > www.srce.hr > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From jakov.sosic at srce.hr Sat Dec 25 01:04:16 2010 From: jakov.sosic at srce.hr (Jakov Sosic) Date: Sat, 25 Dec 2010 02:04:16 +0100 Subject: [Linux-cluster] IP Resource behavior with Red Hat Cluster In-Reply-To: References: <4D14960D.50706@srce.hr> Message-ID: <4D154310.8050804@srce.hr> On 12/24/2010 05:46 PM, Parvez Shaikh wrote: > Hi Jakov > > Thank you for your response. My two hosts have multiple network > interfaces or ethernet cards. I understood from your email, that the > IP corresponding to "cluster node name" for both hosts, should be in > the same subnet before a cluster could bring virtual IP up. No... you misunderstood me. I meant that if the virtual address is 192.168.25.X, than you have to have interface on each node that is set up with the ip address from the same subnet. That interface does not need to correspond to the cluster node name. For example: node1 - eth0 - 192.168.1.11 (netmask 255.255.255.0) node2 - eth0 - 192.168.1.12 (netmask 255.255.255.0) IP resource - 192.168.25.100 Now, how do you expect the cluster to know what to do with IP resource? On which interface can cluster glue 192.168.25.100? eth0? But why eth0? And what is the netmask? What about routes? So, you need to have for example eth1 on both machines set up in the same subnet, so that cluster can glue IP address from IP resource to that exact interface (which is set up statically). So you also have to have for example: node1 - eth1 - 192.168.25.47 (netmask 255.255.255.0) node2 - eth1 - 192.168.25.48 (netmask 255.255.255.0) Now, rgmanager will know where to activate IP resource, because 192.168.25.100 belongs to 192.168.25.0/24 subnet, which is active on node1/eth1 and node2/eth2. If you were to have another IP resource, for example 192.168.240.44, you would need another interface with another set of static ip addresses on every host you intend to run IP resource on... I hope you get it correctly now. -- Jakov Sosic www.srce.hr From parvez.h.shaikh at gmail.com Sat Dec 25 03:26:49 2010 From: parvez.h.shaikh at gmail.com (Parvez Shaikh) Date: Sat, 25 Dec 2010 08:56:49 +0530 Subject: [Linux-cluster] IP Resource behavior with Red Hat Cluster In-Reply-To: <4D154310.8050804@srce.hr> References: <4D14960D.50706@srce.hr> <4D154310.8050804@srce.hr> Message-ID: Thanks a ton Jakov. It has clarified my doubts. Yours gratefully, Parvez On Sat, Dec 25, 2010 at 6:34 AM, Jakov Sosic wrote: > On 12/24/2010 05:46 PM, Parvez Shaikh wrote: >> Hi Jakov >> >> Thank you for your response. My two hosts have multiple network >> interfaces or ethernet cards. I understood from your email, that the >> IP corresponding to "cluster node name" for both hosts, should be in >> the same subnet before a cluster could bring virtual IP up. > > No... you misunderstood me. I meant that if the virtual address is > 192.168.25.X, than you have to have interface on each node that is set > up with the ip address from the same subnet. That interface does not > need to correspond to the cluster node name. For example: > > node1 - eth0 - 192.168.1.11 (netmask 255.255.255.0) > node2 - eth0 - 192.168.1.12 (netmask 255.255.255.0) > > IP resource - 192.168.25.100 > > > Now, how do you expect the cluster to know what to do with IP resource? > On which interface can cluster glue 192.168.25.100? eth0? But why eth0? > And what is the netmask? What about routes? > > So, you need to have for example eth1 on both machines set up in the > same subnet, so that cluster can glue IP address from IP resource to > that exact interface (which is set up statically). So you also have to > have for example: > > node1 - eth1 - 192.168.25.47 (netmask 255.255.255.0) > node2 - eth1 - 192.168.25.48 (netmask 255.255.255.0) > > Now, rgmanager will know where to activate IP resource, because > 192.168.25.100 belongs to 192.168.25.0/24 subnet, which is active on > node1/eth1 and node2/eth2. > > If you were to have another IP resource, for example 192.168.240.44, you > would need another interface with another set of static ip addresses on > every host you intend to run IP resource on... > > > I hope you get it correctly now. > > > > > > -- > Jakov Sosic > www.srce.hr > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From parvez.h.shaikh at gmail.com Mon Dec 27 04:21:42 2010 From: parvez.h.shaikh at gmail.com (Parvez Shaikh) Date: Mon, 27 Dec 2010 09:51:42 +0530 Subject: [Linux-cluster] IP Resource behavior with Red Hat Cluster In-Reply-To: <4D154310.8050804@srce.hr> References: <4D14960D.50706@srce.hr> <4D154310.8050804@srce.hr> Message-ID: Hi I chose my IP resource as 192.168.13.15, I had eth3 configured on 192.168.13.1 but it still failed with error - Dec 27 17:35:32 datablade1 clurgmgrd[31853]: Error storing ip: Duplicate Dec 27 17:36:55 datablade1 clurgmgrd[31853]: Starting disabled service service:service1 Dec 27 17:36:55 datablade1 clurgmgrd[31853]: start on ip "192.168.13.15/24" returned 1 (generic error) Dec 27 17:36:55 datablade1 clurgmgrd[31853]: #68: Failed to start service:service1; return value: 1 Below is set of interfaces - eth0 Link encap:Ethernet HWaddr 00:10:18:66:15:70 inet addr:192.168.10.1 Bcast:192.168.10.255 Mask:255.255.255.0 inet6 addr: fe80::210:18ff:fe66:1570/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:125 errors:0 dropped:0 overruns:0 frame:0 TX packets:305 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:32679 (31.9 KiB) TX bytes:42477 (41.4 KiB) Interrupt:177 Memory:98000000-98012800 eth1 Link encap:Ethernet HWaddr 00:10:18:66:15:72 inet addr:192.168.11.1 Bcast:192.168.11.255 Mask:255.255.255.0 inet6 addr: fe80::210:18ff:fe66:1572/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1237019 errors:0 dropped:0 overruns:0 frame:0 TX packets:1919245 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:183885611 (175.3 MiB) TX bytes:337885336 (322.2 MiB) Interrupt:154 Memory:9a000000-9a012800 eth2 Link encap:Ethernet HWaddr 00:10:18:66:15:74 inet addr:192.168.12.1 Bcast:192.168.12.255 Mask:255.255.255.0 inet6 addr: fe80::210:18ff:fe66:1574/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:419008 errors:0 dropped:0 overruns:0 frame:0 TX packets:29 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:26822898 (25.5 MiB) TX bytes:5992 (5.8 KiB) Interrupt:185 Memory:94000000-94012800 eth3 Link encap:Ethernet HWaddr 00:10:18:66:15:76 inet addr:192.168.13.1 Bcast:192.168.13.255 Mask:255.255.255.0 UP BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Interrupt:162 Memory:96000000-96012800 On Sat, Dec 25, 2010 at 6:34 AM, Jakov Sosic wrote: > On 12/24/2010 05:46 PM, Parvez Shaikh wrote: >> Hi Jakov >> >> Thank you for your response. My two hosts have multiple network >> interfaces or ethernet cards. I understood from your email, that the >> IP corresponding to "cluster node name" for both hosts, should be in >> the same subnet before a cluster could bring virtual IP up. > > No... you misunderstood me. I meant that if the virtual address is > 192.168.25.X, than you have to have interface on each node that is set > up with the ip address from the same subnet. That interface does not > need to correspond to the cluster node name. For example: > > node1 - eth0 - 192.168.1.11 (netmask 255.255.255.0) > node2 - eth0 - 192.168.1.12 (netmask 255.255.255.0) > > IP resource - 192.168.25.100 > > > Now, how do you expect the cluster to know what to do with IP resource? > On which interface can cluster glue 192.168.25.100? eth0? But why eth0? > And what is the netmask? What about routes? > > So, you need to have for example eth1 on both machines set up in the > same subnet, so that cluster can glue IP address from IP resource to > that exact interface (which is set up statically). So you also have to > have for example: > > node1 - eth1 - 192.168.25.47 (netmask 255.255.255.0) > node2 - eth1 - 192.168.25.48 (netmask 255.255.255.0) > > Now, rgmanager will know where to activate IP resource, because > 192.168.25.100 belongs to 192.168.25.0/24 subnet, which is active on > node1/eth1 and node2/eth2. > > If you were to have another IP resource, for example 192.168.240.44, you > would need another interface with another set of static ip addresses on > every host you intend to run IP resource on... > > > I hope you get it correctly now. > > > > > > -- > Jakov Sosic > www.srce.hr > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From raju.rajsand at gmail.com Mon Dec 27 06:48:18 2010 From: raju.rajsand at gmail.com (Rajagopal Swaminathan) Date: Mon, 27 Dec 2010 12:18:18 +0530 Subject: [Linux-cluster] IP Resource behavior with Red Hat Cluster In-Reply-To: References: <4D14960D.50706@srce.hr> <4D154310.8050804@srce.hr> Message-ID: Greetinds, On Mon, Dec 27, 2010 at 9:51 AM, Parvez Shaikh wrote: > Hi > > > Dec 27 17:35:32 datablade1 clurgmgrd[31853]: Error storing ip: Duplicate > Dec 27 17:36:55 datablade1 clurgmgrd[31853]: Starting > disabled service service:service1 > Dec 27 17:36:55 datablade1 clurgmgrd[31853]: start on ip > "192.168.13.15/24" returned 1 (generic error) > Dec 27 17:36:55 datablade1 clurgmgrd[31853]: #68: Failed to > start service:service1; return value: 1 > > Below is set of interfaces - > What does the ip addr show command say? Regards, Rajagopal From parvez.h.shaikh at gmail.com Mon Dec 27 07:05:27 2010 From: parvez.h.shaikh at gmail.com (Parvez Shaikh) Date: Mon, 27 Dec 2010 12:35:27 +0530 Subject: [Linux-cluster] IP Resource behavior with Red Hat Cluster In-Reply-To: References: <4D14960D.50706@srce.hr> <4D154310.8050804@srce.hr> Message-ID: Hi all Issue has been resolved. After debugging a bit I found that link to eth was not detected - "ethtool ethX | grep "Link detected:" | awk '{print $3}'" Output - no After resolving around this, I could get my IP resource up. Thank you for your kind suggestions and interest in this problem Gratefully yours On Mon, Dec 27, 2010 at 12:18 PM, Rajagopal Swaminathan wrote: > Greetinds, > > On Mon, Dec 27, 2010 at 9:51 AM, Parvez Shaikh > wrote: >> Hi >> >> >> Dec 27 17:35:32 datablade1 clurgmgrd[31853]: Error storing ip: Duplicate >> Dec 27 17:36:55 datablade1 clurgmgrd[31853]: Starting >> disabled service service:service1 >> Dec 27 17:36:55 datablade1 clurgmgrd[31853]: start on ip >> "192.168.13.15/24" returned 1 (generic error) >> Dec 27 17:36:55 datablade1 clurgmgrd[31853]: #68: Failed to >> start service:service1; return value: 1 >> >> Below is set of interfaces - >> > > What does the ip addr show command say? > > Regards, > > Rajagopal > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From sandy.rhce at gmail.com Mon Dec 27 07:08:39 2010 From: sandy.rhce at gmail.com (sandeeep) Date: Mon, 27 Dec 2010 12:38:39 +0530 Subject: [Linux-cluster] Linux-cluster Digest, Vol 80, Issue 23 In-Reply-To: References: Message-ID: Hi, I am using RHEL5.4 and trying to make cluster using conga, i did every thing, like in main server i installed luci* cluster* related package using :"yum groupinstall luci* cluster*, and same time installed cman* and in other two nodes i have installed ricci* packege using yum. now every thing is done, but in server when i am running "service cman restart" its giving an error like " local node name is not found in main conficuration file and /usr/sbin/cman_tool: aisex daemon not started ." please have a look into my querry, as i am facing this problem since many days. THanks Sandeep On 12/24/10, linux-cluster-request at redhat.com wrote: > Send Linux-cluster mailing list submissions to > linux-cluster at redhat.com > > To subscribe or unsubscribe via the World Wide Web, visit > https://www.redhat.com/mailman/listinfo/linux-cluster > or, via email, send a message with subject or body 'help' to > linux-cluster-request at redhat.com > > You can reach the person managing the list at > linux-cluster-owner at redhat.com > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Linux-cluster digest..." > > > Today's Topics: > > 1. IP Resource behavior with Red Hat Cluster (Parvez Shaikh) > 2. Re: IP Resource behavior with Red Hat Cluster > (Rajagopal Swaminathan) > 3. Re: IP Resource behavior with Red Hat Cluster (Parvez Shaikh) > 4. Re: IP Resource behavior with Red Hat Cluster (Jakov Sosic) > 5. Re: IP Resource behavior with Red Hat Cluster (Parvez Shaikh) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 24 Dec 2010 11:03:12 +0530 > From: Parvez Shaikh > To: linux-cluster at redhat.com > Subject: [Linux-cluster] IP Resource behavior with Red Hat Cluster > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > Hi all, > > I am using Red Hat cluster 6.2.0 (version shown with cman_tool > version) on Red Hat 5.5 > > I am on host that has multiple network interfaces and all(or some) of > which may be active while I tried to bring up my IP resource up. > > My cluster is of simple configuration - > It has only 2 nodes, and service basically consist of only IP > resource, I had to chose random private IP address for test/debugging > purpose (192.168....) > > When I tried to start service it failed with message - > > clurgmgrd: [31853]: 192.168.25.135 is not configured > > I manually made this virtual IP available on host and then started > service it worked - > > clurgmgrd: [31853]: 192.168.25.135 already configured > > > My question is - Is it prerequisite for IP resource to be manually > added before it can be protected via cluster? > > Thanks > Parvez > > > > ------------------------------ > > Message: 2 > Date: Fri, 24 Dec 2010 06:00:01 +0000 > From: Rajagopal Swaminathan > To: linux clustering > Subject: Re: [Linux-cluster] IP Resource behavior with Red Hat Cluster > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > Greetings, > > On Fri, Dec 24, 2010 at 5:33 AM, Parvez Shaikh > wrote: >> Hi all, >> >> I manually made this virtual IP available on host and then started >> service it worked - >> > > Can you please elaborate? did you try to assign IP to the ethx devices > and then ping? > >> clurgmgrd: [31853]: 192.168.25.135 already configured >> >> >> My question is - Is it prerequisite for IP resource to be manually >> added before it can be protected via cluster? >> > > Every resource/service has to be added to the cluster. > > And they cannot be used by anything else. > > Regards, > > Rajagopal > > > > ------------------------------ > > Message: 3 > Date: Fri, 24 Dec 2010 14:11:53 +0530 > From: Parvez Shaikh > To: linux clustering > Subject: Re: [Linux-cluster] IP Resource behavior with Red Hat Cluster > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > Hi Rajagopal, > > Thank you for your response > > I have created a cluster configuration by adding IP resource with > value 192.168.25.153 (some value) and created a service which just has > IP resource part of it. I have set all requisite configuration such > two node, node names, failover,fencing etc. > > Upon trying to start then service(enable service),it failed - > > clurgmgrd: [31853]: 192.168.25.135 is not configured > > Then manually added this IP to host > > ifconfig eth0:1 192.168.25.135 > > Then service could start but it gave message - > > clurgmgrd: [31853]: 192.168.25.135 already configured > > So do I have to add virtual interface manually (as above or any other > method?) before I could start service with IP resource under it? > > Thanks > Parvez > > On Fri, Dec 24, 2010 at 11:30 AM, Rajagopal Swaminathan > wrote: >> Greetings, >> >> On Fri, Dec 24, 2010 at 5:33 AM, Parvez Shaikh >> wrote: >>> Hi all, >>> >>> I manually made this virtual IP available on host and then started >>> service it worked - >>> >> >> Can you please elaborate? did you try to assign IP to the ethx devices >> and then ping? >> >>> clurgmgrd: [31853]: 192.168.25.135 already configured >>> >>> >>> My question is - Is it prerequisite for IP resource to be manually >>> added before it can be protected via cluster? >>> >> >> Every resource/service has to be added to the cluster. >> >> And they cannot be used by anything else. >> >> Regards, >> >> Rajagopal >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > > ------------------------------ > > Message: 4 > Date: Fri, 24 Dec 2010 13:46:05 +0100 > From: Jakov Sosic > To: linux clustering > Subject: Re: [Linux-cluster] IP Resource behavior with Red Hat Cluster > Message-ID: <4D14960D.50706 at srce.hr> > Content-Type: text/plain; charset=ISO-8859-1 > > On 12/24/2010 09:41 AM, Parvez Shaikh wrote: >> Hi Rajagopal, >> >> Thank you for your response >> >> I have created a cluster configuration by adding IP resource with >> value 192.168.25.153 (some value) and created a service which just has >> IP resource part of it. I have set all requisite configuration such >> two node, node names, failover,fencing etc. >> >> Upon trying to start then service(enable service),it failed - >> >> clurgmgrd: [31853]: 192.168.25.135 is not configured >> >> Then manually added this IP to host >> >> ifconfig eth0:1 192.168.25.135 >> >> Then service could start but it gave message - >> >> clurgmgrd: [31853]: 192.168.25.135 already configured >> >> So do I have to add virtual interface manually (as above or any other >> method?) before I could start service with IP resource under it? > > How is your network configured? For an IP address to work in a cluster, > you have to have interfaces on both machines set up, which are in the > same subnet. For example: > > node1 # ifconfig eth0 192.168.25.11 netmask 255.255.255.0 > node2 # ifconfig eth0 192.168.25.12 netmask 255.255.255.0 > > Then and only then will cluster be able to bring up virtual ip address > and bind it as secondary on this interface. You can then see it with: > > # ip addr show > > > I guess you're trying to bring up IP address from network subnet that is > not in any way set up on your host. And that is a prerequisite with > classic IP resource. > > > > > -- > Jakov Sosic > www.srce.hr > > > > ------------------------------ > > Message: 5 > Date: Fri, 24 Dec 2010 22:16:39 +0530 > From: Parvez Shaikh > To: linux clustering > Subject: Re: [Linux-cluster] IP Resource behavior with Red Hat Cluster > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > Hi Jakov > > Thank you for your response. My two hosts have multiple network > interfaces or ethernet cards. I understood from your email, that the > IP corresponding to "cluster node name" for both hosts, should be in > the same subnet before a cluster could bring virtual IP up. I will > reconfirm if these are in same subnets. > > Gratefully yours, > Parvez > > On Fri, Dec 24, 2010 at 6:16 PM, Jakov Sosic wrote: >> On 12/24/2010 09:41 AM, Parvez Shaikh wrote: >>> Hi Rajagopal, >>> >>> Thank you for your response >>> >>> I have created a cluster configuration by adding IP resource with >>> value 192.168.25.153 (some value) and created a service which just has >>> IP resource part of it. I have set all requisite configuration such >>> two node, node names, failover,fencing etc. >>> >>> Upon trying to start then service(enable service),it failed - >>> >>> clurgmgrd: [31853]: 192.168.25.135 is not configured >>> >>> Then manually added this IP to host >>> >>> ifconfig eth0:1 192.168.25.135 >>> >>> Then service could start but it gave message - >>> >>> clurgmgrd: [31853]: 192.168.25.135 already configured >>> >>> So do I have to add virtual interface manually (as above or any other >>> method?) before I could start service with IP resource under it? >> >> How is your network configured? For an IP address to work in a cluster, >> you have to have interfaces on both machines set up, which are in the >> same subnet. For example: >> >> node1 # ifconfig eth0 192.168.25.11 netmask 255.255.255.0 >> node2 # ifconfig eth0 192.168.25.12 netmask 255.255.255.0 >> >> Then and only then will cluster be able to bring up virtual ip address >> and bind it as secondary on this interface. You can then see it with: >> >> # ip addr show >> >> >> I guess you're trying to bring up IP address from network subnet that is >> not in any way set up on your host. And that is a prerequisite with >> classic IP resource. >> >> >> >> >> -- >> Jakov Sosic >> www.srce.hr >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > > ------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > End of Linux-cluster Digest, Vol 80, Issue 23 > ********************************************* > From susvirkar.3616 at gmail.com Mon Dec 27 09:02:45 2010 From: susvirkar.3616 at gmail.com (umesh susvirkar) Date: Mon, 27 Dec 2010 14:32:45 +0530 Subject: [Linux-cluster] Linux-cluster Digest, Vol 80, Issue 23 In-Reply-To: References: Message-ID: Hi Your server hostname & name you specify in cluster.conf file for cluster node name should be same.is this 2 value are different. if values are different make it similar and check. Regards Umesh Susvirkar On Mon, Dec 27, 2010 at 12:38 PM, sandeeep wrote: > Hi, > I am using RHEL5.4 and trying to make cluster using conga, i did > every thing, like in main server i installed luci* cluster* related > package using :"yum groupinstall luci* cluster*, > and same time installed cman* > and in other two nodes i have installed ricci* packege using yum. > now every thing is done, but in server when i am running "service cman > restart" its giving an error like " local node name is not found in > main conficuration file and /usr/sbin/cman_tool: aisex daemon not > started ." please have a look into my querry, as i am facing this > problem since many days. > > > THanks > Sandeep > > On 12/24/10, linux-cluster-request at redhat.com > wrote: > > Send Linux-cluster mailing list submissions to > > linux-cluster at redhat.com > > > > To subscribe or unsubscribe via the World Wide Web, visit > > https://www.redhat.com/mailman/listinfo/linux-cluster > > or, via email, send a message with subject or body 'help' to > > linux-cluster-request at redhat.com > > > > You can reach the person managing the list at > > linux-cluster-owner at redhat.com > > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of Linux-cluster digest..." > > > > > > Today's Topics: > > > > 1. IP Resource behavior with Red Hat Cluster (Parvez Shaikh) > > 2. Re: IP Resource behavior with Red Hat Cluster > > (Rajagopal Swaminathan) > > 3. Re: IP Resource behavior with Red Hat Cluster (Parvez Shaikh) > > 4. Re: IP Resource behavior with Red Hat Cluster (Jakov Sosic) > > 5. Re: IP Resource behavior with Red Hat Cluster (Parvez Shaikh) > > > > > > ---------------------------------------------------------------------- > > > > Message: 1 > > Date: Fri, 24 Dec 2010 11:03:12 +0530 > > From: Parvez Shaikh > > To: linux-cluster at redhat.com > > Subject: [Linux-cluster] IP Resource behavior with Red Hat Cluster > > Message-ID: > > > > Content-Type: text/plain; charset=ISO-8859-1 > > > > Hi all, > > > > I am using Red Hat cluster 6.2.0 (version shown with cman_tool > > version) on Red Hat 5.5 > > > > I am on host that has multiple network interfaces and all(or some) of > > which may be active while I tried to bring up my IP resource up. > > > > My cluster is of simple configuration - > > It has only 2 nodes, and service basically consist of only IP > > resource, I had to chose random private IP address for test/debugging > > purpose (192.168....) > > > > When I tried to start service it failed with message - > > > > clurgmgrd: [31853]: 192.168.25.135 is not configured > > > > I manually made this virtual IP available on host and then started > > service it worked - > > > > clurgmgrd: [31853]: 192.168.25.135 already configured > > > > > > My question is - Is it prerequisite for IP resource to be manually > > added before it can be protected via cluster? > > > > Thanks > > Parvez > > > > > > > > ------------------------------ > > > > Message: 2 > > Date: Fri, 24 Dec 2010 06:00:01 +0000 > > From: Rajagopal Swaminathan > > To: linux clustering > > Subject: Re: [Linux-cluster] IP Resource behavior with Red Hat Cluster > > Message-ID: > > > > > > Content-Type: text/plain; charset=ISO-8859-1 > > > > Greetings, > > > > On Fri, Dec 24, 2010 at 5:33 AM, Parvez Shaikh > > wrote: > >> Hi all, > >> > >> I manually made this virtual IP available on host and then started > >> service it worked - > >> > > > > Can you please elaborate? did you try to assign IP to the ethx devices > > and then ping? > > > >> clurgmgrd: [31853]: 192.168.25.135 already configured > >> > >> > >> My question is - Is it prerequisite for IP resource to be manually > >> added before it can be protected via cluster? > >> > > > > Every resource/service has to be added to the cluster. > > > > And they cannot be used by anything else. > > > > Regards, > > > > Rajagopal > > > > > > > > ------------------------------ > > > > Message: 3 > > Date: Fri, 24 Dec 2010 14:11:53 +0530 > > From: Parvez Shaikh > > To: linux clustering > > Subject: Re: [Linux-cluster] IP Resource behavior with Red Hat Cluster > > Message-ID: > > > > > > Content-Type: text/plain; charset=ISO-8859-1 > > > > Hi Rajagopal, > > > > Thank you for your response > > > > I have created a cluster configuration by adding IP resource with > > value 192.168.25.153 (some value) and created a service which just has > > IP resource part of it. I have set all requisite configuration such > > two node, node names, failover,fencing etc. > > > > Upon trying to start then service(enable service),it failed - > > > > clurgmgrd: [31853]: 192.168.25.135 is not configured > > > > Then manually added this IP to host > > > > ifconfig eth0:1 192.168.25.135 > > > > Then service could start but it gave message - > > > > clurgmgrd: [31853]: 192.168.25.135 already configured > > > > So do I have to add virtual interface manually (as above or any other > > method?) before I could start service with IP resource under it? > > > > Thanks > > Parvez > > > > On Fri, Dec 24, 2010 at 11:30 AM, Rajagopal Swaminathan > > wrote: > >> Greetings, > >> > >> On Fri, Dec 24, 2010 at 5:33 AM, Parvez Shaikh > >> wrote: > >>> Hi all, > >>> > >>> I manually made this virtual IP available on host and then started > >>> service it worked - > >>> > >> > >> Can you please elaborate? did you try to assign IP to the ethx devices > >> and then ping? > >> > >>> clurgmgrd: [31853]: 192.168.25.135 already configured > >>> > >>> > >>> My question is - Is it prerequisite for IP resource to be manually > >>> added before it can be protected via cluster? > >>> > >> > >> Every resource/service has to be added to the cluster. > >> > >> And they cannot be used by anything else. > >> > >> Regards, > >> > >> Rajagopal > >> > >> -- > >> Linux-cluster mailing list > >> Linux-cluster at redhat.com > >> https://www.redhat.com/mailman/listinfo/linux-cluster > >> > > > > > > > > ------------------------------ > > > > Message: 4 > > Date: Fri, 24 Dec 2010 13:46:05 +0100 > > From: Jakov Sosic > > To: linux clustering > > Subject: Re: [Linux-cluster] IP Resource behavior with Red Hat Cluster > > Message-ID: <4D14960D.50706 at srce.hr> > > Content-Type: text/plain; charset=ISO-8859-1 > > > > On 12/24/2010 09:41 AM, Parvez Shaikh wrote: > >> Hi Rajagopal, > >> > >> Thank you for your response > >> > >> I have created a cluster configuration by adding IP resource with > >> value 192.168.25.153 (some value) and created a service which just has > >> IP resource part of it. I have set all requisite configuration such > >> two node, node names, failover,fencing etc. > >> > >> Upon trying to start then service(enable service),it failed - > >> > >> clurgmgrd: [31853]: 192.168.25.135 is not configured > >> > >> Then manually added this IP to host > >> > >> ifconfig eth0:1 192.168.25.135 > >> > >> Then service could start but it gave message - > >> > >> clurgmgrd: [31853]: 192.168.25.135 already configured > >> > >> So do I have to add virtual interface manually (as above or any other > >> method?) before I could start service with IP resource under it? > > > > How is your network configured? For an IP address to work in a cluster, > > you have to have interfaces on both machines set up, which are in the > > same subnet. For example: > > > > node1 # ifconfig eth0 192.168.25.11 netmask 255.255.255.0 > > node2 # ifconfig eth0 192.168.25.12 netmask 255.255.255.0 > > > > Then and only then will cluster be able to bring up virtual ip address > > and bind it as secondary on this interface. You can then see it with: > > > > # ip addr show > > > > > > I guess you're trying to bring up IP address from network subnet that is > > not in any way set up on your host. And that is a prerequisite with > > classic IP resource. > > > > > > > > > > -- > > Jakov Sosic > > www.srce.hr > > > > > > > > ------------------------------ > > > > Message: 5 > > Date: Fri, 24 Dec 2010 22:16:39 +0530 > > From: Parvez Shaikh > > To: linux clustering > > Subject: Re: [Linux-cluster] IP Resource behavior with Red Hat Cluster > > Message-ID: > > > > Content-Type: text/plain; charset=ISO-8859-1 > > > > Hi Jakov > > > > Thank you for your response. My two hosts have multiple network > > interfaces or ethernet cards. I understood from your email, that the > > IP corresponding to "cluster node name" for both hosts, should be in > > the same subnet before a cluster could bring virtual IP up. I will > > reconfirm if these are in same subnets. > > > > Gratefully yours, > > Parvez > > > > On Fri, Dec 24, 2010 at 6:16 PM, Jakov Sosic > wrote: > >> On 12/24/2010 09:41 AM, Parvez Shaikh wrote: > >>> Hi Rajagopal, > >>> > >>> Thank you for your response > >>> > >>> I have created a cluster configuration by adding IP resource with > >>> value 192.168.25.153 (some value) and created a service which just has > >>> IP resource part of it. I have set all requisite configuration such > >>> two node, node names, failover,fencing etc. > >>> > >>> Upon trying to start then service(enable service),it failed - > >>> > >>> clurgmgrd: [31853]: 192.168.25.135 is not configured > >>> > >>> Then manually added this IP to host > >>> > >>> ifconfig eth0:1 192.168.25.135 > >>> > >>> Then service could start but it gave message - > >>> > >>> clurgmgrd: [31853]: 192.168.25.135 already configured > >>> > >>> So do I have to add virtual interface manually (as above or any other > >>> method?) before I could start service with IP resource under it? > >> > >> How is your network configured? For an IP address to work in a cluster, > >> you have to have interfaces on both machines set up, which are in the > >> same subnet. For example: > >> > >> node1 # ifconfig eth0 192.168.25.11 netmask 255.255.255.0 > >> node2 # ifconfig eth0 192.168.25.12 netmask 255.255.255.0 > >> > >> Then and only then will cluster be able to bring up virtual ip address > >> and bind it as secondary on this interface. You can then see it with: > >> > >> # ip addr show > >> > >> > >> I guess you're trying to bring up IP address from network subnet that is > >> not in any way set up on your host. And that is a prerequisite with > >> classic IP resource. > >> > >> > >> > >> > >> -- > >> Jakov Sosic > >> www.srce.hr > >> > >> -- > >> Linux-cluster mailing list > >> Linux-cluster at redhat.com > >> https://www.redhat.com/mailman/listinfo/linux-cluster > >> > > > > > > > > ------------------------------ > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > End of Linux-cluster Digest, Vol 80, Issue 23 > > ********************************************* > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux-cluster at redhat.com Tue Dec 28 06:36:47 2010 From: linux-cluster at redhat.com (Mailbot for etexusa.com) Date: Mon, 27 Dec 2010 22:36:47 -0800 Subject: [Linux-cluster] DSN: failed (Hi) Message-ID: This is a Delivery Status Notification (DSN). I was unable to deliver your message to hr at holista.in. I said (end of message) And they gave me the error; 552-5.7.0 Our system detected an illegal attachment on your message. Please 552-5.7.0 visit http://mail.google.com/support/bin/answer.py?answer=6590 to 552 5.7.0 review our attachment guidelines. w27si27262803wfh.2 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/rfc822-headers Size: 465 bytes Desc: not available URL: From jayesh.shinde at netcore.co.in Wed Dec 29 05:17:04 2010 From: jayesh.shinde at netcore.co.in (jayesh.shinde) Date: Wed, 29 Dec 2010 10:47:04 +0530 Subject: [Linux-cluster] suggestion require about fence_vmware Message-ID: <4D1AC450.1080801@netcore.co.in> Hi all , I am testing RHCS "Active - Passive " on RHEL 5.5 64 bit OS along with VMWARE ESX 4.0 for *mailing* application . I want suggestion and guidance that , what should be the best architecture and best practice for the below mention setup. My architecture details are as follows :-- ------------------------------------ 1) I have one VM called "node1" under one physical VMWARE server. 2) second VM called "node2" under second physical VMWARE server. 3) Both the physical servers are connected to each other with switch along with NIC teaming and fail over switch is also available. 4) For fencing I am using fencedevice agent as "*fence_vmware*" 5) one SAN partition with LVM configure. User's mailbox data is inside this SAN partition. 6) File system is "EXT3" My queries :--- ========= 1) Is above architecture's points from 1-6 are correct for Active-passive configuration with VMWARE. ? 2) While testing I observer that when I purposely stop the network service on active "node1" then "node2" fence the "node1" properly. But "node1" getting fence by "*poweroff*" . Since all service was running on "node1" and SAN partition was also mounted and suddenly "node1" get fence. *So will this immediate poweroff cause the corruption of SAN ext3 file system and local HDD too* ? 3) If yes how to avoid this ? 4) Is it a correct way of fencing ? 5) Is this correct setup for production environment ? I purposely stop network service on "node1" because I want to test and know that , what will happen when network goes down on "node1" . I also observe that the files which was open in VIM editor also got recover properly because of the Journaling feature of EXT3 FS. Happy christmas. Thanks & Regards Jayesh Shinde -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefan at lsd.co.za Wed Dec 29 19:02:53 2010 From: stefan at lsd.co.za (Stefan Lesicnik) Date: Wed, 29 Dec 2010 21:02:53 +0200 (SAST) Subject: [Linux-cluster] Multiple communication channels In-Reply-To: <2119444257.0.1293649365513.JavaMail.root@zimbra> Message-ID: <1218306684.1.1293649373746.JavaMail.root@zimbra> Hi all, I am running RHCS 5 and have a two node cluster with a shared qdisk. I have a bonded network bond0 and a back to back crossover eth1. Currently I have multicast cluster communication over the crossover, but was wondering if it was possible to use bond0 as an alternative / failover. So if eth1 was down, it could still communicate? I havent been able to find anything in the FAQ / documentation that would suggest this, so I thought I would ask. Thanks alot and I hope everyone has a great new year :) Stefan From linux at alteeve.com Wed Dec 29 19:33:45 2010 From: linux at alteeve.com (Digimer) Date: Wed, 29 Dec 2010 14:33:45 -0500 Subject: [Linux-cluster] Multiple communication channels In-Reply-To: <1218306684.1.1293649373746.JavaMail.root@zimbra> References: <1218306684.1.1293649373746.JavaMail.root@zimbra> Message-ID: <4D1B8D19.1040805@alteeve.com> On 12/29/2010 02:02 PM, Stefan Lesicnik wrote: > Hi all, > > I am running RHCS 5 and have a two node cluster with a shared qdisk. I have a bonded network bond0 and a back to back crossover eth1. > > Currently I have multicast cluster communication over the crossover, but was wondering if it was possible to use bond0 as an alternative / failover. So if eth1 was down, it could still communicate? > > I havent been able to find anything in the FAQ / documentation that would suggest this, so I thought I would ask. > > Thanks alot and I hope everyone has a great new year :) > > Stefan From: http://wiki.alteeve.com/index.php/Openais.conf ------------------------------------ ### Below here are the 'interface' directive(s). # At least one 'interface' directive is required within the 'totem' # directive. When two are specified, the one with 'ringnumber' of '0' # is the primary ring and the second with 'ringnumber' of '1' is the # backup ring. interface { # Increment the ring number for each 'interface' directive. ringnumber: 0 # This must match the subnet of this interface. The final octal # must be '0'. In this case, this directive will bind to the # interface on the 192.168.1.0/24 subnet, so this should be set # to '192.168.1.0'. This can be an IPv6 address, however, you # will be required to set the 'nodeid' in the 'totem' directive # above. Further, there will be no automatic interface # selection within a specified subnet as there is with IPv4. # In this case, the primary ring will be on the interface with # IPs on the 10.0.0.0/24 network (ie: eth1). bindnetaddr: 10.0.0.0 # This is the multicast address used by OpenAIS. Avoid the # '224.0.0.0/8' range as that is used for configuration. If you # use an IPv6 address, be sure to specify a 'nodeid' in the # 'totem' directive above. mcastaddr: 226.94.1.1 # This is the UDP port used with the multicast address above. mcastport: 5405 } # This is a second optional, redundant interface directive. If you use # two 'interface' directives, be sure to review the four 'rrp_*' # variables. # Note that two is the maximum number of interface directives. interface { # Increment the ring number for each 'interface' directive. ringnumber: 1 # In this case, the backup ring will be on the interface with # IPs on the 192.168.1.0/24 network (ie: eth0). bindnetaddr: 192.168.1.0 # MADI: Does this have to be different? How much different? # Can I just use a different port? mcastaddr: 227.94.1.1 # MADI: If this is different, can 'mcastaddr' be the same? mcastport: 5406 } ------------------------------------ Hope this helps. :) -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From linux at alteeve.com Wed Dec 29 19:46:52 2010 From: linux at alteeve.com (Digimer) Date: Wed, 29 Dec 2010 14:46:52 -0500 Subject: [Linux-cluster] Multiple communication channels In-Reply-To: <4D1B8D19.1040805@alteeve.com> References: <1218306684.1.1293649373746.JavaMail.root@zimbra> <4D1B8D19.1040805@alteeve.com> Message-ID: <4D1B902C.8080009@alteeve.com> On 12/29/2010 02:33 PM, Digimer wrote: > On 12/29/2010 02:02 PM, Stefan Lesicnik wrote: >> Hi all, >> >> I am running RHCS 5 and have a two node cluster with a shared qdisk. I have a bonded network bond0 and a back to back crossover eth1. >> >> Currently I have multicast cluster communication over the crossover, but was wondering if it was possible to use bond0 as an alternative / failover. So if eth1 was down, it could still communicate? >> >> I havent been able to find anything in the FAQ / documentation that would suggest this, so I thought I would ask. >> >> Thanks alot and I hope everyone has a great new year :) >> >> Stefan > > From: http://wiki.alteeve.com/index.php/Openais.conf > I forgot to mention that there are the redundant ring options as well. ------------------------------------ ### Redundant Ring Protocol options are below. These are ignored if ### only one 'interface' directive is defined. # This is used to control how the Redundant Ring Protocol is used. If # you only have one 'interface' directive, the default is 'none'. If # you have two, then please set 'active' or 'passive'. The trade off # is that, when the network is degraded, 'active' provides lower # latency from transmit to delivery and 'passive' may nearly double the # speed of the totem protocol when not CPU bound. # Valid options: none, active, passive. rrp_mode: passive # The next three variables are relevant depending on which mode # 'rrp_mode' is set to. Both modes use 'rrp_problem_count_threshold' # but only 'active' uses 'rrp_problem_count_timeout' and # 'rrp_token_expired_timeout'. # # - In 'active' mode: # If a token doesn't arrive in 'rrp_token_expired_timeout' milliseconds # an internal counter called 'problem_count' is incremented by 1. If a # token arrives within 'rrp_problem_count_timeout' however, the # internal decreases by '1'. If the internal counter equals or exceeds # the 'rrp_problem_count_threshold' at any time, the effected interface # will be flagged as faulty and it will no longer be used. # # - In 'passive' mode: # The two interfaces have internal counters called 'token_recv_count' # and 'mcast_recv_count' that are incremented by 1 each time a token # or multicast message is received, respectively. These counts for each # interface is counted and if the counts should differ by more than # 'rrp_problem_count_threshold', then the interface with the lower # count is flagged as faulty and it will no longer be used. # # If an interface is flagged as faulty, an administrator will need to # manually re-enable it. # The default problem count timeout is '1000' milliseconds. rrp_problem_count_timeout: 1000 # The default problem count threshold is '20'. rrp_problem_count_threshold: 20 # This is the time in milliseconds to wait before incrementing the # internal problem counter. Normally, this variable is automatically # calculated by openais and, thus, should not be defined here without # fully understanding the effects of doing so. # # In short; The should always be at least 'rrp_problem_count_timeout' # minus 50 milliseconds with the result being divided by # 'rrp_problem_count_threshold' or else a reconfiguration can occur. # Using the default values then, the default is (1000 - 50)/20=47.5, # rounded down to '47'. #rrp_token_expired_timeout: 47 ------------------------------------ Cheers -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From kitgerrits at gmail.com Wed Dec 29 19:49:26 2010 From: kitgerrits at gmail.com (Kit Gerrits) Date: Wed, 29 Dec 2010 20:49:26 +0100 Subject: [Linux-cluster] Multiple communication channels In-Reply-To: <1218306684.1.1293649373746.JavaMail.root@zimbra> Message-ID: <4d1b90ff.857a0e0a.45e5.ffff8b97@mx.google.com> Hello, AFAIK, Multi-interface heartbeat is something that was only recently added to RHCS (earlier this year, if I recall correctly). Until then, the failover part was usually achieved by using a bonded interface as heartbeat interface. If possible, I would suggest using 2 (connected) Multicast switches and running a bond from each server to each switch. Or 2 regular switches and broadcast heartbeat (switches only connected to eachother) Otherwise, using an active-active bond (channel?) with 2 crossover cables might also work, but offers less protection against interface failures. Regards, Kit _____ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Stefan Lesicnik Sent: woensdag 29 december 2010 20:03 To: linux-cluster at redhat.com Subject: [Linux-cluster] Multiple communication channels Hi all, I am running RHCS 5 and have a two node cluster with a shared qdisk. I have a bonded network bond0 and a back to back crossover eth1. Currently I have multicast cluster communication over the crossover, but was wondering if it was possible to use bond0 as an alternative / failover. So if eth1 was down, it could still communicate? I havent been able to find anything in the FAQ / documentation that would suggest this, so I thought I would ask. Thanks alot and I hope everyone has a great new year :) Stefan -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster _____ No virus found in this message. Checked by AVG - www.avg.com Version: 10.0.1191 / Virus Database: 1435/3346 - Release Date: 12/29/10 -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Wed Dec 29 19:57:53 2010 From: linux at alteeve.com (Digimer) Date: Wed, 29 Dec 2010 14:57:53 -0500 Subject: [Linux-cluster] Multiple communication channels In-Reply-To: <4d1b90ff.857a0e0a.45e5.ffff8b97@mx.google.com> References: <4d1b90ff.857a0e0a.45e5.ffff8b97@mx.google.com> Message-ID: <4D1B92C1.2030900@alteeve.com> On 12/29/2010 02:49 PM, Kit Gerrits wrote: > Hello, > > AFAIK, Multi-interface heartbeat is something that was only recently > added to RHCS (earlier this year, if I recall correctly). > > Until then, the failover part was usually achieved by using a bonded > interface as heartbeat interface. > If possible, I would suggest using 2 (connected) Multicast switches and > running a bond from each server to each switch. > Or 2 regular switches and broadcast heartbeat (switches only connected > to eachother) > Otherwise, using an active-active bond (channel?) with 2 crossover > cables might also work, but offers less protection against interface > failures. > > > Regards, > > Kit Hi, It was around in el5. Perhaps not in the early versions, I am not sure exactly when it was added, but certainly by 5.4. In the recent 3.x branch, openais was replaced by corosync (for core cluster communications), which is where rrp is controlled. Of course, I could always be wrong. :) Cheers. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From cos at aaaaa.org Wed Dec 29 20:28:36 2010 From: cos at aaaaa.org (Ofer Inbar) Date: Wed, 29 Dec 2010 15:28:36 -0500 Subject: [Linux-cluster] question ccsd for multiple clusters on same subnet Message-ID: <20101229202836.GX934@mip.aaaaa.org> CentOS 5.3, cman-2.0.115-34.el5_5.3. Working in a test environment with VMware, setting up test clusters, so we're not setting up separate VLANs for each cluster (though we will do that outside this initial test environment). I accidentally started cman on a new node for a new cluster before I copied the cluster.conf file I wanted to use into /etc/cluster, and to my surprise, it picked up a cluster.conf from an older cluster that's already up and running. In /var/log/messages, I see: ccsd[4475]: Unable to parse /etc/cluster/cluster.conf ccsd[4475]: Searching cluster for valid copy. ccsd[4475]: Remote copy of cluster.conf (version = 15) found. ccsd[4475]: Remote copy of cluster.conf is from quorate node. This let me to notice the documention of the -P option in ccsd's man page. It seems that each ccsd listens on three default ports, and uses broadcast for some things, such that ccsd's from separate clusters might potentially see and talk to each other if they don't have separate VLANs. I assume this is the reason my new host picked up a cluster.conf from an unrelated cluster. However, the documentation is vague and uninformative. I don't really know what ccsd uses each port for. It doesn't even say what the defaults are (though I can see from lsof that they're 50006,7,8, and I could experiment to figure out which port was which). I also don't know whether there's any standard practice about how to run with non-default ports (it does look like /etc/init.d/cman looks for an environment variable named CCSD_OPTS). If I properly initialize my clusters by copying the right cluster.conf into place before I first start cman, I won't encounter the specific problem I had here. However, does that make it okay, or will I run into other problems running multiple clusters with ccsd's using the same ports on the same subnet? Where can I find documentation about this, aimed at sysadmins? -- Cos