From jlauro at umflint.edu Tue May 1 01:44:49 2007 From: jlauro at umflint.edu (Lauro, John) Date: Mon, 30 Apr 2007 21:44:49 -0400 Subject: [Linux-cluster] Using ext3 on shared storge (GFS and GFS2 hangs) In-Reply-To: <46364398.7080401@gmail.com> References: <46364398.7080401@gmail.com> Message-ID: The main con, unlike gfs, ext3 can't be shared directly. It only works as failover. You could share it via NFS on wherever it is active. -----Original Message----- ... Now my question: which can be pros and cons about using ext3 filesystem under a shared storage for RHEL5 CS?? From garylua at singnet.com.sg Wed May 2 06:31:54 2007 From: garylua at singnet.com.sg (garylua at singnet.com.sg) Date: Wed, 02 May 2007 14:31:54 +0800 (SGT) Subject: [Linux-cluster] reducing heartbeat interval Message-ID: <1178087514.4638305a7bce8@discus.singnet.com.sg> Hi, hwo do i go about changing the heartbeat intervals for cluster suite for RHEL4? Is the minimum 5 seconds, or can i go even lower, like 1 sec? thanks From pcaulfie at redhat.com Wed May 2 07:29:23 2007 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Wed, 02 May 2007 08:29:23 +0100 Subject: [Linux-cluster] reducing heartbeat interval In-Reply-To: <1178087514.4638305a7bce8@discus.singnet.com.sg> References: <1178087514.4638305a7bce8@discus.singnet.com.sg> Message-ID: <46383DD3.4030807@redhat.com> garylua at singnet.com.sg wrote: > Hi, hwo do i go about changing the heartbeat intervals for cluster suite for RHEL4? Is the minimum 5 seconds, or can i go even lower, like 1 sec? thanks cluster.conf parameters (eg): The number are integer seconds so 1 is the smallest value you can have. Before you put that into production though, check it under load because high I/O or network loads can cause spurious node "failures". see /proc/cluster/config/cman for all the changeable values. -- Patrick Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 ITE, UK. Registered in England and Wales under Company Registration No. 3798903 From carlopmart at gmail.com Wed May 2 08:01:29 2007 From: carlopmart at gmail.com (carlopmart) Date: Wed, 02 May 2007 10:01:29 +0200 Subject: [Linux-cluster] About recently rehl4.5 released Message-ID: <46384559.8010602@gmail.com> Hi all, Sombedoy knows if it is possible to use Cluster Suite and GFS packages under rhel4.5 with xenU kernels?? Many thanks. -- CL Martinez carlopmart {at} gmail {d0t} com From cnguyen.ext at orange-ftgroup.com Wed May 2 08:50:13 2007 From: cnguyen.ext at orange-ftgroup.com (NGUYEN Can Ext ROSI/DPS) Date: Wed, 02 May 2007 10:50:13 +0200 Subject: [Linux-cluster] service starts up problem In-Reply-To: <20070430192934.GF4012@redhat.com> References: <1177924289.2858.15.camel@ngua> <20070430192934.GF4012@redhat.com> Message-ID: <1178095813.2976.9.camel@ngua> Dear Lon and community, Juste say that was not bug from CLuster Manager, it was just from shared IP that was not in the same vlan (network). When i put my shared IP in the same VLAN than eth0 for example, the service was started fine and shared ip was present and ready for working thank you NNC Le lundi 30 avril 2007 ? 15:29 -0400, Lon Hohberger a ?crit : > On Mon, Apr 30, 2007 at 11:11:29AM +0200, NGUYEN Can Ext ROSI/DPS wrote: > > Could someone help me please ? for example : give me a starting script > > that works or show me where i can find patch > > for ... /etc/init.d/functions ?? > > http://sources.redhat.com/cluster/faq.html#rgm_wontrestart > > More information ^^^ > > Patch vvv > > https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=111998 > > -- Lon > -- Can NGUYEN ---------------- STP Unix - 96 13 ********************************* This message and any attachments (the "message") are confidential and intended solely for the addressees. Any unauthorised use or dissemination is prohibited. Messages are susceptible to alteration. France Telecom Group shall not be liable for the message if altered, changed or falsified. If you are not the intended addressee of this message, please cancel it immediately and inform the sender. ******************************** From sebastian.walter at fu-berlin.de Wed May 2 09:09:31 2007 From: sebastian.walter at fu-berlin.de (Sebastian Walter) Date: Wed, 02 May 2007 11:09:31 +0200 Subject: [Linux-cluster] clvmd hangs In-Reply-To: <20070430171711.GA7539@redhat.com> References: <4635E434.3090607@fu-berlin.de> <20070430171711.GA7539@redhat.com> Message-ID: <4638554B.2070604@fu-berlin.de> Thanks for your help. These are /proc/cluster/services: ###master Service Name GID LID State Code Fence Domain: "default" 6 2 run - [3 2 1] DLM Lock Space: "clvmd" 5 3 join S-6,20,3 [3 2 1] ### node1: Service Name GID LID State Code Fence Domain: "default" 6 2 run - [3 2 1] DLM Lock Space: "clvmd" 5 3 update U-4,1,1 [2 3 1] ### node2: Service Name GID LID State Code Fence Domain: "default" 6 3 run - [3 2 1] DLM Lock Space: "clvmd" 5 4 update U-4,1,1 [2 3 1] There is nothing in /var/log/messages on any host. Whe I shut down "master", the other nodes are starting up clvmd and vg*-programs work. David Teigland wrote: > On Mon, Apr 30, 2007 at 02:42:28PM +0200, Sebastian Walter wrote: > >> DLM Lock Space: "clvmd" 1 1 join >> S-6,20,2 >> [1 3] >> > > The dlm is stuck trying to form a lockspace. It would be useful to see > this info from each node, and any cman/dlm/fenced lines from > /var/log/messages. > > Dave > > From frederik.ferner at diamond.ac.uk Wed May 2 13:44:06 2007 From: frederik.ferner at diamond.ac.uk (Frederik Ferner) Date: Wed, 02 May 2007 14:44:06 +0100 Subject: [Linux-cluster] (new) problems with qdisk, running test rpms Message-ID: <463895A6.50100@diamond.ac.uk> Hi, finally I had a chance to experiment with the test rpms for cman[1] that should solve the problem with multiple master I had... For these tests I was using the following rpms on RHEL4U4: kernel-smp-2.6.9-42.0.3.EL cman-kernel-smp-2.6.9-45.8.1TEST cman-1.0.11-0.4.1qdisk rgmanager-1.9.54-1 To test this I have two server connected to one switch with nothing else connected and one uplink. As heuristics for qdiskd I'm pinging a few IP addresses outside of this switch. When I unplug the uplink with the old cman installed, qdiskd on both servers immediately notice this and lower the score accordingly. With the new version of qdiskd it seems the heuristics are not tested anymore after it reaches a sufficient score once. When the outside network is lost qdiskd on both server still claim the same score in the status file and both servers report the votes for the qdisk to cman. If qdiskd is started while the outside network is unreachable the scores start without the scores for the failing heuristics. Once network is restored the score jumps to at least the minimum required for operation and once again stays there. Is this a bug that will be fixed in the upcoming RHEL4U5 release or could there be something else wrong with my setup? Here's my quorumd section from cluster.conf ----- ----- If you need any more information, I happy to provide this. Kind regards, Frederik [1] http://people.redhat.com/lhh/packages.html -- Frederik Ferner Linux Systems Administrator phone: +44 1235 77 8624 Diamond Light Source Ltd. mob: +44 7917 08 5110 From jstoner at opsource.net Wed May 2 22:33:47 2007 From: jstoner at opsource.net (Jeff Stoner) Date: Wed, 2 May 2007 23:33:47 +0100 Subject: [Linux-cluster] Kernel I/O scheduler In-Reply-To: <46383DD3.4030807@redhat.com> Message-ID: <38A48FA2F0103444906AD22E14F1B5A305CB6E09@mailxchg01.corp.opsource.net> Just a thought...anyone do any load testing with GFS (and non-GFS clusters) using different I/O schedulers (CFQ v Deadline v NOOP v as?) --Jeff Service Engineer OpSource Inc. > -----Original Message----- > Subject: Re: [Linux-cluster] reducing heartbeat interval > > garylua at singnet.com.sg wrote: > > Hi, hwo do i go about changing the heartbeat intervals for cluster > > suite for RHEL4? Is the minimum 5 seconds, or can i go even lower, > > like 1 sec? thanks > > cluster.conf parameters (eg): > > > > The number are integer seconds so 1 is the smallest value you > can have. Before you put that into production though, check > it under load because high I/O or network loads can cause > spurious node "failures". > > see /proc/cluster/config/cman for all the changeable values. From tmornini at engineyard.com Wed May 2 23:16:16 2007 From: tmornini at engineyard.com (Tom Mornini) Date: Wed, 2 May 2007 16:16:16 -0700 Subject: [Linux-cluster] Kernel I/O scheduler In-Reply-To: <38A48FA2F0103444906AD22E14F1B5A305CB6E09@mailxchg01.corp.opsource.net> References: <38A48FA2F0103444906AD22E14F1B5A305CB6E09@mailxchg01.corp.opsource.net> Message-ID: This is a really terrific question. We use Coraid storage, and their driver operates below the I/O scheduler, which is a shame. -- -- Tom Mornini, CTO -- Engine Yard, Ruby on Rails Hosting -- Support, Scalability, Reliability -- (866) 518-YARD (9273) On May 2, 2007, at 3:33 PM, Jeff Stoner wrote: > Just a thought...anyone do any load testing with GFS (and non-GFS > clusters) using different I/O schedulers (CFQ v Deadline v NOOP v as?) > > --Jeff > Service Engineer > OpSource Inc. > > >> -----Original Message----- >> Subject: Re: [Linux-cluster] reducing heartbeat interval >> >> garylua at singnet.com.sg wrote: >>> Hi, hwo do i go about changing the heartbeat intervals for cluster >>> suite for RHEL4? Is the minimum 5 seconds, or can i go even lower, >>> like 1 sec? thanks >> >> cluster.conf parameters (eg): >> >> >> >> The number are integer seconds so 1 is the smallest value you >> can have. Before you put that into production though, check >> it under load because high I/O or network loads can cause >> spurious node "failures". >> >> see /proc/cluster/config/cman for all the changeable values. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From garylua at singnet.com.sg Thu May 3 01:45:49 2007 From: garylua at singnet.com.sg (garylua at singnet.com.sg) Date: Thu, 03 May 2007 09:45:49 +0800 (SGT) Subject: [Linux-cluster] reducing heartbeat interval Message-ID: <1178156749.46393ecd8eada@discus.singnet.com.sg> Hi Patrick, Thanks for the response. I understand that the hello_timer parameter is the heartbeat interval. However, when i look at my cluster.conf, there is no entry for the hello_timer or the deadnode_timer. Do i have to add the line into my cluster.conf file? thanks From pcaulfie at redhat.com Thu May 3 07:50:30 2007 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Thu, 03 May 2007 08:50:30 +0100 Subject: [Linux-cluster] reducing heartbeat interval In-Reply-To: <1178156749.46393ecd8eada@discus.singnet.com.sg> References: <1178156749.46393ecd8eada@discus.singnet.com.sg> Message-ID: <46399446.1080300@redhat.com> garylua at singnet.com.sg wrote: > Hi Patrick, > > Thanks for the response. I understand that the hello_timer parameter is the heartbeat interval. However, when i look at my cluster.conf, there is no entry for the hello_timer or the deadnode_timer. Do i have to add the line into my cluster.conf file? thanks > Yes, you'll need to add it by hand. We don't normally fill in those defaults because we don't encourage people to play with them! -- Patrick Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 ITE, UK. Registered in England and Wales under Company Registration No. 3798903 From sebastian.walter at fu-berlin.de Thu May 3 09:27:08 2007 From: sebastian.walter at fu-berlin.de (Sebastian Walter) Date: Thu, 03 May 2007 11:27:08 +0200 Subject: [Linux-cluster] clvmd hangs In-Reply-To: <4638554B.2070604@fu-berlin.de> References: <4635E434.3090607@fu-berlin.de> <20070430171711.GA7539@redhat.com> <4638554B.2070604@fu-berlin.de> Message-ID: <4639AAEC.30808@fu-berlin.de> Does anybody have a solution for this? Is there any documentation about the Code messages? Sebastian Walter wrote: > Thanks for your help. These are /proc/cluster/services: > > ###master > Service Name GID LID State Code > Fence Domain: "default" 6 2 run - > [3 2 1] > > DLM Lock Space: "clvmd" 5 3 join > S-6,20,3 > [3 2 1] > > ### node1: > Service Name GID LID State Code > Fence Domain: "default" 6 2 run - > [3 2 1] > > DLM Lock Space: "clvmd" 5 3 update > U-4,1,1 > [2 3 1] > > ### node2: > Service Name GID LID State Code > Fence Domain: "default" 6 3 run - > [3 2 1] > > DLM Lock Space: "clvmd" 5 4 update > U-4,1,1 > [2 3 1] > > > There is nothing in /var/log/messages on any host. Whe I shut down > "master", the other nodes are starting up clvmd and vg*-programs work. > > > > > David Teigland wrote: >> On Mon, Apr 30, 2007 at 02:42:28PM +0200, Sebastian Walter wrote: >> >>> DLM Lock Space: "clvmd" 1 1 join >>> S-6,20,2 >>> [1 3] >>> >> >> The dlm is stuck trying to form a lockspace. It would be useful to see >> this info from each node, and any cman/dlm/fenced lines from >> /var/log/messages. >> >> Dave >> >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From rhurst at bidmc.harvard.edu Thu May 3 16:54:28 2007 From: rhurst at bidmc.harvard.edu (rhurst at bidmc.harvard.edu) Date: Thu, 3 May 2007 12:54:28 -0400 Subject: [Linux-cluster] clvmd and exclusive locking In-Reply-To: <4639AAEC.30808@fu-berlin.de> References: <4635E434.3090607@fu-berlin.de> <20070430171711.GA7539@redhat.com> <4638554B.2070604@fu-berlin.de> <4639AAEC.30808@fu-berlin.de> Message-ID: <1178211268.6418.8.camel@WSBID06223> I do not understand how the vgchange/lvchange -aey option gets "enforced" on other nodes that might want to compete for the same clustered VG/lvol resource. For example, node A exclusively locks and mounts -t ext3 /dev/VGBOOTA/lvol0, but on node B, I can can still lock & mount the same volume -- what gives? Shouldn't the clvmd disallow such attempts? Is there a guide that someone can refer me to for such answers? TIA. Robert Hurst, Sr. Cach? Administrator Beth Israel Deaconess Medical Center 1135 Tremont Street, REN-7 Boston, Massachusetts 02120-2140 617-754-8754 ? Fax: 617-754-8730 ? Cell: 401-787-3154 Any technology distinguishable from magic is insufficiently advanced. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2178 bytes Desc: not available URL: From garylua at singnet.com.sg Fri May 4 01:55:42 2007 From: garylua at singnet.com.sg (garylua at singnet.com.sg) Date: Fri, 04 May 2007 09:55:42 +0800 (SGT) Subject: [Linux-cluster] reducing heartbeat interval In-Reply-To: <46399446.1080300@redhat.com> References: <1178156749.46393ecd8eada@discus.singnet.com.sg> <46399446.1080300@redhat.com> Message-ID: <1178243742.463a929ed4d05@discus.singnet.com.sg> Hi Patrick, Thanks for the help again. Can I know in which part of the cluster.conf file do I insert the heartbeat parameter line? And I've heard somewhere that the minimum time that I can set for the heartbeat interval is 5 seconds. Just wondering whether is that true. Thanks --- Patrick Caulfield wrote: > garylua at singnet.com.sg wrote: > > Hi Patrick, > > > > Thanks for the response. I understand that the hello_timer > parameter is the heartbeat interval. However, when i look at my > cluster.conf, there is no entry for the hello_timer or the > deadnode_timer. Do i have to add the line into my cluster.conf file? > thanks > > > > Yes, you'll need to add it by hand. We don't normally fill in those > defaults > because we don't encourage people to play with them! > > -- > Patrick > > Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod > Street, > Windsor, Berkshire, SL4 ITE, UK. > Registered in England and Wales under Company Registration No. > 3798903 > From bsd_daemon at msn.com Fri May 4 08:36:43 2007 From: bsd_daemon at msn.com (mehmet celik) Date: Fri, 04 May 2007 08:36:43 +0000 Subject: [Linux-cluster] reducing heartbeat interval In-Reply-To: <1178243742.463a929ed4d05@discus.singnet.com.sg> Message-ID: Hii all, patrick; if you want to set heartbeat time, you should use hello_timer file. For using this modprobe cman echo 5 >/proc/cluster/config/cman/hello_timer service cman start if you not loaded the "cman module", can't you touch file.. have a nice day. Mehmet CELIK Istanbul/TURKEY > > > >--- Patrick Caulfield wrote: > > > garylua at singnet.com.sg wrote: > > > Hi Patrick, > > > > > > Thanks for the response. I understand that the hello_timer > > parameter is the heartbeat interval. However, when i look at my > > cluster.conf, there is no entry for the hello_timer or the > > deadnode_timer. Do i have to add the line into my cluster.conf file? > > thanks > > > > > > > Yes, you'll need to add it by hand. We don't normally fill in those > > defaults > > because we don't encourage people to play with them! > > > > -- > > Patrick > > > > Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod > > Street, > > Windsor, Berkshire, SL4 ITE, UK. > > Registered in England and Wales under Company Registration No. > > 3798903 > > > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster _________________________________________________________________ Need a break? Find your escape route with Live Search Maps. http://maps.live.com/default.aspx?ss=Restaurants~Hotels~Amusement%20Park&cp=33.832922~-117.915659&style=r&lvl=13&tilt=-90&dir=0&alt=-1000&scene=1118863&encType=1&FORM=MGAC01 From Alain.Moulle at bull.net Fri May 4 08:47:35 2007 From: Alain.Moulle at bull.net (Alain Moulle) Date: Fri, 04 May 2007 10:47:35 +0200 Subject: [Linux-cluster] Back to the problem of unicity of cluster_id Message-ID: <463AF327.8060108@bull.net> Hi About the problem of unicity of cluster_id generated by cnxman (see a copy of my old email on the ML just below), is there something fixed in new versions (U5?) ? or is there a possibility to set a cluster_id directly in cluster.conf instead of the cluster_name ? Thanks Alain --------------------------------------------------------------------- There is a problem of unicity of cluster_id : I have a big configuration with lots of CS4 clusters in pairs. The cluster_names are such as : iocell1 for first pair iocell2 for 2nd one ... Naming is automatic as there are lots of HA node pairs. ok, now let's take the cluster iocell13 and iocell21 and the algorythm in cnxman.c : static uint16_t generate_cluster_id(char *name) { int i; int value = 0; for (i=0; i Hello, I the Release Note of the latest RHEL 4 update, I read about cluster dm ioctl. For my clusters I'm using dm-multipath. So is this feature will influence my cluster ? What is this cluster dm ioctl ? Tx! From teigland at redhat.com Fri May 4 13:38:54 2007 From: teigland at redhat.com (David Teigland) Date: Fri, 4 May 2007 08:38:54 -0500 Subject: [Linux-cluster] clvmd hangs In-Reply-To: <4639AAEC.30808@fu-berlin.de> References: <4635E434.3090607@fu-berlin.de> <20070430171711.GA7539@redhat.com> <4638554B.2070604@fu-berlin.de> <4639AAEC.30808@fu-berlin.de> Message-ID: <20070504133854.GA4659@redhat.com> On Thu, May 03, 2007 at 11:27:08AM +0200, Sebastian Walter wrote: > Does anybody have a solution for this? Is there any documentation about > the Code messages? > > > Sebastian Walter wrote: > >Thanks for your help. These are /proc/cluster/services: > > > >###master > >Service Name GID LID State Code > >Fence Domain: "default" 6 2 run - > >[3 2 1] > > > >DLM Lock Space: "clvmd" 5 3 join > >S-6,20,3 > >[3 2 1] > > > >### node1: > >Service Name GID LID State Code > >Fence Domain: "default" 6 2 run - > >[3 2 1] > > > >DLM Lock Space: "clvmd" 5 3 update > >U-4,1,1 > >[2 3 1] > > > >### node2: > >Service Name GID LID State Code > >Fence Domain: "default" 6 3 run - > >[3 2 1] > > > >DLM Lock Space: "clvmd" 5 4 update > >U-4,1,1 > >[2 3 1] This says that the dlm is stuck in recovery on all the nodes. Which version of the code are you using? Has this happened more than once? Does the cluster have quorum? (cman_tool status) What does /proc/cluster/dlm_debug show from all nodes? What are the dlm threads waiting on? (ps ax -o pid,stat,wchan,cmd | grep dlm) Dave From wferi at niif.hu Fri May 4 15:43:10 2007 From: wferi at niif.hu (Ferenc Wagner) Date: Fri, 04 May 2007 17:43:10 +0200 Subject: [Linux-cluster] Slowness above 500 RRDs In-Reply-To: <20070426170844.GA14224@redhat.com> (David Teigland's message of "Thu, 26 Apr 2007 12:08:44 -0500") References: <87648r6hdi.fsf@tac.ki.iif.hu> <87ps6tl685.fsf@szonett.ki.iif.hu> <20070328162726.GF22230@redhat.com> <20070328163850.GG22230@redhat.com> <87wt06djk7.fsf@tac.ki.iif.hu> <20070423211717.GA22147@redhat.com> <20070424193600.GB11156@redhat.com> <87irbktpzq.fsf@tac.ki.iif.hu> <20070425165553.GC9891@redhat.com> <87bqhbdrf1.fsf@tac.ki.iif.hu> <20070426170844.GA14224@redhat.com> Message-ID: <87abwk388h.fsf@tac.ki.iif.hu> David Teigland writes: > On Thu, Apr 26, 2007 at 06:36:02PM +0200, Ferenc Wagner wrote: > >> I'm working with three nodes: 1, 2 and 3. Looks like the mount by 3 >> makes a big difference. When the filesystem is mounted by 1 and 2 >> only, my test runs much faster. Filesystem mounted by 3 alone is also >> fast. But 3 doesn't seem to cooperate with anyone else with >> reasonable performance. >> >> If I mount the filesystem on all three nodes and run the test on 1, >> the network traffic of 2 and 3 is rather unbalanced: tcpdump receives >> 19566 packets on 2 and 29181 on 3. It's all 21064/tcp traffic, I can >> provide detailed data if that seems useful. > > It sounds like your tests are mixing the effects of the flocks/plocks with > the effects of gfs's own internal file locking. If you want to test and > compare flock/plock performance you need to make sure that gfs's internal > dlm locks are always mastered on the same node (either locally, in which > case it'll be fast, or remotely in which case it'll be slow). The first > node to use a lock will be the master of it. I do the following: 1. reboot all three nodes 2. mount GFS on node 1, 2 and 3 3. run the test on node 1 -> it's slow 4. umount GFS on node 3 5. run the test on node 1 -> it's fast 6. reboot all three nodes 7. mount GFS on node 1, 2 and 3 8. run the test on node 1 -> it's slow 9. umount GFS on node 2 10. run the test on node 1 -> it's slow again I hope the above ensures that node 1 is always the master of all locks. So where could this discrepancy stem from? I'll check whether the boot order influences this, but really running out of ideas... Btw, it there no way for a node to take the lock master role from another (short of unmount the GFS volume on the original master)? -- Thanks, Feri. From teigland at redhat.com Fri May 4 18:17:23 2007 From: teigland at redhat.com (David Teigland) Date: Fri, 4 May 2007 13:17:23 -0500 Subject: [Linux-cluster] Re: [PATCH] DLM: fix a couple of races In-Reply-To: References: Message-ID: <20070504181723.GA13775@redhat.com> On Fri, May 04, 2007 at 09:49:45PM +0530, Satyam Sharma wrote: > Hi, > > There are the following two trivially-fixed races in fs/dlm/config.c: > > 1. The configfs subsystem semaphore must be held by the caller when > calling config_group_find_obj(). It's needed to walk the subsystem > hierarchy without racing with a simultaneous mkdir(2) or rmdir(2). I > looked around to see if there was some other way we were avoiding this > race, but couldn't find any. > > 2. get_comm() does hold the subsystem semaphore but lets go too soon -- > before grabbing a reference on the found config_item. A concurrent > rmdir(2) could come and release the comm after the up() but before the > config_item_get(). > > Patch that fixes both these bugs below. Thanks, Steve should be able to throw this into one of his git trees. Dave From rhurst at bidmc.harvard.edu Sat May 5 12:47:37 2007 From: rhurst at bidmc.harvard.edu (rhurst at bidmc.harvard.edu) Date: Sat, 5 May 2007 08:47:37 -0400 Subject: [Linux-cluster] GFS/CS compatibility with kernel 2.6.9-55 ? References: <20070504181723.GA13775@redhat.com> Message-ID: Our RHEL4 AS up2date is reporting that there is a new kernel release from 2.6.9-42.0.10.ELsmp to: kernel-smp 2.6.9 55.EL x86_64 ...but no mention of dependencies from the GFS/CS channel. My understanding is that the GFS/CS kernel modules need to match the kernel release. So does this mean it is still safe to update without breaking GFS/CS? Current packages are: # rpm -q ccs cman dlm fence GFS rgmanager ccs-1.0.7-0 cman-1.0.11-0 dlm-1.0.1-1 fence-1.32.25-1 GFS-6.1.6-1 rgmanager-1.9.54-1 Any advice is appreciated, thanks. Robert Hurst, Sr. Cach? Administrator Beth Israel Deaconess Medical Center 1135 Tremont Street, REN-7 Boston, Massachusetts 02120-2140 617-754-8754 ? Fax: 617-754-8730 ? Cell: 401-787-3154 Any technology distinguishable from magic is insufficiently advanced. -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 3140 bytes Desc: not available URL: From jos at xos.nl Sat May 5 13:01:29 2007 From: jos at xos.nl (Jos Vos) Date: Sat, 5 May 2007 15:01:29 +0200 Subject: [Linux-cluster] GFS/CS compatibility with kernel 2.6.9-55 ? In-Reply-To: ; from rhurst@bidmc.harvard.edu on Sat, May 05, 2007 at 08:47:37AM -0400 References: <20070504181723.GA13775@redhat.com> Message-ID: <20070505150129.B11732@xos037.xos.nl> On Sat, May 05, 2007 at 08:47:37AM -0400, rhurst at bidmc.harvard.edu wrote: > Our RHEL4 AS up2date is reporting that there is a new kernel release > from 2.6.9-42.0.10.ELsmp to: > ...but no mention of dependencies from the GFS/CS channel. > My understanding is that the GFS/CS kernel modules need to match the > kernel release. So does this mean it is still safe to update without > breaking GFS/CS? I think the CS/GFS packages (including the related kernel modules) are not yet released for RHEL4 U5, so you can only update if you don't use the new kernel as the default kernel. -- -- Jos Vos -- X/OS Experts in Open Systems BV | Phone: +31 20 6938364 -- Amsterdam, The Netherlands | Fax: +31 20 6948204 From rhurst at bidmc.harvard.edu Sat May 5 13:07:20 2007 From: rhurst at bidmc.harvard.edu (rhurst at bidmc.harvard.edu) Date: Sat, 5 May 2007 09:07:20 -0400 Subject: [Linux-cluster] GFS/CS compatibility with kernel 2.6.9-55 ? References: <20070504181723.GA13775@redhat.com> <20070505150129.B11732@xos037.xos.nl> Message-ID: Sound advice... I'll defer the kernel update. -----Original Message----- From: linux-cluster-bounces at redhat.com on behalf of Jos Vos Sent: Sat 5/5/2007 9:01 AM To: linux clustering Subject: Re: [Linux-cluster] GFS/CS compatibility with kernel 2.6.9-55 ? On Sat, May 05, 2007 at 08:47:37AM -0400, rhurst at bidmc.harvard.edu wrote: > Our RHEL4 AS up2date is reporting that there is a new kernel release > from 2.6.9-42.0.10.ELsmp to: > ...but no mention of dependencies from the GFS/CS channel. > My understanding is that the GFS/CS kernel modules need to match the > kernel release. So does this mean it is still safe to update without > breaking GFS/CS? I think the CS/GFS packages (including the related kernel modules) are not yet released for RHEL4 U5, so you can only update if you don't use the new kernel as the default kernel. -- -- Jos Vos -- X/OS Experts in Open Systems BV | Phone: +31 20 6938364 -- Amsterdam, The Netherlands | Fax: +31 20 6948204 -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 3392 bytes Desc: not available URL: From isplist at logicore.net Sun May 6 02:16:27 2007 From: isplist at logicore.net (isplist at logicore.net) Date: Sat, 5 May 2007 21:16:27 -0500 Subject: [Linux-cluster] Log Rotations over GFS won't work Message-ID: <200755211627.188277@leena> Is there something I need to understand about GFS shares for web server log rotations? Each web server has it's own log directory on the GFS share. Each web server has it's path changed to point to the new location via a simple simlink of the /etc/httpd/logs directory. (I also tried simply changing the path in the conf file which didn't work) I run the log rotation, I see no errors and I see no logs rotated. I cannot find a reason for this. Does it have anything to do with the fact that they are on another drive? Mike From garylua at singnet.com.sg Mon May 7 05:55:47 2007 From: garylua at singnet.com.sg (garylua at singnet.com.sg) Date: Mon, 07 May 2007 13:55:47 +0800 (SGT) Subject: [Linux-cluster] reducing heartbeat interval In-Reply-To: <46399446.1080300@redhat.com> References: <1178156749.46393ecd8eada@discus.singnet.com.sg> <46399446.1080300@redhat.com> Message-ID: <1178517347.463ebf6335b93@arrowana.singnet.com.sg> Hi, I've added the line into my cluster.conf and it does not seem to work. I've changed my hello_timer to 2 seconds and when i start the cluster, I checked the /proc/cluster/status/config/hello_timer and it says 2. But when i did a > tail -f /var/log/messages |grep cluster, I still see the the status of each service is being checked every 10 seconds. Can somebody tell me where did i go wrong? thanks --- Patrick Caulfield wrote: > garylua at singnet.com.sg wrote: > > Hi Patrick, > > > > Thanks for the response. I understand that the hello_timer > parameter is the heartbeat interval. However, when i look at my > cluster.conf, there is no entry for the hello_timer or the > deadnode_timer. Do i have to add the line into my cluster.conf file? > thanks > > > > Yes, you'll need to add it by hand. We don't normally fill in those > defaults > because we don't encourage people to play with them! > > -- > Patrick > > Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod > Street, > Windsor, Berkshire, SL4 ITE, UK. > Registered in England and Wales under Company Registration No. > 3798903 > From bsd_daemon at msn.com Mon May 7 09:15:04 2007 From: bsd_daemon at msn.com (mehmet celik) Date: Mon, 07 May 2007 09:15:04 +0000 Subject: [Linux-cluster] reducing heartbeat interval In-Reply-To: <1178517347.463ebf6335b93@arrowana.singnet.com.sg> Message-ID: what is problem ? >From: garylua at singnet.com.sg >Reply-To: garylua at singnet.com.sg,linux clustering > >To: Patrick Caulfield >CC: linux clustering >Subject: Re: [Linux-cluster] reducing heartbeat interval >Date: Mon, 07 May 2007 13:55:47 +0800 (SGT) > >Hi, > >I've added the line into my cluster.conf and it does not seem to work. I've >changed my hello_timer to 2 seconds and when i start the cluster, I checked >the /proc/cluster/status/config/hello_timer and it says 2. But when i did >a > tail -f /var/log/messages |grep cluster, I still see the the status of >each service is being checked every 10 seconds. Can somebody tell me where >did i go wrong? thanks > > > >--- Patrick Caulfield wrote: > > > garylua at singnet.com.sg wrote: > > > Hi Patrick, > > > > > > Thanks for the response. I understand that the hello_timer > > parameter is the heartbeat interval. However, when i look at my > > cluster.conf, there is no entry for the hello_timer or the > > deadnode_timer. Do i have to add the line into my cluster.conf file? > > thanks > > > > > > > Yes, you'll need to add it by hand. We don't normally fill in those > > defaults > > because we don't encourage people to play with them! > > > > -- > > Patrick > > > > Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod > > Street, > > Windsor, Berkshire, SL4 ITE, UK. > > Registered in England and Wales under Company Registration No. > > 3798903 > > > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster _________________________________________________________________ Need a break? Find your escape route with Live Search Maps. http://maps.live.com/default.aspx?ss=Restaurants~Hotels~Amusement%20Park&cp=33.832922~-117.915659&style=r&lvl=13&tilt=-90&dir=0&alt=-1000&scene=1118863&encType=1&FORM=MGAC01 From rhurst at bidmc.harvard.edu Mon May 7 12:07:50 2007 From: rhurst at bidmc.harvard.edu (rhurst at bidmc.harvard.edu) Date: Mon, 7 May 2007 08:07:50 -0400 Subject: [Linux-cluster] Broken package management Message-ID: <1178539671.3324.15.camel@WSBID06223> RHEL software subscriptions to RHN and its channels are excellent tools. However, there appears to be no package compatibility tests/dependencies between channels, i.e., RHEL 4 AS updates recently supply kernel and many lvm2 bug fixes, but breaks their own layered software in Cluster Suite and GFS! I can forgive the kernel, because it is generally skipped until all other 3rd party software (even if their all Red Hat partners: EMC, Emulex, InterSystems, Oracle) is checked/certified against the update. But I checked out the updated lvm2, and it consequently breaks lvm2-cluster-2.02.06-7.0.RHEL4. Can the oversight management of GFS/CS packages be explained please? Subscribed Channels (Alter Channel Subscriptions) * Red Hat Enterprise Linux AS (v. 4 for 64-bit AMD64/Intel EM64T) * Red Hat Developer Suite v. 3 (AS v. 4 for x86_64) * Red Hat Application Server v. 2 (AS v. 4 for x86_64) * Red Hat Cluster Suite (for AS v. 4 for AMD64/EM64T) * Red Hat Global File System 6.1 (for AS v4 AMD64/EM64T) * RHEL AS (v. 4 for AMD64/EM64T) Extras Robert Hurst, Sr. Cach? Administrator Beth Israel Deaconess Medical Center 1135 Tremont Street, REN-7 Boston, Massachusetts 02120-2140 617-754-8754 ? Fax: 617-754-8730 ? Cell: 401-787-3154 Any technology distinguishable from magic is insufficiently advanced. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2178 bytes Desc: not available URL: From lhh at redhat.com Mon May 7 14:49:41 2007 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 7 May 2007 10:49:41 -0400 Subject: [Linux-cluster] (new) problems with qdisk, running test rpms In-Reply-To: <463895A6.50100@diamond.ac.uk> References: <463895A6.50100@diamond.ac.uk> Message-ID: <20070507144940.GC29015@redhat.com> On Wed, May 02, 2007 at 02:44:06PM +0100, Frederik Ferner wrote: > Hi, > > finally I had a chance to experiment with the test rpms for cman[1] that > should solve the problem with multiple master I had... > > For these tests I was using the following rpms on RHEL4U4: > > kernel-smp-2.6.9-42.0.3.EL > cman-kernel-smp-2.6.9-45.8.1TEST > cman-1.0.11-0.4.1qdisk > rgmanager-1.9.54-1 > > To test this I have two server connected to one switch with nothing else > connected and one uplink. As heuristics for qdiskd I'm pinging a few IP > addresses outside of this switch. When I unplug the uplink with the old > cman installed, qdiskd on both servers immediately notice this and lower > the score accordingly. > With the new version of qdiskd it seems the heuristics are not tested > anymore after it reaches a sufficient score once. When the outside > network is lost qdiskd on both server still claim the same score in the > status file and both servers report the votes for the qdisk to cman. Hmm, could you add 'tko="1"' to your cluster.conf for the heuristics? I wonder if it's an initialization problem. > If qdiskd is started while the outside network is unreachable the scores > start without the scores for the failing heuristics. Once network is > restored the score jumps to at least the minimum required for operation > and once again stays there. > > Is this a bug that will be fixed in the upcoming RHEL4U5 release or > could there be something else wrong with my setup? This seems to work for me: [10538] debug: Heuristic: 'ping 192.168.79.254 -c1 -t3' missed (1/3) [10538] debug: Heuristic: 'ping 192.168.79.254 -c1 -t3' missed (2/3) [10538] info: Heuristic: 'ping 192.168.79.254 -c1 -t3' DOWN (3/3) [10537] notice: Score insufficient for master operation (0/11; required=6); downgrading Message from syslogd at green at Mon May 7 10:36:43 2007 ... green clurgmgrd[7305]: #1: Quorum Dissolved (machine rebooted) > Here's my quorumd section from cluster.conf > > ----- > log_facility="local4" status_file="/tmp/qdisk_status" > device="/dev/emcpowerq1"> > interval="2"/> > interval="2"/> > interval="2"/> > interval="2"/> > interval="2"/> > interval="2"/> > interval="2"/> > interval="2"/> > > ----- > If you need any more information, I happy to provide this. Hmm, try adding tko="3" to each of your ping heuristics, like this: -- Lon -- Lon Hohberger - Software Engineer - Red Hat, Inc. From wferi at niif.hu Mon May 7 16:48:19 2007 From: wferi at niif.hu (Ferenc Wagner) Date: Mon, 07 May 2007 18:48:19 +0200 Subject: [Linux-cluster] updatedb lockup Message-ID: <87abwg8trg.fsf@tac.ki.iif.hu> Hi, on my test cluster (running cluster suite 1.03 as packaged in Debian Etch), two of the three nodes are locked up in the updatedb daily cronjob. I noticed this when ls commands on the GFS blocked. The blocked (D state) processes thus are the following: CMD WCHAN [gfs_glockd] gfs_glockd /usr/bin/find / -ignore_rea glock_wait_internal ls --color=auto glock_wait_internal CMD WCHAN /usr/bin/find / -ignore_rea glock_wait_internal ls --color=auto /mnt glock_wait_internal $ /sbin/cman_tool status Protocol version: 5.0.1 Config version: 1 Cluster name: pilot Cluster ID: 3402 Cluster Member: Yes Membership state: Cluster-Member Nodes: 3 Expected_votes: 3 Total_votes: 3 Quorum: 2 Active subsystems: 6 Node name: YYY Node ID: 1 Node addresses: XXX.XXX.XXX.XXX $ /sbin/cman_tool services Service Name GID LID State Code Fence Domain: "default" 2 3 run - [3 1 2] DLM Lock Space: "clvmd" 1 1 run - [3 1 2] DLM Lock Space: "test" 13 6 run - [2 3 1] GFS Mount Group: "test" 14 7 run - [2 3 1] Apart from the node data, the "test" LIDs and the "default" and "clvmd" permutations are different on the two nodes. I didn't try to lock up the third node by touching the GFS there. Can it be some misconfiguration on my side? If it's a bug, and this situation is usable for the developers (for debugging), I can leave it alone and provide any requested data. Otherwise I reset the machines and go on with my experiments. -- Regards, Feri. From wcheng at redhat.com Mon May 7 16:59:20 2007 From: wcheng at redhat.com (Wendy Cheng) Date: Mon, 07 May 2007 12:59:20 -0400 Subject: [Linux-cluster] updatedb lockup In-Reply-To: <87abwg8trg.fsf@tac.ki.iif.hu> References: <87abwg8trg.fsf@tac.ki.iif.hu> Message-ID: <463F5AE8.4020803@redhat.com> Ferenc Wagner wrote: > Hi, > > on my test cluster (running cluster suite 1.03 as packaged in Debian > Etch), two of the three nodes are locked up in the updatedb daily > cronjob. I noticed this when ls commands on the GFS blocked. The > blocked (D state) processes thus are the following: > > > Can it be some misconfiguration on my side? If it's a bug, and this > situation is usable for the developers (for debugging), I can leave it > alone and provide any requested data. Otherwise I reset the machines > and go on with my experiments. > Don't do anything yet. I'll pass some commands so we can collect the debug info.. Give me 30 minutes ... -- Wendy From rhurst at bidmc.harvard.edu Mon May 7 17:54:56 2007 From: rhurst at bidmc.harvard.edu (rhurst at bidmc.harvard.edu) Date: Mon, 7 May 2007 13:54:56 -0400 Subject: [Linux-cluster] clurgmgrd - #48: Unable to obtain cluster lock: Connectiontimed out Message-ID: <1178560496.7699.38.camel@WSBID06223> What could cause clurgmgrd fail like this? If clurgmgrd has a hiccup like this, is it supposed to shutdown its services? Is there something in our implementation that could have prevented this from shutting down? For unexplained reasons, we just had our CS service (WATSON) go down on its own, and the syslog entry details the event as: May 7 13:18:39 db1 clurgmgrd[17888]: #48: Unable to obtain cluster lock: Connection timed out May 7 13:18:41 db1 kernel: dlm: Magma: reply from 2 no lock May 7 13:18:41 db1 kernel: dlm: reply May 7 13:18:41 db1 kernel: rh_cmd 5 May 7 13:18:41 db1 kernel: rh_lkid 200242 May 7 13:18:41 db1 kernel: lockstate 2 May 7 13:18:41 db1 kernel: nodeid 0 May 7 13:18:41 db1 kernel: status 0 May 7 13:18:41 db1 kernel: lkid ee0388 May 7 13:18:41 db1 clurgmgrd[17888]: Stopping service WATSON ... and its service entry looks like this: Eric -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Steven Dake Sent: Monday, May 07, 2007 3:22 PM To: linux clustering Subject: Re: [Linux-cluster] RHEL 5/CentOS 5 cluster? On Mon, 2007-05-07 at 15:08 -0600, Eric Schneider wrote: > I have tried CentOS 5 (i386 and x64) and SL 5 (i386) and I cannot get a 2 > node cluster to startup. I ask questions in the CentOS IRC channel and on > their forums, but no one has a solution. I can get a RHEL 4 cluster working > without issues. Is there something broken in RHEL 5 and clones? > > Eric > > Output from /var/log/messages would be helpful. If you configure a firewall by default, you must add a firewall rule for port 5405 UDP to allow connections from other cluster nodes. This is most likely the problem you are having. Regards -steve > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From Robert.Gil at americanhm.com Tue May 8 00:33:19 2007 From: Robert.Gil at americanhm.com (Robert Gil) Date: Mon, 7 May 2007 20:33:19 -0400 Subject: [Linux-cluster] pvcreate: symbol lookup error: Message-ID: Has anyone come across this error? # pvcreate /dev/emcpowera1 pvcreate: symbol lookup error: /usr/lib/liblvm2clusterlock.so: undefined symbol: lvm_snprintf Thanks, Robert Gil Linux Systems Administrator American Home Mortgage Phone: 631-622-8410 Cell: 631-827-5775 Fax: 516-495-5861 -------------- next part -------------- An HTML attachment was scrubbed... URL: From garylua at singnet.com.sg Tue May 8 01:06:25 2007 From: garylua at singnet.com.sg (garylua at singnet.com.sg) Date: Tue, 08 May 2007 09:06:25 +0800 (SGT) Subject: [Linux-cluster] reducing heartbeat interval In-Reply-To: References: Message-ID: <1178586385.463fcd114cf17@discus.singnet.com.sg> i'm trying to reduce the heartbeat interval of the cluster to as low as possible, but it seems that the the status of my servives are checked every 10 seconds, indicating that my heartbeat interval seems to be 10 seconds. Is there a minimum value of the hellp_timer which i can set? Also, I inserted the line with the hello_timer into my cluster.conf file as Patrick recommended, but the heartbeat interval still remains at 10 sec even though I've changed the value to 2. Thx --- mehmet celik wrote: > > what is problem ? > > >From: garylua at singnet.com.sg > >Reply-To: garylua at singnet.com.sg,linux clustering > > > >To: Patrick Caulfield > >CC: linux clustering > >Subject: Re: [Linux-cluster] reducing heartbeat interval > >Date: Mon, 07 May 2007 13:55:47 +0800 (SGT) > > > >Hi, > > > >I've added the line into my cluster.conf and it does not seem to > work. I've > >changed my hello_timer to 2 seconds and when i start the cluster, I > checked > >the /proc/cluster/status/config/hello_timer and it says 2. But when > i did > >a > tail -f /var/log/messages |grep cluster, I still see the the > status of > >each service is being checked every 10 seconds. Can somebody tell > me where > >did i go wrong? thanks > > > > > > > >--- Patrick Caulfield wrote: > > > > > garylua at singnet.com.sg wrote: > > > > Hi Patrick, > > > > > > > > Thanks for the response. I understand that the hello_timer > > > parameter is the heartbeat interval. However, when i look at my > > > cluster.conf, there is no entry for the hello_timer or the > > > deadnode_timer. Do i have to add the line into my cluster.conf > file? > > > thanks > > > > > > > > > > Yes, you'll need to add it by hand. We don't normally fill in > those > > > defaults > > > because we don't encourage people to play with them! > > > > > > -- > > > Patrick > > > > > > Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 > Peascod > > > Street, > > > Windsor, Berkshire, SL4 ITE, UK. > > > Registered in England and Wales under Company Registration No. > > > 3798903 > > > > > > >-- > >Linux-cluster mailing list > >Linux-cluster at redhat.com > >https://www.redhat.com/mailman/listinfo/linux-cluster > > _________________________________________________________________ > Need a break? Find your escape route with Live Search Maps. > http://maps.live.com/default.aspx?ss=Restaurants~Hotels~Amusement%20Park&cp=33.832922~-117.915659&style=r&lvl=13&tilt=-90&dir=0&alt=-1000&scene=1118863&encType=1&FORM=MGAC01 > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From pcaulfie at redhat.com Tue May 8 08:03:03 2007 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 08 May 2007 09:03:03 +0100 Subject: [Linux-cluster] Back to the problem of unicity of cluster_id In-Reply-To: <463AF327.8060108@bull.net> References: <463AF327.8060108@bull.net> Message-ID: <46402EB7.3080901@redhat.com> Alain Moulle wrote: > Hi > About the problem of unicity of cluster_id generated by cnxman (see a > copy of my old email on the ML just below), is there something fixed > in new versions (U5?) ? or is there a possibility to set a cluster_id > directly in cluster.conf instead of the cluster_name ? Yes U5 has a attribute. I'm not changing the generation algorithm as it would break upgrades, and anyway it's impossible to guarantee uniqueness of a 16 bit number generated from a <=15 character string - someone will always hit a duplicate! -- Patrick Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 ITE, UK. Registered in England and Wales under Company Registration No. 3798903 From pcaulfie at redhat.com Tue May 8 08:05:35 2007 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 08 May 2007 09:05:35 +0100 Subject: [Linux-cluster] reducing heartbeat interval In-Reply-To: <1178586385.463fcd114cf17@discus.singnet.com.sg> References: <1178586385.463fcd114cf17@discus.singnet.com.sg> Message-ID: <46402F4F.1090109@redhat.com> garylua at singnet.com.sg wrote: > i'm trying to reduce the heartbeat interval of the cluster to as low as possible, but it seems that the the status of my servives are checked every 10 seconds, indicating that my heartbeat interval seems to be 10 seconds. Is there a minimum value of the hellp_timer which i can set? > > Also, I inserted the line with the hello_timer into my cluster.conf file as Patrick recommended, but the heartbeat interval still remains at 10 sec even though I've changed the value to 2. Thx > You're confusing two separate things. The kernel hello timer affects the polling for nodes. Application poll times are set somewhere else. Sorry I don't know where offhand - probably in the cluster.conf for the service. -- Patrick Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 ITE, UK. Registered in England and Wales under Company Registration No. 3798903 From swhiteho at redhat.com Tue May 8 08:10:19 2007 From: swhiteho at redhat.com (Steven Whitehouse) Date: Tue, 08 May 2007 09:10:19 +0100 Subject: [Linux-cluster] Re: [PATCH] DLM: fix a couple of races In-Reply-To: <1178611258.7476.5.camel@quoit> References: <1178611258.7476.5.camel@quoit> Message-ID: <1178611819.7476.7.camel@quoit> Hi, On Tue, 2007-05-08 at 09:00 +0100, Steven Whitehouse wrote: > Hi, > > Added to the GFS2 -nmw git tree, thanks. Please remember to add a > Signed-off-by line for future patches - I've added it for you this time, > > Steve. > Sorry - I just spotted that you did add a signed-off-by but git ate it for some reason. I've fixed it up anyway, sorry about that, Steve. From tomas.hoger at gmail.com Tue May 8 08:19:53 2007 From: tomas.hoger at gmail.com (Tomas Hoger) Date: Tue, 8 May 2007 10:19:53 +0200 Subject: [Linux-cluster] reducing heartbeat interval In-Reply-To: <46402F4F.1090109@redhat.com> References: <1178586385.463fcd114cf17@discus.singnet.com.sg> <46402F4F.1090109@redhat.com> Message-ID: <6cfbd1b40705080119i72402e7bi19b9c2e8d9f44dd2@mail.gmail.com> On 5/8/07, Patrick Caulfield wrote: [ ... ] > You're confusing two separate things. The kernel hello timer affects the polling > for nodes. Application poll times are set somewhere else. Sorry I don't know > where offhand - probably in the cluster.conf for the service. Check this FAQ entry: http://sources.redhat.com/cluster/faq.html#rgm_interval th. From rhurst at bidmc.harvard.edu Tue May 8 11:32:33 2007 From: rhurst at bidmc.harvard.edu (rhurst at bidmc.harvard.edu) Date: Tue, 8 May 2007 07:32:33 -0400 Subject: [Linux-cluster] pvcreate: symbol lookup error: In-Reply-To: References: Message-ID: <1178623953.3429.8.camel@WSBID06223> Yes, I have... see earlier rant about 'Broken package management' in regards to U5 release. I tested the update to the latest lvm2, and it broke existing lvm2-cluster. I managed to copy the /usr/sbin/lvm from a production server (pre-U5) to repair the problem. On Mon, 2007-05-07 at 20:33 -0400, Robert Gil wrote: > Has anyone come across this error? > > # pvcreate /dev/emcpowera1 > pvcreate: symbol lookup error: /usr/lib/liblvm2clusterlock.so: > undefined symbol: lvm_snprintf > > Thanks, > > Robert Gil > Linux Systems Administrator > American Home Mortgage > Phone: 631-622-8410 > Cell: 631-827-5775 > Fax: 516-495-5861 > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Robert Hurst, Sr. Cach? Administrator Beth Israel Deaconess Medical Center 1135 Tremont Street, REN-7 Boston, Massachusetts 02120-2140 617-754-8754 ? Fax: 617-754-8730 ? Cell: 401-787-3154 Any technology distinguishable from magic is insufficiently advanced. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2178 bytes Desc: not available URL: From rhurst at bidmc.harvard.edu Tue May 8 11:35:57 2007 From: rhurst at bidmc.harvard.edu (rhurst at bidmc.harvard.edu) Date: Tue, 8 May 2007 07:35:57 -0400 Subject: [Linux-cluster] Back to the problem of unicity of cluster_id In-Reply-To: <46402EB7.3080901@redhat.com> References: <463AF327.8060108@bull.net> <46402EB7.3080901@redhat.com> Message-ID: <1178624157.3429.10.camel@WSBID06223> Do you know when U5 be available from RHN? On Tue, 2007-05-08 at 09:03 +0100, Patrick Caulfield wrote: > Alain Moulle wrote: > > Hi > > About the problem of unicity of cluster_id generated by cnxman (see a > > copy of my old email on the ML just below), is there something fixed > > in new versions (U5?) ? or is there a possibility to set a cluster_id > > directly in cluster.conf instead of the cluster_name ? > > > Yes U5 has a > > > > attribute. > > I'm not changing the generation algorithm as it would break upgrades, and anyway > it's impossible to guarantee uniqueness of a 16 bit number generated from a <=15 > character string - someone will always hit a duplicate! > > Robert Hurst, Sr. Cach? Administrator Beth Israel Deaconess Medical Center 1135 Tremont Street, REN-7 Boston, Massachusetts 02120-2140 617-754-8754 ? Fax: 617-754-8730 ? Cell: 401-787-3154 Any technology distinguishable from magic is insufficiently advanced. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2178 bytes Desc: not available URL: From dgoldsmith at sans.org Tue May 8 14:15:36 2007 From: dgoldsmith at sans.org (David Goldsmith) Date: Tue, 08 May 2007 10:15:36 -0400 Subject: [Linux-cluster] Combining LVS and RHCS Message-ID: <46408608.3020907@sans.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Assume I have set up an LVS 'blob' that has two LVS routers and some number of real servers used to load balance services. Assume I have also set up a third tier using 2 servers and RHCS to cluster shared storage arrays. - From the reading I have been doing, a cluster requires quorum to continue to function so a two-node cluster providing the shared storage would seem to fail to meet quorum if one of the two servers dies. Can the servers in the second-tier of the LVS blob that are providing load balanced services also have RHCS components installed on them and just not be part of any service specific failover domains? This way the total number of member servers in the cluster would still be greater than half, even if one of the two storage server nodes failed. Thanks - -- David Goldsmith -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGQIYH417vU8/9QfkRAp0BAJ9CghQ6vIyaeXCYW3+qd+2I2Yv0DgCgkGs9 w1FIMAKAScwd9vWHmvQN7EE= =j2QI -----END PGP SIGNATURE----- From dgoldsmith at sans.org Tue May 8 14:15:44 2007 From: dgoldsmith at sans.org (David Goldsmith) Date: Tue, 08 May 2007 10:15:44 -0400 Subject: [Linux-cluster] Communication between LVS nodes Message-ID: <46408610.2080605@sans.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Assume I have an LVS cluster setup with two LVS routers and 4 LVS member nodes. Two of the nodes provide load balanced web servers. Two of the nodes provide load-balanced proxy servers. External customers connecting to the site can be passed to either of the two web server nodes. Internal folks doing Internet browsing can be passed to either of the two proxy server nodes (assuming their web browser is configured to use the proxy). Can the web servers in the LVS cluster use the LVS interface to the proxy servers rather than communicating directly to one of the two proxy server nodes? If not, and the web server nodes are configured to connect to one specific proxy node, that would seem to create a possible failure point. Thanks - -- David Goldsmith -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGQIYP417vU8/9QfkRAke3AJ4jFyOLK0pyVTANchvvg55x34G4hACdF/+0 9X/VFGiVhGQGiec295aMxFs= =eyGf -----END PGP SIGNATURE----- From santiago.del.castillo at fnbox.com Tue May 8 14:47:39 2007 From: santiago.del.castillo at fnbox.com (Santiago Del Castillo) Date: Tue, 8 May 2007 11:47:39 -0300 Subject: [Linux-cluster] Probelm compiling Cluster package. In-Reply-To: <20070507212445.GC31492@redhat.com> References: <33d607980705071418o3a623600qb2f93eaf9cdd0647@mail.gmail.com> <20070507212445.GC31492@redhat.com> Message-ID: <33d607980705080747n2e45c6b4p73b29f984b29ee01@mail.gmail.com> Tank you David. It worked. But now i'm getting this: gcc -Wall -I/home/sdcastillo/sources/cluster/config -DHELPER_PROGRAM -D_FILE_OFFSET_BITS=64 -DGFS2_RELEASE_NAME=\"DEVEL.1178569990\" -ggdb -I/usr/include -I../include -I../libgfs2 -c -o gfs2hex.o gfs2hex.c In file included from hexedit.h:21, from gfs2hex.c:26: /usr/include/linux/gfs2_ondisk.h:53: error: expected specifier-qualifier-list before '__be64' /usr/include/linux/gfs2_ondisk.h:90: error: expected specifier-qualifier-list before '__be32' /usr/include/linux/gfs2_ondisk.h:125: error: expected specifier-qualifier-list before '__be32' /usr/include/linux/gfs2_ondisk.h:164: error: expected specifier-qualifier-list before '__be64' /usr/include/linux/gfs2_ondisk.h:206: error: expected specifier-qualifier-list before '__be32' /usr/include/linux/gfs2_ondisk.h:227: error: expected specifier-qualifier-list before '__be64' /usr/include/linux/gfs2_ondisk.h:285: error: expected specifier-qualifier-list before '__be32' /usr/include/linux/gfs2_ondisk.h:352: error: expected specifier-qualifier-list before '__be32' /usr/include/linux/gfs2_ondisk.h:366: error: expected specifier-qualifier-list before '__be16' /usr/include/linux/gfs2_ondisk.h:392: error: expected specifier-qualifier-list before '__be32' /usr/include/linux/gfs2_ondisk.h:410: error: expected specifier-qualifier-list before '__be64' /usr/include/linux/gfs2_ondisk.h:446: error: expected specifier-qualifier-list before '__be32' /usr/include/linux/gfs2_ondisk.h:463: error: expected specifier-qualifier-list before '__be64' /usr/include/linux/gfs2_ondisk.h:479: error: expected specifier-qualifier-list before '__be64' /usr/include/linux/gfs2_ondisk.h:499: error: expected specifier-qualifier-list before '__be64' gfs2hex.c: In function 'indirect_dirent': gfs2hex.c:187: error: 'struct gfs2_dirent' has no member named 'de_rec_len' gfs2hex.c:188: error: 'struct gfs2_dirent' has no member named 'de_rec_len' gfs2hex.c:190: error: 'struct gfs2_inum' has no member named 'no_addr' gfs2hex.c:191: error: 'struct gfs2_inum' has no member named 'no_addr' gfs2hex.c:194: error: 'struct gfs2_dirent' has no member named 'de_name_len' gfs2hex.c:195: error: 'struct gfs2_inum' has no member named 'no_addr' gfs2hex.c:199: error: 'struct gfs2_dirent' has no member named 'de_rec_len' gfs2hex.c:200: warning: control reaches end of non-void function gfs2hex.c: In function 'do_dinode_extended': gfs2hex.c:221: error: 'struct gfs2_dinode' has no member named 'di_mode' gfs2hex.c:222: error: 'struct gfs2_dinode' has no member named '__pad1' gfs2hex.c:226: error: 'struct gfs2_dinode' has no member named 'di_height' gfs2hex.c:240: error: 'struct gfs2_dinode' has no member named 'di_flags' gfs2hex.c:253: error: 'struct gfs2_dinode' has no member named 'di_flags' gfs2hex.c:254: error: 'struct gfs2_dinode' has no member named 'di_height' gfs2hex.c:260: error: 'struct gfs2_dinode' has no member named 'di_depth' gfs2hex.c:264: error: 'struct gfs2_dinode' has no member named 'di_size' gfs2hex.c: In function 'do_leaf_extended': gfs2hex.c:350: error: 'struct gfs2_dirent' has no member named 'de_rec_len' gfs2hex.c:353: error: 'struct gfs2_inum' has no member named 'no_addr' gfs2hex.c: In function 'do_eattr_extended': gfs2hex.c:384: error: 'struct gfs2_ea_header' has no member named 'ea_rec_len' gfs2hex.c: In function 'gfs2_inum_print2': gfs2hex.c:402: error: 'struct gfs2_inum' has no member named 'no_formal_ino' gfs2hex.c:405: error: 'struct gfs2_inum' has no member named 'no_addr' gfs2hex.c: In function 'gfs2_sb_print2': gfs2hex.c:416: error: 'struct gfs2_sb' has no member named 'sb_fs_format' gfs2hex.c:417: error: 'struct gfs2_sb' has no member named 'sb_multihost_format' gfs2hex.c:419: error: 'struct gfs2_sb' has no member named 'sb_bsize' gfs2hex.c:420: error: 'struct gfs2_sb' has no member named 'sb_bsize_shift' gfs2hex.c:426: error: 'struct gfs2_sb' has no member named 'sb_master_dir' gfs2hex.c:427: error: 'struct gfs2_sb' has no member named 'sb_root_dir' gfs2hex.c:429: error: 'struct gfs2_sb' has no member named 'sb_lockproto' gfs2hex.c:430: error: 'struct gfs2_sb' has no member named 'sb_locktable' gfs2hex.c: In function 'display_gfs2': gfs2hex.c:470: error: 'struct gfs2_meta_header' has no member named 'mh_type' make[2]: *** [gfs2hex.o] Error 1 make[2]: Leaving directory `/home/sdcastillo/sources/cluster/gfs2/edit' make[1]: *** [all] Error 2 make[1]: Leaving directory `/home/sdcastillo/sources/cluster/gfs2' make: *** [gfs2] Error 2 Cheers! Santiago On 5/7/07, David Teigland wrote: > On Mon, May 07, 2007 at 06:18:51PM -0300, Santiago Del Castillo wrote: > > Hi! > > > > While trying to compile CVS cluster sources against gfs2-2.6-fixes > > kernel sources (retreived from git > > linux/kernel/git/steve/gfs2-2.6-fixes.git) i get this error: > > > > make[2]: Entering directory > > `/home/sdcastillo/sources/GFS/cluster-2.00.00/dlm/lib' > > gcc -Wall -g -I. -O2 -D_REENTRANT -c -o libdlm.o libdlm.c > > In file included from libdlm.c:48: > > /usr/include/linux/dlm_device.h:32: error: expected ':', ',', ';', '}' > > or '__attribute__' before '*' token > > /usr/include/linux/dlm_device.h:63: error: expected ':', ',', ';', '}' > > or '__attribute__' before '*' token > > The problem is the __user tags in /usr/include/linux/dlm_device.h, if you > edit dlm_device.h and remove them it should work. > > Dave > > From graeme.crawford at gmail.com Tue May 8 15:27:39 2007 From: graeme.crawford at gmail.com (Graeme Crawford) Date: Tue, 8 May 2007 17:27:39 +0200 Subject: [Linux-cluster] kernel: kernel BUG at include/asm/spinlock.h:109! any ideas. Message-ID: <326f0a380705080827x591dba48qd983dbea558ba60@mail.gmail.com> Hi, We are running a server as an nfs head connect to a clariion, every couple of days we have our box fall over with the following msg in syslog. Has anyone had this happen to there boxen? rhel4 u4 May 8 12:23:52 ruchba kernel: Assertion failure in log_do_checkpoint() at fs/jbd/checkpoint.c:363: "drop_count != 0 || cleanup_ret != 0" May 8 12:23:52 ruchba kernel: ------------[ cut here ]------------ May 8 12:23:52 ruchba kernel: ------------[ cut here ]------------ May 8 12:23:52 ruchba kernel: kernel BUG at include/asm/spinlock.h:109! May 8 12:23:52 ruchba kernel: invalid operand: 0000 [#1] May 8 12:23:52 ruchba kernel: SMP May 8 12:23:52 ruchba kernel: Modules linked in: nfsd exportfs lockd nfs_acl sunrpc emcpdm(U) emcpgpx(U) emcpmpx(U) emcp(U) emcplib(U) ide_dump cciss_dump scsi_dump diskdump zlib_deflate i2c_dev i2c_core sg md5 ipv6 iptable_filter ip_tables dm_mirror button battery ac uhci_hcd ehci_hcd hw_random tg3 8021q bonding(U) floppy ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc cciss sd_mod scsi_mod May 8 12:23:52 ruchba kernel: CPU: 0 May 8 12:23:52 ruchba kernel: EIP: 0060:[] Tainted: P VLI May 8 12:23:52 ruchba kernel: EFLAGS: 00010002 (2.6.9-42.0.10.ELsmp) Regards, Graeme. From wcheng at redhat.com Tue May 8 15:10:01 2007 From: wcheng at redhat.com (Wendy Cheng) Date: Tue, 08 May 2007 11:10:01 -0400 Subject: [Linux-cluster] kernel: kernel BUG at include/asm/spinlock.h:109! any ideas. In-Reply-To: <326f0a380705080827x591dba48qd983dbea558ba60@mail.gmail.com> References: <326f0a380705080827x591dba48qd983dbea558ba60@mail.gmail.com> Message-ID: <464092C9.4030606@redhat.com> Graeme Crawford wrote: > Hi, > > We are running a server as an nfs head connect to a clariion, every > couple of days we have our box fall over with the following msg in > syslog. > > Has anyone had this happen to there boxen? Doesn't look like a cluster issue - have you reported this to either EXT3 group or Red Hat support ? -- Wendy > > rhel4 u4 > > May 8 12:23:52 ruchba kernel: Assertion failure in > log_do_checkpoint() at fs/jbd/checkpoint.c:363: "drop_count != 0 || > cleanup_ret != 0" > May 8 12:23:52 ruchba kernel: ------------[ cut here ]------------ > May 8 12:23:52 ruchba kernel: ------------[ cut here ]------------ > May 8 12:23:52 ruchba kernel: kernel BUG at include/asm/spinlock.h:109! > May 8 12:23:52 ruchba kernel: invalid operand: 0000 [#1] > May 8 12:23:52 ruchba kernel: SMP > May 8 12:23:52 ruchba kernel: Modules linked in: nfsd exportfs lockd > nfs_acl sunrpc emcpdm(U) emcpgpx(U) emcpmpx(U) emcp(U) emcplib(U) > ide_dump cciss_dump scsi_dump diskdump zlib_deflate i2c_dev i2c_core > sg md5 ipv6 iptable_filter ip_tables dm_mirror button battery ac > uhci_hcd ehci_hcd hw_random tg3 8021q bonding(U) floppy ext3 jbd > dm_mod qla2300 qla2xxx scsi_transport_fc cciss sd_mod scsi_mod > May 8 12:23:52 ruchba kernel: CPU: 0 > May 8 12:23:52 ruchba kernel: EIP: 0060:[] Tainted: > P VLI > May 8 12:23:52 ruchba kernel: EFLAGS: 00010002 (2.6.9-42.0.10.ELsmp) > > Regards, > > Graeme. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From maciej.bogucki at artegence.com Tue May 8 16:00:59 2007 From: maciej.bogucki at artegence.com (Maciej Bogucki) Date: Tue, 08 May 2007 18:00:59 +0200 Subject: [Linux-cluster] using GNBD versus iSCSI In-Reply-To: <20070330230706.GA5158@ether.msp.redhat.com> References: <9ea95a710703271322m34d0bacctf388a4cd7d7bd840@mail.gmail.com> <20070330230706.GA5158@ether.msp.redhat.com> Message-ID: <46409EBB.6050003@artegence.com> Benjamin Marzinski napisa?(a): > On Tue, Mar 27, 2007 at 10:22:01PM +0200, David Shwatrz wrote: >> Hello, >> In short: what are the advantages/disadvantages when using GNBD versus >> iSCSI for exporting storage in a cluster? > > The only real advantage of using GNBD is that it has built in fencing. With > iSCSI, you still need some somthing to fence all the machines (unless your scsi > target supports SCSI-3 persistent resrvations). Theoretically, GNBD could > run faster, since it doesn't need to do the work to imitate a SCSI device, but > but there's a lot of work that needs to be done for GNBD to reach it's speed > potential. Since there isn't much active development of GNBD, if iSCSI > isn't already faster than it, it will eventually be. I am pretty much the only > GNBD developer, and aside from bug fixing, I do very little work on it. iSCSI > has an active community of developers. Using iSCSI also allows a much more > seemless transition to a hardware shared storage solution later on. > > If you don't have any fencing hardware, and your iSCSI target doesn't support > SCSI-3 persistent reservations, then you should probably go with GNBD. Otherwise > it's up to you. Hello, Disadventage of using GNBD is that it doesn't support authorization/authentication, so it could be a problem in some configurations. Here is comparison test with or without GNBD hardware: - kernel: 2.6.18-8.1.1 - Procesor: 2x Intel(R) Xeon(TM) CPU 2.80GHz - Memory: 2GB - SCSI DISC: MAXTOR Model: ATLAS10K5_146SCA Rev: JNZ6 - Filesystem: ext3 1. local test(/dev/sdb1) [root at blade03-1 ~]# bonnie++ -d /mnt -s 4g -u root Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP blade03-1 4G 35829 86 93034 43 35334 13 46666 93 84543 11 484.2 1 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ blade03-1,4G,35829,86,93034,43,35334,13,46666,93,84543,11,484.2,1,16,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++ 2. GNBD test(/dev/gnbd/sdb1) - 1Gbit network(Nortel Layer 2/3 Copper GbE Switch Module for BladeCenter) [root at blade02-1 ~]# bonnie++ -d /mnt -s 4g -u root Version 1.03 ------Sequential Output------ --Sequential Input---Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP blade02-1 4G 19724 50 20362 11 13391 7 31688 70 79717 16 360.0 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ blade02-1,4G,19724,50,20362,11,13391,7,31688,70,79717,16,360.0,0,16,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++ [root at blade02-1 ~]# Best Regards Maciej Bogucki From shirai at sc-i.co.jp Wed May 9 13:27:52 2007 From: shirai at sc-i.co.jp (Shirai@SystemCreateINC) Date: Wed, 9 May 2007 22:27:52 +0900 Subject: [Linux-cluster] lock_gulmd:failed to statr ltpx References: <9ea95a710703271322m34d0bacctf388a4cd7d7bd840@mail.gmail.com><20070330230706.GA5158@ether.msp.redhat.com> <46409EBB.6050003@artegence.com> Message-ID: <006a01c7923d$da2a63e0$a65fcc3d@tostar> Hi! I am constructing GFS with RHEL4U5(Kernel 2.6.9-42ELsmp). However, lock_gulmd doesn't start even one node correctly. It is cluster.conf as follows. And, it seems to have succeeded in the start of lock_gulmd in/var/log/messages. May 9 21:11:54 localhost ccsd: succeeded May 9 21:12:23 localhost ccsd[4474]: Unable to connect to cluster infrastructur e after 30 seconds. May 9 21:12:24 localhost ccsd[4474]: cluster.conf (cluster name = alpha_cluster , version = 2) found. May 9 21:12:35 localhost lock_gulmd_main[4551]: Forked lock_gulmd_core. May 9 21:12:36 localhost lock_gulmd_main[4551]: Forked lock_gulmd_LT. May 9 21:12:37 localhost lock_gulmd_main[4551]: Forked lock_gulmd_LTPX. May 9 21:12:45 localhost lock_gulmd_core[4598]: Starting lock_gulmd_core 1.0.8. (built Sep 20 2006 10:51:58) Copyright (C) 2004 Red Hat, Inc. All rights reser ved. May 9 21:12:45 localhost lock_gulmd_core[4598]: I am running in Standard mode. May 9 21:12:45 localhost lock_gulmd_core[4598]: I am (lock02-e1) with ip (::fff f:192.168.102.15) May 9 21:12:45 localhost lock_gulmd_core[4598]: This is cluster alpha_cluster May 9 21:12:45 localhost lock_gulmd_core[4598]: I see no Masters, So I am becom ing the Master. May 9 21:12:45 localhost lock_gulmd_core[4598]: Could not send quorum update to slave lock02-e1 May 9 21:12:45 localhost lock_gulmd_core[4598]: New generation of server state. (1178712765578307) May 9 21:12:45 localhost lock_gulmd_core[4598]: EOF on xdr (Magma::4475 ::1 idx :1 fd:6) May 9 21:12:46 localhost lock_gulmd_LT[4602]: Starting lock_gulmd_LT 1.0.8. (bu ilt Sep 20 2006 10:51:58) Copyright (C) 2004 Red Hat, Inc. All rights reserved. May 9 21:12:46 localhost lock_gulmd_LT[4602]: I am running in Standard mode. May 9 21:12:46 localhost lock_gulmd_LT[4602]: I am (lock02-e1) with ip (::ffff: 192.168.102.15) May 9 21:12:46 localhost lock_gulmd_LT[4602]: This is cluster alpha_cluster May 9 21:12:46 localhost lock_gulmd_core[4598]: EOF on xdr (Magma::4475 ::1 idx :2 fd:7) May 9 21:12:47 localhost lock_gulmd_LTPX[4606]: Starting lock_gulmd_LTPX 1.0.8. (built Sep 20 2006 10:51:58) Copyright (C) 2004 Red Hat, Inc. All rights reser ved. May 9 21:12:47 localhost lock_gulmd_LTPX[4606]: I am running in Standard mode. May 9 21:12:47 localhost lock_gulmd_LTPX[4606]: I am (lock02-e1) with ip (::fff f:192.168.102.15) May 9 21:12:47 localhost lock_gulmd_LTPX[4606]: This is cluster alpha_cluster May 9 21:12:47 localhost lock_gulmd_LTPX[4606]: New Master at lock02-e1 ::ffff: 192.168.102.15 May 9 21:12:47 localhost lock_gulmd_LT000[4602]: New Client: idx 2 fd 7 from lo ck02-e1 ::ffff:192.168.102.15 May 9 21:12:47 localhost lock_gulmd_LTPX[4606]: Logged into LT000 at lock02-e1 ::ffff:192.168.102.15 May 9 21:12:47 localhost lock_gulmd_LTPX[4606]: Finished resending to LT000 May 9 21:12:47 localhost ccsd[4474]: Connected to cluster infrastruture via: Gu LM Plugin v1.0.5 May 9 21:12:47 localhost ccsd[4474]: Initial status:: Quorate May 9 21:14:37 localhost lock_gulmd: startup failed However, it will fail before long. What should I do? Regards ------------------------------------------------------ Shirai Noriyuki Chief Engineer Technical Div. System Create Inc Kanda Toyo Bldg, 3-4-2 Kandakajicho Chiyodaku Tokyo 101-0045 Japan Tel81-3-5296-3775 Fax81-3-5296-3777 e-mail:shirai at sc-i.co.jp web:http://www.sc-i.co.jp ------------------------------------------------------ From Alain.Moulle at bull.net Wed May 9 15:18:32 2007 From: Alain.Moulle at bull.net (Alain Moulle) Date: Wed, 09 May 2007 17:18:32 +0200 Subject: [Linux-cluster] CS4 U4 / Agents fence_ipmilan / Auth value ? Message-ID: <4641E648.4090604@bull.net> Hi I can't find the Auth value set at the call of a fence agent , for example fence_ipmilan ? Is it called systematically with "none" , "md5", "passwd" ? Or a default value , and if so , which one ? Thanks Alain Moull? From jparsons at redhat.com Wed May 9 15:32:01 2007 From: jparsons at redhat.com (jim parsons) Date: Wed, 09 May 2007 11:32:01 -0400 Subject: [Linux-cluster] CS4 U4 / Agents fence_ipmilan / Auth value ? In-Reply-To: <4641E648.4090604@bull.net> References: <4641E648.4090604@bull.net> Message-ID: <4641E971.4030900@redhat.com> Alain Moulle wrote: > Hi > > I can't find the Auth value set at the call of a fence agent , for > example fence_ipmilan ? > Is it called systematically with "none" , "md5", "passwd" ? > Or a default value , and if so , which one ? > This is a configurable param in cluster.conf for the ipmi agent. The default, if not included, is auth="none" Of course, this param is also configurable as a cmdline arg when running the agent outside of a cluster. Run /sbin/fence_ipmilan -help or check the man page for fence_ipmilan. I hope this helps, -J From kadlec at sunserv.kfki.hu Wed May 9 18:27:53 2007 From: kadlec at sunserv.kfki.hu (Kadlecsik Jozsi) Date: Wed, 9 May 2007 20:27:53 +0200 (MEST) Subject: [Linux-cluster] Slow tar Message-ID: Hi, We have been testing GFS over AoE (over Gb Ethernet) and in our last test what we tried is to untar a 38GB file with 150000 files. The process was started two days ago and it still hasn't completed. We see that it proceeds in creating the new files and directories, but this seems to be way too slow. This is a non-tuned 5-node GFS cluster with four nodes alive. On each node: # uname -r 2.6.20.4-grsec, # gfs_tool version | head -1 gfs_tool DEVEL.1177069643 (built Apr 20 2007 13:48:40) # cman_tool status | head -1 Protocol version: 5.0.1 AoE and GFS use a dedicated LAN segment with MTU set to 9216. What makes it more strange is that previously we ran bonnie++ which produced the following results: Local ext3: bonnie++ -d /usr/src/test -u nobody Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP atlas 12G 39019 96 50354 22 23123 6 34311 69 49428 5 207.6 1 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 2747 94 +++++ +++ +++++ +++ 2910 99 +++++ +++ 9811 100 GFS: bonnie++ -d /export/home/test -u nobody Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP atlas 12G 29911 86 57432 33 16625 13 39512 96 17334 4 310.4 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 1455 37 +++++ +++ 2958 31 1549 43 +++++ +++ 4496 48 and that suggested better "real life" performance compared to what we got at running a huge tar. What can be wrong here? How could we improve the performance? Best regards, Jozsef -- E-mail : kadlec at sunserv.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From bsd_daemon at msn.com Thu May 10 08:10:42 2007 From: bsd_daemon at msn.com (mehmet celik) Date: Thu, 10 May 2007 08:10:42 +0000 Subject: [Linux-cluster] clurgmgrd - #48: Unable to obtain cluster lock: Connec In-Reply-To: <1178560496.7699.38.camel@WSBID06223> Message-ID: hiii, for running a service by rgmanager, a lock mechanism is run. this lock mechanism is lock_dlm. You don't lock_nolock. As first, you should load the lock_dlm module. after than you should start the "clurgmgrd". >From: rhurst at bidmc.harvard.edu >Reply-To: linux clustering >To: linux-cluster at redhat.com >Subject: [Linux-cluster] clurgmgrd - #48: Unable to obtain cluster >lock: Connectiontimed out >Date: Mon, 7 May 2007 13:54:56 -0400 > >What could cause clurgmgrd fail like this? If clurgmgrd has a hiccup >like this, is it supposed to shutdown its services? Is there something >in our implementation that could have prevented this from shutting down? > >For unexplained reasons, we just had our CS service (WATSON) go down on >its own, and the syslog entry details the event as: > >May 7 13:18:39 db1 clurgmgrd[17888]: #48: Unable to obtain >cluster lock: Connection timed out >May 7 13:18:41 db1 kernel: dlm: Magma: reply from 2 no lock >May 7 13:18:41 db1 kernel: dlm: reply >May 7 13:18:41 db1 kernel: rh_cmd 5 >May 7 13:18:41 db1 kernel: rh_lkid 200242 >May 7 13:18:41 db1 kernel: lockstate 2 >May 7 13:18:41 db1 kernel: nodeid 0 >May 7 13:18:41 db1 kernel: status 0 >May 7 13:18:41 db1 kernel: lkid ee0388 >May 7 13:18:41 db1 clurgmgrd[17888]: Stopping service WATSON > >... and its service entry looks like this: > >recovery="disable"> > > fsid="53188" fstype="ext3" mountpoint="/watson-data" >name="WATSON-lvoldata" options="" self_fence="0"> > force_unmount="1" fsid="29524" fstype="ext3" >mountpoint="/watson-data/sys/db1" name="WATSON-lvoldb1" options="" >self_fence="0"/> > Eric Schneider Tech Services UCCS, IT Department 262-3453 eschneid at uccs.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From eschneid at uccs.edu Fri May 11 15:35:40 2007 From: eschneid at uccs.edu (Eric Schneider) Date: Fri, 11 May 2007 09:35:40 -0600 Subject: [Linux-cluster] fence_apc 7930s Message-ID: <000501c793e2$07de4e50$1b03c680@uccs.edu> I am having issues getting fence_apc to work in my environment. I get the following error when I try to fence: May 11 09:22:22 sleepy fence_node[4002]: agent "fence_apc" reports: failed: unrecognised menu response I remember dealing with this before (on RHEL 4) and I had to use a 3rd party fence_apc to get this working. Has anyone else had this problem?? If so, how did you fix it? OS - CentOS 5 2 node cluster Fence Devices - 2 x apc 7930s Cluster config - ??????? ??????? ??????????????? ??????????????????????? ??????????????????????????????? ??????????????????????????????????????? ??????????????????????????????????????? ??????????????????????????????????????? ?????????????????????????????????? ????? ??????????????????????????????? ??????????????????????? ??????????????? ??????????????? ??????????????????????? ??????????????????????????????? ??????????????????????????????????????? ??????????????????????????????????????? ??????????????????????????????????????? ??????????????????????????????????????? ??????????????????????????????? ??????????????????????? ??????????????? ??????? ??????? ??????? ??????????????? ??????????????? ??????????????? ??????????????????????????????? ??????????????????????????????? ??????????????????????? ??????????????? ??????????????? ??????????????? ??????????????????????? ??????????????? ??????? Eric Schneider Tech Services UCCS, IT Department 262-3453 eschneid at uccs.edu From carlopmart at gmail.com Fri May 11 16:49:57 2007 From: carlopmart at gmail.com (carlopmart) Date: Fri, 11 May 2007 18:49:57 +0200 Subject: [Linux-cluster] how to use fence_xvmd fence option on Rhel5 Message-ID: <46449EB5.8090005@gmail.com> Hi all, RedHat's documentation about using fence_xvmd and fence_xvm it isn't very clear, almost for me. I need to use another fence device than manual on my actual two-node cluster suite under xen. To do this I think to use fence_xvmd and fence_xvm options. But how? What are the minimum requeriments to use it?. My only option is: - Dom0 runs fence_xvmd and domU two-nodes runs fence_xvm clients, but, will dom0 be need act as a part of the cluster?? or not? Thanks. -- CL Martinez carlopmart {at} gmail {d0t} com From lhh at redhat.com Fri May 11 20:19:03 2007 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 11 May 2007 16:19:03 -0400 Subject: [Linux-cluster] clurgmgrd - #48: Unable to obtain cluster lock: Connectiontimed out In-Reply-To: <1178560496.7699.38.camel@WSBID06223> References: <1178560496.7699.38.camel@WSBID06223> Message-ID: <20070511201903.GF15766@redhat.com> On Mon, May 07, 2007 at 01:54:56PM -0400, rhurst at bidmc.harvard.edu wrote: > What could cause clurgmgrd fail like this? If clurgmgrd has a hiccup > like this, is it supposed to shutdown its services? Is there something > in our implementation that could have prevented this from shutting down? > > For unexplained reasons, we just had our CS service (WATSON) go down on > its own, and the syslog entry details the event as: > > May 7 13:18:39 db1 clurgmgrd[17888]: #48: Unable to obtain > cluster lock: Connection timed out > May 7 13:18:41 db1 kernel: dlm: Magma: reply from 2 no lock > May 7 13:18:41 db1 kernel: dlm: reply > May 7 13:18:41 db1 kernel: rh_cmd 5 > May 7 13:18:41 db1 kernel: rh_lkid 200242 > May 7 13:18:41 db1 kernel: lockstate 2 > May 7 13:18:41 db1 kernel: nodeid 0 > May 7 13:18:41 db1 kernel: status 0 > May 7 13:18:41 db1 kernel: lkid ee0388 > May 7 13:18:41 db1 clurgmgrd[17888]: Stopping service WATSON This usually is a dlm bug. Once the DLM gets in to this state, rgmanager blows up. What rgmanager are you using? (There's only one lock per service; the complexity of the service doesn't matter...) -- Lon Hohberger - Software Engineer - Red Hat, Inc. From lhh at redhat.com Fri May 11 20:21:47 2007 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 11 May 2007 16:21:47 -0400 Subject: [Linux-cluster] Combining LVS and RHCS In-Reply-To: <46408608.3020907@sans.org> References: <46408608.3020907@sans.org> Message-ID: <20070511202147.GG15766@redhat.com> On Tue, May 08, 2007 at 10:15:36AM -0400, David Goldsmith wrote: > Can the servers in the second-tier of the LVS blob that are providing > load balanced services also have RHCS components installed on them and > just not be part of any service specific failover domains? This way the > total number of member servers in the cluster would still be greater > than half, even if one of the two storage server nodes failed. Yes, of course. You don't need to run rgmanager on them. Failover domains are an rgmanager construct and unrelated to LVS. -- Lon -- Lon Hohberger - Software Engineer - Red Hat, Inc. From lhh at redhat.com Fri May 11 20:25:08 2007 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 11 May 2007 16:25:08 -0400 Subject: [Linux-cluster] clustat -s service ....."Timed out waiting for a response from Resource Group Manager" In-Reply-To: <46448203.5030006@noaa.gov> References: <46448203.5030006@noaa.gov> Message-ID: <20070511202507.GH15766@redhat.com> On Fri, May 11, 2007 at 10:47:31AM -0400, Daniel Ojo wrote: > Hi Folks, > > I have have a 12 nodes Linux Cluster with GFS on RHE4. I had just added > 12 new resources and services to my with system-config-cluster. All > seems to start up fine but I am getting errors in my message logs..... > > clurgmgrd[8579]: Node ID:000000000000000b stuck with lock usrm::vf > clurgmgrd[8579]: Node ID:000000000000000b stuck with lock usrm::vf What release of rgmanager, magma, magma-plugins do you have? > 2. It is also taking long or hanging when I try to disable and enable a > service via clusvcadm. > 3. Clustat takes longer to yield outputs and when it does it only yields > the Member Name and not the Service Names. Right - something's stuck; the above problem is the cause of these other two. -- Lon Hohberger - Software Engineer - Red Hat, Inc. From lhh at redhat.com Fri May 11 20:26:33 2007 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 11 May 2007 16:26:33 -0400 Subject: [Linux-cluster] fence_apc 7930s In-Reply-To: <000001c793e1$d7767da0$1b03c680@uccs.edu> References: <000001c793e1$d7767da0$1b03c680@uccs.edu> Message-ID: <20070511202633.GI15766@redhat.com> On Fri, May 11, 2007 at 09:34:19AM -0600, Eric Schneider wrote: > I am having issues getting fence_apc to work in my environment. > > > > I get the following error when I try to fence: > > > > May 11 09:22:22 sleepy fence_node[4002]: agent "fence_apc" reports: failed: > unrecognised menu response > > > > I remember dealing with this before (on RHEL 4) and I had to use a 3rd party > fence_apc to get this working. Where'd you get the third party one? It sounds like whoever is maintaining it fixed a bug... -- Lon Hohberger - Software Engineer - Red Hat, Inc. From lhh at redhat.com Fri May 11 20:33:55 2007 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 11 May 2007 16:33:55 -0400 Subject: [Linux-cluster] how to use fence_xvmd fence option on Rhel5 In-Reply-To: <46449EB5.8090005@gmail.com> References: <46449EB5.8090005@gmail.com> Message-ID: <20070511203355.GJ15766@redhat.com> On Fri, May 11, 2007 at 06:49:57PM +0200, carlopmart wrote: > Hi all, > > RedHat's documentation about using fence_xvmd and fence_xvm it isn't very > clear, almost for me. > > I need to use another fence device than manual on my actual two-node > cluster suite under xen. To do this I think to use fence_xvmd and fence_xvm > options. But how? What are the minimum requeriments to use it?. My only > option is: > - Dom0 runs fence_xvmd and domU two-nodes runs fence_xvm clients, but, > will dom0 be need act as a part of the cluster?? or not? Dom0 should be a part of *its own* cluster - which is not part of the VM cluster. It can be a 1-node cluster. (1) So, configure dom0 like a 1-node cluster with no fencing. (2) Add "" to cluster.conf in dom0 as a child of the "" tag. (3) dd if=/dev/urandom of=/etc/cluster/fence_xvm.key (4) scp /etc/cluster/fence_xvm.key root at virtual_node_1:/etc/cluster (5) scp /etc/cluster/fence_xvm.key root at virtual_node_2:/etc/cluster (6) Start cman on dom0 - this should start fence_xvmd for you For testing correct fence_xvm.key distribution, try: (1) Log in to domain-0 in one window (2) killall fence_xvmd (3) fence_xvmd -fddddddddd (4) Log in to a guest domain from another window (5) fence_xvm -H -o null If everything works, you should see all kinds of useless garbage on the screen in the fence_xvmd window. If that works, add fence_xvm to the guest-domain cluster using system-config-cluster or Conga. (All you need is the domain name which corresponds to the cluster name for each guest domain node). -- Lon Hohberger - Software Engineer - Red Hat, Inc. From eschneid at uccs.edu Fri May 11 21:26:12 2007 From: eschneid at uccs.edu (Eric Schneider) Date: Fri, 11 May 2007 15:26:12 -0600 Subject: [Linux-cluster] fence_apc 7930s In-Reply-To: <20070511202633.GI15766@redhat.com> References: <000001c793e1$d7767da0$1b03c680@uccs.edu> <20070511202633.GI15766@redhat.com> Message-ID: <003101c79412$ffb48010$1b03c680@uccs.edu> It was a python one I found on the list. I can get it to work from the command line: This doesn't work. The default fence_apc never works for me. fence_apc.orig -a IP -l apc -n 21 -p "password" -v This works. I had this working on a RHEL 4 clone on a single apc 7930. I can get it to work in C5, but not with multiple apc 7930s. fence_apc.3rd -a IP -l apc -n 21 -p "password" -v #BEGIN_VERSION_GENERATION FENCE_RELEASE_NAME="New APC Agent - test release" REDHAT_COPYRIGHT="" BUILD_DATE="September 21, 2006" #END_VERSION_GENERATION >From what I can tell the default fence_apc just doesn't work with my 7930s. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Lon Hohberger Sent: Friday, May 11, 2007 2:27 PM To: linux clustering Subject: Re: [Linux-cluster] fence_apc 7930s On Fri, May 11, 2007 at 09:34:19AM -0600, Eric Schneider wrote: > I am having issues getting fence_apc to work in my environment. > > > > I get the following error when I try to fence: > > > > May 11 09:22:22 sleepy fence_node[4002]: agent "fence_apc" reports: failed: > unrecognised menu response > > > > I remember dealing with this before (on RHEL 4) and I had to use a 3rd party > fence_apc to get this working. Where'd you get the third party one? It sounds like whoever is maintaining it fixed a bug... -- Lon Hohberger - Software Engineer - Red Hat, Inc. -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From rhurst at bidmc.harvard.edu Fri May 11 23:54:08 2007 From: rhurst at bidmc.harvard.edu (rhurst at bidmc.harvard.edu) Date: Fri, 11 May 2007 19:54:08 -0400 Subject: [Linux-cluster] clurgmgrd - #48: Unable to obtain cluster lock: Connectiontimed out References: <1178560496.7699.38.camel@WSBID06223> <20070511201903.GF15766@redhat.com> Message-ID: We are using RHEL 4 U4 with the GFS/CS that works for that release: $ rpm -q rgmanager dlm dlm-kernel magma magma-plugins rgmanager-1.9.54-1 dlm-1.0.1-1 dlm-kernel-2.6.9-44.9 magma-1.0.6-0 magma-plugins-1.0.9-0 Would the just-announced GFS/CS for U5 help any? Looks like a lof issues were addressed. Robert Hurst, Sr. Cach? Administrator Beth Israel Deaconess Medical Center 1135 Tremont Street, REN-7 Boston, Massachusetts 02120-2140 617-754-8754 ? Fax: 617-754-8730 ? Cell: 401-787-3154 Any technology distinguishable from magic is insufficiently advanced. ________________________________ From: linux-cluster-bounces at redhat.com on behalf of Lon Hohberger Sent: Fri 5/11/2007 4:19 PM To: linux clustering Subject: Re: [Linux-cluster] clurgmgrd - #48: Unable to obtain cluster lock: Connectiontimed out On Mon, May 07, 2007 at 01:54:56PM -0400, rhurst at bidmc.harvard.edu wrote: > What could cause clurgmgrd fail like this? If clurgmgrd has a hiccup > like this, is it supposed to shutdown its services? Is there something > in our implementation that could have prevented this from shutting down? > > For unexplained reasons, we just had our CS service (WATSON) go down on > its own, and the syslog entry details the event as: > > May 7 13:18:39 db1 clurgmgrd[17888]: #48: Unable to obtain > cluster lock: Connection timed out > May 7 13:18:41 db1 kernel: dlm: Magma: reply from 2 no lock > May 7 13:18:41 db1 kernel: dlm: reply > May 7 13:18:41 db1 kernel: rh_cmd 5 > May 7 13:18:41 db1 kernel: rh_lkid 200242 > May 7 13:18:41 db1 kernel: lockstate 2 > May 7 13:18:41 db1 kernel: nodeid 0 > May 7 13:18:41 db1 kernel: status 0 > May 7 13:18:41 db1 kernel: lkid ee0388 > May 7 13:18:41 db1 clurgmgrd[17888]: Stopping service WATSON This usually is a dlm bug. Once the DLM gets in to this state, rgmanager blows up. What rgmanager are you using? (There's only one lock per service; the complexity of the service doesn't matter...) -- Lon Hohberger - Software Engineer - Red Hat, Inc. -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 5838 bytes Desc: not available URL: From nattaponv at hotmail.com Sun May 13 06:30:52 2007 From: nattaponv at hotmail.com (nattapon viroonsri) Date: Sun, 13 May 2007 06:30:52 +0000 Subject: [Linux-cluster] Gfs not return available space after delete file Message-ID: Red Hat Enterprise Linux ES release 4 (Nahant Update 4) rhel-4-u4-rhcs-i386 GFS-kernel-smp-2.6.9-58.0 GFS-6.1.6 I have dell 2 nodes connected to emc storage mount gfs partition my problem is after i create file and delete , and check with df , gfs not return available space but if i check with du it show result correctly I have use below command to create files on gfs for i in $(seq 1 2700); do dd if=/dev/zero of=h$i bs=1M count=1 ; done after file created i have delete all file use "rm *" but when i check available space with df it not return the same size as before i created files If i use gfs with powerpath ( EMCpower.LINUX-4.5.1-022.rhel.i386 ) , even gfs_tool reclaim not return any available space so i recreated gfs file system again without powerpath the same problem still remain, But when use gfs_tool reclaim , this time it return more available space but still not the same as before i created files 1. EMCpowerpath not compatible with gfs ? even gfs_tool reclaim not return space 2. If i dont use EMCpowerpath, Do i have to setup schedule to run "gfs_tool reclaim" in crontab to correct this ? How can i config gfs to automatic reclaim without remount or run "gfs_tool reclaim" ? Regards, Nattapon _________________________________________________________________ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/ From nattaponv at hotmail.com Sun May 13 06:21:50 2007 From: nattaponv at hotmail.com (nattapon viroonsri) Date: Sun, 13 May 2007 06:21:50 +0000 Subject: [Linux-cluster] Gfs not return available space after delete file Message-ID: Red Hat Enterprise Linux ES release 4 (Nahant Update 4) rhel-4-u4-rhcs-i386 GFS-kernel-smp-2.6.9-58.0 GFS-6.1.6 I have dell 2 nodes connected to emc storage mount gfs partition my problem is after i create file and delete , and check with df , gfs not return available space but if i check with du it show result correctly I have use below command to create files on gfs for i in $(seq 1 2700); do dd if=/dev/zero of=h$i bs=1M count=1 ; done after file created i have delete all file use "rm *" but when i check available space with df it not return the same size as before i created files If i use gfs with powerpath ( EMCpower.LINUX-4.5.1-022.rhel.i386 ) , even gfs_tool reclaim not return any available space so i recreated gfs file system again without powerpath the same problem still remain, But when use gfs_tool reclaim , this time it return more available space but still not the same as before i created files 1. EMCpowerpath not compatible with gfs ? even gfs_tool reclaim not return space 2. If i dont use EMCpowerpath, Do i have to setup schedule to run "gfs_tool reclaim" in crontab to correct this ? How can i config gfs to automatic reclaim without remount or run "gfs_tool reclaim" ? Regards, Nattapon _________________________________________________________________ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/ From angle1321 at hotmail.com Sun May 13 14:17:38 2007 From: angle1321 at hotmail.com (angle angle) Date: Sun, 13 May 2007 21:17:38 +0700 Subject: [Linux-cluster] cluster ( OpenMP ) Message-ID: I begin to learn about cluster techonlogy. I would like to compare the efficiency of shared memory and distributed memory so I must use the similar tools in my cluster project.I think I will use the tools as below. 1. fedora core 5 2. oscar 5.0 3. pbs In the shared memory I will use OpenMP library that is in fedora core5. In the distributed memory I will use MPI that I can choose between lam/mpi and mpich in the step of oscar installation. What do you think about the tool that I will use? Can I use it? *** I 'm not sure that I can use OpenMP because in oscar installation it has lam/mpi or mpich to choose but OpenMP doesn't. Otherwise, I'm not sure that in OpenMP,Can PBS use for assign the jobs for compute nodes? _________________________________________________________________ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/ From katriel at penguin-it.co.il Mon May 14 13:40:54 2007 From: katriel at penguin-it.co.il (Katriel Traum) Date: Mon, 14 May 2007 16:40:54 +0300 Subject: [Linux-cluster] cman_too join fails Message-ID: <464866E6.3090102@penguin-it.co.il> Hello. I have a cluster that won't start. RHEL5 as a Xen VM, running 2.6.18-8.el5xen startup fails with cman_tool join with the error: "Cannot start, ais may already be running" "Error reading config from CCS" from an strace it seems that cman_tool forks to run aisexec which fails on the "Error reading config from CCS". An strace dump promised for all first comers. -- Katriel Traum From pcaulfie at redhat.com Mon May 14 14:01:58 2007 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Mon, 14 May 2007 15:01:58 +0100 Subject: [Linux-cluster] cman_too join fails In-Reply-To: <464866E6.3090102@penguin-it.co.il> References: <464866E6.3090102@penguin-it.co.il> Message-ID: <46486BD6.7080506@redhat.com> Katriel Traum wrote: > Hello. > I have a cluster that won't start. RHEL5 as a Xen VM, running > 2.6.18-8.el5xen > startup fails with cman_tool join with the error: > "Cannot start, ais may already be running" > "Error reading config from CCS" > > from an strace it seems that cman_tool forks to run aisexec which fails > on the "Error reading config from CCS". See if "cman_tool join -d" gives any more information, also check syslog for other messages. It sounds like the information in cluster.conf is incorrect somehow. -- Patrick Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 ITE, UK. Registered in England and Wales under Company Registration No. 3798903 From teigland at redhat.com Mon May 14 14:05:13 2007 From: teigland at redhat.com (David Teigland) Date: Mon, 14 May 2007 09:05:13 -0500 Subject: [Linux-cluster] Gfs not return available space after delete file In-Reply-To: References: Message-ID: <20070514140513.GA30043@redhat.com> On Sun, May 13, 2007 at 06:21:50AM +0000, nattapon viroonsri wrote: > > Red Hat Enterprise Linux ES release 4 (Nahant Update 4) > rhel-4-u4-rhcs-i386 > GFS-kernel-smp-2.6.9-58.0 > GFS-6.1.6 > > I have dell 2 nodes connected to emc storage mount gfs partition > my problem is after i create file and delete , and check with df , gfs not > return available space > but if i check with du it show result correctly 'gfs_tool df' will probably help in accounting for all the space. Your storage/powerpath has nothing to do with it. reclaim turns FREEMETA blocks (free blocks that must be reused for metadata) back into FREE blocks (free blocks that can be reused for anything). You shouldn't need to run 'gfs_tool reclaim' unless you have a pathological use case. reclaim isn't nice to use because it has to block all fs access from all nodes while it runs. Dave From nattaponv at hotmail.com Mon May 14 15:35:15 2007 From: nattaponv at hotmail.com (nattapon viroonsri) Date: Mon, 14 May 2007 15:35:15 +0000 Subject: [Linux-cluster] Gfs not return available space after delete file In-Reply-To: <20070514140513.GA30043@redhat.com> Message-ID: Below is result from gfs_tool df , df , du , ls # gfs_tool df /home /home: SB lock proto = "lock_dlm" SB lock table = "testgfs:gfs" SB ondisk format = 1309 SB multihost format = 1401 Block size = 4096 Journals = 4 Resource Groups = 68 Mounted lock proto = "lock_dlm" Mounted lock table = "testgfs:gfs" Mounted host data = "" Journal number = 0 Lock module flags = Local flocks = FALSE Local caching = FALSE Oopses OK = FALSE Type Total Used Free use% ------------------------------------------------------------------------ inodes 825 825 0 100% metadata 1766 1766 0 100% data 34897229 419840 34477389 1% # df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg0-lv0 134G 1.7G 132G 2% /home result from df and gfs_tool df look same but if use du will show correct result du -sh /home 500K /home I do listing file in directory and see nothing left # ls -la /home total 508 drwxr-xr-x 2 root root 2048 May 14 22:21 . drwxr-xr-x 23 root root 4096 May 14 20:53 .. Do i miss something ? Regards, Nattapon > > > > Red Hat Enterprise Linux ES release 4 (Nahant Update 4) > > rhel-4-u4-rhcs-i386 > > GFS-kernel-smp-2.6.9-58.0 > > GFS-6.1.6 > > > > I have dell 2 nodes connected to emc storage mount gfs partition > > my problem is after i create file and delete , and check with df , gfs >not > > return available space > > but if i check with du it show result correctly > >'gfs_tool df' will probably help in accounting for all the space. > >Your storage/powerpath has nothing to do with it. > >reclaim turns FREEMETA blocks (free blocks that must be reused for >metadata) back into FREE blocks (free blocks that can be reused for >anything). You shouldn't need to run 'gfs_tool reclaim' unless you have a >pathological use case. reclaim isn't nice to use because it has to block >all fs access from all nodes while it runs. > >Dave > _________________________________________________________________ Don't just search. Find. Check out the new MSN Search! http://search.msn.com/ From teigland at redhat.com Mon May 14 16:52:51 2007 From: teigland at redhat.com (David Teigland) Date: Mon, 14 May 2007 11:52:51 -0500 Subject: [Linux-cluster] Gfs not return available space after delete file In-Reply-To: References: <20070514140513.GA30043@redhat.com> Message-ID: <20070514165251.GE30043@redhat.com> On Mon, May 14, 2007 at 03:35:15PM +0000, nattapon viroonsri wrote: > Type Total Used Free use% > ------------------------------------------------------------------------ > inodes 825 825 0 100% > metadata 1766 1766 0 100% > data 34897229 419840 34477389 1% I'm guessing that you've run this just after unlinking everything. GFS deallocates stuff asynchronously, so getting all the space back will take some time. Dave From nattaponv at hotmail.com Mon May 14 17:24:37 2007 From: nattaponv at hotmail.com (nattapon viroonsri) Date: Mon, 14 May 2007 17:24:37 +0000 Subject: [Linux-cluster] Gfs not return available space after delete file In-Reply-To: <20070514165251.GE30043@redhat.com> Message-ID: Do have anyway or tuning parameter to get shorter time wait to get space back after delete files ? Does this effect in production environment that heavy i/o( create/delete all time) and result in run out of disk space quickly than expect ? Thankyou Regards, Nattapon >From: David Teigland >To: nattapon viroonsri >CC: linux-cluster at redhat.com >Subject: Re: [Linux-cluster] Gfs not return available space after delete >file >Date: Mon, 14 May 2007 11:52:51 -0500 >MIME-Version: 1.0 >Received: from mx1.redhat.com ([66.187.233.31]) by >bay0-mc6-f11.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.2668); Mon, >14 May 2007 09:52:34 -0700 >Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com >[172.16.52.254])by mx1.redhat.com (8.13.1/8.13.1) with ESMTP id >l4EGqXHW006692for ; Mon, 14 May 2007 12:52:33 -0400 >Received: from null.msp.redhat.com (null.msp.redhat.com [10.15.80.136])by >int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l4EGqWrw007250;Mon, >14 May 2007 12:52:32 -0400 >Received: by null.msp.redhat.com (Postfix, from userid 3890)id 0CCBA46BCBF; >Mon, 14 May 2007 11:52:51 -0500 (CDT) >X-Message-Info: >LsUYwwHHNt3660MmjhEvYg2f34OAemlK+ZzoV09lDsZmbz8QigGIQtU5Yvr3lK0P >References: <20070514140513.GA30043 at redhat.com> > >User-Agent: Mutt/1.4.2.2i >Return-Path: teigland at redhat.com >X-OriginalArrivalTime: 14 May 2007 16:52:35.0452 (UTC) >FILETIME=[45B303C0:01C79648] > >On Mon, May 14, 2007 at 03:35:15PM +0000, nattapon viroonsri wrote: > > Type Total Used Free use% > > >------------------------------------------------------------------------ > > inodes 825 825 0 100% > > metadata 1766 1766 0 100% > > data 34897229 419840 34477389 1% > >I'm guessing that you've run this just after unlinking everything. GFS >deallocates stuff asynchronously, so getting all the space back will take >some time. > >Dave > _________________________________________________________________ Don't just search. Find. Check out the new MSN Search! http://search.msn.click-url.com/go/onm00200636ave/direct/01/ From rpeterso at redhat.com Mon May 14 18:17:40 2007 From: rpeterso at redhat.com (Robert Peterson) Date: Mon, 14 May 2007 13:17:40 -0500 Subject: [Linux-cluster] Probelm compiling Cluster package. In-Reply-To: <33d607980705080747n2e45c6b4p73b29f984b29ee01@mail.gmail.com> References: <33d607980705071418o3a623600qb2f93eaf9cdd0647@mail.gmail.com> <20070507212445.GC31492@redhat.com> <33d607980705080747n2e45c6b4p73b29f984b29ee01@mail.gmail.com> Message-ID: <4648A7C4.4050200@redhat.com> Santiago Del Castillo wrote: > Tank you David. It worked. But now i'm getting this: > > gcc -Wall -I/home/sdcastillo/sources/cluster/config -DHELPER_PROGRAM > -D_FILE_OFFSET_BITS=64 -DGFS2_RELEASE_NAME=\"DEVEL.1178569990\" -ggdb > -I/usr/include -I../include -I../libgfs2 -c -o gfs2hex.o gfs2hex.c > In file included from hexedit.h:21, > from gfs2hex.c:26: > /usr/include/linux/gfs2_ondisk.h:53: error: expected > specifier-qualifier-list before '__be64' > > Cheers! > Santiago Hi Santiago, The gfs2_edit tool pulls in the gfs2 kernel data structures from the kernel source. I don't get these errors. On my system, the declarations for __be64 are pulled in by this statement (which is already in gfs2hex.c): #include I'm not sure what's going on here, but perhaps you're running an older kernel and are missing some stuff from the newer kernel trees. Ordinarily, you would first do something like: ./configure --kernel_src=/home/sdcastillo/sources/cluster/gfs2-2.6-nmw.git The difference between gfs2-2.6-fixes.git and gfs2-2.6-nmw.git is simple: "nmw" stands for "Next Merge Window" which means it is the latest and greatest code for gfs2. It contains all the fixes that are scheduled to be merged into the latest upstream kernel during the next merge window. The "fixes" tree lags a bit behind and doesn't have all the latest fixes. The "nmw" is more bleeding-edge, and therefore more potential exposure to bugs, but it also usually has very important fixes you might need. Regards, Bob Peterson Red Hat Cluster Suite From teigland at redhat.com Mon May 14 18:33:59 2007 From: teigland at redhat.com (David Teigland) Date: Mon, 14 May 2007 13:33:59 -0500 Subject: [Linux-cluster] Gfs not return available space after delete file In-Reply-To: References: <20070514165251.GE30043@redhat.com> Message-ID: <20070514183359.GF30043@redhat.com> On Mon, May 14, 2007 at 05:24:37PM +0000, nattapon viroonsri wrote: > > Do have anyway or tuning parameter to get shorter time wait to get space > back after delete files ? try: gfs_tool settune inoded_secs nsecs is 15 by default, try something like 5 > Does this effect in production environment that heavy i/o( create/delete > all time) and result in run out of disk space quickly than expect ? You should run some more experiments to be sure, but there should be some limit to the number of unlinked by not deallocated inodes (look at how the ilimit tunables work). Once you reach these limits, new operations will pitch in and help do the deallocations. Dave From teigland at redhat.com Mon May 14 18:50:27 2007 From: teigland at redhat.com (David Teigland) Date: Mon, 14 May 2007 13:50:27 -0500 Subject: [Linux-cluster] Probelm compiling Cluster package. In-Reply-To: <4648A7C4.4050200@redhat.com> References: <33d607980705071418o3a623600qb2f93eaf9cdd0647@mail.gmail.com> <20070507212445.GC31492@redhat.com> <33d607980705080747n2e45c6b4p73b29f984b29ee01@mail.gmail.com> <4648A7C4.4050200@redhat.com> Message-ID: <20070514185027.GG30043@redhat.com> On Mon, May 14, 2007 at 01:17:40PM -0500, Robert Peterson wrote: > Santiago Del Castillo wrote: > >Tank you David. It worked. But now i'm getting this: > > > >gcc -Wall -I/home/sdcastillo/sources/cluster/config -DHELPER_PROGRAM > >-D_FILE_OFFSET_BITS=64 -DGFS2_RELEASE_NAME=\"DEVEL.1178569990\" -ggdb > >-I/usr/include -I../include -I../libgfs2 -c -o gfs2hex.o gfs2hex.c > >In file included from hexedit.h:21, > > from gfs2hex.c:26: > >/usr/include/linux/gfs2_ondisk.h:53: error: expected > >specifier-qualifier-list before '__be64' > > > >Cheers! > >Santiago > > Hi Santiago, > > The gfs2_edit tool pulls in the gfs2 kernel data structures from > the kernel source. I don't get these errors. On my system, the > declarations for __be64 are pulled in by this statement (which is > already in gfs2hex.c): > > #include > > I'm not sure what's going on here, but perhaps you're running an older > kernel and are missing some stuff from the newer kernel trees. > Ordinarily, you would first do something like: > > ./configure --kernel_src=/home/sdcastillo/sources/cluster/gfs2-2.6-nmw.git /usr/include/linux/types.h /usr/src/linux/include/linux/types.h aren't typically the same file, so the kernel source you use for building probably won't matter. The first comes with the distribution, something like a kernel-headers package. You might get by with cheating and replacing the first with the second... Dave From lhh at redhat.com Mon May 14 19:18:44 2007 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 14 May 2007 15:18:44 -0400 Subject: [Linux-cluster] fence_apc 7930s In-Reply-To: <003101c79412$ffb48010$1b03c680@uccs.edu> References: <000001c793e1$d7767da0$1b03c680@uccs.edu> <20070511202633.GI15766@redhat.com> <003101c79412$ffb48010$1b03c680@uccs.edu> Message-ID: <20070514191844.GA28891@redhat.com> On Fri, May 11, 2007 at 03:26:12PM -0600, Eric Schneider wrote: > It was a python one I found on the list. I can get it to work from the > command line: > > This doesn't work. The default fence_apc never works for me. > fence_apc.orig -a IP -l apc -n 21 -p "password" -v > > This works. I had this working on a RHEL 4 clone on a single apc 7930. I > can get it to work in C5, but not with multiple apc 7930s. > fence_apc.3rd -a IP -l apc -n 21 -p "password" -v > > #BEGIN_VERSION_GENERATION > FENCE_RELEASE_NAME="New APC Agent - test release" > REDHAT_COPYRIGHT="" > BUILD_DATE="September 21, 2006" > #END_VERSION_GENERATION > Looks like the python-rewritten fence_apc. jparsons would know more. -- Lon From rhurst at bidmc.harvard.edu Mon May 14 19:46:50 2007 From: rhurst at bidmc.harvard.edu (Robert Hurst) Date: Mon, 14 May 2007 15:46:50 -0400 Subject: [Linux-cluster] clurgmgrd - #48: Unable to obtain cluster lock: Connectiontimed out In-Reply-To: References: <1178560496.7699.38.camel@WSBID06223> <20070511201903.GF15766@redhat.com> Message-ID: <1179172010.4511.13.camel@WSBID06223> Any new thoughts on this, is it a new bug, is it fixed with U5? I have a ticket open, but your insights on how probable this is a recurring bug would be helpful. Thanks. On Fri, 2007-05-11 at 19:54 -0400, rhurst at bidmc.harvard.edu wrote: > We are using RHEL 4 U4 with the GFS/CS that works for that release: > > $ rpm -q rgmanager dlm dlm-kernel magma magma-plugins > > rgmanager-1.9.54-1 > dlm-1.0.1-1 > dlm-kernel-2.6.9-44.9 > magma-1.0.6-0 > magma-plugins-1.0.9-0 > > Would the just-announced GFS/CS for U5 help any? Looks like a lof > issues were addressed. > > Robert Hurst, Sr. Cach? Administrator > Beth Israel Deaconess Medical Center > 1135 Tremont Street, REN-7 > Boston, Massachusetts 02120-2140 > 617-754-8754 ? Fax: 617-754-8730 ? Cell: 401-787-3154 > Any technology distinguishable from magic is insufficiently advanced. > > > ______________________________________________________________________ > From: linux-cluster-bounces at redhat.com on behalf of Lon Hohberger > Sent: Fri 5/11/2007 4:19 PM > To: linux clustering > Subject: Re: [Linux-cluster] clurgmgrd - #48: Unable to obtain > cluster lock: Connectiontimed out > > > On Mon, May 07, 2007 at 01:54:56PM -0400, rhurst at bidmc.harvard.edu > wrote: > > What could cause clurgmgrd fail like this? If clurgmgrd has a > hiccup > > like this, is it supposed to shutdown its services? Is there > something > > in our implementation that could have prevented this from shutting > down? > > > > For unexplained reasons, we just had our CS service (WATSON) go down > on > > its own, and the syslog entry details the event as: > > > > May 7 13:18:39 db1 clurgmgrd[17888]: #48: Unable to obtain > > cluster lock: Connection timed out > > May 7 13:18:41 db1 kernel: dlm: Magma: reply from 2 no lock > > May 7 13:18:41 db1 kernel: dlm: reply > > May 7 13:18:41 db1 kernel: rh_cmd 5 > > May 7 13:18:41 db1 kernel: rh_lkid 200242 > > May 7 13:18:41 db1 kernel: lockstate 2 > > May 7 13:18:41 db1 kernel: nodeid 0 > > May 7 13:18:41 db1 kernel: status 0 > > May 7 13:18:41 db1 kernel: lkid ee0388 > > May 7 13:18:41 db1 clurgmgrd[17888]: Stopping service > WATSON > > This usually is a dlm bug. Once the DLM gets in to this state, > rgmanager blows up. What rgmanager are you using? > > (There's only one lock per service; the complexity of the service > doesn't matter...) > > -- > Lon Hohberger - Software Engineer - Red Hat, Inc. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eschneid at uccs.edu Mon May 14 20:33:54 2007 From: eschneid at uccs.edu (Eric Schneider) Date: Mon, 14 May 2007 14:33:54 -0600 Subject: [Linux-cluster] fence_apc 7930s In-Reply-To: <20070514191844.GA28891@redhat.com> References: <000001c793e1$d7767da0$1b03c680@uccs.edu><20070511202633.GI15766@redhat.com><003101c79412$ffb48010$1b03c680@uccs.edu> <20070514191844.GA28891@redhat.com> Message-ID: <007001c79667$30d2aea0$1b03c680@uccs.edu> I can get the python version to work with a single apc device. However, I cannot get it, or any other fence_apc, to work with multiple apc 7930s. Is this even possible??? Thanks, Eric -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Lon Hohberger Sent: Monday, May 14, 2007 1:19 PM To: linux clustering Subject: Re: [Linux-cluster] fence_apc 7930s On Fri, May 11, 2007 at 03:26:12PM -0600, Eric Schneider wrote: > It was a python one I found on the list. I can get it to work from the > command line: > > This doesn't work. The default fence_apc never works for me. > fence_apc.orig -a IP -l apc -n 21 -p "password" -v > > This works. I had this working on a RHEL 4 clone on a single apc 7930. I > can get it to work in C5, but not with multiple apc 7930s. > fence_apc.3rd -a IP -l apc -n 21 -p "password" -v > > #BEGIN_VERSION_GENERATION > FENCE_RELEASE_NAME="New APC Agent - test release" > REDHAT_COPYRIGHT="" > BUILD_DATE="September 21, 2006" > #END_VERSION_GENERATION > Looks like the python-rewritten fence_apc. jparsons would know more. -- Lon -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From nattaponv at hotmail.com Mon May 14 20:41:35 2007 From: nattaponv at hotmail.com (nattapon viroonsri) Date: Mon, 14 May 2007 20:41:35 +0000 Subject: [Linux-cluster] Gfs not return available space after delete file In-Reply-To: <20070514183359.GF30043@redhat.com> Message-ID: >You should run some more experiments to be sure, but there should be some >limit to the number of unlinked by not deallocated inodes (look at how the >ilimit tunables work). Once you reach these limits, new operations will >pitch in and help do the deallocations. > I try this and it look better when check with gfs_tool and df gfs_tool settune /home inoded_secs 3 gfs_tool settune /home reclaim_limit 500 gfs_tool settune /home demote_secs 10 But im not sure this will have other effect Nattapon _________________________________________________________________ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/ From jparsons at redhat.com Mon May 14 21:02:26 2007 From: jparsons at redhat.com (James Parsons) Date: Mon, 14 May 2007 17:02:26 -0400 Subject: [Linux-cluster] fence_apc 7930s In-Reply-To: <007001c79667$30d2aea0$1b03c680@uccs.edu> References: <000001c793e1$d7767da0$1b03c680@uccs.edu><20070511202633.GI15766@redhat.com><003101c79412$ffb48010$1b03c680@uccs.edu> <20070514191844.GA28891@redhat.com> <007001c79667$30d2aea0$1b03c680@uccs.edu> Message-ID: <4648CE62.6070600@redhat.com> Eric Schneider wrote: >I can get the python version to work with a single apc device. However, I >cannot get it, or any other fence_apc, to work with multiple apc 7930s. Is >this even possible??? > > What do you mean by multiple 7930s...do you mean daisy-chained together? Like a Master Switch Plus APC switch? I cn find no info on the site for ganging these switches. There are 24 outlets in this beast, though. If you have multiple apc 7930s, you could just set up each one with its own address, and make each one a fencedevice. Granted, you can't really shut down all (or a combo of) outlets on all switches with one command - but is that really what you need to do? Are you trying to, say, turn off outlet 2 and outlet 4 on one switch at the same time? This can be done either by outlet grouping using the apc interface, or by using two entries within one block in the conf file...in fact, s-c-cluster will detect this condition and make certain that they are both off before turning either one back on. Are you trying to turn off outlet 2 on ap 7931 - switch1, at the same time as outlet 2 on ap 7930 - switch2? The only way to do this is to set them up in the same method block as described above. Is this helping? I am sorry if I am missing your point...:/ -J From katriel at penguin-it.co.il Tue May 15 12:18:45 2007 From: katriel at penguin-it.co.il (Katriel Traum) Date: Tue, 15 May 2007 15:18:45 +0300 Subject: [Linux-cluster] cman_too join fails In-Reply-To: <46486BD6.7080506@redhat.com> References: <464866E6.3090102@penguin-it.co.il> <46486BD6.7080506@redhat.com> Message-ID: <4649A525.9010201@penguin-it.co.il> cluster.conf was created by system-config-cluster and passed an xmllint chek. cluser.conf: