From haprapp at gmail.com Fri Jun 1 02:48:36 2007 From: haprapp at gmail.com (hari hari) Date: Fri, 1 Jun 2007 08:18:36 +0530 Subject: [Linux-cluster] linix cluster Message-ID: I want how configure linux cluster details -- Thanks and Regards A.HARI PRAKASH -------------- next part -------------- An HTML attachment was scrubbed... URL: From mij at irwan.name Fri Jun 1 02:54:07 2007 From: mij at irwan.name (Mohd Irwan Jamaluddin) Date: Fri, 1 Jun 2007 10:54:07 +0800 Subject: [Linux-cluster] linix cluster In-Reply-To: References: Message-ID: On 6/1/07, hari hari wrote: > I want how configure linux cluster details > Simple question with simple answer :) http://www.redhat.com/docs/manuals/enterprise/ Ok seriously, what kind of cluster you want to configure? Is it high-availability, network load balancing, HPC etc? Which RHEL version do you use? -- Regards, Mohd Irwan Jamaluddin Web: http://www.irwan.name/ Blog: http://blog.irwan.name/ From grimme at atix.de Fri Jun 1 05:38:31 2007 From: grimme at atix.de (Marc Grimme) Date: Fri, 1 Jun 2007 07:38:31 +0200 Subject: [Linux-cluster] Active/passive binary files In-Reply-To: <20070531195655.GO4041@redhat.com> References: <20070531195655.GO4041@redhat.com> Message-ID: <200706010738.32962.grimme@atix.de> On Thursday 31 May 2007 21:56:55 Lon Hohberger wrote: > On Thu, May 24, 2007 at 08:22:33AM -0600, Rodolfo Estrada wrote: > > Hi! > > > > I am transferring a TRU64 cluster with oracle 10gR2 as an active/passive > > (no RAC) service to a RHEL5 cluster using GFS. The oracle binaries and > > data files are shared by the nodes in the TRU64 cluster. Can I use the > > same approach in the Linux cluster using GFS? or the binaries are require > > to be installed separately on each node? > You might also want to take a look at www.open-sharedroot.org. There you can build a diskless sharedroot cluster much like the TRUCluster and as a consequence have a sharedroot like with TRU64. Regards Marc. > You can do either. You can share the binaries on GFS or install them > per-node. > > -- Lon > > -- > Lon Hohberger - Software Engineer - Red Hat, Inc. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Gruss / Regards, Marc Grimme Phone: +49-89 452 3538-14 http://www.atix.de/ http://www.open-sharedroot.org/ ** ATIX - Ges. fuer Informationstechnologie und Consulting mbH Einsteinstr. 10 - 85716 Unterschleissheim - Germany Registergericht: Amtsgericht M?nchen Registernummer: HRB 131682 USt.-Id.: DE209485962 Gesch?ftsf?hrung: Marc Grimme, Mark Hlawatschek, Thomas Merz From rhurst at bidmc.harvard.edu Fri Jun 1 11:39:51 2007 From: rhurst at bidmc.harvard.edu (rhurst at bidmc.harvard.edu) Date: Fri, 1 Jun 2007 07:39:51 -0400 Subject: [Linux-cluster] Cluster services stopping In-Reply-To: <20070531200339.GQ4041@redhat.com> References: <1180615948.5475.71.camel@jarjar.trnswrks.com> <20070531200339.GQ4041@redhat.com> Message-ID: <1180697991.3326.9.camel@WSBID06223> Any chance that extended attributes, like log_level, be configurable within luci? I can not find any references to such things. On Thu, 2007-05-31 at 16:03 -0400, Lon Hohberger wrote: > On Thu, May 31, 2007 at 08:52:28AM -0400, Scott McClanahan wrote: > > Any help is appreciated. I can provide more information if you think it > > is helpful. Also, is there some sort of debugging within rgmanager I > > can enable to see what is truly failing or timing out and requiring a > > restart of these services? > > (1) Upgrade at least rgmanager, ccsd, magma, and magma-plugins > to 4.5. > > (2) Configure rgmanager to use a different log thing, like local4: > > > ... > > > (don't forget to use ccs_tool update) > > (3) Configure syslog to redirect local4 to something besides > /var/log/messages: > > local4.* /var/log/rgmanager > > (4) Restart syslog > > ... and you'll have awesome logging in /var/log/rgmanager. Probably > more than you need ;) > > -- Lon -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2178 bytes Desc: not available URL: From james.lapthorn at lapthornconsulting.com Fri Jun 1 11:51:49 2007 From: james.lapthorn at lapthornconsulting.com (James Lapthorn) Date: Fri, 1 Jun 2007 12:51:49 +0100 Subject: [Linux-cluster] CMAN: quorum lost, blocking activity Message-ID: Can any dody help shed some light on the following log entries. This is on a 4 node cluster suite 4 cluster. Looks like I had problems with my quorum disk that shutdown my services on node 1. This was not shown when doing a 'clustat' everything appeared to be fine. It wasn't until checking the mounts that I realised the DB disk was not mounted. Jun 1 10:44:11 leoukldb1 qdiskd[6386]: Score insufficient for master operation (0/1; max=1); downgrading Jun 1 10:44:11 leoukldb1 kernel: CMAN: quorum lost, blocking activity Jun 1 10:44:11 leoukldb1 clurgmgrd[6436]: #1: Quorum Dissolved Jun 1 10:44:11 leoukldb1 ccsd[6264]: Cluster is not quorate. Refusing connection. Jun 1 10:44:11 leoukldb1 ccsd[6264]: Error while processing connect: Connection refused Jun 1 10:44:11 leoukldb1 ccsd[6264]: Invalid descriptor specified (-111). Jun 1 10:44:11 leoukldb1 ccsd[6264]: Someone may be attempting something evil. Jun 1 10:44:11 leoukldb1 ccsd[6264]: Error while processing get: Invalid request descriptor Jun 1 10:44:11 leoukldb1 ccsd[6264]: Invalid descriptor specified (-111). Jun 1 10:44:11 leoukldb1 ccsd[6264]: Someone may be attempting something evil. Jun 1 10:44:11 leoukldb1 ccsd[6264]: Error while processing get: Invalid request descriptor Jun 1 10:44:11 leoukldb1 ccsd[6264]: Invalid descriptor specified (-21). Jun 1 10:44:11 leoukldb1 ccsd[6264]: Someone may be attempting something evil. I had to do a fored reboot in order to get the services to fail over?? Any help would be appreciated! James -------------- next part -------------- An HTML attachment was scrubbed... URL: From lhh at redhat.com Fri Jun 1 13:06:09 2007 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 1 Jun 2007 09:06:09 -0400 Subject: [Linux-cluster] Cluster services stopping In-Reply-To: <1180697991.3326.9.camel@WSBID06223> References: <1180615948.5475.71.camel@jarjar.trnswrks.com> <20070531200339.GQ4041@redhat.com> <1180697991.3326.9.camel@WSBID06223> Message-ID: <20070601130609.GS4041@redhat.com> On Fri, Jun 01, 2007 at 07:39:51AM -0400, rhurst at bidmc.harvard.edu wrote: > Any chance that extended attributes, like log_level, be configurable > within luci? I can not find any references to such things. Sure. Though, errors like the ones causing a service restart should appear in /var/log/messages without reconfiguration. :) -- Lon -- Lon Hohberger - Software Engineer - Red Hat, Inc. From scott.mcclanahan at trnswrks.com Fri Jun 1 13:10:20 2007 From: scott.mcclanahan at trnswrks.com (Scott McClanahan) Date: Fri, 01 Jun 2007 09:10:20 -0400 Subject: [Linux-cluster] Cluster services stopping In-Reply-To: <20070601130609.GS4041@redhat.com> References: <1180615948.5475.71.camel@jarjar.trnswrks.com> <20070531200339.GQ4041@redhat.com><1180697991.3326.9.camel@WSBID06223> <20070601130609.GS4041@redhat.com> Message-ID: <1180703420.5475.81.camel@jarjar.trnswrks.com> On Fri, 2007-06-01 at 09:06 -0400, Lon Hohberger wrote: > On Fri, Jun 01, 2007 at 07:39:51AM -0400, rhurst at bidmc.harvard.edu > wrote: > > Any chance that extended attributes, like log_level, be configurable > > within luci? I can not find any references to such things. > > Sure. Though, errors like the ones causing a service restart should > appear in /var/log/messages without reconfiguration. :) > > -- Lon > > -- > Lon Hohberger - Software Engineer - Red Hat, Inc. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > I've never simulated a file system failure to the point that a file system check would fail and cause a service restart but that should be logged right? Even with default log levels enabled? Is the same true with IP address check failures? It's shocking that nothing is being logged except that the service is being stopped. From Robert.Gil at americanhm.com Fri Jun 1 14:18:49 2007 From: Robert.Gil at americanhm.com (Robert Gil) Date: Fri, 1 Jun 2007 10:18:49 -0400 Subject: [Linux-cluster] MySQL Failover / Failback Message-ID: I am curious if anyone knows the best practices for this? Several use cases include Note: We are choosing to use a vip for the two nodes to make the failover change transparent to the application side. 1) Node 1 (master) dies -How do we enable "sticky" failover so that it does not then fail back to Node 1 -Is Node 2 active all the time or is the service completely shut off? And if its off, how would replication happen? -How do failover domains work in this case? 2) Node 2 (Master) Node 1 recovered -How does replication continue again? -How does the master slave relationship change? Is this automated, or does it require manual intervention? Should we be using DRDB? 3) Node 1 (master) Node 2 (slave) - network connectivity dies on node 1 -There is an IP resource available, but how does this monitor and handle failover? -How can I move the vip in the event of a failure? Do I need to manually script this? With the vip failover, do I attach the vip resource to the mysql resource in the failover domain for those two nodes? What happens if I do this? Thanks, Robert Gil Linux Systems Administrator American Home Mortgage Phone: 631-622-8410 Cell: 631-827-5775 Fax: 516-495-5861 -------------- next part -------------- An HTML attachment was scrubbed... URL: From srigler at marathonoil.com Fri Jun 1 14:44:28 2007 From: srigler at marathonoil.com (Steve Rigler) Date: Fri, 01 Jun 2007 09:44:28 -0500 Subject: [Linux-cluster] gfs_quota returns negative value for usage In-Reply-To: <1180011906.25316.11.camel@houuc8> References: <1180011906.25316.11.camel@houuc8> Message-ID: <1180709068.30624.3.camel@houuc8> On Thu, 2007-05-24 at 08:05 -0500, Steve Rigler wrote: > Greetings, > > We are in the process of implementing quotas on home directories that > reside on a GFS filesystem. All seems well with the exception of one > user who's usage is returned as a negative number from gfs_quota: > > user : limit: 102400.0 warn: 0.0 value: -81.4 > > The user actually has about 100MB in their home directory. > > This is on RHEL 4 update 3 with "GFS-6.1.5-0". Any ideas how we can get > the actual usage to be returned from gfs_quota? > > Thanks, > Steve > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Is this something that "gfs_quota check" might fix? Thanks, Steve From lgodoy at atichile.com Fri Jun 1 16:08:38 2007 From: lgodoy at atichile.com (Luis Godoy Gonzalez) Date: Fri, 01 Jun 2007 12:08:38 -0400 Subject: [Linux-cluster] Install Cluster in RHE4 U5 In-Reply-To: References: Message-ID: <46604486.1040401@atichile.com> Hi I'm trying to install a new machine with RHE4 U5 and Cluster Suite, but I have several troubles. Could any indicate the steps to make this ? Someone indicated to use up2date, but how ? ( up2date ???? ) I don't whish to update the whole system, just only install cluster suite. I download latest sources from "ftp://ftp.redhat.com/pub/redhat/linux/updates/enterprise/4ES/en/RHCS/SRPMS" , but I can't compile this source on any of my machines (RHE4, U4, U5 or RHE5 ). I got this error when I tried to use rpmbuild ============= [root at ele install]# rpmbuild --rebuild ccs-1.0.7-0.src.rpm Installing ccs-1.0.7-0.src.rpm error: Architecture is not included: i386 [root at ele install]# uname -a Linux ele.ati-labs.cl 2.6.9-55.EL #1 Fri Apr 20 16:35:59 EDT 2007 i686 athlon i386 GNU/Linux ============= We tested E5 but is not possible install it yet, for oracle certification issues. :( Thanks in advance for any help. Luis G. From Robert.Gil at americanhm.com Fri Jun 1 16:30:17 2007 From: Robert.Gil at americanhm.com (Robert Gil) Date: Fri, 1 Jun 2007 12:30:17 -0400 Subject: [Linux-cluster] IP Relocate Error Message-ID: I have an IP address as a resource. I have the ip address in a 2 node failover domain (total 4 nodes). When i run ifconfig eth0:1 down The service shows as stopped in clustat and the following errors show in the logs Jun 1 12:25:36 clurgmgrd[5346]: #71: Relocating failed service mastervip Jun 1 12:25:36 clurgmgrd[5346]: #70: Attempting to restart service mastervip locally. Jun 1 12:25:37 clurgmgrd[5346]: Recovering failed service mastervip Jun 1 12:25:37 clurgmgrd[5346]: start on ip:192.168.2.100 returned 1 (generic error) Jun 1 12:25:37 clurgmgrd[5346]: #68: Failed to start mastervip; return value: 1 Jun 1 12:25:37 clurgmgrd[5346]: Stopping service mastervip Jun 1 12:25:37 clurgmgrd[5346]: Service mastervip is stopped The following is the resources in /etc/cluster.conf The service in /etc/cluster.conf Any ideas? Thanks, Robert Gil Linux Systems Administrator American Home Mortgage Phone: 631-622-8410 Cell: 631-827-5775 Fax: 516-495-5861 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jparsons at redhat.com Fri Jun 1 16:43:35 2007 From: jparsons at redhat.com (James Parsons) Date: Fri, 01 Jun 2007 12:43:35 -0400 Subject: [Linux-cluster] Install Cluster in RHE4 U5 In-Reply-To: <46604486.1040401@atichile.com> References: <46604486.1040401@atichile.com> Message-ID: <46604CB7.1040601@redhat.com> Luis Godoy Gonzalez wrote: > Hi > > I'm trying to install a new machine with RHE4 U5 and Cluster Suite, > but I have several troubles. > Could any indicate the steps to make this ? A lot of folks on this list are hands on, command-line types who sneer at GUI's :) HOWEVER: If you use the conga bits to create a cluster, you just have to enter the FQDN's for your cluster nodes and their passwords in a secure form, and click create. Then if you want, you never, never, ever have to use the GUI again...you can tinker to your hearts content with configuration files and prompt commands with six switches in them. Or, you could 1) download/up2date all the necessary RPMs and kernel pieces yourself to all of the nodes 2) create a skel cluster config file 3) copy it around to all of the nodes via scp 4) start the cluster services by hand on each node (careful, order matters!) 5) ssh around and check that every node has joined...if not, run the join command... Know what I mean? ;) -J DISCLAIMER: This post was generated by a known GUI DEVELOPER and could very likely be prejudiced towards GUIs and the lazy, carefree lifestyle they engender. From weikuan.yu at gmail.com Fri Jun 1 16:51:02 2007 From: weikuan.yu at gmail.com (Weikuan Yu) Date: Fri, 01 Jun 2007 12:51:02 -0400 Subject: [Linux-cluster] GNBD configuration Message-ID: <46604E76.3040708@gmail.com> Hi, I recently want to update from GFS6.0 to GFS6.1. I have not yet been successful in creating a GNBD-based small configuration. I used the following cluster.conf on node15,node16,node22. # cat /etc/cluster/cluster.conf After loading all the modules, I try to run these commands: ccsd cman_tool join fence_tool join clvmd I have the following problems. -- not able to have node22 join the cluster manager. -- Not able to complete this command for node16 # fence_tool join [root at node22 cluster-RHEL4]# cman_tool join cman_tool: local node name "node22" not found in cluster.conf Can anybody share some insights here? I have been trying different steps for a while to no successes. Let me know if I need to be more clear on some specifics. Many thanks in advance, Weikuan From jleafey at utmem.edu Fri Jun 1 16:45:14 2007 From: jleafey at utmem.edu (Jay Leafey) Date: Fri, 01 Jun 2007 11:45:14 -0500 Subject: [Linux-cluster] Install Cluster in RHE4 U5 In-Reply-To: <46604486.1040401@atichile.com> References: <46604486.1040401@atichile.com> Message-ID: <46604D1A.2050408@utmem.edu> Luis Godoy Gonzalez wrote: > Hi > > I'm trying to install a new machine with RHE4 U5 and Cluster Suite, but > I have several troubles. > Could any indicate the steps to make this ? > > Someone indicated to use up2date, but how ? ( up2date ???? ) > I don't whish to update the whole system, just only install cluster suite. > > I download latest sources from > "ftp://ftp.redhat.com/pub/redhat/linux/updates/enterprise/4ES/en/RHCS/SRPMS" > , but I can't compile this source on any of my machines (RHE4, U4, U5 or > RHE5 ). > > I got this error when I tried to use rpmbuild > ============= > [root at ele install]# rpmbuild --rebuild ccs-1.0.7-0.src.rpm > Installing ccs-1.0.7-0.src.rpm > error: Architecture is not included: i386 > [root at ele install]# uname -a > Linux ele.ati-labs.cl 2.6.9-55.EL #1 Fri Apr 20 16:35:59 EDT 2007 i686 > athlon i386 GNU/Linux > ============= > > > We tested E5 but is not possible install it yet, for oracle > certification issues. :( > > > Thanks in advance for any help. > Luis G. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Try this: rpmbuild --rebuild --target i686 ccs-1.0.7-0.src.rpm or rpmbuild --rebuild --target x86_64 ccs-1.0.7-0.src.rpm as appropriate. The 'arch' command should give you the appropriate value. The spec file for ccs only lists i686, ia64 (Itanium), and x86_64 (AMD64/EM64T) architectures. The default for rpmbuild on an ix86 box is 'i386' unless you specify a target with the '--target' option. -- Jay Leafey - University of Tennessee E-Mail: jleafey at utmem.edu Phone: 901-448-6534 FAX: 901-448-8199 -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5153 bytes Desc: S/MIME Cryptographic Signature URL: From jstoner at opsource.net Fri Jun 1 17:21:20 2007 From: jstoner at opsource.net (Jeff Stoner) Date: Fri, 1 Jun 2007 18:21:20 +0100 Subject: [Linux-cluster] MySQL Failover / Failback In-Reply-To: Message-ID: <38A48FA2F0103444906AD22E14F1B5A305DE343C@mailxchg01.corp.opsource.net> Sounds like you've got several things happening all at once. If you are not using MySQL Cluster, then you will probably have an active/passive setup, in which MySQL will be running on only one node. If you are using MySQL Cluster, why are you using Redhat Cluster? Replication? Are you referring to MySQL Replication? What is replicating where? Are the slaves a part of the Redhat Cluster? If you simply mean will replication "break" if MySQL fails over then no. Replication on the slave will retry connecting to the master (according to the connection retry settings in MySQL.) Also, you must use the Redhat Cluster-controlled IP when establishing replication and not the IP of any particular node (for obvious reasons.) For my MySQL databases built on Redhat Cluster, I specify my service as follows: If you have RHEL4.5, you can also put all the scripts at the top level to ensure the same ordering: There's partial (read: demo) code in head CVS which implements higher level dependencies, but it is integrated with the rest of rgmanager yet. -- Lon -- Lon Hohberger - Software Engineer - Red Hat, Inc. From lhh at redhat.com Wed Jun 13 14:53:55 2007 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 13 Jun 2007 10:53:55 -0400 Subject: [Linux-cluster] dlm service is not stopping In-Reply-To: References: Message-ID: <20070613145355.GM7203@redhat.com> On Tue, Jun 12, 2007 at 03:09:45PM +0530, Panigrahi, Santosh Kumar wrote: > Hi Cluster Team, > > I have configured a 2 node cluster (RHEL5). When I am shutting down the > cluster, I am stopping "rgmanager' service first and then "cman' > service. Could you file a bugzilla about this? I am guessing it is an rgmanager bug, but it seems to work for me. -- Lon -- Lon Hohberger - Software Engineer - Red Hat, Inc. From teigland at redhat.com Wed Jun 13 15:01:21 2007 From: teigland at redhat.com (David Teigland) Date: Wed, 13 Jun 2007 10:01:21 -0500 Subject: [Linux-cluster] Slowness above 500 RRDs In-Reply-To: <87odjjkjgv.fsf@tac.ki.iif.hu> References: <87wt06djk7.fsf@tac.ki.iif.hu> <20070423211717.GA22147@redhat.com> <20070424193600.GB11156@redhat.com> <87abvya1ke.fsf@tac.ki.iif.hu> <20070521145058.GA2590@redhat.com> <87ejkhmctr.fsf@tac.ki.iif.hu> <20070612161148.GC16723@redhat.com> <87zm35ktyx.fsf@tac.ki.iif.hu> <20070612164418.GF16723@redhat.com> <87odjjkjgv.fsf@tac.ki.iif.hu> Message-ID: <20070613150121.GD16134@redhat.com> On Wed, Jun 13, 2007 at 04:38:40PM +0200, Ferenc Wagner wrote: > David Teigland writes: > > >>>> But looks like nodeA feels obliged to communicate its locking > >>>> process around the cluster. > >>> > >>> I'm not sure what you mean here. To see the amount of dlm locking traffic > >>> on the network, look at port 21064. There should be very little in the > >>> test above... and the dlm locking that you do see should mostly be related > >>> to file i/o, not flocks. > >> > >> There was much traffic on port 21064. Possibly related to file I/O > >> and not flocks, I can't tell. But that's agrees with my speculation, > >> that it's not the explicit [pf]locks that take much time, but > >> something else. > > > > Could you comment the fcntl/flock calls out of the application entirely > > and try it? > > Let's see. A typical test run looks like this (first with fcntl > locking; tcpdump slows down the first iteration from about 6 s): > > filecount=500 > iteration=0 elapsed time=20.196318 s > iteration=1 elapsed time=0.323969 s > iteration=2 elapsed time=0.319929 s > iteration=3 elapsed time=0.361738 s > iteration=4 elapsed time=0.399365 s > total elapsed time=21.601319 s > > During the first (slow) iteration, there's much traffic on port 21064. > During the next (fast) iterations there's no traffic at all on that port. > If I rerun the test immediately, there's still no traffic. > 5 minutes later, without any action on my part, there's a couple of > packets again, then 20 s later a bigger bunch (around 30). > After this, the first iteration generates much traffic again, GOTO 10. > > If I use flock instead, the beginning is similar, but after about 10 s > from the finish of the test, some small traffic appears by itself, and > if I rerun the test after this, it generates traffic again, although > much less than after 5 minutes. The traffic generated 5 minutes after > the test run consists of a couple of packets followed by a much bigger > bunch 5 s later. > > If I don't use any locking at all, then the situation is the same as > with fcntl locking, but the "automatic" traffic consist of a small > burst (couple of packets) 4 min 51 s after the finish, then about 30 > packets 25 s later. > > Does it tell you anything? The timings are perhaps somewhat off > because of the 20 s runtime. If you can make some sense out of this, It sounds pretty normal, I'd need to repeat the test myself to figure out exactly what's happening. The 10 sec is probably toss_secs from the dlm; you can increase with: echo 20 >> /sys/kernel/config/dlm/cluster/toss_secs > I'd be glad to hear it. Also, I'd like to tweak the 5 minutes > timeout, where does it come from? Is it settable by sysfs or > gfs_tool? gfs_tool gettune | grep demote_secs should show 300, to increase: gfs_tool settune demote_secs Dave From lgodoy at atichile.com Wed Jun 13 15:27:32 2007 From: lgodoy at atichile.com (Luis Godoy Gonzalez) Date: Wed, 13 Jun 2007 11:27:32 -0400 Subject: [Linux-cluster] Problems with Cluster In-Reply-To: <1df4abe60706111829n210de3b9m33ffa10dc4fe3afe@mail.gmail.com> References: <1df4abe60706110945w658766cas3e4b9280a6a4a34c@mail.gmail.com> <1df4abe60706111829n210de3b9m33ffa10dc4fe3afe@mail.gmail.com> Message-ID: <46700CE4.30307@atichile.com> I have similar situation, in our case whe have redundant power supply for ensure the server and iLo always is power up. Other way to fix this issue is using some quorum device. I' haven't worked whith them, but I think may work it. Bye Manish Kathuria escribi?: > On 6/11/07, Robert Gil wrote: >> If ilo itself is off, fencing doesn't work. > > Isn't there any timeout setting such that if the ILO doesn't respond > for a certain amount of time, it is treated as fenced and the node is > considered to be dead and the failover takes place? > >> >> Did you add ilo as a fence device? And create a user? You create a >> user in the ilo for that blade, not on the chassis. You have to >> reboot the blade to get to the ilo manager. > > Yes, had added respective ILOs as fence devices for both the servers > and created users also. > > > I just want to make sure that automatic fencing happens and failover > takes place even when there is a complete power failure for one node > >> >> -----Original Message----- >> From: linux-cluster-bounces at redhat.com >> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Manish Kathuria >> Sent: Monday, June 11, 2007 12:45 PM >> To: linux clustering >> Subject: Re: [Linux-cluster] Problems with Cluster >> >> On 6/11/07, Maciej Bogucki wrote: >> > Manish Kathuria napisa?(a): >> >> > > We want the failover to happen when the power supply fails to either >> > > of the nodes. In order to test the scenario, we removed the power >> > > cables from one of the nodes. However the failover did not happen >> > > and upon observing the logs we found that the alive node could not >> > > connect to the fence device (ILO in this case) of the dead node >> > > since it was powered off and the fencing could not take place. Does >> > > this mean that we would not be able to have a failover in case of >> > > power failure for one of the nodes. Is there a way we can do it ? >> > > How is the cluster supposed to react when the ILO itself is >> powered off ? >> > >> > You need to perform manual fencing(administrator reaction) when it >> happend. >> > >> >> Isn't there any way which is automated and does not require manual >> intervention ? Otherwise, the whole purpose gets defeated. >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From mkathuria at tuxtechnologies.co.in Wed Jun 13 16:15:03 2007 From: mkathuria at tuxtechnologies.co.in (Manish Kathuria) Date: Wed, 13 Jun 2007 21:45:03 +0530 Subject: [Linux-cluster] Problems with Cluster In-Reply-To: <200706120816.32533.grimme@atix.de> References: <1df4abe60706110945w658766cas3e4b9280a6a4a34c@mail.gmail.com> <1df4abe60706111829n210de3b9m33ffa10dc4fe3afe@mail.gmail.com> <200706120816.32533.grimme@atix.de> Message-ID: <1df4abe60706130915j379e1932ob4976044104dffe6@mail.gmail.com> On 6/12/07, Marc Grimme wrote: > On Tuesday 12 June 2007 03:29:00 Manish Kathuria wrote: > > On 6/11/07, Robert Gil wrote: > > > If ilo itself is off, fencing doesn't work. > > > > Isn't there any timeout setting such that if the ILO doesn't respond > > for a certain amount of time, it is treated as fenced and the node is > > considered to be dead and the failover takes place? > As far as I remember there is only a tcp-timeout when establishing the > connection to the ilo-card that takes a very long time to occure (that's a > default setting and takes minutes). I'm not sure how and where to set it. We did wait for quite some time and followed the messages appearing in /var/log/messages. It kept on trying to contact the ILO of the node which was powered off. > > But we've had this discussion (especially with ILO-Cards) nearly every time > when using them and therefore and also out of other reasons we had to build > our own fence_ilo agent. I'm quite sure that we solved the timeout problem in > the end. It is set to 10sec per default (Config.timeout). > You can find it at > http://download.atix.de/yum/comoonics/productive/noarch/RPMS/comoonics-bootimage-fenceclient-ilo-0.1-16.noarch.rpm > or directly use the yum/up2date-channel as described here: > http://www.open-sharedroot.org/faq/can-i-use-yum-or-up2date-to-install-the-software/ > then install "comoonics-bootimage-fenceclient-ilo" and there you go. Thanks, I will try and see if they agree to use this version. > > > > > Did you add ilo as a fence device? And create a user? You create a user > > > in the ilo for that blade, not on the chassis. You have to reboot the > > > blade to get to the ilo manager. > > > > Yes, had added respective ILOs as fence devices for both the servers > > and created users also. > We are doing so as well. Always a power user for ilo devices. > We are also automating this with the ilo client. > There is a undocumented switch -x in the fence_ilo client referenced above > where you reference a file that might look as follows and you'll have your > user. > > I just want to make sure that automatic fencing happens and failover > > takes place even when there is a complete power failure for one node > If the timeout thing works you'll also need a second fence mechanism. > You might think about using fence_manual as last resort, to bring that cluster > back online after power failure and then after manual intervention. > > Regards Marc. Just wondering if there is any undocumented option / switch which will force an automatic failover to one node if the ILO on the other one fails to respond within certain time period (maybe few minutes). Regards, -- Manish From mkathuria at tuxtechnologies.co.in Wed Jun 13 16:15:29 2007 From: mkathuria at tuxtechnologies.co.in (Manish Kathuria) Date: Wed, 13 Jun 2007 21:45:29 +0530 Subject: [Linux-cluster] Problems with Cluster In-Reply-To: <46700CE4.30307@atichile.com> References: <1df4abe60706110945w658766cas3e4b9280a6a4a34c@mail.gmail.com> <1df4abe60706111829n210de3b9m33ffa10dc4fe3afe@mail.gmail.com> <46700CE4.30307@atichile.com> Message-ID: <1df4abe60706130915s3ae98404v2a7a543c4ab88d3d@mail.gmail.com> On 6/13/07, Luis Godoy Gonzalez wrote: > I have similar situation, in our case whe have redundant power supply > for ensure the server and iLo always is power up. > Other way to fix this issue is using some quorum device. I' haven't > worked whith them, but I think may work it. > > Bye In this scenario also, both the nodes have redundant power supply and the iLo will always be powered but the users want the failover to happen when a node (and therefore the associated iLo) doesn't receive power supply at all. From wferi at niif.hu Wed Jun 13 16:53:10 2007 From: wferi at niif.hu (Ferenc Wagner) Date: Wed, 13 Jun 2007 18:53:10 +0200 Subject: [Linux-cluster] Slowness above 500 RRDs In-Reply-To: <20070613150121.GD16134@redhat.com> (David Teigland's message of "Wed, 13 Jun 2007 10:01:21 -0500") References: <87wt06djk7.fsf@tac.ki.iif.hu> <20070423211717.GA22147@redhat.com> <20070424193600.GB11156@redhat.com> <87abvya1ke.fsf@tac.ki.iif.hu> <20070521145058.GA2590@redhat.com> <87ejkhmctr.fsf@tac.ki.iif.hu> <20070612161148.GC16723@redhat.com> <87zm35ktyx.fsf@tac.ki.iif.hu> <20070612164418.GF16723@redhat.com> <87odjjkjgv.fsf@tac.ki.iif.hu> <20070613150121.GD16134@redhat.com> Message-ID: <878xankd8p.fsf@tac.ki.iif.hu> David Teigland writes: > The 10 sec is probably toss_secs from the dlm; you can increase > with: echo 20 >> /sys/kernel/config/dlm/cluster/toss_secs > > gfs_tool gettune | grep demote_secs > > should show 300, to increase: > > gfs_tool settune demote_secs Great. No problem to do that after each mount. -- Thanks, Feri. From lhh at redhat.com Wed Jun 13 19:04:13 2007 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 13 Jun 2007 15:04:13 -0400 Subject: [Linux-cluster] fence device: network card? In-Reply-To: <466F2C7F.9080906@klxsystems.net> References: <466F2C7F.9080906@klxsystems.net> Message-ID: <20070613190413.GN7203@redhat.com> On Tue, Jun 12, 2007 at 04:30:07PM -0700, Karl R. Balsmeier wrote: > Basically I read in the docs you can use a NIC card as a fence device, > is this true? I don't think there's NIC-based fencing at this point; I could be mistaken; I haven't looked at the fence tree for some time. Could you point me at where it noted this so I can read it and send in corrections? > Right now each of the 3 servers have 3 NICs, so I have a total of 9 to > play with. Right now I am bonding the two GB NIC's together no > problem. That leaves each server a 100mbps NIC. > My ultimate goal is to use these 3 machines to make a Vsftpd GFS cluster > that I can run Iscsi over. GNBD is not iSCSI. It is similar in that it implements a block device over the network, but to use fence_gnbd, you need to be using GNBD (not iSCSI). -- Lon Hohberger - Software Engineer - Red Hat, Inc. From lhh at redhat.com Wed Jun 13 20:09:52 2007 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 13 Jun 2007 16:09:52 -0400 Subject: [Linux-cluster] dlm service is not stopping In-Reply-To: <20070613145355.GM7203@redhat.com> References: <20070613145355.GM7203@redhat.com> Message-ID: <20070613200952.GO7203@redhat.com> On Wed, Jun 13, 2007 at 10:53:55AM -0400, Lon Hohberger wrote: > On Tue, Jun 12, 2007 at 03:09:45PM +0530, Panigrahi, Santosh Kumar wrote: > > Hi Cluster Team, > > > > I have configured a 2 node cluster (RHEL5). When I am shutting down the > > cluster, I am stopping "rgmanager' service first and then "cman' > > service. > > Could you file a bugzilla about this? I am guessing it is an rgmanager > bug, but it seems to work for me. > We hit this while testing the fix for another bugzilla: If most nodes of a cluster go offline, the last node(s) lose quorum. This causes rgmanager to exit uncleanly, failing to clean up the lockspace. This prevents cman from stopping. Is this what happened for you? -- Lon Hohberger - Software Engineer - Red Hat, Inc. From Santosh.Panigrahi at in.unisys.com Thu Jun 14 03:54:41 2007 From: Santosh.Panigrahi at in.unisys.com (Panigrahi, Santosh Kumar) Date: Thu, 14 Jun 2007 09:24:41 +0530 Subject: [Linux-cluster] dlm service is not stopping In-Reply-To: <20070613200952.GO7203@redhat.com> References: <20070613145355.GM7203@redhat.com> <20070613200952.GO7203@redhat.com> Message-ID: You are absolutely right. There are around 5 process starting name with dlm_* are running. I am also not able to kill these processes by (kill -9 pid). So each time, I am rebooting this node on facing this dlm problem. Please suggest me some other way to kill these processes. I will file a bugzilla for the same. Thanks Santosh -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Lon Hohberger Sent: Thursday, June 14, 2007 1:40 AM To: linux clustering Subject: Re: [Linux-cluster] dlm service is not stopping On Wed, Jun 13, 2007 at 10:53:55AM -0400, Lon Hohberger wrote: > On Tue, Jun 12, 2007 at 03:09:45PM +0530, Panigrahi, Santosh Kumar wrote: > > Hi Cluster Team, > > > > I have configured a 2 node cluster (RHEL5). When I am shutting down the > > cluster, I am stopping "rgmanager' service first and then "cman' > > service. > > Could you file a bugzilla about this? I am guessing it is an rgmanager > bug, but it seems to work for me. > We hit this while testing the fix for another bugzilla: If most nodes of a cluster go offline, the last node(s) lose quorum. This causes rgmanager to exit uncleanly, failing to clean up the lockspace. This prevents cman from stopping. Is this what happened for you? -- Lon Hohberger - Software Engineer - Red Hat, Inc. -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From janne.peltonen at helsinki.fi Thu Jun 14 05:38:25 2007 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Thu, 14 Jun 2007 08:38:25 +0300 Subject: [Linux-cluster] ip.sh In-Reply-To: <20070613144038.GI7203@redhat.com> References: <20070611180422.GF25899@helsinki.fi> <20070613144038.GI7203@redhat.com> Message-ID: <20070614053825.GK25899@helsinki.fi> On Wed, Jun 13, 2007 at 10:40:38AM -0400, Lon Hohberger wrote: > > I've got a rhel 5 based system with 25 > > services, 24 of which use ext3 fs's on clvm logical volumes in a SAN. > > Could you tell us what version of rgmanager you have installed? Seems to be 2.0.23. And, to be exact, my system is a CentOS 5. The release of rgmanager is 1.el5.centos. --Janne -- Janne Peltonen From pcaulfie at redhat.com Thu Jun 14 07:56:05 2007 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Thu, 14 Jun 2007 08:56:05 +0100 Subject: [Linux-cluster] dlm service is not stopping In-Reply-To: References: <20070613145355.GM7203@redhat.com> <20070613200952.GO7203@redhat.com> Message-ID: <4670F495.8060808@redhat.com> Panigrahi, Santosh Kumar wrote: > You are absolutely right. > > There are around 5 process starting name with dlm_* are running. I am > also not able to kill these processes by (kill -9 pid). So each time, I > am rebooting this node on facing this dlm problem. Please suggest me > some other way to kill these processes. > You can't kill those processes, they are DLM kernel threads. You need to shut down any 'real' processes that are using the DLM, then they will go away. -- Patrick Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 ITE, UK. Registered in England and Wales under Company Registration No. 3798903 From cluster at defuturo.co.uk Wed Jun 13 16:31:10 2007 From: cluster at defuturo.co.uk (Robert Clark) Date: Wed, 13 Jun 2007 17:31:10 +0100 Subject: [Linux-cluster] cmirror leg failure: dmeventd dies Message-ID: <1181752271.10296.10.camel@rutabaga.defuturo.co.uk> I've recently upgraded from RHEL4U4 to RHEL4U5 in order to use the new cmirror package instead of one I was building myself from CVS. Now, when I fail one of the PVs, the mirrored LV isn't converted to linear. Instead, dmeventd dies like this: 3377 send(6, "<15>Jun 12 16:50:34 lvm[3371]: Loaded external locking library liblvm2clusterlock.so", 84, MSG_NOSIGNAL) = 84 3377 socket(PF_FILE, SOCK_STREAM, 0) = 7 3377 connect(7, {sa_family=AF_FILE, path=@clvmd}, 110) = 0 3377 time(NULL) = 1181663434 3377 stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=1323, ...}) = 0 3377 stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=1323, ...}) = 0 3377 stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=1323, ...}) = 0 3377 send(6, "<15>Jun 12 16:50:34 lvm[3371]: Finding volume group \"lvm_test1\"", 63, MSG_NOSIGNAL) = 63 3377 stat64("/proc/lvm/VGs/lvm_test1", 0xf6fa6340) = -1 ENOENT (No such file or directory) 3377 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], ~[HUP INT QUIT KILL TERM STOP RTMIN RT_1], 8) = 0 3377 writev(2, [{"", 22}, {": ", 2}, {"symbol lookup error", 19}, {": ", 2}, {"/usr/lib/liblvm2clusterlock.so", 30}, {": ", 2}, {"undefined symbol: print_log", 27}, {"", 0}, {"", 0}, {"\n", 1}], 10) = 105 3377 exit_group(127) = ? and logs this: Jun 12 13:43:57 kiwano dmeventd[3371]: dmeventd ready for processing. Jun 12 13:43:57 kiwano dmeventd[3371]: Monitoring mirror device lvm_test1-var for events Jun 12 13:44:08 kiwano lvm[3371]: lvm_test1-var is now in-sync Jun 12 16:50:33 kiwano lvm[3371]: Mirror device, 253:5, has failed. Jun 12 16:50:33 kiwano lvm[3371]: Device failure in lvm_test1-var Jun 12 16:50:33 kiwano lvm[3371]: WARNING: dev_open(/etc/lvm/lvm.conf) called while suspended The "undefined symbol: print_log" error looks pretty fatal and "ldd -r /usr/lib/liblvm2clusterlock.so" reports it too, though it does the same on working clusters and a FC6 box as well. Can anyone suggest how to debug this? Thanks, Robert From mbrookov at mines.edu Thu Jun 14 14:37:14 2007 From: mbrookov at mines.edu (Matthew B. Brookover) Date: Thu, 14 Jun 2007 08:37:14 -0600 Subject: [Linux-cluster] fence device: network card? In-Reply-To: <466F2C7F.9080906@klxsystems.net> References: <466F2C7F.9080906@klxsystems.net> Message-ID: <1181831834.26590.35.camel@merlin.Mines.EDU> Hi Karl, GNDB is a block server, like ISCSI. Unfortunately there does not appear to be any standard fencing mechanism for ISCSI. I hacked up one of the existing fence agents to use SNMP to turn off the network ports in a Cisco 3750 switch. My test systems are using an old HP switch that the network team was not using -- it works also. System-config-cluster would not allow me to specify my own fence agent. Well, I looked quickly, did not see any thing obvious, gave up and edited cluster.conf with vi. I have attached fence_cisco to this message, and here are some notes to use it. You may need to get the Net-SNMP perl module from CPAN.org. The config file for fence_cisco looks like this: community: switch:10.1.4.254 oneoften:A1:C1 twooften:A5:C2 threeoften:A2:C3 The first line is your community string, the second line is the IP address of the network switch, the rest are the hosts. The first column is the host name followed by a colon separated list of the ports that the host is attached to on the Ethernet switch. In the cluster.conf file, the port parameter must match an entry in the host name column. The fence_cisco agent will 'fence a nic' in the network switch. I have attached fence_cisco to this message. I would suggest that you test every thing carefully. Matt On Tue, 2007-06-12 at 16:30 -0700, Karl R. Balsmeier wrote: > Hi, > > I have three (3) servers built and entered into the > system-config-cluster tool as nodes. Basically the first node has node > 2 and node 3 as members of the cluster. > > For a fence device, I do not have any of the SAN or network/switch > devices listed in the dropdown menu, and where I have read in the > documentation that says "gnbd" Generic Network Block Device seems to be > what i'm looking for. > > Basically I read in the docs you can use a NIC card as a fence device, > is this true? > > Right now each of the 3 servers have 3 NICs, so I have a total of 9 to > play with. Right now I am bonding the two GB NIC's together no > problem. That leaves each server a 100mbps NIC. > > My ultimate goal is to use these 3 machines to make a Vsftpd GFS cluster > that I can run Iscsi over. > > Being new to this though, i'll stick to the primary questions: How does > one configure a fence device in the form of a NIC card? Is the gnbd > item relevant to this? > > -karl > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: fence_cisco Type: application/x-perl Size: 10617 bytes Desc: not available URL: From janne.peltonen at helsinki.fi Thu Jun 14 14:38:27 2007 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Thu, 14 Jun 2007 17:38:27 +0300 Subject: er, fs.sh, not ip (Re: [Linux-cluster] ip.sh) In-Reply-To: <20070614053825.GK25899@helsinki.fi> References: <20070611180422.GF25899@helsinki.fi> <20070613144038.GI7203@redhat.com> <20070614053825.GK25899@helsinki.fi> Message-ID: <20070614143827.GB15269@helsinki.fi> On Thu, Jun 14, 2007 at 08:38:25AM +0300, Janne Peltonen wrote: > On Wed, Jun 13, 2007 at 10:40:38AM -0400, Lon Hohberger wrote: > > > I've got a rhel 5 based system with 25 > > > services, 24 of which use ext3 fs's on clvm logical volumes in a SAN. > > > > Could you tell us what version of rgmanager you have installed? > > Seems to be 2.0.23. And, to be exact, my system is a CentOS 5. The > release of rgmanager is 1.el5.centos. I noticed a bug in this message's subject. It was about fs.sh, not ip.sh... --Janne -- Janne Peltonen From jbrassow at redhat.com Thu Jun 14 15:10:14 2007 From: jbrassow at redhat.com (Jonathan Brassow) Date: Thu, 14 Jun 2007 10:10:14 -0500 Subject: [Linux-cluster] cmirror leg failure: dmeventd dies In-Reply-To: <1181752271.10296.10.camel@rutabaga.defuturo.co.uk> References: <1181752271.10296.10.camel@rutabaga.defuturo.co.uk> Message-ID: <8F041834-5149-4375-86C6-290B4CDC1D50@redhat.com> On Jun 13, 2007, at 11:31 AM, Robert Clark wrote: > I've recently upgraded from RHEL4U4 to RHEL4U5 in order to use > the new > cmirror package instead of one I was building myself from CVS. > > Now, when I fail one of the PVs, the mirrored LV isn't converted to > linear. Instead, dmeventd dies like this: > > 3377 send(6, "<15>Jun 12 16:50:34 lvm[3371]: Loaded external > locking library liblvm2clusterlock.so", 84, MSG_NOSIGNAL) = 84 > 3377 socket(PF_FILE, SOCK_STREAM, 0) = 7 > 3377 connect(7, {sa_family=AF_FILE, path=@clvmd}, 110) = 0 > 3377 time(NULL) = 1181663434 > 3377 stat64("/etc/localtime", {st_mode=S_IFREG|0644, > st_size=1323, ...}) = 0 > 3377 stat64("/etc/localtime", {st_mode=S_IFREG|0644, > st_size=1323, ...}) = 0 > 3377 stat64("/etc/localtime", {st_mode=S_IFREG|0644, > st_size=1323, ...}) = 0 > 3377 send(6, "<15>Jun 12 16:50:34 lvm[3371]: Finding volume group > \"lvm_test1\"", 63, MSG_NOSIGNAL) = 63 > 3377 stat64("/proc/lvm/VGs/lvm_test1", 0xf6fa6340) = -1 ENOENT (No > such file or directory) > 3377 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], ~[HUP INT QUIT > KILL TERM STOP RTMIN RT_1], 8) = 0 > 3377 writev(2, [{"", 22}, {": ", 2}, > {"symbol lookup error", 19}, {": ", 2}, {"/usr/lib/ > liblvm2clusterlock.so", 30}, {": ", 2}, {"undefined symbol: > print_log", 27}, {"", 0}, {"", 0}, {"\n", 1}], 10) = 105 > 3377 exit_group(127) = ? > > and logs this: > > Jun 12 13:43:57 kiwano dmeventd[3371]: dmeventd ready for processing. > Jun 12 13:43:57 kiwano dmeventd[3371]: Monitoring mirror device > lvm_test1-var for events > Jun 12 13:44:08 kiwano lvm[3371]: lvm_test1-var is now in-sync > Jun 12 16:50:33 kiwano lvm[3371]: Mirror device, 253:5, has failed. > Jun 12 16:50:33 kiwano lvm[3371]: Device failure in lvm_test1-var > Jun 12 16:50:33 kiwano lvm[3371]: WARNING: dev_open(/etc/lvm/ > lvm.conf) called while suspended > > The "undefined symbol: print_log" error looks pretty fatal and "ldd > -r /usr/lib/liblvm2clusterlock.so" reports it too, though it does the > same on working clusters and a FC6 box as well. > > Can anyone suggest how to debug this? /etc/lvm/lvm.conf: locking_type = 3 ? brassow From lhh at redhat.com Thu Jun 14 18:14:32 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 14 Jun 2007 14:14:32 -0400 Subject: er, fs.sh, not ip (Re: [Linux-cluster] ip.sh) In-Reply-To: <20070614143827.GB15269@helsinki.fi> References: <20070611180422.GF25899@helsinki.fi> <20070613144038.GI7203@redhat.com> <20070614053825.GK25899@helsinki.fi> <20070614143827.GB15269@helsinki.fi> Message-ID: <20070614181432.GQ7203@redhat.com> On Thu, Jun 14, 2007 at 05:38:27PM +0300, Janne Peltonen wrote: > On Thu, Jun 14, 2007 at 08:38:25AM +0300, Janne Peltonen wrote: > > On Wed, Jun 13, 2007 at 10:40:38AM -0400, Lon Hohberger wrote: > > > > I've got a rhel 5 based system with 25 > > > > services, 24 of which use ext3 fs's on clvm logical volumes in a SAN. > > > > > > Could you tell us what version of rgmanager you have installed? > > > > Seems to be 2.0.23. And, to be exact, my system is a CentOS 5. The > > release of rgmanager is 1.el5.centos. > > I noticed a bug in this message's subject. It was about fs.sh, not > ip.sh... It doesn't matter; it shouldn't do that :) I think it's related to something that may be fixed in my sandbox; I'll check on it some more. -- Lon -- Lon Hohberger - Software Engineer - Red Hat, Inc. From cluster at defuturo.co.uk Thu Jun 14 20:17:08 2007 From: cluster at defuturo.co.uk (Robert Clark) Date: Thu, 14 Jun 2007 21:17:08 +0100 Subject: [Linux-cluster] cmirror leg failure: dmeventd dies In-Reply-To: <8F041834-5149-4375-86C6-290B4CDC1D50@redhat.com> References: <1181752271.10296.10.camel@rutabaga.defuturo.co.uk> <8F041834-5149-4375-86C6-290B4CDC1D50@redhat.com> Message-ID: <1181852229.17288.19.camel@localhost.localdomain> On Thu, 2007-06-14 at 10:10 -0500, Jonathan Brassow wrote: > On Jun 13, 2007, at 11:31 AM, Robert Clark wrote: > > > I've recently upgraded from RHEL4U4 to RHEL4U5 in order to use > > the new > > cmirror package instead of one I was building myself from CVS. > > > > Now, when I fail one of the PVs, the mirrored LV isn't converted to > > linear. Instead, dmeventd dies > /etc/lvm/lvm.conf: > locking_type = 3 Thanks - that got it. Mine was still set to 2 from lvmconf run from the RHEL4U4 lvm2-cluster package. Robert From andremachado at techforce.com.br Thu Jun 14 20:22:14 2007 From: andremachado at techforce.com.br (andremachado) Date: Thu, 14 Jun 2007 13:22:14 -0700 Subject: [Linux-cluster] GFS over GNBD freezes Message-ID: <23e2820769f1b99a2a52b11409ba6e73@localhost> Hello, When executing a copy operation over the same gfs over gnbd that is being written, it freezes (randomly). the problem does not happen when gfs is locally mounted or when writing to 2 different gfs+gnbd devices. Please, what configuration is missing? Regards. Andre Felipe From maciej.bogucki at artegence.com Fri Jun 15 08:34:28 2007 From: maciej.bogucki at artegence.com (Maciej Bogucki) Date: Fri, 15 Jun 2007 10:34:28 +0200 Subject: [Linux-cluster] GFS over GNBD freezes In-Reply-To: <23e2820769f1b99a2a52b11409ba6e73@localhost> References: <23e2820769f1b99a2a52b11409ba6e73@localhost> Message-ID: <46724F14.5080505@artegence.com> andremachado napisa?(a): > Hello, > When executing a copy operation over the same gfs over gnbd that is being written, it freezes (randomly). > the problem does not happen when gfs is locally mounted or when writing to 2 different gfs+gnbd devices. > Please, what configuration is missing? > Regards. > Andre Felipe > Maybe this URL would help you: http://www.techforce.com.br/index.php/news/linux_blog/red_hat_cluster_suite_debian_etch Best Regards Maciej Bogucki From mathieu.avila at seanodes.com Fri Jun 15 08:52:08 2007 From: mathieu.avila at seanodes.com (Mathieu Avila) Date: Fri, 15 Jun 2007 10:52:08 +0200 Subject: [Linux-cluster] Error when starting ccsd and proposed patch Message-ID: <20070615105208.0fc6fd76@mathieu.toulouse> Hello all, I'm sometimes having trouble when starting ccsd and then gulm under heavy CPU load. Ccsd's init script tells it is running but it's not fully initialized. The problem comes from the fact that ccsd's main process returns before the daemonized process of ccsd has finished initializing its sockets. The "cluster_communicator" thread sends a SIGTERM message to the parent process before the main thread has finished its initialization work. With the patch proposed in attachement, the cluster_communicator is started after the main thread has finished initializing. It works well under any load. Any daemon that needs to connect ccsd will then succceed. It was tested with cluster-1.03, but it should work with older versions, the ccsd files didn't seem to have changed much. -- Mathieu Avila -------------- next part -------------- A non-text attachment was scrubbed... Name: ccsd-init.patch Type: text/x-patch Size: 704 bytes Desc: not available URL: From jad at midentity.com Fri Jun 15 12:04:49 2007 From: jad at midentity.com (James Dyer) Date: Fri, 15 Jun 2007 13:04:49 +0100 (BST) Subject: [Linux-cluster] Need some advice, setting up first clustered FS Message-ID: I'm trying to set up my first clustered FS, but before I waste time trying things, only to find they don't work, I thought it would be a good idea to ask the esteemed members of this list for some opinions. At the moment, I have three webservers, which share storage via an NFS mount to a server with 1TB space on it, The file server exports a 800GB partition to these servers. The 800GB partition is a stripe over 2 500MB SATA disks. This 800GB partition is syncronised to another server using Unison every 30 mins. NFS is really not working for us; hitting all sorts of problems with it. Additionally, the above solution is obviously not at all fault tolerant, nor expandable, so it's time to look at other options. Budget limited at the moment, so really need to stick with the hardware I've currently got. The solution I'm thinking of is as follows; I'd like some opinions on whether or not this is a good idea, or if it's stupid, or impossible etc. 1- On each of the file servers, keep the existing 800GB raid0 stripe. 2- Using vblade, present these stripes to both file servers over AoE 3- On each file server, create a raid1 volume of both raid0 stripes 4- Put a gfs filesystem on the raid1 volume, mount on webservers using gfs etc. Some questions: 1- I'm not sure if stage 3 is do-able or not. I'm not sure if I can create a raid1 volume from two AoE volumes. Some things I've read say no, some say perhaps. 2- Can I actually present a device over AoE to the same physical server it's installed in, or would the volume need to be made from the AoE device from the other server, and the physical device on this server? (think that question kinda makes sense...) Really keen to make this very expandible in the future, and fault tolerant, so would expect to move to a raid5/10 system at some point. This would be accomplished by having more file servers exporting a stripe over AoE. At this point, I would imagine I'd have a couple of servers in front of the disk farm servers to actually create the gfs partition, and it is these servers that the webservers would communicate directly with. Hope what I've written makes some semblance of sense... Thanks in advance for any advice/pointers James -- July 27th, 2007 - System Administrator Appreciation Day - http://www.sysadminday.com/ From Robert.Gil at americanhm.com Fri Jun 15 12:31:59 2007 From: Robert.Gil at americanhm.com (Robert Gil) Date: Fri, 15 Jun 2007 08:31:59 -0400 Subject: [Linux-cluster] Need some advice, setting up first clustered FS In-Reply-To: Message-ID: James, I have been looking into similar implementations for our testing environment. We use san everywhere, and since san is so expensive I was considering using AoE to imitate it and a lower cost. As you said, you have 3 webservers and 1 fileserver. If you use AoE each of the 3 servers can mount that device and you can use GFS for the file locking. Each server will see the SAME disk. If you use LVM on the file server, you can expand the fileserver as much as you want. Since AoE is block level storage, you can add additional fileservers and use LVM on the webservers to expand the AoE disks. If this is to be production I would add some fault tolerance, with at least channel bonding, if not two switches for redundancy on the AoE side. When you do this however, your system will see two sets of disks, and you will need to use multipathing to handle the multiple paths and create a pseudo device so in the event of a failure, it is relatively transparent to the OS. If you do bonded GigE, your doing pretty well as far as throughput in comparison to FC. I assume the latencies differ significantly between GigE and FC, but I don't know what the percent is. Hope that helps. Robert Gil Linux Systems Administrator American Home Mortgage -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of James Dyer Sent: Friday, June 15, 2007 8:05 AM To: linux-cluster at redhat.com Subject: [Linux-cluster] Need some advice, setting up first clustered FS I'm trying to set up my first clustered FS, but before I waste time trying things, only to find they don't work, I thought it would be a good idea to ask the esteemed members of this list for some opinions. At the moment, I have three webservers, which share storage via an NFS mount to a server with 1TB space on it, The file server exports a 800GB partition to these servers. The 800GB partition is a stripe over 2 500MB SATA disks. This 800GB partition is syncronised to another server using Unison every 30 mins. NFS is really not working for us; hitting all sorts of problems with it. Additionally, the above solution is obviously not at all fault tolerant, nor expandable, so it's time to look at other options. Budget limited at the moment, so really need to stick with the hardware I've currently got. The solution I'm thinking of is as follows; I'd like some opinions on whether or not this is a good idea, or if it's stupid, or impossible etc. 1- On each of the file servers, keep the existing 800GB raid0 stripe. 2- Using vblade, present these stripes to both file servers over AoE 3- On each file server, create a raid1 volume of both raid0 stripes 4- Put a gfs filesystem on the raid1 volume, mount on webservers using gfs etc. Some questions: 1- I'm not sure if stage 3 is do-able or not. I'm not sure if I can create a raid1 volume from two AoE volumes. Some things I've read say no, some say perhaps. 2- Can I actually present a device over AoE to the same physical server it's installed in, or would the volume need to be made from the AoE device from the other server, and the physical device on this server? (think that question kinda makes sense...) Really keen to make this very expandible in the future, and fault tolerant, so would expect to move to a raid5/10 system at some point. This would be accomplished by having more file servers exporting a stripe over AoE. At this point, I would imagine I'd have a couple of servers in front of the disk farm servers to actually create the gfs partition, and it is these servers that the webservers would communicate directly with. Hope what I've written makes some semblance of sense... Thanks in advance for any advice/pointers James -- July 27th, 2007 - System Administrator Appreciation Day - http://www.sysadminday.com/ -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From andremachado at techforce.com.br Fri Jun 15 14:05:11 2007 From: andremachado at techforce.com.br (andremachado) Date: Fri, 15 Jun 2007 7:05:11 -0700 Subject: [Linux-cluster] GFS over GNBD freezes In-Reply-To: <46724F14.5080505@artegence.com> References: <46724F14.5080505@artegence.com> Message-ID: <495714f1a428bff7858e54afc745bab7@localhost> Hello, Many thanks for your message. Well, actually the article is from my blog... So, I already executed all steps described there, and much more ideas also. The synchronization solved the problem of single directly writing, but not **concurrent** access over gnbd. oopses_ok is a frigthning proposition... Read across RH docs, tried many cluster.conf configurations, enabled debug and collected some data. Unfortunately, I still have only some "suspects". It ***"seems"*** that the cluster locking (suite 1.03.02) is not so robust, and heavily depends on fast FC private networks, not implementing suitable semaphores and handshaking. I am trying to implement iscsi now (reading docs phase). But, maybe, the problems arise again if the real cause is at GFS / clvm locking coordination and not at gnbd itself. As RH docs suggests at fig. 1.13 [0] that the intended idea is feasible, I am trying it. do you have any additional ideas? How to spot the real cause? Regards. Andre Felipe Machado [0] http://elibrary.fultus.com/technical/topic/com.fultus.redhat.elinux5/manuals/Cluster_Suite_Overview/s2-ov-economy-CSO.html > > Maybe this URL would help you: > > http://www.techforce.com.br/index.php/news/linux_blog/red_hat_cluster_suite_debian_etch > > Best Regards > Maciej Bogucki From eschneid at uccs.edu Fri Jun 15 17:29:20 2007 From: eschneid at uccs.edu (Eric Schneider) Date: Fri, 15 Jun 2007 11:29:20 -0600 Subject: [Linux-cluster] CommuniGate as service? Message-ID: <002701c7af72$b532b400$1b03c680@uccs.edu> I am trying to setup CommuniGate (http://www.stalker.com/content/default.html) as a cluster service. I have everything working except for one thing. If I stop/kill the process manually and wait for the status check to move the service nothing happens. I get "clurgmgrd: [3087]: Executing /etc/init.d/CommuniGate status" rather than the "clurgmgrd: [3087]: script:httpd-webcal: status of /etc/init.d/httpd-webcal failed (returned 3)" I get with an apache process. I had to add the "status" part to /etc/init.d/CommuniGate file myself, but there is obviously a problem. I have mucked around with it for a while but I am just putting it out there if my mistakes are obvious. # If you have placed the application folder in a different directory, # change the APPLICATION variable # # The default location for the CommuniGate Pro "base directory" (a folder # containing mail accounts, settings, logs, etc.) is /var/CommuniGate # If you want to use a different location, change the BASEFOLDER variable # # APPLICATION="/opt" BASEFOLDER="/var/CommuniGate" SUPPLPARAMS= PROG="/opt/CommuniGate/CGServer" #ADDED pidfile=${PIDFILE-/var/run/CommuniGate.pid} lockfile=${LOCKFILE-/var/lock/subsys/CommuniGate} RETVAL=0 #/ADDED [ -f ${APPLICATION}/CommuniGate/CGServer ] || exit 0 # Some Linux distributions come with the "NPTL" threads library # that crashes quite often. The following lines are believed to force # Linux to use the old working threads library. # #LD_ASSUME_KERNEL=2.4.1 #export LD_ASSUME_KERNEL # Source function library. if [ -f /etc/rc.d/init.d/functions ]; then . /etc/rc.d/init.d/functions elif [ -f /etc/init.d/functions ]; then . /etc/init.d/functions fi ulimit -u 2000 ulimit -c 2097151 umask 0 # Custom startup parameters if [ -f ${BASEFOLDER}/Startup.sh ]; then . ${BASEFOLDER}/Startup.sh fi case "$1" in start) if [ -d ${BASEFOLDER} ] ; then echo else echo "Creating the CommuniGate Base Folder..." mkdir ${BASEFOLDER} chgrp mail ${BASEFOLDER} chmod 2770 ${BASEFOLDER} fi echo -n "Starting CommuniGate Pro" ${APPLICATION}/CommuniGate/CGServer \ --Base ${BASEFOLDER} --Daemon ${SUPPLPARAMS} \ # --ClusterBackend # --ClusterFrontend #Comment out #touch /var/lock/subsys/CommuniGate #ADD touch ${pidfile} RETVAL=$? echo [ "$RETVAL = 0" ] && touch ${lockfile} #return $RETVAL #/ADDED ;; controller) echo "Starting CommuniGate Pro Cluster Controller" ${APPLICATION}/CommuniGate/CGServer \ --Base ${BASEFOLDER} --Daemon ${SUPPLPARAMS} \ --ClusterController touch /var/lock/subsys/CommuniGate ;; stop) if [ -f ${BASEFOLDER}/ProcessID ]; then echo "Shutting down the CommuniGate Pro Server" kill `cat ${BASEFOLDER}/ProcessID` sleep 5 else echo "It looks like the CommuniGate Pro Server is not running" fi #eric rm -f ${pidfile} ##Orig #rm -f /var/lock/subsys/CommuniGate #ADDED RETVAL=$? echo [ "$RETVAL = 3" ] && rm -f ${lockfile} #${pidfile} #/ADDED ;; #ADDED status) status $PROG RETVAL=$? ;; #/ADDED *) echo "Usage: $0 [ start | stop | status ]" exit 1 esac exit 0 From jonyahoo at directfreight.com Sat Jun 16 03:41:01 2007 From: jonyahoo at directfreight.com (Jon Gabrielson) Date: Fri, 15 Jun 2007 22:41:01 -0500 (CDT) Subject: [Linux-cluster] gfs and software raid across 4 systems. Message-ID: <5474.12.227.156.125.1181965261.squirrel@www.directfreight.com> I have 4 servers each with their own harddrive. I would like to setup gfs so that each can read/write to their local harddrive and the data synced to the other 3 harddrives. My original idea was to export all 4 harddrives with a network block device like iscsi/gnbd/nbd and then use md software raid1 on top of those but I've heard that this is a bad idea as md is not cluster aware. I also found drbd but it only supports 2 harddrives. Are there any available solutions for doing a 4 way mirror like I am looking for? Basically a network raid 1 across 4 harddrives on 4 separate systems. Thanks, Jon. From Robert.Gil at americanhm.com Sun Jun 17 05:34:24 2007 From: Robert.Gil at americanhm.com (Robert Gil) Date: Sun, 17 Jun 2007 01:34:24 -0400 Subject: [Linux-cluster] clusvcadm hangs starting service Message-ID: I have a service for an ip which does a mysql check as a dependancy for failover. For some reason clusvcadm hangs when trying to either enable, disable, or restart that service. I get no errors in the error log and it just hangs. This is luckily in our test environment, but this will not be good in production. We use this floating ip for a couple of mysql servers doing replication so we always want to have the ip pointing to the master (rw) server. Has anyone seen the services hanging? How can this be resolved? Should rgmanager be restarted? wont this fence the server? Thanks, Robert Gil Linux Systems Administrator American Home Mortgage -------------- next part -------------- An HTML attachment was scrubbed... URL: From faizn2000 at yahoo.com Mon Jun 18 07:20:54 2007 From: faizn2000 at yahoo.com (faiz n) Date: Mon, 18 Jun 2007 00:20:54 -0700 (PDT) Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 38, Issue 25 In-Reply-To: <20070616160007.3871E73B03@hormel.redhat.com> Message-ID: <594781.25813.qm@web36604.mail.mud.yahoo.com> Hi All, I have intall globus on ma PC. I have done everything but when I run the command root at host:# etc/init.d/globus-4.0.4 start I gave me the error that libglobus_common_gcc32.so.0. No such file or directory When I downloaded this file and run the command again, then problem came that: ltdl_common_..... No such file or directory When I downloaded it again and run the command again now then this error comes /usr/local /sbin/globus-start-container-detached undefined symbol lookup error: globus_callback_space_reference. so can anyone help me that why this error comes to me and how can I fix this. take Care Thanks Faiz. linux-cluster-request at redhat.com wrote: Send Linux-cluster mailing list submissions to linux-cluster at redhat.com To subscribe or unsubscribe via the World Wide Web, visit https://www.redhat.com/mailman/listinfo/linux-cluster or, via email, send a message with subject or body 'help' to linux-cluster-request at redhat.com You can reach the person managing the list at linux-cluster-owner at redhat.com When replying, please edit your Subject line so it is more specific than "Re: Contents of Linux-cluster digest..." Today's Topics: 1. CommuniGate as service? (Eric Schneider) 2. gfs and software raid across 4 systems. (Jon Gabrielson) ---------------------------------------------------------------------- Message: 1 Date: Fri, 15 Jun 2007 11:29:20 -0600 From: "Eric Schneider" Subject: [Linux-cluster] CommuniGate as service? To: "'linux clustering'" Message-ID: <002701c7af72$b532b400$1b03c680 at uccs.edu> Content-Type: text/plain; charset="US-ASCII" I am trying to setup CommuniGate (http://www.stalker.com/content/default.html) as a cluster service. I have everything working except for one thing. If I stop/kill the process manually and wait for the status check to move the service nothing happens. I get "clurgmgrd: [3087]: Executing /etc/init.d/CommuniGate status" rather than the "clurgmgrd: [3087]: script:httpd-webcal: status of /etc/init.d/httpd-webcal failed (returned 3)" I get with an apache process. I had to add the "status" part to /etc/init.d/CommuniGate file myself, but there is obviously a problem. I have mucked around with it for a while but I am just putting it out there if my mistakes are obvious. # If you have placed the application folder in a different directory, # change the APPLICATION variable # # The default location for the CommuniGate Pro "base directory" (a folder # containing mail accounts, settings, logs, etc.) is /var/CommuniGate # If you want to use a different location, change the BASEFOLDER variable # # APPLICATION="/opt" BASEFOLDER="/var/CommuniGate" SUPPLPARAMS= PROG="/opt/CommuniGate/CGServer" #ADDED pidfile=${PIDFILE-/var/run/CommuniGate.pid} lockfile=${LOCKFILE-/var/lock/subsys/CommuniGate} RETVAL=0 #/ADDED [ -f ${APPLICATION}/CommuniGate/CGServer ] || exit 0 # Some Linux distributions come with the "NPTL" threads library # that crashes quite often. The following lines are believed to force # Linux to use the old working threads library. # #LD_ASSUME_KERNEL=2.4.1 #export LD_ASSUME_KERNEL # Source function library. if [ -f /etc/rc.d/init.d/functions ]; then . /etc/rc.d/init.d/functions elif [ -f /etc/init.d/functions ]; then . /etc/init.d/functions fi ulimit -u 2000 ulimit -c 2097151 umask 0 # Custom startup parameters if [ -f ${BASEFOLDER}/Startup.sh ]; then . ${BASEFOLDER}/Startup.sh fi case "$1" in start) if [ -d ${BASEFOLDER} ] ; then echo else echo "Creating the CommuniGate Base Folder..." mkdir ${BASEFOLDER} chgrp mail ${BASEFOLDER} chmod 2770 ${BASEFOLDER} fi echo -n "Starting CommuniGate Pro" ${APPLICATION}/CommuniGate/CGServer \ --Base ${BASEFOLDER} --Daemon ${SUPPLPARAMS} \ # --ClusterBackend # --ClusterFrontend #Comment out #touch /var/lock/subsys/CommuniGate #ADD touch ${pidfile} RETVAL=$? echo [ "$RETVAL = 0" ] && touch ${lockfile} #return $RETVAL #/ADDED ;; controller) echo "Starting CommuniGate Pro Cluster Controller" ${APPLICATION}/CommuniGate/CGServer \ --Base ${BASEFOLDER} --Daemon ${SUPPLPARAMS} \ --ClusterController touch /var/lock/subsys/CommuniGate ;; stop) if [ -f ${BASEFOLDER}/ProcessID ]; then echo "Shutting down the CommuniGate Pro Server" kill `cat ${BASEFOLDER}/ProcessID` sleep 5 else echo "It looks like the CommuniGate Pro Server is not running" fi #eric rm -f ${pidfile} ##Orig #rm -f /var/lock/subsys/CommuniGate #ADDED RETVAL=$? echo [ "$RETVAL = 3" ] && rm -f ${lockfile} #${pidfile} #/ADDED ;; #ADDED status) status $PROG RETVAL=$? ;; #/ADDED *) echo "Usage: $0 [ start | stop | status ]" exit 1 esac exit 0 ------------------------------ Message: 2 Date: Fri, 15 Jun 2007 22:41:01 -0500 (CDT) From: "Jon Gabrielson" Subject: [Linux-cluster] gfs and software raid across 4 systems. To: linux-cluster at redhat.com Message-ID: <5474.12.227.156.125.1181965261.squirrel at www.directfreight.com> Content-Type: text/plain;charset=iso-8859-1 I have 4 servers each with their own harddrive. I would like to setup gfs so that each can read/write to their local harddrive and the data synced to the other 3 harddrives. My original idea was to export all 4 harddrives with a network block device like iscsi/gnbd/nbd and then use md software raid1 on top of those but I've heard that this is a bad idea as md is not cluster aware. I also found drbd but it only supports 2 harddrives. Are there any available solutions for doing a 4 way mirror like I am looking for? Basically a network raid 1 across 4 harddrives on 4 separate systems. Thanks, Jon. ------------------------------ -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster End of Linux-cluster Digest, Vol 38, Issue 25 ********************************************* --------------------------------- Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more. -------------- next part -------------- An HTML attachment was scrubbed... URL: From samin at isaaviation.ae Mon Jun 18 07:40:27 2007 From: samin at isaaviation.ae (Sushanth Amin) Date: Mon, 18 Jun 2007 11:40:27 +0400 Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 38, Issue 25 In-Reply-To: <594781.25813.qm@web36604.mail.mud.yahoo.com> Message-ID: <200706181147.l5IBlA7L006366@isaaviation.ae> Hello Faiz, Follow the steps mentioned in the link given below http://vdt.cs.wisc.edu/releases/1.2.4/installing-rpms.html Thanks & Regards Sushanth Amin _____ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of faiz n Sent: Monday, June 18, 2007 11:21 AM To: linux-cluster at redhat.com Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 38, Issue 25 Hi All, I have intall globus on ma PC. I have done everything but when I run the command root at host:# etc/init.d/globus-4.0.4 start I gave me the error that libglobus_common_gcc32.so.0. No such file or directory When I downloaded this file and run the command again, then problem came that: ltdl_common_..... No such file or directory When I downloaded it again and run the command again now then this error comes /usr/local /sbin/globus-start-container-detached undefined symbol lookup error: globus_callback_space_reference. so can anyone help me that why this error comes to me and how can I fix this. take Care Thanks Faiz. linux-cluster-request at redhat.com wrote: Send Linux-cluster mailing list submissions to linux-cluster at redhat.com To subscribe or unsubscribe via the World Wide Web, visit https://www.redhat.com/mailman/listinfo/linux-cluster or, via email, send a message with subject or body 'help' to linux-cluster-request at redhat.com You can reach the person managing the list at linux-cluster-owner at redhat.com When replying, please edit your Subject line so it is more specific than "Re: Contents of Linux-cluster digest..." Today's Topics: 1. CommuniGate as service? (Eric Schneider) 2. gfs and software raid across 4 systems. (Jon Gabrielson) ---------------------------------------------------------------------- Message: 1 Date: Fri, 15 Jun 2007 11:29:20 -0600 From: "Eric Schneider" Subject: [Linux-cluster] CommuniGate as service? To: "'linux clustering'" Message-ID: <002701c7af72$b532b400$1b03c680 at uccs.edu> Content-Type: text/plain; charset="US-ASCII" I am trying to setup CommuniGate (http://www.stalker.com/content/default.html) as a cluster service. I have everything working except for one thing. If I stop/kill the process manually and wait for the status check to move the service nothing happens. I get "clurgmgrd: [3087]: Executing /etc/init.d/CommuniGate status" rather than the "clurgmgrd: [3087]: script:httpd-webcal: status of /etc/init.d/httpd-webcal failed (returned 3)" I get with an apache process. I had to add the "status" part to /etc/init.d/CommuniGate file myself, but there is obviously a problem. I have mucked around with it for a while but I am just putting it out there if my mistakes are obvious. # If you have placed the application folder in a different directory, # change the APPLICATION variable # # The default location for the CommuniGate Pro "base directory" (a folder # containing mail accounts, settings, logs, etc.) is /var/CommuniGate # If you want to use a different location, change the BASEFOLDER variable # # APPLICATION="/opt" BASEFOLDER="/var/CommuniGate" SUPPLPARAMS= PROG="/opt/CommuniGate/CGServer" #ADDED pidfile=${PIDFILE-/var/run/CommuniGate.pid} lockfile=${LOCKFILE-/var/lock/subsys/CommuniGate} RETVAL=0 #/ADDED [ -f ${APPLICATION}/CommuniGate/CGServer ] || exit 0 # Some Linux distributions come with the "NPTL" threads library # that crashes quite often. The following lines are believed to force # Linux to use the old working threads library. # #LD_ASSUME_KERNEL=2.4.1 #export LD_ASSUME_KERNEL # Source function library. if [ -f /etc/rc.d/init.d/functions ]; then . /etc/rc.d/init.d/functions elif [ -f /etc/init.d/functions ]; then . /etc/init.d/functions fi ulimit -u 2000 ulimit -c 2097151 umask 0 # Custom startup parameters if [ -f ${BASEFOLDER}/Startup.sh ]; then . ${BASEFOLDER}/Startup.sh fi case "$1" in start) if [ -d ${BASEFOLDER} ] ; then echo else echo "Creating the CommuniGate Base Folder..." mkdir ${BASEFOLDER} chgrp mail ${BASEFOLDER} chmod 2770 ${BASEFOLDER} fi echo -n "Starting CommuniGate Pro" ${APPLICATION}/CommuniGate/CGServer \ --Base ${BASEFOLDER} --Daemon ${SUPPLPARAMS} \ # --ClusterBackend # --ClusterFrontend #Comment out #touch /var/lock/subsys/CommuniGate #ADD touch ${pidfile} RETVAL=$? echo [ "$RETVAL = 0" ] && touch ${lockfile} #return $RETVAL #/ADDED ;; controller) echo "Starting CommuniGate Pro Cluster Controller" ${APPLICATION}/CommuniGate/CGServer \ --Base ${BASEFOLDER} --Daemon ${SUPPLPARAMS} \ --ClusterController touch /var/lock/subsys/CommuniGate ;; stop) if [ -f ${BASEFOLDER}/ProcessID ]; then echo "Shutting down the CommuniGate Pro Server" kill `cat ${BASEFOLDER}/ProcessID` sleep 5 else echo "It looks like the CommuniGate Pro Server is not running" fi #eric rm -f ${pidfile} ##Orig #rm -f /var/lock/subsys/CommuniGate #ADDED RETVAL=$? echo [ "$RETVAL = 3" ] && rm -f ${lockfile} #${pidfile} #/ADDED ;; #ADDED status) status $PROG RETVAL=$? ;; #/ADDED *) echo "Usage: $0 [ start | stop | status ]" exit 1 esac exit 0 ------------------------------ Message: 2 Date: Fri, 15 Jun 2007 22:41:01 -0500 (CDT) From: "Jon Gabrielson" Subject: [Linux-cluster] gfs and software raid across 4 systems. To: linux-cluster at redhat.com Message-ID: <5474.12.227.156.125.1181965261.squirrel at www.directfreight.com> Content-Type: text/plain;charset=iso-8859-1 I have 4 servers each with their own harddrive. I would like to setup gfs so that each can read/write to their local harddrive and the data synced to the other 3 harddrives. My original idea was to export all 4 harddrives with a network block device like iscsi/gnbd/nbd and then use md software raid1 on top of those but I've heard that this is a bad idea as md is not cluster aware. I also found drbd but it only supports 2 harddrives. Are there any available solutions for doing a 4 way mirror like I am looking for? Basically a network raid 1 across 4 harddrives on 4 separate systems. Thanks, Jon. ------------------------------ -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster End of Linux-cluster Digest, Vol 38, Issue 25 ********************************************* _____ Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefan.hirsch at nsn.com Mon Jun 18 08:01:56 2007 From: stefan.hirsch at nsn.com (Stefan Hirsch) Date: Mon, 18 Jun 2007 10:01:56 +0200 Subject: [Linux-cluster] CommuniGate as service? In-Reply-To: <002701c7af72$b532b400$1b03c680@uccs.edu> References: <002701c7af72$b532b400$1b03c680@uccs.edu> Message-ID: <46763BF4.3040400@nsn.com> ext Eric Schneider wrote: > #ADDED > status) > status $PROG > RETVAL=$? > ;; > #/ADDED > *) > echo "Usage: $0 [ start | stop | status ]" > exit 1 > esac > > exit 0 ^^^^^^ The script exits always with zero, except for the wildcard part, otherwise $RETVAL is ignored. /Stefan From urandomdev at gmail.com Mon Jun 18 12:23:13 2007 From: urandomdev at gmail.com (Simon Jolle) Date: Mon, 18 Jun 2007 14:23:13 +0200 Subject: [Linux-cluster] stuck with lock usrm::rg="db" Message-ID: <648d054e0706180523y23bab83dq690d6a7fe5e6c23@mail.gmail.com> Hi list when doing a clustat, the rgmanager doesn't respond and you cant see the cluster resource group (after long timeout). A reboot solved the problem. The system even couldn't restart because of error messages of rgmanager on screen -> only reset helped. Sorry don't have collected any more useful informations. Please request additional output. Attached the cluster.conf. How to diagnose/understand the message "Node ID:0000000000000002 stuck with lock usrm::rg="db"" on the secondary node: Jun 15 18:19:24 oracle09 clurgmgrd[4432]: Node ID:0000000000000002 stuck with lock usrm::rg="db" Jun 15 18:19:54 oracle09 clurgmgrd[4432]: Node ID:0000000000000002 stuck with lock usrm::rg="db" Jun 15 18:20:26 oracle09 clurgmgrd[4432]: Node ID:0000000000000002 stuck with lock usrm::rg="db" on the primary node: Jun 15 10:17:25 oracle08 kernel: rh_lkid 11300b8 Jun 15 10:17:25 oracle08 kernel: lockstate 0 Jun 15 10:17:25 oracle08 kernel: nodeid 2 Jun 15 10:17:25 oracle08 kernel: status 4294967279 Jun 15 10:17:25 oracle08 kernel: lkid f569ff84 Jun 15 10:17:25 oracle08 kernel: dlm: Magma: reply from 1 no lock Jun 15 10:17:25 oracle08 kernel: dlm: reply Jun 15 10:17:25 oracle08 kernel: rh_cmd 5 Jun 15 10:17:25 oracle08 kernel: rh_lkid eb01c5 Jun 15 10:17:25 oracle08 kernel: lockstate 0 Jun 15 10:17:25 oracle08 kernel: nodeid 2 Jun 15 10:17:25 oracle08 kernel: status 4294967279 Jun 15 10:17:25 oracle08 kernel: lkid f569ff84 Jun 15 10:17:25 oracle08 kernel: dlm: Magma: reply from 1 no lock Jun 15 10:17:25 oracle08 kernel: dlm: reply Jun 15 10:17:25 oracle08 kernel: rh_cmd 5 Jun 15 10:17:25 oracle08 kernel: rh_lkid 11e027c Jun 15 10:17:25 oracle08 kernel: lockstate 0 Jun 15 10:17:26 oracle08 kernel: nodeid 2 Jun 15 10:17:26 oracle08 kernel: status 4294967279 Jun 15 10:17:26 oracle08 kernel: lkid f569ff84 Jun 15 10:17:26 oracle08 kernel: dlm: Magma: reply from 1 no lock Jun 15 10:17:26 oracle08 kernel: dlm: reply Jun 15 10:17:26 oracle08 kernel: rh_cmd 5 Jun 15 10:17:26 oracle08 kernel: rh_lkid 122025f Jun 15 10:17:26 oracle08 kernel: lockstate 0 Jun 15 10:17:26 oracle08 kernel: nodeid 2 Jun 15 10:17:26 oracle08 kernel: status 4294967279 Jun 15 10:17:26 oracle08 kernel: lkid f569ff84 Jun 15 10:17:26 oracle08 kernel: dlm: Magma: reply from 1 no lock Jun 15 10:17:26 oracle08 kernel: dlm: reply Jun 15 10:17:26 oracle08 kernel: rh_cmd 5 Jun 15 10:17:26 oracle08 kernel: rh_lkid 12e0185 Jun 15 10:17:26 oracle08 kernel: lockstate 0 Jun 15 10:17:26 oracle08 kernel: nodeid 2 Jun 15 10:17:26 oracle08 kernel: status 4294967279 Jun 15 10:17:26 oracle08 kernel: lkid f569ff84 after initiating shutdown: Jun 15 18:20:54 oracle08 fenced: Stopping fence domain: Jun 15 18:20:54 oracle08 fenced: shutdown succeeded Jun 15 18:20:54 oracle08 fenced: ESC[60G Jun 15 18:20:54 oracle08 fenced: Jun 15 18:20:54 oracle08 rc: Stopping fenced: succeeded Jun 15 18:20:54 oracle08 lock_gulmd: Stopping lock_gulmd: Jun 15 18:20:54 oracle08 lock_gulmd: shutdown succeeded Jun 15 18:20:54 oracle08 lock_gulmd: ESC[60G Jun 15 18:20:54 oracle08 lock_gulmd: Jun 15 18:20:54 oracle08 rc: Stopping lock_gulmd: succeeded Jun 15 18:20:54 oracle08 cman: Stopping cman: Jun 15 18:20:58 oracle08 cman: failed to stop cman failed Jun 15 18:20:58 oracle08 cman: ESC[60G Jun 15 18:20:58 oracle08 cman: Jun 15 18:20:58 oracle08 rc: Stopping cman: failed Jun 15 18:20:58 oracle08 ccsd: Stopping ccsd: Jun 15 18:20:58 oracle08 ccsd[2276]: Stopping ccsd, SIGTERM received. Jun 15 18:20:59 oracle08 ccsd: shutdown succeeded Jun 15 18:20:59 oracle08 ccsd: ESC[60G[ Jun 15 18:20:59 oracle08 ccsd: Jun 15 18:20:59 oracle08 rc: Stopping ccsd: succeeded -- XMPP: sjolle at swissjabber.org -------------- next part -------------- A non-text attachment was scrubbed... Name: cluster.conf.xml Type: text/xml Size: 2254 bytes Desc: not available URL: From christian.brandes at forschungsgruppe.de Mon Jun 18 14:57:10 2007 From: christian.brandes at forschungsgruppe.de (Christian Brandes) Date: Mon, 18 Jun 2007 16:57:10 +0200 Subject: [Linux-cluster] cluster.conf documentation? Message-ID: <46769D46.8020506@forschungsgruppe.de> Is there a more comprehensive guide to /etc/cluster.conf than the man page, with a description of all available options? Best regards Christian -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 4348 bytes Desc: S/MIME Cryptographic Signature URL: From jwilson at transolutions.net Mon Jun 18 17:13:34 2007 From: jwilson at transolutions.net (James Wilson) Date: Mon, 18 Jun 2007 12:13:34 -0500 Subject: [Linux-cluster] gnbd_serv and gnbd export on bootup Message-ID: <4676BD3E.7000206@transolutions.net> Hey All, Just wondering if someone could help me out with getting gnbd_serv to start on boot up and also have gnbd_export to export the storage on bootup? Thanks for any help. From rpeterso at redhat.com Mon Jun 18 19:55:07 2007 From: rpeterso at redhat.com (Robert Peterson) Date: Mon, 18 Jun 2007 14:55:07 -0500 Subject: [Linux-cluster] cluster.conf documentation? In-Reply-To: <46769D46.8020506@forschungsgruppe.de> References: <46769D46.8020506@forschungsgruppe.de> Message-ID: <4676E31B.4040500@redhat.com> Christian Brandes wrote: > Is there a more comprehensive guide to /etc/cluster.conf than the man > page, with a description of all available options? > > Best regards > Christian Hi Christian, http://sources.redhat.com/cluster/faq.html#clusterconf Regards, Bob Peterson Red Hat Cluster Suite From andremachado at techforce.com.br Mon Jun 18 20:23:47 2007 From: andremachado at techforce.com.br (andremachado) Date: Mon, 18 Jun 2007 13:23:47 -0700 Subject: [Linux-cluster] GFS over GNBD freezes -> confirmed gnbd flaw? In-Reply-To: <46764DDA.1090600@artegence.com> References: <46764DDA.1090600@artegence.com> Message-ID: <9501e0cf83a2246371838c9926ce15cb@localhost> Hello, I just updated my blog page [0] with preliminary tests and conclusions about GFS, CLVM, GNBD, iSCSI. It seems that GNBD of redhat cluster suite 1.03.02 has problems.... How spot the problem code? Regards. Andre Felipe Machado [0] http://www.techforce.com.br/index.php/news/linux_blog/red_hat_cluster_suite_debian_etch On Mon, 18 Jun 2007 11:18:18 +0200, Maciej Bogucki wrote: >> I am trying to implement iscsi now (reading docs phase). But, maybe, the > problems arise again if the real cause is at GFS / clvm locking > coordination and not at gnbd itself. > It is good idea to implement iSCSI. Then You will know if is it GNDB or > GFS problem. > > Best Regards > Maciej Bogucki From Alain.Moulle at bull.net Tue Jun 19 14:49:18 2007 From: Alain.Moulle at bull.net (Alain Moulle) Date: Tue, 19 Jun 2007 16:49:18 +0200 Subject: [Linux-cluster] CS4 U4/U5 / is it possible to disable the status ? Message-ID: <4677ECEE.3040005@bull.net> Hi Is there a configuration possibility in the GUI or directly in the cluster.conf to disable the periodic monitoring of services ? Thanks Alain Moull? From Michael.Hagmann at hilti.com Tue Jun 19 15:25:57 2007 From: Michael.Hagmann at hilti.com (Hagmann, Michael) Date: Tue, 19 Jun 2007 17:25:57 +0200 Subject: [Linux-cluster] clurgmgrd - #48: Unable to obtain clusterlock: Connectiontimed out In-Reply-To: <20070511201903.GF15766@redhat.com> References: <1178560496.7699.38.camel@WSBID06223> <20070511201903.GF15766@redhat.com> Message-ID: <9C203D6FD2BF9D49BFF3450201DEDA5301B18B2D@LI-OWL.hag.hilti.com> Hi all we just hit this Problem again: Jun 18 08:03:08 lilr623a clurgmgrd[22152]: #48: Unable to obtain cluster lock: Connection timed out Jun 18 08:03:35 lilr623f clurgmgrd: [21651]: Executing /usr/local/swadmin/caa/SAP/P06WD002 status Jun 18 08:05:29 lilr623f clurgmgrd[21651]: #49: Failed getting status for RG P06WD002 is there any open Bugzilla about this Problem? what we also see that the Crash maybe is realated to the cron.daily entries. Maybe some crontab entry trigger this dlmbug? Here you can see the crontab, the cron.daily start at 08:02 the Cluster stuck ag 08:03 ! Also the last time it was also the same time. root at lilr623a:/tmp# cat /etc/crontab SHELL=/bin/bash PATH=/sbin:/bin:/usr/sbin:/usr/bin MAILTO=root HOME=/ # run-parts 01 * * * * root run-parts /etc/cron.hourly 02 8 * * * root run-parts /etc/cron.daily 22 4 * * 0 root run-parts /etc/cron.weekly 42 4 1 * * root run-parts /etc/cron.monthly root at lilr623a:/tmp# ls -l /etc/cron.daily total 28 lrwxrwxrwx 1 root root 28 Oct 5 2006 00-logwatch -> ../log.d/scripts/logwatch.pl -rwxr-xr-x 1 root root 418 Apr 14 2006 00-makewhatis.cron -rwxr-xr-x 1 root root 276 Sep 28 2004 0anacron -rwxr-xr-x 1 root root 180 Jul 13 2005 logrotate -rwxr-xr-x 1 root root 48 Apr 9 2006 mcelog.cron -rwxr-xr-x 1 root root 2133 Dec 1 2004 prelink -rwxr-xr-x 1 root root 121 Aug 8 2005 slocate.cron Thanks for your help Mike -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Lon Hohberger Sent: Freitag, 11. Mai 2007 22:19 To: linux clustering Subject: Re: [Linux-cluster] clurgmgrd - #48: Unable to obtain clusterlock: Connectiontimed out On Mon, May 07, 2007 at 01:54:56PM -0400, rhurst at bidmc.harvard.edu wrote: > What could cause clurgmgrd fail like this? If clurgmgrd has a hiccup > like this, is it supposed to shutdown its services? Is there > something in our implementation that could have prevented this from shutting down? > > For unexplained reasons, we just had our CS service (WATSON) go down > on its own, and the syslog entry details the event as: > > May 7 13:18:39 db1 clurgmgrd[17888]: #48: Unable to obtain > cluster lock: Connection timed out May 7 13:18:41 db1 kernel: dlm: > Magma: reply from 2 no lock May 7 13:18:41 db1 kernel: dlm: reply May > 7 13:18:41 db1 kernel: rh_cmd 5 May 7 13:18:41 db1 kernel: rh_lkid > 200242 May 7 13:18:41 db1 kernel: lockstate 2 May 7 13:18:41 db1 > kernel: nodeid 0 May 7 13:18:41 db1 kernel: status 0 May 7 13:18:41 > db1 kernel: lkid ee0388 May 7 13:18:41 db1 clurgmgrd[17888]: > Stopping service WATSON This usually is a dlm bug. Once the DLM gets in to this state, rgmanager blows up. What rgmanager are you using? (There's only one lock per service; the complexity of the service doesn't matter...) -- Lon Hohberger - Software Engineer - Red Hat, Inc. -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From alkol6 at gmail.com Tue Jun 19 19:23:11 2007 From: alkol6 at gmail.com (Senol Erdogan) Date: Tue, 19 Jun 2007 22:23:11 +0300 Subject: [Linux-cluster] cluster.conf documentation? In-Reply-To: <46769D46.8020506@forschungsgruppe.de> References: <46769D46.8020506@forschungsgruppe.de> Message-ID: <93bf230a0706191223x5774d7d1m37bb6d15b2387342@mail.gmail.com> and, # man ccs_tool 2007/6/18, Christian Brandes : > > Is there a more comprehensive guide to /etc/cluster.conf than the man > page, with a description of all available options? > > Best regards > Christian > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jparsons at redhat.com Tue Jun 19 20:27:40 2007 From: jparsons at redhat.com (James Parsons) Date: Tue, 19 Jun 2007 16:27:40 -0400 Subject: [Linux-cluster] cluster.conf documentation? In-Reply-To: <93bf230a0706191223x5774d7d1m37bb6d15b2387342@mail.gmail.com> References: <46769D46.8020506@forschungsgruppe.de> <93bf230a0706191223x5774d7d1m37bb6d15b2387342@mail.gmail.com> Message-ID: <46783C3C.9000903@redhat.com> I spent a couple of nights working on the schema description recently. It is split into rhel4 and rhel5 versions now, and includes some of the new openais/cman params. It occurs to me, however, that the 8 new resource types are not represented yet. I will add a description for those over the next few nights. Once again, the URL is: http://sources.redhat.com/cluster/doc/cluster_schema.html Senol Erdogan wrote: > and, > > # man ccs_tool > > 2007/6/18, Christian Brandes < christian.brandes at forschungsgruppe.de > >: > > Is there a more comprehensive guide to /etc/cluster.conf than the man > page, with a description of all available options? > > Best regards > Christian > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > >------------------------------------------------------------------------ > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster > From orkcu at yahoo.com Tue Jun 19 21:19:24 2007 From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=) Date: Tue, 19 Jun 2007 14:19:24 -0700 (PDT) Subject: [Linux-cluster] RH Cluster Suit can be used to create a qmail cluster? Message-ID: <687378.49599.qm@web50611.mail.re2.yahoo.com> Hi I am looking for ideas about to create a Qmail HA cluster with 2 nodes and the storage in a SAN (FC access) right now I am in the design stage, mainly finding potencial problems so .... do anybody has anything to recommend ? (except not use qmail ;-) I would like to use postfix or exim but my client disagree :-( no choice here) my first problem looks like qmail is started, monitored and managed by daemontools (sv* programs) and svscan itseft is started through inittab or rc.local so my first approach is to create an sysV init script for svscanboot (whitch is used to start svc and svscan) and that script is the one that will be controlled by RHCS as a script resource (alonside with the GFS or plain FS resource, and maybe the IP resource) so, my idea is to "clusterizate" (that word exist ? ;-) ) the daemontool and not the qmail process, do you agree? thanks in advance for any tip :-) roger __________________________________________ RedHat Certified ( RHCE ) Cisco Certified ( CCNA & CCDA ) ____________________________________________________________________________________ Luggage? GPS? Comic books? Check out fitting gifts for grads at Yahoo! Search http://search.yahoo.com/search?fr=oni_on_mail&p=graduation+gifts&cs=bz From chris at cmiware.com Tue Jun 19 22:23:45 2007 From: chris at cmiware.com (Chris Harms) Date: Tue, 19 Jun 2007 17:23:45 -0500 Subject: [Linux-cluster] avoiding 2 Node Fencing shootout Message-ID: <46785771.1000403@cmiware.com> Hi all, Setup: Cluster Suite 5 2 Nodes each fenced by DRAC card over network interface. Flagged as two node in cluster.conf As a test, I unplugged one node (Node A) from the network. The remaining node (Node B) attempted to fence it, but failed (no network access) and never assumed the services. Plugging in Node A a gain induced each to fence the other. Something similar happens when a node is rebooted manually (shutdown -r now). How does one best combat this? I've seen reference to adding a ping node but no actual documentation on how to do it. Thanks, Chris From rainer at ultra-secure.de Tue Jun 19 22:33:59 2007 From: rainer at ultra-secure.de (Rainer Duffner) Date: Wed, 20 Jun 2007 00:33:59 +0200 Subject: [Linux-cluster] RH Cluster Suit can be used to create a qmail cluster? In-Reply-To: <687378.49599.qm@web50611.mail.re2.yahoo.com> References: <687378.49599.qm@web50611.mail.re2.yahoo.com> Message-ID: <48D17A8F-34A5-420D-B856-BC936FFCC797@ultra-secure.de> Am 19.06.2007 um 23:19 schrieb Roger Pe?a: > Hi > > I am looking for ideas about to create a Qmail HA > cluster with 2 nodes and the storage in a SAN (FC > access) > Only two nodes? What backend do you want to use? (In case you want to use vpopmail) > right now I am in the design stage, mainly finding > potencial problems so .... > do anybody has anything to recommend ? Qmail is IMO not suited for a GFS cluster. GFS tries its best to keep write operations on the cluster-FS synchronized. This is useless in the case of Qmail, because Qmail is designed to function even on NFS-filesystems without any kind of useful locking. In GFS-land, Qmail just generates lots of useless I/O. > (except not use qmail ;-) I would like to use postfix > or exim but my client disagree :-( no choice here) > It's understandable. Qmail still offers a lot of value when it comes to virtual email-domain hosting - though the original DJB-Qmail is barely usable today. But people like Matt Simerson and Bill Shupp have done tremendous integration-work, and helped to keep the platform on par (or in some cases beyond) with other systems, even commercial ones. > my first problem looks like qmail is started, > monitored and managed by daemontools (sv* programs) > and svscan itseft is started through inittab or > rc.local > so my first approach is to create an sysV init script > for svscanboot (whitch is used to start svc and > svscan) and that script is the one that will be > controlled by RHCS as a script resource (alonside with > the GFS or plain FS resource, and maybe the IP > resource) > Sometimes, it's not enough to stop the svscan-startscript. Daemons linger around, prevent new ones from starting. After killing the start-scripts, it might be necessary to kill (or kill -9) any remaining processes. > so, my idea is to "clusterizate" (that word exist ? > ;-) ) the daemontool and not the qmail process, do you > agree? > > thanks in advance for any tip :-) > You could try to run a sharedroot-cluster on RHEL4 and see how it performs for your workload - there are some succesful reports here on this list (though the one I remember uses a tremendous amount of disk- spindles). This should solve your problems with the script (just fence the whole node - finished). If you don't want to go that route, I'd say forget about GFS and go back to NFS (with a serious NFS server-platform like Solaris and clients like Solaris or FreeBSD) - see the picture on Bill Shupp's homepage for a design. Matt Simerson's formerly FreeBSD-only (now also Solaris, Linux, Darwin) Mail-Toaster framework already contains most of the integration-work necessary (distribute configfiles etc. - take a look at the source, it's amazing). Above a certain amount of users (500k, probably varies), shared- storage may be the wrong answer anyway. Then, a distributed setup might be better suited. How many users will you have to support? cheers, Rainer -- Rainer Duffner CISSP, LPI, MCSE rainer at ultra-secure.de From rpeterso at redhat.com Tue Jun 19 22:52:45 2007 From: rpeterso at redhat.com (Robert Peterson) Date: Tue, 19 Jun 2007 17:52:45 -0500 Subject: [Linux-cluster] avoiding 2 Node Fencing shootout In-Reply-To: <46785771.1000403@cmiware.com> References: <46785771.1000403@cmiware.com> Message-ID: <46785E3D.2070200@redhat.com> Chris Harms wrote: > Hi all, > > Setup: > > Cluster Suite 5 > 2 Nodes each fenced by DRAC card over network interface. Flagged as two > node in cluster.conf > > > > As a test, I unplugged one node (Node A) from the network. The > remaining node (Node B) attempted to fence it, but failed (no network > access) and never assumed the services. Plugging in Node A a gain > induced each to fence the other. Something similar happens when a node > is rebooted manually (shutdown -r now). > > How does one best combat this? I've seen reference to adding a ping > node but no actual documentation on how to do it. > > Thanks, > Chris > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > Hi Chris, Here's a good place to start: http://sources.redhat.com/cluster/faq.html#quorum There's a couple FAQ entries after that one too pertaining to tie-breaking. Regards, Bob Peterson Red Hat Cluster Suite From orkcu at yahoo.com Wed Jun 20 00:42:11 2007 From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=) Date: Tue, 19 Jun 2007 17:42:11 -0700 (PDT) Subject: [Linux-cluster] RH Cluster Suit can be used to create a qmail cluster? In-Reply-To: <48D17A8F-34A5-420D-B856-BC936FFCC797@ultra-secure.de> Message-ID: <451387.21922.qm@web50606.mail.re2.yahoo.com> --- Rainer Duffner wrote: > > Am 19.06.2007 um 23:19 schrieb Roger Pe?a: > > > Hi > > > > I am looking for ideas about to create a Qmail HA > > cluster with 2 nodes and the storage in a SAN (FC > > access) > > > > > Only two nodes? > What backend do you want to use? > (In case you want to use vpopmail) backend for what? for user data? we plan to use ldap, another two more server to the cluster, but I was talking about just mail related (smtp, pop-imap) nodes > > > > right now I am in the design stage, mainly finding > > potencial problems so .... > > do anybody has anything to recommend ? > > > Qmail is IMO not suited for a GFS cluster. > GFS tries its best to keep write operations on the > cluster-FS > synchronized. > This is useless in the case of Qmail, because Qmail > is designed to > function even on NFS-filesystems without any kind of > useful locking. > In GFS-land, Qmail just generates lots of useless > I/O. you are thinking about maildir advantages? yeap, I know that with maildir you will not have locking problems (practical meaning although there is a teorical chance :-) ) but I was thinking in FS cache, default config for ext3 are not suitable, maybe I should go into the tunning ext3 area but .... I thought GFS will take care of FS sincronization more easy than tunning ext3 to not make cache or just do it for little time > > > > (except not use qmail ;-) I would like to use > postfix > > or exim but my client disagree :-( no choice here) > > > > > It's understandable. Qmail still offers a lot of > value when it comes > to virtual email-domain hosting - though the > original DJB-Qmail is > barely usable today. > But people like Matt Simerson and Bill Shupp have > done tremendous > integration-work, and helped to keep the platform on > par (or in some > cases beyond) with other systems, even commercial > ones. > > > > my first problem looks like qmail is started, > > monitored and managed by daemontools (sv* > programs) > > and svscan itseft is started through inittab or > > rc.local > > so my first approach is to create an sysV init > script > > for svscanboot (whitch is used to start svc and > > svscan) and that script is the one that will be > > controlled by RHCS as a script resource (alonside > with > > the GFS or plain FS resource, and maybe the IP > > resource) > > > > > Sometimes, it's not enough to stop the > svscan-startscript. > Daemons linger around, prevent new ones from > starting. After killing > the start-scripts, it might be necessary to kill (or > kill -9) any > remaining processes. good to know it :-) I will be looking for this problem :-) > > > > so, my idea is to "clusterizate" (that word exist > ? > > ;-) ) the daemontool and not the qmail process, do > you > > agree? > > > > thanks in advance for any tip :-) > > > > > You could try to run a sharedroot-cluster on RHEL4 > and see how it > performs for your workload - there are some > succesful reports here on > this list (though the one I remember uses a > tremendous amount of disk- > spindles). > This should solve your problems with the script > (just fence the whole > node - finished). > > If you don't want to go that route, I'd say forget > about GFS and go > back to NFS (with a serious NFS server-platform like > Solaris and > clients like Solaris or FreeBSD) - see the picture another requisite for the solution: the OS has to be RHEL, RHEL5 as the preferred > on Bill Shupp's > homepage for a design. > Matt Simerson's formerly FreeBSD-only (now also > Solaris, Linux, > Darwin) Mail-Toaster framework already contains most > of the > integration-work necessary (distribute configfiles > etc. - take a look > at the source, it's amazing). I will do > > Above a certain amount of users (500k, probably > varies), shared- > storage may be the wrong answer anyway. > Then, a distributed setup might be better suited. > How many users will you have to support? I guess few hundreds of thousands but I hope not 500k, maybe 200k or 300k I know this is an important data to be uncertain but as I said I am in the process of finding potentials problems yet :-) in the next few days-weeks I will have more deep understand of the environment > Rainer thanks a lot Rainier cu roger __________________________________________ RedHat Certified ( RHCE ) Cisco Certified ( CCNA & CCDA ) ____________________________________________________________________________________ Be a better Globetrotter. Get better travel answers from someone who knows. Yahoo! Answers - Check it out. http://answers.yahoo.com/dir/?link=list&sid=396545469 From anujhere at gmail.com Tue Jun 19 14:25:16 2007 From: anujhere at gmail.com (anugunj anuj singh) Date: Tue, 19 Jun 2007 19:55:16 +0530 Subject: [Linux-cluster] drbd+GFS HA cluster Message-ID: <1182263116.7430.16.camel@anugunj.sytes.net> Hi, I installed drbd-8.0.3 on RHEL4, my drbd.conf global { usage-count yes; } common { syncer { rate 10M; } } resource r0 { protocol C; net { cram-hmac-alg sha1; shared-secret "anugunj"; } on node0005.anugunj.com { device /dev/drbd1; disk /dev/hda6; address 10.1.1.3:7789; meta-disk internal; } on node0021.anugunj.com { device /dev/drbd1; disk /dev/sdb1; address 10.1.1.11:7789; meta-disk internal; } } created drbd-meta data drbdadm create-md r0 , and then gfs file system on it. I am trying to do mirroring of cluster, using gfs file system , currently I have to make one drbd hard disk primary, other secondary, only then I am able to mount and use, on other system it is not mounting, i have to make it primary with. drbdadm primary all. has anyone tried drbd on gfs, filesystem. Version i am using does support gfs (according to changelog) http://svn.drbd.org/drbd/trunk/ChangeLog . Target is for HA of cluster. while mounting drbd hard disks on both systems simultaneous. thanks and regards anugunj "anuj singh" -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mathieu.avila at seanodes.com Wed Jun 20 08:09:54 2007 From: mathieu.avila at seanodes.com (Mathieu Avila) Date: Wed, 20 Jun 2007 10:09:54 +0200 Subject: [Linux-cluster] Error when starting ccsd and proposed patch In-Reply-To: <20070615105208.0fc6fd76@mathieu.toulouse> References: <20070615105208.0fc6fd76@mathieu.toulouse> Message-ID: <20070620100954.1e8b2697@mathieu.toulouse> Sorry to bother you with this ; am i the only one that spotted this issue ? I did review the code from cluster-2 and cluster-1.04 and the patch is also relevant there. A easy way of running into this problem is to generate CPU load on a node, and then do loops of ccsd and gulm start/stop. Sometimes, gulm will get out with an error complaining that it was unable to contact ccsd. Le Fri, 15 Jun 2007 10:52:08 +0200, Mathieu Avila a ?crit : > Hello all, > > I'm sometimes having trouble when starting ccsd and then gulm under > heavy CPU load. Ccsd's init script tells it is running but it's not > fully initialized. > The problem comes from the fact that ccsd's main process returns > before the daemonized process of ccsd has finished initializing its > sockets. The "cluster_communicator" thread sends a SIGTERM message to > the parent process before the main thread has finished its > initialization work. > > With the patch proposed in attachement, the cluster_communicator is > started after the main thread has finished initializing. It works > well under any load. Any daemon that needs to connect ccsd will > then succceed. > It was tested with cluster-1.03, but it should work with older > versions, the ccsd files didn't seem to have changed much. > > -- > Mathieu Avila From dan.deshayes at algitech.com Wed Jun 20 08:18:22 2007 From: dan.deshayes at algitech.com (Dan Deshayes) Date: Wed, 20 Jun 2007 10:18:22 +0200 Subject: [Linux-cluster] problem starting clvmd on second node. Message-ID: <4678E2CE.4050302@algitech.com> Hello, I'm having problem starting the clvmd on the second node. I'm running Centos 5 resently updated. Its going to be a 3node HA cluster. What I've done is the following. creating the filesystems on the fiberdiscs thats devided with lvm. mkfs.gfs -p lock_dlm -t acl002:project_logs -j 3 /dev/projectVG/logs mkfs.gfs -p lock_dlm -t acl002:project_web -j 3 /dev/projectVG/web mkfs.gfs -p lock_dlm -t acl002:project_db -j 3 /dev/projectVG/db Then I start the cluster namned acl002 on all nodes and clvmd on the first node, it starts and i can mount/unmount and write to the volumes, but i get a clvmd -T20 process When I go to the second node and starts clvmd it hangs in the vgscan. I'm using locking_type = 3 in the lvm.conf file, before I used 2 with the liblvm2clusterlock.so libary but it doesn't seems to be availible anymore and this does not seem to be related to my problem(?). It works if I start the clvmd on the second node first but then the first node gives the same error. Maybe someone can give me a hint in the right direction. Thanks in advance. Dan From rainer at ultra-secure.de Wed Jun 20 08:19:00 2007 From: rainer at ultra-secure.de (Rainer Duffner) Date: Wed, 20 Jun 2007 10:19:00 +0200 Subject: [Linux-cluster] RH Cluster Suit can be used to create a qmail cluster? In-Reply-To: <451387.21922.qm@web50606.mail.re2.yahoo.com> References: <451387.21922.qm@web50606.mail.re2.yahoo.com> Message-ID: <4678E2F4.7000201@ultra-secure.de> Roger Pe?a wrote: > > backend for what? for user data? Yes. It's a rather important question, IMO. > you are thinking about maildir advantages? yeap, I > know that with maildir you will not have locking > problems (practical meaning although there is a > teorical chance :-) ) > but I was thinking in FS cache, default config for > ext3 are not suitable, maybe I should go into the > tunning ext3 area but .... I thought GFS will take > care of FS sincronization more easy than tunning ext3 > to not make cache or just do it for little time > > I'm not sure what you mean here. > good to know it :-) > I will be looking for this problem :-) > > Rather, problems tend to look for you ;-) > another requisite for the solution: > the OS has to be RHEL, RHEL5 as the preferred > > It's just that RHEL doesn't offer most of the needed software out-of-the-box (apart from the ldap-client). And even if it does, you need to recompile it yourself, because it needs other compilation-options. > I guess few hundreds of thousands but I hope not 500k, > maybe 200k or 300k > I know this is an important data to be uncertain but > as I said I am in the process of finding potentials > problems yet :-) in the next few days-weeks I will > have more deep understand of the environment > > 300k would be still OK for a shared storage. What kind of SAN do you have? cheers, Rainer From pcaulfie at redhat.com Wed Jun 20 10:48:35 2007 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Wed, 20 Jun 2007 11:48:35 +0100 Subject: [Linux-cluster] Error when starting ccsd and proposed patch In-Reply-To: <20070620100954.1e8b2697@mathieu.toulouse> References: <20070615105208.0fc6fd76@mathieu.toulouse> <20070620100954.1e8b2697@mathieu.toulouse> Message-ID: <46790603.1080204@redhat.com> Mathieu Avila wrote: > Sorry to bother you with this ; am i the only one that spotted this > issue ? I think you might be ;-) > I did review the code from cluster-2 and cluster-1.04 and the patch is > also relevant there. > A easy way of running into this problem is to generate CPU load on a > node, and then do loops of ccsd and gulm start/stop. Sometimes, gulm > will get out with an error complaining that it was unable to contact > ccsd. Yes, I can believe the problem is still in ccsd2 - it's really the same thing, ccsd hasnt been touched for a while Your patch is appreciated, really, it's just that things are a bit hectic at the moment, we'll get around to testing/integrating it soon I hope. Thanks, -- Patrick Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 ITE, UK. Registered in England and Wales under Company Registration No. 3798903 From Santosh.Panigrahi at in.unisys.com Wed Jun 20 11:12:35 2007 From: Santosh.Panigrahi at in.unisys.com (Panigrahi, Santosh Kumar) Date: Wed, 20 Jun 2007 16:42:35 +0530 Subject: [Linux-cluster] failover domain conf. with conga Message-ID: Hi, I am not able to configure fail over domain in RHEL5 through conga utility. On trying to do so, I am getting an error page from luci service. But I am able to configure it through system-config-cluster utility. I got following information from RHEl5 release notes. [At present, conga and luci do not allow users to create and configure failover domains. To create failover domains, use system-config-cluster. You need to manually edit /etc/cluster/cluster.conf to configure failover domains created this way.] https://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/release-not es/RELEASE-NOTES-x86-en.html I want to know when will be the next release of conga utility with the failover domain feature ? Thanks and Regards, Santosh -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Gil at americanhm.com Wed Jun 20 12:18:02 2007 From: Robert.Gil at americanhm.com (Robert Gil) Date: Wed, 20 Jun 2007 08:18:02 -0400 Subject: [Linux-cluster] problem starting clvmd on second node. In-Reply-To: <4678E2CE.4050302@algitech.com> Message-ID: What exactly is the error? If its permission denied it may most likely have to do with fenced not running. If lvm skips the clustered filesystems, then look at the lvm.conf to make sure its right. Robert Gil Linux Systems Administrator American Home Mortgage -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dan Deshayes Sent: Wednesday, June 20, 2007 4:18 AM To: linux-cluster at redhat.com Subject: [Linux-cluster] problem starting clvmd on second node. Hello, I'm having problem starting the clvmd on the second node. I'm running Centos 5 resently updated. Its going to be a 3node HA cluster. What I've done is the following. creating the filesystems on the fiberdiscs thats devided with lvm. mkfs.gfs -p lock_dlm -t acl002:project_logs -j 3 /dev/projectVG/logs mkfs.gfs -p lock_dlm -t acl002:project_web -j 3 /dev/projectVG/web mkfs.gfs -p lock_dlm -t acl002:project_db -j 3 /dev/projectVG/db Then I start the cluster namned acl002 on all nodes and clvmd on the first node, it starts and i can mount/unmount and write to the volumes, but i get a clvmd -T20 process When I go to the second node and starts clvmd it hangs in the vgscan. I'm using locking_type = 3 in the lvm.conf file, before I used 2 with the liblvm2clusterlock.so libary but it doesn't seems to be availible anymore and this does not seem to be related to my problem(?). It works if I start the clvmd on the second node first but then the first node gives the same error. Maybe someone can give me a hint in the right direction. Thanks in advance. Dan -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From dan.deshayes at algitech.com Wed Jun 20 13:22:08 2007 From: dan.deshayes at algitech.com (Dan Deshayes) Date: Wed, 20 Jun 2007 15:22:08 +0200 Subject: [Linux-cluster] problem starting clvmd on second node. In-Reply-To: References: Message-ID: <46792A00.8080707@algitech.com> when i start the clvmd on the second node in debugmode it seems to start fine. [root at asl012 conf.d]# clvmd -d CLVMD[aaabc2c0]: Jun 20 15:06:13 CLVMD started CLVMD[aaabc2c0]: Jun 20 15:06:22 Cluster ready, doing some more initialisation CLVMD[aaabc2c0]: Jun 20 15:06:22 starting LVM thread CLVMD[aaabc2c0]: Jun 20 15:06:22 clvmd ready for work CLVMD[aaabc2c0]: Jun 20 15:06:22 Using timeout of 60 seconds CLVMD[41401940]: Jun 20 15:06:22 LVM thread function started File descriptor 5 left open CLVMD[41401940]: Jun 20 15:06:23 LVM thread waiting for work but when i try to mount it the proccess just freezes. /var/logs/messages: Jun 20 15:06:22 asl012 clvmd: Cluster LVM daemon started - connected to CMAN Jun 20 15:07:19 asl012 kernel: Trying to join cluster "lock_dlm", "acl002:project_db" Jun 20 15:07:19 asl012 kernel: dlm: connecting to 2 Jun 20 15:07:19 asl012 kernel: Joined cluster. Now mounting FS... and then nothing happens. root 19031 0.0 0.0 3628 332 pts/0 D 15:07 0:00 /sbin/mount.gfs /dev/projectVG/db /project/db/ -o rw after trying to do this the filesystem locks up on the first node also. the fenced is running and starts fine when starting the cman-service. I'm not tryin to make the filesystem failover but to be mounted by all nodes always. /Dan Robert Gil wrote: >What exactly is the error? If its permission denied it may most likely >have to do with fenced not running. If lvm skips the clustered >filesystems, then look at the lvm.conf to make sure its right. > > >Robert Gil >Linux Systems Administrator >American Home Mortgage > > >-----Original Message----- >From: linux-cluster-bounces at redhat.com >[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dan Deshayes >Sent: Wednesday, June 20, 2007 4:18 AM >To: linux-cluster at redhat.com >Subject: [Linux-cluster] problem starting clvmd on second node. > >Hello, >I'm having problem starting the clvmd on the second node. >I'm running Centos 5 resently updated. Its going to be a 3node HA >cluster. >What I've done is the following. > >creating the filesystems on the fiberdiscs thats devided with lvm. >mkfs.gfs -p lock_dlm -t acl002:project_logs -j 3 /dev/projectVG/logs >mkfs.gfs -p lock_dlm -t acl002:project_web -j 3 /dev/projectVG/web >mkfs.gfs -p lock_dlm -t acl002:project_db -j 3 /dev/projectVG/db > >Then I start the cluster namned acl002 on all nodes and clvmd on the >first node, it starts and i can mount/unmount and write to the volumes, >but i get a clvmd -T20 process When I go to the second node and starts >clvmd it hangs in the vgscan. > >I'm using locking_type = 3 in the lvm.conf file, before I used 2 with >the liblvm2clusterlock.so libary but it doesn't seems to be availible >anymore and this does not seem to be related to my problem(?). > >It works if I start the clvmd on the second node first but then the >first node gives the same error. > >Maybe someone can give me a hint in the right direction. >Thanks in advance. >Dan > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster > > From rpeterso at redhat.com Wed Jun 20 13:59:20 2007 From: rpeterso at redhat.com (Robert Peterson) Date: Wed, 20 Jun 2007 08:59:20 -0500 Subject: [Linux-cluster] problem starting clvmd on second node. In-Reply-To: <46792A00.8080707@algitech.com> References: <46792A00.8080707@algitech.com> Message-ID: <467932B8.4060506@redhat.com> Dan Deshayes wrote: > but when i try to mount it the proccess just freezes. /var/logs/messages: > Jun 20 15:06:22 asl012 clvmd: Cluster LVM daemon started - connected to > CMAN > Jun 20 15:07:19 asl012 kernel: Trying to join cluster "lock_dlm", > "acl002:project_db" > Jun 20 15:07:19 asl012 kernel: dlm: connecting to 2 > Jun 20 15:07:19 asl012 kernel: Joined cluster. Now mounting FS... > > and then nothing happens. > root 19031 0.0 0.0 3628 332 pts/0 D 15:07 0:00 > /sbin/mount.gfs /dev/projectVG/db /project/db/ -o rw > after trying to do this the filesystem locks up on the first node also. > > /Dan Hi Dan, So apparently it's the mount that's hanging, not clvmd. By any chance, are you using manual fencing? Because I think this behavior can be caused when a node is manually fenced, but the fence_ack_manual script was never run. If that's the problem, try: fence_ack_manual -n Regards, Bob Peterson Red Hat Cluster Suite From kristoffer.lippert at jppol.dk Wed Jun 20 14:29:02 2007 From: kristoffer.lippert at jppol.dk (Kristoffer Lippert) Date: Wed, 20 Jun 2007 16:29:02 +0200 Subject: [Linux-cluster] Performance on GFS, OCFS2 or GFS2? In-Reply-To: References: <4678E2CE.4050302@algitech.com> Message-ID: <00B9BFA1C44A674794C9A1A4F5A22CA51A7941@exchsrv07.rootdom.dk> Hi, I'm considering different options for a SAN. Are there anyone who knows of a comparison of the GFS, GFS2 and OCFS2 filesystems? Mostly i'm concerned about performance and stability (maturity). It'll be running on a RHEL 5. Kind regards Kristoffer From dan.deshayes at algitech.com Wed Jun 20 14:34:01 2007 From: dan.deshayes at algitech.com (Dan Deshayes) Date: Wed, 20 Jun 2007 16:34:01 +0200 Subject: [Linux-cluster] problem starting clvmd on second node. In-Reply-To: <467932B8.4060506@redhat.com> References: <46792A00.8080707@algitech.com> <467932B8.4060506@redhat.com> Message-ID: <46793AD9.4020308@algitech.com> Robert Peterson wrote: > Dan Deshayes wrote: > >> but when i try to mount it the proccess just freezes. >> /var/logs/messages: >> Jun 20 15:06:22 asl012 clvmd: Cluster LVM daemon started - connected >> to CMAN >> Jun 20 15:07:19 asl012 kernel: Trying to join cluster "lock_dlm", >> "acl002:project_db" >> Jun 20 15:07:19 asl012 kernel: dlm: connecting to 2 >> Jun 20 15:07:19 asl012 kernel: Joined cluster. Now mounting FS... >> >> and then nothing happens. >> root 19031 0.0 0.0 3628 332 pts/0 D 15:07 0:00 >> /sbin/mount.gfs /dev/projectVG/db /project/db/ -o rw >> after trying to do this the filesystem locks up on the first node also. >> >> /Dan > > > Hi Dan, > > So apparently it's the mount that's hanging, not clvmd. > By any chance, are you using manual fencing? Because I think this > behavior > can be caused when a node is manually fenced, but the > fence_ack_manual script was never run. If that's the problem, try: > > fence_ack_manual -n > > Regards, > > Bob Peterson > Red Hat Cluster Suite > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Hey Robert, thats right, its not really the clvmd but when starting as a service it performs a vgscan wich also freezes. Correct, I'm using manual fenceing for the moment, but I don't want to fence the other node since it hasn't faild, but to have the filesystem mounted by all the nodes at the same time. Just fence when it fails. Maybe I've missunderstood the possibility of this? Though I've used it on previous versions of centos. Regards, Dan From jparsons at redhat.com Wed Jun 20 14:34:37 2007 From: jparsons at redhat.com (jim parsons) Date: Wed, 20 Jun 2007 10:34:37 -0400 Subject: [Linux-cluster] failover domain conf. with conga In-Reply-To: References: Message-ID: <1182350078.3302.9.camel@localhost.localdomain> On Wed, 2007-06-20 at 16:42 +0530, Panigrahi, Santosh Kumar wrote: > Hi, > > > > I am not able to configure fail over domain in RHEL5 through conga > utility. On trying to do so, I am getting an error page from luci > service. But I am able to configure it through system-config-cluster > utility. > > I got following information from RHEl5 release notes. > > [At present, conga and luci do not allow users to create and configure > failover domains. > > To create failover domains, use system-config-cluster. You need to > manually edit /etc/cluster/cluster.conf to configure failover domains > created this way.] > > https://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/release-notes/RELEASE-NOTES-x86-en.html > > > > I want to know when will be the next release of conga utility with the > failover domain feature ? There was a z-stream (asynchronous update) release of Conga that addresses some VM issues and also offers creation and config of Fdoms. You want the 0.9.2-6.el5 builds for ricci, luci, and modcluster. If you are using these versions, then you have encountered something I would be interested in knowing more about :) I would also be interested in any usability comments regarding the conga UI that you care to offer. BTW, here is a page that may be helpful to you: http://sourceware.org/cluster/conga/ I will file a bug to update the documentation. Thanks for trying out conga. -J From Michael.Hagmann at hilti.com Wed Jun 20 14:55:47 2007 From: Michael.Hagmann at hilti.com (Hagmann, Michael) Date: Wed, 20 Jun 2007 16:55:47 +0200 Subject: [Linux-cluster] Red Hat Cluster for Symantec Veritas NetBackup 6 In-Reply-To: <1182350078.3302.9.camel@localhost.localdomain> References: <1182350078.3302.9.camel@localhost.localdomain> Message-ID: <9C203D6FD2BF9D49BFF3450201DEDA5301B18E29@LI-OWL.hag.hilti.com> Hi all we have a lot of Red Hat 4 Clusters for Oracle / SAP operational. Now we also have to move our Tru64 / TruCluster NetBackup Infrastructure to Linux. Now we are thinking to move the NetBackup to a Red Hat Cluster. Does someone have this running? When yes are there any important points to know? thanks for your comments Mike Michael Hagmann UNIX Systems Engineering Enterprise Systems Technology Hilti Corporation 9494 Schaan Liechtenstein Department FIBS Feldkircherstrasse 100 P.O.Box 333 P +423-234 2467 F +423-234 6467 E michael.hagmann at hilti.com www.hilti.com From rainer at ultra-secure.de Wed Jun 20 15:04:30 2007 From: rainer at ultra-secure.de (Rainer Duffner) Date: Wed, 20 Jun 2007 17:04:30 +0200 Subject: [Linux-cluster] Performance on GFS, OCFS2 or GFS2? In-Reply-To: <00B9BFA1C44A674794C9A1A4F5A22CA51A7941@exchsrv07.rootdom.dk> References: <4678E2CE.4050302@algitech.com> <00B9BFA1C44A674794C9A1A4F5A22CA51A7941@exchsrv07.rootdom.dk> Message-ID: <467941FE.8000209@ultra-secure.de> Kristoffer Lippert wrote: > Hi, > > I'm considering different options for a SAN. Are there anyone who knows > of a comparison of the GFS, GFS2 and OCFS2 filesystems? > Mostly i'm concerned about performance and stability (maturity). > > It'll be running on a RHEL 5. > > Kind regards > Kristoffer > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > There was a comparison in some back-issue of iX magazine (German monthly IT-magazine - http://www.heise.de/ix). It was some time ago (a year, or more), so they didn't include GFS2. But as GFS2 isn't ready for prime-time anyway, it should not be a big problem. I don't remember in which issue it was - search their archive. cheers, Rainer From rpeterso at redhat.com Wed Jun 20 16:20:25 2007 From: rpeterso at redhat.com (Robert Peterson) Date: Wed, 20 Jun 2007 11:20:25 -0500 Subject: [Linux-cluster] problem starting clvmd on second node. In-Reply-To: <46793AD9.4020308@algitech.com> References: <46792A00.8080707@algitech.com> <467932B8.4060506@redhat.com> <46793AD9.4020308@algitech.com> Message-ID: <467953C9.5080504@redhat.com> Dan Deshayes wrote: > Hey Robert, > thats right, its not really the clvmd but when starting as a service it > performs a vgscan wich also freezes. > Correct, I'm using manual fenceing for the moment, but I don't want to > fence the other node since it hasn't faild, > but to have the filesystem mounted by all the nodes at the same time. > Just fence when it fails. > Maybe I've missunderstood the possibility of this? Though I've used it > on previous versions of centos. > > Regards, Dan Hi Dan, Is this the old cluster infrastructure (e.g. rhel4/centos 4/stable or equiv) or the new cluster infrastructure (e.g. rhel5/HEAD or equiv)? Since you're using manual fencing, perhaps you should start from the beginning in case a node thinks it needs a fence ack: 1. power off all nodes 2. power on all nodes 3. start clustering on all nodes 4. do group_tool -v on all nodes to make sure there are no error conditions 5. start clvmd on all nodes 6. do your mounts Let me know what happens. Regards, Bob Peterson From chris at cmiware.com Wed Jun 20 16:55:52 2007 From: chris at cmiware.com (Chris Harms) Date: Wed, 20 Jun 2007 11:55:52 -0500 Subject: [Linux-cluster] Diskless Quorum Disk Message-ID: <46795C18.5040408@cmiware.com> I'm interested in using qdisk heuristics to circumvent a fencing duel in my two node cluster, however I have no shared storage so I'm mostly interested in network tests. The FAQ indicates "You don't have to use a disk or partition to get this functionality." however Conga complains about not setting a device or label, and errors when I enter a dummy label. To what should I set the device / label if I just want to ping the gateway for example? Cheers, Chris From lhh at redhat.com Wed Jun 20 21:44:08 2007 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 20 Jun 2007 17:44:08 -0400 Subject: [Linux-cluster] Diskless Quorum Disk In-Reply-To: <46795C18.5040408@cmiware.com> References: <46795C18.5040408@cmiware.com> Message-ID: <20070620214408.GL4687@redhat.com> On Wed, Jun 20, 2007 at 11:55:52AM -0500, Chris Harms wrote: > I'm interested in using qdisk heuristics to circumvent a fencing duel in > my two node cluster, however I have no shared storage so I'm mostly > interested in network tests. The FAQ indicates "You don't have to use a > disk or partition to get this functionality." however Conga complains > about not setting a device or label, and errors when I enter a dummy > label. The FAQ is incorrect probably because of context; I'll clarify it. You don't need to use *qdiskd* to prevent a 'fence duel' in the case that the cluster is configured in the following way: http://sources.redhat.com/cluster/faq.html#two_node_correct ... however, support for "diskless mode" is not implemented. File a bugzilla / feature request so we can track it? > To what should I set the device / label if I just want to ping the > gateway for example? You can't right now; it uses the quorum partition to converge on things and do voting. -- Lon Hohberger - Software Engineer - Red Hat, Inc. From chris at cmiware.com Wed Jun 20 22:57:05 2007 From: chris at cmiware.com (Chris Harms) Date: Wed, 20 Jun 2007 17:57:05 -0500 Subject: [Linux-cluster] Diskless Quorum Disk In-Reply-To: <20070620214408.GL4687@redhat.com> References: <46795C18.5040408@cmiware.com> <20070620214408.GL4687@redhat.com> Message-ID: <4679B0C1.9090005@cmiware.com> My nodes were set to "quorum=1 two_node=1" and fenced by DRAC cards using telnet over their NICs. The same NICs used in my bonded config on the OS so I assumed it was on the same network path. Perhaps I assume incorrectly. Desired effect would be survivor claims service(s) running on unreachable node and attempts to fence unreachable node or bring it back online without fencing should it establish contact. Actual result was survivor spun its wheels trying to fence unreachable node and did not assume services. Restoring network connectivity induced the previously unreachable node to reboot and the surviving node experienced some kind of weird power off and then powered back on (???). Ergo I figured I must need quorum disk so I can use something like a ping node. My present plan is to use a loop device for the quorum disk device and then setup ping heuristics. Will this even work, i.e. do the nodes both need to see the same qdisk or can I fool the service with a loop device? I am not deploying GFS or GNDB and I have no SAN. My only option would be to add another DRBD partition for this purpose which may or may not work. What is the proper setup option, two_node=1 or qdisk? Chris Lon Hohberger wrote: > On Wed, Jun 20, 2007 at 11:55:52AM -0500, Chris Harms wrote: > >> I'm interested in using qdisk heuristics to circumvent a fencing duel in >> my two node cluster, however I have no shared storage so I'm mostly >> interested in network tests. The FAQ indicates "You don't have to use a >> disk or partition to get this functionality." however Conga complains >> about not setting a device or label, and errors when I enter a dummy >> label. >> > > The FAQ is incorrect probably because of context; I'll clarify it. You > don't need to use *qdiskd* to prevent a 'fence duel' in the case that the > cluster is configured in the following way: > > http://sources.redhat.com/cluster/faq.html#two_node_correct > > ... however, support for "diskless mode" is not implemented. File a > bugzilla / feature request so we can track it? > > >> To what should I set the device / label if I just want to ping the >> gateway for example? >> > > You can't right now; it uses the quorum partition to converge on things > and do voting. > > From David.Schroeder at flinders.edu.au Wed Jun 20 23:58:48 2007 From: David.Schroeder at flinders.edu.au (David Schroeder) Date: Thu, 21 Jun 2007 09:28:48 +0930 Subject: [Linux-cluster] Cluster service restarting Message-ID: <4679BF38.6080509@flinders.edu.au> Hi, We have been running web and database clusters successfully for several years on RHEL 3 and 4 and we now have one of each on RHEL 5. The setup is very straight forward, 2 nodes active/active with one running the webserver the other the databases. We have found the services restart in place regularly, up to 2 or 3 times a day sometimes. The cause is the Failure to ping one or another of the clustered service IP addresses and is evident from the log entries. This happens less frequently on the database server with one clustered interface than it does with the webserver that has 5. The failure to ping that is reported in the logs for the webserver is not always on the same IP address and it seems quite random in time and which in which IP address it reports is at fault. There are no load related issues as this is still in the testing stage. I have turned the "Monitor Link" setting off and it still happens. Are there any settings that will increase the timeout as I'm sure the interface does not go down. Any other pointers or suggestions? Thanks -- David Schroeder Server Support Information Services Division Flinders University Adelaide, Australia Ph: +61 8 8201 2689 From Robert.Hell at fabasoft.com Thu Jun 21 04:04:30 2007 From: Robert.Hell at fabasoft.com (Hell, Robert) Date: Thu, 21 Jun 2007 06:04:30 +0200 Subject: [Linux-cluster] Getting cman to use a different NIC for Heartbeat Message-ID: Hi! We got a 2-node cluster with RHEL5 Cluster Suite. We want to use a dedicated network for heartbeat communication. So we have 2 interfaces - one for "data" and one for heartbeat communication. I tried the way explained in http://sources.redhat.com/cluster/faq.html#cman_heartbeat_nic but when I start cman: Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... failed cman not started: Overridden node name is not in CCS /usr/sbin/cman_tool: aisexec daemon didn't start [FAILED] I used the name pg-hba-001 (heartbeat) in /etc/init.d/cman - the node is configured with the name pg-ba-001 in cluster.conf. When I use the heartbeat-name in cluster.conf there's no problem. But that's not what I want. It would be nice if there is any way to tell cman that it should use the heartbeat connection for heartbeat communication - but the node name should be the original pg-ba-001. The perfect solution in my opinion would be that cman uses both available paths - first the heartbeat connection (which is in fact a direct node to node connection) and if this one fails for any reason the connection over regular LAN. Is there any way to achieve that? Thanks in advance, Robert Fabasoft R&D Software GmbH & Co KG Honauerstra?e 4 4020 Linz Austria Tel: [43] (732) 60 61 62 Fax: [43] (732) 60 61 62-609 E-Mail: Robert.Hell at fabasoft.com www.fabasoft.com Fabasoft R&D Software GmbH & Co KG: Handelsgericht Linz, FN 190334d Komplement?r: Fabasoft R&D Software GmbH, Handelsgericht Linz, FN 190091x -------------- next part -------------- An HTML attachment was scrubbed... URL: From Alain.Moulle at bull.net Thu Jun 21 05:43:53 2007 From: Alain.Moulle at bull.net (Alain Moulle) Date: Thu, 21 Jun 2007 07:43:53 +0200 Subject: [Linux-cluster] CS4 U4/U5 a way to disable service monitoring ? Message-ID: <467A1019.7050408@bull.net> Hi Is there a way to disable the periodic status of service either with GUI, either directly in cluster.conf ? Thanks a lot. Alain Moull? From manami_mukherjee at yahoo.com Thu Jun 21 09:13:23 2007 From: manami_mukherjee at yahoo.com (manami mukherjee) Date: Thu, 21 Jun 2007 02:13:23 -0700 (PDT) Subject: [Linux-cluster] Setup of Linux cluster in VMware Message-ID: <947291.17990.qm@web62510.mail.re1.yahoo.com> Hi All, I have just registered to this group. I have one query : i was trying to set up Linux cluster using rhel 4.0 in Vmware. The problem i am facing with shared disk , and i dont have any proper documenation for this. Please let me know if anyone of you have worked on this . Thanks, Manami ____________________________________________________________________________________ Bored stiff? Loosen up... Download and play hundreds of games for free on Yahoo! Games. http://games.yahoo.com/games/front From haller at atix.de Thu Jun 21 09:54:27 2007 From: haller at atix.de (Dirk Haller) Date: Thu, 21 Jun 2007 11:54:27 +0200 Subject: [Linux-cluster] Setup of Linux cluster in VMware In-Reply-To: <947291.17990.qm@web62510.mail.re1.yahoo.com> References: <947291.17990.qm@web62510.mail.re1.yahoo.com> Message-ID: <200706211154.28175.haller@atix.de> Hello Manami, what is your exact problem? Setting up shared disks in VMware? If yes, have a look on this doc... http://www.vmware.com/support/gsx3/doc/ha_configs_gsx.html This also works with the free VMware server version. Regards, Dirk On Thursday 21 June 2007 11:13:23 manami mukherjee wrote: > Hi All, > I have just registered to this group. > > I have one query : > i was trying to set up Linux cluster using rhel 4.0 in > Vmware. > > The problem i am facing with shared disk , and i dont > have any proper documenation for this. > > Please let me know if anyone of you have worked on > this . > > Thanks, > Manami > > > > ___________________________________________________________________________ >_________ Bored stiff? Loosen up... > Download and play hundreds of games for free on Yahoo! Games. > http://games.yahoo.com/games/front > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Gruss / Regards Dirk Haller Tel.: +49-89 452 3538 -13 **** ATIX - Gesellschaft f?r Informationstechnologie und Consulting mbH Einsteinstrasse 10 D-85716 Unterschleissheim Tel.: +49-89 452 3538 -0 http://www.atix.de/ !!! ATIX auf dem Linux Tag in Berlin: !!! LinuxTag, 30.05. - 02.06.2007, Berlin Linux-Verband Stand: Halle 12, Stand 56 Registergericht: Amtsgericht M?nchen Registernummer: HRB 131682 USt.-Id.: DE209485962 Gesch?ftsf?hrung: Marc Grimme, Mark Hlawatschek, Thomas Merz From jprats at cesca.es Thu Jun 21 10:57:03 2007 From: jprats at cesca.es (Jordi Prats) Date: Thu, 21 Jun 2007 12:57:03 +0200 Subject: [Linux-cluster] Very poor performance of GFS2 Message-ID: <467A597F.7080604@cesca.es> Hi all, I'm getting a very poor performance using GFS2. Here some numbers: A disk usage on a GFS2 filesystem: [root at urani CLUSTER]# time du -hs PostgreSQL814/postgresql-8.1.4/ 121M PostgreSQL814/postgresql-8.1.4/ real 5m49.597s user 0m0.000s sys 0m0.004s On the local disk: [root at urani CLUSTER]# time du -hs postgresql-8.1.4/ 116M postgresql-8.1.4/ real 0m0.015s user 0m0.000s sys 0m0.016s The GFS2 filesystem was created using this: mkfs.gfs2 -t dades_test:postgres814 -p lock_dlm -j 2 /dev/data/postgres814 It is mounted on two machines (urani and plutoni): /dev/data/mysql5020 on /CLUSTER/MySQL5020 type gfs2 (rw,hostdata=jid=1:id=196610:first=0) /dev/data/postgres814 on /CLUSTER/PostgreSQL814 type gfs2 (rw,hostdata=jid=1:id=65538:first=0) I supose it's caused by some problem on my configuration. What I am missing? Thank you! Jordi -- ...................................................................... __ / / Jordi Prats C E / S / C A Dept. de Sistemes /_/ Centre de Supercomputaci? de Catalunya Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona T. 93 205 6464 ? F. 93 205 6979 ? jprats at cesca.es ...................................................................... From swplotner at amherst.edu Thu Jun 21 14:04:28 2007 From: swplotner at amherst.edu (Steffen Plotner) Date: Thu, 21 Jun 2007 10:04:28 -0400 Subject: [Linux-cluster] Setup of Linux cluster in VMware References: <947291.17990.qm@web62510.mail.re1.yahoo.com> Message-ID: <0456A130F613AD459887FF012652963F0125EE7A@mail7.amherst.edu> Hi, I have a cluster that resides partially in vmware virtual machines and physical machines. To get at shared disks I use the iscsi initiator from within the VM. Steffen ________________________________ Steffen Plotner Systems Administrator/Programmer Systems & Networking Amherst College PO BOX 5000 Amherst, MA 01002-5000 Tel (413) 542-2348 Fax (413) 542-2626 Email: swplotner at amherst.edu ________________________________ ________________________________ From: linux-cluster-bounces at redhat.com on behalf of manami mukherjee Sent: Thu 6/21/2007 5:13 AM To: linux-cluster at redhat.com Subject: [Linux-cluster] Setup of Linux cluster in VMware Hi All, I have just registered to this group. I have one query : i was trying to set up Linux cluster using rhel 4.0 in Vmware. The problem i am facing with shared disk , and i dont have any proper documenation for this. Please let me know if anyone of you have worked on this . Thanks, Manami ____________________________________________________________________________________ Bored stiff? Loosen up... Download and play hundreds of games for free on Yahoo! Games. http://games.yahoo.com/games/front -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From janne.peltonen at helsinki.fi Thu Jun 21 15:37:33 2007 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Thu, 21 Jun 2007 18:37:33 +0300 Subject: [Linux-cluster] rgmanager-2.0.24 doesn't execute script status Message-ID: <20070621153732.GD15269@helsinki.fi> Hi. It seems to me that there is a bug in rgmanager 2.0.24 (at least in the centos build). It doesn't execute status for service scripts, even if there is this line in /usr/share/cluster/script.sh: Downgrading back to version 2.0.23 seemed to help. It works with the same configuration that version 2.0.24 doesn't. (Perhaps I'd better learn to use Bugzilla...) --Janne Peltonen Univ. of Helsinki P.S. Any news on the fs.sh front? -- Janne Peltonen From mvz+rhcluster at nimium.hr Thu Jun 21 16:42:47 2007 From: mvz+rhcluster at nimium.hr (Miroslav Zubcic) Date: Thu, 21 Jun 2007 18:42:47 +0200 Subject: [Linux-cluster] Status script action timeout Message-ID: <467AAA87.10502@nimium.hr> Hello, Today early in the morning I have suffered from oracle (yuck!) listener bug. Routine in status () in service monitoring script in my RHCS never returned, just hanged forever. I have received error only after oracle listener was restarted and oracle has been restarted in working hours because listener hanged in 04:03 AM. Is there some timeout parameter for rgmanager? Something to put in cluster.conf? Man page doesn't mention anything, rh-cs-en.pdf neither, so RTFM dosen't help. There must be some way to control timeout from status routine in cluster control script right? Thanks ... -- Miroslav Zubcic, Nimium d.o.o., email: Tel: +385 01 6390 782, Fax: +385 01 4852 640, Mobile: +385 098 942 8672 Gredicka 3, 10000 Zagreb, Hrvatska From wkenji at labs.fujitsu.com Fri Jun 22 08:53:41 2007 From: wkenji at labs.fujitsu.com (Kenji Wakamiya) Date: Fri, 22 Jun 2007 17:53:41 +0900 Subject: [Linux-cluster] Can't mount snapshot LUN with new infrastructure Message-ID: <467B8E15.4080209@labs.fujitsu.com> Hello, I have a three node GFS1 cluster that is based on OpenAIS and CentOS 5, and am using NetApp's iSCSI LUN as a block device. Now I want to mount that LUN's snapshot LUN with lock_nolock on any member of the three nodes, with having mounted original LUN. With old version of CMAN (not OpenAIS), the same thing worked well. But I've got the following error with new infrastructure: # mount -t gfs /dev/isda on /web type gfs (rw,hostdata=jid=2:id=131074:first=0,acl) # mount -t gfs -o lockproto=lock_nolock /dev/isdb /testsnap /sbin/mount.gfs: error 17 mounting /dev/isdb on /testsnap /var/log/mssages: Jun 22 17:10:46 node17 kernel: kobject_add failed for cluster3:gfstest with -EEXIST, don't try to register things with the same name in the same directory. Jun 22 17:10:46 node17 kernel: [] kobject_add+0x147/0x16d Jun 22 17:10:46 node17 kernel: [] kobject_register+0x19/0x30 Jun 22 17:10:46 node17 kernel: [] fill_super+0x3df/0x5c9 [gfs] Jun 22 17:10:46 node17 kernel: [] get_sb_bdev+0xc6/0x110 Jun 22 17:10:46 node17 kernel: [] __alloc_pages+0x57/0x27e Jun 22 17:10:46 node17 kernel: [] gfs_get_sb+0x12/0x16 [gfs] Jun 22 17:10:46 node17 kernel: [] fill_super+0x0/0x5c9 [gfs] Jun 22 17:10:46 node17 kernel: [] vfs_kern_mount+0x7d/0xf2 Jun 22 17:10:46 node17 kernel: [] do_kern_mount+0x25/0x36 Jun 22 17:10:46 node17 kernel: [] do_mount+0x5d6/0x646 Jun 22 17:10:46 node17 kernel: [] find_get_pages_tag+0x30/0x6e Jun 22 17:10:46 node17 kernel: [] pagevec_lookup_tag+0x1b/0x22 Jun 22 17:10:46 node17 kernel: [] get_page_from_freelist+0x96/0x310 Jun 22 17:10:46 node17 kernel: [] get_page_from_freelist+0x2a6/0x310 Jun 22 17:10:46 node17 kernel: [] get_page_from_freelist+0x96/0x310 Jun 22 17:10:46 node17 kernel: [] copy_mount_options+0x26/0x109 Jun 22 17:10:46 node17 kernel: [] sys_mount+0x6d/0xa5 Jun 22 17:10:46 node17 kernel: [] syscall_call+0x7/0xb Jun 22 17:10:46 node17 kernel: ======================= Now I have /sys/fs/gfs/cluster3:gfstest/. So I suspect that a conflict between two LUN's lock table names is occurring in sysfs. Is there any good solution? Thanks, Kenji From GavinF at itdynamics.co.za Fri Jun 22 10:14:35 2007 From: GavinF at itdynamics.co.za (Gavin Fietze) Date: Fri, 22 Jun 2007 12:14:35 +0200 Subject: [Linux-cluster] QDISK problem Message-ID: <467BBD2B.A5F6.00ED.0@itdynamics.co.za> I am trying to get qdiskd to work in a 3 node cluster using RHEL 5 AP . The 3 nodes are virtual machines running XEN. Domain0 is also RHEL 5 AP I have upgraded cman and rgmanager to cman-2.0.60-1.el5 and rgmanager-2.0.23-1 respectively, everything else stock standard. When I run clustat and "cman_tool nodes" I get strange output for the qdisk object : [root at node1 ~]# [root at node1 ~]# clustat /dev/sdc1?? U?? U?? not found realloc 1232 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ node1.sv.dynamics.co.za 1 Online, Local, rgmanager node2.sv.dynamics.co.za 2 Online, rgmanager node3.sv.dynamics.co.za 3 Online, rgmanager /dev/sdc1?? U?? U?? 0 Online, Estranged, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:it node2.sv.dynamics.co.za started service:tru node3.sv.dynamics.co.za started service:aql (node1.sv.dynamics.co.za) failed service:if node3.sv.dynamics.co.za started service:fc (node1.sv.dynamics.co.za) failed service:com (node1.sv.dynamics.co.za) failed service:xprint node3.sv.dynamics.co.za started [root at node1 ~]# [root at node1 ~]# [root at node1 ~]# [root at node1 ~]# cman_tool nodes Node Sts Inc Joined Name 0 M 0 2007-06-22 10:49:54 /dev/sdc1?? U?? U?? 1 M 4 2007-06-22 10:47:14 node1.sv.dynamics.co.za 2 M 52 2007-06-22 10:47:14 node2.sv.dynamics.co.za 3 M 52 2007-06-22 10:47:14 node3.sv.dynamics.co.za mkqisk does not report any funnies: [root at node1 ~]# mkqdisk -L mkqdisk v0.5.1 /dev/sdc1: Magic: eb7a62c2 Label: epoc Created: Thu Jun 14 16:04:49 2007 Host: node3.svdynamics.co.za Is this normal, and will it effect the operation of qdiskd? Can someone tell me what Inc represents in the cman_tool nodes output? Thanks Gavin Fietze IT Dynamics (Pty) Ltd Direct +27 (0)31 7130826 Fax +27 (0)31 7020613 Mobile +27 (0)83 5012516 -------------- next part -------------- An HTML attachment was scrubbed... URL: From teigland at redhat.com Fri Jun 22 13:33:08 2007 From: teigland at redhat.com (David Teigland) Date: Fri, 22 Jun 2007 08:33:08 -0500 Subject: [Linux-cluster] Very poor performance of GFS2 In-Reply-To: <467A597F.7080604@cesca.es> References: <467A597F.7080604@cesca.es> Message-ID: <20070622133308.GA6381@redhat.com> On Thu, Jun 21, 2007 at 12:57:03PM +0200, Jordi Prats wrote: > Hi all, > I'm getting a very poor performance using GFS2. Here some numbers: > > A disk usage on a GFS2 filesystem: > > [root at urani CLUSTER]# time du -hs PostgreSQL814/postgresql-8.1.4/ > 121M PostgreSQL814/postgresql-8.1.4/ > > real 5m49.597s > user 0m0.000s > sys 0m0.004s > > > On the local disk: > > [root at urani CLUSTER]# time du -hs postgresql-8.1.4/ > 116M postgresql-8.1.4/ > > real 0m0.015s > user 0m0.000s > sys 0m0.016s You should compare with gfs1, that will tell you if the gfs2 numbers are in the right ballpark. Dave From teigland at redhat.com Fri Jun 22 13:40:54 2007 From: teigland at redhat.com (David Teigland) Date: Fri, 22 Jun 2007 08:40:54 -0500 Subject: [Linux-cluster] Can't mount snapshot LUN with new infrastructure In-Reply-To: <467B8E15.4080209@labs.fujitsu.com> References: <467B8E15.4080209@labs.fujitsu.com> Message-ID: <20070622134054.GB6381@redhat.com> On Fri, Jun 22, 2007 at 05:53:41PM +0900, Kenji Wakamiya wrote: > Hello, > > I have a three node GFS1 cluster that is based on OpenAIS and CentOS 5, > and am using NetApp's iSCSI LUN as a block device. > > Now I want to mount that LUN's snapshot LUN with lock_nolock on any > member of the three nodes, with having mounted original LUN. > > With old version of CMAN (not OpenAIS), the same thing worked well. > But I've got the following error with new infrastructure: > > # mount -t gfs > /dev/isda on /web type gfs (rw,hostdata=jid=2:id=131074:first=0,acl) > # mount -t gfs -o lockproto=lock_nolock /dev/isdb /testsnap > /sbin/mount.gfs: error 17 mounting /dev/isdb on /testsnap > Now I have /sys/fs/gfs/cluster3:gfstest/. So I suspect that a > conflict between two LUN's lock table names is occurring in sysfs. Yes, you're exactly right. You can also override the locktable name with a mount option: mount -t gfs -o lockproto=lock_nolock,locktable=foo /dev/isdb /testsnap Dave From simanhew at gmail.com Fri Jun 22 16:57:35 2007 From: simanhew at gmail.com (siman hew) Date: Fri, 22 Jun 2007 12:57:35 -0400 Subject: [Linux-cluster] What is "Password Script" field for in Fence Device Configuration Message-ID: <6596a7c70706220957s5dfad6cel69337069ace15be6@mail.gmail.com> Hi all, I tried to use old GUI (and Conga) to define a fence device on RHEL4U5, I found there is new field "Password Script" under Password field. What is thid field for? I can not find it on RHEL5 (old GUI & Conga) . Is this particular for 4U5 ? Any explaination is very apprecaited. Thanks, Siman -------------- next part -------------- An HTML attachment was scrubbed... URL: From oliver.olsen at advance.as Fri Jun 22 17:51:05 2007 From: oliver.olsen at advance.as (Oliver Olsen) Date: Fri, 22 Jun 2007 19:51:05 +0200 Subject: [Linux-cluster] Snapshots with GFS1/RHEL 4 (U4) Message-ID: <20070622195105.r0bk5hg4kk0ws4oo@home.advance.as> Hi, I'm currently in the process of upgrading a RHEL 3 cluster to RHEL 4, and I haven't been able to figure out how to use snapshots in a proper way. We have other RHEL 4 systems using LVM2, and I use snapshots with ext3 for backup with great success on those. As the demand for uptime on this particular cluster is cruicial (and regular filebackup using tar is way too slow!) I was hoping to accomplish the same kind of snapshots with GFS1 and RHEL 4 as with ext3. My test-environment consists of 3 VM's in a VMware Server 1.0.3 environment, using DLM and one GFS filesystem on the VM's. Fencing and connectivity is working just fine, but when I try to use snapshots things go wrong. I try to initiate the snapshot via [root at gfs1 Scripts]# lvcreate -L500M -s -n snap /dev/GFS/LV1 Logical volume "snap" created When I try to mount the snapshot (in the same way as I do with an ext3 snapshot) it does not seem to work [root at gfs1 Scripts]# mount -t gfs /dev/mapper/GFS-LV1 on /mnt/GFS type gfs (rw) [root at gfs1 Scripts]# mount -t gfs /dev/GFS/snap /mnt/GFS-snap mount: File exists Also - when I try to do changes in the /mnt/GFS filesystem after creating the snapshot, there does not seem to be any changes in the snapshot when I check the graphical interface of LVM (it says Snapshot usage: 0%). Attributes are "swi-a-" and GFS (clustered) according to LVM. I assume I am missing something vital here, but I haven't been able to find the documentation which explains this in an easy manner. Any inputs will be highly appreciated! Best regards, Oliver Olsen From Greg.Caetano at hp.com Fri Jun 22 18:08:35 2007 From: Greg.Caetano at hp.com (Caetano, Greg) Date: Fri, 22 Jun 2007 14:08:35 -0400 Subject: [Linux-cluster] Snapshots with GFS1/RHEL 4 (U4) In-Reply-To: <20070622195105.r0bk5hg4kk0ws4oo@home.advance.as> References: <20070622195105.r0bk5hg4kk0ws4oo@home.advance.as> Message-ID: Oliver What does the following command show for the status of your snapshot volume and/or device # lvdisplay /dev/GFS/LV1 Greg Caetano HP TSG Linux Solutions Alliances Engineering Chicago, IL greg.caetano at hp.com Red Hat Certified Engineer RHCE#803004972711193 -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Oliver Olsen Sent: Friday, June 22, 2007 12:51 PM To: linux clustering Subject: [Linux-cluster] Snapshots with GFS1/RHEL 4 (U4) Hi, I'm currently in the process of upgrading a RHEL 3 cluster to RHEL 4, and I haven't been able to figure out how to use snapshots in a proper way. We have other RHEL 4 systems using LVM2, and I use snapshots with ext3 for backup with great success on those. As the demand for uptime on this particular cluster is cruicial (and regular filebackup using tar is way too slow!) I was hoping to accomplish the same kind of snapshots with GFS1 and RHEL 4 as with ext3. My test-environment consists of 3 VM's in a VMware Server 1.0.3 environment, using DLM and one GFS filesystem on the VM's. Fencing and connectivity is working just fine, but when I try to use snapshots things go wrong. I try to initiate the snapshot via [root at gfs1 Scripts]# lvcreate -L500M -s -n snap /dev/GFS/LV1 Logical volume "snap" created When I try to mount the snapshot (in the same way as I do with an ext3 snapshot) it does not seem to work [root at gfs1 Scripts]# mount -t gfs /dev/mapper/GFS-LV1 on /mnt/GFS type gfs (rw) [root at gfs1 Scripts]# mount -t gfs /dev/GFS/snap /mnt/GFS-snap mount: File exists Also - when I try to do changes in the /mnt/GFS filesystem after creating the snapshot, there does not seem to be any changes in the snapshot when I check the graphical interface of LVM (it says Snapshot usage: 0%). Attributes are "swi-a-" and GFS (clustered) according to LVM. I assume I am missing something vital here, but I haven't been able to find the documentation which explains this in an easy manner. Any inputs will be highly appreciated! Best regards, Oliver Olsen -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From oliver.olsen at advance.as Fri Jun 22 18:17:17 2007 From: oliver.olsen at advance.as (Oliver Olsen) Date: Fri, 22 Jun 2007 20:17:17 +0200 Subject: [Linux-cluster] Snapshots with GFS1/RHEL 4 (U4) In-Reply-To: References: <20070622195105.r0bk5hg4kk0ws4oo@home.advance.as> Message-ID: <20070622201717.airzxizzko84kc0o@home.advance.as> Quoting "Caetano, Greg" : > Oliver > > What does the following command show for the status of your snapshot > volume and/or device > > # lvdisplay /dev/GFS/LV1 > Greg, The output is as follows [root at gfs1 Scripts]# lvdisplay /dev/GFS/LV1 --- Logical volume --- LV Name /dev/GFS/LV1 VG Name GFS LV UUID zCkOfT-2Nua-Z3V2-GEy7-vrrs-p0IX-qu01kr LV Write Access read/write LV snapshot status source of /dev/GFS/snap [active] LV Status available # open 1 LV Size 1.56 GB Current LE 400 Segments 1 Allocation inherit Read ahead sectors 0 Block device 253:3 [root at gfs1 Scripts]# lvdisplay /dev/GFS/snap --- Logical volume --- LV Name /dev/GFS/snap VG Name GFS LV UUID KSPZxh-y9j4-h1c3-JaYo-O8FZ-2dPr-Nbv09S LV Write Access read/write LV snapshot status active destination for /dev/GFS/LV1 LV Status available # open 0 LV Size 1.56 GB Current LE 400 COW-table size 100.00 MB COW-table LE 25 Allocated to snapshot 0.02% Snapshot chunk size 8.00 KB Segments 1 Allocation inherit Read ahead sectors 0 Block device 253:5 (I used "lvcreate -L100M -s -n snap /dev/GFS/LV1" for this particular snapshot) Best regards, Oliver Olsen From teigland at redhat.com Fri Jun 22 18:45:42 2007 From: teigland at redhat.com (David Teigland) Date: Fri, 22 Jun 2007 13:45:42 -0500 Subject: [Linux-cluster] Snapshots with GFS1/RHEL 4 (U4) In-Reply-To: <20070622195105.r0bk5hg4kk0ws4oo@home.advance.as> References: <20070622195105.r0bk5hg4kk0ws4oo@home.advance.as> Message-ID: <20070622184542.GE6381@redhat.com> On Fri, Jun 22, 2007 at 07:51:05PM +0200, Oliver Olsen wrote: > I assume I am missing something vital here, but I haven't been able to > find the documentation which explains this in an easy manner. You need clustered snapshots in lvm2 which don't exist. Clustered mirroring in lvm2 was recently introduced, though, which you can use with gfs. Dave From lhh at redhat.com Fri Jun 22 20:36:04 2007 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 22 Jun 2007 16:36:04 -0400 Subject: [Linux-cluster] Diskless Quorum Disk In-Reply-To: <4679B0C1.9090005@cmiware.com> References: <46795C18.5040408@cmiware.com> <20070620214408.GL4687@redhat.com> <4679B0C1.9090005@cmiware.com> Message-ID: <20070622203604.GO4687@redhat.com> On Wed, Jun 20, 2007 at 05:57:05PM -0500, Chris Harms wrote: > My nodes were set to "quorum=1 two_node=1" and fenced by DRAC cards > using telnet over their NICs. The same NICs used in my bonded config on > the OS so I assumed it was on the same network path. Perhaps I assume > incorrectly. That sounds mostly right. The point is that a node disconnected from the cluster must not be able to fence a node which is supposedly still connected. That is: 'A' must not be able to fence 'B' if 'A' becomes disconnected from the cluster. However, 'A' must be able to be fenced if 'A' becomes disconnected. Why was DRAC unreachable; was it unplugged too? (Is DRAC like IPMI - in that it shares a NIC with the host machine?) > Desired effect would be survivor claims service(s) running on > unreachable node and attempts to fence unreachable node or bring it back > online without fencing should it establish contact. Actual result was > survivor spun its wheels trying to fence unreachable node and did not > assume services. Yes, this is an unfortunate limitation of using (most) integrated power management systems. Basically, some BMCs share a NIC with the host (IPMI), and some run off of the machine's power supply (IPMI, iLO, DRAC). When the fence device becomes unreachable, we don't know whether it's a total network outage or a "power disconnected" state. * If the power to a node has been disconnected, it's safe to recover. * If the node just lost all of its network connectivity, it's *NOT* safe to recover. * In both cases, we can not confirm the node is dead... which is why we don't recover. > Restoring network connectivity induced the previously > unreachable node to reboot and the surviving node experienced some kind > of weird power off and then powered back on (???). That doesn't sound right; the surviving node should have stayed put (not rebooted). > Ergo I figured I must need quorum disk so I can use something like a > ping node. My present plan is to use a loop device for the quorum disk > device and then setup ping heuristics. Will this even work, i.e. do the > nodes both need to see the same qdisk or can I fool the service with a > loop device? I don't believe the effect of tricking qdiskd in this way have been explored; I don't see why it wouldn't work in theory, but... qdiskd with or without a disk won't fix the behavior you experienced (uncertain state due to failure to fence -> retry / wait for node to come back). > I am not deploying GFS or GNDB and I have no SAN. My only > option would be to add another DRBD partition for this purpose which may > or may not work. > What is the proper setup option, two_node=1 or qdisk? In your case, I'd say two_node="1". -- Lon Hohberger - Software Engineer - Red Hat, Inc. From wkenji at labs.fujitsu.com Fri Jun 22 20:39:33 2007 From: wkenji at labs.fujitsu.com (Kenji Wakamiya) Date: Sat, 23 Jun 2007 05:39:33 +0900 Subject: [Linux-cluster] Can't mount snapshot LUN with new infrastructure In-Reply-To: <20070622134054.GB6381@redhat.com> References: <467B8E15.4080209@labs.fujitsu.com> <20070622134054.GB6381@redhat.com> Message-ID: <467C3385.2080802@labs.fujitsu.com> David Teigland wrote: >> Now I have /sys/fs/gfs/cluster3:gfstest/. So I suspect that a >> conflict between two LUN's lock table names is occurring in sysfs. > > Yes, you're exactly right. You can also override the locktable name with > a mount option: > > mount -t gfs -o lockproto=lock_nolock,locktable=foo /dev/isdb /testsnap I didn't know that option existed. It worked! Thank you! Kenji From lhh at redhat.com Fri Jun 22 20:40:43 2007 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 22 Jun 2007 16:40:43 -0400 Subject: [Linux-cluster] Cluster service restarting In-Reply-To: <4679BF38.6080509@flinders.edu.au> References: <4679BF38.6080509@flinders.edu.au> Message-ID: <20070622204042.GP4687@redhat.com> On Thu, Jun 21, 2007 at 09:28:48AM +0930, David Schroeder wrote: > Hi, > > We have been running web and database clusters successfully for several > years on RHEL 3 and 4 and we now have one of each on RHEL 5. > > The setup is very straight forward, 2 nodes active/active with one > running the webserver the other the databases. > > We have found the services restart in place regularly, up to 2 or 3 > times a day sometimes. The cause is the Failure to ping one or another > of the clustered service IP addresses and is evident from the log > entries. This happens less frequently on the database server with one > clustered interface than it does with the webserver that has 5. The > failure to ping that is reported in the logs for the webserver is not > always on the same IP address and it seems quite random in time and > which in which IP address it reports is at fault. There are no load > related issues as this is still in the testing stage. > > I have turned the "Monitor Link" setting off and it still happens. > > Are there any settings that will increase the timeout as I'm sure the > interface does not go down. > > Any other pointers or suggestions? You can disable the check; remove these from /usr/share/cluster/ip.sh: Update your /etc/cluster/cluster.conf's config_version and redistribute the configuration file using ccs_tool update. This will cause rgmanager to stop doing the 'ping' checks. -- Lon -- Lon Hohberger - Software Engineer - Red Hat, Inc. From chris at cmiware.com Fri Jun 22 23:43:53 2007 From: chris at cmiware.com (Chris Harms) Date: Fri, 22 Jun 2007 18:43:53 -0500 Subject: [Linux-cluster] Diskless Quorum Disk In-Reply-To: <20070622203604.GO4687@redhat.com> References: <46795C18.5040408@cmiware.com> <20070620214408.GL4687@redhat.com> <4679B0C1.9090005@cmiware.com> <20070622203604.GO4687@redhat.com> Message-ID: <467C5EB9.20904@cmiware.com> Lon, thank you for the response. It appears that what I thought was a fence duel, was actually the cluster fencing the proper node and DRBD halting the surviving node after a split brain scenario. (Have some work to do on my drbd.conf obviously.) After the fenced node revived, it saw that the other was unresponsive (it had been halted) and then fenced it; in this case inducing it to power on. Our DRAC shares the NICs with the host. We will probably hack on the DRAC fence script a little to take advantage of some other features available besides doing a poweroff poweron. Using two_node=1 may be an option again, but then the FAQ indicates the quorum disk might still be beneficial. Using a loop device didn't seem to go so well, but that could be due to configuration error. Having one node not see the qdisk is probably an automatic test failure. Thanks again, Chris Lon Hohberger wrote: > On Wed, Jun 20, 2007 at 05:57:05PM -0500, Chris Harms wrote: > >> My nodes were set to "quorum=1 two_node=1" and fenced by DRAC cards >> using telnet over their NICs. The same NICs used in my bonded config on >> the OS so I assumed it was on the same network path. Perhaps I assume >> incorrectly. >> > > That sounds mostly right. The point is that a node disconnected from > the cluster must not be able to fence a node which is supposedly still > connected. > > That is: 'A' must not be able to fence 'B' if 'A' becomes disconnected > from the cluster. However, 'A' must be able to be fenced if 'A' becomes > disconnected. > > Why was DRAC unreachable; was it unplugged too? (Is DRAC like IPMI - in > that it shares a NIC with the host machine?) > > >> Desired effect would be survivor claims service(s) running on >> unreachable node and attempts to fence unreachable node or bring it back >> online without fencing should it establish contact. Actual result was >> survivor spun its wheels trying to fence unreachable node and did not >> assume services. >> > > Yes, this is an unfortunate limitation of using (most) integrated power > management systems. Basically, some BMCs share a NIC with the host > (IPMI), and some run off of the machine's power supply (IPMI, iLO, > DRAC). When the fence device becomes unreachable, we don't know whether > it's a total network outage or a "power disconnected" state. > > * If the power to a node has been disconnected, it's safe to recover. > > * If the node just lost all of its network connectivity, it's *NOT* safe > to recover. > > * In both cases, we can not confirm the node is dead... which is why we > don't recover. > > >> Restoring network connectivity induced the previously >> unreachable node to reboot and the surviving node experienced some kind >> of weird power off and then powered back on (???). >> > > That doesn't sound right; the surviving node should have stayed put (not > rebooted). > > >> Ergo I figured I must need quorum disk so I can use something like a >> ping node. My present plan is to use a loop device for the quorum disk >> device and then setup ping heuristics. Will this even work, i.e. do the >> nodes both need to see the same qdisk or can I fool the service with a >> loop device? >> > > I don't believe the effect of tricking qdiskd in this way have been > explored; I don't see why it wouldn't work in theory, but... qdiskd with > or without a disk won't fix the behavior you experienced (uncertain > state due to failure to fence -> retry / wait for node to come back). > > >> I am not deploying GFS or GNDB and I have no SAN. My only >> option would be to add another DRBD partition for this purpose which may >> or may not work. >> > > >> What is the proper setup option, two_node=1 or qdisk? >> > > In your case, I'd say two_node="1". > > From bsachnoff at incisent.com Sat Jun 23 22:23:56 2007 From: bsachnoff at incisent.com (Brent Sachnoff) Date: Sat, 23 Jun 2007 17:23:56 -0500 Subject: [Linux-cluster] a couple of questions regarding clusters Message-ID: <91E302EA72562F43AEDC06DD8FAE7D3602855F82@ord1mail01.firstlook.biz> I have a 3 node cluster running redhat 4 with gfs. What is the proper way to have a node leave the cluster for maintenance and then rejoin after maintenance is completed? From the docs, I have read that I need to unmount gfs and then stop all the services in the following order: rgmanager, gfs, clvmd, fenced. I can then issue a cman_tool leave (remove) request. I have also noticed that if I lose ip connectivity to a certain node I lose gfs connectivity with the other two nodes. I thought that I would only need 2 votes to continue connectivity. Thanks for the help! -------------- next part -------------- An HTML attachment was scrubbed... URL: From manjusc13 at rediffmail.com Mon Jun 25 03:57:19 2007 From: manjusc13 at rediffmail.com (manjunath c shanubog) Date: 25 Jun 2007 03:57:19 -0000 Subject: [Linux-cluster] Cluster configuration on redhat AS 4 Message-ID: <20070625035719.12708.qmail@webmail6.rediffmail.com> Hi,           I have to setup two node cluster with redhat AS 4 and cluster suite with GFS. The application which is to be installed is MySql database. I would like to have a solution for the below queries          1. Detailed installation guide for cluster suite installation and is it possible to load balance on redhat 4/5 linux.          2. Do i need to have a separate cluster suite for MySql, if so which one is Good.          3. Guide or document for Installation of MySQL on cluster.          4. In windows clustering there is no need of fencing device, why is it necessary in linux. if so which is good fencing device and its configuration details.Thanking YouManjunath -------------- next part -------------- An HTML attachment was scrubbed... URL: From jprats at cesca.es Mon Jun 25 06:48:05 2007 From: jprats at cesca.es (Jordi Prats) Date: Mon, 25 Jun 2007 08:48:05 +0200 Subject: [Linux-cluster] Very poor performance of GFS2 In-Reply-To: <20070622133308.GA6381@redhat.com> References: <467A597F.7080604@cesca.es> <20070622133308.GA6381@redhat.com> Message-ID: <467F6525.3040506@cesca.es> I supose you say so because the disk could be slow, but it should not be this slow because they virtual machines and they are accesing the same way as is the local disk. (Both are LVM volumes) I found almost no documentation about how to install GFS2, so I'm assuming I did something wrong. I supose GFS2 do not add about 5 minutes of delay because of it's operations! Jordi David Teigland wrote: > On Thu, Jun 21, 2007 at 12:57:03PM +0200, Jordi Prats wrote: > >> Hi all, >> I'm getting a very poor performance using GFS2. Here some numbers: >> >> A disk usage on a GFS2 filesystem: >> >> [root at urani CLUSTER]# time du -hs PostgreSQL814/postgresql-8.1.4/ >> 121M PostgreSQL814/postgresql-8.1.4/ >> >> real 5m49.597s >> user 0m0.000s >> sys 0m0.004s >> >> >> On the local disk: >> >> [root at urani CLUSTER]# time du -hs postgresql-8.1.4/ >> 116M postgresql-8.1.4/ >> >> real 0m0.015s >> user 0m0.000s >> sys 0m0.016s >> > > You should compare with gfs1, that will tell you if the gfs2 numbers are > in the right ballpark. > > Dave > > > > -- ...................................................................... __ / / Jordi Prats C E / S / C A Dept. de Sistemes /_/ Centre de Supercomputaci? de Catalunya Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona T. 93 205 6464 ? F. 93 205 6979 ? jprats at cesca.es ...................................................................... From tmornini at engineyard.com Mon Jun 25 07:19:01 2007 From: tmornini at engineyard.com (Tom Mornini) Date: Mon, 25 Jun 2007 00:19:01 -0700 Subject: [Linux-cluster] Very poor performance of GFS2 In-Reply-To: <467F6525.3040506@cesca.es> References: <467A597F.7080604@cesca.es> <20070622133308.GA6381@redhat.com> <467F6525.3040506@cesca.es> Message-ID: On Jun 24, 2007, at 11:48 PM, Jordi Prats wrote: > I supose you say so because the disk could be slow, but it should > not be this slow because they virtual machines and they are > accesing the same way as is the local disk. (Both are LVM volumes) > > I found almost no documentation about how to install GFS2, so I'm > assuming I did something wrong. I supose GFS2 do not add about 5 > minutes of delay because of it's operations! I think you're right. GFS2 is *supposed* to be faster than the original GFS, and here's what I get. ey00-s00001 data # pwd /data ey00-s00001 data # time du -hs postgresql-8.2.4/ 76M postgresql-8.2.4/ real 0m0.061s user 0m0.000s sys 0m0.050s ey00-s00001 data # df -Th Filesystem Type Size Used Avail Use% Mounted on /dev/sda1 reiserfs 2.0G 783M 1.3G 39% / udev tmpfs 512M 120K 512M 1% /dev shm tmpfs 512M 0 512M 0% /dev/shm /dev/sdb1 gfs 227G 128G 99G 57% /data -- -- Tom Mornini, CTO -- Engine Yard, Ruby on Rails Hosting -- Support, Scalability, Reliability -- (866) 518-YARD (9273) From admin.cluster at gmail.com Mon Jun 25 09:50:59 2007 From: admin.cluster at gmail.com (Anthony) Date: Mon, 25 Jun 2007 11:50:59 +0200 Subject: [Linux-cluster] eclipse for RHEL4 Message-ID: <467F9003.7000603@gmail.com> Hello, i have problems running eclipse on my RHEL box, when i use the gz version from the eclipse website. so i am looking for a Eclipse RPM for RHEL 4, couldn't find it on the RHN network. Does anyone has a link for the Eclipse RPM for the RHEL 4 update 2 Thnks, Anthony. From davea at support.kcm.org Mon Jun 25 15:44:36 2007 From: davea at support.kcm.org (Dave Augustus) Date: Mon, 25 Jun 2007 10:44:36 -0500 Subject: [Linux-cluster] No failover occurs! Message-ID: <1182786276.12133.2.camel@kcm40202.kcmhq.org> I am familiar with Heartbeat and new to RHCS. Anyhow: I created a 2 node cluster with no quorum drive. added an ip address on the public eth added an ip address on the private eth added the script apache, with proper configs on both hosts The only way I can get all 3 to run is to reboot the nodes. Shouldn't it failover if a service fails to start? Thanks, Dave From andrewxwang at yahoo.com.tw Mon Jun 25 17:02:27 2007 From: andrewxwang at yahoo.com.tw (Andrew Wang) Date: Tue, 26 Jun 2007 01:02:27 +0800 (CST) Subject: [Linux-cluster] Fwd: Sun N1 Grid Engine Software and the Tokyo Institute of Technology Super Computer Grid (Sun BluePrint) Message-ID: <966992.58014.qm@web73503.mail.tp2.yahoo.com> The blueprint talks about how SGE 6 is used in the TSUBAME supercomputer (no. 7 on the June 2006 TOP500 list). It talks about the tight SGE-SSH integration in: Chapter 3: SSH for Sun N1 Grid Engine Software You can download the PDF at: http://www.sun.com/blueprints/0607/820-1695.html Andrew. ____________________________________________________________________________________ ?????????????????????? Yahoo!?????????? http://tw.mobile.yahoo.com/texts/mail.php From Christopher.Barry at qlogic.com Mon Jun 25 17:59:14 2007 From: Christopher.Barry at qlogic.com (Christopher Barry) Date: Mon, 25 Jun 2007 13:59:14 -0400 Subject: [Linux-cluster] Fwd: Sun N1 Grid Engine Software and the Tokyo Institute of Technology Super Computer Grid (Sun BluePrint) In-Reply-To: <966992.58014.qm@web73503.mail.tp2.yahoo.com> References: <966992.58014.qm@web73503.mail.tp2.yahoo.com> Message-ID: <1182794354.5240.16.camel@localhost> On Tue, 2007-06-26 at 01:02 +0800, Andrew Wang wrote: > The blueprint talks about how SGE 6 is used in the > TSUBAME supercomputer (no. 7 on the June 2006 TOP500 > list). > > It talks about the tight SGE-SSH integration in: > > Chapter 3: SSH for Sun N1 Grid Engine Software > > You can download the PDF at: > http://www.sun.com/blueprints/0607/820-1695.html > > Andrew. > > > First, do not cross-post. Second, SGE is not RHCS. Third, and the most interesting, the blueprint design on the cover of this PDF appears to be of a bathroom, with the text "To Drain -->" being the most prominent aspect of the blueprint! Not the best connotation... Doh! -C From andremachado at techforce.com.br Mon Jun 25 19:34:06 2007 From: andremachado at techforce.com.br (andremachado) Date: Mon, 25 Jun 2007 12:34:06 -0700 Subject: [Linux-cluster] how to GFS+iSCSI failover? Message-ID: Hello, I have 2 nodes (node_1, node_2) with 2 respective GFS (gfs_1, gfs_2) lvm exported trough iSCSI Enterprise Target to a third node node_3 with Open-iSCSI. What is the correct way to implement a cluster failover service to verify if node_1 becomes unavailable, umount/logout gfs_1 and then login/mount the gfs_2 from node_2 to the same mount point at node_3? Should I configure a failover domain restricted, prioritized node_1 and node_2, each with a private resource gfs, with THEIR local mount points or node_3 mount point? Should I configure a resource custom script? (it seems likely). How then monitor the node availability? could you explain the gfs, ip and script resources behaviour further than the RH docs? Regards. Andre Felipe Machado updated [0] in June 21 2007 [0] http://www.techforce.com.br/index.php/news/linux_blog/red_hat_cluster_suite_debian_etch From chris at cmiware.com Mon Jun 25 20:08:41 2007 From: chris at cmiware.com (Chris Harms) Date: Mon, 25 Jun 2007 15:08:41 -0500 Subject: [Linux-cluster] trouble reinstalling cluster suite 5 Message-ID: <468020C9.8080208@cmiware.com> After some trouble with my first go, I ended up uninstalling the cluster packages and attempting to reinstall. However, one of the nodes apparently can't forget some info about the previous installation. After reinstalling ricci and luci, and doing create cluster the nodes reboot as normal. Upon reboot, the offending node attempts to fence the other (I haven't even gotten to the point of setting that up yet) while the other reports ccsd[3414]: Unable to connect to cluster infrastructure after XYZ seconds. attempts to stop cman via service cman stop summarily fail and fenced and ccsd have to be killed before it will succeed. Any ideas on what or where it would be storing information about the old install, or how I should properly uninstall the software before starting over? Thanks in advance, Chris From chris at cmiware.com Mon Jun 25 22:13:10 2007 From: chris at cmiware.com (Chris Harms) Date: Mon, 25 Jun 2007 17:13:10 -0500 Subject: [Linux-cluster] trouble reinstalling cluster suite 5 (correction) In-Reply-To: <468020C9.8080208@cmiware.com> References: <468020C9.8080208@cmiware.com> Message-ID: <46803DF6.6040002@cmiware.com> something is apparently wrong with openais on a node. repeated removal and reinstall of this package yields no change and I get the following errors when trying to install this node in my cluster through Conga: openais[3488]: [MAIN ] Error reading CCS info, cannot start Jun 25 17:03:40 openais[3488]: [MAIN ] ?h?? Jun 25 17:03:40 openais[3488]: [MAIN ] AIS Executive exiting (-9). Jun 25 17:04:06 ccsd[3447]: Unable to connect to cluster infrastructure I'm sure the garbled log entry is not normal. Any ideas? Chris Harms wrote: > After some trouble with my first go, I ended up uninstalling the > cluster packages and attempting to reinstall. However, one of the > nodes apparently can't forget some info about the previous > installation. After reinstalling ricci and luci, and doing create > cluster the nodes reboot as normal. Upon reboot, the offending node > attempts to fence the other (I haven't even gotten to the point of > setting that up yet) while the other reports > > ccsd[3414]: Unable to connect to cluster infrastructure after XYZ > seconds. > > attempts to stop cman via service cman stop summarily fail and fenced > and ccsd have to be killed before it will succeed. > > Any ideas on what or where it would be storing information about the > old install, or how I should properly uninstall the software before > starting over? > > Thanks in advance, > Chris > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From chris at cmiware.com Mon Jun 25 22:59:53 2007 From: chris at cmiware.com (Chris Harms) Date: Mon, 25 Jun 2007 17:59:53 -0500 Subject: [Linux-cluster] trouble reinstalling cluster suite 5 (solved) In-Reply-To: <46803DF6.6040002@cmiware.com> References: <468020C9.8080208@cmiware.com> <46803DF6.6040002@cmiware.com> Message-ID: <468048E9.3080208@cmiware.com> For the sake of completeness it was a bad entry in /etc/hosts file Chris Harms wrote: > something is apparently wrong with openais on a node. repeated > removal and reinstall of this package yields no change and I get the > following errors when trying to install this node in my cluster > through Conga: > > openais[3488]: [MAIN ] Error reading CCS info, cannot start > Jun 25 17:03:40 openais[3488]: [MAIN ] ?h?? > Jun 25 17:03:40 openais[3488]: [MAIN ] AIS Executive exiting (-9). > Jun 25 17:04:06 ccsd[3447]: Unable to connect to cluster infrastructure > > I'm sure the garbled log entry is not normal. Any ideas? > > > > Chris Harms wrote: >> After some trouble with my first go, I ended up uninstalling the >> cluster packages and attempting to reinstall. However, one of the >> nodes apparently can't forget some info about the previous >> installation. After reinstalling ricci and luci, and doing create >> cluster the nodes reboot as normal. Upon reboot, the offending node >> attempts to fence the other (I haven't even gotten to the point of >> setting that up yet) while the other reports >> >> ccsd[3414]: Unable to connect to cluster infrastructure after XYZ >> seconds. >> >> attempts to stop cman via service cman stop summarily fail and fenced >> and ccsd have to be killed before it will succeed. >> >> Any ideas on what or where it would be storing information about the >> old install, or how I should properly uninstall the software before >> starting over? >> >> Thanks in advance, >> Chris >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From pcaulfie at redhat.com Tue Jun 26 07:32:35 2007 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 26 Jun 2007 08:32:35 +0100 Subject: [Linux-cluster] trouble reinstalling cluster suite 5 (solved) In-Reply-To: <468048E9.3080208@cmiware.com> References: <468020C9.8080208@cmiware.com> <46803DF6.6040002@cmiware.com> <468048E9.3080208@cmiware.com> Message-ID: <4680C113.1000207@redhat.com> Chris Harms wrote: > For the sake of completeness it was a bad entry in /etc/hosts file > Do you still have the /etc/hosts file that caused this crash? We have a bugzilla entry open for this bug, but no-one has manage to provide the offending file for me to fix it. -- Patrick Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 ITE, UK. Registered in England and Wales under Company Registration No. 3798903 From chris at cmiware.com Tue Jun 26 15:09:39 2007 From: chris at cmiware.com (Chris Harms) Date: Tue, 26 Jun 2007 10:09:39 -0500 Subject: [Linux-cluster] trouble reinstalling cluster suite 5 (solved) In-Reply-To: <4680C113.1000207@redhat.com> References: <468020C9.8080208@cmiware.com> <46803DF6.6040002@cmiware.com> <468048E9.3080208@cmiware.com> <4680C113.1000207@redhat.com> Message-ID: <46812C33.3070007@cmiware.com> I believe it contained the host.domain.tld of the machine in the 127.0.0.1 entry. Patrick Caulfield wrote: > Chris Harms wrote: > >> For the sake of completeness it was a bad entry in /etc/hosts file >> >> > > Do you still have the /etc/hosts file that caused this crash? > > We have a bugzilla entry open for this bug, but no-one has manage to provide the > offending file for me to fix it. > > From johnson.eric at gmail.com Tue Jun 26 18:47:46 2007 From: johnson.eric at gmail.com (eric johnson) Date: Tue, 26 Jun 2007 14:47:46 -0400 Subject: [Linux-cluster] GFS2 - Simple script that seems to make GFS2 sad Message-ID: Hi - I had experimented with GFS a few months back. I'm interested in it, but know that it isn't quite production worthy yet - at least not quite for my needs. Now that GFS2 is emerging, I thought I'd give it a quick try again just to see how things were shaping up. I've got a script that seems to make our installation sad... Take this script > cat foo.pl my $i=0; my $max=shift(@ARGV); my $d=shift(@ARGV); if (not defined $d) { $d=""; } foreach(my $i=0;$i<$max;$i++) { my $filename=sprintf("%s-%d%s",rand()*100000,$i,$d); open FOO, ">$filename"; for (my $j=0;$j<1500;++$j) { print FOO "This is fun!!\n"; } close FOO; } Assuming a mount at /gfs Queue up a good chunk of these - each working their own directory... cd /gfs mkdir foo1 cd foo1 perl -w ~/foo.pl 10000000 A & cd .. mkdir foo2 cd foo2 perl -w ~/foo.pl 10000000 A & cd .. mkdir foo3 cd foo3 perl -w ~/foo.pl 10000000 A & cd .. mkdir foo4 cd foo4 perl -w ~/foo.pl 10000000 A & cd .. mkdir foo5 cd foo5 perl -w ~/foo.pl 10000000 A & After a few minutes, the mount seems to disappear. > cd /gfs -bash: cd: /gfs: Input/output error It seems likely that I have something misconfigured... -Eric From johnson.eric at gmail.com Tue Jun 26 19:24:03 2007 From: johnson.eric at gmail.com (eric johnson) Date: Tue, 26 Jun 2007 15:24:03 -0400 Subject: [Linux-cluster] Re: GFS2 - Simple script that seems to make GFS2 sad In-Reply-To: References: Message-ID: A dmesg reveals this... GFS2: fsid=gfs:gfs.0: gfs2_delete_inode: 13 GFS2: fsid=gfs:gfs.0: fatal: assertion "gfs2_glock_is_held_excl(gl)" failed GFS2: fsid=gfs:gfs.0: function = glock_lo_after_commit, file = fs/gfs2/lops.c, line = 61 I'm a bit out of my element here so I may be highlighting a red herring. -Eric From chris at cmiware.com Tue Jun 26 23:16:44 2007 From: chris at cmiware.com (Chris Harms) Date: Tue, 26 Jun 2007 18:16:44 -0500 Subject: [Linux-cluster] manual fencing problem Message-ID: <46819E5C.1090802@cmiware.com> Trying to setup manual fencing for testing purposes in Conga gave me the following errors: agent "fence_manual" reports: failed: fence_manual no node name It appears this came up before: http://www.redhat.com/archives/linux-cluster/2006-May/msg00088.html but is still unresolved. Cheers, Chris From rgrover1 at gmail.com Wed Jun 27 03:54:07 2007 From: rgrover1 at gmail.com (Rohit Grover) Date: Wed, 27 Jun 2007 15:54:07 +1200 Subject: [Linux-cluster] wishing to run GFS on iSCSI with redundancy Message-ID: Hello, We'd like to run GFS in a cluster serviced by a pool of iSCSI disks. We would like to use RAID to add redundancy to the storage, but there's literature on the net saying that linux's MD driver is not cluster safe. Since CLVM doesn't support RAID, what options do we have other than pairing the iSCSI disks with DRBD? thanks, Rohit Grover. From jprats at cesca.es Wed Jun 27 11:16:51 2007 From: jprats at cesca.es (Jordi Prats) Date: Wed, 27 Jun 2007 13:16:51 +0200 Subject: [Linux-cluster] crash GFS2 Message-ID: <46824723.8080704@cesca.es> Hi, I've got this crash using GFS2 and exporting it with NFS. Jordi Message from syslogd at urani at Wed Jun 27 06:57:50 2007 ... urani kernel: ------------[ cut here ]------------ Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... urani kernel: invalid opcode: 0000 [#1] Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... urani kernel: SMP Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... urani kernel: CPU: 0 Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... urani kernel: EIP: 0061:[] Not tainted VLI Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... urani kernel: EFLAGS: 00010282 (2.6.20-1.2952.fc6xen #1) Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... urani kernel: EIP is at gfs2_glock_nq+0xff/0x19d [gfs2] Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... urani kernel: eax: 00000020 ebx: eac37e88 ecx: ffffffff edx: f5416000 Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... urani kernel: esi: eac37d40 edi: dffc4f40 ebp: dffc4f40 esp: eac37d0c Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... urani kernel: ds: 007b es: 007b ss: 0069 Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... urani kernel: Process nfsd (pid: 22673, ti=eac37000 task=d2f371b0 task.ti=eac37000) Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... urani kernel: Stack: ee1f2324 00000002 00000001 e6b79000 00000000 eac37d40 dcb05b0c dcb05d04 Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... urani kernel: e6b79000 ee1e8c88 eac37d40 ee1dc4bd ffffffe4 eac37d40 eac37d40 dffc4f40 Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... urani kernel: 00005891 00000001 00000002 00000000 00000402 ee1e8c81 dcb05b0c ee1e8c3d Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... urani kernel: Call Trace: Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... urani kernel: [] gfs2_delete_inode+0x4b/0x14f [gfs2] Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... urani kernel: [] gfs2_holder_uninit+0xb/0x1b [gfs2] Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... urani kernel: [] gfs2_delete_inode+0x44/0x14f [gfs2] Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... urani kernel: [] gfs2_delete_inode+0x0/0x14f [gfs2] Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... urani kernel: [] generic_delete_inode+0xa3/0x10b Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... urani kernel: [] iput+0x60/0x62 Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... urani kernel: [] gfs2_createi+0xcb8/0xcf2 [gfs2] Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... urani kernel: [] __might_sleep+0x21/0xc1 Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... urani kernel: [] gfs2_create+0x5d/0x101 [gfs2] Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... urani kernel: [] gfs2_createi+0x5b/0xcf2 [gfs2] Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... urani kernel: [] gfs2_glock_nq_num+0x3f/0x64 [gfs2] Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... urani kernel: [] vfs_create+0xca/0x134 Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... urani kernel: [] nfsd_create_v3+0x27f/0x468 [nfsd] Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... urani kernel: [] _spin_unlock_irqrestore+0x8/0x16 Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... urani kernel: [] nfsd3_proc_create+0x15e/0x16c [nfsd] Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... urani kernel: [] nfsd_dispatch+0xc5/0x180 [nfsd] Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... urani kernel: [] svcauth_unix_set_client+0x165/0x19a [sunrpc] Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... urani kernel: [] svc_process+0x355/0x610 [sunrpc] Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... urani kernel: [] nfsd+0x173/0x278 [nfsd] Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... urani kernel: [] nfsd+0x0/0x278 [nfsd] Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... urani kernel: [] kernel_thread_helper+0x7/0x10 Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... urani kernel: ======================= Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... urani kernel: Code: 0c c7 04 24 17 23 1f ee 89 44 24 04 e8 28 1b 24 d2 8b 47 2c 8b 57 14 89 44 24 08 89 54 24 04 c7 04 24 24 23 1f ee e8 0e 1b 24 d2 <0f> 0b eb fe 39 58 0c 74 0e 89 d0 8b 10 0f 18 02 90 39 c8 75 ef Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... urani kernel: EIP: [] gfs2_glock_nq+0xff/0x19d [gfs2] SS:ESP 0069:eac37d0c Message from syslogd at urani at Wed Jun 27 07:00:50 2007 ... urani kernel: Oops: 0000 [#2] Message from syslogd at urani at Wed Jun 27 07:00:50 2007 ... urani kernel: SMP Message from syslogd at urani at Wed Jun 27 07:00:50 2007 ... urani kernel: CPU: 0 Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... urani kernel: EIP: 0061:[] Not tainted VLI Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... urani kernel: EFLAGS: 00010297 (2.6.20-1.2952.fc6xen #1) Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... urani kernel: EIP is at gfs2_permission+0x36/0xc2 [gfs2] Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... urani kernel: eax: 00000000 ebx: 00000000 ecx: 00000000 edx: ea533730 Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... urani kernel: esi: dfe10078 edi: dfe10094 ebp: e8c74b8c esp: e6f4ae68 Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... urani kernel: ds: 007b es: 007b ss: 0069 Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... urani kernel: Process nfsd (pid: 22678, ti=e6f4a000 task=ea533730 task.ti=e6f4a000) Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... urani kernel: Stack: c0462732 00000003 0000000a 00000000 c042827c e6cea810 eb345400 e6f4aebc Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... urani kernel: 11270000 ee1e49fa ee2cc954 e8c74b8c ee1e8686 00000000 00000003 c046cb7d Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... urani kernel: 00000003 e8c74b8c c678d140 11270000 ee2cd34b d49de1a8 e6cea804 c678d140 Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... urani kernel: Call Trace: Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... urani kernel: [] kfree+0xe/0x6f Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... urani kernel: [] set_current_groups+0x154/0x160 Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... urani kernel: [] gfs2_decode_fh+0xe2/0xe9 [gfs2] Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... urani kernel: [] nfsd_acceptable+0x0/0xbf [nfsd] Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... urani kernel: [] gfs2_permission+0x0/0xc2 [gfs2] Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... urani kernel: [] permission+0x9e/0xdb Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... urani kernel: [] nfsd_permission+0x87/0xd5 [nfsd] Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... urani kernel: [] fh_verify+0x434/0x519 [nfsd] Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... urani kernel: [] nfsd_acceptable+0x0/0xbf [nfsd] Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... urani kernel: [] _spin_unlock_irqrestore+0x8/0x16 Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... urani kernel: [] nfsd3_proc_create+0xdc/0x16c [nfsd] Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... urani kernel: [] nfsd_cache_lookup+0x1c7/0x2ab [nfsd] Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... urani kernel: [] groups_alloc+0x42/0xae Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... urani kernel: [] nfsd_dispatch+0xc5/0x180 [nfsd] Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... urani kernel: [] svcauth_unix_set_client+0x165/0x19a [sunrpc] Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... urani kernel: [] svc_process+0x355/0x610 [sunrpc] Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... urani kernel: [] hypervisor_callback+0x46/0x50 Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... urani kernel: [] nfsd+0x173/0x278 [nfsd] Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... urani kernel: [] nfsd+0x0/0x278 [nfsd] Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... urani kernel: [] kernel_thread_helper+0x7/0x10 Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... urani kernel: ======================= Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... urani kernel: Code: 24 04 8b b0 f4 01 00 00 8d 7e 1c 89 f8 e8 69 96 42 d2 8b 4e 44 eb 14 8b 41 0c 65 8b 15 08 00 00 00 3b 82 a8 00 00 00 74 11 89 d9 <8b> 19 0f 18 03 90 8d 46 44 39 c1 75 df eb 41 89 f8 e8 20 96 42 Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... urani kernel: EIP: [] gfs2_permission+0x36/0xc2 [gfs2] SS:ESP 0069:e6f4ae68 -- ...................................................................... __ / / Jordi Prats C E / S / C A Dept. de Sistemes /_/ Centre de Supercomputaci? de Catalunya Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona T. 93 205 6464 ? F. 93 205 6979 ? jprats at cesca.es ...................................................................... From jprats at cesca.es Wed Jun 27 11:19:57 2007 From: jprats at cesca.es (Jordi Prats) Date: Wed, 27 Jun 2007 13:19:57 +0200 Subject: [Linux-cluster] [Fwd: crash GFS2] Message-ID: <468247DD.1080802@cesca.es> Hi, On the console I've found a lot of this messages: Hope this helps! Jordi ======================= BUG: soft lockup detected on CPU#0! [] softlockup_tick+0xaa/0xc1 [] timer_interrupt+0x552/0x59f [] handle_level_irq+0xd1/0xdf [] _spin_lock_irqsave+0x12/0x17 [] __add_entropy_words+0x56/0x18b [] _spin_unlock_irqrestore+0x8/0x16 [] handle_IRQ_event+0x1e/0x47 [] handle_level_irq+0x93/0xdf [] handle_level_irq+0x0/0xdf [] do_IRQ+0xb5/0xdb [] evtchn_do_upcall+0x5f/0x97 [] hypervisor_callback+0x46/0x50 [] simple_strtoul+0xab/0xc5 [] _raw_spin_lock+0x67/0xd9 [] gfs2_permission+0x1d/0xc2 [gfs2] [] kfree+0xe/0x6f [] set_current_groups+0x154/0x160 [] gfs2_decode_fh+0xe2/0xe9 [gfs2] [] nfsd_acceptable+0x0/0xbf [nfsd] [] gfs2_permission+0x0/0xc2 [gfs2] [] permission+0x9e/0xdb [] nfsd_permission+0x87/0xd5 [nfsd] [] fh_verify+0x434/0x519 [nfsd] [] nfsd_acceptable+0x0/0xbf [nfsd] [] _spin_unlock_irqrestore+0x8/0x16 [] nfsd3_proc_create+0xdc/0x16c [nfsd] [] nfsd_cache_lookup+0x1c7/0x2ab [nfsd] [] groups_alloc+0x42/0xae [] nfsd_dispatch+0xc5/0x180 [nfsd] [] svcauth_unix_set_client+0x165/0x19a [sunrpc] [] svc_process+0x355/0x610 [sunrpc] [] hypervisor_callback+0x46/0x50 [] nfsd+0x173/0x278 [nfsd] [] nfsd+0x0/0x278 [nfsd] [] kernel_thread_helper+0x7/0x10 ======================= -- ...................................................................... __ / / Jordi Prats C E / S / C A Dept. de Sistemes /_/ Centre de Supercomputaci? de Catalunya Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona T. 93 205 6464 ? F. 93 205 6979 ? jprats at cesca.es ...................................................................... -------------- next part -------------- An embedded message was scrubbed... From: Jordi Prats Subject: crash GFS2 Date: Wed, 27 Jun 2007 13:16:51 +0200 Size: 10208 URL: From brian at bcons.com Wed Jun 27 13:11:25 2007 From: brian at bcons.com (Brian C. O'Berry) Date: Wed, 27 Jun 2007 09:11:25 -0400 Subject: [Linux-cluster] Possible to Share Berkeley DB Environment via GFS? Message-ID: <468261FD.1090906@bcons.com> Is it possible to share a Berkeley DB environment, via a common GFS filesystem, between Concurrent Data Store applications running on different systems? In reference to *remote* filesystems, the Berkeley DB Reference Guide states that "Remote filesystems rarely support mapping files into process memory, and even more rarely support correct semantics for mutexes after the attempt succeeds. For this reason, we strongly recommend that the database environment directory reside in a local filesystem... For remote filesystems that do allow system files to be mapped into process memory, home directories accessed via remote filesystems cannot be used simultaneously from multiple clients. None of the commercial remote filesystems available today implement coherent, distributed shared memory for remote-mounted files. As a result, different machines will see different versions of these shared regions, and the system behavior is undefined." Based on that, I'd expect sharing the environment (home) directory between systems to be infeasible, but I know little about GFS. Can someone verify one way or the other whether such sharing is possible? Thanks, Brian From wcheng at redhat.com Wed Jun 27 14:33:02 2007 From: wcheng at redhat.com (Wendy Cheng) Date: Wed, 27 Jun 2007 10:33:02 -0400 Subject: [Linux-cluster] [Fwd: crash GFS2] In-Reply-To: <468247DD.1080802@cesca.es> References: <468247DD.1080802@cesca.es> Message-ID: <4682751E.8050907@redhat.com> Jordi Prats wrote: > Hi, > On the console I've found a lot of this messages: We'll have few critical GFS2-NFS fixes ready late today via Red Hat bugzilla 243136. If you like to help testing this out, let us know your kernel version... -- Wendy > > ======================= > BUG: soft lockup detected on CPU#0! > [] softlockup_tick+0xaa/0xc1 > [] timer_interrupt+0x552/0x59f > [] handle_level_irq+0xd1/0xdf > [] _spin_lock_irqsave+0x12/0x17 > [] __add_entropy_words+0x56/0x18b > [] _spin_unlock_irqrestore+0x8/0x16 > [] handle_IRQ_event+0x1e/0x47 > [] handle_level_irq+0x93/0xdf > [] handle_level_irq+0x0/0xdf > [] do_IRQ+0xb5/0xdb > [] evtchn_do_upcall+0x5f/0x97 > [] hypervisor_callback+0x46/0x50 > [] simple_strtoul+0xab/0xc5 > [] _raw_spin_lock+0x67/0xd9 > [] gfs2_permission+0x1d/0xc2 [gfs2] > [] kfree+0xe/0x6f > [] set_current_groups+0x154/0x160 > [] gfs2_decode_fh+0xe2/0xe9 [gfs2] > [] nfsd_acceptable+0x0/0xbf [nfsd] > [] gfs2_permission+0x0/0xc2 [gfs2] > [] permission+0x9e/0xdb > [] nfsd_permission+0x87/0xd5 [nfsd] > [] fh_verify+0x434/0x519 [nfsd] > [] nfsd_acceptable+0x0/0xbf [nfsd] > [] _spin_unlock_irqrestore+0x8/0x16 > [] nfsd3_proc_create+0xdc/0x16c [nfsd] > [] nfsd_cache_lookup+0x1c7/0x2ab [nfsd] > [] groups_alloc+0x42/0xae > [] nfsd_dispatch+0xc5/0x180 [nfsd] > [] svcauth_unix_set_client+0x165/0x19a [sunrpc] > [] svc_process+0x355/0x610 [sunrpc] > [] hypervisor_callback+0x46/0x50 > [] nfsd+0x173/0x278 [nfsd] > [] nfsd+0x0/0x278 [nfsd] > [] kernel_thread_helper+0x7/0x10 > ======================= > > > ------------------------------------------------------------------------ > > Subject: > crash GFS2 > From: > Jordi Prats > Date: > Wed, 27 Jun 2007 13:16:51 +0200 > To: > linux-cluster at redhat.com > > To: > linux-cluster at redhat.com > > > Hi, > I've got this crash using GFS2 and exporting it with NFS. > > Jordi > > Message from syslogd at urani at Wed Jun 27 06:57:50 2007 ... > urani kernel: ------------[ cut here ]------------ > > Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... > urani kernel: invalid opcode: 0000 [#1] > > Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... > urani kernel: SMP > > Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... > urani kernel: CPU: 0 > > Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... > urani kernel: EIP: 0061:[] Not tainted VLI > > Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... > urani kernel: EFLAGS: 00010282 (2.6.20-1.2952.fc6xen #1) > > Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... > urani kernel: EIP is at gfs2_glock_nq+0xff/0x19d [gfs2] > > Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... > urani kernel: eax: 00000020 ebx: eac37e88 ecx: ffffffff edx: > f5416000 > > Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... > urani kernel: esi: eac37d40 edi: dffc4f40 ebp: dffc4f40 esp: > eac37d0c > > Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... > urani kernel: ds: 007b es: 007b ss: 0069 > > Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... > urani kernel: Process nfsd (pid: 22673, ti=eac37000 task=d2f371b0 > task.ti=eac37000) > > Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... > urani kernel: Stack: ee1f2324 00000002 00000001 e6b79000 00000000 > eac37d40 dcb05b0c dcb05d04 > > Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... > urani kernel: e6b79000 ee1e8c88 eac37d40 ee1dc4bd ffffffe4 > eac37d40 eac37d40 dffc4f40 > > Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... > urani kernel: 00005891 00000001 00000002 00000000 00000402 > ee1e8c81 dcb05b0c ee1e8c3d > > Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... > urani kernel: Call Trace: > > Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... > urani kernel: [] gfs2_delete_inode+0x4b/0x14f [gfs2] > > Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... > urani kernel: [] gfs2_holder_uninit+0xb/0x1b [gfs2] > > Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... > urani kernel: [] gfs2_delete_inode+0x44/0x14f [gfs2] > > Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... > urani kernel: [] gfs2_delete_inode+0x0/0x14f [gfs2] > > Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... > urani kernel: [] generic_delete_inode+0xa3/0x10b > > Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... > urani kernel: [] iput+0x60/0x62 > > Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... > urani kernel: [] gfs2_createi+0xcb8/0xcf2 [gfs2] > > Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... > urani kernel: [] __might_sleep+0x21/0xc1 > > Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... > urani kernel: [] gfs2_create+0x5d/0x101 [gfs2] > > Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... > urani kernel: [] gfs2_createi+0x5b/0xcf2 [gfs2] > > Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... > urani kernel: [] gfs2_glock_nq_num+0x3f/0x64 [gfs2] > > Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... > urani kernel: [] vfs_create+0xca/0x134 > > Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... > urani kernel: [] nfsd_create_v3+0x27f/0x468 [nfsd] > > Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... > urani kernel: [] _spin_unlock_irqrestore+0x8/0x16 > > Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... > urani kernel: [] nfsd3_proc_create+0x15e/0x16c [nfsd] > > Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... > urani kernel: [] nfsd_dispatch+0xc5/0x180 [nfsd] > > Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... > urani kernel: [] svcauth_unix_set_client+0x165/0x19a [sunrpc] > > Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... > urani kernel: [] svc_process+0x355/0x610 [sunrpc] > > Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... > urani kernel: [] nfsd+0x173/0x278 [nfsd] > > Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... > urani kernel: [] nfsd+0x0/0x278 [nfsd] > > Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... > urani kernel: [] kernel_thread_helper+0x7/0x10 > > Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... > urani kernel: ======================= > > Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... > urani kernel: Code: 0c c7 04 24 17 23 1f ee 89 44 24 04 e8 28 1b 24 d2 > 8b 47 2c 8b 57 14 89 44 24 08 89 54 24 04 c7 04 24 24 23 1f ee e8 0e > 1b 24 d2 <0f> 0b eb fe 39 58 0c 74 0e 89 d0 8b 10 0f 18 02 90 39 c8 75 ef > > Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... > urani kernel: EIP: [] gfs2_glock_nq+0xff/0x19d [gfs2] SS:ESP > 0069:eac37d0c > > Message from syslogd at urani at Wed Jun 27 07:00:50 2007 ... > urani kernel: Oops: 0000 [#2] > > Message from syslogd at urani at Wed Jun 27 07:00:50 2007 ... > urani kernel: SMP > > Message from syslogd at urani at Wed Jun 27 07:00:50 2007 ... > urani kernel: CPU: 0 > > Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... > urani kernel: EIP: 0061:[] Not tainted VLI > > Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... > urani kernel: EFLAGS: 00010297 (2.6.20-1.2952.fc6xen #1) > > Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... > urani kernel: EIP is at gfs2_permission+0x36/0xc2 [gfs2] > > Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... > urani kernel: eax: 00000000 ebx: 00000000 ecx: 00000000 edx: > ea533730 > > Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... > urani kernel: esi: dfe10078 edi: dfe10094 ebp: e8c74b8c esp: > e6f4ae68 > > Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... > urani kernel: ds: 007b es: 007b ss: 0069 > > Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... > urani kernel: Process nfsd (pid: 22678, ti=e6f4a000 task=ea533730 > task.ti=e6f4a000) > > Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... > urani kernel: Stack: c0462732 00000003 0000000a 00000000 c042827c > e6cea810 eb345400 e6f4aebc > > Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... > urani kernel: 11270000 ee1e49fa ee2cc954 e8c74b8c ee1e8686 > 00000000 00000003 c046cb7d > > Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... > urani kernel: 00000003 e8c74b8c c678d140 11270000 ee2cd34b > d49de1a8 e6cea804 c678d140 > > Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... > urani kernel: Call Trace: > > Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... > urani kernel: [] kfree+0xe/0x6f > > Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... > urani kernel: [] set_current_groups+0x154/0x160 > > Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... > urani kernel: [] gfs2_decode_fh+0xe2/0xe9 [gfs2] > > Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... > urani kernel: [] nfsd_acceptable+0x0/0xbf [nfsd] > > Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... > urani kernel: [] gfs2_permission+0x0/0xc2 [gfs2] > > Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... > urani kernel: [] permission+0x9e/0xdb > > Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... > urani kernel: [] nfsd_permission+0x87/0xd5 [nfsd] > > Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... > urani kernel: [] fh_verify+0x434/0x519 [nfsd] > > Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... > urani kernel: [] nfsd_acceptable+0x0/0xbf [nfsd] > > Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... > urani kernel: [] _spin_unlock_irqrestore+0x8/0x16 > > Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... > urani kernel: [] nfsd3_proc_create+0xdc/0x16c [nfsd] > > Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... > urani kernel: [] nfsd_cache_lookup+0x1c7/0x2ab [nfsd] > > Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... > urani kernel: [] groups_alloc+0x42/0xae > > Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... > urani kernel: [] nfsd_dispatch+0xc5/0x180 [nfsd] > > Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... > urani kernel: [] svcauth_unix_set_client+0x165/0x19a [sunrpc] > > Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... > urani kernel: [] svc_process+0x355/0x610 [sunrpc] > > Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... > urani kernel: [] hypervisor_callback+0x46/0x50 > > Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... > urani kernel: [] nfsd+0x173/0x278 [nfsd] > > Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... > urani kernel: [] nfsd+0x0/0x278 [nfsd] > > Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... > urani kernel: [] kernel_thread_helper+0x7/0x10 > > Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... > urani kernel: ======================= > > Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... > urani kernel: Code: 24 04 8b b0 f4 01 00 00 8d 7e 1c 89 f8 e8 69 96 42 > d2 8b 4e 44 eb 14 8b 41 0c 65 8b 15 08 00 00 00 3b 82 a8 00 00 00 74 > 11 89 d9 <8b> 19 0f 18 03 90 8d 46 44 39 c1 75 df eb 41 89 f8 e8 20 96 42 > > Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... > urani kernel: EIP: [] gfs2_permission+0x36/0xc2 [gfs2] > SS:ESP 0069:e6f4ae68 > > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From laule75 at yahoo.fr Wed Jun 27 15:42:17 2007 From: laule75 at yahoo.fr (laurent) Date: Wed, 27 Jun 2007 15:42:17 +0000 (GMT) Subject: [Linux-cluster] Redhat cluster limits Message-ID: <200548.29516.qm@web26510.mail.ukl.yahoo.com> Hello, I'm trying to test extensively redhat Cluster Suite and GFS in order to replace eventually Veritas cluster products. Unfortunately, we've experienced problems when we try to play with a big number of filesystems and services (around 100) ==> high CPU consumption to monitor resources, memory allocation problems when creating lot of logical volumes, etc. ; Have you ever experienced such kind of issues ? I would be glad to ear any feedback from people who have carried out some stress tests on the redhat cluster suites (maximum number of services/resources created, max number of volume group and logical volumes, etc etc...) Kind regards Laurent _____________________________________________________________________________ Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpeterso at redhat.com Wed Jun 27 16:29:20 2007 From: rpeterso at redhat.com (Bob Peterson) Date: Wed, 27 Jun 2007 11:29:20 -0500 Subject: [Linux-cluster] Redhat cluster limits In-Reply-To: <200548.29516.qm@web26510.mail.ukl.yahoo.com> References: <200548.29516.qm@web26510.mail.ukl.yahoo.com> Message-ID: <1182961760.11507.27.camel@technetium.msp.redhat.com> On Wed, 2007-06-27 at 15:42 +0000, laurent wrote: > Hello, > > I'm trying to test extensively redhat Cluster Suite and GFS in order > to replace eventually Veritas cluster products. > > Unfortunately, we've experienced problems when we try to play with a > big number of filesystems and services (around 100) ==> high CPU > consumption to monitor resources, memory allocation problems when > creating lot of logical volumes, etc. ; > > Have you ever experienced such kind of issues ? > > I would be glad to ear any feedback from people who have carried out > some stress tests on the redhat cluster suites (maximum number of > services/resources created, max number of volume group and logical > volumes, etc etc...) > > Kind regards > > Laurent Hi Laurent, What version of cluster suite/RHEL/Centos/etc., are you using? I suggest opening up bugzillas against each of the problems you've seen. I can't speak for Red Hat or any other developer, but my take is this: Rather than ask if cluster suite can do the job, let's get the problems solved so it can do the job. If there's a problem, we want to hear about them. If we need to fix the code, that makes it better for everyone to use. Regards, Bob Peterson Red Hat Cluster Suite From jprats at cesca.es Wed Jun 27 18:38:21 2007 From: jprats at cesca.es (Jordi Prats) Date: Wed, 27 Jun 2007 20:38:21 +0200 Subject: [Linux-cluster] [Fwd: crash GFS2] In-Reply-To: <4682751E.8050907@redhat.com> References: <468247DD.1080802@cesca.es> <4682751E.8050907@redhat.com> Message-ID: <4682AE9D.1070103@cesca.es> Hi, I'm using a xen guest kernel 2.6.20 on a Fedora Core 6 My uname is: Linux urani 2.6.20-1.2952.fc6xen #1 SMP Wed May 16 19:19:04 EDT 2007 i686 i686 i386 GNU/Linux Jordi Wendy Cheng wrote: > Jordi Prats wrote: >> Hi, >> On the console I've found a lot of this messages: > > We'll have few critical GFS2-NFS fixes ready late today via Red Hat > bugzilla 243136. If you like to help testing this out, let us know your > kernel version... > > -- Wendy >> >> ======================= >> BUG: soft lockup detected on CPU#0! >> [] softlockup_tick+0xaa/0xc1 >> [] timer_interrupt+0x552/0x59f >> [] handle_level_irq+0xd1/0xdf >> [] _spin_lock_irqsave+0x12/0x17 >> [] __add_entropy_words+0x56/0x18b >> [] _spin_unlock_irqrestore+0x8/0x16 >> [] handle_IRQ_event+0x1e/0x47 >> [] handle_level_irq+0x93/0xdf >> [] handle_level_irq+0x0/0xdf >> [] do_IRQ+0xb5/0xdb >> [] evtchn_do_upcall+0x5f/0x97 >> [] hypervisor_callback+0x46/0x50 >> [] simple_strtoul+0xab/0xc5 >> [] _raw_spin_lock+0x67/0xd9 >> [] gfs2_permission+0x1d/0xc2 [gfs2] >> [] kfree+0xe/0x6f >> [] set_current_groups+0x154/0x160 >> [] gfs2_decode_fh+0xe2/0xe9 [gfs2] >> [] nfsd_acceptable+0x0/0xbf [nfsd] >> [] gfs2_permission+0x0/0xc2 [gfs2] >> [] permission+0x9e/0xdb >> [] nfsd_permission+0x87/0xd5 [nfsd] >> [] fh_verify+0x434/0x519 [nfsd] >> [] nfsd_acceptable+0x0/0xbf [nfsd] >> [] _spin_unlock_irqrestore+0x8/0x16 >> [] nfsd3_proc_create+0xdc/0x16c [nfsd] >> [] nfsd_cache_lookup+0x1c7/0x2ab [nfsd] >> [] groups_alloc+0x42/0xae >> [] nfsd_dispatch+0xc5/0x180 [nfsd] >> [] svcauth_unix_set_client+0x165/0x19a [sunrpc] >> [] svc_process+0x355/0x610 [sunrpc] >> [] hypervisor_callback+0x46/0x50 >> [] nfsd+0x173/0x278 [nfsd] >> [] nfsd+0x0/0x278 [nfsd] >> [] kernel_thread_helper+0x7/0x10 >> ======================= >> >> >> ------------------------------------------------------------------------ >> >> Subject: >> crash GFS2 >> From: >> Jordi Prats >> Date: >> Wed, 27 Jun 2007 13:16:51 +0200 >> To: >> linux-cluster at redhat.com >> >> To: >> linux-cluster at redhat.com >> >> >> Hi, >> I've got this crash using GFS2 and exporting it with NFS. >> >> Jordi >> >> Message from syslogd at urani at Wed Jun 27 06:57:50 2007 ... >> urani kernel: ------------[ cut here ]------------ >> >> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... >> urani kernel: invalid opcode: 0000 [#1] >> >> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... >> urani kernel: SMP >> >> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... >> urani kernel: CPU: 0 >> >> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... >> urani kernel: EIP: 0061:[] Not tainted VLI >> >> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... >> urani kernel: EFLAGS: 00010282 (2.6.20-1.2952.fc6xen #1) >> >> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... >> urani kernel: EIP is at gfs2_glock_nq+0xff/0x19d [gfs2] >> >> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... >> urani kernel: eax: 00000020 ebx: eac37e88 ecx: ffffffff edx: >> f5416000 >> >> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... >> urani kernel: esi: eac37d40 edi: dffc4f40 ebp: dffc4f40 esp: >> eac37d0c >> >> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... >> urani kernel: ds: 007b es: 007b ss: 0069 >> >> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... >> urani kernel: Process nfsd (pid: 22673, ti=eac37000 task=d2f371b0 >> task.ti=eac37000) >> >> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... >> urani kernel: Stack: ee1f2324 00000002 00000001 e6b79000 00000000 >> eac37d40 dcb05b0c dcb05d04 >> >> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... >> urani kernel: e6b79000 ee1e8c88 eac37d40 ee1dc4bd ffffffe4 >> eac37d40 eac37d40 dffc4f40 >> >> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... >> urani kernel: 00005891 00000001 00000002 00000000 00000402 >> ee1e8c81 dcb05b0c ee1e8c3d >> >> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... >> urani kernel: Call Trace: >> >> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... >> urani kernel: [] gfs2_delete_inode+0x4b/0x14f [gfs2] >> >> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ... >> urani kernel: [] gfs2_holder_uninit+0xb/0x1b [gfs2] >> >> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... >> urani kernel: [] gfs2_delete_inode+0x44/0x14f [gfs2] >> >> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... >> urani kernel: [] gfs2_delete_inode+0x0/0x14f [gfs2] >> >> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... >> urani kernel: [] generic_delete_inode+0xa3/0x10b >> >> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... >> urani kernel: [] iput+0x60/0x62 >> >> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... >> urani kernel: [] gfs2_createi+0xcb8/0xcf2 [gfs2] >> >> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... >> urani kernel: [] __might_sleep+0x21/0xc1 >> >> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... >> urani kernel: [] gfs2_create+0x5d/0x101 [gfs2] >> >> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... >> urani kernel: [] gfs2_createi+0x5b/0xcf2 [gfs2] >> >> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ... >> urani kernel: [] gfs2_glock_nq_num+0x3f/0x64 [gfs2] >> >> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... >> urani kernel: [] vfs_create+0xca/0x134 >> >> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... >> urani kernel: [] nfsd_create_v3+0x27f/0x468 [nfsd] >> >> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... >> urani kernel: [] _spin_unlock_irqrestore+0x8/0x16 >> >> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... >> urani kernel: [] nfsd3_proc_create+0x15e/0x16c [nfsd] >> >> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... >> urani kernel: [] nfsd_dispatch+0xc5/0x180 [nfsd] >> >> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... >> urani kernel: [] svcauth_unix_set_client+0x165/0x19a [sunrpc] >> >> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... >> urani kernel: [] svc_process+0x355/0x610 [sunrpc] >> >> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... >> urani kernel: [] nfsd+0x173/0x278 [nfsd] >> >> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... >> urani kernel: [] nfsd+0x0/0x278 [nfsd] >> >> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... >> urani kernel: [] kernel_thread_helper+0x7/0x10 >> >> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... >> urani kernel: ======================= >> >> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... >> urani kernel: Code: 0c c7 04 24 17 23 1f ee 89 44 24 04 e8 28 1b 24 d2 >> 8b 47 2c 8b 57 14 89 44 24 08 89 54 24 04 c7 04 24 24 23 1f ee e8 0e >> 1b 24 d2 <0f> 0b eb fe 39 58 0c 74 0e 89 d0 8b 10 0f 18 02 90 39 c8 75 ef >> >> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ... >> urani kernel: EIP: [] gfs2_glock_nq+0xff/0x19d [gfs2] SS:ESP >> 0069:eac37d0c >> >> Message from syslogd at urani at Wed Jun 27 07:00:50 2007 ... >> urani kernel: Oops: 0000 [#2] >> >> Message from syslogd at urani at Wed Jun 27 07:00:50 2007 ... >> urani kernel: SMP >> >> Message from syslogd at urani at Wed Jun 27 07:00:50 2007 ... >> urani kernel: CPU: 0 >> >> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... >> urani kernel: EIP: 0061:[] Not tainted VLI >> >> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... >> urani kernel: EFLAGS: 00010297 (2.6.20-1.2952.fc6xen #1) >> >> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... >> urani kernel: EIP is at gfs2_permission+0x36/0xc2 [gfs2] >> >> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... >> urani kernel: eax: 00000000 ebx: 00000000 ecx: 00000000 edx: >> ea533730 >> >> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... >> urani kernel: esi: dfe10078 edi: dfe10094 ebp: e8c74b8c esp: >> e6f4ae68 >> >> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... >> urani kernel: ds: 007b es: 007b ss: 0069 >> >> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... >> urani kernel: Process nfsd (pid: 22678, ti=e6f4a000 task=ea533730 >> task.ti=e6f4a000) >> >> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... >> urani kernel: Stack: c0462732 00000003 0000000a 00000000 c042827c >> e6cea810 eb345400 e6f4aebc >> >> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... >> urani kernel: 11270000 ee1e49fa ee2cc954 e8c74b8c ee1e8686 >> 00000000 00000003 c046cb7d >> >> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... >> urani kernel: 00000003 e8c74b8c c678d140 11270000 ee2cd34b >> d49de1a8 e6cea804 c678d140 >> >> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... >> urani kernel: Call Trace: >> >> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... >> urani kernel: [] kfree+0xe/0x6f >> >> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... >> urani kernel: [] set_current_groups+0x154/0x160 >> >> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... >> urani kernel: [] gfs2_decode_fh+0xe2/0xe9 [gfs2] >> >> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... >> urani kernel: [] nfsd_acceptable+0x0/0xbf [nfsd] >> >> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... >> urani kernel: [] gfs2_permission+0x0/0xc2 [gfs2] >> >> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... >> urani kernel: [] permission+0x9e/0xdb >> >> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... >> urani kernel: [] nfsd_permission+0x87/0xd5 [nfsd] >> >> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ... >> urani kernel: [] fh_verify+0x434/0x519 [nfsd] >> >> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... >> urani kernel: [] nfsd_acceptable+0x0/0xbf [nfsd] >> >> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... >> urani kernel: [] _spin_unlock_irqrestore+0x8/0x16 >> >> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... >> urani kernel: [] nfsd3_proc_create+0xdc/0x16c [nfsd] >> >> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... >> urani kernel: [] nfsd_cache_lookup+0x1c7/0x2ab [nfsd] >> >> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... >> urani kernel: [] groups_alloc+0x42/0xae >> >> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... >> urani kernel: [] nfsd_dispatch+0xc5/0x180 [nfsd] >> >> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... >> urani kernel: [] svcauth_unix_set_client+0x165/0x19a [sunrpc] >> >> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... >> urani kernel: [] svc_process+0x355/0x610 [sunrpc] >> >> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... >> urani kernel: [] hypervisor_callback+0x46/0x50 >> >> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... >> urani kernel: [] nfsd+0x173/0x278 [nfsd] >> >> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... >> urani kernel: [] nfsd+0x0/0x278 [nfsd] >> >> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... >> urani kernel: [] kernel_thread_helper+0x7/0x10 >> >> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... >> urani kernel: ======================= >> >> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... >> urani kernel: Code: 24 04 8b b0 f4 01 00 00 8d 7e 1c 89 f8 e8 69 96 42 >> d2 8b 4e 44 eb 14 8b 41 0c 65 8b 15 08 00 00 00 3b 82 a8 00 00 00 74 >> 11 89 d9 <8b> 19 0f 18 03 90 8d 46 44 39 c1 75 df eb 41 89 f8 e8 20 96 42 >> >> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ... >> urani kernel: EIP: [] gfs2_permission+0x36/0xc2 [gfs2] >> SS:ESP 0069:e6f4ae68 >> >> ------------------------------------------------------------------------ >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- ...................................................................... __ / / Jordi Prats Catal? C E / S / C A Departament de Sistemes /_/ Centre de Supercomputaci? de Catalunya Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona T. 93 205 6464 ? F. 93 205 6979 ? jprats at cesca.es ...................................................................... pgp:0x5D0D1321 ...................................................................... From nrbwpi at gmail.com Wed Jun 27 22:35:57 2007 From: nrbwpi at gmail.com (nrbwpi at gmail.com) Date: Wed, 27 Jun 2007 18:35:57 -0400 Subject: [Linux-cluster] RHEL5 GFS2 - 2 node - node fenced when writing In-Reply-To: <1181201333.25918.229.camel@quoit> References: <6eee34430706061627p69a8080lf39168322926a769@mail.gmail.com> <1181201333.25918.229.camel@quoit> Message-ID: <6eee34430706271535jd4b13c0hf5b99219db4db6ec@mail.gmail.com> Thanks for your reply I switched the hardware over to Fedora core 6, brought the system up2date, and configured it the same as before with GFS2. Uname returns the following kernel string: "Linux fu2 2.6.20-1.2952.fc6 #1 SMP Wed May 16 18:18:22 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux". The same fencing occurred after several hours of writing zeros to the volume with dd in 250MB files. This time, however, I noticed a kernel panic on the fenced node. The kernel output in /var/log/messages is below. Could this be a hardware configuration issue, or a bug in the kernel? ##################################### Kernel panic ##################################### Jun 26 10:00:41 fu2 kernel: ------------[ cut here ]------------ Jun 26 10:00:41 fu2 kernel: kernel BUG at lib/list_debug.c:67! Jun 26 10:00:41 fu2 kernel: invalid opcode: 0000 [1] SMP Jun 26 10:00:41 fu2 kernel: last sysfs file: /devices/pci0000:00/0000:00: 02.0/0000:06:00.0/0000:07:00.0/0000:08:00.0/0000:09:00.0/irq Jun 26 10:00:41 fu2 kernel: CPU 7Jun 26 10:00:41 fu2 kernel: Modules linked in: lock_dlm gfs2 dlm configfs ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state nf_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge autofs4 hidp xfs rfcomm l2cap bluetooth sunrpc ipv6 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi dm_multipath video sbs i2c_ec i2c_core dock button battery asus_acpi backlight ac parport_pc lp parport sg ata_piix libata pcspkr bnx2 ide_cd cdrom serio_raw dm_snapshot dm_zero dm_mirror dm_mod lpfc scsi_transport_fc shpchp megaraid_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Jun 26 10:00:41 fu2 kernel: Pid: 4142, comm: gfs2_logd Not tainted 2.6.20-1.2952.fc6 #1 Jun 26 10:00:41 fu2 kernel: RIP: 0010:[] [] list_del+0x21/0x5b Jun 26 10:00:41 fu2 kernel: RSP: 0018:ffff81011e247d00 EFLAGS: 00010082 Jun 26 10:00:41 fu2 kernel: RAX: 0000000000000058 RBX: ffff81011aa40000 RCX: ffffffff8057fc58 Jun 26 10:00:41 fu2 kernel: RDX: ffffffff8057fc58 RSI: 0000000000000000 RDI: ffffffff8057fc40 Jun 26 10:00:41 fu2 kernel: RBP: ffff81012da3f7c0 R08: ffffffff8057fc58 R09: 0000000000000001 Jun 26 10:00:41 fu2 kernel: R10: 0000000000000000 R11: ffff81012fd9d0c0 R12: ffff81011aa40f70 Jun 26 10:00:41 fu2 kernel: R13: ffff810123fb1a00 R14: ffff810123fb05d8 R15: 0000000000000036 Jun 26 10:00:41 fu2 kernel: FS: 0000000000000000(0000) GS:ffff81012fdb47c0(0000) knlGS:0000000000000000 Jun 26 10:00:41 fu2 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Jun 26 10:00:41 fu2 kernel: CR2: 00002aaaadfbe008 CR3: 0000000042c20000 CR4: 00000000000006e0 Jun 26 10:00:41 fu2 kernel: Process gfs2_logd (pid: 4142, threadinfo ffff81011e246000, task ffff810121d35800) Jun 26 10:00:41 fu2 kernel: Stack: ffff810123fb1a00 ffffffff802cc6e7 0000003c00000000 ffff81012da3f7c0 Jun 26 10:00:41 fu2 kernel: 000000000000003c ffff810123fb0400 0000000000000000 ffff810123fb1a00 Jun 26 10:00:41 fu2 kernel: ffff81012da3f800 ffffffff802cc8be ffff810123fb07e8 ffff810123fb0400 Jun 26 10:00:41 fu2 kernel: Call Trace: Jun 26 10:00:41 fu2 kernel: [] free_block+0xb1/0x142 Jun 26 10:00:41 fu2 kernel: [] cache_flusharray+0x7d/0xb1 Jun 26 10:00:41 fu2 kernel: [] kmem_cache_free+0x1ef/0x20c Jun 26 10:00:41 fu2 kernel: [] :gfs2:databuf_lo_before_commit+0x576/0x5c6 Jun 26 10:00:41 fu2 kernel: [] :gfs2:gfs2_log_flush+0x11e/0x2d3 Jun 26 10:00:41 fu2 kernel: [] :gfs2:gfs2_logd+0xab/0x15b Jun 26 10:00:41 fu2 kernel: [] :gfs2:gfs2_logd+0x0/0x15b Jun 26 10:00:41 fu2 kernel: [] keventd_create_kthread+0x0/0x6a Jun 26 10:00:41 fu2 kernel: [] kthread+0xd0/0xff Jun 26 10:00:41 fu2 kernel: [] child_rip+0xa/0x12 Jun 26 10:00:41 fu2 kernel: [] keventd_create_kthread+0x0/0x6a Jun 26 10:00:41 fu2 kernel: [] kthread+0x0/0xff Jun 26 10:00:41 fu2 kernel: [] child_rip+0x0/0x12 Jun 26 10:00:41 fu2 kernel: Jun 26 10:00:41 fu2 kernel: Jun 26 10:00:41 fu2 kernel: Code: 0f 0b eb fe 48 8b 07 48 8b 50 08 48 39 fa 74 12 48 c7 c7 97 Jun 26 10:00:41 fu2 kernel: RIP [] list_del+0x21/0x5b Jun 26 10:00:41 fu2 kernel: RSP On 6/7/07, Steven Whitehouse wrote: > > Hi, > > The version of GFS2 in RHEL5 is rather old. Please use Fedora, the > upstream kernel or wait until RHEL 5.1 is out. This should solve the > problem that you are seeing, > > Steve. > > On Wed, 2007-06-06 at 19:27 -0400, nrbwpi at gmail.com wrote: > > Hello, > > > > Installed RHEL5 on a new two node cluster with Shared FC storage. The > > two shared storage boxes are each split into 6.9TB LUNs for a total of > > 4 - 6.9TB LUNS. Each machine is connected via a single 100Mb > > connection to a switch and a single FC connection to a FC switch. > > > > The 4 LUNs have LVM on them with GFS2. The file systems are mountable > > from each box. When performing a script dd write of zeros in 250MB > > file sizes to the file system from each box to different LUNS, one of > > the nodes in the cluster is fenced by the other one. File size does > > not seem to matter. > > > > My first guess at the problem was the heartbeat timeout in openais. > > In the cluster.conf below I added the totem line to hopefully raise > > the timeout to 10 seconds. This however did not resolve the problem. > > Both boxes are running the latest updates as of 2 days ago from > > up2date. > > > > Below is the cluster.conf and what is seen in the logs. Any > > suggestions would be greatly appreciated. > > > > Thanks! > > > > Neal > > > > > > > > ########################################## > > > > Cluster.conf > > > > ########################################## > > > > > > > > > > > > > > > > > > > > > switch="1"/> > > > > > > > interface="eth0"/> > > > > > > > > > > > switch="1"/> > > > > > > > interface="eth0"/> > > > > > > > > > > > > > > > > > login="apc" name="apc4" passwd="apc"/> > > > > > > > > > > > > > > > > > > ##################################################### > > > > /var/log/messages > > > > ##################################################### > > > > Jun 5 20:19:30 fu1 openais[5351]: [TOTEM] The token was lost in the > > OPERATIONAL state. > > Jun 5 20:19:30 fu1 openais[5351]: [TOTEM] Receive multicast socket > > recv buffer size (262142 bytes). > > Jun 5 20:19:30 fu1 openais[5351]: [TOTEM] Transmit multicast socket > > send buffer size (262142 bytes). > > Jun 5 20:19:30 fu1 openais[5351]: [TOTEM] entering GATHER state from > > 2. > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] entering GATHER state from > > 0. > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] Creating commit token > > because I am the rep. > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] Saving state aru 6e high > > seq received 6e > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] entering COMMIT state. > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] entering RECOVERY state. > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] position [0] member > > 192.168.14.195: > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] previous ring seq 16 rep > > 192.168.14.195 > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] aru 6e high delivered 6e > > received flag 0 > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] Did not need to originate > > any messages in recovery. > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] Storing new sequence id for > > ring 14 > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] Sending initial ORF token > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] CLM CONFIGURATION CHANGE > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] New Configuration: > > Jun 5 20:19:34 fu1 kernel: dlm: closing connection to node 2 > > Jun 5 20:19:34 fu1 fenced[5367]: fu2 not a cluster member after 0 sec > > post_fail_delay > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] r(0) > > ip(192.168.14.195) > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] Members Left: > > Jun 5 20:19:34 fu1 fenced[5367]: fencing node "fu2" > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] r(0) > > ip(192.168.14.197) > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] Members Joined: > > Jun 5 20:19:34 fu1 openais[5351]: [SYNC ] This node is within the > > primary component and will provide service. > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] CLM CONFIGURATION CHANGE > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] New Configuration: > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] r(0) > > ip(192.168.14.195) > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] Members Left: > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] Members Joined: > > Jun 5 20:19:34 fu1 openais[5351]: [SYNC ] This node is within the > > primary component and will provide service. > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] entering OPERATIONAL state. > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] got nodejoin message > > 192.168.14.195 > > Jun 5 20:19:34 fu1 openais[5351]: [CPG ] got joinlist message from > > node 1 > > Jun 5 20:19:36 fu1 fenced[5367]: fence "fu2" success > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: > > Trying to acquire journal lock... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: > > Trying to acquire journal lock... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: > > Looking at journal... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: > > Trying to acquire journal lock... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: > > Trying to acquire journal lock... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: > > Looking at journal... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: > > Looking at journal... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: > > Looking at journal... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: > > Acquiring the transaction lock... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: > > Replaying journal... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: > > Replayed 0 of 0 blocks > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: > > Found 0 revoke tags > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: > > Journal replayed in 1s > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: > > Done > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: > > Acquiring the transaction lock... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: > > Replaying journal... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: > > Replayed 0 of 0 blocks > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: > > Found 0 revoke tags > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: > > Journal replayed in 1s > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: > > Done > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: > > Acquiring the transaction lock... > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: > > Acquiring the transaction lock... > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: > > Replaying journal... > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: > > Replayed 222 of 223 blocks > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: > > Found 1 revoke tags > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: > > Journal replayed in 1s > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: > > Done > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: > > Replaying journal... > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: > > Replayed 438 of 439 blocks > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: > > Found 1 revoke tags > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: > > Journal replayed in 1s > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: > > Done > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From swhiteho at redhat.com Wed Jun 27 23:40:13 2007 From: swhiteho at redhat.com (Steven Whitehouse) Date: Thu, 28 Jun 2007 00:40:13 +0100 Subject: [Linux-cluster] RHEL5 GFS2 - 2 node - node fenced when writing In-Reply-To: <6eee34430706271535jd4b13c0hf5b99219db4db6ec@mail.gmail.com> References: <6eee34430706061627p69a8080lf39168322926a769@mail.gmail.com> <1181201333.25918.229.camel@quoit> <6eee34430706271535jd4b13c0hf5b99219db4db6ec@mail.gmail.com> Message-ID: <1182987613.3386.36.camel@localhost.localdomain> Hi, On Wed, 2007-06-27 at 18:35 -0400, nrbwpi at gmail.com wrote: > Thanks for your reply > > I switched the hardware over to Fedora core 6, brought the system > up2date, and configured it the same as before with GFS2. Uname returns > the following kernel string: "Linux fu2 2.6.20-1.2952.fc6 #1 SMP Wed > May 16 18:18:22 EDT 2007 86_64 x86_64 x86_64 GNU/Linux". > > The same fencing occurred after several hours of writing zeros to the > volume with dd in 250MB files. This time, however, I noticed a kernel > panic on the fenced node. The kernel output in /var/log/messages is > below. Could this be a hardware configuration issue, or a bug in the > kernel? > Its a kernel bug. We are currently working on fixing something in the same area, so it might be that you've tripped over the same thing, or something related anyway. There are also a few patches (quite recent, again) which are in the git tree, but haven't made it into FC-6 yet, so it might also be one of those that will fix the problem. I'll try and get another set of update patches done shortly - I'm out of the office at that moment which makes such things a bit slower than usual I'm afraid. If you are able to test the current GFS2 git tree kernel and you are still having the problem, then please report it through the Red Hat bugzilla, Steve. > > > ##################################### > > > > Kernel panic > > > > ##################################### > > > > Jun 26 10:00:41 fu2 kernel: ------------[ cut here ]------------ > > Jun 26 10:00:41 fu2 kernel: kernel BUG at lib/list_debug.c:67! > > Jun 26 10:00:41 fu2 kernel: invalid opcode: 0000 [1] SMP > > Jun 26 10:00:41 fu2 kernel: last sysfs > file: /devices/pci0000:00/0000:00:02.0/0000:06:00.0/0000:07:00.0/0000:08:00.0/0000:09:00.0/irq > > Jun 26 10:00:41 fu2 kernel: CPU 7Jun 26 10:00:41 fu2 kernel: Modules > linked in: lock_dlm gfs2 dlm configfs ipt_MASQUERADE iptable_nat > nf_nat nf_conntrack_ipv4 xt_state nf_conntrack nfnetlink ipt_REJECT > xt_tcpudp iptable_filter ip_tables x_tables bridge autofs4 hidp xfs > rfcomm l2cap bluetooth sunrpc ipv6 ib_iser rdma_cm ib_cm iw_cm ib_sa > ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi > dm_multipath video sbs i2c_ec i2c_core dock button battery asus_acpi > backlight ac parport_pc lp parport sg ata_piix libata pcspkr bnx2 > ide_cd cdrom serio_raw dm_snapshot dm_zero dm_mirror dm_mod lpfc > scsi_transport_fc shpchp megaraid_sas sd_mod scsi_mod ext3 jbd > ehci_hcd ohci_hcd uhci_hcd > > Jun 26 10:00:41 fu2 kernel: Pid: 4142, comm: gfs2_logd Not tainted > 2.6.20-1.2952.fc6 #1 > > Jun 26 10:00:41 fu2 kernel: RIP: 0010:[] > [] list_del+0x21/0x5b > > Jun 26 10:00:41 fu2 kernel: RSP: 0018:ffff81011e247d00 EFLAGS: > 00010082 > > Jun 26 10:00:41 fu2 kernel: RAX: 0000000000000058 RBX: > ffff81011aa40000 RCX: ffffffff8057fc58 > > Jun 26 10:00:41 fu2 kernel: RDX: ffffffff8057fc58 RSI: > 0000000000000000 RDI: ffffffff8057fc40 > > Jun 26 10:00:41 fu2 kernel: RBP: ffff81012da3f7c0 R08: > ffffffff8057fc58 R09: 0000000000000001 > > Jun 26 10:00:41 fu2 kernel: R10: 0000000000000000 R11: > ffff81012fd9d0c0 R12: ffff81011aa40f70 > > Jun 26 10:00:41 fu2 kernel: R13: ffff810123fb1a00 R14: > ffff810123fb05d8 R15: 0000000000000036 > > Jun 26 10:00:41 fu2 kernel: FS: 0000000000000000(0000) > GS:ffff81012fdb47c0(0000) knlGS:0000000000000000 > > Jun 26 10:00:41 fu2 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: > 000000008005003b > > Jun 26 10:00:41 fu2 kernel: CR2: 00002aaaadfbe008 CR3: > 0000000042c20000 CR4: 00000000000006e0 > > Jun 26 10:00:41 fu2 kernel: Process gfs2_logd (pid: 4142, threadinfo > ffff81011e246000, task ffff810121d35800) > > Jun 26 10:00:41 fu2 kernel: Stack: ffff810123fb1a00 ffffffff802cc6e7 > 0000003c00000000 ffff81012da3f7c0 > > Jun 26 10:00:41 fu2 kernel: 000000000000003c ffff810123fb0400 > 0000000000000000 ffff810123fb1a00 > > Jun 26 10:00:41 fu2 kernel: ffff81012da3f800 ffffffff802cc8be > ffff810123fb07e8 ffff810123fb0400 > > Jun 26 10:00:41 fu2 kernel: Call Trace: > > Jun 26 10:00:41 fu2 kernel: [] free_block > +0xb1/0x142 > > Jun 26 10:00:41 fu2 kernel: [] cache_flusharray > +0x7d/0xb1 > > Jun 26 10:00:41 fu2 kernel: [] kmem_cache_free > +0x1ef/0x20c > > Jun 26 10:00:41 fu2 kernel: > [] :gfs2:databuf_lo_before_commit+0x576/0x5c6 > > Jun 26 10:00:41 fu2 kernel: [] :gfs2:gfs2_log_flush > +0x11e/0x2d3 > > Jun 26 10:00:41 fu2 kernel: [] :gfs2:gfs2_logd > +0xab/0x15b > > Jun 26 10:00:41 fu2 kernel: [] :gfs2:gfs2_logd > +0x0/0x15b > > Jun 26 10:00:41 fu2 kernel: [] > keventd_create_kthread+0x0/0x6a > > Jun 26 10:00:41 fu2 kernel: [] kthread+0xd0/0xff > > Jun 26 10:00:41 fu2 kernel: [] child_rip+0xa/0x12 > > Jun 26 10:00:41 fu2 kernel: [] > keventd_create_kthread+0x0/0x6a > > Jun 26 10:00:41 fu2 kernel: [] kthread+0x0/0xff > > Jun 26 10:00:41 fu2 kernel: [] child_rip+0x0/0x12 > > Jun 26 10:00:41 fu2 kernel: > > Jun 26 10:00:41 fu2 kernel: > > Jun 26 10:00:41 fu2 kernel: Code: 0f 0b eb fe 48 8b 07 48 8b 50 08 48 > 39 fa 74 12 48 c7 c7 97 > > Jun 26 10:00:41 fu2 kernel: RIP [] list_del > +0x21/0x5b > > Jun 26 10:00:41 fu2 kernel: RSP > > > > On 6/7/07, Steven Whitehouse wrote: > Hi, > > The version of GFS2 in RHEL5 is rather old. Please use Fedora, > the > upstream kernel or wait until RHEL 5.1 is out. This should > solve the > problem that you are seeing, > > Steve. > > On Wed, 2007-06-06 at 19:27 -0400, nrbwpi at gmail.com wrote: > > Hello, > > > > Installed RHEL5 on a new two node cluster with Shared FC > storage. The > > two shared storage boxes are each split into 6.9TB LUNs for > a total of > > 4 - 6.9TB LUNS. Each machine is connected via a single > 100Mb > > connection to a switch and a single FC connection to a FC > switch. > > > > The 4 LUNs have LVM on them with GFS2. The file systems are > mountable > > from each box. When performing a script dd write of zeros > in 250MB > > file sizes to the file system from each box to different > LUNS, one of > > the nodes in the cluster is fenced by the other one. File > size does > > not seem to matter. > > > > My first guess at the problem was the heartbeat timeout in > openais. > > In the cluster.conf below I added the totem line to > hopefully raise > > the timeout to 10 seconds. This however did not resolve the > problem. > > Both boxes are running the latest updates as of 2 days ago > from > > up2date. > > > > Below is the cluster.conf and what is seen in the logs. Any > > suggestions would be greatly appreciated. > > > > Thanks! > > > > Neal > > > > > > > > ########################################## > > > > Cluster.conf > > > > ########################################## > > > > > > > > name="storage1"> > > post_join_delay="3"/> > > > > votes="1"> > > > > > > port="1" > > switch="1"/> > > > > > > > interface="eth0"/> > > > > votes="1"> > > > > > > port="2" > > switch="1"/> > > > > > > > interface="eth0"/> > > > > > > > > > > > > > > > > ipaddr="192.168.14.193" > > login="apc" name="apc4" passwd="apc"/> > > > > > > > > > > > > > > > > > > ##################################################### > > > > /var/log/messages > > > > ##################################################### > > > > Jun 5 20:19:30 fu1 openais[5351]: [TOTEM] The token was > lost in the > > OPERATIONAL state. > > Jun 5 20:19:30 fu1 openais[5351]: [TOTEM] Receive multicast > socket > > recv buffer size (262142 bytes). > > Jun 5 20:19:30 fu1 openais[5351]: [TOTEM] Transmit > multicast socket > > send buffer size (262142 bytes). > > Jun 5 20:19:30 fu1 openais[5351]: [TOTEM] entering GATHER > state from > > 2. > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] entering GATHER > state from > > 0. > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] Creating commit > token > > because I am the rep. > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] Saving state aru > 6e high > > seq received 6e > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] entering COMMIT > state. > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] entering RECOVERY > state. > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] position [0] > member > > 192.168.14.195: > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] previous ring seq > 16 rep > > 192.168.14.195 > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] aru 6e high > delivered 6e > > received flag 0 > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] Did not need to > originate > > any messages in recovery. > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] Storing new > sequence id for > > ring 14 > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] Sending initial > ORF token > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] CLM CONFIGURATION > CHANGE > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] New > Configuration: > > Jun 5 20:19:34 fu1 kernel: dlm: closing connection to node > 2 > > Jun 5 20:19:34 fu1 fenced[5367]: fu2 not a cluster member > after 0 sec > > post_fail_delay > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] r(0) > > ip(192.168.14.195) > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] Members Left: > > Jun 5 20:19:34 fu1 fenced[5367]: fencing node "fu2" > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] r(0) > > ip(192.168.14.197) > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] Members Joined: > > Jun 5 20:19:34 fu1 openais[5351]: [SYNC ] This node is > within the > > primary component and will provide service. > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] CLM CONFIGURATION > CHANGE > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] New > Configuration: > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] r(0) > > ip(192.168.14.195) > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] Members Left: > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] Members Joined: > > Jun 5 20:19:34 fu1 openais[5351]: [SYNC ] This node is > within the > > primary component and will provide service. > > Jun 5 20:19:34 fu1 openais[5351]: [TOTEM] entering > OPERATIONAL state. > > Jun 5 20:19:34 fu1 openais[5351]: [CLM ] got nodejoin > message > > 192.168.14.195 > > Jun 5 20:19:34 fu1 openais[5351]: [CPG ] got joinlist > message from > > node 1 > > Jun 5 20:19:36 fu1 fenced[5367]: fence "fu2" success > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: > jid=1: > > Trying to acquire journal lock... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: > jid=1: > > Trying to acquire journal lock... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: > jid=1: > > Looking at journal... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: > jid=1: > > Trying to acquire journal lock... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: > jid=1: > > Trying to acquire journal lock... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: > jid=1: > > Looking at journal... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: > jid=1: > > Looking at journal... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: > jid=1: > > Looking at journal... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: > jid=1: > > Acquiring the transaction lock... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: > jid=1: > > Replaying journal... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: > jid=1: > > Replayed 0 of 0 blocks > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: > jid=1: > > Found 0 revoke tags > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: > jid=1: > > Journal replayed in 1s > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: > jid=1: > > Done > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: > jid=1: > > Acquiring the transaction lock... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: > jid=1: > > Replaying journal... > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: > jid=1: > > Replayed 0 of 0 blocks > > Jun 5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: > jid=1: > > Found 0 revoke tags > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: > jid=1: > > Journal replayed in 1s > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: > jid=1: > > Done > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: > jid=1: > > Acquiring the transaction lock... > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: > jid=1: > > Acquiring the transaction lock... > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: > jid=1: > > Replaying journal... > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: > jid=1: > > Replayed 222 of 223 blocks > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: > jid=1: > > Found 1 revoke tags > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: > jid=1: > > Journal replayed in 1s > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: > jid=1: > > Done > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: > jid=1: > > Replaying journal... > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: > jid=1: > > Replayed 438 of 439 blocks > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: > jid=1: > > Found 1 revoke tags > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: > jid=1: > > Journal replayed in 1s > > Jun 5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: > jid=1: > > Done > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From janne.peltonen at helsinki.fi Thu Jun 28 15:45:37 2007 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Thu, 28 Jun 2007 18:45:37 +0300 Subject: [Linux-cluster] Cluster node without access to all resources - trouble Message-ID: <20070628154537.GA3108@helsinki.fi> Hi. I'm running a five node cluster. Four of the nodes run services that need access to a SAN, but the fifth doesn't. (The fifth node belongs to the cluster to avoid a cluster with an even number of nodes. Additionally, the fifth node is a stand-alone rack server, while the four other nodes are blade server, two of the in two different blade racks - this way, even if either of the blade racks goes down, I won't lose the cluster.) This seems to create all sorts of trouble. For example, if I try to manipulate clvm'd filesystems on the other four nodes, they refuse to commit changes if the fifth node is up. And even if I've restricted the SAN-access-needing services to run only on the four nodes that have the access, the cluster system tries to shut the services down in the fifth node also (when quorum is lost, for example) - and complains about being unable to stop them and, on the nodes that should run the services, refuses to restart them until I've removed the fifth node from the cluster and fenced it. (Or, rather, I've removed the fifth node from the cluster and one of the other nodes has successfully fenced it.) So. Is it really necessary that all the members in a cluster have access to all the resources that any of the members have, even if the services in the cluster are partitioned to run in only a part of the cluster? Or is there a way to tell the cluster that it shouldn't care about the fifth members opinion about certain services; that is, it doesn't need to check if the services are running on it, because they never do. Or should I just make sure that the fifth member always comes up last (that is, won't be running while the others are coming up)? Or should I aceept that I'm going to create more harm than avoiding by letting the fifth node belong to the cluster, and just run it outside the cluster? Sorry if this was incoherent. I'm a bit tired; this system should be in production in two weeks, and unexpected problems (that didn't come up during testing) keep coming up... Any suggestions would be greatly appreciated. --Janne -- Janne Peltonen From Robert.Gil at americanhm.com Thu Jun 28 15:50:07 2007 From: Robert.Gil at americanhm.com (Robert Gil) Date: Thu, 28 Jun 2007 11:50:07 -0400 Subject: [Linux-cluster] Cluster node without access to all resources -trouble In-Reply-To: <20070628154537.GA3108@helsinki.fi> Message-ID: What version of cluster are you running? Robert Gil Linux Systems Administrator American Home Mortgage -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Janne Peltonen Sent: Thursday, June 28, 2007 11:46 AM To: linux-cluster at redhat.com Subject: [Linux-cluster] Cluster node without access to all resources -trouble Hi. I'm running a five node cluster. Four of the nodes run services that need access to a SAN, but the fifth doesn't. (The fifth node belongs to the cluster to avoid a cluster with an even number of nodes. Additionally, the fifth node is a stand-alone rack server, while the four other nodes are blade server, two of the in two different blade racks - this way, even if either of the blade racks goes down, I won't lose the cluster.) This seems to create all sorts of trouble. For example, if I try to manipulate clvm'd filesystems on the other four nodes, they refuse to commit changes if the fifth node is up. And even if I've restricted the SAN-access-needing services to run only on the four nodes that have the access, the cluster system tries to shut the services down in the fifth node also (when quorum is lost, for example) - and complains about being unable to stop them and, on the nodes that should run the services, refuses to restart them until I've removed the fifth node from the cluster and fenced it. (Or, rather, I've removed the fifth node from the cluster and one of the other nodes has successfully fenced it.) So. Is it really necessary that all the members in a cluster have access to all the resources that any of the members have, even if the services in the cluster are partitioned to run in only a part of the cluster? Or is there a way to tell the cluster that it shouldn't care about the fifth members opinion about certain services; that is, it doesn't need to check if the services are running on it, because they never do. Or should I just make sure that the fifth member always comes up last (that is, won't be running while the others are coming up)? Or should I aceept that I'm going to create more harm than avoiding by letting the fifth node belong to the cluster, and just run it outside the cluster? Sorry if this was incoherent. I'm a bit tired; this system should be in production in two weeks, and unexpected problems (that didn't come up during testing) keep coming up... Any suggestions would be greatly appreciated. --Janne -- Janne Peltonen -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From janne.peltonen at helsinki.fi Thu Jun 28 16:23:36 2007 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Thu, 28 Jun 2007 19:23:36 +0300 Subject: [Linux-cluster] Cluster node without access to all resources -trouble In-Reply-To: References: <20070628154537.GA3108@helsinki.fi> Message-ID: <20070628162336.GA3221@helsinki.fi> On Thu, Jun 28, 2007 at 11:50:07AM -0400, Robert Gil wrote: > What version of cluster are you running? [jmmpelto at pcn3 ~]$ sudo rpm -qa 'lvm*cluster|cman|rgmanager' cman-2.0.60-1.el5 lvm2-cluster-2.02.16-3.el5 rgmanager-2.0.23-1.el5.centos I'm not running rgmanager-2.0.24 because it didn't seem to run the script status checks (!). --Janne > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Janne Peltonen > Sent: Thursday, June 28, 2007 11:46 AM > To: linux-cluster at redhat.com > Subject: [Linux-cluster] Cluster node without access to all resources > -trouble > > Hi. > > I'm running a five node cluster. Four of the nodes run services that > need access to a SAN, but the fifth doesn't. (The fifth node belongs to > the cluster to avoid a cluster with an even number of nodes. > Additionally, the fifth node is a stand-alone rack server, while the > four other nodes are blade server, two of the in two different blade > racks - this way, even if either of the blade racks goes down, I won't > lose the cluster.) This seems to create all sorts of trouble. For > example, if I try to manipulate clvm'd filesystems on the other four > nodes, they refuse to commit changes if the fifth node is up. And even > if I've restricted the SAN-access-needing services to run only on the > four nodes that have the access, the cluster system tries to shut the > services down in the fifth node also (when quorum is lost, for example) > - and complains about being unable to stop them and, on the nodes that > should run the services, refuses to restart them until I've removed the > fifth node from the cluster and fenced it. (Or, rather, I've removed the > fifth node from the cluster and one of the other nodes has successfully > fenced it.) > > So. > > Is it really necessary that all the members in a cluster have access to > all the resources that any of the members have, even if the services in > the cluster are partitioned to run in only a part of the cluster? Or is > there a way to tell the cluster that it shouldn't care about the fifth > members opinion about certain services; that is, it doesn't need to > check if the services are running on it, because they never do. Or > should I just make sure that the fifth member always comes up last (that > is, won't be running while the others are coming up)? Or should I aceept > that I'm going to create more harm than avoiding by letting the fifth > node belong to the cluster, and just run it outside the cluster? > > Sorry if this was incoherent. I'm a bit tired; this system should be in > production in two weeks, and unexpected problems (that didn't come up > during testing) keep coming up... Any suggestions would be greatly > appreciated. > > > --Janne > -- > Janne Peltonen > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Janne Peltonen From Robert.Gil at americanhm.com Thu Jun 28 16:29:04 2007 From: Robert.Gil at americanhm.com (Robert Gil) Date: Thu, 28 Jun 2007 12:29:04 -0400 Subject: [Linux-cluster] Cluster node without access to all resources-trouble In-Reply-To: <20070628162336.GA3221@helsinki.fi> Message-ID: I cant really help you there. In EL4 each of the services are separate. So a node can be part of the cluster but doesn't need to share the resources such as a shared san disk. If you have the resources set up so that it requires that resource, then it should be fenced. Robert Gil Linux Systems Administrator American Home Mortgage Phone: 631-622-8410 Cell: 631-827-5775 Fax: 516-495-5861 -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Janne Peltonen Sent: Thursday, June 28, 2007 12:24 PM To: linux clustering Subject: Re: [Linux-cluster] Cluster node without access to all resources-trouble On Thu, Jun 28, 2007 at 11:50:07AM -0400, Robert Gil wrote: > What version of cluster are you running? [jmmpelto at pcn3 ~]$ sudo rpm -qa 'lvm*cluster|cman|rgmanager' cman-2.0.60-1.el5 lvm2-cluster-2.02.16-3.el5 rgmanager-2.0.23-1.el5.centos I'm not running rgmanager-2.0.24 because it didn't seem to run the script status checks (!). --Janne > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Janne Peltonen > Sent: Thursday, June 28, 2007 11:46 AM > To: linux-cluster at redhat.com > Subject: [Linux-cluster] Cluster node without access to all resources > -trouble > > Hi. > > I'm running a five node cluster. Four of the nodes run services that > need access to a SAN, but the fifth doesn't. (The fifth node belongs > to the cluster to avoid a cluster with an even number of nodes. > Additionally, the fifth node is a stand-alone rack server, while the > four other nodes are blade server, two of the in two different blade > racks - this way, even if either of the blade racks goes down, I won't > lose the cluster.) This seems to create all sorts of trouble. For > example, if I try to manipulate clvm'd filesystems on the other four > nodes, they refuse to commit changes if the fifth node is up. And even > if I've restricted the SAN-access-needing services to run only on the > four nodes that have the access, the cluster system tries to shut the > services down in the fifth node also (when quorum is lost, for > example) > - and complains about being unable to stop them and, on the nodes that > should run the services, refuses to restart them until I've removed > the fifth node from the cluster and fenced it. (Or, rather, I've > removed the fifth node from the cluster and one of the other nodes has > successfully fenced it.) > > So. > > Is it really necessary that all the members in a cluster have access > to all the resources that any of the members have, even if the > services in the cluster are partitioned to run in only a part of the > cluster? Or is there a way to tell the cluster that it shouldn't care > about the fifth members opinion about certain services; that is, it > doesn't need to check if the services are running on it, because they > never do. Or should I just make sure that the fifth member always > comes up last (that is, won't be running while the others are coming > up)? Or should I aceept that I'm going to create more harm than > avoiding by letting the fifth node belong to the cluster, and just run it outside the cluster? > > Sorry if this was incoherent. I'm a bit tired; this system should be > in production in two weeks, and unexpected problems (that didn't come > up during testing) keep coming up... Any suggestions would be greatly > appreciated. > > > --Janne > -- > Janne Peltonen > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Janne Peltonen -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From janne.peltonen at helsinki.fi Thu Jun 28 16:54:05 2007 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Thu, 28 Jun 2007 19:54:05 +0300 Subject: [Linux-cluster] Cluster node without access to all resources-trouble In-Reply-To: References: <20070628162336.GA3221@helsinki.fi> Message-ID: <20070628165405.GB3221@helsinki.fi> On Thu, Jun 28, 2007 at 12:29:04PM -0400, Robert Gil wrote: > I cant really help you there. In EL4 each of the services are separate. > So a node can be part of the cluster but doesn't need to share the > resources such as a shared san disk. If you have the resources set up so > that it requires that resource, then it should be fenced. Yep. The situation seems to be this (someone who really knows abt the inner workings of the resource group manager, correct me): *when a clurgmgrd starts, it wants to know the status of all the services, and to make thing sure, it stops all services locally (unmounts the filesystems, runs the scripts with "stop") - and asks the already-running cluster members their idea of the status *when the clurgmgrd on the fifth node starts, it tries to stop locally the SAN requiring services - and cannot match the /dev// paths with real nodes, so it ends up with incoherent information about their status *if all the nodes with SAN access are restarted (while the fifth node is up), the nodes with SAN access first stop the services locally - and then, apparently, ask the fifth node about the service status. Result: a line like the following, for each service: --cut-- Jun 28 17:56:20 pcn2.mappi.helsinki.fi clurgmgrd[5895]: #34: Cannot get status for service service:im --cut-- (what is weird, though, is that the fifth node knows perfectly well the status of this particular service, since it's running the service (service:im doesn't need the SAN access) - perhaps there is some other reason not to believe the fifth node at this point. can't imagine what it'd be, though.) *after that, the nodes with SAN access do nothing about any services until after the fifth node has left the cluster and has been fenced. So, apparently the other nodes conclude that the fifth node is 'bad' and could be interfering with their SAN access requiring services. When the fifth node has been fenced, the other nodes start the services. And the fifth node can join the cluster and start the services that should be running there... > > -----Original Message----- > > From: linux-cluster-bounces at redhat.com > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Janne Peltonen > > Sent: Thursday, June 28, 2007 11:46 AM > > To: linux-cluster at redhat.com > > Subject: [Linux-cluster] Cluster node without access to all resources > > -trouble > > > > Hi. > > > > I'm running a five node cluster. Four of the nodes run services that > > need access to a SAN, but the fifth doesn't. (The fifth node belongs > > to the cluster to avoid a cluster with an even number of nodes. > > Additionally, the fifth node is a stand-alone rack server, while the > > four other nodes are blade server, two of the in two different blade > > racks - this way, even if either of the blade racks goes down, I won't > > > lose the cluster.) This seems to create all sorts of trouble. For > > example, if I try to manipulate clvm'd filesystems on the other four > > nodes, they refuse to commit changes if the fifth node is up. And even > > > if I've restricted the SAN-access-needing services to run only on the > > four nodes that have the access, the cluster system tries to shut the > > services down in the fifth node also (when quorum is lost, for > > example) > > - and complains about being unable to stop them and, on the nodes that > > > should run the services, refuses to restart them until I've removed > > the fifth node from the cluster and fenced it. (Or, rather, I've > > removed the fifth node from the cluster and one of the other nodes has > > > successfully fenced it.) > > > > So. > > > > Is it really necessary that all the members in a cluster have access > > to all the resources that any of the members have, even if the > > services in the cluster are partitioned to run in only a part of the > > cluster? Or is there a way to tell the cluster that it shouldn't care > > about the fifth members opinion about certain services; that is, it > > doesn't need to check if the services are running on it, because they > > never do. Or should I just make sure that the fifth member always > > comes up last (that is, won't be running while the others are coming > > up)? Or should I aceept that I'm going to create more harm than > > avoiding by letting the fifth node belong to the cluster, and just run > it outside the cluster? > > > > Sorry if this was incoherent. I'm a bit tired; this system should be > > in production in two weeks, and unexpected problems (that didn't come > > up during testing) keep coming up... Any suggestions would be greatly > > appreciated. > > > > > > --Janne > > -- > > Janne Peltonen > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Janne Peltonen > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Janne Peltonen From mike at duncancg.com Thu Jun 28 17:43:41 2007 From: mike at duncancg.com (Mike Duncan) Date: Thu, 28 Jun 2007 12:43:41 -0500 Subject: [Linux-cluster] Newbie Question Message-ID: <1183052621.3641.5.camel@Dilbert> I hope this is a good place for beginner questions. If not, please let me know and I'll go away.... I am trying to construct my first cluster. I have 6 PCs (P3s) and am trying to learn the basics. I have MPI up and running so that the nodes will answer and they run a simple "hello world" program. Here's my main question: I'm running Fedora Core 6 on this cluster, and would like to implement GFS as each node has a huge data HDD--160G per node. However I cannot find any information on the Net about implementing GFS on FC6. I found some old info about FC4, but it was useless. I see that RHEL has the functionality included. Do I need to install RHEL? I have a legal copy of RHEL ES 4. Thanks for any assistance! Mike From lhh at redhat.com Thu Jun 28 18:18:13 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 28 Jun 2007 14:18:13 -0400 Subject: [Linux-cluster] QDISK problem In-Reply-To: <467BBD2B.A5F6.00ED.0@itdynamics.co.za> References: <467BBD2B.A5F6.00ED.0@itdynamics.co.za> Message-ID: <20070628181813.GO8818@redhat.com> On Fri, Jun 22, 2007 at 12:14:35PM +0200, Gavin Fietze wrote: > I am trying to get qdiskd to work in a 3 node cluster using RHEL 5 AP . The 3 nodes are virtual machines running XEN. Domain0 is also RHEL 5 AP > I have upgraded cman and rgmanager to cman-2.0.60-1.el5 and rgmanager-2.0.23-1 respectively, everything else stock standard. > > When I run clustat and "cman_tool nodes" I get strange output for the qdisk object : This is a bug in rgmanager; it will be fixed in 5.1. For now, since you have 3 nodes, just go without qdisk. -- Lon -- Lon Hohberger - Software Engineer - Red Hat, Inc. From lhh at redhat.com Thu Jun 28 18:19:58 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 28 Jun 2007 14:19:58 -0400 Subject: [Linux-cluster] No failover occurs! In-Reply-To: <1182786276.12133.2.camel@kcm40202.kcmhq.org> References: <1182786276.12133.2.camel@kcm40202.kcmhq.org> Message-ID: <20070628181958.GP8818@redhat.com> On Mon, Jun 25, 2007 at 10:44:36AM -0500, Dave Augustus wrote: > I am familiar with Heartbeat and new to RHCS. > > Anyhow: > > I created a 2 node cluster with no quorum drive. > added an ip address on the public eth > added an ip address on the private eth > added the script apache, with proper configs on both hosts > > The only way I can get all 3 to run is to reboot the nodes. > > Shouldn't it failover if a service fails to start? It's a configuration option. Restart = default = restart on the same node. Relocate = "failover" Disable = don't bother trying; disable service immediately. Could you post your cluster.conf? -- Lon Hohberger - Software Engineer - Red Hat, Inc. From lhh at redhat.com Thu Jun 28 18:22:19 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 28 Jun 2007 14:22:19 -0400 Subject: [Linux-cluster] a couple of questions regarding clusters In-Reply-To: <91E302EA72562F43AEDC06DD8FAE7D3602855F82@ord1mail01.firstlook.biz> References: <91E302EA72562F43AEDC06DD8FAE7D3602855F82@ord1mail01.firstlook.biz> Message-ID: <20070628182219.GQ8818@redhat.com> On Sat, Jun 23, 2007 at 05:23:56PM -0500, Brent Sachnoff wrote: > I have a 3 node cluster running redhat 4 with gfs. What is the proper > way to have a node leave the cluster for maintenance and then rejoin > after maintenance is completed? From the docs, I have read that I need > to unmount gfs and then stop all the services in the following order: > rgmanager, gfs, clvmd, fenced. I can then issue a cman_tool leave > (remove) request. That should be correct. 'cman_tool leave remove' will decrement the quorum vote count. > I have also noticed that if I lose ip connectivity to a certain node I > lose gfs connectivity with the other two nodes. I thought that I would > only need 2 votes to continue connectivity. I assume that in this case, you are disconnecting the cable (not doing a clean shutdown as above). That's correct, except fencing must complete in order for you to maintain full access to the GFS volume. If you don't have fencing configured, you will lose connectivity to GFS volume. -- Lon Hohberger - Software Engineer - Red Hat, Inc. From lhh at redhat.com Thu Jun 28 18:24:15 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 28 Jun 2007 14:24:15 -0400 Subject: [Linux-cluster] Cluster configuration on redhat AS 4 In-Reply-To: <20070625035719.12708.qmail@webmail6.rediffmail.com> References: <20070625035719.12708.qmail@webmail6.rediffmail.com> Message-ID: <20070628182415.GR8818@redhat.com> On Mon, Jun 25, 2007 at 03:57:19AM -0000, manjunath c shanubog wrote: > Hi,           I have to setup two node cluster with redhat AS 4 and cluster suite with GFS. The application which is to be installed is MySql database. I would like to have a solution for the below queries          1. Detailed installation guide for cluster suite installation and is it possible to load balance on redhat 4/5 linux.          2. Do i need to have a separate cluster suite for MySql, if so which one is Good.          3. Guide or document for Installation of MySQL on cluster.          4. In windows clustering there is no need of fencing device, why is it necessary in linux. if so which is good fencing device and its configuration details.Thanking YouManjunath I don't think active/active mysql clustering currently works; someone else can correct me if I'm wrong on this. -- Lon Hohberger - Software Engineer - Red Hat, Inc. From lhh at redhat.com Thu Jun 28 18:25:55 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 28 Jun 2007 14:25:55 -0400 Subject: [Linux-cluster] manual fencing problem In-Reply-To: <46819E5C.1090802@cmiware.com> References: <46819E5C.1090802@cmiware.com> Message-ID: <20070628182553.GS8818@redhat.com> On Tue, Jun 26, 2007 at 06:16:44PM -0500, Chris Harms wrote: > Trying to setup manual fencing for testing purposes in Conga gave me the > following errors: > > agent "fence_manual" reports: failed: fence_manual no node name > > It appears this came up before: > http://www.redhat.com/archives/linux-cluster/2006-May/msg00088.html > > but is still unresolved. Ew.. What releases of conga (luci/ricci) system-config-cluster and fence do you have installed? -- Lon Hohberger - Software Engineer - Red Hat, Inc. From lhh at redhat.com Thu Jun 28 18:30:32 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 28 Jun 2007 14:30:32 -0400 Subject: [Linux-cluster] wishing to run GFS on iSCSI with redundancy In-Reply-To: References: Message-ID: <20070628183032.GT8818@redhat.com> On Wed, Jun 27, 2007 at 03:54:07PM +1200, Rohit Grover wrote: > Hello, > > We'd like to run GFS in a cluster serviced by a pool of iSCSI disks. > We would like to use RAID to add redundancy to the storage, but > there's literature on the net saying that linux's MD driver is not > cluster safe. Since CLVM doesn't support RAID, what options do we > have other than pairing the iSCSI disks with DRBD? Ok, couple of notes: * MD is only unsafe in a cluster if it's used on multiple cluster nodes. That is, it should be fairly easy to implement a resource agent which assembles MD devices from network block devices - on one node at a time. * DRBD only will work with two writers (if 0.8.x+). I'm not sure how many mirror targets you can maintain. * Aren't most iSCSI targets RAID arrays already (?) -- Lon Hohberger - Software Engineer - Red Hat, Inc. From lhh at redhat.com Thu Jun 28 18:33:42 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 28 Jun 2007 14:33:42 -0400 Subject: [Linux-cluster] Cluster node without access to all resources - trouble In-Reply-To: <20070628154537.GA3108@helsinki.fi> References: <20070628154537.GA3108@helsinki.fi> Message-ID: <20070628183341.GU8818@redhat.com> On Thu, Jun 28, 2007 at 06:45:37PM +0300, Janne Peltonen wrote: > > So. > > Is it really necessary that all the members in a cluster have access to > all the resources that any of the members have, even if the services in > the cluster are partitioned to run in only a part of the cluster? No. > Or is > there a way to tell the cluster that it shouldn't care about the fifth > members opinion about certain services; that is, it doesn't need to > check if the services are running on it, because they never do. (1) don't start rgmanager on the fifth node :), or (2) if you do start rgmanager on the fifth node, make all services be part of a "restricted" failover domain comprised of the other four nodes. > Or > should I just make sure that the fifth member always comes up last (that > is, won't be running while the others are coming up)? Or should I aceept > that I'm going to create more harm than avoiding by letting the fifth > node belong to the cluster, and just run it outside the cluster? If the above two don't work, it's a bug. (Oh! and those status-script checks will be fixed in 5.1 ;) ). -- Lon Hohberger - Software Engineer - Red Hat, Inc. From lhh at redhat.com Thu Jun 28 18:39:44 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 28 Jun 2007 14:39:44 -0400 Subject: [Linux-cluster] Cluster node without access to all resources-trouble In-Reply-To: <20070628165405.GB3221@helsinki.fi> References: <20070628162336.GA3221@helsinki.fi> <20070628165405.GB3221@helsinki.fi> Message-ID: <20070628183944.GV8818@redhat.com> On Thu, Jun 28, 2007 at 07:54:05PM +0300, Janne Peltonen wrote: > On Thu, Jun 28, 2007 at 12:29:04PM -0400, Robert Gil wrote: > > I cant really help you there. In EL4 each of the services are separate. > > So a node can be part of the cluster but doesn't need to share the > > resources such as a shared san disk. If you have the resources set up so > > that it requires that resource, then it should be fenced. RHEL5 is the same FWIW, or should be. > *when a clurgmgrd starts, it wants to know the status of all the > services, and to make thing sure, it stops all services locally > (unmounts the filesystems, runs the scripts with "stop") - and asks the > already-running cluster members their idea of the status Right. > *when the clurgmgrd on the fifth node starts, it tries to stop locally > the SAN requiring services - and cannot match the /dev// paths > with real nodes, so it ends up with incoherent information about their > status This should not cause a problem. > *if all the nodes with SAN access are restarted (while the fifth node is > up), the nodes with SAN access first stop the services locally - and > then, apparently, ask the fifth node about the service status. Result: > a line like the following, for each service: > > --cut-- > Jun 28 17:56:20 pcn2.mappi.helsinki.fi clurgmgrd[5895]: #34: Cannot get status for service service:im > --cut-- What do you mean here, (sorry, being daft) Restart all nodes = "just rgmanager on all nodes", or "reboot all nodes"? > (what is weird, though, is that the fifth node knows perfectly well the > status of this particular service, since it's running the service > (service:im doesn't need the SAN access) - perhaps there is some other > reason not to believe the fifth node at this point. can't imagine what > it'd be, though.) cman_tool services from each node could help here. > *after that, the nodes with SAN access do nothing about any services > until after the fifth node has left the cluster and has been fenced. If you're rebooting the other 4 nodes, it sounds like the 5th is holding some sort of a lock which it shouldn't be across quorum transitions (which would be a bug). If this is the case, could you: * install rgmanager-debuginfo * get me a backtrace: gdb clurgmgrd `pidof clurgmgrd` thr a a bt -- Lon -- Lon Hohberger - Software Engineer - Red Hat, Inc. From lhh at redhat.com Thu Jun 28 18:41:05 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 28 Jun 2007 14:41:05 -0400 Subject: [Linux-cluster] Newbie Question In-Reply-To: <1183052621.3641.5.camel@Dilbert> References: <1183052621.3641.5.camel@Dilbert> Message-ID: <20070628184105.GW8818@redhat.com> On Thu, Jun 28, 2007 at 12:43:41PM -0500, Mike Duncan wrote: > I hope this is a good place for beginner questions. If not, please let > me know and I'll go away.... > > I am trying to construct my first cluster. I have 6 PCs (P3s) and am > trying to learn the basics. I have MPI up and running so that the nodes > will answer and they run a simple "hello world" program. > > Here's my main question: I'm running Fedora Core 6 on this cluster, and > would like to implement GFS as each node has a huge data HDD--160G per > node. However I cannot find any information on the Net about > implementing GFS on FC6. I found some old info about FC4, but it was > useless. I see that RHEL has the functionality included. Do I need to > install RHEL? No, you don't; the RHEL5 documentation for installing/configuring GFS should apply to Fedora Core 6. -- Lon Hohberger - Software Engineer - Red Hat, Inc. From janne.peltonen at helsinki.fi Thu Jun 28 18:40:54 2007 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Thu, 28 Jun 2007 21:40:54 +0300 Subject: [Linux-cluster] Cluster node without access to all resources - trouble In-Reply-To: <20070628183341.GU8818@redhat.com> References: <20070628154537.GA3108@helsinki.fi> <20070628183341.GU8818@redhat.com> Message-ID: <20070628184053.GD3221@helsinki.fi> On Thu, Jun 28, 2007 at 02:33:42PM -0400, Lon Hohberger wrote: > (1) don't start rgmanager on the fifth node :), or ...now there's an idea :) > (2) if you do start rgmanager on the fifth node, make all services be > part of a "restricted" failover domain comprised of the other four > nodes. > > > Or > > should I just make sure that the fifth member always comes up last (that > > is, won't be running while the others are coming up)? Or should I aceept > > that I'm going to create more harm than avoiding by letting the fifth > > node belong to the cluster, and just run it outside the cluster? > > If the above two don't work, it's a bug. If (2) means /all/ the services, even the one that should be running on the fifth node, it's more or less equal to (1), isn't it? That is, the service that I want to be running on node five can't be a clustered service (which is, come to think of it, exactly what I want...) > (Oh! and those status-script checks will be fixed in 5.1 ;) ). (Thanks.) --Janne -- Janne Peltonen From chris at cmiware.com Thu Jun 28 19:00:05 2007 From: chris at cmiware.com (Chris Harms) Date: Thu, 28 Jun 2007 14:00:05 -0500 Subject: [Linux-cluster] manual fencing problem In-Reply-To: <20070628182553.GS8818@redhat.com> References: <46819E5C.1090802@cmiware.com> <20070628182553.GS8818@redhat.com> Message-ID: <46840535.9080608@cmiware.com> luci-0.9.2-6.el5 ricci-0.9.2-6.el5 cman-2.0.64-1.el5 fence_tool 2.0.64 (built May 10 2007 17:58:41) Lon Hohberger wrote: > On Tue, Jun 26, 2007 at 06:16:44PM -0500, Chris Harms wrote: > >> Trying to setup manual fencing for testing purposes in Conga gave me the >> following errors: >> >> agent "fence_manual" reports: failed: fence_manual no node name >> >> It appears this came up before: >> http://www.redhat.com/archives/linux-cluster/2006-May/msg00088.html >> >> but is still unresolved. >> > > Ew.. What releases of conga (luci/ricci) system-config-cluster and fence > do you have installed? > > From Robert.Hell at fabasoft.com Thu Jun 28 19:12:09 2007 From: Robert.Hell at fabasoft.com (Hell, Robert) Date: Thu, 28 Jun 2007 21:12:09 +0200 Subject: AW: [Linux-cluster] QDISK problem In-Reply-To: <20070628181813.GO8818@redhat.com> References: <467BBD2B.A5F6.00ED.0@itdynamics.co.za> <20070628181813.GO8818@redhat.com> Message-ID: By the way - is there a release date for 5.1? I couldn't find one online ... Regards, Robert -----Urspr?ngliche Nachricht----- Von: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] Im Auftrag von Lon Hohberger Gesendet: Donnerstag, 28. Juni 2007 20:18 An: linux clustering Betreff: Re: [Linux-cluster] QDISK problem On Fri, Jun 22, 2007 at 12:14:35PM +0200, Gavin Fietze wrote: > I am trying to get qdiskd to work in a 3 node cluster using RHEL 5 AP . The 3 nodes are virtual machines running XEN. Domain0 is also RHEL 5 AP > I have upgraded cman and rgmanager to cman-2.0.60-1.el5 and rgmanager-2.0.23-1 respectively, everything else stock standard. > > When I run clustat and "cman_tool nodes" I get strange output for the qdisk object : This is a bug in rgmanager; it will be fixed in 5.1. For now, since you have 3 nodes, just go without qdisk. -- Lon -- Lon Hohberger - Software Engineer - Red Hat, Inc. -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From jparsons at redhat.com Thu Jun 28 19:39:47 2007 From: jparsons at redhat.com (jim parsons) Date: Thu, 28 Jun 2007 15:39:47 -0400 Subject: [Linux-cluster] manual fencing problem In-Reply-To: <20070628182553.GS8818@redhat.com> References: <46819E5C.1090802@cmiware.com> <20070628182553.GS8818@redhat.com> Message-ID: <1183059587.3017.4.camel@localhost.localdomain> On Thu, 2007-06-28 at 14:25 -0400, Lon Hohberger wrote: > On Tue, Jun 26, 2007 at 06:16:44PM -0500, Chris Harms wrote: > > Trying to setup manual fencing for testing purposes in Conga gave me the > > following errors: > > > > agent "fence_manual" reports: failed: fence_manual no node name > > > > It appears this came up before: > > http://www.redhat.com/archives/linux-cluster/2006-May/msg00088.html > > > > but is still unresolved. > > Ew.. What releases of conga (luci/ricci) system-config-cluster and fence > do you have installed? > This is a known bug and was fixed in the current 5.1 beta...we can provide you a patch whether you are using 4 or 5. What version are you running? -J From jparsons at redhat.com Thu Jun 28 19:41:43 2007 From: jparsons at redhat.com (jim parsons) Date: Thu, 28 Jun 2007 15:41:43 -0400 Subject: AW: [Linux-cluster] QDISK problem In-Reply-To: References: <467BBD2B.A5F6.00ED.0@itdynamics.co.za> <20070628181813.GO8818@redhat.com> Message-ID: <1183059703.3017.6.camel@localhost.localdomain> On Thu, 2007-06-28 at 21:12 +0200, Hell, Robert wrote: > By the way - is there a release date for 5.1? > > I couldn't find one online ... Beta froze yesterday. Prolly a week? -j > -----Urspr?ngliche Nachricht----- > Von: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] Im Auftrag von Lon Hohberger > Gesendet: Donnerstag, 28. Juni 2007 20:18 > An: linux clustering > Betreff: Re: [Linux-cluster] QDISK problem > > On Fri, Jun 22, 2007 at 12:14:35PM +0200, Gavin Fietze wrote: > > I am trying to get qdiskd to work in a 3 node cluster using RHEL 5 AP . The 3 nodes are virtual machines running XEN. Domain0 is also RHEL 5 AP > > I have upgraded cman and rgmanager to cman-2.0.60-1.el5 and rgmanager-2.0.23-1 respectively, everything else stock standard. > > > > When I run clustat and "cman_tool nodes" I get strange output for the qdisk object : > > This is a bug in rgmanager; it will be fixed in 5.1. For now, since you > have 3 nodes, just go without qdisk. > > -- Lon > From chris at cmiware.com Thu Jun 28 19:56:29 2007 From: chris at cmiware.com (Chris Harms) Date: Thu, 28 Jun 2007 14:56:29 -0500 Subject: [Linux-cluster] manual fencing problem In-Reply-To: <1183059587.3017.4.camel@localhost.localdomain> References: <46819E5C.1090802@cmiware.com> <20070628182553.GS8818@redhat.com> <1183059587.3017.4.camel@localhost.localdomain> Message-ID: <4684126D.3070302@cmiware.com> Using CS 5 on RHEL 5 via RedHat Network. Thanks, Chris jim parsons wrote: > On Thu, 2007-06-28 at 14:25 -0400, Lon Hohberger wrote: > >> On Tue, Jun 26, 2007 at 06:16:44PM -0500, Chris Harms wrote: >> >>> Trying to setup manual fencing for testing purposes in Conga gave me the >>> following errors: >>> >>> agent "fence_manual" reports: failed: fence_manual no node name >>> >>> It appears this came up before: >>> http://www.redhat.com/archives/linux-cluster/2006-May/msg00088.html >>> >>> but is still unresolved. >>> >> Ew.. What releases of conga (luci/ricci) system-config-cluster and fence >> do you have installed? >> >> > This is a known bug and was fixed in the current 5.1 beta...we can > provide you a patch whether you are using 4 or 5. What version are you > running? > > -J > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From mike at duncancg.com Thu Jun 28 20:04:49 2007 From: mike at duncancg.com (Mike Duncan) Date: Thu, 28 Jun 2007 15:04:49 -0500 Subject: [Linux-cluster] Newbie Question In-Reply-To: <20070628184105.GW8818@redhat.com> References: <1183052621.3641.5.camel@Dilbert> <20070628184105.GW8818@redhat.com> Message-ID: <1183061089.3641.12.camel@Dilbert> Hmm. OK. Does that mean I still need to acquire RH's Cluster Suite? Unfortunately, I'm doing this as a personal project, and do not have a company's resources behind me. Mike > useless. I see that RHEL has the functionality included. Do I need to > > install RHEL? > > No, you don't; the RHEL5 documentation for installing/configuring GFS > should apply to Fedora Core 6. > From janne.peltonen at helsinki.fi Thu Jun 28 20:49:11 2007 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Thu, 28 Jun 2007 23:49:11 +0300 Subject: [Linux-cluster] Clustered VGs with many PVs slow to activate Message-ID: <20070628204911.GE3221@helsinki.fi> Hi. There is a clustered vg in my five-node cluster (four of which have access to the SAN where the physical devices reside). It seems to me that the more LV's and PV's I've got, the more time it takes to get the VG activated. When I had 8 PV's and 79 LV's, it took some ten minutes to activate the VG on each node. Now, I extended the VG by 39 PV's (so that my sometimes-disk-intensive services wouldn't interfere with each other, each has its own 'disk' from the SAN) and I haven't succeeded in activating the VG anymore. (I did the pvcreates on one node, and they were fast. Thereafter I run the vgextend on the same node, and it took a couple of minutes. Then I tried to do vgdisplay on one of the other nodes, and got an error (the lvm couldn't find the new PV's and then, the VG itself). I rebooted all the nodes but they haven't booted yet. The cluster seems to be up, and the nodes have all successfully started CLVMD (I'm seeing this from our remote syslog host, I don't have access to the node consoles right now (stupid java remote consoles and stupid JVMs that don't handle slow X11 connections well)) - and that's it. They are all probably trying to activate the VG with 47 PV's, and it seems to take ages. It started an three-quarters of an hour ago, and now I'm going to sleep (it's midnight here) and see if it'll be up by tomorrow morning... --Janne -- Janne Peltonen From janne.peltonen at helsinki.fi Thu Jun 28 20:51:19 2007 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Thu, 28 Jun 2007 23:51:19 +0300 Subject: [Linux-cluster] Cluster node without access to all resources-trouble In-Reply-To: <20070628183944.GV8818@redhat.com> References: <20070628162336.GA3221@helsinki.fi> <20070628165405.GB3221@helsinki.fi> <20070628183944.GV8818@redhat.com> Message-ID: <20070628205119.GF3221@helsinki.fi> On Thu, Jun 28, 2007 at 02:39:44PM -0400, Lon Hohberger wrote: > > > *if all the nodes with SAN access are restarted (while the fifth node is > > up), the nodes with SAN access first stop the services locally - and > > then, apparently, ask the fifth node about the service status. Result: > > a line like the following, for each service: > > > > --cut-- > > Jun 28 17:56:20 pcn2.mappi.helsinki.fi clurgmgrd[5895]: #34: Cannot get status for service service:im > > --cut-- > > What do you mean here, (sorry, being daft) > > Restart all nodes = "just rgmanager on all nodes", or "reboot all > nodes"? Reboot all nodes. > > *after that, the nodes with SAN access do nothing about any services > > until after the fifth node has left the cluster and has been fenced. > If you're rebooting the other 4 nodes, it sounds like the 5th is holding > some sort of a lock which it shouldn't be across quorum transitions > (which would be a bug). > > If this is the case, could you: > > * install rgmanager-debuginfo > * get me a backtrace: > > gdb clurgmgrd `pidof clurgmgrd` > thr a a bt I'll try to find the time for this tomorrow or something. (This behaviour doesn't really make the cluster un-production-useable, so I'm trying to solve the other problems first ;) --Janne -- Janne Peltonen From janne.peltonen at helsinki.fi Thu Jun 28 21:49:07 2007 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Fri, 29 Jun 2007 00:49:07 +0300 Subject: [Linux-cluster] Clustered VGs with many PVs slow to activate In-Reply-To: <20070628204911.GE3221@helsinki.fi> References: <20070628204911.GE3221@helsinki.fi> Message-ID: <20070628214907.GA3768@helsinki.fi> On Thu, Jun 28, 2007 at 11:49:11PM +0300, Janne Peltonen wrote: > that's it. They are all probably trying to activate the VG with 47 PV's, > and it seems to take ages. It started an three-quarters of an hour ago, > and now I'm going to sleep (it's midnight here) and see if it'll be up > by tomorrow morning... So I didn't go to sleep. The VG took /one and a half hours/ to activate. And operations such as pvdisplay /dev/sdau1 take ages (minutes and minutes), and the pvdisplay appears to hog cpu. Meanwhile: [jmmpelto at pcn3 ~]$ sudo dd if=/dev/sdau1 of=/tmp/huu bs=1k count=10000 10000+0 records in 10000+0 records out 10240000 bytes (10 MB) copied, 0.30667 seconds, 33.4 MB/s Er. --Janne -- Janne Peltonen From rgrover1 at gmail.com Thu Jun 28 22:49:35 2007 From: rgrover1 at gmail.com (Rohit Grover) Date: Fri, 29 Jun 2007 10:49:35 +1200 Subject: [Linux-cluster] wishing to run GFS on iSCSI with redundancy In-Reply-To: <20070628183032.GT8818@redhat.com> References: <20070628183032.GT8818@redhat.com> Message-ID: <426bed110706281549n70b37bfaq178c04530540c568@mail.gmail.com> Hello, Thanks a lot for responding. Ok, couple of notes: > > * MD is only unsafe in a cluster if it's used on multiple cluster nodes. > That is, it should be fairly easy to implement a resource agent which > assembles MD devices from network block devices - on one node at a time. True. I would like to have MD assembling iSCSI initiators (the same set, of course) at multiple nodes. This will facilitate load distribution. Isn't it true that if MD is made to not cache any data flowing through it (and leave GFS to do caching and coherency control across the cluster), then MD should be a viable solution to putting together iSCSI initiators with RAID? * DRBD only will work with two writers (if 0.8.x+). I'm not sure how many > mirror targets you can maintain. Could you please elaborate on this? I don't understand what is meant by 'DRBD will only work with two writers'. Thanks. * Aren't most iSCSI targets RAID arrays already (?) Yes, they are in our case. But we also want to survive software/firmware failures of the iSCSI targets. Thanks, Rohit Grover. -------------- next part -------------- An HTML attachment was scrubbed... URL: From janne.peltonen at helsinki.fi Fri Jun 29 07:21:42 2007 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Fri, 29 Jun 2007 10:21:42 +0300 Subject: [Linux-cluster] Clustered VGs with many PVs slow to activate In-Reply-To: <20070628214907.GA3768@helsinki.fi> References: <20070628204911.GE3221@helsinki.fi> <20070628214907.GA3768@helsinki.fi> Message-ID: <20070629072141.GE29854@helsinki.fi> On Fri, Jun 29, 2007 at 12:49:07AM +0300, Janne Peltonen wrote: > On Thu, Jun 28, 2007 at 11:49:11PM +0300, Janne Peltonen wrote: > > that's it. They are all probably trying to activate the VG with 47 PV's, > > and it seems to take ages. It started an three-quarters of an hour ago, > > and now I'm going to sleep (it's midnight here) and see if it'll be up > > by tomorrow morning... > > So I didn't go to sleep. The VG took /one and a half hours/ to activate. > And operations such as pvdisplay /dev/sdau1 take ages (minutes and > minutes), and the pvdisplay appears to hog cpu. Meanwhile: Some data to give you the feel of things: [jmmpelto at pcn1 ~]$ time sudo service clvmd restart Deactivating VG mappi-primary: 0 logical volume(s) in volume group "mappi-primary" now active [ OK ] Stopping clvm: [ OK ] Starting clvmd: [ OK ] Activating VGs: 2 logical volume(s) in volume group "main" now active 78 logical volume(s) in volume group "mappi-primary" now active [ OK ] real 4m40.448s user 0m0.662s sys 0m0.299s [jmmpelto at pcn1 ~]$ time sudo vgextend mappi-primary $(for LETTER in {i..r}; do echo /dev/sd${LETTER}1; done) Password: Volume group "mappi-primary" successfully extended real 0m7.534s user 0m0.197s sys 0m0.112s [jmmpelto at pcn1 ~]$ time sudo service clvmd restart Deactivating VG mappi-primary: 0 logical volume(s) in volume group "mappi-primary" now active [ OK ] Stopping clvm: [ OK ] Starting clvmd: [ OK ] Activating VGs: 2 logical volume(s) in volume group "main" now active 78 logical volume(s) in volume group "mappi-primary" now active [ OK ] real 43m17.340s user 0m2.473s sys 0m0.528s Adding ten PV's increased the deactivation-activation-cycle time tenfold. Whatever might be the reason for this. --Janne -- Janne Peltonen From janne.peltonen at helsinki.fi Fri Jun 29 07:35:55 2007 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Fri, 29 Jun 2007 10:35:55 +0300 Subject: [Linux-cluster] Clustered VGs with many PVs slow to activate In-Reply-To: <20070629072141.GE29854@helsinki.fi> References: <20070628204911.GE3221@helsinki.fi> <20070628214907.GA3768@helsinki.fi> <20070629072141.GE29854@helsinki.fi> Message-ID: <20070629073554.GF29854@helsinki.fi> There seems to be great variation in the cycle time in different SAN load conditions: On Fri, Jun 29, 2007 at 10:21:42AM +0300, Janne Peltonen wrote: > [jmmpelto at pcn1 ~]$ time sudo service clvmd restart > Deactivating VG mappi-primary: 0 logical volume(s) in volume group "mappi-primary" now active > [ OK ] > Stopping clvm: [ OK ] > Starting clvmd: [ OK ] > Activating VGs: 2 logical volume(s) in volume group "main" now active > 78 logical volume(s) in volume group "mappi-primary" now active > [ OK ] > > real 4m40.448s > user 0m0.662s > sys 0m0.299s (added and reduced 10 PV's) (and the activity on the SAN on other nodes decreased) [jmmpelto at pcn1 ~]$ time sudo service clvmd restart Deactivating VG mappi-primary: 0 logical volume(s) in volume group "mappi-primary" now active [ OK ] Stopping clvm: [ OK ] Starting clvmd: [ OK ] Activating VGs: 2 logical volume(s) in volume group "main" now active 78 logical volume(s) in volume group "mappi-primary" now active [ OK ] real 1m54.891s user 0m0.672s sys 0m0.324s [jmmpelto at pcn1 ~]$ time sudo service clvmd restart Password: Deactivating VG mappi-primary: 0 logical volume(s) in volume group "mappi-primary" now active [ OK ] Stopping clvm: [ OK ] Starting clvmd: [ OK ] Activating VGs: 2 logical volume(s) in volume group "main" now active 78 logical volume(s) in volume group "mappi-primary" now active [ OK ] real 2m3.736s user 0m0.660s sys 0m0.321s --Janne -- Janne Peltonen From breeves at redhat.com Fri Jun 29 08:54:06 2007 From: breeves at redhat.com (Bryn M. Reeves) Date: Fri, 29 Jun 2007 09:54:06 +0100 Subject: [Linux-cluster] Clustered VGs with many PVs slow to activate In-Reply-To: <20070628204911.GE3221@helsinki.fi> References: <20070628204911.GE3221@helsinki.fi> Message-ID: <4684C8AE.5030706@redhat.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Janne Peltonen wrote: > Hi. > > There is a clustered vg in my five-node cluster (four of which have > access to the SAN where the physical devices reside). It seems to me > that the more LV's and PV's I've got, the more time it takes to get the > VG activated. When I had 8 PV's and 79 LV's, it took some ten minutes to > activate the VG on each node. Now, I extended the VG by 39 PV's (so that my Did you put a metadata copy on each PV? Check with vgdisplay: --- Volume group --- VG Name t0 System ID Format lvm2 Metadata Areas 4 <--------- Metadata Sequence No 3 VG Access read/write VG Status resizable MAX LV 0 Cur LV 0 Open LV 0 Max PV 0 Cur PV 64 Act PV 64 VG Size 7.75 GB PE Size 4.00 MB Total PE 1984 Alloc PE / Size 0 / 0 Free PE / Size 1984 / 7.75 GB VG UUID PcERts-A1MU-KoT6-gu6R-Acuw-MYuf-08Ximv If that's the case, you probably don't want one for each PV here. It's unnecessary and will slow the tools down a lot when there are a large number of PVs. Check out the --metadatacopies option to pvcreate and re-create the volume group with a much smaller number of MDAs if this is the problem you're seeing. If you're careful you can do this in-place via the - --restorefile option to pvcreate and vgcfgbackup/vgcfgrestore. Regards, Bryn. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iD8DBQFGhMiu6YSQoMYUY94RApX4AKCEW1/2ybBF7dmGnIHVTN1kKUpYUACfbxt4 kBTSA2222ahlwxasXKp3ffg= =Gohu -----END PGP SIGNATURE----- From janne.peltonen at helsinki.fi Fri Jun 29 09:21:32 2007 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Fri, 29 Jun 2007 12:21:32 +0300 Subject: [Linux-cluster] Clustered VGs with many PVs slow to activate In-Reply-To: <4684C8AE.5030706@redhat.com> References: <20070628204911.GE3221@helsinki.fi> <4684C8AE.5030706@redhat.com> Message-ID: <20070629092132.GG29854@helsinki.fi> On Fri, Jun 29, 2007 at 09:54:06AM +0100, Bryn M. Reeves wrote: > > There is a clustered vg in my five-node cluster (four of which have > > access to the SAN where the physical devices reside). It seems to me > > that the more LV's and PV's I've got, the more time it takes to get the > > VG activated. When I had 8 PV's and 79 LV's, it took some ten minutes to > > activate the VG on each node. Now, I extended the VG by 39 PV's (so that my > > Did you put a metadata copy on each PV? Check with vgdisplay: [..] > If that's the case, you probably don't want one for each PV here. It's > unnecessary and will slow the tools down a lot when there are a large > number of PVs. > > Check out the --metadatacopies option to pvcreate and re-create the > volume group with a much smaller number of MDAs if this is the problem > you're seeing. If you're careful you can do this in-place via the > - --restorefile option to pvcreate and vgcfgbackup/vgcfgrestore. THANK you. And here I was thinking I knew the lvm tools completely... Apparantly one'd have to read the man pages of even everyday tools now and then. ;) --Janne -- Janne Peltonen From janne.peltonen at helsinki.fi Fri Jun 29 13:03:19 2007 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Fri, 29 Jun 2007 16:03:19 +0300 Subject: [Linux-cluster] How to avoid all services starting on the first node that boots? Message-ID: <20070629130318.GJ29854@helsinki.fi> Hi! In my cluster, there are 39 services spread over 4 nodes. Any service can run on any node, so I've set the failover domain priorities up so that when any node goes down the services are spread more or less evenly on the remaining nodes. Even if there is but one node remaining, it can run the services. But there seems to be a catch. It seems to me that the first node that starts the rgmanager starts up all the services - and, since starting up the services takes up a lot of resources, it takes a long time (well, abt five minutes) until the services are relocated where they belong. Is there a way to increase the time that a node waits for a prior member in the failover domain to come up before it tries to start the service in the group? I couldn't find any, but perhaps I didn't search well enough. Thanks for any advice. --Janne -- Janne Peltonen From rpeterso at redhat.com Fri Jun 29 13:10:04 2007 From: rpeterso at redhat.com (Bob Peterson) Date: Fri, 29 Jun 2007 08:10:04 -0500 Subject: [Linux-cluster] Newbie Question In-Reply-To: <1183061089.3641.12.camel@Dilbert> References: <1183052621.3641.5.camel@Dilbert> <20070628184105.GW8818@redhat.com> <1183061089.3641.12.camel@Dilbert> Message-ID: <1183122604.11507.35.camel@technetium.msp.redhat.com> On Thu, 2007-06-28 at 15:04 -0500, Mike Duncan wrote: > Hmm. OK. Does that mean I still need to acquire RH's Cluster Suite? > Unfortunately, I'm doing this as a personal project, and do not have a > company's resources behind me. > > Mike Hi Mike, This is all open-source software, so you don't need to buy anything. You can download the source code for the entire cluster suite, compile it, install it, and use it. You can fetch the entire source code tree from CVS with this command: cvs -d :ext:sources.redhat.com:/cvs/cluster co cluster There are install packages for various platforms too, if you don't want to compile it yourself. Regards, Bob Peterson Red Hat Cluster Suite From janne.peltonen at helsinki.fi Fri Jun 29 13:25:56 2007 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Fri, 29 Jun 2007 16:25:56 +0300 Subject: [Linux-cluster] fs.sh? Message-ID: <20070629132556.GK29854@helsinki.fi> Hi! I had the trouble with fs.sh a while ago, and, well, it hasn't gone anywhere. Are there any news? Thanks. --Janne -- Janne Peltonen From sholmes at surf7.com Fri Jun 29 13:58:54 2007 From: sholmes at surf7.com (steven holmes) Date: Fri, 29 Jun 2007 06:58:54 -0700 (PDT) Subject: [Linux-cluster] a cluster in a cluster Message-ID: <802137.86357.qm@web403.biz.mail.mud.yahoo.com> has any one tried to build a storage cluster and then build vmware on those hosts and make vm windows cluster accros the 2 hosts. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlopmart at gmail.com Fri Jun 29 15:19:07 2007 From: carlopmart at gmail.com (carlopmart) Date: Fri, 29 Jun 2007 17:19:07 +0200 Subject: [Linux-cluster] a cluster in a cluster In-Reply-To: <802137.86357.qm@web403.biz.mail.mud.yahoo.com> References: <802137.86357.qm@web403.biz.mail.mud.yahoo.com> Message-ID: <468522EB.9070404@gmail.com> steven holmes wrote: > has any one tried to build a storage cluster and then build vmware on > those hosts and make vm windows cluster accros the 2 hosts. > > > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Yes, using RHEL4.4 cluster suite, vmware server 1.0.3, and rhel4.4 guests with rhcs and vmware-tools installed (very important). Works very very well ... -- CL Martinez carlopmart {at} gmail {d0t} com From kristoffer.lippert at jppol.dk Fri Jun 29 15:23:20 2007 From: kristoffer.lippert at jppol.dk (Kristoffer Lippert) Date: Fri, 29 Jun 2007 17:23:20 +0200 Subject: [linux-cluster] multipath issue... Smells of hardware issue. Message-ID: <00B9BFA1C44A674794C9A1A4F5A22CA51A79EB@exchsrv07.rootdom.dk> Hi, I have a setup with two identical RX200s3 FuSi servers talking to a SAN (SX60 + extra controller), and that works fine with gfs1. I do however see some errors on one of the servers. It's in my message log and only now and then now and then (though always under load, but i cant load it and thereby force it to give the error). The error says: Jun 28 15:44:17 app02 multipathd: 8:16: mark as failed Jun 28 15:44:17 app02 multipathd: main_disk_volume1: remaining active paths: 1 Jun 28 15:44:17 app02 kernel: sd 2:0:0:0: SCSI error: return code = 0x00070000 Jun 28 15:44:17 app02 kernel: end_request: I/O error, dev sdb, sector 705160231 Jun 28 15:44:17 app02 kernel: device-mapper: multipath: Failing path 8:16. Jun 28 15:44:22 app02 multipathd: sdb: readsector0 checker reports path is up Jun 28 15:44:22 app02 multipathd: 8:16: reinstated Jun 28 15:44:22 app02 multipathd: main_disk_volume1: remaining active paths: 2 Jun 28 15:46:02 app02 multipathd: 8:32: mark as failed Jun 28 15:46:02 app02 multipathd: main_disk_volume1: remaining active paths: 1 Jun 28 15:46:02 app02 kernel: sd 3:0:0:0: SCSI error: return code = 0x00070000 Jun 28 15:46:02 app02 kernel: end_request: I/O error, dev sdc, sector 739870727 Jun 28 15:46:02 app02 kernel: device-mapper: multipath: Failing path 8:32. Jun 28 15:46:06 app02 multipathd: sdc: readsector0 checker reports path is up Jun 28 15:46:06 app02 multipathd: 8:32: reinstated Jun 28 15:46:06 app02 multipathd: main_disk_volume1: remaining active paths: 2 To me i looks like a fiber that bounces up and down. (There is no switch involved). Sometimes i only get a slightly shorter version: Jun 29 09:04:32 app02 kernel: sd 2:0:0:0: SCSI error: return code = 0x00070000 Jun 29 09:04:32 app02 kernel: end_request: I/O error, dev sdb, sector 2782490295 Jun 29 09:04:32 app02 kernel: device-mapper: multipath: Failing path 8:16. Jun 29 09:04:32 app02 multipathd: 8:16: mark as failed Jun 29 09:04:32 app02 multipathd: main_disk_volume1: remaining active paths: 1 Jun 29 09:04:37 app02 multipathd: sdb: readsector0 checker reports path is up Jun 29 09:04:37 app02 multipathd: 8:16: reinstated Jun 29 09:04:37 app02 multipathd: main_disk_volume1: remaining active paths: 2 Any sugestions, but start swapping hardware? Mvh / Kind regards Kristoffer Lippert Systemansvarlig JP/Politiken A/S Online Magasiner Tlf. +45 8738 3032 Cell. +45 6062 8703 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Christopher.Barry at qlogic.com Fri Jun 29 16:26:14 2007 From: Christopher.Barry at qlogic.com (Christopher Barry) Date: Fri, 29 Jun 2007 12:26:14 -0400 Subject: [Linux-cluster] Clustered VGs with many PVs slow to activate In-Reply-To: <20070629073554.GF29854@helsinki.fi> References: <20070628204911.GE3221@helsinki.fi> <20070628214907.GA3768@helsinki.fi> <20070629072141.GE29854@helsinki.fi> <20070629073554.GF29854@helsinki.fi> Message-ID: <1183134374.10344.51.camel@localhost> On Fri, 2007-06-29 at 10:35 +0300, Janne Peltonen wrote: > There seems to be great variation in the cycle time in different SAN > load conditions: > > On Fri, Jun 29, 2007 at 10:21:42AM +0300, Janne Peltonen wrote: > > [jmmpelto at pcn1 ~]$ time sudo service clvmd restart > > Deactivating VG mappi-primary: 0 logical volume(s) in volume group "mappi-primary" now active > > [ OK ] > > Stopping clvm: [ OK ] > > Starting clvmd: [ OK ] > > Activating VGs: 2 logical volume(s) in volume group "main" now active > > 78 logical volume(s) in volume group "mappi-primary" now active > > [ OK ] > > > > real 4m40.448s > > user 0m0.662s > > sys 0m0.299s > > (added and reduced 10 PV's) (and the activity on the SAN on other nodes > decreased) > > [jmmpelto at pcn1 ~]$ time sudo service clvmd restart > Deactivating VG mappi-primary: 0 logical volume(s) in volume group "mappi-primary" now active > [ OK ] > Stopping clvm: [ OK ] > Starting clvmd: [ OK ] > Activating VGs: 2 logical volume(s) in volume group "main" now active > 78 logical volume(s) in volume group "mappi-primary" now active > [ OK ] > > real 1m54.891s > user 0m0.672s > sys 0m0.324s > [jmmpelto at pcn1 ~]$ time sudo service clvmd restart > Password: > Deactivating VG mappi-primary: 0 logical volume(s) in volume group "mappi-primary" now active > [ OK ] > Stopping clvm: [ OK ] > Starting clvmd: [ OK ] > Activating VGs: 2 logical volume(s) in volume group "main" now active > 78 logical volume(s) in volume group "mappi-primary" now active > [ OK ] > > real 2m3.736s > user 0m0.660s > sys 0m0.321s > > > --Janne What's interesting to me here is the huge difference in real vs. user or sys time. It appears to spend most of the time waiting around. Can you trace the process to see what it's doing and where it sits and waits? -- Regards, -C From jparsons at redhat.com Fri Jun 29 16:34:18 2007 From: jparsons at redhat.com (jim parsons) Date: Fri, 29 Jun 2007 12:34:18 -0400 Subject: [Linux-cluster] a cluster in a cluster In-Reply-To: <468522EB.9070404@gmail.com> References: <802137.86357.qm@web403.biz.mail.mud.yahoo.com> <468522EB.9070404@gmail.com> Message-ID: <1183134858.3313.2.camel@localhost.localdomain> On Fri, 2007-06-29 at 17:19 +0200, carlopmart wrote: > > steven holmes wrote: > > has any one tried to build a storage cluster and then build vmware on > > those hosts and make vm windows cluster accros the 2 hosts. > > > > > > ------------------------------------------------------------------------ > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > Yes, using RHEL4.4 cluster suite, vmware server 1.0.3, and rhel4.4 > guests with rhcs and vmware-tools installed (very important). Works very > very well ... > This works OK in RHEL5 using xen as well. Here is a recipe for how to do it in the interface: http://sourceware.org/cluster/conga/cookbook/VMs_as_clusters/ Even if you don't use the interface for deployment, the pre-requisite list shown in these slides is valuable. -J From breeves at redhat.com Fri Jun 29 16:37:08 2007 From: breeves at redhat.com (Bryn M. Reeves) Date: Fri, 29 Jun 2007 17:37:08 +0100 Subject: [Linux-cluster] Clustered VGs with many PVs slow to activate In-Reply-To: <1183134374.10344.51.camel@localhost> References: <20070628204911.GE3221@helsinki.fi> <20070628214907.GA3768@helsinki.fi> <20070629072141.GE29854@helsinki.fi> <20070629073554.GF29854@helsinki.fi> <1183134374.10344.51.camel@localhost> Message-ID: <46853534.4030601@redhat.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Christopher Barry wrote: > What's interesting to me here is the huge difference in real vs. user or > sys time. It appears to spend most of the time waiting around. > > Can you trace the process to see what it's doing and where it sits and > waits? > It'll be waiting for I/O to all the metadata areas on all the different PVs. Regards, Bryn. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iD8DBQFGhTU06YSQoMYUY94RArmcAKDVubXRwakPZM24kyDBmdk/v8V9IQCgohPF edZ1PUYQStsZheHmnzO74bI= =bu6q -----END PGP SIGNATURE----- From Christopher.Barry at qlogic.com Fri Jun 29 16:47:02 2007 From: Christopher.Barry at qlogic.com (Christopher Barry) Date: Fri, 29 Jun 2007 12:47:02 -0400 Subject: [Linux-cluster] a cluster in a cluster In-Reply-To: <1183134858.3313.2.camel@localhost.localdomain> References: <802137.86357.qm@web403.biz.mail.mud.yahoo.com> <468522EB.9070404@gmail.com> <1183134858.3313.2.camel@localhost.localdomain> Message-ID: <1183135623.10344.54.camel@localhost> On Fri, 2007-06-29 at 12:34 -0400, jim parsons wrote: > On Fri, 2007-06-29 at 17:19 +0200, carlopmart wrote: > > > > steven holmes wrote: > > > has any one tried to build a storage cluster and then build vmware on > > > those hosts and make vm windows cluster accros the 2 hosts. > > > > > > > > > ------------------------------------------------------------------------ > > > > > > -- > > > Linux-cluster mailing list > > > Linux-cluster at redhat.com > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > Yes, using RHEL4.4 cluster suite, vmware server 1.0.3, and rhel4.4 > > guests with rhcs and vmware-tools installed (very important). Works very > > very well ... > > > This works OK in RHEL5 using xen as well. Here is a recipe for how to do > it in the interface: > http://sourceware.org/cluster/conga/cookbook/VMs_as_clusters/ > > Even if you don't use the interface for deployment, the pre-requisite > list shown in these slides is valuable. > > -J > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Has anyone created a cluster on VMware ESX 3.0.x with FC SAN yet? -- Regards, -C From Christopher.Barry at qlogic.com Fri Jun 29 16:48:24 2007 From: Christopher.Barry at qlogic.com (Christopher Barry) Date: Fri, 29 Jun 2007 12:48:24 -0400 Subject: [Linux-cluster] Clustered VGs with many PVs slow to activate In-Reply-To: <46853534.4030601@redhat.com> References: <20070628204911.GE3221@helsinki.fi> <20070628214907.GA3768@helsinki.fi> <20070629072141.GE29854@helsinki.fi> <20070629073554.GF29854@helsinki.fi> <1183134374.10344.51.camel@localhost> <46853534.4030601@redhat.com> Message-ID: <1183135705.10344.57.camel@localhost> On Fri, 2007-06-29 at 17:37 +0100, Bryn M. Reeves wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Christopher Barry wrote: > > What's interesting to me here is the huge difference in real vs. user or > > sys time. It appears to spend most of the time waiting around. > > > > Can you trace the process to see what it's doing and where it sits and > > waits? > > > > It'll be waiting for I/O to all the metadata areas on all the different PVs. > > Regards, > Bryn. > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.7 (GNU/Linux) > Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org > > iD8DBQFGhTU06YSQoMYUY94RArmcAKDVubXRwakPZM24kyDBmdk/v8V9IQCgohPF > edZ1PUYQStsZheHmnzO74bI= > =bu6q > -----END PGP SIGNATURE----- > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster And is this currently a serial event? -- Regards, -C From sholmes at surf7.com Fri Jun 29 17:03:53 2007 From: sholmes at surf7.com (steven holmes) Date: Fri, 29 Jun 2007 10:03:53 -0700 (PDT) Subject: [Linux-cluster] a cluster in a cluster In-Reply-To: <468522EB.9070404@gmail.com> Message-ID: <33489.66750.qm@web406.biz.mail.mud.yahoo.com> i am running into a small problem i get the windows cluster built but when i do the msdtc the server i do it from the other server can not write to it. carlopmart wrote: steven holmes wrote: > has any one tried to build a storage cluster and then build vmware on > those hosts and make vm windows cluster accros the 2 hosts. > > > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Yes, using RHEL4.4 cluster suite, vmware server 1.0.3, and rhel4.4 guests with rhcs and vmware-tools installed (very important). Works very very well ... -- CL Martinez carlopmart {at} gmail {d0t} com -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From sholmes at surf7.com Fri Jun 29 17:07:44 2007 From: sholmes at surf7.com (steven holmes) Date: Fri, 29 Jun 2007 10:07:44 -0700 (PDT) Subject: [Linux-cluster] a cluster in a cluster In-Reply-To: <1183135623.10344.54.camel@localhost> Message-ID: <580106.36531.qm@web404.biz.mail.mud.yahoo.com> yes but you can not use vmotion when you do this. Christopher Barry wrote: On Fri, 2007-06-29 at 12:34 -0400, jim parsons wrote: > On Fri, 2007-06-29 at 17:19 +0200, carlopmart wrote: > > > > steven holmes wrote: > > > has any one tried to build a storage cluster and then build vmware on > > > those hosts and make vm windows cluster accros the 2 hosts. > > > > > > > > > ------------------------------------------------------------------------ > > > > > > -- > > > Linux-cluster mailing list > > > Linux-cluster at redhat.com > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > Yes, using RHEL4.4 cluster suite, vmware server 1.0.3, and rhel4.4 > > guests with rhcs and vmware-tools installed (very important). Works very > > very well ... > > > This works OK in RHEL5 using xen as well. Here is a recipe for how to do > it in the interface: > http://sourceware.org/cluster/conga/cookbook/VMs_as_clusters/ > > Even if you don't use the interface for deployment, the pre-requisite > list shown in these slides is valuable. > > -J > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Has anyone created a cluster on VMware ESX 3.0.x with FC SAN yet? -- Regards, -C -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From sholmes at surf7.com Fri Jun 29 17:12:27 2007 From: sholmes at surf7.com (steven holmes) Date: Fri, 29 Jun 2007 10:12:27 -0700 (PDT) Subject: [Linux-cluster] a cluster in a cluster In-Reply-To: <468522EB.9070404@gmail.com> Message-ID: <714947.4942.qm@web414.biz.mail.mud.yahoo.com> this is what i used do you have the recipe you used carlopmart wrote: steven holmes wrote: > has any one tried to build a storage cluster and then build vmware on > those hosts and make vm windows cluster accros the 2 hosts. > > > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Yes, using RHEL4.4 cluster suite, vmware server 1.0.3, and rhel4.4 guests with rhcs and vmware-tools installed (very important). Works very very well ... -- CL Martinez carlopmart {at} gmail {d0t} com -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlopmart at gmail.com Fri Jun 29 19:17:52 2007 From: carlopmart at gmail.com (carlopmart) Date: Fri, 29 Jun 2007 21:17:52 +0200 Subject: [Linux-cluster] a cluster in a cluster In-Reply-To: <714947.4942.qm@web414.biz.mail.mud.yahoo.com> References: <714947.4942.qm@web414.biz.mail.mud.yahoo.com> Message-ID: <46855AE0.8080307@gmail.com> steven holmes wrote: > this is what i used do you have the recipe you used > > */carlopmart /* wrote: > > > > steven holmes wrote: > > has any one tried to build a storage cluster and then build > vmware on > > those hosts and make vm windows cluster accros the 2 hosts. > > > > > > > ------------------------------------------------------------------------ > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > Yes, using RHEL4.4 cluster suite, vmware server 1.0.3, and rhel4.4 > guests with rhcs and vmware-tools installed (very important). Works > very > very well ... > > -- > CL Martinez > carlopmart {at} gmail {d0t} com > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Yes, here it is: a) Install two host rhel4 nodes b) apply all patches to rhel4 c) Install development packages (kernel-devel, gcc, etc) on both hosts d) prepare some type of shared storage like iscsi, FC san, drbd, etc ... e) Install rhcs for rhel4 and do initial configuration, fence device included. f) configure shared storage on both nodes and format with gfs filesystem. g) Install vmware binaries on both servers (also you can install vmware binaries on shared storage, but I haven't test it yet) h) Install first rhel4 guest, applying all patches and vmware-tools (very very very important) on one node i) Install second rhel4 guest and do the same like you do it with the first guest, and on the same node j) Next you need to copy /etc/vmware/vm-* on the second host node k) Modify .vmx files to use same uuid on both nodes (you can do this starting rhel4 guest on the second node and accept warning about uuid) l) Create start and stop scripts for the guests using vmware-cmd command m) Do some tests using this scripts. n) Configure rhcs on hosts and test guest machines. o) Stop virtual machines and configure virtual shared storage. p) Install rhcs and gfs on both guest machines q) configure cluster services and test it. r) finish. I think that this should work with rhel5, but I haven't test it .... -- CL Martinez carlopmart {at} gmail {d0t} com From sholmes at surf7.com Sat Jun 30 00:57:14 2007 From: sholmes at surf7.com (Steven C Holmes) Date: Fri, 29 Jun 2007 19:57:14 -0500 Subject: [Linux-cluster] a cluster in a cluster In-Reply-To: <46855AE0.8080307@gmail.com> References: <714947.4942.qm@web414.biz.mail.mud.yahoo.com> <46855AE0.8080307@gmail.com> Message-ID: <002801c7bab1$99b6a4d0$cd23ee70$@com> You did the same thing I did except for the uuid and I bet that is they key tank you very much I bet this fixes the problem. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of carlopmart Sent: Friday, June 29, 2007 2:18 PM To: linux clustering Subject: Re: [Linux-cluster] a cluster in a cluster steven holmes wrote: > this is what i used do you have the recipe you used > > */carlopmart /* wrote: > > > > steven holmes wrote: > > has any one tried to build a storage cluster and then build > vmware on > > those hosts and make vm windows cluster accros the 2 hosts. > > > > > > > ------------------------------------------------------------------------ > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > Yes, using RHEL4.4 cluster suite, vmware server 1.0.3, and rhel4.4 > guests with rhcs and vmware-tools installed (very important). Works > very > very well ... > > -- > CL Martinez > carlopmart {at} gmail {d0t} com > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Yes, here it is: a) Install two host rhel4 nodes b) apply all patches to rhel4 c) Install development packages (kernel-devel, gcc, etc) on both hosts d) prepare some type of shared storage like iscsi, FC san, drbd, etc ... e) Install rhcs for rhel4 and do initial configuration, fence device included. f) configure shared storage on both nodes and format with gfs filesystem. g) Install vmware binaries on both servers (also you can install vmware binaries on shared storage, but I haven't test it yet) h) Install first rhel4 guest, applying all patches and vmware-tools (very very very important) on one node i) Install second rhel4 guest and do the same like you do it with the first guest, and on the same node j) Next you need to copy /etc/vmware/vm-* on the second host node k) Modify .vmx files to use same uuid on both nodes (you can do this starting rhel4 guest on the second node and accept warning about uuid) l) Create start and stop scripts for the guests using vmware-cmd command m) Do some tests using this scripts. n) Configure rhcs on hosts and test guest machines. o) Stop virtual machines and configure virtual shared storage. p) Install rhcs and gfs on both guest machines q) configure cluster services and test it. r) finish. I think that this should work with rhel5, but I haven't test it .... -- CL Martinez carlopmart {at} gmail {d0t} com -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From chris at cmiware.com Sat Jun 30 18:41:03 2007 From: chris at cmiware.com (Chris Harms) Date: Sat, 30 Jun 2007 13:41:03 -0500 Subject: [Linux-cluster] IP monitor failing periodically Message-ID: <4686A3BF.9070609@cmiware.com> I am experiencing periodic failovers due to a floating IP address not passing the status check: clurgmgrd: [9975]: Failed to ping 192.168.13.204 Jun 30 11:41:47 nodeA clurgmgrd[9975]: status on ip "192.168.13.204" returned 1 (generic error) Both nodes have bonded NICs with gigabit connections to redundant switches, so it is unlikely they are going down, nothing in the logs about linux losing the links. I parked all the cluster services - 2 Postgres services and 1 Apache - on one node and allowed it to run overnight. There would be no client activity during this time. One Postgres service failed two times in this manner and the other failed once in this manner. The Apache service did not fail. What can I do to resolve this or get more information out of the system to resolve this? Thanks in advance, Chris