From yazan at ccs.com.jo Sun Jan 2 18:02:58 2005 From: yazan at ccs.com.jo (Yazan Bakheit) Date: Sun, 2 Jan 2005 10:02:58 -0800 Subject: [Linux-cluster] quorum problem Message-ID: hi, i installed the cluster suite and then i found a problem in the shared as i cant see it,and then i used the gfs and configured it and solve the problem and after that i want to use the cluster suite gui but here in the gui there is a check box called (Has Quorum) but i cant checked it it seemes to be hidden, how can it be activated. i want to tell you that i used the documentation for the gfs which is (rh-gfsico-en-6.0) and i perform every thing but as two nodes, every thing is OK, but the original cluster suit seems to be not working, what should i do now? , i will send or write you the whole configuration that i used or creatre from the beginning if you want. Please Execuse me if im nagging you with these cases or these questions, i know that my questions may appeare as a stupid question but i am new in the field and i really need a help. Tahnk You Regards Yazan. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: tech.gif Type: image/gif Size: 862 bytes Desc: not available URL: From rstevens at vitalstream.com Mon Jan 3 17:21:42 2005 From: rstevens at vitalstream.com (Rick Stevens) Date: Mon, 03 Jan 2005 09:21:42 -0800 Subject: [Linux-cluster] GFS and Storage greater than 2 TB In-Reply-To: <41D4F81F.6000907@andrew.cmu.edu> References: <75E9203E0F0DD511B37E00D0B789D45007E64F44@fcv-stgo.cverde.cl> <41D4F81F.6000907@andrew.cmu.edu> Message-ID: <41D97F26.6050205@vitalstream.com> Jacob Joseph wrote: > Does this limit still exist with the cvs GFS on a 2.6 kernel? I believe the limit under a 2.6 kernel is 16TB, but I've not checked it. > > -Jacob > > Markus Miller wrote: > >> Thank you for the answer. That is all I needed to know. >> >> -----Mensaje original----- >> De: Rick Stevens [mailto:rstevens at vitalstream.com] >> Enviado el: Thursday, December 30, 2004 5:03 PM >> Para: linux clistering >> Asunto: Re: [Linux-cluster] GFS and Storage greater than 2 TB >> >> >> Markus Miller wrote: >> >>> Hi, >>> >>> researching I found a posting to this list made by Kevin Anderson >>> (Date: Tue, 19 Oct 2004 17:56:24 -0500) where he states the following: >>> >>> ---snip--- >>> Maximum size of each GFS filesystem for RHEL3 (2.4.x kernel) is 2 TB, >>> you can have multiple filesystems of that level. So, to get access to >>> 10TB of data requires a minimum of 5 separate filesystems/storage >>> combinations. >>> ---snip--- >>> >>> What do I have to do to achive this? Do I have to configure several >>> GFS clusters in the cluster.ccs file (each of a m?ximum size of 2 >>> TB)? Or do I have to configure one GFS cluster with serveral >>> filesystems each with a maximum size of 2 TB? The GFS Admin Guide is >>> not very precise, but what's really confusing me is the statement on >>> page 12: "2 TB maximum, for total of all storage connected to a GFS >>> cluster." >>> >>> At the moment we are evaluating to buy servers and storage, therefore >>> I do not have any equipment to do the testing myself. >>> Any coment is highly apreciated. >> >> >> >> It's the GFS filesystem that has the limit (actually, it's the 2.4 >> kernel). Essentially, "gfs_mkfs" can only handle a maximum of 2TB. >> >> What he means above is that you have to have five separate partitions >> of 2TB each and each with a GFS filesystem on them. You have to mount >> those five filesystems separately. If you're using VG/LVM, with a VG >> as "vggroup" and LVs in that group as "test1" through "test5": >> >> mount -t gfs /dev/mapper/vggroup-test1 /mnt/gfs1 >> mount -t gfs /dev/mapper/vggroup-test2 /mnt/gfs2 >> mount -t gfs /dev/mapper/vggroup-test3 /mnt/gfs3 >> mount -t gfs /dev/mapper/vggroup-test4 /mnt/gfs4 >> mount -t gfs /dev/mapper/vggroup-test5 /mnt/gfs5 >> >> How you use them after that is up to you. Just remember that a given >> GFS filesystem under kernel 2.4 is limited to 2TB maximum >> ---------------------------------------------------------------------- >> - Rick Stevens, Senior Systems Engineer rstevens at vitalstream.com - >> - VitalStream, Inc. http://www.vitalstream.com - >> - - >> - Brain: The organ with which we think that we think. - >> ---------------------------------------------------------------------- >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> http://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> http://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > -- ---------------------------------------------------------------------- - Rick Stevens, Senior Systems Engineer rstevens at vitalstream.com - - VitalStream, Inc. http://www.vitalstream.com - - - - Admitting you have a problem is the first step toward getting - - medicated for it. -- Jim Evarts (http://www.TopFive.com) - ---------------------------------------------------------------------- From yazan at ccs.com.jo Tue Jan 4 17:54:13 2005 From: yazan at ccs.com.jo (Yazan Bakheit) Date: Tue, 4 Jan 2005 09:54:13 -0800 Subject: [Linux-cluster] quorum problem Message-ID: hi, how can i make the checkbox for the quorum in the gui utility active, i mean that when i request the gui for the cluster suit there is a check box called (Has Quorum) and it is look to be hidden and i can't checked it even i have made two partitions for the quorom and i add them to the cluster and to the /etc/sysconfig/rawdevices but i cant checked it . how can i solve this? Thanks Yazan. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: tech.gif Type: image/gif Size: 862 bytes Desc: not available URL: From pcaulfie at redhat.com Tue Jan 4 09:16:06 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 4 Jan 2005 09:16:06 +0000 Subject: [Linux-cluster] sock_alloc 2.6.10 && gfs In-Reply-To: <22514.1104496178@www4.gmx.net> References: <22514.1104496178@www4.gmx.net> Message-ID: <20050104091605.GA23831@tykepenguin.com> On Fri, Dec 31, 2004 at 01:29:38PM +0100, Svetoslav Slavtchev wrote: > Hi guys, > it seems sock_alloc became static in 2.6.10 > and i'm not sure how exactly to fix gfs > ( > cluster/dlm/lowcomms.c:454 > memset(&peeraddr, 0, sizeof(peeraddr)); > newsock = sock_alloc(); > if (!newsock) > return -ENOMEM; > > ) > > do you think it'll be enough just to revert the change ? > (see attached diff ) As a quick hack that should work. What /should/ be done is to change lowcomms to use sock_create_kern() -- patrick From pcaulfie at redhat.com Tue Jan 4 11:29:24 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 4 Jan 2005 11:29:24 +0000 Subject: [Linux-cluster] cman bad generation number In-Reply-To: <1103736819.30947.20.camel@ibm-c.pdx.osdl.net> References: <1103654081.29749.17.camel@ibm-c.pdx.osdl.net> <20041222090832.GB1260@tykepenguin.com> <1103736819.30947.20.camel@ibm-c.pdx.osdl.net> Message-ID: <20050104112924.GB23831@tykepenguin.com> On Wed, Dec 22, 2004 at 09:33:39AM -0800, Daniel McNeil wrote: > > > > > > How does one know what the current "generation" number is? > > > > You don't, cman does. it's the current "generation" of the cluster which is > > incremented for each state transition. Are you taking nodes up and down during > > these tests?? > > The nodes are staying up. I am mounting and umounting a lot. > Any reason to not add generation /proc/cluster/status? (it would help > debugging at least). No reason at all not to, apart from I really don't think it will tell anyone anything useful. The cause of the problem is that the CMAN heartbeat messages are being lost on the network flooded by lock traffic. generation mismatches are just a symptom of that. > > I currently have it set up for manual fencing and I have yet to see that > work correctly. This was a 3 node cluster. cl032 got the bad > generation number and cman was "killed by STARTTRANS or NOMINATE" > cl030 got a bad generation number (but stayed up) and cl031 leaves > the cluster because it says cl030 told it to. So that leaves me > with 1 node up without quorum. I did not see any fencing messages. > > Should the surviving node (cl030) have attempted fencing or does > it only do that if it has quorum? ah no, fencing will only happen if the cluster has quorum. > I do not seem to be able to keep cman up for much past 2 days if > I have my tests running. (it stays up with no load, of course). > My tests are not the complicated currently either. Just tar, du > and rm in separate directories from 1, 2 and then 3 nodes > simultaneously. Who knows what will happen if I add tests > to cause lots of dlm lock conflict. > How long does cman stay up in your testing? I've never had iSCSI stay up long enough to find out :( -- patrick From mbrookov at mines.edu Tue Jan 4 14:44:59 2005 From: mbrookov at mines.edu (Matthew B. Brookover) Date: Tue, 04 Jan 2005 07:44:59 -0700 Subject: [Linux-cluster] ISCSI? (was cman bad generation number) In-Reply-To: <20050104112924.GB23831@tykepenguin.com> References: <1103654081.29749.17.camel@ibm-c.pdx.osdl.net> <20041222090832.GB1260@tykepenguin.com> <1103736819.30947.20.camel@ibm-c.pdx.osdl.net> <20050104112924.GB23831@tykepenguin.com> Message-ID: <1104849899.4815.13.camel@merlin.Mines.EDU> On Tue, 2005-01-04 at 04:29, Patrick Caulfield wrote: > I've never had iSCSI stay up long enough to find out :( Which iSCSI are you using? We are considering buying iSCSI based hardware from Left Hand Networks. I have not done any heavy testing, but I have used UNH-ISCSI for both the target and initiator with GFS and did not have any problems. Should I re-think this plan? Matt -------------- next part -------------- An HTML attachment was scrubbed... URL: From pcaulfie at redhat.com Tue Jan 4 16:14:27 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 4 Jan 2005 16:14:27 +0000 Subject: [Linux-cluster] ISCSI? (was cman bad generation number) In-Reply-To: <1104849899.4815.13.camel@merlin.Mines.EDU> References: <1103654081.29749.17.camel@ibm-c.pdx.osdl.net> <20041222090832.GB1260@tykepenguin.com> <1103736819.30947.20.camel@ibm-c.pdx.osdl.net> <20050104112924.GB23831@tykepenguin.com> <1104849899.4815.13.camel@merlin.Mines.EDU> Message-ID: <20050104161427.GC7994@tykepenguin.com> On Tue, Jan 04, 2005 at 07:44:59AM -0700, Matthew B. Brookover wrote: > On Tue, 2005-01-04 at 04:29, Patrick Caulfield wrote: > > I've never had iSCSI stay up long enough to find out :( > > Which iSCSI are you using? > > We are considering buying iSCSI based hardware from Left Hand Networks. I > have not done any heavy testing, but I have used UNH-ISCSI for both the > target and initiator with GFS and did not have any problems. Should I > re-think this plan? It might be my environment, others haven't reported problems. But my linux-iscsi-4.0.1.10 on kernel 2.6.9 just locks up on a regular basis - regardless of whether there is I/O to the device or not. -- patrick From pcaulfie at redhat.com Tue Jan 4 16:42:17 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 4 Jan 2005 16:42:17 +0000 Subject: [Linux-cluster] cman bad generation number In-Reply-To: <1103736819.30947.20.camel@ibm-c.pdx.osdl.net> References: <1103654081.29749.17.camel@ibm-c.pdx.osdl.net> <20041222090832.GB1260@tykepenguin.com> <1103736819.30947.20.camel@ibm-c.pdx.osdl.net> Message-ID: <20050104164217.GE7994@tykepenguin.com> Just to check that you are seeing what I think you're seeing, can you set some cman variables to increase the heartbeat frequency: echo "9" > /proc/cluster/config/cman/max_retries echo "1" > /proc/cluster/config/cman/hello_timer You'll need to do this before "cman_tool join". Thanks, patrick From crh at ubiqx.mn.org Tue Jan 4 21:19:58 2005 From: crh at ubiqx.mn.org (Christopher R. Hertel) Date: Tue, 4 Jan 2005 15:19:58 -0600 Subject: [Linux-cluster] ISCSI? In-Reply-To: <20050104161427.GC7994@tykepenguin.com> References: <1103654081.29749.17.camel@ibm-c.pdx.osdl.net> <20041222090832.GB1260@tykepenguin.com> <1103736819.30947.20.camel@ibm-c.pdx.osdl.net> <20050104112924.GB23831@tykepenguin.com> <1104849899.4815.13.camel@merlin.Mines.EDU> <20050104161427.GC7994@tykepenguin.com> Message-ID: <20050104211958.GP31004@Favog.ubiqx.mn.org> On Tue, Jan 04, 2005 at 04:14:27PM +0000, Patrick Caulfield wrote: > On Tue, Jan 04, 2005 at 07:44:59AM -0700, Matthew B. Brookover wrote: > > On Tue, 2005-01-04 at 04:29, Patrick Caulfield wrote: > > > > I've never had iSCSI stay up long enough to find out :( This is off-topic, but... Is the Ardis target the only iSCSI target source available for Linux? Chris -)----- -- "Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X Samba Team -- http://www.samba.org/ -)----- Christopher R. Hertel jCIFS Team -- http://jcifs.samba.org/ -)----- ubiqx development, uninq. ubiqx Team -- http://www.ubiqx.org/ -)----- crh at ubiqx.mn.org OnLineBook -- http://ubiqx.org/cifs/ -)----- crh at ubiqx.org From daniel at osdl.org Tue Jan 4 22:22:26 2005 From: daniel at osdl.org (Daniel McNeil) Date: Tue, 04 Jan 2005 14:22:26 -0800 Subject: [Linux-cluster] cman bad generation number In-Reply-To: <20050104164217.GE7994@tykepenguin.com> References: <1103654081.29749.17.camel@ibm-c.pdx.osdl.net> <20041222090832.GB1260@tykepenguin.com> <1103736819.30947.20.camel@ibm-c.pdx.osdl.net> <20050104164217.GE7994@tykepenguin.com> Message-ID: <1104877346.2838.17.camel@ibm-c.pdx.osdl.net> On Tue, 2005-01-04 at 08:42, Patrick Caulfield wrote: > Just to check that you are seeing what I think you're seeing, can you set some > cman variables to increase the heartbeat frequency: > > echo "9" > /proc/cluster/config/cman/max_retries > echo "1" > /proc/cluster/config/cman/hello_timer > > You'll need to do this before "cman_tool join". > > Thanks, > > patrick > I'll give this a try and let it run overnight. Daniel From mbrookov at mines.edu Tue Jan 4 22:23:10 2005 From: mbrookov at mines.edu (Matthew B. Brookover) Date: Tue, 04 Jan 2005 15:23:10 -0700 Subject: [Linux-cluster] ISCSI? In-Reply-To: <20050104211958.GP31004@Favog.ubiqx.mn.org> References: <1103654081.29749.17.camel@ibm-c.pdx.osdl.net> <20041222090832.GB1260@tykepenguin.com> <1103736819.30947.20.camel@ibm-c.pdx.osdl.net> <20050104112924.GB23831@tykepenguin.com> <1104849899.4815.13.camel@merlin.Mines.EDU> <20050104161427.GC7994@tykepenguin.com> <20050104211958.GP31004@Favog.ubiqx.mn.org> Message-ID: <1104877390.4815.35.camel@merlin.Mines.EDU> The unh-iscsi implements both target and initiator, see http://unh-iscsi.sourceforge.net/ for more information. I have used it with GFS and Linux 2.6.8.1. I could not get unh-iscsi to compile with linux 2.6.9 and switched to hardware scsi for more recent testing. I believe there is a fix for 2.6.9. Matt On Tue, 2005-01-04 at 14:19, Christopher R. Hertel wrote: > On Tue, Jan 04, 2005 at 04:14:27PM +0000, Patrick Caulfield wrote: > > On Tue, Jan 04, 2005 at 07:44:59AM -0700, Matthew B. Brookover wrote: > > > On Tue, 2005-01-04 at 04:29, Patrick Caulfield wrote: > > > > > > I've never had iSCSI stay up long enough to find out :( > > This is off-topic, but... > > Is the Ardis target the only iSCSI target source available for Linux? > > Chris -)----- -------------- next part -------------- An HTML attachment was scrubbed... URL: From crh at ubiqx.mn.org Tue Jan 4 22:44:32 2005 From: crh at ubiqx.mn.org (Christopher R. Hertel) Date: Tue, 4 Jan 2005 16:44:32 -0600 Subject: [Linux-cluster] ISCSI? In-Reply-To: <1104877390.4815.35.camel@merlin.Mines.EDU> References: <1103654081.29749.17.camel@ibm-c.pdx.osdl.net> <20041222090832.GB1260@tykepenguin.com> <1103736819.30947.20.camel@ibm-c.pdx.osdl.net> <20050104112924.GB23831@tykepenguin.com> <1104849899.4815.13.camel@merlin.Mines.EDU> <20050104161427.GC7994@tykepenguin.com> <20050104211958.GP31004@Favog.ubiqx.mn.org> <1104877390.4815.35.camel@merlin.Mines.EDU> Message-ID: <20050104224432.GS31004@Favog.ubiqx.mn.org> On Tue, Jan 04, 2005 at 03:23:10PM -0700, Matthew B. Brookover wrote: > The unh-iscsi implements both target and initiator, see > http://unh-iscsi.sourceforge.net/ for more information. I have used it > with GFS and Linux 2.6.8.1. I could not get unh-iscsi to compile with > linux 2.6.9 and switched to hardware scsi for more recent testing. I > believe there is a fix for 2.6.9. > > Matt Thanks! I should have remembered that UNH was working on iSCSI. Chris -)----- -- "Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X Samba Team -- http://www.samba.org/ -)----- Christopher R. Hertel jCIFS Team -- http://jcifs.samba.org/ -)----- ubiqx development, uninq. ubiqx Team -- http://www.ubiqx.org/ -)----- crh at ubiqx.mn.org OnLineBook -- http://ubiqx.org/cifs/ -)----- crh at ubiqx.org From daniel at osdl.org Tue Jan 4 22:46:17 2005 From: daniel at osdl.org (Daniel McNeil) Date: Tue, 04 Jan 2005 14:46:17 -0800 Subject: [Linux-cluster] cman bad generation number In-Reply-To: <20050104112924.GB23831@tykepenguin.com> References: <1103654081.29749.17.camel@ibm-c.pdx.osdl.net> <20041222090832.GB1260@tykepenguin.com> <1103736819.30947.20.camel@ibm-c.pdx.osdl.net> <20050104112924.GB23831@tykepenguin.com> Message-ID: <1104878776.2838.42.camel@ibm-c.pdx.osdl.net> On Tue, 2005-01-04 at 03:29, Patrick Caulfield wrote: > On Wed, Dec 22, 2004 at 09:33:39AM -0800, Daniel McNeil wrote: > > > > > > > > How does one know what the current "generation" number is? > > > > > > You don't, cman does. it's the current "generation" of the cluster which is > > > incremented for each state transition. Are you taking nodes up and down during > > > these tests?? > > > > The nodes are staying up. I am mounting and umounting a lot. > > Any reason to not add generation /proc/cluster/status? (it would help > > debugging at least). > > No reason at all not to, apart from I really don't think it will tell anyone > anything useful. The cause of the problem is that the CMAN heartbeat messages > are being lost on the network flooded by lock traffic. generation mismatches are > just a symptom of that. > One thing I do not understand is that I am leaving the nodes in the cluster and just doing mounting and umounting, so the generation number should not be changing. I think you are saying the the lock traffic is so high that the heart are lost so the node being kicked out is seeing the new heart beat from the other nodes and doesn't know they are not receiving his heartbeat messages. This node must be seeing the other nodes heartbeat messages or it would have started a membership transition without the other nodes. Do I have this right? Shouldn't the heartbeat messages have higher priority over the lock traffic messages? Shouldn't there be a way of throttling back the lock traffic and seeing if heartbeat connection can be re-established before starting a membership transition? Daniel From daniel at osdl.org Wed Jan 5 01:13:02 2005 From: daniel at osdl.org (Daniel McNeil) Date: Tue, 04 Jan 2005 17:13:02 -0800 Subject: [Linux-cluster] dlm patch to fix referencing free memory Message-ID: <1104887581.7044.4.camel@ibm-c.pdx.osdl.net> I checked out the latest cvs and noticed my patch to fix the referencing of freed memory is not included. Here is the patch again. Please let me know how to get this patch into the cvs tree. Thanks, Daniel Looking through the code, I found when that a call to queue_ast(lkb, AST_COMP | AST_DEL, 0); will lead to process_asts() which will free the dlm_rsb. So there is a race where the rsb can be freed BEFORE we do the up_write(rsb->res_lock); The fix is simple, do the up_write() before the queue_ast(). --- cluster.orig/dlm-kernel/src/locking.c 2004-12-09 15:23:13.789834384 -0800 +++ cluster/dlm-kernel/src/locking.c 2004-12-09 15:24:51.809742940 -0800 @@ -687,8 +687,13 @@ void dlm_lock_stage3(struct dlm_lkb *lkb lkb->lkb_retstatus = -EAGAIN; if (lkb->lkb_lockqueue_flags & DLM_LKF_NOQUEUEBAST) send_blocking_asts_all(rsb, lkb); + /* + * up the res_lock before queueing ast, since the AST_DEL will + * cause the rsb to be released and that can happen anytime. + */ + up_write(&rsb->res_lock); queue_ast(lkb, AST_COMP | AST_DEL, 0); - goto out; + return; } /* @@ -888,7 +893,13 @@ int dlm_unlock_stage2(struct dlm_lkb *lk lkb->lkb_retstatus = flags & DLM_LKF_CANCEL ? -DLM_ECANCEL:-DLM_EUNLOCK; if (!remote) { + /* + * up the res_lock before queueing ast, since the AST_DEL will + * cause the rsb to be released and that can happen anytime. + */ + up_write(&rsb->res_lock); queue_ast(lkb, AST_COMP | AST_DEL, 0); + goto out2; } else { up_write(&rsb->res_lock); release_lkb(rsb->res_ls, lkb); From teigland at redhat.com Wed Jan 5 03:08:34 2005 From: teigland at redhat.com (David Teigland) Date: Wed, 5 Jan 2005 11:08:34 +0800 Subject: [Linux-cluster] dlm patch to fix referencing free memory In-Reply-To: <1104887581.7044.4.camel@ibm-c.pdx.osdl.net> References: <1104887581.7044.4.camel@ibm-c.pdx.osdl.net> Message-ID: <20050105030834.GB5770@redhat.com> On Tue, Jan 04, 2005 at 05:13:02PM -0800, Daniel McNeil wrote: > I checked out the latest cvs and noticed my patch to fix > the referencing of freed memory is not included. > > Here is the patch again. Please let me know how to get this > patch into the cvs tree. Sorry, got it now. Thanks for the fix. -- Dave Teigland From pcaulfie at redhat.com Wed Jan 5 09:00:44 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Wed, 5 Jan 2005 09:00:44 +0000 Subject: [Linux-cluster] cman bad generation number In-Reply-To: <1104878776.2838.42.camel@ibm-c.pdx.osdl.net> References: <1103654081.29749.17.camel@ibm-c.pdx.osdl.net> <20041222090832.GB1260@tykepenguin.com> <1103736819.30947.20.camel@ibm-c.pdx.osdl.net> <20050104112924.GB23831@tykepenguin.com> <1104878776.2838.42.camel@ibm-c.pdx.osdl.net> Message-ID: <20050105090043.GA3866@tykepenguin.com> On Tue, Jan 04, 2005 at 02:46:17PM -0800, Daniel McNeil wrote: > > One thing I do not understand is that I am leaving the nodes in the > cluster and just doing mounting and umounting, so the generation number > should not be changing. > > I think you are saying the the lock traffic is so high that the heart > are lost so the node being kicked out is seeing the new heart beat > from the other nodes and doesn't know they are not receiving his > heartbeat messages. This node must be seeing the other nodes > heartbeat messages or it would have started a membership transition > without the other nodes. Do I have this right? Yes, I think. It's all a bit vague. If it wasn't I might have an answer by now :-( > Shouldn't the heartbeat messages have higher priority > over the lock traffic messages? They do. That's why I am puzzled. I'm currently investigating if the heartbeat thread is being starved of CPU time by either the DLM or GFS. > Shouldn't there be a way of throttling back the lock traffic and seeing > if heartbeat connection can be re-established before starting a > membership transition? DLM & CMAN are not that tightly coupled. -- patrick From hkubota at gmx.net Wed Jan 5 13:49:38 2005 From: hkubota at gmx.net (Harald Kubota) Date: Wed, 05 Jan 2005 22:49:38 +0900 Subject: [Linux-cluster] ISCSI? In-Reply-To: <20050104224432.GS31004@Favog.ubiqx.mn.org> References: <1103654081.29749.17.camel@ibm-c.pdx.osdl.net> <20041222090832.GB1260@tykepenguin.com> <1103736819.30947.20.camel@ibm-c.pdx.osdl.net> <20050104112924.GB23831@tykepenguin.com> <1104849899.4815.13.camel@merlin.Mines.EDU> <20050104161427.GC7994@tykepenguin.com> <20050104211958.GP31004@Favog.ubiqx.mn.org> <1104877390.4815.35.camel@merlin.Mines.EDU> <20050104224432.GS31004@Favog.ubiqx.mn.org> Message-ID: <41DBF072.5060407@gmx.net> There is one more iscsi target available for Linux: http://sourceforge.net/projects/iscsitarget/ I use it for testing (not clustered yet) and it works surprisingly well (if the network is stable). Harald From crh at ubiqx.mn.org Wed Jan 5 18:57:15 2005 From: crh at ubiqx.mn.org (Christopher R. Hertel) Date: Wed, 5 Jan 2005 12:57:15 -0600 Subject: [Linux-cluster] ISCSI? In-Reply-To: <41DBF072.5060407@gmx.net> References: <1103654081.29749.17.camel@ibm-c.pdx.osdl.net> <20041222090832.GB1260@tykepenguin.com> <1103736819.30947.20.camel@ibm-c.pdx.osdl.net> <20050104112924.GB23831@tykepenguin.com> <1104849899.4815.13.camel@merlin.Mines.EDU> <20050104161427.GC7994@tykepenguin.com> <20050104211958.GP31004@Favog.ubiqx.mn.org> <1104877390.4815.35.camel@merlin.Mines.EDU> <20050104224432.GS31004@Favog.ubiqx.mn.org> <41DBF072.5060407@gmx.net> Message-ID: <20050105185715.GD8351@Favog.ubiqx.mn.org> Looks like this one has had more recent development as well. Thanks! Chris -)----- On Wed, Jan 05, 2005 at 10:49:38PM +0900, Harald Kubota wrote: > There is one more iscsi target available for Linux: > http://sourceforge.net/projects/iscsitarget/ > I use it for testing (not clustered yet) and it works surprisingly well > (if the network is stable). > > Harald > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- "Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X Samba Team -- http://www.samba.org/ -)----- Christopher R. Hertel jCIFS Team -- http://jcifs.samba.org/ -)----- ubiqx development, uninq. ubiqx Team -- http://www.ubiqx.org/ -)----- crh at ubiqx.mn.org OnLineBook -- http://ubiqx.org/cifs/ -)----- crh at ubiqx.org From bastian at waldi.eu.org Wed Jan 5 22:18:55 2005 From: bastian at waldi.eu.org (Bastian Blank) Date: Wed, 5 Jan 2005 23:18:55 +0100 Subject: [Linux-cluster] ISCSI? In-Reply-To: <1104877390.4815.35.camel@merlin.Mines.EDU> References: <1103654081.29749.17.camel@ibm-c.pdx.osdl.net> <20041222090832.GB1260@tykepenguin.com> <1103736819.30947.20.camel@ibm-c.pdx.osdl.net> <20050104112924.GB23831@tykepenguin.com> <1104849899.4815.13.camel@merlin.Mines.EDU> <20050104161427.GC7994@tykepenguin.com> <20050104211958.GP31004@Favog.ubiqx.mn.org> <1104877390.4815.35.camel@merlin.Mines.EDU> Message-ID: <20050105221855.GA27974@wavehammer.waldi.eu.org> On Tue, Jan 04, 2005 at 03:23:10PM -0700, Matthew B. Brookover wrote: > The unh-iscsi implements both target and initiator, see > http://unh-iscsi.sourceforge.net/ for more information. I have used it > with GFS and Linux 2.6.8.1. I could not get unh-iscsi to compile with > linux 2.6.9 and switched to hardware scsi for more recent testing. I The initiator locks itself to death if used on smp systems. Bastian -- Ahead warp factor one, Mr. Sulu. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: Digital signature URL: From daniel at osdl.org Wed Jan 5 22:19:01 2005 From: daniel at osdl.org (Daniel McNeil) Date: Wed, 05 Jan 2005 14:19:01 -0800 Subject: [Linux-cluster] cman bad generation number In-Reply-To: <20050105090043.GA3866@tykepenguin.com> References: <1103654081.29749.17.camel@ibm-c.pdx.osdl.net> <20041222090832.GB1260@tykepenguin.com> <1103736819.30947.20.camel@ibm-c.pdx.osdl.net> <20050104112924.GB23831@tykepenguin.com> <1104878776.2838.42.camel@ibm-c.pdx.osdl.net> <20050105090043.GA3866@tykepenguin.com> Message-ID: <1104963541.14834.10.camel@ibm-c.pdx.osdl.net> On Wed, 2005-01-05 at 01:00, Patrick Caulfield wrote: > On Tue, Jan 04, 2005 at 02:46:17PM -0800, Daniel McNeil wrote: > > > > One thing I do not understand is that I am leaving the nodes in the > > cluster and just doing mounting and umounting, so the generation number > > should not be changing. > > > > I think you are saying the the lock traffic is so high that the heart > > are lost so the node being kicked out is seeing the new heart beat > > from the other nodes and doesn't know they are not receiving his > > heartbeat messages. This node must be seeing the other nodes > > heartbeat messages or it would have started a membership transition > > without the other nodes. Do I have this right? > > Yes, I think. It's all a bit vague. If it wasn't I might have an answer by now > :-( > > > Shouldn't the heartbeat messages have higher priority > > over the lock traffic messages? > > They do. That's why I am puzzled. I'm currently investigating if the heartbeat > thread is being starved of CPU time by either the DLM or GFS. > > > Shouldn't there be a way of throttling back the lock traffic and seeing > > if heartbeat connection can be re-established before starting a > > membership transition? > > DLM & CMAN are not that tightly coupled. Do DLM and CMAN use a common communication layer? I was expecting that they would since having multiple interfaces for redundancy would be something they would both want. DLM should just want to be able to send messages to other nodes and shouldn't care how it gets there. I was expecting this to be part of CMAN since it should know which interfaces are connected to which nodes and their state. It could also load balance on multiple networks. Is there a description of how multiple interfaces are handle today? Thanks, Daniel From pcaulfie at redhat.com Thu Jan 6 08:47:19 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Thu, 6 Jan 2005 08:47:19 +0000 Subject: [Linux-cluster] cman bad generation number In-Reply-To: <1104963541.14834.10.camel@ibm-c.pdx.osdl.net> References: <1103654081.29749.17.camel@ibm-c.pdx.osdl.net> <20041222090832.GB1260@tykepenguin.com> <1103736819.30947.20.camel@ibm-c.pdx.osdl.net> <20050104112924.GB23831@tykepenguin.com> <1104878776.2838.42.camel@ibm-c.pdx.osdl.net> <20050105090043.GA3866@tykepenguin.com> <1104963541.14834.10.camel@ibm-c.pdx.osdl.net> Message-ID: <20050106084719.GA4923@tykepenguin.com> On Wed, Jan 05, 2005 at 02:19:01PM -0800, Daniel McNeil wrote: > > Do DLM and CMAN use a common communication layer? No. They should, but the communications in CMAN is primitive and not up to supporting the high levels of traffic that the DLM can generate. CMAN uses its own "reliable multicast" system whereas the DLM uses TCP. The disparity is really only because CMAN needs to do cluster-wide broadcasts but the DLM only ever needs to talk to single nodes at a time (per message). > I was expecting that they would since having multiple > interfaces for redundancy would be something they > would both want. DLM should just want to be able > to send messages to other nodes and shouldn't care > how it gets there. I was expecting this to be > part of CMAN since it should know which interfaces are > connected to which nodes and their state. It could > also load balance on multiple networks. Is there a > description of how multiple interfaces are handle today? The short answer is "rather badly". CMAN handles dual interfaces by a simple failover if messages go missing. DLM gets the interface information from CMAN but because of the nature of TCP the failover isn't nearly as clean -- patrick From erwan at seanodes.com Thu Jan 6 10:03:30 2005 From: erwan at seanodes.com (Velu Erwan) Date: Thu, 06 Jan 2005 11:03:30 +0100 Subject: [Linux-cluster] Test suite & benchmark Message-ID: <1105005810.5455.71.camel@R1.seanodes.com> Hi all, I must assume this is a common request but I didn't find any clue about it. I'd like to know which test suite and/or benchmark tools you are using to test and validate your gfs installation. I mean validating the installation using a tool for stressing the lock manager, testing concurrency open and/or write etc... I saw some of you using bonnie which doesn't seems to do that, do you know any other tool I could use ? Thanks, -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: Ceci est une partie de message num?riquement sign?e URL: From Hansjoerg.Maurer at dlr.de Thu Jan 6 12:16:19 2005 From: Hansjoerg.Maurer at dlr.de (Hansjoerg.Maurer at dlr.de) Date: Thu, 6 Jan 2005 13:16:19 +0100 Subject: [Linux-cluster] Some experiences and questions concerning GFS vs. GPFS Message-ID: <4CE5177FBED2784FAC715DB5553BD8970A3F28@exbe04.intra.dlr.de> Hi we are planing to implement a Linux Cluster Solution with shared SAN storage in Q2/2005. We already tried RedHat GFS in a old test SAN environment, and it works great. As an alternative solution we found a product called GPFS from IBM (gerneral parallel file system) It seems to have some features, GFS does not have now, but according to the documentation, it seems to bee very complex and it seems to support only IBM Storage devices (FastT....). The advantages seem to be - filesystems up to 100 TB on IA32 (Blocksize up to 1MByte) - syncronous replication of pools - better scaling If you have one RAID5 array in the SAN (lets call it RAID-A) and you add another RAID5 array (RAID-B) you can but them together, exceed the filesystem and reallocate the filesystem while it is online to a stripeset over RAID-A AND RAID-B in order to get optimal performance. The disadvantages seem to bee - the dependency on IBM Storage devices (especially for fencing) - the complexity (fileaccess takes place over a userspace daemon, which caches data and stat information) This seems to be the reason they can achieve the file system size - the integration of GFS into RHEL seems to be better of course... :-) ok, now my questions What is the status of GFS for RHEL4 concerning the advantages of GPFS from above - is it correct, that GFS filesystems in RHEL4 even on x86_64 can be very big (PByte) to? - will there be something similar like the reallocation of a stripe set over a newly created array in RHEL4 GFS? - the possibility of syncronous mirroring is not so important in our special case... - will the next stable version of GFS with the above features be available with initial Release of RHEL4 or is there an other planed release date. We want to implement the SAN in Q2/2005, so that we can wait if some of the limitations of GFS will be negotiated until than. And a final questions: - has anybody experience with both products, so that he can tell me about advantages and disadvantages (especially concerning performance). We will recieve a GPFS evaluation licence next week, but our old SAN storage hardware is not apropriate for performance tests, because it will be the bottleneck :-) Sorry if this E-Mail is a bit off topic. Sales persons are often showing you only the advantages of their product, and I hope that someone can help me with practical experiences. If you think, that this is off topic, please answer directly. Thank you very much Greetings Hansj?rg We will _________________________________________________________________ Dr. Hansjoerg Maurer | LAN- & System-Manager | Deutsches Zentrum | DLR Oberpfaffenhofen f. Luft- und Raumfahrt e.V. | Institut f. Robotik | Postfach 1116 | Muenchner Strasse 20 82230 Wessling | 82234 Wessling Germany | | Tel: 08153/28-2431 | E-mail: Hansjoerg.Maurer at dlr.de Fax: 08153/28-1134 | WWW: http://www.robotic.dlr.de/ __________________________________________________________________ There are 10 types of people in this world, those who understand binary and those who don't. From tom at nethinks.com Thu Jan 6 11:33:56 2005 From: tom at nethinks.com (tom at nethinks.com) Date: Thu, 6 Jan 2005 12:33:56 +0100 Subject: [Linux-cluster] Again Problems with newest CVS Message-ID: Hi all, getting the following errors any ideas? make[3]: Entering directory `/usr/src/linux-2.6.9' CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/acl.o CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/bits.o CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/bmap.o CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/daemon.o CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/diaper.o CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/dio.o CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/dir.o CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/eaops.o CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/eattr.o CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/file.o CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/glock.o CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/glops.o CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/inode.o CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.o In file included from /tmp/gfs/cluster/gfs-kernel/src/gfs/gfs.h:24, from /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:24: /tmp/gfs/cluster/gfs-kernel/src/gfs/incore.h:817: error: redefinition of `struct gfs_args' /tmp/gfs/cluster/gfs-kernel/src/gfs/incore.h:844: error: redefinition of `struct gfs_tune' In file included from /tmp/gfs/cluster/gfs-kernel/src/gfs/gfs.h:25, from /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:24: /tmp/gfs/cluster/gfs-kernel/src/gfs/util.h:321: error: redefinition of `struct gfs_user_buffer' /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:44: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:44: warning: its scope is only this definition or declaration, which is probably not what you want /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:59: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_skeleton': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:66: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:67: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:73: warning: passing arg 2 of pointer to function from incompatible pointer type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:77: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:105: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_get_cookie': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:109: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:130: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_get_super': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:137: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:139: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:163: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:191: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_get_args': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:196: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:234: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_get_lockstruct': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:239: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:270: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_get_stat_gfs': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:275: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:335: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_get_counters': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:340: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:431: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_get_tune': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:436: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:505: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_set_tune': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:513: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:516: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:520: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:761: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_do_reclaim': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:768: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:798: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_do_shrink': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:802: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:817: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_get_file_stat': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:823: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:825: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:838: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:859: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_set_file_flag': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:868: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:871: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:882: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:967: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_get_file_meta': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:973: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:976: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:977: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1023: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_do_file_flush': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1025: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1040: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi2hip': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1044: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1047: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1072: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_get_hfile_stat': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1079: warning: passing arg 2 of `gi2hip' from incompatible pointer type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1083: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1096: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1117: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_do_hfile_read': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1126: warning: passing arg 2 of `gi2hip' from incompatible pointer type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1130: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1130: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1137: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1137: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1137: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1154: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_do_hfile_write': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1166: warning: passing arg 2 of `gi2hip' from incompatible pointer type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1170: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1170: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1173: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1186: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1186: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1223: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1223: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1223: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1258: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_do_hfile_trunc': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1267: warning: passing arg 2 of `gi2hip' from incompatible pointer type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1275: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1292: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_do_quota_sync': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1296: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1310: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_do_quota_refresh': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1318: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1321: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1354: warning: `struct gfs_ioctl' declared inside parameter list /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_do_quota_read': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1362: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1364: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1367: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1392: error: dereferencing pointer to incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gfs_ioctl_i': /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1410: error: storage size of `gi' isn't known /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1415: error: invalid application of `sizeof' to an incomplete type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1432: warning: passing arg 3 of `gi_skeleton' from incompatible pointer type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1436: warning: passing arg 3 of `gi_skeleton' from incompatible pointer type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1438: warning: passing arg 3 of `gi_skeleton' from incompatible pointer type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1440: warning: passing arg 3 of `gi_skeleton' from incompatible pointer type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1442: warning: passing arg 3 of `gi_skeleton' from incompatible pointer type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1444: warning: passing arg 3 of `gi_skeleton' from incompatible pointer type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1448: warning: passing arg 3 of `gi_skeleton' from incompatible pointer type /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1410: warning: unused variable `gi' make[4]: *** [/tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.o] Error 1 make[3]: *** [_module_/tmp/gfs/cluster/gfs-kernel/src/gfs] Error 2 -tom From lhh at redhat.com Thu Jan 6 21:39:59 2005 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 06 Jan 2005 16:39:59 -0500 Subject: [Linux-cluster] quorum problem In-Reply-To: References: Message-ID: <1105047599.12800.1.camel@ayanami.boston.redhat.com> On Tue, 2005-01-04 at 09:54 -0800, Yazan Bakheit wrote: > i mean that when i request the gui for the cluster suit there > is a check box called > (Has Quorum) and it is look to be hidden and i can't checked > it even i have made It's not an option; it's an indicator. You can't change it. The cluster will change it when it has a majority of members online. Try starting up both nodes. -- Lon From kpreslan at redhat.com Fri Jan 7 06:44:39 2005 From: kpreslan at redhat.com (Ken Preslan) Date: Fri, 7 Jan 2005 00:44:39 -0600 Subject: [Linux-cluster] Again Problems with newest CVS In-Reply-To: References: Message-ID: <20050107064439.GA21295@potassium.msp.redhat.com> You have an old version of gfs_ioctl.h somewhere. Find it, replace it with the new one, and try again. On Thu, Jan 06, 2005 at 12:33:56PM +0100, tom at nethinks.com wrote: > > > > > Hi all, > > getting the following errors any ideas? > > > make[3]: Entering directory `/usr/src/linux-2.6.9' > CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/acl.o > CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/bits.o > CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/bmap.o > CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/daemon.o > CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/diaper.o > CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/dio.o > CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/dir.o > CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/eaops.o > CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/eattr.o > CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/file.o > CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/glock.o > CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/glops.o > CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/inode.o > CC [M] /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.o > In file included from /tmp/gfs/cluster/gfs-kernel/src/gfs/gfs.h:24, > from /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:24: > /tmp/gfs/cluster/gfs-kernel/src/gfs/incore.h:817: error: redefinition of > `struct gfs_args' > /tmp/gfs/cluster/gfs-kernel/src/gfs/incore.h:844: error: redefinition of > `struct gfs_tune' > In file included from /tmp/gfs/cluster/gfs-kernel/src/gfs/gfs.h:25, > from /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:24: > /tmp/gfs/cluster/gfs-kernel/src/gfs/util.h:321: error: redefinition of > `struct gfs_user_buffer' > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:44: warning: `struct gfs_ioctl' > declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:44: warning: its scope is only > this definition or declaration, which is probably not what you want > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:59: warning: `struct gfs_ioctl' > declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_skeleton': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:66: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:67: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:73: warning: passing arg 2 of > pointer to function from incompatible pointer type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:77: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:105: warning: `struct > gfs_ioctl' declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_get_cookie': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:109: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:130: warning: `struct > gfs_ioctl' declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_get_super': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:137: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:139: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:163: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:191: warning: `struct > gfs_ioctl' declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_get_args': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:196: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:234: warning: `struct > gfs_ioctl' declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function > `gi_get_lockstruct': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:239: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:270: warning: `struct > gfs_ioctl' declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_get_stat_gfs': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:275: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:335: warning: `struct > gfs_ioctl' declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_get_counters': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:340: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:431: warning: `struct > gfs_ioctl' declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_get_tune': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:436: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:505: warning: `struct > gfs_ioctl' declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_set_tune': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:513: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:516: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:520: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:761: warning: `struct > gfs_ioctl' declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_do_reclaim': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:768: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:798: warning: `struct > gfs_ioctl' declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi_do_shrink': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:802: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:817: warning: `struct > gfs_ioctl' declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function > `gi_get_file_stat': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:823: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:825: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:838: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:859: warning: `struct > gfs_ioctl' declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function > `gi_set_file_flag': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:868: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:871: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:882: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:967: warning: `struct > gfs_ioctl' declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function > `gi_get_file_meta': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:973: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:976: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:977: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1023: warning: `struct > gfs_ioctl' declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function > `gi_do_file_flush': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1025: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1040: warning: `struct > gfs_ioctl' declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gi2hip': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1044: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1047: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1072: warning: `struct > gfs_ioctl' declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function > `gi_get_hfile_stat': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1079: warning: passing arg 2 of > `gi2hip' from incompatible pointer type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1083: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1096: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1117: warning: `struct > gfs_ioctl' declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function > `gi_do_hfile_read': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1126: warning: passing arg 2 of > `gi2hip' from incompatible pointer type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1130: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1130: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1137: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1137: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1137: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1154: warning: `struct > gfs_ioctl' declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function > `gi_do_hfile_write': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1166: warning: passing arg 2 of > `gi2hip' from incompatible pointer type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1170: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1170: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1173: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1186: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1186: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1223: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1223: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1223: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1258: warning: `struct > gfs_ioctl' declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function > `gi_do_hfile_trunc': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1267: warning: passing arg 2 of > `gi2hip' from incompatible pointer type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1275: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1292: warning: `struct > gfs_ioctl' declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function > `gi_do_quota_sync': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1296: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1310: warning: `struct > gfs_ioctl' declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function > `gi_do_quota_refresh': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1318: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1321: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: At top level: > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1354: warning: `struct > gfs_ioctl' declared inside parameter list > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function > `gi_do_quota_read': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1362: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1364: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1367: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1392: error: dereferencing > pointer to incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c: In function `gfs_ioctl_i': > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1410: error: storage size of > `gi' isn't known > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1415: error: invalid > application of `sizeof' to an incomplete type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1432: warning: passing arg 3 of > `gi_skeleton' from incompatible pointer type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1436: warning: passing arg 3 of > `gi_skeleton' from incompatible pointer type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1438: warning: passing arg 3 of > `gi_skeleton' from incompatible pointer type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1440: warning: passing arg 3 of > `gi_skeleton' from incompatible pointer type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1442: warning: passing arg 3 of > `gi_skeleton' from incompatible pointer type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1444: warning: passing arg 3 of > `gi_skeleton' from incompatible pointer type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1448: warning: passing arg 3 of > `gi_skeleton' from incompatible pointer type > /tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.c:1410: warning: unused variable > `gi' > make[4]: *** [/tmp/gfs/cluster/gfs-kernel/src/gfs/ioctl.o] Error 1 > make[3]: *** [_module_/tmp/gfs/cluster/gfs-kernel/src/gfs] Error 2 > > -tom > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- Ken Preslan From pcaulfie at redhat.com Fri Jan 7 10:46:31 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Fri, 7 Jan 2005 10:46:31 +0000 Subject: [Linux-cluster] cman bad generation number In-Reply-To: <20050105090043.GA3866@tykepenguin.com> References: <1103654081.29749.17.camel@ibm-c.pdx.osdl.net> <20041222090832.GB1260@tykepenguin.com> <1103736819.30947.20.camel@ibm-c.pdx.osdl.net> <20050104112924.GB23831@tykepenguin.com> <1104878776.2838.42.camel@ibm-c.pdx.osdl.net> <20050105090043.GA3866@tykepenguin.com> Message-ID: <20050107104630.GB3614@tykepenguin.com> OK, some more investigation seems to be pointing to the heartbeat thread being not woken up when it's timer tells it to. This might be simply that there are other higher-priority tasks happening on the system because of the IO load. Now I've upgraded my cluster to 2.6.10, iSCSI seems to be more stable (same iSCSI software interestingly) so I've set the heartbeat nice level to -20 (same as the iSCSI process) and I'll see if it survives the weekend. It's done overnight so far which is better than I've had yet 8) -- patrick From ptr at poczta.fm Fri Jan 7 13:45:10 2005 From: ptr at poczta.fm (ptr at poczta.fm) Date: 07 Jan 2005 14:45:10 +0100 Subject: [Linux-cluster] Current CVS and 2.6.9 Message-ID: <20050107134510.D49002599CB@poczta.interia.pl> Hello. I decided recently to upgrade GFS version due to mysterious hard nodes lockups. Unfortunatelly attempts to build GFS userland and kernel modules failed, with following errors: /install/GFS/cluster/gfs-kernel/src/gfs/ops_file.c: In function `gfs_lock': /install/GFS/cluster/gfs-kernel/src/gfs/ops_file.c:1448: warning: implicit declaration of function `posix_lock_file_wait' /install/GFS/cluster/gfs-kernel/src/gfs/ops_file.c: In function `do_flock': /install/GFS/cluster/gfs-kernel/src/gfs/ops_file.c:1529: warning: implicit declaration of function `flock_lock_file_wait' /install/GFS/cluster/gfs-kernel/src/gfs/ops_file.c: At top level: /install/GFS/cluster/gfs-kernel/src/gfs/ops_file.c:1622: error: unknown field `flock' specified in initializer /install/GFS/cluster/gfs-kernel/src/gfs/ops_file.c:1622: warning: initialization from incompatible pointer type /install/GFS/cluster/gfs-kernel/src/gfs/ops_file.c:1632: error: unknown field `flock' specified in initializer /install/GFS/cluster/gfs-kernel/src/gfs/ops_file.c:1632: warning: initialization from incompatible pointer type make[5]: *** [/install/GFS/cluster/gfs-kernel/src/gfs/ops_file.o] Error 1 make[4]: *** [_module_/install/GFS/cluster/gfs-kernel/src/gfs] Error 2 make[4]: Leaving directory `/usr/src/linux-2.6.8.1' make[3]: *** [all] Error 2 make[3]: Leaving directory `/install/GFS/cluster/gfs-kernel/src/gfs' make[2]: *** [install] Error 2 make[2]: Leaving directory `/install/GFS/cluster/gfs-kernel/src' make[1]: *** [install] Error 2 make[1]: Leaving directory `/install/GFS/cluster/gfs-kernel' make: *** [all] Error 2 I also received a bunch of warnings, like: *** Warning: "kcl_addref_cluster" [/install/GFS/cluster/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_get_node_by_addr" [/install/GFS/cluster/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_get_node_addresses" [/install/GFS/cluster/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_releaseref_cluster" [/install/GFS/cluster/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_get_current_interface" [/install/GFS/cluster/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_get_node_by_nodeid" [/install/GFS/cluster/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_leave_service" [/install/GFS/cluster/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_remove_callback" [/install/GFS/cluster/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_global_service_id" [/install/GFS/cluster/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_unregister_service" [/install/GFS/cluster/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_join_service" [/install/GFS/cluster/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_start_done" [/install/GFS/cluster/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_add_callback" [/install/GFS/cluster/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_register_service" [/install/GFS/cluster/dlm-kernel/src/dlm.ko] undefined! Both attempts failed - I tried to build new GFS from sources using freshly build (but not installed!) kerlne 2.6.8.9, the same concerns 2.6.8.1 unpacked form vanilla sources. No such think occured when I was building GFS the "old way" (patching kernel sources separately). I'm running currently CVS version 2.0.1 and sometimes without _any_ suspisious reason both nodes in cluster freeze - no chance to reboot them other way than by hard reset. No error logs to debug. TIA for your help, regards. Piotr ---------------------------------------------------------------------- Startuj z INTERIA.PL!!! >>> http://link.interia.pl/f1837 From daniel at osdl.org Tue Jan 11 00:50:20 2005 From: daniel at osdl.org (Daniel McNeil) Date: Mon, 10 Jan 2005 16:50:20 -0800 Subject: [Linux-cluster] mount hang during test runs Message-ID: <1105404619.30484.7.camel@ibm-c.pdx.osdl.net> I started another test run on last week and let it run over the week end. a 3 node test was running when it hung. I set /proc/cluster/config/cman/max_retries to 9 and /proc/cluster/config/cman/hello_timer to 1 This time I hit a mount hang. The mount is hung on cl032: mount D C170F414 0 18375 18369 (NOTLB) e2dbbc20 00000082 e1dbda10 c170f414 0003e36e 00000000 00000008 c011bb10 d5ea8d58 57435700 0003e36e c18880ac e2dbbc00 e1dbda10 00000000 c170f8c0 c170ef60 00000000 000038d3 57435987 0003e36e e1dbcf50 e1dbd0b8 00000000 Call Trace: [] wait_for_completion+0xa4/0xe0 [] kcl_join_service+0x162/0x1a0 [cman] [] init_mountgroup+0x6f/0xc0 [lock_dlm] [] lm_dlm_mount+0xa1/0xf0 [lock_dlm] [] lm_mount+0x155/0x250 [lock_harness] [] gfs_lm_mount+0x1fd/0x390 [gfs] [] fill_super+0x513/0x1330 [gfs] [] gfs_get_sb+0x199/0x210 [gfs] [] do_kern_mount+0x5c/0x110 [] do_new_mount+0x98/0xe0 [] do_mount+0x165/0x1b0 [] sys_mount+0xb5/0x140 [] sysenter_past_esp+0x52/0x71 Looks like a problem join the mount group. /proc/cluster/services shows: [root at cl030 cman]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2 3] DLM Lock Space: "stripefs" 324 693 run - [1 2 3] GFS Mount Group: "stripefs" 325 694 update U-4,1,3 [1 2 3] [root at cl031 cluster]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2 3] DLM Lock Space: "stripefs" 324 457 run - [1 2 3] GFS Mount Group: "stripefs" 325 458 update U-4,1,3 [1 2 3] [root at cl032 cluster]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2 3] DLM Lock Space: "stripefs" 324 225 run - [1 2 3] GFS Mount Group: "stripefs" 325 226 join S-6,20,3 [1 2 3] I collected stack traces and a bunch of other info. It is available here: http://developer.osdl.org/daniel/GFS/mount.hang.05jan2005/ Any ideas on debugging this one? Daniel From pcaulfie at redhat.com Tue Jan 11 08:56:08 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 11 Jan 2005 08:56:08 +0000 Subject: [Linux-cluster] cman bad generation number In-Reply-To: <1103736819.30947.20.camel@ibm-c.pdx.osdl.net> References: <1103654081.29749.17.camel@ibm-c.pdx.osdl.net> <20041222090832.GB1260@tykepenguin.com> <1103736819.30947.20.camel@ibm-c.pdx.osdl.net> Message-ID: <20050111085608.GA6645@tykepenguin.com> On Wed, Dec 22, 2004 at 09:33:39AM -0800, Daniel McNeil wrote: > How long does cman stay up in your testing? With the higher pririty on the heartbeat thread I got 5 days before iSCSI died on me again... This isn't quite the same load as yours but it is on 8 busy nodes. -- patrick From serge at triumvirat.ru Tue Jan 11 08:34:24 2005 From: serge at triumvirat.ru (Sergey) Date: Tue, 11 Jan 2005 11:34:24 +0300 Subject: [Linux-cluster] some questions about setting up GFS Message-ID: <1125914338.20050111113424@triumvirat.ru> Hello! We bought HP ProLiant DL380 G4 Packaged Cluster-MSA500 G2 server and after installation of RHEL3 and GFS-6.0.0-15 I have some questions. Because I have no expirience in setting up such systems, please, tell me, which mistakes in configuration I made. Now system is configured this way: /dev/cciss/c0d1 - External Logical Volume, 293.6 Gbytes (RAID 5) =================== [root at hp1 root]# fdisk /dev/cciss/c0d1 Command (m for help): p Disk /dev/cciss/c0d1: 293.6 GB, 293626045440 bytes 255 heads, 63 sectors/track, 35698 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/cciss/c0d1p1 1 9 72261 fd Linux raid autodetect /dev/cciss/c0d1p2 10 35698 286671892+ fd Linux raid autodetect I've found nothing in GFS documentation about partitioning hard drive during setting up GFS. So, I used to experiment. First partition is allocated for <- CCA device ->. I'd like to know if there is enough space and right type of partition, and, at all, is it right way to allocate <- CCA device ->. Second partition was formatted as GFS, so the question is: is selected type of partition right or not? =================== [root at hp1 root]# cat pool0.cfg poolname pool0 minor 0 subpools 2 subpool 0 0 1 gfs_journal pooldevice 0 0 /dev/cciss/c0d1p1 subpool 1 0 1 gfs_data pooldevice 1 0 /dev/cciss/c0d1p2 Actually, I don't know why, but first subpool I've made as gfs_journal :-) Basically, the system works, but something may be wrong. =================== During setting up I've made this command: [root at hp1 root]# ccs_tool create /root/cluster/ /dev/cciss/c0d1p1 [root at hp1 root]# pool_tool -s Device Pool Label ====== ========== /dev/cciss/c0d0 <- partition information -> /dev/cciss/c0d0p1 <- EXT2/3 filesystem -> /dev/cciss/c0d0p2 <- swap device -> /dev/cciss/c0d0p3 <- EXT2/3 filesystem -> /dev/cciss/c0d1 <- partition information -> /dev/cciss/c0d1p1 <- CCA device -> /dev/cciss/c0d1p2 <- GFS filesystem -> I'd like to hear some comments on it. =================== Thanks. -- Sergey Mikhnevich From Vincent.Aniello at PipelineTrading.com Tue Jan 11 13:45:52 2005 From: Vincent.Aniello at PipelineTrading.com (Vincent Aniello) Date: Tue, 11 Jan 2005 08:45:52 -0500 Subject: [Linux-cluster] Multipath I/O Message-ID: <834F55E6F1BE3B488AD3AFC927A0970018B873@EMAILSRV1.exad.net> Do I need to use the QLogic failover driver with GFS for multipath I/O or does GFS handle multipath I/O on its own? Thanks for your input. --Vincent -------------- next part -------------- An HTML attachment was scrubbed... URL: From serge at triumvirat.ru Tue Jan 11 08:34:24 2005 From: serge at triumvirat.ru (Sergey) Date: Tue, 11 Jan 2005 11:34:24 +0300 Subject: [Linux-cluster] some questions about setting up GFS Message-ID: <1125914338.20050111113424@triumvirat.ru> Hello! We bought HP ProLiant DL380 G4 Packaged Cluster-MSA500 G2 server and after installation of RHEL3 and GFS-6.0.0-15 I have some questions. Because I have no expirience in setting up such systems, please, tell me, which mistakes in configuration I made. Now system is configured this way: /dev/cciss/c0d1 - External Logical Volume, 293.6 Gbytes (RAID 5) =================== [root at hp1 root]# fdisk /dev/cciss/c0d1 Command (m for help): p Disk /dev/cciss/c0d1: 293.6 GB, 293626045440 bytes 255 heads, 63 sectors/track, 35698 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/cciss/c0d1p1 1 9 72261 fd Linux raid autodetect /dev/cciss/c0d1p2 10 35698 286671892+ fd Linux raid autodetect I've found nothing in GFS documentation about partitioning hard drive during setting up GFS. So, I used to experiment. First partition is allocated for <- CCA device ->. I'd like to know if there is enough space and right type of partition, and, at all, is it right way to allocate <- CCA device ->. Second partition was formatted as GFS, so the question is: is selected type of partition right or not? =================== [root at hp1 root]# cat pool0.cfg poolname pool0 minor 0 subpools 2 subpool 0 0 1 gfs_journal pooldevice 0 0 /dev/cciss/c0d1p1 subpool 1 0 1 gfs_data pooldevice 1 0 /dev/cciss/c0d1p2 Actually, I don't know why, but first subpool I've made as gfs_journal :-) Basically, the system works, but something may be wrong. =================== During setting up I've made this command: [root at hp1 root]# ccs_tool create /root/cluster/ /dev/cciss/c0d1p1 [root at hp1 root]# pool_tool -s Device Pool Label ====== ========== /dev/cciss/c0d0 <- partition information -> /dev/cciss/c0d0p1 <- EXT2/3 filesystem -> /dev/cciss/c0d0p2 <- swap device -> /dev/cciss/c0d0p3 <- EXT2/3 filesystem -> /dev/cciss/c0d1 <- partition information -> /dev/cciss/c0d1p1 <- CCA device -> /dev/cciss/c0d1p2 <- GFS filesystem -> I'd like to hear some comments on it. =================== Thanks. -- Sergey Mikhnevich From jbrassow at redhat.com Tue Jan 11 17:30:26 2005 From: jbrassow at redhat.com (Jonathan E Brassow) Date: Tue, 11 Jan 2005 11:30:26 -0600 Subject: [Linux-cluster] some questions about setting up GFS In-Reply-To: <1125914338.20050111113424@triumvirat.ru> References: <1125914338.20050111113424@triumvirat.ru> Message-ID: <7B452B98-63F6-11D9-85A8-000A957BB1F6@redhat.com> It looks like you are not using pool. You seem to have divided up the storage sanely. However, rather than forming a pool logical volume on the partitions you've created and then putting ccs and gfs on the pool volumes, you have simply put ccs and gfs directly on the underlying device. This is not terrible if you can ensure that your devices will _always_ have the same name, regardless of the machine you are viewing them from. (Pool's main function is to write labels to the underlying devices so that they can be uniquely identified on every machine in the cluster.) So, at this point, you can choose to forget about using pool and proceed as you have started; or you can set up your pools first and put ccs and gfs on them (this is the method normally used). If you choose to set up pools, you would do something like: # create config files for two different pools (one for ccs and one for gfs) prompt> cat > cca_pool.cfg poolname cca subpools 1 subpool 0 0 1 gfs_data pooldevice 0 0 /dev/cciss/c0d1p1 prompt> cat > gfs1_pool.cfg poolname gfs1 subpools 1 subpool 0 0 1 gfs_data pooldevice 0 0 /dev/cciss/c0d1p2 #Write the labels to disk - remember this only needs to be done once prompt> pool_tool cca_pool.cfg prompt> pool_tool gfs1_pool.cfg #Instantiate the pool logical volumes prompt> pool_assemble #Now you have block devices called /dev/pool/cca and /dev/pool/gfs1 # create your CCS archive and gfs file system on these devices prompt> ccs_tool create /root/cluster /dev/pool/cca prompt> mkfs.gfs ... /dev/pool/gfs1 brassow On Jan 11, 2005, at 2:34 AM, Sergey wrote: > Hello! > > We bought HP ProLiant DL380 G4 Packaged Cluster-MSA500 G2 server and > after installation of RHEL3 and GFS-6.0.0-15 I have some questions. > > Because I have no expirience in setting up such systems, please, tell > me, which mistakes in configuration I made. > > Now system is configured this way: > > /dev/cciss/c0d1 - External Logical Volume, 293.6 Gbytes (RAID 5) > > =================== > [root at hp1 root]# fdisk /dev/cciss/c0d1 > Command (m for help): p > > Disk /dev/cciss/c0d1: 293.6 GB, 293626045440 bytes > 255 heads, 63 sectors/track, 35698 cylinders > Units = cylinders of 16065 * 512 = 8225280 bytes > > Device Boot Start End Blocks Id System > /dev/cciss/c0d1p1 1 9 72261 fd Linux raid > autodetect > /dev/cciss/c0d1p2 10 35698 286671892+ fd Linux raid > autodetect > > I've found nothing in GFS documentation about partitioning hard drive > during setting up GFS. So, I used to experiment. > > First partition is allocated for <- CCA device ->. I'd like to know if > there is enough space and right type of partition, and, at all, is it > right way to allocate <- CCA device ->. > > Second partition was formatted as GFS, so the question is: is selected > type of partition right or not? > > =================== > > [root at hp1 root]# cat pool0.cfg > poolname pool0 > minor 0 subpools 2 > subpool 0 0 1 gfs_journal > pooldevice 0 0 /dev/cciss/c0d1p1 > subpool 1 0 1 gfs_data > pooldevice 1 0 /dev/cciss/c0d1p2 > > > Actually, I don't know why, but first subpool I've made as gfs_journal > :-) > > Basically, the system works, but something may be wrong. > > =================== > During setting up I've made this command: > > [root at hp1 root]# ccs_tool create /root/cluster/ /dev/cciss/c0d1p1 > > [root at hp1 root]# pool_tool -s > Device Pool Label > ====== ========== > /dev/cciss/c0d0 <- partition information -> > /dev/cciss/c0d0p1 <- EXT2/3 filesystem -> > /dev/cciss/c0d0p2 <- swap device -> > /dev/cciss/c0d0p3 <- EXT2/3 filesystem -> > /dev/cciss/c0d1 <- partition information -> > /dev/cciss/c0d1p1 <- CCA device -> > /dev/cciss/c0d1p2 <- GFS filesystem -> > > I'd like to hear some comments on it. > > =================== > > Thanks. > > -- > Sergey Mikhnevich > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > From yazan at ccs.com.jo Tue Jan 11 15:10:57 2005 From: yazan at ccs.com.jo (Yazan Al-Sheyyab) Date: Tue, 11 Jan 2005 17:10:57 +0200 Subject: [Linux-cluster] GFS probelm Message-ID: <001901c4f7ef$c0a304d0$69050364@yazanz> hi, i have RedHat enterprise linux ES v 3.0 Update 4 (the latest) and i have GFS version as : GFS-modules-smp-6.0.0-1.2 GFS-6.0.0-0.6.i686 when i installed the first rpm the system says that i must have the ( kernel-smp-2.4.21-15.EL.rpm )to run this. but i have a newer kernel by Update4. so ,can i found a solution for that problem ? and can i download a newer version of GFS to run with my updated kernel. cause i now have a newer updates that cannot work with the old gfs i am using . ???????????. Thank you yazan. From danderso at redhat.com Tue Jan 11 19:05:17 2005 From: danderso at redhat.com (Derek Anderson) Date: Tue, 11 Jan 2005 13:05:17 -0600 Subject: [Linux-cluster] GFS probelm In-Reply-To: <001901c4f7ef$c0a304d0$69050364@yazanz> References: <001901c4f7ef$c0a304d0$69050364@yazanz> Message-ID: <200501111305.18337.danderso@redhat.com> On Tuesday 11 January 2005 09:10, Yazan Al-Sheyyab wrote: > hi, > > i have RedHat enterprise linux ES v 3.0 Update 4 (the latest) > > and i have GFS version as : > > GFS-modules-smp-6.0.0-1.2 > GFS-6.0.0-0.6.i686 > > when i installed the first rpm the system says that i must have the > ( kernel-smp-2.4.21-15.EL.rpm )to run this. > > but i have a newer kernel by Update4. > > so ,can i found a solution for that problem ? > and can i download a newer version of GFS to run with my updated kernel. kernel-2.4.21-27 -- GFS-6.0.2-17 kernel-2.4.21-27.0.1 -- GFS-6.0.2-24 The latest GFS available on RHN matches the latest kernel available on RHN. > > cause i now have a newer updates that cannot work with the old gfs i am > using . > > ???????????. > > > Thank you > yazan. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From bujan at isqsolutions.com Tue Jan 11 19:58:09 2005 From: bujan at isqsolutions.com (Manuel Bujan) Date: Tue, 11 Jan 2005 14:58:09 -0500 Subject: [Linux-cluster] GFS hang when one node fail Message-ID: <009801c4f817$dfbe3f60$0c9ce142@pcbujan> Hi, Is there any possibility for a two-node GFS installation to continue working when one of the nodes fail abruptly? Do I have to wait for the fence to be done and the failed system become operational again, to resume activity ? We are using the latest CVS code in a 2.6.9 linux kernel. Regards Bujan PD: We are already using CMAN with two_node="1" expected_votes="1" -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrassow at redhat.com Wed Jan 12 00:47:44 2005 From: jbrassow at redhat.com (Jonathan E Brassow) Date: Tue, 11 Jan 2005 18:47:44 -0600 Subject: [Linux-cluster] Multipath I/O In-Reply-To: <834F55E6F1BE3B488AD3AFC927A0970018B873@EMAILSRV1.exad.net> References: <834F55E6F1BE3B488AD3AFC927A0970018B873@EMAILSRV1.exad.net> Message-ID: <922AA8AE-6433-11D9-85A8-000A957BB1F6@redhat.com> GFS <= 6.0 will handle multipath automatically. brassow On Jan 11, 2005, at 7:45 AM, Vincent Aniello wrote: > Do I need to use the QLogic failover driver with GFS for multipath I/O > or does GFS handle multipath I/O on its own? > ? > Thanks for your input. > ? > --Vincent > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 595 bytes Desc: not available URL: From daniel at osdl.org Wed Jan 12 01:00:46 2005 From: daniel at osdl.org (Daniel McNeil) Date: Tue, 11 Jan 2005 17:00:46 -0800 Subject: [Linux-cluster] cman bad generation number In-Reply-To: <20050111085608.GA6645@tykepenguin.com> References: <1103654081.29749.17.camel@ibm-c.pdx.osdl.net> <20041222090832.GB1260@tykepenguin.com> <1103736819.30947.20.camel@ibm-c.pdx.osdl.net> <20050111085608.GA6645@tykepenguin.com> Message-ID: <1105491645.30484.23.camel@ibm-c.pdx.osdl.net> On Tue, 2005-01-11 at 00:56, Patrick Caulfield wrote: > On Wed, Dec 22, 2004 at 09:33:39AM -0800, Daniel McNeil wrote: > > How long does cman stay up in your testing? > > With the higher pririty on the heartbeat thread I got 5 days before iSCSI died > on me again... This isn't quite the same load as yours but it is on 8 busy nodes. I have not seen 5 days yet on my set. See my email from yesterday. Is the code to have higher priority for the heartbeat thread already checked in? I restarted my test yesterday and it is still going, but it usually has trouble after 50 hours or so. Daniel From Vincent.Aniello at PipelineTrading.com Wed Jan 12 03:00:54 2005 From: Vincent.Aniello at PipelineTrading.com (Vincent Aniello) Date: Tue, 11 Jan 2005 22:00:54 -0500 Subject: [Linux-cluster] Multipath I/O Message-ID: <834F55E6F1BE3B488AD3AFC927A0970018B8D9@EMAILSRV1.exad.net> So versions after 6.0 no longer handle multipath I/O automatically? --Vincent ________________________________ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jonathan E Brassow Sent: Tuesday, January 11, 2005 7:48 PM To: linux clistering Subject: Re: [Linux-cluster] Multipath I/O GFS <= 6.0 will handle multipath automatically. brassow On Jan 11, 2005, at 7:45 AM, Vincent Aniello wrote: Do I need to use the QLogic failover driver with GFS for multipath I/O or does GFS handle multipath I/O on its own? Thanks for your input. --Vincent -- Linux-cluster mailing list Linux-cluster at redhat.com http://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From teigland at redhat.com Wed Jan 12 03:47:50 2005 From: teigland at redhat.com (David Teigland) Date: Wed, 12 Jan 2005 11:47:50 +0800 Subject: [Linux-cluster] GFS hang when one node fail In-Reply-To: <009801c4f817$dfbe3f60$0c9ce142@pcbujan> References: <009801c4f817$dfbe3f60$0c9ce142@pcbujan> Message-ID: <20050112034750.GA6184@redhat.com> On Tue, Jan 11, 2005 at 02:58:09PM -0500, Manuel Bujan wrote: > Hi, > > Is there any possibility for a two-node GFS installation to continue > working when one of the nodes fail abruptly? yes > Do I have to wait for the fence to be done yes, the remaining node will fence the failed node > and the failed system become operational again, to resume activity ? no, the remaining node will run fine on its own > PD: We are already using CMAN with two_node="1" expected_votes="1" that's correct -- Dave Teigland From teigland at redhat.com Wed Jan 12 06:45:23 2005 From: teigland at redhat.com (David Teigland) Date: Wed, 12 Jan 2005 14:45:23 +0800 Subject: [Linux-cluster] mount hang during test runs In-Reply-To: <1105404619.30484.7.camel@ibm-c.pdx.osdl.net> References: <1105404619.30484.7.camel@ibm-c.pdx.osdl.net> Message-ID: <20050112064523.GB7571@redhat.com> On Mon, Jan 10, 2005 at 04:50:20PM -0800, Daniel McNeil wrote: > I collected stack traces and a bunch of other info. It is > available here: > http://developer.osdl.org/daniel/GFS/mount.hang.05jan2005/ > > Any ideas on debugging this one? - Processes on cl032 and cl030 are blocked waiting for dlm responses from cl031. - Processes on cl031 are blocked waiting for dlm responses to resource directory lookups (looking up unknown resource masters for 10,0 and 3,11). - It looks like dlm_recvd may be stuck on cl031 preventing it from receiving the requests from the other two nodes and preventing it from receiving the responses to its own lookup requests. This is probably the crux of the problem. Unfortunately, all we see for dlm_recvd on cl031 (from stack.cl031) is: dlm_recvd R running 0 29053 6 29054 29052 (L-TLB) cl032 - requesting PR on 10,1 (mounting) ---------------------------------------- lock_dlm2 D C170F414 0 18399 4 18398 (L-TLB) e6a1fe04 00000046 e7639930 c170f414 0003e36e 00000018 00000008 00000000 d5ea8d58 7505db9d 0003e36e db8ff348 e6a1fdf8 e7639930 00000000 c170f8c0 c170ef60 00000000 000138a5 7505df29 0003e36e f4377170 f43772d8 00000000 Call Trace: [] wait_for_completion+0xa4/0xe0 [] lm_dlm_lock_sync+0x59/0x70 [lock_dlm] [] id_test_and_set+0xa3/0x260 [lock_dlm] [] claim_jid+0x47/0x120 [lock_dlm] [] process_start+0x46d/0x610 [lock_dlm] [] dlm_async+0x274/0x3c0 [lock_dlm] [] kthread+0xba/0xc0 [] kernel_thread_helper+0x5/0x10 cl031 - requesting PR on 10,0 ----------------------------- lock_dlm1 D C170EF9C 0 29065 6 29066 29054 (L-TLB) d2e0ede8 00000046 f76d3850 c170ef9c 0003e354 00000018 00000008 00000000 f6750838 30672ddf 0003e354 dbf900dc d2e0eddc f76d3850 00000000 c170f8c0 c170ef60 00000000 0002088a 306734a4 0003e354 f64d8710 f64d8878 00000000 Call Trace: [] wait_for_completion+0xa4/0xe0 [] lm_dlm_lock_sync+0x59/0x70 [lock_dlm] [] id_value+0x93/0x130 [lock_dlm] [] id_find+0x2f/0x70 [lock_dlm] [] discover_jids+0x6a/0xa0 [lock_dlm] [] process_start+0x2e8/0x610 [lock_dlm] [] dlm_async+0x274/0x3c0 [lock_dlm] [] kthread+0xba/0xc0 [] kernel_thread_helper+0x5/0x10 cl031 - requesting NL on 3,11 ----------------------------- df D 00000008 0 29088 29086 (NOTLB) dd0e5c14 00000082 dd0e5c04 00000008 00000001 f8b3b571 00000008 dd0e5c0c ecb0a568 dbf9002c d6e5415c 00000008 dd0e5c44 00000018 00000000 00000000 c170ef60 00000000 00000fec 4d5f5234 0003e3a1 f6789190 f67892f8 dd0e5c44 Call Trace: [] wait_for_completion+0xa4/0xe0 [] do_dlm_lock_sync+0x4b/0x60 [lock_dlm] [] hold_null_lock+0xb4/0xd0 [lock_dlm] [] lm_dlm_hold_lvb+0x40/0x50 [lock_dlm] [] gfs_lm_hold_lvb+0x3c/0x50 [gfs] [] gfs_lvb_hold+0x41/0xe0 [gfs] [] gfs_ri_update+0x1d3/0x250 [gfs] [] gfs_rindex_hold+0xe8/0x100 [gfs] [] gfs_stat_gfs+0x21/0x80 [gfs] [] gfs_statfs+0x30/0xd0 [gfs] [] vfs_statfs+0x4c/0x70 [] vfs_statfs64+0x1b/0x50 [] sys_statfs64+0x67/0xa0 [] sysenter_past_esp+0x52/0x71 cl030 - requesting PR on 10,1 ----------------------------- lock_dlm2 D 00000008 0 14338 6 14337 (L-TLB) cf1b4de8 00000046 cf1b4dd8 00000008 00000001 00000018 00000008 00000000 f600ec98 00000000 00000000 cbe5ed24 cf1b4ddc 00000000 f7b82054 cf1b4df8 c170ef60 00000000 00014966 b62fc6b6 00009f97 f6610730 f6610898 00000009 Call Trace: [] wait_for_completion+0xa4/0xe0 [] lm_dlm_lock_sync+0x59/0x70 [lock_dlm] [] id_value+0x93/0x130 [lock_dlm] [] id_find+0x2f/0x70 [lock_dlm] [] discover_jids+0x6a/0xa0 [lock_dlm] [] process_start+0x2e8/0x610 [lock_dlm] [] dlm_async+0x274/0x3c0 [lock_dlm] [] kthread+0xba/0xc0 [] kernel_thread_helper+0x5/0x10 cl030 - requesting NL on 3,11 ----------------------------- df D 00000008 0 14362 14360 (NOTLB) d10a3c14 00000086 d10a3c04 00000008 00000001 f8b3b571 00000008 d10a3c0c f6b89818 cbe5ec74 c2015b28 00000008 d10a3c44 00000018 00000000 00000000 c170ef60 00000000 000305ef f0cf7f52 00009fe4 da6f0f10 da6f1078 d10a3c44 Call Trace: [] wait_for_completion+0xa4/0xe0 [] do_dlm_lock_sync+0x4b/0x60 [lock_dlm] [] hold_null_lock+0xb4/0xd0 [lock_dlm] [] lm_dlm_hold_lvb+0x40/0x50 [lock_dlm] [] gfs_lm_hold_lvb+0x3c/0x50 [gfs] [] gfs_lvb_hold+0x41/0xe0 [gfs] [] gfs_ri_update+0x1d3/0x250 [gfs] [] gfs_rindex_hold+0xe8/0x100 [gfs] [] gfs_stat_gfs+0x21/0x80 [gfs] [] gfs_statfs+0x30/0xd0 [gfs] [] vfs_statfs+0x4c/0x70 [] vfs_statfs64+0x1b/0x50 [] sys_statfs64+0x67/0xa0 [] sysenter_past_esp+0x52/0x71 cl032 (nodeid 3, mounting and looking for free jid) --------------------------------------------------- Resource dfdbf26c (parent 00000000). Name (len=24) " 10 1" Local Copy, Master is node 2 Granted Queue Conversion Queue Waiting Queue 000102aa -- (PR) Master: 00000000 LQ: 3,0x9 (pid 18399) cl031 (nodeid 2, jid 1) ----------------------- Resource cc0100a4 (parent 00000000). Name (len=24) " 10 1" Master Copy LVB: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Granted Queue 000100d5 PR (pid 29066) Conversion Queue Waiting Queue Resource e16fe26c (parent 00000000). Name (len=24) " 10 0" Local Copy, Master is node -1 Granted Queue Conversion Queue Waiting Queue Resource e4b5573c (parent 00000000). Name (len=24) " 3 11" Local Copy, Master is node -1 Granted Queue Conversion Queue Waiting Queue cl030 (nodeid 1, jid 0) ----------------------- Resource cfb9054c (parent 00000000). Name (len=24) " 10 0" Master Copy LVB: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Granted Queue 000102c3 PR (pid 14338) Conversion Queue Waiting Queue Resource d798911c (parent 00000000). Name (len=24) " 10 1" Local Copy, Master is node 2 Granted Queue Conversion Queue Waiting Queue 000103b7 -- (PR) Master: 00000000 LQ: 3,0x9 (pid 14338) Resource d38d7b2c (parent 00000000). Name (len=24) " 3 11" Local Copy, Master is node 2 Granted Queue Conversion Queue Waiting Queue 0002022e -- (NL) Master: 00000000 LQ: 3,0x8 (pid 14362) -- Dave Teigland From pcaulfie at redhat.com Wed Jan 12 08:58:12 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Wed, 12 Jan 2005 08:58:12 +0000 Subject: [Linux-cluster] cman bad generation number In-Reply-To: <1105491645.30484.23.camel@ibm-c.pdx.osdl.net> References: <1103654081.29749.17.camel@ibm-c.pdx.osdl.net> <20041222090832.GB1260@tykepenguin.com> <1103736819.30947.20.camel@ibm-c.pdx.osdl.net> <20050111085608.GA6645@tykepenguin.com> <1105491645.30484.23.camel@ibm-c.pdx.osdl.net> Message-ID: <20050112085812.GI6645@tykepenguin.com> On Tue, Jan 11, 2005 at 05:00:46PM -0800, Daniel McNeil wrote: > On Tue, 2005-01-11 at 00:56, Patrick Caulfield wrote: > > On Wed, Dec 22, 2004 at 09:33:39AM -0800, Daniel McNeil wrote: > > > How long does cman stay up in your testing? > > > > With the higher pririty on the heartbeat thread I got 5 days before iSCSI died > > on me again... This isn't quite the same load as yours but it is on 8 busy nodes. > > I have not seen 5 days yet on my set. See my email from yesterday. > Is the code to have higher priority for the heartbeat thread > already checked in? I restarted my test yesterday and it is > still going, but it usually has trouble after 50 hours or so. > It's rev 1.45 of membership.c checked in on the 7th Jan. If that hasn't fixed it I'll have to dabble with realtime things as it does seem now that the threads are not being woken up, even though the timer is firing. -- patrick From ptr at poczta.fm Wed Jan 12 10:38:49 2005 From: ptr at poczta.fm (ptr at poczta.fm) Date: 12 Jan 2005 11:38:49 +0100 Subject: [Linux-cluster] Log entry Message-ID: <20050112103849.3DD8B3031E3@poczta.interia.pl> Hello. I'm receiving entries like the one below in my system logs. It's 2-nodes cluster built form CVS. -node1: dlm: lkb id 52cd01b3 remid 4c730361 flags 0 status 3 rqmode 5 grmode 3 nodeid 1 lqstate 2 lqflags 44 dlm: request rh_cmd 6 rh_lkid 4c730361 remlkid 52cd01b3 flags 0 status 0 rqmode 3 dlm: eva: process_lockqueue_reply id 52cd01b3 state 0 -node2: dlm: lkb id 43010219 remid 48330092 flags 0 status 3 rqmode 5 grmode 3 nodeid 2 lqstate 2 lqflags 44 dlm: request rh_cmd 6 rh_lkid 48330092 remlkid 43010219 flags 0 status 0 rqmode 3 dlm: eva: process_lockqueue_reply id 43010219 state 0 Can someone explain what kind of faults are they? Regards, Piotr ---------------------------------------------------------------------- Najlepsze auto, najlepsze moto... >>> http://link.interia.pl/f1841 From teigland at redhat.com Wed Jan 12 11:56:06 2005 From: teigland at redhat.com (David Teigland) Date: Wed, 12 Jan 2005 19:56:06 +0800 Subject: [Linux-cluster] Log entry In-Reply-To: <20050112103849.3DD8B3031E3@poczta.interia.pl> References: <20050112103849.3DD8B3031E3@poczta.interia.pl> Message-ID: <20050112115606.GA12401@redhat.com> On Wed, Jan 12, 2005 at 11:38:49AM +0100, ptr at poczta.fm wrote: > > Hello. > I'm receiving entries like the one below in my system logs. > It's 2-nodes cluster built form CVS. > > -node1: > > dlm: lkb > id 52cd01b3 > remid 4c730361 > flags 0 > status 3 > rqmode 5 > grmode 3 > nodeid 1 > lqstate 2 > lqflags 44 > dlm: request > rh_cmd 6 > rh_lkid 4c730361 > remlkid 52cd01b3 > flags 0 > status 0 > rqmode 3 > dlm: eva: process_lockqueue_reply id 52cd01b3 state 0 > Can someone explain what kind of faults are they? They are a notice that an unexplained message reordering has taken place and been corrected. The log entries can be ignored. -- Dave Teigland From mshk_00 at hotmail.com Wed Jan 12 11:59:56 2005 From: mshk_00 at hotmail.com (maria perez) Date: Wed, 12 Jan 2005 12:59:56 +0100 Subject: [Linux-cluster] mount file system GFS Message-ID: I follow the basic example C.3 of the administration Guide of Red Hat GFS 6.0. :" LOCK_GULM SLM Embedded" with only two nodes to access a one file system shared that resides in a SAN (MJ) via iscsi. I have installed red hat enterprise 3.0, kernel 2.4.21-15.0.4.EL, with the modules for GFS-6.0.0-7. All go right (pools created and activated, cca created, ccsd launched, file system created with gfs_mkfs) except I can not mount the file system, the command mount not recognize the file system type gfs, the message appears on the console is: mount : type file system incorrect, option incorrect, superblock incorrect in /dev/pool/pool_gfs01 or number of file systems mounted excessive /dev/pool/pool_gfs01 is pool created for file system, the device assigned is /dev/sdd2 Why can not mount the file system? what is wrong? _________________________________________________________________ Acepta el reto MSN Premium: Protecci?n para tus hijos en internet. Desc?rgalo y pru?balo 2 meses gratis. http://join.msn.com?XAPID=1697&DI=1055&HL=Footer_mailsenviados_proteccioninfantil From mtilstra at redhat.com Wed Jan 12 14:01:27 2005 From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra) Date: Wed, 12 Jan 2005 08:01:27 -0600 Subject: [Linux-cluster] mount file system GFS In-Reply-To: References: Message-ID: <20050112140127.GA2807@redhat.com> On Wed, Jan 12, 2005 at 12:59:56PM +0100, maria perez wrote: > I follow the basic example C.3 of the administration Guide of Red Hat GFS > 6.0. :" LOCK_GULM SLM Embedded" with only two nodes to access a one file > system shared that resides in a SAN (MJ) via iscsi. > I have installed red hat enterprise 3.0, kernel 2.4.21-15.0.4.EL, with the > modules for GFS-6.0.0-7. > All go right (pools created and activated, cca created, ccsd launched, file > system created with gfs_mkfs) except I can not mount the file system, the > command mount not recognize the file system type gfs, the message appears > on the console is: > mount : type file system incorrect, option incorrect, superblock > incorrect in /dev/pool/pool_gfs01 or number of file systems mounted > excessive > /dev/pool/pool_gfs01 is pool created for file system, the device assigned > is /dev/sdd2 > Why can not mount the file system? what is wrong? run dmesg to get more info about why it cannot mount. Did you remember to start lock_gulmd? -- Michael Conrad Tadpol Tilstra The bug starts here. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From serge at triumvirat.ru Wed Jan 12 14:10:40 2005 From: serge at triumvirat.ru (Sergey) Date: Wed, 12 Jan 2005 17:10:40 +0300 Subject: [Linux-cluster] some questions about setting up GFS In-Reply-To: <7B452B98-63F6-11D9-85A8-000A957BB1F6@redhat.com> References: <1125914338.20050111113424@triumvirat.ru> <7B452B98-63F6-11D9-85A8-000A957BB1F6@redhat.com> Message-ID: <1814439859.20050112171040@triumvirat.ru> Hello! > It looks like you are not using pool. Thanks, I've guided by your examples, so raid can be mounted. Now I have some questions about Cluster Configuration System Files. I have 2 nodes - hp1 and hp2. Any of nodes have Integrated Lights-Out with ROM Version: 1.55 - 04/16/2004. Since I have only 2 nodes one of them has to be master, but if first of them (master) is correctly shut down, slave experiencing serious problems which can be solved by resetting. Is it all right? How to make it right? I tried to make servers = ["hp1","hp2","hp3"] (hp3 is really absent), then if master is shut down second node became master. So, if nodes are alternately correctly shut down and boot up master is switching from one to another and everything seems ok, but if one of the nodes is shut down incorrectly (e.g. power cord is pulled out of socket), this have written in systemlog: Jan 12 14:44:33 hp1 lock_gulmd_core[6500]: hp2 missed a heartbeat (time:1105530273952756 mb:1) Jan 12 14:44:48 hp1 lock_gulmd_core[6500]: hp2 missed a heartbeat (time:1105530288972780 mb:2) Jan 12 14:45:03 hp1 lock_gulmd_core[6500]: hp2 missed a heartbeat (time:1105530303992751 mb:3) Jan 12 14:45:03 hp1 lock_gulmd_core[6500]: Client (hp2) expired Jan 12 14:45:03 hp1 lock_gulmd_core[6500]: Core lost slave quorum. Have 1, need 2. Switching to Arbitrating. Jan 12 14:45:03 hp1 lock_gulmd_core[6614]: Gonna exec fence_node hp2 Jan 12 14:45:03 hp1 lock_gulmd_core[6500]: Forked [6614] fence_node hp2 with a 0 pause. Jan 12 14:45:03 hp1 fence_node[6614]: Performing fence method, riloe, on hp2. Jan 12 14:45:04 hp1 fence_node[6614]: The agent (fence_rib) reports: Jan 12 14:45:04 hp1 fence_node[6614]: WARNING! fence_rib is deprecated. use fence_ilo instead parse error: unknown option "ipaddr=10.10.0.112" If start again service lock_gulm on the second node, then on first node this have written in systemlog: Jan 12 14:50:14 hp1 lock_gulmd_core[7148]: Gonna exec fence_node hp2 Jan 12 14:50:14 hp1 fence_node[7148]: Performing fence method, riloe, on hp2. Jan 12 14:50:14 hp1 fence_node[7148]: The agent (fence_rib) reports: Jan 12 14:50:14 hp1 fence_node[7148]: WARNING! fence_rib is deprecated. use fence_ilo instead parse error: unknown option "ipaddr=10.10.0.112" Jan 12 14:50:14 hp1 fence_node[7148]: Jan 12 14:50:14 hp1 fence_node[7148]: All fencing methods FAILED! Jan 12 14:50:14 hp1 fence_node[7148]: Fence of "hp2" was unsuccessful. Jan 12 14:50:14 hp1 lock_gulmd_core[6500]: Fence failed. [7148] Exit code:1 Running it again. Jan 12 14:50:14 hp1 lock_gulmd_core[6500]: Forked [7157] fence_node hp2 with a 5 pause. Jan 12 14:50:15 hp1 lock_gulmd_core[6500]: (10.10.0.201:hp2) Cannot login if you are expired. And I can't umount GFS file system and can't reboot systems because GFS is mounted, only reset both nodes. I think I have mistakes in my configuration, may be it is because incorrect agent = "fence_rib" or something else. Please help :-) Cluster Configuration: cluster.ccs: cluster { name = "cluster" lock_gulm { servers = ["hp1"] (or servers = ["hp1,"hp2","hp3"]) } } fence.ccs: fence_devices { ILO-HP1 { agent = "fence_rib" ipaddr = "10.10.0.111" login = "xx" passwd = "xx" } ILO-HP2 { agent = "fence_rib" ipaddr = "10.10.0.112" login = "xx" passwd = "xx" } } nodes.ccs: nodes { hp1 { ip_interfaces { eth0 = "10.10.0.200" } fence { riloe { ILO-HP1 { localport = 17988 } } } } hp2 { ip_interfaces { eth0 = "10.10.0.201" } fence { riloe { ILO-HP2 { localport = 17988 } } } } # if 3 nodes in cluster.ccs # hp3 { # ip_interfaces { eth0 = "10.10.0.201" } # fence { riloe { ILO-HP2 { localport = 17988 } } } # } Thanks a lot anyway! -- Sergey From mtilstra at redhat.com Wed Jan 12 14:49:05 2005 From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra) Date: Wed, 12 Jan 2005 08:49:05 -0600 Subject: [Linux-cluster] some questions about setting up GFS In-Reply-To: <1814439859.20050112171040@triumvirat.ru> References: <1125914338.20050111113424@triumvirat.ru> <7B452B98-63F6-11D9-85A8-000A957BB1F6@redhat.com> <1814439859.20050112171040@triumvirat.ru> Message-ID: <20050112144905.GA3029@redhat.com> On Wed, Jan 12, 2005 at 05:10:40PM +0300, Sergey wrote: > Hello! > > > It looks like you are not using pool. > > Thanks, I've guided by your examples, so raid can be mounted. > > Now I have some questions about Cluster Configuration System Files. > > I have 2 nodes - hp1 and hp2. Any of nodes have Integrated Lights-Out > with ROM Version: 1.55 - 04/16/2004. > > Since I have only 2 nodes one of them has to be master, but if first > of them (master) is correctly shut down, slave experiencing > serious problems which can be solved by resetting. Is it all right? > How to make it right? > > I tried to make servers = ["hp1","hp2","hp3"] (hp3 is really absent), > then if master is shut down second node became master. So, if The nodes in the servers config line for gulm form a mini-cluster of sorts. There must be quorum (51%) of nodes present in this mini-cluster for things to continue. You must have two of the three servers up and running so that the mini-cluster has quorum, which then will alow the other nodes to connect. > nodes are alternately correctly shut down and boot up master is > switching from one to another and everything seems ok, but if one of > the nodes is shut down incorrectly (e.g. power cord is pulled out of > socket), this have written in systemlog: > > Jan 12 14:44:33 hp1 lock_gulmd_core[6500]: hp2 missed a heartbeat (time:1105530273952756 mb:1) > Jan 12 14:44:48 hp1 lock_gulmd_core[6500]: hp2 missed a heartbeat (time:1105530288972780 mb:2) > Jan 12 14:45:03 hp1 lock_gulmd_core[6500]: hp2 missed a heartbeat (time:1105530303992751 mb:3) > Jan 12 14:45:03 hp1 lock_gulmd_core[6500]: Client (hp2) expired > Jan 12 14:45:03 hp1 lock_gulmd_core[6500]: Core lost slave quorum. Have 1, need 2. Switching to Arbitrating. > Jan 12 14:45:03 hp1 lock_gulmd_core[6614]: Gonna exec fence_node hp2 > Jan 12 14:45:03 hp1 lock_gulmd_core[6500]: Forked [6614] fence_node hp2 with a 0 pause. > Jan 12 14:45:03 hp1 fence_node[6614]: Performing fence method, riloe, on hp2. > Jan 12 14:45:04 hp1 fence_node[6614]: The agent (fence_rib) reports: > Jan 12 14:45:04 hp1 fence_node[6614]: WARNING! fence_rib is deprecated. use fence_ilo instead parse error: unknown > option "ipaddr=10.10.0.112" > > If start again service lock_gulm on the second node, then on first > node this have written in systemlog: > > Jan 12 14:50:14 hp1 lock_gulmd_core[7148]: Gonna exec fence_node hp2 > Jan 12 14:50:14 hp1 fence_node[7148]: Performing fence method, riloe, on hp2. > Jan 12 14:50:14 hp1 fence_node[7148]: The agent (fence_rib) reports: > Jan 12 14:50:14 hp1 fence_node[7148]: WARNING! fence_rib is deprecated. use fence_ilo instead parse error: unknown > option "ipaddr=10.10.0.112" > Jan 12 14:50:14 hp1 fence_node[7148]: > Jan 12 14:50:14 hp1 fence_node[7148]: All fencing methods FAILED! > Jan 12 14:50:14 hp1 fence_node[7148]: Fence of "hp2" was unsuccessful. > Jan 12 14:50:14 hp1 lock_gulmd_core[6500]: Fence failed. [7148] Exit code:1 Running it again. > Jan 12 14:50:14 hp1 lock_gulmd_core[6500]: Forked [7157] fence_node hp2 with a 5 pause. > Jan 12 14:50:15 hp1 lock_gulmd_core[6500]: (10.10.0.201:hp2) Cannot login if you are expired. The node hp2 has to be successfully fenced before it is allowed to re-join the cluster. If your fencing is misconfigured or not working, a fenced node will never get to rejoin. You really should test that fencing works by running fence_node for each node in your cluster before running lock_gulmd. This makes sure that fencing is setup and working correctly. Do that, and once you've verified that fencing is correct (without lock_gulmd running) try things again with lock_gulmd. > And I can't umount GFS file system and can't reboot systems > because GFS is mounted, only reset both nodes. > > I think I have mistakes in my configuration, may be it is because > incorrect agent = "fence_rib" or something else. > > Please help :-) > > > Cluster Configuration: > > cluster.ccs: > cluster { > name = "cluster" > lock_gulm { > servers = ["hp1"] (or servers = ["hp1,"hp2","hp3"]) > } > } > > fence.ccs: > fence_devices { > ILO-HP1 { > agent = "fence_rib" > ipaddr = "10.10.0.111" > login = "xx" > passwd = "xx" > } > ILO-HP2 { > agent = "fence_rib" > ipaddr = "10.10.0.112" > login = "xx" > passwd = "xx" > } > } > > nodes.ccs: > nodes { > hp1 { > ip_interfaces { eth0 = "10.10.0.200" } > fence { riloe { ILO-HP1 { localport = 17988 } } } > } > hp2 { > ip_interfaces { eth0 = "10.10.0.201" } > fence { riloe { ILO-HP2 { localport = 17988 } } } > } > # if 3 nodes in cluster.ccs > # hp3 { > # ip_interfaces { eth0 = "10.10.0.201" } > # fence { riloe { ILO-HP2 { localport = 17988 } } } > # } -- Michael Conrad Tadpol Tilstra Hi, I'm an evil mutated signature virus, put me in your .sig or I will bite your kneecaps! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From mshk_00 at hotmail.com Wed Jan 12 11:12:33 2005 From: mshk_00 at hotmail.com (maria perez) Date: Wed, 12 Jan 2005 12:12:33 +0100 Subject: [Linux-cluster] Mount GFS Message-ID: I follow Basic example GFS C.3 (administration guide Red hat GFS 6.0) LOCK_GULM SLM Embedded with only two nodes and only a file system, one of them running as server LOCK_GULM. My shared storage is a SAN, MJ, I access it via iscsi. All go right, except I can not mount the file system gfs, mount not recognize the type of fyle system gfs, the message appear on the console is: mount : file system type incorrect, option incorrect, superblock incorrect in /dev/pool/pool_gfs01 or number of file systems mounted excessive Why?? _________________________________________________________________ Moda para esta temporada. Ponte al d?a de todas las tendencias. http://www.msn.es/Mujer/moda/default.asp From daniel at osdl.org Wed Jan 12 17:44:22 2005 From: daniel at osdl.org (Daniel McNeil) Date: Wed, 12 Jan 2005 09:44:22 -0800 Subject: [Linux-cluster] cman bad generation number In-Reply-To: <20050112085812.GI6645@tykepenguin.com> References: <1103654081.29749.17.camel@ibm-c.pdx.osdl.net> <20041222090832.GB1260@tykepenguin.com> <1103736819.30947.20.camel@ibm-c.pdx.osdl.net> <20050111085608.GA6645@tykepenguin.com> <1105491645.30484.23.camel@ibm-c.pdx.osdl.net> <20050112085812.GI6645@tykepenguin.com> Message-ID: <1105551862.30484.31.camel@ibm-c.pdx.osdl.net> On Wed, 2005-01-12 at 00:58, Patrick Caulfield wrote: > On Tue, Jan 11, 2005 at 05:00:46PM -0800, Daniel McNeil wrote: > > On Tue, 2005-01-11 at 00:56, Patrick Caulfield wrote: > > > On Wed, Dec 22, 2004 at 09:33:39AM -0800, Daniel McNeil wrote: > > > > How long does cman stay up in your testing? > > > > > > With the higher pririty on the heartbeat thread I got 5 days before iSCSI died > > > on me again... This isn't quite the same load as yours but it is on 8 busy nodes. > > > > I have not seen 5 days yet on my set. See my email from yesterday. > > Is the code to have higher priority for the heartbeat thread > > already checked in? I restarted my test yesterday and it is > > still going, but it usually has trouble after 50 hours or so. > > > > It's rev 1.45 of membership.c checked in on the 7th Jan. If that hasn't fixed it > I'll have to dabble with realtime things as it does seem now that the threads > are not being woken up, even though the timer is firing. I'm running from code as of Jan 4th, so I do not have that change. I'll updated my code. 2 nodes died last night running my tests with echo "9" > /proc/cluster/config/cman/max_retries echo "1" > /proc/cluster/config/cman/hello_timer here's the output on the console from the 3 nodes: cl030: CMAN: no HELLO from cl031a, removing from the cluster CMAN: node cl032a is not responding - removing from the cluster CMAN: quorum lost, blocking activity cl031: CMAN: node cl030a is not responding - removing from the cluster CMAN: node cl032a is not responding - removing from the cluster SM: Assertion failed on line 67 of file /Views/redhat-cluster/cluster/cman-kernel/src/sm_membership.c SM: assertion: "node" SM: time = 115176056 Kernel panic - not syncing: SM: Record message above and reboot. Message from syslogd at cl031 at Wed Jan 12 01:17:57 2005 ... Record message above and reboot. syncing: SM: cl032: CMAN: too many transition restarts - will die Daniel From yazan at ccs.com.jo Wed Jan 12 18:03:48 2005 From: yazan at ccs.com.jo (Yazan Al-Sheyyab) Date: Wed, 12 Jan 2005 20:03:48 +0200 Subject: [Linux-cluster] 3 questions Message-ID: <000b01c4f8d1$10c00710$69050364@yazanz> hi, I have 3 question : 1- should i setup the temporary directory for GFS configuration files on the two nodes or only on one node ? _______________________________________________________ 2- and if on the two nodes, should i run the : ( ccs_tool create.... ) command on the two nodes or only from one ? _______________________________________________________ 3- I have two members , and have build the cluster.ccs file as follow: cluster { name = "oracluster" lock_gulm { servers = [ "orat1"] } } should i put the two members in the servers line or only the first ( because the document example is about 4 nodes and it had put only the first 3 nodes name) is that true or what ???????????// Thanks Yazan. From amanthei at redhat.com Wed Jan 12 18:04:16 2005 From: amanthei at redhat.com (Adam Manthei) Date: Wed, 12 Jan 2005 12:04:16 -0600 Subject: [Linux-cluster] some questions about setting up GFS In-Reply-To: <1814439859.20050112171040@triumvirat.ru> References: <1125914338.20050111113424@triumvirat.ru> <7B452B98-63F6-11D9-85A8-000A957BB1F6@redhat.com> <1814439859.20050112171040@triumvirat.ru> Message-ID: <20050112180416.GC32421@redhat.com> On Wed, Jan 12, 2005 at 05:10:40PM +0300, Sergey wrote: > I have 2 nodes - hp1 and hp2. Any of nodes have Integrated Lights-Out > with ROM Version: 1.55 - 04/16/2004. > > Jan 12 14:45:04 hp1 fence_node[6614]: The agent (fence_rib) reports: > Jan 12 14:45:04 hp1 fence_node[6614]: WARNING! fence_rib is deprecated. use fence_ilo instead parse error: unknown > option "ipaddr=10.10.0.112" Two things: 1. This is telling you to use an updated version of the agent, fence_ilo. replace fence_rib w/ fence_ilo in your ccs files 2. "ipaddr" is not a parameter for either fence_ilo or fence_rib. The correct parameter is "hostname" (as described in the man page). Hint: You will also need perl-Crypt-SSLeay package from RHN or Net::SSLeay from CPAN. > Cluster Configuration: > > cluster.ccs: > cluster { > name = "cluster" > lock_gulm { > servers = ["hp1"] (or servers = ["hp1,"hp2","hp3"]) > } > } > > fence.ccs: > fence_devices { > ILO-HP1 { > agent = "fence_rib" > ipaddr = "10.10.0.111" > login = "xx" > passwd = "xx" > } > ILO-HP2 { > agent = "fence_rib" > ipaddr = "10.10.0.112" > login = "xx" > passwd = "xx" > } > } > > nodes.ccs: > nodes { > hp1 { > ip_interfaces { eth0 = "10.10.0.200" } > fence { riloe { ILO-HP1 { localport = 17988 } } } > } > hp2 { > ip_interfaces { eth0 = "10.10.0.201" } > fence { riloe { ILO-HP2 { localport = 17988 } } } > } -- Adam Manthei From sbasto at fe.up.pt Wed Jan 12 19:35:24 2005 From: sbasto at fe.up.pt (=?ISO-8859-1?Q?S=E9rgio?= M. Basto) Date: Wed, 12 Jan 2005 19:35:24 +0000 Subject: [Linux-cluster] other log entry Message-ID: <1105558524.8704.8.camel@rh10.fe.up.pt> Hi, with redhat AS 3 update3 I got on /var/log/messages 04:02:48 samba-gfs lock_gulmd_core[10523]: "STONITH<->GuLM Bridge" is logged out. fd:9 Jan 9 04:04:53 samba-gfs last message repeated 4 times Jan 9 04:06:58 samba-gfs last message repeated 4 times Jan 9 04:09:03 samba-gfs last message repeated 4 times Jan 9 04:11:08 samba-gfs last message repeated 4 times Jan 9 04:13:13 samba-gfs last message repeated 4 times Jan 9 04:15:18 samba-gfs last message repeated 4 times are this normal ? what this means ? google don't find any ! thanks, -- S?rgio M. B. From mmiller at cruzverde.cl Wed Jan 12 20:28:09 2005 From: mmiller at cruzverde.cl (Markus Miller) Date: Wed, 12 Jan 2005 17:28:09 -0300 Subject: [Linux-cluster] RAW device limits Message-ID: <75E9203E0F0DD511B37E00D0B789D45007E835AA@fcv-stgo.cverde.cl> Hi, I found a document on the oracle web site ... http://oss.oracle.com/projects/ocfs/dist/documentation/RHAS_best_practices.html .. that says that the maximum number of RAW devices supported by Red Hat AS 2.1 is 255. Does anybody know, if this limit still exists in Red Hat Enterprise 3? I searched the Internet and found all kinds of limits (file size, filesystem size ...) but nothing about the amount of RAW devices soported in Red Hat Enterprise 3. Regards, Markus Markus Miller Ingeniero de Sistemas, RHCE DIFARMA Lord Cochrane 326, Santiago, Chile Tel. +56 2 6944076 mmiller at cruzverde.cl From rajkum2002 at rediffmail.com Wed Jan 12 23:49:10 2005 From: rajkum2002 at rediffmail.com (Raj Kumar) Date: 12 Jan 2005 23:49:10 -0000 Subject: [Linux-cluster] 3 questions Message-ID: <20050112234910.19523.qmail@webmail28.rediffmail.com> >1- should i setup the temporary directory for GFS configuration files > on the two nodes or only on one node ? ONE> >_______________________________________________________ >2- and if on the two nodes, should i run the : > ( ccs_tool create.... ) command on the two nodes or only from one ? > >_______________________________________________________ > ONLY FROM ONE NODE. THIS IS KIND OF GFS SETUP WHICH HAS TO BE DONE ONCE AND SO YOU RUN IT FROM ONLY ONE NODE. >3- I have two members , and have build the cluster.ccs file as follow: > > cluster { > name = "oracluster" > lock_gulm { > servers = [ "orat1"] > } >} THIS ENTRY INDICATES THE NODES RUNNING LOCK SERVERS (LOCK_GULMD). SINCE YOU HAVE ONLY TWO NODES YOU CAN RUN ONLY ONE LOCK SERVER AND YOU WILL PUT THE NAME OF THE NODE RUNNING LOCK SERVER HERE. From GFS manual: Because of quorum requirements, the number of lock servers allowed in a GFS cluster can be 1, 3, 4, or 5. Any other number of lock servers that is, 0, 2, or more than 5 is not supported. Hope this helps! Raj -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel at osdl.org Thu Jan 13 00:47:59 2005 From: daniel at osdl.org (Daniel McNeil) Date: Wed, 12 Jan 2005 16:47:59 -0800 Subject: [Linux-cluster] Clusters special interest group (SIG) Message-ID: <1105577279.5655.3.camel@ibm-c.pdx.osdl.net> The purpose of the Clusters SIG is to provide a general Linux clusters forum which is not specific to any one cluster project. Most of the discussion takes place on the clusters_sig at osdl.org mailing list. For information, the web page is here: (http://developer.osdl.org/dev/clusters/) To sign up for mailing list: http://lists.osdl.org/mailman/listinfo/clusters_sig Initial topics will most likely be (still up for discussion): - Common kernel components * Code review for kernel hooks needed for in-kernel cluster services. * Sharing of common features between cluster implementations * Fencing mechanisms * Resource management * Other cluster components (DLM, Membership, communication, etc) * SA Forum interfaces * OSDL working group capabilities/requirements * Customer and developer feedback on how open source clustering is being used and features that are needed or lacking. Daniel From mshk_00 at hotmail.com Thu Jan 13 11:28:15 2005 From: mshk_00 at hotmail.com (maria perez) Date: Thu, 13 Jan 2005 12:28:15 +0100 Subject: Re Re: [Linux-cluster] mount file system GFS Message-ID: >Michael Conrad Tadpol Tilstra >run dmesg to get more info about why it cannot mount. >Did you remember to start lock_gulmd? Certainly, thank you very much! But now I have another problem.:Only the node stablished like server lock_gulm can mount the file system, the second node hang. Why?? The nodes' names are different and each node the file /etc/hosts contains: 127.0.0.1 localhost.localdomain localhost 127.0.0.1 machinename.domain machinename (ip machine) machinename.domain machinename I have another question: I have read the number of lock_gulm servers only can be 1, 3 or 5, not 2. Is right? I am thinking established the two nodes like servers lock_gulm, will be this correct? _________________________________________________________________ Acepta el reto MSN Premium: Correos m?s divertidos con fotos y textos incre?bles en MSN Premium. Desc?rgalo y pru?balo 2 meses gratis. http://join.msn.com?XAPID=1697&DI=1055&HL=Footer_mailsenviados_correosmasdivertidos From yazan at ccs.com.jo Thu Jan 13 12:31:45 2005 From: yazan at ccs.com.jo (Yazan Al-Sheyyab) Date: Thu, 13 Jan 2005 14:31:45 +0200 Subject: [Linux-cluster] cluster suite question? Message-ID: <000d01c4f96b$d8096dd0$69050364@yazanz> hi, i am working from the beggining again. should i install the cluster suite on the two nodes or only on one node?? Thanks. From yazan at ccs.com.jo Thu Jan 13 13:15:26 2005 From: yazan at ccs.com.jo (Yazan Al-Sheyyab) Date: Thu, 13 Jan 2005 15:15:26 +0200 Subject: [Linux-cluster] gfs probelm Message-ID: <001501c4f971$f294df80$69050364@yazanz> HI, I configure the gfs , and i mount the partitioned as gfs as mentioned in the document, but when make a reboot, the system halted and stay ask continuously : lock_glumd is it running. i didnt put the mounted gfs partitions in the /etc/fstab. ( is that true ?) i made a shell and i put in it the following : service ccsd stop service lock_gulmd stop and i execut it before i make a reboot, and when i loged again to the system the two services are running by the system, that is ok , i know it is not a solution , BUT in the second reboot i found that the system gives the same continuous error question ( lock_gulmd is it running? ). how can i solve this ? can i put the partitions in the /etc/fstab ? OR WHAT ?????. Thanks. From serge at triumvirat.ru Thu Jan 13 13:40:23 2005 From: serge at triumvirat.ru (Sergey) Date: Thu, 13 Jan 2005 16:40:23 +0300 Subject: [Linux-cluster] some questions about setting up GFS In-Reply-To: <20050112144905.GA3029@redhat.com> References: <1125914338.20050111113424@triumvirat.ru> <7B452B98-63F6-11D9-85A8-000A957BB1F6@redhat.com> <1814439859.20050112171040@triumvirat.ru> <20050112144905.GA3029@redhat.com> Message-ID: <2310360088.20050113164023@triumvirat.ru> >> I have 2 nodes - hp1 and hp2. Any of nodes have Integrated Lights-Out >> with ROM Version: 1.55 - 04/16/2004. >> > The nodes in the servers config line for gulm form a mini-cluster of > sorts. There must be quorum (51%) of nodes present in this mini-cluster > for things to continue. > You must have two of the three servers up and running so that the > mini-cluster has quorum, which then will alow the other nodes to > connect. I have only 2 nodes and I can't get quorum. Should I use Single Lock Manager (SLM), when one node is master and another is slave? But in this case if master goes down slave loses access to common file system, and it systemlog looks like this: Jan 13 15:56:59 hp2 kernel: lock_gulm: Checking for journals for node "hp1" Jan 13 15:56:59 hp2 lock_gulmd_core[2935]: Master Node has logged out. Jan 13 15:56:59 hp2 kernel: lock_gulm: Checking for journals for node "hp1" Jan 13 15:56:59 hp2 lock_gulmd_core[2935]: In core_io.c:410 (v6.0.0) death by: Lost connection to SLM Master (hp1), stopping. node reset required to re-activate cluster operations. Jan 13 15:56:59 hp2 kernel: lock_gulm: ERROR Got an error in gulm_res_recvd err: -71 Jan 13 15:56:59 hp2 lock_gulmd_LTPX[2941]: EOF on xdr (_ core _:0.0.0.0 idx:1 fd:5) Jan 13 15:56:59 hp2 lock_gulmd_LTPX[2941]: In ltpx_io.c:335 (v6.0.0) death by: Lost connection to core, cannot continue. node reset required to re-activate cluster operations. Jan 13 15:56:59 hp2 kernel: lock_gulm: ERROR gulm_LT_recver err -71 Jan 13 15:57:02 hp2 kernel: lock_gulm: ERROR Got a -111 trying to login to lock_gulmd. Is it running? status of lock_gulmd: [root at hp2 root]# /etc/init.d/lock_gulmd status lock_gulmd dead but subsys locked If master boots up after some time happens nothing - slave does not try to connect. What should happens further and in what order? > You really should test that fencing works by running > fence_node for each node in your cluster before running > lock_gulmd. This makes sure that fencing is setup and working > correctly. > Do that, and once you've verified that fencing is correct (without > lock_gulmd running) try things again with lock_gulmd. Result of command fence_node NODENAME is reboot of NODENAME. Is it right? -- Sergey From mtilstra at redhat.com Thu Jan 13 14:10:26 2005 From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra) Date: Thu, 13 Jan 2005 08:10:26 -0600 Subject: Re Re: [Linux-cluster] mount file system GFS In-Reply-To: References: Message-ID: <20050113141026.GA19979@redhat.com> On Thu, Jan 13, 2005 at 12:28:15PM +0100, maria perez wrote: > > > >Michael Conrad Tadpol Tilstra > > >run dmesg to get more info about why it cannot mount. > > >Did you remember to start lock_gulmd? > > Certainly, thank you very much! > But now I have another problem.:Only the node stablished like server > lock_gulm can mount the file system, the second node hang. Why?? > The nodes' names are different and each node the file /etc/hosts contains: > > 127.0.0.1 localhost.localdomain localhost > 127.0.0.1 machinename.domain machinename If your /etc/hosts file is actually setting the ip of nodes in your cluster to 127.0.0.1, lock_gulmd will not work. > I have another question: I have read the number of lock_gulm servers only > can be 1, 3 or 5, not 2. Is right? yes. 3,4, and 5 servers run in a mini cluster to avoid a single point of failure. 1 server runs as a single point of failure, but is useful for testing. > I am thinking established the two nodes like servers lock_gulm, will be > this correct? I am sorry, but I don't quite understand this question. You can setup two nodes, both as servers, (by putting three nodes in the servers list and not using one of the entries.) But if one node dies the other will hang. It can be done, with gulm it is not advisable. -- Michael Conrad Tadpol Tilstra I am having an out of money experience. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From mtilstra at redhat.com Thu Jan 13 14:18:24 2005 From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra) Date: Thu, 13 Jan 2005 08:18:24 -0600 Subject: [Linux-cluster] some questions about setting up GFS In-Reply-To: <2310360088.20050113164023@triumvirat.ru> References: <1125914338.20050111113424@triumvirat.ru> <7B452B98-63F6-11D9-85A8-000A957BB1F6@redhat.com> <1814439859.20050112171040@triumvirat.ru> <20050112144905.GA3029@redhat.com> <2310360088.20050113164023@triumvirat.ru> Message-ID: <20050113141824.GB19979@redhat.com> On Thu, Jan 13, 2005 at 04:40:23PM +0300, Sergey wrote: > > >> I have 2 nodes - hp1 and hp2. Any of nodes have Integrated Lights-Out > >> with ROM Version: 1.55 - 04/16/2004. > >> > > > The nodes in the servers config line for gulm form a mini-cluster of > > sorts. There must be quorum (51%) of nodes present in this mini-cluster > > for things to continue. > > > You must have two of the three servers up and running so that the > > mini-cluster has quorum, which then will alow the other nodes to > > connect. > > I have only 2 nodes and I can't get quorum. Should I use Single Lock > Manager (SLM), when one node is master and another is slave? > > But in this case if master goes down slave loses access to common file > system, and it systemlog looks like this: Correct. That is the behavor of gulm in SLM mode. [snip] > If master boots up after some time happens nothing - slave does not > try to connect. Again correct, in SLM mode, the lock state was lost, so there is nothing for the slave to reconnect to. For gulm, you need atleast three nodes to get RLM mode. The third gulm node does not need to run anything but gulm, and can be configured from a file using an option to ccsd. You just need to make sure the configs are the same on all three nodes. > What should happens further and in what order? > > > > You really should test that fencing works by running > > fence_node for each node in your cluster before running > > lock_gulmd. This makes sure that fencing is setup and working > > correctly. > > > Do that, and once you've verified that fencing is correct (without > > lock_gulmd running) try things again with lock_gulmd. > > Result of command > fence_node NODENAME > is reboot of NODENAME. Is it right? If you are using a fencing agent that power cycles the node. (so, sometimes yes. fence_ilo will reboot the node.) -- Michael Conrad Tadpol Tilstra IIss llooccaall eecchhoo oonn?? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From tboucher at ca.ibm.com Thu Jan 13 14:25:53 2005 From: tboucher at ca.ibm.com (Tony Boucher) Date: Thu, 13 Jan 2005 09:25:53 -0500 Subject: [Linux-cluster] Log entry In-Reply-To: <20050112103849.3DD8B3031E3@poczta.interia.pl> Message-ID: There not errors. It looks like you have verbose logging set. Tony Boucher I/T Specialist (HACMP/GPFS/WLM) "Experience is a hard teacher because she gives the test first, the lesson afterwards." -- Unknown ptr at poczta.fm Sent by: linux-cluster-bounces at redhat.com 01/12/2005 05:38 AM Please respond to linux clistering To linux-cluster at redhat.com cc Subject [Linux-cluster] Log entry Hello. I'm receiving entries like the one below in my system logs. It's 2-nodes cluster built form CVS. -node1: dlm: lkb id 52cd01b3 remid 4c730361 flags 0 status 3 rqmode 5 grmode 3 nodeid 1 lqstate 2 lqflags 44 dlm: request rh_cmd 6 rh_lkid 4c730361 remlkid 52cd01b3 flags 0 status 0 rqmode 3 dlm: eva: process_lockqueue_reply id 52cd01b3 state 0 -node2: dlm: lkb id 43010219 remid 48330092 flags 0 status 3 rqmode 5 grmode 3 nodeid 2 lqstate 2 lqflags 44 dlm: request rh_cmd 6 rh_lkid 48330092 remlkid 43010219 flags 0 status 0 rqmode 3 dlm: eva: process_lockqueue_reply id 43010219 state 0 Can someone explain what kind of faults are they? Regards, Piotr ---------------------------------------------------------------------- Najlepsze auto, najlepsze moto... >>> http://link.interia.pl/f1841 -- Linux-cluster mailing list Linux-cluster at redhat.com http://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From pcaulfie at redhat.com Thu Jan 13 15:41:20 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Thu, 13 Jan 2005 15:41:20 +0000 Subject: [Linux-cluster] Simple wrap for SAF AIS Lock API In-Reply-To: <1102610496.4843.16.camel@manticore.sh.intel.com> References: <1102610496.4843.16.camel@manticore.sh.intel.com> Message-ID: <20050113154120.GE2346@tykepenguin.com> On Thu, Dec 09, 2004 at 04:41:36PM +0000, Stanley Wang wrote: > Hi all, > > The attached patch provides SAF AIS lock APIs support based on current > GDLM. It's just simplest wrap of GDLM's user mode api and didn't touch > GDLM's codes. I think it can be a good complementarity to GDLM. > > The patch is against lastest CVS codes. > > Any interests or comments? Now committed to CVS head. Sorry for the rather long delay. -- patrick From mshk_00 at hotmail.com Fri Jan 14 09:22:37 2005 From: mshk_00 at hotmail.com (maria perez) Date: Fri, 14 Jan 2005 10:22:37 +0100 Subject: : Re: Re Re: [Linux-cluster] mount file system GFS Message-ID: Thank you very much for your help, Michael. Excuse me, but my english is not enough good. I try write correctly in an understable way,but not always I achieve it. Finally, with your help, I achieved mount a file system shared by two nodes, only one of this is running like lock_gulmd server. I had to eliminate the lines in /etc/hosts file of each node, that contain 127.0.0.1. Someone said to me in a occasion that never eliminate the line: '127.0.0.1 localhost.localdomain localhost' What can it happen?? what problems can appear?? My system now had a single point of failure, I would like, if it is possible that the two nodes were servers lock_gulm . I understand in your message I can run the two nodes like servers lock_gulm having only two nodes but declaring three nodes in the file cluster.ccs in the sentence servers=" " and using only two of the three (really the third node not exits). In the file nodes.ccs : I had to declare the three nodes too??Nor?? Do I undersand you well?? _________________________________________________________________ Acepta el reto MSN Premium: Protecci?n para tus hijos en internet. Desc?rgalo y pru?balo 2 meses gratis. http://join.msn.com?XAPID=1697&DI=1055&HL=Footer_mailsenviados_proteccioninfantil From Axel.Thimm at ATrpms.net Fri Jan 14 10:57:24 2005 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Fri, 14 Jan 2005 11:57:24 +0100 Subject: [Linux-cluster] Re: CVS compile with 2.6.10-rc3 In-Reply-To: <20041210221412.GA26453@potassium.msp.redhat.com> References: <20041210215924.GA11520@iwork57.lis.uiuc.edu> <20041210221412.GA26453@potassium.msp.redhat.com> Message-ID: <20050114105724.GB5419@neu.nirvana> The fix is in CVS now, thanks! Now that FC2/FC3 have gone 2.6.10, the rawhide GFS-kernel packages break (they also break against rawhide's 2.6.10). Could a new gfs CVS checkout be committed into rawhide? I'm preparing packages for FC3 (perhaps even FC2) and want to be source-wise as close to rawhide/rhel4 as possible. Thanks! On Fri, Dec 10, 2004 at 04:14:12PM -0600, Ken Preslan wrote: > It looks like every other driver in the rc3 patch just drops the "0" > argument to that function. Go ahead and try it and see what you get. > > > On Fri, Dec 10, 2004 at 03:59:25PM -0600, Brynnen R Owen wrote: > > Hi all, > > > > This may be off your radar still, but it appears that the CVS source > > fails to compile with vanilla 2.6.10-rc3. The smoking source file is > > cluster/gfs-kernel/src/gfs/quota.c: > > > > CC [M] /mnt/install/src-2.6.10-rc3-gfs32-1/cluster/gfs-kernel/src/gfs/quota.o > > /mnt/install/src-2.6.10-rc3-gfs32-1/cluster/gfs-kernel/src/gfs/quota.c: > > In function `print_quota_message': > > /mnt/install/src-2.6.10-rc3-gfs32-1/cluster/gfs-kernel/src/gfs/quota.c:956: > > warning: passing arg 3 of pointer to function makes integer from > > pointer without a cast > > /mnt/install/src-2.6.10-rc3-gfs32-1/cluster/gfs-kernel/src/gfs/quota.c:956: > > too many arguments to function > > > > Did the kernel API for tty access change? -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From cjkovacs at verizon.net Fri Jan 14 11:29:38 2005 From: cjkovacs at verizon.net (Corey Kovacs) Date: Fri, 14 Jan 2005 06:29:38 -0500 Subject: : Re: Re Re: [Linux-cluster] mount file system GFS In-Reply-To: References: Message-ID: <200501140629.39107.cjkovacs@verizon.net> You don't have to remove the loopback line, only the reference to the machines host name in that line... so instead of having.... 127.0.0.1 mymachinename localhost.localdomain localhost You only need/want.... 127.0.0.1 localhost.localdomain localhost 192.168.1.1 mymachinename.mydomain.com mymachinename of course you'll use your correct ip address, etc.... having the host name in the loopback line causes all sorts of problems with other things as well and I am not sure why it's put there in the first place. Cheers. Corey On Friday 14 January 2005 04:22, maria perez wrote: > Thank you very much for your help, Michael. > Excuse me, but my english is not enough good. I try write correctly in an > understable way,but not always I achieve it. > > Finally, with your help, I achieved mount a file system shared by two nodes, > only one of this is running like lock_gulmd server. I had to eliminate the > lines in /etc/hosts file of each node, that contain 127.0.0.1. Someone said > to me in a occasion that never eliminate the line: > '127.0.0.1 localhost.localdomain localhost' > What can it happen?? what problems can appear?? > > My system now had a single point of failure, I would like, if it is possible > that the two nodes were servers lock_gulm . I understand in your message I > can run the two nodes like servers lock_gulm having only two nodes but > declaring three nodes in the file cluster.ccs in the sentence servers=" " > and using only two of the three (really the third node not exits). In the > file nodes.ccs : I had to declare the three nodes too??Nor?? Do I undersand > you well?? > > _________________________________________________________________ > Acepta el reto MSN Premium: Protecci?n para tus hijos en internet. > Desc?rgalo y pru?balo 2 meses gratis. > http://join.msn.com?XAPID=1697&DI=1055&HL=Footer_mailsenviados_proteccioninfantil > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > From mshk_00 at hotmail.com Fri Jan 14 12:31:54 2005 From: mshk_00 at hotmail.com (maria perez) Date: Fri, 14 Jan 2005 13:31:54 +0100 Subject: : Re: Re Re: [Linux-cluster] mount file system GFS Message-ID: Thanks, It is certain. At last I achieved mount the file system from two nodes and the file /etc/hosts maintains 127.0.0.1 localhost................... I included the line 127.0.0.1 mymachinename ...... because I understood wrong a message that I read and I decided to probe with it. Now I know the problem was in the file /etc/hosts of each machine does not appear a line including the ip and name of the other machine. Regards >You don't have to remove the loopback line, only the reference to the >machines host name >in that line... >so instead of having.... > >127.0.0.1 mymachinename localhost.localdomain localhost > >You only need/want.... >127.0.0.1 localhost.localdomain localhost >192.168.1.1 mymachinename.mydomain.com mymachinename >of course you'll use your correct ip address, etc.... >having the host name in the loopback line causes all sorts of problems with >other things as well and I am not sure why it's put there in the first >place. >Cheers. >Corey _________________________________________________________________ Descarga gratis la Barra de Herramientas de MSN http://www.msn.es/usuario/busqueda/barra?XAPID=2031&DI=1055&SU=http%3A//www.hotmail.com&HL=LINKTAG1OPENINGTEXT_MSNBH From mshk_00 at hotmail.com Fri Jan 14 12:47:00 2005 From: mshk_00 at hotmail.com (maria perez) Date: Fri, 14 Jan 2005 13:47:00 +0100 Subject: [Linux-cluster] gfs probelm Message-ID: >HI, > > I configure the gfs , and i mount the partitioned as gfs as mentioned in >the document, but when make a reboot, the system halted and stay ask >continuously : lock_glumd is it running. >i didnt put the mounted gfs partitions in the /etc/fstab. ( is that true >?) >i made a shell and i put in it the following : > service ccsd stop > service lock_gulmd stop > and i execut it before i make a reboot, and when i loged again to the >system the two services are running by the system, that is ok , i know it >is >not a solution , BUT in the second reboot i found that the system gives the >same continuous error question ( lock_gulmd is it running? ). >how can i solve this ? >can i put the partitions in the /etc/fstab ? >OR WHAT ?????. I don't know many nodes you have running, and if all nodes are lock_gulm servers or only one of them. Maybe if you reboot a node that is a server lock_gulm without stop this services (gfs, lock_gulmd, ccsd) and other nodes that are running depens of that node. have you created the archive /etc/sysconfig/gfs ??? !!The order to stop the modules is: gfs, lock_gulmd, ccsd, pool. Sorry I can not help you more. Good luck! _________________________________________________________________ Acepta el reto MSN Premium: Protecci?n para tus hijos en internet. Desc?rgalo y pru?balo 2 meses gratis. http://join.msn.com?XAPID=1697&DI=1055&HL=Footer_mailsenviados_proteccioninfantil From mtilstra at redhat.com Fri Jan 14 15:43:41 2005 From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra) Date: Fri, 14 Jan 2005 09:43:41 -0600 Subject: : Re: Re Re: [Linux-cluster] mount file system GFS In-Reply-To: References: Message-ID: <20050114154341.GC24596@redhat.com> On Fri, Jan 14, 2005 at 10:22:37AM +0100, maria perez wrote: > Thank you very much for your help, Michael. > Excuse me, but my english is not enough good. I try write correctly in an > understable way,but not always I achieve it. Yeah, no worries. I've been speaking english all my life, and I still screw it up regularly. ^_^ [snipped what got answered by others] > My system now had a single point of failure, I would like, if it is > possible that the two nodes were servers lock_gulm . I understand in your > message I can run the two nodes like servers lock_gulm having only two > nodes but declaring three nodes in the file cluster.ccs in the sentence > servers=" " and using only two of the three (really the third node not > exits). In the file nodes.ccs : I had to declare the three nodes too??Nor?? > Do I undersand you well?? Yes, that right. With this setup, one node will stop when the other dies. But you will not need to reboot both, just the one that died. Not an ideal situation, but a little better. All this comes from the fact that gulm was not designed with small in mind. -- Michael Conrad Tadpol Tilstra Chemicals, n.: Noxious substances from which modern foods are made. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From tom at nethinks.com Fri Jan 14 10:41:34 2005 From: tom at nethinks.com (tom at nethinks.com) Date: Fri, 14 Jan 2005 11:41:34 +0100 Subject: [Linux-cluster] mail-cluster + gfs setup? Message-ID: Hi all, does somebody here on the list have successfully setup a cyrus pop3/imap mail cluster? We are running for 2 month a test setup and we are disappointed of the performance. We already updatet gfs to the last cvs version but the performance of reading the berkley db are more then poor it takes more then 5 min to initalize the db with two servers connectet to the gfs filesystem. Many thx. -tom From bujan at isqsolutions.com Fri Jan 14 17:58:07 2005 From: bujan at isqsolutions.com (Manuel Bujan) Date: Fri, 14 Jan 2005 12:58:07 -0500 Subject: [Linux-cluster] Which APC fence device ? Message-ID: <005c01c4fa62$9b438910$7801a8c0@pcbujan> Hi, Could any of you guys can recommend me a working APC Masterswitch model to use as a fencing device for our two node GFS cluster ? We are planning to go in production by the next month and we were using until now the fencing manual mechanism. I looked inside the APC site and I found different models, but I am not certainly sure which one to select to be compatible with the fence_apc program. Does fence_apc work with ethernet power switches from APC like the model AP7900 ? http://www.apc.com/products/family/index.cfm?id=70 Any suggestions Regards Bujan From bujan at isqsolutions.com Fri Jan 14 18:04:20 2005 From: bujan at isqsolutions.com (Manuel Bujan) Date: Fri, 14 Jan 2005 13:04:20 -0500 Subject: [Linux-cluster] mail-cluster + gfs setup? References: Message-ID: <005f01c4fa63$7ad0ff40$7801a8c0@pcbujan> yes, We are now testing a two-node installation using Postfix + Cyrus Imap/Pop3 + MySQL + Apache without major problems. I recomend you to use a MailDir style mailbox and disable the sorting and threading features that were enabled in the Cyrus IMAP installation by default. Regards Bujan ----- Original Message ----- From: To: Sent: Friday, January 14, 2005 5:41 AM Subject: [Linux-cluster] mail-cluster + gfs setup? > > > > > Hi all, > > does somebody here on the list have successfully setup a cyrus pop3/imap > mail cluster? > > We are running for 2 month a test setup and we are disappointed of the > performance. > > We already updatet gfs to the last cvs version but the performance of > reading the berkley db are more then poor > it takes more then 5 min to initalize the db with two servers connectet to > the gfs filesystem. > > Many thx. > > -tom > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > From lhh at redhat.com Fri Jan 14 21:26:50 2005 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 14 Jan 2005 16:26:50 -0500 Subject: [Linux-cluster] Which APC fence device ? In-Reply-To: <005c01c4fa62$9b438910$7801a8c0@pcbujan> References: <005c01c4fa62$9b438910$7801a8c0@pcbujan> Message-ID: <1105738010.9279.203.camel@ayanami.boston.redhat.com> On Fri, 2005-01-14 at 12:58 -0500, Manuel Bujan wrote: > Does fence_apc work with ethernet power switches from APC like the model > AP7900 ? > http://www.apc.com/products/family/index.cfm?id=70 I think it works with the 7900 and 7921; not 100% on that. -- Lon From yazan at ccs.com.jo Sat Jan 15 06:15:25 2005 From: yazan at ccs.com.jo (Yazan Al-Sheyyab) Date: Sat, 15 Jan 2005 08:15:25 +0200 Subject: [Linux-cluster] gfs probelm References: Message-ID: <00bc01c4fac9$9a225bd0$69050364@yazanz> i have two nodes , orat1 , orat2 . and i make the lock_gulmd on the tow nodes because the document i have says that i should run the the ccsd then the lock_gulmd on the two nodes. and i put the first node as lock_gulm in the cluster.ccs file. Thanks. ----- Original Message ----- From: "maria perez" To: Sent: Friday, January 14, 2005 2:47 PM Subject: Re:[Linux-cluster] gfs probelm > > > > >HI, > > > > I configure the gfs , and i mount the partitioned as gfs as mentioned in > >the document, but when make a reboot, the system halted and stay ask > >continuously : lock_glumd is it running. > > >i didnt put the mounted gfs partitions in the /etc/fstab. ( is that true > >?) > > >i made a shell and i put in it the following : > > service ccsd stop > > service lock_gulmd stop > > > and i execut it before i make a reboot, and when i loged again to the > >system the two services are running by the system, that is ok , i know it > >is > >not a solution , BUT in the second reboot i found that the system gives the > >same continuous error question ( lock_gulmd is it running? ). > > >how can i solve this ? > >can i put the partitions in the /etc/fstab ? > > >OR WHAT ?????. > > I don't know many nodes you have running, and if all nodes are lock_gulm > servers or only one of them. Maybe if you reboot a node that is a server > lock_gulm without stop this services (gfs, lock_gulmd, ccsd) and other nodes > that are running depens of that node. > > have you created the archive /etc/sysconfig/gfs ??? > > !!The order to stop the modules is: gfs, lock_gulmd, ccsd, pool. > Sorry I can not help you more. > Good luck! > > _________________________________________________________________ > Acepta el reto MSN Premium: Protecci?n para tus hijos en internet. > Desc?rgalo y pru?balo 2 meses gratis. > http://join.msn.com?XAPID=1697&DI=1055&HL=Footer_mailsenviados_proteccioninfantil > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From yazan at ccs.com.jo Sat Jan 15 07:13:44 2005 From: yazan at ccs.com.jo (Yazan Al-Sheyyab) Date: Sat, 15 Jan 2005 09:13:44 +0200 Subject: [Linux-cluster] gfs probelm References: Message-ID: <00c101c4fad1$bff3c7b0$69050364@yazanz> no i didnt create the archive /etc/sysconfig/gfs ??? yet. actually i dont know how will i create it, im using a GFS document and that is not mention in the doc. how can i build it and what can i put in it ? Thanks. ----- Original Message ----- From: "maria perez" To: Sent: Friday, January 14, 2005 2:47 PM Subject: Re:[Linux-cluster] gfs probelm > > > > >HI, > > > > I configure the gfs , and i mount the partitioned as gfs as mentioned in > >the document, but when make a reboot, the system halted and stay ask > >continuously : lock_glumd is it running. > > >i didnt put the mounted gfs partitions in the /etc/fstab. ( is that true > >?) > > >i made a shell and i put in it the following : > > service ccsd stop > > service lock_gulmd stop > > > and i execut it before i make a reboot, and when i loged again to the > >system the two services are running by the system, that is ok , i know it > >is > >not a solution , BUT in the second reboot i found that the system gives the > >same continuous error question ( lock_gulmd is it running? ). > > >how can i solve this ? > >can i put the partitions in the /etc/fstab ? > > >OR WHAT ?????. > > I don't know many nodes you have running, and if all nodes are lock_gulm > servers or only one of them. Maybe if you reboot a node that is a server > lock_gulm without stop this services (gfs, lock_gulmd, ccsd) and other nodes > that are running depens of that node. > > have you created the archive /etc/sysconfig/gfs ??? > > !!The order to stop the modules is: gfs, lock_gulmd, ccsd, pool. > Sorry I can not help you more. > Good luck! > > _________________________________________________________________ > Acepta el reto MSN Premium: Protecci?n para tus hijos en internet. > Desc?rgalo y pru?balo 2 meses gratis. > http://join.msn.com?XAPID=1697&DI=1055&HL=Footer_mailsenviados_proteccioninfantil > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From yazan at ccs.com.jo Sat Jan 15 07:27:42 2005 From: yazan at ccs.com.jo (Yazan Al-Sheyyab) Date: Sat, 15 Jan 2005 09:27:42 +0200 Subject: [Linux-cluster] cluster suite question? Message-ID: <01db01c4fad3$b366d1c0$69050364@yazanz> hi, should i install the cluster suite on the two nodes or only on one node?? my problem is that when i installed the cluster suit on the two nodes then i found that the gui of the cluster suit didnt have the quorum check box checked even i configure the /etc/sysconfig/rawdevices , and i put in it two partition each of 100MB as raw1 and raw2 and i used them but with the same problem . firstly i configured the GFS and then i installed ther cluster suit an im haveing this problem , and because of that i asked if can i installed it on the two nodes or not ?????????. Thanks. From mshk_00 at hotmail.com Mon Jan 17 08:48:04 2005 From: mshk_00 at hotmail.com (maria perez) Date: Mon, 17 Jan 2005 09:48:04 +0100 Subject: [Linux-cluster] gfs probelm Message-ID: >no i didnt create the archive /etc/sysconfig/gfs ??? yet. >actually i dont know how will i create it, im using a GFS document and that >is not mention in the doc. >how can i build it and what can i put in it ? >Thanks. The only you have to do is: since a console of the system write: [root at machinename]# nano /etc/sysconfig/gfs then appear in the console this archive and write into it something like this; POOLS="name_pool1 name_pool2 ...name_pooln" CCS_ARCHIVE="/dev/pool/name_pool_cluster" (name_pool_cluster: pool created for the archive cca of the cluster that you are usign, in the administration guide of GFS6.0 appear like alpha_cca ) (You can create the file in other way of course) With this archive the system on the boot or reboot try to stop ther services: gfs, lock_gulmd, ccsd and pool (I believe it!!!!) maria _________________________________________________________________ Hor?scopo, tarot, numerolog?a... Escucha lo que te dicen los astros. http://astrocentro.msn.es/ From yazan at ccs.com.jo Mon Jan 17 09:20:17 2005 From: yazan at ccs.com.jo (Yazan Al-Sheyyab) Date: Mon, 17 Jan 2005 11:20:17 +0200 Subject: [Linux-cluster] lvm with gfs Message-ID: <000b01c4fc75$c25378c0$69050364@yazanz> hi, The shared that i have is raid5, and i have completedthe partitions i want and i formated them as gfs and mount them and everything is OK. but now , i want to setup rawdevices, and i want to put each rawdevice in one partion , so i have taken the free space from the shared and made it an lvm partition, and then i partioned the new lvm partition into 19 partitions : /dev/vg0/r1........ /dev/vg0/r18 and i want to format them as gfs filesystem i used the same procedure used from the beggining so i made another partition named as /dev/vg0/newgfs (to use as CCS file) but when i make : pool_tool -c to any of the new partitions it says that ( Unable to open device "/dev/lvm(a-s)" then it writes pool label written successfully from the file. when i run pool_tool -s it writes an with the new partitions. This is the problem. ????? Thanks From yazan at ccs.com.jo Mon Jan 17 09:57:15 2005 From: yazan at ccs.com.jo (Yazan Al-Sheyyab) Date: Mon, 17 Jan 2005 11:57:15 +0200 Subject: [Linux-cluster] pool Message-ID: <000701c4fc7a$ec88bb50$69050364@yazanz> hi, can i put partitions created fom LVM as /dev/vg0/r1...../dev/vg0/r18 into pools ? and how? is it the same procedure ? how to put them in pool so that when i run pool_tool -s it doesnot give to them. Thanks From daniel at osdl.org Tue Jan 18 01:31:33 2005 From: daniel at osdl.org (Daniel McNeil) Date: Mon, 17 Jan 2005 17:31:33 -0800 Subject: [Linux-cluster] cluster failed after 53 hours Message-ID: <1106011893.15101.6.camel@ibm-c.pdx.osdl.net> My 3 node cluster ran tests for 53 hours before hitting a problem. Node cl031 hit the 1st problem CMAN: killed by STARTTRANS or NOMINATE. There is a DLM assert on cl031 also, but that is after a whole bunch of debug output. The full logs are here (http://developer.osdl.org/daniel/GFS/test.12jan2005/) Any ideas on what is going on? Here is simplified output (in the README file): test started Jan Wed 12 17:18 hung after Fri Jan 14 22:00 cl031 got an error in just under 53 hours. ========================================== Jan 14 22:00:38 cl031 kernel: CMAN: node cl031a has been removed from the cluster : No response to messages Jan 14 22:00:38 cl031 kernel: CMAN: killed by STARTTRANS or NOMINATE Jan 14 22:00:38 cl031 kernel: CMAN: we are leaving the cluster. Jan 14 22:00:38 cl031 kernel: name " 2 54aef1" flags 2 nodeid 0 ref 1 Jan 14 22:00:38 cl031 kernel: G 0029017f gr 5 rq -1 flg 0 sts 2 node 0 remid 0 lq 0,5 [snip 34980 lines] Jan 14 22:10:07 cl031 kernel: G 00010165 gr 5 rq -1 flg 0 sts 2 node 0 remid 0 lq 0,5 Jan 14 22:10:07 cl031 kernel: 3 to 3 id 432 Jan 14 22:10:07 cl031 kernel: stripefs updated 350 resources Jan 14 22:10:07 cl031 kernel: stripefs rebuild locks Jan 14 22:10:07 cl031 kernel: stripefs rebuilt 0 locks Jan 14 22:10:07 cl031 kernel: stripefs recover event 6122 done Jan 14 22:10:07 cl031 kernel: stripefs rcom status f to 3 Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 433 Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 434 Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 435 Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 436 Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 437 Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 438 Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 439 Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 440 Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 441 Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 442 Jan 14 22:10:07 cl031 kernel: stripefs move flags 0,0,1 ids 6119,6122,6122 Jan 14 22:10:07 cl031 kernel: stripefs process held requests Jan 14 22:10:07 cl031 kernel: stripefs processed 0 requests Jan 14 22:10:07 cl031 kernel: stripefs resend marked requests Jan 14 22:10:07 cl031 kernel: stripefs resent 0 requests Jan 14 22:10:07 cl031 kernel: stripefs recover event 6122 finished Jan 14 22:10:07 cl031 kernel: stripefs move flags 1,0,0 ids 6122,6122,6122 Jan 14 22:10:07 cl031 kernel: stripefs add_to_requestq cmd 1 fr 3 Jan 14 22:10:08 cl031 kernel: stripefs move flags 0,0,0 ids 6122,6122,6122 Jan 14 22:10:08 cl031 kernel: stripefs rcom status 0 to 1 Jan 14 22:10:08 cl031 kernel: stripefs move flags 0,1,0 ids 6122,6123,6122 Jan 14 22:10:08 cl031 kernel: stripefs move use event 6123 Jan 14 22:10:08 cl031 kernel: stripefs recover event 6123 Jan 14 22:10:08 cl031 kernel: stripefs add node 1 Jan 14 22:10:08 cl031 kernel: stripefs rcom send 1 to 1 id 443 Jan 14 22:10:08 cl031 kernel: stripefs rcom status 4 to 1 Jan 14 22:10:08 cl031 kernel: jan 14 22:10:08 cl031 kernel: DLM: Assertion failed on line 128 of file /Views/redhat-cluster/cluster/dlm-kernel/src/reccomms.c Jan 14 22:10:08 cl031 kernel: DLM: assertion: "error >= 0" Jan 14 22:10:08 cl031 kernel: DLM: time = 201619244 Jan 14 22:10:08 cl031 kernel: error = -105 Jan 14 22:10:08 cl031 kernel: >From reccoms.c: error = midcomms_send_message(nodeid, (struct dlm_header *) rc, GFP_KERNEL); DLM_ASSERT(error >= 0, printk("error = %d\n", error);); cl030 ===== Jan 14 22:00:38 cl030 kernel: CMAN: removing node cl031a from the cluster : No rresponse to messages Jan 14 22:00:39 cl030 kernel: dlm: stripefs: nodes_init failed -1 Jan 14 22:00:39 cl030 fence_manual: Node cl031a needs to be reset before recoverry can procede. Waiting for cl031a to rejoin the cluster or for manual acknowleddgement that it has been reset (i.e. fence_ack_manual -s cl031a) (2 hours and 45 minutes later Sat Jan 15 00:45:00) Jan 15 00:50:12 cl030 kernel: CMAN: nmembers in HELLO message from 3 does not maatch our view (got 1, exp 2) Jan 15 00:52:57 cl030 kernel: CMAN: too many transition restarts - will die Jan 15 00:52:57 cl030 kernel: CMAN: we are leaving the cluster. Inconsistent cluuster view cl032 ===== Jan 14 22:00:38 cl032 kernel: CMAN: node cl031a has been removed from the cluster : No response to messages Jan 14 22:00:39 cl032 kernel: dlm: stripefs: nodes_reconfig failed 1 Jan 14 22:00:39 cl032 fenced[8983]: fencing deferred to 1 Jan 15 00:50:08 cl032 kernel: CMAN: removing node cl030a from the cluster : No response to messages Jan 15 00:50:08 cl032 kernel: CMAN: quorum lost, blocking activity Jan 15 00:53:02 cl032 kernel: SM: 00000001 process_recovery_barrier status=-104 Daniel From Axel.Thimm at ATrpms.net Tue Jan 18 02:21:05 2005 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Tue, 18 Jan 2005 03:21:05 +0100 Subject: [Linux-cluster] FC3 and FC2 package backports of GFS Message-ID: <20050118022105.GU5849@neu.nirvana> Hi, I'm starting to push out packages for GFS on FC3 and FC2 under http://atrpms.net/name/cluster/ The userland packages are basically rebuilds of what exists in FC rawhide. The kernel module packages are still the same cut from CVS like the rawhide packages (with minor compile fixes for 2.6.10), but the packages have been completely restructured to allow for each installed kernel to have its own non-conflicting copy of the required kernel modules. I.e. the kernel modules for GFS are in packages called GFS-kmdl- etc. There are also packages for qla2xxx, device-mapper with multipath support as well as multipath-tools for setting up GFS over FC. Please note that most packages are placed in the "bleeding" repo which is only for early and experimental packages. Nevertheless feel free to fry your SANs with them. :) Thanks! -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From pcaulfie at redhat.com Tue Jan 18 08:48:30 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 18 Jan 2005 08:48:30 +0000 Subject: [Linux-cluster] cluster failed after 53 hours In-Reply-To: <1106011893.15101.6.camel@ibm-c.pdx.osdl.net> References: <1106011893.15101.6.camel@ibm-c.pdx.osdl.net> Message-ID: <20050118084830.GC12101@tykepenguin.com> On Mon, Jan 17, 2005 at 05:31:33PM -0800, Daniel McNeil wrote: > My 3 node cluster ran tests for 53 hours before hitting a problem. > > > Node cl031 hit the 1st problem CMAN: killed by STARTTRANS or > NOMINATE. There is a DLM assert on cl031 also, but that is > after a whole bunch of debug output. The full logs are > here (http://developer.osdl.org/daniel/GFS/test.12jan2005/) > > Any ideas on what is going on? > > Here is simplified output (in the README file): > test started Jan Wed 12 17:18 > hung after Fri Jan 14 22:00 > > cl031 got an error in just under 53 hours. > ========================================== > Jan 14 22:00:38 cl031 kernel: CMAN: node cl031a has been removed from the cluster : No response to messages It's the usual thing. missing messages. patrick From pcaulfie at redhat.com Tue Jan 18 14:01:58 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 18 Jan 2005 14:01:58 +0000 Subject: [Linux-cluster] cluster failed after 53 hours In-Reply-To: <1106011893.15101.6.camel@ibm-c.pdx.osdl.net> References: <1106011893.15101.6.camel@ibm-c.pdx.osdl.net> Message-ID: <20050118140158.GH12101@tykepenguin.com> On Mon, Jan 17, 2005 at 05:31:33PM -0800, Daniel McNeil wrote: > My 3 node cluster ran tests for 53 hours before hitting a problem. Attached is a patch to set the CMAN process to run at realtime priority, I'm not sure if that's the right thing to do or not to be honest. Neither am I sure whether your 48-53 hours is significant - it's possible that memory may be an issue (only guessing but GFS caches locks like crazy, it may be worth cutting this down a bit by tweaking /proc/cluster/lock_dlm/drop_count and/or /proc/cluster/lock_dlm/drop_period otherwise, the only way were gpoing to get to the bottom of this is to enable "DEBUG_MEMB" in cman and see what it thinks is going on when the node is kicked out of the cluster. patrick -------------- next part -------------- Index: cnxman.c =================================================================== RCS file: /cvs/cluster/cluster/cman-kernel/src/cnxman.c,v retrieving revision 1.45 diff -u -p -r1.45 cnxman.c --- cnxman.c 17 Jan 2005 14:42:36 -0000 1.45 +++ cnxman.c 18 Jan 2005 10:49:50 -0000 @@ -63,6 +63,7 @@ static int is_valid_temp_nodeid(int node extern int start_membership_services(pid_t); extern int kcl_leave_cluster(int remove); extern int send_kill(int nodeid, int needack); +extern void cman_set_realtime(struct task_struct *tsk, int prio); static struct proto_ops cl_proto_ops; static struct sock *master_sock; @@ -308,7 +309,7 @@ static int cluster_kthread(void *unused) init_waitqueue_entry(&cnxman_waitq_head, current); add_wait_queue(&cnxman_waitq, &cnxman_waitq_head); - set_user_nice(current, -6); + cman_set_realtime(current, 1); /* Allow the sockets to start receiving */ list_for_each(socklist, &socket_list) { Index: membership.c =================================================================== RCS file: /cvs/cluster/cluster/cman-kernel/src/membership.c,v retrieving revision 1.47 diff -u -p -r1.47 membership.c --- membership.c 13 Jan 2005 14:12:59 -0000 1.47 +++ membership.c 18 Jan 2005 10:49:50 -0000 @@ -201,6 +202,13 @@ static uint8_t *node_opinion = NULL; #define OPINION_AGREE 1 #define OPINION_DISAGREE 2 + +void cman_set_realtime(struct task_struct *tsk, int prio) +{ + tsk->policy = SCHED_FIFO; + tsk->rt_priority = prio; +} + /* Set node id of a node, also add it to the members array and expand the array * if necessary */ static inline void set_nodeid(struct cluster_node *node, int nodeid) @@ -281,7 +289,7 @@ static int hello_kthread(void *unused) hello_task = tsk; up(&hello_task_lock); - set_user_nice(current, -20); + cman_set_realtime(current, 1); while (node_state != REJECTED && node_state != LEFT_CLUSTER) { @@ -317,7 +325,7 @@ static int membership_kthread(void *unus sigprocmask(SIG_BLOCK, &tmpsig, NULL); membership_task = tsk; - set_user_nice(current, -5); + cman_set_realtime(current, 1); /* Open the socket */ if (init_membership_services()) From chekov at ucla.edu Tue Jan 18 22:04:13 2005 From: chekov at ucla.edu (Alan Wood) Date: Tue, 18 Jan 2005 14:04:13 -0800 (PST) Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 9, Issue 12 In-Reply-To: <20050115170059.3B0E67387F@hormel.redhat.com> References: <20050115170059.3B0E67387F@hormel.redhat.com> Message-ID: Bujan, we purchased the AP7900 units and they work fine with the fence_apc module. the only issue I had was that fence_apc did not support SSH, though the AP7900 does, which isn't an issue if you put your PDUs on their own secluded network. the telnet interface works perfectly. -alan On Sat, 15 Jan 2005 linux-cluster-request at redhat.com wrote: > Date: Fri, 14 Jan 2005 12:58:07 -0500 > From: "Manuel Bujan" > Subject: [Linux-cluster] Which APC fence device ? > To: "linux clustering" > Message-ID: <005c01c4fa62$9b438910$7801a8c0 at pcbujan> > Content-Type: text/plain; format=flowed; charset="iso-8859-1"; > reply-type=original > > Hi, > > Could any of you guys can recommend me a working APC Masterswitch model to > use as a fencing device for our two node GFS cluster ? > > We are planning to go in production by the next month and we were using > until now the fencing manual mechanism. > > I looked inside the APC site and I found different models, but I am not > certainly sure which one to select to be compatible with the fence_apc > program. > > Does fence_apc work with ethernet power switches from APC like the model > AP7900 ? > http://www.apc.com/products/family/index.cfm?id=70 > > Any suggestions > > Regards > Bujan > >> >> From amanthei at redhat.com Tue Jan 18 22:36:58 2005 From: amanthei at redhat.com (Adam Manthei) Date: Tue, 18 Jan 2005 16:36:58 -0600 Subject: [Linux-cluster] gfs probelm In-Reply-To: <001501c4f971$f294df80$69050364@yazanz> References: <001501c4f971$f294df80$69050364@yazanz> Message-ID: <20050118223658.GP1885@redhat.com> Hi... On Thu, Jan 13, 2005 at 03:15:26PM +0200, Yazan Al-Sheyyab wrote: > > > HI, > > I configure the gfs , and i mount the partitioned as gfs as mentioned in > the document, but when make a reboot, the system halted and stay ask > continuously : lock_glumd is it running. > > i didnt put the mounted gfs partitions in the /etc/fstab. ( is that true > ?) > > i made a shell and i put in it the following : > service ccsd stop > service lock_gulmd stop Why make a shell script if the initscripts are installed on the system? The easiest way to get GFS start on boot is to make sure that all 4 subsystems for GFS are started. They also need to be started in the correct order: 1. service pool start 2. service ccsd start 3. service lock_gulmd start 4. service gfs start To enable them automatically on the system, use chkconfig to turn them on: chkconfig pool --add chkconfig ccsd --add chkconfig lock_gulmd --add chkconfig gfs --add > and i execut it before i make a reboot, and when i loged again to the > system the two services are running by the system, that is ok , i know it is > not a solution , BUT in the second reboot i found that the system gives the > same continuous error question ( lock_gulmd is it running? ). you probably aren't running the lock_gulmd server. > how can i solve this ? > can i put the partitions in the /etc/fstab ? You can put GFS in /etc/fstab provided that lock_gulmd is running. If you don't want the system to automatically start them, simply add "noauto" to the parameters list in /etc/fstab. You might also run into problems with /etc/rc.d/rc.sysinit and /etc/rc.d/init.d/netfs trying to mount gfs. If so, add gfs to the exclusion list so that is looks like the following: [root at node root]# grep gfs /etc/rc.d/init.d/netfs action $"Mounting other filesystems: " mount -a -t nonfs,smbfs,ncpfs,gfs [root at node root]# grep gfs /etc/rc.d/rc.sysinit action $"Mounting local filesystems: " mount -a -t nonfs,smbfs,ncpfs,gfs -O no_netdev > OR WHAT ?????. I notice other on the list commenting on /etc/sysconfig/gfs. This file is not typically needed, but can be used to help limit what is autodetected on your system on startup. POOLS specifies the pools to try to load. If this parameter is blank, it the system will try to load all the pools that it can find CCS_ARCHIVE specifies the ccs archive to use on the system. If left blank, the system will try to load ccs for an archive it find on a pool. If it doesn't find one, or finds more than one, it will error out if this value is not set. -- Adam Manthei From amanthei at redhat.com Tue Jan 18 22:39:46 2005 From: amanthei at redhat.com (Adam Manthei) Date: Tue, 18 Jan 2005 16:39:46 -0600 Subject: [Linux-cluster] pool In-Reply-To: <000701c4fc7a$ec88bb50$69050364@yazanz> References: <000701c4fc7a$ec88bb50$69050364@yazanz> Message-ID: <20050118223946.GQ1885@redhat.com> On Mon, Jan 17, 2005 at 11:57:15AM +0200, Yazan Al-Sheyyab wrote: > hi, > > can i put partitions created fom LVM as > /dev/vg0/r1...../dev/vg0/r18 > > into pools ? and how? > is it the same procedure ? It is not recommended that you do this. However, if the device appears in /proc/partitions, you can put a pool label on it and assemble it. > how to put them in pool so that when i run > pool_tool -s it doesnot give to them. > > Thanks > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- Adam Manthei From daniel at osdl.org Tue Jan 18 23:10:20 2005 From: daniel at osdl.org (Daniel McNeil) Date: Tue, 18 Jan 2005 15:10:20 -0800 Subject: [Linux-cluster] cluster failed after 53 hours In-Reply-To: <20050118084830.GC12101@tykepenguin.com> References: <1106011893.15101.6.camel@ibm-c.pdx.osdl.net> <20050118084830.GC12101@tykepenguin.com> Message-ID: <1106089819.15101.10.camel@ibm-c.pdx.osdl.net> On Tue, 2005-01-18 at 00:48, Patrick Caulfield wrote: > On Mon, Jan 17, 2005 at 05:31:33PM -0800, Daniel McNeil wrote: > > My 3 node cluster ran tests for 53 hours before hitting a problem. > > > > > > Node cl031 hit the 1st problem CMAN: killed by STARTTRANS or > > NOMINATE. There is a DLM assert on cl031 also, but that is > > after a whole bunch of debug output. The full logs are > > here (http://developer.osdl.org/daniel/GFS/test.12jan2005/) > > > > Any ideas on what is going on? > > > > Here is simplified output (in the README file): > > test started Jan Wed 12 17:18 > > hung after Fri Jan 14 22:00 > > > > cl031 got an error in just under 53 hours. > > ========================================== > > Jan 14 22:00:38 cl031 kernel: CMAN: node cl031a has been removed from the cluster : No response to messages > > It's the usual thing. missing messages. > > patrick There is an DLM ASSERT farther down in log that show error = -105 which is ENOBUFS. Is this happening after the node has decided to leave the cluster? I just want to make sure a out of memory problem isn't causing the problem. Daniel From pcaulfie at redhat.com Wed Jan 19 08:50:08 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Wed, 19 Jan 2005 08:50:08 +0000 Subject: [Linux-cluster] cluster failed after 53 hours In-Reply-To: <1106089819.15101.10.camel@ibm-c.pdx.osdl.net> References: <1106011893.15101.6.camel@ibm-c.pdx.osdl.net> <20050118084830.GC12101@tykepenguin.com> <1106089819.15101.10.camel@ibm-c.pdx.osdl.net> Message-ID: <20050119085008.GD11569@tykepenguin.com> On Tue, Jan 18, 2005 at 03:10:20PM -0800, Daniel McNeil wrote: > > There is an DLM ASSERT farther down in log that show error = -105 > which is ENOBUFS. Is this happening after the node has decided > to leave the cluster? I just want to make sure a out of memory > problem isn't causing the problem. > Unfortunately it could be, or it may not be. :( lowcomms_get_buffer() can return NULL if either a) there is no memory to allocate a page, or b) the DLM has been shut down. If that happens, -ENOBUFS is the result. On balance I would suspect that b) is more likely in this situation. One oddity in that log is that the DLM took 10 minutes to shutdown after CMAN decided it had to leave the cluster - or did those 34980 lines have to go down a serial console? -- patrick From tboucher at ca.ibm.com Wed Jan 19 14:44:17 2005 From: tboucher at ca.ibm.com (Tony Boucher) Date: Wed, 19 Jan 2005 09:44:17 -0500 Subject: [Linux-cluster] IBM Blade center In-Reply-To: Message-ID: Does anyone know of a fencing module that works with IBM Blade center ? We want to be able to fence by rebooting the blade. There are some issues with fencing through the McData switch. The servers boot from SAN, so when the McData fence module logs in and disables the FC port. The whole node gets hosed. (The OS drive gets fenced off too) Thanks, Tony Boucher I/T Specialist (HACMP/GPFS/WLM) 2200 Walkley Rd Ottawa,ON K1G 5L2 tboucher at ca.ibm.com Cell 613-295-1674 Voice mail 613-247-5289 "Experience is a hard teacher because she gives the test first, the lesson afterwards." -- Unknown -------------- next part -------------- An HTML attachment was scrubbed... URL: From amanthei at redhat.com Wed Jan 19 15:04:47 2005 From: amanthei at redhat.com (Adam Manthei) Date: Wed, 19 Jan 2005 09:04:47 -0600 Subject: [Linux-cluster] IBM Blade center In-Reply-To: References: Message-ID: <20050119150447.GD27578@redhat.com> On Wed, Jan 19, 2005 at 09:44:17AM -0500, Tony Boucher wrote: > Does anyone know of a fencing module that works with IBM Blade center ? We > want to be able to fence by rebooting the blade. Use fence_bladecenter. This will require that you have telnet enabled on your management module (may require a firmware update) > There are some issues with fencing through the McData switch. The servers > boot from SAN, so when the McData fence module logs in and disables the FC > port. The whole node gets hosed. (The OS drive gets fenced off too) -- Adam Manthei From dmorgan at gmi-mr.com Wed Jan 19 18:00:24 2005 From: dmorgan at gmi-mr.com (Duncan Morgan) Date: Wed, 19 Jan 2005 10:00:24 -0800 Subject: [Linux-cluster] IBM Blade center In-Reply-To: <20050119150447.GD27578@redhat.com> Message-ID: <003e01c4fe50$c03a9500$6204570a@DMorganMobile> We use the Intel version of the Blade Center and wrote a custom fence script. It is quite easy to do. Duncan -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Adam Manthei Sent: Wednesday, January 19, 2005 7:05 AM To: linux clistering Subject: Re: [Linux-cluster] IBM Blade center On Wed, Jan 19, 2005 at 09:44:17AM -0500, Tony Boucher wrote: > Does anyone know of a fencing module that works with IBM Blade center ? We > want to be able to fence by rebooting the blade. Use fence_bladecenter. This will require that you have telnet enabled on your management module (may require a firmware update) > There are some issues with fencing through the McData switch. The servers > boot from SAN, so when the McData fence module logs in and disables the FC > port. The whole node gets hosed. (The OS drive gets fenced off too) -- Adam Manthei -- Linux-cluster mailing list Linux-cluster at redhat.com http://www.redhat.com/mailman/listinfo/linux-cluster !DSPAM:41ee771331271363136074! From daniel at osdl.org Wed Jan 19 18:47:57 2005 From: daniel at osdl.org (Daniel McNeil) Date: Wed, 19 Jan 2005 10:47:57 -0800 Subject: [Linux-cluster] cluster failed after 53 hours In-Reply-To: <20050119085008.GD11569@tykepenguin.com> References: <1106011893.15101.6.camel@ibm-c.pdx.osdl.net> <20050118084830.GC12101@tykepenguin.com> <1106089819.15101.10.camel@ibm-c.pdx.osdl.net> <20050119085008.GD11569@tykepenguin.com> Message-ID: <1106160476.3041.44.camel@ibm-c.pdx.osdl.net> On Wed, 2005-01-19 at 00:50, Patrick Caulfield wrote: > On Tue, Jan 18, 2005 at 03:10:20PM -0800, Daniel McNeil wrote: > > > > There is an DLM ASSERT farther down in log that show error = -105 > > which is ENOBUFS. Is this happening after the node has decided > > to leave the cluster? I just want to make sure a out of memory > > problem isn't causing the problem. > > > > Unfortunately it could be, or it may not be. :( > lowcomms_get_buffer() can return NULL if either a) there is no memory to > allocate a page, or b) the DLM has been shut down. If that happens, -ENOBUFS is > the result. On balance I would suspect that b) is more likely in this situation. > > One oddity in that log is that the DLM took 10 minutes to shutdown after CMAN > decided it had to leave the cluster - or did those 34980 lines have to go down a > serial console? Yup. Serial console. Daniel From mshk_00 at hotmail.com Thu Jan 20 10:44:03 2005 From: mshk_00 at hotmail.com (maria perez) Date: Thu, 20 Jan 2005 11:44:03 +0100 Subject: [Linux-cluster] How install gfs with dm and lvm2????? Message-ID: Hi, I am here again. I am trying install gfs from cvs with opendlm (something like that) on a system with red hat enterprise 3.0, maintaining the kernel 2.4.21.15.EL. I found in the page 'http://sources.redhat.com/cluster/gfs/' some instructions for it,. I started installing the device-mapper and applying the patch for this module to my kernel (following the instructions in the correspondig file INSTALL). But I have found some problems, when I try apply the patchs contained in the package device-mapper-... for the device-mapper and the VFS: the system said already exits the mayority of the archives and when I try build the kernel (once selected the option device mapper support ) gives me some errors: error: symbol '_kstrtab_vcalloc' is already defined' eroor: symbol '_ksymtab_vcalloc' is already defined' What happen?? Is this way the most correct or handy?? Someone could to guide me ? Does it exits any manual or recipe that can to help me?? Thanks, i am a bit lost... maria _________________________________________________________________ Moda para esta temporada. Ponte al d?a de todas las tendencias. http://www.msn.es/Mujer/moda/default.asp From teigland at redhat.com Thu Jan 20 10:57:58 2005 From: teigland at redhat.com (David Teigland) Date: Thu, 20 Jan 2005 18:57:58 +0800 Subject: [Linux-cluster] How install gfs with dm and lvm2????? In-Reply-To: References: Message-ID: <20050120105758.GG23386@redhat.com> On Thu, Jan 20, 2005 at 11:44:03AM +0100, maria perez wrote: > Hi, I am here again. > > I am trying install gfs from cvs with opendlm (something like that) on a > system with red hat enterprise 3.0, maintaining the kernel 2.4.21.15.EL. The code on the cvs head requires a 2.6.10 kernel. > I found in the page 'http://sources.redhat.com/cluster/gfs/' some > instructions for it,. > I started installing the device-mapper and applying the patch for this > module to my kernel (following the instructions in the correspondig file > INSTALL). Use these instructions: http://sources.redhat.com/cluster/doc/usage.txt (no device-mapper kernel patches are used with 2.6 kernels) -- Dave Teigland From info at einetmailer.com Wed Jan 19 09:41:46 2005 From: info at einetmailer.com (Financial Assistance) Date: Wed, 19 Jan 2005 03:41:46 -0600 Subject: [Linux-cluster] Buried Under Bills? Message-ID: <200501190938.j0J9cXw8024252@mx1.redhat.com> An HTML attachment was scrubbed... URL: From daniel at osdl.org Fri Jan 21 23:46:48 2005 From: daniel at osdl.org (Daniel McNeil) Date: Fri, 21 Jan 2005 15:46:48 -0800 Subject: [Linux-cluster] ccs_tool ld error on latest cvs Message-ID: <1106351208.14739.8.camel@ibm-c.pdx.osdl.net> I tried compile the latest cvs tree against 2.6.10 and hit this loader error. I'm compiling on redhat 9. Any ideas? make[2]: Entering directory `/Views/redhat-cluster/cluster/ccs/ccs_tool' gcc -Wall -I. -I../config -I../include -I../lib -Wall -O2 -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE `xml2-config --cflags` -DCCS_RELEASE_NAME=\"DEVEL.1106267706\" -I. -I../config -I../include -I../lib -o ccs_tool ccs_tool.c update.c upgrade.c old_parser.c -L../lib `xml2-config --libs` -L/Views/redhat-cluster/cluster/build/lib -lccs -lmagma -lmagmamsg -ldl /usr/lib/libmagma.so: undefined reference to `pthread_rwlock_rdlock' /usr/lib/libmagma.so: undefined reference to `pthread_rwlock_unlock' /usr/lib/libmagma.so: undefined reference to `pthread_rwlock_wrlock' collect2: ld returned 1 exit status make[2]: *** [ccs_tool] Error 1 Daniel From mmatus at dinha.acms.arizona.edu Fri Jan 21 23:54:42 2005 From: mmatus at dinha.acms.arizona.edu (Marcelo Matus) Date: Fri, 21 Jan 2005 16:54:42 -0700 Subject: [Linux-cluster] cluster failed after 53 hours In-Reply-To: <20050118140158.GH12101@tykepenguin.com> References: <1106011893.15101.6.camel@ibm-c.pdx.osdl.net> <20050118140158.GH12101@tykepenguin.com> Message-ID: <41F19642.1080907@acms.arizona.edu> We also have some crashes when writting very large files, 5GB or so, and it seems the problem occurs when we hit the GFS cache limit, where the machine memory is 4GB (Dual Opteron). Is there a way to tune the GFS cache to use less memory, let say a maximum 512MB, so we can debug the problem better? And it is either the remote GFS cache or GNBD, since we can write 8GB or larger files when GFS is mounted locally, ie, when we do the tests in the same machine that exports the GFS device, via GNBD, to the rest of the nodes. Marcelo Patrick Caulfield wrote: >On Mon, Jan 17, 2005 at 05:31:33PM -0800, Daniel McNeil wrote: > > >>My 3 node cluster ran tests for 53 hours before hitting a problem. >> >> > >Attached is a patch to set the CMAN process to run at realtime priority, I'm not >sure if that's the right thing to do or not to be honest. > >Neither am I sure whether your 48-53 hours is significant - it's possible that >memory may be an issue (only guessing but GFS caches locks like crazy, it may be >worth cutting this down a bit by tweaking > >/proc/cluster/lock_dlm/drop_count and/or >/proc/cluster/lock_dlm/drop_period > >otherwise, the only way were gpoing to get to the bottom of this is to enable >"DEBUG_MEMB" in cman and see what it thinks is going on when the node is kicked >out of the cluster. > > >patrick > > >------------------------------------------------------------------------ > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >http://www.redhat.com/mailman/listinfo/linux-cluster > From woytek+ at cmu.edu Sat Jan 22 02:57:19 2005 From: woytek+ at cmu.edu (Jonathan Woytek) Date: Fri, 21 Jan 2005 21:57:19 -0500 Subject: [Linux-cluster] OOM failures with GFS, NFS and Samba on a cluster with RHEL3-AS Message-ID: <41F1C10F.2080309@cmu.edu> I have been experiencing OOM failures (followed by reboots) on a cluster running Dell PowerEdge 1860's (dual-proc, 4GB RAM) with RHEL3-AS with all current updates. The system is configured as a two-member cluster, running GFS 6.0.2-25 (RH SRPM) and cluster services 1.2.16-1 (also RH SRPM). My original testing went fine with the cluster, including service fail-over and all that stuff (only one lock_gulmd, so if the master goes down, the world explodes--but I expected that). Use seemed to be okay, but there weren't a whole lot of users. Recently, a project wanted to serve some data from their space in GFS via their own machine. We mounted their space via NFS from the cluster, and they serve their data via samba from their machine. Shortly thereafter, two things happened: more people started to access the data, and the cluster machines started to crash. The symptoms are that free memory drops extremely quickly (sometimes more than 3GB disappears in less than two minutes). Load average usually goes up quickly (when I can see it). NFS processes are normally at the top of top, along with kswapd. At some point, around this time, the kernel starts to spit out OOM messages and it starts to kill bunches of processes. The machine eventually reboots itself and comes back up cleanly. Space of outages seems to be dependent on how many people are using the system, but I've also seen the machine go down when the backup system runs a few backups on the machine. One of the things I've noticed, though, is that the backup system doesn't actually cause the machine to crash if the system has been recently rebooted, and memory usage returns to normal after the backup is finished. Memory usage usually does NOT return to completely normal after the gigabytes of memory become used (when that happens, the machine will sit there and keep running for a while with only 20MB or less free, until something presumably tries to use that memory and the machine flips out). That is the only time I've seen the backup system cause the system to crash--after it has endured significant usage during the day and there are 20MB or less free. I'll usually get a call from the culprits telling me that they were copying either a) lots of files or b) large files to the cluster. Any ideas here? Anything I can look at to tune? jonathan From woytek+ at cmu.edu Sun Jan 23 18:45:28 2005 From: woytek+ at cmu.edu (Jonathan Woytek) Date: Sun, 23 Jan 2005 13:45:28 -0500 Subject: [Linux-cluster] OOM failures with GFS, NFS and Samba on a cluster with RHEL3-AS In-Reply-To: <41F1C10F.2080309@cmu.edu> References: <41F1C10F.2080309@cmu.edu> Message-ID: <41F3F0C8.7020906@cmu.edu> Additional information: I enabled full output on lock_gulmd, since my dead top sessions would often show that process near the top of the list around the time of crashes. The machine was rebooted around 10:50AM, and was down again at 12:44. In the span of less than a minute, the machine plowed through over 3GB of memory and crashed. The extra debugging information from lock_gulmd said nothing, except that there was a successful heartbeat. The OOM messages began at 12:44:01, and the machine was dead somewhere around 12:44:40. Nobody should be using the machine during this time. A cron job that was scheduled to fire off at 12:44 (it runs every two minutes to check memory usage, specifically to try to track this problem) did not run (or at least was not logged if it did). I took that job out of cron just to make sure that it isn't part of the problem. The low-memory-check that ran at 12:42 reported nothing, and my threshold for that is set at 512MB. The span between crashes this weekend has been between three and eight hours. Yesterday, the machine rebooted (looking at lastlog, not last message before restart in /var/log/messages, but I'll be looking at that in a bit) at 15:20 (after being up since 23:50 on Friday), 18:27, 21:43, onto sunday at 01:14, 04:33, and finally 12:48. Something seems quite wrong with this. jonathan Jonathan Woytek wrote: > I have been experiencing OOM failures (followed by reboots) on a cluster > running Dell PowerEdge 1860's (dual-proc, 4GB RAM) with RHEL3-AS with > all current updates. > > The system is configured as a two-member cluster, running GFS 6.0.2-25 > (RH SRPM) and cluster services 1.2.16-1 (also RH SRPM). My original > testing went fine with the cluster, including service fail-over and all > that stuff (only one lock_gulmd, so if the master goes down, the world > explodes--but I expected that). > > Use seemed to be okay, but there weren't a whole lot of users. Recently, > a project wanted to serve some data from their space in GFS via their > own machine. We mounted their space via NFS from the cluster, and they > serve their data via samba from their machine. Shortly thereafter, two > things happened: more people started to access the data, and the > cluster machines started to crash. The symptoms are that free memory > drops extremely quickly (sometimes more than 3GB disappears in less than > two minutes). Load average usually goes up quickly (when I can see > it). NFS processes are normally at the top of top, along with kswapd. > At some point, around this time, the kernel starts to spit out OOM > messages and it starts to kill bunches of processes. The machine > eventually reboots itself and comes back up cleanly. > > Space of outages seems to be dependent on how many people are using the > system, but I've also seen the machine go down when the backup system > runs a few backups on the machine. One of the things I've noticed, > though, is that the backup system doesn't actually cause the machine to > crash if the system has been recently rebooted, and memory usage returns > to normal after the backup is finished. Memory usage usually does NOT > return to completely normal after the gigabytes of memory become used > (when that happens, the machine will sit there and keep running for a > while with only 20MB or less free, until something presumably tries to > use that memory and the machine flips out). That is the only time I've > seen the backup system cause the system to crash--after it has endured > significant usage during the day and there are 20MB or less free. > > I'll usually get a call from the culprits telling me that they were > copying either a) lots of files or b) large files to the cluster. > > Any ideas here? Anything I can look at to tune? > > jonathan > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From woytek+ at cmu.edu Thu Jan 20 21:56:03 2005 From: woytek+ at cmu.edu (Jonathan Woytek) Date: Thu, 20 Jan 2005 16:56:03 -0500 Subject: [Linux-cluster] OOM issues with GFS, NFS, Samba on RHEL3-AS cluster Message-ID: <41F028F3.7020506@cmu.edu> Hello. I've tried to read-up on the lists here to see what I can find about these sorts of issues, but the information appears to be somewhat sparse. Here's my situation: I have a two-member cluster built on RHEL 3 AS (with all current updates installed). That means kernel 2.4.21-27.0.2.EL with GFS (6.0.2-25) and cluster services (1.2.16-1) built from SRPMS distributed by RedHat. My storage is iSCSI-based over gigabit ethernet. Hardware are Dell PowerEdge 1860's with 4GB of RAM and dual 2.4GHz processors. My problem is that the node serving disk via NFS and Samba gets into a strange mode where it starts to get kernel-based out-of-memory errors, which start to kill things off. The machine reboots itself and comes back up with no issues. In the process, of course, it wreaks havoc with lock_gulmd and a host of other things, and makes a bunch of users upset (it probably didn't help that we've been dealing with unstable storage here for a while, and I put this system together with the idea that it would be more reliable). I plan on trying to add a third node, which would fix the lock_gulmd craziness. That's not my big problem, though. I NEED to figure out why this is happening. My analysis so far seems to indicate that the crashes are caused mostly when there are a lot of files open (or at least a lot of disk activity). The failures seem to occur most often when people are accessing data (on GFS) from the server over an NFS mount to another machine, but they also seem to occur if the machine has seen a day's worth of that sort of usage and the backup system tries to get its nightly backup between 11PM and 2AM. When memory starts to get low, kswapd shows up and starts eating serious cycles, along with the nfsd's. I've tried increasing the number of nfsd's, but that didn't seem to have an effect. Any ideas on things I should be checking? Interestingly enough, no swap seems to be used when this happens. The load average normally creeps up right before death, and the machine gets down to less than 18MB free (though a lot the 4GB is tied up in cache). jonathan -- Jonathan Woytek w: 412-681-3463 woytek+ at cmu.edu NREC Computing Manager c: 412-401-1627 KB3HOZ PGP Key available upon request From pierre.filippone at retail-sc.com Fri Jan 21 11:21:59 2005 From: pierre.filippone at retail-sc.com (Pierre Filippone) Date: Fri, 21 Jan 2005 12:21:59 +0100 Subject: [Linux-cluster] Cluster aware software raid Message-ID: Hi, we are trying to use GFS in a FC environment on a two node cluster. Additionally to multipathing (performed by IBM's SDD) we want to mirror the data on two SAN storages via any kind of raid software. As far as I understood, there is currently no software available for linux (except Veritas VM) which is able to support this scenario. I read that CLVM will probably support cluster aware mirroring in the future. Are there any estimations, when it will be production ready ? Will it be released with RH ES 4 ? In some news groups I saw older postings discussing how to make md cluster aware. But, afaik, this also did not happen yet. Do you know any other project, which is near to finishing this feature ? Thanks for your help, Pierre Filippone RSC Commercial Services OHG Bleichstr. 8 40211 D?sseldorf From woytek+ at cmu.edu Sun Jan 23 23:12:18 2005 From: woytek+ at cmu.edu (Jonathan Woytek) Date: Sun, 23 Jan 2005 18:12:18 -0500 Subject: [Linux-cluster] OOM issues with GFS, NFS, Samba on RHEL3-AS cluster In-Reply-To: <41F028F3.7020506@cmu.edu> References: <41F028F3.7020506@cmu.edu> Message-ID: <41F42F52.3060502@cmu.edu> Sorry about the duplicate message--I had sent this when I had a mistake in my email address. When I fixed it, this message apparently went through to the list. jonathan Jonathan Woytek wrote: > Hello. I've tried to read-up on the lists here to see what I can find > about these sorts of issues, but the information appears to be somewhat > sparse. > > Here's my situation: I have a two-member cluster built on RHEL 3 AS > (with all current updates installed). That means kernel > 2.4.21-27.0.2.EL with GFS (6.0.2-25) and cluster services (1.2.16-1) > built from SRPMS distributed by RedHat. My storage is iSCSI-based over > gigabit ethernet. Hardware are Dell PowerEdge 1860's with 4GB of RAM > and dual 2.4GHz processors. > > My problem is that the node serving disk via NFS and Samba gets into a > strange mode where it starts to get kernel-based out-of-memory errors, > which start to kill things off. The machine reboots itself and comes > back up with no issues. In the process, of course, it wreaks havoc with > lock_gulmd and a host of other things, and makes a bunch of users upset > (it probably didn't help that we've been dealing with unstable storage > here for a while, and I put this system together with the idea that it > would be more reliable). > > I plan on trying to add a third node, which would fix the lock_gulmd > craziness. That's not my big problem, though. I NEED to figure out why > this is happening. My analysis so far seems to indicate that the > crashes are caused mostly when there are a lot of files open (or at > least a lot of disk activity). The failures seem to occur most often > when people are accessing data (on GFS) from the server over an NFS > mount to another machine, but they also seem to occur if the machine has > seen a day's worth of that sort of usage and the backup system tries to > get its nightly backup between 11PM and 2AM. When memory starts to get > low, kswapd shows up and starts eating serious cycles, along with the > nfsd's. I've tried increasing the number of nfsd's, but that didn't > seem to have an effect. > > Any ideas on things I should be checking? Interestingly enough, no swap > seems to be used when this happens. The load average normally creeps up > right before death, and the machine gets down to less than 18MB free > (though a lot the 4GB is tied up in cache). > > jonathan > -- > Jonathan Woytek w: 412-681-3463 woytek+ at cmu.edu > NREC Computing Manager c: 412-401-1627 KB3HOZ > PGP Key available upon request > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- Jonathan Woytek w: 412-681-3463 woytek+ at cmu.edu NREC Computing Manager c: 412-401-1627 KB3HOZ PGP Key available upon request From woytek+ at cmu.edu Mon Jan 24 04:27:52 2005 From: woytek+ at cmu.edu (Jonathan Woytek) Date: Sun, 23 Jan 2005 23:27:52 -0500 Subject: [Linux-cluster] OOM failures with GFS, NFS and Samba on a cluster with RHEL3-AS In-Reply-To: <41F3F0C8.7020906@cmu.edu> References: <41F1C10F.2080309@cmu.edu> <41F3F0C8.7020906@cmu.edu> Message-ID: <41F47948.6050501@cmu.edu> Even more additional information: I've been monitoring the system through a few crashes now, and it looks like what is actually running out of memory is "lowmem". The system seems to eat about 130-140kB every two seconds. It seems that the system is NOT actually plowing through 3GB+ of memory--highmem does not seem to drop. Whee fun. jonathan Jonathan Woytek wrote: > Additional information: > > I enabled full output on lock_gulmd, since my dead top sessions would > often show that process near the top of the list around the time of > crashes. The machine was rebooted around 10:50AM, and was down again at > 12:44. In the span of less than a minute, the machine plowed through > over 3GB of memory and crashed. The extra debugging information from > lock_gulmd said nothing, except that there was a successful heartbeat. > The OOM messages began at 12:44:01, and the machine was dead somewhere > around 12:44:40. Nobody should be using the machine during this time. A > cron job that was scheduled to fire off at 12:44 (it runs every two > minutes to check memory usage, specifically to try to track this > problem) did not run (or at least was not logged if it did). I took > that job out of cron just to make sure that it isn't part of the > problem. The low-memory-check that ran at 12:42 reported nothing, and > my threshold for that is set at 512MB. > > The span between crashes this weekend has been between three and eight > hours. Yesterday, the machine rebooted (looking at lastlog, not last > message before restart in /var/log/messages, but I'll be looking at that > in a bit) at 15:20 (after being up since 23:50 on Friday), 18:27, 21:43, > onto sunday at 01:14, 04:33, and finally 12:48. Something seems quite > wrong with this. > > jonathan > > > Jonathan Woytek wrote: > >> I have been experiencing OOM failures (followed by reboots) on a >> cluster running Dell PowerEdge 1860's (dual-proc, 4GB RAM) with >> RHEL3-AS with all current updates. >> >> The system is configured as a two-member cluster, running GFS 6.0.2-25 >> (RH SRPM) and cluster services 1.2.16-1 (also RH SRPM). My original >> testing went fine with the cluster, including service fail-over and >> all that stuff (only one lock_gulmd, so if the master goes down, the >> world explodes--but I expected that). >> >> Use seemed to be okay, but there weren't a whole lot of users. >> Recently, a project wanted to serve some data from their space in GFS >> via their own machine. We mounted their space via NFS from the >> cluster, and they serve their data via samba from their machine. >> Shortly thereafter, two things happened: more people started to >> access the data, and the cluster machines started to crash. The >> symptoms are that free memory drops extremely quickly (sometimes more >> than 3GB disappears in less than two minutes). Load average usually >> goes up quickly (when I can see it). NFS processes are normally at >> the top of top, along with kswapd. At some point, around this time, >> the kernel starts to spit out OOM messages and it starts to kill >> bunches of processes. The machine eventually reboots itself and comes >> back up cleanly. >> >> Space of outages seems to be dependent on how many people are using >> the system, but I've also seen the machine go down when the backup >> system runs a few backups on the machine. One of the things I've >> noticed, though, is that the backup system doesn't actually cause the >> machine to crash if the system has been recently rebooted, and memory >> usage returns to normal after the backup is finished. Memory usage >> usually does NOT return to completely normal after the gigabytes of >> memory become used (when that happens, the machine will sit there and >> keep running for a while with only 20MB or less free, until something >> presumably tries to use that memory and the machine flips out). That >> is the only time I've seen the backup system cause the system to >> crash--after it has endured significant usage during the day and there >> are 20MB or less free. >> >> I'll usually get a call from the culprits telling me that they were >> copying either a) lots of files or b) large files to the cluster. >> >> Any ideas here? Anything I can look at to tune? >> >> jonathan >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> http://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From jaime at iaa.es Mon Jan 24 09:18:49 2005 From: jaime at iaa.es (Jaime Perea) Date: Mon, 24 Jan 2005 10:18:49 +0100 Subject: [Linux-cluster] ccs_tool ld error on latest cvs In-Reply-To: <1106351208.14739.8.camel@ibm-c.pdx.osdl.net> References: <1106351208.14739.8.camel@ibm-c.pdx.osdl.net> Message-ID: <200501241018.49756.jaime@iaa.es> Hi everybody, My first posting! Perhaps doing LDFLAGS="-lpthread" make could work. -- Jaime D. Perea Duarte. Linux registered user #10472 Dep. Astrofisica Extragalactica. Instituto de Astrofisica de Andalucia (CSIC) Apdo. 3004, 18080 Granada, Spain. El S?bado, 22 de Enero de 2005 00:46, Daniel McNeil escribi?: > I tried compile the latest cvs tree against 2.6.10 and hit this > loader error. I'm compiling on redhat 9. > > Any ideas? > > make[2]: Entering directory `/Views/redhat-cluster/cluster/ccs/ccs_tool' > gcc -Wall -I. -I../config -I../include -I../lib -Wall -O2 > -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE `xml2-config --cflags` > -DCCS_RELEASE_NAME=\"DEVEL.1106267706\" -I. -I../config -I../include > -I../lib -o ccs_tool ccs_tool.c update.c upgrade.c old_parser.c -L../lib > `xml2-config --libs` -L/Views/redhat-cluster/cluster/build/lib -lccs > -lmagma -lmagmamsg -ldl /usr/lib/libmagma.so: undefined reference to > `pthread_rwlock_rdlock' /usr/lib/libmagma.so: undefined reference to > `pthread_rwlock_unlock' /usr/lib/libmagma.so: undefined reference to > `pthread_rwlock_wrlock' collect2: ld returned 1 exit status > make[2]: *** [ccs_tool] Error 1 > > Daniel > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From mtilstra at redhat.com Mon Jan 24 14:38:33 2005 From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra) Date: Mon, 24 Jan 2005 08:38:33 -0600 Subject: [Linux-cluster] OOM failures with GFS, NFS and Samba on a cluster with RHEL3-AS In-Reply-To: <41F3F0C8.7020906@cmu.edu> References: <41F1C10F.2080309@cmu.edu> <41F3F0C8.7020906@cmu.edu> Message-ID: <20050124143833.GA30145@redhat.com> On Sun, Jan 23, 2005 at 01:45:28PM -0500, Jonathan Woytek wrote: > Additional information: > > I enabled full output on lock_gulmd, since my dead top sessions would > often show that process near the top of the list around the time of > crashes. The machine was rebooted around 10:50AM, and was down again at Not suprising that lock_gulmd is working hard when gfs is under heavy use. Its it busy processing all those lock requests. What would be more useful from gulm for this than the logging messages, is to query the locktable every so often for its stats. `gulm_tool getstats :lt000` The 'locks = ###' line is how many lock structures are current held. gulm is very greedy about memory, and you are running the lock servers on the same nodes you're mounting from. also, just to see if I read the first post right, you have samba->nfs->gfs? -- Michael Conrad Tadpol Tilstra i'm trying to think, but nothing's happening... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From woytek+ at cmu.edu Mon Jan 24 18:36:47 2005 From: woytek+ at cmu.edu (Jonathan Woytek) Date: Mon, 24 Jan 2005 13:36:47 -0500 Subject: [Linux-cluster] OOM failures with GFS, NFS and Samba on a cluster with RHEL3-AS In-Reply-To: <20050124143833.GA30145@redhat.com> References: <41F1C10F.2080309@cmu.edu> <41F3F0C8.7020906@cmu.edu> <20050124143833.GA30145@redhat.com> Message-ID: <41F5403F.4070207@cmu.edu> Michael Conrad Tadpol Tilstra wrote: > On Sun, Jan 23, 2005 at 01:45:28PM -0500, Jonathan Woytek wrote: > >>Additional information: >> >>I enabled full output on lock_gulmd, since my dead top sessions would >>often show that process near the top of the list around the time of >>crashes. The machine was rebooted around 10:50AM, and was down again at > > > Not suprising that lock_gulmd is working hard when gfs is under heavy > use. Its it busy processing all those lock requests. What would be > more useful from gulm for this than the logging messages, is to query > the locktable every so often for its stats. > `gulm_tool getstats :lt000` > The 'locks = ###' line is how many lock structures are current held. > gulm is very greedy about memory, and you are running the lock servers > on the same nodes you're mounting from. Here are the stats from the master lock_gulmd lt000: I_am = Master run time = 9436 pid = 2205 verbosity = Default id = 0 partitions = 1 out_queue = 0 drpb_queue = 0 locks = 20356 unlocked = 17651 exclusive = 15 shared = 2690 deferred = 0 lvbs = 17661 expired = 0 lock ops = 107354 conflicts = 0 incomming_queue = 0 conflict_queue = 0 reply_queue = 0 free_locks = 69644 free_lkrqs = 60 used_lkrqs = 0 free_holders = 109634 used_holders = 20366 highwater = 1048576 Something keeps eating away at lowmem, though, and I still can't figure out what exactly it is. > also, just to see if I read the first post right, you have > samba->nfs->gfs? If I understand your arrows correctly, I have a filesystem mounted with GFS that I'm sharing via NFS to another machine that is sharing it via Samba. I've closed that link, though, to try to eliminate that as a problem. So now I'm serving the GFS filesystem directly through Samba. jonathan -- Jonathan Woytek w: 412-681-3463 woytek+ at cmu.edu NREC Computing Manager c: 412-401-1627 KB3HOZ PGP Key available upon request From woytek+ at cmu.edu Mon Jan 24 18:43:29 2005 From: woytek+ at cmu.edu (Jonathan Woytek) Date: Mon, 24 Jan 2005 13:43:29 -0500 Subject: [Linux-cluster] OOM failures with GFS, NFS and Samba on a cluster with RHEL3-AS In-Reply-To: <41F5403F.4070207@cmu.edu> References: <41F1C10F.2080309@cmu.edu> <41F3F0C8.7020906@cmu.edu> <20050124143833.GA30145@redhat.com> <41F5403F.4070207@cmu.edu> Message-ID: <41F541D1.5050305@cmu.edu> /proc/meminfo: total: used: free: shared: buffers: cached: Mem: 4189741056 925650944 3264090112 0 18685952 76009472 Swap: 2146787328 0 2146787328 MemTotal: 4091544 kB MemFree: 3187588 kB MemShared: 0 kB Buffers: 18248 kB Cached: 74228 kB SwapCached: 0 kB Active: 107232 kB ActiveAnon: 50084 kB ActiveCache: 57148 kB Inact_dirty: 1892 kB Inact_laundry: 16276 kB Inact_clean: 16616 kB Inact_target: 28400 kB HighTotal: 3276544 kB HighFree: 3164096 kB LowTotal: 815000 kB LowFree: 23492 kB SwapTotal: 2096472 kB SwapFree: 2096472 kB Committed_AS: 72244 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 2048 kB When a bunch of locks become free, lowmem seems to recover somewhat. However, shutting down lock_gulmd entirely does NOT return lowmem to what it probably should be (though I'm not sure if the system is just keeping all of that memory cached until something else needs it or not). jonathan Jonathan Woytek wrote: > Michael Conrad Tadpol Tilstra wrote: > >> On Sun, Jan 23, 2005 at 01:45:28PM -0500, Jonathan Woytek wrote: >> >>> Additional information: >>> >>> I enabled full output on lock_gulmd, since my dead top sessions would >>> often show that process near the top of the list around the time of >>> crashes. The machine was rebooted around 10:50AM, and was down again at >> >> >> >> Not suprising that lock_gulmd is working hard when gfs is under heavy >> use. Its it busy processing all those lock requests. What would be >> more useful from gulm for this than the logging messages, is to query >> the locktable every so often for its stats. >> `gulm_tool getstats :lt000` >> The 'locks = ###' line is how many lock structures are current held. >> gulm is very greedy about memory, and you are running the lock servers >> on the same nodes you're mounting from. > > > Here are the stats from the master lock_gulmd lt000: > > I_am = Master > run time = 9436 > pid = 2205 > verbosity = Default > id = 0 > partitions = 1 > out_queue = 0 > drpb_queue = 0 > locks = 20356 > unlocked = 17651 > exclusive = 15 > shared = 2690 > deferred = 0 > lvbs = 17661 > expired = 0 > lock ops = 107354 > conflicts = 0 > incomming_queue = 0 > conflict_queue = 0 > reply_queue = 0 > free_locks = 69644 > free_lkrqs = 60 > used_lkrqs = 0 > free_holders = 109634 > used_holders = 20366 > highwater = 1048576 > > > Something keeps eating away at lowmem, though, and I still can't figure > out what exactly it is. > > >> also, just to see if I read the first post right, you have >> samba->nfs->gfs? > > > If I understand your arrows correctly, I have a filesystem mounted with > GFS that I'm sharing via NFS to another machine that is sharing it via > Samba. I've closed that link, though, to try to eliminate that as a > problem. So now I'm serving the GFS filesystem directly through Samba. > > jonathan > -- Jonathan Woytek w: 412-681-3463 woytek+ at cmu.edu NREC Computing Manager c: 412-401-1627 KB3HOZ PGP Key available upon request From laza at yu.net Mon Jan 24 19:57:28 2005 From: laza at yu.net (Lazar Obradovic) Date: Mon, 24 Jan 2005 20:57:28 +0100 Subject: [Linux-cluster] multipath/gfs lockout under heavy write Message-ID: <1106596648.13534.79.camel@laza.eunet.yu> Hello, I'm not quite sure if the problem I'm experiencing is GFS or dm-multi/multipath issue, so I'm writing to both lists... sorry for that and please trim as soon as you realise who is it for. This is the scenario: I've created two-node cluster and mounted two LVs on each of them: /dev/vg/data on /mnt/data type gfs (rw) /dev/vg/syslog on /var/log/ng type gfs (rw) Each node is running 2.6.10 with udm2 patch set, GFS and LVM2 fetched from CVS on Jan, 19th and multipath-tools-0.4.1. Storage controller is HSV110, and has two paths from each server to it: # multipath -v2 create: 3600508b400013a6c00006000009c0000 [size=500 GB][features="0"][hwhandler="0"] \_ round-robin 0 [first] \_ 0:0:0:1 sda 8:0 [faulty] \_ 0:0:1:1 sdb 8:16 [ready ] \_ 0:0:2:1 sdc 8:32 [faulty] \_ 0:0:3:1 sdd 8:48 [ready ] I tried to copy 100Gb of large files (each of them is about 15Gb) to a /mnt/data through SSH connection from the third server to one of the clustered. Looking at switch statistics, I saw that traffic was indeed balanced over both FC links, but after copying almost 80Gb, without any reason or unusual event on SAN/storage side, /dev/vg/data reported: SCSI error : <0 0 1 1> return code = 0x20000 end_request: I/O error, dev sdb, sector 401320376 end_request: I/O error, dev sdb, sector 401320384 Device sda not ready. SCSI error : <0 0 3 1> return code = 0x20000 end_request: I/O error, dev sdd, sector 401321168 end_request: I/O error, dev sdd, sector 401321176 Buffer I/O error on device diapered_dm-2, logical block 37057899 lost page write due to I/O error on diapered_dm-2 Buffer I/O error on device diapered_dm-2, logical block 37057900 lost page write due to I/O error on diapered_dm-2 Buffer I/O error on device diapered_dm-2, logical block 37057901 lost page write due to I/O error on diapered_dm-2 Buffer I/O error on device diapered_dm-2, logical block 37057902 lost page write due to I/O error on diapered_dm-2 Buffer I/O error on device diapered_dm-2, logical block 37057903 lost page write due to I/O error on diapered_dm-2 Buffer I/O error on device diapered_dm-2, logical block 37057904 lost page write due to I/O error on diapered_dm-2 Buffer I/O error on device diapered_dm-2, logical block 37057905 lost page write due to I/O error on diapered_dm-2 Buffer I/O error on device diapered_dm-2, logical block 37057906 lost page write due to I/O error on diapered_dm-2 Buffer I/O error on device diapered_dm-2, logical block 37057907 lost page write due to I/O error on diapered_dm-2 Buffer I/O error on device diapered_dm-2, logical block 37057908 lost page write due to I/O error on diapered_dm-2 GFS: fsid=admin:data.0: fatal: I/O error GFS: fsid=admin:data.0: block = 37057898 GFS: fsid=admin:data.0: function = gfs_dwrite GFS: fsid=admin:data.0: file = /usr/src/cluster/gfs-kernel/src/gfs/dio.c, line = 651 GFS: fsid=admin:data.0: time = 1106582338 GFS: fsid=admin:data.0: about to withdraw from the cluster GFS: fsid=admin:data.0: waiting for outstanding I/O SCSI error : <0 0 1 1> return code = 0x20000 Device sdc not ready. GFS: fsid=admin:data.0: warning: assertion "!buffer_busy(bh)" failed GFS: fsid=admin:data.0: function = gfs_logbh_uninit GFS: fsid=admin:data.0: file = /usr/src/cluster/gfs-kernel/src/gfs/dio.c, line = 930 GFS: fsid=admin:data.0: time = 1106582351 printk: 54 messages suppressed. Buffer I/O error on device diapered_dm-2, logical block 36272387 lost page write due to I/O error on diapered_dm-2 Buffer I/O error on device diapered_dm-2, logical block 37024703 lost page write due to I/O error on diapered_dm-2 GFS: fsid=admin:data.0: telling LM to withdraw lock_dlm: withdraw abandoned memory GFS: fsid=admin:data.0: withdrawn printk: 12 messages suppressed. Buffer I/O error on device diapered_dm-2, logical block 37005453 lost page write due to I/O error on diapered_dm-2 printk: 1036 messages suppressed. Buffer I/O error on device diapered_dm-2, logical block 37006489 lost page write due to I/O error on diapered_dm-2 printk: 1035 messages suppressed. Buffer I/O error on device diapered_dm-2, logical block 37007525 lost page write due to I/O error on diapered_dm-2 while /dev/vg/syslog continued to work as usual (dd-ing /dev/zero to some file worked like a charm). After that error, SCP died, and I couldn't umount nor remount that filesystem. Fenced didn't triggered so I had to reboot the machine in order to make it work again (and I'm using fence_ibmblade which works on another cluster I have). Since both LVs are a part of same VG (and, thus, are using the same physical device seen over multipath), I'd guess the problem is somewhere inside GFS, but the things that keep confusing me are: - those SCSI errors that look like multipath errors - name 'diapered_dm-2' which I never saw before - fenced not fencing obviously faulty node What else do you need to debug this issue? Once again, sorry for the cross-post... -- Lazar Obradovic YUnet International, NOC From woytek+ at cmu.edu Mon Jan 24 21:37:54 2005 From: woytek+ at cmu.edu (Jonathan Woytek) Date: Mon, 24 Jan 2005 16:37:54 -0500 Subject: [Linux-cluster] OOM failures with GFS, NFS and Samba on a cluster with RHEL3-AS In-Reply-To: <41F541D1.5050305@cmu.edu> References: <41F1C10F.2080309@cmu.edu> <41F3F0C8.7020906@cmu.edu> <20050124143833.GA30145@redhat.com> <41F5403F.4070207@cmu.edu> <41F541D1.5050305@cmu.edu> Message-ID: <41F56AB2.90906@cmu.edu> Yet more and more info: Jan 24 16:17:00 quicksilver kernel: Mem-info: Jan 24 16:17:00 quicksilver kernel: Zone:DMA freepages: 2835 min: 0 low: 0 high: 0 Jan 24 16:17:00 quicksilver kernel: Zone:Normal freepages: 1034 min: 1279 low: 4544 high: 6304 Jan 24 16:17:00 quicksilver kernel: Zone:HighMem freepages:759901 min: 255 low: 15872 high: 23808 Jan 24 16:17:00 quicksilver kernel: Free pages: 763768 (759901 HighMem) Jan 24 16:17:00 quicksilver kernel: ( Active: 22610/25584, inactive_laundry: 3922, inactive_clean: 3890, free: 763768 ) Jan 24 16:17:00 quicksilver kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:2835 Jan 24 16:17:00 quicksilver kernel: aa:0 ac:27 id:0 il:115 ic:0 fr:1026 Jan 24 16:17:00 quicksilver kernel: aa:12742 ac:9847 id:25584 il:3807 ic:3890 fr:759901 Jan 24 16:17:00 quicksilver kernel: 1*4kB 1*8kB 0*16kB 0*32kB 1*64kB 0*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11340kB) Jan 24 16:17:00 quicksilver kernel: 272*4kB 19*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3784kB) Jan 24 16:17:01 quicksilver kernel: 43*4kB 17*8kB 2*16kB 7*32kB 1*64kB 78*128kB 138*256kB 89*512kB 83*1024kB 32*2048kB 683*4096kB = 3039604kB) Jan 24 16:17:01 quicksilver kernel: Swap cache: add 0, delete 0, find 0/0, race 0+0 Jan 24 16:17:01 quicksilver kernel: 197629 pages of slabcache Jan 24 16:17:01 quicksilver kernel: 328 pages of kernel stacks Jan 24 16:17:01 quicksilver kernel: 0 lowmem pagetables, 529 highmem pagetables Jan 24 16:17:01 quicksilver kernel: Free swap: 2096472kB Jan 24 16:17:01 quicksilver kernel: 1245184 pages of RAM Jan 24 16:17:01 quicksilver kernel: 819136 pages of HIGHMEM Jan 24 16:17:01 quicksilver kernel: 222298 reserved pages Jan 24 16:17:01 quicksilver kernel: 38487 pages shared Jan 24 16:17:01 quicksilver kernel: 0 pages swap cached Jan 24 16:17:01 quicksilver kernel: Out of Memory: Killed process 2441 (sendmail). Jan 24 16:17:01 quicksilver kernel: Out of Memory: Killed process 2441 (sendmail). Jan 24 16:17:01 quicksilver kernel: Fixed up OOM kill of mm-less task The machine reports OOM kills for about 15-30 seconds before clumembd gets killed and the machine reboots. The OOM kills usually begin at the top of the minute, though that probably doesn't have anything to do with anything except coincidence. jonathan Jonathan Woytek wrote: > /proc/meminfo: > total: used: free: shared: buffers: cached: > Mem: 4189741056 925650944 3264090112 0 18685952 76009472 > Swap: 2146787328 0 2146787328 > MemTotal: 4091544 kB > MemFree: 3187588 kB > MemShared: 0 kB > Buffers: 18248 kB > Cached: 74228 kB > SwapCached: 0 kB > Active: 107232 kB > ActiveAnon: 50084 kB > ActiveCache: 57148 kB > Inact_dirty: 1892 kB > Inact_laundry: 16276 kB > Inact_clean: 16616 kB > Inact_target: 28400 kB > HighTotal: 3276544 kB > HighFree: 3164096 kB > LowTotal: 815000 kB > LowFree: 23492 kB > SwapTotal: 2096472 kB > SwapFree: 2096472 kB > Committed_AS: 72244 kB > HugePages_Total: 0 > HugePages_Free: 0 > Hugepagesize: 2048 kB > > When a bunch of locks become free, lowmem seems to recover somewhat. > However, shutting down lock_gulmd entirely does NOT return lowmem to > what it probably should be (though I'm not sure if the system is just > keeping all of that memory cached until something else needs it or not). > > jonathan > > Jonathan Woytek wrote: > >> Michael Conrad Tadpol Tilstra wrote: >> >>> On Sun, Jan 23, 2005 at 01:45:28PM -0500, Jonathan Woytek wrote: >>> >>>> Additional information: >>>> >>>> I enabled full output on lock_gulmd, since my dead top sessions >>>> would often show that process near the top of the list around the >>>> time of crashes. The machine was rebooted around 10:50AM, and was >>>> down again at >>> >>> >>> >>> >>> Not suprising that lock_gulmd is working hard when gfs is under heavy >>> use. Its it busy processing all those lock requests. What would be >>> more useful from gulm for this than the logging messages, is to query >>> the locktable every so often for its stats. >>> `gulm_tool getstats :lt000` >>> The 'locks = ###' line is how many lock structures are current held. >>> gulm is very greedy about memory, and you are running the lock servers >>> on the same nodes you're mounting from. >> >> >> >> Here are the stats from the master lock_gulmd lt000: >> >> I_am = Master >> run time = 9436 >> pid = 2205 >> verbosity = Default >> id = 0 >> partitions = 1 >> out_queue = 0 >> drpb_queue = 0 >> locks = 20356 >> unlocked = 17651 >> exclusive = 15 >> shared = 2690 >> deferred = 0 >> lvbs = 17661 >> expired = 0 >> lock ops = 107354 >> conflicts = 0 >> incomming_queue = 0 >> conflict_queue = 0 >> reply_queue = 0 >> free_locks = 69644 >> free_lkrqs = 60 >> used_lkrqs = 0 >> free_holders = 109634 >> used_holders = 20366 >> highwater = 1048576 >> >> >> Something keeps eating away at lowmem, though, and I still can't >> figure out what exactly it is. >> >> >>> also, just to see if I read the first post right, you have >>> samba->nfs->gfs? >> >> >> >> If I understand your arrows correctly, I have a filesystem mounted >> with GFS that I'm sharing via NFS to another machine that is sharing >> it via Samba. I've closed that link, though, to try to eliminate that >> as a problem. So now I'm serving the GFS filesystem directly through >> Samba. >> >> jonathan >> > -- Jonathan Woytek w: 412-681-3463 woytek+ at cmu.edu NREC Computing Manager c: 412-401-1627 KB3HOZ PGP Key available upon request From teigland at redhat.com Tue Jan 25 04:06:24 2005 From: teigland at redhat.com (David Teigland) Date: Tue, 25 Jan 2005 12:06:24 +0800 Subject: [Linux-cluster] multipath/gfs lockout under heavy write In-Reply-To: <1106596648.13534.79.camel@laza.eunet.yu> References: <1106596648.13534.79.camel@laza.eunet.yu> Message-ID: <20050125040624.GB5786@redhat.com> On Mon, Jan 24, 2005 at 08:57:28PM +0100, Lazar Obradovic wrote: > Since both LVs are a part of same VG (and, thus, are using the same > physical device seen over multipath), I'd guess the problem is somewhere > inside GFS, but the things that keep confusing me are: > > - those SCSI errors that look like multipath errors The SCSI errors appear to be the root problem, not GFS. I don't know what multipath might have to do with it. > - name 'diapered_dm-2' which I never saw before In the past, GFS would immediately panic the machine when it saw i/o errors. Now it tries to shut down the bad fs instead. After this happens you should be able to unmount the offending fs, leave the cluster and reboot the machine cleanly. > - fenced not fencing obviously faulty node In your situation, the node is running fine wrt the cluster so there's no need to fence it. GFS is just shutting down a faulty fs (doing this is not always very "clean" and can produce a lot of errors/warnings on the console.) Perhaps we could reinstate an option to have gfs panic immediately when it sees i/o errors instead of trying to shut down the problem fs. In this case, the panicked node would be "dead" and it would be fenced. -- Dave Teigland From mmatus at dinha.acms.arizona.edu Tue Jan 25 08:41:54 2005 From: mmatus at dinha.acms.arizona.edu (Marcelo Matus) Date: Tue, 25 Jan 2005 01:41:54 -0700 Subject: [Linux-cluster] multipath/gfs lockout under heavy write In-Reply-To: <20050125040624.GB5786@redhat.com> References: <1106596648.13534.79.camel@laza.eunet.yu> <20050125040624.GB5786@redhat.com> Message-ID: <41F60652.7000908@acms.arizona.edu> David Teigland wrote: >On Mon, Jan 24, 2005 at 08:57:28PM +0100, Lazar Obradovic wrote: > > > >>Since both LVs are a part of same VG (and, thus, are using the same >>physical device seen over multipath), I'd guess the problem is somewhere >>inside GFS, but the things that keep confusing me are: >> >>- those SCSI errors that look like multipath errors >> >> > >The SCSI errors appear to be the root problem, not GFS. I don't know what >multipath might have to do with it. > > > >>- name 'diapered_dm-2' which I never saw before >> >> > >In the past, GFS would immediately panic the machine when it saw i/o >errors. Now it tries to shut down the bad fs instead. After this happens >you should be able to unmount the offending fs, leave the cluster and >reboot the machine cleanly. > > I have a question about your last comment. We did the following experiment with GFS 6.0.2: 1.- Setup a cluster using a unique GFS server and gnbd device (lock_gulm master and gnbd_export in the same node). 2.- Fence out a node manually using fence_gnbd. then we observed two cases: 1.- If the fenced machine is not mounting the GFS/gnbd fs, but only importing it, then we can properly either reboot or restart the GFS services with no problem. 2.- If the fenced machine is mounting the GFS/gnbd fs, but with no process using it, almost everything produces a kernel panic, even just unmounting the unused fs. In fact the only thing that works, besides pushing the reset button, is 'reboot -f', which is almost the same. So, when you say "In the past", do you refer to GFS 6.0.2 ? > > >>- fenced not fencing obviously faulty node >> >> > >In your situation, the node is running fine wrt the cluster so there's no >need to fence it. GFS is just shutting down a faulty fs (doing this is >not always very "clean" and can produce a lot of errors/warnings on the >console.) > >Perhaps we could reinstate an option to have gfs panic immediately when it >sees i/o errors instead of trying to shut down the problem fs. In this >case, the panicked node would be "dead" and it would be fenced. > > > From teigland at redhat.com Tue Jan 25 09:12:06 2005 From: teigland at redhat.com (David Teigland) Date: Tue, 25 Jan 2005 17:12:06 +0800 Subject: [Linux-cluster] multipath/gfs lockout under heavy write In-Reply-To: <41F60652.7000908@acms.arizona.edu> References: <1106596648.13534.79.camel@laza.eunet.yu> <20050125040624.GB5786@redhat.com> <41F60652.7000908@acms.arizona.edu> Message-ID: <20050125091206.GE5786@redhat.com> On Tue, Jan 25, 2005 at 01:41:54AM -0700, Marcelo Matus wrote: > >In the past, GFS would immediately panic the machine when it saw i/o > >errors. Now it tries to shut down the bad fs instead. After this happens > >you should be able to unmount the offending fs, leave the cluster and > >reboot the machine cleanly. > > I have a question about your last comment. We did the following > experiment with GFS 6.0.2: > > 1.- Setup a cluster using a unique GFS server and gnbd device (lock_gulm > master and gnbd_export in the same node). > > 2.- Fence out a node manually using fence_gnbd. > > then we observed two cases: > > 1.- If the fenced machine is not mounting the GFS/gnbd fs, but only > importing it, then we can properly either reboot or restart the GFS > services with no problem. > > 2.- If the fenced machine is mounting the GFS/gnbd fs, but with no > process using it, almost everything produces a kernel panic, even just > unmounting the unused fs. In fact the only thing that works, besides > pushing the reset button, is 'reboot -f', which is almost the same. > > So, when you say "In the past", do you refer to GFS 6.0.2 ? I was actually referring to the code Lazar is using which is the next, as yet unreleased, version of GFS from the public cvs. Your situation could be explained similarly, like this: - running fence_gnbd causes the node to get i/o errors if it tries to use gnbd - if the node has GFS mounted, GFS will try to use gnbd - when GFS 6.0.2 sees i/o errors it will panic If you don't have GFS mounted, the last two steps don't exist and there's no panic. -- Dave Teigland From mmatus at dinha.acms.arizona.edu Tue Jan 25 09:20:50 2005 From: mmatus at dinha.acms.arizona.edu (Marcelo Matus) Date: Tue, 25 Jan 2005 02:20:50 -0700 Subject: [Linux-cluster] multipath/gfs lockout under heavy write In-Reply-To: <20050125091206.GE5786@redhat.com> References: <1106596648.13534.79.camel@laza.eunet.yu> <20050125040624.GB5786@redhat.com> <41F60652.7000908@acms.arizona.edu> <20050125091206.GE5786@redhat.com> Message-ID: <41F60F72.6080900@acms.arizona.edu> David Teigland wrote: >On Tue, Jan 25, 2005 at 01:41:54AM -0700, Marcelo Matus wrote: > > > >>>In the past, GFS would immediately panic the machine when it saw i/o >>>errors. Now it tries to shut down the bad fs instead. After this happens >>>you should be able to unmount the offending fs, leave the cluster and >>>reboot the machine cleanly. >>> >>> >>I have a question about your last comment. We did the following >>experiment with GFS 6.0.2: >> >>1.- Setup a cluster using a unique GFS server and gnbd device (lock_gulm >>master and gnbd_export in the same node). >> >>2.- Fence out a node manually using fence_gnbd. >> >>then we observed two cases: >> >>1.- If the fenced machine is not mounting the GFS/gnbd fs, but only >>importing it, then we can properly either reboot or restart the GFS >>services with no problem. >> >>2.- If the fenced machine is mounting the GFS/gnbd fs, but with no >>process using it, almost everything produces a kernel panic, even just >>unmounting the unused fs. In fact the only thing that works, besides >>pushing the reset button, is 'reboot -f', which is almost the same. >> >>So, when you say "In the past", do you refer to GFS 6.0.2 ? >> >> > >I was actually referring to the code Lazar is using which is the next, as >yet unreleased, version of GFS from the public cvs. Your situation could >be explained similarly, like this: > >- running fence_gnbd causes the node to get i/o errors if it tries to use > gnbd > >- if the node has GFS mounted, GFS will try to use gnbd > >- when GFS 6.0.2 sees i/o errors it will panic > >If you don't have GFS mounted, the last two steps don't exist and there's >no panic. > > > Thanks, that clarify to us that we don't have any error in our configuration :). Then, the question is: the new behaviour as you described, will be only present in the CVS version (kernel 2.6) or it will be also back ported to the current GFS 6.0.2 version (kernel 2.4) ? Marcelo From mshk_00 at hotmail.com Tue Jan 25 10:18:37 2005 From: mshk_00 at hotmail.com (maria perez) Date: Tue, 25 Jan 2005 11:18:37 +0100 Subject: [Linux-cluster] GFS and HEARTBEAT Message-ID: Hi, I have a doubt (among many). How can I stablish heartbear with GFS??? I have two nodes connected through ethernet, both nodes are servers lock_gulmd (I have installed GFS 6.0.0-7.1 over my kernel 2.4. 21-15.0.4.EL- I am using Red Hat Enterprise v.3). In the file CLUSTER.CCS I have defined three nodes (the third never take part in the cluster), the three nodes too are defined in the file NODES.CCS, the method of fencing that I have defined is MANUAL. I would like to know how I can install heartbeat in my system: Has GFS any mechanism that permit run heartbeat?? Why GFS can add parameters for heartbeat in the file CLUSTER.CCS?? What relation have this parameter with the process heartbeat?? I have to download and install heartbeat from .. http://linux-ha.org/download ??????????' or GFS incorporates any ??? Thanks for all, maria. _________________________________________________________________ Descarga gratis la Barra de Herramientas de MSN http://www.msn.es/usuario/busqueda/barra?XAPID=2031&DI=1055&SU=http%3A//www.hotmail.com&HL=LINKTAG1OPENINGTEXT_MSNBH From nigel.jewell at pixexcel.co.uk Tue Jan 25 10:43:36 2005 From: nigel.jewell at pixexcel.co.uk (Nigel Jewell) Date: Tue, 25 Jan 2005 10:43:36 +0000 Subject: [Linux-cluster] GNBD & Network Outage Message-ID: <41F622D8.1060102@pixexcel.co.uk> Dear all, We've been looking at the issues of using GNBD to provide access to a block device on a secondary installation and we've hit a brick wall. I was wondering if anyone had seen the same behaviour On host "A" we do: gnbd_export -d /dev/sda2 -e foo -c On host "B" we do: gnbd_import -i A ... and as you would expect /dev/gnbd/foo appears on B and is usable. We have no other aspects of GFS in use. Now - in order for this to be useful, we've been testing the effects of using GNBD if there is a LAN outage. If we write a big file to a mounted file system on B:/dev/gnbd/foo and pull out the LAN cable halfway through the data being synced to A, host B never gives up trying to contact A. In fact, if you plug in the cable 10 minutes later the sync recovers. Now - on the surface - this doesn't seem like a big problem, but it is when you try and use the imported device alongside software RAID or when you want to do something "normal" like reboot the box. Rebooting just stops when it trys to unmount the file systems. We want to use B:/dev/gndb/foo alongside a local partition on B and create a RAID-1 using mdadm. In the same scenario (where the LAN cable is pulled), the md device on B completely stops all of the IO on the machine because (presumably) the md software is trying to write to the gnbd device ... which is forever trying to contact host A ... and of course never gives up. It would be nice if it did give up and the md software continued the md device in degraded mode. So the question is this (got there in the end). Can anyone suggest a solution and/or alternative/workaround? Is it possible to specify a time-out for the GNBD import/export for when the LAN does die? Any ideas? Regards, -- Nige. PixExcel Limited URL: http://www.pixexcel.co.uk MSN: nigel.jewell at pixexcel.co.uk From mmonge at gmail.com Tue Jan 25 10:56:07 2005 From: mmonge at gmail.com (Marcos Monge) Date: Tue, 25 Jan 2005 11:56:07 +0100 Subject: [Linux-cluster] RH Cluster Suite without raw shared disk Message-ID: <48e9aa4e05012502564c869da2@mail.gmail.com> Hi There is anyway to create a cluster without the two shared partitions in shared disk? It's possbile, for example, use a NFS Filer (netapp) as the shared disk in some way? In my case, I want to do a cluster of 2 nodes, with a nfs filer sharing the aplication data, and also, if possible, the cluster status information, without using a SCSI/HBA shared system. Thanks in advance Marcos From fajar at telkom.co.id Tue Jan 25 11:35:54 2005 From: fajar at telkom.co.id (Fajar A. Nugraha) Date: Tue, 25 Jan 2005 18:35:54 +0700 Subject: [Linux-cluster] RH Cluster Suite without raw shared disk In-Reply-To: <48e9aa4e05012502564c869da2@mail.gmail.com> References: <48e9aa4e05012502564c869da2@mail.gmail.com> Message-ID: <41F62F1A.8090606@telkom.co.id> Marcos Monge wrote: >Hi > >There is anyway to create a cluster without the two shared partitions >in shared disk? > > > If you use http://sources.redhat.com/cluster/, then the answer is yes. I have tested it. The shared device is located on gnbd device exported by another machine. I'm using it as an alternative for NFS. I can even run Xen (http://www.cl.cam.ac.uk/Research/SRG/netos/xen/) domains on it, and have live migration feature. >It's possbile, for example, use a NFS Filer (netapp) as the shared >disk in some way? > > > Sorry, haven't tried it yet. Regards, Fajar From mtilstra at redhat.com Tue Jan 25 14:36:39 2005 From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra) Date: Tue, 25 Jan 2005 08:36:39 -0600 Subject: [Linux-cluster] RH Cluster Suite without raw shared disk In-Reply-To: <48e9aa4e05012502564c869da2@mail.gmail.com> References: <48e9aa4e05012502564c869da2@mail.gmail.com> Message-ID: <20050125143639.GA16148@redhat.com> On Tue, Jan 25, 2005 at 11:56:07AM +0100, Marcos Monge wrote: > It's possbile, for example, use a NFS Filer (netapp) as the shared > disk in some way? If you set it up to export iscsi devices, you can put gfs onto that. -- Michael Conrad Tadpol Tilstra I know that I know just enough to know how much more there is to know. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From bmarzins at redhat.com Tue Jan 25 20:55:38 2005 From: bmarzins at redhat.com (Benjamin Marzinski) Date: Tue, 25 Jan 2005 14:55:38 -0600 Subject: [Linux-cluster] GNBD & Network Outage In-Reply-To: <41F622D8.1060102@pixexcel.co.uk> References: <41F622D8.1060102@pixexcel.co.uk> Message-ID: <20050125205538.GD13289@phlogiston.msp.redhat.com> On Tue, Jan 25, 2005 at 10:43:36AM +0000, Nigel Jewell wrote: > Dear all, > > We've been looking at the issues of using GNBD to provide access to a > block device on a secondary installation and we've hit a brick wall. I > was wondering if anyone had seen the same behaviour > > On host "A" we do: > > gnbd_export -d /dev/sda2 -e foo -c > > On host "B" we do: > > gnbd_import -i A > > ... and as you would expect /dev/gnbd/foo appears on B and is usable. > > We have no other aspects of GFS in use. > > Now - in order for this to be useful, we've been testing the effects of > using GNBD if there is a LAN outage. If we write a big file to a > mounted file system on B:/dev/gnbd/foo and pull out the LAN cable > halfway through the data being synced to A, host B never gives up trying > to contact A. In fact, if you plug in the cable 10 minutes later the > sync recovers. > > Now - on the surface - this doesn't seem like a big problem, but it is > when you try and use the imported device alongside software RAID or when > you want to do something "normal" like reboot the box. Rebooting just > stops when it trys to unmount the file systems. > > We want to use B:/dev/gndb/foo alongside a local partition on B and > create a RAID-1 using mdadm. In the same scenario (where the LAN cable > is pulled), the md device on B completely stops all of the IO on the > machine because (presumably) the md software is trying to write to the > gnbd device ... which is forever trying to contact host A ... and of > course never gives up. It would be nice if it did give up and the md > software continued the md device in degraded mode. > > So the question is this (got there in the end). Can anyone suggest a > solution and/or alternative/workaround? Is it possible to specify a > time-out for the GNBD import/export for when the LAN does die? Sure. You see the -c in you export line. Don't put it there. That puts the device in (the very poorly named) uncached mode. This does two things. One: It causes the server to use direct IO to write to the exported device, so your read performance will take a hit. Two: It will time out after a period (default to 10 sec). After gnbd times out, it must be able to fence the server before it will let the requests fail. This is so that you know that the server isn't simply stalled and might write out the requests later (if gnbd failed out, and the requests were rerouted to the backend storage over another gnbd server, if the first server wrote it's requests out later, it could cause data corruption). This means that to run in uncached mode, you need to have a cluster manager and fencing devices, which I'm not certain that you have. I've got some questions about your setup. Will this be part of a clustered filesystem setup? If it will, I see some problems with your mirror. When other nodes (including the gnbd server node A) write to the exported device, these writes will not appear on the local partion of B. So won't your mirror get out of sync? If only B will write to the exported device, (and that's the only way I see this working) you can probably get by with nbd, which simply fails out if it loses connection. There is a cluster mirror project in the works. When that is done, you would be able to have node B gnbd export it's local partition, and then run a mirror on top of the device exported from A and the device exported from B, which all nodes could access and would stay in sync. But this project isn't finished yet. -Ben > Any ideas? > > Regards, > > -- > Nige. > > PixExcel Limited > URL: http://www.pixexcel.co.uk > MSN: nigel.jewell at pixexcel.co.uk > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From rryll at yahoo.com Tue Jan 25 23:33:52 2005 From: rryll at yahoo.com (Darryll Napolis) Date: Tue, 25 Jan 2005 15:33:52 -0800 (PST) Subject: [Linux-cluster] ga1 clusvcmgrd[27878]: readServiceBlock: Service number mismatch 4, 6. Message-ID: <20050125233353.83434.qmail@web51906.mail.yahoo.com> Using RHEL3 cluster suite, i've been getting the messages listed below in my /var/log/messages: ga1 clusvcmgrd[27878]: readServiceBlock: Service number mismatch 4, 6. anybody knows what does that mean? Any inputs are greatly appreciated. Thanks __________________________________ Do you Yahoo!? Yahoo! Mail - You care about security. So do we. http://promotions.yahoo.com/new_mail From jaime at iaa.es Wed Jan 26 16:27:01 2005 From: jaime at iaa.es (Jaime Perea) Date: Wed, 26 Jan 2005 17:27:01 +0100 Subject: [Linux-cluster] kernel versions and gfs In-Reply-To: <20050125233353.83434.qmail@web51906.mail.yahoo.com> References: <20050125233353.83434.qmail@web51906.mail.yahoo.com> Message-ID: <200501261727.01544.jaime@iaa.es> Hi everybody I have a strange problem. I need to work with some software that refuses to work under a version of the kernel > 2.6.8. On the other hand I would like to install gfs and all the other related stuff. I used the last version from the cvs and it compiles well under kernel 2.6.10 but I cannot compile it under the 2.6.8 kernel. Something like /home/jaime/clu/cluster-2.6.8.1/gfs-kernel/src/gfs/ops_file.c:1670: error: unknown field `flock' specified in initializer /home/jaime/clu/cluster-2.6.8.1/gfs-kernel/src/gfs/ops_file.c:1670: warning: initialization from incompatible pointer type make[5]: *** [/home/jaime/clu/cluster-2.6.8.1/gfs-kernel/src/gfs/ops_file.o] Error 1 make[4]: *** [_module_/home/jaime/clu/cluster-2.6.8.1/gfs-kernel/src/gfs] Error 2 Do I need a specific version of gfs for the 2.6.8 version of the kernel? Thanks -- Jaime D. Perea Duarte. Linux registered user #10472 Dep. Astrofisica Extragalactica. Instituto de Astrofisica de Andalucia (CSIC) Apdo. 3004, 18080 Granada, Spain. From lhh at redhat.com Wed Jan 26 19:49:59 2005 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 26 Jan 2005 14:49:59 -0500 Subject: [Linux-cluster] ga1 clusvcmgrd[27878]: readServiceBlock: Service number mismatch 4, 6. In-Reply-To: <20050125233353.83434.qmail@web51906.mail.yahoo.com> References: <20050125233353.83434.qmail@web51906.mail.yahoo.com> Message-ID: <1106768999.16910.112.camel@ayanami.boston.redhat.com> On Tue, 2005-01-25 at 15:33 -0800, Darryll Napolis wrote: > Using RHEL3 cluster suite, i've been getting the > messages listed below in my /var/log/messages: > > ga1 clusvcmgrd[27878]: readServiceBlock: > Service number mismatch 4, 6. > > anybody knows what does that mean? Any inputs are > greatly appreciated. Thanks It's an anomaly, but is mostly noise. See: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=120934 -- Lon From lhh at redhat.com Thu Jan 27 18:36:38 2005 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 27 Jan 2005 13:36:38 -0500 Subject: [Linux-cluster] GFS and HEARTBEAT In-Reply-To: References: Message-ID: <1106850998.16910.168.camel@ayanami.boston.redhat.com> On Tue, 2005-01-25 at 11:18 +0100, maria perez wrote: > What relation have this parameter with the process heartbeat?? > I have to download and install heartbeat from .. > http://linux-ha.org/download ??????????' > or GFS incorporates any ??? Hi Maria, No work has been done to integrate GFS with Heartbeat. There is a lot of common work being done between the linux-cluster project and the linux-ha project, so this may change in the future. However, it should be possible to run GFS as the backend store for services/resource groups managed by heartbeat, but be aware that heartbeat's notion of membership and GFS's may not always coincide (e.g. one might think node X is offline, while the other believes it to be online). -- Lon From lhh at redhat.com Thu Jan 27 18:49:12 2005 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 27 Jan 2005 13:49:12 -0500 Subject: [Linux-cluster] GFS and HEARTBEAT In-Reply-To: <1106850998.16910.168.camel@ayanami.boston.redhat.com> References: <1106850998.16910.168.camel@ayanami.boston.redhat.com> Message-ID: <1106851752.16910.181.camel@ayanami.boston.redhat.com> On Thu, 2005-01-27 at 13:36 -0500, Lon Hohberger wrote: > However, it should be possible to run GFS as the backend store for > services/resource groups managed by heartbeat, but be aware that > heartbeat's notion of membership and GFS's may not always coincide (e.g. > one might think node X is offline, while the other believes it to be > online). Sorry, forgot to explain the practical implication. As an example: 1 - Heartbeat detects that node A is offline. 2 - Heartbeat STONITHs node A and takes over services. 3 - (30 seconds pass) 4 - GFS detects that node A is offline. 5 - GFS fences node A and recovers node A's journal. 6 - Cluster in sane state. In the above case, it's mostly just annoying for the system administrator and causes node A to be unavailable for longer than necessary. (If GFS locks are held when node A dies, then it becomes a bit more complicated.) -- Lon From daniel at osdl.org Thu Jan 27 19:21:15 2005 From: daniel at osdl.org (Daniel McNeil) Date: Thu, 27 Jan 2005 11:21:15 -0800 Subject: [Linux-cluster] [PATCH] to fix ccs_tool ld error on latest cvs In-Reply-To: <200501241018.49756.jaime@iaa.es> References: <1106351208.14739.8.camel@ibm-c.pdx.osdl.net> <200501241018.49756.jaime@iaa.es> Message-ID: <1106853675.9346.42.camel@ibm-c.pdx.osdl.net> On Mon, 2005-01-24 at 01:18, Jaime Perea wrote: > Hi everybody, > > My first posting! > > Perhaps doing > LDFLAGS="-lpthread" make > > could work. Thanks. Adding -lpthread in the Makefile fixed the build problem. Here's the patch that fixed the problem. --- cluster.orig/ccs/ccs_tool/Makefile 2005-01-27 11:15:49.385135157 -0800 +++ cluster/ccs/ccs_tool/Makefile 2005-01-24 15:29:38.000000000 -0800 @@ -25,7 +25,7 @@ endif LDFLAGS+= -L${ccs_libdir} `xml2-config --libs` -L${libdir} -LOADLIBES+= -lccs -lmagma -lmagmamsg -ldl +LOADLIBES+= -lccs -lmagma -lmagmamsg -ldl -lpthread all: ccs_tool From marco.yanez at hp.com Thu Jan 27 23:31:56 2005 From: marco.yanez at hp.com (Yanez, Marco Antonio) Date: Thu, 27 Jan 2005 17:31:56 -0600 Subject: [Linux-cluster] Question about GFS with 2 nodes Message-ID: Hi, I have a 2 nodes (one Master and other Secondary) in a GFS configuration. We investigate that only 3 nodes GFS can give a continuos operation in automatic form. But my question is: In a 2 nodes GFS configuration: if my Master node doesn't work for any situation, How I can configure the secondary node (manually) to continue with my normal operation while fix the Master node? Is is possible? I appreciate all yoour help on this. Best Regards. Marco From daniel at osdl.org Fri Jan 28 01:41:10 2005 From: daniel at osdl.org (Daniel McNeil) Date: Thu, 27 Jan 2005 17:41:10 -0800 Subject: [Linux-cluster] umount hang on 2.6.10 and latest GFS Message-ID: <1106876470.20799.13.camel@ibm-c.pdx.osdl.net> I hit a umount hang running my tests. It was running with 2 nodes mounted cl030 and cl031. It has finished a test and is unmounting cl030 when it hung. cl031 seems fine with the gfs file system still mounted. The gfs file system is unmounted (not in /proc/mounts), but the umount is hung trying to stop dlm_astd Here's the stack trace: umount D 00000008 0 10453 10447 (NOTLB) cdaa8de4 00000082 cdaa8dd4 00000008 00000001 000c0000 00000008 00000002 c1bce798 00000286 e8f782e0 cdaa8dc4 c0116871 e9db55e0 960546f9 c170ef60 00000000 0001fba3 0167aae6 00005e6b f74f8080 f74f81ec c170ef60 00000000 Call Trace: [] wait_for_completion+0xa4/0xe0 [] kthread_stop+0x85/0xae [] astd_stop+0x13/0x32 [dlm] [] dlm_release+0x91/0xa0 [dlm] [] release_lockspace+0x222/0x2f0 [dlm] [] release_gdlm+0x1c/0x30 [lock_dlm] [] lm_dlm_unmount+0x4f/0x70 [lock_dlm] [] lm_unmount+0x3c/0xa0 [lock_harness] [] gfs_lm_unmount+0x2f/0x40 [gfs] [] gfs_put_super+0x2fb/0x3a0 [gfs] [] generic_shutdown_super+0x127/0x140 [] gfs_kill_sb+0x2e/0x69 [gfs] [] deactivate_super+0x81/0xa0 [] sys_umount+0x3c/0xa0 [] sys_oldumount+0x19/0x20 [] sysenter_past_esp+0x52/0x75 dlm_astd D 00000008 0 10264 6 3235 (L-TLB) dc9c3ee8 00000046 dc9c3ed8 00000008 00000002 00000800 00000008 c8cc35e0 f7bc0568 5f8a4c1c 0179a889 e4676c5a 00004b2d dc9c3f14 c051c000 c1716f60 00000001 000001b0 0167c50b 00005e6b e9db55e0 e9db574c c1714060 00000000 Call Trace: [] rwsem_down_read_failed+0x9c/0x190 [] .text.lock.ast+0xc7/0x1de [dlm] [] dlm_astd+0x1e5/0x210 [dlm] [] kthread+0xba/0xc0 [] kernel_thread_helper+0x5/0x10 So, it looks like dlm_astd is stuck on a down_read(). The only down_read I see is in process_asts(). down_read(&ls->ls_in_recovery); So, it looks block on recovery of the lockspace, but the DLM is not listed in /proc/cluster/services and /proc/cluster/dlm_locks shows no locks. Full info available here: http://developer.osdl.org/daniel/GFS/test.25jan2005/ Ideas? Daniel From teigland at redhat.com Fri Jan 28 02:34:08 2005 From: teigland at redhat.com (David Teigland) Date: Fri, 28 Jan 2005 10:34:08 +0800 Subject: [Linux-cluster] umount hang on 2.6.10 and latest GFS In-Reply-To: <1106876470.20799.13.camel@ibm-c.pdx.osdl.net> References: <1106876470.20799.13.camel@ibm-c.pdx.osdl.net> Message-ID: <20050128023408.GB5298@redhat.com> On Thu, Jan 27, 2005 at 05:41:10PM -0800, Daniel McNeil wrote: > dlm_astd D 00000008 0 10264 6 3235 (L-TLB) > dc9c3ee8 00000046 dc9c3ed8 00000008 00000002 00000800 00000008 c8cc35e0 > f7bc0568 5f8a4c1c 0179a889 e4676c5a 00004b2d dc9c3f14 c051c000 c1716f60 > 00000001 000001b0 0167c50b 00005e6b e9db55e0 e9db574c c1714060 00000000 > Call Trace: > [] rwsem_down_read_failed+0x9c/0x190 > [] .text.lock.ast+0xc7/0x1de [dlm] > [] dlm_astd+0x1e5/0x210 [dlm] > [] kthread+0xba/0xc0 > [] kernel_thread_helper+0x5/0x10 > > So, it looks like dlm_astd is stuck on a down_read(). > > The only down_read I see is in process_asts(). > > down_read(&ls->ls_in_recovery); Yep, that's it. The ls struct is freed while dlm_astd is blocked there. I checked in a fix for this a few days ago. -- Dave Teigland From lhh at redhat.com Fri Jan 28 16:24:34 2005 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 28 Jan 2005 11:24:34 -0500 Subject: [Linux-cluster] Question about GFS with 2 nodes In-Reply-To: References: Message-ID: <1106929474.16910.241.camel@ayanami.boston.redhat.com> On Thu, 2005-01-27 at 17:31 -0600, Yanez, Marco Antonio wrote: > We investigate that only 3 nodes GFS can give a continuos operation in > automatic form. But my question is: > > In a 2 nodes GFS configuration: if my Master node doesn't work for any > situation, How I can configure the secondary node (manually) to continue > with my normal operation while fix the Master node? > Is is possible? Not with gulm. Try running CMAN in 2-node mode instead. -- Lon From daniel at osdl.org Sat Jan 29 00:51:46 2005 From: daniel at osdl.org (Daniel McNeil) Date: Fri, 28 Jan 2005 16:51:46 -0800 Subject: [Linux-cluster] build errors on the latest cvs Message-ID: <1106959906.20799.36.camel@ibm-c.pdx.osdl.net> I trying to re-build after update from cvs . Using 'make' gets: ln -snf libmagmamsg.so.DEVEL.1106957305 libmagmamsg.so.DEVEL ln -snf libmagma.so.DEVEL.1106957305 libmagma.so ln -snf libmagma_nt.so.DEVEL.1106957305 libmagma_nt.so ln -snf libmagmamsg.so.DEVEL.1106957305 libmagmamsg.so install -d /Views/redhat-cluster/cluster/build/lib install -d /usr/lib install: cannot change permissions of `/usr/lib': Operation not permitted make[2]: *** [install] Error 1 make[2]: Leaving directory `/Views/redhat-cluster/cluster/magma/lib' make[1]: *** [install] Error 2 make[1]: Leaving directory `/Views/redhat-cluster/cluster/magma' so it looks like slibdir is not being set right. I tried building from a clean view and got the same error. :( Running 'make install' as root does work with (with my patch to add -pthread to the ccs_tool Makefile), but I like building it all first before installing it. Daniel From rstevens at vitalstream.com Sat Jan 29 01:08:51 2005 From: rstevens at vitalstream.com (Rick Stevens) Date: Fri, 28 Jan 2005 17:08:51 -0800 Subject: [Linux-cluster] build errors on the latest cvs In-Reply-To: <1106959906.20799.36.camel@ibm-c.pdx.osdl.net> References: <1106959906.20799.36.camel@ibm-c.pdx.osdl.net> Message-ID: <41FAE223.3020108@vitalstream.com> Daniel McNeil wrote: > I trying to re-build after update from cvs . > > Using 'make' gets: > > ln -snf libmagmamsg.so.DEVEL.1106957305 libmagmamsg.so.DEVEL > ln -snf libmagma.so.DEVEL.1106957305 libmagma.so > ln -snf libmagma_nt.so.DEVEL.1106957305 libmagma_nt.so > ln -snf libmagmamsg.so.DEVEL.1106957305 libmagmamsg.so > install -d /Views/redhat-cluster/cluster/build/lib > install -d /usr/lib > install: cannot change permissions of `/usr/lib': Operation not permitted > make[2]: *** [install] Error 1 > make[2]: Leaving directory `/Views/redhat-cluster/cluster/magma/lib' > make[1]: *** [install] Error 2 > make[1]: Leaving directory `/Views/redhat-cluster/cluster/magma' > > so it looks like slibdir is not being set right. > > I tried building from a clean view and got the same error. :( > > Running 'make install' as root does work with (with my patch > to add -pthread to the ccs_tool Makefile), but I like > building it all first before installing it. Of COURSE you can't change the permissions of /usr/lib as a normal, mortal user.../usr/lib is owner: root, group: root. That's why installs HAVE to be done by root. Mere mortals aren't allowed to screw with important things like /usr/lib or /lib. ---------------------------------------------------------------------- - Rick Stevens, Senior Systems Engineer rstevens at vitalstream.com - - VitalStream, Inc. http://www.vitalstream.com - - - - I'm afraid my karma just ran over your dogma - ---------------------------------------------------------------------- From daniel at osdl.org Sat Jan 29 01:35:11 2005 From: daniel at osdl.org (Daniel McNeil) Date: Fri, 28 Jan 2005 17:35:11 -0800 Subject: [Linux-cluster] build errors on the latest cvs In-Reply-To: <41FAE223.3020108@vitalstream.com> References: <1106959906.20799.36.camel@ibm-c.pdx.osdl.net> <41FAE223.3020108@vitalstream.com> Message-ID: <1106962511.20799.51.camel@ibm-c.pdx.osdl.net> On Fri, 2005-01-28 at 17:08, Rick Stevens wrote: > Daniel McNeil wrote: > > I trying to re-build after update from cvs . > > > > Using 'make' gets: > > > > ln -snf libmagmamsg.so.DEVEL.1106957305 libmagmamsg.so.DEVEL > > ln -snf libmagma.so.DEVEL.1106957305 libmagma.so > > ln -snf libmagma_nt.so.DEVEL.1106957305 libmagma_nt.so > > ln -snf libmagmamsg.so.DEVEL.1106957305 libmagmamsg.so > > install -d /Views/redhat-cluster/cluster/build/lib > > install -d /usr/lib > > install: cannot change permissions of `/usr/lib': Operation not permitted > > make[2]: *** [install] Error 1 > > make[2]: Leaving directory `/Views/redhat-cluster/cluster/magma/lib' > > make[1]: *** [install] Error 2 > > make[1]: Leaving directory `/Views/redhat-cluster/cluster/magma' > > > > so it looks like slibdir is not being set right. > > > > I tried building from a clean view and got the same error. :( > > > > Running 'make install' as root does work with (with my patch > > to add -pthread to the ccs_tool Makefile), but I like > > building it all first before installing it. > > Of COURSE you can't change the permissions of /usr/lib as a normal, > mortal user.../usr/lib is owner: root, group: root. That's why installs > HAVE to be done by root. Mere mortals aren't allowed to screw with > important things like /usr/lib or /lib. I should have been more clear: when doing a "make" it should not be touching anything like /usr/lib/ or /lib. It is ok if 'make install' puts stuff in /usr/lib or /lib and other places (and that part must be done as root). It looks like the Makefile uses a prefix to a local directory. 'slibdir' is not using "/Views/redhat-cluster/cluster/build" prefix that 'libdir' above did (see the install -d line above the one that failed), so it does not build without being root. That is the build problem. Mere mortals should be able to build :) Daniel From daniel at osdl.org Sat Jan 29 01:44:54 2005 From: daniel at osdl.org (Daniel McNeil) Date: Fri, 28 Jan 2005 17:44:54 -0800 Subject: [Linux-cluster] build errors on the latest cvs AND ccsd doesn't run In-Reply-To: <1106962511.20799.51.camel@ibm-c.pdx.osdl.net> References: <1106959906.20799.36.camel@ibm-c.pdx.osdl.net> <41FAE223.3020108@vitalstream.com> <1106962511.20799.51.camel@ibm-c.pdx.osdl.net> Message-ID: <1106963093.20799.59.camel@ibm-c.pdx.osdl.net> I cannot get the latest cvs to run because ccsd complains: [root at cl030 cluster]# ccsd Failed to connect to cluster manager. Hint: Magma plugins are not in the right spot. Either it did not install stuff in the right spot or it is looking in the wrong spot. Any ideas? Daniel PS BTW, I did at first try running the updated kernel modules with the old user-level tools and some other nodes running code from a few days ago. Cman gave me a verson mis-match. So I updated all the nodes (kernel modules and user-level). If there are version changes like this, it would be nice to email a note to the gfs mailing list. From sunjw at onewaveinc.com Sat Jan 29 16:30:47 2005 From: sunjw at onewaveinc.com (=?gb2312?B?y++/oc6w?=) Date: Sun, 30 Jan 2005 00:30:47 +0800 Subject: [Linux-cluster] fence problem Message-ID: Hello all, I have brocade FC switch whose model is silkworm 3850, which fence method can I use on GFS for kernel 2.6.9, and how to configure the file cluster.conf,how to configure the FC switch? Thanks for any reply! Best regards! Luckey From sunjw at onewaveinc.com Sat Jan 29 16:14:48 2005 From: sunjw at onewaveinc.com (=?gb2312?B?y++/oc6w?=) Date: Sun, 30 Jan 2005 00:14:48 +0800 Subject: [Linux-cluster] NFS over GFS problem Message-ID: Hello,all I got the gfs' code from cvs on 2004-12-12, which was for kernel 2.6.9 , and I opened the NFS server on one GFS node, then I mounted the nfs filesystem on some machines which did not in the cluster.The NFS server's OS is FC3, the NFS clients' OS is RH9. There were problems: 1. When I stoped the NFS server and umount the GFS with NFS client mounted, no error happened. but after I remounted the GFS , restarted the NFS server, remounted the NFS,and umounted the NFS, stoped NFS server, then umount of GFS hang, system reboot failed too.the dmesg is : GFS: fsid=xxx:xxx.3: Unmount seems to be stalled. Dumping lock state... Glock (2, 26) gl_flags = gl_count = 2 gl_state = 0 lvb_count = 0 object = yes aspace = 0 reclaim = no Inode: num = 26/26 type = 2 i_count = 1 i_flags = vnode = yes Glock (5, 26) gl_flags = gl_count = 2 gl_state = 3 lvb_count = 0 object = yes aspace = no reclaim = no Holder owner = -1 gh_state = 3 gh_flags = 5 7 error = 0 gh_iflags = 1 6 7 what's the messages' meaning? It seams as if the mount/umount sequence is critical. 2. When two GFS nodes read the same file(s)(as 1~50 files) on the storage at the same time, total disk IO performance is worse than one node's? When one GFS node and one NFS node(client) read the same file(s) ,the NFS node's performance is nearly zero? What's worse, the command "ls" looks like blocked on the GFS directory on both GFS nodes and NFS nodes on above cases. So how can I speed up the command "ls", or may be the system call "stat"? The GFS filesystem use the "lock_dlm" lock protocol. Will "lock_gulm" protocol improve the status, or any other? Are there any gfs tune options or mount options to resolve the problem? Thanks for any reply! Best regards! Luckey From lhh at redhat.com Mon Jan 31 15:00:41 2005 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 31 Jan 2005 10:00:41 -0500 Subject: [Linux-cluster] build errors on the latest cvs In-Reply-To: <1106962511.20799.51.camel@ibm-c.pdx.osdl.net> References: <1106959906.20799.36.camel@ibm-c.pdx.osdl.net> <41FAE223.3020108@vitalstream.com> <1106962511.20799.51.camel@ibm-c.pdx.osdl.net> Message-ID: <1107183641.22835.27.camel@ayanami.boston.redhat.com> On Fri, 2005-01-28 at 17:35 -0800, Daniel McNeil wrote: > It is ok if 'make install' puts stuff in /usr/lib or /lib > and other places (and that part must be done as root). > > It looks like the Makefile uses a prefix to a local directory. > 'slibdir' is not using "/Views/redhat-cluster/cluster/build" > prefix that 'libdir' above did (see the install -d line above > the one that failed), so it does not build without being root. > That is the build problem. Mere mortals should be able to build :) You're correct. Fixing. -- Lon From lhh at redhat.com Mon Jan 31 15:08:18 2005 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 31 Jan 2005 10:08:18 -0500 Subject: [Linux-cluster] build errors on the latest cvs In-Reply-To: <1106959906.20799.36.camel@ibm-c.pdx.osdl.net> References: <1106959906.20799.36.camel@ibm-c.pdx.osdl.net> Message-ID: <1107184098.22835.31.camel@ayanami.boston.redhat.com> On Fri, 2005-01-28 at 16:51 -0800, Daniel McNeil wrote: > install -d /Views/redhat-cluster/cluster/build/lib > install -d /usr/lib That shouldn't happen during 'make'. Did you pass any flags to configure? It's working for me. -- Lon From lhh at redhat.com Mon Jan 31 15:13:51 2005 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 31 Jan 2005 10:13:51 -0500 Subject: [Linux-cluster] build errors on the latest cvs In-Reply-To: <1106962511.20799.51.camel@ibm-c.pdx.osdl.net> References: <1106959906.20799.36.camel@ibm-c.pdx.osdl.net> <41FAE223.3020108@vitalstream.com> <1106962511.20799.51.camel@ibm-c.pdx.osdl.net> Message-ID: <1107184431.22835.35.camel@ayanami.boston.redhat.com> On Fri, 2005-01-28 at 17:35 -0800, Daniel McNeil wrote: > It looks like the Makefile uses a prefix to a local directory. > 'slibdir' is not using "/Views/redhat-cluster/cluster/build" > prefix that 'libdir' above did (see the install -d line above > the one that failed), so it does not build without being root. > That is the build problem. Mere mortals should be able to build :) Oh. I know what happened... The top-level configure doesn't know about slibdir. The magma and magma-plugins builds are fine, but the top level causes the others to bomb. -- Lon From lhh at redhat.com Mon Jan 31 15:19:19 2005 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 31 Jan 2005 10:19:19 -0500 Subject: [Linux-cluster] build errors on the latest cvs In-Reply-To: <1107184431.22835.35.camel@ayanami.boston.redhat.com> References: <1106959906.20799.36.camel@ibm-c.pdx.osdl.net> <41FAE223.3020108@vitalstream.com> <1106962511.20799.51.camel@ibm-c.pdx.osdl.net> <1107184431.22835.35.camel@ayanami.boston.redhat.com> Message-ID: <1107184759.22835.40.camel@ayanami.boston.redhat.com> On Mon, 2005-01-31 at 10:13 -0500, Lon Hohberger wrote: > I know what happened... The top-level configure doesn't know about > slibdir. > > The magma and magma-plugins builds are fine, but the top level causes > the others to bomb. Erm, no, that's not it either. It works fine for me: ln -snf libmagma_nt.so.DEVEL.1107184454 libmagma_nt.so.DEVEL ld -shared -soname libmagmamsg.so.DEVEL -o libmagmamsg.so.DEVEL.1107184454 message.o fdops.o -lc ln -snf libmagmamsg.so.DEVEL.1107184454 libmagmamsg.so.DEVEL ln -snf libmagma.so.DEVEL.1107184454 libmagma.so ln -snf libmagma_nt.so.DEVEL.1107184454 libmagma_nt.so ln -snf libmagmamsg.so.DEVEL.1107184454 libmagmamsg.so install -d /tmp/lon/usr/lib install -d /tmp/lon/usr/lib In the configure script, it substitutes them the same way; this is rather strange... -- Lon From lhh at redhat.com Mon Jan 31 15:26:23 2005 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 31 Jan 2005 10:26:23 -0500 Subject: [Linux-cluster] build errors on the latest cvs In-Reply-To: <1107184759.22835.40.camel@ayanami.boston.redhat.com> References: <1106959906.20799.36.camel@ibm-c.pdx.osdl.net> <41FAE223.3020108@vitalstream.com> <1106962511.20799.51.camel@ibm-c.pdx.osdl.net> <1107184431.22835.35.camel@ayanami.boston.redhat.com> <1107184759.22835.40.camel@ayanami.boston.redhat.com> Message-ID: <1107185183.22835.43.camel@ayanami.boston.redhat.com> On Mon, 2005-01-31 at 10:19 -0500, Lon Hohberger wrote: > In the configure script, it substitutes them the same way; this is > rather strange... Ok, the coffee has hit. Fix in pool. -- Lon From amanthei at redhat.com Mon Jan 31 16:06:47 2005 From: amanthei at redhat.com (Adam Manthei) Date: Mon, 31 Jan 2005 10:06:47 -0600 Subject: [Linux-cluster] fence problem In-Reply-To: References: Message-ID: <20050131160647.GJ10537@redhat.com> On Sun, Jan 30, 2005 at 12:30:47AM +0800, ?????? wrote: > Hello all, > > I have brocade FC switch whose model is silkworm 3850, > which fence method can I use on GFS for kernel 2.6.9, > and how to configure the file cluster.conf,how to configure the FC switch? fence_brocade will probably work. The parameters are documented in the man page. -- Adam Manthei