From andriy at druzhba.lviv.ua Wed Sep 1 10:22:09 2004 From: andriy at druzhba.lviv.ua (Andriy Galetski) Date: Wed, 1 Sep 2004 13:22:09 +0300 Subject: [Linux-cluster] SM log messages References: <010001c48c14$baae0da0$f13cc90a@druzhba.com> Message-ID: <00bc01c4900d$8b974de0$f13cc90a@druzhba.com> Hi ! Can anyone tell me what is mean next "SM:" messages in log. Sep 1 13:13:51 cl10 kernel: dlm: gfs1: recover event 2 done Sep 1 13:13:51 cl10 kernel: dlm: gfs1: recover event 2 finished Sep 1 13:13:51 cl10 kernel: GFS: fsid=alpha:gfs1.0: Joined cluster. Now mounting FS... Sep 1 13:13:51 cl10 kernel: GFS: fsid=alpha:gfs1.0: jid=0: Trying to acquire journal lock... Sep 1 13:13:51 cl10 kernel: GFS: fsid=alpha:gfs1.0: jid=0: Looking at journal... Sep 1 13:13:51 cl10 kernel: GFS: fsid=alpha:gfs1.0: jid=0: Done Sep 1 13:13:51 cl10 kernel: GFS: fsid=alpha:gfs1.0: jid=1: Trying to acquire journal lock... Sep 1 13:13:51 cl10 kernel: GFS: fsid=alpha:gfs1.0: jid=1: Looking at journal... Sep 1 13:13:51 cl10 kernel: GFS: fsid=alpha:gfs1.0: jid=1: Done Sep 1 13:13:51 cl10 kernel: SM: process_reply invalid id=2 nodeid=4294967294 Sep 1 13:13:52 cl10 kernel: SM: process_reply invalid id=2 nodeid=4294967294 Sep 1 13:13:53 cl10 kernel: SM: process_reply invalid id=3 nodeid=4294967294 Thanks for help. From teigland at redhat.com Wed Sep 1 10:33:10 2004 From: teigland at redhat.com (David Teigland) Date: Wed, 1 Sep 2004 18:33:10 +0800 Subject: [Linux-cluster] SM log messages In-Reply-To: <00bc01c4900d$8b974de0$f13cc90a@druzhba.com> References: <010001c48c14$baae0da0$f13cc90a@druzhba.com> <00bc01c4900d$8b974de0$f13cc90a@druzhba.com> Message-ID: <20040901103310.GB19621@redhat.com> On Wed, Sep 01, 2004 at 01:22:09PM +0300, Andriy Galetski wrote: > Hi ! > > Can anyone tell me what is mean next "SM:" > messages in log. If everything continued to run fine, then they can be ignored. Otherwise, it's probably this bug (or related to it): https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=128420 -- Dave Teigland From stephen.willey at framestore-cfc.com Wed Sep 1 11:31:41 2004 From: stephen.willey at framestore-cfc.com (Stephen Willey) Date: Wed, 01 Sep 2004 12:31:41 +0100 Subject: [Linux-cluster] GFS 2Tb limit Message-ID: <4135B31D.7050508@framestore-cfc.com> There was a post a while back asking about 2Tb limits and the consensus was that with 2.6 you should be able to exceed the 2Tb limit with GFS. I've been trying several ways to get GFS working including using software raidtabs and LVM (seperately :) ) and everytime I try to use mkfs.gfs on a block device larger than 2Tb I get the following: Command: mkfs.gfs -p lock_dlm -t cluster1:gfs1 -j 8 /dev/md0 Result: mkfs.gfs: can't determine size of /dev/md0: File too large (/dev/md0 is obviously something different when using LVM or direct block device access) Does anyone have a working GFS filesystem larger than 2Tb (or know how to make one)? Without being able to scale past 2Tb, GFS becomes pretty useless for us... Thanks for any help, Stephen From laza at yu.net Wed Sep 1 12:04:37 2004 From: laza at yu.net (Lazar Obradovic) Date: Wed, 01 Sep 2004 14:04:37 +0200 Subject: [Linux-cluster] Re: cluster depends on tcp_wrappers? In-Reply-To: <87r7pnjh8o.fsf@coraid.com> References: <87u0ujjlz2.fsf@coraid.com> <20040831165253.GA14574@redhat.com> <87r7pnjh8o.fsf@coraid.com> Message-ID: <1094040277.21327.321.camel@laza.eunet.yu> You should add libxml2 there too... ccsd obviously needs it for config parsing... On Tue, 2004-08-31 at 20:27, Ed L Cashin wrote: > Michael Conrad Tadpol Tilstra writes: > > > On Tue, Aug 31, 2004 at 12:44:49PM -0400, Ed L Cashin wrote: > >> Hi. Does cluster, and gulm/src/utils_ip.c from today's CVS > >> specifically, depend on tcp_wrappers? > > > > gulm does use tcpwrappers, it always has. > > OK, here's a patch. Without tcp wrappers already installed, following > the directions in usage.txt results in a cryptic message about tcpd.h > being missing, so either a check in the configure script or some > documentation is necessary. > > --- cluster-cvs/doc/usage.txt.20040831 Tue Aug 31 14:21:57 2004 > +++ cluster-cvs/doc/usage.txt Tue Aug 31 14:22:39 2004 > @@ -25,6 +25,10 @@ > cvs -d :pserver:cvs at sources.redhat.com:/cvs/lvm2 checkout LVM2 > cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout cluster > > +- satisfy dependencies > + > + gulm requires tcp_wrappers > + > > Build and install > ----------------- -- Lazar Obradovic, System Engineer ----- laza at YU.net YUnet International http://www.EUnet.yu Dubrovacka 35/III, 11000 Belgrade Tel: +381 11 3119901; Fax: +381 11 3119901 ----- This e-mail is confidential and intended only for the recipient. Unauthorized distribution, modification or disclosure of its contents is prohibited. If you have received this e-mail in error, please notify the sender by telephone +381 11 3119901. ----- From chris at math.uu.se Wed Sep 1 13:05:49 2004 From: chris at math.uu.se (Christian Nygaard) Date: Wed, 1 Sep 2004 15:05:49 +0200 (CEST) Subject: [Linux-cluster] GFS cluster components? In-Reply-To: <4135B31D.7050508@framestore-cfc.com> References: <4135B31D.7050508@framestore-cfc.com> Message-ID: Is there a way to build a cheap and stable GFS system? What components would you recommend? Thanks for your input, Chris From ecashin at coraid.com Wed Sep 1 14:25:13 2004 From: ecashin at coraid.com (Ed L Cashin) Date: Wed, 01 Sep 2004 10:25:13 -0400 Subject: [Linux-cluster] Re: cluster depends on tcp_wrappers? References: <87u0ujjlz2.fsf@coraid.com> <20040831165253.GA14574@redhat.com> <87r7pnjh8o.fsf@coraid.com> <1094040277.21327.321.camel@laza.eunet.yu> Message-ID: <87eklmjcc6.fsf@coraid.com> Lazar Obradovic writes: > You should add libxml2 there too... > > ccsd obviously needs it for config parsing... That's true. I ran into that but forgot. I'm probably forgetting others, but it's good to get a list started. --- cluster-cvs/doc/usage.txt.20040831 Tue Aug 31 14:21:57 2004 +++ cluster-cvs/doc/usage.txt Wed Sep 1 10:19:35 2004 @@ -25,6 +25,11 @@ cvs -d :pserver:cvs at sources.redhat.com:/cvs/lvm2 checkout LVM2 cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout cluster +- satisfy dependencies + + gulm requires tcp_wrappers + ccsd requires libxml2 and its headers + Build and install ----------------- -- Ed L Cashin From ecashin at coraid.com Wed Sep 1 14:31:09 2004 From: ecashin at coraid.com (Ed L Cashin) Date: Wed, 01 Sep 2004 10:31:09 -0400 Subject: [Linux-cluster] Re: errors on inode.c References: <775913971.20040804063716@intersystems.com> Message-ID: <87brgqjc2a.fsf@coraid.com> Jeff writes: ... > I ran into this when I moved from one of the snapshots to > the cvs-latest. Issue "updatedb" and then "locate gfs_ioctl.h". > Remove the copies outside of the source tree. The make script > looks for header files in various places other than the source > tree and if it finds them, it uses them in preference to the > source tree. That seems like a problem. New code should build cleanly in isolation, otherwise it can be polluted by leftovers sprinkled around the system. It isn't easy to install the cluster software into a separate directory like /opt/cluster-20040901/, so it is likely that there will be old files in places like /usr/include. > There may be similar problems with header files for > cman-kernel and gfs-kernel. Yes, but they aren't consistent enough to script away easily. > Also, the libraries moved between the snapshots and latest > so if you did install the snapshot you need to execute: > rm -rf /lib/libmagma* /lib/magma /lib/libgulm* > rm -rf /lib/libccs* /lib/libdlm* > before you build from cvs. -- Ed L Cashin From phillips at redhat.com Wed Sep 1 15:06:00 2004 From: phillips at redhat.com (Daniel Phillips) Date: Wed, 1 Sep 2004 11:06:00 -0400 Subject: [Linux-cluster] New virtual synchrony API for the kernel: was Re: [Openais] New API in openais In-Reply-To: <1093981842.3613.42.camel@persist.az.mvista.com> References: <1093941076.3613.14.camel@persist.az.mvista.com> <1093973757.5933.56.camel@cherrybomb.pdx.osdl.net> <1093981842.3613.42.camel@persist.az.mvista.com> Message-ID: <200409011106.00541.phillips@redhat.com> Hi Steven, On Tuesday 31 August 2004 15:50, Steven Dake wrote: > It would be useful for linux cluster developers for a common low > level group communication API to be agreed upon by relevant clusters > projects. Without this approach, we may end up with several systems > all using different cluster communication & membership mechanisms > that are incompatible. To be honest, this does look interesting, however could you help me on a few points: - Is there any evil IP we have to worry about with this? - Can I get a formal interface spec from AIS for this, without signing a license? - Have you got benchmarks available for control and normal messaging From phillips at redhat.com Wed Sep 1 15:15:45 2004 From: phillips at redhat.com (Daniel Phillips) Date: Wed, 1 Sep 2004 11:15:45 -0400 Subject: [Linux-cluster] New virtual synchrony API for the kernel: was Re: [Openais] New API in openais In-Reply-To: <1093981842.3613.42.camel@persist.az.mvista.com> References: <1093941076.3613.14.camel@persist.az.mvista.com> <1093973757.5933.56.camel@cherrybomb.pdx.osdl.net> <1093981842.3613.42.camel@persist.az.mvista.com> Message-ID: <200409011115.45780.phillips@redhat.com> Hi Steven, (here's the rest of that message) On Tuesday 31 August 2004 15:50, Steven Dake wrote: > It would be useful for linux cluster developers for a common low > level group communication API to be agreed upon by relevant clusters > projects. Without this approach, we may end up with several systems > all using different cluster communication & membership mechanisms > that are incompatible. To be honest, this does look interesting, however could you help me on a few points: - Is there any evil IP we have to worry about with this? - Can I get a formal interface spec from AIS for this, without signing a license? - Have you got benchmarks available for control and normal messaging? - Have you looked at the barrier subsystem in sources.redhat.com/dlm? Could this be used as a primitive in implementing Virtual Synchrony? - Why would we need to worry about the AIS spec, in-kernel? What would stop you from providing an interface that presented some kernel functionality to userspace, with the interface of your choice, presumably AIS? - Why isn't Virtual Synchrony overkill, since we don't attempt to deal with netsplits by allowing subclusters to continue to operate? - In what way would GFS benefit from using Virtual Synchrony in place of its current messaging algorithms? Regards, Daniel From ben.m.cahill at intel.com Wed Sep 1 15:19:36 2004 From: ben.m.cahill at intel.com (Cahill, Ben M) Date: Wed, 1 Sep 2004 08:19:36 -0700 Subject: [Linux-cluster] What is purpose of GFS' "LIVE" lock? TIA. EOM. Message-ID: <0604335B7764D141945E202153105960033E2542@orsmsx404.amr.corp.intel.com> From wli at holomorphy.com Wed Sep 1 15:24:39 2004 From: wli at holomorphy.com (William Lee Irwin III) Date: Wed, 1 Sep 2004 08:24:39 -0700 Subject: [Linux-cluster] GFS 2Tb limit In-Reply-To: <4135B31D.7050508@framestore-cfc.com> References: <4135B31D.7050508@framestore-cfc.com> Message-ID: <20040901152439.GZ5492@holomorphy.com> On Wed, Sep 01, 2004 at 12:31:41PM +0100, Stephen Willey wrote: > There was a post a while back asking about 2Tb limits and the consensus > was that with 2.6 you should be able to exceed the 2Tb limit with GFS. > I've been trying several ways to get GFS working including using > software raidtabs and LVM (seperately :) ) and everytime I try to use > mkfs.gfs on a block device larger than 2Tb I get the following: > Command: mkfs.gfs -p lock_dlm -t cluster1:gfs1 -j 8 /dev/md0 > Result: mkfs.gfs: can't determine size of /dev/md0: File too large > (/dev/md0 is obviously something different when using LVM or direct > block device access) > Does anyone have a working GFS filesystem larger than 2Tb (or know how > to make one)? > Without being able to scale past 2Tb, GFS becomes pretty useless for us... > Thanks for any help, Either your utility is not opening the file with O_LARGEFILE or an O_LARGEFILE check has been incorrectly processed by the kernel. Please strace the utility and include the compressed results as a MIME attachment. Remember to compress the results, as most MTA's will reject messages of excessive size, in particular, mine. -- wli From bbaptist at iexposure.com Wed Sep 1 16:37:52 2004 From: bbaptist at iexposure.com (Bret Baptist) Date: Wed, 1 Sep 2004 11:37:52 -0500 Subject: [Linux-cluster] cman_tool error Message-ID: <200409011137.52410.bbaptist@iexposure.com> I am trying to join a simple cluster with cman_tool join but I constantly get the error: cman_tool join cman_tool: node name mankey is ambiguous Here is my cluster.conf: What does that error mean anyway? -- Bret Baptist Systems and Technical Support Specialist bbaptist at iexposure.com Internet Exposure, Inc. http://www.iexposure.com (612)676-1946 x17 Web Development-Web Marketing-ISP Services ------------------------------------------ Today is the tomorrow you worried about yesterday. From bbaptist at iexposure.com Wed Sep 1 18:08:39 2004 From: bbaptist at iexposure.com (Bret Baptist) Date: Wed, 1 Sep 2004 13:08:39 -0500 Subject: [Linux-cluster] cman_tool error In-Reply-To: <200409011137.52410.bbaptist@iexposure.com> References: <200409011137.52410.bbaptist@iexposure.com> Message-ID: <200409011308.39392.bbaptist@iexposure.com> Replying to myself... On Wednesday 01 September 2004 11:37 am, Bret Baptist wrote: > I am trying to join a simple cluster with cman_tool join but I constantly > get the error: > cman_tool join > cman_tool: node name mankey is ambiguous > > Here is my cluster.conf: < snip config> > > What does that error mean anyway? Grrr I found it, there were multiple entries in my /etc/hosts file for the host mankey. I removed them all except the correct one and it works correctly now. -- Bret Baptist Systems and Technical Support Specialist bbaptist at iexposure.com Internet Exposure, Inc. http://www.iexposure.com (612)676-1946 x17 Web Development-Web Marketing-ISP Services ------------------------------------------ Today is the tomorrow you worried about yesterday. From phillips at redhat.com Wed Sep 1 18:37:47 2004 From: phillips at redhat.com (Daniel Phillips) Date: Wed, 1 Sep 2004 14:37:47 -0400 Subject: [Linux-cluster] [ANNOUNCE] Linux Cluster Infrastructure BOF at Linux Kongress Message-ID: <200409011437.48032.phillips@redhat.com> There will be a Linux Cluster Infrastructure BOF at Linux Kongress in Erlangen, Germany, thursday 2004-09-09 or friday 2004-09-10. The exact day, time and room number to be posted here: http://www.linux-kongress.org/2004/program.html This will be round three of the Linux cluster infrastructure community effort. Rounds one and two were at OLS and Minneapolis, respectively. A summary of the latter is available here: http://sources.redhat.com/cluster/events/summit2004/presentations.html The story so far: We all agree that the time has come to establish a kernel infrastructure for cluster filesystems, which will also be useable by user space applications. Or at least, most of us agree about that. At Minneapolis we parted on the understanding that we would all read code and find out why (or why not) the GFS kernel support infrastructure can serve the needs of cluster systems beyond GFS, including other cluster filesystems, user space cluster applications, and the Single System Image project. http://sources.redhat.com/cluster/ Last time, Red Hat engineers outnumbered Suse engineers by roughly ten to one. The Linux Kongress BOF therefore presents an opportunity to redress that imbalance. Regards, Daniel From kpreslan at redhat.com Wed Sep 1 18:42:19 2004 From: kpreslan at redhat.com (Ken Preslan) Date: Wed, 1 Sep 2004 13:42:19 -0500 Subject: [Linux-cluster] GFS 2Tb limit In-Reply-To: <4135B31D.7050508@framestore-cfc.com> References: <4135B31D.7050508@framestore-cfc.com> Message-ID: <20040901184219.GA2379@potassium.msp.redhat.com> On Wed, Sep 01, 2004 at 12:31:41PM +0100, Stephen Willey wrote: > There was a post a while back asking about 2Tb limits and the consensus > was that with 2.6 you should be able to exceed the 2Tb limit with GFS. > I've been trying several ways to get GFS working including using > software raidtabs and LVM (seperately :) ) and everytime I try to use > mkfs.gfs on a block device larger than 2Tb I get the following: > > Command: mkfs.gfs -p lock_dlm -t cluster1:gfs1 -j 8 /dev/md0 > Result: mkfs.gfs: can't determine size of /dev/md0: File too large I just found and removed a safety check that made sure gfs_mkfs didn't try to make a filesystem bigger than 2TB (which was useful on 2.4). The check is after the place generating your error, though. WLI's suggestion of getting a strace of the process is a good one. -- Ken Preslan From kpreslan at redhat.com Wed Sep 1 19:12:41 2004 From: kpreslan at redhat.com (Ken Preslan) Date: Wed, 1 Sep 2004 14:12:41 -0500 Subject: [Linux-cluster] What is purpose of GFS' "LIVE" lock? TIA. EOM. In-Reply-To: <0604335B7764D141945E202153105960033E2542@orsmsx404.amr.corp.intel.com> References: <0604335B7764D141945E202153105960033E2542@orsmsx404.amr.corp.intel.com> Message-ID: <20040901191241.GA2411@potassium.msp.redhat.com> On Wed, Sep 01, 2004 at 08:19:36AM -0700, Cahill, Ben M wrote: > What is purpose of GFS' "LIVE" lock? TIA. EOM. It doesn't do anything anymore. At one point it was useful to see if there were other machines using the filesystem during mount. It doesn't really hurt anything and it may be useful in the future, so there hasn't been a huge rush to get rid of it. -- Ken Preslan From mfedyk at matchmail.com Wed Sep 1 20:45:05 2004 From: mfedyk at matchmail.com (Mike Fedyk) Date: Wed, 01 Sep 2004 13:45:05 -0700 Subject: [Linux-cluster] GFS 2Tb limit In-Reply-To: <4135B31D.7050508@framestore-cfc.com> References: <4135B31D.7050508@framestore-cfc.com> Message-ID: <413634D1.5020706@matchmail.com> Stephen Willey wrote: > There was a post a while back asking about 2Tb limits and the > consensus was that with 2.6 you should be able to exceed the 2Tb limit > with GFS. I've been trying several ways to get GFS working including > using software raidtabs and LVM (seperately :) ) and everytime I try > to use mkfs.gfs on a block device larger than 2Tb I get the following: > > Command: mkfs.gfs -p lock_dlm -t cluster1:gfs1 -j 8 /dev/md0 > Result: mkfs.gfs: can't determine size of /dev/md0: File too large Doesn't MD have trouble with SW RAID arrays larger than 2TB? From wli at holomorphy.com Wed Sep 1 20:47:24 2004 From: wli at holomorphy.com (William Lee Irwin III) Date: Wed, 1 Sep 2004 13:47:24 -0700 Subject: [Linux-cluster] GFS 2Tb limit In-Reply-To: <413634D1.5020706@matchmail.com> References: <4135B31D.7050508@framestore-cfc.com> <413634D1.5020706@matchmail.com> Message-ID: <20040901204724.GN5492@holomorphy.com> Stephen Willey wrote: >> There was a post a while back asking about 2Tb limits and the >> consensus was that with 2.6 you should be able to exceed the 2Tb limit >> with GFS. I've been trying several ways to get GFS working including >> using software raidtabs and LVM (seperately :) ) and everytime I try >> to use mkfs.gfs on a block device larger than 2Tb I get the following: >> Command: mkfs.gfs -p lock_dlm -t cluster1:gfs1 -j 8 /dev/md0 >> Result: mkfs.gfs: can't determine size of /dev/md0: File too large On Wed, Sep 01, 2004 at 01:45:05PM -0700, Mike Fedyk wrote: > Doesn't MD have trouble with SW RAID arrays larger than 2TB? If it does, then it needs to be made 64-bit sector_t clean. -- wli From cherry at osdl.org Wed Sep 1 20:57:15 2004 From: cherry at osdl.org (John Cherry) Date: Wed, 01 Sep 2004 13:57:15 -0700 Subject: [Linux-cluster] New virtual synchrony API for the kernel: was Re: [Openais] New API in openais In-Reply-To: <200409011115.45780.phillips@redhat.com> References: <1093941076.3613.14.camel@persist.az.mvista.com> <1093973757.5933.56.camel@cherrybomb.pdx.osdl.net> <1093981842.3613.42.camel@persist.az.mvista.com> <200409011115.45780.phillips@redhat.com> Message-ID: <1094072235.10369.102.camel@cherrybomb.pdx.osdl.net> Daniel, Steve is out having a baby (or at least his wife)...so I'll take a crack at some of your questions. On Wed, 2004-09-01 at 08:15, Daniel Phillips wrote: > Hi Steven, > > (here's the rest of that message) > > On Tuesday 31 August 2004 15:50, Steven Dake wrote: > > It would be useful for linux cluster developers for a common low > > level group communication API to be agreed upon by relevant clusters > > projects. Without this approach, we may end up with several systems > > all using different cluster communication & membership mechanisms > > that are incompatible. > Agreed. The low level cluster communication mechanisms mentioned at the cluster summit included the communication layer used by CMAN, TIPC, and SSI/CI internode communication. The openais project has a GMI which is a virtual synchrony based communication mechanism. Because of it's potential usefulness for applications, Steve is proposing an EVS-style API which would support agreed/safe ordering. What we should avoid is 4 different cluster communication mechanisms! > To be honest, this does look interesting, however could you help me on a > few points: > > - Is there any evil IP we have to worry about with this? I will let Steve answer definitively on the evil IP question, but the answer is that there are no IP issues with the OpenAIS project or the EVS API proposal. OpenAIS is being developed with a BSD-style license. > > - Can I get a formal interface spec from AIS for this, without > signing a license? The EVS proposal is not an SAForum AIS interface. The SAForum may want to adopt it at some point, but SAF-AIS focuses on a group messaging service which could be built on top of a low level cluster communication mechanism. BTW, anyone can download a copy of the SA Forum specifications. No formal license signing is required. > > - Have you got benchmarks available for control and normal messaging? There are some test programs, but I'll let Steve answer this one. > > - Have you looked at the barrier subsystem in sources.redhat.com/dlm? > Could this be used as a primitive in implementing Virtual Synchrony? I'll let Steve answer this one as well. > > - Why would we need to worry about the AIS spec, in-kernel? What > would stop you from providing an interface that presented some > kernel functionality to userspace, with the interface of your > choice, presumably AIS? Again, the EVS proposal is not an AIS interface. However, there is a membership API and a lock manager API specified in the SA Forum AIS. In order to present a consistent API to user space and to allow for a modular AIS service design, it would be good for the kernel services (membership and DLM) to present standard interfaces (such as SAF-AIS). This could be done as a "layer" to the existing kernel services. > > - Why isn't Virtual Synchrony overkill, since we don't attempt to > deal with netsplits by allowing subclusters to continue to operate? Virtual synchrony is the communication layer. The membership service (in your case, CMAN) determines active partitions and deals with netsplits, etc. In openais however, membership and virtual synchrony communication is pretty intertwined. > > - In what way would GFS benefit from using Virtual Synchrony in place > of its current messaging algorithms? What messaging algorithms are being used for GFS? I assumed that the DLM would be used for lock traffic and the barrier subsystem would be used for recovery. Regards, John From mfedyk at matchmail.com Wed Sep 1 21:23:20 2004 From: mfedyk at matchmail.com (Mike Fedyk) Date: Wed, 01 Sep 2004 14:23:20 -0700 Subject: [Linux-cluster] GFS 2Tb limit In-Reply-To: <20040901204724.GN5492@holomorphy.com> References: <4135B31D.7050508@framestore-cfc.com> <413634D1.5020706@matchmail.com> <20040901204724.GN5492@holomorphy.com> Message-ID: <41363DC8.20407@matchmail.com> William Lee Irwin III wrote: >Stephen Willey wrote: > > >>>There was a post a while back asking about 2Tb limits and the >>>consensus was that with 2.6 you should be able to exceed the 2Tb limit >>>with GFS. I've been trying several ways to get GFS working including >>>using software raidtabs and LVM (seperately :) ) and everytime I try >>>to use mkfs.gfs on a block device larger than 2Tb I get the following: >>>Command: mkfs.gfs -p lock_dlm -t cluster1:gfs1 -j 8 /dev/md0 >>>Result: mkfs.gfs: can't determine size of /dev/md0: File too large >>> >>> > >On Wed, Sep 01, 2004 at 01:45:05PM -0700, Mike Fedyk wrote: > > >>Doesn't MD have trouble with SW RAID arrays larger than 2TB? >> >> > >If it does, then it needs to be made 64-bit sector_t clean. > ISTR a thread from a few months back saying that both MD and DM have some code that isn't. From wli at holomorphy.com Wed Sep 1 21:24:42 2004 From: wli at holomorphy.com (William Lee Irwin III) Date: Wed, 1 Sep 2004 14:24:42 -0700 Subject: [Linux-cluster] GFS 2Tb limit In-Reply-To: <41363DC8.20407@matchmail.com> References: <4135B31D.7050508@framestore-cfc.com> <413634D1.5020706@matchmail.com> <20040901204724.GN5492@holomorphy.com> <41363DC8.20407@matchmail.com> Message-ID: <20040901212442.GO5492@holomorphy.com> William Lee Irwin III wrote: >> If it does, then it needs to be made 64-bit sector_t clean. On Wed, Sep 01, 2004 at 02:23:20PM -0700, Mike Fedyk wrote: > ISTR a thread from a few months back saying that both MD and DM have > some code that isn't. Sounds like we'd better audit the things, no? -- wli From mfedyk at matchmail.com Wed Sep 1 21:49:59 2004 From: mfedyk at matchmail.com (Mike Fedyk) Date: Wed, 01 Sep 2004 14:49:59 -0700 Subject: [Linux-cluster] GFS 2Tb limit In-Reply-To: <20040901212442.GO5492@holomorphy.com> References: <4135B31D.7050508@framestore-cfc.com> <413634D1.5020706@matchmail.com> <20040901204724.GN5492@holomorphy.com> <41363DC8.20407@matchmail.com> <20040901212442.GO5492@holomorphy.com> Message-ID: <41364407.8080000@matchmail.com> William Lee Irwin III wrote: >William Lee Irwin III wrote: > > >>>If it does, then it needs to be made 64-bit sector_t clean. >>> >>> > >On Wed, Sep 01, 2004 at 02:23:20PM -0700, Mike Fedyk wrote: > > >>ISTR a thread from a few months back saying that both MD and DM have >>some code that isn't. >> >> > >Sounds like we'd better audit the things, no? > I'd help if I could... :-/ From laza at yu.net Wed Sep 1 21:52:27 2004 From: laza at yu.net (Lazar Obradovic) Date: Wed, 01 Sep 2004 23:52:27 +0200 Subject: [Linux-cluster] Multicast ccsd In-Reply-To: <068D22A6-EBB0-11D8-9B62-000A957BB1F6@redhat.com> References: <1091553471.16747.165.camel@laza.eunet.yu> <1091556279.30938.179.camel@laza.eunet.yu> <1091736172.19762.336.camel@laza.eunet.yu> <20040809075039.GA9240@tykepenguin.com> <1092051143.1114.130.camel@laza.eunet.yu> <20040809133438.GI11723@tykepenguin.com> <1092067121.23273.235.camel@laza.eunet.yu> <20040810092900.GB13291@tykepenguin.com> <1092139910.32187.1098.camel@laza.eunet.yu> <20040810122043.GE13291@tykepenguin.com> <068D22A6-EBB0-11D8-9B62-000A957BB1F6@redhat.com> Message-ID: <1094075547.21327.617.camel@laza.eunet.yu> Jonathan, thanks for mcast support in ccsd. However, it seems that you have forgoten to set mcast_ttl for a socket, so mcast wouldn't have any sence, since mcast with ttl == 1 is same as broadcast for many networks. All you have to do is just add something like: ---8<--- char ttl = 10; if (setsockopt(sfd, IPPROTO_IP, IP_MULTICAST_TTL, &ttl, sizeof (ttl) < 0 )) { log_err("Unable to set mcast ttl.\n"); error = -errno; goto fail; } --->8--- ttl could be defined somewhere outside join_group(), or ever fetched from argv, which also goes for cman_tool. Also, try not to use 224.0.0.0/23, as it is reserved, so we might get into trouble with default values. 224.0.0.1 is reserved for all mcast capable hosts, so even non-cluster members (potential or not) would get ccsd announcements. We might even request one mcast address to be assigned for linux-cluster project, so we can officialy use it. btw, to all developers: it was so uncool to remove 2.6.7 kernel patch :( 2.6.8.1 has a problem with tg3 driver (some autonegotiation issues), so it's completly unusable. -- Lazar Obradovic, System Engineer ----- laza at YU.net YUnet International http://www.EUnet.yu Dubrovacka 35/III, 11000 Belgrade Tel: +381 11 3119901; Fax: +381 11 3119901 ----- This e-mail is confidential and intended only for the recipient. Unauthorized distribution, modification or disclosure of its contents is prohibited. If you have received this e-mail in error, please notify the sender by telephone +381 11 3119901. ----- From jbrassow at redhat.com Wed Sep 1 22:53:29 2004 From: jbrassow at redhat.com (Jonathan E Brassow) Date: Wed, 1 Sep 2004 17:53:29 -0500 Subject: [Linux-cluster] Multicast ccsd In-Reply-To: <1094075547.21327.617.camel@laza.eunet.yu> References: <1091553471.16747.165.camel@laza.eunet.yu> <1091556279.30938.179.camel@laza.eunet.yu> <1091736172.19762.336.camel@laza.eunet.yu> <20040809075039.GA9240@tykepenguin.com> <1092051143.1114.130.camel@laza.eunet.yu> <20040809133438.GI11723@tykepenguin.com> <1092067121.23273.235.camel@laza.eunet.yu> <20040810092900.GB13291@tykepenguin.com> <1092139910.32187.1098.camel@laza.eunet.yu> <20040810122043.GE13291@tykepenguin.com> <068D22A6-EBB0-11D8-9B62-000A957BB1F6@redhat.com> <1094075547.21327.617.camel@laza.eunet.yu> Message-ID: > However, it seems that you have forgoten to set mcast_ttl for a socket, > so mcast wouldn't have any sence, since mcast with ttl == 1 is same as > broadcast for many networks. > > All you have to do is just add something like: > ---8<--- > char ttl = 10; > > if (setsockopt(sfd, IPPROTO_IP, IP_MULTICAST_TTL, > &ttl, sizeof (ttl) < 0 )) { > log_err("Unable to set mcast ttl.\n"); > error = -errno; > goto fail; > } > --->8--- Why did you choose 10? I know that ttl == 1 is subnet... but 3-31 is site local. Is 10 a better choice than 3? If so, why? I can certainly make something higher the default. Sorry about the ttl not being an option, I must have forgotten about it when doing the IPv6 stuff. There (looking for confirmation), the ttl is part of the address. ttl could be a new option to ccsd - '-t ' for threshold or ttl. Or, it could be part of the '-m' option, where the address and the ttl would be separated by a ','. > ttl could be defined somewhere outside join_group(), or ever fetched > from argv, which also goes for cman_tool. > > Also, try not to use 224.0.0.0/23, as it is reserved, so we might get > into trouble with default values. 224.0.0.1 is reserved for all mcast > capable hosts, so even non-cluster members (potential or not) would get > ccsd announcements. How does 224.3.0.65 sound for a default? This begs the question, should I be using ff02::3:1 rather than ff02::1 for IPv6? > btw, to all developers: it was so uncool to remove 2.6.7 kernel patch > :( > 2.6.8.1 has a problem with tg3 driver (some autonegotiation issues), so > it's completly unusable. I doubt that those developers in charge of *-kernel subdirectories will want to maintain separate patches for various kernel other than where the head is at. So, current code will likely follow the head. The old kernel patches are removed because they become out of sync (no longer worked on). The only way to get them is to cvs co -D, which automatically lets the user know that they are not acquiring current code. On the other hand, have you tried compiling the modules outside the kernel? I'd be surprised if that didn't work yet... brassow From wli at holomorphy.com Thu Sep 2 09:03:18 2004 From: wli at holomorphy.com (William Lee Irwin III) Date: Thu, 2 Sep 2004 02:03:18 -0700 Subject: [Linux-cluster] GFS 2Tb limit In-Reply-To: <4136DF8E.8050209@framestore-cfc.com> References: <4135B31D.7050508@framestore-cfc.com> <20040901152439.GZ5492@holomorphy.com> <4136DF8E.8050209@framestore-cfc.com> Message-ID: <20040902090318.GC5492@holomorphy.com> On Thu, Sep 02, 2004 at 09:53:34AM +0100, Stephen Willey wrote: > > > > > > > > William Lee Irwin III wrote: >
>
On Wed, Sep 01, 2004 at 12:31:41PM +0100, Stephen Willey wrote:
>   
>
>
There was a post a while back asking about 2Tb limits and the consensus 
> was that with 2.6 you should be able to exceed the 2Tb limit with GFS.  
> I've been trying several ways to get GFS working including using 
> software raidtabs and LVM (seperately :) ) and everytime I try to use 
> mkfs.gfs on a block device larger than 2Tb I get the following:

You seem to need help controlling your MUA. Anyway, AFAIK it's not
really an issue of scalability, just plain old bugs.

-- wli



From sdake at mvista.com  Thu Sep  2 06:03:12 2004
From: sdake at mvista.com (Steven Dake)
Date: Wed, 01 Sep 2004 23:03:12 -0700
Subject: [Linux-cluster] New virtual synchrony API for the kernel: was
	Re: [Openais] New API in openais
In-Reply-To: <200409011115.45780.phillips@redhat.com>
References: <1093941076.3613.14.camel@persist.az.mvista.com>
	<1093973757.5933.56.camel@cherrybomb.pdx.osdl.net>
	<1093981842.3613.42.camel@persist.az.mvista.com>
	<200409011115.45780.phillips@redhat.com>
Message-ID: <1094104992.5515.47.camel@persist.az.mvista.com>

On Wed, 2004-09-01 at 08:15, Daniel Phillips wrote:
> Hi Steven,
> 
> (here's the rest of that message)
> 
> On Tuesday 31 August 2004 15:50, Steven Dake wrote:
> > It would be useful for linux cluster developers for a common low
> > level group communication API to be agreed upon by relevant clusters
> > projects.  Without this approach, we may end up with several systems
> > all using different cluster communication & membership mechanisms
> > that are incompatible.
> 
> To be honest, this does look interesting, however could you help me on a 
> few points:
> 
>   - Is there any evil IP we have to worry about with this?
> 

I have not done any patent search, however, I am not aware of any
patents that apply.

The evs API is not an SA Forum API, but rather an API that projects
could use to implement cluster services or applications (one of those
being AIS).

The EVS api is developed by the openais project which licenses all code
under the Revised BSD license.  I would also be happy to license the API
header files, code, etc under a dual license, where the two licenses are
Revised BSD and GPL.

openais group messaging uses crypto code provided by the libtomcrypt
project under a fully public domain license.  These crypto libraries
provide encryption and authentication, but the code could work without
them (insecurely).

>   - Can I get a formal interface spec from AIS for this, without
>     signing a license?
> 

The EVS interface, as all code in openais, is available under Revised
BSD and hence, it does require living up to the requirements of the
Revised BSD license.  But this is a commonly accepted open source
license so this shouldn't be too much problem.

The EVS API has little to do with SA Forum itself, other then it is
implemented in a project which also aims to implement the SA Forum
APIs.  The copyright and license requirements of the SA Forum do not
apply to the EVS api.

I think we still need some work to hammer out the last details of the
EVS API, but if we work together we can probably come to some agreement
about what else is needed by the API.  The current api is very simple. 
I am working on man pages now and should have them posted in a few days.

I'm happy to change the API if we can still come to some agreement that
virtual synchrony is a requirement of the API..

>   - Have you got benchmarks available for control and normal messaging?
> 

there is a tool called evsbench in the openais distribution which can be
used to print out various benchmarks for various loads.  I modified some
of the parameters of the benchmark to start at 100 bytes and increase
writes by 100 bytes per run.

In a two processor cluster, made of 1.6GHZ Xeon with 1 GB ram using a
Netgear 100 mbit switch, I get the following performance at 70% cpu
usage as measured with top (80% of this is encryption/authentication):

100000 Writes   100 bytes per write  12.788 Seconds runtime  7820.022
TP/s   0.782 MB/s.
90000 Writes   200 bytes per write  11.012 Seconds runtime  8172.742
TP/s   1.635 MB/s.
81000 Writes   300 bytes per write  10.139 Seconds runtime  7989.066
TP/s   2.397 MB/s.
72900 Writes   400 bytes per write   9.685 Seconds runtime  7527.315
TP/s   3.011 MB/s.
65610 Writes   500 bytes per write  10.583 Seconds runtime  6199.683
TP/s   3.100 MB/s.
59049 Writes   600 bytes per write   9.309 Seconds runtime  6343.239
TP/s   3.806 MB/s.
53144 Writes   700 bytes per write   7.333 Seconds runtime  7247.023
TP/s   5.073 MB/s.
47829 Writes   800 bytes per write   6.743 Seconds runtime  7092.640
TP/s   5.674 MB/s.
43046 Writes   900 bytes per write   5.713 Seconds runtime  7534.503
TP/s   6.781 MB/s.
38741 Writes  1000 bytes per write   5.253 Seconds runtime  7374.890
TP/s   7.375 MB/s.
34866 Writes  1100 bytes per write   4.731 Seconds runtime  7369.611
TP/s   8.107 MB/s.
31379 Writes  1200 bytes per write   4.471 Seconds runtime  7018.992
TP/s   8.423 MB/s.
28241 Writes  1300 bytes per write   4.236 Seconds runtime  6667.422
TP/s   8.668 MB/s.

Your results may be different depending on the quality of your network. 
The EVS api is designed to work in networks that are extremely lossy
(99.9+% packet loss), but optimizes for networks that lose very few
packets (1 in 10^10 packets loss rate expected).

Without encryption or authentication, I've measured 10 MB/sec for
maximum packet size which is about 1306 bytes in the current
implementation.

Performance in more processor clusters is not affected too negatively,
perhaps less then .1% in throughput.  I have measured 12 node clusters
of various speeds at 8.4mb/sec total available throughput.  The maximum
throughput of one node does decrease, however, as nodes are added.  I've
measured a very long time ago something like 5-6mb/sec for one node, but
its been a long time, so I suggest testing this yourself if your
interested in that number.

>   - Have you looked at the barrier subsystem in sources.redhat.com/dlm?
>     Could this be used as a primitive in implementing Virtual Synchrony?

Virtual synchrony can be iplemented in atleast 4 ways that I am aware
of.  The method used in openais is called the ring protocol.  It may be
possible to implement VS/EVS in a different fashion, however, the ring
protocol in the research has the best performance and reliability.

>   - Why would we need to worry about the AIS spec, in-kernel?  What
>     would stop you from providing an interface that presented some
>     kernel functionality to userspace, with the interface of your
>     choice, presumably AIS?
> 
Yes this is the proposal on the table.  Implement EVS API in the kernel,
and then AIS could be implemented on top of this EVS API in userland. 
Also, this would allow other applications such as redhat's GFS to use
the EVS API in kernel.  This way everyone wins with a common messaging
API.

I also believe it would be possible to support multiple communication
mechanisms with a protocol driver per protocol.  Of these, TIPC and
openais's gmi would be prime candidates if someone does the work.

>   - Why isn't Virtual Synchrony overkill, since we don't attempt to
>     deal with netsplits by allowing subclusters to continue to operate?
> 
Any distributed system must absolutely deal with partitions and merges. 
Think of the most common partition, where 1 processor dies.  This is a
very common case that must be handled correctly.  But EVS provides many
other benefits beyond partitions and merges (although this is the main
benefit).

>   - In what way would GFS benefit from using Virtual Synchrony in place
>     of its current messaging algorithms?
> 

Performance, security, and most important reliability.  Even though its
a little long, I'll cut and paste from the openais
(developer.osdl.org/dev/openais) README.devmap.  There is an interesting
peice that describes how easily a lock service could be implemented in a
virtual syncrhony system because of the agreed ordering property.

processor: a system responsible for executing the virtual synchrony
model
configuration: the list of processors under which messages are delivered
partition: one or more processors leave the configuration
merge: one or more processors join the configuration
group messaging: sending a message from one sender to many receivers

Virtual synchrony is a model for group messaging.  This is often
confused with particular implementations of virtual synchrony.  Try to
focus on what virtual syncrhony provides, not how it provides it, unless
interested in working on the group messaging interface of openais.

Virtual synchrony provides several advantages:

 * integrated membership
 * strong membership guarantees
 * agreed ordering of delivered messages
 * same delivery of configuration changes and messages on every node
 * self-delivery
 * reliable communication in the face of unreliable networks
 * recovery of messages sent within a configuration where possible
 * use of network multicast using standard UDP/IP

Integrated membership allows the group messaging interface to give
configuration change events to the API services.  This is obviously
beneficial to the cluster membership service (and its respective API0,
but is helpful to other services as described later.

Strong membership guarantees allow a distributed application to make
decisions based upon the configuration (membership).  Every service in
openais registers a configuration change function.  This function is
called whenever a configuration change occurs.  The information passed
is the current processors, the processors that have left the
configuration, and the processors that have joined the configuration. 
This information is then used to make decisions within a distributed
state machine.  One example usage is that an AMF component
running a specific processor has left the configuration, so failover
actions must now be taken with the new configuration (and known
components).  

Virtual synchrony requires that messages may be delivered in agreed
order.  FIFO order indicates that one sender and one receiver agree on
the order of messages sent.  Agreed ordering takes this requirement to
groups, requiring that one sender and all receivers agree on the order
of messages sent.

Consider a lock service.  The service is responsible for arbitrating
locks between multiple processors in the system.  With fifo ordering,
this is very difficult because a request at about the same time for a
lock from two seperate processors may arrive at all the receivers in
different order.  Agreed ordering ensures that all the processors are
delivered the message in the same order.  

In this case the first lock message will always be from processor X,
while the second lock message will always be from processor Y.   Hence
the first request is always honored by all processors, and the second
request is rejected (since the lock is taken).  This is how race
conditions are avoided in distributed systems.

Every processor is delivered a configuration change and messages within
a configuration in the same order.  This ensures that any distributed
state machine will make the same decisions on every processor within the
configuration.  This also allows the configuration and the messages to
be considered when making decisions.

Virtual synchrony requires that every node is delivered messages that it
sends.  This enables the logic to be placed in one location (the handler
for the delivery of the group message) instead of two seperate places. 
This also allows messages that are sent to be ordered in the stream of
other messages within the configuration.

Certain guarantees are required of virtually synchronous systems.  If
a message is sent, it must be delivered by every processor unless that
processor fails.  If a particular processor fails, a configuration
change occurs creating a new configuration under which a new set of
decisions may be made.  This implies that even unreliable networks must
reliably deliver messages.   The implementation in openais works on
unreliable as well as reliable networks.

Every message sent must be delivered, unless a configuration change
occurs.  In the case of a configuration change, every message that can
be recovered must be recovered before the new configuration is
installed.  Some systems during partition won't continue to recover
messages within the old configuration even though those messages can be
recovered.  Virtual synchrony makes that impossible, except for those
members that are no longer part of a configuration.

Finally virtual syncrhony takes advantage of hardware multicast to avoid
duplicated packets and scale to large transmit rates.  On 100mbit
network, openais can approach wire speeds depending on the number of
messages queued for a particular processor.

What does all of this mean for the developer?

 * messages are delivered reliably
 * messages and configuration changes are delivered in the same order to
all processors
 * configuration and messages can both be used to make decisions


Thanks
-steve

> Regards,
> 
> Daniel



From stephen.willey at framestore-cfc.com  Thu Sep  2 08:53:34 2004
From: stephen.willey at framestore-cfc.com (Stephen Willey)
Date: Thu, 02 Sep 2004 09:53:34 +0100
Subject: [Linux-cluster] GFS 2Tb limit
In-Reply-To: <20040901152439.GZ5492@holomorphy.com>
References: <4135B31D.7050508@framestore-cfc.com>
	<20040901152439.GZ5492@holomorphy.com>
Message-ID: <4136DF8E.8050209@framestore-cfc.com>

An HTML attachment was scrubbed...
URL: 

From laza at yu.net  Thu Sep  2 13:58:01 2004
From: laza at yu.net (Lazar Obradovic)
Date: Thu, 02 Sep 2004 15:58:01 +0200
Subject: [Linux-cluster] Multicast ccsd
In-Reply-To: 
References: <1091553471.16747.165.camel@laza.eunet.yu>
	<1091556279.30938.179.camel@laza.eunet.yu>
	<1091736172.19762.336.camel@laza.eunet.yu>
	<20040809075039.GA9240@tykepenguin.com>
	<1092051143.1114.130.camel@laza.eunet.yu>
	<20040809133438.GI11723@tykepenguin.com>
	<1092067121.23273.235.camel@laza.eunet.yu>
	<20040810092900.GB13291@tykepenguin.com>
	<1092139910.32187.1098.camel@laza.eunet.yu>
	<20040810122043.GE13291@tykepenguin.com>
	<068D22A6-EBB0-11D8-9B62-000A957BB1F6@redhat.com>
	<1094075547.21327.617.camel@laza.eunet.yu>
	
Message-ID: <1094133481.21333.743.camel@laza.eunet.yu>

On Thu, 2004-09-02 at 00:53, Jonathan E Brassow wrote:
> Why did you choose 10?  I know that ttl == 1 is subnet... but 3-31 is 
> site local.  Is 10 a better choice than 3?  If so, why?  I can 
> certainly make something higher the default.

It depends on the site, mainly on network topology. Increment TTL for
every L3 device you have in the path between two cluster members. 
Value od 10 is something that "works for me" (c), since 10 is the
maximum path length my packets might have, even when flowing through
backup links. More standard setups (with servers into different vlans,
connected directly to L3 swtich) might use ttl = 3. 

One very important thing to do is to define access lists on
routers/switches, to allow only valid nodes to be senders and recievers.
Otherwise, someone else in the reach of mcast packets might listen to
cluster announcements, which might present security risk. 

> ttl could be a new option to ccsd - '-t ' for threshold or ttl.  
> Or, it could be part of the '-m' option, where the address and the ttl 
> would be separated by a ','.

It's better to leave it separate, that is, use "-t". 

> How does 224.3.0.65 sound for a default?  This begs the question, 
> should I be using ff02::3:1 rather than ff02::1 for IPv6?

By looking at http://www.iana.org/assignments/multicast-addresses, you
can note that 224.3.0.64 - 224.251.255.255 is reserverd by IANA. 
On the other hand, 224.0.2.3-224.0.2.063 seems to be unassigned.

I'm using 224.0.2.10 for both ccsd and cman, so that might be a good
default :))

> On the other hand, have you tried compiling the modules outside the 
> kernel?  I'd be surprised if that didn't work yet...

This works more-or-less ok, but I have problem with
gfs-kernel/.../000001.patch, which *has* to be applied directly into
kernel, for it contains flock extensions. 

Nevermind, I'm patching by hand :)

-- 
Lazar Obradovic, System Engineer
----- 
laza at YU.net
YUnet International http://www.EUnet.yu
Dubrovacka 35/III, 11000 Belgrade
Tel: +381 11 3119901; Fax: +381 11 3119901
-----
This e-mail is confidential and intended only for the recipient.
Unauthorized distribution, modification or disclosure of its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone +381 11 3119901.
-----




From stephen.willey at framestore-cfc.com  Thu Sep  2 15:09:38 2004
From: stephen.willey at framestore-cfc.com (Stephen Willey)
Date: Thu, 02 Sep 2004 16:09:38 +0100
Subject: [Linux-cluster] GFS 2Tb limit
In-Reply-To: <20040901152439.GZ5492@holomorphy.com>
References: <4135B31D.7050508@framestore-cfc.com>
	<20040901152439.GZ5492@holomorphy.com>
Message-ID: <413737B2.6030709@framestore-cfc.com>

William Lee Irwin III wrote:
> On Wed, Sep 01, 2004 at 12:31:41PM +0100, Stephen Willey wrote:
> 
>>There was a post a while back asking about 2Tb limits and the consensus 
>>was that with 2.6 you should be able to exceed the 2Tb limit with GFS.  
>>I've been trying several ways to get GFS working including using 
>>software raidtabs and LVM (seperately :) ) and everytime I try to use 
>>mkfs.gfs on a block device larger than 2Tb I get the following:
>>Command: mkfs.gfs -p lock_dlm -t cluster1:gfs1 -j 8 /dev/md0
>>Result: mkfs.gfs: can't determine size of /dev/md0: File too large
>>(/dev/md0 is obviously something different when using LVM or direct 
>>block device access)
>>Does anyone have a working GFS filesystem larger than 2Tb (or know how 
>>to make one)?
>>Without being able to scale past 2Tb, GFS becomes pretty useless for us...
>>Thanks for any help,
> 
> 
> Either your utility is not opening the file with O_LARGEFILE or an
> O_LARGEFILE check has been incorrectly processed by the kernel. Please
> strace the utility and include the compressed results as a MIME
> attachment. Remember to compress the results, as most MTA's will reject
> messages of excessive size, in particular, mine.
> 
> 
> -- wli


MD'd two 1.8Tb RAIDs together to form a 3.8Tb /dev/md0
mkfs.jfs /dev/md0 was successful
mkfs.gfs -p lock_dlm -t cluster1:gfs1 -j 8 /dev/md0 failed as expected 
with a File too large error

The strace output should be attached...

Thanks,

Stephen
-------------- next part --------------
A non-text attachment was scrubbed...
Name: straceresults.gz
Type: application/x-gzip
Size: 833 bytes
Desc: not available
URL: 

From notiggy at gmail.com  Thu Sep  2 17:28:35 2004
From: notiggy at gmail.com (Brian Jackson)
Date: Thu, 2 Sep 2004 12:28:35 -0500
Subject: [Linux-cluster] GFS cluster components?
In-Reply-To: 
References: <4135B31D.7050508@framestore-cfc.com>
	
Message-ID: 

On Wed, 1 Sep 2004 15:05:49 +0200 (CEST), Christian Nygaard
 wrote:
> 
> Is there a way to build a cheap and stable GFS system? What components
> would you recommend?

It depends on your definition of cheap and stable. No, GFS isn't
magically going to give you 5 9's of uptime with $100 worth of
hardware. At this point, GFS's SAN roots are still there, so I'd
suggest a SAN if you really want reliable (although it is possible to
build a GFS setup without a SAN). You can also use a regular network
(although I'd suggest at least gigabit ethernet) and something like
GNBD, iSCSI, etc. to build it. You can also use a firewire drive
connected between 2 computers for the really cheap (although the
reliability is pretty much gone at this point). One thing to note is
that currently the kernel's software raid layer isn't cluster
friendly, so you won't have a way to do data redundancy unless your
storage array/etc. is doing it.

--Brian Jackson

> 
> Thanks for your input,
> Chris
>



From wli at holomorphy.com  Thu Sep  2 17:59:11 2004
From: wli at holomorphy.com (William Lee Irwin III)
Date: Thu, 2 Sep 2004 10:59:11 -0700
Subject: [Linux-cluster] GFS 2Tb limit
In-Reply-To: <413737B2.6030709@framestore-cfc.com>
References: <4135B31D.7050508@framestore-cfc.com>
	<20040901152439.GZ5492@holomorphy.com>
	<413737B2.6030709@framestore-cfc.com>
Message-ID: <20040902175911.GD5492@holomorphy.com>

William Lee Irwin III wrote:
>> Either your utility is not opening the file with O_LARGEFILE or an
>> O_LARGEFILE check has been incorrectly processed by the kernel. Please
>> strace the utility and include the compressed results as a MIME
>> attachment. Remember to compress the results, as most MTA's will reject
>> messages of excessive size, in particular, mine.

On Thu, Sep 02, 2004 at 04:09:38PM +0100, Stephen Willey wrote:
> MD'd two 1.8Tb RAIDs together to form a 3.8Tb /dev/md0
> mkfs.jfs /dev/md0 was successful
> mkfs.gfs -p lock_dlm -t cluster1:gfs1 -j 8 /dev/md0 failed as expected 
> with a File too large error
> The strace output should be attached...
> Thanks,
> Stephen

So it's the latter. Could you give the precise kernel version in which
you encountered this bug? Writing a patch for 2.6.9-rc1-mm2 and
expecting it to be used may be too much to ask for...


-- wli



From kpreslan at redhat.com  Thu Sep  2 18:49:37 2004
From: kpreslan at redhat.com (Ken Preslan)
Date: Thu, 2 Sep 2004 13:49:37 -0500
Subject: [Linux-cluster] GFS 2Tb limit
In-Reply-To: <413737B2.6030709@framestore-cfc.com>
References: <4135B31D.7050508@framestore-cfc.com>
	<20040901152439.GZ5492@holomorphy.com>
	<413737B2.6030709@framestore-cfc.com>
Message-ID: <20040902184937.GA6528@potassium.msp.redhat.com>

On Thu, Sep 02, 2004 at 04:09:38PM +0100, Stephen Willey wrote:
> The strace output should be attached...

This is another problem with ioctls from the kernel not getting exported
to userspace correctly.  The definition we were using was correct for
Linux 2.4, but it's incorrect for 2.6.

The first ioctl() below is supposed to be BLKGETSIZE64, but the ioctl
number is wrong and the ioctl fails.  So, iddev tries BLKGETSIZE, which
can't encode the device size in a long and returns EFBIG.

open("/dev/md0", O_RDONLY|O_LARGEFILE)  = 3
fstat64(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(9, 0), ...}) = 0
ioctl(3, 0x80081272, 0xbfffe180)        = -1 EINVAL (Invalid argument)
ioctl(3, BLKGETSIZE, 0xbfffe0dc)        = -1 EFBIG (File too large)
write(2, "mkfs.gfs: ", 10mkfs.gfs: )              = 10
write(2, "can\'t determine size of /dev/md0"..., 49can't determine size of /dev/md0: File too large


Apply the below patch, recompile iddev and then mkfs, and see if that
fixes it.


diff -urN crap1/iddev/lib/size.c crap2/iddev/lib/size.c
--- crap1/iddev/lib/size.c	24 Jun 2004 08:53:40 -0000	1.1
+++ crap2/iddev/lib/size.c	2 Sep 2004 18:36:37 -0000
@@ -40,7 +40,7 @@
 #include 
 
 #ifndef BLKGETSIZE64
-#define BLKGETSIZE64 _IOR(0x12, 114, uint64)
+#define BLKGETSIZE64 _IOR(0x12, 114, size_t)
 #endif
 
 static int do_device_size(int fd, uint64 *bytes)

-- 
Ken Preslan 



From Joel.Becker at oracle.com  Thu Sep  2 20:05:50 2004
From: Joel.Becker at oracle.com (Joel Becker)
Date: Thu, 2 Sep 2004 13:05:50 -0700
Subject: [Linux-cluster] GFS 2Tb limit
In-Reply-To: <20040902184937.GA6528@potassium.msp.redhat.com>
References: <4135B31D.7050508@framestore-cfc.com>
	<20040901152439.GZ5492@holomorphy.com>
	<413737B2.6030709@framestore-cfc.com>
	<20040902184937.GA6528@potassium.msp.redhat.com>
Message-ID: <20040902200550.GT25438@ca-server1.us.oracle.com>

On Thu, Sep 02, 2004 at 01:49:37PM -0500, Ken Preslan wrote:
> The first ioctl() below is supposed to be BLKGETSIZE64, but the ioctl
> number is wrong and the ioctl fails.  So, iddev tries BLKGETSIZE, which
> can't encode the device size in a long and returns EFBIG.

	Why on earth isn't it using lseek64() for this?

	uint64_t size = lseek64(disk_fd, 0ULL, SEEK_END);

Joel
-- 

Life's Little Instruction Book #232

	"Keep your promises."

Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127



From kpreslan at redhat.com  Thu Sep  2 20:55:14 2004
From: kpreslan at redhat.com (Ken Preslan)
Date: Thu, 2 Sep 2004 15:55:14 -0500
Subject: [Linux-cluster] GFS 2Tb limit
In-Reply-To: <20040902200550.GT25438@ca-server1.us.oracle.com>
References: <4135B31D.7050508@framestore-cfc.com>
	<20040901152439.GZ5492@holomorphy.com>
	<413737B2.6030709@framestore-cfc.com>
	<20040902184937.GA6528@potassium.msp.redhat.com>
	<20040902200550.GT25438@ca-server1.us.oracle.com>
Message-ID: <20040902205514.GA9307@potassium.msp.redhat.com>

On Thu, Sep 02, 2004 at 01:05:50PM -0700, Joel Becker wrote:
> On Thu, Sep 02, 2004 at 01:49:37PM -0500, Ken Preslan wrote:
> > The first ioctl() below is supposed to be BLKGETSIZE64, but the ioctl
> > number is wrong and the ioctl fails.  So, iddev tries BLKGETSIZE, which
> > can't encode the device size in a long and returns EFBIG.
> 
> 	Why on earth isn't it using lseek64() for this?
> 
> 	uint64_t size = lseek64(disk_fd, 0ULL, SEEK_END);

Hehe.  Thanks for the pointer.

-- 
Ken Preslan 



From stanley.wang at intel.com  Fri Sep  3 06:02:27 2004
From: stanley.wang at intel.com (Wang, Stanley)
Date: Fri, 3 Sep 2004 14:02:27 +0800
Subject: [Linux-cluster] Persistent lock question for GDLM
Message-ID: 

If DLM_LKF_PERSISTENT is specified, the lock will not be purged when the
holder (only applied to process) exits. My question is how can I purge
this persistent lock after the holder exits?

 
Opinions expressed are those of the author and do not represent Intel
Corporation
 
"gpg --recv-keys --keyserver wwwkeys.pgp.net E1390A7F"
{E1390A7F:3AD1 1B0C 2019 E183 0CFF  55E8 369A 8B75 E139 0A7F}



From stephen.willey at framestore-cfc.com  Fri Sep  3 09:35:13 2004
From: stephen.willey at framestore-cfc.com (Stephen Willey)
Date: Fri, 03 Sep 2004 10:35:13 +0100
Subject: [Linux-cluster] GFS 2Tb limit
In-Reply-To: <20040902175911.GD5492@holomorphy.com>
References: <4135B31D.7050508@framestore-cfc.com>
	<20040901152439.GZ5492@holomorphy.com>
	<413737B2.6030709@framestore-cfc.com>
	<20040902175911.GD5492@holomorphy.com>
Message-ID: <41383AD1.7040003@framestore-cfc.com>

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
> 
>>>Either your utility is not opening the file with O_LARGEFILE or an
>>>O_LARGEFILE check has been incorrectly processed by the kernel. Please
>>>strace the utility and include the compressed results as a MIME
>>>attachment. Remember to compress the results, as most MTA's will reject
>>>messages of excessive size, in particular, mine.
> 
> 
> On Thu, Sep 02, 2004 at 04:09:38PM +0100, Stephen Willey wrote:
> 
>>MD'd two 1.8Tb RAIDs together to form a 3.8Tb /dev/md0
>>mkfs.jfs /dev/md0 was successful
>>mkfs.gfs -p lock_dlm -t cluster1:gfs1 -j 8 /dev/md0 failed as expected 
>>with a File too large error
>>The strace output should be attached...
>>Thanks,
>>Stephen
> 
> 
> So it's the latter. Could you give the precise kernel version in which
> you encountered this bug? Writing a patch for 2.6.9-rc1-mm2 and
> expecting it to be used may be too much to ask for...
> 
> 
> -- wli

I was using 2.6.7 patched with everything under the cluster trunk as per 
the instructions here:

http://gfs.wikidev.net/Kernel_configuration

I'll give the 2.6.9-rc1-mm2 kernel a try and let you know...

Thanks,

Stephen



From wli at holomorphy.com  Fri Sep  3 09:38:44 2004
From: wli at holomorphy.com (William Lee Irwin III)
Date: Fri, 3 Sep 2004 02:38:44 -0700
Subject: [Linux-cluster] GFS 2Tb limit
In-Reply-To: <41383AD1.7040003@framestore-cfc.com>
References: <4135B31D.7050508@framestore-cfc.com>
	<20040901152439.GZ5492@holomorphy.com>
	<413737B2.6030709@framestore-cfc.com>
	<20040902175911.GD5492@holomorphy.com>
	<41383AD1.7040003@framestore-cfc.com>
Message-ID: <20040903093844.GO3106@holomorphy.com>

William Lee Irwin III wrote:
>> So it's the latter. Could you give the precise kernel version in which
>> you encountered this bug? Writing a patch for 2.6.9-rc1-mm2 and
>> expecting it to be used may be too much to ask for...

On Fri, Sep 03, 2004 at 10:35:13AM +0100, Stephen Willey wrote:
> I was using 2.6.7 patched with everything under the cluster trunk as per 
> the instructions here:
> http://gfs.wikidev.net/Kernel_configuration
> I'll give the 2.6.9-rc1-mm2 kernel a try and let you know...

It's likely it will either lack the relevant clustering code or will
also lack the fix; please try the other fix posted. I meant only to ask
which kernel version I should write a patch against.


-- wli



From rmayhew at mweb.com  Fri Sep  3 10:40:49 2004
From: rmayhew at mweb.com (Richard Mayhew)
Date: Fri, 3 Sep 2004 12:40:49 +0200
Subject: [Linux-cluster] FS Block Size Limit
Message-ID: <91C4F1A7C418014D9F88E938C13554589D019E@mwjdc2.mweb.com>

Hi,

I tried to format our 4 GFS partitions with a block size of 8192 (twice
the default) to correspond with the default raw block size on the SAN's
LUN's to do some performace comparisons with no success. Any block size
1024*X < 4096 worked with no problems, anything larger than 4096 came
back with an error when trying to mount the file system.

Any idea as to why there is a limit set to 4096?

Thanks

--

Regards

Richard Mayhew
Unix Specialist





From stephen.willey at framestore-cfc.com  Fri Sep  3 14:22:35 2004
From: stephen.willey at framestore-cfc.com (Stephen Willey)
Date: Fri, 03 Sep 2004 15:22:35 +0100
Subject: [Linux-cluster] GFS 2Tb limit
In-Reply-To: <20040902205514.GA9307@potassium.msp.redhat.com>
References: <4135B31D.7050508@framestore-cfc.com>	<20040901152439.GZ5492@holomorphy.com>	<413737B2.6030709@framestore-cfc.com>	<20040902184937.GA6528@potassium.msp.redhat.com>	<20040902200550.GT25438@ca-server1.us.oracle.com>
	<20040902205514.GA9307@potassium.msp.redhat.com>
Message-ID: <41387E2B.2000202@framestore-cfc.com>

Ken Preslan wrote:
> On Thu, Sep 02, 2004 at 01:05:50PM -0700, Joel Becker wrote:
> 
>>On Thu, Sep 02, 2004 at 01:49:37PM -0500, Ken Preslan wrote:
>>
>>>The first ioctl() below is supposed to be BLKGETSIZE64, but the ioctl
>>>number is wrong and the ioctl fails.  So, iddev tries BLKGETSIZE, which
>>>can't encode the device size in a long and returns EFBIG.
>>
>>	Why on earth isn't it using lseek64() for this?
>>
>>	uint64_t size = lseek64(disk_fd, 0ULL, SEEK_END);
> 
> 
> Hehe.  Thanks for the pointer.
> 

I've rebuild iddev and then gfs_mkfs and that's worked beautifully. 
Thanks a lot.

I'm having some problems actually mounting the thing now but I'll have 
to look at those on Monday.

Thanks again for all the help guys...

Stephen



From kpreslan at redhat.com  Fri Sep  3 16:49:20 2004
From: kpreslan at redhat.com (Ken Preslan)
Date: Fri, 3 Sep 2004 11:49:20 -0500
Subject: [Linux-cluster] FS Block Size Limit
In-Reply-To: <91C4F1A7C418014D9F88E938C13554589D019E@mwjdc2.mweb.com>
References: <91C4F1A7C418014D9F88E938C13554589D019E@mwjdc2.mweb.com>
Message-ID: <20040903164920.GA14996@potassium.msp.redhat.com>

On Fri, Sep 03, 2004 at 12:40:49PM +0200, Richard Mayhew wrote:
> I tried to format our 4 GFS partitions with a block size of 8192 (twice
> the default) to correspond with the default raw block size on the SAN's
> LUN's to do some performace comparisons with no success. Any block size
> 1024*X < 4096 worked with no problems, anything larger than 4096 came
> back with an error when trying to mount the file system.
> 
> Any idea as to why there is a limit set to 4096?

Linux filesystems can't have block sizes greater than the machine's
page size.  For x86 and x86_64 that's 4096 bytes.  I've seen 16k pages
on a IA64 box, so you can go larger there.

-- 
Ken Preslan 



From mfedyk at matchmail.com  Fri Sep  3 19:15:30 2004
From: mfedyk at matchmail.com (Mike Fedyk)
Date: Fri, 03 Sep 2004 12:15:30 -0700
Subject: [Linux-cluster] FS Block Size Limit
In-Reply-To: <20040903164920.GA14996@potassium.msp.redhat.com>
References: <91C4F1A7C418014D9F88E938C13554589D019E@mwjdc2.mweb.com>
	<20040903164920.GA14996@potassium.msp.redhat.com>
Message-ID: <4138C2D2.3060707@matchmail.com>

Ken Preslan wrote:

>On Fri, Sep 03, 2004 at 12:40:49PM +0200, Richard Mayhew wrote:
>  
>
>>I tried to format our 4 GFS partitions with a block size of 8192 (twice
>>the default) to correspond with the default raw block size on the SAN's
>>LUN's to do some performace comparisons with no success. Any block size
>>1024*X < 4096 worked with no problems, anything larger than 4096 came
>>back with an error when trying to mount the file system.
>>
>>Any idea as to why there is a limit set to 4096?
>>    
>>
>
>Linux filesystems can't have block sizes greater than the machine's
>page size.  For x86 and x86_64 that's 4096 bytes.  I've seen 16k pages
>on a IA64 box, so you can go larger there.
>
I remember some threads on LKML about larger page sizes on x86.

Not sure on the current status, but you'll want to look into "page cluster".

Mike



From kpreslan at redhat.com  Fri Sep  3 19:36:09 2004
From: kpreslan at redhat.com (Ken Preslan)
Date: Fri, 3 Sep 2004 14:36:09 -0500
Subject: [Linux-cluster] FS Block Size Limit
In-Reply-To: <4138C2D2.3060707@matchmail.com>
References: <91C4F1A7C418014D9F88E938C13554589D019E@mwjdc2.mweb.com>
	<20040903164920.GA14996@potassium.msp.redhat.com>
	<4138C2D2.3060707@matchmail.com>
Message-ID: <20040903193609.GA6077@potassium.msp.redhat.com>

On Fri, Sep 03, 2004 at 12:15:30PM -0700, Mike Fedyk wrote:
> >Linux filesystems can't have block sizes greater than the machine's
> >page size.  For x86 and x86_64 that's 4096 bytes.  I've seen 16k pages
> >on a IA64 box, so you can go larger there.
> >
> I remember some threads on LKML about larger page sizes on x86.
> 
> Not sure on the current status, but you'll want to look into "page cluster".

There are huge pages, meaning you can configure some subset of the pages
in a system to be a lot bigger than default (4MB on x86 boxes).  I don't
think that helps filesystem block sizes any.  Or were you thinking of
something else?

-- 
Ken Preslan 



From mfedyk at matchmail.com  Fri Sep  3 19:43:57 2004
From: mfedyk at matchmail.com (Mike Fedyk)
Date: Fri, 03 Sep 2004 12:43:57 -0700
Subject: [Linux-cluster] FS Block Size Limit
In-Reply-To: <20040903193609.GA6077@potassium.msp.redhat.com>
References: <91C4F1A7C418014D9F88E938C13554589D019E@mwjdc2.mweb.com>
	<20040903164920.GA14996@potassium.msp.redhat.com>
	<4138C2D2.3060707@matchmail.com>
	<20040903193609.GA6077@potassium.msp.redhat.com>
Message-ID: <4138C97D.5040909@matchmail.com>

Ken Preslan wrote:

>On Fri, Sep 03, 2004 at 12:15:30PM -0700, Mike Fedyk wrote:
>  
>
>>>Linux filesystems can't have block sizes greater than the machine's
>>>page size.  For x86 and x86_64 that's 4096 bytes.  I've seen 16k pages
>>>on a IA64 box, so you can go larger there.
>>>
>>>      
>>>
>>I remember some threads on LKML about larger page sizes on x86.
>>
>>Not sure on the current status, but you'll want to look into "page cluster".
>>    
>>
>
>There are huge pages, meaning you can configure some subset of the pages
>in a system to be a lot bigger than default (4MB on x86 boxes).  I don't
>think that helps filesystem block sizes any.  Or were you thinking of
>something else?
>
Yes, I'm thinking of something else.



From adam.cassar at netregistry.com.au  Mon Sep  6 01:27:31 2004
From: adam.cassar at netregistry.com.au (Adam Cassar)
Date: Mon, 06 Sep 2004 11:27:31 +1000
Subject: [Linux-cluster] repeatable assertion failure, what does it mean?
Message-ID: <1094434050.18292.10.camel@akira2.nro.au.com>

occurs on mount on a new fs

kernel BUG at /usr/src/GFS/cluster/gfs-kernel/src/dlm/lock.c:397!
invalid operand: 0000 [#1]
SMP 
Modules linked in: lock_dlm dlm cman gfs lock_harness 8250 serial_core
dm_mod
CPU:    1
EIP:    0060:[]    Not tainted
EFLAGS: 00010282   (2.6.8.1) 
EIP is at do_dlm_lock+0x1d6/0x1ea [lock_dlm]
eax: 00000001   ebx: ffffffea   ecx: c03478b4   edx: 000059e8
esi: c1af2600   edi: f7c7ac80   ebp: 00000000   esp: f6c23e2c
ds: 007b   es: 007b   ss: 0068
Process lock_dlm (pid: 23452, threadinfo=f6c22000 task=f6f8b290)
Stack: f898da49 f4882e60 00000010 00000000 00000000 ffffffea 00000003
00000005 
       0000005d 00000000 00000000 20202020 30312020 20202020 20202020
20202020 
       30202020 f6c20018 00000001 c1af2600 00000005 c1af2600 00000001
f8989c5e 
Call Trace:
 [] lm_dlm_lock+0x6e/0x7f [lock_dlm]
 [] lm_dlm_lock_sync+0x4c/0x62 [lock_dlm]
 [] id_test_and_set+0xf8/0x21f [lock_dlm]
 [] claim_jid+0x3b/0x110 [lock_dlm]
 [] process_start+0x372/0x4e1 [lock_dlm]
 [] dlm_async+0x1fc/0x315 [lock_dlm]
 [] default_wake_function+0x0/0x12
 [] default_wake_function+0x0/0x12
 [] dlm_async+0x0/0x315 [lock_dlm]
 [] kernel_thread_helper+0x5/0xb
Code: 0f 0b 8d 01 80 df 98 f8 c7 04 24 40 e0 98 f8 e8 5a f2 78 c7kernel
BUG at /usr/src/GFS/cluster/gfs-kernel/src/dlm/lock.c:397!
invalid operand: 0000 [#1]
CPU:    1
EIP:    0060:[]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010282   (2.6.8.1) 
eax: 00000001   ebx: ffffffea   ecx: c03478b4   edx: 000059e8
esi: c1af2600   edi: f7c7ac80   ebp: 00000000   esp: f6c23e2c
ds: 007b   es: 007b   ss: 0068
Stack: f898da49 f4882e60 00000010 00000000 00000000 ffffffea 00000003
00000005 
       0000005d 00000000 00000000 20202020 30312020 20202020 20202020
20202020 
       30202020 f6c20018 00000001 c1af2600 00000005 c1af2600 00000001
f8989c5e 
 [] lm_dlm_lock+0x6e/0x7f [lock_dlm]
 [] lm_dlm_lock_sync+0x4c/0x62 [lock_dlm]
 [] id_test_and_set+0xf8/0x21f [lock_dlm]
 [] claim_jid+0x3b/0x110 [lock_dlm]
 [] process_start+0x372/0x4e1 [lock_dlm]
 [] dlm_async+0x1fc/0x315 [lock_dlm]
 [] default_wake_function+0x0/0x12
 [] default_wake_function+0x0/0x12
 [] dlm_async+0x0/0x315 [lock_dlm]
 [] kernel_thread_helper+0x5/0xb

Code: 0f 0b 8d 01 80 df 98 f8 c7 04 24 40 e0 98 f8 e8 5a f2 78 c7


>>EIP; f8989b8a    <=====

>>ebx; ffffffea <__kernel_rt_sigreturn+1baa/????>
>>ecx; c03478b4 
>>edx; 000059e8 Before first symbol
>>esi; c1af2600 
>>edi; f7c7ac80 
>>esp; f6c23e2c 

Code;  f8989b8a 
00000000 <_EIP>:
Code;  f8989b8a    <=====
   0:   0f 0b                     ud2a      <=====
Code;  f8989b8c 
   2:   8d 01                     lea    (%ecx),%eax
Code;  f8989b8e 
   4:   80 df 98                  sbb    $0x98,%bh
Code;  f8989b91 
   7:   f8                        clc    
Code;  f8989b92 
   8:   c7 04 24 40 e0 98 f8      movl   $0xf898e040,(%esp,1)
Code;  f8989b99 
   f:   e8 5a f2 78 c7            call   c778f26e <_EIP+0xc778f26e>
c0118df8 




From sankar at redhat.com  Mon Sep  6 10:59:05 2004
From: sankar at redhat.com (Sankarshan Mukhopadhay)
Date: Mon, 06 Sep 2004 16:29:05 +0530
Subject: [Linux-cluster] GFS cluster components?
In-Reply-To: 
References: <4135B31D.7050508@framestore-cfc.com>	
	
Message-ID: <413C42F9.8040008@redhat.com>

Brian Jackson wrote:

[snipped]

> hardware. At this point, GFS's SAN roots are still there, so I'd
> suggest a SAN if you really want reliable (although it is possible to
> build a GFS setup without a SAN). You can also use a regular network
> (although I'd suggest at least gigabit ethernet) and something like
> GNBD, iSCSI, etc. to build it. You can also use a firewire drive
> connected between 2 computers for the really cheap (although the
> reliability is pretty much gone at this point). 

Are these alternative setups certified or are they workable 
implementations ?


>One thing to note is
> that currently the kernel's software raid layer isn't cluster
> friendly, so you won't have a way to do data redundancy unless your
> storage array/etc. is doing it.

Hey, thanks for this bit of information.

Regards
Sankarshan

-- 
Sankarshan Mukhopadhyay

Red Hat India Pvt Ltd
517, World Trade Centre
5th Floor, B-Wing
Barakhamba Lane
Connaught Place
New Delhi 110 001
T: +91-011-51550569/3181
F: +91-011-51553180
M: +91-989980 1676
e-mail: sankar at redhat.com



From pcaulfie at redhat.com  Mon Sep  6 07:42:35 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 6 Sep 2004 08:42:35 +0100
Subject: [Linux-cluster] cluster send request failed: Bad address
In-Reply-To: 
References: 
Message-ID: <20040906074234.GA28386@tykepenguin.com>

On Tue, Aug 31, 2004 at 02:49:13PM -0400, Fredric Isaman wrote:
> I am trying to set up a simple 3-node cluster (containing iota6-8). I get
> up to running clvmd on each node. At this point, iota8 works fine, all lvm
> commands work (although with some error messages about lock failures on
> the other nodes). However, any attempt to use a lvm command on the other
> nodes gives some sort of locking error.  For example:
> 
> [root at iota6g LVM2]# pvremove /baddev
>   cluster send request failed: Bad address
>   Can't get lock for orphan PVs

It looks like the the DLM userspace is out of date with the kernel. Make sure
everying is up-to-date and that there are no old shared libraries (eg libdlm.so)
hanging around.
-- 

patrick



From pcaulfie at redhat.com  Mon Sep  6 07:46:32 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 6 Sep 2004 08:46:32 +0100
Subject: [Linux-cluster] Persistent lock question for GDLM
In-Reply-To: 
References: 
Message-ID: <20040906074632.GB28386@tykepenguin.com>

On Fri, Sep 03, 2004 at 02:02:27PM +0800, Wang, Stanley wrote:
> If DLM_LKF_PERSISTENT is specified, the lock will not be purged when the
> holder (only applied to process) exits. My question is how can I purge
> this persistent lock after the holder exits?
> 

If you know the lock name, then you can do a query to get all the lock IDs and
then simply call dlm_unlock() on them.

If you don't know the lock names then it's much harder, as there is no qildcard
lock query.

-- 

patrick



From teigland at redhat.com  Mon Sep  6 08:00:56 2004
From: teigland at redhat.com (David Teigland)
Date: Mon, 6 Sep 2004 16:00:56 +0800
Subject: [Linux-cluster] repeatable assertion failure, what does it mean?
In-Reply-To: <1094434050.18292.10.camel@akira2.nro.au.com>
References: <1094434050.18292.10.camel@akira2.nro.au.com>
Message-ID: <20040906080056.GD30477@redhat.com>


On Mon, Sep 06, 2004 at 11:27:31AM +1000, Adam Cassar wrote:
> occurs on mount on a new fs
> 
> kernel BUG at /usr/src/GFS/cluster/gfs-kernel/src/dlm/lock.c:397!

This is fixed by updating and rebuilding both dlm and lock_dlm.

-- 
Dave Teigland  



From pcaulfie at redhat.com  Mon Sep  6 08:46:28 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 6 Sep 2004 09:46:28 +0100
Subject: [Linux-cluster] [PATCH]: avoid local_nodeid conflict with
	ia64/numa define
In-Reply-To: <200408310135.32411.arekm@pld-linux.org>
References: <200408310135.32411.arekm@pld-linux.org>
Message-ID: <20040906084628.GA31214@tykepenguin.com>

On Tue, Aug 31, 2004 at 01:35:32AM +0200, Arkadiusz Miskiewicz wrote:
> 
> Little patch by qboosh at pld-linux.org:
> 
> - avoid local_nodeid conflict with ia64/numa define
> 
> http://cvs.pld-linux.org/cgi-bin/cvsweb/SOURCES/linux-cluster-dlm.patch?r1=1.1.2.3&r2=1.1.2.4
> 

Thanks, I'll apply that.

-- 

patrick



From stanley.wang at intel.com  Mon Sep  6 08:52:39 2004
From: stanley.wang at intel.com (Wang, Stanley)
Date: Mon, 6 Sep 2004 16:52:39 +0800
Subject: [Linux-cluster] Persistent lock question for GDLM
Message-ID: 

Thanks a lot!

Best Regards,
Stan

Opinions expressed are those of the author and do not represent Intel Corporation
 
"gpg --recv-keys --keyserver wwwkeys.pgp.net E1390A7F"
{E1390A7F:3AD1 1B0C 2019 E183 0CFF  55E8 369A 8B75 E139 0A7F} 

>-----Original Message-----
>From: linux-cluster-bounces at redhat.com 
>[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Patrick 
>Caulfield
>Sent: 2004?9?6? 15:47
>To: Discussion of clustering software components including GFS
>Subject: Re: [Linux-cluster] Persistent lock question for GDLM
>
>On Fri, Sep 03, 2004 at 02:02:27PM +0800, Wang, Stanley wrote:
>> If DLM_LKF_PERSISTENT is specified, the lock will not be 
>purged when the
>> holder (only applied to process) exits. My question is how 
>can I purge
>> this persistent lock after the holder exits?
>> 
>
>If you know the lock name, then you can do a query to get all 
>the lock IDs and
>then simply call dlm_unlock() on them.
>
>If you don't know the lock names then it's much harder, as 
>there is no qildcard
>lock query.
>
>-- 
>
>patrick
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>http://www.redhat.com/mailman/listinfo/linux-cluster
>



From mauelshagen at redhat.com  Tue Sep  7 14:15:31 2004
From: mauelshagen at redhat.com (Heinz Mauelshagen)
Date: Tue, 7 Sep 2004 16:15:31 +0200
Subject: [Linux-cluster] *** Announcement: dmraid 1.0.0-rc4 ***
Message-ID: <20040907141531.GA13850@redhat.com>


               *** Announcement: dmraid 1.0.0-rc4 ***

dmraid 1.0.0-rc4 is available at
http://people.redhat.com:/~heinzm/sw/dmraid/ in source, source rpm and i386 rpm.

dmraid (Device-Mapper Raid tool) discovers, [de]activates and displays
properties of software RAID sets (ie. ATARAID) and contained DOS
partitions using the device-mapper runtime of the 2.6 kernel.

The following ATARAID types are supported on Linux 2.6:

Highpoint HPT37X
Highpoint HPT45X
Intel Software RAID
Promise FastTrack
Silicon Image Medley

This ATARAID type is only basically supported in this version (I need
better metadata format specs; please help):
LSI Logic MegaRAID

Please provide insight to support those metadata formats completely.

Thanks.

See files README and CHANGELOG, which come with the source tarball for
prerequisites to run this software, further instructions on installing
and using dmraid!

CHANGELOG is contained below for your convenience as well.


Call for testers:
-----------------

I need testers with the above ATARAID types, to check that the mapping
created by this tool is correct (see options "-t -ay") and access to the ATARAID
data is proper.

In case you have a different ATARAID solution from those listed above,
please feel free to contact me about supporting it in dmraid.

You can activate your ATARAID sets without danger of overwriting
your metadata, because dmraid accesses it read-only unless you use
option -E with -r in order to erase ATARAID metadata (see 'man dmraid')!

This is a release candidate version so you want to have backups of your valuable
data *and* you want to test accessing your data read-only first in order to
make sure that the mapping is correct before you go for read-write access.


The author is reachable at .

For test results, mapping information, discussions, questions, patches,
enhancement requests and the like, please subscribe and mail
to .

--

Regards,
Heinz    -- The LVM Guy --


CHANGELOG:
---------


Changelog from dmraid 1.0.0-rc3 to 1.0.0-rc4		2004.09.07

FIXES:
------
o get_dm_serial fix for trailing blanks
o infinite loop bug in makefile
o unified RAID #defines
o RAID disk erase size
o avoided unnecessary read in isw_read()
o segfault in build_set() on RAID set group failure
o activation of partitions on Intel Software RAID
o allow display if tables for active RAID sets (-t -ay)
o discovering no RAID disks shouldn't return an error
o free_set would have segfaulted on virgin RAID set structures
o deep DOS partition chains (Paul Moore)
o "dmraid -sa" displayed group RAID set with Intel Software RAID
  when it shouldn't
o return RAID super set pointer from hpt45x_group() and sil_group()
  rether than sub set pointer


FEATURES:
---------

o added offset output to all native metadata logs
o started defining metadata format handler event method needed for
  write updates to native metadata (eg, for mirror failure)
o [de]activation of a single raid sets below a group one (isw)
o support for multiple -c options (see "man dmraid"):
  "dmraid -b -c{0,2}"
  "dmraid -r -c{0,2}"
  "dmraid -s -c{0,3}"

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen                                 Red Hat GmbH
Consulting Development Engineer                   Am Sonnenhang 11
                                                  56242 Marienrachdorf
                                                  Germany
Mauelshagen at RedHat.com                            +49 2626 141200
                                                       FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-



From jbrassow at redhat.com  Tue Sep  7 14:58:34 2004
From: jbrassow at redhat.com (Jonathan E Brassow)
Date: Tue, 7 Sep 2004 09:58:34 -0500
Subject: [Linux-cluster] cluster send request failed: Bad address
In-Reply-To: <20040906074234.GA28386@tykepenguin.com>
References: 
	<20040906074234.GA28386@tykepenguin.com>
Message-ID: <641D1D78-00DE-11D9-B793-000A957BB1F6@redhat.com>

Specifically, if there is /lib/libdlm.so*, remove it.

  brassow

On Sep 6, 2004, at 2:42 AM, Patrick Caulfield wrote:

> On Tue, Aug 31, 2004 at 02:49:13PM -0400, Fredric Isaman wrote:
>> I am trying to set up a simple 3-node cluster (containing iota6-8). I 
>> get
>> up to running clvmd on each node. At this point, iota8 works fine, 
>> all lvm
>> commands work (although with some error messages about lock failures 
>> on
>> the other nodes). However, any attempt to use a lvm command on the 
>> other
>> nodes gives some sort of locking error.  For example:
>>
>> [root at iota6g LVM2]# pvremove /baddev
>>   cluster send request failed: Bad address
>>   Can't get lock for orphan PVs
>
> It looks like the the DLM userspace is out of date with the kernel. 
> Make sure
> everying is up-to-date and that there are no old shared libraries (eg 
> libdlm.so)
> hanging around.
> -- 
>
> patrick
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
>



From brian.marsden at sgc.ox.ac.uk  Tue Sep  7 18:39:29 2004
From: brian.marsden at sgc.ox.ac.uk (Brian Marsden)
Date: Tue, 07 Sep 2004 19:39:29 +0100
Subject: [Linux-cluster] GAS locking up
Message-ID: 

Hi,

 I have two machines, hestia and hroth1 which are running Red Hat
Enterprise Linux 3.0 AS. The two machines are connected via fibrechannel
to the same storage group on a EMC CX300 array. I have compiled GAS
using the latest src.rpm file that is available and the 2.4.21-15 kernel
patches. All works fine on both nodes for a while (locking is fine, no
corruption, manual fencing works if a machine dies) but then I
experience lockups for processes that access any of the mounted GAS
filesystems. It is hard to reproduce reliably and may occur at any time.
Classic examples are ls /scratch (where /scratch is a GAS filesystem) or
even mount or unmount. Once one process has locked up, no other GAS
filesystems or any commands associated with them work. Only a reboot
will solve the problem - restarting lock_gulm does not help (and has
actually given me a kernel panic on one occasion).

 At first I thought that this was a fencing issue, but looking at both
machine's /var/log/messages shows no GAS messages at all (when a machine
crashes and the manual fence is activated, I always see messages telling
me to acknowledge the fence). In addition, gulm_tool shows both nodes to
logged in and show the heartbeat working fine.

 Another interesting behaviour is the delay required to stat the
filesystems for the first time after they are mounted - e.g. running df
-l can take up to 5 seconds per GAS filesystem.

 Has anyone heard of these problems before? As it stands, my current
setup is somewhat unusable(!).

 For reference, my ccsd configuration looks like this:

nodes.ccs:

 nodes {
        hestia {
                ip_interfaces {
                        eth1 = "192.168.1.253"
                }
                fence {
                        human {
                                admin {
                                        ipaddr = "192.168.1.253"
                                }
                        }
                }
        }
        hroth1 {
                ip_interfaces {
                        eth1 = "192.168.1.1"
                }
                fence {
                        human {
                                admin {
                                        ipaddr = "192.168.1.1"
                                }
                        }
                }
        }
}

cluster.ccs:
cluster {
        name = "SAN1"
        lock_gulm {
                servers = ["hestia", "hroth1"]
        }
}

fence.ccs
fence_devices {
        admin {
                agent = "fence_manual"
        }
}

Any advice would be very very gratefully received.

Regards,

Brian Marsden

--
Dr. Brian Marsden                Email: brian.marsden at sgc.ox.ac.uk
Head of Research Informatics
Structural Genomics Consortium
University of Oxford
Botnar Research Centre           Phone: +44 (0)1865 227723 
OX3 7LD, Oxford, UK              Fax:   +44 (0)1865 737231




From brian.marsden at sgc.ox.ac.uk  Tue Sep  7 18:45:47 2004
From: brian.marsden at sgc.ox.ac.uk (Brian Marsden)
Date: Tue, 07 Sep 2004 19:45:47 +0100
Subject: [Linux-cluster] gfs locking up
Message-ID: 

[apologies - it appears that my spell checker got the better of me in my
previous message, let's try again:]

Hi,

 I have two machines, hestia and hroth1 which are running Red Hat
Enterprise Linux 3.0 AS. The two machines are connected via fibrechannel
to the same storage group on a EMC CX300 array. I have compiled gfs
using the latest src.rpm file that is available and the 2.4.21-15 kernel
patches. All works fine on both nodes for a while (locking is fine, no
corruption, manual fencing works if a machine dies) but then I
experience lockups for processes that access any of the mounted gfs
filesystems. It is hard to reproduce reliably and may occur at any time.
Classic examples are ls /scratch (where /scratch is a gfs filesystem) or
even mount or unmount. Once one process has locked up, no other gfs
filesystems or any commands associated with them work. Only a reboot
will solve the problem - restarting lock_gulm does not help (and has
actually given me a kernel panic on one occasion).

 At first I thought that this was a fencing issue, but looking at both
machine's /var/log/messages shows no gfs messages at all (when a machine
crashes and the manual fence is activated, I always see messages telling
me to acknowledge the fence). In addition, gulm_tool shows both nodes to
logged in and show the heartbeat working fine.

 Another interesting behaviour is the delay required to stat the
filesystems for the first time after they are mounted - e.g. running df
-l can take up to 5 seconds per gfs filesystem.

 Has anyone heard of these problems before? As it stands, my current
setup is somewhat unusable(!).

 For reference, my ccsd configuration looks like this:

nodes.ccs:

 nodes {
        hestia {
                ip_interfaces {
                        eth1 = "192.168.1.253"
                }
                fence {
                        human {
                                admin {
                                        ipaddr = "192.168.1.253"
                                }
                        }
                }
        }
        hroth1 {
                ip_interfaces {
                        eth1 = "192.168.1.1"
                }
                fence {
                        human {
                                admin {
                                        ipaddr = "192.168.1.1"
                                }
                        }
                }
        }
}

cluster.ccs:
cluster {
        name = "SAN1"
        lock_gulm {
                servers = ["hestia", "hroth1"]
        }
}

fence.ccs
fence_devices {
        admin {
                agent = "fence_manual"
        }
}

Any advice would be very very gratefully received.

Regards,

Brian Marsden

--
Dr. Brian Marsden                Email: brian.marsden at sgc.ox.ac.uk 
Head of Research Informatics
Structural Genomics Consortium
University of Oxford
Botnar Research Centre           Phone: +44 (0)1865 227723 
OX3 7LD, Oxford, UK              Fax:   +44 (0)1865 737231



--
Dr. Brian Marsden                Email: brian.marsden at sgc.ox.ac.uk
Head of Research Informatics
Structural Genomics Consortium
University of Oxford
Botnar Research Centre           Phone: +44 (0)1865 227723 
OX3 7LD, Oxford, UK              Fax:   +44 (0)1865 737231




From kpreslan at redhat.com  Tue Sep  7 19:10:51 2004
From: kpreslan at redhat.com (Ken Preslan)
Date: Tue, 7 Sep 2004 14:10:51 -0500
Subject: [Linux-cluster] gfs locking up
In-Reply-To: 
References: 
Message-ID: <20040907191051.GA31375@potassium.msp.redhat.com>

On Tue, Sep 07, 2004 at 07:45:47PM +0100, Brian Marsden wrote:
>  Another interesting behaviour is the delay required to stat the
> filesystems for the first time after they are mounted - e.g. running df
> -l can take up to 5 seconds per gfs filesystem.

This part is normal.  GFS tries to speed up df operations by caching
data in the lock LVBs.  The first df after mount is what builds this
cache by reading all the information off of disk.  So, that one will
always be slow.

-- 
Ken Preslan 



From kpreslan at redhat.com  Tue Sep  7 19:43:15 2004
From: kpreslan at redhat.com (Ken Preslan)
Date: Tue, 7 Sep 2004 14:43:15 -0500
Subject: [Linux-cluster] gfs locking up
In-Reply-To: 
References: 
Message-ID: <20040907194315.GB31375@potassium.msp.redhat.com>

On Tue, Sep 07, 2004 at 07:45:47PM +0100, Brian Marsden wrote:
>  I have two machines, hestia and hroth1 which are running Red Hat
> Enterprise Linux 3.0 AS. The two machines are connected via fibrechannel
> to the same storage group on a EMC CX300 array. I have compiled gfs
> using the latest src.rpm file that is available and the 2.4.21-15 kernel
> patches. All works fine on both nodes for a while (locking is fine, no
> corruption, manual fencing works if a machine dies) but then I
> experience lockups for processes that access any of the mounted gfs
> filesystems. It is hard to reproduce reliably and may occur at any time.
> Classic examples are ls /scratch (where /scratch is a gfs filesystem) or
> even mount or unmount. Once one process has locked up, no other gfs
> filesystems or any commands associated with them work. Only a reboot
> will solve the problem - restarting lock_gulm does not help (and has
> actually given me a kernel panic on one occasion).

When a lockup happens, a few things that might be useful in
figuring it out are:  A "ps aux" on both nodes and the output of
"gfs_tool lockdump /mountpoint" on both nodes.

-- 
Ken Preslan 



From ben.m.cahill at intel.com  Wed Sep  8 02:02:33 2004
From: ben.m.cahill at intel.com (Cahill, Ben M)
Date: Tue, 7 Sep 2004 19:02:33 -0700
Subject: [Linux-cluster] [PATCH]Comments in gfs_ondisk.h
Message-ID: <0604335B7764D141945E202153105960033E2552@orsmsx404.amr.corp.intel.com>

Hi all,

Attached please find patch for cluster/gfs-kernel/src/gfs/gfs_ondisk.h.

I've added a lot of comments, and edited some pre-existing comments, to
help briefly document the on-disk layout.  No other changes were made.
I hope this is helpful.

Please let me know if you see anything wrong.

-- Ben --

Opinions are mine, not Intel's
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gfs_ondisk.h.patch
Type: application/octet-stream
Size: 22916 bytes
Desc: gfs_ondisk.h.patch
URL: 

From notiggy at gmail.com  Mon Sep  6 17:30:32 2004
From: notiggy at gmail.com (Brian Jackson)
Date: Mon, 6 Sep 2004 12:30:32 -0500
Subject: [Linux-cluster] GFS cluster components?
In-Reply-To: <413C42F9.8040008@redhat.com>
References: <4135B31D.7050508@framestore-cfc.com>
	
	
	<413C42F9.8040008@redhat.com>
Message-ID: 

On Mon, 06 Sep 2004 16:29:05 +0530, Sankarshan Mukhopadhay
 wrote:
> Brian Jackson wrote:
> 
> [snipped]
> 
> > hardware. At this point, GFS's SAN roots are still there, so I'd
> > suggest a SAN if you really want reliable (although it is possible to
> > build a GFS setup without a SAN). You can also use a regular network
> > (although I'd suggest at least gigabit ethernet) and something like
> > GNBD, iSCSI, etc. to build it. You can also use a firewire drive
> > connected between 2 computers for the really cheap (although the
> > reliability is pretty much gone at this point).
> 
> Are these alternative setups certified or are they workable
> implementations ?

You're the one that works at redhat, you tell me. :)
Seriously though, I'm sure Red Hat probably only supports the SAN
setup, the iscsi setup should be workable (there are companies using
it in production environments), the firewire, as I said, is only good
for testing. But from what I understand all the linux-cluster code
isn't really supported (as in approved for use on RHEL) by Red Hat at
this point anyways.


> 
> 
> >One thing to note is
> > that currently the kernel's software raid layer isn't cluster
> > friendly, so you won't have a way to do data redundancy unless your
> > storage array/etc. is doing it.
> 
> Hey, thanks for this bit of information.
> 
> Regards
> Sankarshan



From mtilstra at redhat.com  Wed Sep  8 15:03:06 2004
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Wed, 8 Sep 2004 10:03:06 -0500
Subject: [Linux-cluster] Re: cluster depends on tcp_wrappers?
In-Reply-To: <87eklmjcc6.fsf@coraid.com>
References: <87u0ujjlz2.fsf@coraid.com> <20040831165253.GA14574@redhat.com>
	<87r7pnjh8o.fsf@coraid.com>
	<1094040277.21327.321.camel@laza.eunet.yu>
	<87eklmjcc6.fsf@coraid.com>
Message-ID: <20040908150306.GA15566@redhat.com>

On Wed, Sep 01, 2004 at 10:25:13AM -0400, Ed L Cashin wrote:
> +  gulm requires tcp_wrappers
> +  ccsd requires libxml2 and its headers

sorry for being so slow on this.  so many things to do.
I've added it to the gulmusage.txt file, dave will need to edit the
usage.txt file.

-- 
Michael Conrad Tadpol Tilstra
Wasting time is an important part of life.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: 

From kpreslan at redhat.com  Wed Sep  8 20:39:10 2004
From: kpreslan at redhat.com (Ken Preslan)
Date: Wed, 8 Sep 2004 15:39:10 -0500
Subject: [Linux-cluster] [PATCH]Comments in gfs_ondisk.h
In-Reply-To: <0604335B7764D141945E202153105960033E2552@orsmsx404.amr.corp.intel.com>
References: <0604335B7764D141945E202153105960033E2552@orsmsx404.amr.corp.intel.com>
Message-ID: <20040908203910.GA6453@potassium.msp.redhat.com>

On Tue, Sep 07, 2004 at 07:02:33PM -0700, Cahill, Ben M wrote:
> Hi all,
> 
> Attached please find patch for cluster/gfs-kernel/src/gfs/gfs_ondisk.h.
> 
> I've added a lot of comments, and edited some pre-existing comments, to
> help briefly document the on-disk layout.  No other changes were made.
> I hope this is helpful.

Great, this is really helpful!  Thanks.

> Please let me know if you see anything wrong.

A few things:

o  Dinode stands for "disk inode".  Other UNIXes make the distinction
   between different versions of an inode:
     "dinode" -> ondisk representation of the inode
     "inode" -> the FS-specific representation of the inode
     "vnode" -> the VFS' representation of the inode
   Linux tends to call everything "inode", which can get confusing
   sometimes.

o  The locations of the special inodes is protected by GFS_SB_LOCK (as
   are all the fields in the superblock), but the regular inode locks on
   those inodes protect their contents.  The original comments in the
   code were really wrong.  (I don't think GFS_ROOT_LOCK ever existed.)

o  di_incarn is now unused, the functionality got moved to mh_incarn.

I checked in the fixed version.  Thanks, again.

-- 
Ken Preslan 



From iisaman at citi.umich.edu  Wed Sep  8 21:01:12 2004
From: iisaman at citi.umich.edu (Fredric Isaman)
Date: Wed, 8 Sep 2004 17:01:12 -0400 (EDT)
Subject: [Linux-cluster] Unclean shutdown/restart procedure
In-Reply-To: <20040908160122.2480C758A3@hormel.redhat.com>
References: <20040908160122.2480C758A3@hormel.redhat.com>
Message-ID: 

I believe this is a bug, but it may be I misunderstand how to cleanly
bring up/down a whole cluster.

I have a 3-node cluster, using today's CVS.  I can bring it up fine, mount
a gfs filesystem over iscsi on each node, then shut it down fine, using
the procedures from useage.txt.  However, if I then startup again without
rebooting, when I get to clvmd -d , it will increase the active subsystem
count, and fail with:

  Unable to create lockspace for CLVM
  Can't initialise cluster interface

and the kernel log shows:

  dlm: Can't bind to port 21064
  dlm: cannot start lowcomms -98

At this point the node that ran clvmd has an active subsystem count of
one.  Shutting down the cluster (using cman_tool leave force) and
restarting it does not change this. Trying to run clvmd in this state
causes the machine to immediately hang with no messages to the log or
console.

	Fred





From bstevens at redhat.com  Thu Sep  9 02:21:23 2004
From: bstevens at redhat.com (Brian Stevens)
Date: Wed, 08 Sep 2004 22:21:23 -0400
Subject: [Linux-cluster] GFS cluster components?
In-Reply-To: 
References: <4135B31D.7050508@framestore-cfc.com>
	
	
	<413C42F9.8040008@redhat.com>
	
Message-ID: <1094696482.4027.32.camel@localhost>

On Mon, 2004-09-06 at 13:30, Brian Jackson wrote:
> On Mon, 06 Sep 2004 16:29:05 +0530, Sankarshan Mukhopadhay
>  wrote:
> > Brian Jackson wrote:
> > 
> > [snipped]
> > 
> > > hardware. At this point, GFS's SAN roots are still there, so I'd
> > > suggest a SAN if you really want reliable (although it is possible to
> > > build a GFS setup without a SAN). You can also use a regular network
> > > (although I'd suggest at least gigabit ethernet) and something like
> > > GNBD, iSCSI, etc. to build it. You can also use a firewire drive
> > > connected between 2 computers for the really cheap (although the
> > > reliability is pretty much gone at this point).
> > 
> > Are these alternative setups certified or are they workable
> > implementations ?
> 
> You're the one that works at redhat, you tell me. :)
> Seriously though, I'm sure Red Hat probably only supports the SAN
> setup, the iscsi setup should be workable (there are companies using
> it in production environments), the firewire, as I said, is only good
> for testing. But from what I understand all the linux-cluster code
> isn't really supported (as in approved for use on RHEL) by Red Hat at
> this point anyways.

This list is an open forum on the linux cluster code base ...
bugs, where it is going, new ideas, technical contributions, etc.
For supported cluster products which are certified go to
redhat.com.

brian



From andriy at druzhba.lviv.ua  Thu Sep  9 07:50:44 2004
From: andriy at druzhba.lviv.ua (Andriy Galetski)
Date: Thu, 9 Sep 2004 10:50:44 +0300
Subject: [Linux-cluster] Cluster node hung ( SM: Assertion failed )
Message-ID: <001301c49641$b6783b70$f13cc90a@druzhba.com>

Hi!
I have two node cluster. (CL10 and CL20)
Generally it's working good.
But sometimes after rebooting one of nodes
I get unstable situation:
For example ... after incorrect shutdown node CL10
Remaining node CL20 regaining quorom and left operational until
other node CL10 was and start joining to cluster.
In that moment console CL10 get message ...

kernel: CMAN: no HELLO from cl20, removing from the cluster

On CL20 side console view ..

SM: Assertion Failed on line 52 of file
/usr/local/src/cluster/cman-kernel/src/sm_misc.c

SM: assertion: "!error"

SM: time 1729980

Kernel panic: SM:
    Records message above and reboot.
...
and  CL20 hung after that.

My config is:



  
  

  
    
        
      
        
          
        
      
    
    
        
      
        
          
        
      
    
  

  
    
  



__
Thanks for any information.



From laza at yu.net  Thu Sep  9 17:34:16 2004
From: laza at yu.net (Lazar Obradovic)
Date: Thu, 09 Sep 2004 19:34:16 +0200
Subject: [Linux-cluster] fence/agent/ibmblade udp port patch
Message-ID: <1094751255.569.182.camel@laza.eunet.yu>

Hi, 

this is a patch to allow fence_ibmblade to use udp port other than
standard snmp (udp/161). 

Main reason for this is that IBM BladeCenter MM supports only 3 hosts
and not hostgroups per community, and only 3 communities, which puts a
limit to maximum of 9 nodes in a cluster. 

This is somewhat inconvinient, so, as a workaround, one can install a
udp forwarder on some node(s) (preferably outside cluster) and use only
its address in IBM BladeMM configuration. Port Forwarder will probably
have to use some other port that standard snmp, not to block snmp access
of "relay" node, so that's what this patch is all about. 

-- 
Lazar Obradovic, System Engineer
----- 
laza at YU.net
YUnet International http://www.EUnet.yu
Dubrovacka 35/III, 11000 Belgrade
Tel: +381 11 3119901; Fax: +381 11 3119901
-----
This e-mail is confidential and intended only for the recipient.
Unauthorized distribution, modification or disclosure of its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone +381 11 3119901.
-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fence_ibmblade-udpport.diff
Type: text/x-patch
Size: 1610 bytes
Desc: not available
URL: 

From andriy at druzhba.lviv.ua  Fri Sep 10 09:27:22 2004
From: andriy at druzhba.lviv.ua (Andriy Galetski)
Date: Fri, 10 Sep 2004 12:27:22 +0300
Subject: [Linux-cluster] dlm_resdir(s)': Can't free all objects
References: <001301c49641$b6783b70$f13cc90a@druzhba.com>
Message-ID: <005f01c49718$60920c90$f13cc90a@druzhba.com>

Hi!
Can anyone explain ?
2 node cluster working.
When I  remove GFS modules from kernel on one node - get next messages :

Sep 10 11:53:26 cl20 gfs: Umount GFS filesystems:  succeeded
Sep 10 11:53:27 cl20 kernel: CMAN: we are leaving the cluster
Sep 10 11:53:28 cl20 gfs: ccsd shutdown succeeded
Sep 10 11:53:28 cl20 kernel: slab error in kmem_cache_destroy(): cache
`dlm_resdir(s)': Can't free all objects
Sep 10 11:53:28 cl20 kernel:  [] kmem_cache_destroy+0xd0/0x10e
Sep 10 11:53:28 cl20 kernel:  [] dlm_memory_exit+0x37/0x55 [dlm]
Sep 10 11:53:28 cl20 kernel:  [] cleanup_module+0x19/0x26 [dlm]
Sep 10 11:53:28 cl20 kernel:  [] cman_callback+0x0/0x38 [dlm]
Sep 10 11:53:28 cl20 kernel:  [] sys_delete_module+0x15a/0x193
Sep 10 11:53:28 cl20 kernel:  [] do_munmap+0x134/0x184
Sep 10 11:53:28 cl20 kernel:  [] sys_munmap+0x45/0x66
Sep 10 11:53:28 cl20 kernel:  [] sysenter_past_esp+0x52/0x71

what is wrong ?

Thanks.



From pcaulfie at redhat.com  Fri Sep 10 09:40:39 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 10 Sep 2004 10:40:39 +0100
Subject: [Linux-cluster] Unclean shutdown/restart procedure
In-Reply-To: 
References: <20040908160122.2480C758A3@hormel.redhat.com>
	
Message-ID: <20040910094038.GA7345@tykepenguin.com>

On Wed, Sep 08, 2004 at 05:01:12PM -0400, Fredric Isaman wrote:
> I believe this is a bug, but it may be I misunderstand how to cleanly
> bring up/down a whole cluster.
> 
> I have a 3-node cluster, using today's CVS.  I can bring it up fine, mount
> a gfs filesystem over iscsi on each node, then shut it down fine, using
> the procedures from useage.txt.  However, if I then startup again without
> rebooting, when I get to clvmd -d , it will increase the active subsystem
> count, and fail with:
> 
>   Unable to create lockspace for CLVM
>   Can't initialise cluster interface
> 
> and the kernel log shows:
> 
>   dlm: Can't bind to port 21064
>   dlm: cannot start lowcomms -98
> 
> At this point the node that ran clvmd has an active subsystem count of
> one.  Shutting down the cluster (using cman_tool leave force) and
> restarting it does not change this. Trying to run clvmd in this state
> causes the machine to immediately hang with no messages to the log or
> console.

I think I've fixed the refcounting bug now. The "Can't bind" error is a
nuisance. If you shut the DLM down, depending on the state of the sockets, you
have to wait until all the other nodes have also shut down their connections to
this node before bringing it back up again.

patrick



From ecashin at coraid.com  Fri Sep 10 13:52:15 2004
From: ecashin at coraid.com (Ed L Cashin)
Date: Fri, 10 Sep 2004 09:52:15 -0400
Subject: [Linux-cluster] GFS limits: fs size, etc.
Message-ID: <87fz5qw7sg.fsf@coraid.com>

Hi.  I'm trying to list all of the GFS limits that I can find.  The
one I'm most interested in is max filesystem size.  Here are the ones
I've found so far.

Is the theoretical limit on filesystem size 2^64 bytes or is there a
practical constraint?  I couldn't find that.

filesystem size:

  ?

file size: gfs_dinode has u64 di_size member, so

  16777216 TB is theoretical max file size

block size:

  gfs_mkfs manpage says min block size 512 bytes, max block size is
  the page size for the host

  Also, according to a comment at the top of gfs_ondisk.h, the
  de_rec_len member of gfs_dirent limits the FS block size to 64KB.

jounal size:

  gfs_mkfs manpage says min is 32MB  

extended attributes:

  max name length: 255 bytes
  max data length: 65535 bytes



-- 
  Ed L Cashin 



From amanthei at redhat.com  Fri Sep 10 15:51:37 2004
From: amanthei at redhat.com (Adam Manthei)
Date: Fri, 10 Sep 2004 10:51:37 -0500
Subject: [Linux-cluster] fence/agent/ibmblade udp port patch
In-Reply-To: <1094751255.569.182.camel@laza.eunet.yu>
References: <1094751255.569.182.camel@laza.eunet.yu>
Message-ID: <20040910155137.GA21526@redhat.com>

Patch applied.  It would be helpful if you could update the man page 
as well, including with it an example of how you would setup the port
forwarder.

On Thu, Sep 09, 2004 at 07:34:16PM +0200, Lazar Obradovic wrote:
> Hi, 
> 
> this is a patch to allow fence_ibmblade to use udp port other than
> standard snmp (udp/161). 
> 
> Main reason for this is that IBM BladeCenter MM supports only 3 hosts
> and not hostgroups per community, and only 3 communities, which puts a
> limit to maximum of 9 nodes in a cluster. 
> 
> This is somewhat inconvinient, so, as a workaround, one can install a
> udp forwarder on some node(s) (preferably outside cluster) and use only
> its address in IBM BladeMM configuration. Port Forwarder will probably
> have to use some other port that standard snmp, not to block snmp access
> of "relay" node, so that's what this patch is all about. 
> 
> -- 
> Lazar Obradovic, System Engineer
> ----- 
> laza at YU.net
> YUnet International http://www.EUnet.yu
> Dubrovacka 35/III, 11000 Belgrade
> Tel: +381 11 3119901; Fax: +381 11 3119901
> -----
> This e-mail is confidential and intended only for the recipient.
> Unauthorized distribution, modification or disclosure of its
> contents is prohibited. If you have received this e-mail in error,
> please notify the sender by telephone +381 11 3119901.
> -----


> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster


-- 
Adam Manthei  



From kpreslan at redhat.com  Fri Sep 10 18:05:03 2004
From: kpreslan at redhat.com (Ken Preslan)
Date: Fri, 10 Sep 2004 13:05:03 -0500
Subject: [Linux-cluster] GFS limits: fs size, etc.
In-Reply-To: <87fz5qw7sg.fsf@coraid.com>
References: <87fz5qw7sg.fsf@coraid.com>
Message-ID: <20040910180503.GA19202@potassium.msp.redhat.com>

On Fri, Sep 10, 2004 at 09:52:15AM -0400, Ed L Cashin wrote:
> Hi.  I'm trying to list all of the GFS limits that I can find.  The
> one I'm most interested in is max filesystem size.  Here are the ones
> I've found so far.
> 
> Is the theoretical limit on filesystem size 2^64 bytes or is there a
> practical constraint?  I couldn't find that.
> 
> filesystem size:

For Linux 2.4, the max file system is 2TB is you trust the sign bit,
1TB if you don't.  This limit comes from a 32-bit sector address in the
"struct buffer_head":  2^32 * 512 bytes/sector = 2TB.

For Linux 2.6 on a 32-bit platform, the max filesystem size is 16TB if
you trust the sign bit, 8TB if you don't.  This limit comes from the
32-bit page index in the "struct page": 2^32 * 4096 bytes/page = 16TB.

For Linux 2.6 on a 64-bit platform, the max filesystem size is *big*.
Something around 2^64 bytes.

> file size: gfs_dinode has u64 di_size member, so
> 
>   16777216 TB is theoretical max file size

That's correct on 64-bit filesystems.

On a 32-bit system, you're limited by the 32-bit page index again.  The
biggest regular file you can have is 16TB/8TB.  Journaled files don't
go through the generic Linux file code, so they can go up to 2^64 bytes.
(Assuming the file is hole-y -- a 32-bit platform can't support a
filesystem big enough to hold all the data.)

> block size:
> 
>   gfs_mkfs manpage says min block size 512 bytes, max block size is
>   the page size for the host
> 
>   Also, according to a comment at the top of gfs_ondisk.h, the
>   de_rec_len member of gfs_dirent limits the FS block size to 64KB.
>
> jounal size:
> 
>   gfs_mkfs manpage says min is 32MB  
> 
> extended attributes:
> 
>   max name length: 255 bytes
>   max data length: 65535 bytes
> 
> 
> 
> -- 
>   Ed L Cashin 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Ken Preslan 



From lmb at suse.de  Fri Sep 10 17:51:29 2004
From: lmb at suse.de (Lars Marowsky-Bree)
Date: Fri, 10 Sep 2004 19:51:29 +0200
Subject: [Linux-cluster] New virtual synchrony API for the kernel: was Re:
	[Openais] New API in openais
In-Reply-To: <1094104992.5515.47.camel@persist.az.mvista.com>
References: <1093941076.3613.14.camel@persist.az.mvista.com>
	<1093973757.5933.56.camel@cherrybomb.pdx.osdl.net>
	<1093981842.3613.42.camel@persist.az.mvista.com>
	<200409011115.45780.phillips@redhat.com>
	<1094104992.5515.47.camel@persist.az.mvista.com>
Message-ID: <20040910175129.GT7359@marowsky-bree.de>

On 2004-09-01T23:03:12,
   Steven Dake  said:

I've been pretty busy the last couple of days, so please bear with me
for my late reply.

A virtual synchrony group messaging component would certainly be
immensely helpful. As it pretty strongly ties to membership events (as
you very correctly point out), I do think we need to review the APIs
here.

Could you post some sample code and how / where you'd propose to merge
it in?

Also, again, I'm not sure this needs to be in the kernel. Do you have
upper bounds of the memory consumption? Would the speed really benefit
from being in the kernel?

OTOH, all other networking protocols such as TCP, SCTP or even IP/Sec
live in kernel space, so clearly there's prior evidence of this being a
reasonable idea.


Sincerely,
    Lars Marowsky-Br?e 

-- 
High Availability & Clustering	   \\\  /// 
SUSE Labs, Research and Development \honk/ 
SUSE LINUX AG - A Novell company     \\// 



From sdake at mvista.com  Fri Sep 10 20:53:15 2004
From: sdake at mvista.com (Steven Dake)
Date: Fri, 10 Sep 2004 13:53:15 -0700
Subject: [Linux-cluster] New virtual synchrony API for the kernel: was
	Re: [Openais] New API in openais
In-Reply-To: <20040910175129.GT7359@marowsky-bree.de>
References: <1093941076.3613.14.camel@persist.az.mvista.com>
	<1093973757.5933.56.camel@cherrybomb.pdx.osdl.net>
	<1093981842.3613.42.camel@persist.az.mvista.com>
	<200409011115.45780.phillips@redhat.com>
	<1094104992.5515.47.camel@persist.az.mvista.com>
	<20040910175129.GT7359@marowsky-bree.de>
Message-ID: <1094849594.23862.184.camel@persist.az.mvista.com>

On Fri, 2004-09-10 at 10:51, Lars Marowsky-Bree wrote:
> On 2004-09-01T23:03:12,
>    Steven Dake  said:
> 
> I've been pretty busy the last couple of days, so please bear with me
> for my late reply.
> 
> A virtual synchrony group messaging component would certainly be
> immensely helpful. As it pretty strongly ties to membership events (as
> you very correctly point out), I do think we need to review the APIs
> here.
> 
> Could you post some sample code and how / where you'd propose to merge
> it in?
> 

The virtual synchrony APIs I propose we start with can be reviewed at:

http://developer.osdl.org/dev/openais/htmldocs/index.html

There is a sample program using these interfaces at:

http://developer.osdl.org/dev/openais/src/test/testevs.c

> Also, again, I'm not sure this needs to be in the kernel. Do you have
> upper bounds of the memory consumption? Would the speed really benefit
> from being in the kernel?
> 
Right now, the entire openais project with all the services it provides
consumes 2MB of ram at idle.  I'd expect under load its about 3MB.  The
group messaging protocol portion of that uses perhaps 1 MB of ram.  It
will reject new messages if its buffers are full, so it cannot grow
wildly.

Your definately correct; a virtual syncrhony protocol doesn't absolutely
have to be in the kernel.  In fact, it is implemented today completely
in userland with 8.5MB/sec throughput to large groups with encryption
and authentication.

You could gain some network performance by ridding UDP from the IP
header, but this is only 8 bytes.

There is little performance gain in using the kernel (as basically, a
kernel thread/process would have to be created to operate the protocol).

The only point of a kernel virtual syncrhony API is to standardize on
one set of messaging APIs for the kernel projects that require messaging
(and at a higher level, distributed locks, fencing, etc).

We want to avoid is two seperate messaging protocols operating at the
same time (performance drain).

We also want to choose wisely the messaging model and protocol we use. 
If we don't, we could have problems later.

> OTOH, all other networking protocols such as TCP, SCTP or even IP/Sec
> live in kernel space, so clearly there's prior evidence of this being a
> reasonable idea.
> 
> 
> Sincerely,
>     Lars Marowsky-Br?e 



From ecashin at coraid.com  Mon Sep 13 15:46:55 2004
From: ecashin at coraid.com (Ed L Cashin)
Date: Mon, 13 Sep 2004 11:46:55 -0400
Subject: [Linux-cluster] Re: GFS limits: fs size, etc.
References: <87fz5qw7sg.fsf@coraid.com>
	<20040910180503.GA19202@potassium.msp.redhat.com>
Message-ID: <87wtyytbm8.fsf@coraid.com>

Ken Preslan  writes:

...
> For Linux 2.6 on a 32-bit platform, the max filesystem size is 16TB if
> you trust the sign bit, 8TB if you don't.  This limit comes from the
> 32-bit page index in the "struct page": 2^32 * 4096 bytes/page = 16TB.
>
> For Linux 2.6 on a 64-bit platform, the max filesystem size is *big*.
> Something around 2^64 bytes.

I think I'm going to get an AMD64 machine and try a 20TB GFS fs.  

If anybody's aready doing this, I'd like to hear about it, esp. any
gotchas.

-- 
  Ed L Cashin 



From Axel.Thimm at ATrpms.net  Mon Sep 13 16:15:08 2004
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Mon, 13 Sep 2004 18:15:08 +0200
Subject: [Linux-cluster] 32bits vs 64bits (was: GFS limits: fs size, etc.)
In-Reply-To: <20040910180503.GA19202@potassium.msp.redhat.com>
References: <87fz5qw7sg.fsf@coraid.com>
	<20040910180503.GA19202@potassium.msp.redhat.com>
Message-ID: <20040913161508.GB13209@neu.physik.fu-berlin.de>

On Fri, Sep 10, 2004 at 01:05:03PM -0500, Ken Preslan wrote:
> For Linux 2.6 on a 32-bit platform, the max filesystem size is 16TB if
> you trust the sign bit, 8TB if you don't.  This limit comes from the
> 32-bit page index in the "struct page": 2^32 * 4096 bytes/page = 16TB.
> 
> For Linux 2.6 on a 64-bit platform, the max filesystem size is *big*.
> Something around 2^64 bytes.

That means that you need to have a cluster of equal-bit-arch members?

One can certainly not add a 32-bit cluster member to a 64-bit > 16TB
crafted cluster. What about smaller sized filesystems? Would 32bits
and 64bits work nicely together, or are there more barriers?
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: 

From kpreslan at redhat.com  Mon Sep 13 16:24:32 2004
From: kpreslan at redhat.com (Ken Preslan)
Date: Mon, 13 Sep 2004 11:24:32 -0500
Subject: [Linux-cluster] 32bits vs 64bits (was: GFS limits: fs size, etc.)
In-Reply-To: <20040913161508.GB13209@neu.physik.fu-berlin.de>
References: <87fz5qw7sg.fsf@coraid.com>
	<20040910180503.GA19202@potassium.msp.redhat.com>
	<20040913161508.GB13209@neu.physik.fu-berlin.de>
Message-ID: <20040913162431.GA13609@potassium.msp.redhat.com>

On Mon, Sep 13, 2004 at 06:15:08PM +0200, Axel Thimm wrote:
> On Fri, Sep 10, 2004 at 01:05:03PM -0500, Ken Preslan wrote:
> > For Linux 2.6 on a 32-bit platform, the max filesystem size is 16TB if
> > you trust the sign bit, 8TB if you don't.  This limit comes from the
> > 32-bit page index in the "struct page": 2^32 * 4096 bytes/page = 16TB.
> > 
> > For Linux 2.6 on a 64-bit platform, the max filesystem size is *big*.
> > Something around 2^64 bytes.
> 
> That means that you need to have a cluster of equal-bit-arch members?
>
> One can certainly not add a 32-bit cluster member to a 64-bit > 16TB
> crafted cluster. What about smaller sized filesystems? Would 32bits
> and 64bits work nicely together, or are there more barriers?

You can happily mix 32-bit and 64-bit machines.  As you said, 32-bit
machines shouldn't access bigger filesystems.  But, you can have a mixed
cluster with the 32-bit machines mounting only smaller the filesystems
and the 64-bit machines mounting anything they want.

-- 
Ken Preslan 



From Axel.Thimm at ATrpms.net  Mon Sep 13 16:55:59 2004
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Mon, 13 Sep 2004 18:55:59 +0200
Subject: [Linux-cluster] Re: 32bits vs 64bits (was: GFS limits: fs size,
	etc.)
In-Reply-To: <20040913162431.GA13609@potassium.msp.redhat.com>
References: <87fz5qw7sg.fsf@coraid.com>
	<20040910180503.GA19202@potassium.msp.redhat.com>
	<20040913161508.GB13209@neu.physik.fu-berlin.de>
	<20040913162431.GA13609@potassium.msp.redhat.com>
Message-ID: <20040913165559.GD13209@neu.physik.fu-berlin.de>

On Mon, Sep 13, 2004 at 11:24:32AM -0500, Ken Preslan wrote:
> On Mon, Sep 13, 2004 at 06:15:08PM +0200, Axel Thimm wrote:
> > On Fri, Sep 10, 2004 at 01:05:03PM -0500, Ken Preslan wrote:
> > > For Linux 2.6 on a 32-bit platform, the max filesystem size is 16TB if
> > > you trust the sign bit, 8TB if you don't.  This limit comes from the
> > > 32-bit page index in the "struct page": 2^32 * 4096 bytes/page = 16TB.
> > > 
> > > For Linux 2.6 on a 64-bit platform, the max filesystem size is *big*.
> > > Something around 2^64 bytes.
> > 
> > That means that you need to have a cluster of equal-bit-arch members?
> >
> > One can certainly not add a 32-bit cluster member to a 64-bit > 16TB
> > crafted cluster. What about smaller sized filesystems? Would 32bits
> > and 64bits work nicely together, or are there more barriers?
> 
> You can happily mix 32-bit and 64-bit machines.  As you said, 32-bit
> machines shouldn't access bigger filesystems.  But, you can have a mixed
> cluster with the 32-bit machines mounting only smaller the filesystems
> and the 64-bit machines mounting anything they want.

That's good news, thank you. :)

Does this mean the on-disk-format is independent of the machine word
size?

Just out of interest, what will happen, if a 32bit cluster member
tries to join/mount a too-large fs? Will the operation fail or will
there be silent data corruption?

(My background is that I am testing GFS/cvs under x86_64, but later
most cluster members will be ia32.)

Thanks!
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: 

From anton at hq.310.ru  Tue Sep 14 08:00:16 2004
From: anton at hq.310.ru (=?Windows-1251?B?wO3y7u0gzeX17vDu+Oj1?=)
Date: Tue, 14 Sep 2004 12:00:16 +0400
Subject: [Linux-cluster] Assertion failed on line 1281 of file
	/usr/src/cluster/gfs-kernel/src/gfs/glock.c
Message-ID: <318981499.20040914120016@hq.310.ru>

Hi all,

i update srources today from cvs
after compile, install, reboot i see in messagess:

Sep 14 11:04:29 c5 kernel: GFS: fsid=310farm:gfs01.2: Joined cluster. Now mounting FS...
Sep 14 11:04:29 c5 kernel:
Sep 14 11:04:29 c5 kernel: GFS: Assertion failed on line 1281 of file /usr/src/cluster/gfs-kernel/src/gfs/glock.c
Sep 14 11:04:29 c5 kernel: GFS: assertion: "(tmp_gh->gh_flags & GL_LOCAL_EXCL) || !(gh->gh_flags & GL_LOCAL_EXCL)"
Sep 14 11:04:29 c5 kernel: GFS: time = 1095145469
Sep 14 11:04:29 c5 kernel: GFS: fsid=310farm:gfs01.2: glock = (4, 0)
Sep 14 11:04:29 c5 kernel:
Sep 14 11:04:29 c5 kernel: Kernel panic: GFS: Record message above and reboot.
Sep 14 11:04:29 c5 kernel:

In what there can be a problem?

-- 
e-mail: anton at hq.310.ru




From stephane.messerli at urbanet.ch  Tue Sep 14 11:16:59 2004
From: stephane.messerli at urbanet.ch (=?iso-8859-1?Q?St=E9phane_Messerli?=)
Date: Tue, 14 Sep 2004 13:16:59 +0200
Subject: [Linux-cluster] lock_dlm - unable to handle kernel NULL pointer
	dereference
Message-ID: <200409141115.i8EBF7ne015989@smtp.hispeed.ch>

Hi,

We have a cluster of two rh 2.6.7 smp machines using gfs and we exprerience
random stability issues.
Every 2 days or so, a lock_dlm error message is dumped to the log (see
below), and either both machines are unable to access the gfs file system
(hanging on ls, df, ...), or a random process that was accessing a file is
hanging on one of the machine (always a different process, can be tar, gzip,
mv, ...) and cannot be terminated.
At this point the only thing we can do is reboot both nodes.

We haven't found a way to reproduce this problem, it seems to happen
randomly.

We have done the following to eliminate the problem (without success nor
improvement):

- Shutdown machine A and run all services on machine B
- Shutdown machine B and run all services on machine A
- Disable heavy I/O on both machines (mainly full daily backups)

The error message is the following:

------

Sep 13 15:05:43 L1_OAS56_B kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000005 Sep 13 15:05:43 L1_OAS56_B kernel:
printing eip:
Sep 13 15:05:43 L1_OAS56_B kernel: c013a1f6 Sep 13 15:05:43 L1_OAS56_B
kernel: *pde = 17aea001 Sep 13 15:05:43 L1_OAS56_B kernel: Oops: 0002 [#1]
Sep 13 15:05:43 L1_OAS56_B kernel: SMP Sep 13 15:05:43 L1_OAS56_B kernel:
Modules linked in: nfsd exportfs ipv6 autofs e1000 af_packet parport_pc
parport ohci_hcd ehci_hcd lock_dlm dlm cman gfs lock_harness dm_mod floppy
uhci_hcd usbcore thermal processor fan button battery asus_acpi ac ext3 jbd
loop ide_cd cdrom qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
i2o_block i2o_core
Sep 13 15:05:43 L1_OAS56_B kernel: CPU: 2
Sep 13 15:05:43 L1_OAS56_B kernel: EIP: 0060:[] Not tainted
Sep 13 15:05:43 L1_OAS56_B kernel: EFLAGS: 00010083 (2.6.7)
Sep 13 15:05:43 L1_OAS56_B kernel: EIP is at find_get_pages+0x41/0x5a
Sep 13 15:05:43 L1_OAS56_B kernel: eax: 00000001 ebx: d6d2de4c ecx: 00000010
edx: 00000004
Sep 13 15:05:43 L1_OAS56_B kernel: esi: f274a724 edi: e00f2240 ebp: d6d2ddfc
esp: d6d2dde4
Sep 13 15:05:43 L1_OAS56_B kernel: ds: 007b es: 007b ss: 0068
Sep 13 15:05:43 L1_OAS56_B kernel: Process lock_dlm (pid: 1575,
threadinfo=d6d2c000 task=f7b945c0) Sep 13 15:05:43 L1_OAS56_B kernel: Stack:
f274a728 d6d2de4c 00000000 00000010 d6d2de44 f274a724 d6d2de18 c01441ed
Sep 13 15:05:43 L1_OAS56_B kernel: f274a724 00000000 00000010 d6d2de4c
00000000 d6d2dea0 c01444d0 d6d2de44
Sep 13 15:05:43 L1_OAS56_B kernel: f274a724 00000000 00000010 c3207870
00000000 d6d2c000 00000000 00000000
Sep 13 15:05:43 L1_OAS56_B kernel: Call Trace:
Sep 13 15:05:43 L1_OAS56_B kernel: [] show_stack+0x80/0x96 Sep 13
15:05:43 L1_OAS56_B kernel: [] show_registers+0x15f/0x1ae Sep 13
15:05:43 L1_OAS56_B kernel: [] die+0x8d/0xfb Sep 13 15:05:43
L1_OAS56_B kernel: [] do_page_fault+0x270/0x579 Sep 13 15:05:43
L1_OAS56_B kernel: [] error_code+0x2d/0x38 Sep 13 15:05:43
L1_OAS56_B kernel: [] pagevec_lookup+0x2c/0x35 Sep 13 15:05:43
L1_OAS56_B kernel: [] truncate_inode_pages+0x71/0x29f Sep 13
15:05:43 L1_OAS56_B kernel: [] gfs_inval_buf+0x45/0x88 [gfs] Sep
13 15:05:43 L1_OAS56_B kernel: [] inode_go_inval+0x45/0x4f [gfs]
Sep 13 15:05:43 L1_OAS56_B kernel: [] drop_bh+0x15f/0x1d6 [gfs]
Sep 13 15:05:43 L1_OAS56_B kernel: [] gfs_glock_cb+0x167/0x1f4
[gfs] Sep 13 15:05:43 L1_OAS56_B kernel: []
process_complete+0x103/0x34c [lock_dlm] Sep 13 15:05:43 L1_OAS56_B kernel:
[] dlm_async+0x1cb/0x290 [lock_dlm] Sep 13 15:05:43 L1_OAS56_B
kernel: [] kernel_thread_helper+0x5/0xb Sep 13 15:05:43 L1_OAS56_B
kernel:
Sep 13 15:05:43 L1_OAS56_B kernel: Code: f0 ff 40 04 83 c2 01 39 ca 72 f2 c6
46 10 01 fb 83 c4 10 5b

------

Any idea of what's wrong or what we should we check next?
Is it possible to "unlock" the machines after such an error without reboot?

The release version is DEVEL.1090589850.

Thanks for your help,

St?phane Messerli

stephane.messerli at urbanet.ch
Senior Support & Project Engineer, Technology Europe
24/7 Real Media (NASDAQ: TFSM)
Route de la Pierre
1024 Ecublens
Switzerland
tel. +41 21 695 97 46
fax +41 21 695 97 01





From smesserli at realmedia.com  Tue Sep 14 09:22:55 2004
From: smesserli at realmedia.com (=?iso-8859-1?Q?St=E9phane_Messerli?=)
Date: Tue, 14 Sep 2004 11:22:55 +0200
Subject: [Linux-cluster] (no subject)
Message-ID: <058iiNJwL0171M36@cmsapps01.cms.usa.net>

Hi,

We have a cluster of two rh 2.6.7 smp machines using gfs and we exprerience
random stability issues.

Every 2 days or so, a lock_dlm error message is dumped to the log (see
below).
At this point, either both machines are unable to access the gfs file system
(hanging on ls, df, ...), or a random process that was accessing a file is
hanging on one of the machine (always a different process, can be tar, gzip,
mv, ...) and cannot be terminated.
At this point the only thing we can do is reboot both nodes.

We haven't found a way to reproduce this problem, it seems to happen
randomly.
We have done the following to eliminate the problem (without success nor
improvement):

 - Shutdown machine A and run all services on machine B
 - Shutdown machine B and run all services on machine A
 - Disable heavy I/O on both machines (mainly full daily backups)

The error message is the following:

------

Sep 13 15:05:43 L1_OAS56_B kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000005
Sep 13 15:05:43 L1_OAS56_B kernel:  printing eip:
Sep 13 15:05:43 L1_OAS56_B kernel: c013a1f6
Sep 13 15:05:43 L1_OAS56_B kernel: *pde = 17aea001
Sep 13 15:05:43 L1_OAS56_B kernel: Oops: 0002 [#1]
Sep 13 15:05:43 L1_OAS56_B kernel: SMP
Sep 13 15:05:43 L1_OAS56_B kernel: Modules linked in: nfsd exportfs ipv6
autofs e1000 af_packet parport_pc parport ohci_hcd ehci_hcd lock_dlm dlm
cman gfs lock_harness dm_mod floppy uhci_hcd usbcore thermal processor fan
button battery asus_acpi ac ext3 jbd loop ide_cd cdrom qla2300 qla2xxx
scsi_transport_fc sd_mod scsi_mod i2o_block i2o_core
Sep 13 15:05:43 L1_OAS56_B kernel: CPU:    2
Sep 13 15:05:43 L1_OAS56_B kernel: EIP:    0060:[]    Not tainted
Sep 13 15:05:43 L1_OAS56_B kernel: EFLAGS: 00010083   (2.6.7)
Sep 13 15:05:43 L1_OAS56_B kernel: EIP is at find_get_pages+0x41/0x5a
Sep 13 15:05:43 L1_OAS56_B kernel: eax: 00000001   ebx: d6d2de4c   ecx:
00000010   edx: 00000004
Sep 13 15:05:43 L1_OAS56_B kernel: esi: f274a724   edi: e00f2240   ebp:
d6d2ddfc   esp: d6d2dde4
Sep 13 15:05:43 L1_OAS56_B kernel: ds: 007b   es: 007b   ss: 0068
Sep 13 15:05:43 L1_OAS56_B kernel: Process lock_dlm (pid: 1575,
threadinfo=d6d2c000 task=f7b945c0)
Sep 13 15:05:43 L1_OAS56_B kernel: Stack: f274a728 d6d2de4c 00000000
00000010 d6d2de44 f274a724 d6d2de18 c01441ed
Sep 13 15:05:43 L1_OAS56_B kernel:        f274a724 00000000 00000010
d6d2de4c 00000000 d6d2dea0 c01444d0 d6d2de44
Sep 13 15:05:43 L1_OAS56_B kernel:        f274a724 00000000 00000010
c3207870 00000000 d6d2c000 00000000 00000000
Sep 13 15:05:43 L1_OAS56_B kernel: Call Trace:
Sep 13 15:05:43 L1_OAS56_B kernel:  [] show_stack+0x80/0x96
Sep 13 15:05:43 L1_OAS56_B kernel:  [] show_registers+0x15f/0x1ae
Sep 13 15:05:43 L1_OAS56_B kernel:  [] die+0x8d/0xfb
Sep 13 15:05:43 L1_OAS56_B kernel:  [] do_page_fault+0x270/0x579
Sep 13 15:05:43 L1_OAS56_B kernel:  [] error_code+0x2d/0x38
Sep 13 15:05:43 L1_OAS56_B kernel:  [] pagevec_lookup+0x2c/0x35
Sep 13 15:05:43 L1_OAS56_B kernel:  []
truncate_inode_pages+0x71/0x29f
Sep 13 15:05:43 L1_OAS56_B kernel:  [] gfs_inval_buf+0x45/0x88
[gfs]
Sep 13 15:05:43 L1_OAS56_B kernel:  [] inode_go_inval+0x45/0x4f
[gfs]
Sep 13 15:05:43 L1_OAS56_B kernel:  [] drop_bh+0x15f/0x1d6 [gfs]
Sep 13 15:05:43 L1_OAS56_B kernel:  [] gfs_glock_cb+0x167/0x1f4
[gfs]
Sep 13 15:05:43 L1_OAS56_B kernel:  []
process_complete+0x103/0x34c [lock_dlm]
Sep 13 15:05:43 L1_OAS56_B kernel:  [] dlm_async+0x1cb/0x290
[lock_dlm]
Sep 13 15:05:43 L1_OAS56_B kernel:  []
kernel_thread_helper+0x5/0xb
Sep 13 15:05:43 L1_OAS56_B kernel:
Sep 13 15:05:43 L1_OAS56_B kernel: Code: f0 ff 40 04 83 c2 01 39 ca 72 f2 c6
46 10 01 fb 83 c4 10 5b

------

Any idea of what's wrong or what we should we check next?
Is it possible to "unlock" the machines after such an error without reboot?

The release version is DEVEL.1090589850.

Thanks for your help,

St?phane Messerli
Senior Support & Project Engineer, Technology Europe
smesserli at realmedia.com

24/7 Real Media (NASDAQ: TFSM)
Route de la Pierre
1024 Ecublens
Switzerland

tel. +41 21 695 97 46
fax +41 21 695 97 01
	





From laza at yu.net  Wed Sep 15 11:06:32 2004
From: laza at yu.net (Lazar Obradovic)
Date: Wed, 15 Sep 2004 13:06:32 +0200
Subject: [Linux-cluster] Directory lockups?
Message-ID: <1095242945.12259.130.camel@laza.eunet.yu>

Hello, 

I have been receieving this messages lately: 

Sep 15 08:16:02 test01 kernel: dlm: mailbox: dir entry exists 6bd5037b fr 5 r 0        7         60300b6
Sep 15 08:53:29 test01 kernel: dlm: mailbox: dir entry exists abef0134 fr 3 r 0        7         7382d53
Sep 15 10:42:18 test01 kernel: dlm: mailbox: dir entry exists d00d7 fr 4 r 0        7         52565b8
Sep 15 10:42:18 test01 kernel: dlm: mailbox: dir entry exists f0012 fr 4 r 0        7         52565b8
Sep 15 10:46:00 test01 kernel: dlm: mailbox: dir entry exists f0302 fr 6 r 0        7         5356fea
Sep 15 11:10:32 test01 kernel: dlm: mailbox: dir entry exists 420026 fr 4 r 0        7         2ef1081
Sep 15 11:48:01 test01 kernel: dlm: locks dir entry exists 30282 fr 5 r 0        7         56a3afd
-- 
Lazar Obradovic, System Engineer
----- 
laza at YU.net
YUnet International http://www.EUnet.yu
Dubrovacka 35/III, 11000 Belgrade
Tel: +381 11 3119901; Fax: +381 11 3119901
-----
This e-mail is confidential and intended only for the recipient.
Unauthorized distribution, modification or disclosure of its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone +381 11 3119901.
-----



From laza at yu.net  Wed Sep 15 11:16:41 2004
From: laza at yu.net (Lazar Obradovic)
Date: Wed, 15 Sep 2004 13:16:41 +0200
Subject: [Linux-cluster] Directory lockups?
Message-ID: <1095247001.8184.21.camel@laza.eunet.yu>

(ignore my last post... sorry for the broken mail, seems that my mouse is quite dirty :( 

Hello, 

I have been receieving this messages lately: 

Sep 15 08:16:02 test01 kernel: dlm: locks: dir entry exists 6bd5037b fr 5 r 0        7         60300b6
Sep 15 08:53:29 test01 kernel: dlm: locks: dir entry exists abef0134 fr 3 r 0        7         7382d53
Sep 15 10:42:18 test01 kernel: dlm: locks: dir entry exists d00d7 fr 4 r 0        7         52565b8
Sep 15 10:42:18 test01 kernel: dlm: locks: dir entry exists f0012 fr 4 r 0        7         52565b8
Sep 15 10:46:00 test01 kernel: dlm: locks: dir entry exists f0302 fr 6 r 0        7         5356fea
Sep 15 11:10:32 test01 kernel: dlm: locks: dir entry exists 420026 fr 4 r 0        7         2ef1081
Sep 15 11:48:01 test01 kernel: dlm: locks: dir entry exists 30282 fr 5 r 0        7         56a3afd

it seems that it has to do with a particular directory. Every read or
write attempt inside that directory gets blocked and, since processess
queue up, load rises 'till it kills the node. 

Now, I've been searching through logs and havent found anything useful,
except those few lines up there.

I'v run gfs_fsck and got: 

[... useless things omitted ... ]

Dinodes with more than one dirent:
       inode = 30926818, dirents = 2
Dinodes with link count > 1:
       inode = 30926818, nlink = 2
Pass 6:  done  (0:00:00)

[... useless things omitted ... ]

What shall I do to debug this further? Can  anyone explain why is this
happening? 

-- 
Lazar Obradovic, System Engineer
----- 
laza at YU.net
YUnet International http://www.EUnet.yu
Dubrovacka 35/III, 11000 Belgrade
Tel: +381 11 3119901; Fax: +381 11 3119901
-----
This e-mail is confidential and intended only for the recipient.
Unauthorized distribution, modification or disclosure of its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone +381 11 3119901.
-----



From teigland at redhat.com  Wed Sep 15 11:17:46 2004
From: teigland at redhat.com (David Teigland)
Date: Wed, 15 Sep 2004 19:17:46 +0800
Subject: [Linux-cluster] Directory lockups?
In-Reply-To: <1095242945.12259.130.camel@laza.eunet.yu>
References: <1095242945.12259.130.camel@laza.eunet.yu>
Message-ID: <20040915111746.GE17196@redhat.com>


On Wed, Sep 15, 2004 at 01:06:32PM +0200, Lazar Obradovic wrote:
> Hello, 
> 
> I have been receieving this messages lately: 
> 
> Sep 15 08:16:02 test01 kernel: dlm: mailbox: dir entry exists 6bd5037b fr 5 r 0        7         60300b6
> Sep 15 08:53:29 test01 kernel: dlm: mailbox: dir entry exists abef0134 fr 3 r 0        7         7382d53
> Sep 15 10:42:18 test01 kernel: dlm: mailbox: dir entry exists d00d7 fr 4 r 0        7         52565b8
> Sep 15 10:42:18 test01 kernel: dlm: mailbox: dir entry exists f0012 fr 4 r 0        7         52565b8
> Sep 15 10:46:00 test01 kernel: dlm: mailbox: dir entry exists f0302 fr 6 r 0        7         5356fea
> Sep 15 11:10:32 test01 kernel: dlm: mailbox: dir entry exists 420026 fr 4 r 0        7         2ef1081
> Sep 15 11:48:01 test01 kernel: dlm: locks dir entry exists 30282 fr 5 r 0        7         56a3afd

Those messages can be safely ignored; I'll remove them sometime soon.
They result from multiple processes requesting fcntl/posix locks on the
same files.

-- 
Dave Teigland  



From lars.larsen at edb.com  Wed Sep 15 11:43:57 2004
From: lars.larsen at edb.com (=?iso-8859-1?Q?Larsen_Lars_Asbj=F8rn?=)
Date: Wed, 15 Sep 2004 13:43:57 +0200
Subject: [Linux-cluster] GFS and Oracle RAC
Message-ID: <4FD673DA70A6C44AA180BE5C971B1A19012170A3@OSLATCEX0002.edb.com>

Anybody with experience om this combination when it comes to performance,
functionality and stability ?

Red Hat Enterprise Linux AS on Itanium CPUs will be OS/hardware combination,
if usable.

Vennlig hilsen/Best regards
Lars Larsen
Seniorkonsulent/Senior Consultant EDB IT Drift
Tfl: +47 22 77 27 45, mobil: +47 900 19 569
lars.larsen at edb.com
www.edb.com



From teigland at redhat.com  Wed Sep 15 03:40:46 2004
From: teigland at redhat.com (David Teigland)
Date: Wed, 15 Sep 2004 11:40:46 +0800
Subject: [Linux-cluster] (no subject)
In-Reply-To: <058iiNJwL0171M36@cmsapps01.cms.usa.net>
References: <058iiNJwL0171M36@cmsapps01.cms.usa.net>
Message-ID: <20040915034046.GA17196@redhat.com>

On Tue, Sep 14, 2004 at 11:22:55AM +0200, St?phane Messerli wrote:

> Any idea of what's wrong or what we should we check next?

I think this is a new one.

> Is it possible to "unlock" the machines after such an error without
> reboot?

No, the machine needs to be reset when you get this.

> The release version is DEVEL.1090589850.

That looks like July 23; you should probably update to what's in cvs and
let us know if anything changes.

-- 
Dave Teigland  



From bastian at waldi.eu.org  Wed Sep 15 12:42:22 2004
From: bastian at waldi.eu.org (Bastian Blank)
Date: Wed, 15 Sep 2004 14:42:22 +0200
Subject: [Linux-cluster] [PATCH] clean fence/bin and gfs/bin
Message-ID: <20040915124222.GA21743@wavehammer.waldi.eu.org>

The Makefiles in the fence and gfs directory doesn't clean the bin
directories. The attached patch fixes this.

Bastian

-- 
I'm a soldier, not a diplomat.  I can only tell the truth.
		-- Kirk, "Errand of Mercy", stardate 3198.9
-------------- next part --------------
diff -urN strace-4.5.7/debian/rules strace-4.5.7.new/debian/rules
--- strace-4.5.7/debian/rules	2004-08-31 10:53:52.000000000 +0200
+++ strace-4.5.7.new/debian/rules	2004-09-13 15:14:07.000000000 +0200
@@ -3,6 +3,7 @@
 # Debian package information
 package		= strace
 
+DEB_BUILD_GNU_TYPE := $(shell dpkg-architecture -qDEB_BUILD_GNU_TYPE)
 DEB_HOST_GNU_TYPE := $(shell dpkg-architecture -qDEB_HOST_GNU_TYPE)
 
 ifeq ($(DEB_HOST_GNU_TYPE),sparc-linux)
@@ -11,6 +12,12 @@
   CC64 = gcc -m64
 endif
 
+ifeq ($(DEB_HOST_GNU_TYPE),s390-linux)
+  build64 = yes
+  HOST64 = s390x-linux
+  CC64 = gcc -m64
+endif
+
 ifeq ($(build64),yes)
    extra_build_targets += stamp-build64
 endif
@@ -23,11 +30,11 @@
 
 build/Makefile:
 	mkdir -p $(@D)
-	cd $(@D); sh ../configure --prefix=/usr
+	cd $(@D); sh ../configure --prefix=/usr --build=$(DEB_BUILD_GNU_TYPE) --host=$(DEB_HOST_GNU_TYPE)
 
 build64/Makefile:
 	mkdir -p $(@D)
-	cd $(@D); CC="$(CC64)" sh ../configure --prefix=/usr --build=$(HOST64)
+	cd $(@D); CC="$(CC64)" sh ../configure --prefix=/usr --build=$(DEB_BUILD_GNU_TYPE) --host=$(HOST64)
 
 clean:
 	rm -rf debian/tmp debian/substvars debian/files
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: 

From bastian at waldi.eu.org  Wed Sep 15 13:28:33 2004
From: bastian at waldi.eu.org (Bastian Blank)
Date: Wed, 15 Sep 2004 15:28:33 +0200
Subject: [Linux-cluster] [PATCH] clean fence/bin and gfs/bin
In-Reply-To: <20040915124222.GA21743@wavehammer.waldi.eu.org>
References: <20040915124222.GA21743@wavehammer.waldi.eu.org>
Message-ID: <20040915132833.GB11269@wavehammer.waldi.eu.org>

On Wed, Sep 15, 2004 at 02:42:22PM +0200, Bastian Blank wrote:
> The Makefiles in the fence and gfs directory doesn't clean the bin
> directories. The attached patch fixes this.

I attached the wrong patch by accident.

Bastian

-- 
A woman should have compassion.
		-- Kirk, "Catspaw", stardate 3018.2
-------------- next part --------------
Index: fence/Makefile
===================================================================
RCS file: /cvs/cluster/cluster/fence/Makefile,v
retrieving revision 1.1
diff -u -r1.1 Makefile
--- fence/Makefile	24 Jun 2004 08:53:10 -0000	1.1
+++ fence/Makefile	15 Sep 2004 12:40:32 -0000
@@ -20,6 +20,7 @@
 
 clean:
 	cd agents && ${MAKE} clean
+	cd bin && ${MAKE} clean
 	cd fence_node && ${MAKE} clean
 	cd fence_tool && ${MAKE} clean
 	cd fenced && ${MAKE} clean
Index: gfs/Makefile
===================================================================
RCS file: /cvs/cluster/cluster/gfs/Makefile,v
retrieving revision 1.1
diff -u -r1.1 Makefile
--- gfs/Makefile	24 Jun 2004 08:53:20 -0000	1.1
+++ gfs/Makefile	15 Sep 2004 12:40:32 -0000
@@ -30,6 +30,7 @@
 	cd gfs_tool && ${MAKE} copytobin
 
 clean:
+	cd bin && ${MAKE} clean
 	cd gfs_edit && ${MAKE} clean
 	cd gfs_fsck && ${MAKE} clean
 	cd gfs_grow && ${MAKE} clean
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: 

From lhh at redhat.com  Wed Sep 15 14:59:15 2004
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 15 Sep 2004 10:59:15 -0400
Subject: [Linux-cluster] [PATCH] clean fence/bin and gfs/bin
In-Reply-To: <20040915124222.GA21743@wavehammer.waldi.eu.org>
References: <20040915124222.GA21743@wavehammer.waldi.eu.org>
Message-ID: <1095260355.2364.67.camel@atlantis.boston.redhat.com>

On Wed, 2004-09-15 at 14:42 +0200, Bastian Blank wrote:
> The Makefiles in the fence and gfs directory doesn't clean the bin
> directories. The attached patch fixes this.

You sent a patch to strace?

-- Lon




From lhh at redhat.com  Wed Sep 15 14:59:42 2004
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 15 Sep 2004 10:59:42 -0400
Subject: [Linux-cluster] [PATCH] clean fence/bin and gfs/bin
In-Reply-To: <1095260355.2364.67.camel@atlantis.boston.redhat.com>
References: <20040915124222.GA21743@wavehammer.waldi.eu.org>
	<1095260355.2364.67.camel@atlantis.boston.redhat.com>
Message-ID: <1095260382.2364.69.camel@atlantis.boston.redhat.com>

On Wed, 2004-09-15 at 10:59 -0400, Lon Hohberger wrote:
> On Wed, 2004-09-15 at 14:42 +0200, Bastian Blank wrote:
> > The Makefiles in the fence and gfs directory doesn't clean the bin
> > directories. The attached patch fixes this.
> 
> You sent a patch to strace?

N/M.  Saw correct patch.

-- Lon



From lhh at redhat.com  Wed Sep 15 15:19:55 2004
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 15 Sep 2004 11:19:55 -0400
Subject: [Linux-cluster] [PATCH] clean fence/bin and gfs/bin
In-Reply-To: <20040915132833.GB11269@wavehammer.waldi.eu.org>
References: <20040915124222.GA21743@wavehammer.waldi.eu.org>
	<20040915132833.GB11269@wavehammer.waldi.eu.org>
Message-ID: <1095261595.2364.76.camel@atlantis.boston.redhat.com>

On Wed, 2004-09-15 at 15:28 +0200, Bastian Blank wrote:
> On Wed, Sep 15, 2004 at 02:42:22PM +0200, Bastian Blank wrote:
> > The Makefiles in the fence and gfs directory doesn't clean the bin
> > directories. The attached patch fixes this.
> 
> I attached the wrong patch by accident.

Hmm, neither do cman or ccs

Actually GNBD seems to be the only one which currently does...

Patch for the four of them.

-- Lon
-------------- next part --------------
A non-text attachment was scrubbed...
Name: clean.patch
Type: text/x-patch
Size: 1785 bytes
Desc: not available
URL: 

From john.l.villalovos at intel.com  Wed Sep 15 18:50:33 2004
From: john.l.villalovos at intel.com (Villalovos, John L)
Date: Wed, 15 Sep 2004 11:50:33 -0700
Subject: [Linux-cluster] Will GFS run on the 2.6.5 kernel?
Message-ID: <60C14C611F1DDD4198D53F2F43D8CA3B01F9C0CE@orsmsx410>

I was wondering if GFS will run on the 2.6.5 kernel?

Currently it won't work because the function:
unmap_shared_mapping_range() came into being in the 2.6.6 kernel.  It
appears that Daniel Phillips created this.

Should I be able to take the BitKeeper patch changeset that added this
to 2.6.6 and apply it to 2.6.5?

Thanks for any info.

John



From kpreslan at redhat.com  Wed Sep 15 20:10:33 2004
From: kpreslan at redhat.com (Ken Preslan)
Date: Wed, 15 Sep 2004 15:10:33 -0500
Subject: [Linux-cluster] Assertion failed on line 1281 of file
	/usr/src/cluster/gfs-kernel/src/gfs/glock.c
In-Reply-To: <318981499.20040914120016@hq.310.ru>
References: <318981499.20040914120016@hq.310.ru>
Message-ID: <20040915201033.GA31145@potassium.msp.redhat.com>

On Tue, Sep 14, 2004 at 12:00:16PM +0400, ????? ????????? wrote:
> i update srources today from cvs
> after compile, install, reboot i see in messagess:
> 
> Sep 14 11:04:29 c5 kernel: GFS: fsid=310farm:gfs01.2: Joined cluster. Now mounting FS...
> Sep 14 11:04:29 c5 kernel:
> Sep 14 11:04:29 c5 kernel: GFS: Assertion failed on line 1281 of file /usr/src/cluster/gfs-kernel/src/gfs/glock.c
> Sep 14 11:04:29 c5 kernel: GFS: assertion: "(tmp_gh->gh_flags & GL_LOCAL_EXCL) || !(gh->gh_flags & GL_LOCAL_EXCL)"
> Sep 14 11:04:29 c5 kernel: GFS: time = 1095145469
> Sep 14 11:04:29 c5 kernel: GFS: fsid=310farm:gfs01.2: glock = (4, 0)
> Sep 14 11:04:29 c5 kernel:
> Sep 14 11:04:29 c5 kernel: Kernel panic: GFS: Record message above and reboot.
> Sep 14 11:04:29 c5 kernel:
> 
> In what there can be a problem?

I guess I'm at a loss as to how this could be happening.  You're sure you
did a complete update and haven't changed any of the filesystem source
files, right?

-- 
Ken Preslan 



From kpreslan at redhat.com  Wed Sep 15 20:14:10 2004
From: kpreslan at redhat.com (Ken Preslan)
Date: Wed, 15 Sep 2004 15:14:10 -0500
Subject: [Linux-cluster] Re: 32bits vs 64bits (was: GFS limits: fs size,
	etc.)
In-Reply-To: <20040913165559.GD13209@neu.physik.fu-berlin.de>
References: <87fz5qw7sg.fsf@coraid.com>
	<20040910180503.GA19202@potassium.msp.redhat.com>
	<20040913161508.GB13209@neu.physik.fu-berlin.de>
	<20040913162431.GA13609@potassium.msp.redhat.com>
	<20040913165559.GD13209@neu.physik.fu-berlin.de>
Message-ID: <20040915201410.GB31145@potassium.msp.redhat.com>

On Mon, Sep 13, 2004 at 06:55:59PM +0200, Axel Thimm wrote:
> Does this mean the on-disk-format is independent of the machine word
> size?

Yep.

> Just out of interest, what will happen, if a 32bit cluster member
> tries to join/mount a too-large fs? Will the operation fail or will
> there be silent data corruption?

Yeah, right now, mounting a too-large filesystem will corrupt the
filesystem when you write past the 16TB boundary.  I'll work on getting
mkfs to write a flag into the superblock that the filesystem will check
on mount.

-- 
Ken Preslan 



From kpreslan at redhat.com  Wed Sep 15 20:36:35 2004
From: kpreslan at redhat.com (Ken Preslan)
Date: Wed, 15 Sep 2004 15:36:35 -0500
Subject: [Linux-cluster] Will GFS run on the 2.6.5 kernel?
In-Reply-To: <60C14C611F1DDD4198D53F2F43D8CA3B01F9C0CE@orsmsx410>
References: <60C14C611F1DDD4198D53F2F43D8CA3B01F9C0CE@orsmsx410>
Message-ID: <20040915203635.GC31145@potassium.msp.redhat.com>

On Wed, Sep 15, 2004 at 11:50:33AM -0700, Villalovos, John L wrote:
> I was wondering if GFS will run on the 2.6.5 kernel?
> 
> Currently it won't work because the function:
> unmap_shared_mapping_range() came into being in the 2.6.6 kernel.  It
> appears that Daniel Phillips created this.
> 
> Should I be able to take the BitKeeper patch changeset that added this
> to 2.6.6 and apply it to 2.6.5?

I'm sure if you work at it, you can get it to run on 2.6.5.  But as you've
found out, it won't compile right out of the box. Adding the
unmap_shared_mapping_range() patches is a start.  You'll have to see what
else you need to fix (if anything).

-- 
Ken Preslan 



From kpreslan at redhat.com  Wed Sep 15 20:42:38 2004
From: kpreslan at redhat.com (Ken Preslan)
Date: Wed, 15 Sep 2004 15:42:38 -0500
Subject: [Linux-cluster] Directory lockups?
In-Reply-To: <1095247001.8184.21.camel@laza.eunet.yu>
References: <1095247001.8184.21.camel@laza.eunet.yu>
Message-ID: <20040915204238.GD31145@potassium.msp.redhat.com>

On Wed, Sep 15, 2004 at 01:16:41PM +0200, Lazar Obradovic wrote:
> What shall I do to debug this further? Can  anyone explain why is this
> happening? 

The outputs (from every node) of "ps aux" and "gfs_tool lockdump
/mountpoint" would be useful.

-- 
Ken Preslan 



From laza at yu.net  Wed Sep 15 20:55:02 2004
From: laza at yu.net (Lazar Obradovic)
Date: Wed, 15 Sep 2004 22:55:02 +0200
Subject: [Linux-cluster] Directory lockups?
In-Reply-To: <1095247001.8184.21.camel@laza.eunet.yu>
References: <1095247001.8184.21.camel@laza.eunet.yu>
Message-ID: <1095281702.8286.183.camel@laza.eunet.yu>

It happened again today, and I got around 80 queud processes waiting to
write into same file. All processes were in "D" state when looked from
'ps' and they all blocked whole directory where file is (ls into that
dir would block too). 

Now, node just recovered itself, but that directory was unavailable for
almost an hour and a half!

Do deadlocktimeout and lock_timeout (in /proc/cluster/config/dlm) have
anything to do with this and are they configurable? 

Can someone shed a light on /proc interface, just to know what's where?
This could also go into usage.txt or even separate file... 

> Hello, 
> 
> I have been receieving this messages lately: 
> 
> Sep 15 08:16:02 test01 kernel: dlm: locks: dir entry exists 6bd5037b fr 5 r 0        7         60300b6
> Sep 15 08:53:29 test01 kernel: dlm: locks: dir entry exists abef0134 fr 3 r 0        7         7382d53
> Sep 15 10:42:18 test01 kernel: dlm: locks: dir entry exists d00d7 fr 4 r 0        7         52565b8
> Sep 15 10:42:18 test01 kernel: dlm: locks: dir entry exists f0012 fr 4 r 0        7         52565b8
> Sep 15 10:46:00 test01 kernel: dlm: locks: dir entry exists f0302 fr 6 r 0        7         5356fea
> Sep 15 11:10:32 test01 kernel: dlm: locks: dir entry exists 420026 fr 4 r 0        7         2ef1081
> Sep 15 11:48:01 test01 kernel: dlm: locks: dir entry exists 30282 fr 5 r 0        7         56a3afd
> 
> it seems that it has to do with a particular directory. Every read or
> write attempt inside that directory gets blocked and, since processess
> queue up, load rises 'till it kills the node. 
> 
> Now, I've been searching through logs and havent found anything useful,
> except those few lines up there.
> 
> I'v run gfs_fsck and got: 
> 
> [... useless things omitted ... ]
> 
> Dinodes with more than one dirent:
>        inode = 30926818, dirents = 2
> Dinodes with link count > 1:
>        inode = 30926818, nlink = 2
> Pass 6:  done  (0:00:00)
> 
> [... useless things omitted ... ]
> 
> What shall I do to debug this further? Can  anyone explain why is this
> happening? 
-- 
Lazar Obradovic, System Engineer
----- 
laza at YU.net
YUnet International http://www.EUnet.yu
Dubrovacka 35/III, 11000 Belgrade
Tel: +381 11 3119901; Fax: +381 11 3119901
-----
This e-mail is confidential and intended only for the recipient.
Unauthorized distribution, modification or disclosure of its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone +381 11 3119901.
-----



From teigland at redhat.com  Thu Sep 16 05:46:39 2004
From: teigland at redhat.com (David Teigland)
Date: Thu, 16 Sep 2004 13:46:39 +0800
Subject: [Linux-cluster] Directory lockups?
In-Reply-To: <1095281702.8286.183.camel@laza.eunet.yu>
References: <1095247001.8184.21.camel@laza.eunet.yu>
	<1095281702.8286.183.camel@laza.eunet.yu>
Message-ID: <20040916054639.GA17802@redhat.com>

On Wed, Sep 15, 2004 at 10:55:02PM +0200, Lazar Obradovic wrote:
> It happened again today, and I got around 80 queud processes waiting to
> write into same file. All processes were in "D" state when looked from
> 'ps' and they all blocked whole directory where file is (ls into that
> dir would block too). 

Is there a test or application you're running that we could try ourselves?

> Now, node just recovered itself, but that directory was unavailable for
> almost an hour and a half!

In addition to Ken's suggestion ("ps aux" and "gfs_tool lockdump
/mountpoint" from each node), you could provide "cat
/proc/cluster/lock_dlm_debug" from each node.

> Do deadlocktimeout and lock_timeout (in /proc/cluster/config/dlm) have
> anything to do with this and are they configurable? 

They are unrelated to gfs.

> Can someone shed a light on /proc interface, just to know what's where?
> This could also go into usage.txt or even separate file... 

I don't think any of them would be useful.  It's simply our habit to
define any "constant" this way.

buffer_size - network message size used by the dlm
dirtbl_size, lkbtbl_size, rsbtbl_size - hash table sizes
lock_timeout - max time we'll wait for a reply for a remote request
  (not used for gfs locks)
deadlocktime - max time a request will wait to be granted
  (not used for gfs locks)
recover_timer - while waiting for certain conditions during recovery,
  this is the interval between checks
tcp_port - used for dlm communication
max_connections - max number of network connections the dlm will make

-- 
Dave Teigland  



From psousa at ptinovacao.pt  Thu Sep 16 10:05:02 2004
From: psousa at ptinovacao.pt (Paulo Sousa)
Date: Thu, 16 Sep 2004 11:05:02 +0100
Subject: [Linux-cluster] Problem with RHEL3 and GFS-6.0.0.10, Kernel Panic
Message-ID: 

Hello,

            I' testing the GFS in RHEL3 but I have some problems.

 

            I have 2 servers connect to a shared SCSI storage and one of the
serves is the lock server (I don't have redundancy at this moment to
lock_server, it is just for testing)

 

            Server1 (mount gfs filesystem + lock_server)

            Server2 (mount gfs filessystem)

            

            This is the test I have made in the server 1

 

            /etc/init.d/lock_gulmd stop

            /etc/init.d/lock_gulmd start

 

            After 2..3  seconds

            

            gfs1  login: lock_gulm: Checking for journals for node "gfs1"

            lock_gulm: ERROR Got an error in gulm_res_recvd err: -71

            lock_gulm: ERROR gulm_LT recver err -71

            lock_gulm: ERROR Got a -1111 trying to login to lock_gulm. It it
running?

 

            Lock_gulm: Assertion failed on line 50 of file gulm_core.c

            Lock_gulm: assertion: "gulm_cm_GenerationI == gen"

            Kernel panic:

            Lock_gulm: Record message above and reboot

 

                        

 

 

            

Cumprimentos,

       Paulo Sousa

                                 

___________________________________________

Paulo Sousa 

Plataformas de Messaging e Servi?os

Servi?os e Redes M?veis 

PT Inova??o, SA

Rua Eng. Jos? Ferreira Pinto Basto

3810 - 106 Aveiro - Portugal

  http://www.ptinovacao.pt

Tel. +351 234403607

Fax. +351 234424160

mailto:psousa at ptinovacao.pt  

__________________________________________

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From lars.larsen at edb.com  Thu Sep 16 11:30:09 2004
From: lars.larsen at edb.com (=?iso-8859-1?Q?Larsen_Lars_Asbj=F8rn?=)
Date: Thu, 16 Sep 2004 13:30:09 +0200
Subject: [Linux-cluster] Problem with RHEL3 and GFS-6.0.0.10, Kernel P anic
Message-ID: <4FD673DA70A6C44AA180BE5C971B1A19012170AF@OSLATCEX0002.edb.com>

 
 

Vennlig hilsen/Best regards 
Lars Larsen 
Seniorkonsulent/Senior Consultant EDB IT Drift 
Tfl: +47 22 77 27 45, mobil: +47 900 19 569 
lars.larsen at edb.com 
www.edb.com 

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]On Behalf Of Paulo Sousa
Sent: 16. september 2004 12:05
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Problem with RHEL3 and GFS-6.0.0.10, Kernel Panic



Hello,

            I' testing the GFS in RHEL3 but I have some problems.

 

            I have 2 servers connect to a shared SCSI storage and one of the
serves is the lock server (I don't have redundancy at this moment to
lock_server, it is just for testing)

 

            Server1 (mount gfs filesystem + lock_server)

            Server2 (mount gfs filessystem)

            

            This is the test I have made in the server 1

 

            /etc/init.d/lock_gulmd stop

            /etc/init.d/lock_gulmd start

 

            After 2..3  seconds

            

            gfs1  login: lock_gulm: Checking for journals for node "gfs1"

            lock_gulm: ERROR Got an error in gulm_res_recvd err: -71

            lock_gulm: ERROR gulm_LT recver err -71

            lock_gulm: ERROR Got a -1111 trying to login to lock_gulm. It it
running?

 

            Lock_gulm: Assertion failed on line 50 of file gulm_core.c

            Lock_gulm: assertion: "gulm_cm_GenerationI == gen"

            Kernel panic:

            Lock_gulm: Record message above and reboot

 

                        

 

 

            

Cumprimentos,

       Paulo Sousa

                                 

___________________________________________

Paulo Sousa 

Plataformas de Messaging e Servi?os

Servi?os e Redes M?veis 

PT Inova??o, SA

Rua Eng. Jos? Ferreira Pinto Basto

3810 - 106 Aveiro - Portugal

  http://www.ptinovacao.pt

Tel. +351 234403607

Fax. +351 234424160

mailto:psousa at ptinovacao.pt  

__________________________________________

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From stephane.messerli at urbanet.ch  Wed Sep 15 21:37:52 2004
From: stephane.messerli at urbanet.ch (=?iso-8859-1?Q?St=E9phane_Messerli?=)
Date: Wed, 15 Sep 2004 23:37:52 +0200
Subject: **SPAM** Re: [Linux-cluster] (no subject)
In-Reply-To: <20040915034046.GA17196@redhat.com>
Message-ID: <200409152137.i8FLb6sw006443@smtp.hispeed.ch>

Thanks for your answer.
We will definitely try the latest cvs code very soon. 

In the meantime, we've been able to identify a potential cause of our
problems.
Most (80%) of the lock_dlm (and a few other gfs related bugs) errors happen
while a tar is reading files from the gfs file system and writing the output
to a nfs mount point of another machine.

Does gfs have known issues with nfs mount points?

Thanks,
- St?phane Messerli (stephane.messerli at urbanet.ch)




-----Message d'origine-----
De : linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] De la part de David Teigland
Envoy? : mercredi, 15. septembre 2004 05:41
? : St?phane Messerli
Cc : linux-cluster at redhat.com
Objet : **SPAM** Re: [Linux-cluster] (no subject)

On Tue, Sep 14, 2004 at 11:22:55AM +0200, St?phane Messerli wrote:

> Any idea of what's wrong or what we should we check next?

I think this is a new one.

> Is it possible to "unlock" the machines after such an error without 
> reboot?

No, the machine needs to be reset when you get this.

> The release version is DEVEL.1090589850.

That looks like July 23; you should probably update to what's in cvs and let
us know if anything changes.

--
Dave Teigland  

--
Linux-cluster mailing list
Linux-cluster at redhat.com
http://www.redhat.com/mailman/listinfo/linux-cluster




From alanr at unix.sh  Thu Sep 16 14:25:44 2004
From: alanr at unix.sh (Alan Robertson)
Date: Thu, 16 Sep 2004 08:25:44 -0600
Subject: [Linux-cluster] [Fwd: FW: Call for papers deadline extension:
	HAPCW2004]
Message-ID: <4149A268.6090902@unix.sh>

FYI...

-- 
     Alan Robertson 

"Openness is the foundation and preservative of friendship...  Let me claim 
from you at all times your undisguised opinions." - William Wilberforce
-------------- next part --------------
An embedded message was scrubbed...
From: "Box" 
Subject: FW: Call for papers deadline extension: HAPCW2004 
Date: Thu, 16 Sep 2004 09:06:48 -0500
Size: 5499
URL: 

From mtilstra at redhat.com  Thu Sep 16 15:32:19 2004
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Thu, 16 Sep 2004 10:32:19 -0500
Subject: [Linux-cluster] Problem with RHEL3 and GFS-6.0.0.10, Kernel Panic
In-Reply-To: 
References: 
Message-ID: <20040916153219.GA20169@redhat.com>

On Thu, Sep 16, 2004 at 11:05:02AM +0100, Paulo Sousa wrote:
>                I' testing the GFS in RHEL3 but I have some problems.
> 
>                 I have 2 servers connect to a shared SCSI storage and one
>    of  the  serves  is  the  lock server (I don't have redundancy at this
>    moment to lock_server, it is just for testing)
> 
>                Server1 (mount gfs filesystem + lock_server)
>                Server2 (mount gfs filessystem)
> 
>                This is the test I have made in the server 1
> 
>                /etc/init.d/lock_gulmd stop

You have a single lock server.  This is where all of the lock state is
stored.  The lock state is what keeps the different nodes mounting gfs
from corrupting data.  You have no redundancy in the lock state.  You
stopped the lock server.  The lock state was lost.  The cluster cannot
continue.  The nodes killed themselves rather than let the filesystem
meta data get corrupted.

If you want to be able to stop lock servers, you MUST have redundancy in
the lock servers.  For gulm this means you need three nodes.


-- 
Michael Conrad Tadpol Tilstra
Gravity is a myth, the Earth sucks.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: 

From psousa at ptinovacao.pt  Fri Sep 17 09:06:42 2004
From: psousa at ptinovacao.pt (Paulo Sousa)
Date: Fri, 17 Sep 2004 10:06:42 +0100
Subject: [Linux-cluster] Problem with RHEL3 and GFS-6.0.0.10, Kernel P anic
Message-ID: 

>>On Thu, Sep 16, 2004 at 11:05:02AM +0100, Paulo Sousa wrote:
>>                I' testing the GFS in RHEL3 but I have some problems.
>> 
>>                 I have 2 servers connect to a shared SCSI storage and one
>>    of  the  serves  is  the  lock server (I don't have redundancy at this
>>    moment to lock_server, it is just for testing)
>> 
>>                Server1 (mount gfs filesystem + lock_server)
>>                Server2 (mount gfs filessystem)
>> 
>>                This is the test I have made in the server 1
>> 
>>                /etc/init.d/lock_gulmd stop

>You have a single lock server.  This is where all of the lock state is
>stored.  The lock state is what keeps the different nodes mounting gfs
>from corrupting data.  You have no redundancy in the lock state.  You
>stopped the lock server.  The lock state was lost.  The cluster cannot
>continue.  The nodes killed themselves rather than let the filesystem
>meta data get corrupted.
>
>If you want to be able to stop lock servers, you MUST have redundancy in
>the lock servers.  For gulm this means you need three nodes.

Tank you, but my problem is: The system gives kernel panic and stop, then I
need make reboot manual.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From pcaulfie at redhat.com  Mon Sep 20 12:25:51 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 20 Sep 2004 13:25:51 +0100
Subject: [Linux-cluster] [RFC] Generic Kernel API
Message-ID: <20040920122551.GC32420@tykepenguin.com>

At the cluster summit most people seedm to agree that we needed a generic,
pluggable kernel API for cluster functions. Well, I've finally got round to
doing something.

The attached spec allows for plug-in cluster modules with the possibility of
a node being a member of multiple clusters if the cluster managers allow it.
I've seperated out the functions of "cluster manager" so they can be provided by
different components if necessary.

Two things that are not complete (or even started) in here are a communications
API and a locking API. 

For the first, I'd like to leave that to those more qualified than me to do and
for the second I'd like to (less modestly) propose our existing DLM API with the
argument that it is a full-featured API that others can implement parts of if
necessary.

Comments please.
-- 

patrick
-------------- next part --------------
CONCEPTS
--------

The kernel holds a list of named cluster management modules which register themselves at
insmod time. Each of these may provide one or more groups of services: "comms", "membership" and "quorum".

In theory a node may be a member of many clusters, though some cluster managers may prevent this.

The kernel APIs presented here are meant to be simple enough to be tidy, but featureful enough
to implement SAF on top in userspace. I don't think it is appropriate to implement the full
SAF specification in kernel space.


Membership ops
--------------

struct membership_node_address
{
	int32_t mna_len;
	char    mna_address[MAX_ADDR_LEN];
};

struct membership_node
{
	int32_t				mn_nodeid;
	struct membership_node_address  mn_address
	char				mn_name[MAX_NAME_LEN];
	uint32_t			mn_member;
	struct timeval			mn_boottime;
};

struct membership_notify_info
{
	void *		mni_context;
	uint32_t	mni_viewnumber;
	uint32_t	mni_numitems;
	uint32_t	mni_nummembers;
	char *		mni_buffer;
};

struct membership_ops
{
	int (start_notify) (void *cmprivate,
			    void *context, uint32_t flags, membership_callback_routine *callback, char *buffer, int max_items);
#define	MEMBERSHIP_FLAGS_NOTIFY_CHANGES  1 /* Notify of membership changes */
#define	MEMBERSHIP_FLAGS_NOTIFY_NODES    2 /* Send me a full node list now */

	int (notify_stop)  (void *cmprivate);
	int (get_name)     (void *cmprivate, char *name, int maxlen);
	int (get_node)     (void *cmprivate, int32_t nodeid, struct membership_node *node);
#define MEMBERSHIP_NODE_THISNODE        -1 /* Get info about local node */

};

/* This is what is called by membership services as a callback */
typedef int (membership_callback_routine) (void *context, uint32_t reason);

I've made node IDs a signed int32, this allows for a negative pseudo ID for "this node". 
cman uses 0 for "this node" but other membership APIs may allow a real node to have an ID of zero.
SAF uses a "this node" pseudo ID.


Quorum ops
----------

/* These might be a bit too specific... */

struct quorum_info
{
	uint32_t qi_total_votes;
	uint32_t qi_expected_votes;
	uint32_t qi_quorum;
};

struct quorum_ops
{

	int (get_quorate) (void *cmprivate);
	int (get_votes)   (void *cmprivate, int32_t nodeid);
	int (get_info)    (void *cmprivate, struct quorum_info *info);
};

Bottom interface. 
-----------------

/* When a CM module is loaded it calls cm_register()
 * which adds its proto_name/ops pair to a global list. */

int cm_register(struct cm_ops *proto);
void cm_unregister(struct cm_ops *proto);


/* A CM sets up one of these structs with the functions it can provide and
 * registers it, along with its name (type) using cm_register() */

struct cm_ops {
        char co_proto_name[256];

        /* These are required */

        int (*co_attach) (struct cm_info *info);
        int (*co_detach) (void *cmprivate);

        /* These are optional, a CM may provide some or all */

        struct cm_comm_ops   *co_cops;
        struct cm_member_ops *co_mops;
        struct cm_quorum_ops *co_qops;
}

Others
------

I've omitted the comms interface because I'm not really sure how featured this
really out to be.

We may want to add a locking interface in here too?


Top interface  
-------------

/* When cm_attach() is called, the "harness" searches the
 * global list of registered CM's, looking for one with the given
 * proto_name.  If one is found, its co_attach() function is called, being
 * passed the cm_attach() parameters. */

int cm_attach(char *proto_name, char *cluster_name, struct cm_info *info);
void cm_detach(void *cmprivate);


/* When a CM's attach function is called, it fills in the cm_info struct
 * provided by the caller with its own ops functions and values.  This
 * includes its private data pointer to be used with its ops functions. */

struct cm_info {
        struct cm_ops *ops;
        void *cmprivate;
}

eg
--

Say "foo" is a "low level" system and provides select comms and member
functions.

1. it sets foo_ops
        cm_proto_name = "foo";
        cm_attach = foo_attach;
        cm_detach = foo_detach;
        cm_cops = foo_cops;
        cm_mops = foo_mops;
        cm_qops = NULL;
2. and calls cm_register(&foo_ops);


Say "bar" is a higher level system and provides select member and quorum
functions.

1. it sets bar_ops
        cm_proto_name = "bar";
        cm_attach = bar_attach;
        cm_detach = bar_detach;
        cm_cops = NULL;
        cm_mops = bar_mops;
        cm_qops = bar_qops;

2. and calls cm_register(&bar_ops);


Internally, bar could attach to foo and use the functions foo provides.
Bar may provide some member_ops functions that foo doesn't, in addition to
some quorum services, none of which foo provides.  Applications may attach
to just bar, just foo, or in some cases both foo and bar.

bar could be programmed to use foo statically (like lock_dlm is
programmed to use dlm and cman, but gfs can use either lock_dlm or
lock_gulm).  bar could also take the lower level type (foo) as an input
parameter in some way, making it dynamic.

From andrew.bonaffini at lmco.com  Mon Sep 20 14:09:33 2004
From: andrew.bonaffini at lmco.com (Bonaffini, Andrew)
Date: Mon, 20 Sep 2004 10:09:33 -0400
Subject: [Linux-cluster] Configuration Question
Message-ID: 

I am running cluster suite with RH ES 3.0 and 7 server nodes, and have a
configuration question.  I would like to have a backup server in my
configuration which is normally not running any services - failed
services in my cluster are to always failover to the backup server.  I
think I can do this by having restricted failover domains consisting of
2 members - the active server and the backup; i.e. there would be 6 such
restricted failover domains defined, one for each active server (or
maybe there is an easier way - any suggestions?), all pointing to the
same backup server.
 
The problem is that my cluster configuration currently obtain the IP
address of the backup server through DNS.  After a failure occurs we are
changing the DNS so that a different server assumes the backup role - in
other words, after the services are switched to the backup and the
failed server is repaired, the repaired server then assumes the backup
server role.  Since the IP addresses of the servers in the cluster suite
are obtained through DNS, this effectively changes the IP addresses out
from under the cluster suite, and I'm not sure what effect this may have
on the system.  Is there a better way to achieve what we are trying to
do?
 
Thanks,
 
Andy Bonaffini
Lockheed Martin Maritime Systems and Sensors
Manassas, VA
andrew.bonaffini at lmco.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From laza at yu.net  Tue Sep 21 18:39:21 2004
From: laza at yu.net (Lazar Obradovic)
Date: Tue, 21 Sep 2004 20:39:21 +0200
Subject: [Linux-cluster] DLM/CLVM problem
Message-ID: <1095791960.22069.456.camel@laza.eunet.yu>

Hi, 

I more often then not have a problem when starting clvmd. It starts
normaly, but /proc/cluster/services, says: 

# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           2   2 run       -
[5 7 6 4 2 3 1]

DLM Lock Space:  "clvmd"                             0   3 join      S-1,80,7
[]


while other nodes report: 

# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           2   2 run       -
[4 2 5 3 6 7]

DLM Lock Space:  "clvmd"                             1   3 update    U-4,1,7
[4 2 5 3 6 7]

vgchage will hung afterwards and only reboot would (eventualy) fix the
problem. Other nodes are working just fine in the meantime... 

What do "code" flags *exactly* mean?

Anyone with a suggestion about what should I do here? 

-- 
Lazar Obradovic, System Engineer
----- 
laza at YU.net
YUnet International http://www.EUnet.yu
Dubrovacka 35/III, 11000 Belgrade
Tel: +381 11 3119901; Fax: +381 11 3119901
-----
This e-mail is confidential and intended only for the recipient.
Unauthorized distribution, modification or disclosure of its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone +381 11 3119901.
-----



From laza at yu.net  Tue Sep 21 19:17:25 2004
From: laza at yu.net (Lazar Obradovic)
Date: Tue, 21 Sep 2004 21:17:25 +0200
Subject: [Linux-cluster] DLM/CLVM problem
In-Reply-To: <1095791960.22069.456.camel@laza.eunet.yu>
References: <1095791960.22069.456.camel@laza.eunet.yu>
Message-ID: <1095794245.22075.468.camel@laza.eunet.yu>

and it seems that it happens only when node with id=1 gets out... after
that, no node (even that one that used to be node#1) can't get in... 

On Tue, 2004-09-21 at 20:39, Lazar Obradovic wrote:
> Hi, 
> 
> I more often then not have a problem when starting clvmd. It starts
> normaly, but /proc/cluster/services, says: 
> 
> # cat /proc/cluster/services
> Service          Name                              GID LID State     Code
> Fence Domain:    "default"                           2   2 run       -
> [5 7 6 4 2 3 1]
> 
> DLM Lock Space:  "clvmd"                             0   3 join      S-1,80,7
> []
> 
> 
> while other nodes report: 
> 
> # cat /proc/cluster/services
> Service          Name                              GID LID State     Code
> Fence Domain:    "default"                           2   2 run       -
> [4 2 5 3 6 7]
> 
> DLM Lock Space:  "clvmd"                             1   3 update    U-4,1,7
> [4 2 5 3 6 7]
> 
> vgchage will hung afterwards and only reboot would (eventualy) fix the
> problem. Other nodes are working just fine in the meantime... 
> 
> What do "code" flags *exactly* mean?
> 
> Anyone with a suggestion about what should I do here? 
-- 
Lazar Obradovic, System Engineer
----- 
laza at YU.net
YUnet International http://www.EUnet.yu
Dubrovacka 35/III, 11000 Belgrade
Tel: +381 11 3119901; Fax: +381 11 3119901
-----
This e-mail is confidential and intended only for the recipient.
Unauthorized distribution, modification or disclosure of its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone +381 11 3119901.
-----



From teigland at redhat.com  Wed Sep 22 03:47:42 2004
From: teigland at redhat.com (David Teigland)
Date: Wed, 22 Sep 2004 11:47:42 +0800
Subject: [Linux-cluster] DLM/CLVM problem
In-Reply-To: <1095791960.22069.456.camel@laza.eunet.yu>
References: <1095791960.22069.456.camel@laza.eunet.yu>
Message-ID: <20040922034742.GA11961@redhat.com>


On Tue, Sep 21, 2004 at 08:39:21PM +0200, Lazar Obradovic wrote:
> Hi, 
> 
> I more often then not have a problem when starting clvmd. It starts
> normaly, but /proc/cluster/services, says: 
> 
> # cat /proc/cluster/services
> Service          Name                              GID LID State     Code
> Fence Domain:    "default"                           2   2 run       -
> [5 7 6 4 2 3 1]
> 
> DLM Lock Space:  "clvmd"                             0   3 join      S-1,80,7
> []
> 
> 
> while other nodes report: 
> 
> # cat /proc/cluster/services
> Service          Name                              GID LID State     Code
> Fence Domain:    "default"                           2   2 run       -
> [4 2 5 3 6 7]
> 
> DLM Lock Space:  "clvmd"                             1   3 update    U-4,1,7
> [4 2 5 3 6 7]
> 
> vgchage will hung afterwards and only reboot would (eventualy) fix the
> problem. Other nodes are working just fine in the meantime... 

> What do "code" flags *exactly* mean?

for update events begining with "U-"
4 = ue_state = UEST_JSTART_SERVICEWAIT
1 = ue_flags = UEFL_ALLOW_STARTDONE
7 = ue_nodeid = nodeid of node joining or leaving the sg

SM is waiting for the dlm service to complete recovery.  The dlm on nodes
[4 2 5 3 6 7] is still in the process of recovery due to node 7 joining
the lockspace.  If it stays this way for long, it probably means that dlm
recovery is hung for some reason.  dmesg or /proc/cluster/dlm_debug should
show roughly how far the dlm recovery got.

for service events begining with "S-"
1 = se_state = SEST_JOIN_BEGIN
80 = se_flags = SEFL_DELAY
7 = se_reply_count = number of replies received

SM will not permit this node to join the lockspace because the lockspace
in question is still doing recovery.  Once recovery completes, this node
will go ahead and join.

-- 
Dave Teigland  



From bernd.quiskamp at hp.com  Wed Sep 22 08:18:47 2004
From: bernd.quiskamp at hp.com (Quiskamp, Bernd)
Date: Wed, 22 Sep 2004 10:18:47 +0200
Subject: [Linux-cluster] Performance metrics on using gfs pool manager
Message-ID: <564DAFB99CD0BD488B43E47D057CC54201AE3020@bbnexc02.emea.cpqcorp.net>

Are there any performance metrics available concerning different gfs
pool manager configurations? We intent to use gfs in an EMC san storage
environment. Our preferred storage configuration is to bind only one big
storage device (up to 2 TByte) with gfs pool volume manager, leaving all
configuration and administration efforts on the storage site.
An alternative configuration could be the usage of a high number (up to
50) of small devices, which will be integrated and managed by pool
volume manager.
What is the best strategy to get the best performance with minimized
admin effort? Are there any experiences with different gfs
configurations?
 
Thank you in advance
Bernd Quiskamp (bernd.quiskamp at hp.com)
 
 
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From sdake at mvista.com  Tue Sep 21 22:32:30 2004
From: sdake at mvista.com (Steven Dake)
Date: Tue, 21 Sep 2004 15:32:30 -0700
Subject: [Linux-cluster] RE: [RFC] Generic Kernel API
Message-ID: <1095805950.6881.50.camel@persist.az.mvista.com>

Patrick,

I hvae read your RFC for an API and find it interesting.  But there is
one aspect that is somewhat disturbing to me.

In the model you propose messaging and membership are seperated (or
atleast not completed).  How can you communicate to a group, though, if
you are not aware of the current configuration (membership).  Think of
talking on the phone to a group of friends.  When you say something, you
may want to know which of your friends is in the room, because it may
alter your speech to them.  Maybe not.  But distributed applications do
need to know the relationship between configurations and messages.  I
have implemented distributed projects in the past that do not implement
strong membership guarantees and they are unreliable.

As a result, I propose we use virtual synchrony as the basis of kernel
communication.  To that end, I have developed a small API (which is
implemented in userland in about 5000 lines of code).  This may be the
basis, with whatever changes are required for kernel inclusion, for
communication.

The basic concept of the API is that there are groups with 32 byte
keys.  It is possible to publish from 1 processor to all other group
members.  It is possible to join or leave a group.  When a configuration
change occurs or a message needs to be delivered, a callback is executed
(which handles the configuration change or message).  The API also
allows multiple instances of communication paths, priorities, large
messages (256k now) which are compile-time configurable, etc.

I'd ask that you atleast consider reviewing the man pages that have been
produced for this API.  The link is:
http://developer.osdl.org/dev/openais/htmldocs/index.html

To understand how the protocol works and find out more information about
openais, check out:
http://developer.osdl.org/dev/openais/pres/aispres040914.pdf

To find out more about the openais project, check out:
http://developer.osdl.org/dev/openais





From opi at le-bit.de  Fri Sep 24 00:00:55 2004
From: opi at le-bit.de (Alexander Opitz)
Date: Fri, 24 Sep 2004 02:00:55 +0200
Subject: [Linux-cluster] choosing the right setup
Message-ID: <200409240200.55244.opi@le-bit.de>

Hi,

I've a problem and I don't know how I should setup this, I hope someone can 
help me.

I've 2 standard PC's with IDE they should work together in a nodegroupe (later 
there should a 3rd pc added).

On both should run MaxDB first as primary and on the other as Hot Standby.

For this I need a shared partition (the primary writes in a file and the hot 
standby reads it and updates his own database).

I've no SAN or other storage device only the IDE hard drives in both pc's.

How I can have this shared partition on both computers?

DRBD can do this, but the partition in the second computer can't access the 
drive while it is secondary so the database can only be updated after 
failover, but this needs to much time then.

GNBD only exports a block device from the other computer, but if the secondary 
fails while it is in standby the first computer can't write the file.

ODR is only in draft2 since some years and I don't found something better.

The mirroring of GFS seems to be in work/progress ... I found no documentation 
for it.

My solution would be using local partition and a GNBD exported one and set 
over it software raid (md device) and on top GFS ... but I think that can put 
me in trouble while rebuild.

Any ideas?

Thanks Alexander Opitz//



From kpfleming at backtobasicsmgmt.com  Fri Sep 24 04:37:26 2004
From: kpfleming at backtobasicsmgmt.com (Kevin P. Fleming)
Date: Thu, 23 Sep 2004 21:37:26 -0700
Subject: [Linux-cluster] choosing the right setup
In-Reply-To: <200409240200.55244.opi@le-bit.de>
References: <200409240200.55244.opi@le-bit.de>
Message-ID: <4153A486.206@backtobasicsmgmt.com>

Alexander Opitz wrote:

> For this I need a shared partition (the primary writes in a file and the hot 
> standby reads it and updates his own database).

People have successfully used IEEE-1394 (Firewire) drives in this way; 
connected to both machines at once, and accessed from both machines at once.

It's not the fastest solution available, and it's certainly not 
enterprise-grade, but it's a workable solution until you can come up 
with something better :-)



From teigland at redhat.com  Fri Sep 24 04:43:30 2004
From: teigland at redhat.com (David Teigland)
Date: Fri, 24 Sep 2004 12:43:30 +0800
Subject: [Linux-cluster] RE: [RFC] Generic Kernel API
In-Reply-To: <1095805950.6881.50.camel@persist.az.mvista.com>
References: <1095805950.6881.50.camel@persist.az.mvista.com>
Message-ID: <20040924044330.GB16563@redhat.com>


On Tue, Sep 21, 2004 at 03:32:30PM -0700, Steven Dake wrote:
> Patrick,
> 
> I hvae read your RFC for an API and find it interesting.  But there is
> one aspect that is somewhat disturbing to me.
> 
> In the model you propose messaging and membership are seperated (or
> atleast not completed).

What's to prevent a single integrated messaging/membership system (like
you describe below) from providing both messaging and membership ops?

I /don't/ think the separation of the ops into different structs was meant
to imply that different systems would provide them.  I think the intention
was that a clustering module would be free to provide whichever methods it
wanted, e.g. a clustering module that didn't have a quorum system would
just leave those functions null, or provide just selected functions within
a given struct.


> As a result, I propose we use virtual synchrony as the basis of kernel
> communication.  To that end, I have developed a small API (which is
> implemented in userland in about 5000 lines of code).  This may be the
> basis, with whatever changes are required for kernel inclusion, for
> communication.

Sounds good.  It would actually be an integrated communication and basic
membership system, right?  As you mentioned above, the two are
interdependent.  By "basic membership" I'm implying that more exotic
membership systems could be implemented above this lowest layer.

I think the question here is whether your messaging/membership system
(currently in user space) would fit behind the API Patrick sent once
ported to the kernel.  If not, then what needs to be changed so it would?
The idea is for the API to be general enough to support a variety of
clustering modules, including yours.

-- 
Dave Teigland  



From opi at le-bit.de  Fri Sep 24 09:40:29 2004
From: opi at le-bit.de (Alexander Opitz)
Date: Fri, 24 Sep 2004 11:40:29 +0200
Subject: [Linux-cluster] choosing the right setup
In-Reply-To: <4153A486.206@backtobasicsmgmt.com>
References: <200409240200.55244.opi@le-bit.de>
	<4153A486.206@backtobasicsmgmt.com>
Message-ID: <200409241140.29493.opi@le-bit.de>

Am Freitag, 24. September 2004 06:37 schrieb Kevin P. Fleming:
> People have successfully used IEEE-1394 (Firewire) drives in this way;
> connected to both machines at once, and accessed from both machines at
> once.

I've nothing more then what I've told ... so I've also no FireWire or any 
other SCSI like thinks :-/

Greetings Alex//

PS: Am I the first person who have such problems?



From ben.m.cahill at intel.com  Fri Sep 24 13:13:39 2004
From: ben.m.cahill at intel.com (Cahill, Ben M)
Date: Fri, 24 Sep 2004 06:13:39 -0700
Subject: [Linux-cluster] Size limit on mail list?
Message-ID: <0604335B7764D141945E202153105960033E25B0@orsmsx404.amr.corp.intel.com>

Hello,

I sent a patch out yesterday to the list, and haven't seen it show up in
my inbox ... it was about 60K altogether ... is there a size limit on
list mail?

Thanks.

-- Ben --



From anton at hq.310.ru  Fri Sep 24 14:24:02 2004
From: anton at hq.310.ru (=?Windows-1251?B?wO3y7u0gzeX17vDu+Oj1?=)
Date: Fri, 24 Sep 2004 18:24:02 +0400
Subject: [Linux-cluster] immutable flag on gfs
Message-ID: <1364245794.20040924182402@hq.310.ru>

Hi all !

I have written a patch for processing a flag immutable for files and
directories.

I hope to see it in cvs.

-- 
e-mail: anton at hq.310.ru
http://www.310.ru
-------------- next part --------------
A non-text attachment was scrubbed...
Name: immutable.patch
Type: application/octet-stream
Size: 4585 bytes
Desc: not available
URL: 

From john.l.villalovos at intel.com  Sat Sep 25 00:28:27 2004
From: john.l.villalovos at intel.com (Villalovos, John L)
Date: Fri, 24 Sep 2004 17:28:27 -0700
Subject: [Linux-cluster] Subversion?
Message-ID: <60C14C611F1DDD4198D53F2F43D8CA3B0216E094@orsmsx410>

Slashdot linked to an interview with Tom Lord, the creator of Arch,
where he talks about how much Subversion sucks :)

http://developers.slashdot.org/article.pl?sid=04/09/24/1926208&tid=156&t
id=8

I read it and he didn't really change my mind.  I still like Subversion
:)

John



From bastian at waldi.eu.org  Sat Sep 25 08:27:28 2004
From: bastian at waldi.eu.org (Bastian Blank)
Date: Sat, 25 Sep 2004 10:27:28 +0200
Subject: [Linux-cluster] Subversion?
In-Reply-To: <412A1206.5040103@backtobasicsmgmt.com>
References: <200408231143.11372.phillips@redhat.com>
	<412A1206.5040103@backtobasicsmgmt.com>
Message-ID: <20040925082728.GA26258@wavehammer.waldi.eu.org>

On Mon, Aug 23, 2004 at 08:49:26AM -0700, Kevin P. Fleming wrote:
> I also like BK quite a bit, and it has one major advantage over 
> CVS/Subversion: you can have local trees and actually _commit_ to them, 
> including changeset comments and everything else. This is very nice when 
> you are working on multiple bits of a project and are not ready to 
> commit them to the "real" repositories.

svk does the same for subversion.

-- 
You!  What PLANET is this!
		-- McCoy, "The City on the Edge of Forever", stardate 3134.0
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: 

From bastian at waldi.eu.org  Sat Sep 25 08:30:08 2004
From: bastian at waldi.eu.org (Bastian Blank)
Date: Sat, 25 Sep 2004 10:30:08 +0200
Subject: [Linux-cluster] Subversion?
In-Reply-To: <1093280841.12873.43.camel@cherrybomb.pdx.osdl.net>
References: <200408231143.11372.phillips@redhat.com>
	<1093280841.12873.43.camel@cherrybomb.pdx.osdl.net>
Message-ID: <20040925083008.GB26258@wavehammer.waldi.eu.org>

On Mon, Aug 23, 2004 at 10:07:22AM -0700, John Cherry wrote:
> I understand that subversion is quite nice, but kernel developers have
> adopted bitkeeper (at least Linus and several of his maintainers). 
> While you may not need all the distributed capabilities of bitkeeper
> now, it is sure nice to have a tool that allows for non-local
> repositories and change set tracking outside of the main repository (as
> Kevin so clearly stated).

Do you think, redhat will provide bk licenses for people which don't get
a free one? I'm a subversion and svk developer and will not get one
because of this.

Bastian

-- 
Without freedom of choice there is no creativity.
		-- Kirk, "The return of the Archons", stardate 3157.4
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: 

From bastian at waldi.eu.org  Sat Sep 25 08:31:46 2004
From: bastian at waldi.eu.org (Bastian Blank)
Date: Sat, 25 Sep 2004 10:31:46 +0200
Subject: [Linux-cluster] Subversion?
In-Reply-To: <20040823190656.GG22622@Favog.ubiqx.mn.org>
References: <200408231143.11372.phillips@redhat.com>
	<20040823174837.GC22622@Favog.ubiqx.mn.org>
	<1093286433.11335.0.camel@localhost.localdomain>
	<20040823190656.GG22622@Favog.ubiqx.mn.org>
Message-ID: <20040925083146.GC26258@wavehammer.waldi.eu.org>

On Mon, Aug 23, 2004 at 02:06:56PM -0500, Christopher R. Hertel wrote:
> It's my understanding that Samba source web access will be moving (has
> already been moved) to viewcvs.  I've passed along the caution, given
> earlier, about database locking.  There's some talk of sharing the single
> database amonst several mirrors using Samba and CIFS-VFS.  :)

Use the fsfs backend. It works quite well over NFS.

Bastian

-- 
Earth -- mother of the most beautiful women in the universe.
		-- Apollo, "Who Mourns for Adonais?" stardate 3468.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: 

From phillips at redhat.com  Sat Sep 25 14:44:27 2004
From: phillips at redhat.com (Daniel Phillips)
Date: Sat, 25 Sep 2004 10:44:27 -0400
Subject: [Linux-cluster] Subversion?
In-Reply-To: <20040925083008.GB26258@wavehammer.waldi.eu.org>
References: <200408231143.11372.phillips@redhat.com>
	<1093280841.12873.43.camel@cherrybomb.pdx.osdl.net>
	<20040925083008.GB26258@wavehammer.waldi.eu.org>
Message-ID: <200409251044.28064.phillips@redhat.com>

On Saturday 25 September 2004 04:30, Bastian Blank wrote:
> On Mon, Aug 23, 2004 at 10:07:22AM -0700, John Cherry wrote:
> > I understand that subversion is quite nice, but kernel developers
> > have adopted bitkeeper (at least Linus and several of his
> > maintainers). While you may not need all the distributed
> > capabilities of bitkeeper now, it is sure nice to have a tool that
> > allows for non-local repositories and change set tracking outside
> > of the main repository (as Kevin so clearly stated).
>
> Do you think, redhat will provide bk licenses for people which don't
> get a free one? I'm a subversion and svk developer and will not get
> one because of this.

Speaking with my red hat on, I do not think Red Hat will provide a 
Bitkeeper license for anyone who is not a Red Hat employee.  Speaking 
personally, I oppose the use of any proprietary version control tool in 
an open source project.

Arch and Subversion are both good enough to do the job.

Regards,

Daniel



From agk at redhat.com  Sat Sep 25 15:26:25 2004
From: agk at redhat.com (Alasdair G Kergon)
Date: Sat, 25 Sep 2004 16:26:25 +0100
Subject: [Linux-cluster] Size limit on mail list?
In-Reply-To: <0604335B7764D141945E202153105960033E25B0@orsmsx404.amr.corp.intel.com>
References: <0604335B7764D141945E202153105960033E25B0@orsmsx404.amr.corp.intel.com>
Message-ID: <20040925152625.GL11810@agk.surrey.redhat.com>

On Fri, Sep 24, 2004 at 06:13:39AM -0700, Cahill, Ben M wrote:
> my inbox ... it was about 60K altogether ... is there a size limit on
> list mail?
 
There are lots of filters on these lists; your 80K message was trapped
awaiting moderator approval, and the MIME you used has come out garbled,
so there seems no point letting it through.

...
> Opinions are mine, not Intel's
> ZGlmZiAtcnUgY3ZzL2NsdXN0ZXIvZ2ZzLWtlcm5lbC9zcmMvZ2ZzL2Rpby5jIGJ1aWxkXzA5MjMw
> NC9jbHVzdGVyL2dmcy1rZXJuZWwvc3JjL2dmcy9kaW8uYwotLS0gY3ZzL2NsdXN0ZXIvZ2ZzLWtl
...

Please re-send with patches inline, rather than as attachments.

Alasdair
-- 
agk at redhat.com



From andriy at druzhba.lviv.ua  Mon Sep 27 08:23:15 2004
From: andriy at druzhba.lviv.ua (Andriy Galetski)
Date: Mon, 27 Sep 2004 11:23:15 +0300
Subject: [Linux-cluster] Multihome network configuration not working .!
Message-ID: <003501c4a46b$3cf79150$f13cc90a@druzhba.com>

Hi !
I want to use Multihome network configuration with broadcast
but it is not working. ...

When disconnected main connection
and trying:
cman_tool join -n cl020

On other node receive:
Sep 27 10:36:13 cl10 kernel: CMAN: node cl20 rejoining
Sep 27 10:36:17 cl10 kernel: CMAN: node cl20 is not responding - removing
from the cluster

My cluster.conf:

?xml version="1.0"?>


  
  

  
    
        
      
        
          
        
      
    
    
        
      
        
          
        
      
    
  

  
    
  




Thanks for any information.



From pcaulfie at redhat.com  Mon Sep 27 13:26:19 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 27 Sep 2004 14:26:19 +0100
Subject: [Linux-cluster] Multihome network configuration not working .!
In-Reply-To: <003501c4a46b$3cf79150$f13cc90a@druzhba.com>
References: <003501c4a46b$3cf79150$f13cc90a@druzhba.com>
Message-ID: <20040927132619.GC30900@tykepenguin.com>

On Mon, Sep 27, 2004 at 11:23:15AM +0300, Andriy Galetski wrote:
> Hi !
> I want to use Multihome network configuration with broadcast
> but it is not working. ...
> 
> When disconnected main connection
> and trying:
> cman_tool join -n cl020
> 
> On other node receive:
> Sep 27 10:36:13 cl10 kernel: CMAN: node cl20 rejoining
> Sep 27 10:36:17 cl10 kernel: CMAN: node cl20 is not responding - removing
> from the cluster


You shouldn't specify the alternative node name in this case. Just a normal 
cman_tool join should work, cman will try both interfaces during the join
anyway.

If it doesn't, can you enable COMMS & MEMB debugging in cnxman-private.h and
send me the output please? (yes,I know it means a recompile...sorry)

patrick



From bmarzins at redhat.com  Mon Sep 27 16:13:50 2004
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Mon, 27 Sep 2004 11:13:50 -0500
Subject: [Linux-cluster] choosing the right setup
In-Reply-To: <200409241140.29493.opi@le-bit.de>
References: <200409240200.55244.opi@le-bit.de>
	<4153A486.206@backtobasicsmgmt.com>
	<200409241140.29493.opi@le-bit.de>
Message-ID: <20040927161350.GL5074@phlogiston.msp.redhat.com>

On Fri, Sep 24, 2004 at 11:40:29AM +0200, Alexander Opitz wrote:
> Am Freitag, 24. September 2004 06:37 schrieb Kevin P. Fleming:
> > People have successfully used IEEE-1394 (Firewire) drives in this way;
> > connected to both machines at once, and accessed from both machines at
> > once.
> 
> I've nothing more then what I've told ... so I've also no FireWire or any 
> other SCSI like thinks :-/
> 
> Greetings Alex//
> 
> PS: Am I the first person who have such problems?

No.  Unfortunately, until the cluster mirror target arrives, there really isn't
a good solution for this problem that doesn't require some sort of shared
storage.

-Ben
 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster



From ecashin at coraid.com  Mon Sep 27 18:33:20 2004
From: ecashin at coraid.com (Ed L Cashin)
Date: Mon, 27 Sep 2004 14:33:20 -0400
Subject: [Linux-cluster] "wire protocol" documentation?
Message-ID: <87r7ontvdb.fsf@coraid.com>

Hi.  If someone wanted to implement GFS support for a non-Linux system
like Oberon or Plan 9, where would they go to find out what they
should be doing on the network?

-- 
  Ed L Cashin 



From mtilstra at redhat.com  Mon Sep 27 18:41:26 2004
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Mon, 27 Sep 2004 13:41:26 -0500
Subject: [Linux-cluster] "wire protocol" documentation?
In-Reply-To: <87r7ontvdb.fsf@coraid.com>
References: <87r7ontvdb.fsf@coraid.com>
Message-ID: <20040927184126.GA25310@redhat.com>

On Mon, Sep 27, 2004 at 02:33:20PM -0400, Ed L Cashin wrote:
> Hi.  If someone wanted to implement GFS support for a non-Linux system
> like Oberon or Plan 9, where would they go to find out what they
> should be doing on the network?

gfs doesn't actually use any networking.  The cluster and lock managers
do.  For gulm, most everything is in gio_wiretypes.h and xdr_base.c
cman and dlm delevopers will have to tell you where to look for those.

-- 
Michael Conrad Tadpol Tilstra
My shoe is on fire.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: 

From qliu at ncsa.uiuc.edu  Mon Sep 27 20:30:59 2004
From: qliu at ncsa.uiuc.edu (Qian Liu)
Date: Mon, 27 Sep 2004 15:30:59 -0500
Subject: [Linux-cluster] Question: gnbd_import error
Message-ID: <5.1.0.14.2.20040927152753.00bba5f8@pop.ncsa.uiuc.edu>

Hi, all
I got the following error message from my client node:
gnbd_import: ERROR cannot parse /sys/class/gnbd/gnbd0/server

Anyone who got this before or has idea to fix this? thanks in advance!

-Qian



From bmarzins at redhat.com  Mon Sep 27 20:48:53 2004
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Mon, 27 Sep 2004 15:48:53 -0500
Subject: [Linux-cluster] Question: gnbd_import error
In-Reply-To: <5.1.0.14.2.20040927152753.00bba5f8@pop.ncsa.uiuc.edu>
References: <5.1.0.14.2.20040927152753.00bba5f8@pop.ncsa.uiuc.edu>
Message-ID: <20040927204853.GO5074@phlogiston.msp.redhat.com>

On Mon, Sep 27, 2004 at 03:30:59PM -0500, Qian Liu wrote:
> Hi, all
> I got the following error message from my client node:
> gnbd_import: ERROR cannot parse /sys/class/gnbd/gnbd0/server

Do you have sysfs mounted on /sys?

If not, you need sysfs.

If so, can you please reply with the contents of /sys/class/gnbd/gnbd0/server?

Thanks.

-Ben
 
> Anyone who got this before or has idea to fix this? thanks in advance!
> 
> -Qian
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster



From tony at sybaspace.com  Mon Sep 27 23:50:23 2004
From: tony at sybaspace.com (Tony Fraser)
Date: Mon, 27 Sep 2004 16:50:23 -0700
Subject: [Linux-cluster] choosing the right setup
In-Reply-To: <200409241140.29493.opi@le-bit.de>
References: <200409240200.55244.opi@le-bit.de>
	<4153A486.206@backtobasicsmgmt.com> <200409241140.29493.opi@le-bit.de>
Message-ID: <1096329022.1861.21.camel@sybaws1.office.sybaspace.com>

On Fri, 2004-09-24 at 02:40, Alexander Opitz wrote:
> Am Freitag, 24. September 2004 06:37 schrieb Kevin P. Fleming:
> > People have successfully used IEEE-1394 (Firewire) drives in this way;
> > connected to both machines at once, and accessed from both machines at
> > once.
> 
> I've nothing more then what I've told ... so I've also no FireWire or any 
> other SCSI like thinks :-/
> 
> Greetings Alex//
> 
> PS: Am I the first person who have such problems?

I've been lurking on the list for a while now. I'm hoping to use GFS in
a similar situation. 

I have yet to actually setup GFS at all but I what I was hoping to do is
use DRBD and Heartbeat to have a active/passive mirror of the data then
use gnbd on an aliased/takeover IP address to export the GFS file system
to the rest of the cluster.

I have no idea if it will work or not.

I also recall that Philipp has said it wouldn't be that big of a deal to
make DRBD active/active specifically for this situation. Now that GFS is
once again open source there may be more interest in doing the work to
make DRBD active/active. It's been quite a while since I messed with
DRBD but you may want to look into it.


-- 
Tony Fraser
tony at sybaspace.com
Sybaspace Internet Solutions                        System Administrator
phone: (250) 246-5368                                fax: (250) 246-5398



From andriy at druzhba.lviv.ua  Tue Sep 28 08:00:03 2004
From: andriy at druzhba.lviv.ua (Andriy Galetski)
Date: Tue, 28 Sep 2004 11:00:03 +0300
Subject: [Linux-cluster] Multihome network configuration not working .!
References: <003501c4a46b$3cf79150$f13cc90a@druzhba.com>
	<20040927132619.GC30900@tykepenguin.com>
Message-ID: <00b201c4a531$2909dfb0$f13cc90a@druzhba.com>


> On Mon, Sep 27, 2004 at 11:23:15AM +0300, Andriy Galetski wrote:
> > Hi !
> > I want to use Multihome network configuration with broadcast
> > but it is not working. ...
> >
> > When disconnected main connection
> > and trying:
> > cman_tool join -n cl020
> >
> > On other node receive:
> > Sep 27 10:36:13 cl10 kernel: CMAN: node cl20 rejoining
> > Sep 27 10:36:17 cl10 kernel: CMAN: node cl20 is not responding -
removing
> > from the cluster
>
>
> You shouldn't specify the alternative node name in this case. Just a
normal
> cman_tool join should work, cman will try both interfaces during the join
> anyway.
>
> If it doesn't, can you enable COMMS & MEMB debugging in cnxman-private.h
and
> send me the output please? (yes,I know it means a recompile...sorry)
>
> patrick
>

Ok

Two node cluster with configuration:



  
  
  
    
        
      
        
          
        
      
    
    
        
      
        
          
        
      
    
  
  
    
  

......
On cl20 node I disconnect eth0 which belong to node name="cl20",
eth1 which belng to  altname name="cl020" left working.
Then on cl10 I run:

cman_tool join -d
alternative node name cl010
setup up interface for address: cl10
Broadcast address for c3cc90a is ff3cc90a
setup up interface for address: cl010
Broadcast address for a00a8c0 is ff00a8c0

The messages receive:
ep 28 10:41:57 cl10 kernel: CMAN: Waiting to join or form a Linux-cluster
Sep 28 10:41:57 cl10 ccsd[30333]: Connected to cluster infrastruture via:
CMAN/SM Plugin v1.0
Sep 28 10:41:57 cl10 ccsd[30333]: Initial status:: Inquorate
Sep 28 10:42:13 cl10 kernel: CMAN: forming a new cluster
Sep 28 10:42:13 cl10 kernel: CMAN: quorum regained, resuming activity

After that on cl20 run:
[root at cl20 root]# cman_tool join -d
alternative node name cl020
setup up interface for address: cl20
Broadcast address for 143cc90a is ff3cc90a
setup up interface for address: cl020
Broadcast address for 1400a8c0 is ff00a8c0

CL20 messages:
Sep 28 10:48:22 cl20 kernel: CMAN: Waiting to join or form a Linux-cluster
Sep 28 10:48:22 cl20 ccsd[27495]: Connected to cluster infrastruture via:
CMAN/SM Plugin v1.0
Sep 28 10:48:22 cl20 ccsd[27495]: Initial status:: Inquorate
Sep 28 10:48:23 cl20 kernel: : 02 00 1a 99 c0 a8 00 0a 00 00 00 00 00 00 00
00
Sep 28 10:48:23 cl20 kernel: CMAN: sending membership request
Sep 28 10:48:23 cl20 kernel: : 02 00 1a 99 c0 a8 00 0a 00 00 00 00 00 00 00
00
Sep 28 10:48:23 cl20 last message repeated 7 times
Sep 28 10:48:23 cl20 kernel: CMAN: got node cl10
Sep 28 10:48:23 cl20 kernel: : 02 00 1a 99 c0 a8 00 0a 00 00 00 00 00 00 00
00
Sep 28 10:48:23 cl20 last message repeated 3 times
Sep 28 10:49:08 cl20 kernel: CMAN: Being told to leave the cluster by node 1
Sep 28 10:49:08 cl20 kernel: CMAN: we are leaving the cluster

CL10 messages:
Sep 28 10:48:23 cl10 kernel: : 02 00 1a 99 c0 a8 00 14 00 00 00 00 00 00 00
00
Sep 28 10:48:23 cl10 last message repeated 2 times
Sep 28 10:48:23 cl10 kernel: CMAN: got node cl20
Sep 28 10:48:23 cl10 kernel: : 02 00 1a 99 c0 a8 00 14 00 00 00 00 00 00 00
00
Sep 28 10:48:23 cl10 last message repeated 6 times
Sep 28 10:48:27 cl10 kernel: CMAN: node cl20 is not responding - removing
from the cluster
Sep 28 10:48:31 cl10 kernel: CMAN: node cl20 is not responding - removing
from the cluster
Sep 28 10:48:42 cl10 kernel: : 02 00 1a 99 c0 a8 00 14 00 00 00 00 00 00 00
00

[root at cl10 root]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    1   M   cl10
   2    1    1   X   cl20

[root at cl10 root]# ping cl020
PING cl020.druzhba.lviv.ua (192.168.0.20) 56(84) bytes of data.
64 bytes from cl020.druzhba.lviv.ua (192.168.0.20): icmp_seq=0 ttl=64
time=0.155 ms
64 bytes from cl020.druzhba.lviv.ua (192.168.0.20): icmp_seq=1 ttl=64
time=0.078 ms


In result CMAN don't use altname interfaces (

Any Idea ???

Thanks.



From pcaulfie at redhat.com  Tue Sep 28 10:26:40 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 28 Sep 2004 11:26:40 +0100
Subject: [Linux-cluster] Multihome network configuration not working .!
In-Reply-To: <00b201c4a531$2909dfb0$f13cc90a@druzhba.com>
References: <003501c4a46b$3cf79150$f13cc90a@druzhba.com>
	<20040927132619.GC30900@tykepenguin.com>
	<00b201c4a531$2909dfb0$f13cc90a@druzhba.com>
Message-ID: <20040928102640.GA6381@tykepenguin.com>

Thanks. 

here's a patch that fixes it (yes, it is in cman_tool, not the kernel), or you
can get it from CVS.


diff -u -r1.9 join.c
--- cman/cman_tool/join.c       21 Sep 2004 08:04:43 -0000      1.9
+++ cman/cman_tool/join.c       28 Sep 2004 10:23:30 -0000
@@ -152,7 +152,7 @@
     if (bind(local_sock, (struct sockaddr *)&local_sin, sizeof(local_sin)))
        die("Cannot bind local address: %s", strerror(errno));
 
-    sock_info.number = num;
+    sock_info.number = num + 1;
     sock_info.multicast = 1;
 
     /* Pass the multicast socket to kernel space */
@@ -233,7 +233,7 @@
     if (bind(local_sock, (struct sockaddr *)&local_sin, sizeof(local_sin)))
        die("Cannot bind local address: %s", strerror(errno));
 
-    sock_info.number = num;
+    sock_info.number = num + 1;
 
     /* Pass the multicast socket to kernel space */
     sock_info.fd = mcast_sock;

-- 

patrick



From anton at hq.310.ru  Tue Sep 28 12:44:40 2004
From: anton at hq.310.ru (Anton Nekhoroshikh)
Date: Tue, 28 Sep 2004 16:44:40 +0400
Subject: [Linux-cluster] immutable flag on gfs
Message-ID: <538414694.20040928164440@hq.310.ru>

Hi all!

I sent patch for use immutable a flag, it does not arrange?


e-mail: anton at hq.310.ru
http://www.310.ru



From andriy at druzhba.lviv.ua  Tue Sep 28 13:34:45 2004
From: andriy at druzhba.lviv.ua (Andriy Galetski)
Date: Tue, 28 Sep 2004 16:34:45 +0300
Subject: [Linux-cluster] Compile from CVS Error
Message-ID: <003501c4a55f$eb28eae0$f13cc90a@druzhba.com>

Hi !
I have problem when compile today CVS sources

  1. apply kernel patch (to enable flock(2) to work among gfs nodes)
   cd /path/to/kernel
   patch -p1 < cluster/gfs-kernel/patches//00001.patch

2. build kernel
   cd /path/to/kernel
   make; make modules_install

3. build cluster kernel modules and user space
   cd cluster
   ./configure --kernel_src=/path/to/kernel
   make install

After that:
....
.....
  CC [M]  /usr/local/src/cluster/gfs-kernel/src/gfs/ops_export.o
  CC [M]  /usr/local/src/cluster/gfs-kernel/src/gfs/ops_file.o
/usr/local/src/cluster/gfs-kernel/src/gfs/ops_file.c: In function
`do_flock':
/usr/local/src/cluster/gfs-kernel/src/gfs/ops_file.c:1559: warning: implicit
declaration of function `flock_lock_file_wait'
/usr/local/src/cluster/gfs-kernel/src/gfs/ops_file.c: At top level:
/usr/local/src/cluster/gfs-kernel/src/gfs/ops_file.c:1656: unknown field
`flock' specified in initializer
/usr/local/src/cluster/gfs-kernel/src/gfs/ops_file.c:1656: warning:
initialization from incompatible pointer type
/usr/local/src/cluster/gfs-kernel/src/gfs/ops_file.c:1666: unknown field
`flock' specified in initializer
/usr/local/src/cluster/gfs-kernel/src/gfs/ops_file.c:1666: warning:
initialization from incompatible pointer type
make[5]: *** [/usr/local/src/cluster/gfs-kernel/src/gfs/ops_file.o] Error 1
make[4]: *** [_module_/usr/local/src/cluster/gfs-kernel/src/gfs] Error 2
make[4]: Leaving directory `/usr/src/linux-2.6.8.1'
make[3]: *** [all] Error 2
make[3]: Leaving directory `/usr/local/src/cluster/gfs-kernel/src/gfs'
make[2]: *** [install] Error 2
make[2]: Leaving directory `/usr/local/src/cluster/gfs-kernel/src'
make[1]: *** [install] Error 2
make[1]: Leaving directory `/usr/local/src/cluster/gfs-kernel'
make: *** [install] Error 2

Thanks.



From adrian.immler at magix.net  Tue Sep 28 14:55:33 2004
From: adrian.immler at magix.net (Adrian Immler)
Date: Tue, 28 Sep 2004 16:55:33 +0200
Subject: [Linux-cluster] Compile from CVS Error
In-Reply-To: <003501c4a55f$eb28eae0$f13cc90a@druzhba.com>
References: <003501c4a55f$eb28eae0$f13cc90a@druzhba.com>
Message-ID: <20040928165533.6b4f3172.adrian.immler@magix.net>

apply all kernel patches ...


On Tue, 28 Sep 2004 16:34:45 +0300
"Andriy Galetski"  wrote:

> Hi !
> I have problem when compile today CVS sources
> 
>   1. apply kernel patch (to enable flock(2) to work among gfs nodes)
>    cd /path/to/kernel
>    patch -p1 < cluster/gfs-kernel/patches//00001.patch
> 
> 2. build kernel
>    cd /path/to/kernel
>    make; make modules_install
> 
> 3. build cluster kernel modules and user space
>    cd cluster
>    ./configure --kernel_src=/path/to/kernel
>    make install
> 
> After that:
> ....
> .....
>   CC [M]  /usr/local/src/cluster/gfs-kernel/src/gfs/ops_export.o
>   CC [M]  /usr/local/src/cluster/gfs-kernel/src/gfs/ops_file.o
> /usr/local/src/cluster/gfs-kernel/src/gfs/ops_file.c: In function
> `do_flock':
> /usr/local/src/cluster/gfs-kernel/src/gfs/ops_file.c:1559: warning: implicit
> declaration of function `flock_lock_file_wait'
> /usr/local/src/cluster/gfs-kernel/src/gfs/ops_file.c: At top level:
> /usr/local/src/cluster/gfs-kernel/src/gfs/ops_file.c:1656: unknown field
> `flock' specified in initializer
> /usr/local/src/cluster/gfs-kernel/src/gfs/ops_file.c:1656: warning:
> initialization from incompatible pointer type
> /usr/local/src/cluster/gfs-kernel/src/gfs/ops_file.c:1666: unknown field
> `flock' specified in initializer
> /usr/local/src/cluster/gfs-kernel/src/gfs/ops_file.c:1666: warning:
> initialization from incompatible pointer type
> make[5]: *** [/usr/local/src/cluster/gfs-kernel/src/gfs/ops_file.o] Error 1
> make[4]: *** [_module_/usr/local/src/cluster/gfs-kernel/src/gfs] Error 2
> make[4]: Leaving directory `/usr/src/linux-2.6.8.1'
> make[3]: *** [all] Error 2
> make[3]: Leaving directory `/usr/local/src/cluster/gfs-kernel/src/gfs'
> make[2]: *** [install] Error 2
> make[2]: Leaving directory `/usr/local/src/cluster/gfs-kernel/src'
> make[1]: *** [install] Error 2
> make[1]: Leaving directory `/usr/local/src/cluster/gfs-kernel'
> make: *** [install] Error 2
> 
> Thanks.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster


-- 

MAGIX AG
Adrian Immler
Internet Department
Rotherstr. 19
10245 Berlin
GERMANY

Tel.:   +49 (0)30- 29 39 2- 347
Fax:    +49 (0)30- 29 39 2- 400
Email:  mailto:aimmler at magix.net
Web:    www.magix.com

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
The information in this email is intended only for the addressee named
above. Access to this email by anyone else is unauthorized. If you are
not the intended recipient of this message any disclosure, copying,
distribution or any action taken in reliance on it is prohibited and
may be unlawful.

MAGIX does not warrant that any attachments are free from viruses or
other defects and accepts no liability for any losses resulting from
infected email transmissions. Please note that any views expressed in
this email may be those of the originator and do not necessarily
represent the agenda of the company.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 



From ben.m.cahill at intel.com  Mon Sep 27 21:42:36 2004
From: ben.m.cahill at intel.com (Cahill, Ben M)
Date: Mon, 27 Sep 2004 14:42:36 -0700
Subject: [Linux-cluster] FW: [PATCH] More comments for GFS files
Message-ID: <0604335B7764D141945E202153105960033E25BD@orsmsx404.amr.corp.intel.com>

 

-----Original Message-----
From: Cahill, Ben M 
Sent: Thursday, September 23, 2004 4:12 PM
To: RedHat Cluster (linux-cluster at redhat.com)
Subject: [PATCH] More comments for GFS files

Hi all,

Below please find a patch for more comments in some files in
gfs-kernel/src/gfs:

dio.c
file.c
gfs_ioctl.c
incore.h
log.c
lops.c
lvb.h
rgrp.c

The focus was on incore.h.

These were diffed against Thursday's CVS, and I've built and run GFS
after applying the patches, so things should hopefully apply cleanly.

-- Ben --

Opinions are mine, not Intel's




diff -ru cvs/cluster/gfs-kernel/src/gfs/dio.c
build_092304/cluster/gfs-kernel/src/gfs/dio.c
--- cvs/cluster/gfs-kernel/src/gfs/dio.c	2004-06-24
04:53:27.000000000 -0400
+++ build_092304/cluster/gfs-kernel/src/gfs/dio.c	2004-09-23
14:18:00.229937128 -0400
@@ -1078,6 +1078,9 @@
  * gfs_sync_meta - sync all the buffers in a filesystem
  * @sdp: the filesystem
  *
+ * Flush metadata blocks to on-disk journal, then
+ * Flush metadata blocks (now in AIL) to on-disk in-place locations
+ * Periodically keep checking until done (AIL empty)
  */
 
 void
diff -ru cvs/cluster/gfs-kernel/src/gfs/file.c
build_092304/cluster/gfs-kernel/src/gfs/file.c
--- cvs/cluster/gfs-kernel/src/gfs/file.c	2004-06-24
04:53:27.000000000 -0400
+++ build_092304/cluster/gfs-kernel/src/gfs/file.c	2004-09-23
14:18:09.964457256 -0400
@@ -199,15 +199,18 @@
 	char **p = (char **)buf;
 	int error = 0;
 
+	/* the dinode block always gets journaled */
 	if (bh->b_blocknr == ip->i_num.no_addr) {
 		GFS_ASSERT_INODE(!new, ip,);
 		gfs_trans_add_bh(ip->i_gl, bh);
 		memcpy(bh->b_data + offset, *p, size);
+	/* data blocks get journaled only for special files */
 	} else if (gfs_is_jdata(ip)) {
 		gfs_trans_add_bh(ip->i_gl, bh);
 		memcpy(bh->b_data + offset, *p, size);
 		if (new)
 			gfs_buffer_clear_ends(bh, offset, size, TRUE);
+	/* non-journaled data blocks get written to in-place disk blocks
*/
 	} else {
 		memcpy(bh->b_data + offset, *p, size);
 		if (new)
@@ -240,11 +243,13 @@
 	char **p = (char **)buf;
 	int error = 0;
 
+	/* the dinode block always gets journaled */
 	if (bh->b_blocknr == ip->i_num.no_addr) {
 		GFS_ASSERT_INODE(!new, ip,);
 		gfs_trans_add_bh(ip->i_gl, bh);
 		if (copy_from_user(bh->b_data + offset, *p, size))
 			error = -EFAULT;
+	/* data blocks get journaled only for special files */
 	} else if (gfs_is_jdata(ip)) {
 		gfs_trans_add_bh(ip->i_gl, bh);
 		if (copy_from_user(bh->b_data + offset, *p, size))
@@ -254,6 +259,7 @@
 			if (error)
 				memset(bh->b_data + offset, 0, size);
 		}
+	/* non-journaled data blocks get written to in-place disk blocks
*/
 	} else {
 		if (copy_from_user(bh->b_data + offset, *p, size))
 			error = -EFAULT;
diff -ru cvs/cluster/gfs-kernel/src/gfs/gfs_ioctl.h
build_092304/cluster/gfs-kernel/src/gfs/gfs_ioctl.h
--- cvs/cluster/gfs-kernel/src/gfs/gfs_ioctl.h	2004-09-13
18:48:45.000000000 -0400
+++ build_092304/cluster/gfs-kernel/src/gfs/gfs_ioctl.h	2004-09-23
13:32:21.518284584 -0400
@@ -131,18 +131,21 @@
 	unsigned int gt_demote_secs;
 	unsigned int gt_incore_log_blocks;
 	unsigned int gt_jindex_refresh_secs;
+
+	/* how often various daemons run (seconds) */
 	unsigned int gt_depend_secs;
-	unsigned int gt_scand_secs;
-	unsigned int gt_recoverd_secs;
-	unsigned int gt_logd_secs;
-	unsigned int gt_quotad_secs;
-	unsigned int gt_inoded_secs;
-	unsigned int gt_quota_simul_sync;
-	unsigned int gt_quota_warn_period;
+	unsigned int gt_scand_secs;       /* find unused glocks and
inodes */
+	unsigned int gt_recoverd_secs;    /* recover journal of crashed
node */
+	unsigned int gt_logd_secs;        /* update log tail as AIL
flushes */
+	unsigned int gt_quotad_secs;      /* sync changes to quota file,
clean*/
+	unsigned int gt_inoded_secs;      /* toss unused inodes */
+
+	unsigned int gt_quota_simul_sync; /* max # quotavals to sync at
once */
+	unsigned int gt_quota_warn_period; /* secs between quota warn
msgs */
 	unsigned int gt_atime_quantum;
-	unsigned int gt_quota_quantum;
-	unsigned int gt_quota_scale_num;
-	unsigned int gt_quota_scale_den;
+	unsigned int gt_quota_quantum;    /* secs between syncs to quota
file */
+	unsigned int gt_quota_scale_num;  /* numerator */
+	unsigned int gt_quota_scale_den;  /* denominator */
 	unsigned int gt_quota_enforce;
 	unsigned int gt_quota_account;
 	unsigned int gt_new_files_jdata;
diff -ru cvs/cluster/gfs-kernel/src/gfs/incore.h
build_092304/cluster/gfs-kernel/src/gfs/incore.h
--- cvs/cluster/gfs-kernel/src/gfs/incore.h	2004-09-13
18:48:45.000000000 -0400
+++ build_092304/cluster/gfs-kernel/src/gfs/incore.h	2004-09-23
14:58:06.330154296 -0400
@@ -11,20 +11,28 @@
 
************************************************************************
*******
 
************************************************************************
******/
 
+/*
+ *  In-core (memory/RAM) structures.
+ *  These do not appear on-disk.  See gfs_ondisk.h for on-disk
structures.
+ */
+
 #ifndef __INCORE_DOT_H__
 #define __INCORE_DOT_H__
 
+/*  flags used in function call parameters  */
+
 #define DIO_NEW           (0x00000001)
-#define DIO_FORCE         (0x00000002)
-#define DIO_CLEAN         (0x00000004)
-#define DIO_DIRTY         (0x00000008)
-#define DIO_START         (0x00000010)
-#define DIO_WAIT          (0x00000020)
-#define DIO_METADATA      (0x00000040)
-#define DIO_DATA          (0x00000080)
+#define DIO_FORCE         (0x00000002)  /* force read of block from
disk */
+#define DIO_CLEAN         (0x00000004)  /* don't write to disk */
+#define DIO_DIRTY         (0x00000008)  /* data changed, must write to
disk */
+#define DIO_START         (0x00000010)  /* start disk read or write */
+#define DIO_WAIT          (0x00000020)  /* wait for disk r/w to
complete */
+
+#define DIO_METADATA      (0x00000040)  /* process glock's protected
metadata */
+#define DIO_DATA          (0x00000080)  /* process glock's protected
filedata */
 #define DIO_INVISIBLE     (0x00000100)
-#define DIO_CHECK         (0x00000200)
-#define DIO_ALL           (0x00000400)
+#define DIO_CHECK         (0x00000200)  /* make sure glock's AIL is
empty */
+#define DIO_ALL           (0x00000400)  /* flush all AIL transactions
to disk */
 
 /*  Structure prototypes  */
 
@@ -98,6 +106,7 @@
 	void (*lo_after_scan) (struct gfs_sbd * sdp, unsigned int jid,
 			       unsigned int pass);
 
+	/* type of element (glock/buf/unlinked/quota) */
 	char *lo_name;
 };
 
@@ -107,227 +116,351 @@
  */
 
 struct gfs_log_element {
-	struct gfs_log_operations *le_ops;
+	struct gfs_log_operations *le_ops; /* vector of functions */
 
-	struct gfs_trans *le_trans;
-	struct list_head le_list;
+	struct gfs_trans *le_trans;     /* we're part of this
transaction */
+	struct list_head le_list;       /* link to transaction's element
list */
 };
 
+/*
+ * Meta-header cache structure.
+ * One for each metadata block that we've read from disk, and are still
using.
+ * In-core superblock structure hosts the actual cache.
+ * Also, each resource group keeps a list of cached blocks within its
scope.
+ */
 struct gfs_meta_header_cache {
-	struct list_head mc_list_hash;
-	struct list_head mc_list_single;
-	struct list_head mc_list_rgd;
+	/* Links to various lists */
+	struct list_head mc_list_hash;   /* superblock's hashed list */
+	struct list_head mc_list_single; /* superblock's single list */
+	struct list_head mc_list_rgd;    /* resource group's list */
 
-	uint64_t mc_block;
-	struct gfs_meta_header mc_mh;
+	uint64_t mc_block;               /* block # (in-place address)
*/
+	struct gfs_meta_header mc_mh;    /* payload: the block's
meta-header */
 };
 
+/*
+ * Dependency cache structure.
+ * In-core superblock structure hosts the actual cache.
+ * Also, each resource group keeps a list of dependency blocks within
its scope.
+ */
 struct gfs_depend {
-	struct list_head gd_list_hash;
-	struct list_head gd_list_rgd;
+	/* Links to various lists */
+	struct list_head gd_list_hash;  /* superblock's hashed list */
+	struct list_head gd_list_rgd;   /* resource group's list */
 
-	struct gfs_rgrpd *gd_rgd;
-	uint64_t gd_formal_ino;
-	unsigned long gd_time;
+	struct gfs_rgrpd *gd_rgd;       /* resource group descriptor */
+	uint64_t gd_formal_ino;         /* inode ID */
+	unsigned long gd_time;          /* time (jiffies) when put on
list */
 };
 
 /*
- *  Structure containing information about the allocation bitmaps.
- *  There are one of these for each fs block that the bitmap for
- *  the resource group header covers.
+ *  Block allocation bitmap descriptor structure.
+ *  One of these for each fs block that contains bitmap data
+ *    (i.e. the resource group header blocks and their following bitmap
blocks).
+ *  Each allocatable fs data block is represented by 2 bits (4 alloc
states).
  */
 
 struct gfs_bitmap {
-	uint32_t bi_offset;	/* The offset in the buffer of the first
byte */
-	uint32_t bi_start;	/* The position of the first byte in
this block */
-	uint32_t bi_len;	/* The number of bytes in this block */
+	uint32_t bi_offset;  /* Byte offset of bitmap within this bit
block
+	                        (non-zero only for an rgrp header block)
*/
+	uint32_t bi_start;   /* Data block (rgrp scope, 32-bit)
represented
+	                        by the first bit-pair in this bit block
*/
+	uint32_t bi_len;     /* The number of bitmap bytes in this bit
block */
 };
 
 /*
- *  Structure containing information Resource Groups
+ *  Resource Group (Rgrp) descriptor structure.
+ *  There is one of these for each resource (block) group in the fs.
+ *  The filesystem is divided into a number of resource groups to allow
+ *    simultaneous block alloc operations by a number of nodes.
  */
 
 struct gfs_rgrpd {
-	struct list_head rd_list;	/* Link with superblock */
-	struct list_head rd_list_mru;
-	struct list_head rd_recent;	/* Recently used rgrps */
+	/* Links to superblock lists */
+	struct list_head rd_list;       /* on-disk-order list of all
rgrps */
+	struct list_head rd_list_mru;   /* Most Recently Used list of
all rgs */
+	struct list_head rd_recent;     /* recently used rgrps */
 
-	struct gfs_glock *rd_gl;	/* Glock for rgrp */
+	struct gfs_glock *rd_gl;        /* Glock for this rgrp */
 
-	unsigned long rd_flags;
+	unsigned long rd_flags;         /* ?? */
 
-	struct gfs_rindex rd_ri;	/* Resource Index structure */
-	struct gfs_rgrp rd_rg;	        /* Resource Group structure */
-	uint64_t rd_rg_vn;
+	struct gfs_rindex rd_ri;        /* Resource Index (on-disk)
structure */
+	struct gfs_rgrp rd_rg;          /* Resource Group (on-disk)
structure */
+	uint64_t rd_rg_vn;              /* version #: if != glock's
gl_vn,
+	                                   we need to read rgrp fm disk
*/
 
-	struct gfs_bitmap *rd_bits;
-	struct buffer_head **rd_bh;
+	/* Block alloc bitmap cache */
+	struct gfs_bitmap *rd_bits;     /* Array of block bitmap
descriptors */
+	struct buffer_head **rd_bh;     /* Array of ptrs to block bitmap
bh's */
 
-	uint32_t rd_last_alloc_data;
-	uint32_t rd_last_alloc_meta;
+	/* Block allocation strategy, rgrp scope. Start at these blocks
when
+	 * searching for next data/meta block to alloc */
+	uint32_t rd_last_alloc_data;    /* most recent data block
allocated */
+	uint32_t rd_last_alloc_meta;    /* most recent meta block
allocated */
 
-	struct list_head rd_mhc;
-	struct list_head rd_depend;
+	struct list_head rd_mhc;        /* cached meta-headers for this
rgrp */
+	struct list_head rd_depend;     /* dependency elements */
 
-	struct gfs_sbd *rd_sbd;
+	struct gfs_sbd *rd_sbd;		/* fs incore superblock (fs
instance) */
 };
 
 /*
  *  Per-buffer data
+ *  One of these is attached as GFS private data to each fs block's
buffer_head.
+ *  These also link into the Active Items Lists (AIL) (buffers flushed
to
+ *    on-disk log, but not yet flushed to on-disk in-place locations)
attached
+ *    to transactions and glocks.
  */
 
 struct gfs_bufdata {
-	struct buffer_head *bd_bh;	/* struct buffer_head which this
struct belongs to */
-	struct gfs_glock *bd_gl;	/* Pointer to Glock struct for
this bh */
+	struct buffer_head *bd_bh;  /* we belong to this Linux
buffer_head */
+	struct gfs_glock *bd_gl;    /* this glock protects buffer's
payload */
 
 	struct gfs_log_element bd_new_le;
 	struct gfs_log_element bd_incore_le;
 
-	char *bd_frozen;
-	struct semaphore bd_lock;
+	char *bd_frozen;            /* "frozen" copy of buffer's data */
+	struct semaphore bd_lock;   /* protects access to this structure
*/
 
-	unsigned int bd_pinned;	                /* Pin count */
-	struct list_head bd_ail_tr_list;	/* List of buffers
hanging off tr_ail_bufs */
-	struct list_head bd_ail_gl_list;	/* List of buffers
hanging off gl_ail_bufs */
+	/* "pin" means keep buffer in RAM, don't write to disk (yet) */
+	unsigned int bd_pinned;	         /* recursive pin count */
+	struct list_head bd_ail_tr_list; /* link to transaction's AIL
list */
+	struct list_head bd_ail_gl_list; /* link to glock's AIL list */
 };
 
 /*
  *  Glock operations
+ *  One set of operations for each glock, the set selected by type of
glock.
+ *  These functions get called at various points in a glock's lifetime.
+ *  "xmote" = promote (lock) a glock at inter-node level.
+ *  "th" = top half, "bh" = bottom half
  */
 
 struct gfs_glock_operations {
+
+	/* before acquiring a lock at inter-node level */
 	void (*go_xmote_th) (struct gfs_glock * gl, unsigned int state,
 			     int flags);
+
+	/* after acquiring a lock at inter-node level */
 	void (*go_xmote_bh) (struct gfs_glock * gl);
+
+	/* before releasing a lock at inter-node level, calls go_sync
*/
 	void (*go_drop_th) (struct gfs_glock * gl);
+
+	/* after releasing a lock at inter-node level, calls go_inval
*/
 	void (*go_drop_bh) (struct gfs_glock * gl);
+
+	/* sync dirty data to disk before releasing an inter-node lock
+	 * (another node needs to read the updated data from disk) */
 	void (*go_sync) (struct gfs_glock * gl, int flags);
+
+	/* invalidate local data just after releasing an inter-node lock
+	 * (another node may change the on-disk data, so it's no good to
us) */
 	void (*go_inval) (struct gfs_glock * gl, int flags);
+
+	/* lock-type-specific check to see if it's okay to unlock a
glock */
 	int (*go_demote_ok) (struct gfs_glock * gl);
+
+	/* after locking at local process level */
 	int (*go_lock) (struct gfs_glock * gl, int flags);
+
+	/* before unlocking at local process level */
 	void (*go_unlock) (struct gfs_glock * gl, int flags);
+
+	/* after receiving a callback: another node needs the lock */
 	void (*go_callback) (struct gfs_glock * gl, unsigned int state);
+
 	void (*go_greedy) (struct gfs_glock * gl);
-	int go_type;
+
+	/* lock type: locks with same lock # (usually an fs block #),
+	 *   but different types, are different locks */
+	int go_type;    /* glock type */
 };
 
-/*  Actions  */
-#define HIF_MUTEX               (0)
-#define HIF_PROMOTE             (1)
-#define HIF_DEMOTE              (2)
-#define HIF_GREEDY              (3)
+/*
+ *  Glock holder structure
+ *  These coordinate the use, within this node, of an acquired
inter-node lock.
+ *  One for each holder of a glock.  A glock may be shared within a
node by
+ *    several processes, or even by several recursive requests from the
same
+ *    process.  Each is a separate "holder".  To be shared locally, the
glock
+ *    must be in "SHARED" or "DEFERRED" state at inter-node level,
which means
+ *    that processes on other nodes might also read the protected
entity.
+ *  When a process needs to manipulate a lock, it requests it via one
of
+ *    these holder structures.  If the request cannot be satisfied
immediately,
+ *    the holder structure gets queued on one of these glock lists:
+ *    1) waiters1, for gaining exclusive access to the glock structure.
+ *    2) waiters2, for locking (promoting) or unlocking (demoting) a
lock.
+ *       This may require changing lock state at inter-node level.
+ *  When holding a lock, gfs_holder struct stays on glock's holder
list.
+ *  See gfs-kernel/src/harness/lm_interface.h for gh_state (LM_ST_...)
+ *    and gh_flags (LM_FLAG...) fields.
+ *  Also see glock.h for gh_flags field (GL_...) flags.
+ */
+/*  Action requests  */
+#define HIF_MUTEX       (0)  /* exclusive access to glock struct */
+#define HIF_PROMOTE     (1)  /* change lock to more restrictive state
*/
+#define HIF_DEMOTE      (2)  /* change lock to less restrictive state
*/
+#define HIF_GREEDY      (3)
 
 /*  States  */
-#define HIF_ALLOCED             (4)
-#define HIF_DEALLOC             (5)
-#define HIF_HOLDER              (6)
-#define HIF_FIRST               (7)
-#define HIF_WAKEUP              (8)
-#define HIF_RECURSE             (9)
+#define HIF_ALLOCED     (4)  /* holder structure is or was in use */
+#define HIF_DEALLOC     (5)  /* holder structure no longer in use */
+#define HIF_HOLDER      (6)  /* we have been granted a hold on the lock
*/
+#define HIF_FIRST       (7)  /* we are first on glock's holder list */
+#define HIF_WAKEUP      (8)  /* wake us up when request is satisfied */
+#define HIF_RECURSE     (9)  /* recursive locks on same glock by same
process */
 
 struct gfs_holder {
-	struct list_head gh_list;
+	struct list_head gh_list;      /* link to one of glock's holder
lists */
 
-	struct gfs_glock *gh_gl;
-	struct task_struct *gh_owner;
-	unsigned int gh_state;
-	int gh_flags;
-
-	int gh_error;
-	unsigned long gh_iflags;
-	struct completion gh_wait;
+	struct gfs_glock *gh_gl;       /* glock that we're holding */
+	struct task_struct *gh_owner;  /* Linux process that is the
holder */
+
+	/* request to change lock state */
+	unsigned int gh_state;         /* LM_ST_... requested lock state
*/
+	int gh_flags;                  /* GL_... or LM_FLAG_... req
modifiers */
+
+	int gh_error;                  /* GLR_... CANCELLED or TRYFAILED
*/
+	unsigned long gh_iflags;       /* HIF_... see above */
+	struct completion gh_wait;     /* wait for completion of ... */
 };
 
 /*
  *  Glock Structure
- */
-
-#define GLF_PLUG                (0)
-#define GLF_LOCK                (1)
-#define GLF_STICKY              (2)
+ *  One for each inter-node lock held by this node.
+ *  A glock is a local representation/abstraction of an inter-node
lock.
+ *    Inter-node locks are managed by a "lock module" which plugs in to
the
+ *    lock harness / glock interface (see gfs-kernel/harness).
Different
+ *    lock modules support different lock protocols (e.g. GULM, GDLM,
no_lock).
+ *  A glock may have one or more holders within a node.  See gfs_holder
above.
+ *  Glocks are managed within a hash table hosted by the in-core
superblock.
+ *  After all holders have released a glock, it will stay in the hash
table
+ *    cache for a certain time (gt_prefetch_secs), during which the
inter-node
+ *    lock will not be released unless another node needs the lock.
This
+ *    provides better performance in case this node needs the glock
again soon.
+ *  Each glock has an associated vector of lock-type-specific "glops"
functions
+ *    which are called at important times during the life of a glock,
and
+ *    which define the type of lock (e.g. dinode, rgrp, non-disk, etc).
+ *    See gfs_glock_operations above.
+ *  A glock, at inter-node scope, is identified by the following
dimensions:
+ *    1)  lock number (usually a block # for on-disk protected
entities,
+ *           or a fixed assigned number for non-disk locks, e.g.
MOUNT).
+ *    2)  lock type (actually, the type of entity protected by the
lock).
+ *    3)  lock namespace, to support multiple GFS filesystems
simultaneously.
+ *           Namespace (usually cluster:filesystem) is specified when
mounting.
+ *           See man page for gfs_mount.
+ *  Glocks require support of Lock Value Blocks (LVBs) by the
inter-node lock
+ *    manager.  LVBs are small (32-byte) chunks of data associated with
a given
+ *    lock, that can be quickly shared between cluster nodes.  Used for
certain
+ *    purposes such as sharing an rgroup's block usage statistics
without
+ *    requiring the overhead of:
+ *      -- sync-to-disk by one node, then a
+ *      -- read from disk by another node.
+ *  
+ */
+
+#define GLF_PLUG                (0)  /* dummy */
+#define GLF_LOCK                (1)  /* exclusive access to glock
structure */
+#define GLF_STICKY              (2)  /* permanent lock, used sparingly
*/
 #define GLF_PREFETCH            (3)
 #define GLF_SYNC                (4)
 #define GLF_DIRTY               (5)
-#define GLF_LVB_INVALID         (6)
+#define GLF_LVB_INVALID         (6)  /* LVB does not contain valid data
*/
 #define GLF_SKIP_WAITERS2       (7)
 #define GLF_GREEDY              (8)
 
 struct gfs_glock {
-	struct list_head gl_list;
-	unsigned long gl_flags;
-	struct lm_lockname gl_name;
-	atomic_t gl_count;
-
-	spinlock_t gl_spin;
-
-	unsigned int gl_state;
-	struct list_head gl_holders;
-	struct list_head gl_waiters1;	/*  HIF_MUTEX  */
-	struct list_head gl_waiters2;	/*  HIF_DEMOTE, HIF_GREEDY  */
-	struct list_head gl_waiters3;	/*  HIF_PROMOTE  */
+	struct list_head gl_list;    /* link to superblock's hash table
*/
+	unsigned long gl_flags;      /* GLF_... see above */
+	struct lm_lockname gl_name;  /* lock number and lock type */
+	atomic_t gl_count;           /* recursive access/usage count */
+
+	spinlock_t gl_spin;          /* protects some members of this
struct */
+
+	/* lock state reflects inter-node manager's lock state */
+	unsigned int gl_state;       /* LM_ST_... see
harness/lm_interface.h */
+
+	/* lists of gfs_holders */
+	struct list_head gl_holders;  /* all current holders of the
glock */
+	struct list_head gl_waiters1; /* wait for excl. access to glock
struct*/
+	struct list_head gl_waiters2; /* HIF_DEMOTE, HIF_GREEDY */
+	struct list_head gl_waiters3; /* HIF_PROMOTE */
 
-	struct gfs_glock_operations *gl_ops;
+	struct gfs_glock_operations *gl_ops; /* function vector, defines
type */
 
 	struct gfs_holder *gl_req_gh;
 	gfs_glop_bh_t gl_req_bh;
 
-	lm_lock_t *gl_lock;
-	char *gl_lvb;
-	atomic_t gl_lvb_count;
-
-	uint64_t gl_vn;
-	unsigned long gl_stamp;
-	void *gl_object;
+	lm_lock_t *gl_lock;       /* lock module's private lock data */
+	char *gl_lvb;             /* Lock Value Block */
+	atomic_t gl_lvb_count;    /* LVB recursive usage (hold/unhold)
count */
+
+	uint64_t gl_vn;           /* incremented when protected data
changes */
+	unsigned long gl_stamp;   /* glock cache retention timer */
+	void *gl_object;          /* the protected entity (e.g. a
dinode) */
 
 	struct gfs_log_element gl_new_le;
 	struct gfs_log_element gl_incore_le;
 
-	struct gfs_gl_hash_bucket *gl_bucket;
-	struct list_head gl_reclaim;
+	struct gfs_gl_hash_bucket *gl_bucket; /* our bucket in hash
table */
+	struct list_head gl_reclaim;          /* link to "reclaim" list
*/
 
-	struct gfs_sbd *gl_sbd;
+	struct gfs_sbd *gl_sbd;               /* superblock (fs
instance) */
 
-	struct inode *gl_aspace;
-	struct list_head gl_dirty_buffers;
-	struct list_head gl_ail_bufs;
+	struct inode *gl_aspace;              /* Linux VFS inode */
+	struct list_head gl_dirty_buffers;    /* ?? */
+	struct list_head gl_ail_bufs;         /* AIL buffers protected
by us */
 };
 
 /*
  *  In-Place Reservation structure
+ *  Coordinates allocation of "in-place" (as opposed to journal) fs
blocks,
+ *     which contain persistent inode/file/directory data and metadata.
+ *     These blocks are the allocatable blocks within resource groups
(i.e.
+ *     not including rgrp header and block alloc bitmap blocks).
+ *  gfs_inplace_reserve() calculates a fulfillment plan for allocating
blocks,
+ *     based on block statistics in the resource group headers.
+ *  Then, gfs_blkalloc() or gfs_metaalloc() walks the block alloc
bitmaps
+ *     to do the actual allocation.
  */
 
 struct gfs_alloc {
-	/*  Quota stuff  */
-
-	unsigned int al_qd_num;
-	struct gfs_quota_data *al_qd[4];
-	struct gfs_holder al_qd_ghs[4];
-
-	/* Filled in by the caller to gfs_inplace_reserve() */
-
-	uint32_t al_requested_di;
-	uint32_t al_requested_meta;
-	uint32_t al_requested_data;
-
-	/* Filled in by gfs_inplace_reserve() */
-
-	char *al_file;
-	unsigned int al_line;
-	struct gfs_holder al_ri_gh;
-	struct gfs_holder al_rgd_gh;
-	struct gfs_rgrpd *al_rgd;
-	uint32_t al_reserved_meta;
-	uint32_t al_reserved_data;
-
-	/* Filled in by gfs_blkalloc() */
-
-	uint32_t al_alloced_di;
-	uint32_t al_alloced_meta;
-	uint32_t al_alloced_data;
+	/*
+	 *  Up to 4 quotas (including an inode's user and group quotas)
+	 *  can track changes in block allocation
+	 */
+
+	unsigned int al_qd_num;          /* # of quotas tracking changes
*/
+	struct gfs_quota_data *al_qd[4]; /* ptrs to quota structures */
+	struct gfs_holder al_qd_ghs[4];  /* holders for quota glocks */
+
+	/* Request, filled in by the caller to gfs_inplace_reserve() */
+
+	uint32_t al_requested_di;     /* number of dinodes to reserve */
+	uint32_t al_requested_meta;   /* number of metadata blocks to
reserve */
+	uint32_t al_requested_data;   /* number of data blocks to
reserve */
+
+	/* Fulfillment plan, filled in by gfs_inplace_reserve() */
+
+	char *al_file;                /* debug info, .c file making
request */
+	unsigned int al_line;         /* debug info, line of code making
req */
+	struct gfs_holder al_ri_gh;   /* glock holder for resource grp
index */
+	struct gfs_holder al_rgd_gh;  /* glock holder for al_rgd rgrp */
+	struct gfs_rgrpd *al_rgd;     /* resource group from which to
alloc */
+	uint32_t al_reserved_meta;    /* alloc this # meta blocks from
al_rgd */
+	uint32_t al_reserved_data;    /* alloc this # data blocks from
al_rgd */
+
+	/* Actual alloc, filled in by gfs_blkalloc()/gfs_metaalloc(),
etc. */
+
+	uint32_t al_alloced_di;       /* # dinode blocks allocated */
+	uint32_t al_alloced_meta;     /* # meta blocks allocated */
+	uint32_t al_alloced_data;     /* # data blocks allocated */
 
 	/* Dinode allocation crap */
 
-	struct gfs_unlinked *al_ul;
+	struct gfs_unlinked *al_ul;   /* unlinked dinode log entry */
 };
 
 /*
@@ -339,27 +472,32 @@
 #define GIF_SW_PAGED            (2)
 
 struct gfs_inode {
-	struct gfs_inum i_num;
+	struct gfs_inum i_num;   /* formal inode # and block address */
 
-	atomic_t i_count;
-	unsigned long i_flags;
+	atomic_t i_count;        /* recursive usage (get/put) count */
+	unsigned long i_flags;   /* GIF_...  see above */
 
-	uint64_t i_vn;
-	struct gfs_dinode i_di;
+	uint64_t i_vn;           /* version #: if different from glock's
vn,
+	                            we need to read inode from disk */
+	struct gfs_dinode i_di;  /* dinode (on-disk) structure */
 
-	struct gfs_glock *i_gl;
-	struct gfs_sbd *i_sbd;
-	struct inode *i_vnode;
+	struct gfs_glock *i_gl;  /* this glock protects this inode */
+	struct gfs_sbd *i_sbd;   /* superblock (fs instance structure)
*/
+	struct inode *i_vnode;   /* Linux VFS inode structure */
 
-	struct gfs_holder i_iopen_gh;
+	struct gfs_holder i_iopen_gh;  /* glock holder for # inode opens
lock */
 
-	struct gfs_alloc *i_alloc;
-	uint64_t i_last_rg_alloc;
+	/* block allocation strategy, inode scope */
+	struct gfs_alloc *i_alloc; /* in-place block reservation
structure */
+	uint64_t i_last_rg_alloc;  /* most recnt block alloc was fm this
rgrp */
 
-	struct task_struct *i_creat_task;
-	pid_t i_creat_pid;
+	/* Linux process that originally created this inode */
+	struct task_struct *i_creat_task; /* Linux "current" task struct
*/
+	pid_t i_creat_pid;                /* Linux process ID
current->pid */
 
-	spinlock_t i_lock;
+	spinlock_t i_lock;                /* protects this structure */
+
+	/* cache of most-recently used buffers in indirect addressing
chain */
 	struct buffer_head *i_cache[GFS_MAX_META_HEIGHT];
 
 	unsigned int i_greedy;
@@ -378,8 +516,8 @@
 	struct semaphore f_fl_lock;
 	struct gfs_holder f_fl_gh;
 
-	struct gfs_inode *f_inode;
-	struct file *f_vfile;
+	struct gfs_inode *f_inode;        /* incore GFS inode */
+	struct file *f_vfile;             /* Linux file struct */
 };
 
 /*
@@ -393,112 +531,143 @@
 #define ULF_LOCK                (4)
 
 struct gfs_unlinked {
-	struct list_head ul_list;
-	unsigned int ul_count;
+	struct list_head ul_list;    /* link to superblock's
sd_unlinked_list */
+	unsigned int ul_count;       /* usage count */
 
-	struct gfs_inum ul_inum;
-	unsigned long ul_flags;
+	struct gfs_inum ul_inum;     /* formal inode #, block addr */
+	unsigned long ul_flags;      /* ULF_... */
 
-	struct gfs_log_element ul_new_le;
-	struct gfs_log_element ul_incore_le;
-	struct gfs_log_element ul_ondisk_le;
+	struct gfs_log_element ul_new_le;    /* new, not yet committed
*/
+	struct gfs_log_element ul_incore_le; /* committed to incore log
*/
+	struct gfs_log_element ul_ondisk_le; /* committed to ondisk log
*/
 };
 
 /*
  *  Quota log element
+ *  One for each logged change in a block alloc value affecting a given
quota.
+ *  Only one of these for a given quota within a given transaction;
+ *    multiple changes, within one transaction, for a given quota will
be
+ *    combined into one log element.
  */
 
 struct gfs_quota_le {
-	struct gfs_log_element ql_le;
+	/* Log element maps us to a particular set of log operations
functions,
+	 *    and to a particular transaction */
+	struct gfs_log_element ql_le;    /* generic log element
structure */
 
-	struct gfs_quota_data *ql_data;
-	struct list_head ql_data_list;
+	struct gfs_quota_data *ql_data;  /* the quota we're changing */
+	struct list_head ql_data_list;   /* link to quota's log element
list */
 
-	int64_t ql_change;
+	int64_t ql_change;           /* # of blocks alloc'd (+) or freed
(-) */
 };
 
-#define QDF_USER                (0)
-#define QDF_OD_LIST             (1)
-#define QDF_LOCK                (2)
+/*
+ *  Quota structure
+ *  One for each user or group quota.
+ *  Summarizes all block allocation activity for a given quota, and
supports
+ *    recording updates of current block alloc values in GFS' special
quota
+ *    file, including the journaling of these updates, encompassing
+ *    multiple transactions and log dumps.
+ */
+
+#define QDF_USER                (0)   /* user (1) vs. group (0) quota
*/
+#define QDF_OD_LIST             (1)   /* waiting for sync to quota file
*/
+#define QDF_LOCK                (2)   /* protects access to this
structure */
 
 struct gfs_quota_data {
-	struct list_head qd_list;
-	unsigned int qd_count;
+	struct list_head qd_list;     /* Link to superblock's
sd_quota_list */
+	unsigned int qd_count;        /* usage/reference count */
 
-	uint32_t qd_id;
-	unsigned long qd_flags;
+	uint32_t qd_id;               /* user or group ID number */
+	unsigned long qd_flags;       /* QDF_... */
 
-	struct list_head qd_le_list;
+	/* this list is for non-log-dump transactions */
+	struct list_head qd_le_list;  /* List of gfs_quota_le log
elements */
 
-	int64_t qd_change_new;
-	int64_t qd_change_ic;
-	int64_t qd_change_od;
-	int64_t qd_change_sync;
+	/* summary of block alloc changes affecting this quota, in
various
+	 * stages of logging & syncing changes to the special quota file
*/
+	int64_t qd_change_new;  /* new, not yet committed to in-core
log*/
+	int64_t qd_change_ic;   /* committed to in-core log */
+	int64_t qd_change_od;   /* committed to on-disk log */
+	int64_t qd_change_sync; /* being synced to the in-place quota
file */
 
-	struct gfs_quota_le qd_ondisk_ql;
-	uint64_t qd_sync_gen;
+	struct gfs_quota_le qd_ondisk_ql; /* log element for log dump */
+	uint64_t qd_sync_gen;         /* sync-to-quota-file generation #
*/
 
-	struct gfs_glock *qd_gl;
-	struct gfs_quota_lvb qd_qb;
+	/* glock provides protection for quota, *and* provides
+	 * lock value block (LVB) communication, between nodes, of
current
+	 * quota values.  Shared lock -> LVB read.  EX lock -> LVB
write. */
+	struct gfs_glock *qd_gl;      /* glock for this quota */
+	struct gfs_quota_lvb qd_qb;   /* LVB (limit/warn/value) */
 
-	unsigned long qd_last_warn;
+	unsigned long qd_last_warn;   /* jiffies of last warning to user
*/
 };
 
+/*
+ * Log Buffer descriptor structure
+ * One for each fs block buffer recorded in the log
+ */
 struct gfs_log_buf {
-	struct list_head lb_list;
+	/* link to one of the transaction structure's lists */
+	struct list_head lb_list;      /* link to tr_free_bufs or
tr_list */
 
 	struct buffer_head lb_bh;
 	struct buffer_head *lb_unlock;
 };
 
 /*
- *  Transaction structures
+ *  Transaction structure
+ *  One for each transaction
+ *  This coordinates the logging and flushing of written metadata.
  */
 
 #define TRF_LOG_DUMP            (0x00000001)
 
 struct gfs_trans {
-	struct list_head tr_list;
+
+	/* link to various lists */
+	struct list_head tr_list;      /* superblk's incore trans or AIL
list*/
 
 	/* Initial creation stuff */
 
-	char *tr_file;
-	unsigned int tr_line;
+	char *tr_file;                 /* debug info: .c file creating
trans */
+	unsigned int tr_line;          /* debug info: codeline creating
trans */
 
-	unsigned int tr_mblks_asked;	/* Number of log blocks asked to
be reserved */
-	unsigned int tr_eblks_asked;
-	unsigned int tr_seg_reserved;	/* Number of segments reserved
*/
+	/* reservations for on-disk space in journal */
+	unsigned int tr_mblks_asked;   /* # of meta log blocks requested
*/
+	unsigned int tr_eblks_asked;   /* # of extra log blocks
requested */
+	unsigned int tr_seg_reserved;  /* # of segments actually
reserved */
 
-	struct gfs_holder *tr_t_gh;
+	struct gfs_holder *tr_t_gh;    /* glock holder for this
transaction */
 
 	/* Stuff filled in during creation */
 
-	unsigned int tr_flags;
-	struct list_head tr_elements;
+	unsigned int tr_flags;         /* TRF_... */
+	struct list_head tr_elements;  /* List of this trans' log
elements */
 
 	/* Stuff modified during the commit */
 
-	unsigned int tr_num_free_bufs;
+	unsigned int tr_num_free_bufs; /* List of free gfs_log_buf
structs */
 	struct list_head tr_free_bufs;
-	unsigned int tr_num_free_bmem;
+	unsigned int tr_num_free_bmem; /* List of free fs-block-size
buffers */
 	struct list_head tr_free_bmem;
 
-	uint64_t tr_log_head;	        /* The current log head */
-	uint64_t tr_first_head;	        /* First header block */
+	uint64_t tr_log_head;          /* The current log head */
+	uint64_t tr_first_head;	       /* First header block */
 
-	struct list_head tr_bufs;	/* List of buffers going to the
log */
+	struct list_head tr_bufs;      /* List of buffers going to the
log */
 
-	/* Stuff that's part of the AIL */
+	/* Stuff that's part of the Active Items List (AIL) */
 
-	struct list_head tr_ail_bufs;
+	struct list_head tr_ail_bufs;  /* List of buffers on AIL list */
 
-	/* Private data for different log element types */
+	/* # log elements of various types on tr_elements list */
 
-	unsigned int tr_num_gl;
-	unsigned int tr_num_buf;
-	unsigned int tr_num_iul;
-	unsigned int tr_num_ida;
-	unsigned int tr_num_q;
+	unsigned int tr_num_gl;        /* glocks */
+	unsigned int tr_num_buf;       /* buffers */
+	unsigned int tr_num_iul;       /* unlinked inodes */
+	unsigned int tr_num_ida;       /* de-allocated inodes */
+	unsigned int tr_num_q;         /* quotas */
 };
 
 /*
@@ -511,153 +680,201 @@
 } __attribute__ ((__aligned__(SMP_CACHE_BYTES)));
 
 /*
- *  Super Block Data Structure  (One per filesystem)
- */
+ *  "Super Block" Data Structure
+ *  One per mounted filesystem.
+ *  This is the big instance structure that ties everything together
for
+ *    a given mounted filesystem.  Each GFS mount has its own,
supporting
+ *    mounts of multiple GFS filesystems on each node.
+ *  Pointer to this is usually seen as "sdp" throughout code.
+ *  This is a very large structure, as structures go, in part because
it
+ *    contains arrays of hash buckets for various in-core caches.
+ */
+
+/* sd_flags */
+
+#define SDF_JOURNAL_LIVE        (0)  /* journaling is active (fs is
writeable)*/
+
+/* daemon run (1) / stop (0) flags */
+#define SDF_SCAND_RUN           (1)  /* put unused glocks on reclaim
queue */
+#define SDF_GLOCKD_RUN          (2)  /* reclaim (dealloc) unused glocks
*/
+#define SDF_RECOVERD_RUN        (3)  /* recover journal of a crashed
node */
+#define SDF_LOGD_RUN            (4)  /* update log tail after AIL
flushed */
+#define SDF_QUOTAD_RUN          (5)  /* sync quota changes to file,
cleanup */
+#define SDF_INODED_RUN          (6)  /* deallocate unlinked inodes */
+
+/* (re)mount options from Linux VFS */
+#define SDF_NOATIME             (7)  /* don't change access time */
+#define SDF_ROFS                (8)  /* read-only mode (no journal) */
 
-#define SDF_JOURNAL_LIVE        (0)
-#define SDF_SCAND_RUN           (1)
-#define SDF_GLOCKD_RUN          (2)
-#define SDF_RECOVERD_RUN        (3)
-#define SDF_LOGD_RUN            (4)
-#define SDF_QUOTAD_RUN          (5)
-#define SDF_INODED_RUN          (6)
-#define SDF_NOATIME             (7)
-#define SDF_ROFS                (8)
+/* journal log dump support */
 #define SDF_NEED_LOG_DUMP       (9)
 #define SDF_FOUND_UL_DUMP       (10)
 #define SDF_FOUND_Q_DUMP        (11)
-#define SDF_IN_LOG_DUMP         (12)
+#define SDF_IN_LOG_DUMP         (12) /* serializes log dumps */
+
 
-#define GFS_GL_HASH_SHIFT       (13)
+/* constants for various in-core caches */
+
+/* glock cache */
+#define GFS_GL_HASH_SHIFT       (13)    /* # hash buckets = 8K */
 #define GFS_GL_HASH_SIZE        (1 << GFS_GL_HASH_SHIFT)
 #define GFS_GL_HASH_MASK        (GFS_GL_HASH_SIZE - 1)
 
-#define GFS_MHC_HASH_SHIFT      (10)
+/* meta header cache */
+#define GFS_MHC_HASH_SHIFT      (10)    /* # hash buckets = 1K */
 #define GFS_MHC_HASH_SIZE       (1 << GFS_MHC_HASH_SHIFT)
 #define GFS_MHC_HASH_MASK       (GFS_MHC_HASH_SIZE - 1)
 
-#define GFS_DEPEND_HASH_SHIFT   (10)
+/* dependency cache */
+#define GFS_DEPEND_HASH_SHIFT   (10)    /* # hash buckets = 1K */
 #define GFS_DEPEND_HASH_SIZE    (1 << GFS_DEPEND_HASH_SHIFT)
 #define GFS_DEPEND_HASH_MASK    (GFS_DEPEND_HASH_SIZE - 1)
 
 struct gfs_sbd {
-	struct gfs_sb sd_sb;	        /* Super Block */
+	struct gfs_sb sd_sb;            /* GFS on-disk Super Block image
*/
 
-	struct super_block *sd_vfs;	/* FS's device independent sb */
+	struct super_block *sd_vfs;     /* Linux VFS device independent
sb */
 
-	struct gfs_args sd_args;
-	unsigned long sd_flags;
+	struct gfs_args sd_args;        /* Mount arguments */
+	unsigned long sd_flags;         /* SDF_... see above */
 
-	struct gfs_tune sd_tune;	/* FS tuning structure */
+	struct gfs_tune sd_tune;	/* Filesystem tuning structure
*/
 
 	/* Resource group stuff */
 
-	struct gfs_inode *sd_riinode;	/* rindex inode */
-	uint64_t sd_riinode_vn;	/* Version number of the resource index
inode */
-
-	struct list_head sd_rglist;	/* List of resource groups */
-	struct semaphore sd_rindex_lock;
-
-	struct list_head sd_rg_mru_list;	/* List of resource
groups in MRU order */
-	spinlock_t sd_rg_mru_lock;	/* Lock for MRU list */
-	struct list_head sd_rg_recent;	/* Recently used rgrps */
-	spinlock_t sd_rg_recent_lock;
-	struct gfs_rgrpd *sd_rg_forward;	/* Next new rgrp to try
for allocation */
-	spinlock_t sd_rg_forward_lock;
+	struct gfs_inode *sd_riinode;	/* Resource Index (rindex) inode
*/
+	uint64_t sd_riinode_vn;	        /* Resource Index version #
(detects
+	                                   whether new rgrps have been
added) */
+
+	struct list_head sd_rglist;	/* List of all resource groups,
*/
+	struct semaphore sd_rindex_lock;/*     on-disk order */
+	struct list_head sd_rg_mru_list;/* List of resource groups, */
+	spinlock_t sd_rg_mru_lock;      /*     most-recently-used (MRU)
order */
+	struct list_head sd_rg_recent;	/* List of rgrps from which
blocks */
+	spinlock_t sd_rg_recent_lock;   /*     were recently allocated
*/
+	struct gfs_rgrpd *sd_rg_forward;/* Next rgrp from which to
attempt */
+	spinlock_t sd_rg_forward_lock;  /*     a block alloc */
 
-	unsigned int sd_rgcount;	/* Count of resource groups */
+	unsigned int sd_rgcount;	/* Total # of resource groups */
 
 	/*  Constants computed on mount  */
 
-	uint32_t sd_fsb2bb;
-	uint32_t sd_fsb2bb_shift;	/* Shift FS Block numbers to the
left by
-					   this to get buffer cache
blocks  */
-	uint32_t sd_diptrs;	/* Number of pointers in a dinode */
-	uint32_t sd_inptrs;	/* Number of pointers in a indirect
block */
-	uint32_t sd_jbsize;	/* Size of a journaled data block */
-	uint32_t sd_hash_bsize;	/* sizeof(exhash block) */
+	/* "bb" == "basic block" == 512Byte sector */
+	uint32_t sd_fsb2bb;             /* # 512B basic blocks in a FS
block */
+	uint32_t sd_fsb2bb_shift;       /* Shift sector # to the right
by 
+	                                   this to get FileSystem block
addr */
+	uint32_t sd_diptrs;     /* Max # of block pointers in a dinode
*/
+	uint32_t sd_inptrs;     /* Max # of block pointers in an
indirect blk */
+	uint32_t sd_jbsize;     /* Payload size (bytes) of a journaled
metadata
+	                               block (GFS journals all meta
blocks) */
+	uint32_t sd_hash_bsize; /* sizeof(exhash block) */
 	uint32_t sd_hash_bsize_shift;
-	uint32_t sd_hash_ptrs;	/* Number of points in a hash block */
-	uint32_t sd_max_dirres;	/* Maximum space needed to add a
directory entry */
-	uint32_t sd_max_height;	/* Maximum height of a file's metadata
tree */
+	uint32_t sd_hash_ptrs;  /* Number of points in a hash block */
+	uint32_t sd_max_dirres; /* Max blocks needed to add a directory
entry */
+	uint32_t sd_max_height;	/* Max height of a file's indir addr
tree */
 	uint64_t sd_heightsize[GFS_MAX_META_HEIGHT];
-	uint32_t sd_max_jheight;	/* Maximum height of a journaled
file's metadata tree */
+	uint32_t sd_max_jheight; /* Max hgt, journaled file's indir addr
tree */
 	uint64_t sd_jheightsize[GFS_MAX_META_HEIGHT];
 
 	/*  Lock Stuff  */
 
+	/* glock cache (all glocks currently held by this node for this
fs) */
 	struct gfs_gl_hash_bucket sd_gl_hash[GFS_GL_HASH_SIZE];
 
-	struct list_head sd_reclaim_list;
+	/* glock reclaim support for scand and glockd */
+	struct list_head sd_reclaim_list;   /* list of glocks to reclaim
*/
 	spinlock_t sd_reclaim_lock;
 	wait_queue_head_t sd_reclaim_wchan;
-	atomic_t sd_reclaim_count;
+	atomic_t sd_reclaim_count;          /* # glocks on reclaim list
*/
 
-	struct lm_lockstruct sd_lockstruct;
+	/* lock module tells us if we're first-to-mount, 
+	 *    which journal to use, etc. */
+	struct lm_lockstruct sd_lockstruct; /* info provided by lock
module */
 
-	struct list_head sd_mhc[GFS_MHC_HASH_SIZE];
-	struct list_head sd_mhc_single;
+	/*  Other caches */
+
+	/* meta-header cache (incore copies of on-disk meta headers)*/
+	struct list_head sd_mhc[GFS_MHC_HASH_SIZE]; /* hash buckets */
+	struct list_head sd_mhc_single;     /* non-hashed list of all
MHCs */
 	spinlock_t sd_mhc_lock;
-	atomic_t sd_mhc_count;
+	atomic_t sd_mhc_count;              /* # MHCs in cache */
 
-	struct list_head sd_depend[GFS_DEPEND_HASH_SIZE];
+	/* dependency cache */
+	struct list_head sd_depend[GFS_DEPEND_HASH_SIZE];  /* hash
buckets */
 	spinlock_t sd_depend_lock;
-	atomic_t sd_depend_count;
+	atomic_t sd_depend_count;           /* # dependencies in cache
*/
 
-	struct gfs_holder sd_live_gh;
+	/* LIVE inter-node lock indicates that fs is mounted on at least
+	 * one node */
+	struct gfs_holder sd_live_gh;       /* glock holder for LIVE
lock */
 
+	/* for quiescing the filesystem */
 	struct gfs_holder sd_freeze_gh;
 	struct semaphore sd_freeze_lock;
 	unsigned int sd_freeze_count;
 
 	/*  Inode Stuff  */
 
-	struct gfs_inode *sd_rooti;	/* FS's root inode */
+	struct gfs_inode *sd_rooti;         /* FS's root inode */
 
-	struct gfs_glock *sd_rename_gl;	/* rename glock */
+	/* only 1 node at a time may rename (e.g. mv) a file or dir */
+	struct gfs_glock *sd_rename_gl;     /* rename glock */
 
 	/*  Daemon stuff  */
 
-	struct task_struct *sd_scand_process;
-	unsigned int sd_glockd_num;
+	/* scan for glocks and inodes to toss from memory */
+	struct task_struct *sd_scand_process; /* scand places on reclaim
list*/
+	unsigned int sd_glockd_num;    /* # of glockd procs to do
reclaiming*/
+
+	/* recover journal of a crashed node */
 	struct task_struct *sd_recoverd_process;
+
+	/* update log tail as AIL gets flushed to in-place on-disk
blocks */
 	struct task_struct *sd_logd_process;
+
+	/* sync quota updates to disk, and clean up unused quota structs
*/
 	struct task_struct *sd_quotad_process;
+
+	/* clean up unused inode structures */
 	struct task_struct *sd_inoded_process;
 
+	/* support for starting/stopping daemons */
 	struct semaphore sd_thread_lock;
 	struct completion sd_thread_completion;
 
 	/*  Log stuff  */
 
-	struct gfs_glock *sd_trans_gl;	/* transaction glock */
+	/* transaction lock protects journal replay (recovery) */
+	struct gfs_glock *sd_trans_gl;	/* transaction glock structure
*/
 
-	struct gfs_inode *sd_jiinode;	/* jindex inode */
-	uint64_t sd_jiinode_vn;	/* Version number of the journal index
inode */
+	struct gfs_inode *sd_jiinode;	/* journal index inode */
+	uint64_t sd_jiinode_vn;         /* journal index version #
(detects
+	                                   if new journals have been
added) */
 
 	unsigned int sd_journals;	/* Number of journals in the FS
*/
-	struct gfs_jindex *sd_jindex;	/* Array of Jindex structures
describing this FS's journals */
+	struct gfs_jindex *sd_jindex;	/* Array of journal descriptors
*/
 	struct semaphore sd_jindex_lock;
-	unsigned long sd_jindex_refresh_time;
+	unsigned long sd_jindex_refresh_time; /* poll for new journals
(secs) */
 
-	struct gfs_jindex sd_jdesc;	/* Jindex structure describing
this machine's journal */
-	struct gfs_holder sd_journal_gh;	/* the glock for this
machine's journal */
+	struct gfs_jindex sd_jdesc;	 /* this machine's journal
descriptor */
+	struct gfs_holder sd_journal_gh; /* this machine's journal glock
*/
 
 	uint64_t sd_sequence;	/* Assigned to xactions in order they
commit */
 	uint64_t sd_log_head;	/* Block number of next journal write */
 	uint64_t sd_log_wrap;
 
 	spinlock_t sd_log_seg_lock;
-	unsigned int sd_log_seg_free;	/* Free segments in the log */
+	unsigned int sd_log_seg_free;	/* # of free segments in the log
*/
 	struct list_head sd_log_seg_list;
 	wait_queue_head_t sd_log_seg_wait;
 
-	struct list_head sd_log_ail;	/* struct gfs_trans structures
that form the Active Items List 
-					   "next" is the head, "prev" is
the tail  */
-
-	struct list_head sd_log_incore;	/* transactions that have been
commited incore (but not ondisk)
-					   "next" is the newest, "prev"
is the oldest  */
-	unsigned int sd_log_buffers;	/* Number of buffers in the
incore log */
+	/* "Active Items List" of transactions that have been flushed to
+	 * on-disk log, and are waiting for flush to in-place on-disk
blocks */
+	struct list_head sd_log_ail;	/* "next" is head, "prev" is
tail */
+
+	/* Transactions committed incore, but not yet flushed to on-disk
log */
+	struct list_head sd_log_incore;	/* "next" is newest, "prev" is
oldest */
+	unsigned int sd_log_buffers;	/* # of buffers in the incore
log */
 
 	struct semaphore sd_log_lock;	/* Lock for access to log values
*/
 
@@ -674,16 +891,17 @@
 
 	/*  quota crap  */
 
-	struct list_head sd_quota_list;
+	struct list_head sd_quota_list; /* list of all gfs_quota_data
structs */
 	spinlock_t sd_quota_lock;
 
-	atomic_t sd_quota_count;
-	atomic_t sd_quota_od_count;
+	atomic_t sd_quota_count;        /* # quotas on sd_quota_list */
+	atomic_t sd_quota_od_count;     /* # quotas waiting for sync to
+	                                   special on-disk quota file */
 
-	struct gfs_inode *sd_qinode;
+	struct gfs_inode *sd_qinode;    /* special on-disk quota file */
 
-	uint64_t sd_quota_sync_gen;
-	unsigned long sd_quota_sync_time;
+	uint64_t sd_quota_sync_gen;     /* generation, incr when sync to
file */
+	unsigned long sd_quota_sync_time; /* jiffies, last sync to quota
file */
 
 	/*  license crap  */
 
diff -ru cvs/cluster/gfs-kernel/src/gfs/log.c
build_092304/cluster/gfs-kernel/src/gfs/log.c
--- cvs/cluster/gfs-kernel/src/gfs/log.c	2004-07-12
15:22:44.000000000 -0400
+++ build_092304/cluster/gfs-kernel/src/gfs/log.c	2004-09-23
14:18:29.406501616 -0400
@@ -134,7 +134,8 @@
 /**
  * gfs_ail_start - Start I/O on the AIL
  * @sdp: the filesystem
- * @flags:
+ * @flags:  DIO_ALL -- flush *all* AIL transactions to disk
+ *          default -- flush first-on-list AIL transaction to disk
  *
  */
 
@@ -1207,7 +1208,7 @@
 		LO_CLEAN_DUMP(sdp, le);
 	}
 
-	/* If there isn't anything the AIL, we won't get back the log
+	/* If there isn't anything in the AIL, we won't get back the log
 	   space we reserved unless we do it ourselves. */
 
 	if (list_empty(&sdp->sd_log_ail)) {
diff -ru cvs/cluster/gfs-kernel/src/gfs/lops.c
build_092304/cluster/gfs-kernel/src/gfs/lops.c
--- cvs/cluster/gfs-kernel/src/gfs/lops.c	2004-06-24
04:53:28.000000000 -0400
+++ build_092304/cluster/gfs-kernel/src/gfs/lops.c	2004-09-23
14:18:41.725628824 -0400
@@ -442,6 +442,13 @@
  * @blkno: the location of the log's copy of the block
  *
  * Returns: 0 on success, -EXXX on failure
+ *
+ * Read in-place block from disk
+ * Read log (journal) block from disk
+ * Compare generation numbers
+ * Copy log block to in-place block on-disk if:
+ *   log generation # > in-place generation #
+ *   OR generation #s are ==, but data contained in block is different
(corrupt)
  */
 
 static int
diff -ru cvs/cluster/gfs-kernel/src/gfs/lvb.h
build_092304/cluster/gfs-kernel/src/gfs/lvb.h
--- cvs/cluster/gfs-kernel/src/gfs/lvb.h	2004-06-24
04:53:28.000000000 -0400
+++ build_092304/cluster/gfs-kernel/src/gfs/lvb.h	2004-09-23
14:19:09.962336192 -0400
@@ -11,26 +11,44 @@
 
************************************************************************
*******
 
************************************************************************
******/
 
+/*
+ * Formats of Lock Value Blocks (LVBs) for various types of locks.
+ * These 32-bit data chunks can be shared quickly between nodes
+ *   via the inter-node lock manager (via LAN instead of on-disk).
+ */
+
 #ifndef __LVB_DOT_H__
 #define __LVB_DOT_H__
 
 #define GFS_MIN_LVB_SIZE (32)
 
+/*
+ * Resource Group block allocation statistics
+ * Each resource group lock contains one of these in its LVB.
+ * Used for sharing approximate current statistics for statfs.
+ * Not used for actual block allocation.
+ */
 struct gfs_rgrp_lvb {
-	uint32_t rb_magic;
-	uint32_t rb_free;
-	uint32_t rb_useddi;
-	uint32_t rb_freedi;
-	uint32_t rb_usedmeta;
-	uint32_t rb_freemeta;
+	uint32_t rb_magic;      /* GFS_MAGIC sanity check value */
+	uint32_t rb_free;       /* # free data blocks */
+	uint32_t rb_useddi;     /* # used dinode blocks */
+	uint32_t rb_freedi;     /* # free dinode blocks */
+	uint32_t rb_usedmeta;   /* # used metadata blocks */
+	uint32_t rb_freemeta;   /* # free metadata blocks */
 };
 
+/*
+ * Quota
+ * Each quota lock contains one of these in its LVB.
+ * Keeps track of block allocation limits and current block allocation
+ *   for either a cluster-wide user or a cluster-wide group.
+ */
 struct gfs_quota_lvb {
-	uint32_t qb_magic;
+	uint32_t qb_magic;      /* GFS_MAGIC sanity check value */
 	uint32_t qb_pad;
-	uint64_t qb_limit;
-	uint64_t qb_warn;
-	int64_t qb_value;
+	uint64_t qb_limit;      /* hard limit of # blocks to alloc */
+	uint64_t qb_warn;       /* warn user when alloc is above this #
*/
+	int64_t qb_value;       /* current # blocks allocated */
 };
 
 /*  Translation functions  */
diff -ru cvs/cluster/gfs-kernel/src/gfs/rgrp.c
build_092304/cluster/gfs-kernel/src/gfs/rgrp.c
--- cvs/cluster/gfs-kernel/src/gfs/rgrp.c	2004-06-24
04:53:28.000000000 -0400
+++ build_092304/cluster/gfs-kernel/src/gfs/rgrp.c	2004-09-23
14:18:56.703351864 -0400
@@ -372,6 +372,7 @@
 
 	memset(count, 0, 4 * sizeof(uint32_t));
 
+	/* count # blocks in each of 4 possible allocation states */
 	for (buf = 0; buf < length; buf++) {
 		bits = &rgd->rd_bits[buf];
 		for (x = 0; x < 4; x++)
@@ -531,6 +532,7 @@
  * gfs_compute_bitstructs - Compute the bitmap sizes
  * @rgd: The resource group descriptor
  *
+ * Calculates bitmap descriptors, one for each block that contains
bitmap data
  */
 
 static void
@@ -538,7 +540,7 @@
 {
 	struct gfs_sbd *sdp = rgd->rd_sbd;
 	struct gfs_bitmap *bits;
-	uint32_t length = rgd->rd_ri.ri_length;
+	uint32_t length = rgd->rd_ri.ri_length; /* # blocks in hdr &
bitmap */
 	uint32_t bytes_left, bytes;
 	int x;
 
@@ -550,21 +552,25 @@
 	for (x = 0; x < length; x++) {
 		bits = &rgd->rd_bits[x];
 
+		/* small rgrp; bitmap stored completely in header block
*/
 		if (length == 1) {
 			bytes = bytes_left;
 			bits->bi_offset = sizeof(struct gfs_rgrp);
 			bits->bi_start = 0;
 			bits->bi_len = bytes;
+		/* header block */
 		} else if (x == 0) {
 			bytes = sdp->sd_sb.sb_bsize - sizeof(struct
gfs_rgrp);
 			bits->bi_offset = sizeof(struct gfs_rgrp);
 			bits->bi_start = 0;
 			bits->bi_len = bytes;
+		/* last block */
 		} else if (x + 1 == length) {
 			bytes = bytes_left;
 			bits->bi_offset = sizeof(struct
gfs_meta_header);
 			bits->bi_start = rgd->rd_ri.ri_bitbytes -
bytes_left;
 			bits->bi_len = bytes;
+		/* other blocks */
 		} else {
 			bytes = sdp->sd_sb.sb_bsize - sizeof(struct
gfs_meta_header);
 			bits->bi_offset = sizeof(struct
gfs_meta_header);
@@ -855,10 +861,12 @@
  * @rgd: the RG data
  * @al: the struct gfs_alloc structure describing the reservation
  *
- * Sets the $ir_datares field in @res.
- * Sets the $ir_metares field in @res.
+ * If there's room for the requested blocks to be allocated from the
RG:
+ *   Sets the $al_reserved_data field in @al.
+ *   Sets the $al_reserved_meta field in @al.
+ *   Sets the $al_rgd field in @al.
  *
- * Returns: 1 on success, 0 on failure
+ * Returns: 1 on success (it fits), 0 on failure (it doesn't fit)
  */
 
 static int
@@ -900,7 +908,7 @@
 }
 
 /**
- * recent_rgrp_first - get first RG from recent list
+ * recent_rgrp_first - get first RG from "recent" list
  * @sdp: The GFS superblock
  * @rglast: address of the rgrp used last
  *
@@ -939,7 +947,7 @@
 }
 
 /**
- * recent_rgrp_next - get next RG from recent list
+ * recent_rgrp_next - get next RG from "recent" list
  * @cur_rgd: current rgrp
  *
  * Returns: The next rgrp in the recent list
@@ -978,7 +986,7 @@
 }
 
 /**
- * recent_rgrp_remove - remove an RG from recent list
+ * recent_rgrp_remove - remove an RG from "recent" list
  * @rgd: The rgrp to remove
  *
  */
@@ -992,9 +1000,14 @@
 }
 
 /**
- * recent_rgrp_add - add an RG to recent list
+ * recent_rgrp_add - add an RG to tail of "recent" list
  * @new_rgd: The rgrp to add
  *
+ * Before adding, make sure that:
+ *   1) it's not already on the list
+ *   2) there's still room for more entries
+ * The capacity limit imposed on the "recent" list is basically a
node's "share"
+ *   of rgrps within a cluster, i.e. (total # rgrps) / (# nodes
(journals))
  */
 
 static void






From phillips at redhat.com  Wed Sep 29 15:01:39 2004
From: phillips at redhat.com (Daniel Phillips)
Date: Wed, 29 Sep 2004 11:01:39 -0400
Subject: [Linux-cluster] FW: [PATCH] More comments for GFS files
In-Reply-To: <0604335B7764D141945E202153105960033E25BD@orsmsx404.amr.corp.intel.com>
References: <0604335B7764D141945E202153105960033E25BD@orsmsx404.amr.corp.intel.com>
Message-ID: <200409291101.39343.phillips@redhat.com>

On Monday 27 September 2004 17:42, Cahill, Ben M wrote:
> Hi all,
>
> Below please find a patch for more comments in some files in
> gfs-kernel/src/gfs:

Hi Ben,

It looks lovely.  Minor nit: how about caps on the sentences?  It makes 
it look more "Linus" and less "Bill Joy".

Regards,

Daniel



From ben.m.cahill at intel.com  Wed Sep 29 17:26:09 2004
From: ben.m.cahill at intel.com (Cahill, Ben M)
Date: Wed, 29 Sep 2004 10:26:09 -0700
Subject: [Linux-cluster] FW: [PATCH] More comments for GFS files
Message-ID: <0604335B7764D141945E202153105960033E25D2@orsmsx404.amr.corp.intel.com>

> -----Original Message-----
> From: Daniel Phillips [mailto:phillips at redhat.com] 
> Sent: Wednesday, September 29, 2004 11:02 AM
> To: linux-cluster at redhat.com
> Cc: Cahill, Ben M
> Subject: Re: [Linux-cluster] FW: [PATCH] More comments for GFS files
> 
> On Monday 27 September 2004 17:42, Cahill, Ben M wrote:
> > Hi all,
> >
> > Below please find a patch for more comments in some files in
> > gfs-kernel/src/gfs:
> 
> Hi Ben,
> 
> It looks lovely.  Minor nit: how about caps on the sentences? 
>  It makes 
> it look more "Linus" and less "Bill Joy".

Thanks for the tip ... I'll do that in the future.  I think I was trying
to follow precedent in the pre-existing code ... it might vary from file
to file, or even within a file.  I hope you guys will go ahead and check
in what I sent already, if that's okay (at least for now).  If not, let
me know, and I'll prepare a new patch.

-- Ben --

> 
> Regards,
> 
> Daniel
> 



From daniel at osdl.org  Wed Sep 29 23:55:41 2004
From: daniel at osdl.org (Daniel McNeil)
Date: Wed, 29 Sep 2004 16:55:41 -0700
Subject: [Linux-cluster] [RFC] Generic Kernel API
In-Reply-To: <20040920122551.GC32420@tykepenguin.com>
References: <20040920122551.GC32420@tykepenguin.com>
Message-ID: <1096502141.3334.67.camel@ibm-c.pdx.osdl.net>

On Mon, 2004-09-20 at 05:25, Patrick Caulfield wrote:
> At the cluster summit most people seedm to agree that we needed a generic,
> pluggable kernel API for cluster functions. Well, I've finally got round to
> doing something.
> 
> The attached spec allows for plug-in cluster modules with the possibility of
> a node being a member of multiple clusters if the cluster managers allow it.
> I've seperated out the functions of "cluster manager" so they can be provided by
> different components if necessary.
> 
> Two things that are not complete (or even started) in here are a communications
> API and a locking API. 
> 
> For the first, I'd like to leave that to those more qualified than me to do and
> for the second I'd like to (less modestly) propose our existing DLM API with the
> argument that it is a full-featured API that others can implement parts of if
> necessary.
> 
> Comments please.

Patrick,

I read over your api and have a few comments.

Simple stuff first.  The membership_node looks very similar to the SAF
interfaces, so I assume they fields mean the same.  mn_member is 32bits
but it just specifies if this node is a member (1) or not (0), right?

The mni_viewnumber is 32 bits, in SAF it is 64bits.  Might want it to
be 64bits.  (I think nodeid should be 64bits, but SAF has it as 32bits,
so I guess it is ok).

What is mni_context?

I bit more description of these fields would be nice -- don't have to
be as verbose as SAF :)

In membership_ops, you have start_notify and notify_stop -- might want
to be consistent with the naming (either notify_start or stop_notify).

Now the more complicated stuff:

I think we need more information on how this api works and a description
of how the calls are used.

cm_attach() is used to attach to a particular cluster provider that
has been registered.  Who calls cm_attach()?

I assume whoever calls cm_attach() will then be calling the ops
functions.

What is cmprivate in start_notify?

Once start_notify is called the CM module will call the callback
function whenever there is a change until notify_stop is called?

The membership_callback_routine only has "context" and "reason".
Again, what is context?  What is reason?
How is the data returned?  I'm guessing a struct membership_notify_info
is filled in at from the buffer passed in from start_notify,  Is that
right?  A bit more description here would be good.

What is the difference between get_quorate() and get_info() which
returns a struct quorum_info with qi_quorum?

Should get_quorate() and get_info() take a viewnumber so we can match
up the list of member and whether it had quorum?  (it could have changed
after the callback with membership before we call get_quorum.)

Thanks,

Daniel 





From pcaulfie at redhat.com  Thu Sep 30 07:10:24 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Thu, 30 Sep 2004 08:10:24 +0100
Subject: [Linux-cluster] [RFC] Generic Kernel API
In-Reply-To: <1096502141.3334.67.camel@ibm-c.pdx.osdl.net>
References: <20040920122551.GC32420@tykepenguin.com>
	<1096502141.3334.67.camel@ibm-c.pdx.osdl.net>
Message-ID: <20040930071023.GA12471@tykepenguin.com>

On Wed, Sep 29, 2004 at 04:55:41PM -0700, Daniel McNeil wrote:
> 
> Patrick,
> 
> I read over your api and have a few comments.
> 
> Simple stuff first.  The membership_node looks very similar to the SAF
> interfaces, so I assume they fields mean the same.  mn_member is 32bits
> but it just specifies if this node is a member (1) or not (0), right?

Yes.
 
> The mni_viewnumber is 32 bits, in SAF it is 64bits.  Might want it to
> be 64bits.  (I think nodeid should be 64bits, but SAF has it as 32bits,
> so I guess it is ok).
> 
> What is mni_context?

It's an opaque structure passed in from the caller that gets passed back via the
callback so that the caller can identify the request (or attach private
information).
 
> I bit more description of these fields would be nice -- don't have to
> be as verbose as SAF :)
> 
> In membership_ops, you have start_notify and notify_stop -- might want
> to be consistent with the naming (either notify_start or stop_notify).

Yes, I fixed that!
 
> Now the more complicated stuff:
> 
> I think we need more information on how this api works and a description
> of how the calls are used.
> 
> cm_attach() is used to attach to a particular cluster provider that
> has been registered.  Who calls cm_attach()?
> 
> I assume whoever calls cm_attach() will then be calling the ops
> functions.
> 
> What is cmprivate in start_notify?
> 
> Once start_notify is called the CM module will call the callback
> function whenever there is a change until notify_stop is called?
> 
> The membership_callback_routine only has "context" and "reason".
> Again, what is context?  What is reason?
> How is the data returned?  I'm guessing a struct membership_notify_info
> is filled in at from the buffer passed in from start_notify,  Is that
> right?  A bit more description here would be good.
> 
> What is the difference between get_quorate() and get_info() which
> returns a struct quorum_info with qi_quorum?

get_quorate returns a boolean value that just says whether the cluster has
quorum or not. get_info returns a struct showing the elements that went up to
making that decision. I'm not really sure how much use it is to applications but
I don't like hiding information!
 
> Should get_quorate() and get_info() take a viewnumber so we can match
> up the list of member and whether it had quorum?  (it could have changed
> after the callback with membership before we call get_quorum.)

The problem there is keeping a list of members for each view, which seems like
rather a waste of memory in kernel space.


I'm in the middle of writing an implementation of this (with a cman plugin) that
I'll post shortly. That should clear up any other points that I may seem to have
ignored above! some of the things have been fixed in the meantime. I should get
it posted this week.
-- 

patrick



From tony at rootseven.com  Thu Sep 30 10:07:52 2004
From: tony at rootseven.com (Anthony Caravello)
Date: Thu, 30 Sep 2004 06:07:52 -0400
Subject: [Linux-cluster] Dell PowerVault 210 Cluster
Message-ID: <415BDAF8.6090304@rootseven.com>

Hello,
I've configured two PowerEdge 2450's each with a Perc 2/DC connected to 
a PowerVault 210s via a pair of SEMM's.  Updated PERC firmware to 
provide Cluster support and enabled this on both, adjusting the 
initiator id on one so they are not the same.  Configured array and 
installed Red Hat Fedora Core 2 on both machines, and mounted the 
external array as an ext3 filesystem successfully on each.

Both machines can mount and write to the external array, but neither can 
see the others files.  I guess this makes sense as each machine has it's 
own inode table for the filesystem.  Probably safe to assume writes will 
soon lead to corruption of each others files.

Is such a cluster possible under linux?  Do I need to use GFS or another 
filesystem to handle this configuration, or is there another way to 
syncronize these systems?

Thank You



From ecashin at coraid.com  Thu Sep 30 14:25:54 2004
From: ecashin at coraid.com (Ed L Cashin)
Date: Thu, 30 Sep 2004 10:25:54 -0400
Subject: [Linux-cluster] Re: FW: [PATCH] More comments for GFS files
References: <0604335B7764D141945E202153105960033E25BD@orsmsx404.amr.corp.intel.com>
Message-ID: <87vfdvn899.fsf@coraid.com>

Good to see short, apropos comments.  :)

"Cahill, Ben M"  writes:

...
> +	unsigned int gt_scand_secs;       /* find unused glocks and
> inodes */
> +	unsigned int gt_recoverd_secs;    /* recover journal of crashed
> node */

Is your mail software wrapping lines in the patch?  

-- 
  Ed L Cashin 



From nygaard at redhat.com  Thu Sep 30 14:38:43 2004
From: nygaard at redhat.com (Erling Nygaard)
Date: Thu, 30 Sep 2004 09:38:43 -0500
Subject: [Linux-cluster] Dell PowerVault 210 Cluster
In-Reply-To: <415BDAF8.6090304@rootseven.com>;
	from tony@rootseven.com on Thu, Sep 30, 2004 at 06:07:52AM -0400
References: <415BDAF8.6090304@rootseven.com>
Message-ID: <20040930093843.M2126@homer.msp.redhat.com>

yes, this is what GFS does. Allow you to mount the same shared filesystem 
on multiple nodes and keep the filesystem synchronized.
Your ext3 filesystem will be completely hosed fairly soon since there is 
no synchronization between the nodes....


On Thu, Sep 30, 2004 at 06:07:52AM -0400, Anthony Caravello wrote:
> Hello,
> I've configured two PowerEdge 2450's each with a Perc 2/DC connected to 
> a PowerVault 210s via a pair of SEMM's.  Updated PERC firmware to 
> provide Cluster support and enabled this on both, adjusting the 
> initiator id on one so they are not the same.  Configured array and 
> installed Red Hat Fedora Core 2 on both machines, and mounted the 
> external array as an ext3 filesystem successfully on each.
> 
> Both machines can mount and write to the external array, but neither can 
> see the others files.  I guess this makes sense as each machine has it's 
> own inode table for the filesystem.  Probably safe to assume writes will 
> soon lead to corruption of each others files.
> 
> Is such a cluster possible under linux?  Do I need to use GFS or another 
> filesystem to handle this configuration, or is there another way to 
> syncronize these systems?
> 
> Thank You
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Erling Nygaard
nygaard at redhat.com

Red Hat Inc



From ben.m.cahill at intel.com  Thu Sep 30 15:20:25 2004
From: ben.m.cahill at intel.com (Cahill, Ben M)
Date: Thu, 30 Sep 2004 08:20:25 -0700
Subject: [Linux-cluster] Re: FW: [PATCH] More comments for GFS files
Message-ID: <0604335B7764D141945E202153105960033E25DA@orsmsx404.amr.corp.intel.com>


> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ed L Cashin
> Sent: Thursday, September 30, 2004 10:26 AM
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] Re: FW: [PATCH] More comments for GFS files
> 
> Good to see short, apropos comments.  :)

Thanks.  Let me know if you see anything wrong, BTW.

> 
> "Cahill, Ben M"  writes:
> 
> ...
> > +	unsigned int gt_scand_secs;       /* find unused glocks and
> > inodes */
> > +	unsigned int gt_recoverd_secs;    /* recover journal of crashed
> > node */
> 
> Is your mail software wrapping lines in the patch? 

Arggh, yes ... thanks for catching that.

I just sent a copy to Ken Preslan with the patch as an attachment, so he
can work with it.

I won't clutter the list with another attempt, unless someone wants me
to.  I'll do better next time.

-- Ben --

Opinions are mine, not Intel's

 
> 
> -- 
>   Ed L Cashin 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
> 



From daniel at osdl.org  Thu Sep 30 17:52:15 2004
From: daniel at osdl.org (Daniel McNeil)
Date: Thu, 30 Sep 2004 10:52:15 -0700
Subject: [Linux-cluster] [RFC] Generic Kernel API
In-Reply-To: <20040930071023.GA12471@tykepenguin.com>
References: <20040920122551.GC32420@tykepenguin.com>
	<1096502141.3334.67.camel@ibm-c.pdx.osdl.net>
	<20040930071023.GA12471@tykepenguin.com>
Message-ID: <1096566735.3334.76.camel@ibm-c.pdx.osdl.net>

On Thu, 2004-09-30 at 00:10, Patrick Caulfield wrote:
> On Wed, Sep 29, 2004 at 04:55:41PM -0700, Daniel McNeil wrote:
> > 
> > Patrick,
> > 
> > I read over your api and have a few comments.
> > 
> > Simple stuff first.  The membership_node looks very similar to the SAF
> > interfaces, so I assume they fields mean the same.  mn_member is 32bits
> > but it just specifies if this node is a member (1) or not (0), right?
> 
> Yes.
>  
> > The mni_viewnumber is 32 bits, in SAF it is 64bits.  Might want it to
> > be 64bits.  (I think nodeid should be 64bits, but SAF has it as 32bits,
> > so I guess it is ok).
> > 
> > What is mni_context?
> 
> It's an opaque structure passed in from the caller that gets passed back via the
> callback so that the caller can identify the request (or attach private
> information).
>  
> > I bit more description of these fields would be nice -- don't have to
> > be as verbose as SAF :)
> > 
> > In membership_ops, you have start_notify and notify_stop -- might want
> > to be consistent with the naming (either notify_start or stop_notify).
> 
> Yes, I fixed that!
>  
> > Now the more complicated stuff:
> > 
> > I think we need more information on how this api works and a description
> > of how the calls are used.
> > 
> > cm_attach() is used to attach to a particular cluster provider that
> > has been registered.  Who calls cm_attach()?
> > 
> > I assume whoever calls cm_attach() will then be calling the ops
> > functions.
> > 
> > What is cmprivate in start_notify?
> > 
> > Once start_notify is called the CM module will call the callback
> > function whenever there is a change until notify_stop is called?
> > 
> > The membership_callback_routine only has "context" and "reason".
> > Again, what is context?  What is reason?
> > How is the data returned?  I'm guessing a struct membership_notify_info
> > is filled in at from the buffer passed in from start_notify,  Is that
> > right?  A bit more description here would be good.
> > 
> > What is the difference between get_quorate() and get_info() which
> > returns a struct quorum_info with qi_quorum?
> 
> get_quorate returns a boolean value that just says whether the cluster has
> quorum or not. get_info returns a struct showing the elements that went up to
> making that decision. I'm not really sure how much use it is to applications but
> I don't like hiding information!
>  
> > Should get_quorate() and get_info() take a viewnumber so we can match
> > up the list of member and whether it had quorum?  (it could have changed
> > after the callback with membership before we call get_quorum.)
> 
> The problem there is keeping a list of members for each view, which seems like
> rather a waste of memory in kernel space.

Would it be ok to just keep the info for the last viewnumber only?
If the viewnumber did not match then an error could be returned
for get_quorate.  get_info could return the viewnumber as part
of quorum info.

What does get_votes do if nodeid is NOT currently in the cluster
membership?

> 
> 
> I'm in the middle of writing an implementation of this (with a cman plugin) that
> I'll post shortly. That should clear up any other points that I may seem to have
> ignored above! some of the things have been fixed in the meantime. I should get
> it posted this week.

Good.  Code should clear things up!

Thanks,

Daniel



From ben.m.cahill at intel.com  Thu Sep 30 19:34:42 2004
From: ben.m.cahill at intel.com (Cahill, Ben M)
Date: Thu, 30 Sep 2004 12:34:42 -0700
Subject: [Linux-cluster] [PATCH] install and reference gfs_mount.8 man page
Message-ID: <0604335B7764D141945E202153105960033E25E5@orsmsx404.amr.corp.intel.com>

Hi all,

I hope I've cleared up the line-wrap problem.  Let's find out.

Here's a small patch (inline and attached) to install
the man page that I wrote a short time ago.

-- Ben --

Opinions are mine, not Intel's



diff -Naur cvs/cluster/gfs/man/gfs.8 build_092304/cluster/gfs/man/gfs.8
--- cvs/cluster/gfs/man/gfs.8	2004-06-24 04:53:25.000000000 -0400
+++ build_092304/cluster/gfs/man/gfs.8	2004-09-30 14:38:12.841438968
-0400
@@ -20,12 +20,13 @@
 command/feature you are looking for.
 .sp
  gfs                 GFS overview (this man page)
+ gfs_mount           Mounting a GFS file system
  gfs_fsck            The GFS file system checker
  gfs_grow            Growing a GFS file system
  gfs_jadd            Adding a journal to a GFS file system
  gfs_mkfs            Make a GFS file system
  gfs_quota           Manipulate GFS disk quotas 
- gfs_tool            Tool to manipulate a GFS 
+ gfs_tool            Tool to manipulate a GFS file system
 .sp
 .in
 
diff -Naur cvs/cluster/gfs/man/Makefile
build_092304/cluster/gfs/man/Makefile
--- cvs/cluster/gfs/man/Makefile	2004-08-19 12:29:35.000000000
-0400
+++ build_092304/cluster/gfs/man/Makefile	2004-09-30
14:44:39.174707384 -0400
@@ -13,6 +13,7 @@
 
 TARGET8= \
 	gfs.8 \
+	gfs_mount.8 \
 	gfs_fsck.8 \
 	gfs_grow.8 \
 	gfs_jadd.8 \
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gfs_man.patch
Type: application/octet-stream
Size: 1076 bytes
Desc: gfs_man.patch
URL: 

From mnerren at paracel.com  Thu Sep 30 23:01:44 2004
From: mnerren at paracel.com (micah nerren)
Date: Thu, 30 Sep 2004 16:01:44 -0700
Subject: [Linux-cluster] mount is hanging
Message-ID: <1096585303.6682.280.camel@angmar>

Hi,

I have a SAN with 4 file systems on it, each GFS. These are mounted
across various servers running GFS, 3 of which are lock_gulm servers.
This is on RHEL WS 3 with GFS-6.0.0-7.1 on x86_64.

One of the file systems simply will not mount now. The other 3 mount and
unmount fine. They are all part of the same cca. I have my master lock
server running in heavy debug mode but none of the output from
lock_gulmd tells me anything about this one bad pool. How can I figure
out what is going on, any good debug or troubleshooting steps I should
do? I think if I just reboot everything it will settle down, but we
can't do that just yet, as the master lock server happens to be on a
production box right now.

Also, is there a way to migrate a master lock server to a slave lock
server? In other words, can I force the master to become a slave and a
slave to become the new master?

Thanks!