From dist-list at LEXUM.UMontreal.CA Sat Oct 1 21:58:25 2005 From: dist-list at LEXUM.UMontreal.CA (FM) Date: Sat, 01 Oct 2005 17:58:25 -0400 Subject: [Linux-cluster] RHEL4 active/active cluster without GFS Message-ID: <433F0681.4020700@lexum.umontreal.ca> Hello everybody, First post here ... for my first cluster attempt. I do not need GFS I'm trying to install linux-ha (RPMS from http://www.ultramonkey.org/download/heartbeat/). But the installation always fails because of ipvsadmin missing. I read that ipvs is in the kernel, so I check in the default .conf and ipvs is enable. Ho can I install ipvsadmin ? Is there a RedHat way to create cluster without GFS ? What are your advices ? Thanks ! From carlopmart at gmail.com Sun Oct 2 08:35:25 2005 From: carlopmart at gmail.com (carlopmart at gmail.com) Date: Sun, 02 Oct 2005 10:35:25 +0200 Subject: [Linux-cluster] Export directory via gnbd Message-ID: <433F9BCD.2020303@gmail.com> Hi all, Is it possible to export directorys via gnbd?? GNBD docs only descrives how to export files or partitons ... -- CL Martinez carlopmart {at} gmail {d0t} com From Axel.Thimm at ATrpms.net Sun Oct 2 10:23:05 2005 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Sun, 2 Oct 2005 12:23:05 +0200 Subject: [Linux-cluster] Re: SMP and GFS In-Reply-To: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com> References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com> Message-ID: <20051002102305.GD13944@neu.nirvana> On Thu, Jul 14, 2005 at 04:57:51PM -0400, Manuel Bujan wrote: > Is there any issue I should be aware of if SMP is enabled in > my kernel ? What if I compile my kernel to be pre-emptible ? Any problem with that and GFS ? > > I am running GFS in a dual Xeon server from DELL. > After a lot of time running my GFS setup I got the following error > in one of our cluster servers, and I had to reboot it in order to > restablish the service: > > ################################################################################# > Jul 14 14:19:35 atmail-2 kernel: 2 > Jul 14 14:19:35 atmail-2 kernel: gfs001 (18044) req reply einval ae2c0092 fr 1 r 1 2 > Jul 14 14:19:35 atmail-2 kernel: gfs001 (31381) req reply einval bf9901e7 fr 1 r 1 2 > Jul 14 14:19:35 atmail-2 kernel: gfs001 (2023) req reply einval d6c30333 fr 1 r 1 2 > Jul 14 14:19:35 atmail-2 kernel: gfs001 send einval to 1 > Jul 14 14:19:35 atmail-2 last message repeated 2 times I found similar log sniplets on a RHEL4U1 machine with dual Xeons (HP Proliant). The machine crashed with a kernel panic shortly after telling the other nodes to leave the cluster (sorry the staff was under pressure and noone wrote down the panic's output): Sep 30 05:08:11 zs01 kernel: nval to 1 (P:kernel) Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel) Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel) Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel) Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel) Sep 30 05:08:33 zs03 kernel: CMAN: removing node zs02 from the cluster : Missed too many heartbeats (P:kernel) Sep 30 05:08:39 zs03 kernel: CMAN: removing node zs01 from the cluster : No response to messages (P:kernel) Sep 30 05:08:45 zs03 kernel: CMAN: quorum lost, blocking activity (P:kernel) Seeking for the einval messages I found only this post here. So it doesn't seem to happen that often. OTOH it's the same hardware, perhaps dual Xeons are not good for GFS and/or the cluster infrastructure? In my case kernel and GFS bits are all from Red Hat, no self built components other than a qla2xxx driver, but the issue is on the cluster communication side. -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From sequel at neofreak.org Sun Oct 2 15:06:07 2005 From: sequel at neofreak.org (DeadManMoving) Date: Sun, 02 Oct 2005 11:06:07 -0400 Subject: [Linux-cluster] Re: SMP and GFS In-Reply-To: <20051002102305.GD13944@neu.nirvana> References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com> <20051002102305.GD13944@neu.nirvana> Message-ID: <1128265567.23136.21.camel@saloon.neofreak.org> I'm running a cluster on two node without GFS (only using clurgmgrd to export nfs share) on IBM x346 servers (Pentium 4 Xeon (Foster) with the smp kernel; 2.6.9-11smp) and, while i do not see those errors in my logs, i do see them in /proc/cluster/dlm_debug : Magma send einval to 1 Magma send einval to 1 Magma send einval to 1 Magma send einval to 1 Magma send einval to 1 Magma (3055) req reply einval 440255 fr 1 r 1 usrm::rg="home_ma Magma (3055) req reply einval 4b0262 fr 1 r 1 usrm::rg="home_ma Magma send einval to 1 Magma (11923) req reply einval 5300f1 fr 1 r 1 usrm::vf Magma send einval to 1 Magma (3055) req reply einval 530338 fr 1 r 1 usrm::vf My cluster is highly instable, just this morning i've realized that the clurgmgrd deamon was dead... Can someone at Red Hat shed some light on this? Thanks, Tony Lapointe. On Sun, 2005-10-02 at 12:23 +0200, Axel Thimm wrote: > On Thu, Jul 14, 2005 at 04:57:51PM -0400, Manuel Bujan wrote: > > Is there any issue I should be aware of if SMP is enabled in > > my kernel ? What if I compile my kernel to be pre-emptible ? Any problem with that and GFS ? > > > > I am running GFS in a dual Xeon server from DELL. > > > After a lot of time running my GFS setup I got the following error > > in one of our cluster servers, and I had to reboot it in order to > > restablish the service: > > > > > ################################################################################# > > Jul 14 14:19:35 atmail-2 kernel: 2 > > Jul 14 14:19:35 atmail-2 kernel: gfs001 (18044) req reply einval ae2c0092 fr 1 r 1 2 > > Jul 14 14:19:35 atmail-2 kernel: gfs001 (31381) req reply einval bf9901e7 fr 1 r 1 2 > > Jul 14 14:19:35 atmail-2 kernel: gfs001 (2023) req reply einval d6c30333 fr 1 r 1 2 > > Jul 14 14:19:35 atmail-2 kernel: gfs001 send einval to 1 > > Jul 14 14:19:35 atmail-2 last message repeated 2 times > > I found similar log sniplets on a RHEL4U1 machine with dual Xeons (HP > Proliant). The machine crashed with a kernel panic shortly after > telling the other nodes to leave the cluster (sorry the staff was > under pressure and noone wrote down the panic's output): > > Sep 30 05:08:11 zs01 kernel: nval to 1 (P:kernel) > Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel) > Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel) > Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel) > Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel) > Sep 30 05:08:33 zs03 kernel: CMAN: removing node zs02 from the cluster : Missed too many heartbeats (P:kernel) > Sep 30 05:08:39 zs03 kernel: CMAN: removing node zs01 from the cluster : No response to messages (P:kernel) > Sep 30 05:08:45 zs03 kernel: CMAN: quorum lost, blocking activity (P:kernel) > > Seeking for the einval messages I found only this post here. So it > doesn't seem to happen that often. OTOH it's the same hardware, > perhaps dual Xeons are not good for GFS and/or the cluster > infrastructure? > > In my case kernel and GFS bits are all from Red Hat, no self built > components other than a qla2xxx driver, but the issue is on the > cluster communication side. > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From pcaulfie at redhat.com Mon Oct 3 06:59:22 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Mon, 03 Oct 2005 07:59:22 +0100 Subject: [Linux-cluster] Re: SMP and GFS In-Reply-To: <20051002102305.GD13944@neu.nirvana> References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com> <20051002102305.GD13944@neu.nirvana> Message-ID: <4340D6CA.7070105@redhat.com> Axel Thimm wrote: > On Thu, Jul 14, 2005 at 04:57:51PM -0400, Manuel Bujan wrote: > >>Is there any issue I should be aware of if SMP is enabled in >>my kernel ? What if I compile my kernel to be pre-emptible ? Any problem with that and GFS ? >> Pre-emptible kernels will not work with GFS, that's certain. -- patrick From pcaulfie at redhat.com Mon Oct 3 07:10:59 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Mon, 03 Oct 2005 08:10:59 +0100 Subject: [Openais] Re: [Linux-cluster] new userland cman In-Reply-To: <1128109200.8440.14.camel@unnamed.az.mvista.com> References: <433D4134.6080608@redhat.com> <1128109200.8440.14.camel@unnamed.az.mvista.com> Message-ID: <4340D983.7080106@redhat.com> Steven Dake wrote: > Patrick > > Thanks for the work > > I have a few comments inline > > On Fri, 2005-09-30 at 14:44 +0100, Patrick Caulfield wrote: >>- Hard limit to size of cluster (set at compile time to 32 currently)*** >> > > > I hope to have multiring in 2006; then we should scale to hundreds of > processors... Nice :) I have some ideas for shrinking the size of the current packets which will help the current system and lower the ethernet load. I'll start on those shortly. > >>neutral >>------- >>- Always uses multicast (no broadcast). A default multicast address is supplied >>if none is given > > > If broadcast is important, which I guess it may be, we can pretty easily > add this support... > I was going to look into this but I doubt its really worth it. It's just any extra complication and will only apply to IPv4 anyway. >>- libcman is the only API ( a compatible libcman is available for the kernel >>version) >>- Simplified CCS schema, but will read old one if it has nodeids in it.**** >> >>internal >>-------- >>- Usable messaging API >>- Robust membership algorithm >>- Community involvement, multiple developers. >> >> >>* I very much doubt that anyone will notice apart from maybe Dave & me >> >>** Could fix this in AIS, but I'm not sure the patch would be popular upstream. >>It's much more efficient to run them on different ports or multicast addresses >>anyway. Incidentally: DON'T run an encrypted and a non-encrypted cluster on the >>same port & multicast address (not that you would!) - the non-encrypted ones >>will crash. >> > > > On this point, you mention you could fix "this", do you mean having two > clusters use the same port and ips? I have also considered and do want > this by having each "cluster" join a specific group at startup to serve > as the cluster membership view. Unfortunately this would require > process group membership, and the process groups interface is unfinished > (totempg.c) so this isn't possible today. Note I'd take a patch from > someone that finished the job on this interface :) I for example, would > like communication for a specific checkpoint to go over a specific named > group, instead of to everyone connected to totem. Then the clm could > join a group and get membership events, the checkpoint service for a > specific checkpoint could join a group, and communicate on that group, > and get membership events for that group etc. > > What did you have in mind here? Actually something /very/ simple. the old cman just had a uint16 in every packet which was a cluster_id. If the cluster_id in an incoming packet didn't match the one read from the config file then the packet was dropped. It's really just a way of simplifying configuration for those using broadcast or a default multicast address. In my more evil moments thought it might be worth hijacking the commented out "filler" in struct message_header :) -- patrick From Axel.Thimm at ATrpms.net Mon Oct 3 08:33:36 2005 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Mon, 3 Oct 2005 10:33:36 +0200 Subject: [Linux-cluster] Re: SMP and GFS In-Reply-To: <4340D6CA.7070105@redhat.com> References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com> <20051002102305.GD13944@neu.nirvana> <4340D6CA.7070105@redhat.com> Message-ID: <20051003083336.GC10393@neu.nirvana> On Mon, Oct 03, 2005 at 07:59:22AM +0100, Patrick Caulfield wrote: > Axel Thimm wrote: > > On Thu, Jul 14, 2005 at 04:57:51PM -0400, Manuel Bujan wrote: > > > >>Is there any issue I should be aware of if SMP is enabled in > >>my kernel ? What if I compile my kernel to be pre-emptible ? Any problem with that and GFS ? > >> > > Pre-emptible kernels will not work with GFS, that's certain. My report was on a RHEL4 kernel. -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From pcaulfie at redhat.com Mon Oct 3 09:31:02 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Mon, 03 Oct 2005 10:31:02 +0100 Subject: [Linux-cluster] Re: SMP and GFS In-Reply-To: <20051003083336.GC10393@neu.nirvana> References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com> <20051002102305.GD13944@neu.nirvana> <4340D6CA.7070105@redhat.com> <20051003083336.GC10393@neu.nirvana> Message-ID: <4340FA56.6090708@redhat.com> Axel Thimm wrote: > On Mon, Oct 03, 2005 at 07:59:22AM +0100, Patrick Caulfield wrote: > >>Axel Thimm wrote: >> >>>On Thu, Jul 14, 2005 at 04:57:51PM -0400, Manuel Bujan wrote: >>> >>> >>>>Is there any issue I should be aware of if SMP is enabled in >>>>my kernel ? What if I compile my kernel to be pre-emptible ? Any problem with that and GFS ? >>>> >> >>Pre-emptible kernels will not work with GFS, that's certain. > > > My report was on a RHEL4 kernel. ...but you did ask about pre-emtible kernels :) The important messages here are these : > Sep 30 05:08:33 zs03 kernel: CMAN: removing node zs02 from the cluster : Missed too many heartbeats (P:kernel) > Sep 30 05:08:39 zs03 kernel: CMAN: removing node zs01 from the cluster : No response to messages (P:kernel) showing that a node has been kicked out of the cluster for not responding quickly enough to messages. You could try increasing the value in /proc/cluster/config/cman/max_retries -- patrick From Axel.Thimm at ATrpms.net Mon Oct 3 10:52:06 2005 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Mon, 3 Oct 2005 12:52:06 +0200 Subject: [Linux-cluster] Re: SMP and GFS In-Reply-To: <4340FA56.6090708@redhat.com> References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com> <20051002102305.GD13944@neu.nirvana> <4340D6CA.7070105@redhat.com> <20051003083336.GC10393@neu.nirvana> <4340FA56.6090708@redhat.com> Message-ID: <20051003105206.GG10393@neu.nirvana> On Mon, Oct 03, 2005 at 10:31:02AM +0100, Patrick Caulfield wrote: > Axel Thimm wrote: > > On Mon, Oct 03, 2005 at 07:59:22AM +0100, Patrick Caulfield wrote: > > > >>Axel Thimm wrote: > >> > >>>On Thu, Jul 14, 2005 at 04:57:51PM -0400, Manuel Bujan wrote: > >>> > >>> > >>>>Is there any issue I should be aware of if SMP is enabled in > >>>>my kernel ? What if I compile my kernel to be pre-emptible ? Any problem with that and GFS ? > >>>> > >> > >>Pre-emptible kernels will not work with GFS, that's certain. > > > > > > My report was on a RHEL4 kernel. > > > ...but you did ask about pre-emtible kernels :) No, I didn't, that was Manuel Bujan 6 weeks ago. ;) I replied that I saw the same einval messages on a RHEL4 kernel. > The important messages here are these : > > > Sep 30 05:08:33 zs03 kernel: CMAN: removing node zs02 from the cluster : > Missed too many heartbeats (P:kernel) > > Sep 30 05:08:39 zs03 kernel: CMAN: removing node zs01 from the cluster : No > response to messages (P:kernel) > > > showing that a node has been kicked out of the cluster for not responding > quickly enough to messages. You could try increasing the value in > > /proc/cluster/config/cman/max_retries I know, but that doesn't explain the einval messages, or does it? Or formulated differently: the einval messages show that the dual Xeon box had some issues with sockets and its being kicked out could be just a symptom of that. Also the RHEL4 box should not kernel panic (all involved parties have the same config, but only the panicing node has dual Xeons on EM64T, the other two are dual opterons, all run the same smp RHEL4 kernel). At that time the dual xeon was doing a backup on this interface with 25-30 MB/sec. That could explain the delayed/dropped UDP heartbeat packages. Can it explain the "send einval to 1" messages and the kernel panic? -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From pcaulfie at redhat.com Mon Oct 3 11:02:40 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Mon, 03 Oct 2005 12:02:40 +0100 Subject: [Linux-cluster] Re: SMP and GFS In-Reply-To: <20051003105206.GG10393@neu.nirvana> References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com> <20051002102305.GD13944@neu.nirvana> <4340D6CA.7070105@redhat.com> <20051003083336.GC10393@neu.nirvana> <4340FA56.6090708@redhat.com> <20051003105206.GG10393@neu.nirvana> Message-ID: <43410FD0.5020403@redhat.com> Axel Thimm wrote: > On Mon, Oct 03, 2005 at 10:31:02AM +0100, Patrick Caulfield wrote: > >>Axel Thimm wrote: >> >>>On Mon, Oct 03, 2005 at 07:59:22AM +0100, Patrick Caulfield wrote: >>> >>> >>>>Axel Thimm wrote: >>>> >>>> >>>>>On Thu, Jul 14, 2005 at 04:57:51PM -0400, Manuel Bujan wrote: >>>>> >>>>> >>>>> >>>>>>Is there any issue I should be aware of if SMP is enabled in >>>>>>my kernel ? What if I compile my kernel to be pre-emptible ? Any problem with that and GFS ? >>>>>> >>>> >>>>Pre-emptible kernels will not work with GFS, that's certain. >>> >>> >>>My report was on a RHEL4 kernel. >> >> >>...but you did ask about pre-emtible kernels :) > > > No, I didn't, that was Manuel Bujan 6 weeks ago. ;) > > I replied that I saw the same einval messages on a RHEL4 kernel. > > >>The important messages here are these : >> >> >>>Sep 30 05:08:33 zs03 kernel: CMAN: removing node zs02 from the cluster : >> >>Missed too many heartbeats (P:kernel) >> >>>Sep 30 05:08:39 zs03 kernel: CMAN: removing node zs01 from the cluster : No >> >>response to messages (P:kernel) >> >> >>showing that a node has been kicked out of the cluster for not responding >>quickly enough to messages. You could try increasing the value in >> >>/proc/cluster/config/cman/max_retries > > > I know, but that doesn't explain the einval messages, or does it? Or > formulated differently: the einval messages show that the dual Xeon > box had some issues with sockets and its being kicked out could be > just a symptom of that. it probably does explain them. If the node is kicked out of the cluster, the DLM starts return -EINVAL from lock ops (because the lockspace no longer exists). This very often causes the GFS lock_dlm module to oops. The bugzillas are confused about this but it sort-of exists as https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=165160 -- patrick From Axel.Thimm at ATrpms.net Mon Oct 3 11:40:17 2005 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Mon, 3 Oct 2005 13:40:17 +0200 Subject: [Linux-cluster] Re: SMP and GFS In-Reply-To: <43410FD0.5020403@redhat.com> References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com> <20051002102305.GD13944@neu.nirvana> <4340D6CA.7070105@redhat.com> <20051003083336.GC10393@neu.nirvana> <4340FA56.6090708@redhat.com> <20051003105206.GG10393@neu.nirvana> <43410FD0.5020403@redhat.com> Message-ID: <20051003114017.GH10393@neu.nirvana> On Mon, Oct 03, 2005 at 12:02:40PM +0100, Patrick Caulfield wrote: > Axel Thimm wrote: > >>showing that a node has been kicked out of the cluster for not responding > >>quickly enough to messages. You could try increasing the value in > >> > >>/proc/cluster/config/cman/max_retries > > > > I know, but that doesn't explain the einval messages, or does it? Or > > formulated differently: the einval messages show that the dual Xeon > > box had some issues with sockets and its being kicked out could be > > just a symptom of that. > > it probably does explain them. If the node is kicked out of the cluster, the DLM > starts return -EINVAL from lock ops (because the lockspace no longer exists). > This very often causes the GFS lock_dlm module to oops. > > > The bugzillas are confused about this but it sort-of exists as > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=165160 Thanks, that bugzilla explains a lot. It's the same situation like Corey's, two nodes were shut down, quorum was lost, and one of the two nodes removed was using the filesystem and was having lock_dlm on it. So it paniced. It all very much makes sense now. The two remaining issues are o why did the network interface blow up twice, and killed the communication between the nodes (and it looks like it once killed all UDP communications permanently including syslog)? We replaced all cabling and switches, next thing is to use a dedicated GBit network only for cman/dlm. That's of course something we need to investigate and should not be an issue with GFS. o why did the filesystem desync across members? That may or may not be a consequence of the previous cman/dlm failures and kernel panics, or may be a consequence of the broken networking between the nodes. In both cases while the triggering problem seems to be in the networking between the nodes, filesystem inconsitency should not happen, and reflects some bug in GFS. See also https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=169693 BTW what is "revolver"? Is that a stress test used at RH for GFS? Would it be possible to share this tool? Thanks! -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From andrewxwang at yahoo.com.tw Mon Oct 3 13:06:14 2005 From: andrewxwang at yahoo.com.tw (Andrew Wang) Date: Mon, 3 Oct 2005 21:06:14 +0800 (CST) Subject: [Linux-cluster] FYI: www.gridengine.info (new site) Message-ID: <20051003130614.16205.qmail@web18004.mail.tpe.yahoo.com> Besides the SGE homepage (http://gridengine.sunsource.net) for HOWTOs, docs, and news, a new site just released: http://www.gridengine.info/ It's written by an SGE user outside of Sun. Andrew. ___________________________________________________ ?????? Yahoo!???????r???? 7.0 beta?????M?W?????????????? http://messenger.yahoo.com.tw/beta.html From eric at bootseg.com Mon Oct 3 15:23:17 2005 From: eric at bootseg.com (Eric Kerin) Date: Mon, 03 Oct 2005 11:23:17 -0400 Subject: [Linux-cluster] Re: rgmanager dieing with no messages [was: Re: SMP and GFS] In-Reply-To: <1128265567.23136.21.camel@saloon.neofreak.org> References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com> <20051002102305.GD13944@neu.nirvana> <1128265567.23136.21.camel@saloon.neofreak.org> Message-ID: <1128352997.3504.9.camel@auh5-0479.corp.jabil.org> On Sun, 2005-10-02 at 11:06 -0400, DeadManMoving wrote: > My cluster is highly instable, just this morning i've realized that > the clurgmgrd deamon was dead... I'm having this same problem on my cluster, I've been planning on enabling core dumps for rgmanager once I find a few minutes to restart the cluster services. With any luck, that will be today. Eric Kerin eric at bootseg.com From mwill at penguincomputing.com Mon Oct 3 15:11:40 2005 From: mwill at penguincomputing.com (Michael Will) Date: Mon, 03 Oct 2005 08:11:40 -0700 Subject: [Linux-cluster] Export directory via gnbd In-Reply-To: <433F9BCD.2020303@gmail.com> References: <433F9BCD.2020303@gmail.com> Message-ID: <43414A2C.2020906@penguincomputing.com> carlopmart at gmail.com wrote: > Hi all, > > Is it possible to export directorys via gnbd?? GNBD docs only > descrives how to export files or partitons ... > There is a difference between block-level and file-level export of storage. block level: Just like iSCSI would, GNBD does export one chunk of data without knowing anything about the structure of what you write to it. The client can use it as if it was a local disk (with some limitations), which means it can read and write blocks of data. It could be mysql writing database data, or it could be the OS writing a filesystem on it. The GNBD server does not know or care. NBD=network block device. Two clients could access the same blockdevice read-write, but you would need a network protocol that negotiates locking and caching so that the two separate clients don't step over each others data. GFS is one for filesystems, mysql-cluster implements the same for a relational database. file-level: Just like a NAS or an NFS-server, the server exports a filesystem to clients. So instead of requesting read/write of blocks of data, the clients requesting listing directories, locking, reading and writing to files. So to answer your question 'how do I export a directory via gnbd' you might have to reword it with the above clarification. Instead of NBD you can use NFS to export a directory to multiple clients, or GFS if you plan to export from multiple machines to multiple clients. Michael -- Michael Will Penguin Computing Corp. Sales Engineer 415-954-2822 415-954-2899 fx mwill at penguincomputing.com From lhh at redhat.com Mon Oct 3 16:19:05 2005 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 03 Oct 2005 12:19:05 -0400 Subject: [Linux-cluster] RHEL4 active/active cluster without GFS In-Reply-To: <433F0681.4020700@lexum.umontreal.ca> References: <433F0681.4020700@lexum.umontreal.ca> Message-ID: <1128356346.27430.82.camel@ayanami.boston.redhat.com> On Sat, 2005-10-01 at 17:58 -0400, FM wrote: > Hello everybody, > First post here ... for my first cluster attempt. I do not need GFS > > I'm trying to install linux-ha (RPMS from > http://www.ultramonkey.org/download/heartbeat/). But the installation > always fails because of ipvsadmin missing. > > I read that ipvs is in the kernel, so I check in the default .conf and > ipvs is enable. Ho can I install ipvsadmin ? IPVS is indeed in the kernel. ipvsadm is a user-land package which is used to administer the kernel parts of IPVS - you can grab the source RPM here: ftp://ftp.redhat.com/pub/redhat/linux/enterprise/4/en/RHCS/i386/SRPMS/ipvsadm-1.24-6.src.rpm It's needed to control the IPVS director. > Is there a Red Hat way to create cluster without GFS ? Yes, Red Hat Cluster Suite. -- Lon From teigland at redhat.com Mon Oct 3 16:51:51 2005 From: teigland at redhat.com (David Teigland) Date: Mon, 3 Oct 2005 11:51:51 -0500 Subject: [Linux-cluster] Re: SMP and GFS In-Reply-To: <20051002102305.GD13944@neu.nirvana> References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com> <20051002102305.GD13944@neu.nirvana> Message-ID: <20051003165151.GB16574@redhat.com> On Sun, Oct 02, 2005 at 12:23:05PM +0200, Axel Thimm wrote: > On Thu, Jul 14, 2005 at 04:57:51PM -0400, Manuel Bujan wrote: > > Jul 14 14:19:35 atmail-2 kernel: gfs001 (2023) req reply einval > > d6c30333 fr 1 r 1 2 > > Jul 14 14:19:35 atmail-2 kernel: gfs001 send einval to 1 > > Jul 14 14:19:35 atmail-2 last message repeated 2 times > I found similar log sniplets on a RHEL4U1 machine with dual Xeons (HP > Proliant). The machine crashed with a kernel panic shortly after > telling the other nodes to leave the cluster (sorry the staff was > under pressure and noone wrote down the panic's output): > > Sep 30 05:08:11 zs01 kernel: nval to 1 (P:kernel) > Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel) > Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel) > Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel) > Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel) These "einval" messages from the dlm are not necessarily bad and are not directly related to the "removing from cluster" messages below. The einval conditions above can legitimately occur during normal operation and the dlm should be able to deal with them. Specifically they mean that: 1. node A is told that the lock master for resource R is node B 2. the last lock is removed from R on B 3. B gives up mastery of R 4. A sends lock request to B 5. B doesn't recognize R and returns einval to A 6. A starts over The message "send einval to..." is printed on B in step 5. The message "req reply einval..." is printed on A in step 6. This is an unfortunate situation, but not lethal. That said, a spike in these messages may indicate that something is amiss (and that a "removing from cluster" may be on the way). Or, maybe the gfs load has struck the dlm in a particularly sore way. > Sep 30 05:08:33 zs03 kernel: CMAN: removing node zs02 from the cluster : > Missed too many heartbeats (P:kernel) > Sep 30 05:08:39 zs03 kernel: CMAN: removing node zs01 from the cluster : > No response to messages (P:kernel) After this happens, the dlm will often return an error (like -EINVAL) to lock_dlm. It's not the same thing as above. Lock_dlm will always panic at that point since it can no longer acquire locks for gfs. Dave From lhh at redhat.com Mon Oct 3 17:20:58 2005 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 03 Oct 2005 13:20:58 -0400 Subject: [Openais] Re: [Linux-cluster] new userland cman In-Reply-To: <4340D983.7080106@redhat.com> References: <433D4134.6080608@redhat.com> <1128109200.8440.14.camel@unnamed.az.mvista.com> <4340D983.7080106@redhat.com> Message-ID: <1128360058.27430.99.camel@ayanami.boston.redhat.com> On Mon, 2005-10-03 at 08:10 +0100, Patrick Caulfield wrote: > >>neutral > >>------- > >>- Always uses multicast (no broadcast). A default multicast address is supplied > >>if none is given > > > > > > If broadcast is important, which I guess it may be, we can pretty easily > > add this support... > > > > I was going to look into this but I doubt its really worth it. It's just any > extra complication and will only apply to IPv4 anyway. I think broadcast is quite important, actually - although I also think that it should *not* be the default. Multicast doesn't always work very well (in practice) on existing networks, and works poorly (if at all) over things like crossover ethernet cables and hub-based private networks. You know, the cheap stuff hackers use in their houses to play with cluster software ;) Broadcast is far more likely to work out of the box in the above cases, and isn't hard to implement (... actually, it's easier than multicast). Also, IPv6 isn't what I'd call "mainstream" just yet, so supporting all the hacks we can with IPv4 isn't necessarily a bad thing ;) -- Lon From Axel.Thimm at ATrpms.net Mon Oct 3 17:35:46 2005 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Mon, 3 Oct 2005 19:35:46 +0200 Subject: [Linux-cluster] Re: SMP and GFS In-Reply-To: <20051003165151.GB16574@redhat.com> References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com> <20051002102305.GD13944@neu.nirvana> <20051003165151.GB16574@redhat.com> Message-ID: <20051003173546.GT10393@neu.nirvana> On Mon, Oct 03, 2005 at 11:51:51AM -0500, David Teigland wrote: > On Sun, Oct 02, 2005 at 12:23:05PM +0200, Axel Thimm wrote: > > On Thu, Jul 14, 2005 at 04:57:51PM -0400, Manuel Bujan wrote: > > > Jul 14 14:19:35 atmail-2 kernel: gfs001 (2023) req reply einval > > > d6c30333 fr 1 r 1 2 > > > Jul 14 14:19:35 atmail-2 kernel: gfs001 send einval to 1 > > > Jul 14 14:19:35 atmail-2 last message repeated 2 times > > > I found similar log sniplets on a RHEL4U1 machine with dual Xeons (HP > > Proliant). The machine crashed with a kernel panic shortly after > > telling the other nodes to leave the cluster (sorry the staff was > > under pressure and noone wrote down the panic's output): > > > > Sep 30 05:08:11 zs01 kernel: nval to 1 (P:kernel) > > Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel) > > Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel) > > Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel) > > Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel) > > These "einval" messages from the dlm are not necessarily bad and are not > directly related to the "removing from cluster" messages below. The > einval conditions above can legitimately occur during normal operation and > the dlm should be able to deal with them. Specifically they mean that: > > 1. node A is told that the lock master for resource R is node B > 2. the last lock is removed from R on B > 3. B gives up mastery of R > 4. A sends lock request to B > 5. B doesn't recognize R and returns einval to A > 6. A starts over > > The message "send einval to..." is printed on B in step 5. > The message "req reply einval..." is printed on A in step 6. > > This is an unfortunate situation, but not lethal. That said, a spike in > these messages may indicate that something is amiss (and that a "removing > from cluster" may be on the way). Or, maybe the gfs load has struck the > dlm in a particularly sore way. > > > Sep 30 05:08:33 zs03 kernel: CMAN: removing node zs02 from the cluster : > > Missed too many heartbeats (P:kernel) > > Sep 30 05:08:39 zs03 kernel: CMAN: removing node zs01 from the cluster : > > No response to messages (P:kernel) > > After this happens, the dlm will often return an error (like -EINVAL) to > lock_dlm. It's not the same thing as above. Lock_dlm will always panic > at that point since it can no longer acquire locks for gfs. At the time all of this happened, the three node cluster zs01 to zs03 had only zs01 active with nfs and samba exports (both with neglidgible activity at that time of the day) and a proprietary backup solution (vertias' netbackup). The latter created a network traffic of 25-30 MB/sec of the interface the cluster heartbeat was also running on. The backup was running for a couple of hours already. Can that we the root of evil? Delayed or dropped UDP cman packages? Can the same scanario explain the (silent!) desyncing of GFS later on, after all nodes were rebooted? -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From joe.fernandez at hp.com Mon Oct 3 20:17:49 2005 From: joe.fernandez at hp.com (Fernandez, Joe (HP Systems)) Date: Tue, 4 Oct 2005 06:17:49 +1000 Subject: [Linux-cluster] Please remove Message-ID: Hi, Could you please remove me off the list, thank you. Regards, Joe Fernandez HP Systems Hewlett-Packard Australia Ph. 61.3 8804 7308 Mob. 61.412 830 066 -------------- next part -------------- An HTML attachment was scrubbed... URL: From lhh at redhat.com Mon Oct 3 20:21:46 2005 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 03 Oct 2005 16:21:46 -0400 Subject: [Openais] Re: [Linux-cluster] new userland cman In-Reply-To: <1128370379.30850.3.camel@unnamed.az.mvista.com> References: <433D4134.6080608@redhat.com> <1128109200.8440.14.camel@unnamed.az.mvista.com> <4340D983.7080106@redhat.com> <1128360058.27430.99.camel@ayanami.boston.redhat.com> <1128370379.30850.3.camel@unnamed.az.mvista.com> Message-ID: <1128370906.27430.130.camel@ayanami.boston.redhat.com> On Mon, 2005-10-03 at 13:12 -0700, Steven Dake wrote: > > Broadcast is far more likely to work out of the box in the above cases, > > and isn't hard to implement (... actually, it's easier than multicast). > > > > Adding this should just be a few lines of code. I'll see if I can work > out a patch today. Nice. -- Lon From linux4dave at gmail.com Mon Oct 3 21:49:10 2005 From: linux4dave at gmail.com (dave first) Date: Mon, 3 Oct 2005 14:49:10 -0700 Subject: [Linux-cluster] PVFS going Wild Message-ID: <207649d0510031449t52eee8c6je2949869bc4f552@mail.gmail.com> Hey Guys, I just took over a couple of clusters for a sysadmin that left the company. Unfortunately, the hand-off was less than informative. So, I've got an old linux cluster, still well-used, with a PVFS filesystem mounted at /work. I'm new to clustering, and I sure as hell don't know much about it, but I've got a sick puppy here. All points to the PVFS filesystem. lsof: WARNING: can't stat() pvfs file system /work Output information may be incomplete. In /var/log/messages: Oct 3 13:51:34 elvis PAM_pwdb[24431]: (su) session opened for user deb_r by deb(uid=2626) Oct 3 13:51:49 elvis kernel: (./ll_pvfs.c, 361): ll_pvfs_getmeta failed on downcall for 192.168.1.102:300 0/pvfs-meta Oct 3 13:51:49 elvis kernel: (./ll_pvfs.c, 361): ll_pvfs_getmeta failed on downcall for 192.168.1.102:300 0/pvfs-meta/manaa/DFTBNEW Oct 3 14:16:48 elvis kernel: (./ll_pvfs.c, 409): ll_pvfs_statfs failed on downcall for 192.168.1.102:3000 /pvfs-meta Oct 3 14:16:elvis kernel: (./inode.c, 321): pvfs_statfs failed So the Linux elvis 2.2.19-13.beosmp #1 SMP Tue Aug 21 20:04:44 EDT 2001 i686 unknown Red Hat Linux release 6.2 (Zoot) Can't access /work from the master or any nodes, elvis [49#] ls /work ls: /work: Too many open files I ran a script in /usr/bin called pvfs_client_stop.sh - which killed all the pvfs daemons, etc #!/bin/tcsh # Phil Carns # pcarns at hubcap.clemson.edu # # This is an example script for how to get Scyld Beowulf cluster nodes # to mount a PVFS file system. set PVFSD = "/usr/sbin/pvfsd" set PVFSMOD = "pvfs" set PVFS_CLIENT_MOUNT_DIR = "/work" set MOUNT_PVFS = "/sbin/mount.pvfs" # unmount the file system locally and on all of the slave nodes /bin/umount $PVFS_CLIENT_MOUNT_DIR bpsh -pad /bin/umount $PVFS_CLIENT_MOUNT_DIR # kill all of the pvfsd client daemons /usr/bin/killall pvfsd # remove the pvfs module on the local and the slave nodes /sbin/rmmod $PVFSMOD bpsh -pad /sbin/rmmod $PVFSMOD Then I ran pvfs_client_start.sh /work, which seemed to work, except it never exited... #!/bin/tcsh # Phil Carns # pcarns at hubcap.clemson.edu # # This is an example script for how to get Scyld Beowulf cluster nodes # to mount a PVFS file system. set PVFSD = "/usr/sbin/pvfsd" set PVFSMOD = "pvfs" set PVFS_CLIENT_MOUNT_DIR = "/work" set MOUNT_PVFS = "/sbin/mount.pvfs" set PVFS_META_DIR = `bpctl -M -a`:$1 if $1 == "" then echo "usage: pvfs_client_start.sh " echo "(Causes every machine in the cluster to mount the PVFS file system)" exit -1 endif # insert the pvfs module on the local and slave nodes /sbin/modprobe $PVFSMOD bpsh -pad /sbin/modprobe $PVFSMOD # start the pvfsd client daemon on the local and slave nodes $PVFSD bpsh -pad $PVFSD # actually mount the file system locally and on all of the slave nodes $MOUNT_PVFS $PVFS_META_DIR $PVFS_CLIENT_MOUNT_DIR bpsh -pad $MOUNT_PVFS $PVFS_META_DIR $PVFS_CLIENT_MOUNT_DIR This seemed to work (well, it restarted daemons and such, but I still can't get into /work and getting resource busy and: mount.pvfs: Device or resource busy mount.pvfs: server 192.168.1.102 alive, but mount failed (invalid metadata directory name?) Comments? Useful ideas? A good joke??? dave -------------- next part -------------- An HTML attachment was scrubbed... URL: From cboudjnah at squiz.net Mon Oct 3 23:27:58 2005 From: cboudjnah at squiz.net (Chmouel Boudjnah) Date: Tue, 04 Oct 2005 09:27:58 +1000 Subject: [Linux-cluster] GFS crash Message-ID: <1128382078.9653.8.camel@paris.squiz.net> Hello, I had a crash on a server using GFS-6.1 with kernel 2.6.9-11.ELsmp, i am using GFS with an AOE SAN drive. I am not sure if the problem is with AOE SAN or with GFS would be great to tell me so i can redirect the bug report to the CORAID people. So i have first in the logs some weird stuff about sataide (i am not sure if the SAN is using that) : Sep 30 17:43:20 srv kernel: e send einval to 2 Sep 30 17:43:20 srv kernel: sataide send einval to 2 Sep 30 17:43:20 srv last message repeated 38 times Sep 30 17:43:20 srv kernel: sataide unlock ff050383 no id Sep 30 17:43:20 srv kernel: 231834 id 0 -1,3 1 Sep 30 17:43:20 srv kernel: 7814 qc 2,59f30e -1,5 id ffbe0378 sts 0 0 Sep 30 17:43:20 srv kernel: 19531 lk 5,59f30e id 0 -1,3 0 Sep 30 17:43:20 srv kernel: 4189 lk 2,2ed6bc id 0 -1,3 10001 Sep 30 17:43:20 srv kernel: 7814 qc 5,231834 -1,3 id 5dc0124 sts 0 0 Sep 30 17:43:20 srv kernel: 7814 qc 5,59f30e -1,3 id 27b00cf sts 0 0 Sep 30 17:43:20 srv kernel: 4189 lk 5,2ed6bc id 0 -1,3 1 Sep 30 17:43:20 srv kernel: 7814 qc 2,2ed6bc -1,3 id 1c0202 sts 0 0 Sep 30 17:43:20 srv kernel: 4189 lk 2,2903b3 id 0 -1,3 10001 Sep 30 17:43:20 srv kernel: 7814 qc 5,2ed6bc -1,3 id 227032a sts 0 0 Sep 30 17:43:20 srv kernel: 4189 lk 5,2903b3 id 0 -1,3 1 Sep 30 17:43:20 srv kernel: 7814 qc 2,2903b3 -1,3 id 23c036d sts 0 0 Sep 30 17:43:20 srv kernel: 4189 lk 2,2ba987 id 0 -1,3 10001 Sep 30 17:43:20 srv kernel: 4189 lk 5,2ba987 id 0 -1,3 1 Sep 30 17:43:20 srv kernel: 7814 qc 2,2ba987 -1,3 id 3ab033c sts 0 0 Sep 30 17:43:20 srv kernel: 7814 qc 5,2903b3 -1,3 id 1c80004 sts 0 0 Sep 30 17:43:20 srv kernel: 4189 lk 2,2ce731 id 0 -1,3 10001 Sep 30 17:43:20 srv kernel: 10052 lk 2,500e75 id 0 -1,5 0 Sep 30 17:43:20 srv kernel: 4189 lk 5,2ce731 id 0 -1,3 1 Sep 30 17:43:20 srv kernel: 7814 qc 5,2ba987 -1,3 id 1f003a sts 0 0 Sep 30 17:43:20 srv kernel: 7814 qc 2,2ce731 -1,3 id ff74033d sts 0 0 Sep 30 17:43:20 srv kernel: 19531 lk 5,500e74 id ffd101bd 3,5 805 Sep 30 17:43:20 srv kernel: 7814 qc 5,500e74 3,5 id ffd101bd sts 0 0 Sep 30 17:43:20 srv kernel: 7814 qc 2,500e75 -1,5 id 1660224 sts 0 0 Sep 30 17:43:20 srv kernel: 10052 lk 5,500e75 id 0 -1,3 0 Sep 30 17:43:20 srv kernel: 7814 qc 5,500e75 -1,3 id 3210323 sts 0 0 Sep 30 17:43:20 srv kernel: 29523 lk 2,217df id 0 -1,3 10000 Sep 30 17:43:20 srv kernel: 7814 qc 2,217df -1,3 id 5019b sts 0 0 Sep 30 17:43:20 srv kernel: 29523 lk 5,217df id 0 -1,3 0 Sep 30 17:43:21 srv kernel: 7814 qc 5,217df -1,3 id 2ae0267 sts 0 0 Sep 30 17:43:21 srv kernel: 7814 qc 5,2ce731 -1,3 id 7d0232 sts 0 0 Sep 30 17:43:21 srv kernel: 4189 lk 2,263a00 id 0 -1,3 10001 Sep 30 17:43:21 srv kernel: 7814 qc 2,263a00 -1,3 id 12700c3 sts 0 0 Sep 30 17:43:21 srv kernel: 4189 lk 5,263a00 id 0 -1,3 1 Sep 30 17:43:21 srv kernel: 4189 lk 2,2c446d id 0 -1,3 10001 Sep 30 17:43:21 srv kernel: 7814 qc 5,263a00 -1,3 id ffc00230 sts 0 0 Sep 30 17:43:21 srv kernel: 4189 lk 5,2c446d id 0 -1,3 1 Sep 30 17:43:21 srv kernel: 7814 qc 2,2c446d -1,3 id 34903b4 sts 0 0 Sep 30 17:43:21 srv kernel: 4189 lk 2,1e7a15 id 0 -1,3 10001 Sep 30 17:43:21 srv kernel: 7814 qc 5,2c446d -1,3 id fea901a1 sts 0 0 Sep 30 17:43:21 srv kernel: 4189 lk 5,1e7a15 id 0 -1,3 1 and the crash of GFS just after : Sep 30 17:43:22 srv kernel: lock_dlm: Assertion failed on line 353 of file /usr/src/build/574067-i686/BUILD/smp/src/dlm/lock.c Sep 30 17:43:22 srv kernel: lock_dlm: assertion: "!error" Sep 30 17:43:22 srv kernel: lock_dlm: time = 2509316164 Sep 30 17:43:22 srv kernel: sataide: error=-22 num=5,5bf2f1 lkf=801 flags=84 Sep 30 17:43:22 srv kernel: Sep 30 17:43:22 srv kernel: ------------[ cut here ]------------ Sep 30 17:43:22 srv kernel: kernel BUG at /usr/src/build/574067-i686/BUILD/smp/src/dlm/lock.c:353! Sep 30 17:43:22 srv kernel: invalid operand: 0000 [#1] Sep 30 17:43:22 srv kernel: SMP Sep 30 17:43:22 srv kernel: Modules linked in: lock_dlm(U) aoe(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 joydev button battery ac uhci_hcd ehci_hcd e1000 floppy sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod mptscsih mptbase sd_mod scsi_mod Sep 30 17:43:22 srv kernel: CPU: 0 Sep 30 17:43:22 srv kernel: EIP: 0060:[] Not tainted VLI Sep 30 17:43:22 srv kernel: EFLAGS: 00010246 (2.6.9-11.ELsmp) Sep 30 17:43:22 srv kernel: EIP is at do_dlm_unlock+0xaa/0xbf [lock_dlm] Sep 30 17:43:22 srv kernel: eax: 00000001 ebx: ffffffea ecx: f63f5f04 edx: f8b5809e Sep 30 17:43:22 srv kernel: esi: cb3ac080 edi: cb3ac080 ebp: f8b1d000 esp: f63f5f00 Sep 30 17:43:23 srv kernel: ds: 007b es: 007b ss: 0068 Sep 30 17:43:23 srv kernel: Process lock_dlm1 (pid: 7818, threadinfo=f63f5000 task=f75bb0b0) Sep 30 17:43:23 srv kernel: Stack: f8b5809e f8b1d000 00000003 f8b538c0 f8ab24f2 00000001 dcbdb3c0 dcbdb3a4 Sep 30 17:43:23 srv kernel: f8aa8852 f8add0c0 d73b9e80 dcbdb3a4 f8add0c0 cb3ac080 f8aa7d4b dcbdb3a4 Sep 30 17:43:23 srv kernel: 00000001 00000001 f8aa7e02 dcbdb3c0 dcbdb3a4 f8aa99af cb3ac080 f7d50e00 Sep 30 17:43:23 srv kernel: Call Trace: Sep 30 17:43:23 srv kernel: [] lm_dlm_unlock+0x14/0x1c [lock_dlm] Sep 30 17:43:23 srv kernel: [] gfs_lm_unlock+0x2c/0x42 [gfs] Sep 30 17:43:23 srv kernel: [] gfs_glock_drop_th+0xf3/0x12d [gfs] Sep 30 17:43:23 srv kernel: [] rq_demote+0x7f/0x98 [gfs] Sep 30 17:43:23 srv kernel: [] run_queue+0x5a/0xc1 [gfs] Sep 30 17:43:23 srv kernel: [] blocking_cb+0x39/0x7a [gfs] Sep 30 17:43:23 srv kernel: [] process_blocking+0x90/0x93 [lock_dlm] Sep 30 17:43:23 srv kernel: [] dlm_async+0x28b/0x2ff [lock_dlm] Sep 30 17:43:23 srv kernel: [] default_wake_function+0x0/0xc Sep 30 17:43:23 srv kernel: [] default_wake_function+0x0/0xc Sep 30 17:43:23 srv kernel: [] dlm_async+0x0/0x2ff [lock_dlm] Sep 30 17:43:23 srv kernel: [] kthread+0x73/0x9b Sep 30 17:43:23 srv kernel: [] kthread+0x0/0x9b Sep 30 17:43:23 srv kernel: [] kernel_thread_helper+0x5/0xb Sep 30 17:43:23 srv kernel: Code: 76 34 8b 06 ff 76 2c ff 76 08 ff 76 04 ff 76 0c 53 ff 70 18 68 a9 81 b5 f8 e8 d6 e3 5c c7 83 c4 34 68 9e 80 b5 f8 e8 c9 e3 5c c7 <0f> 0b 61 01 ef 7f b5 f8 68 a0 80 b5 f8 e8 84 db 5c c7 5b 5e c3 Sep 30 17:43:23 srv kernel: <0>Fatal exception: panic in 5 seconds Cheers, Chmouel. -- Chmouel Boudjnah - Squiz.net - http://www.squiz.net From jnewbigin at ict.swin.edu.au Tue Oct 4 04:15:51 2005 From: jnewbigin at ict.swin.edu.au (John Newbigin) Date: Tue, 04 Oct 2005 14:15:51 +1000 Subject: [Linux-cluster] GFS-6.0.2.27 issues Message-ID: <434201F7.3050107@ict.swin.edu.au> Is anyone seeing this with GFS-6.0.2.27 (EL3): ldconfig: /usr/lib/libgulm.so.6 is not a symbolic link The cause is indeed as the message says. /usr/lib/libgulm.so /usr/lib/libgulm.so.6 /usr/lib/libgulm.so.6.0.2 all appear the be the same file, rather than symlinks to the 6.0.2 version file. It all works OK, just makes installing updates print out the error message as each packages runs ldconfig. I also have files in /usr/lib/debug which is not a problem but i wonder if they need to be there. John. -- John Newbigin Computer Systems Officer Faculty of Information and Communication Technologies Swinburne University of Technology Melbourne, Australia http://www.ict.swin.edu.au/staff/jnewbigin From jnewbigin at ict.swin.edu.au Tue Oct 4 04:29:38 2005 From: jnewbigin at ict.swin.edu.au (John Newbigin) Date: Tue, 04 Oct 2005 14:29:38 +1000 Subject: [Linux-cluster] Errata page duplicates Message-ID: <43420532.9060709@ict.swin.edu.au> On the page http://rhn.redhat.com/errata/RHBA-2005-723.html all the files seem to be listed in triplicate. There does not seem to be a report a bug in this errata email address so I figure someone on this list will know what to do. John. -- John Newbigin Computer Systems Officer Faculty of Information and Communication Technologies Swinburne University of Technology Melbourne, Australia http://www.ict.swin.edu.au/staff/jnewbigin From tom-fedora at kofler.eu.org Tue Oct 4 11:52:28 2005 From: tom-fedora at kofler.eu.org (Thomas Kofler) Date: Tue, 4 Oct 2005 13:52:28 +0200 Subject: [Linux-cluster] FC4 kernel-2.6.13-1.1526_FC4 & GFS-kernel package mismatch ? Message-ID: <1128426748.43426cfc13ec5@mail.devcon.cc> Hi, we noticed, that after running "yum update" yesterday on our system, that cman didn't start up any longer. We investigated the problem and it depends on the kernel version, if we boot with the old 1447 - kernel, cman and the related services startup fine. The newest GFS-kernel places the modules under /lib/modules/2.6.12-1.1447_FC4 But the kernel itself kernel-2.6.13-1.1526_FC4 has of course /lib/modules/2.6.13-1.1526_FC4 as its module path. Is it a bug or is there to do something by hand? If not, I would open a bug on bugzilla, but under which section: kernel or GFS-kernel - which package team forget the dependency ? Thanks for feedback, Thomas kernel-2.6.13-1.1526_FC4 modules: /lib/modules/2.6.13-1.1526_FC4 GFS-kernel-2.6.11.8-20050601.152643.FC4.14 /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs/gfs.ko /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm/lock_dlm.ko /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm/lock_gulm.ko /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_harness /lib/modules/2.6.12- 1.1447_FC4/kernel/fs/gfs_locking/lock_harness/lock_harness.ko /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock/lock_nolock.ko From sgray at bluestarinc.com Tue Oct 4 13:16:43 2005 From: sgray at bluestarinc.com (Sean Gray) Date: Tue, 04 Oct 2005 09:16:43 -0400 Subject: [Linux-cluster] FC4 kernel-2.6.13-1.1526_FC4 & GFS-kernel package mismatch ? In-Reply-To: <1128426748.43426cfc13ec5@mail.devcon.cc> References: <1128426748.43426cfc13ec5@mail.devcon.cc> Message-ID: <1128431803.31466.4286.camel@localhost.localdomain> Thomas, Why not grab the srpms and recompile the rpms? I have some notes on my experience compiling srpms for RHEL4 x86_64 2.6.9-11, they me be of assistance as it was easier said than done. Sean On Tue, 2005-10-04 at 13:52 +0200, Thomas Kofler wrote: > Hi, > > we noticed, that after running "yum update" yesterday on our system, that cman > didn't start up any longer. > > We investigated the problem and it depends on the kernel version, if we boot > with the old 1447 - kernel, cman and the related services startup fine. > > The newest GFS-kernel places the modules under /lib/modules/2.6.12-1.1447_FC4 > > But the kernel itself kernel-2.6.13-1.1526_FC4 has of > course /lib/modules/2.6.13-1.1526_FC4 as its module path. > > Is it a bug or is there to do something by hand? If not, I would open a bug on > bugzilla, but under which section: kernel or GFS-kernel - which package team > forget the dependency ? > > Thanks for feedback, > Thomas > > > kernel-2.6.13-1.1526_FC4 > modules: /lib/modules/2.6.13-1.1526_FC4 > > GFS-kernel-2.6.11.8-20050601.152643.FC4.14 > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs/gfs.ko > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm/lock_dlm.ko > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm/lock_gulm.ko > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_harness > /lib/modules/2.6.12- > 1.1447_FC4/kernel/fs/gfs_locking/lock_harness/lock_harness.ko > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock/lock_nolock.ko > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > Sean N. Gray Director of Information Technology United Radio Incorporated, DBA BlueStar 24 Spiral Drive Florence, Kentucky 41042 office: 859.371.4423 x263 toll free: 800.371.4423 x263 fax: 859.371.4425 mobile: 513.616.3379 -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom-fedora at kofler.eu.org Tue Oct 4 13:33:10 2005 From: tom-fedora at kofler.eu.org (Thomas Kofler) Date: Tue, 4 Oct 2005 15:33:10 +0200 Subject: [Linux-cluster] FC4 kernel-2.6.13-1.1526_FC4 & GFS-kernel package mismatch ? In-Reply-To: <1128431803.31466.4286.camel@localhost.localdomain> References: <1128426748.43426cfc13ec5@mail.devcon.cc> <1128431803.31466.4286.camel@localhost.localdomain> Message-ID: <1128432789.43428496039e6@mail.devcon.cc> > Why not grab the srpms and recompile the rpms? In theory no problem, but imagine the default behaviour of a user. yum update the system and yum install the packages. And the cluster/GFS system will fail to start up, so its an annoying bug in my opinion. Regards, Thomas From mailinglists at marvin-lists.freaks.de Tue Oct 4 15:50:29 2005 From: mailinglists at marvin-lists.freaks.de (Christian Niessner) Date: Tue, 04 Oct 2005 17:50:29 +0200 Subject: [Linux-cluster] cluster service development - question Message-ID: <1128441029.25994.125.camel@phanara.ai.arno.vpn> hi, i'm currently developing a cluster service for the cluster release 1.00.00. It's using libmagma for messaging and node membership maintainance, and these parts work really well.. But i also have to maintain a list of all configured nodes. What is the 'best practice' to get this list? (Node id and name would be fine...) It doesn't seem do be possible with libmagma... Thanks, chris From cfeist at redhat.com Tue Oct 4 16:30:17 2005 From: cfeist at redhat.com (Chris Feist) Date: Tue, 04 Oct 2005 11:30:17 -0500 Subject: [Linux-cluster] FC4 kernel-2.6.13-1.1526_FC4 & GFS-kernel package mismatch ? In-Reply-To: <1128426748.43426cfc13ec5@mail.devcon.cc> References: <1128426748.43426cfc13ec5@mail.devcon.cc> Message-ID: <4342AE19.1070306@redhat.com> We should have updated rpms in the -test tree shortly, and then if no problems are reported they'll be moved to the standard tree. Thanks, Chris Thomas Kofler wrote: > Hi, > > we noticed, that after running "yum update" yesterday on our system, that cman > didn't start up any longer. > > We investigated the problem and it depends on the kernel version, if we boot > with the old 1447 - kernel, cman and the related services startup fine. > > The newest GFS-kernel places the modules under /lib/modules/2.6.12-1.1447_FC4 > > But the kernel itself kernel-2.6.13-1.1526_FC4 has of > course /lib/modules/2.6.13-1.1526_FC4 as its module path. > > Is it a bug or is there to do something by hand? If not, I would open a bug on > bugzilla, but under which section: kernel or GFS-kernel - which package team > forget the dependency ? > > Thanks for feedback, > Thomas > > > kernel-2.6.13-1.1526_FC4 > modules: /lib/modules/2.6.13-1.1526_FC4 > > GFS-kernel-2.6.11.8-20050601.152643.FC4.14 > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs/gfs.ko > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm/lock_dlm.ko > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm/lock_gulm.ko > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_harness > /lib/modules/2.6.12- > 1.1447_FC4/kernel/fs/gfs_locking/lock_harness/lock_harness.ko > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock/lock_nolock.ko > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From lhh at redhat.com Tue Oct 4 16:40:54 2005 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 04 Oct 2005 12:40:54 -0400 Subject: [Linux-cluster] cluster service development - question In-Reply-To: <1128441029.25994.125.camel@phanara.ai.arno.vpn> References: <1128441029.25994.125.camel@phanara.ai.arno.vpn> Message-ID: <1128444054.27430.169.camel@ayanami.boston.redhat.com> On Tue, 2005-10-04 at 17:50 +0200, Christian Niessner wrote: > hi, > > i'm currently developing a cluster service for the cluster release > 1.00.00. It's using libmagma for messaging and node membership > maintainance, and these parts work really well.. But i also have to > maintain a list of all configured nodes. > > What is the 'best practice' to get this list? (Node id and name would be > fine...) It doesn't seem do be possible with libmagma... Sure it is. cluster_member_list_t *mlist; mlist = clu_member_list(); -- Lon From lhh at redhat.com Tue Oct 4 16:44:18 2005 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 04 Oct 2005 12:44:18 -0400 Subject: [Linux-cluster] cluster service development - question In-Reply-To: <1128441029.25994.125.camel@phanara.ai.arno.vpn> References: <1128441029.25994.125.camel@phanara.ai.arno.vpn> Message-ID: <1128444258.27430.173.camel@ayanami.boston.redhat.com> On Tue, 2005-10-04 at 17:50 +0200, Christian Niessner wrote: > hi, > > i'm currently developing a cluster service for the cluster release > 1.00.00. It's using libmagma for messaging and node membership > maintainance, and these parts work really well.. But i also have to > maintain a list of all configured nodes. > > What is the 'best practice' to get this list? (Node id and name would be > fine...) It doesn't seem do be possible with libmagma... > > Thanks, > chris Oh - note - the node ID is 64 bits because gulm doesn't really have a notion of "node ID" internally, so we use the local ipv6 network address (lower 8 octets) as the node ID. Just something to be aware of; i.e. don't cast it to an int. :) -- Lon From lhh at redhat.com Tue Oct 4 16:50:18 2005 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 04 Oct 2005 12:50:18 -0400 Subject: [Linux-cluster] cluster service development - question In-Reply-To: <1128444054.27430.169.camel@ayanami.boston.redhat.com> References: <1128441029.25994.125.camel@phanara.ai.arno.vpn> <1128444054.27430.169.camel@ayanami.boston.redhat.com> Message-ID: <1128444618.27430.177.camel@ayanami.boston.redhat.com> On Tue, 2005-10-04 at 12:40 -0400, Lon Hohberger wrote: > On Tue, 2005-10-04 at 17:50 +0200, Christian Niessner wrote: > > hi, > > > > i'm currently developing a cluster service for the cluster release > > 1.00.00. It's using libmagma for messaging and node membership > > maintainance, and these parts work really well.. But i also have to > > maintain a list of all configured nodes. > > > > What is the 'best practice' to get this list? (Node id and name would be > > fine...) It doesn't seem do be possible with libmagma... Ugh, I read this wrong. Here you go. It's part of a rewrite of clustat which has "how to get this stuff from ccsd" built right in. I haven't committed this yet, but will soon as part of a larger clustat rewrite. ccs_member_list() returns cluster_member_list_t *. build_member_list does that, but compares it against who is in the configuration (and in this case, who is running rgmanager). It's kind of hackish the way I'm overloading the cm_state field with a bunch of bitflags, but it works. -- Lon -------------- next part -------------- A non-text attachment was scrubbed... Name: ccs-clustat-merge.c Type: text/x-csrc Size: 3379 bytes Desc: not available URL: From mailinglists at marvin-lists.freaks.de Tue Oct 4 16:55:43 2005 From: mailinglists at marvin-lists.freaks.de (Christian Niessner) Date: Tue, 04 Oct 2005 18:55:43 +0200 Subject: [Linux-cluster] cluster service development - question In-Reply-To: <1128444054.27430.169.camel@ayanami.boston.redhat.com> References: <1128441029.25994.125.camel@phanara.ai.arno.vpn> <1128444054.27430.169.camel@ayanami.boston.redhat.com> Message-ID: <1128444943.25994.145.camel@phanara.ai.arno.vpn> Hi Lon, On Tue, 2005-10-04 at 12:40 -0400, Lon Hohberger wrote: > Sure it is. > > cluster_member_list_t *mlist; > > mlist = clu_member_list(); It seems clu_member_list() only returns the nodes that have joined the cluster, not all nodes configured in /etc/cluster/cluster.conf. In my case, i need all nodes. I had a quick look into the haeder files. I only found a clu_member_list(char *group). But even calling with NULL it only reports joined nodes... Or did I do something wrong? ciao, chris From liangs at cse.ohio-state.edu Tue Oct 4 16:58:08 2005 From: liangs at cse.ohio-state.edu (Shuang Liang) Date: Tue, 04 Oct 2005 12:58:08 -0400 Subject: [Linux-cluster] Gnbd with LVM Message-ID: <4342B4A0.5040504@cse.ohio-state.edu> Hi all, Does Gnbd work with logic volume manager in Linux, so that data can stripe across multiple gnbd device on a single GFS filesytem? I am also curious that if it is possible for the 6.1 version of GFS to work without cluster tools? Thanks Shuang, From teigland at redhat.com Tue Oct 4 17:06:05 2005 From: teigland at redhat.com (David Teigland) Date: Tue, 4 Oct 2005 12:06:05 -0500 Subject: [Linux-cluster] GFS crash In-Reply-To: <1128382078.9653.8.camel@paris.squiz.net> References: <1128382078.9653.8.camel@paris.squiz.net> Message-ID: <20051004170605.GC10135@redhat.com> On Tue, Oct 04, 2005 at 09:27:58AM +1000, Chmouel Boudjnah wrote: > Hello, > > I had a crash on a server using GFS-6.1 with kernel 2.6.9-11.ELsmp, i am > using GFS with an AOE SAN drive. > > I am not sure if the problem is with AOE SAN or with GFS would be great > to tell me so i can redirect the bug report to the CORAID people. > > So i have first in the logs some weird stuff about sataide (i am not > sure if the SAN is using that) : > > Sep 30 17:43:20 srv kernel: e send einval to 2 > Sep 30 17:43:20 srv kernel: sataide send einval to 2 > Sep 30 17:43:20 srv last message repeated 38 times > Sep 30 17:43:20 srv kernel: sataide unlock ff050383 no id The dlm is returning errors for both remote and local lock requests, indicating that it doesn't know about any of the locks being requested. That's often because the dlm was "shut down" by cman when cman lost its connection to the cluster. There are usually log messages from cman, too, saying what has happened. Is AOE using the same network as cman? If so, you might try putting them on two different networks. > Sep 30 17:43:22 srv kernel: lock_dlm: Assertion failed on line 353 of > file /usr/src/build/574067-i686/BUILD/smp/src/dlm/lock.c > Sep 30 17:43:22 srv kernel: lock_dlm: assertion: "!error" > Sep 30 17:43:22 srv kernel: lock_dlm: time = 2509316164 > Sep 30 17:43:22 srv kernel: sataide: error=-22 num=5,5bf2f1 lkf=801 > flags=84 This is the typical assertion failure you get when gfs can't acquire any locks. Dave From jbrassow at redhat.com Tue Oct 4 18:20:42 2005 From: jbrassow at redhat.com (Jonathan E Brassow) Date: Tue, 4 Oct 2005 13:20:42 -0500 Subject: [Linux-cluster] Gnbd with LVM In-Reply-To: <4342B4A0.5040504@cse.ohio-state.edu> References: <4342B4A0.5040504@cse.ohio-state.edu> Message-ID: <42045367e86f9fe3320a7837428e8d80@redhat.com> On Oct 4, 2005, at 11:58 AM, Shuang Liang wrote: > Hi all, > Does Gnbd work with logic volume manager in Linux, so that data can > stripe across multiple gnbd device on a single GFS filesytem? I have gotten it to work. You may need to add the gnbd devices to your filter (in /etc/lvm/lvm.conf). > I am also curious that if it is possible for the 6.1 version of GFS > to work without cluster tools? > GFS will work as a local file system if you mkfs with the '-p lock_nolock' option. brassow From pcaulfie at redhat.com Wed Oct 5 07:00:09 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Wed, 05 Oct 2005 08:00:09 +0100 Subject: [Linux-cluster] cluster service development - question In-Reply-To: <1128444618.27430.177.camel@ayanami.boston.redhat.com> References: <1128441029.25994.125.camel@phanara.ai.arno.vpn> <1128444054.27430.169.camel@ayanami.boston.redhat.com> <1128444618.27430.177.camel@ayanami.boston.redhat.com> Message-ID: <434379F9.2080209@redhat.com> Lon Hohberger wrote: > On Tue, 2005-10-04 at 12:40 -0400, Lon Hohberger wrote: > >>On Tue, 2005-10-04 at 17:50 +0200, Christian Niessner wrote: >> >>>hi, >>> >>>i'm currently developing a cluster service for the cluster release >>>1.00.00. It's using libmagma for messaging and node membership >>>maintainance, and these parts work really well.. But i also have to >>>maintain a list of all configured nodes. >>> >>>What is the 'best practice' to get this list? (Node id and name would be >>>fine...) It doesn't seem do be possible with libmagma... > > > Ugh, I read this wrong. > > Here you go. It's part of a rewrite of clustat which has "how to get > this stuff from ccsd" built right in. I haven't committed this yet, but > will soon as part of a larger clustat rewrite. > > ccs_member_list() returns cluster_member_list_t *. > > build_member_list does that, but compares it against who is in the > configuration (and in this case, who is running rgmanager). It's kind > of hackish the way I'm overloading the cm_state field with a bunch of > bitflags, but it works. > Just to point out that the next cman version (the userland daemon on head of CVS) will behave as you want - ie requesting the members list will retrieve all the nodes known to CCS. -- patrick From cfeist at redhat.com Wed Oct 5 20:54:29 2005 From: cfeist at redhat.com (Chris Feist) Date: Wed, 05 Oct 2005 15:54:29 -0500 Subject: [Linux-cluster] FC4 kernel-2.6.13-1.1526_FC4 & GFS-kernel package mismatch ? In-Reply-To: <1128426748.43426cfc13ec5@mail.devcon.cc> References: <1128426748.43426cfc13ec5@mail.devcon.cc> Message-ID: <43443D85.8050703@redhat.com> GFS/CS updated kernel rpms are available from fedora-test. ftp://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/4/i386/ Thanks, Chris Thomas Kofler wrote: > Hi, > > we noticed, that after running "yum update" yesterday on our system, that cman > didn't start up any longer. > > We investigated the problem and it depends on the kernel version, if we boot > with the old 1447 - kernel, cman and the related services startup fine. > > The newest GFS-kernel places the modules under /lib/modules/2.6.12-1.1447_FC4 > > But the kernel itself kernel-2.6.13-1.1526_FC4 has of > course /lib/modules/2.6.13-1.1526_FC4 as its module path. > > Is it a bug or is there to do something by hand? If not, I would open a bug on > bugzilla, but under which section: kernel or GFS-kernel - which package team > forget the dependency ? > > Thanks for feedback, > Thomas > > > kernel-2.6.13-1.1526_FC4 > modules: /lib/modules/2.6.13-1.1526_FC4 > > GFS-kernel-2.6.11.8-20050601.152643.FC4.14 > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs/gfs.ko > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm/lock_dlm.ko > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm/lock_gulm.ko > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_harness > /lib/modules/2.6.12- > 1.1447_FC4/kernel/fs/gfs_locking/lock_harness/lock_harness.ko > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock > /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock/lock_nolock.ko > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From lhh at redhat.com Wed Oct 5 20:59:16 2005 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 05 Oct 2005 16:59:16 -0400 Subject: [Linux-cluster] cluster service development - question In-Reply-To: <434379F9.2080209@redhat.com> References: <1128441029.25994.125.camel@phanara.ai.arno.vpn> <1128444054.27430.169.camel@ayanami.boston.redhat.com> <1128444618.27430.177.camel@ayanami.boston.redhat.com> <434379F9.2080209@redhat.com> Message-ID: <1128545956.27430.249.camel@ayanami.boston.redhat.com> On Wed, 2005-10-05 at 08:00 +0100, Patrick Caulfield wrote: > > Here you go. It's part of a rewrite of clustat which has "how to get > > this stuff from ccsd" built right in. I haven't committed this yet, but > > will soon as part of a larger clustat rewrite. > > > > ccs_member_list() returns cluster_member_list_t *. > > > > build_member_list does that, but compares it against who is in the > > configuration (and in this case, who is running rgmanager). It's kind > > of hackish the way I'm overloading the cm_state field with a bunch of > > bitflags, but it works. > > > > Just to point out that the next cman version (the userland daemon on head of > CVS) will behave as you want - ie requesting the members list will retrieve all > the nodes known to CCS. > Nice =) -- Lon From lhh at redhat.com Wed Oct 5 21:08:22 2005 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 05 Oct 2005 17:08:22 -0400 Subject: [Linux-cluster] Re: rgmanager dieing with no messages [was: Re: SMP and GFS] In-Reply-To: <1128352997.3504.9.camel@auh5-0479.corp.jabil.org> References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com> <20051002102305.GD13944@neu.nirvana> <1128265567.23136.21.camel@saloon.neofreak.org> <1128352997.3504.9.camel@auh5-0479.corp.jabil.org> Message-ID: <1128546502.27430.252.camel@ayanami.boston.redhat.com> On Mon, 2005-10-03 at 11:23 -0400, Eric Kerin wrote: > On Sun, 2005-10-02 at 11:06 -0400, DeadManMoving wrote: > > My cluster is highly instable, just this morning i've realized that > > the clurgmgrd deamon was dead... > > I'm having this same problem on my cluster, I've been planning on > enabling core dumps for rgmanager once I find a few minutes to restart > the cluster services. With any luck, that will be today. If you see anything, let me know. There's a segfault I'm trying to track down which this is... I haven't been able to reproduce it internally :( From jnewbigin at ict.swin.edu.au Thu Oct 6 02:01:45 2005 From: jnewbigin at ict.swin.edu.au (John Newbigin) Date: Thu, 06 Oct 2005 12:01:45 +1000 Subject: [Linux-cluster] GFS-6.0.2.27 issues In-Reply-To: <434201F7.3050107@ict.swin.edu.au> References: <434201F7.3050107@ict.swin.edu.au> Message-ID: <43448589.6050608@ict.swin.edu.au> FYI Bugzilla #169967 John Newbigin wrote: > Is anyone seeing this with GFS-6.0.2.27 (EL3): > ldconfig: /usr/lib/libgulm.so.6 is not a symbolic link > > The cause is indeed as the message says. > /usr/lib/libgulm.so /usr/lib/libgulm.so.6 /usr/lib/libgulm.so.6.0.2 > all appear the be the same file, rather than symlinks to the 6.0.2 > version file. > > It all works OK, just makes installing updates print out the error > message as each packages runs ldconfig. > > I also have files in /usr/lib/debug which is not a problem but i wonder > if they need to be there. > > John. > -- John Newbigin Computer Systems Officer Faculty of Information and Communication Technologies Swinburne University of Technology Melbourne, Australia http://www.ict.swin.edu.au/staff/jnewbigin From phung at cs.columbia.edu Thu Oct 6 20:37:21 2005 From: phung at cs.columbia.edu (Dan B. Phung) Date: Thu, 6 Oct 2005 16:37:21 -0400 (EDT) Subject: [Linux-cluster] new cluster created when new node joins Message-ID: I have an existing cluster: blade04: # cman_tool nodes Node Votes Exp Sts Name 1 1 1 X blade01 4 1 1 M blade04 11 1 1 M blade11 then blade06 joins the cluster, but instead of joining the existing cluster, it creates a new one: blade06: # cman_tool nodes Node Votes Exp Sts Name 6 1 1 M blade06 Both machines are using Protocol version: 5.0.1 How can I further debug why this is happening? thanks, dan From jbrassow at redhat.com Thu Oct 6 21:59:31 2005 From: jbrassow at redhat.com (Jonathan E Brassow) Date: Thu, 6 Oct 2005 16:59:31 -0500 Subject: [Linux-cluster] new cluster created when new node joins In-Reply-To: References: Message-ID: <98b567529b272baeb6be7f90371aa324@redhat.com> do they have multiple clusters set up in their environment? Does the /etc/cluster/cluster.xml file match the others? brassow On Oct 6, 2005, at 3:37 PM, Dan B. Phung wrote: > I have an existing cluster: > > blade04: # cman_tool nodes > Node Votes Exp Sts Name > 1 1 1 X blade01 > 4 1 1 M blade04 > 11 1 1 M blade11 > > then blade06 joins the cluster, but instead of joining the existing > cluster, it creates a new one: > > blade06: # cman_tool nodes > Node Votes Exp Sts Name > 6 1 1 M blade06 > > Both machines are using Protocol version: 5.0.1 > > How can I further debug why this is happening? > > thanks, > dan > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From phung at cs.columbia.edu Thu Oct 6 22:06:48 2005 From: phung at cs.columbia.edu (Dan B. Phung) Date: Thu, 6 Oct 2005 18:06:48 -0400 (EDT) Subject: [Linux-cluster] new cluster created when new node joins In-Reply-To: <98b567529b272baeb6be7f90371aa324@redhat.com> Message-ID: There is another cluster that is running orthogonal of this cluster, but that's not defined in this cluster.xml. The cluster.xml is the same for both these machines. On 6, Oct, 2005, Jonathan E Brassow declared: > do they have multiple clusters set up in their environment? Does the > /etc/cluster/cluster.xml file match the others? > > brassow > > On Oct 6, 2005, at 3:37 PM, Dan B. Phung wrote: > > > I have an existing cluster: > > > > blade04: # cman_tool nodes > > Node Votes Exp Sts Name > > 1 1 1 X blade01 > > 4 1 1 M blade04 > > 11 1 1 M blade11 > > > > then blade06 joins the cluster, but instead of joining the existing > > cluster, it creates a new one: > > > > blade06: # cman_tool nodes > > Node Votes Exp Sts Name > > 6 1 1 M blade06 > > > > Both machines are using Protocol version: 5.0.1 > > > > How can I further debug why this is happening? > > > > thanks, > > dan > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- email: phung at cs.columbia.edu www: http://www.cs.columbia.edu/~phung phone: 646-775-6090 fax: 212-666-0140 office: CS Dept. 520, 1214 Amsterdam Ave., MC 0401, New York, NY 10027 From jbrassow at redhat.com Thu Oct 6 22:11:41 2005 From: jbrassow at redhat.com (Jonathan E Brassow) Date: Thu, 6 Oct 2005 17:11:41 -0500 Subject: [Linux-cluster] new cluster created when new node joins In-Reply-To: References: Message-ID: <9eebfcf8d2c537ddab93d7d0c4f7c7c8@redhat.com> Anything in /var/log/messages? On Oct 6, 2005, at 5:06 PM, Dan B. Phung wrote: > There is another cluster that is running orthogonal of this cluster, > but > that's not defined in this cluster.xml. The cluster.xml is the same > for both these machines. > > > On 6, Oct, 2005, Jonathan E Brassow declared: > >> do they have multiple clusters set up in their environment? Does the >> /etc/cluster/cluster.xml file match the others? >> >> brassow >> >> On Oct 6, 2005, at 3:37 PM, Dan B. Phung wrote: >> >>> I have an existing cluster: >>> >>> blade04: # cman_tool nodes >>> Node Votes Exp Sts Name >>> 1 1 1 X blade01 >>> 4 1 1 M blade04 >>> 11 1 1 M blade11 >>> >>> then blade06 joins the cluster, but instead of joining the existing >>> cluster, it creates a new one: >>> >>> blade06: # cman_tool nodes >>> Node Votes Exp Sts Name >>> 6 1 1 M blade06 >>> >>> Both machines are using Protocol version: 5.0.1 >>> >>> How can I further debug why this is happening? >>> >>> thanks, >>> dan >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -- > email: phung at cs.columbia.edu > www: http://www.cs.columbia.edu/~phung > phone: 646-775-6090 > fax: 212-666-0140 > office: CS Dept. 520, 1214 Amsterdam Ave., MC 0401, New York, NY 10027 > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From phung at cs.columbia.edu Thu Oct 6 22:17:13 2005 From: phung at cs.columbia.edu (Dan B. Phung) Date: Thu, 6 Oct 2005 18:17:13 -0400 (EDT) Subject: [Linux-cluster] new cluster created when new node joins In-Reply-To: <9eebfcf8d2c537ddab93d7d0c4f7c7c8@redhat.com> Message-ID: on the existing blades in the cluster there's nothing in the logs. on the entering blade, we get the "normal" Oct 6 16:28:29 blade06 kernel: CMAN: Waiting to join or form a Linux-cluster Oct 6 16:29:01 blade06 kernel: CMAN: forming a new cluster Oct 6 16:29:01 blade06 kernel: CMAN: quorum regained, resuming activity Oct 6 16:31:20 blade06 kernel: CMAN: we are leaving the cluster. Oct 6 16:31:42 blade06 kernel: CMAN: Waiting to join or form a Linux-cluster Oct 6 16:32:14 blade06 kernel: CMAN: forming a new cluster Oct 6 16:32:14 blade06 kernel: CMAN: quorum regained, resuming activity ...so it seems like the messages aren't getting sent/received on the mutlicast network. I guess I'll try sniffing the network to see if the messages are out there. -dan On 6, Oct, 2005, Jonathan E Brassow declared: > Anything in /var/log/messages? > > On Oct 6, 2005, at 5:06 PM, Dan B. Phung wrote: > > > There is another cluster that is running orthogonal of this cluster, > > but > > that's not defined in this cluster.xml. The cluster.xml is the same > > for both these machines. > > > > > > On 6, Oct, 2005, Jonathan E Brassow declared: > > > >> do they have multiple clusters set up in their environment? Does the > >> /etc/cluster/cluster.xml file match the others? > >> > >> brassow > >> > >> On Oct 6, 2005, at 3:37 PM, Dan B. Phung wrote: > >> > >>> I have an existing cluster: > >>> > >>> blade04: # cman_tool nodes > >>> Node Votes Exp Sts Name > >>> 1 1 1 X blade01 > >>> 4 1 1 M blade04 > >>> 11 1 1 M blade11 > >>> > >>> then blade06 joins the cluster, but instead of joining the existing > >>> cluster, it creates a new one: > >>> > >>> blade06: # cman_tool nodes > >>> Node Votes Exp Sts Name > >>> 6 1 1 M blade06 > >>> > >>> Both machines are using Protocol version: 5.0.1 > >>> > >>> How can I further debug why this is happening? > >>> > >>> thanks, > >>> dan > >>> > >>> -- > >>> Linux-cluster mailing list > >>> Linux-cluster at redhat.com > >>> https://www.redhat.com/mailman/listinfo/linux-cluster > >>> > >> > >> -- > >> Linux-cluster mailing list > >> Linux-cluster at redhat.com > >> https://www.redhat.com/mailman/listinfo/linux-cluster > >> > > > > -- > > email: phung at cs.columbia.edu > > www: http://www.cs.columbia.edu/~phung > > phone: 646-775-6090 > > fax: 212-666-0140 > > office: CS Dept. 520, 1214 Amsterdam Ave., MC 0401, New York, NY 10027 > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- email: phung at cs.columbia.edu www: http://www.cs.columbia.edu/~phung phone: 646-775-6090 fax: 212-666-0140 office: CS Dept. 520, 1214 Amsterdam Ave., MC 0401, New York, NY 10027 From hernando.garcia at gmail.com Fri Oct 7 10:09:02 2005 From: hernando.garcia at gmail.com (Hernando Garcia) Date: Fri, 07 Oct 2005 11:09:02 +0100 Subject: [Linux-cluster] Please remove In-Reply-To: References: Message-ID: <1128679742.4350.0.camel@hgarcia.surrey.redhat.com> You can DIY from here ;) https://www.redhat.com/mailman/listinfo/linux-cluster On Tue, 2005-10-04 at 06:17 +1000, Fernandez, Joe (HP Systems) wrote: > Hi, > > Could you please remove me off the list, thank you. > > > Regards, > > Joe Fernandez > > HP Systems > Hewlett-Packard Australia > Ph. 61.3 8804 7308 > Mob. 61.412 830 066 > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From Axel.Thimm at ATrpms.net Fri Oct 7 10:51:04 2005 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Fri, 7 Oct 2005 12:51:04 +0200 Subject: [Linux-cluster] RHCS/GFS for RHELU2 (was: FC4 kernel-2.6.13-1.1526_FC4 & GFS-kernel package mismatch ?) In-Reply-To: <43443D85.8050703@redhat.com> References: <1128426748.43426cfc13ec5@mail.devcon.cc> <43443D85.8050703@redhat.com> Message-ID: <20051007105104.GA14283@neu.nirvana> The CS/GFS isos under RHELU2 are still for RHELU1, and the rhn channels also only have kernel modules for RHELU1's kernel. Should I bugzilla this? Thanks! On Wed, Oct 05, 2005 at 03:54:29PM -0500, Chris Feist wrote: > GFS/CS updated kernel rpms are available from fedora-test. > > ftp://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/4/i386/ > > Thanks, > Chris > > Thomas Kofler wrote: > >Hi, > > > >we noticed, that after running "yum update" yesterday on our system, that > >cman didn't start up any longer. > > > >We investigated the problem and it depends on the kernel version, if we > >boot with the old 1447 - kernel, cman and the related services startup > >fine. > > > >The newest GFS-kernel places the modules under > >/lib/modules/2.6.12-1.1447_FC4 > > > >But the kernel itself kernel-2.6.13-1.1526_FC4 has of > >course /lib/modules/2.6.13-1.1526_FC4 as its module path. > > > >Is it a bug or is there to do something by hand? If not, I would open a > >bug on bugzilla, but under which section: kernel or GFS-kernel - which > >package team forget the dependency ? > > > >Thanks for feedback, > >Thomas > > > > > >kernel-2.6.13-1.1526_FC4 > >modules: /lib/modules/2.6.13-1.1526_FC4 > > > >GFS-kernel-2.6.11.8-20050601.152643.FC4.14 > >/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs > >/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs/gfs.ko > >/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking > >/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm > >/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm/lock_dlm.ko > >/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm > >/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm/lock_gulm.ko > >/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_harness > >/lib/modules/2.6.12- > >1.1447_FC4/kernel/fs/gfs_locking/lock_harness/lock_harness.ko > >/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock > >/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock/lock_nolock.ko > > > > > > > -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From Axel.Thimm at ATrpms.net Fri Oct 7 12:02:04 2005 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Fri, 7 Oct 2005 14:02:04 +0200 Subject: [Linux-cluster] Re: RHCS/GFS for RHELU2 (was: FC4 kernel-2.6.13-1.1526_FC4 & GFS-kernel package mismatch ?) In-Reply-To: <20051007105104.GA14283@neu.nirvana> References: <1128426748.43426cfc13ec5@mail.devcon.cc> <43443D85.8050703@redhat.com> <20051007105104.GA14283@neu.nirvana> Message-ID: <20051007120204.GA20566@neu.nirvana> On Fri, Oct 07, 2005 at 12:51:04PM +0200, Axel Thimm wrote: > The CS/GFS isos under RHELU2 are still for RHELU1, and the rhn > channels also only have kernel modules for RHELU1's kernel. OK, looks like they are still in the beta channel. -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From cfeist at redhat.com Fri Oct 7 14:17:39 2005 From: cfeist at redhat.com (Chris Feist) Date: Fri, 07 Oct 2005 09:17:39 -0500 Subject: [Linux-cluster] Re: RHCS/GFS for RHELU2 In-Reply-To: <20051007105104.GA14283@neu.nirvana> References: <1128426748.43426cfc13ec5@mail.devcon.cc> <43443D85.8050703@redhat.com> <20051007105104.GA14283@neu.nirvana> Message-ID: <43468383.3050407@redhat.com> Axel, Normally it takes a day or two for CS/GFS isos to be released on RHN after RHEL is released. The rpms have been updated and the isos should be appearing shortly. Thanks, Chris Axel Thimm wrote: > The CS/GFS isos under RHELU2 are still for RHELU1, and the rhn > channels also only have kernel modules for RHELU1's kernel. > > Should I bugzilla this? > > Thanks! > > On Wed, Oct 05, 2005 at 03:54:29PM -0500, Chris Feist wrote: > >>GFS/CS updated kernel rpms are available from fedora-test. >> >>ftp://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/4/i386/ >> >>Thanks, >>Chris >> >>Thomas Kofler wrote: >> >>>Hi, >>> >>>we noticed, that after running "yum update" yesterday on our system, that >>>cman didn't start up any longer. >>> >>>We investigated the problem and it depends on the kernel version, if we >>>boot with the old 1447 - kernel, cman and the related services startup >>>fine. >>> >>>The newest GFS-kernel places the modules under >>>/lib/modules/2.6.12-1.1447_FC4 >>> >>>But the kernel itself kernel-2.6.13-1.1526_FC4 has of >>>course /lib/modules/2.6.13-1.1526_FC4 as its module path. >>> >>>Is it a bug or is there to do something by hand? If not, I would open a >>>bug on bugzilla, but under which section: kernel or GFS-kernel - which >>>package team forget the dependency ? >>> >>>Thanks for feedback, >>>Thomas >>> >>> >>>kernel-2.6.13-1.1526_FC4 >>>modules: /lib/modules/2.6.13-1.1526_FC4 >>> >>>GFS-kernel-2.6.11.8-20050601.152643.FC4.14 >>>/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs >>>/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs/gfs.ko >>>/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking >>>/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm >>>/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm/lock_dlm.ko >>>/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm >>>/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm/lock_gulm.ko >>>/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_harness >>>/lib/modules/2.6.12- >>>1.1447_FC4/kernel/fs/gfs_locking/lock_harness/lock_harness.ko >>>/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock >>>/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock/lock_nolock.ko >>> >>> >>> >> > From colman at codagenomics.com Fri Oct 7 14:35:03 2005 From: colman at codagenomics.com (Richard Colman) Date: Fri, 7 Oct 2005 07:35:03 -0700 Subject: [Linux-cluster] Job Posting - Southern CA Message-ID: <200510071436.j97Ea4EN021329@mx3.redhat.com> CODA Genomics in Irvine, CA is expanding and now would like to hire an absolutely top-notch systems analyst/administrator/programmer for design, development and administration of systems and software for real-time, distributed parallel processing on Linux clusters for both genomics research and commercial production of synthetic genes. We primarily use RED HAT and Debian software. Please respond by email to jobs at codagenomics.com to obtain a detailed job description . No telephone calls please. Thank You. Richard Colman From baesso at ksolutions.it Mon Oct 10 07:22:07 2005 From: baesso at ksolutions.it (Baesso Mirko) Date: Mon, 10 Oct 2005 09:22:07 +0200 Subject: [Linux-cluster] Setup Fence_wti on cluster RHES4 U1 Message-ID: Hi i've setup a Redhat cluster with two node and i try to test failover using power switch (WTI-NPS230) but seem doesn't work If I look for error messages I see that fence_wti waits for a command to execute. I setup cluster.conf using system-config-cluster and there is no section regarding command to execute Could you please let me now how to setup correctly Thanks Baesso Mirko - System Engineer KSolutions.S.p.A. Via Lenin 132/26 56017 S.Martino Ulmiano (PI) - Italy tel.+ 39 0 50 898369 fax. + 39 0 50 861200 baesso at ksolutions.it http//www.ksolutions.it -------------- next part -------------- An HTML attachment was scrubbed... URL: From herta.vandeneynde at cc.kuleuven.be Mon Oct 10 14:29:50 2005 From: herta.vandeneynde at cc.kuleuven.be (Herta Van den Eynde) Date: Mon, 10 Oct 2005 16:29:50 +0200 Subject: [Linux-cluster] umount failed - device is busy Message-ID: <434A7ADE.108@cc.kuleuven.be> environment: - Red Hat AS 3 (kernel-smp-2.4.21-37.EL - custom built to probe all LUNs on each SCSI device) - clumanager 1.2.28 The cluster consists of 2 members running three services which simply nfs export a number of directories to five other systems. The cluster has been operational since February. Following the latest upgrade (from kernel-smp-2.4.21-32.0.1.EL custom built and clumanager-1.2.26.1-1), all services are running on one member. When I try to locate the services, the operation fails, and the following message pops up: A Problem has occurred while changing ownership of this service. Please check logs for details. The cluster log reports the following: ==== begin log extract Member arnebd trying to relocate lepustl to nihald...Oct 10 16:08:06 arnebd clusvcmgrd: [13627]: service notice: Stopping service lepustl ... Oct 10 16:08:06 arnebd clurmtabd[26429]: Signal 15 received; exiting Oct 10 16:08:12 arnebd clusvcmgrd: [13627]: service error: 'umount /dev/sdb2' failed (/usr/local/lepus-tl), error=1 Oct 10 16:08:12 arnebd clusvcmgrd: [13627]: service error: umount: /usr/local/lepus-tl: device is busy Oct 10 16:08:12 arnebd clusvcmgrd: [13627]: service error: umount: /usr/local/lepus-tl: device is busy Oct 10 16:08:12 arnebd clusvcmgrd: [13627]: service error: Cannot stop filesystems for lepustl Oct 10 16:08:12 arnebd clusvcmgrd[13626]: Starting stopped service lepustl Oct 10 16:08:12 arnebd clusvcmgrd: [14083]: service notice: Starting service lepustl ... Oct 10 16:08:12 arnebd clurmtabd[14194]: Log level is now 7 Oct 10 16:08:12 arnebd clurmtabd[14194]: Polling interval is now 4 seconds failed Oct 10 16:08:12 arnebd clusvcmgrd: [14083]: service notice: Started service lepustl ... Oct 10 16:08:14 arnebd clurmtabd[6533]: Detected modified /var/lib/nfs/rmtab Oct 10 16:08:14 arnebd clurmtabd[9655]: Detected modified /var/lib/nfs/rmtab ==== end log extract FWIIW, no one was logged in but me, and my current directory was not on this filesystem. Neither fuser nor lsof returned any process using the filesystem. I figured the clurmtabd process may be locking it, so I did verify that there is only one clurmtab process for that filesystem. Any ideas/suggestions? Kind regards, Herta -- Herta Van den Eynde -=- Toledo system management K.U. Leuven - Ludit -=- phone: +32 (0)16 322 166 -=- 50?51'27" N 004?40'39" E "I wish I were two little cats. Then I could play together." Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm From herta.vandeneynde at cc.kuleuven.be Mon Oct 10 15:59:34 2005 From: herta.vandeneynde at cc.kuleuven.be (Herta Van den Eynde) Date: Mon, 10 Oct 2005 17:59:34 +0200 Subject: [Linux-cluster] umount failed - device is busy In-Reply-To: <434A7ADE.108@cc.kuleuven.be> References: <434A7ADE.108@cc.kuleuven.be> Message-ID: <434A8FE6.40508@cc.kuleuven.be> Further investigation suggests that locking may have something to do with this. On the system that currently runs the services, I find these lock files in four -rwx------ 1 root root 0 Oct 8 03:30 lock.0 -rwx------ 1 root root 0 Oct 8 03:30 lock.1 -rwx------ 1 root root 0 Oct 8 03:30 lock.116 -rwx------ 1 root root 0 Oct 8 03:30 lock.2 -rw-r--r-- 1 root root 0 Oct 8 03:31 service.0 -rw-r--r-- 1 root root 0 Oct 10 16:08 service.1 -rw-r--r-- 1 root root 0 Oct 8 03:30 service.2 On the now idel cluster member, I have these lock files: -rwx------ 1 root root 0 Oct 8 03:30 lock.0 -rwx------ 1 root root 0 Oct 8 03:30 lock.1 -rwx------ 1 root root 0 Oct 8 03:30 lock.116 -rwx------ 1 root root 0 Oct 8 03:30 lock.2 The four lock.n files strike me as odd since I only have three services. Also, should the lock files even be there on the idle cluster member? Could anyone running a similar cluster please post the content of the /var/lock/clumanager/ of the different members along with the the number of services currently running on that member? Kind regards, Herta Herta Van den Eynde wrote: > environment: > - Red Hat AS 3 (kernel-smp-2.4.21-37.EL - custom built to probe all LUNs > on each SCSI device) > - clumanager 1.2.28 > > The cluster consists of 2 members running three services which simply > nfs export a number of directories to five other systems. > The cluster has been operational since February. > > Following the latest upgrade (from kernel-smp-2.4.21-32.0.1.EL custom > built and clumanager-1.2.26.1-1), all services are running on one > member. When I try to locate the services, the operation fails, and the > following message pops up: > > A Problem has occurred while changing ownership > of this service. Please check logs for details. > > The cluster log reports the following: > > ==== begin log extract > Member arnebd trying to relocate lepustl to nihald...Oct 10 16:08:06 > arnebd clusvcmgrd: [13627]: service notice: Stopping service > lepustl ... > Oct 10 16:08:06 arnebd clurmtabd[26429]: Signal 15 received; > exiting > Oct 10 16:08:12 arnebd clusvcmgrd: [13627]: service error: 'umount > /dev/sdb2' failed (/usr/local/lepus-tl), error=1 > Oct 10 16:08:12 arnebd clusvcmgrd: [13627]: service error: umount: > /usr/local/lepus-tl: device is busy > Oct 10 16:08:12 arnebd clusvcmgrd: [13627]: service error: umount: > /usr/local/lepus-tl: device is busy > Oct 10 16:08:12 arnebd clusvcmgrd: [13627]: service error: Cannot > stop filesystems for lepustl > Oct 10 16:08:12 arnebd clusvcmgrd[13626]: Starting stopped > service lepustl > Oct 10 16:08:12 arnebd clusvcmgrd: [14083]: service notice: > Starting service lepustl ... > Oct 10 16:08:12 arnebd clurmtabd[14194]: Log level is now 7 > Oct 10 16:08:12 arnebd clurmtabd[14194]: Polling interval is now > 4 seconds > failed > Oct 10 16:08:12 arnebd clusvcmgrd: [14083]: service notice: > Started service lepustl ... > Oct 10 16:08:14 arnebd clurmtabd[6533]: Detected modified > /var/lib/nfs/rmtab > Oct 10 16:08:14 arnebd clurmtabd[9655]: Detected modified > /var/lib/nfs/rmtab > ==== end log extract > > FWIIW, no one was logged in but me, and my current directory was not on > this filesystem. > Neither fuser nor lsof returned any process using the filesystem. > I figured the clurmtabd process may be locking it, so I did verify that > there is only one clurmtab process for that filesystem. > > Any ideas/suggestions? > > Kind regards, > > Herta > -- Herta Van den Eynde -=- Toledo system management K.U. Leuven - Ludit -=- phone: +32 (0)16 322 166 -=- 50?51'27" N 004?40'39" E "I wish I were two little cats. Then I could play together." Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm From lhh at redhat.com Mon Oct 10 17:02:02 2005 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 10 Oct 2005 13:02:02 -0400 Subject: [Linux-cluster] umount failed - device is busy In-Reply-To: <434A8FE6.40508@cc.kuleuven.be> References: <434A7ADE.108@cc.kuleuven.be> <434A8FE6.40508@cc.kuleuven.be> Message-ID: <1128963722.4680.21.camel@ayanami.boston.redhat.com> On Mon, 2005-10-10 at 17:59 +0200, Herta Van den Eynde wrote: > Further investigation suggests that locking may have something to do > with this. > On the system that currently runs the services, I find these lock files > in four > -rwx------ 1 root root 0 Oct 8 03:30 lock.0 > -rwx------ 1 root root 0 Oct 8 03:30 lock.1 > -rwx------ 1 root root 0 Oct 8 03:30 lock.116 > -rwx------ 1 root root 0 Oct 8 03:30 lock.2 > -rw-r--r-- 1 root root 0 Oct 8 03:31 service.0 > -rw-r--r-- 1 root root 0 Oct 10 16:08 service.1 > -rw-r--r-- 1 root root 0 Oct 8 03:30 service.2 > On the now idel cluster member, I have these lock files: > -rwx------ 1 root root 0 Oct 8 03:30 lock.0 > -rwx------ 1 root root 0 Oct 8 03:30 lock.1 > -rwx------ 1 root root 0 Oct 8 03:30 lock.116 > -rwx------ 1 root root 0 Oct 8 03:30 lock.2 Lock files aren't removed. > The four lock.n files strike me as odd since I only have three services. One is the configuration lock. > Also, should the lock files even be there on the idle cluster member? Yes. Did you try enabling force unmount in the device/file system configuration? -- Lon From jbrassow at redhat.com Mon Oct 10 20:03:40 2005 From: jbrassow at redhat.com (Jonathan E Brassow) Date: Mon, 10 Oct 2005 15:03:40 -0500 Subject: [Linux-cluster] Setup Fence_wti on cluster RHES4 U1 In-Reply-To: References: Message-ID: I don't have the gui in front of me, but there should be a manage fencing button or something... If you've never specified which port that a machine is connected to on the WTI, then you haven't gotten far enough in the set-up. brassow On Oct 10, 2005, at 2:22 AM, Baesso Mirko wrote: > Hi > > i?ve setup a Redhat cluster with two node and i try to test failover > using power switch (WTI-NPS230) but seem doesn?t work > > If I look for error messages I see that fence_wti waits for a command > to execute. > > I setup cluster.conf using system-config-cluster and there is no > section regarding command to execute > > Could you please let me now how to setup correctly > > Thanks > > Baesso Mirko - System Engineer > > KSolutions.S.p.A. > > Via Lenin 132/26 > > 56017? S.Martino Ulmiano (PI) - Italy > > tel.+ 39 0 50 898369 fax. + 39 0 50 861200 > > baesso at ksolutions.it?? http//www.ksolutions.it > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 2095 bytes Desc: not available URL: From herta.vandeneynde at cc.kuleuven.be Mon Oct 10 20:06:54 2005 From: herta.vandeneynde at cc.kuleuven.be (Herta Van den Eynde) Date: Mon, 10 Oct 2005 22:06:54 +0200 Subject: [Linux-cluster] umount failed - device is busy In-Reply-To: <1128963722.4680.21.camel@ayanami.boston.redhat.com> References: <434A7ADE.108@cc.kuleuven.be> <434A8FE6.40508@cc.kuleuven.be> <1128963722.4680.21.camel@ayanami.boston.redhat.com> Message-ID: <434AC9DE.50606@cc.kuleuven.be> Lon Hohberger wrote: > On Mon, 2005-10-10 at 17:59 +0200, Herta Van den Eynde wrote: > >>Further investigation suggests that locking may have something to do >>with this. >>On the system that currently runs the services, I find these lock files >>in four >>-rwx------ 1 root root 0 Oct 8 03:30 lock.0 >>-rwx------ 1 root root 0 Oct 8 03:30 lock.1 >>-rwx------ 1 root root 0 Oct 8 03:30 lock.116 >>-rwx------ 1 root root 0 Oct 8 03:30 lock.2 >>-rw-r--r-- 1 root root 0 Oct 8 03:31 service.0 >>-rw-r--r-- 1 root root 0 Oct 10 16:08 service.1 >>-rw-r--r-- 1 root root 0 Oct 8 03:30 service.2 > > >>On the now idel cluster member, I have these lock files: >>-rwx------ 1 root root 0 Oct 8 03:30 lock.0 >>-rwx------ 1 root root 0 Oct 8 03:30 lock.1 >>-rwx------ 1 root root 0 Oct 8 03:30 lock.116 >>-rwx------ 1 root root 0 Oct 8 03:30 lock.2 > > > Lock files aren't removed. > > >>The four lock.n files strike me as odd since I only have three services. > > > One is the configuration lock. > > >> Also, should the lock files even be there on the idle cluster member? > > > Yes. > > Did you try enabling force unmount in the device/file system > configuration? > > -- Lon Thanks for the explanation, Lon. Yes, the devices are configured for "Force Unmount". With the device unmounted on all of the nfs clients I even tried to 'umount -f' manually, but I got the same result. Kind regards, Herta Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm From lhh at redhat.com Mon Oct 10 21:02:26 2005 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 10 Oct 2005 17:02:26 -0400 Subject: [Linux-cluster] umount failed - device is busy In-Reply-To: <434AC9DE.50606@cc.kuleuven.be> References: <434A7ADE.108@cc.kuleuven.be> <434A8FE6.40508@cc.kuleuven.be> <1128963722.4680.21.camel@ayanami.boston.redhat.com> <434AC9DE.50606@cc.kuleuven.be> Message-ID: <1128978146.4680.37.camel@ayanami.boston.redhat.com> On Mon, 2005-10-10 at 22:06 +0200, Herta Van den Eynde wrote: > > Did you try enabling force unmount in the device/file system > > configuration? > > > > -- Lon > > Thanks for the explanation, Lon. Yes, the devices are configured for > "Force Unmount". > With the device unmounted on all of the nfs clients I even tried to > 'umount -f' manually, but I got the same result. Odd. Well, "umount -f" actually doesn't do what most people think it does. The "force unmount" option looks for and kills any user-land process holding a reference on the file system using "kill -9". So, if you're getting EBUSY on unmount even though force-unmount is working (confirmed by you looking at lsof/fuser), chances are good that there's a kernel reference on the file system. It could be something NFS related - try "service nfs stop" and see if you can umount the file system. -- Lon From herta.vandeneynde at cc.kuleuven.be Mon Oct 10 21:22:20 2005 From: herta.vandeneynde at cc.kuleuven.be (Herta Van den Eynde) Date: Mon, 10 Oct 2005 23:22:20 +0200 Subject: [Linux-cluster] umount failed - device is busy In-Reply-To: <1128978146.4680.37.camel@ayanami.boston.redhat.com> References: <434A7ADE.108@cc.kuleuven.be> <434A8FE6.40508@cc.kuleuven.be> <1128963722.4680.21.camel@ayanami.boston.redhat.com> <434AC9DE.50606@cc.kuleuven.be> <1128978146.4680.37.camel@ayanami.boston.redhat.com> Message-ID: <434ADB8C.9010508@cc.kuleuven.be> Lon Hohberger wrote: > On Mon, 2005-10-10 at 22:06 +0200, Herta Van den Eynde wrote: > > >>>Did you try enabling force unmount in the device/file system >>>configuration? >>> >>>-- Lon >> >>Thanks for the explanation, Lon. Yes, the devices are configured for >>"Force Unmount". >>With the device unmounted on all of the nfs clients I even tried to >>'umount -f' manually, but I got the same result. > > > Odd. Well, "umount -f" actually doesn't do what most people think it > does. > > The "force unmount" option looks for and kills any user-land process > holding a reference on the file system using "kill -9". > > So, if you're getting EBUSY on unmount even though force-unmount is > working (confirmed by you looking at lsof/fuser), chances are good that > there's a kernel reference on the file system. > > It could be something NFS related - try "service nfs stop" and see if > you can umount the file system. > > -- Lon > Unfortunately, this is a production cluster which serves well over 100,000 users (e-learning environment for our university, a dozen associated colleges, and a few hundred K-12 institutions) and I only have 4 hour maintenance windows on the 7th of each month, so stopping all of nfs is not an option today. :-( One of the cluster services is used for admin purposes, and that's the only one I can currently use (within limits) to test suggestions. FWIIW, I don't think the force unmount works. True, lsof/fuser don't report processes against the filesystem, but "df" and "mount" show that it's still there, and I can write to and read from it after I try a "umount -f". Kind regards, Herta Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm From phung at cs.columbia.edu Mon Oct 10 22:43:21 2005 From: phung at cs.columbia.edu (Dan B. Phung) Date: Mon, 10 Oct 2005 18:43:21 -0400 (EDT) Subject: [Linux-cluster] which CVS version for GFS with 2.6.11? Message-ID: Can someone advise me as to which tag I should use to checkout the latest stable snapshot for GFS with 2.6.11? e.g. cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout -rYOUR_TAG? cluster thanks, dan From herta.vandeneynde at cc.kuleuven.be Tue Oct 11 10:01:22 2005 From: herta.vandeneynde at cc.kuleuven.be (Herta Van den Eynde) Date: Tue, 11 Oct 2005 12:01:22 +0200 Subject: [Linux-cluster] umount failed - device is busy In-Reply-To: <434ADB8C.9010508@cc.kuleuven.be> References: <434A7ADE.108@cc.kuleuven.be> <434A8FE6.40508@cc.kuleuven.be> <1128963722.4680.21.camel@ayanami.boston.redhat.com> <434AC9DE.50606@cc.kuleuven.be> <1128978146.4680.37.camel@ayanami.boston.redhat.com> <434ADB8C.9010508@cc.kuleuven.be> Message-ID: <434B8D72.3080006@cc.kuleuven.be> Next attempt at understanding what is going on. According to the documentation, "the clurmtabd daemon synchronizes NFS mount entries in /var/lib/nfs/rmtab with a private copy on a service's mount point." Assuming the private copy is the one in .clumanager/rmtab, shouldn't that file contain data? (They are empty for all three filesystems.) Kind regards, Herta Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm From lhh at redhat.com Tue Oct 11 14:40:30 2005 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 11 Oct 2005 10:40:30 -0400 Subject: [Linux-cluster] which CVS version for GFS with 2.6.11? In-Reply-To: References: Message-ID: <1129041630.4680.58.camel@ayanami.boston.redhat.com> On Mon, 2005-10-10 at 18:43 -0400, Dan B. Phung wrote: > Can someone advise me as to which tag I should use to checkout > the latest stable snapshot for GFS with 2.6.11? > > e.g. > cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout -rYOUR_TAG? cluster Possibly FC4, but it's antiquated. Note that it was for the FC4 2.6.11 kernel; other kernels may or may not work. FWIW, the -STABLE branch tracks the latest upstream kernel. -- Lon From lhh at redhat.com Tue Oct 11 15:06:37 2005 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 11 Oct 2005 11:06:37 -0400 Subject: [Linux-cluster] umount failed - device is busy In-Reply-To: <434ADB8C.9010508@cc.kuleuven.be> References: <434A7ADE.108@cc.kuleuven.be> <434A8FE6.40508@cc.kuleuven.be> <1128963722.4680.21.camel@ayanami.boston.redhat.com> <434AC9DE.50606@cc.kuleuven.be> <1128978146.4680.37.camel@ayanami.boston.redhat.com> <434ADB8C.9010508@cc.kuleuven.be> Message-ID: <1129043197.4680.85.camel@ayanami.boston.redhat.com> On Mon, 2005-10-10 at 23:22 +0200, Herta Van den Eynde wrote: > > Odd. Well, "umount -f" actually doesn't do what most people think it > > does. > FWIIW, I don't think the force unmount works. True, lsof/fuser don't > report processes against the filesystem, but "df" and "mount" show that > it's still there, and I can write to and read from it after I try a > "umount -f". (/me waves his hand and says, "This is not the force unmount you are looking for...") * "umount -f" only works for NFS-mounted file systems, and only then in certain cases. If there is pending I/O (e.g. processes in disk wait), it will not work (I may be wrong about this one, but I think this is the case). In any case, it does not currently do anything for local file systems like ext3, jfs, reiserfs, xfs, etc... If there are open references on any local file system, the umount fails with -EBUSY, regardless of whether or not "umount -f" was used. This means, if there is a process, say a bash shell, running with CWD in a local file system's mount point, running "umount -f" on that mount point will fail the same as running "umount" without the "-f" flag. * The "force unmount" option in Cluster Manager, by contrast, attempts to clear references on a locally mounted file systems by killing processes using those file systems. Put more clearly: it attempts to do what most people think "umount -f" does (or should do) in a general way. So, back to our example: That bash shell is sitting in the mountpoint. We look for all processes using the mount point, see that bash (pid 11034) is using it, and kill pid 11034 with signal 9 (SIGKILL). Bash certainly no longer has a reference on our file system ;) Now, there are no more processes using the mount point, so we issue "umount"... which should work. However, this is not working in your case. There are a couple of things which come to mind which might cause this: * nfsd holding a reference (which is why I asked you to stop nfs; "exportfs -ua" should work too). * another mounted file system below the mount point (e.g. trying to umount /a while another file system is mounted on /a/b). -- Lon From lhh at redhat.com Tue Oct 11 15:09:31 2005 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 11 Oct 2005 11:09:31 -0400 Subject: [Linux-cluster] umount failed - device is busy In-Reply-To: <434B8D72.3080006@cc.kuleuven.be> References: <434A7ADE.108@cc.kuleuven.be> <434A8FE6.40508@cc.kuleuven.be> <1128963722.4680.21.camel@ayanami.boston.redhat.com> <434AC9DE.50606@cc.kuleuven.be> <1128978146.4680.37.camel@ayanami.boston.redhat.com> <434ADB8C.9010508@cc.kuleuven.be> <434B8D72.3080006@cc.kuleuven.be> Message-ID: <1129043371.4680.89.camel@ayanami.boston.redhat.com> On Tue, 2005-10-11 at 12:01 +0200, Herta Van den Eynde wrote: > Next attempt at understanding what is going on. > > According to the documentation, "the clurmtabd daemon synchronizes NFS > mount entries in /var/lib/nfs/rmtab with a private copy on a service's > mount point." > > Assuming the private copy is the one in .clumanager/rmtab, shouldn't > that file contain data? (They are empty for all three filesystems.) It synchronizes based on exports found in /etc/cluster.xml ... What kernel and nfs-utils versions are you running? (Note that this is a separate problem from the previous one.) -- Lon From herta.vandeneynde at cc.kuleuven.be Tue Oct 11 15:48:29 2005 From: herta.vandeneynde at cc.kuleuven.be (Herta Van den Eynde) Date: Tue, 11 Oct 2005 17:48:29 +0200 Subject: [Linux-cluster] umount failed - device is busy In-Reply-To: <1129043197.4680.85.camel@ayanami.boston.redhat.com> References: <434A7ADE.108@cc.kuleuven.be> <434A8FE6.40508@cc.kuleuven.be> <1128963722.4680.21.camel@ayanami.boston.redhat.com> <434AC9DE.50606@cc.kuleuven.be> <1128978146.4680.37.camel@ayanami.boston.redhat.com> <434ADB8C.9010508@cc.kuleuven.be> <1129043197.4680.85.camel@ayanami.boston.redhat.com> Message-ID: <434BDECD.2060303@cc.kuleuven.be> Lon Hohberger wrote: > On Mon, 2005-10-10 at 23:22 +0200, Herta Van den Eynde wrote: > > >>>Odd. Well, "umount -f" actually doesn't do what most people think it >>>does. > > >>FWIIW, I don't think the force unmount works. True, lsof/fuser don't >>report processes against the filesystem, but "df" and "mount" show that >> it's still there, and I can write to and read from it after I try a >>"umount -f". > > > (/me waves his hand and says, "This is not the force unmount you are > looking for...") > > * "umount -f" only works for NFS-mounted file systems, and only then in > certain cases. If there is pending I/O (e.g. processes in disk wait), > it will not work (I may be wrong about this one, but I think this is the > case). In any case, it does not currently do anything for local file > systems like ext3, jfs, reiserfs, xfs, etc... If there are open > references on any local file system, the umount fails with -EBUSY, > regardless of whether or not "umount -f" was used. > > This means, if there is a process, say a bash shell, running with CWD in > a local file system's mount point, running "umount -f" on that mount > point will fail the same as running "umount" without the "-f" flag. > > > * The "force unmount" option in Cluster Manager, by contrast, attempts > to clear references on a locally mounted file systems by killing > processes using those file systems. Put more clearly: it attempts to do > what most people think "umount -f" does (or should do) in a general way. > So, back to our example: > > That bash shell is sitting in the mountpoint. We look for all processes > using the mount point, see that bash (pid 11034) is using it, and kill > pid 11034 with signal 9 (SIGKILL). Bash certainly no longer has a > reference on our file system ;) > > Now, there are no more processes using the mount point, so we issue > "umount"... which should work. However, this is not working in your > case. There are a couple of things which come to mind which might cause > this: > > * nfsd holding a reference (which is why I asked you to stop nfs; > "exportfs -ua" should work too). > > * another mounted file system below the mount point (e.g. trying to > umount /a while another file system is mounted on /a/b). > > -- Lon Thanks for all this info, Lon. I really appreciate it. Bit of extra information: the system that was running the services got STONITHed by the other cluster member shortly before midnight. The services all failed over nicely, but the situation remains: if I try to stop or relocate a service, I get a "device is busy". I suppose that rules out an intermittent issue. There's no mounts below mounts. Kind regards, Herta Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm From herta.vandeneynde at cc.kuleuven.be Tue Oct 11 15:52:29 2005 From: herta.vandeneynde at cc.kuleuven.be (Herta Van den Eynde) Date: Tue, 11 Oct 2005 17:52:29 +0200 Subject: [Linux-cluster] clurmtabd question (was: umount failed - device is busy) In-Reply-To: <1129043371.4680.89.camel@ayanami.boston.redhat.com> References: <434A7ADE.108@cc.kuleuven.be> <434A8FE6.40508@cc.kuleuven.be> <1128963722.4680.21.camel@ayanami.boston.redhat.com> <434AC9DE.50606@cc.kuleuven.be> <1128978146.4680.37.camel@ayanami.boston.redhat.com> <434ADB8C.9010508@cc.kuleuven.be> <434B8D72.3080006@cc.kuleuven.be> <1129043371.4680.89.camel@ayanami.boston.redhat.com> Message-ID: <434BDFBD.5040900@cc.kuleuven.be> Lon Hohberger wrote: > On Tue, 2005-10-11 at 12:01 +0200, Herta Van den Eynde wrote: > >>Next attempt at understanding what is going on. >> >>According to the documentation, "the clurmtabd daemon synchronizes NFS >>mount entries in /var/lib/nfs/rmtab with a private copy on a service's >>mount point." >> >>Assuming the private copy is the one in .clumanager/rmtab, shouldn't >>that file contain data? (They are empty for all three filesystems.) > > > It synchronizes based on exports found in /etc/cluster.xml ... > > What kernel and nfs-utils versions are you running? > > (Note that this is a separate problem from the previous one.) > > -- Lon Glad to hear it's a separate problem. :-( I changed the subject accordingly. kernel-smp-2.4.21-37.EL - custom built to probe all LUNs on each SCSI device nfs-util is at 1.0.6-42EL Kind regards, Herta Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm From lhh at redhat.com Tue Oct 11 18:18:31 2005 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 11 Oct 2005 14:18:31 -0400 Subject: [Linux-cluster] umount failed - device is busy In-Reply-To: <434BDECD.2060303@cc.kuleuven.be> References: <434A7ADE.108@cc.kuleuven.be> <434A8FE6.40508@cc.kuleuven.be> <1128963722.4680.21.camel@ayanami.boston.redhat.com> <434AC9DE.50606@cc.kuleuven.be> <1128978146.4680.37.camel@ayanami.boston.redhat.com> <434ADB8C.9010508@cc.kuleuven.be> <1129043197.4680.85.camel@ayanami.boston.redhat.com> <434BDECD.2060303@cc.kuleuven.be> Message-ID: <1129054711.4680.119.camel@ayanami.boston.redhat.com> On Tue, 2005-10-11 at 17:48 +0200, Herta Van den Eynde wrote: > Bit of extra information: the system that was running the services got > STONITHed by the other cluster member shortly before midnight. > The services all failed over nicely, but the situation remains: if I > try to stop or relocate a service, I get a "device is busy". > I suppose that rules out an intermittent issue. > > There's no mounts below mounts. Drat. Nfsd is the most likely candidate for holding the reference. Unfortunately, this is not something I can track down; you will have to either file a support request and/or a Bugzilla. When you get a chance, you should definitely try stopping nfsd and seeing if that clears the mystery references (allowing you to unmount). If the problem comes from nfsd, it should not be terribly difficult to track down. Also, you should not need to recompile your kernel to probe all the LUNs per device; just edit /etc/modules.conf: options scsi_mod max_scsi_luns=128 ... then run mkinitrd to rebuild the initrd image. -- Lon From herta.vandeneynde at cc.kuleuven.be Tue Oct 11 19:16:55 2005 From: herta.vandeneynde at cc.kuleuven.be (Herta Van den Eynde) Date: Tue, 11 Oct 2005 21:16:55 +0200 Subject: [Linux-cluster] umount failed - device is busy In-Reply-To: <1129054711.4680.119.camel@ayanami.boston.redhat.com> References: <434A7ADE.108@cc.kuleuven.be> <434A8FE6.40508@cc.kuleuven.be> <1128963722.4680.21.camel@ayanami.boston.redhat.com> <434AC9DE.50606@cc.kuleuven.be> <1128978146.4680.37.camel@ayanami.boston.redhat.com> <434ADB8C.9010508@cc.kuleuven.be> <1129043197.4680.85.camel@ayanami.boston.redhat.com> <434BDECD.2060303@cc.kuleuven.be> <1129054711.4680.119.camel@ayanami.boston.redhat.com> Message-ID: <434C0FA7.9000803@cc.kuleuven.be> Lon Hohberger wrote: > On Tue, 2005-10-11 at 17:48 +0200, Herta Van den Eynde wrote: > > >>Bit of extra information: the system that was running the services got >>STONITHed by the other cluster member shortly before midnight. >>The services all failed over nicely, but the situation remains: if I >>try to stop or relocate a service, I get a "device is busy". >>I suppose that rules out an intermittent issue. >> >>There's no mounts below mounts. > > > Drat. > > Nfsd is the most likely candidate for holding the reference. > > Unfortunately, this is not something I can track down; you will have to > either file a support request and/or a Bugzilla. When you get a chance, > you should definitely try stopping nfsd and seeing if that clears the > mystery references (allowing you to unmount). If the problem comes from > nfsd, it should not be terribly difficult to track down. > > Also, you should not need to recompile your kernel to probe all the LUNs > per device; just edit /etc/modules.conf: > > options scsi_mod max_scsi_luns=128 > > ... then run mkinitrd to rebuild the initrd image. > > -- Lon Next maintenance window is 4 weeks away, so I won't be able to test the nfsd hypothesis anytime soon. In the meantime, I'll file a support request. I'll keep you posted. At least the unexpected STONITH confirms that the failover still works. The /etc/modules.conf tip is a big time saver. Rebuilding the modules takes forever. Thanks, Lon. Herta Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm From bojan at rexursive.com Tue Oct 11 20:07:44 2005 From: bojan at rexursive.com (Bojan Smojver) Date: Wed, 12 Oct 2005 06:07:44 +1000 Subject: [Linux-cluster] GFS 6.1.2 and RHEL4 U2 Message-ID: <1129061264.2348.1.camel@coyote.rexursive.com> I have a 5 node experimental cluster running RHEL4 U1 and GFS 6.1.0. I upgraded one box to RHEL U2 (kernel 2.6.9-22.ELsmp) and to GFS 6.1.2. When the box boots up with the new kernel and GFS, it joins the cluster OK (I can see that on other members), but clvmd and fenced won't start, so the system hangs. Did anyone else experience similar stuff? Or is this intentional (i.e. is the new version of GFS/cluster binary incompatible with U1 version)? -- Bojan From bojan at rexursive.com Wed Oct 12 06:28:25 2005 From: bojan at rexursive.com (Bojan Smojver) Date: Wed, 12 Oct 2005 16:28:25 +1000 Subject: [Linux-cluster] GFS 6.1.2 and RHEL4 U2 In-Reply-To: <1129061264.2348.1.camel@coyote.rexursive.com> References: <1129061264.2348.1.camel@coyote.rexursive.com> Message-ID: <20051012162825.23z5latb4ksk8k8c@imp.rexursive.com> Quoting Bojan Smojver : > I have a 5 node experimental cluster running RHEL4 U1 and GFS 6.1.0. I > upgraded one box to RHEL U2 (kernel 2.6.9-22.ELsmp) and to GFS 6.1.2. > When the box boots up with the new kernel and GFS, it joins the cluster > OK (I can see that on other members), but clvmd and fenced won't start, > so the system hangs. > > Did anyone else experience similar stuff? Or is this intentional (i.e. > is the new version of GFS/cluster binary incompatible with U1 version)? BTW, this is what I get on the upgraded machine when I attempt to start fenced: Oct 12 16:24:42 matrix1-5 kernel: SM: process_reply invalid id=0 nodeid=3 Oct 12 16:24:42 matrix1-5 kernel: SM: process_reply invalid id=0 nodeid=2 Oct 12 16:24:42 matrix1-5 kernel: SM: process_reply invalid id=0 nodeid=4 Oct 12 16:24:42 matrix1-5 kernel: SM: process_reply invalid id=0 nodeid=1 Fenced never starts... -- Bojan From pcaulfie at redhat.com Wed Oct 12 06:47:45 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Wed, 12 Oct 2005 07:47:45 +0100 Subject: [Linux-cluster] Re: setting the heartbeat interval In-Reply-To: <1a81f4f4b1d7c534a47c31bd918bea98@redhat.com> References: <1a81f4f4b1d7c534a47c31bd918bea98@redhat.com> Message-ID: <434CB191.1090803@redhat.com> Jonathan E Brassow wrote: > Looking at the cman_tool man page, I don't see a way to change the > heartbeat interval. Dave, is there a way to change this while cman is > part of a cluster? > > To change the in memory version number or expected votes for cman, you > would: > > 1) change cluster.xml file > 2) ccs_tool update > 3) cman_tool version -r ; cman_tool expected -e > > If cman_tool can change the heartbeat interval without restarting the > cluster (or cman on each machine), it would look very much like step > #3. This can not be done through the GUI, because the GUI only changes > the in memory version number. A reply for the list: The heartbeat & detection intervals for cman can be set by writing values into /proc/cluster/conf/cman/hello_timer & /proc/cluster/conf/cman/deadnode_timer These values are in seconds and take effect immediately (-ish, ie the new hello timer will take effect after the last hello timer has expired). Because these really need to be the same on all nodes I don't recommend changing them on-the-fly though - they should be set between loading the module and running cman_tool join. There is also /proc/cluster/conf/cman/max_retries which some may like to increase if they are seeing "No response to messages" reasons for a node being kicked out of the cluster - you can change this any time you like with no ill effects. The version of cman_tool on the STABLE tag of CVS has code that will read these values from CCS when "cman_tool join" is run. I think this should be in a future RHEL4 Update. -- patrick From baesso at ksolutions.it Wed Oct 12 08:05:25 2005 From: baesso at ksolutions.it (Baesso Mirko) Date: Wed, 12 Oct 2005 10:05:25 +0200 Subject: R: [Linux-cluster] Setup Fence_wti on cluster RHES4 U1 Message-ID: Hi I check fence device section and I setup all, wti ports also But when i try to unplug one node the other cannot power it off This is my cluster.conf ... Baesso Mirko - System Engineer KSolutions.S.p.A. Via Lenin 132/26 56017 S.Martino Ulmiano (PI) - Italy tel.+ 39 0 50 898369 fax. + 39 0 50 861200 baesso at ksolutions.it http//www.ksolutions.it _____ Da: Jonathan E Brassow [mailto:jbrassow at redhat.com] Inviato: luned? 10 ottobre 2005 22.04 A: linux clustering Oggetto: Re: [Linux-cluster] Setup Fence_wti on cluster RHES4 U1 I don't have the gui in front of me, but there should be a manage fencing button or something... If you've never specified which port that a machine is connected to on the WTI, then you haven't gotten far enough in the set-up. brassow On Oct 10, 2005, at 2:22 AM, Baesso Mirko wrote: Hi i've setup a Redhat cluster with two node and i try to test failover using power switch (WTI-NPS230) but seem doesn't work If I look for error messages I see that fence_wti waits for a command to execute. I setup cluster.conf using system-config-cluster and there is no section regarding command to execute Could you please let me now how to setup correctly Thanks Baesso Mirko - System Engineer KSolutions.S.p.A. Via Lenin 132/26 56017 S.Martino Ulmiano (PI) - Italy tel.+ 39 0 50 898369 fax. + 39 0 50 861200 baesso at ksolutions.it http//www.ksolutions.it -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From teigland at redhat.com Wed Oct 12 14:29:26 2005 From: teigland at redhat.com (David Teigland) Date: Wed, 12 Oct 2005 09:29:26 -0500 Subject: [Linux-cluster] GFS 6.1.2 and RHEL4 U2 In-Reply-To: <20051012162825.23z5latb4ksk8k8c@imp.rexursive.com> References: <1129061264.2348.1.camel@coyote.rexursive.com> <20051012162825.23z5latb4ksk8k8c@imp.rexursive.com> Message-ID: <20051012142926.GB7876@redhat.com> On Wed, Oct 12, 2005 at 04:28:25PM +1000, Bojan Smojver wrote: > Quoting Bojan Smojver : > > >I have a 5 node experimental cluster running RHEL4 U1 and GFS 6.1.0. I > >upgraded one box to RHEL U2 (kernel 2.6.9-22.ELsmp) and to GFS 6.1.2. > >When the box boots up with the new kernel and GFS, it joins the cluster > >OK (I can see that on other members), but clvmd and fenced won't start, > >so the system hangs. > > > >Did anyone else experience similar stuff? Or is this intentional (i.e. > >is the new version of GFS/cluster binary incompatible with U1 version)? > > BTW, this is what I get on the upgraded machine when I attempt to start > fenced: > > Oct 12 16:24:42 matrix1-5 kernel: SM: process_reply invalid id=0 > nodeid=3 > Oct 12 16:24:42 matrix1-5 kernel: SM: process_reply invalid id=0 > nodeid=2 > Oct 12 16:24:42 matrix1-5 kernel: SM: process_reply invalid id=0 > nodeid=4 > Oct 12 16:24:42 matrix1-5 kernel: SM: process_reply invalid id=0 > nodeid=1 > > Fenced never starts... A bug fix required a minor change to the cman/sm message formats between U1 and U2 that make the two versions incompatible, so all nodes need to be running the U2 version. Dave From bojan at rexursive.com Wed Oct 12 20:06:18 2005 From: bojan at rexursive.com (Bojan Smojver) Date: Thu, 13 Oct 2005 06:06:18 +1000 Subject: [Linux-cluster] GFS 6.1.2 and RHEL4 U2 In-Reply-To: <20051012142926.GB7876@redhat.com> References: <1129061264.2348.1.camel@coyote.rexursive.com> <20051012162825.23z5latb4ksk8k8c@imp.rexursive.com> <20051012142926.GB7876@redhat.com> Message-ID: <1129147578.31843.5.camel@coyote.rexursive.com> On Wed, 2005-10-12 at 09:29 -0500, David Teigland wrote: > A bug fix required a minor change to the cman/sm message formats between > U1 and U2 that make the two versions incompatible, so all nodes need to be > running the U2 version. Thanks. I'll bounce all nodes today and report back if they don't form the cluster (I'm sure they will :-). I have to admit that I didn't look that hard into RPM release notes, but I never noticed any warnings about this on RHN... -- Bojan From Bowie_Bailey at BUC.com Wed Oct 12 20:18:41 2005 From: Bowie_Bailey at BUC.com (Bowie Bailey) Date: Wed, 12 Oct 2005 16:18:41 -0400 Subject: [Linux-cluster] GFS + DLM howto? Message-ID: <4766EEE585A6D311ADF500E018C154E302133245@bnifex.cis.buc.com> I'm trying to configure three servers to share a GFS 6.1 filesystem. I am completely new to all of this and the instructions in the manuals I've found on the RH website are running me around in circles. Can anyone point me to a good how-to that will walk me through a simple configuration of GFS and DLM? Thanks, Bowie From teigland at redhat.com Wed Oct 12 20:28:17 2005 From: teigland at redhat.com (David Teigland) Date: Wed, 12 Oct 2005 15:28:17 -0500 Subject: [Linux-cluster] GFS + DLM howto? In-Reply-To: <4766EEE585A6D311ADF500E018C154E302133245@bnifex.cis.buc.com> References: <4766EEE585A6D311ADF500E018C154E302133245@bnifex.cis.buc.com> Message-ID: <20051012202817.GD10593@redhat.com> On Wed, Oct 12, 2005 at 04:18:41PM -0400, Bowie Bailey wrote: > I'm trying to configure three servers to share a GFS 6.1 filesystem. > > I am completely new to all of this and the instructions in the manuals I've > found on the RH website are running me around in circles. > > Can anyone point me to a good how-to that will walk me through a simple > configuration of GFS and DLM? I think the official documentation assumes you're doing everything through the gui, at least with respect to the clustering components. If you're not, then this is probably the best we have along with the man pages: http://sources.redhat.com/cluster/doc/usage.txt Dave From sgray at bluestarinc.com Wed Oct 12 21:21:10 2005 From: sgray at bluestarinc.com (Sean Gray) Date: Wed, 12 Oct 2005 17:21:10 -0400 Subject: [Linux-cluster] GFS + DLM howto? In-Reply-To: <4766EEE585A6D311ADF500E018C154E302133245@bnifex.cis.buc.com> References: <4766EEE585A6D311ADF500E018C154E302133245@bnifex.cis.buc.com> Message-ID: <1129152070.4166.401.camel@libra.bluestar.cvg0> Following are my notes. Keep in mind that I did a lot of installing from SRPMs and you may not need to go through all that. Hope it helps... - Sean # RHGFS and RHCS on RHEL4 x86_64 2.6.9-11 # v 0.1 # By Sean Gray copyright 2005 # Published under the GNU Free Documentation License http://www.gnu.org/licenses/fdl.txt # Sources: http://www.google.com # http://www.redhat.com/docs/manuals/csgfs/browse/rh-gfs-en/ch-install.html # http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/ch-software.html # http://sources.redhat.com/cluster/doc/usage.txt # http://karan.org/ # https://www.redhat.com/archives/linux-cluster/index.html # http://lists.centos.org/pipermail/centos-devel/2005-August/thread.html#861 # http://www.hughesjr.com/ # This document claims to have no value whatsoever to anyone, except maybe the author. # # # # enable ntp system-config-time # install SRPMS from: # ftp://ftp.redhat.com/pub/redhat/linux/enterprise/4/en/RHCS/x86_64/SRPMS/* # ftp://ftp.redhat.com/pub/redhat/linux/enterprise/4/en/RHGFS/x86_64/SRPMS/* # install kernel source rpm -ivh ftp://ftp.redhat.com/pub/redhat/linux/updates/enterprise/4ES/en/os/SRPMS/kernel-2.6.9-11.EL.src.rpm # build install perl-Net-Telnet-3.03-3 rpmbuild --rebuild /usr/src/redhat/SRPMS/perl-Net-Telnet-3.03-3.src.rpm rpm -ivh /usr/src/redhat/RPMS/noarch/perl-Net-Telnet-3.03-3.noarch.rpm # build install gulm rpmbuild -bb /usr/src/redhat/SPECS/gulm.spec rpm -ivh /usr/src/redhat/RPMS/x86_64/gulm-1.0.0-0.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/gulm-devel-1.0.0-0.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/gulm-debuginfo-1.0.0-0.x86_64.rpm # build install magma rpmbuild -bb /usr/src/redhat/SPECS/magma.spec rpm -ivh /usr/src/redhat/RPMS/x86_64/magma-1.0.0-0.x86_64.rpm rpm -ivh /usr/src/redhat/RPMS/x86_64/magma-debuginfo-1.0.0-0.x86_64.rpm rpm -ivh /usr/src/redhat/RPMS/x86_64/magma-devel-1.0.0-0.x86_64.rpm # build install magma-plugins rpmbuild -bb /usr/src/redhat/SPECS/magma-plugins.spec rpm -ivh /usr/src/redhat/RPMS/x86_64/magma-plugins-1.0.0-0.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/magma-plugins-debuginfo-1.0.0-0.x86_64.rpm # Download and install from RHN rpm -ivh kernel-2.6.9-11.EL.x86_64.rpm \ kernel-devel-2.6.9-11.EL.x86_64.rpm \ kernel-doc-2.6.9-11.EL.noarch.rpm \ kernel-hugemem-2.6.9-11.EL.i686.rpm \ kernel-hugemem-devel-2.6.9-11.EL.i686 \ kernel-smp-2.6.9-11.EL.x86_64.rpm \ kernel-smp-devel-2.6.9-11.EL.x86_64.rpm # reboot with new kernel init 6 # Edit all the hugemem kernel stuff out of the spec files # I don't need hugmem and don't have time to troubleshoot # uneasily build install 3rd-party fake-build-requires rpm -ivh http://rpm.karan.org/el4/csgfs/SRPMS/fake-build-provides-1.0-20.src.rpm rpm -ivh # build install ccs rpmbuild -bb /usr/src/redhat/SPECS/ccs.spec rpm -ivh /usr/src/redhat/RPMS/x86_64/ccs-1.0.0-0.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/ccs-debuginfo-1.0.0-0.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/ccs-devel-1.0.0-0.x86_64.rpm # build install cman-kernel rpmbuild -bb /usr/src/redhat/SPECS/cman-kernel.spec rpm -ivh /usr/src/redhat/RPMS/x86_64/cman-kernel-2.6.9-36.0.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/cman-kernel-debuginfo-2.6.9-36.0.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/cman-kernel-smp-2.6.9-36.0.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/cman-kernheaders-2.6.9-36.0.x86_64.rpm # build install cman rpmbuild -bb /usr/src/redhat/SPECS/cman.spec rpm -ivh /usr/src/redhat/RPMS/x86_64/cman-1.0.0-0.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/cman-debuginfo-1.0.0-0.x86_64.rpm # build install dlm-kernel # edit dlm-kernel.spec # remove --> $kernel_src/scripts/mod/modpost -m -i /lib/modules/%{kernel_version}$flavor/kernel/cluster/cman.symvers src/dlm.o -o dlm.symvers # add --> $kernel_src/scripts/mod/modpost -m -i /lib/modules/%{kernel_version}$flavor/kernel/cluster/cman.symvers /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/dlm.o -o dlm.symvers rpmbuild -bb /usr/src/redhat/SPECS/dlm-kernel.spec rpm -ivh /usr/src/redhat/RPMS/x86_64/dlm-kernel-2.6.9-34.0.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/dlm-kernel-debuginfo-2.6.9-34.0.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/dlm-kernheaders-2.6.9-34.0.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/dlm-kernel-smp-2.6.9-34.0.x86_64.rpm # build install dlm rpmbuild -bb /usr/src/redhat/SPECS/dlm.spec rpm -ivh /usr/src/redhat/RPMS/x86_64/dlm-1.0.0-0.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/dlm-devel-1.0.0-0.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/dlm-debuginfo-1.0.0-0.x86_64.rpm # build install fence rpmbuild -bb /usr/src/redhat/SPECS/fence.spec rpm -ivh /usr/src/redhat/RPMS/x86_64/fence-1.32.1-0.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/fence-debuginfo-1.32.1-0.x86_64.rpm # build install iddev rpmbuild -bb /usr/src/redhat/SPECS/iddev.spec rpm -ivh /usr/src/redhat/RPMS/x86_64/iddev-2.0.0-0.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/iddev-devel-2.0.0-0.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/iddev-debuginfo-2.0.0-0.x86_64.rpm # build install rgmanager rpmbuild -bb /usr/src/redhat/SPECS/rgmanager.spec rpm -ivh /usr/src/redhat/RPMS/x86_64/rgmanager-1.9.34-1.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/rgmanager-debuginfo-1.9.34-1.x86_64.rpm # build install system-config-cluster rpmbuild -bb /usr/src/redhat/SPECS/system-config-cluster.spec /usr/src/redhat/RPMS/noarch/system-config-cluster-1.0.12-1.0.noarch.rpm # build install ipvsadm rpmbuild -bb /usr/src/redhat/SPECS/ipvsadm.spec rpm -ivh /usr/src/redhat/RPMS/x86_64/ipvsadm-1.24-6.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/ipvsadm-debuginfo-1.24-6.x86_64.rpm # build install piranha rpmbuild -bb /usr/src/redhat/SPECS/piranha.spec rpm -ivh /usr/src/redhat/RPMS/x86_64/piranha-0.8.0-1.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/piranha-debuginfo-0.8.0-1.x86_64.rpm # I tried to build gfs-kernel here but smp would not compile # finally figured out, after moving to gnbd-kernel that deleting # /usr/src/BUILD/smp allowed it to build. Hmmm. rm -rf /usr/src/BUILD/smp # build install gnbd-kernel rpmbuild -bb /usr/src/redhat/SPECS/gnbd-kernel.spec rpm -ivh /usr/src/redhat/RPMS/x86_64/gnbd-kernel-2.6.9-8.27.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/gnbd-kernheaders-2.6.9-8.27.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/gnbd-kernel-smp-2.6.9-8.27.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/gnbd-kernel-debuginfo-2.6.9-8.27.x86_64.rpm rm -rf /usr/src/BUILD/smp # after all the trial and error it appears my BUILD was hosed also rm -rf /usr/src/redhat/BUILD/gfs-kernel-2.6.9-35/ # build install gfs-kernel rpmbuild -bb /usr/src/redhat/SPECS/GFS-kernel.spec rpm -ivh /usr/src/redhat/RPMS/x86_64/GFS-kernel-2.6.9-35.5.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/GFS-kernheaders-2.6.9-35.5.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/GFS-kernel-smp-2.6.9-35.5.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/GFS-kernel-debuginfo-2.6.9-35.5.x86_64.rpm # build install gfs rpmbuild -bb /usr/src/redhat/SPECS/GFS.spec rpm -ivh /usr/src/redhat/RPMS/x86_64/GFS-6.1.0-0.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/GFS-debuginfo-6.1.0-0.x86_64.rpm # build install gnbd rpmbuild --bb /usr/src/redhat/SPECS/gnbd.spec rpm -ivh /usr/src/redhat/RPMS/x86_64/gnbd-1.0.0-0.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/gnbd-debuginfo-1.0.0-0.x86_64.rpm # build install lvm2-cluster # On my not so up2date system I had to first upgrade # device-mapper and lvm2 for this to work rpmbuild -bb /usr/src/redhat/SPECS/lvm2-cluster.spec rpm -ivh /usr/src/redhat/RPMS/x86_64/lvm2-cluster-2.01.09-5.0.RHEL4.x86_64.rpm \ /usr/src/redhat/RPMS/x86_64/lvm2-cluster-debuginfo-2.01.09-5.0.RHEL4.x86_64.rpm # Wow that only took 24 hours of my life. # That was on node 1 onto node 2 # copy all newly built rpms to a tmp folder on asteroids # download the following from RHN # kernel-smp-2.6.9-11.EL.x86_64.rpm # kernel-2.6.9-11.EL.x86_64.rpm # device-mapper-1.01.01-1.RHEL4.x86_64.rpm # lvm2-2.01.08-1.0.RHEL4.x86_64.rpm # remove all device-mapper rpms (???) there are both # i386 and x86_64 installed, I removed both and installed # the x86_64 and i386 rpm -e device-mapper-1.00.19-2 --allmatches --nodeps # install rpm -Uvh perl-Net-Telnet-3.03-3.noarch.rpm \ system-config-cluster-1.0.12-1.0.noarch.rpm \ ccs-1.0.0-0.x86_64.rpm \ ccs-debuginfo-1.0.0-0.x86_64.rpm \ ccs-devel-1.0.0-0.x86_64.rpm \ cman-1.0.0-0.x86_64.rpm \ cman-debuginfo-1.0.0-0.x86_64.rpm \ cman-kernel-2.6.9-36.0.x86_64.rpm \ cman-kernel-debuginfo-2.6.9-36.0.x86_64.rpm \ cman-kernel-smp-2.6.9-36.0.x86_64.rpm \ cman-kernheaders-2.6.9-36.0.x86_64.rpm \ device-mapper-1.01.01-1.RHEL4.x86_64.rpm \ dlm-1.0.0-0.x86_64.rpm \ dlm-debuginfo-1.0.0-0.x86_64.rpm \ dlm-devel-1.0.0-0.x86_64.rpm \ dlm-kernel-2.6.9-34.0.x86_64.rpm \ dlm-kernel-debuginfo-2.6.9-34.0.x86_64.rpm \ dlm-kernel-dlm-kernheaders-2.6.9-34.0.x86_64.rpm \ dlm-kernel-smp-2.6.9-34.0.x86_64.rpm \ dlm-kernheaders-2.6.9-34.0.x86_64.rpm \ fake-build-provides-1.0-20.x86_64.rpm \ fence-1.32.1-0.x86_64.rpm \ fence-debuginfo-1.32.1-0.x86_64.rpm \ GFS-6.1.0-0.x86_64.rpm \ GFS-debuginfo-6.1.0-0.x86_64.rpm \ GFS-kernel-2.6.9-35.5.x86_64.rpm \ GFS-kernel-debuginfo-2.6.9-35.5.x86_64.rpm \ GFS-kernel-smp-2.6.9-35.5.x86_64.rpm \ GFS-kernheaders-2.6.9-35.5.x86_64.rpm \ gnbd-1.0.0-0.x86_64.rpm \ gnbd-debuginfo-1.0.0-0.x86_64.rpm \ gnbd-kernel-2.6.9-8.27.x86_64.rpm \ gnbd-kernel-debuginfo-2.6.9-8.27.x86_64.rpm \ gnbd-kernel-smp-2.6.9-8.27.x86_64.rpm \ gnbd-kernheaders-2.6.9-8.27.x86_64.rpm \ gulm-1.0.0-0.x86_64.rpm \ gulm-debuginfo-1.0.0-0.x86_64.rpm \ gulm-devel-1.0.0-0.x86_64.rpm \ iddev-2.0.0-0.x86_64.rpm \ iddev-debuginfo-2.0.0-0.x86_64.rpm \ iddev-devel-2.0.0-0.x86_64.rpm \ ipvsadm-1.24-6.x86_64.rpm \ ipvsadm-debuginfo-1.24-6.x86_64.rpm \ kernel-2.6.9-11.EL.x86_64.rpm \ kernel-smp-2.6.9-11.EL.x86_64.rpm \ lvm2-2.01.08-1.0.RHEL4.x86_64.rpm \ lvm2-cluster-2.01.09-5.0.RHEL4.x86_64.rpm \ lvm2-cluster-debuginfo-2.01.09-5.0.RHEL4.x86_64.rpm \ magma-1.0.0-0.x86_64.rpm \ magma-debuginfo-1.0.0-0.x86_64.rpm \ magma-devel-1.0.0-0.x86_64.rpm \ magma-plugins-1.0.0-0.x86_64.rpm \ magma-plugins-debuginfo-1.0.0-0.x86_64.rpm \ piranha-0.8.0-1.x86_64.rpm \ piranha-debuginfo-0.8.0-1.x86_64.rpm \ rgmanager-1.9.34-1.x86_64.rpm \ rgmanager-debuginfo-1.9.34-1.x86_64.rpm # reboot # Configuration pvcreate /dev/sda # carve up your disk with system-config-lvm gfs_mkfs -p lock_dlm -t alpha_cluster:scratch0_LV -j 12 /dev/scratch_VG/scratch0_LV # on all nodes mount -t gfs -oacl /dev/scratch_VG/scratch0_LV /scratch/ # Rinse and repeat on defender, tron, centipede, tapper, paperboy, joust, tempest # galaxian, pacman, and punchout On Wed, 2005-10-12 at 16:18 -0400, Bowie Bailey wrote: > I'm trying to configure three servers to share a GFS 6.1 filesystem. > > I am completely new to all of this and the instructions in the manuals I've > found on the RH website are running me around in circles. > > Can anyone point me to a good how-to that will walk me through a simple > configuration of GFS and DLM? > > Thanks, > > Bowie > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > Sean N. Gray Director of Information Technology United Radio Incorporated, DBA BlueStar 24 Spiral Drive Florence, Kentucky 41042 office: 859.371.4423 x263 toll free: 800.371.4423 x263 fax: 859.371.4425 mobile: 513.616.3379 -------------- next part -------------- An HTML attachment was scrubbed... URL: From bojan at rexursive.com Wed Oct 12 23:16:05 2005 From: bojan at rexursive.com (Bojan Smojver) Date: Thu, 13 Oct 2005 09:16:05 +1000 Subject: [Linux-cluster] GFS 6.1.2 and RHEL4 U2 In-Reply-To: <1129147578.31843.5.camel@coyote.rexursive.com> References: <1129061264.2348.1.camel@coyote.rexursive.com> <20051012162825.23z5latb4ksk8k8c@imp.rexursive.com> <20051012142926.GB7876@redhat.com> <1129147578.31843.5.camel@coyote.rexursive.com> Message-ID: <20051013091605.0vrlyqfun4kgk8wg@imp.rexursive.com> Quoting Bojan Smojver : > Thanks. I'll bounce all nodes today and report back if they don't form > the cluster (I'm sure they will :-). They all came back, so it was just a backward compatibility problem. -- Bojan From Bowie_Bailey at BUC.com Thu Oct 13 13:49:48 2005 From: Bowie_Bailey at BUC.com (=?UTF-8?B?Qm93aWUgQmFpbGV5?=) Date: Thu, 13 Oct 2005 09:49:48 -0400 Subject: =?UTF-8?B?UkU6IFtMaW51eC1jbHVzdGVyXSBHRlMgKyBETE0gaG93dG8/?= Message-ID: <4766EEE585A6D311ADF500E018C154E302133249@bnifex.cis.buc.com> From: Sean Gray [mailto:sgray at bluestarinc.com] > > Following are my notes. Keep in mind that I did a lot of > installing from SRPMs and you may not need to go through all > that. Hope it helps... - Sean > > . > . > . > > # Configuration > > pvcreate /dev/sda > # carve up your disk with system-config-lvm > gfs_mkfs -p lock_dlm -t alpha_cluster:scratch0_LV -j 12 > /dev/scratch_VG/scratch0_LV > # on all nodes > mount -t gfs -oacl /dev/scratch_VG/scratch0_LV /scratch/ > > # Rinse and repeat on defender, tron, centipede, tapper, > paperboy, joust, tempest > # galaxian, pacman, and punchout Ok, but what I am trying to get is instructions on how to configure the "alpha_cluster:scratch0_LV" lock table that you refer to here. Bowie From sgray at bluestarinc.com Thu Oct 13 14:46:59 2005 From: sgray at bluestarinc.com (Sean Gray) Date: Thu, 13 Oct 2005 10:46:59 -0400 Subject: [Linux-cluster] GFS + DLM howto? In-Reply-To: <4766EEE585A6D311ADF500E018C154E302133249@bnifex.cis.buc.com> References: <4766EEE585A6D311ADF500E018C154E302133249@bnifex.cis.buc.com> Message-ID: <1129214819.9819.108.camel@libra.bluestar.cvg0> Create a GFS resource using system-config-cluster. On Thu, 2005-10-13 at 09:49 -0400, Bowie Bailey wrote: > From: Sean Gray [mailto:sgray at bluestarinc.com] > > > > Following are my notes. Keep in mind that I did a lot of > > installing from SRPMs and you may not need to go through all > > that. Hope it helps... - Sean > > > > . > > . > > . > > > > # Configuration > > > > pvcreate /dev/sda > > # carve up your disk with system-config-lvm > > gfs_mkfs -p lock_dlm -t alpha_cluster:scratch0_LV -j 12 > > /dev/scratch_VG/scratch0_LV > > # on all nodes > > mount -t gfs -oacl /dev/scratch_VG/scratch0_LV /scratch/ > > > > # Rinse and repeat on defender, tron, centipede, tapper, > > paperboy, joust, tempest > > # galaxian, pacman, and punchout > > Ok, but what I am trying to get is instructions on how to configure > the "alpha_cluster:scratch0_LV" lock table that you refer to here. > > Bowie > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > Sean N. Gray Director of Information Technology United Radio Incorporated, DBA BlueStar 24 Spiral Drive Florence, Kentucky 41042 office: 859.371.4423 x263 toll free: 800.371.4423 x263 fax: 859.371.4425 mobile: 513.616.3379 -------------- next part -------------- An HTML attachment was scrubbed... URL: From teigland at redhat.com Thu Oct 13 15:03:45 2005 From: teigland at redhat.com (David Teigland) Date: Thu, 13 Oct 2005 10:03:45 -0500 Subject: [Linux-cluster] GFS + DLM howto? In-Reply-To: <4766EEE585A6D311ADF500E018C154E302133249@bnifex.cis.buc.com> References: <4766EEE585A6D311ADF500E018C154E302133249@bnifex.cis.buc.com> Message-ID: <20051013150345.GA8587@redhat.com> On Thu, Oct 13, 2005 at 09:49:48AM -0400, Bowie Bailey wrote: > From: Sean Gray [mailto:sgray at bluestarinc.com] > > gfs_mkfs -p lock_dlm -t alpha_cluster:scratch0_LV -j 12 > > /dev/scratch_VG/scratch0_LV > > # on all nodes > > mount -t gfs -oacl /dev/scratch_VG/scratch0_LV /scratch/ > Ok, but what I am trying to get is instructions on how to configure > the "alpha_cluster:scratch0_LV" lock table that you refer to here. "alpha_cluster" is the cluster name from cluster.conf "scratch0_LV" is the unique filesystem name that you pick for the fs when you do gfs_mkfs. These are mentioned in usage.txt, man gfs_mkfs. Dave From Bowie_Bailey at BUC.com Thu Oct 13 15:30:23 2005 From: Bowie_Bailey at BUC.com (Bowie Bailey) Date: Thu, 13 Oct 2005 11:30:23 -0400 Subject: [Linux-cluster] GFS + DLM howto? Message-ID: <4766EEE585A6D311ADF500E018C154E30213324D@bnifex.cis.buc.com> From: David Teigland [mailto:teigland at redhat.com] > > On Thu, Oct 13, 2005 at 09:49:48AM -0400, Bowie Bailey wrote: > > From: Sean Gray [mailto:sgray at bluestarinc.com] > > > > gfs_mkfs -p lock_dlm -t alpha_cluster:scratch0_LV -j 12 > > > /dev/scratch_VG/scratch0_LV > > > # on all nodes > > > mount -t gfs -oacl /dev/scratch_VG/scratch0_LV /scratch/ > > > Ok, but what I am trying to get is instructions on how to configure > > the "alpha_cluster:scratch0_LV" lock table that you refer to here. > > "alpha_cluster" is the cluster name from cluster.conf > "scratch0_LV" is the unique filesystem name that you pick for the fs > when you do gfs_mkfs. These are mentioned in usage.txt, man gfs_mkfs. Right. I'm currently working through the usage.txt that you linked me to. I was just replying to Sean to see if he had anything extra to add since his response skipped over the part of the configuration that I'm interested in. I think I'll be able to figure it out from here. I'll be back if I have more questions. :) Thanks for the help! (both of you) Bowie From spwilcox at att.com Thu Oct 13 18:26:59 2005 From: spwilcox at att.com (Steve Wilcox) Date: Thu, 13 Oct 2005 14:26:59 -0400 Subject: [Linux-cluster] Oracle 10G-R2 on GFS install problems Message-ID: <1129228020.27905.17.camel@aptis101.cqtel.com> In the process of installing Oracle 10G-R2 on a RHEL4-U2 x86_64 cluster with GFS 6.1.2, I get the following error when running Oracle's root.sh for cluster ready services (a.k.a clusterware): [ OCROSD][4142143168]utstoragetype: /u00/app/ocr0 is on FS type 18225520. Not supported. I did a little poking around and found that OCFS2 has the same issue, but with OCFS2 it can be circumvented by mounting with -o datavolume... I was unable to find any similar options for GFS mounts. This looks like probably more of an Oracle bug, as 10G-R1 installed without any problems (I have my DBA pursuing the Oracle route), but I was wondering if anyone else has come across this problem and if so, was there any fix? Thanks, -steve From Bowie_Bailey at BUC.com Thu Oct 13 19:01:57 2005 From: Bowie_Bailey at BUC.com (Bowie Bailey) Date: Thu, 13 Oct 2005 15:01:57 -0400 Subject: [Linux-cluster] GFS + DLM howto? Message-ID: <4766EEE585A6D311ADF500E018C154E302133251@bnifex.cis.buc.com> I seem to be missing something. I have been able to configure cluster.conf and get ccsd and clvmd running, but it fails when I try to initialize the physical volume. # pvcreate /dev/etherd/e1.0 Device /dev/etherd/e1.0 not found. The device is definitely there and I can ping it with the aoeping command. # ll /dev/etherd/ total 4 -rw-r--r-- 1 root root 1 Oct 13 14:33 discover brw------- 1 root disk 152, 256 Oct 12 17:07 e1.0 All the modules seem to be loaded. (irrelevant modules removed from the list) # lsmod Module Size Used by aoe 26816 0 lock_dlm 43740 0 dlm 113092 4 lock_dlm gfs 280920 0 lock_harness 8992 2 lock_dlm,gfs cman 122720 10 lock_dlm,dlm dm_mod 58949 0 Any suggestions? Bowie From joshua at emailscout.net Sun Oct 2 11:03:02 2005 From: joshua at emailscout.net (Joshua Mouch) Date: Sun, 2 Oct 2005 07:03:02 -0400 Subject: [Linux-cluster] ddraid production release Message-ID: <4enjrj$1gendbr@mxip18a.cluster1.charter.net> I know ddraid is still in its infancy, but do you have an approximate release date in mind? This year. next year. year after? Joshua Mouch -------------- next part -------------- An HTML attachment was scrubbed... URL: From joshua at emailscout.net Mon Oct 3 17:23:59 2005 From: joshua at emailscout.net (Joshua Mouch) Date: Mon, 3 Oct 2005 13:23:59 -0400 Subject: [Linux-cluster] Fedora + GFS & No GNBD scripts?? Message-ID: <4enk62$8thovc@mxip28a.cluster1.charter.net> Hello, I've got GFS set up (almost) perfectly after a few days of following several HOWTOs and the RedHat manual. However, after a reboot, gnbd_serv isn't loaded, nor does it ever get loaded until I do it manually after boot (after disabling fenced so the system will boot because fenced waits forever while trying to communicate with the non-existant gnbd_serv). So, the first issue is: why doesn't Fedora provide a way to load gnbd_serv and the module gnbd and gnbd_client on boot (e.g. /etc/init.d/gnbd)? The second issue is: the devices that I export using gnbd_export aren't remembered between boots. Each time a reboot, I need to re-export like this: gnbd_export -d /dev/VolGroup00/LogVolStorage -e server_storage I did quite a bit of googling on this and found that Gentoo handles this all by providing a /etc/gnbdtab, /etc/init.d/gndb_serv, and /etc/init.d/gnbd_client. The gndb exports & imports are stored in the first file. So what's going on? Do I need to copy Gentoo's way of doing it, or is there a Fedora way that didn't get installed for some reason? Joshua Mouch -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdake at mvista.com Mon Oct 3 20:12:59 2005 From: sdake at mvista.com (Steven Dake) Date: Mon, 03 Oct 2005 13:12:59 -0700 Subject: [Openais] Re: [Linux-cluster] new userland cman In-Reply-To: <1128360058.27430.99.camel@ayanami.boston.redhat.com> References: <433D4134.6080608@redhat.com> <1128109200.8440.14.camel@unnamed.az.mvista.com> <4340D983.7080106@redhat.com> <1128360058.27430.99.camel@ayanami.boston.redhat.com> Message-ID: <1128370379.30850.3.camel@unnamed.az.mvista.com> On Mon, 2005-10-03 at 13:20 -0400, Lon Hohberger wrote: > On Mon, 2005-10-03 at 08:10 +0100, Patrick Caulfield wrote: > > > >>neutral > > >>------- > > >>- Always uses multicast (no broadcast). A default multicast address is supplied > > >>if none is given > > > > > > > > > If broadcast is important, which I guess it may be, we can pretty easily > > > add this support... > > > > > > > I was going to look into this but I doubt its really worth it. It's just any > > extra complication and will only apply to IPv4 anyway. > > I think broadcast is quite important, actually - although I also think > that it should *not* be the default. > > Multicast doesn't always work very well (in practice) on existing > networks, and works poorly (if at all) over things like crossover > ethernet cables and hub-based private networks. You know, the cheap > stuff hackers use in their houses to play with cluster software ;) > I have tested the multicast with both crossover point to point as well as hub networks. Actually the way the protocol works, switches are not even necessary. There are very few (less then 1%) link collisions with a hub network even at 90% network load. > Broadcast is far more likely to work out of the box in the above cases, > and isn't hard to implement (... actually, it's easier than multicast). > Adding this should just be a few lines of code. I'll see if I can work out a patch today. Regards -steve > Also, IPv6 isn't what I'd call "mainstream" just yet, so supporting all > the hacks we can with IPv4 isn't necessarily a bad thing ;) > > -- Lon > From tom-fedora at kofler.eu.org Fri Oct 7 18:30:16 2005 From: tom-fedora at kofler.eu.org (Thomas Kofler) Date: Fri, 7 Oct 2005 20:30:16 +0200 Subject: [Linux-cluster] Additional node "Cluster membership rejected" Message-ID: <1128709816.4346beb85a069@mail.devcon.cc> Hi, we are running a 4 node cluster successfully. Now we try to join an additional node - but it fails. We upgraded the cluster.conf file to reflect the new node. [root at www5 ~]# ccs_tool update /etc/cluster/cluster.conf Config file updated from version 10 to 11 Update complete. cluster.conf was checked and is synchron on all nodes, hosts files are also fine. When we try to join the new node gfsserver2 [root at gfsserver2 cluster]# cman_tool join we get [root at gfsserver2 cluster]# CMAN: Cluster membership rejected And the interesting part is: Oct 7 13:51:59 gfsserver ccsd[415]: Update of cluster.conf complete (version 10 -> 11). Oct 7 13:52:00 gfsserver kernel: CMAN: Join request from gfsserver2.devcon.cc rejected, config version local 10 remote 11 Why do the 4 existing nodes not check, that they also have version 11 in use ? Or do we have to "reload" anything additionally to the ccs_tool update command ? Thanks in advance, Regards, Thomas Oct 7 13:51:34 gfsserver2 kernel: GFS 2.6.11.8-20050601.152643.FC4.9 (built Jul 18 2005 10:42:24) installed Oct 7 13:51:39 gfsserver2 kernel: CMAN 2.6.11.5-20050601.152643.FC4.9 (built Jul 18 2005 10:27:35) installed Oct 7 13:51:39 gfsserver2 kernel: NET: Registered protocol family 30 Oct 7 13:51:39 gfsserver2 kernel: DLM 2.6.11.5-20050601.152643.FC4.10 (built Jul 18 2005 10:34:42) installed Oct 7 13:51:39 gfsserver2 kernel: Lock_DLM (built Jul 18 2005 10:42:18) installed Oct 7 13:51:49 gfsserver2 ccsd[839]: Starting ccsd 1.0.0: Oct 7 13:51:49 gfsserver2 ccsd[839]: Built: Jun 16 2005 10:45:39 Oct 7 13:51:49 gfsserver2 ccsd[839]: Copyright (C) Red Hat, Inc. 2004 All rights reserved. Oct 7 13:51:49 gfsserver2 ccsd[839]: IP Protocol:: IPv4 only Oct 7 13:51:59 gfsserver2 ccsd[839]: cluster.conf (cluster name = devconcluster, version = 11) found. Oct 7 13:51:59 gfsserver2 ccsd[839]: Remote copy of cluster.conf is from quorate node. Oct 7 13:51:59 gfsserver2 ccsd[839]: Local version # : 11 Oct 7 13:51:59 gfsserver2 ccsd[839]: Remote version #: 11 Oct 7 13:51:59 gfsserver2 kernel: CMAN: Waiting to join or form a Linux- cluster Oct 7 13:51:59 gfsserver2 ccsd[839]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.2 Oct 7 13:51:59 gfsserver2 ccsd[839]: Initial status:: Inquorate Oct 7 13:52:00 gfsserver2 kernel: CMAN: sending membership request Oct 7 13:52:00 gfsserver2 kernel: CMAN: Cluster membership rejected Oct 7 13:52:00 gfsserver2 ccsd[839]: Cluster manager shutdown. Attemping to reconnect... Oct 7 13:52:20 gfsserver2 ccsd[839]: Unable to connect to cluster infrastructure after 30 seconds. Oct 7 13:52:50 gfsserver2 ccsd[839]: Unable to connect to cluster infrastructure after 60 seconds. Oct 7 13:53:20 gfsserver2 ccsd[839]: Unable to connect to cluster infrastructure after 90 seconds. Oct 7 13:53:51 gfsserver2 ccsd[839]: Unable to connect to cluster infrastructure after 120 seconds. Oct 7 13:53:58 gfsserver2 ccsd[839]: Remote copy of cluster.conf is from quorate node. From tom-fedora at kofler.eu.org Sat Oct 8 07:38:15 2005 From: tom-fedora at kofler.eu.org (Thomas Kofler) Date: Sat, 8 Oct 2005 09:38:15 +0200 Subject: [Linux-cluster] Additional node "Cluster membership rejected" Message-ID: <1128757095.43477767a7bcc@mail.devcon.cc> Hi, we are running a 4 node cluster successfully. Now we try to join an additional node - but it fails. We upgraded the cluster.conf file to reflect the new node. [root at www5 ~]# ccs_tool update /etc/cluster/cluster.conf Config file updated from version 10 to 11 Update complete. cluster.conf was checked and is synchron on all nodes, hosts files are also fine. When we try to join the new node gfsserver2 [root at gfsserver2 cluster]# cman_tool join we get [root at gfsserver2 cluster]# CMAN: Cluster membership rejected And the interesting part is: Oct 7 13:51:59 gfsserver ccsd[415]: Update of cluster.conf complete (version 10 -> 11). Oct 7 13:52:00 gfsserver kernel: CMAN: Join request from gfsserver2.devcon.cc rejected, config version local 10 remote 11 Why do the 4 existing nodes not check, that they also have version 11 in use ? Or do we have to "reload" anything additionally to the ccs_tool update command ? Thanks in advance, Regards, Thomas Oct 7 13:51:34 gfsserver2 kernel: GFS 2.6.11.8-20050601.152643.FC4.9 (built Jul 18 2005 10:42:24) installed Oct 7 13:51:39 gfsserver2 kernel: CMAN 2.6.11.5-20050601.152643.FC4.9 (built Jul 18 2005 10:27:35) installed Oct 7 13:51:39 gfsserver2 kernel: NET: Registered protocol family 30 Oct 7 13:51:39 gfsserver2 kernel: DLM 2.6.11.5-20050601.152643.FC4.10 (built Jul 18 2005 10:34:42) installed Oct 7 13:51:39 gfsserver2 kernel: Lock_DLM (built Jul 18 2005 10:42:18) installed Oct 7 13:51:49 gfsserver2 ccsd[839]: Starting ccsd 1.0.0: Oct 7 13:51:49 gfsserver2 ccsd[839]: Built: Jun 16 2005 10:45:39 Oct 7 13:51:49 gfsserver2 ccsd[839]: Copyright (C) Red Hat, Inc. 2004 All rights reserved. Oct 7 13:51:49 gfsserver2 ccsd[839]: IP Protocol:: IPv4 only Oct 7 13:51:59 gfsserver2 ccsd[839]: cluster.conf (cluster name = devconcluster, version = 11) found. Oct 7 13:51:59 gfsserver2 ccsd[839]: Remote copy of cluster.conf is from quorate node. Oct 7 13:51:59 gfsserver2 ccsd[839]: Local version # : 11 Oct 7 13:51:59 gfsserver2 ccsd[839]: Remote version #: 11 Oct 7 13:51:59 gfsserver2 kernel: CMAN: Waiting to join or form a Linux- cluster Oct 7 13:51:59 gfsserver2 ccsd[839]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.2 Oct 7 13:51:59 gfsserver2 ccsd[839]: Initial status:: Inquorate Oct 7 13:52:00 gfsserver2 kernel: CMAN: sending membership request Oct 7 13:52:00 gfsserver2 kernel: CMAN: Cluster membership rejected Oct 7 13:52:00 gfsserver2 ccsd[839]: Cluster manager shutdown. Attemping to reconnect... Oct 7 13:52:20 gfsserver2 ccsd[839]: Unable to connect to cluster infrastructure after 30 seconds. Oct 7 13:52:50 gfsserver2 ccsd[839]: Unable to connect to cluster infrastructure after 60 seconds. Oct 7 13:53:20 gfsserver2 ccsd[839]: Unable to connect to cluster infrastructure after 90 seconds. Oct 7 13:53:51 gfsserver2 ccsd[839]: Unable to connect to cluster infrastructure after 120 seconds. Oct 7 13:53:58 gfsserver2 ccsd[839]: Remote copy of cluster.conf is from quorate node. From Bowie_Bailey at BUC.com Fri Oct 14 15:56:12 2005 From: Bowie_Bailey at BUC.com (Bowie Bailey) Date: Fri, 14 Oct 2005 11:56:12 -0400 Subject: [Linux-cluster] Fencing? Message-ID: <4766EEE585A6D311ADF500E018C154E30213325C@bnifex.cis.buc.com> I'm a bit unclear on the concept of fencing. Can anyone point me to a good overview of what it does and how it works? Bowie From adam at popik.pl Fri Oct 14 13:19:33 2005 From: adam at popik.pl (Adam Popik) Date: Fri, 14 Oct 2005 15:19:33 +0200 Subject: [Linux-cluster] iscsi and RHGFS for RHEL4 Message-ID: <434FB065.5050109@popik.pl> Hi, I have questions about rhgfs for rhel4 in documentation is : "... multipath gnbd and iSCSI are not available with this release ..." what that mean : gfs not supported with iscsi or not supported on gnbd with iscsi ? PS Sorry for broken English.. Adam -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 4179 bytes Desc: S/MIME Cryptographic Signature URL: From pcaulfie at redhat.com Fri Oct 14 07:09:56 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Fri, 14 Oct 2005 08:09:56 +0100 Subject: [Linux-cluster] GFS + DLM howto? In-Reply-To: <4766EEE585A6D311ADF500E018C154E302133251@bnifex.cis.buc.com> References: <4766EEE585A6D311ADF500E018C154E302133251@bnifex.cis.buc.com> Message-ID: <434F59C4.8060802@redhat.com> Bowie Bailey wrote: > I seem to be missing something. > > I have been able to configure cluster.conf and get ccsd and clvmd > running, but it fails when I try to initialize the physical volume. > > # pvcreate /dev/etherd/e1.0 > Device /dev/etherd/e1.0 not found. > > The device is definitely there and I can ping it with the aoeping > command. > > # ll /dev/etherd/ > total 4 > -rw-r--r-- 1 root root 1 Oct 13 14:33 discover > brw------- 1 root disk 152, 256 Oct 12 17:07 e1.0 > You might need to add the device to the 'devices' section of /etc/lvm/lvm.conf, eg: types = [ "etherd", 16 ] The name (I've used etherd here as a guess) is the device name that appears in /proc/partitions. The number (16) is the maximum number of partitions per device. -- patrick From alan.gagne at comcast.net Thu Oct 13 21:59:39 2005 From: alan.gagne at comcast.net (Alan Gagne) Date: Thu, 13 Oct 2005 17:59:39 -0400 Subject: [Linux-cluster] Oracle 10G-R2 on GFS install problems Message-ID: <003e01c5d041$69e23050$8432d70a@panhead> GFS is not a certified file system option for Oracle 10gR2 rac. There is some limited support for running on GFS though most of the information I have found is for 9i. You can set this up like I have currently. Place the Oracle clusterware voting and cluster registry files on raw devices. You can then create the database on gfs. Alan -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bowie_Bailey at BUC.com Fri Oct 14 17:39:19 2005 From: Bowie_Bailey at BUC.com (Bowie Bailey) Date: Fri, 14 Oct 2005 13:39:19 -0400 Subject: [Linux-cluster] GFS + DLM howto? Message-ID: <4766EEE585A6D311ADF500E018C154E302133260@bnifex.cis.buc.com> From: Patrick Caulfield [mailto:pcaulfie at redhat.com] > > Bowie Bailey wrote: > > I seem to be missing something. > > > > I have been able to configure cluster.conf and get ccsd and clvmd > > running, but it fails when I try to initialize the physical volume. > > > > # pvcreate /dev/etherd/e1.0 > > Device /dev/etherd/e1.0 not found. > > > > The device is definitely there and I can ping it with the aoeping > > command. > > > > # ll /dev/etherd/ > > total 4 > > -rw-r--r-- 1 root root 1 Oct 13 14:33 discover > > brw------- 1 root disk 152, 256 Oct 12 17:07 e1.0 > > > > You might need to add the device to the 'devices' section of > /etc/lvm/lvm.conf, eg: > > types = [ "etherd", 16 ] > > The name (I've used etherd here as a guess) is the device name that > appears in /proc/partitions. The number (16) is the maximum number > of partitions per device. I found the answer to this question on Coraid's site soon after I asked. I added this to the "devices" section of lvm.conf: types = [ "aoe", 16 ] After that, everything worked perfectly! I've now got an operational setup with a single node. Now I've just got to see if I can get the other nodes configured. Thanks for the help (and patience)! Bowie From kanderso at redhat.com Fri Oct 14 17:49:25 2005 From: kanderso at redhat.com (Kevin Anderson) Date: Fri, 14 Oct 2005 12:49:25 -0500 Subject: [Linux-cluster] iscsi and RHGFS for RHEL4 In-Reply-To: <434FB065.5050109@popik.pl> References: <434FB065.5050109@popik.pl> Message-ID: <1129312165.3526.52.camel@dhcp80-225.msp.redhat.com> On Fri, 2005-10-14 at 15:19 +0200, Adam Popik wrote: > Hi, > I have questions about rhgfs for rhel4 in documentation is : > "... multipath gnbd and iSCSI are not available with this release ..." > what that mean : > gfs not supported with iscsi or not supported on gnbd with iscsi ? RHEL4 didn't support iSCSI until the recent RHEL4 U2 release. GFS works fine with iSCSI, is supported and is used quite extensively by the development team. The multipath gnbd refers to the use of a gnbd device under the device mapper multipath module. The device mapper multipath code currently assumes that the devices are pure SCSI devices and submits a scsi command that GNBD currently doesn't provide. So, the GNBD devices aren't recognized by the multipath module as valid devices. So, two separate items, one outdated, the other in progress. Hope this helps Kevin From spwilcox at att.com Fri Oct 14 18:48:48 2005 From: spwilcox at att.com (Steve Wilcox) Date: Fri, 14 Oct 2005 14:48:48 -0400 Subject: [Linux-cluster] Oracle 10G-R2 on GFS install problems In-Reply-To: <003e01c5d041$69e23050$8432d70a@panhead> References: <003e01c5d041$69e23050$8432d70a@panhead> Message-ID: <1129315728.4443.10.camel@aptis101.cqtel.com> I was afraid of that. Interesting that Oracle would make such a change between R1 and R2, but I guess clusterware underwent a fairly extensive re-write. Thanks for the info. -steve On Thu, 2005-10-13 at 17:59 -0400, Alan Gagne wrote: > GFS is not a certified file system option for Oracle 10gR2 rac. > There is some limited support for running on GFS though most > of the information I have found is for 9i. You can set this up like > I have currently. Place the Oracle clusterware voting and cluster > registry files on raw devices. > You can then create the database on gfs. > > Alan > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From adam at popik.pl Fri Oct 14 18:59:53 2005 From: adam at popik.pl (Adam Popik) Date: Fri, 14 Oct 2005 20:59:53 +0200 Subject: [Linux-cluster] iscsi and RHGFS for RHEL4 In-Reply-To: <1129312165.3526.52.camel@dhcp80-225.msp.redhat.com> References: <434FB065.5050109@popik.pl> <1129312165.3526.52.camel@dhcp80-225.msp.redhat.com> Message-ID: <43500029.2030507@popik.pl> Kevin Anderson wrote: > On Fri, 2005-10-14 at 15:19 +0200, Adam Popik wrote: > >>Hi, >>I have questions about rhgfs for rhel4 in documentation is : >>"... multipath gnbd and iSCSI are not available with this release ..." >>what that mean : >>gfs not supported with iscsi or not supported on gnbd with iscsi ? > > > RHEL4 didn't support iSCSI until the recent RHEL4 U2 release. GFS works > fine with iSCSI, is supported and is used quite extensively by the > development team. > > The multipath gnbd refers to the use of a gnbd device under the device > mapper multipath module. The device mapper multipath code currently > assumes that the devices are pure SCSI devices and submits a scsi > command that GNBD currently doesn't provide. So, the GNBD devices > aren't recognized by the multipath module as valid devices. > > So, two separate items, one outdated, the other in progress. > > Hope this helps > Kevin > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster I now work with rhel3 with gfs and FC and that work fine, but new project have no a lot of money that maybe combination with iscsi should be a good way (gfs will use for home directories rhel's WS and use for working with fluent - big files). Thanks for help Adam -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 4179 bytes Desc: S/MIME Cryptographic Signature URL: From ciril at hclinsys.com Sat Oct 15 03:36:49 2005 From: ciril at hclinsys.com (CIRIL IGNATIOUS T) Date: Sat, 15 Oct 2005 09:06:49 +0530 Subject: [Linux-cluster] Active/Active oracle 10g database with Redhat Cluster Suite. Message-ID: <43507951.5060309@hclinsys.com> Dear All Is it possible to configure Active/Active Cluster of Oracle 10g Database with Redhat Cluster Suite. Please indicate if any useful links. Ciril From omer at faruk.net Mon Oct 17 06:16:42 2005 From: omer at faruk.net (Omer Faruk Sen) Date: Mon, 17 Oct 2005 09:16:42 +0300 (EEST) Subject: [Linux-cluster] Fencing? In-Reply-To: <4766EEE585A6D311ADF500E018C154E30213325C@bnifex.cis.buc.com> References: <4766EEE585A6D311ADF500E018C154E30213325C@bnifex.cis.buc.com> Message-ID: <53866.193.140.74.2.1129529802.squirrel@193.140.74.2> I don't know very much but what I understand from fencing is forcefully disabling a node that is not reachable by cluster to prevent this dead node accidently (maybe the node wasn't dead and will try to write something to shared storage which can cause catastrophic damage if GFS is not used) write something to file system. It does this using power switches or other methods such as IPMI or ILO .(I heard there was a new module for fencing that uses vmware ) Thus I think this fencing conecpt is the same as STONITH in linux-ha.org which means Shoot The Other Node In The Head(Heart).... If I am mistaken someone please correct me. > I'm a bit unclear on the concept of fencing. Can anyone point me to a > good > overview of what it does and how it works? > > Bowie > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Omer Faruk Sen http://www.faruk.net From lhh at redhat.com Mon Oct 17 22:24:24 2005 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 17 Oct 2005 18:24:24 -0400 Subject: [Linux-cluster] Fencing? In-Reply-To: <53866.193.140.74.2.1129529802.squirrel@193.140.74.2> References: <4766EEE585A6D311ADF500E018C154E30213325C@bnifex.cis.buc.com> <53866.193.140.74.2.1129529802.squirrel@193.140.74.2> Message-ID: <1129587864.10298.28.camel@ayanami.boston.redhat.com> On Mon, 2005-10-17 at 09:16 +0300, Omer Faruk Sen wrote: > (maybe the node wasn't dead and will try to write > something to shared storage which can cause catastrophic damage if GFS is > not used) write something to file system. Correct, except it causes catastrophic damage in any case, regardless of whether or not GFS is used. GFS requires fencing in order to operate. > It does this using power > switches or other methods such as IPMI or ILO .(I heard there was a new > module for fencing that uses vmware ) GFS can use fabric-level fencing - that is, you can tell the iSCSI server to cut a node off, or ask the fiber-channel switch to disable a port. This is in addition to "power-cycle" fencing. > Thus I think this fencing conecpt is the same as STONITH in linux-ha.org > which means Shoot The Other Node In The Head(Heart).... STONITH, STOMITH, etc. are indeed implementations of I/O fencing. Fencing is the act of forcefully preventing a node from being able to access resources after that node has been evicted from the cluster in an attempt to avoid corruption. The canonical example of when it is needed is the live-hang scenario, as you described: 1. node A hangs with I/Os pending to a shared file system 2. node B and node C decide that node A is dead and recover resources allocated on node A (including the shared file system) 3. node A resumes normal operation 4. node A completes I/Os to shared file system At this point, the shared file system is probably corrupt. If you're lucky, fsck will fix it -- if you're not, you'll need to restore from backup. I/O fencing (STONITH, or whatever we want to call it) prevents the last step (step 4) from happening. How fencing is done (power cycling via external switch, SCSI reservations, FC zoning, integrated methods like IPMI, iLO, manual intervention, etc.) is unimportant - so long as whatever method is used can guarantee that step 4 can not complete. -- Lon From dawson at fnal.gov Tue Oct 18 14:20:14 2005 From: dawson at fnal.gov (Troy Dawson) Date: Tue, 18 Oct 2005 09:20:14 -0500 Subject: [Linux-cluster] write's pausing - which tools to debug? Message-ID: <4355049E.4060606@fnal.gov> Hi, We've been having some problems with doing a write's to our GFS file system, and it will pause, for long periods. (Like from 5 to 10 seconds, to 30 seconds, and occasially 5 minutes) After the pause, it's like nothing happened, whatever the process is, just keeps going happy as can be. Except for these pauses, our GFS is quite zippy, both reads and writes. But these pauses are holding us back from going full production. I need to know what tools I should use to figure out what is causing these pauses. Here is the setup. ------------------- All machines: RHEL 4 update 1 (ok, actually S.L. 4.1), kernel 2.6.9-11.ELsmp, GFS 6.1.0, ccs 1.0.0, gulm 1.0.0, rgmanager 1.9.34 I have no ability to do fencing yet, so I chose to use the gulm locking mechanism. I have it setup so that there are 3 lock servers, for failover. I have tested the failover, and it works quite well. I have 5 machines in the cluster. 1 isn't connected to the SAN, or using GFS. It is just a failover gulm lock server incase the other two lock servers go down. So I have 4 machines connected to our SAN and using GFS. 3 are read-only, 1 is read-write. If it is important, the 3 read-only are x86_64, the 1 read-write and the 1 not connected are i386. The read/write machine is our master lock server. Then one of the read-only is a fallback lock server, as is the machine not using GFS. ---------------- Anyway, we're getting these pauses when writting, and I'm having a hard time tracking down where the problem is. I *think* that we can still read from the other machines. But since this comes and goes, I haven't been able to verify that. Anyway, which tools do you think would be best in diagnosing this? Many Thanks Troy Dawson -- __________________________________________________ Troy Dawson dawson at fnal.gov (630)840-6468 Fermilab ComputingDivision/CSS CSI Group __________________________________________________ From haiwu.us at gmail.com Tue Oct 18 23:30:26 2005 From: haiwu.us at gmail.com (hai wu) Date: Tue, 18 Oct 2005 18:30:26 -0500 Subject: [Linux-cluster] GFS cluster and Dell DRAC Message-ID: We are mainly using Dell PowerEdge servers. I know Dell DRAC port was mentioned in GFS document. But I don't know how Dell DRAC would be configured in order to get it to work for GFS (power reset). Can someone explain about its usage for GFS? Thanks, Hai -------------- next part -------------- An HTML attachment was scrubbed... URL: From suran007 at coolgames.com.cn Wed Oct 19 08:09:41 2005 From: suran007 at coolgames.com.cn (=?gb2312?B?c3VyYW4wMDc=?=) Date: Wed, 19 Oct 2005 16:09:41 +0800 Subject: [Linux-cluster] GFS mount hang Message-ID: <20051019080941.4961.qmail@mail.test.com> my system is redhat AS 3 UPDATE3,the kernel is linux-2.4.21-27.0.4.Elsmp,the gfs version is GFS-6.0.2-26,our gfs-cluster operation serveral days well,(suddenly),the cluster is dead,our gfs-cluster is one gnbd server and six gfs nodes,currently,my problem is gfs services (such as ccsd,lock_gulmd) of our six gfs nodes is already started well,but only node02,05,06 can mount the gfs pool,when the node03,04 mount the gfs pool will hang ,I use the gulm_tool nodelist node01 to see the lock stat,the result is well ,I can\'t see any problem . who can help me ,my msn is suran007 at hotmail.com, I hope someone can help me ,thanks~~ ---- iGENUS is a free webmail interface, No fee, Free download --------------------------------------------------------- please visit http://www.igenus.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From suran007 at coolgames.com.cn Wed Oct 19 07:59:37 2005 From: suran007 at coolgames.com.cn (=?gb2312?B?c3VyYW4wMDc=?=) Date: Wed, 19 Oct 2005 15:59:37 +0800 Subject: [Linux-cluster] GFS mount hang Message-ID: <20051019075937.4864.qmail@mail.test.com> my system is redhat AS 3 UPDATE3,the kernel is linux-2.4.21-27.0.4.Elsmp,the gfs version is GFS-6.0.2-26,our gfs-cluster operation serveral days well,(suddenly),the cluster is dead,our gfs-cluster is one gnbd server and six gfs nodes,currently,my problem is gfs services (such as ccsd,lock_gulmd) of our six gfs nodes is already started well,but only node02,05,06 can mount the gfs pool,when the node03,04 mount the gfs pool will hang ,I use the gulm_tool nodelist node01 to see the lock stat,the result is well ,I can\'t see any problem . who can help me ,my msn is suran007 at hotmail.com, I hope someone can help me ,thanks~~ ---- iGENUS is a free webmail interface, No fee, Free download --------------------------------------------------------- please visit http://www.igenus.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From suran007 at coolgames.com.cn Wed Oct 19 08:05:05 2005 From: suran007 at coolgames.com.cn (=?gb2312?B?c3VyYW4wMDc=?=) Date: Wed, 19 Oct 2005 16:05:05 +0800 Subject: [Linux-cluster] GFS mount hang Message-ID: <20051019080505.4921.qmail@mail.test.com> my system is redhat AS 3 UPDATE3,the kernel is linux-2.4.21-27.0.4.Elsmp,the gfs version is GFS-6.0.2-26,our gfs-cluster operation serveral days well,(suddenly),the cluster is dead,our gfs-cluster is one gnbd server and six gfs nodes,currently,my problem is gfs services (such as ccsd,lock_gulmd) of our six gfs nodes is already started well,but only node02,05,06 can mount the gfs pool,when the node03,04 mount the gfs pool will hang ,I use the gulm_tool nodelist node01 to see the lock stat,the result is well ,I can\'t see any problem . who can help me ,my msn is suran007 at hotmail.com, I hope someone can help me ,thanks~~ ---- iGENUS is a free webmail interface, No fee, Free download --------------------------------------------------------- please visit http://www.igenus.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From suran007 at coolgames.com.cn Wed Oct 19 07:42:43 2005 From: suran007 at coolgames.com.cn (=?gb2312?B?c3VyYW4wMDc=?=) Date: Wed, 19 Oct 2005 15:42:43 +0800 Subject: [Linux-cluster] TEST Message-ID: <20051019074243.4621.qmail@mail.test.com> MY ---- iGENUS is a free webmail interface, No fee, Free download --------------------------------------------------------- please visit http://www.igenus.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From Axel.Thimm at ATrpms.net Wed Oct 19 10:48:16 2005 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Wed, 19 Oct 2005 12:48:16 +0200 Subject: [Linux-cluster] Re: write's pausing - which tools to debug? In-Reply-To: <4355049E.4060606@fnal.gov> References: <4355049E.4060606@fnal.gov> Message-ID: <20051019104816.GD31027@neu.nirvana> Hi, On Tue, Oct 18, 2005 at 09:20:14AM -0500, Troy Dawson wrote: > We've been having some problems with doing a write's to our GFS file > system, and it will pause, for long periods. (Like from 5 to 10 > seconds, to 30 seconds, and occasially 5 minutes) After the pause, it's > like nothing happened, whatever the process is, just keeps going happy > as can be. > Except for these pauses, our GFS is quite zippy, both reads and writes. > But these pauses are holding us back from going full production. > I need to know what tools I should use to figure out what is causing > these pauses. > > Here is the setup. > All machines: RHEL 4 update 1 (ok, actually S.L. 4.1), kernel > 2.6.9-11.ELsmp, GFS 6.1.0, ccs 1.0.0, gulm 1.0.0, rgmanager 1.9.34 > > I have no ability to do fencing yet, so I chose to use the gulm locking > mechanism. I have it setup so that there are 3 lock servers, for > failover. I have tested the failover, and it works quite well. If this is a testing environment use manual fencing. E.g. if a node needs to get fenced you get a log message saying that you should do that and acknowledge that. > I have 5 machines in the cluster. 1 isn't connected to the SAN, or > using GFS. It is just a failover gulm lock server incase the other two > lock servers go down. > > So I have 4 machines connected to our SAN and using GFS. 3 are > read-only, 1 is read-write. If it is important, the 3 read-only are > x86_64, the 1 read-write and the 1 not connected are i386. > > The read/write machine is our master lock server. Then one of the > read-only is a fallback lock server, as is the machine not using GFS. > > Anyway, we're getting these pauses when writting, and I'm having a hard > time tracking down where the problem is. I *think* that we can still > read from the other machines. But since this comes and goes, I haven't > been able to verify that. What SAN hardware is attached to the nodes? > Anyway, which tools do you think would be best in diagnosing this? I'd suggest to check/monitor networking. Also place the cluster communication on a separate network that the SAN/LAN network. The cluster heartbeat goes over UDP and a congested network may delay these packages or drop the completely. At least that's the CMAN picture, lock_gulm may be different. Also don't mix RHELU1 and U2 or FC. Just in case you'd like to upgrade to SL4.2 one by one. There have been many changes/bug fixes to the cluster bits in RHELU2, and there are also some new spiffy features like multipath. Perhaps it's worth rebasing your testing environment? -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From teigland at redhat.com Wed Oct 19 15:10:45 2005 From: teigland at redhat.com (David Teigland) Date: Wed, 19 Oct 2005 10:10:45 -0500 Subject: [Linux-cluster] cluster-1.01.00 Message-ID: <20051019151045.GA3975@redhat.com> A new source tarball from the STABLE branch has been released; it builds and runs on 2.6.13: ftp://sources.redhat.com/pub/cluster/releases/cluster-1.01.00.tar.gz Version 1.01.00 - 5 October 2005 ================================ cman-kernel: SM should wait for all recoveries to complete before it processes any group joins/leaves. bz#162014 cman-kernel: Fix barriers. cman-kernel: Fix off-by-one error in find_node_by_nodeid() that can cause an oops in some odd circumstances. dlm-kernel: Don't increment the DLM reference count when connecting to an already extant lockspace. bz#157295 dlm-kernel: Fix refcounting that could cause a memory leak. dlm-kernel: Return locking errors correctly. bz#154990 dlm-kernel: Don't free the lockinfo block if the LKB still exists. bz#161146 cman: "cman_tool join" can now set /proc/cluster/conf/cman values from CCS lock_dlm: The first mounter shouldn't let others mount until others_may_mount() has been called. bz#161808 gfs-kernel: If it took too long to sync the dependent inodes back to disk, resource group descriptor could get corrupted. bz#164324 gfs-kernel: It is now possible to toggle acls on and off with -o remount. Also, acls are only displayed when they are enabled. gfs-kernel: No longer check permissions before truncating a file in gfs_setattr. bz#169039 gfs-kernel: Fix oops when copying suid root file to gfs. gfs-kernel: changes to work on 2.6.13 gfs_fsck: Some variables weren't getting initialized properly in pass1b, causing hangs (or segfaults) when duplicate blocks were present. bz#162709 fence: Add support for Dell PowerEdge 1855 to fence_drac. bz#150563 fence: Add support for latest ilo firmware version (1.75). Changes were also added to make sure that power status of the machine is being properlly checked after power change commands have been issued. bz#161352 fence: fence_ipmilan default operation should be reboot. bz#164627 fence: fence_wti default operation should be reboot. bz#162805 ccs: Increase daemon performance by adding local socket communications. rgmanager: Fix ip bugs. bz#157327, bz#163651, bz#166526 rgmanager: Fix hang when specifying nonexistent services. bz#159767 rgmanager: Fix service tree handling. bz#162824, bz#162936 Dave From Bill.Scherer at VerizonWireless.com Wed Oct 19 15:22:14 2005 From: Bill.Scherer at VerizonWireless.com (Bill Scherer) Date: Wed, 19 Oct 2005 11:22:14 -0400 Subject: [Linux-cluster] bladecenter fencing... Message-ID: <435664A6.4090002@VerizonWireless.com> Hello - I have 24 blades in two bladecenters, all running RHEL4. I have successfully configured a cluster with GFS on four nodes in one bladecenter. There appears to be no way to create a cluster composed of blades in different bladecenters because the fence agent setup has no facility to handle multiple bladecenter management modules. Or am I missing something? TIA, Bill Scherer From david.chappel at mindbank.com Fri Oct 14 15:31:07 2005 From: david.chappel at mindbank.com (David A. Chappel) Date: Fri, 14 Oct 2005 09:31:07 -0600 Subject: [Linux-cluster] mounts not spanning Message-ID: <1129303867.4838.30.camel@localhost.localdomain> Hi there clusterites... Anyone have a cluestick? I have created a wee "cluster" of two machines. They seem to be happy in every way, except that when I mount the gfs volumes on each machine, the mounts do not span across the two nodes, but act as a traditional node. In other words, I can echo "haha" > /mnt/shareMe/haha.txt on one machine but it doesn't show up on the other. Vice versa too. I use: mount -t gfs /dev/shareMeVG/shareMeLV /mnt/shareMe I've tried the -o ignore_local_fs option without success. Also, is there a quick/standard way for non-cluster kernel machines to mount the "partition" remotely? Cheers, -D [root at JavaTheHut ~]# cat /proc/cluster/status Protocol version: 5.0.1 Config version: 1 Cluster name: clusta Cluster ID: 6621 Cluster Member: Yes Membership state: Cluster-Member Nodes: 2 Expected_votes: 1 Total_votes: 2 Quorum: 1 Active subsystems: 6 Node name: JavaTheHut.mindbankts.com Node addresses: 10.1.1.22 [root at marvin ~]# cat /etc/cluster/cluster.conf From pcaulfie at redhat.com Fri Oct 14 07:07:01 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Fri, 14 Oct 2005 08:07:01 +0100 Subject: [Linux-cluster] Additional node "Cluster membership rejected" In-Reply-To: <1128709816.4346beb85a069@mail.devcon.cc> References: <1128709816.4346beb85a069@mail.devcon.cc> Message-ID: <434F5915.7010904@redhat.com> Thomas Kofler wrote: > Hi, > > we are running a 4 node cluster successfully. Now we try to join an additional > node - but it fails. > > We upgraded the cluster.conf file to reflect the new node. > > [root at www5 ~]# ccs_tool update /etc/cluster/cluster.conf > Config file updated from version 10 to 11 > Update complete. > > cluster.conf was checked and is synchron on all nodes, hosts files are also > fine. > > When we try to join the new node gfsserver2 > [root at gfsserver2 cluster]# cman_tool join > > we get > > [root at gfsserver2 cluster]# CMAN: Cluster membership rejected > > And the interesting part is: > > > Oct 7 13:51:59 gfsserver ccsd[415]: Update of cluster.conf complete (version > 10 -> 11). > Oct 7 13:52:00 gfsserver kernel: CMAN: Join request from gfsserver2.devcon.cc > rejected, config version local 10 remote 11 > > Why do the 4 existing nodes not check, that they also have version 11 in use ? > > Or do we have to "reload" anything additionally to the ccs_tool update > command ? You'll need to run cman_tool version -r 11 on one node in the cluster if you update the CCS file. -- patrick From tom-fedora at kofler.eu.org Fri Oct 14 06:31:09 2005 From: tom-fedora at kofler.eu.org (Thomas Kofler) Date: Fri, 14 Oct 2005 08:31:09 +0200 Subject: [Linux-cluster] Additional node "Cluster membership rejected" In-Reply-To: <1128709816.4346beb85a069@mail.devcon.cc> References: <1128709816.4346beb85a069@mail.devcon.cc> Message-ID: <1129271469.434f50adeead6@mail.devcon.cc> Hm, the list delayed my email for nearly a week, but neverless - found the solutions: You have to tell cman, that a new config version is available: cman_tool version -r 11 Thomas From vojtech.moravek at cz.ibm.com Wed Oct 19 07:48:11 2005 From: vojtech.moravek at cz.ibm.com (Vojtech Moravek) Date: Wed, 19 Oct 2005 09:48:11 +0200 Subject: [Linux-cluster] Performance Problem-GFS 6.1 u2 - LockGulm Message-ID: Hi All I testing HA samba cluster with One -load balancer, -two samba servers(gfs client) -gfs-server -storage connected by FC to samba servers see picture below. loadbalancer ------------------- | | (ethernet network) ---------------------------------------------------------- | | | | (internal gfs network | | samba1-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. samba2-.-.-.-gfs-server | | | | --------------------------------------- Storage (Fibre connect) Everything works perfectly, but only 30-40 minuts aprox under client work. After that time gfs mount points are goint to slow rapidly :( When I try browsing in directory structure on servers all operation like chdir, readir are very very slow. But all system resurces looks ok..Ram is ok, cpu usage is ok, but traffic on gfs network is growing. Did someone met with problem like this? And one point ..when I mount gfs volumes and I made system command "df", first output is too slow. Is this normal??? My configurations files: -------------------------------------- cat /etc/cluster/cluster.conf ------------------------------------- cat /etc/fstab # This file is edited by fstab-sync - see 'man fstab-sync' for details LABEL=/1 / ext3 defaults 1 1 none /dev/pts devpts gid=5,mode=620 0 0 none /dev/shm tmpfs defaults 0 0 none /proc proc defaults 0 0 none /sys sysfs defaults 0 0 LABEL=/var1 /var ext3 defaults 1 2 LABEL=SWAP-sda3 swap swap defaults 0 0 /dev/vg_pole1/profiles_vg1 /vg_pole1/profiles/ gfs noatime 0 0 /dev/vg_pole1/home_vg1 /vg_pole1/home/ gfs noatime 0 0 /dev/vg_pole2/profiles_vg2 /vg_pole2/profiles/ gfs noatime 0 0 /dev/vg_pole2/home_vg2 /vg_pole2/home/ gfs noatime 0 0 Thanks for any help Vojtech Moravek vojtech.moravek at cz.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From brent at phys.ufl.edu Tue Oct 18 23:27:52 2005 From: brent at phys.ufl.edu (Brent A Nelson) Date: Tue, 18 Oct 2005 19:27:52 -0400 (EDT) Subject: [Linux-cluster] ddraid? Message-ID: There hasn't been mention of ddraid in a while and the CVS hasn't been updated in about 3 months, I believe. Has there been any further progress with it? What are the risks associated with using it in its current form? If the lack of bad block handling is the only real concern, would the risk be substantially mitigated if the underlying devices are raid1, raid1+0, or raid5? Has anyone tried it yet in a production environment? Any comments to share? We'd really like to use this cool little tool. Redundancy and a performance gain, even when across the net; what's not to like (except for the minor nuisance of 2^n+1 devices being required)? Thanks, Brent Nelson Director of Computing Dept. of Physics University of Florida From Stefan.Marx at SCHOBER.DE Fri Oct 14 05:54:51 2005 From: Stefan.Marx at SCHOBER.DE (Stefan Marx) Date: Fri, 14 Oct 2005 07:54:51 +0200 Subject: Antw: [Linux-cluster] Oracle 10G-R2 on GFS install problems Message-ID: Hi Marvin, also OCFS2 ist f?r Oracle Produkte noch nicht freigegeben, auch wenn es von Oracle selebr stammt. GFS ist f?r die 9.2er Serie zertifiziert, wobei man schauen mu?, ob man RHEL3 oder RHEL4 verwenden kann, je nachdem, ob man 32-Bit oder 64-Bit Support ben?tigt. Da gibt es ziemlich klare Aussagen bei Oracle, welche Produkte mit welcher Version auf welchem Betriebssystem mit welcher Version und zus?tzlich auf welcher Hardwareplattform supported sind. Und das hat auch meistens seinen guten Grund :-(. Klar gehen die Sachen auch auf anderen Betriebssystemen, solange die entsprechenden Libraries und Kernelversionen passen, sind dann aber halt nicht supported. Ciao, Stefan >>>spwilcox at att.com 10/13/05 8:26 pm >>> In the process of installing Oracle 10G-R2 on a RHEL4-U2 x86_64 cluster with GFS 6.1.2, I get the following error when running Oracle's root.sh for cluster ready services (a.k.a clusterware): [ OCROSD][4142143168]utstoragetype: /u00/app/ocr0 is on FS type 18225520. Not supported. I did a little poking around and found that OCFS2 has the same issue, but with OCFS2 it can be circumvented by mounting with -o datavolume... I was unable to find any similar options for GFS mounts. This looks like probably more of an Oracle bug, as 10G-R1 installed without any problems (I have my DBA pursuing the Oracle route), but I was wondering if anyone else has come across this problem and if so, was there any fix? Thanks, -steve -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From Axel.Thimm at ATrpms.net Fri Oct 14 00:21:24 2005 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Fri, 14 Oct 2005 02:21:24 +0200 Subject: [Linux-cluster] Re: Additional node "Cluster membership rejected" In-Reply-To: <1128757095.43477767a7bcc@mail.devcon.cc> References: <1128757095.43477767a7bcc@mail.devcon.cc> Message-ID: <20051014002124.GB4695@neu.nirvana> On Sat, Oct 08, 2005 at 09:38:15AM +0200, Thomas Kofler wrote: > Hi, > > we are running a 4 node cluster successfully. Now we try to join an additional > node - but it fails. > > We upgraded the cluster.conf file to reflect the new node. > > [root at www5 ~]# ccs_tool update /etc/cluster/cluster.conf > Config file updated from version 10 to 11 > Update complete. > > cluster.conf was checked and is synchron on all nodes, hosts files are also > fine. Now you need cman_tool version -r 11 Check out the man page for ccs_tool under "update" > When we try to join the new node gfsserver2 > [root at gfsserver2 cluster]# cman_tool join > > we get > > [root at gfsserver2 cluster]# CMAN: Cluster membership rejected > > And the interesting part is: > > > Oct 7 13:51:59 gfsserver ccsd[415]: Update of cluster.conf complete (version > 10 -> 11). > Oct 7 13:52:00 gfsserver kernel: CMAN: Join request from gfsserver2.devcon.cc > rejected, config version local 10 remote 11 > > Why do the 4 existing nodes not check, that they also have version 11 in use ? > > Or do we have to "reload" anything additionally to the ccs_tool update > command ? > > Thanks in advance, > Regards, > Thomas > > Oct 7 13:51:34 gfsserver2 kernel: GFS 2.6.11.8-20050601.152643.FC4.9 (built > Jul 18 2005 10:42:24) installed > Oct 7 13:51:39 gfsserver2 kernel: CMAN 2.6.11.5-20050601.152643.FC4.9 (built > Jul 18 2005 10:27:35) installed > Oct 7 13:51:39 gfsserver2 kernel: NET: Registered protocol family 30 > Oct 7 13:51:39 gfsserver2 kernel: DLM 2.6.11.5-20050601.152643.FC4.10 (built > Jul 18 2005 10:34:42) installed > Oct 7 13:51:39 gfsserver2 kernel: Lock_DLM (built Jul 18 2005 10:42:18) > installed > Oct 7 13:51:49 gfsserver2 ccsd[839]: Starting ccsd 1.0.0: > Oct 7 13:51:49 gfsserver2 ccsd[839]: Built: Jun 16 2005 10:45:39 > Oct 7 13:51:49 gfsserver2 ccsd[839]: Copyright (C) Red Hat, Inc. 2004 All > rights reserved. > Oct 7 13:51:49 gfsserver2 ccsd[839]: IP Protocol:: IPv4 only > Oct 7 13:51:59 gfsserver2 ccsd[839]: cluster.conf (cluster name = > devconcluster, version = 11) found. > Oct 7 13:51:59 gfsserver2 ccsd[839]: Remote copy of cluster.conf is from > quorate node. > Oct 7 13:51:59 gfsserver2 ccsd[839]: Local version # : 11 > Oct 7 13:51:59 gfsserver2 ccsd[839]: Remote version #: 11 > Oct 7 13:51:59 gfsserver2 kernel: CMAN: Waiting to join or form a Linux- > cluster > Oct 7 13:51:59 gfsserver2 ccsd[839]: Connected to cluster infrastruture via: > CMAN/SM Plugin v1.1.2 > Oct 7 13:51:59 gfsserver2 ccsd[839]: Initial status:: Inquorate > Oct 7 13:52:00 gfsserver2 kernel: CMAN: sending membership request > Oct 7 13:52:00 gfsserver2 kernel: CMAN: Cluster membership rejected > Oct 7 13:52:00 gfsserver2 ccsd[839]: Cluster manager shutdown. Attemping to > reconnect... > Oct 7 13:52:20 gfsserver2 ccsd[839]: Unable to connect to cluster > infrastructure after 30 seconds. > Oct 7 13:52:50 gfsserver2 ccsd[839]: Unable to connect to cluster > infrastructure after 60 seconds. > Oct 7 13:53:20 gfsserver2 ccsd[839]: Unable to connect to cluster > infrastructure after 90 seconds. > Oct 7 13:53:51 gfsserver2 ccsd[839]: Unable to connect to cluster > infrastructure after 120 seconds. > Oct 7 13:53:58 gfsserver2 ccsd[839]: Remote copy of cluster.conf is from > quorate node. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From vojtech.moravek at cz.ibm.com Wed Oct 19 17:10:18 2005 From: vojtech.moravek at cz.ibm.com (Vojtech Moravek) Date: Wed, 19 Oct 2005 19:10:18 +0200 Subject: [Linux-cluster] Performance Problem-GFS 6.1 u2 - LockGulm Message-ID: Hi All I testing HA samba cluster with One -load balancer, -two samba servers(gfs client) -gfs-server -storage connected by FC to samba servers see picture below. loadbalancer ------------------- | | (ethernet network) ---------------------------------------------------------- | | | | (internal gfs network | | samba1-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. samba2-.-.-.-gfs-server | | | | --------------------------------------- Storage (Fibre connect) Everything works perfectly, but only 30-40 minuts aprox under client work. After that time gfs mount points are goint to slow rapidly :( When I try browsing in directory structure on servers all operation like chdir, readir are very very slow. But all system resurces looks ok..Ram is ok, cpu usage is ok, but traffic on gfs network is growing. Did someone met with problem like this? And one point ..when I mount gfs volumes and I made system command "df", first output is too slow. Is this normal??? My configurations files: -------------------------------------- cat /etc/cluster/cluster.conf ------------------------------------- cat /etc/fstab # This file is edited by fstab-sync - see 'man fstab-sync' for details LABEL=/1 / ext3 defaults 1 1 none /dev/pts devpts gid=5,mode=620 0 0 none /dev/shm tmpfs defaults 0 0 none /proc proc defaults 0 0 none /sys sysfs defaults 0 0 LABEL=/var1 /var ext3 defaults 1 2 LABEL=SWAP-sda3 swap swap defaults 0 0 /dev/vg_pole1/profiles_vg1 /vg_pole1/profiles/ gfs noatime 0 0 /dev/vg_pole1/home_vg1 /vg_pole1/home/ gfs noatime 0 0 /dev/vg_pole2/profiles_vg2 /vg_pole2/profiles/ gfs noatime 0 0 /dev/vg_pole2/home_vg2 /vg_pole2/home/ gfs noatime 0 0 Thanks for any help Vojtech Moravek vojtech.moravek at cz.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From tspauld98 at yahoo.com Wed Oct 19 19:49:01 2005 From: tspauld98 at yahoo.com (Tim Spaulding) Date: Wed, 19 Oct 2005 12:49:01 -0700 (PDT) Subject: [Linux-cluster] New Cluster Installation Starts Partitioned Message-ID: <20051019194901.19010.qmail@web60516.mail.yahoo.com> Hi All, I have a couple of machines that I'm trying to cluster. The machines are freshly installed FC4 machines that have been fully updated and running the latest kernel. They are configured to use the lvm2 by default so lvm2 and dm was already installed. I'm following the directions in the usage.txt off RedHat's web site. I compile the cluster tarball, run depmod, and start ccsd without issue. When I do a cman_tool join -w on each node, both nodes start cman and join the cluster, but the cluster is apparently partitioned (i.e. they both see the cluster and are joined to it, but the two nodes cannot see that the other node is joined). I've searched around and haven't found anything specific to this symptom. I have a feeling that it's something to do with my network configuration. Any help would be appreciated. Both machines are i686 archs with dual NICs. The NICs are connected to networks that do not route to each other. One network (eth0 on both machines) is a development network. The other network (eth1) is our corporate network. I'm trying to configure the cluster to use the dev network (eth0). Here's the output from uname: Linux ctclinux1.clam.com 2.6.13-1.1526_FC4 #1 Wed Sep 28 19:15:10 EDT 2005 i686 i686 i386 GNU/Linux Linux ctclinux2.clam.com 2.6.13-1.1526_FC4 #1 Wed Sep 28 19:15:10 EDT 2005 i686 i686 i386 GNU/Linux Here's the network configuration on ctclinux1: eth0 Link encap:Ethernet HWaddr 00:01:03:26:5C:C9 inet addr:192.168.36.200 Bcast:192.168.36.255 Mask:255.255.255.0 inet6 addr: fe80::201:3ff:fe26:5cc9/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:7260 errors:0 dropped:0 overruns:0 frame:0 TX packets:350 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:449183 (438.6 KiB) TX bytes:27853 (27.2 KiB) Interrupt:10 Base address:0xec00 eth1 Link encap:Ethernet HWaddr 00:B0:D0:41:0F:65 inet addr:10.10.10.200 Bcast:10.10.255.255 Mask:255.255.0.0 inet6 addr: fe80::2b0:d0ff:fe41:f65/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:57450 errors:0 dropped:0 overruns:1 frame:0 TX packets:12957 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:10040767 (9.5 MiB) TX bytes:1962029 (1.8 MiB) Interrupt:5 Base address:0xe880 eth1:1 Link encap:Ethernet HWaddr 00:B0:D0:41:0F:65 inet addr:10.10.10.204 Bcast:10.10.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Interrupt:5 Base address:0xe880 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:17568 errors:0 dropped:0 overruns:0 frame:0 TX packets:17568 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:3692600 (3.5 MiB) TX bytes:3692600 (3.5 MiB) sit0 Link encap:IPv6-in-IPv4 NOARP MTU:1480 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 192.168.36.0 * 255.255.255.0 U 0 0 0 eth0 10.74.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 10.72.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 10.75.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 10.73.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 10.10.0.0 * 255.255.0.0 U 0 0 0 eth1 169.254.0.0 * 255.255.0.0 U 0 0 0 eth1 default 10.10.1.1 0.0.0.0 UG 0 0 0 eth1 cat /etc/hosts 10.10.10.200 ctclinux1-svc 192.168.36.200 ctclinux1-cls 192.168.36.201 ctclinux2-cls 10.10.10.201 ctclinux2-svc Here's the network configuration on ctclinux2: ifconfig -a eth0 Link encap:Ethernet HWaddr 00:01:03:D4:80:7C inet addr:192.168.36.201 Bcast:192.168.36.255 Mask:255.255.255.0 inet6 addr: fe80::201:3ff:fed4:807c/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:7702 errors:0 dropped:0 overruns:1 frame:0 TX packets:282 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:477769 (466.5 KiB) TX bytes:22444 (21.9 KiB) Interrupt:10 Base address:0xec00 eth1 Link encap:Ethernet HWaddr 00:B0:D0:41:0F:9B inet addr:10.10.10.201 Bcast:10.10.255.255 Mask:255.255.0.0 inet6 addr: fe80::2b0:d0ff:fe41:f9b/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:53846 errors:0 dropped:0 overruns:1 frame:0 TX packets:7759 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:5733713 (5.4 MiB) TX bytes:1155588 (1.1 MiB) Interrupt:5 Base address:0xe880 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:17912 errors:0 dropped:0 overruns:0 frame:0 TX packets:17912 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:3401868 (3.2 MiB) TX bytes:3401868 (3.2 MiB) sit0 Link encap:IPv6-in-IPv4 NOARP MTU:1480 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 192.168.36.0 * 255.255.255.0 U 0 0 0 eth0 10.74.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 10.72.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 10.75.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 10.73.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 10.10.0.0 * 255.255.0.0 U 0 0 0 eth1 169.254.0.0 * 255.255.0.0 U 0 0 0 eth1 default 10.10.1.1 0.0.0.0 UG 0 0 0 eth1 cat /etc/hosts 10.10.10.201 ctclinux2-svc 192.168.36.201 ctclinux2-cls 192.168.36.200 ctclinux1-cls 10.10.10.200 ctclinux1-svc Here's the cluster configuration file: Here's the cluster information from ctclinux1 after the cluster is started and joined: cman_tool -d join -w nodename ctclinux1.clam.com not found nodename ctclinux1 (truncated) not found nodename ctclinux1 doesn't match ctclinux1-cls (ctclinux1-cls in cluster.conf) nodename ctclinux1 doesn't match ctclinux2-cls (ctclinux2-cls in cluster.conf) nodename localhost (if lo) not found selected nodename ctclinux1-cls setup up interface for address: ctclinux1-cls Broadcast address for c824a8c0 is ff24a8c0 cman_tool status Protocol version: 5.0.1 Config version: 1 Cluster name: cl_tic Cluster ID: 6429 Cluster Member: Yes Membership state: Cluster-Member Nodes: 1 Expected_votes: 2 Total_votes: 1 Quorum: 2 Activity blocked Active subsystems: 0 Node name: ctclinux1-cls Node addresses: 192.168.36.200 cman_tool nodes Node Votes Exp Sts Name 1 1 2 M ctclinux1-cls Here's the cluster information from ctclinux2 after the cluster is started and joined: cman_tool -d join -w nodename ctclinux2.clam.com not found nodename ctclinux2 (truncated) not found nodename ctclinux2 doesn't match ctclinux1-cls (ctclinux1-cls in cluster.conf) nodename ctclinux2 doesn't match ctclinux2-cls (ctclinux2-cls in cluster.conf) nodename localhost (if lo) not found selected nodename ctclinux2-cls setup up interface for address: ctclinux2-cls Broadcast address for c924a8c0 is ff24a8c0 cman_tool status Protocol version: 5.0.1 Config version: 1 Cluster name: cl_tic Cluster ID: 6429 Cluster Member: Yes Membership state: Cluster-Member Nodes: 1 Expected_votes: 2 Total_votes: 1 Quorum: 2 Activity blocked Active subsystems: 0 Node name: ctclinux2-cls Node addresses: 192.168.36.201 cman_tool nodes Node Votes Exp Sts Name 1 1 2 M ctclinux2-cls Let me know if there is more information that I need to provide. As an aside, I've tried reducing the quorum count with no difference in behavior and I've tried using multicast which fails on the cman_tool join with an "Unknown Host" error. I'm open to any other suggestions. Thanks, tims __________________________________ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com From alexander_rau at yahoo.com Wed Oct 19 19:27:56 2005 From: alexander_rau at yahoo.com (Alexander Rau) Date: Wed, 19 Oct 2005 12:27:56 -0700 (PDT) Subject: [Linux-cluster] application monitoring - apache crash doesn't invoke failover Message-ID: <20051019192756.82030.qmail@web52101.mail.yahoo.com> We are trying to test the failover in a 2 cluster environment by killing apache. The service fails according to clustat, however the cluster mananger does not move the service from the failed node to the fail over node.... /var/log/messages shows the following output (on the node with the forced failure): Oct 19 16:34:59 armstrong clurgmgrd[4269]: status on script "httpd" returned 1 (generic error) Oct 19 16:34:59 armstrong clurgmgrd[4269]: Stopping service http Oct 19 16:34:59 armstrong httpd: httpd shutdown failed Oct 19 16:34:59 armstrong clurgmgrd[4269]: stop on script "httpd" returned 1 (generic error) Oct 19 16:34:59 armstrong clurgmgrd[4269]: #12: RG http failed to stop; intervention required Oct 19 16:34:59 armstrong clurgmgrd[4269]: Service http is failed Anybody any ideas? Thanks AR From eric at bootseg.com Wed Oct 19 20:41:29 2005 From: eric at bootseg.com (Eric Kerin) Date: Wed, 19 Oct 2005 16:41:29 -0400 Subject: [Linux-cluster] application monitoring - apache crash doesn't invoke failover In-Reply-To: <20051019192756.82030.qmail@web52101.mail.yahoo.com> References: <20051019192756.82030.qmail@web52101.mail.yahoo.com> Message-ID: <1129754489.3349.13.camel@auh5-0479.corp.jabil.org> See this bugzilla entry: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=151104 especially the attached patch. Basically RHEL4 (and RHEL3) don't (and at this point, can't) follow the LSB's standard return value for successful stop operations, which is that a stop operation of a service that isn't running should return 0 as it's errorlevel. Thanks, Eric Kerin eric at bootseg.com On Wed, 2005-10-19 at 12:27 -0700, Alexander Rau wrote: > We are trying to test the failover in a 2 cluster > environment by killing apache. > > The service fails according to clustat, however the > cluster mananger does not move the service from the > failed node to the fail over node.... > > /var/log/messages shows the following output (on the > node with the forced failure): > > Oct 19 16:34:59 armstrong clurgmgrd[4269]: > status on script "httpd" returned 1 (generic error) > Oct 19 16:34:59 armstrong clurgmgrd[4269]: > Stopping service http > Oct 19 16:34:59 armstrong httpd: httpd shutdown failed > Oct 19 16:34:59 armstrong clurgmgrd[4269]: > stop on script "httpd" returned 1 (generic error) > Oct 19 16:34:59 armstrong clurgmgrd[4269]: #12: > RG http failed to stop; intervention required > Oct 19 16:34:59 armstrong clurgmgrd[4269]: > Service http is failed > > Anybody any ideas? > > Thanks > > AR > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From lhh at redhat.com Wed Oct 19 22:25:21 2005 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 19 Oct 2005 18:25:21 -0400 Subject: [Linux-cluster] application monitoring - apache crash doesn't invoke failover In-Reply-To: <1129754489.3349.13.camel@auh5-0479.corp.jabil.org> References: <20051019192756.82030.qmail@web52101.mail.yahoo.com> <1129754489.3349.13.camel@auh5-0479.corp.jabil.org> Message-ID: <1129760721.25547.89.camel@ayanami.boston.redhat.com> On Wed, 2005-10-19 at 16:41 -0400, Eric Kerin wrote: > See this bugzilla entry: > > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=151104 especially > the attached patch. > > Basically RHEL4 (and RHEL3) don't (and at this point, can't) follow the > LSB's standard return value for successful stop operations, which is > that a stop operation of a service that isn't running should return 0 as > it's errorlevel. Correct. -- Lon From hlawatschek at atix.de Wed Oct 19 22:28:41 2005 From: hlawatschek at atix.de (Mark Hlawatschek) Date: Thu, 20 Oct 2005 00:28:41 +0200 Subject: [Linux-cluster] New Cluster Installation Starts Partitioned In-Reply-To: <20051019194901.19010.qmail@web60516.mail.yahoo.com> References: <20051019194901.19010.qmail@web60516.mail.yahoo.com> Message-ID: <1129760921.3471.6.camel@falballa.gallien.atix> Hi Tim, make sure that the cmans on both nodes can talk to each other. I observed this problem when iptables wasn't configured correctly. If you have an active iptables config shut it down and try again. Hope that helps ... Mark On Wed, 2005-10-19 at 12:49 -0700, Tim Spaulding wrote: > Hi All, > > I have a couple of machines that I'm trying to cluster. The machines are freshly installed FC4 > machines that have been fully updated and running the latest kernel. They are configured to use > the lvm2 by default so lvm2 and dm was already installed. I'm following the directions in the > usage.txt off RedHat's web site. I compile the cluster tarball, run depmod, and start ccsd > without issue. When I do a cman_tool join -w on each node, both nodes start cman and join the > cluster, but the cluster is apparently partitioned (i.e. they both see the cluster and are joined > to it, but the two nodes cannot see that the other node is joined). I've searched around and > haven't found anything specific to this symptom. I have a feeling that it's something to do with > my network configuration. Any help would be appreciated. > > Both machines are i686 archs with dual NICs. The NICs are connected to networks that do not route > to each other. One network (eth0 on both machines) is a development network. The other network > (eth1) is our corporate network. I'm trying to configure the cluster to use the dev network > (eth0). > > Here's the output from uname: > > Linux ctclinux1.clam.com 2.6.13-1.1526_FC4 #1 Wed Sep 28 19:15:10 EDT 2005 i686 i686 i386 > GNU/Linux > Linux ctclinux2.clam.com 2.6.13-1.1526_FC4 #1 Wed Sep 28 19:15:10 EDT 2005 i686 i686 i386 > GNU/Linux > > Here's the network configuration on ctclinux1: > > eth0 Link encap:Ethernet HWaddr 00:01:03:26:5C:C9 > inet addr:192.168.36.200 Bcast:192.168.36.255 Mask:255.255.255.0 > inet6 addr: fe80::201:3ff:fe26:5cc9/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:7260 errors:0 dropped:0 overruns:0 frame:0 > TX packets:350 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:449183 (438.6 KiB) TX bytes:27853 (27.2 KiB) > Interrupt:10 Base address:0xec00 > > eth1 Link encap:Ethernet HWaddr 00:B0:D0:41:0F:65 > inet addr:10.10.10.200 Bcast:10.10.255.255 Mask:255.255.0.0 > inet6 addr: fe80::2b0:d0ff:fe41:f65/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:57450 errors:0 dropped:0 overruns:1 frame:0 > TX packets:12957 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:10040767 (9.5 MiB) TX bytes:1962029 (1.8 MiB) > Interrupt:5 Base address:0xe880 > > eth1:1 Link encap:Ethernet HWaddr 00:B0:D0:41:0F:65 > inet addr:10.10.10.204 Bcast:10.10.255.255 Mask:255.255.0.0 > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > Interrupt:5 Base address:0xe880 > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:17568 errors:0 dropped:0 overruns:0 frame:0 > TX packets:17568 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:3692600 (3.5 MiB) TX bytes:3692600 (3.5 MiB) > > sit0 Link encap:IPv6-in-IPv4 > NOARP MTU:1480 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) > > Kernel IP routing table > Destination Gateway Genmask Flags Metric Ref Use Iface > 192.168.36.0 * 255.255.255.0 U 0 0 0 eth0 > 10.74.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > 10.72.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > 10.75.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > 10.73.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > 10.10.0.0 * 255.255.0.0 U 0 0 0 eth1 > 169.254.0.0 * 255.255.0.0 U 0 0 0 eth1 > default 10.10.1.1 0.0.0.0 UG 0 0 0 eth1 > > cat /etc/hosts > 10.10.10.200 ctclinux1-svc > 192.168.36.200 ctclinux1-cls > 192.168.36.201 ctclinux2-cls > 10.10.10.201 ctclinux2-svc > > Here's the network configuration on ctclinux2: > > ifconfig -a > eth0 Link encap:Ethernet HWaddr 00:01:03:D4:80:7C > inet addr:192.168.36.201 Bcast:192.168.36.255 Mask:255.255.255.0 > inet6 addr: fe80::201:3ff:fed4:807c/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:7702 errors:0 dropped:0 overruns:1 frame:0 > TX packets:282 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:477769 (466.5 KiB) TX bytes:22444 (21.9 KiB) > Interrupt:10 Base address:0xec00 > > eth1 Link encap:Ethernet HWaddr 00:B0:D0:41:0F:9B > inet addr:10.10.10.201 Bcast:10.10.255.255 Mask:255.255.0.0 > inet6 addr: fe80::2b0:d0ff:fe41:f9b/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:53846 errors:0 dropped:0 overruns:1 frame:0 > TX packets:7759 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:5733713 (5.4 MiB) TX bytes:1155588 (1.1 MiB) > Interrupt:5 Base address:0xe880 > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:17912 errors:0 dropped:0 overruns:0 frame:0 > TX packets:17912 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:3401868 (3.2 MiB) TX bytes:3401868 (3.2 MiB) > > sit0 Link encap:IPv6-in-IPv4 > NOARP MTU:1480 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) > > route > Kernel IP routing table > Destination Gateway Genmask Flags Metric Ref Use Iface > 192.168.36.0 * 255.255.255.0 U 0 0 0 eth0 > 10.74.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > 10.72.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > 10.75.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > 10.73.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > 10.10.0.0 * 255.255.0.0 U 0 0 0 eth1 > 169.254.0.0 * 255.255.0.0 U 0 0 0 eth1 > default 10.10.1.1 0.0.0.0 UG 0 0 0 eth1 > > cat /etc/hosts > 10.10.10.201 ctclinux2-svc > 192.168.36.201 ctclinux2-cls > 192.168.36.200 ctclinux1-cls > 10.10.10.200 ctclinux1-svc > > Here's the cluster configuration file: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Here's the cluster information from ctclinux1 after the cluster is started and joined: > > cman_tool -d join -w > nodename ctclinux1.clam.com not found > nodename ctclinux1 (truncated) not found > nodename ctclinux1 doesn't match ctclinux1-cls (ctclinux1-cls in cluster.conf) > nodename ctclinux1 doesn't match ctclinux2-cls (ctclinux2-cls in cluster.conf) > nodename localhost (if lo) not found > selected nodename ctclinux1-cls > setup up interface for address: ctclinux1-cls > Broadcast address for c824a8c0 is ff24a8c0 > > cman_tool status > Protocol version: 5.0.1 > Config version: 1 > Cluster name: cl_tic > Cluster ID: 6429 > Cluster Member: Yes > Membership state: Cluster-Member > Nodes: 1 > Expected_votes: 2 > Total_votes: 1 > Quorum: 2 Activity blocked > Active subsystems: 0 > Node name: ctclinux1-cls > Node addresses: 192.168.36.200 > > cman_tool nodes > Node Votes Exp Sts Name > 1 1 2 M ctclinux1-cls > > Here's the cluster information from ctclinux2 after the cluster is started and joined: > > cman_tool -d join -w > nodename ctclinux2.clam.com not found > nodename ctclinux2 (truncated) not found > nodename ctclinux2 doesn't match ctclinux1-cls (ctclinux1-cls in cluster.conf) > nodename ctclinux2 doesn't match ctclinux2-cls (ctclinux2-cls in cluster.conf) > nodename localhost (if lo) not found > selected nodename ctclinux2-cls > setup up interface for address: ctclinux2-cls > Broadcast address for c924a8c0 is ff24a8c0 > > cman_tool status > Protocol version: 5.0.1 > Config version: 1 > Cluster name: cl_tic > Cluster ID: 6429 > Cluster Member: Yes > Membership state: Cluster-Member > Nodes: 1 > Expected_votes: 2 > Total_votes: 1 > Quorum: 2 Activity blocked > Active subsystems: 0 > Node name: ctclinux2-cls > Node addresses: 192.168.36.201 > > cman_tool nodes > Node Votes Exp Sts Name > 1 1 2 M ctclinux2-cls > > Let me know if there is more information that I need to provide. As an aside, I've tried reducing > the quorum count with no difference in behavior and I've tried using multicast which fails on the > cman_tool join with an "Unknown Host" error. I'm open to any other suggestions. > > Thanks, > > tims > > > > > __________________________________ > Yahoo! Mail - PC Magazine Editors' Choice 2005 > http://mail.yahoo.com > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Mark Hlawatschek From alexander_rau at yahoo.com Thu Oct 20 00:11:21 2005 From: alexander_rau at yahoo.com (Alexander Rau) Date: Wed, 19 Oct 2005 17:11:21 -0700 (PDT) Subject: [Linux-cluster] mounting using label Message-ID: <20051020001121.10195.qmail@web52102.mail.yahoo.com> Hi: I am trying to mount a file system on the SAN by using the label rather then the device. When I specify "-L label" in either the device line or the Options line the cluster service fails to start. Just wondering if anybody has successfully used labels to mount a file system as a cluster service...? Thanks AR From erwan at seanodes.com Thu Oct 20 12:13:24 2005 From: erwan at seanodes.com (Velu Erwan) Date: Thu, 20 Oct 2005 14:13:24 +0200 Subject: [Linux-cluster] cluster-1.01.00 In-Reply-To: <20051019151045.GA3975@redhat.com> References: <20051019151045.GA3975@redhat.com> Message-ID: <435789E4.70905@seanodes.com> David Teigland a ?crit : >A new source tarball from the STABLE branch has been released; it builds >and runs on 2.6.13: > > ftp://sources.redhat.com/pub/cluster/releases/cluster-1.01.00.tar.gz > > > I just tried it on my 2.6.13-4 and I had the following error : make[2]: Entering directory `/home/build/cluster-1.01.00/cman/lib' gcc -Wall -g -O -I. -fPIC -I/home/build/cluster-1.01.00/build/incdir/cluster -c -o libcman.o libcman.c libcman.c:31:35: cluster/cnxman-socket.h: No such file or directory I've fixed that with the patch attached to this mail. Everything is compiling fine great jobs, this is really easiest than before ;) Is it difficult to make it compile on previous kernels like 2.6.11 ? -------------- next part -------------- A non-text attachment was scrubbed... Name: cluster-1.01.00-include.patch Type: text/x-patch Size: 343 bytes Desc: not available URL: From tspauld98 at yahoo.com Thu Oct 20 15:49:58 2005 From: tspauld98 at yahoo.com (Tim Spaulding) Date: Thu, 20 Oct 2005 08:49:58 -0700 (PDT) Subject: [Linux-cluster] New Cluster Installation Starts Partitioned In-Reply-To: <1129760921.3471.6.camel@falballa.gallien.atix> Message-ID: <20051020154958.50725.qmail@web60524.mail.yahoo.com> Hi Mark, Thanks, that solved it. I had opened up the right ports on my primary node but had forgotten to do the same on the secondary node reinforcing Murphy's Second Law of Clustering. It's always the little things. :) Thanks again, tims --- Mark Hlawatschek wrote: > Hi Tim, > > make sure that the cmans on both nodes can talk to each other. I > observed this problem when iptables wasn't configured correctly. If you > have an active iptables config shut it down and try again. > > Hope that helps ... > > Mark > > On Wed, 2005-10-19 at 12:49 -0700, Tim Spaulding wrote: > > Hi All, > > > > I have a couple of machines that I'm trying to cluster. The machines are freshly installed > FC4 > > machines that have been fully updated and running the latest kernel. They are configured to > use > > the lvm2 by default so lvm2 and dm was already installed. I'm following the directions in the > > usage.txt off RedHat's web site. I compile the cluster tarball, run depmod, and start ccsd > > without issue. When I do a cman_tool join -w on each node, both nodes start cman and join the > > cluster, but the cluster is apparently partitioned (i.e. they both see the cluster and are > joined > > to it, but the two nodes cannot see that the other node is joined). I've searched around and > > haven't found anything specific to this symptom. I have a feeling that it's something to do > with > > my network configuration. Any help would be appreciated. > > > > Both machines are i686 archs with dual NICs. The NICs are connected to networks that do not > route > > to each other. One network (eth0 on both machines) is a development network. The other > network > > (eth1) is our corporate network. I'm trying to configure the cluster to use the dev network > > (eth0). > > > > Here's the output from uname: > > > > Linux ctclinux1.clam.com 2.6.13-1.1526_FC4 #1 Wed Sep 28 19:15:10 EDT 2005 i686 i686 i386 > > GNU/Linux > > Linux ctclinux2.clam.com 2.6.13-1.1526_FC4 #1 Wed Sep 28 19:15:10 EDT 2005 i686 i686 i386 > > GNU/Linux > > > > Here's the network configuration on ctclinux1: > > > > eth0 Link encap:Ethernet HWaddr 00:01:03:26:5C:C9 > > inet addr:192.168.36.200 Bcast:192.168.36.255 Mask:255.255.255.0 > > inet6 addr: fe80::201:3ff:fe26:5cc9/64 Scope:Link > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > RX packets:7260 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:350 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:1000 > > RX bytes:449183 (438.6 KiB) TX bytes:27853 (27.2 KiB) > > Interrupt:10 Base address:0xec00 > > > > eth1 Link encap:Ethernet HWaddr 00:B0:D0:41:0F:65 > > inet addr:10.10.10.200 Bcast:10.10.255.255 Mask:255.255.0.0 > > inet6 addr: fe80::2b0:d0ff:fe41:f65/64 Scope:Link > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > RX packets:57450 errors:0 dropped:0 overruns:1 frame:0 > > TX packets:12957 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:1000 > > RX bytes:10040767 (9.5 MiB) TX bytes:1962029 (1.8 MiB) > > Interrupt:5 Base address:0xe880 > > > > eth1:1 Link encap:Ethernet HWaddr 00:B0:D0:41:0F:65 > > inet addr:10.10.10.204 Bcast:10.10.255.255 Mask:255.255.0.0 > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > Interrupt:5 Base address:0xe880 > > > > lo Link encap:Local Loopback > > inet addr:127.0.0.1 Mask:255.0.0.0 > > inet6 addr: ::1/128 Scope:Host > > UP LOOPBACK RUNNING MTU:16436 Metric:1 > > RX packets:17568 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:17568 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:0 > > RX bytes:3692600 (3.5 MiB) TX bytes:3692600 (3.5 MiB) > > > > sit0 Link encap:IPv6-in-IPv4 > > NOARP MTU:1480 Metric:1 > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:0 > > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) > > > > Kernel IP routing table > > Destination Gateway Genmask Flags Metric Ref Use Iface > > 192.168.36.0 * 255.255.255.0 U 0 0 0 eth0 > > 10.74.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > > 10.72.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > > 10.75.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > > 10.73.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > > 10.10.0.0 * 255.255.0.0 U 0 0 0 eth1 > > 169.254.0.0 * 255.255.0.0 U 0 0 0 eth1 > > default 10.10.1.1 0.0.0.0 UG 0 0 0 eth1 > > > > cat /etc/hosts > > 10.10.10.200 ctclinux1-svc > > 192.168.36.200 ctclinux1-cls > > 192.168.36.201 ctclinux2-cls > > 10.10.10.201 ctclinux2-svc > > > > Here's the network configuration on ctclinux2: > > > > ifconfig -a > > eth0 Link encap:Ethernet HWaddr 00:01:03:D4:80:7C > > inet addr:192.168.36.201 Bcast:192.168.36.255 Mask:255.255.255.0 > > inet6 addr: fe80::201:3ff:fed4:807c/64 Scope:Link > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > RX packets:7702 errors:0 dropped:0 overruns:1 frame:0 > > TX packets:282 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:1000 > > RX bytes:477769 (466.5 KiB) TX bytes:22444 (21.9 KiB) > > Interrupt:10 Base address:0xec00 > > > > eth1 Link encap:Ethernet HWaddr 00:B0:D0:41:0F:9B > > inet addr:10.10.10.201 Bcast:10.10.255.255 Mask:255.255.0.0 > > inet6 addr: fe80::2b0:d0ff:fe41:f9b/64 Scope:Link > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > RX packets:53846 errors:0 dropped:0 overruns:1 frame:0 > > TX packets:7759 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:1000 > > RX bytes:5733713 (5.4 MiB) TX bytes:1155588 (1.1 MiB) > > Interrupt:5 Base address:0xe880 > > > > lo Link encap:Local Loopback > > inet addr:127.0.0.1 Mask:255.0.0.0 > > inet6 addr: ::1/128 Scope:Host > > UP LOOPBACK RUNNING MTU:16436 Metric:1 > > RX packets:17912 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:17912 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:0 > > RX bytes:3401868 (3.2 MiB) TX bytes:3401868 (3.2 MiB) > > > > sit0 Link encap:IPv6-in-IPv4 > > NOARP MTU:1480 Metric:1 > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:0 > > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) > > > > route > > Kernel IP routing table > > Destination Gateway Genmask Flags Metric Ref Use Iface > > 192.168.36.0 * 255.255.255.0 U 0 0 0 eth0 > > 10.74.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > > 10.72.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > > 10.75.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > > 10.73.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > > 10.10.0.0 * 255.255.0.0 U 0 0 0 eth1 > > 169.254.0.0 * 255.255.0.0 U 0 0 0 eth1 > > default 10.10.1.1 0.0.0.0 UG 0 0 0 eth1 > > > > cat /etc/hosts > > 10.10.10.201 ctclinux2-svc > > 192.168.36.201 ctclinux2-cls > > 192.168.36.200 ctclinux1-cls > > 10.10.10.200 ctclinux1-svc > > > > Here's the cluster configuration file: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Here's the cluster information from ctclinux1 after the cluster is started and joined: > > > > cman_tool -d join -w > > nodename ctclinux1.clam.com not found > > nodename ctclinux1 (truncated) not found > > nodename ctclinux1 doesn't match ctclinux1-cls (ctclinux1-cls in cluster.conf) > > nodename ctclinux1 doesn't match ctclinux2-cls (ctclinux2-cls in cluster.conf) > > nodename localhost (if lo) not found > > selected nodename ctclinux1-cls > > setup up interface for address: ctclinux1-cls > > Broadcast address for c824a8c0 is ff24a8c0 > > > > cman_tool status > > Protocol version: 5.0.1 > > Config version: 1 > > Cluster name: cl_tic > > Cluster ID: 6429 > > Cluster Member: Yes > > Membership state: Cluster-Member > > Nodes: 1 > > Expected_votes: 2 > > Total_votes: 1 > > Quorum: 2 Activity blocked > > Active subsystems: 0 > > Node name: ctclinux1-cls > > Node addresses: 192.168.36.200 > > > > cman_tool nodes > > Node Votes Exp Sts Name > > 1 1 2 M ctclinux1-cls > > > > Here's the cluster information from ctclinux2 after the cluster is started and joined: > > > > cman_tool -d join -w > > nodename ctclinux2.clam.com not found > > nodename ctclinux2 (truncated) not found > > nodename ctclinux2 doesn't match ctclinux1-cls (ctclinux1-cls in cluster.conf) > > nodename ctclinux2 doesn't match ctclinux2-cls (ctclinux2-cls in cluster.conf) > > nodename localhost (if lo) not found > > selected nodename ctclinux2-cls > > setup up interface for address: ctclinux2-cls > > Broadcast address for c924a8c0 is ff24a8c0 > > > > cman_tool status > > Protocol version: 5.0.1 > > Config version: 1 > > Cluster name: cl_tic > > Cluster ID: 6429 > > Cluster Member: Yes > > Membership state: Cluster-Member > > Nodes: 1 > > Expected_votes: 2 > > Total_votes: 1 > > Quorum: 2 Activity blocked > > Active subsystems: 0 > > Node name: ctclinux2-cls > > Node addresses: 192.168.36.201 > > > > cman_tool nodes > > Node Votes Exp Sts Name > > 1 1 2 M ctclinux2-cls > > > > Let me know if there is more information that I need to provide. As an aside, I've tried > reducing > > the quorum count with no difference in behavior and I've tried using multicast which fails on > the > > cman_tool join with an "Unknown Host" error. I'm open to any other suggestions. > > > > Thanks, > > > > tims > > > > > > > > > > __________________________________ > > Yahoo! Mail - PC Magazine Editors' Choice 2005 > > http://mail.yahoo.com > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Mark Hlawatschek > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > __________________________________ Yahoo! Music Unlimited Access over 1 million songs. Try it free. http://music.yahoo.com/unlimited/ From linux4dave at gmail.com Thu Oct 20 16:05:02 2005 From: linux4dave at gmail.com (dave first) Date: Thu, 20 Oct 2005 09:05:02 -0700 Subject: [Linux-cluster] Clustering Tutorial Message-ID: <207649d0510200905t77ae28b2j7813921f16b1f8e1@mail.gmail.com> Hey Guys, I'm a unix geek going waaaay back, but I haven't been administering Linux. I've taken a job where all the *nix systems are Linux (most RH). There are two clusters. I know nada about clusters. In my experience, when I jump into something w/o learning the basics, I'm always on a learning curve. So, I need to learn basics about Linux Clustering, including terminology - like what is "fencing?" Are there any good online sources you could point me to? Reading this list, I know I could understand a lot more if I had the terminology down... dave -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwill at penguincomputing.com Thu Oct 20 16:18:15 2005 From: mwill at penguincomputing.com (Michael Will) Date: Thu, 20 Oct 2005 09:18:15 -0700 Subject: [Linux-cluster] Clustering Tutorial In-Reply-To: <207649d0510200905t77ae28b2j7813921f16b1f8e1@mail.gmail.com> References: <207649d0510200905t77ae28b2j7813921f16b1f8e1@mail.gmail.com> Message-ID: <4357C347.8000506@jellyfish.highlyscyld.com> http://www.phy.duke.edu/resources/computing/brahma/Resources/beowulf_book.php is a good start, http://www.beowulf.org is another good place, it is also the home of the original beowulf mailinglist. Generally I would recommend digging through recent mailinglist postings because there are often very informed answers to questions. Lon just answered a fencing question a few days ago: "STONITH, STOMITH, etc. are indeed implementations of I/O fencing. Fencing is the act of forcefully preventing a node from being able to access resources after that node has been evicted from the cluster in an attempt to avoid corruption. The canonical example of when it is needed is the live-hang scenario, as you described: 1. node A hangs with I/Os pending to a shared file system 2. node B and node C decide that node A is dead and recover resources allocated on node A (including the shared file system) 3. node A resumes normal operation 4. node A completes I/Os to shared file system At this point, the shared file system is probably corrupt. If you're lucky, fsck will fix it -- if you're not, you'll need to restore from backup. I/O fencing (STONITH, or whatever we want to call it) prevents the last step (step 4) from happening. How fencing is done (power cycling via external switch, SCSI reservations, FC zoning, integrated methods like IPMI, iLO, manual intervention, etc.) is unimportant - so long as whatever method is used can guarantee that step 4 can not complete." "GFS can use fabric-level fencing - that is, you can tell the iSCSI server to cut a node off, or ask the fiber-channel switch to disable a port. This is in addition to "power-cycle" fencing." Michael From davegu1 at hotmail.com Thu Oct 20 17:52:58 2005 From: davegu1 at hotmail.com (David Gutierrez) Date: Thu, 20 Oct 2005 12:52:58 -0500 Subject: [Linux-cluster] Clustering Tutorial In-Reply-To: <207649d0510200905t77ae28b2j7813921f16b1f8e1@mail.gmail.com> Message-ID: Dave, There is lots of information on liinux out there on the net. Specially if you do a search on Linux Documentation, there is a website for that too. http://www.tldp.net/index.html >From there you can go to the cluster. But imagine a cluster in linux as a cluster in AIX, Solaris, HPUX or Tru64. ' David From: dave first Reply-To: linux clustering To: linux clustering Subject: [Linux-cluster] Clustering Tutorial Date: Thu, 20 Oct 2005 09:05:02 -0700 MIME-Version: 1.0 Received: from hormel.redhat.com ([209.132.177.30]) by mc3-f6.hotmail.com with Microsoft SMTPSVC(6.0.3790.211); Thu, 20 Oct 2005 09:05:11 -0700 Received: from listman.util.phx.redhat.com (listman.util.phx.redhat.com [10.8.4.110])by hormel.redhat.com (Postfix) with ESMTPid DB7217416F; Thu, 20 Oct 2005 12:05:10 -0400 (EDT) Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com[172.16.52.254])by listman.util.phx.redhat.com (8.13.1/8.13.1) with ESMTP idj9KG59gP001593 for ;Thu, 20 Oct 2005 12:05:09 -0400 Received: from mx3.redhat.com (mx3.redhat.com [172.16.48.32])by int-mx1.corp.redhat.com (8.11.6/8.11.6) with ESMTP id j9KG58V04078for ; Thu, 20 Oct 2005 12:05:08 -0400 Received: from zproxy.gmail.com (zproxy.gmail.com [64.233.162.195])by mx3.redhat.com (8.13.1/8.13.1) with ESMTP id j9KG52TL014034for ; Thu, 20 Oct 2005 12:05:02 -0400 Received: by zproxy.gmail.com with SMTP id m7so209527nzffor ; Thu, 20 Oct 2005 09:05:02 -0700 (PDT) Received: by 10.36.34.1 with SMTP id h1mr2045048nzh;Thu, 20 Oct 2005 09:05:02 -0700 (PDT) Received: by 10.36.252.29 with HTTP; Thu, 20 Oct 2005 09:05:02 -0700 (PDT) X-Message-Info: EoYTbT2lH2Oqgvb27XXVSOWVJbUxrl8bnGIDAa6+wOE= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com;h=received:message-id:date:from:to:subject:mime-version:content-type;b=ZPZj2fOsm/kHT5hJZwsd8+ZwVTTPxlaRoz6+cWYRlIRwFmNogdU1mwjJhks/u0pn0ckQ0uOWaC1UD9za59IRx23rAZd/t2mVR0wmrdCfQ+M2uc2N9ODFSGYk2hXFGN6T6aTqt+Rf/crlM6oaVUN06z1JT/MNiGsG2XfF4XaEKhI= X-RedHat-Spam-Score: -2.739 X-loop: linux-cluster at redhat.com X-BeenThere: linux-cluster at redhat.com X-Mailman-Version: 2.1.5 Precedence: junk List-Id: linux clustering List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: linux-cluster-bounces at redhat.com Return-Path: linux-cluster-bounces at redhat.com X-OriginalArrivalTime: 20 Oct 2005 16:05:11.0431 (UTC) FILETIME=[0CA89970:01C5D590] Hey Guys, I'm a unix geek going waaaay back, but I haven't been administering Linux. I've taken a job where all the *nix systems are Linux (most RH). There are two clusters. I know nada about clusters. In my experience, when I jump into something w/o learning the basics, I'm always on a learning curve. So, I need to learn basics about Linux Clustering, including terminology - like what is "fencing?" Are there any good online sources you could point me to? Reading this list, I know I could understand a lot more if I had the terminology down... dave -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From tspauld98 at yahoo.com Thu Oct 20 18:58:24 2005 From: tspauld98 at yahoo.com (Tim Spaulding) Date: Thu, 20 Oct 2005 11:58:24 -0700 (PDT) Subject: [Linux-cluster] Clustering Tutorial In-Reply-To: <4357C347.8000506@jellyfish.highlyscyld.com> Message-ID: <20051020185824.21929.qmail@web60525.mail.yahoo.com> Just a note of caution, there's a big difference between High Availability Clustering and High Performance Clustering. AFAIK, Beowulf is an HPC technology. RHCS (Red Hat Cluster Suite) and GFS (Global File System) are HAC technologies. Some of the underlying building blocks are used by both communities but they are used for fundamentally difference purposes. http://www.linux-ha.org is the home of another HAC, linux-based technology. They have more documentation on clustering and its concepts. Red Hat does a good job on the HOW-TOs of getting a cluster working but a terrible job of telling folks the WHY-TOs of clustering. I'm currently working on a comparison of linux-ha and RHCS so if you have questions regarding HAC on linux then fire away. If you have a beowulf cluster, je ne comprends pas, sorry. --tims --- Michael Will wrote: > http://www.phy.duke.edu/resources/computing/brahma/Resources/beowulf_book.php > is a good start, > http://www.beowulf.org is another good place, it is also the home of the > original beowulf mailinglist. > > Generally I would recommend digging through recent mailinglist postings > because > there are often very informed answers to questions. > > Lon just answered a fencing question a few days ago: > > "STONITH, STOMITH, etc. are indeed implementations of I/O fencing. > > Fencing is the act of forcefully preventing a node from being able to > access resources after that node has been evicted from the cluster in an > attempt to avoid corruption. > > The canonical example of when it is needed is the live-hang scenario, as > you described: > > 1. node A hangs with I/Os pending to a shared file system > 2. node B and node C decide that node A is dead and recover resources > allocated on node A (including the shared file system) > 3. node A resumes normal operation > 4. node A completes I/Os to shared file system > > At this point, the shared file system is probably corrupt. If you're > lucky, fsck will fix it -- if you're not, you'll need to restore from > backup. I/O fencing (STONITH, or whatever we want to call it) prevents > the last step (step 4) from happening. > > How fencing is done (power cycling via external switch, SCSI > reservations, FC zoning, integrated methods like IPMI, iLO, manual > intervention, etc.) is unimportant - so long as whatever method is used > can guarantee that step 4 can not complete." > > "GFS can use fabric-level fencing - that is, you can tell the iSCSI > server to cut a node off, or ask the fiber-channel switch to disable a > port. This is in addition to "power-cycle" fencing." > > > Michael > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > __________________________________ Yahoo! Music Unlimited Access over 1 million songs. Try it free. http://music.yahoo.com/unlimited/ From lhh at redhat.com Thu Oct 20 19:01:51 2005 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 20 Oct 2005 15:01:51 -0400 Subject: [Linux-cluster] mounting using label In-Reply-To: <20051020001121.10195.qmail@web52102.mail.yahoo.com> References: <20051020001121.10195.qmail@web52102.mail.yahoo.com> Message-ID: <1129834911.17902.48.camel@ayanami.boston.redhat.com> On Wed, 2005-10-19 at 17:11 -0700, Alexander Rau wrote: > Hi: > > I am trying to mount a file system on the SAN by using > the label rather then the device. > > When I specify "-L label" in either the device line or > the Options line the cluster service fails to start. > > Just wondering if anybody has successfully used labels > to mount a file system as a cluster service...? Put: LABEL=label_name ...as the "device" name in the cluster configuration. -- Lon From dawson at fnal.gov Fri Oct 21 13:18:28 2005 From: dawson at fnal.gov (Troy Dawson) Date: Fri, 21 Oct 2005 08:18:28 -0500 Subject: [Linux-cluster] Re: write's pausing - which tools to debug? In-Reply-To: <20051019104816.GD31027@neu.nirvana> References: <4355049E.4060606@fnal.gov> <20051019104816.GD31027@neu.nirvana> Message-ID: <4358EAA4.1080901@fnal.gov> Axel Thimm wrote: > Hi, > > On Tue, Oct 18, 2005 at 09:20:14AM -0500, Troy Dawson wrote: > >>We've been having some problems with doing a write's to our GFS file >>system, and it will pause, for long periods. (Like from 5 to 10 >>seconds, to 30 seconds, and occasially 5 minutes) After the pause, it's >>like nothing happened, whatever the process is, just keeps going happy >>as can be. >>Except for these pauses, our GFS is quite zippy, both reads and writes. >> But these pauses are holding us back from going full production. >>I need to know what tools I should use to figure out what is causing >>these pauses. >> >>Here is the setup. >>All machines: RHEL 4 update 1 (ok, actually S.L. 4.1), kernel >>2.6.9-11.ELsmp, GFS 6.1.0, ccs 1.0.0, gulm 1.0.0, rgmanager 1.9.34 >> >>I have no ability to do fencing yet, so I chose to use the gulm locking >>mechanism. I have it setup so that there are 3 lock servers, for >>failover. I have tested the failover, and it works quite well. > > > If this is a testing environment use manual fencing. E.g. if a node > needs to get fenced you get a log message saying that you should do > that and acknowledge that. > > >>I have 5 machines in the cluster. 1 isn't connected to the SAN, or >>using GFS. It is just a failover gulm lock server incase the other two >>lock servers go down. >> >>So I have 4 machines connected to our SAN and using GFS. 3 are >>read-only, 1 is read-write. If it is important, the 3 read-only are >>x86_64, the 1 read-write and the 1 not connected are i386. >> >>The read/write machine is our master lock server. Then one of the >>read-only is a fallback lock server, as is the machine not using GFS. >> >>Anyway, we're getting these pauses when writting, and I'm having a hard >>time tracking down where the problem is. I *think* that we can still >>read from the other machines. But since this comes and goes, I haven't >>been able to verify that. > > > What SAN hardware is attached to the nodes? > > From the switch on down, I don't know. It's a centrally managed SAN, that I have been allowed to plug into and given disk space. I do have Qlogic cards in the machines. >>Anyway, which tools do you think would be best in diagnosing this? > > > I'd suggest to check/monitor networking. Also place the cluster > communication on a separate network that the SAN/LAN network. The > cluster heartbeat goes over UDP and a congested network may delay > these packages or drop the completely. At least that's the CMAN > picture, lock_gulm may be different. > That sounds like a good idea. All of our machines have two ethernet ports, and I'm not using the second one on any of them. That would actually fix some other problems as well. > Also don't mix RHELU1 and U2 or FC. Just in case you'd like to > upgrade to SL4.2 one by one. > Yup, read that, but thanks for the reminder. > There have been many changes/bug fixes to the cluster bits in RHELU2, > and there are also some new spiffy features like multipath. Perhaps > it's worth rebasing your testing environment? > Don't I wish it was a testing enviroment. But at least the machines don't HAVE to be 24x7. And I've only got one of them in production right now, so it's only one going down. Troy -- __________________________________________________ Troy Dawson dawson at fnal.gov (630)840-6468 Fermilab ComputingDivision/CSS CSI Group __________________________________________________ From linux4dave at gmail.com Sat Oct 22 02:18:36 2005 From: linux4dave at gmail.com (dave first) Date: Fri, 21 Oct 2005 19:18:36 -0700 Subject: [Linux-cluster] Clustering Tutorial In-Reply-To: <20051020185824.21929.qmail@web60525.mail.yahoo.com> References: <4357C347.8000506@jellyfish.highlyscyld.com> <20051020185824.21929.qmail@web60525.mail.yahoo.com> Message-ID: <207649d0510211918i29dfe228k3b29befcc6f35f48@mail.gmail.com> Thanks. I should have mentioned that we're doing high performance clustering, and not HA. We have a beowulf cluster (old and decrepid) and an OSCAR cluster. None of our current clusters are RH, but that will probably change once we get our next 4-opteron cpu/box cluster... Yeehaw! And a Big Thanks to everyone who responded. I now have some good resources. A lot of reading... yaaaawn ! heh-heh. dave On 10/20/05, Tim Spaulding wrote: > > Just a note of caution, there's a big difference between High Availability > Clustering and High > Performance Clustering. AFAIK, Beowulf is an HPC technology. RHCS (Red Hat > Cluster Suite) and > GFS (Global File System) are HAC technologies. Some of the underlying > building blocks are used by > both communities but they are used for fundamentally difference purposes. > > http://www.linux-ha.org is the home of another HAC, linux-based > technology. They have more > documentation on clustering and its concepts. Red Hat does a good job on > the HOW-TOs of getting a > cluster working but a terrible job of telling folks the WHY-TOs of > clustering. > > I'm currently working on a comparison of linux-ha and RHCS so if you have > questions regarding HAC > on linux then fire away. If you have a beowulf cluster, je ne comprends > pas, sorry. > > --tims > > --- Michael Will wrote: > > > > http://www.phy.duke.edu/resources/computing/brahma/Resources/beowulf_book.php > > is a good start, > > http://www.beowulf.org is another good place, it is also the home of the > > original beowulf mailinglist. > > > > Generally I would recommend digging through recent mailinglist postings > > because > > there are often very informed answers to questions. > > > > Lon just answered a fencing question a few days ago: > > > > "STONITH, STOMITH, etc. are indeed implementations of I/O fencing. > > > > Fencing is the act of forcefully preventing a node from being able to > > access resources after that node has been evicted from the cluster in an > > attempt to avoid corruption. > > > > The canonical example of when it is needed is the live-hang scenario, as > > you described: > > > > 1. node A hangs with I/Os pending to a shared file system > > 2. node B and node C decide that node A is dead and recover resources > > allocated on node A (including the shared file system) > > 3. node A resumes normal operation > > 4. node A completes I/Os to shared file system > > > > At this point, the shared file system is probably corrupt. If you're > > lucky, fsck will fix it -- if you're not, you'll need to restore from > > backup. I/O fencing (STONITH, or whatever we want to call it) prevents > > the last step (step 4) from happening. > > > > How fencing is done (power cycling via external switch, SCSI > > reservations, FC zoning, integrated methods like IPMI, iLO, manual > > intervention, etc.) is unimportant - so long as whatever method is used > > can guarantee that step 4 can not complete." > > > > "GFS can use fabric-level fencing - that is, you can tell the iSCSI > > server to cut a node off, or ask the fiber-channel switch to disable a > > port. This is in addition to "power-cycle" fencing." > > > > > > Michael > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > __________________________________ > Yahoo! Music Unlimited > Access over 1 million songs. Try it free. > http://music.yahoo.com/unlimited/ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bill.Scherer at VerizonWireless.com Mon Oct 24 14:56:14 2005 From: Bill.Scherer at VerizonWireless.com (Bill Scherer) Date: Mon, 24 Oct 2005 10:56:14 -0400 Subject: [Linux-cluster] ssh, ldap, and nfs Message-ID: <435CF60E.1050001@VerizonWireless.com> Sorry if this is a bit off-topic, but does anyone have any idea how to get ssh to accept public-key authorization for accounts that exist only in ldap land and whose home folders are nfs mounted? It should work, right? From vmoravek at atlas.cz Mon Oct 24 23:00:26 2005 From: vmoravek at atlas.cz (vmoravek at atlas.cz) Date: Tue, 25 Oct 2005 01:00:26 +0200 Subject: [Linux-cluster] gfs dlm - SAMBA problem(lock??) Message-ID: Hi all, I having use gfs for samba cluster.GFS works fine but have big trouble with this situation. When 6 computers want downloadind same file od directory everything is ok. But same situation but 8 computers trafic is rappidly going down and I have some strange messages about oplocks (maybe) in smb.log. Have any idea what is wrong? Best Regard Vojta From sommere+linux-cluster at gac.edu Tue Oct 25 00:48:17 2005 From: sommere+linux-cluster at gac.edu (Ethan Sommer) Date: Mon, 24 Oct 2005 19:48:17 -0500 Subject: [Linux-cluster] Occasional kernel panics Message-ID: <435D80D1.70105@gac.edu> Every few days or so our cluster machines seem to have kernel panics comp laing about GFS locking (although its pretty irregular, we went for a few weeks without an outage) We noticed that this happened a LOT, and it was reproducible when certain users accessed files, when we were serving afp off the cluster. We have changed things since then so that afp is run on a server which nfs mounts the cluster. We are running FC4 with the gfs modules from yum. Here is our most recent kernel panics, followed by one from when we had afp running on the cluster: (it looks like there is relevant info above the cut-here, possibly if it might be helpful) Oct 19 14:44:41 meow kernel: ------------[ cut here ]------------ Oct 19 14:44:41 meow kernel: kernel BUG at /usr/src/build/607755-i686/BUILD/smp/src/lockqueue.c:1144! Oct 19 14:44:41 meow kernel: invalid operand: 0000 [#1] Oct 19 14:44:41 meow kernel: SMP Oct 19 14:44:41 meow kernel: Modules linked in: nfsd exportfs lockd autofs4 lock_dlm(U) gfs(U) lock_harness(U) rfcomm l2cap bluetooth dlm(U) cman(U) md5 ip v6 sunrpc ipt_LOG ipt_limit ipt_state ip_conntrack iptable_filter ip_tables video button battery ac uhci_hcd ehci_hcd hw_random i2c_i801 i2c_core shpchp e1000 floppy ext3 jbd raid1 dm_mod qla2200 qla2xxx scsi_transport_fc ata_piix libata sd_mod scsi_mod Oct 19 14:44:41 meow kernel: CPU: 1 Oct 19 14:44:41 meow kernel: EIP: 0060:[] Not tainted VLI Oct 19 14:44:41 meow kernel: EFLAGS: 00010292 (2.6.12-1.1447_FC4smp) Oct 19 14:44:41 meow kernel: EIP is at process_cluster_request+0xddb/0xdef [dlm] Oct 19 14:44:41 meow kernel: eax: 00000004 ebx: 00000000 ecx: c035fa4c edx: 00000286 Oct 19 14:44:41 meow kernel: esi: f7fb8400 edi: 00000000 ebp: d2988000 esp: f7eefe24 Oct 19 14:44:41 meow kernel: ds: 007b es: 007b ss: 0068 Oct 19 14:44:41 meow kernel: Process dlm_recvd (pid: 2402, threadinfo=f7eef000 task=f7851020) Oct 19 14:44:41 meow kernel: Stack: f8b0621b 00000001 f8b071e0 f8b06217 2583f987 00000001 00000040 00004000 Oct 19 14:44:41 meow kernel: f7eefe48 00000000 c038e1a0 00000a58 f0167b00 c02a26c1 00000a58 00004040 Oct 19 14:44:41 meow kernel: 00000072 f7eefed4 00000000 00000001 00000246 00000000 edd6eeb8 00000000 Oct 19 14:44:41 meow kernel: Call Trace: Oct 19 14:44:41 meow kernel: [] sock_recvmsg+0x103/0x11e Oct 19 14:44:41 meow kernel: [] midcomms_process_incoming_buffer+0x13b/0x25f [dlm] Oct 19 14:44:41 meow kernel: [] load_balance_newidle+0x23/0x82 Oct 19 14:44:41 meow kernel: [] receive_from_sock+0x196/0x2c9 [dlm] Oct 19 14:44:41 meow kernel: [] schedule+0x405/0xc5e Oct 19 14:44:41 meow kernel: [] schedule+0x431/0xc5e Oct 19 14:44:41 meow kernel: [] dlm_recvd+0x0/0x9c [dlm] Oct 19 14:44:41 meow kernel: [] process_sockets+0x75/0xb7 [dlm] Oct 19 14:44:41 meow kernel: [] dlm_recvd+0x70/0x9c [dlm] Oct 19 14:44:41 meow kernel: [] kthread+0x93/0x97 Oct 19 14:44:41 meow kernel: [] kthread+0x0/0x97 Oct 19 14:44:41 meow kernel: [] kernel_thread_helper+0x5/0xb Oct 19 14:44:41 meow kernel: Code: 4f 82 62 c7 89 e8 e8 b1 b4 00 00 8b 4c 24 14 89 4c 24 04 c7 04 24 6d 63 b0 f8 e8 34 82 62 c7 c7 04 24 1b 62 b0 f8 e8 28 82 62 c7 <0f> 0b 78 04 e0 71 b0 f8 c7 04 24 70 72 b0 f8 e8 40 78 62 c7 57 Oct 19 14:44:41 meow kernel: <0>Fatal exception: panic in 5 seconds Panic 2: Oct 10 09:58:39 woof kernel: ------------[ cut here ]------------ Oct 10 09:58:39 woof kernel: kernel BUG at /usr/src/build/607778-i686/BUILD/smp/src/dlm/lock.c:411! Oct 10 09:58:39 woof kernel: invalid operand: 0000 [#1] Oct 10 09:58:39 woof kernel: SMP Oct 10 09:58:39 woof kernel: Modules linked in: nfsd exportfs lockd autofs4 lock_dlm(U) gfs(U) lock_harness(U) rfcomm l2cap bluetooth dlm(U) cman(U) md5 ip v6 sunrpc ipt_LOG ipt_limit ipt_state ip_conntrack iptable_filter ip_tables video button battery ac uhci_hcd ehci_hcd hw_random i2c_i801 i2c_core shpchp e1 000 dm_snapshot dm_zero dm_mirror ext3 jbd raid1 dm_mod qla2200 qla2xxx scsi_transport_fc ata_piix libata sd_mod scsi_mod Oct 10 09:58:39 woof kernel: CPU: 1 Oct 10 09:58:39 woof kernel: EIP: 0060:[] Not tainted VLI Oct 10 09:58:39 woof kernel: EFLAGS: 00010292 (2.6.12-1.1447_FC4smp) Oct 10 09:58:39 woof kernel: EIP is at do_dlm_lock+0x1b7/0x21d [lock_dlm] Oct 10 09:58:39 woof kernel: eax: 00000004 ebx: 00000000 ecx: c035fa4c edx: 00000292 Oct 10 09:58:39 woof kernel: esi: f7848140 edi: ffffffea ebp: 00000003 esp: c74b3cfc Oct 10 09:58:39 woof kernel: ds: 007b es: 007b ss: 0068 Oct 10 09:58:39 woof kernel: Process imapd (pid: 24278, threadinfo=c74b3000 task=f4721a80) Oct 10 09:58:39 woof kernel: Stack: f8b9de75 f7848140 00000003 1bbe0000 00000000 ffffffea 00000003 00000005 Oct 10 09:58:39 woof kernel: 0000000d 00000005 00000000 f58c0a00 00000001 0000000d 20200000 20202020 Oct 10 09:58:39 woof kernel: 20203320 20202020 62312020 30306562 00183030 c8fb2f00 00000001 00000001 Oct 10 09:58:39 woof kernel: Call Trace: Oct 10 09:58:39 woof kernel: [] lm_dlm_lock+0x52/0x5e [lock_dlm] Oct 10 09:58:39 woof kernel: [] lm_dlm_lock+0x0/0x5e [lock_dlm] Oct 10 09:58:39 woof kernel: [] gfs_lm_lock+0x3d/0x5c [gfs] Oct 10 09:58:39 woof kernel: [] gfs_glock_xmote_th+0xae/0x1d3 [gfs] Oct 10 09:58:39 woof kernel: [] rq_promote+0x126/0x150 [gfs] Oct 10 09:58:39 woof kernel: [] run_queue+0xee/0x113 [gfs] Oct 10 09:58:39 woof kernel: [] gfs_glock_nq+0x93/0x144 [gfs] Oct 10 09:58:39 woof kernel: [] gfs_glock_nq_init+0x18/0x2d [gfs] Oct 10 09:58:39 woof kernel: [] get_local_rgrp+0xca/0x1b0 [gfs] Oct 10 09:58:39 woof kernel: [] gfs_inplace_reserve_i+0x90/0xd0 [gfs] Oct 10 09:58:39 woof kernel: [] gfs_quota_lock_m+0xbf/0x117 [gfs] Oct 10 09:58:39 woof kernel: [] do_do_write_buf+0x3a1/0x485 [gfs] Oct 10 09:58:39 woof kernel: [] glock_wait_internal+0x16b/0x26a [gfs] Oct 10 09:58:39 woof kernel: [] do_write_buf+0x182/0x1b6 [gfs] Oct 10 09:58:39 woof kernel: [] walk_vm+0xb3/0x111 [gfs] Oct 10 09:58:39 woof kernel: [] gfs_write+0xa0/0xc2 [gfs] Oct 10 09:58:39 woof kernel: [] do_write_buf+0x0/0x1b6 [gfs] Oct 10 09:58:39 woof kernel: [] gfs_write+0x0/0xc2 [gfs] Oct 10 09:58:39 woof kernel: [] vfs_write+0x9e/0x110 Oct 10 09:58:39 woof kernel: [] sys_write+0x41/0x6a Oct 10 09:58:39 woof kernel: [] syscall_call+0x7/0xb Oct 10 09:58:39 woof kernel: Code: 7c 24 14 89 4c 24 0c 89 5c 24 10 89 6c 24 08 89 74 24 04 c7 04 24 28 e6 b9 f8 e8 0e 94 58 c7 c7 04 24 75 de b9 f8 e8 02 94 58 c7 <0f> 0b 9b 01 a0 e4 b9 f8 c7 04 24 3c e5 b9 f8 e8 1a 8a 58 c7 66 Oct 10 09:58:39 woof kernel: <0>Fatal exception: panic in 5 seconds Sep 7 15:37:44 meow kernel: ------------[ cut here ]------------ Sep 7 15:37:44 meow kernel: kernel BUG at /usr/src/build/588748-i686/BUILD/smp/src/dlm/plock.c:500! Sep 7 15:37:44 meow kernel: invalid operand: 0000 [#1] Sep 7 15:37:44 meow kernel: SMP Sep 7 15:37:44 meow kernel: Modules linked in: appletalk nfsd exportfs lockd autofs4 lock_dlm(U) gfs(U) lock_harness(U) rfcomm l2cap bluetooth dlm(U) cman (U) sunrpc md5 ipv6 ipt_LOG ipt_limit ipt_state ip_conntrack iptable_filter ip_tables video button battery ac uhci_hcd ehci_hcd hw_random i2c_i801 i2c_core shpchp e1000 floppy ext3 jbd raid1 dm_mod qla2200 qla2xxx scsi_transport_fc ata_piix libata sd_mod scsi_mod Sep 7 15:37:44 meow kernel: CPU: 3 Sep 7 15:37:44 meow kernel: EIP: 0060:[] Tainted: GF VLI Sep 7 15:37:44 meow kernel: EFLAGS: 00010292 (2.6.12-1.1398_FC4smp) Sep 7 15:37:44 meow kernel: EIP is at update_lock+0x87/0x9b [lock_dlm] Sep 7 15:37:44 meow kernel: eax: 00000004 ebx: fffffff5 ecx: c035ca4c edx: 00000282 Sep 7 15:37:44 meow kernel: esi: 00000000 edi: e99c2c00 ebp: 00000000 esp: d05dedb4 Sep 7 15:37:44 meow kernel: ds: 007b es: 007b ss: 0068 Sep 7 15:37:44 meow kernel: Process afpd (pid: 3872, threadinfo=d05de000 task=d6447550) Sep 7 15:37:44 meow kernel: Stack: badc0ded f8b9d0d6 fffffff5 f8b9da70 f8b9d101 06609291 f7943000 00000000 Sep 7 15:37:44 meow kernel: f8b9a499 7ffffff8 00000000 7ffffff8 00000000 d05dede8 d7636700 7ffffff8 Sep 7 15:37:44 meow kernel: 00000000 d05deea8 d05dee28 f8b9a987 00000001 7ffffff8 00000000 7ffffff8 Sep 7 15:37:44 meow kernel: Call Trace: Sep 7 15:37:44 meow kernel: [] add_lock+0x8e/0xed [lock_dlm] Sep 7 15:37:44 meow kernel: [] fill_gaps+0x87/0x10e [lock_dlm] Sep 7 15:37:44 meow kernel: [] lock_case3+0x43/0xac [lock_dlm] Sep 7 15:37:44 meow kernel: [] plock_internal+0x1aa/0x370 [lock_dlm] Sep 7 15:37:44 meow kernel: [] lm_dlm_plock+0x25b/0x2dc [lock_dlm] Sep 7 15:37:44 meow kernel: [] lm_dlm_plock+0x0/0x2dc [lock_dlm] Sep 7 15:37:44 meow kernel: [] gfs_lm_plock+0x45/0x57 [gfs] Sep 7 15:37:44 meow kernel: [] gfs_lock+0xcd/0x11c [gfs] Sep 7 15:37:44 meow kernel: [] gfs_lock+0x0/0x11c [gfs] Sep 7 15:37:44 meow kernel: [] fcntl_setlk64+0x16c/0x26a Sep 7 15:37:44 meow kernel: [] fget+0x3b/0x42 Sep 7 15:37:44 meow kernel: [] sys_fcntl64+0x55/0x97 Sep 7 15:37:44 meow kernel: [] syscall_call+0x7/0xb Sep 7 15:37:44 meow kernel: Code: 01 00 00 c7 04 24 a8 da b9 f8 e8 7c 77 58 c7 89 5c 24 04 c7 04 24 08 d1 b9 f8 e8 6c 77 58 c7 c7 04 24 d6 d0 b9 f8 e8 60 77 58 c7 <0f> 0b f4 01 70 da b9 f8 c7 04 24 10 db b9 f8 e8 78 6d 58 c7 55 Sep 7 15:37:44 meow kernel: <0>Fatal exception: panic in 5 seconds Thanks for any help, Ethan From tpcollier at liberty.edu Tue Oct 25 19:31:06 2005 From: tpcollier at liberty.edu (Collier, Tirus (SA)) Date: Tue, 25 Oct 2005 15:31:06 -0400 Subject: [Linux-cluster] EMCPower Errors Message-ID: Good Day, Request to know if anyone experienced the following errors on there cluster. I'm running a 3 node cluster with following: 1.) 3 PE1850s 2.) CX700 storage 3.) Kernel 2.4.21-32.0.1.ELsmp #1 pool_tool -s | grep error | more /dev/emcpoweraa <- error -> /dev/emcpoweraa1 <- error -> /dev/emcpoweraa10 <- error -> /dev/emcpoweraa11 <- error -> /dev/emcpoweraa12 <- error -> /dev/emcpoweraa13 <- error -> /dev/emcpoweraa14 <- error -> /dev/emcpoweraa15 <- error -> /dev/emcpoweraa2 <- error -> /dev/emcpoweraa3 <- error -> /dev/emcpoweraa4 <- error -> /dev/emcpoweraa5 <- error -> /dev/emcpoweraa6 <- error -> /dev/emcpoweraa7 <- error -> /dev/emcpoweraa8 <- error -> /dev/emcpoweraa9 <- error -> /dev/emcpowerab <- error -> /dev/emcpowerab1 <- error -> /dev/emcpowerab10 <- error -> /dev/emcpowerab11 <- error -> Stonewall T. Collier -------------- next part -------------- An HTML attachment was scrubbed... URL: From teigland at redhat.com Tue Oct 25 20:00:12 2005 From: teigland at redhat.com (David Teigland) Date: Tue, 25 Oct 2005 15:00:12 -0500 Subject: [Linux-cluster] Occasional kernel panics In-Reply-To: <435D80D1.70105@gac.edu> References: <435D80D1.70105@gac.edu> Message-ID: <20051025200012.GA15854@redhat.com> On Mon, Oct 24, 2005 at 07:48:17PM -0500, Ethan Sommer wrote: > Oct 19 14:44:41 meow kernel: kernel BUG at > /usr/src/build/607755-i686/BUILD/smp/src/lockqueue.c:1144! > Oct 10 09:58:39 woof kernel: kernel BUG at > /usr/src/build/607778-i686/BUILD/smp/src/dlm/lock.c:411! > Sep 7 15:37:44 meow kernel: kernel BUG at > /usr/src/build/588748-i686/BUILD/smp/src/dlm/plock.c:500! I don't have any quick explanation for the first two. It's clear from the third that the afpd application is doing some serious posix locking where there's ample room for bugs. We'll take a look, thanks for the info. Dave From mwill at penguincomputing.com Thu Oct 27 00:47:13 2005 From: mwill at penguincomputing.com (Michael Will) Date: Wed, 26 Oct 2005 17:47:13 -0700 Subject: Antw: [Linux-cluster] Oracle 10G-R2 on GFS install problems In-Reply-To: References: Message-ID: <43602391.3020708@penguincomputing.com> OCFS2 has not been released for oracle products, even thought is from oracle themselves. GFS has been certified for the 9.2-series, and if you can use RHEL3 or RHEL4 depends on if you need 32 or 64bit support. Oracle has strict guidelines what they support in which version on which OS again in which version, and on which hardware platform. Usually there is a good reason for exclusion of choices. It might work on other versions and OS's not listed, but it won't be supported then. Stefan, this mailinglist has an international audience and all postings are in english ;-) Stefan Marx wrote: >Hi Marvin, > >also OCFS2 ist f?r Oracle Produkte noch nicht freigegeben, auch wenn es von Oracle selebr stammt. GFS ist f?r die 9.2er Serie zertifiziert, wobei man schauen mu?, ob man RHEL3 oder RHEL4 verwenden kann, je nachdem, ob man 32-Bit oder 64-Bit Support ben?tigt. Da gibt es ziemlich klare Aussagen bei Oracle, welche Produkte mit welcher Version auf welchem Betriebssystem mit welcher Version und zus?tzlich auf welcher Hardwareplattform supported sind. Und das hat auch meistens seinen guten Grund :-(. Klar gehen die Sachen auch auf anderen Betriebssystemen, solange die entsprechenden Libraries und Kernelversionen passen, sind dann aber halt nicht supported. > >Ciao, Stefan > > > > >>>>spwilcox at att.com 10/13/05 8:26 pm >>> >>>> >>>> >In the process of installing Oracle 10G-R2 on a RHEL4-U2 x86_64 cluster >with GFS 6.1.2, I get the following error when running Oracle's root.sh >for cluster ready services (a.k.a clusterware): > >[ OCROSD][4142143168]utstoragetype: /u00/app/ocr0 is on FS type >18225520. Not supported. > >I did a little poking around and found that OCFS2 has the same issue, >but with OCFS2 it can be circumvented by mounting with -o datavolume... >I was unable to find any similar options for GFS mounts. This looks >like probably more of an Oracle bug, as 10G-R1 installed without any >problems (I have my DBA pursuing the Oracle route), but I was wondering >if anyone else has come across this problem and if so, was there any >fix? > >Thanks, >-steve > > > -- Michael Will Penguin Computing Corp. Sales Engineer 415-954-2822 415-954-2899 fx mwill at penguincomputing.com From erwan at seanodes.com Thu Oct 27 12:12:27 2005 From: erwan at seanodes.com (Velu Erwan) Date: Thu, 27 Oct 2005 14:12:27 +0200 Subject: [Linux-cluster] cluster-1.01.00 In-Reply-To: <20051019151045.GA3975@redhat.com> References: <20051019151045.GA3975@redhat.com> Message-ID: <4360C42B.1050107@seanodes.com> David Teigland a ?crit : >A new source tarball from the STABLE branch has been released; it builds >and runs on 2.6.13: > I've been working on making a rpm of this tarball. Now I have one main rpm which contains all the usuals binaries, one for the librairies, one for the devel, one for the kernel module. For the kernel modules, I choose to create a dkms rpm. This allow to not provide a binary kernel module but a rpm which rebuilt the modules on the target host. This is very usefull, i.e dkms is just able to rebuild automatically the gfs module if you reboot on another kernel. It doesn't needs any user/admin helps. This makes our lifes easiest. ;o) You can find my specfile and the SRPMS at http://www.seanodes.com/~erwan/SRPMS This rpms have been tested successfully on the withebox 4 and mandriva 2006 & cooker. The SRPMS is now included in mandriva repository, an "urpmi" is enough to have a runnable gfs ;o) This could be cool to integrate the specfile to the cvs tree. Making a rpm of this tarball shows me several troubles: 1- The configure architecture doesn't allow to choose all options I mean the main configure is calling a set of sub configure with the same options to all of them. Some "sub configure" accept additionnal options, like "--plugindir=" for magma. If I call the main configure with --plugindir it fails because the other sub configure doesn't implement "--plugindir. It could be better to ignore un-implemented options which prevent failures. 2-make dependencies In my case, I'd like to separate the binary and the kernel module build. This could allow me to build the rpm by just compiling the binary and then provides the dkms rpm where the kernel modules are located. This provides a faster build process. Today, it seems we must make the kernel modules before the binaires else you can't build the binairies. Instead of having a "all:" target, it could be cool to have a "binaries:" and a "kernel-modules:" targets. Today, if my rpm building machine doesn't have a kernel source where the make has been done I have from /home/nis/guibo/rpm/BUILD/cluster-1.01.00/cman-kernel/src/cnxman.c:15: include/linux/config.h:4:28: error: linux/autoconf.h: No such file or directory 3-soname troubles I'm not an expert about this part but some binaires are linked with the .so library whereas it should be with the .so.x. ccsd is a good example: [root at max4 ~]# ldd /sbin/ccsd libxml2.so.2 => /usr/lib64/libxml2.so.2 (0x00002aaaaabc1000) libz.so.1 => /lib64/libz.so.1 (0x00002aaaaadd1000) libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x00002aaaaaee6000) libm.so.6 => /lib64/tls/libm.so.6 (0x00002aaaaaffb000) libmagma.so => /usr/lib64/libmagma.so (0x00002aaaab153000) libmagmamsg.so => /usr/lib64/libmagmamsg.so (0x00002aaaab25b000) libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaab35f000) libc.so.6 => /lib64/tls/libc.so.6 (0x00002aaaab462000) /lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000) I was told that binaries must be linked with .so.x because .so are only for development. In the mandriva rpm's policy .so must be in lib%name-devel rpms and .so.x in lib%name rpms. This error make it impossible. I don't know if its related or not but 3 librairies doesn't provides enought informations to help rpm finding what they provides. I've add a workaround to my spec file by defining %ifarch x86_64 Provides: libmagma.so()(64bit) libmagmamsg.so()(64bit) libmagma_nt.so()(64bit) %endif - Erwan Velu From teigland at redhat.com Thu Oct 27 19:02:28 2005 From: teigland at redhat.com (David Teigland) Date: Thu, 27 Oct 2005 14:02:28 -0500 Subject: [Linux-cluster] cluster-1.01.00 In-Reply-To: <4360C42B.1050107@seanodes.com> References: <20051019151045.GA3975@redhat.com> <4360C42B.1050107@seanodes.com> Message-ID: <20051027190228.GC9710@redhat.com> On Thu, Oct 27, 2005 at 02:12:27PM +0200, Velu Erwan wrote: > David Teigland a ?crit : > Making a rpm of this tarball shows me several troubles: These all sound like good suggestions. We'd be happy to get any patches you have to fix some of them, otherwise it may be some time until someone gets around to working on it. Thanks, Dave From philip.r.dana at nwp01.usace.army.mil Thu Oct 27 19:08:03 2005 From: philip.r.dana at nwp01.usace.army.mil (Philip R. Dana) Date: Thu, 27 Oct 2005 12:08:03 -0700 Subject: [Linux-cluster] Service/Resource group help needed Message-ID: <1130440083.2950.25.camel@nwp-wk-79033-l> I am setting up a two node active/passive cluster to provide DHCP/DNS services, using CentOS 4 U2 and RHCS4. The rpms were compiled from srpms using the info provided by Sean Gray (thanks, Sean). The shared storage is on a NetApp filer using ISCSI. I think I've missed something somewhere. The output from clustat and clusvcadm: [root at ns1-node1 ~]# clustat Member Status: Quorate Not a member of the Resource Manager service group. Resource Group information unavailable; showing all cluster members. Member Name State ID ------ ---- ----- -- ns1-node2.mydomain.net Online 0x0000000000000002 ns1-node1.mydomain.net Online 0x0000000000000001 [root at ns1-node1 ~]# clusvcadm -m ns1-node1.mydomain.net -e DNS Member ns1-node1.mydomain.net not in membership list Any help/advice will be greatly appreciated. TIA. From eric at bootseg.com Thu Oct 27 19:17:36 2005 From: eric at bootseg.com (Eric Kerin) Date: Thu, 27 Oct 2005 15:17:36 -0400 Subject: [Linux-cluster] Service/Resource group help needed In-Reply-To: <1130440083.2950.25.camel@nwp-wk-79033-l> References: <1130440083.2950.25.camel@nwp-wk-79033-l> Message-ID: <1130440656.3453.10.camel@auh5-0479.corp.jabil.org> On Thu, 2005-10-27 at 12:08 -0700, Philip R. Dana wrote: > [root at ns1-node1 ~]# clustat > Member Status: Quorate > > Not a member of the Resource Manager service group. > Resource Group information unavailable; showing all cluster members. > > Member Name State ID > ------ ---- ----- -- > ns1-node2.mydomain.net Online 0x0000000000000002 > ns1-node1.mydomain.net Online 0x0000000000000001 > Basically this message means the rgmanager service isn't running on the cluster node you ran clustat on. So it's showing the full membership list for the cluster. Start it up on all the nodes, and you should be good to go. Hope this helps, Eric Kerin eric at bootseg.com From mwill at penguincomputing.com Thu Oct 27 19:25:04 2005 From: mwill at penguincomputing.com (Michael Will) Date: Thu, 27 Oct 2005 12:25:04 -0700 Subject: [Linux-cluster] dhcp failover Message-ID: <43612990.3010703@penguincomputing.com> Two things to consider: 1. Normally you would run two DHCP servers under two different IP's on the same subnet that serve half of the IP numbers. When one fails, the other one continues to serve its ip space and hopefully the first one is fixed before the second one runs out of IP numbers. 2. If you have a hard requirement to be only on a single IP address for the DHCP server (rare case for some ISP's DSL hardware) then you can do the active/passive configuration much more easily with a classic heartbeat-setup. We have done both as professional service for customers of ours. Attached storage we usually only use when there is significant data to share, i.e. mysql. The dhcp server configuration can be synced with rsync across the second gigabit ethernet port in realtime. Michael -- Michael Will Penguin Computing Corp. Sales Engineer 415-954-2822 415-954-2899 fx mwill at penguincomputing.com From Philip.R.Dana at nwp01.usace.army.mil Thu Oct 27 19:59:09 2005 From: Philip.R.Dana at nwp01.usace.army.mil (Dana, Philip R NWP Contractor) Date: Thu, 27 Oct 2005 12:59:09 -0700 Subject: [Linux-cluster] Service/Resource group help needed In-Reply-To: <1130440656.3453.10.camel@auh5-0479.corp.jabil.org> References: <1130440083.2950.25.camel@nwp-wk-79033-l> <1130440656.3453.10.camel@auh5-0479.corp.jabil.org> Message-ID: <1130443149.2950.32.camel@nwp-wk-79033-l> Thanks for the quick reply. The rgmanager service is running on both nodes, but I think I have lock (dlm) problems. From a service restart: Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: Services Initialized Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: Logged in SG "usrm::manager" Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: Magma Event: Membership Change Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: State change: Local UP Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: #33: Unable to obtain cluster lock: Operation not permitted On Thu, 2005-10-27 at 15:17 -0400, Eric Kerin wrote: > On Thu, 2005-10-27 at 12:08 -0700, Philip R. Dana wrote: > > [root at ns1-node1 ~]# clustat > > Member Status: Quorate > > > > Not a member of the Resource Manager service group. > > Resource Group information unavailable; showing all cluster members. > > > > Member Name State ID > > ------ ---- ----- -- > > ns1-node2.mydomain.net Online 0x0000000000000002 > > ns1-node1.mydomain.net Online 0x0000000000000001 > > > > Basically this message means the rgmanager service isn't running on the > cluster node you ran clustat on. So it's showing the full membership > list for the cluster. > > Start it up on all the nodes, and you should be good to go. > > Hope this helps, > Eric Kerin > eric at bootseg.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From lhh at redhat.com Thu Oct 27 22:06:02 2005 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 27 Oct 2005 18:06:02 -0400 Subject: [Linux-cluster] Service/Resource group help needed In-Reply-To: <1130443149.2950.32.camel@nwp-wk-79033-l> References: <1130440083.2950.25.camel@nwp-wk-79033-l> <1130440656.3453.10.camel@auh5-0479.corp.jabil.org> <1130443149.2950.32.camel@nwp-wk-79033-l> Message-ID: <1130450762.23803.41.camel@ayanami.boston.redhat.com> On Thu, 2005-10-27 at 12:59 -0700, Dana, Philip R NWP Contractor wrote: > Thanks for the quick reply. The rgmanager service is running on both > nodes, but I think I have lock (dlm) problems. From a service restart: > > Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: Services Initialized > Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: Logged in SG > "usrm::manager" > Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: Magma Event: > Membership Change > Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: State change: Local > UP > Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: #33: Unable to obtain > cluster lock: Operation not permitted service rgmanager stop; modprobe dlm; service rgmanager start -- Lon From pcaulfie at redhat.com Fri Oct 28 13:06:52 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Fri, 28 Oct 2005 14:06:52 +0100 Subject: [Linux-cluster] Re: Xen cluster doku down In-Reply-To: <43621F64.2020609@kofler.eu.org> References: <43621F64.2020609@kofler.eu.org> Message-ID: <4362226C.1050304@redhat.com> Thomas Kofler wrote: > Hi, > > I used your nice guide setting up a xen cluster, but suddenly the link > is broken: > http://www.cix.co.uk/~tykepenguin/xencluster.html > > > Do you have a mirror URL ? Sorry, It's now at http://people.redhat.com/pcaulfie/docs/xencluster.html -- patrick From a_webb_5 at yahoo.com Fri Oct 28 16:14:38 2005 From: a_webb_5 at yahoo.com (Amber Webb) Date: Fri, 28 Oct 2005 09:14:38 -0700 (PDT) Subject: [Linux-cluster] TORQUE 2.0 Message-ID: <20051028161438.49432.qmail@web35710.mail.mud.yahoo.com> Hi, I would like to announce that TORQUE Resource Manager 2.0 was just released, and can be downloaded at www.clusterresources.com/torque. TORQUE, which is built on OpenPBS is one of the most widely used open source batch schedulers. TORQUE's improvements since the last patch include an improved start up feature for quick startup of downed nodes, enhanced internal diagnostics, simplified install, and improved API reporting abilities. TORQUE is a community project with contributions from NCSA, OSC, USC, the U.S. Department of Energy, Sandia, PNNL, University of Buffalo, TeraGrid and many other leading edge HPC organizations. We invite you to download and try TORQUE and visit our user community www.clusterresources.com/torque. We welcome feedback and patch submissions. Regards, Amber __________________________________ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com From david.chappel at mindbank.com Wed Oct 19 17:30:58 2005 From: david.chappel at mindbank.com (David A. Chappel) Date: Wed, 19 Oct 2005 11:30:58 -0600 Subject: [Linux-cluster] mounts not spanning In-Reply-To: <1129303867.4838.30.camel@localhost.localdomain> References: <1129303867.4838.30.camel@localhost.localdomain> Message-ID: <1129743058.5069.12.camel@localhost.localdomain> Hi all; On Fri, 2005-10-14 at 09:31 -0600, David A. Chappel wrote: > Hi there clusterites... Anyone have a cluestick? > The clue stick was meant for me. And for good reason. I'll wait for ddraid. Cheers, -D > I have created a wee "cluster" of two machines. They seem to be happy > in every way, except that when I mount the gfs volumes on each machine, > the mounts do not span across the two nodes, but act as a traditional > node. In other words, I can echo "haha" > /mnt/shareMe/haha.txt on one > machine but it doesn't show up on the other. Vice versa too. > > I use: > mount -t gfs /dev/shareMeVG/shareMeLV /mnt/shareMe > > I've tried the -o ignore_local_fs option without success. > > Also, is there a quick/standard way for non-cluster kernel machines to > mount the "partition" remotely? > > Cheers, > -D > > > > [root at JavaTheHut ~]# cat /proc/cluster/status > Protocol version: 5.0.1 > Config version: 1 > Cluster name: clusta > Cluster ID: 6621 > Cluster Member: Yes > Membership state: Cluster-Member > Nodes: 2 > Expected_votes: 1 > Total_votes: 2 > Quorum: 1 > Active subsystems: 6 > Node name: JavaTheHut.mindbankts.com > Node addresses: 10.1.1.22 > > [root at marvin ~]# cat /etc/cluster/cluster.conf > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From david.chappel at mindbank.com Wed Oct 19 20:03:38 2005 From: david.chappel at mindbank.com (David A. Chappel) Date: Wed, 19 Oct 2005 14:03:38 -0600 Subject: [Linux-cluster] New Cluster Installation Starts Partitioned In-Reply-To: <20051019194901.19010.qmail@web60516.mail.yahoo.com> References: <20051019194901.19010.qmail@web60516.mail.yahoo.com> Message-ID: <1129752218.5069.29.camel@localhost.localdomain> Might be a firewall issue. Doing a netstat -nl listed ports that were not mentioned in the "simple setup" docs for me. Specifically 14567. Cheers, -d On Wed, 2005-10-19 at 12:49 -0700, Tim Spaulding wrote: > Hi All, > > I have a couple of machines that I'm trying to cluster. The machines are freshly installed FC4 > machines that have been fully updated and running the latest kernel. They are configured to use > the lvm2 by default so lvm2 and dm was already installed. I'm following the directions in the > usage.txt off RedHat's web site. I compile the cluster tarball, run depmod, and start ccsd > without issue. When I do a cman_tool join -w on each node, both nodes start cman and join the > cluster, but the cluster is apparently partitioned (i.e. they both see the cluster and are joined > to it, but the two nodes cannot see that the other node is joined). I've searched around and > haven't found anything specific to this symptom. I have a feeling that it's something to do with > my network configuration. Any help would be appreciated. > > Both machines are i686 archs with dual NICs. The NICs are connected to networks that do not route > to each other. One network (eth0 on both machines) is a development network. The other network > (eth1) is our corporate network. I'm trying to configure the cluster to use the dev network > (eth0). > > Here's the output from uname: > > Linux ctclinux1.clam.com 2.6.13-1.1526_FC4 #1 Wed Sep 28 19:15:10 EDT 2005 i686 i686 i386 > GNU/Linux > Linux ctclinux2.clam.com 2.6.13-1.1526_FC4 #1 Wed Sep 28 19:15:10 EDT 2005 i686 i686 i386 > GNU/Linux > > Here's the network configuration on ctclinux1: > > eth0 Link encap:Ethernet HWaddr 00:01:03:26:5C:C9 > inet addr:192.168.36.200 Bcast:192.168.36.255 Mask:255.255.255.0 > inet6 addr: fe80::201:3ff:fe26:5cc9/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:7260 errors:0 dropped:0 overruns:0 frame:0 > TX packets:350 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:449183 (438.6 KiB) TX bytes:27853 (27.2 KiB) > Interrupt:10 Base address:0xec00 > > eth1 Link encap:Ethernet HWaddr 00:B0:D0:41:0F:65 > inet addr:10.10.10.200 Bcast:10.10.255.255 Mask:255.255.0.0 > inet6 addr: fe80::2b0:d0ff:fe41:f65/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:57450 errors:0 dropped:0 overruns:1 frame:0 > TX packets:12957 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:10040767 (9.5 MiB) TX bytes:1962029 (1.8 MiB) > Interrupt:5 Base address:0xe880 > > eth1:1 Link encap:Ethernet HWaddr 00:B0:D0:41:0F:65 > inet addr:10.10.10.204 Bcast:10.10.255.255 Mask:255.255.0.0 > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > Interrupt:5 Base address:0xe880 > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:17568 errors:0 dropped:0 overruns:0 frame:0 > TX packets:17568 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:3692600 (3.5 MiB) TX bytes:3692600 (3.5 MiB) > > sit0 Link encap:IPv6-in-IPv4 > NOARP MTU:1480 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) > > Kernel IP routing table > Destination Gateway Genmask Flags Metric Ref Use Iface > 192.168.36.0 * 255.255.255.0 U 0 0 0 eth0 > 10.74.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > 10.72.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > 10.75.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > 10.73.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > 10.10.0.0 * 255.255.0.0 U 0 0 0 eth1 > 169.254.0.0 * 255.255.0.0 U 0 0 0 eth1 > default 10.10.1.1 0.0.0.0 UG 0 0 0 eth1 > > cat /etc/hosts > 10.10.10.200 ctclinux1-svc > 192.168.36.200 ctclinux1-cls > 192.168.36.201 ctclinux2-cls > 10.10.10.201 ctclinux2-svc > > Here's the network configuration on ctclinux2: > > ifconfig -a > eth0 Link encap:Ethernet HWaddr 00:01:03:D4:80:7C > inet addr:192.168.36.201 Bcast:192.168.36.255 Mask:255.255.255.0 > inet6 addr: fe80::201:3ff:fed4:807c/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:7702 errors:0 dropped:0 overruns:1 frame:0 > TX packets:282 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:477769 (466.5 KiB) TX bytes:22444 (21.9 KiB) > Interrupt:10 Base address:0xec00 > > eth1 Link encap:Ethernet HWaddr 00:B0:D0:41:0F:9B > inet addr:10.10.10.201 Bcast:10.10.255.255 Mask:255.255.0.0 > inet6 addr: fe80::2b0:d0ff:fe41:f9b/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:53846 errors:0 dropped:0 overruns:1 frame:0 > TX packets:7759 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:5733713 (5.4 MiB) TX bytes:1155588 (1.1 MiB) > Interrupt:5 Base address:0xe880 > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:17912 errors:0 dropped:0 overruns:0 frame:0 > TX packets:17912 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:3401868 (3.2 MiB) TX bytes:3401868 (3.2 MiB) > > sit0 Link encap:IPv6-in-IPv4 > NOARP MTU:1480 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) > > route > Kernel IP routing table > Destination Gateway Genmask Flags Metric Ref Use Iface > 192.168.36.0 * 255.255.255.0 U 0 0 0 eth0 > 10.74.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > 10.72.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > 10.75.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > 10.73.0.0 192.168.36.10 255.255.255.0 UG 0 0 0 eth0 > 10.10.0.0 * 255.255.0.0 U 0 0 0 eth1 > 169.254.0.0 * 255.255.0.0 U 0 0 0 eth1 > default 10.10.1.1 0.0.0.0 UG 0 0 0 eth1 > > cat /etc/hosts > 10.10.10.201 ctclinux2-svc > 192.168.36.201 ctclinux2-cls > 192.168.36.200 ctclinux1-cls > 10.10.10.200 ctclinux1-svc > > Here's the cluster configuration file: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Here's the cluster information from ctclinux1 after the cluster is started and joined: > > cman_tool -d join -w > nodename ctclinux1.clam.com not found > nodename ctclinux1 (truncated) not found > nodename ctclinux1 doesn't match ctclinux1-cls (ctclinux1-cls in cluster.conf) > nodename ctclinux1 doesn't match ctclinux2-cls (ctclinux2-cls in cluster.conf) > nodename localhost (if lo) not found > selected nodename ctclinux1-cls > setup up interface for address: ctclinux1-cls > Broadcast address for c824a8c0 is ff24a8c0 > > cman_tool status > Protocol version: 5.0.1 > Config version: 1 > Cluster name: cl_tic > Cluster ID: 6429 > Cluster Member: Yes > Membership state: Cluster-Member > Nodes: 1 > Expected_votes: 2 > Total_votes: 1 > Quorum: 2 Activity blocked > Active subsystems: 0 > Node name: ctclinux1-cls > Node addresses: 192.168.36.200 > > cman_tool nodes > Node Votes Exp Sts Name > 1 1 2 M ctclinux1-cls > > Here's the cluster information from ctclinux2 after the cluster is started and joined: > > cman_tool -d join -w > nodename ctclinux2.clam.com not found > nodename ctclinux2 (truncated) not found > nodename ctclinux2 doesn't match ctclinux1-cls (ctclinux1-cls in cluster.conf) > nodename ctclinux2 doesn't match ctclinux2-cls (ctclinux2-cls in cluster.conf) > nodename localhost (if lo) not found > selected nodename ctclinux2-cls > setup up interface for address: ctclinux2-cls > Broadcast address for c924a8c0 is ff24a8c0 > > cman_tool status > Protocol version: 5.0.1 > Config version: 1 > Cluster name: cl_tic > Cluster ID: 6429 > Cluster Member: Yes > Membership state: Cluster-Member > Nodes: 1 > Expected_votes: 2 > Total_votes: 1 > Quorum: 2 Activity blocked > Active subsystems: 0 > Node name: ctclinux2-cls > Node addresses: 192.168.36.201 > > cman_tool nodes > Node Votes Exp Sts Name > 1 1 2 M ctclinux2-cls > > Let me know if there is more information that I need to provide. As an aside, I've tried reducing > the quorum count with no difference in behavior and I've tried using multicast which fails on the > cman_tool join with an "Unknown Host" error. I'm open to any other suggestions. > > Thanks, > > tims > > > > > __________________________________ > Yahoo! Mail - PC Magazine Editors' Choice 2005 > http://mail.yahoo.com > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From sommere at gac.edu Fri Oct 21 16:38:13 2005 From: sommere at gac.edu (Ethan Sommer) Date: Fri, 21 Oct 2005 11:38:13 -0500 Subject: [Linux-cluster] Occasional kernel panics Message-ID: <43591975.9070800@gac.edu> Every few days or so our cluster machines seem to have kernel panics comp laing about GFS locking (although its pretty irregular, we went for a few weeks without an outage) We noticed that this happened a LOT, and it was reproducible when certain users accessed files, when we were serving afp off the cluster. We have changed things since then so that afp is run on a server which nfs mounts the cluster. We are running FC4 with the gfs modules from yum. Here is our most recent kernel panics, followed by one from when we had afp running on the cluster: (it looks like there is relevant info above the cut-here, possibly if it might be helpful) Oct 19 14:44:41 meow kernel: ------------[ cut here ]------------ Oct 19 14:44:41 meow kernel: kernel BUG at /usr/src/build/607755-i686/BUILD/smp/src/lockqueue.c:1144! Oct 19 14:44:41 meow kernel: invalid operand: 0000 [#1] Oct 19 14:44:41 meow kernel: SMP Oct 19 14:44:41 meow kernel: Modules linked in: nfsd exportfs lockd autofs4 lock_dlm(U) gfs(U) lock_harness(U) rfcomm l2cap bluetooth dlm(U) cman(U) md5 ip v6 sunrpc ipt_LOG ipt_limit ipt_state ip_conntrack iptable_filter ip_tables video button battery ac uhci_hcd ehci_hcd hw_random i2c_i801 i2c_core shpchp e1000 floppy ext3 jbd raid1 dm_mod qla2200 qla2xxx scsi_transport_fc ata_piix libata sd_mod scsi_mod Oct 19 14:44:41 meow kernel: CPU: 1 Oct 19 14:44:41 meow kernel: EIP: 0060:[] Not tainted VLI Oct 19 14:44:41 meow kernel: EFLAGS: 00010292 (2.6.12-1.1447_FC4smp) Oct 19 14:44:41 meow kernel: EIP is at process_cluster_request+0xddb/0xdef [dlm] Oct 19 14:44:41 meow kernel: eax: 00000004 ebx: 00000000 ecx: c035fa4c edx: 00000286 Oct 19 14:44:41 meow kernel: esi: f7fb8400 edi: 00000000 ebp: d2988000 esp: f7eefe24 Oct 19 14:44:41 meow kernel: ds: 007b es: 007b ss: 0068 Oct 19 14:44:41 meow kernel: Process dlm_recvd (pid: 2402, threadinfo=f7eef000 task=f7851020) Oct 19 14:44:41 meow kernel: Stack: f8b0621b 00000001 f8b071e0 f8b06217 2583f987 00000001 00000040 00004000 Oct 19 14:44:41 meow kernel: f7eefe48 00000000 c038e1a0 00000a58 f0167b00 c02a26c1 00000a58 00004040 Oct 19 14:44:41 meow kernel: 00000072 f7eefed4 00000000 00000001 00000246 00000000 edd6eeb8 00000000 Oct 19 14:44:41 meow kernel: Call Trace: Oct 19 14:44:41 meow kernel: [] sock_recvmsg+0x103/0x11e Oct 19 14:44:41 meow kernel: [] midcomms_process_incoming_buffer+0x13b/0x25f [dlm] Oct 19 14:44:41 meow kernel: [] load_balance_newidle+0x23/0x82 Oct 19 14:44:41 meow kernel: [] receive_from_sock+0x196/0x2c9 [dlm] Oct 19 14:44:41 meow kernel: [] schedule+0x405/0xc5e Oct 19 14:44:41 meow kernel: [] schedule+0x431/0xc5e Oct 19 14:44:41 meow kernel: [] dlm_recvd+0x0/0x9c [dlm] Oct 19 14:44:41 meow kernel: [] process_sockets+0x75/0xb7 [dlm] Oct 19 14:44:41 meow kernel: [] dlm_recvd+0x70/0x9c [dlm] Oct 19 14:44:41 meow kernel: [] kthread+0x93/0x97 Oct 19 14:44:41 meow kernel: [] kthread+0x0/0x97 Oct 19 14:44:41 meow kernel: [] kernel_thread_helper+0x5/0xb Oct 19 14:44:41 meow kernel: Code: 4f 82 62 c7 89 e8 e8 b1 b4 00 00 8b 4c 24 14 89 4c 24 04 c7 04 24 6d 63 b0 f8 e8 34 82 62 c7 c7 04 24 1b 62 b0 f8 e8 28 82 62 c7 <0f> 0b 78 04 e0 71 b0 f8 c7 04 24 70 72 b0 f8 e8 40 78 62 c7 57 Oct 19 14:44:41 meow kernel: <0>Fatal exception: panic in 5 seconds Panic 2: Oct 10 09:58:39 woof kernel: ------------[ cut here ]------------ Oct 10 09:58:39 woof kernel: kernel BUG at /usr/src/build/607778-i686/BUILD/smp/src/dlm/lock.c:411! Oct 10 09:58:39 woof kernel: invalid operand: 0000 [#1] Oct 10 09:58:39 woof kernel: SMP Oct 10 09:58:39 woof kernel: Modules linked in: nfsd exportfs lockd autofs4 lock_dlm(U) gfs(U) lock_harness(U) rfcomm l2cap bluetooth dlm(U) cman(U) md5 ip v6 sunrpc ipt_LOG ipt_limit ipt_state ip_conntrack iptable_filter ip_tables video button battery ac uhci_hcd ehci_hcd hw_random i2c_i801 i2c_core shpchp e1 000 dm_snapshot dm_zero dm_mirror ext3 jbd raid1 dm_mod qla2200 qla2xxx scsi_transport_fc ata_piix libata sd_mod scsi_mod Oct 10 09:58:39 woof kernel: CPU: 1 Oct 10 09:58:39 woof kernel: EIP: 0060:[] Not tainted VLI Oct 10 09:58:39 woof kernel: EFLAGS: 00010292 (2.6.12-1.1447_FC4smp) Oct 10 09:58:39 woof kernel: EIP is at do_dlm_lock+0x1b7/0x21d [lock_dlm] Oct 10 09:58:39 woof kernel: eax: 00000004 ebx: 00000000 ecx: c035fa4c edx: 00000292 Oct 10 09:58:39 woof kernel: esi: f7848140 edi: ffffffea ebp: 00000003 esp: c74b3cfc Oct 10 09:58:39 woof kernel: ds: 007b es: 007b ss: 0068 Oct 10 09:58:39 woof kernel: Process imapd (pid: 24278, threadinfo=c74b3000 task=f4721a80) Oct 10 09:58:39 woof kernel: Stack: f8b9de75 f7848140 00000003 1bbe0000 00000000 ffffffea 00000003 00000005 Oct 10 09:58:39 woof kernel: 0000000d 00000005 00000000 f58c0a00 00000001 0000000d 20200000 20202020 Oct 10 09:58:39 woof kernel: 20203320 20202020 62312020 30306562 00183030 c8fb2f00 00000001 00000001 Oct 10 09:58:39 woof kernel: Call Trace: Oct 10 09:58:39 woof kernel: [] lm_dlm_lock+0x52/0x5e [lock_dlm] Oct 10 09:58:39 woof kernel: [] lm_dlm_lock+0x0/0x5e [lock_dlm] Oct 10 09:58:39 woof kernel: [] gfs_lm_lock+0x3d/0x5c [gfs] Oct 10 09:58:39 woof kernel: [] gfs_glock_xmote_th+0xae/0x1d3 [gfs] Oct 10 09:58:39 woof kernel: [] rq_promote+0x126/0x150 [gfs] Oct 10 09:58:39 woof kernel: [] run_queue+0xee/0x113 [gfs] Oct 10 09:58:39 woof kernel: [] gfs_glock_nq+0x93/0x144 [gfs] Oct 10 09:58:39 woof kernel: [] gfs_glock_nq_init+0x18/0x2d [gfs] Oct 10 09:58:39 woof kernel: [] get_local_rgrp+0xca/0x1b0 [gfs] Oct 10 09:58:39 woof kernel: [] gfs_inplace_reserve_i+0x90/0xd0 [gfs] Oct 10 09:58:39 woof kernel: [] gfs_quota_lock_m+0xbf/0x117 [gfs] Oct 10 09:58:39 woof kernel: [] do_do_write_buf+0x3a1/0x485 [gfs] Oct 10 09:58:39 woof kernel: [] glock_wait_internal+0x16b/0x26a [gfs] Oct 10 09:58:39 woof kernel: [] do_write_buf+0x182/0x1b6 [gfs] Oct 10 09:58:39 woof kernel: [] walk_vm+0xb3/0x111 [gfs] Oct 10 09:58:39 woof kernel: [] gfs_write+0xa0/0xc2 [gfs] Oct 10 09:58:39 woof kernel: [] do_write_buf+0x0/0x1b6 [gfs] Oct 10 09:58:39 woof kernel: [] gfs_write+0x0/0xc2 [gfs] Oct 10 09:58:39 woof kernel: [] vfs_write+0x9e/0x110 Oct 10 09:58:39 woof kernel: [] sys_write+0x41/0x6a Oct 10 09:58:39 woof kernel: [] syscall_call+0x7/0xb Oct 10 09:58:39 woof kernel: Code: 7c 24 14 89 4c 24 0c 89 5c 24 10 89 6c 24 08 89 74 24 04 c7 04 24 28 e6 b9 f8 e8 0e 94 58 c7 c7 04 24 75 de b9 f8 e8 02 94 58 c7 <0f> 0b 9b 01 a0 e4 b9 f8 c7 04 24 3c e5 b9 f8 e8 1a 8a 58 c7 66 Oct 10 09:58:39 woof kernel: <0>Fatal exception: panic in 5 seconds Sep 7 15:37:44 meow kernel: ------------[ cut here ]------------ Sep 7 15:37:44 meow kernel: kernel BUG at /usr/src/build/588748-i686/BUILD/smp/src/dlm/plock.c:500! Sep 7 15:37:44 meow kernel: invalid operand: 0000 [#1] Sep 7 15:37:44 meow kernel: SMP Sep 7 15:37:44 meow kernel: Modules linked in: appletalk nfsd exportfs lockd autofs4 lock_dlm(U) gfs(U) lock_harness(U) rfcomm l2cap bluetooth dlm(U) cman (U) sunrpc md5 ipv6 ipt_LOG ipt_limit ipt_state ip_conntrack iptable_filter ip_tables video button battery ac uhci_hcd ehci_hcd hw_random i2c_i801 i2c_core shpchp e1000 floppy ext3 jbd raid1 dm_mod qla2200 qla2xxx scsi_transport_fc ata_piix libata sd_mod scsi_mod Sep 7 15:37:44 meow kernel: CPU: 3 Sep 7 15:37:44 meow kernel: EIP: 0060:[] Tainted: GF VLI Sep 7 15:37:44 meow kernel: EFLAGS: 00010292 (2.6.12-1.1398_FC4smp) Sep 7 15:37:44 meow kernel: EIP is at update_lock+0x87/0x9b [lock_dlm] Sep 7 15:37:44 meow kernel: eax: 00000004 ebx: fffffff5 ecx: c035ca4c edx: 00000282 Sep 7 15:37:44 meow kernel: esi: 00000000 edi: e99c2c00 ebp: 00000000 esp: d05dedb4 Sep 7 15:37:44 meow kernel: ds: 007b es: 007b ss: 0068 Sep 7 15:37:44 meow kernel: Process afpd (pid: 3872, threadinfo=d05de000 task=d6447550) Sep 7 15:37:44 meow kernel: Stack: badc0ded f8b9d0d6 fffffff5 f8b9da70 f8b9d101 06609291 f7943000 00000000 Sep 7 15:37:44 meow kernel: f8b9a499 7ffffff8 00000000 7ffffff8 00000000 d05dede8 d7636700 7ffffff8 Sep 7 15:37:44 meow kernel: 00000000 d05deea8 d05dee28 f8b9a987 00000001 7ffffff8 00000000 7ffffff8 Sep 7 15:37:44 meow kernel: Call Trace: Sep 7 15:37:44 meow kernel: [] add_lock+0x8e/0xed [lock_dlm] Sep 7 15:37:44 meow kernel: [] fill_gaps+0x87/0x10e [lock_dlm] Sep 7 15:37:44 meow kernel: [] lock_case3+0x43/0xac [lock_dlm] Sep 7 15:37:44 meow kernel: [] plock_internal+0x1aa/0x370 [lock_dlm] Sep 7 15:37:44 meow kernel: [] lm_dlm_plock+0x25b/0x2dc [lock_dlm] Sep 7 15:37:44 meow kernel: [] lm_dlm_plock+0x0/0x2dc [lock_dlm] Sep 7 15:37:44 meow kernel: [] gfs_lm_plock+0x45/0x57 [gfs] Sep 7 15:37:44 meow kernel: [] gfs_lock+0xcd/0x11c [gfs] Sep 7 15:37:44 meow kernel: [] gfs_lock+0x0/0x11c [gfs] Sep 7 15:37:44 meow kernel: [] fcntl_setlk64+0x16c/0x26a Sep 7 15:37:44 meow kernel: [] fget+0x3b/0x42 Sep 7 15:37:44 meow kernel: [] sys_fcntl64+0x55/0x97 Sep 7 15:37:44 meow kernel: [] syscall_call+0x7/0xb Sep 7 15:37:44 meow kernel: Code: 01 00 00 c7 04 24 a8 da b9 f8 e8 7c 77 58 c7 89 5c 24 04 c7 04 24 08 d1 b9 f8 e8 6c 77 58 c7 c7 04 24 d6 d0 b9 f8 e8 60 77 58 c7 <0f> 0b f4 01 70 da b9 f8 c7 04 24 10 db b9 f8 e8 78 6d 58 c7 55 Sep 7 15:37:44 meow kernel: <0>Fatal exception: panic in 5 seconds Thanks for any help, Ethan From fbavandpouri at amcc.com Mon Oct 24 23:02:19 2005 From: fbavandpouri at amcc.com (Farid Bavandpouri) Date: Mon, 24 Oct 2005 16:02:19 -0700 Subject: [Linux-cluster] gfs dlm - SAMBA problem(lock??) Message-ID: <9D1E2BDCB5C57B46B56E6D80843439EBBB88DA@SDCEXCHANGE01.ad.amcc.com> Remobe/unsubscribe mmontaseri at amcc.com -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of vmoravek at atlas.cz Sent: Monday, October 24, 2005 4:00 PM To: Linux-cluster at redhat.com Subject: [Linux-cluster] gfs dlm - SAMBA problem(lock??) Hi all, I having use gfs for samba cluster.GFS works fine but have big trouble with this situation. When 6 computers want downloadind same file od directory everything is ok. But same situation but 8 computers trafic is rappidly going down and I have some strange messages about oplocks (maybe) in smb.log. Have any idea what is wrong? Best Regard Vojta -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------------------------------------------------- CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and contains information that is confidential and proprietary to Applied Micro Circuits Corporation or its subsidiaries. It is to be used solely for the purpose of furthering the parties' business relationship. All unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. From fbavandpouri at amcc.com Tue Oct 25 16:41:27 2005 From: fbavandpouri at amcc.com (Farid Bavandpouri) Date: Tue, 25 Oct 2005 09:41:27 -0700 Subject: [Linux-cluster] Occasional kernel panics Message-ID: <9D1E2BDCB5C57B46B56E6D80843439EBBB89D2@SDCEXCHANGE01.ad.amcc.com> Unsubscribe mmontaseri at amcc.com He no longer works at AMCC. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ethan Sommer Sent: Monday, October 24, 2005 5:48 PM To: linux-cluster at redhat.com Subject: [Linux-cluster] Occasional kernel panics Every few days or so our cluster machines seem to have kernel panics comp laing about GFS locking (although its pretty irregular, we went for a few weeks without an outage) We noticed that this happened a LOT, and it was reproducible when certain users accessed files, when we were serving afp off the cluster. We have changed things since then so that afp is run on a server which nfs mounts the cluster. We are running FC4 with the gfs modules from yum. Here is our most recent kernel panics, followed by one from when we had afp running on the cluster: (it looks like there is relevant info above the cut-here, possibly if it might be helpful) Oct 19 14:44:41 meow kernel: ------------[ cut here ]------------ Oct 19 14:44:41 meow kernel: kernel BUG at /usr/src/build/607755-i686/BUILD/smp/src/lockqueue.c:1144! Oct 19 14:44:41 meow kernel: invalid operand: 0000 [#1] Oct 19 14:44:41 meow kernel: SMP Oct 19 14:44:41 meow kernel: Modules linked in: nfsd exportfs lockd autofs4 lock_dlm(U) gfs(U) lock_harness(U) rfcomm l2cap bluetooth dlm(U) cman(U) md5 ip v6 sunrpc ipt_LOG ipt_limit ipt_state ip_conntrack iptable_filter ip_tables video button battery ac uhci_hcd ehci_hcd hw_random i2c_i801 i2c_core shpchp e1000 floppy ext3 jbd raid1 dm_mod qla2200 qla2xxx scsi_transport_fc ata_piix libata sd_mod scsi_mod Oct 19 14:44:41 meow kernel: CPU: 1 Oct 19 14:44:41 meow kernel: EIP: 0060:[] Not tainted VLI Oct 19 14:44:41 meow kernel: EFLAGS: 00010292 (2.6.12-1.1447_FC4smp) Oct 19 14:44:41 meow kernel: EIP is at process_cluster_request+0xddb/0xdef [dlm] Oct 19 14:44:41 meow kernel: eax: 00000004 ebx: 00000000 ecx: c035fa4c edx: 00000286 Oct 19 14:44:41 meow kernel: esi: f7fb8400 edi: 00000000 ebp: d2988000 esp: f7eefe24 Oct 19 14:44:41 meow kernel: ds: 007b es: 007b ss: 0068 Oct 19 14:44:41 meow kernel: Process dlm_recvd (pid: 2402, threadinfo=f7eef000 task=f7851020) Oct 19 14:44:41 meow kernel: Stack: f8b0621b 00000001 f8b071e0 f8b06217 2583f987 00000001 00000040 00004000 Oct 19 14:44:41 meow kernel: f7eefe48 00000000 c038e1a0 00000a58 f0167b00 c02a26c1 00000a58 00004040 Oct 19 14:44:41 meow kernel: 00000072 f7eefed4 00000000 00000001 00000246 00000000 edd6eeb8 00000000 Oct 19 14:44:41 meow kernel: Call Trace: Oct 19 14:44:41 meow kernel: [] sock_recvmsg+0x103/0x11e Oct 19 14:44:41 meow kernel: [] midcomms_process_incoming_buffer+0x13b/0x25f [dlm] Oct 19 14:44:41 meow kernel: [] load_balance_newidle+0x23/0x82 Oct 19 14:44:41 meow kernel: [] receive_from_sock+0x196/0x2c9 [dlm] Oct 19 14:44:41 meow kernel: [] schedule+0x405/0xc5e Oct 19 14:44:41 meow kernel: [] schedule+0x431/0xc5e Oct 19 14:44:41 meow kernel: [] dlm_recvd+0x0/0x9c [dlm] Oct 19 14:44:41 meow kernel: [] process_sockets+0x75/0xb7 [dlm] Oct 19 14:44:41 meow kernel: [] dlm_recvd+0x70/0x9c [dlm] Oct 19 14:44:41 meow kernel: [] kthread+0x93/0x97 Oct 19 14:44:41 meow kernel: [] kthread+0x0/0x97 Oct 19 14:44:41 meow kernel: [] kernel_thread_helper+0x5/0xb Oct 19 14:44:41 meow kernel: Code: 4f 82 62 c7 89 e8 e8 b1 b4 00 00 8b 4c 24 14 89 4c 24 04 c7 04 24 6d 63 b0 f8 e8 34 82 62 c7 c7 04 24 1b 62 b0 f8 e8 28 82 62 c7 <0f> 0b 78 04 e0 71 b0 f8 c7 04 24 70 72 b0 f8 e8 40 78 62 c7 57 Oct 19 14:44:41 meow kernel: <0>Fatal exception: panic in 5 seconds Panic 2: Oct 10 09:58:39 woof kernel: ------------[ cut here ]------------ Oct 10 09:58:39 woof kernel: kernel BUG at /usr/src/build/607778-i686/BUILD/smp/src/dlm/lock.c:411! Oct 10 09:58:39 woof kernel: invalid operand: 0000 [#1] Oct 10 09:58:39 woof kernel: SMP Oct 10 09:58:39 woof kernel: Modules linked in: nfsd exportfs lockd autofs4 lock_dlm(U) gfs(U) lock_harness(U) rfcomm l2cap bluetooth dlm(U) cman(U) md5 ip v6 sunrpc ipt_LOG ipt_limit ipt_state ip_conntrack iptable_filter ip_tables video button battery ac uhci_hcd ehci_hcd hw_random i2c_i801 i2c_core shpchp e1 000 dm_snapshot dm_zero dm_mirror ext3 jbd raid1 dm_mod qla2200 qla2xxx scsi_transport_fc ata_piix libata sd_mod scsi_mod Oct 10 09:58:39 woof kernel: CPU: 1 Oct 10 09:58:39 woof kernel: EIP: 0060:[] Not tainted VLI Oct 10 09:58:39 woof kernel: EFLAGS: 00010292 (2.6.12-1.1447_FC4smp) Oct 10 09:58:39 woof kernel: EIP is at do_dlm_lock+0x1b7/0x21d [lock_dlm] Oct 10 09:58:39 woof kernel: eax: 00000004 ebx: 00000000 ecx: c035fa4c edx: 00000292 Oct 10 09:58:39 woof kernel: esi: f7848140 edi: ffffffea ebp: 00000003 esp: c74b3cfc Oct 10 09:58:39 woof kernel: ds: 007b es: 007b ss: 0068 Oct 10 09:58:39 woof kernel: Process imapd (pid: 24278, threadinfo=c74b3000 task=f4721a80) Oct 10 09:58:39 woof kernel: Stack: f8b9de75 f7848140 00000003 1bbe0000 00000000 ffffffea 00000003 00000005 Oct 10 09:58:39 woof kernel: 0000000d 00000005 00000000 f58c0a00 00000001 0000000d 20200000 20202020 Oct 10 09:58:39 woof kernel: 20203320 20202020 62312020 30306562 00183030 c8fb2f00 00000001 00000001 Oct 10 09:58:39 woof kernel: Call Trace: Oct 10 09:58:39 woof kernel: [] lm_dlm_lock+0x52/0x5e [lock_dlm] Oct 10 09:58:39 woof kernel: [] lm_dlm_lock+0x0/0x5e [lock_dlm] Oct 10 09:58:39 woof kernel: [] gfs_lm_lock+0x3d/0x5c [gfs] Oct 10 09:58:39 woof kernel: [] gfs_glock_xmote_th+0xae/0x1d3 [gfs] Oct 10 09:58:39 woof kernel: [] rq_promote+0x126/0x150 [gfs] Oct 10 09:58:39 woof kernel: [] run_queue+0xee/0x113 [gfs] Oct 10 09:58:39 woof kernel: [] gfs_glock_nq+0x93/0x144 [gfs] Oct 10 09:58:39 woof kernel: [] gfs_glock_nq_init+0x18/0x2d [gfs] Oct 10 09:58:39 woof kernel: [] get_local_rgrp+0xca/0x1b0 [gfs] Oct 10 09:58:39 woof kernel: [] gfs_inplace_reserve_i+0x90/0xd0 [gfs] Oct 10 09:58:39 woof kernel: [] gfs_quota_lock_m+0xbf/0x117 [gfs] Oct 10 09:58:39 woof kernel: [] do_do_write_buf+0x3a1/0x485 [gfs] Oct 10 09:58:39 woof kernel: [] glock_wait_internal+0x16b/0x26a [gfs] Oct 10 09:58:39 woof kernel: [] do_write_buf+0x182/0x1b6 [gfs] Oct 10 09:58:39 woof kernel: [] walk_vm+0xb3/0x111 [gfs] Oct 10 09:58:39 woof kernel: [] gfs_write+0xa0/0xc2 [gfs] Oct 10 09:58:39 woof kernel: [] do_write_buf+0x0/0x1b6 [gfs] Oct 10 09:58:39 woof kernel: [] gfs_write+0x0/0xc2 [gfs] Oct 10 09:58:39 woof kernel: [] vfs_write+0x9e/0x110 Oct 10 09:58:39 woof kernel: [] sys_write+0x41/0x6a Oct 10 09:58:39 woof kernel: [] syscall_call+0x7/0xb Oct 10 09:58:39 woof kernel: Code: 7c 24 14 89 4c 24 0c 89 5c 24 10 89 6c 24 08 89 74 24 04 c7 04 24 28 e6 b9 f8 e8 0e 94 58 c7 c7 04 24 75 de b9 f8 e8 02 94 58 c7 <0f> 0b 9b 01 a0 e4 b9 f8 c7 04 24 3c e5 b9 f8 e8 1a 8a 58 c7 66 Oct 10 09:58:39 woof kernel: <0>Fatal exception: panic in 5 seconds Sep 7 15:37:44 meow kernel: ------------[ cut here ]------------ Sep 7 15:37:44 meow kernel: kernel BUG at /usr/src/build/588748-i686/BUILD/smp/src/dlm/plock.c:500! Sep 7 15:37:44 meow kernel: invalid operand: 0000 [#1] Sep 7 15:37:44 meow kernel: SMP Sep 7 15:37:44 meow kernel: Modules linked in: appletalk nfsd exportfs lockd autofs4 lock_dlm(U) gfs(U) lock_harness(U) rfcomm l2cap bluetooth dlm(U) cman (U) sunrpc md5 ipv6 ipt_LOG ipt_limit ipt_state ip_conntrack iptable_filter ip_tables video button battery ac uhci_hcd ehci_hcd hw_random i2c_i801 i2c_core shpchp e1000 floppy ext3 jbd raid1 dm_mod qla2200 qla2xxx scsi_transport_fc ata_piix libata sd_mod scsi_mod Sep 7 15:37:44 meow kernel: CPU: 3 Sep 7 15:37:44 meow kernel: EIP: 0060:[] Tainted: GF VLI Sep 7 15:37:44 meow kernel: EFLAGS: 00010292 (2.6.12-1.1398_FC4smp) Sep 7 15:37:44 meow kernel: EIP is at update_lock+0x87/0x9b [lock_dlm] Sep 7 15:37:44 meow kernel: eax: 00000004 ebx: fffffff5 ecx: c035ca4c edx: 00000282 Sep 7 15:37:44 meow kernel: esi: 00000000 edi: e99c2c00 ebp: 00000000 esp: d05dedb4 Sep 7 15:37:44 meow kernel: ds: 007b es: 007b ss: 0068 Sep 7 15:37:44 meow kernel: Process afpd (pid: 3872, threadinfo=d05de000 task=d6447550) Sep 7 15:37:44 meow kernel: Stack: badc0ded f8b9d0d6 fffffff5 f8b9da70 f8b9d101 06609291 f7943000 00000000 Sep 7 15:37:44 meow kernel: f8b9a499 7ffffff8 00000000 7ffffff8 00000000 d05dede8 d7636700 7ffffff8 Sep 7 15:37:44 meow kernel: 00000000 d05deea8 d05dee28 f8b9a987 00000001 7ffffff8 00000000 7ffffff8 Sep 7 15:37:44 meow kernel: Call Trace: Sep 7 15:37:44 meow kernel: [] add_lock+0x8e/0xed [lock_dlm] Sep 7 15:37:44 meow kernel: [] fill_gaps+0x87/0x10e [lock_dlm] Sep 7 15:37:44 meow kernel: [] lock_case3+0x43/0xac [lock_dlm] Sep 7 15:37:44 meow kernel: [] plock_internal+0x1aa/0x370 [lock_dlm] Sep 7 15:37:44 meow kernel: [] lm_dlm_plock+0x25b/0x2dc [lock_dlm] Sep 7 15:37:44 meow kernel: [] lm_dlm_plock+0x0/0x2dc [lock_dlm] Sep 7 15:37:44 meow kernel: [] gfs_lm_plock+0x45/0x57 [gfs] Sep 7 15:37:44 meow kernel: [] gfs_lock+0xcd/0x11c [gfs] Sep 7 15:37:44 meow kernel: [] gfs_lock+0x0/0x11c [gfs] Sep 7 15:37:44 meow kernel: [] fcntl_setlk64+0x16c/0x26a Sep 7 15:37:44 meow kernel: [] fget+0x3b/0x42 Sep 7 15:37:44 meow kernel: [] sys_fcntl64+0x55/0x97 Sep 7 15:37:44 meow kernel: [] syscall_call+0x7/0xb Sep 7 15:37:44 meow kernel: Code: 01 00 00 c7 04 24 a8 da b9 f8 e8 7c 77 58 c7 89 5c 24 04 c7 04 24 08 d1 b9 f8 e8 6c 77 58 c7 c7 04 24 d6 d0 b9 f8 e8 60 77 58 c7 <0f> 0b f4 01 70 da b9 f8 c7 04 24 10 db b9 f8 e8 78 6d 58 c7 55 Sep 7 15:37:44 meow kernel: <0>Fatal exception: panic in 5 seconds Thanks for any help, Ethan -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------------------------------------------------- CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and contains information that is confidential and proprietary to Applied Micro Circuits Corporation or its subsidiaries. It is to be used solely for the purpose of furthering the parties' business relationship. All unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. From clusterbuilder at gmail.com Tue Oct 25 17:35:47 2005 From: clusterbuilder at gmail.com (Nick I) Date: Tue, 25 Oct 2005 11:35:47 -0600 Subject: [Linux-cluster] Ask the Cluster Expert Message-ID: Hi. Thanks to the response from many in the community I have added sections about diskless clusters and information on 32-bit and 64-bit processors at the site I help run, www.ClusterBuilder.org.I also added a section called Ask the Cluster Expert ( http://www.clusterbuilder.org/pages/ask-the-expert.php) for people to submit questions they have about cluster and grid computing. I post the questions at an FAQ page (http://www.clusterbuilder.org/pages/ask-the-expert/faq.php) and then research the answer as well as allow those knowledgeable in the community to submit a response to the question. I want to build a valuable knowledgebase of high performance computing information. I need you to share your knowledge by adding to the question responses and also submitting questions/answers to common problems you've experienced in the past and are experiencing now. A sample question could be certain operating systems on clusters. Thanks, Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: From garyshi at gmail.com Fri Oct 28 17:00:33 2005 From: garyshi at gmail.com (Gary Shi) Date: Sat, 29 Oct 2005 01:00:33 +0800 Subject: [Linux-cluster] GFS over GNBD servers connected to a SAN? Message-ID: The Administrator's Guide suggests 3 kinds of configurations, the second one "GFS and GNBD with a SAN", servers running GFS share device export by GNBD servers. I'm wondering the detail of such configuration. Does it have better performance because it could distribute the load on single GNBD servers? Compared to the 3rd way, "GFS and GNBD with Directly Connected Storage", seems the only difference is we could export the same device through different GNBD servers. Is it true? For example: Suppose the SAN exports only 1 logical device, and we have 4 GNBD servers connect to the SAN, and 32 application servers share the filesystem via GFS. So the disk on SAN is /dev/sdb on each GNBD server. Can we use "gnbd_export -d /dev/sdb -e test" to export the device a same name "test" on all GNBD servers, make every 8 GFS servers share a GNBD server, and the total 32 GFS nodes finally access the same SAN device? What configuration is suggested for a high-performance GNBD server? How many client is fair for a GNBD server? BTW, is it possible to run NFS service on GFS nodes, and make different client groups access different NFS servers, resulting in a lot of NFS clients access a same shared filesystem? -- regards, Gary Shi -------------- next part -------------- An HTML attachment was scrubbed... URL: From philip.r.dana at nwp01.usace.army.mil Mon Oct 31 14:45:07 2005 From: philip.r.dana at nwp01.usace.army.mil (Philip R. Dana) Date: Mon, 31 Oct 2005 06:45:07 -0800 Subject: [Linux-cluster] Service/Resource group help needed In-Reply-To: <1130450762.23803.41.camel@ayanami.boston.redhat.com> References: <1130440083.2950.25.camel@nwp-wk-79033-l> <1130440656.3453.10.camel@auh5-0479.corp.jabil.org> <1130443149.2950.32.camel@nwp-wk-79033-l> <1130450762.23803.41.camel@ayanami.boston.redhat.com> Message-ID: <1130769907.2950.60.camel@nwp-wk-79033-l> Modprobe dlm resulted in a module not found error, a result of making the edit to dlm-kernel.spec that Mr. Gray used. If dlm-kernel is built with the spec file unmodified, then building cman-kernel errors out with a file not found error. Back to square one. The replies were all appreciated. Thanks. On Thu, 2005-10-27 at 18:06 -0400, Lon Hohberger wrote: > On Thu, 2005-10-27 at 12:59 -0700, Dana, Philip R NWP Contractor wrote: > > Thanks for the quick reply. The rgmanager service is running on both > > nodes, but I think I have lock (dlm) problems. From a service restart: > > > > Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: Services Initialized > > Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: Logged in SG > > "usrm::manager" > > Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: Magma Event: > > Membership Change > > Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: State change: Local > > UP > > Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: #33: Unable to obtain > > cluster lock: Operation not permitted > > service rgmanager stop; modprobe dlm; service rgmanager start > > -- Lon > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From erwan at seanodes.com Mon Oct 31 16:20:15 2005 From: erwan at seanodes.com (Velu Erwan) Date: Mon, 31 Oct 2005 17:20:15 +0100 Subject: [Linux-cluster] Readhead Issues using cluster-1.01.00 Message-ID: <4366443F.9000707@seanodes.com> Hi, I've been playing with cluster-1.01.00 and I found the reading very slow. I've been trying to setup "max_readahead" using gfs_tool and performances are unchanged. So I've read the code and placed some printk everywhere ;o) Results is : in opts_file.c when you make a gfs_read you have the following code (this is right for both buffered and directio reads) if (gfs_is_jdata(ip) || (gfs_is_stuffed(ip) && !test_bit(GIF_PAGED, &ip->i_flags))) count = do_read_readi(file, buf, size, offset); else count = generic_file_read(file, buf, size, offset); In my case, I always uses generic_file_read because all conditions are equal to 0. I'm not a gfs expert so I don't know if it's normal or not to not use the do_read_readi call to read the fs. I've watched how do_read_readi() works and after a few calls you call gfs_start_ra (from dio.c). This sounds perfect because it defines "uint32_t max_ra = gfs_tune_get(sdp, gt_max_readahead) >> sdp->sd_sb.sb_bsize_shift;"; This uses the default max_readahead set. Now the other case, what's about generic_file_read (defined in the usual linux tree) Of course, it doesn't know about the value you set (max_readahead). So after reading how it works, the file structure owns a f_ra member which is defined for handling the readahead aspects. I though about forcing this value before the generic_file_read call. Please found this patch attached. When I set a default value to file->f_ra.ra_pages to 512, my performances are 3 times better ! I've bupped from 40MB/sec to 120MB/sec. I'm now reaching the normal performance I was expected. Comments about this patch: 1?) I don't know if it's the cleanest way to do it but it works, so It shows that the readahead aspect is not handled while using generic_file_read. 2?) I've used a default value that match my hardware but it could be cleaner to use the "gfs_tune_get(sdp, gt_max_readahead)" call. 2bis?) I don't know how to call it because it needs the gfs_lock structure and I don't know how to provide one (I didn't read enough code for that). 3?) I need a gfs guru to finish this patch with point 2 & 2bis if my patch sounds relevant Erwan, -------------- next part -------------- A non-text attachment was scrubbed... Name: cluster-1.01.00-readahead.patch Type: text/x-patch Size: 447 bytes Desc: not available URL: From bmarzins at redhat.com Mon Oct 31 16:57:10 2005 From: bmarzins at redhat.com (Benjamin Marzinski) Date: Mon, 31 Oct 2005 10:57:10 -0600 Subject: [Linux-cluster] GFS over GNBD servers connected to a SAN? In-Reply-To: References: Message-ID: <20051031165710.GA24441@phlogiston.msp.redhat.com> On Sat, Oct 29, 2005 at 01:00:33AM +0800, Gary Shi wrote: > The Administrator's Guide suggests 3 kinds of configurations, the second > one "GFS and GNBD with a SAN", servers running GFS share device export by > GNBD servers. I'm wondering the detail of such configuration. Does it have > better performance because it could distribute the load on single GNBD > servers? Compared to the 3rd way, "GFS and GNBD with Directly Connected > Storage", seems the only difference is we could export the same device > through different GNBD servers. Is it true? For example: > > Suppose the SAN exports only 1 logical device, and we have 4 GNBD servers > connect to the SAN, and 32 application servers share the filesystem via > GFS. So the disk on SAN is /dev/sdb on each GNBD server. Can we use > "gnbd_export -d /dev/sdb -e test" to export the device a same name "test" > on all GNBD servers, make every 8 GFS servers share a GNBD server, and the > total 32 GFS nodes finally access the same SAN device? Well, it depends. Using RHEL3 with pool, you can have multiple GNBD servers exporting the same SAN device. However, GNBD itself does not do the multipathing. It simply has a mode (uncached mode) that allows multipathing software to be run on top of it. The RHEL3 pool code has multipathing support. However to do this, you must name the GNBD devices exported by each server different names. Otherwise, GNBD will not import them multiple devices with the same name. Best practice is to name the device _. There are some of additional requirements for doing this. For one, you MUST have hardware based fencing on the GNBD servers, otherwise you risk corruption. You MUST export ALL multipathed GNBD devices uncached, otherwise you WILL see corruption and you WILL eventually destroy your entire filesystem. If you are using the fence_gnbd fencing agent (and this is only recommended if you do not have a hardware fencing machanism for the gnbd client machines. Otherwise use that) you must set it to multipath style fenceing, or you risk corruption. You should read the gnbd man pages (especially fence_gnbd.8 gnbd_export.8). All of the multipath requirements are listed there. (search for "WARNING" in the text for the necessary steps to avoid corruption). In RHEL4, there is no pool device. Multipathing is handled by device-mapper-multipath. Unfortunately, this code is currently too SCSI centric to work with GNBD, so this setup is impossible in RHEL4. > What configuration is suggested for a high-performance GNBD server? How > many client is fair for a GNBD server? The largest number of GNBD clients I have heard of in a production setting is 128. There is no reason why there couldn't be more. The performance bottleneck for setups with a high number of clients is in the network connection. Since you have a single thread serving each client-server-device instance, the gnbd server actually performs better (in terms of total throughput) with more clients. Obviously, your per-client performance will drop, usually due to limited network bandwith. Having only one gnbd sever per device is obviously a single point of failure. So if you are running with RHEL3, you may want multiple servers. In practice. People usually do just fine by designating a single node to be exclusively a GNBD server (which means not running GFS on that node). If you are running GULM, and would like to use your GNBD server as a GULM server, you should have two network interfaces. One for lock traffic and one for block traffic. Since gulm uses a lot of memory, no disk, and gnbd uses a lot of disk but little memory, they can do well together. However, if gulm can't send out heartbeats in a timely manner, your nodes can get fenced durning periods of high block IO. With RHEL4, the only real difference is that you do not have the option of multiple gnbd servers per SAN device. It's still best to use the gnbd server exclusively for that purpose. > BTW, is it possible to run NFS service on GFS nodes, and make different > client groups access different NFS servers, resulting in a lot of NFS > clients access a same shared filesystem? > > -- > regards, > Gary Shi > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster