From dist-list at LEXUM.UMontreal.CA  Sat Oct  1 21:58:25 2005
From: dist-list at LEXUM.UMontreal.CA (FM)
Date: Sat, 01 Oct 2005 17:58:25 -0400
Subject: [Linux-cluster] RHEL4 active/active cluster without  GFS
Message-ID: <433F0681.4020700@lexum.umontreal.ca>

Hello everybody,
First post here ... for my first cluster attempt. I do not need GFS

I'm trying to install linux-ha (RPMS from 
http://www.ultramonkey.org/download/heartbeat/). But the installation 
always fails because of ipvsadmin missing.

I read that ipvs is in the kernel, so I check in the default .conf and 
ipvs is enable. Ho can I install ipvsadmin ?

Is there a RedHat way to create cluster without GFS ?

What are your advices ?

Thanks !


From carlopmart at gmail.com  Sun Oct  2 08:35:25 2005
From: carlopmart at gmail.com (carlopmart at gmail.com)
Date: Sun, 02 Oct 2005 10:35:25 +0200
Subject: [Linux-cluster] Export directory via gnbd
Message-ID: <433F9BCD.2020303@gmail.com>

Hi all,

  Is it possible to export directorys via gnbd?? GNBD docs only 
descrives how to export files or partitons ...

-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From Axel.Thimm at ATrpms.net  Sun Oct  2 10:23:05 2005
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Sun, 2 Oct 2005 12:23:05 +0200
Subject: [Linux-cluster] Re: SMP and GFS
In-Reply-To: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com>
References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com>
Message-ID: <20051002102305.GD13944@neu.nirvana>

On Thu, Jul 14, 2005 at 04:57:51PM -0400, Manuel Bujan wrote:
> Is there any  issue I should be aware of if SMP is enabled in
> my kernel ? What if I compile my kernel to be pre-emptible ? Any problem with that and GFS ?
> 
> I am running GFS in a dual Xeon server from DELL.

> After a lot of time running my GFS setup I got the following error
> in one of our cluster servers, and I had to reboot it in order to
> restablish the service:

> 
> #################################################################################
> Jul 14 14:19:35 atmail-2 kernel:  2
> Jul 14 14:19:35 atmail-2 kernel: gfs001 (18044) req reply einval ae2c0092 fr 1 r 1        2
> Jul 14 14:19:35 atmail-2 kernel: gfs001 (31381) req reply einval bf9901e7 fr 1 r 1        2
> Jul 14 14:19:35 atmail-2 kernel: gfs001 (2023) req reply einval d6c30333 fr 1 r 1        2
> Jul 14 14:19:35 atmail-2 kernel: gfs001 send einval to 1
> Jul 14 14:19:35 atmail-2 last message repeated 2 times

I found similar log sniplets on a RHEL4U1 machine with dual Xeons (HP
Proliant). The machine crashed with a kernel panic shortly after
telling the other nodes to leave the cluster (sorry the staff was
under pressure and noone wrote down the panic's output):

Sep 30 05:08:11 zs01 kernel: nval to 1 (P:kernel)
Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel)
Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel)
Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel)
Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel)
Sep 30 05:08:33 zs03 kernel: CMAN: removing node zs02 from the cluster : Missed too many heartbeats (P:kernel)
Sep 30 05:08:39 zs03 kernel: CMAN: removing node zs01 from the cluster : No response to messages (P:kernel)
Sep 30 05:08:45 zs03 kernel: CMAN: quorum lost, blocking activity (P:kernel)

Seeking for the einval messages I found only this post here. So it
doesn't seem to happen that often. OTOH it's the same hardware,
perhaps dual Xeons are not good for GFS and/or the cluster
infrastructure?

In my case kernel and GFS bits are all from Red Hat, no self built
components other than a qla2xxx driver, but the issue is on the
cluster communication side.
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051002/4fba432f/attachment.sig>

From sequel at neofreak.org  Sun Oct  2 15:06:07 2005
From: sequel at neofreak.org (DeadManMoving)
Date: Sun, 02 Oct 2005 11:06:07 -0400
Subject: [Linux-cluster] Re: SMP and GFS
In-Reply-To: <20051002102305.GD13944@neu.nirvana>
References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com>
	<20051002102305.GD13944@neu.nirvana>
Message-ID: <1128265567.23136.21.camel@saloon.neofreak.org>

I'm running a cluster on two node without GFS (only using clurgmgrd to
export nfs share) on IBM x346 servers (Pentium 4 Xeon (Foster) with the
smp kernel; 2.6.9-11smp) and, while i do not see those errors in my
logs, i do see them in /proc/cluster/dlm_debug :

Magma send einval to 1
Magma send einval to 1
Magma send einval to 1
Magma send einval to 1
Magma send einval to 1
Magma (3055) req reply einval 440255 fr 1 r 1 usrm::rg="home_ma
Magma (3055) req reply einval 4b0262 fr 1 r 1 usrm::rg="home_ma
Magma send einval to 1
Magma (11923) req reply einval 5300f1 fr 1 r 1 usrm::vf
Magma send einval to 1
Magma (3055) req reply einval 530338 fr 1 r 1 usrm::vf

My cluster is highly instable, just this morning i've realized that the
clurgmgrd deamon was dead...

Can someone at Red Hat shed some light on this?

Thanks, Tony Lapointe.

On Sun, 2005-10-02 at 12:23 +0200, Axel Thimm wrote:
> On Thu, Jul 14, 2005 at 04:57:51PM -0400, Manuel Bujan wrote:
> > Is there any  issue I should be aware of if SMP is enabled in
> > my kernel ? What if I compile my kernel to be pre-emptible ? Any problem with that and GFS ?
> > 
> > I am running GFS in a dual Xeon server from DELL.
> 
> > After a lot of time running my GFS setup I got the following error
> > in one of our cluster servers, and I had to reboot it in order to
> > restablish the service:
> 
> > 
> > #################################################################################
> > Jul 14 14:19:35 atmail-2 kernel:  2
> > Jul 14 14:19:35 atmail-2 kernel: gfs001 (18044) req reply einval ae2c0092 fr 1 r 1        2
> > Jul 14 14:19:35 atmail-2 kernel: gfs001 (31381) req reply einval bf9901e7 fr 1 r 1        2
> > Jul 14 14:19:35 atmail-2 kernel: gfs001 (2023) req reply einval d6c30333 fr 1 r 1        2
> > Jul 14 14:19:35 atmail-2 kernel: gfs001 send einval to 1
> > Jul 14 14:19:35 atmail-2 last message repeated 2 times
> 
> I found similar log sniplets on a RHEL4U1 machine with dual Xeons (HP
> Proliant). The machine crashed with a kernel panic shortly after
> telling the other nodes to leave the cluster (sorry the staff was
> under pressure and noone wrote down the panic's output):
> 
> Sep 30 05:08:11 zs01 kernel: nval to 1 (P:kernel)
> Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel)
> Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel)
> Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel)
> Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel)
> Sep 30 05:08:33 zs03 kernel: CMAN: removing node zs02 from the cluster : Missed too many heartbeats (P:kernel)
> Sep 30 05:08:39 zs03 kernel: CMAN: removing node zs01 from the cluster : No response to messages (P:kernel)
> Sep 30 05:08:45 zs03 kernel: CMAN: quorum lost, blocking activity (P:kernel)
> 
> Seeking for the einval messages I found only this post here. So it
> doesn't seem to happen that often. OTOH it's the same hardware,
> perhaps dual Xeons are not good for GFS and/or the cluster
> infrastructure?
> 
> In my case kernel and GFS bits are all from Red Hat, no self built
> components other than a qla2xxx driver, but the issue is on the
> cluster communication side.
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From pcaulfie at redhat.com  Mon Oct  3 06:59:22 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 03 Oct 2005 07:59:22 +0100
Subject: [Linux-cluster] Re: SMP and GFS
In-Reply-To: <20051002102305.GD13944@neu.nirvana>
References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com>
	<20051002102305.GD13944@neu.nirvana>
Message-ID: <4340D6CA.7070105@redhat.com>

Axel Thimm wrote:
> On Thu, Jul 14, 2005 at 04:57:51PM -0400, Manuel Bujan wrote:
> 
>>Is there any  issue I should be aware of if SMP is enabled in
>>my kernel ? What if I compile my kernel to be pre-emptible ? Any problem with that and GFS ?
>>

Pre-emptible kernels will not work with GFS, that's certain.

-- 

patrick


From pcaulfie at redhat.com  Mon Oct  3 07:10:59 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 03 Oct 2005 08:10:59 +0100
Subject: [Openais] Re: [Linux-cluster] new userland cman
In-Reply-To: <1128109200.8440.14.camel@unnamed.az.mvista.com>
References: <433D4134.6080608@redhat.com>
	<1128109200.8440.14.camel@unnamed.az.mvista.com>
Message-ID: <4340D983.7080106@redhat.com>

Steven Dake wrote:
> Patrick
> 
> Thanks for the work
> 
> I have a few comments inline
> 
> On Fri, 2005-09-30 at 14:44 +0100, Patrick Caulfield wrote:

>>- Hard limit to size of cluster (set at compile time to 32 currently)***
>>
> 
> 
> I hope to have multiring in 2006; then we should scale to hundreds of
> processors...

Nice :)

I have some ideas for shrinking the size of the current packets which will help
the current system and lower the ethernet load. I'll start on those shortly.

> 
>>neutral
>>-------
>>- Always uses multicast (no broadcast). A default multicast address is supplied
>>if none is given
> 
> 
> If broadcast is important, which I guess it may be, we can pretty easily
> add this support...
> 

I was going to look into this but I doubt its really worth it. It's just any
extra complication and will only apply to IPv4 anyway.


>>- libcman is the only API ( a compatible libcman is available for the kernel
>>version)
>>- Simplified CCS schema, but will read old one if it has nodeids in it.****
>>
>>internal
>>--------
>>- Usable messaging API
>>- Robust membership algorithm
>>- Community involvement, multiple developers.
>>
>>
>>* I very much doubt that anyone will notice apart from maybe Dave & me
>>
>>** Could fix this in AIS, but I'm not sure the patch would be popular upstream.
>>It's much more efficient to run them on different ports or multicast addresses
>>anyway. Incidentally: DON'T run an encrypted and a non-encrypted cluster on the
>>same port & multicast address (not that you would!) - the non-encrypted ones
>>will crash.
>>
> 
> 
> On this point, you mention you could fix "this", do you mean having two
> clusters use the same port and ips?  I have also considered and do want
> this by having each "cluster" join a specific group at startup to serve
> as the cluster membership view.  Unfortunately this would require
> process group membership, and the process groups interface is unfinished
> (totempg.c) so this isn't possible today.  Note I'd take a patch from
> someone that finished the job on this interface :)  I for example, would
> like communication for a specific checkpoint to go over a specific named
> group, instead of to everyone connected to totem.  Then the clm could
> join a group and get membership events, the checkpoint service for a
> specific checkpoint could join a group, and communicate on that group,
> and get membership events for that group etc.
> 
> What did you have in mind here?


Actually something /very/ simple. the old cman just had a uint16 in every packet
which was a cluster_id. If the cluster_id in an incoming packet didn't match the
one read from the config file then the packet was dropped. It's really just a
way of simplifying configuration for those using broadcast or a default
multicast address.

In my more evil moments thought it might be worth hijacking the commented out
"filler" in struct message_header :)
-- 

patrick


From Axel.Thimm at ATrpms.net  Mon Oct  3 08:33:36 2005
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Mon, 3 Oct 2005 10:33:36 +0200
Subject: [Linux-cluster] Re: SMP and GFS
In-Reply-To: <4340D6CA.7070105@redhat.com>
References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com>
	<20051002102305.GD13944@neu.nirvana> <4340D6CA.7070105@redhat.com>
Message-ID: <20051003083336.GC10393@neu.nirvana>

On Mon, Oct 03, 2005 at 07:59:22AM +0100, Patrick Caulfield wrote:
> Axel Thimm wrote:
> > On Thu, Jul 14, 2005 at 04:57:51PM -0400, Manuel Bujan wrote:
> > 
> >>Is there any  issue I should be aware of if SMP is enabled in
> >>my kernel ? What if I compile my kernel to be pre-emptible ? Any problem with that and GFS ?
> >>
> 
> Pre-emptible kernels will not work with GFS, that's certain.

My report was on a RHEL4 kernel.
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051003/ed709f75/attachment.sig>

From pcaulfie at redhat.com  Mon Oct  3 09:31:02 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 03 Oct 2005 10:31:02 +0100
Subject: [Linux-cluster] Re: SMP and GFS
In-Reply-To: <20051003083336.GC10393@neu.nirvana>
References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com>	<20051002102305.GD13944@neu.nirvana>
	<4340D6CA.7070105@redhat.com> <20051003083336.GC10393@neu.nirvana>
Message-ID: <4340FA56.6090708@redhat.com>

Axel Thimm wrote:
> On Mon, Oct 03, 2005 at 07:59:22AM +0100, Patrick Caulfield wrote:
> 
>>Axel Thimm wrote:
>>
>>>On Thu, Jul 14, 2005 at 04:57:51PM -0400, Manuel Bujan wrote:
>>>
>>>
>>>>Is there any  issue I should be aware of if SMP is enabled in
>>>>my kernel ? What if I compile my kernel to be pre-emptible ? Any problem with that and GFS ?
>>>>
>>
>>Pre-emptible kernels will not work with GFS, that's certain.
> 
> 
> My report was on a RHEL4 kernel.


...but you did ask about pre-emtible kernels :)

The important messages here are these :

> Sep 30 05:08:33 zs03 kernel: CMAN: removing node zs02 from the cluster :
Missed too many heartbeats (P:kernel)
> Sep 30 05:08:39 zs03 kernel: CMAN: removing node zs01 from the cluster : No
response to messages (P:kernel)


showing that a node has been kicked out of the cluster for not responding
quickly enough to messages. You could try increasing the value in

/proc/cluster/config/cman/max_retries

-- 

patrick


From Axel.Thimm at ATrpms.net  Mon Oct  3 10:52:06 2005
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Mon, 3 Oct 2005 12:52:06 +0200
Subject: [Linux-cluster] Re: SMP and GFS
In-Reply-To: <4340FA56.6090708@redhat.com>
References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com>
	<20051002102305.GD13944@neu.nirvana> <4340D6CA.7070105@redhat.com>
	<20051003083336.GC10393@neu.nirvana> <4340FA56.6090708@redhat.com>
Message-ID: <20051003105206.GG10393@neu.nirvana>

On Mon, Oct 03, 2005 at 10:31:02AM +0100, Patrick Caulfield wrote:
> Axel Thimm wrote:
> > On Mon, Oct 03, 2005 at 07:59:22AM +0100, Patrick Caulfield wrote:
> > 
> >>Axel Thimm wrote:
> >>
> >>>On Thu, Jul 14, 2005 at 04:57:51PM -0400, Manuel Bujan wrote:
> >>>
> >>>
> >>>>Is there any  issue I should be aware of if SMP is enabled in
> >>>>my kernel ? What if I compile my kernel to be pre-emptible ? Any problem with that and GFS ?
> >>>>
> >>
> >>Pre-emptible kernels will not work with GFS, that's certain.
> > 
> > 
> > My report was on a RHEL4 kernel.
> 
> 
> ...but you did ask about pre-emtible kernels :)

No, I didn't, that was Manuel Bujan 6 weeks ago. ;)

I replied that I saw the same einval messages on a RHEL4 kernel.

> The important messages here are these :
> 
> > Sep 30 05:08:33 zs03 kernel: CMAN: removing node zs02 from the cluster :
> Missed too many heartbeats (P:kernel)
> > Sep 30 05:08:39 zs03 kernel: CMAN: removing node zs01 from the cluster : No
> response to messages (P:kernel)
> 
> 
> showing that a node has been kicked out of the cluster for not responding
> quickly enough to messages. You could try increasing the value in
> 
> /proc/cluster/config/cman/max_retries

I know, but that doesn't explain the einval messages, or does it? Or
formulated differently: the einval messages show that the dual Xeon
box had some issues with sockets and its being kicked out could be
just a symptom of that.

Also the RHEL4 box should not kernel panic (all involved parties have
the same config, but only the panicing node has dual Xeons on EM64T,
the other two are dual opterons, all run the same smp RHEL4 kernel).

At that time the dual xeon was doing a backup on this interface with
25-30 MB/sec. That could explain the delayed/dropped UDP heartbeat
packages. Can it explain the "send einval to 1" messages and the
kernel panic?
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051003/74c4948e/attachment.sig>

From pcaulfie at redhat.com  Mon Oct  3 11:02:40 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 03 Oct 2005 12:02:40 +0100
Subject: [Linux-cluster] Re: SMP and GFS
In-Reply-To: <20051003105206.GG10393@neu.nirvana>
References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com>	<20051002102305.GD13944@neu.nirvana>
	<4340D6CA.7070105@redhat.com>	<20051003083336.GC10393@neu.nirvana>
	<4340FA56.6090708@redhat.com> <20051003105206.GG10393@neu.nirvana>
Message-ID: <43410FD0.5020403@redhat.com>

Axel Thimm wrote:
> On Mon, Oct 03, 2005 at 10:31:02AM +0100, Patrick Caulfield wrote:
> 
>>Axel Thimm wrote:
>>
>>>On Mon, Oct 03, 2005 at 07:59:22AM +0100, Patrick Caulfield wrote:
>>>
>>>
>>>>Axel Thimm wrote:
>>>>
>>>>
>>>>>On Thu, Jul 14, 2005 at 04:57:51PM -0400, Manuel Bujan wrote:
>>>>>
>>>>>
>>>>>
>>>>>>Is there any  issue I should be aware of if SMP is enabled in
>>>>>>my kernel ? What if I compile my kernel to be pre-emptible ? Any problem with that and GFS ?
>>>>>>
>>>>
>>>>Pre-emptible kernels will not work with GFS, that's certain.
>>>
>>>
>>>My report was on a RHEL4 kernel.
>>
>>
>>...but you did ask about pre-emtible kernels :)
> 
> 
> No, I didn't, that was Manuel Bujan 6 weeks ago. ;)
> 
> I replied that I saw the same einval messages on a RHEL4 kernel.
> 
> 
>>The important messages here are these :
>>
>>
>>>Sep 30 05:08:33 zs03 kernel: CMAN: removing node zs02 from the cluster :
>>
>>Missed too many heartbeats (P:kernel)
>>
>>>Sep 30 05:08:39 zs03 kernel: CMAN: removing node zs01 from the cluster : No
>>
>>response to messages (P:kernel)
>>
>>
>>showing that a node has been kicked out of the cluster for not responding
>>quickly enough to messages. You could try increasing the value in
>>
>>/proc/cluster/config/cman/max_retries
> 
> 
> I know, but that doesn't explain the einval messages, or does it? Or
> formulated differently: the einval messages show that the dual Xeon
> box had some issues with sockets and its being kicked out could be
> just a symptom of that.

it probably does explain them. If the node is kicked out of the cluster, the DLM
starts return -EINVAL from lock ops (because the lockspace no longer exists).
This very often causes the GFS lock_dlm module to oops.


The bugzillas are confused about this but it sort-of exists as
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=165160
-- 

patrick


From Axel.Thimm at ATrpms.net  Mon Oct  3 11:40:17 2005
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Mon, 3 Oct 2005 13:40:17 +0200
Subject: [Linux-cluster] Re: SMP and GFS
In-Reply-To: <43410FD0.5020403@redhat.com>
References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com>
	<20051002102305.GD13944@neu.nirvana> <4340D6CA.7070105@redhat.com>
	<20051003083336.GC10393@neu.nirvana> <4340FA56.6090708@redhat.com>
	<20051003105206.GG10393@neu.nirvana> <43410FD0.5020403@redhat.com>
Message-ID: <20051003114017.GH10393@neu.nirvana>

On Mon, Oct 03, 2005 at 12:02:40PM +0100, Patrick Caulfield wrote:
> Axel Thimm wrote:
> >>showing that a node has been kicked out of the cluster for not responding
> >>quickly enough to messages. You could try increasing the value in
> >>
> >>/proc/cluster/config/cman/max_retries
> > 
> > I know, but that doesn't explain the einval messages, or does it? Or
> > formulated differently: the einval messages show that the dual Xeon
> > box had some issues with sockets and its being kicked out could be
> > just a symptom of that.
> 
> it probably does explain them. If the node is kicked out of the cluster, the DLM
> starts return -EINVAL from lock ops (because the lockspace no longer exists).
> This very often causes the GFS lock_dlm module to oops.
> 
> 
> The bugzillas are confused about this but it sort-of exists as
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=165160

Thanks, that bugzilla explains a lot. It's the same situation like
Corey's, two nodes were shut down, quorum was lost, and one of the two
nodes removed was using the filesystem and was having lock_dlm on
it. So it paniced.

It all very much makes sense now. The two remaining issues are

o why did the network interface blow up twice, and killed the
  communication between the nodes (and it looks like it once killed
  all UDP communications permanently including syslog)? We replaced
  all cabling and switches, next thing is to use a dedicated GBit
  network only for cman/dlm. That's of course something we need to
  investigate and should not be an issue with GFS.

o why did the filesystem desync across members? That may or may not be
  a consequence of the previous cman/dlm failures and kernel panics,
  or may be a consequence of the broken networking between the
  nodes. In both cases while the triggering problem seems to be in the
  networking between the nodes, filesystem inconsitency should not
  happen, and reflects some bug in GFS. See also

  https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=169693

BTW what is "revolver"? Is that a stress test used at RH for GFS?
Would it be possible to share this tool?

Thanks!
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051003/6a79c484/attachment.sig>

From andrewxwang at yahoo.com.tw  Mon Oct  3 13:06:14 2005
From: andrewxwang at yahoo.com.tw (Andrew Wang)
Date: Mon, 3 Oct 2005 21:06:14 +0800 (CST)
Subject: [Linux-cluster] FYI: www.gridengine.info (new site)
Message-ID: <20051003130614.16205.qmail@web18004.mail.tpe.yahoo.com>

Besides the SGE homepage
(http://gridengine.sunsource.net) for HOWTOs, docs,
and news, a new site just released:

http://www.gridengine.info/

It's written by an SGE user outside of Sun.

Andrew.


___________________________________________________  ?????? Yahoo!???????r???? 7.0 beta?????M?W??????????????  http://messenger.yahoo.com.tw/beta.html


From eric at bootseg.com  Mon Oct  3 15:23:17 2005
From: eric at bootseg.com (Eric Kerin)
Date: Mon, 03 Oct 2005 11:23:17 -0400
Subject: [Linux-cluster] Re: rgmanager dieing with no messages [was: Re: SMP
	and GFS]
In-Reply-To: <1128265567.23136.21.camel@saloon.neofreak.org>
References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com>
	<20051002102305.GD13944@neu.nirvana>
	<1128265567.23136.21.camel@saloon.neofreak.org>
Message-ID: <1128352997.3504.9.camel@auh5-0479.corp.jabil.org>

On Sun, 2005-10-02 at 11:06 -0400, DeadManMoving wrote:
> My cluster is highly instable, just this morning i've realized that
> the clurgmgrd deamon was dead...

I'm having this same problem on my cluster, I've been planning on
enabling core dumps for rgmanager once I find a few minutes to restart
the cluster services. With any luck, that will be today.

Eric Kerin
eric at bootseg.com


From mwill at penguincomputing.com  Mon Oct  3 15:11:40 2005
From: mwill at penguincomputing.com (Michael Will)
Date: Mon, 03 Oct 2005 08:11:40 -0700
Subject: [Linux-cluster] Export directory via gnbd
In-Reply-To: <433F9BCD.2020303@gmail.com>
References: <433F9BCD.2020303@gmail.com>
Message-ID: <43414A2C.2020906@penguincomputing.com>

carlopmart at gmail.com wrote:

> Hi all,
>
>  Is it possible to export directorys via gnbd?? GNBD docs only 
> descrives how to export files or partitons ...
>
There is a difference between block-level and file-level export of storage.

block level:

Just like iSCSI would, GNBD does export one chunk of data without
knowing anything about the structure of what you write to it. The client can
use it as if it was a local disk (with some limitations), which means it can
read and write blocks of data. It could be mysql writing database data, 
or it
could be the OS writing a filesystem on it. The GNBD server does not know
or care.  NBD=network block device.

Two clients could access the same blockdevice read-write, but you would need
a network protocol that negotiates locking and caching so that the two 
separate
clients don't step over each others data. GFS is one for filesystems, 
mysql-cluster
implements the same for a relational database.

file-level:

Just like a NAS or an NFS-server, the server exports a filesystem to 
clients.
So instead of requesting read/write of blocks of data, the clients 
requesting
listing directories, locking, reading and writing to files. 

So to answer your question 'how do I export a directory via gnbd' you might
have to reword it with the above clarification. Instead of NBD you can use
NFS to export a directory to multiple clients, or GFS if you plan to 
export from
multiple machines to multiple clients.

Michael

-- 
Michael Will
Penguin Computing Corp.
Sales Engineer
415-954-2822
415-954-2899 fx
mwill at penguincomputing.com 


From lhh at redhat.com  Mon Oct  3 16:19:05 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 03 Oct 2005 12:19:05 -0400
Subject: [Linux-cluster] RHEL4 active/active cluster without  GFS
In-Reply-To: <433F0681.4020700@lexum.umontreal.ca>
References: <433F0681.4020700@lexum.umontreal.ca>
Message-ID: <1128356346.27430.82.camel@ayanami.boston.redhat.com>

On Sat, 2005-10-01 at 17:58 -0400, FM wrote:
> Hello everybody,
> First post here ... for my first cluster attempt. I do not need GFS
> 
> I'm trying to install linux-ha (RPMS from 
> http://www.ultramonkey.org/download/heartbeat/). But the installation 
> always fails because of ipvsadmin missing.
> 
> I read that ipvs is in the kernel, so I check in the default .conf and 
> ipvs is enable. Ho can I install ipvsadmin ?

IPVS is indeed in the kernel.

ipvsadm is a user-land package which is used to administer the kernel
parts of IPVS - you can grab the source RPM here:

ftp://ftp.redhat.com/pub/redhat/linux/enterprise/4/en/RHCS/i386/SRPMS/ipvsadm-1.24-6.src.rpm

It's needed to control the IPVS director.


> Is there a Red Hat way to create cluster without GFS ?

Yes, Red Hat Cluster Suite.

-- Lon


From teigland at redhat.com  Mon Oct  3 16:51:51 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 3 Oct 2005 11:51:51 -0500
Subject: [Linux-cluster] Re: SMP and GFS
In-Reply-To: <20051002102305.GD13944@neu.nirvana>
References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com>
	<20051002102305.GD13944@neu.nirvana>
Message-ID: <20051003165151.GB16574@redhat.com>

On Sun, Oct 02, 2005 at 12:23:05PM +0200, Axel Thimm wrote:
> On Thu, Jul 14, 2005 at 04:57:51PM -0400, Manuel Bujan wrote:
> > Jul 14 14:19:35 atmail-2 kernel: gfs001 (2023) req reply einval
> > d6c30333 fr 1 r 1        2
> > Jul 14 14:19:35 atmail-2 kernel: gfs001 send einval to 1
> > Jul 14 14:19:35 atmail-2 last message repeated 2 times

> I found similar log sniplets on a RHEL4U1 machine with dual Xeons (HP
> Proliant). The machine crashed with a kernel panic shortly after
> telling the other nodes to leave the cluster (sorry the staff was
> under pressure and noone wrote down the panic's output):
> 
> Sep 30 05:08:11 zs01 kernel: nval to 1 (P:kernel)
> Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel)
> Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel)
> Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel)
> Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel)

These "einval" messages from the dlm are not necessarily bad and are not
directly related to the "removing from cluster" messages below.  The
einval conditions above can legitimately occur during normal operation and
the dlm should be able to deal with them.  Specifically they mean that:

  1. node A is told that the lock master for resource R is node B
  2. the last lock is removed from R on B
  3. B gives up mastery of R
  4. A sends lock request to B
  5. B doesn't recognize R and returns einval to A
  6. A starts over

The message "send einval to..." is printed on B in step 5.
The message "req reply einval..." is printed on A in step 6.

This is an unfortunate situation, but not lethal.  That said, a spike in
these messages may indicate that something is amiss (and that a "removing
from cluster" may be on the way).  Or, maybe the gfs load has struck the
dlm in a particularly sore way.

> Sep 30 05:08:33 zs03 kernel: CMAN: removing node zs02 from the cluster :
> Missed too many heartbeats (P:kernel)
> Sep 30 05:08:39 zs03 kernel: CMAN: removing node zs01 from the cluster :
> No response to messages (P:kernel)

After this happens, the dlm will often return an error (like -EINVAL) to
lock_dlm.  It's not the same thing as above.  Lock_dlm will always panic
at that point since it can no longer acquire locks for gfs.

Dave


From lhh at redhat.com  Mon Oct  3 17:20:58 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 03 Oct 2005 13:20:58 -0400
Subject: [Openais] Re: [Linux-cluster] new userland cman
In-Reply-To: <4340D983.7080106@redhat.com>
References: <433D4134.6080608@redhat.com>
	<1128109200.8440.14.camel@unnamed.az.mvista.com>
	<4340D983.7080106@redhat.com>
Message-ID: <1128360058.27430.99.camel@ayanami.boston.redhat.com>

On Mon, 2005-10-03 at 08:10 +0100, Patrick Caulfield wrote:

> >>neutral
> >>-------
> >>- Always uses multicast (no broadcast). A default multicast address is supplied
> >>if none is given
> > 
> > 
> > If broadcast is important, which I guess it may be, we can pretty easily
> > add this support...
> > 
> 
> I was going to look into this but I doubt its really worth it. It's just any
> extra complication and will only apply to IPv4 anyway.

I think broadcast is quite important, actually - although I also think
that it should *not* be the default.

Multicast doesn't always work very well (in practice) on existing
networks, and works poorly (if at all) over things like crossover
ethernet cables and hub-based private networks.  You know, the cheap
stuff hackers use in their houses to play with cluster software ;)

Broadcast is far more likely to work out of the box in the above cases,
and isn't hard to implement (... actually, it's easier than multicast).

Also, IPv6 isn't what I'd call "mainstream" just yet, so supporting all
the hacks we can with IPv4 isn't necessarily a bad thing ;)

-- Lon


From Axel.Thimm at ATrpms.net  Mon Oct  3 17:35:46 2005
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Mon, 3 Oct 2005 19:35:46 +0200
Subject: [Linux-cluster] Re: SMP and GFS
In-Reply-To: <20051003165151.GB16574@redhat.com>
References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com>
	<20051002102305.GD13944@neu.nirvana>
	<20051003165151.GB16574@redhat.com>
Message-ID: <20051003173546.GT10393@neu.nirvana>

On Mon, Oct 03, 2005 at 11:51:51AM -0500, David Teigland wrote:
> On Sun, Oct 02, 2005 at 12:23:05PM +0200, Axel Thimm wrote:
> > On Thu, Jul 14, 2005 at 04:57:51PM -0400, Manuel Bujan wrote:
> > > Jul 14 14:19:35 atmail-2 kernel: gfs001 (2023) req reply einval
> > > d6c30333 fr 1 r 1        2
> > > Jul 14 14:19:35 atmail-2 kernel: gfs001 send einval to 1
> > > Jul 14 14:19:35 atmail-2 last message repeated 2 times
> 
> > I found similar log sniplets on a RHEL4U1 machine with dual Xeons (HP
> > Proliant). The machine crashed with a kernel panic shortly after
> > telling the other nodes to leave the cluster (sorry the staff was
> > under pressure and noone wrote down the panic's output):
> > 
> > Sep 30 05:08:11 zs01 kernel: nval to 1 (P:kernel)
> > Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel)
> > Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel)
> > Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel)
> > Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel)
> 
> These "einval" messages from the dlm are not necessarily bad and are not
> directly related to the "removing from cluster" messages below.  The
> einval conditions above can legitimately occur during normal operation and
> the dlm should be able to deal with them.  Specifically they mean that:
> 
>   1. node A is told that the lock master for resource R is node B
>   2. the last lock is removed from R on B
>   3. B gives up mastery of R
>   4. A sends lock request to B
>   5. B doesn't recognize R and returns einval to A
>   6. A starts over
> 
> The message "send einval to..." is printed on B in step 5.
> The message "req reply einval..." is printed on A in step 6.
> 
> This is an unfortunate situation, but not lethal.  That said, a spike in
> these messages may indicate that something is amiss (and that a "removing
> from cluster" may be on the way).  Or, maybe the gfs load has struck the
> dlm in a particularly sore way.
> 
> > Sep 30 05:08:33 zs03 kernel: CMAN: removing node zs02 from the cluster :
> > Missed too many heartbeats (P:kernel)
> > Sep 30 05:08:39 zs03 kernel: CMAN: removing node zs01 from the cluster :
> > No response to messages (P:kernel)
> 
> After this happens, the dlm will often return an error (like -EINVAL) to
> lock_dlm.  It's not the same thing as above.  Lock_dlm will always panic
> at that point since it can no longer acquire locks for gfs.

At the time all of this happened, the three node cluster zs01 to zs03
had only zs01 active with nfs and samba exports (both with neglidgible
activity at that time of the day) and a proprietary backup solution
(vertias' netbackup). The latter created a network traffic of 25-30
MB/sec of the interface the cluster heartbeat was also running on. The
backup was running for a couple of hours already.

Can that we the root of evil? Delayed or dropped UDP cman packages?

Can the same scanario explain the (silent!) desyncing of GFS later on,
after all nodes were rebooted?
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051003/c5cbe42b/attachment.sig>

From joe.fernandez at hp.com  Mon Oct  3 20:17:49 2005
From: joe.fernandez at hp.com (Fernandez, Joe (HP Systems))
Date: Tue, 4 Oct 2005 06:17:49 +1000
Subject: [Linux-cluster] Please remove
Message-ID: <C6D4E314224DD348B00AF127279427E002036961@mbbexc01.asiapacific.cpqcorp.net>

Hi,
 
Could you please remove me off the list, thank you. 
 

Regards,

Joe Fernandez

HP Systems
Hewlett-Packard Australia
Ph. 61.3 8804 7308
Mob. 61.412 830 066


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051004/e0463cfe/attachment.htm>

From lhh at redhat.com  Mon Oct  3 20:21:46 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 03 Oct 2005 16:21:46 -0400
Subject: [Openais] Re: [Linux-cluster] new userland cman
In-Reply-To: <1128370379.30850.3.camel@unnamed.az.mvista.com>
References: <433D4134.6080608@redhat.com>
	<1128109200.8440.14.camel@unnamed.az.mvista.com>
	<4340D983.7080106@redhat.com>
	<1128360058.27430.99.camel@ayanami.boston.redhat.com>
	<1128370379.30850.3.camel@unnamed.az.mvista.com>
Message-ID: <1128370906.27430.130.camel@ayanami.boston.redhat.com>

On Mon, 2005-10-03 at 13:12 -0700, Steven Dake wrote:

> > Broadcast is far more likely to work out of the box in the above cases,
> > and isn't hard to implement (... actually, it's easier than multicast).
> > 
> 
> Adding this should just be a few lines of code.  I'll see if I can work
> out a patch today.

Nice.

-- Lon


From linux4dave at gmail.com  Mon Oct  3 21:49:10 2005
From: linux4dave at gmail.com (dave first)
Date: Mon, 3 Oct 2005 14:49:10 -0700
Subject: [Linux-cluster] PVFS going Wild
Message-ID: <207649d0510031449t52eee8c6je2949869bc4f552@mail.gmail.com>

Hey Guys,

I just took over a couple of clusters for a sysadmin that left the company.
Unfortunately, the hand-off was less than informative. <sigh> So, I've got
an old linux cluster, still well-used, with a PVFS filesystem mounted at
/work. I'm new to clustering, and I sure as hell don't know much about it,
but I've got a sick puppy here. All points to the PVFS filesystem.

lsof: WARNING: can't stat() pvfs file system /work
Output information may be incomplete.


In /var/log/messages:

Oct 3 13:51:34 elvis PAM_pwdb[24431]: (su) session opened for user deb_r by
deb(uid=2626)
Oct 3 13:51:49 elvis kernel: (./ll_pvfs.c, 361): ll_pvfs_getmeta failed on
downcall for 192.168.1.102:300 <http://192.168.1.102:300>
0/pvfs-meta
Oct 3 13:51:49 elvis kernel: (./ll_pvfs.c, 361): ll_pvfs_getmeta failed on
downcall for 192.168.1.102:300 <http://192.168.1.102:300>
0/pvfs-meta/manaa/DFTBNEW
Oct 3 14:16:48 elvis kernel: (./ll_pvfs.c, 409): ll_pvfs_statfs failed on
downcall for 192.168.1.102:3000 <http://192.168.1.102:3000>
/pvfs-meta
Oct 3 14:16:elvis kernel: (./inode.c, 321): pvfs_statfs failed

So the

Linux elvis 2.2.19-13.beosmp #1 SMP Tue Aug 21 20:04:44 EDT 2001 i686
unknown

Red Hat Linux release 6.2 (Zoot)

Can't access /work from the master or any nodes,

elvis [49#] ls /work
ls: /work: Too many open files


I ran a script in /usr/bin called pvfs_client_stop.sh - which killed all the
pvfs daemons, etc

#!/bin/tcsh

# Phil Carns
# pcarns at hubcap.clemson.edu
#
# This is an example script for how to get Scyld Beowulf cluster nodes
# to mount a PVFS file system.

set PVFSD = "/usr/sbin/pvfsd"
set PVFSMOD = "pvfs"
set PVFS_CLIENT_MOUNT_DIR = "/work"
set MOUNT_PVFS = "/sbin/mount.pvfs"

# unmount the file system locally and on all of the slave nodes
/bin/umount $PVFS_CLIENT_MOUNT_DIR
bpsh -pad /bin/umount $PVFS_CLIENT_MOUNT_DIR

# kill all of the pvfsd client daemons
/usr/bin/killall pvfsd

# remove the pvfs module on the local and the slave nodes
/sbin/rmmod $PVFSMOD
bpsh -pad /sbin/rmmod $PVFSMOD

Then I ran pvfs_client_start.sh /work, which seemed to work, except it never
exited...

#!/bin/tcsh

# Phil Carns
# pcarns at hubcap.clemson.edu
#
# This is an example script for how to get Scyld Beowulf cluster nodes
# to mount a PVFS file system.

set PVFSD = "/usr/sbin/pvfsd"
set PVFSMOD = "pvfs"
set PVFS_CLIENT_MOUNT_DIR = "/work"
set MOUNT_PVFS = "/sbin/mount.pvfs"
set PVFS_META_DIR = `bpctl -M -a`:$1

if $1 == "" then
 echo "usage: pvfs_client_start.sh <meta dir>"
 echo "(Causes every machine in the cluster to mount the PVFS file system)"
 exit -1
endif

# insert the pvfs module on the local and slave nodes
/sbin/modprobe $PVFSMOD
bpsh -pad /sbin/modprobe $PVFSMOD

# start the pvfsd client daemon on the local and slave nodes
$PVFSD
bpsh -pad $PVFSD

# actually mount the file system locally and on all of the slave nodes
$MOUNT_PVFS $PVFS_META_DIR $PVFS_CLIENT_MOUNT_DIR
bpsh -pad $MOUNT_PVFS $PVFS_META_DIR $PVFS_CLIENT_MOUNT_DIR


This seemed to work (well, it restarted daemons and such, but I still can't
get into /work and getting resource busy and:

mount.pvfs: Device or resource busy
mount.pvfs: server 192.168.1.102 <http://192.168.1.102> alive, but mount
failed (invalid metadata directory name?)

Comments? Useful ideas? A good joke???

dave
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051003/84992ef7/attachment.htm>

From cboudjnah at squiz.net  Mon Oct  3 23:27:58 2005
From: cboudjnah at squiz.net (Chmouel Boudjnah)
Date: Tue, 04 Oct 2005 09:27:58 +1000
Subject: [Linux-cluster] GFS crash
Message-ID: <1128382078.9653.8.camel@paris.squiz.net>

Hello,

I had a crash on a server using GFS-6.1 with kernel 2.6.9-11.ELsmp, i am
using GFS with an AOE SAN drive. 

I am not sure if the problem is with AOE SAN or with GFS would be great
to tell me so i can redirect the bug report to the CORAID people.

So i have first in the logs some weird stuff about sataide (i am not
sure if the SAN is using that) :

Sep 30 17:43:20 srv kernel: e send einval to 2
Sep 30 17:43:20 srv kernel: sataide send einval to 2
Sep 30 17:43:20 srv last message repeated 38 times
Sep 30 17:43:20 srv kernel: sataide unlock ff050383 no id
Sep 30 17:43:20 srv kernel: 231834 id 0 -1,3 1
Sep 30 17:43:20 srv kernel: 7814 qc 2,59f30e -1,5 id ffbe0378 sts 0 0
Sep 30 17:43:20 srv kernel: 19531 lk 5,59f30e id 0 -1,3 0
Sep 30 17:43:20 srv kernel: 4189 lk 2,2ed6bc id 0 -1,3 10001
Sep 30 17:43:20 srv kernel: 7814 qc 5,231834 -1,3 id 5dc0124 sts 0 0
Sep 30 17:43:20 srv kernel: 7814 qc 5,59f30e -1,3 id 27b00cf sts 0 0
Sep 30 17:43:20 srv kernel: 4189 lk 5,2ed6bc id 0 -1,3 1
Sep 30 17:43:20 srv kernel: 7814 qc 2,2ed6bc -1,3 id 1c0202 sts 0 0
Sep 30 17:43:20 srv kernel: 4189 lk 2,2903b3 id 0 -1,3 10001
Sep 30 17:43:20 srv kernel: 7814 qc 5,2ed6bc -1,3 id 227032a sts 0 0
Sep 30 17:43:20 srv kernel: 4189 lk 5,2903b3 id 0 -1,3 1
Sep 30 17:43:20 srv kernel: 7814 qc 2,2903b3 -1,3 id 23c036d sts 0 0
Sep 30 17:43:20 srv kernel: 4189 lk 2,2ba987 id 0 -1,3 10001
Sep 30 17:43:20 srv kernel: 4189 lk 5,2ba987 id 0 -1,3 1
Sep 30 17:43:20 srv kernel: 7814 qc 2,2ba987 -1,3 id 3ab033c sts 0 0
Sep 30 17:43:20 srv kernel: 7814 qc 5,2903b3 -1,3 id 1c80004 sts 0 0
Sep 30 17:43:20 srv kernel: 4189 lk 2,2ce731 id 0 -1,3 10001
Sep 30 17:43:20 srv kernel: 10052 lk 2,500e75 id 0 -1,5 0
Sep 30 17:43:20 srv kernel: 4189 lk 5,2ce731 id 0 -1,3 1
Sep 30 17:43:20 srv kernel: 7814 qc 5,2ba987 -1,3 id 1f003a sts 0 0
Sep 30 17:43:20 srv kernel: 7814 qc 2,2ce731 -1,3 id ff74033d sts 0 0
Sep 30 17:43:20 srv kernel: 19531 lk 5,500e74 id ffd101bd 3,5 805
Sep 30 17:43:20 srv kernel: 7814 qc 5,500e74 3,5 id ffd101bd sts 0 0
Sep 30 17:43:20 srv kernel: 7814 qc 2,500e75 -1,5 id 1660224 sts 0 0
Sep 30 17:43:20 srv kernel: 10052 lk 5,500e75 id 0 -1,3 0
Sep 30 17:43:20 srv kernel: 7814 qc 5,500e75 -1,3 id 3210323 sts 0 0
Sep 30 17:43:20 srv kernel: 29523 lk 2,217df id 0 -1,3 10000
Sep 30 17:43:20 srv kernel: 7814 qc 2,217df -1,3 id 5019b sts 0 0
Sep 30 17:43:20 srv kernel: 29523 lk 5,217df id 0 -1,3 0
Sep 30 17:43:21 srv kernel: 7814 qc 5,217df -1,3 id 2ae0267 sts 0 0
Sep 30 17:43:21 srv kernel: 7814 qc 5,2ce731 -1,3 id 7d0232 sts 0 0
Sep 30 17:43:21 srv kernel: 4189 lk 2,263a00 id 0 -1,3 10001
Sep 30 17:43:21 srv kernel: 7814 qc 2,263a00 -1,3 id 12700c3 sts 0 0
Sep 30 17:43:21 srv kernel: 4189 lk 5,263a00 id 0 -1,3 1
Sep 30 17:43:21 srv kernel: 4189 lk 2,2c446d id 0 -1,3 10001
Sep 30 17:43:21 srv kernel: 7814 qc 5,263a00 -1,3 id ffc00230 sts 0 0
Sep 30 17:43:21 srv kernel: 4189 lk 5,2c446d id 0 -1,3 1
Sep 30 17:43:21 srv kernel: 7814 qc 2,2c446d -1,3 id 34903b4 sts 0 0
Sep 30 17:43:21 srv kernel: 4189 lk 2,1e7a15 id 0 -1,3 10001
Sep 30 17:43:21 srv kernel: 7814 qc 5,2c446d -1,3 id fea901a1 sts 0 0
Sep 30 17:43:21 srv kernel: 4189 lk 5,1e7a15 id 0 -1,3 1


and the crash of GFS just after :

Sep 30 17:43:22 srv kernel: lock_dlm:  Assertion failed on line 353 of
file /usr/src/build/574067-i686/BUILD/smp/src/dlm/lock.c
Sep 30 17:43:22 srv kernel: lock_dlm:  assertion:  "!error"
Sep 30 17:43:22 srv kernel: lock_dlm:  time = 2509316164
Sep 30 17:43:22 srv kernel: sataide: error=-22 num=5,5bf2f1 lkf=801
flags=84
Sep 30 17:43:22 srv kernel:
Sep 30 17:43:22 srv kernel: ------------[ cut here ]------------
Sep 30 17:43:22 srv kernel: kernel BUG
at /usr/src/build/574067-i686/BUILD/smp/src/dlm/lock.c:353!
Sep 30 17:43:22 srv kernel: invalid operand: 0000 [#1]
Sep 30 17:43:22 srv kernel: SMP
Sep 30 17:43:22 srv kernel: Modules linked in: lock_dlm(U) aoe(U) gfs(U)
lock_harness(U) dlm(U) cman(U) md5 ipv6 joydev button battery
ac uhci_hcd ehci_hcd e1000 floppy sg dm_snapshot dm_zero dm_mirror ext3
jbd dm_mod mptscsih mptbase sd_mod scsi_mod
Sep 30 17:43:22 srv kernel: CPU:    0
Sep 30 17:43:22 srv kernel: EIP:    0060:[<f8b5360d>]    Not tainted VLI
Sep 30 17:43:22 srv kernel: EFLAGS: 00010246   (2.6.9-11.ELsmp)
Sep 30 17:43:22 srv kernel: EIP is at do_dlm_unlock+0xaa/0xbf [lock_dlm]
Sep 30 17:43:22 srv kernel: eax: 00000001   ebx: ffffffea   ecx:
f63f5f04   edx: f8b5809e
Sep 30 17:43:22 srv kernel: esi: cb3ac080   edi: cb3ac080   ebp:
f8b1d000   esp: f63f5f00
Sep 30 17:43:23 srv kernel: ds: 007b   es: 007b   ss: 0068
Sep 30 17:43:23 srv kernel: Process lock_dlm1 (pid: 7818,
threadinfo=f63f5000 task=f75bb0b0)
Sep 30 17:43:23 srv kernel: Stack: f8b5809e f8b1d000 00000003 f8b538c0
f8ab24f2 00000001 dcbdb3c0 dcbdb3a4
Sep 30 17:43:23 srv kernel:        f8aa8852 f8add0c0 d73b9e80 dcbdb3a4
f8add0c0 cb3ac080 f8aa7d4b dcbdb3a4
Sep 30 17:43:23 srv kernel:        00000001 00000001 f8aa7e02 dcbdb3c0
dcbdb3a4 f8aa99af cb3ac080 f7d50e00
Sep 30 17:43:23 srv kernel: Call Trace:
Sep 30 17:43:23 srv kernel:  [<f8b538c0>] lm_dlm_unlock+0x14/0x1c
[lock_dlm]
Sep 30 17:43:23 srv kernel:  [<f8ab24f2>] gfs_lm_unlock+0x2c/0x42 [gfs]
Sep 30 17:43:23 srv kernel:  [<f8aa8852>] gfs_glock_drop_th+0xf3/0x12d
[gfs]
Sep 30 17:43:23 srv kernel:  [<f8aa7d4b>] rq_demote+0x7f/0x98 [gfs]
Sep 30 17:43:23 srv kernel:  [<f8aa7e02>] run_queue+0x5a/0xc1 [gfs]
Sep 30 17:43:23 srv kernel:  [<f8aa99af>] blocking_cb+0x39/0x7a [gfs]
Sep 30 17:43:23 srv kernel:  [<f8b5727b>] process_blocking+0x90/0x93
[lock_dlm]
Sep 30 17:43:23 srv kernel:  [<f8b578c8>] dlm_async+0x28b/0x2ff
[lock_dlm]
Sep 30 17:43:23 srv kernel:  [<c011dc6f>] default_wake_function+0x0/0xc
Sep 30 17:43:23 srv kernel:  [<c011dc6f>] default_wake_function+0x0/0xc
Sep 30 17:43:23 srv kernel:  [<f8b5763d>] dlm_async+0x0/0x2ff [lock_dlm]
Sep 30 17:43:23 srv kernel:  [<c0132e31>] kthread+0x73/0x9b
Sep 30 17:43:23 srv kernel:  [<c0132dbe>] kthread+0x0/0x9b
Sep 30 17:43:23 srv kernel:  [<c01041f1>] kernel_thread_helper+0x5/0xb
Sep 30 17:43:23 srv kernel: Code: 76 34 8b 06 ff 76 2c ff 76 08 ff 76 04
ff 76 0c 53 ff 70 18 68 a9 81 b5 f8 e8 d6 e3 5c c7 83 c4 34 68 9e 80 b5
f8 e8 c9 e3 5c c7 <0f> 0b 61 01 ef 7f b5 f8 68 a0 80 b5 f8 e8 84 db 5c
c7 5b 5e c3
Sep 30 17:43:23 srv kernel:  <0>Fatal exception: panic in 5 seconds


Cheers, Chmouel.


-- 
Chmouel Boudjnah - Squiz.net - http://www.squiz.net


From jnewbigin at ict.swin.edu.au  Tue Oct  4 04:15:51 2005
From: jnewbigin at ict.swin.edu.au (John Newbigin)
Date: Tue, 04 Oct 2005 14:15:51 +1000
Subject: [Linux-cluster] GFS-6.0.2.27 issues
Message-ID: <434201F7.3050107@ict.swin.edu.au>

Is anyone seeing this with GFS-6.0.2.27 (EL3):
ldconfig: /usr/lib/libgulm.so.6 is not a symbolic link

The cause is indeed as the message says.
/usr/lib/libgulm.so  /usr/lib/libgulm.so.6  /usr/lib/libgulm.so.6.0.2 
all appear the be the same file, rather than symlinks to the 6.0.2 
version file.

It all works OK, just makes installing updates print out the error 
message as each packages runs ldconfig.

I also have files in /usr/lib/debug which is not a problem but i wonder 
if they need to be there.

John.

-- 
John Newbigin
Computer Systems Officer
Faculty of Information and Communication Technologies
Swinburne University of Technology
Melbourne, Australia
http://www.ict.swin.edu.au/staff/jnewbigin


From jnewbigin at ict.swin.edu.au  Tue Oct  4 04:29:38 2005
From: jnewbigin at ict.swin.edu.au (John Newbigin)
Date: Tue, 04 Oct 2005 14:29:38 +1000
Subject: [Linux-cluster] Errata page duplicates
Message-ID: <43420532.9060709@ict.swin.edu.au>

On the page http://rhn.redhat.com/errata/RHBA-2005-723.html all the 
files seem to be listed in triplicate.

There does not seem to be a report a bug in this errata email address so 
I figure someone on this list will know what to do.

John.

-- 
John Newbigin
Computer Systems Officer
Faculty of Information and Communication Technologies
Swinburne University of Technology
Melbourne, Australia
http://www.ict.swin.edu.au/staff/jnewbigin


From tom-fedora at kofler.eu.org  Tue Oct  4 11:52:28 2005
From: tom-fedora at kofler.eu.org (Thomas Kofler)
Date: Tue,  4 Oct 2005 13:52:28 +0200
Subject: [Linux-cluster] FC4 kernel-2.6.13-1.1526_FC4 & GFS-kernel package
	mismatch ?
Message-ID: <1128426748.43426cfc13ec5@mail.devcon.cc>

Hi,

we noticed, that after running "yum update" yesterday on our system, that cman 
didn't start up any longer.

We investigated the problem and it depends on the kernel version, if we boot 
with the old 1447 - kernel, cman and the related services startup fine.

The newest GFS-kernel places the modules under /lib/modules/2.6.12-1.1447_FC4

But the kernel itself kernel-2.6.13-1.1526_FC4 has of 
course /lib/modules/2.6.13-1.1526_FC4 as its module path.

Is it a bug or is there to do something by hand? If not, I would open a bug on 
bugzilla, but under which section: kernel or GFS-kernel - which package team 
forget the dependency ?

Thanks for feedback,
Thomas


kernel-2.6.13-1.1526_FC4 
modules: /lib/modules/2.6.13-1.1526_FC4

GFS-kernel-2.6.11.8-20050601.152643.FC4.14
/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs
/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs/gfs.ko
/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking
/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm
/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm/lock_dlm.ko
/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm
/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm/lock_gulm.ko
/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_harness
/lib/modules/2.6.12-
1.1447_FC4/kernel/fs/gfs_locking/lock_harness/lock_harness.ko
/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock
/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock/lock_nolock.ko

 
From sgray at bluestarinc.com  Tue Oct  4 13:16:43 2005
From: sgray at bluestarinc.com (Sean Gray)
Date: Tue, 04 Oct 2005 09:16:43 -0400
Subject: [Linux-cluster] FC4 kernel-2.6.13-1.1526_FC4 & GFS-kernel
	package mismatch ?
In-Reply-To: <1128426748.43426cfc13ec5@mail.devcon.cc>
References: <1128426748.43426cfc13ec5@mail.devcon.cc>
Message-ID: <1128431803.31466.4286.camel@localhost.localdomain>

Thomas,

Why not grab the srpms and recompile the rpms? I have some notes on my
experience compiling srpms for RHEL4 x86_64 2.6.9-11, they me be of
assistance as it was easier said than done.

Sean

On Tue, 2005-10-04 at 13:52 +0200, Thomas Kofler wrote:

> Hi,
> 
> we noticed, that after running "yum update" yesterday on our system, that cman 
> didn't start up any longer.
> 
> We investigated the problem and it depends on the kernel version, if we boot 
> with the old 1447 - kernel, cman and the related services startup fine.
> 
> The newest GFS-kernel places the modules under /lib/modules/2.6.12-1.1447_FC4
> 
> But the kernel itself kernel-2.6.13-1.1526_FC4 has of 
> course /lib/modules/2.6.13-1.1526_FC4 as its module path.
> 
> Is it a bug or is there to do something by hand? If not, I would open a bug on 
> bugzilla, but under which section: kernel or GFS-kernel - which package team 
> forget the dependency ?
> 
> Thanks for feedback,
> Thomas
> 
> 
> kernel-2.6.13-1.1526_FC4 
> modules: /lib/modules/2.6.13-1.1526_FC4
> 
> GFS-kernel-2.6.11.8-20050601.152643.FC4.14
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs/gfs.ko
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm/lock_dlm.ko
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm/lock_gulm.ko
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_harness
> /lib/modules/2.6.12-
> 1.1447_FC4/kernel/fs/gfs_locking/lock_harness/lock_harness.ko
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock/lock_nolock.ko
> 
>  
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

Sean N. Gray
Director of Information Technology
United Radio Incorporated, DBA BlueStar
24 Spiral Drive
Florence, Kentucky 41042
office: 859.371.4423 x263
toll free: 800.371.4423 x263
fax: 859.371.4425
mobile: 513.616.3379
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051004/bfd28a95/attachment.htm>

From tom-fedora at kofler.eu.org  Tue Oct  4 13:33:10 2005
From: tom-fedora at kofler.eu.org (Thomas Kofler)
Date: Tue,  4 Oct 2005 15:33:10 +0200
Subject: [Linux-cluster] FC4 kernel-2.6.13-1.1526_FC4 & GFS-kernel package
	mismatch ?
In-Reply-To: <1128431803.31466.4286.camel@localhost.localdomain>
References: <1128426748.43426cfc13ec5@mail.devcon.cc>
	<1128431803.31466.4286.camel@localhost.localdomain>
Message-ID: <1128432789.43428496039e6@mail.devcon.cc>


> Why not grab the srpms and recompile the rpms? 

In theory no problem, but imagine the default behaviour of a user.
yum update the system and yum install the packages.

And the cluster/GFS system will fail to start up, so its an annoying bug in my 
opinion.

Regards,
Thomas
 

From mailinglists at marvin-lists.freaks.de  Tue Oct  4 15:50:29 2005
From: mailinglists at marvin-lists.freaks.de (Christian Niessner)
Date: Tue, 04 Oct 2005 17:50:29 +0200
Subject: [Linux-cluster] cluster service development - question
Message-ID: <1128441029.25994.125.camel@phanara.ai.arno.vpn>

hi,

i'm currently developing a cluster service for the cluster release 
1.00.00. It's using libmagma for messaging and node membership 
maintainance, and these parts work really well.. But i also have to
maintain a list of all configured nodes.

What is the 'best practice' to get this list? (Node id and name would be
fine...) It doesn't seem do be possible with libmagma...

Thanks,
	chris


From cfeist at redhat.com  Tue Oct  4 16:30:17 2005
From: cfeist at redhat.com (Chris Feist)
Date: Tue, 04 Oct 2005 11:30:17 -0500
Subject: [Linux-cluster] FC4 kernel-2.6.13-1.1526_FC4 & GFS-kernel package
	mismatch ?
In-Reply-To: <1128426748.43426cfc13ec5@mail.devcon.cc>
References: <1128426748.43426cfc13ec5@mail.devcon.cc>
Message-ID: <4342AE19.1070306@redhat.com>

We should have updated rpms in the -test tree shortly, and then if no problems 
are reported they'll be moved to the standard tree.

Thanks,
Chris

Thomas Kofler wrote:
> Hi,
> 
> we noticed, that after running "yum update" yesterday on our system, that cman 
> didn't start up any longer.
> 
> We investigated the problem and it depends on the kernel version, if we boot 
> with the old 1447 - kernel, cman and the related services startup fine.
> 
> The newest GFS-kernel places the modules under /lib/modules/2.6.12-1.1447_FC4
> 
> But the kernel itself kernel-2.6.13-1.1526_FC4 has of 
> course /lib/modules/2.6.13-1.1526_FC4 as its module path.
> 
> Is it a bug or is there to do something by hand? If not, I would open a bug on 
> bugzilla, but under which section: kernel or GFS-kernel - which package team 
> forget the dependency ?
> 
> Thanks for feedback,
> Thomas
> 
> 
> kernel-2.6.13-1.1526_FC4 
> modules: /lib/modules/2.6.13-1.1526_FC4
> 
> GFS-kernel-2.6.11.8-20050601.152643.FC4.14
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs/gfs.ko
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm/lock_dlm.ko
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm/lock_gulm.ko
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_harness
> /lib/modules/2.6.12-
> 1.1447_FC4/kernel/fs/gfs_locking/lock_harness/lock_harness.ko
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock/lock_nolock.ko
> 
>  
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From lhh at redhat.com  Tue Oct  4 16:40:54 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 04 Oct 2005 12:40:54 -0400
Subject: [Linux-cluster] cluster service development - question
In-Reply-To: <1128441029.25994.125.camel@phanara.ai.arno.vpn>
References: <1128441029.25994.125.camel@phanara.ai.arno.vpn>
Message-ID: <1128444054.27430.169.camel@ayanami.boston.redhat.com>

On Tue, 2005-10-04 at 17:50 +0200, Christian Niessner wrote:
> hi,
> 
> i'm currently developing a cluster service for the cluster release 
> 1.00.00. It's using libmagma for messaging and node membership 
> maintainance, and these parts work really well.. But i also have to
> maintain a list of all configured nodes.
> 
> What is the 'best practice' to get this list? (Node id and name would be
> fine...) It doesn't seem do be possible with libmagma...

Sure it is.

cluster_member_list_t *mlist;

mlist = clu_member_list();


-- Lon


From lhh at redhat.com  Tue Oct  4 16:44:18 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 04 Oct 2005 12:44:18 -0400
Subject: [Linux-cluster] cluster service development - question
In-Reply-To: <1128441029.25994.125.camel@phanara.ai.arno.vpn>
References: <1128441029.25994.125.camel@phanara.ai.arno.vpn>
Message-ID: <1128444258.27430.173.camel@ayanami.boston.redhat.com>

On Tue, 2005-10-04 at 17:50 +0200, Christian Niessner wrote:
> hi,
> 
> i'm currently developing a cluster service for the cluster release 
> 1.00.00. It's using libmagma for messaging and node membership 
> maintainance, and these parts work really well.. But i also have to
> maintain a list of all configured nodes.
> 
> What is the 'best practice' to get this list? (Node id and name would be
> fine...) It doesn't seem do be possible with libmagma...
> 
> Thanks,
> 	chris

Oh - note - the node ID is 64 bits because gulm doesn't really have a
notion of "node ID" internally, so we use the local ipv6 network address
(lower 8 octets) as the node ID.

Just something to be aware of; i.e. don't cast it to an int. :)

-- Lon


From lhh at redhat.com  Tue Oct  4 16:50:18 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 04 Oct 2005 12:50:18 -0400
Subject: [Linux-cluster] cluster service development - question
In-Reply-To: <1128444054.27430.169.camel@ayanami.boston.redhat.com>
References: <1128441029.25994.125.camel@phanara.ai.arno.vpn>
	<1128444054.27430.169.camel@ayanami.boston.redhat.com>
Message-ID: <1128444618.27430.177.camel@ayanami.boston.redhat.com>

On Tue, 2005-10-04 at 12:40 -0400, Lon Hohberger wrote:
> On Tue, 2005-10-04 at 17:50 +0200, Christian Niessner wrote:
> > hi,
> > 
> > i'm currently developing a cluster service for the cluster release 
> > 1.00.00. It's using libmagma for messaging and node membership 
> > maintainance, and these parts work really well.. But i also have to
> > maintain a list of all configured nodes.
> > 
> > What is the 'best practice' to get this list? (Node id and name would be
> > fine...) It doesn't seem do be possible with libmagma...

Ugh, I read this wrong.

Here you go.  It's part of a rewrite of clustat which has "how to get
this stuff from ccsd" built right in.  I haven't committed this yet, but
will soon as part of a larger clustat rewrite.

ccs_member_list() returns cluster_member_list_t *.

build_member_list does that, but compares it against who is in the
configuration (and in this case, who is running rgmanager).  It's kind
of hackish the way I'm overloading the cm_state field with a bunch of
bitflags, but it works.

-- Lon
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ccs-clustat-merge.c
Type: text/x-csrc
Size: 3379 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051004/1a8e0273/attachment.bin>

From mailinglists at marvin-lists.freaks.de  Tue Oct  4 16:55:43 2005
From: mailinglists at marvin-lists.freaks.de (Christian Niessner)
Date: Tue, 04 Oct 2005 18:55:43 +0200
Subject: [Linux-cluster] cluster service development - question
In-Reply-To: <1128444054.27430.169.camel@ayanami.boston.redhat.com>
References: <1128441029.25994.125.camel@phanara.ai.arno.vpn>
	<1128444054.27430.169.camel@ayanami.boston.redhat.com>
Message-ID: <1128444943.25994.145.camel@phanara.ai.arno.vpn>

Hi Lon,

On Tue, 2005-10-04 at 12:40 -0400, Lon Hohberger wrote:
> Sure it is.
> 
> cluster_member_list_t *mlist;
> 
> mlist = clu_member_list();

It seems clu_member_list() only returns the nodes that have
joined the cluster, not all nodes configured
in /etc/cluster/cluster.conf. In my case, i need all
nodes.

I had a quick look into the haeder files. I only found a
clu_member_list(char *group). But even calling with NULL 
it only reports joined nodes...

Or did I do something wrong?

ciao,
	chris


From liangs at cse.ohio-state.edu  Tue Oct  4 16:58:08 2005
From: liangs at cse.ohio-state.edu (Shuang Liang)
Date: Tue, 04 Oct 2005 12:58:08 -0400
Subject: [Linux-cluster] Gnbd with LVM
Message-ID: <4342B4A0.5040504@cse.ohio-state.edu>

Hi all,
   Does Gnbd work with logic volume manager in Linux, so that data can 
stripe across multiple gnbd device on a single GFS filesytem?
   I am also curious that if it is possible for the 6.1 version of GFS 
to work without cluster tools?

Thanks

Shuang,


From teigland at redhat.com  Tue Oct  4 17:06:05 2005
From: teigland at redhat.com (David Teigland)
Date: Tue, 4 Oct 2005 12:06:05 -0500
Subject: [Linux-cluster] GFS crash
In-Reply-To: <1128382078.9653.8.camel@paris.squiz.net>
References: <1128382078.9653.8.camel@paris.squiz.net>
Message-ID: <20051004170605.GC10135@redhat.com>

On Tue, Oct 04, 2005 at 09:27:58AM +1000, Chmouel Boudjnah wrote:
> Hello,
> 
> I had a crash on a server using GFS-6.1 with kernel 2.6.9-11.ELsmp, i am
> using GFS with an AOE SAN drive. 
> 
> I am not sure if the problem is with AOE SAN or with GFS would be great
> to tell me so i can redirect the bug report to the CORAID people.
> 
> So i have first in the logs some weird stuff about sataide (i am not
> sure if the SAN is using that) :
> 
> Sep 30 17:43:20 srv kernel: e send einval to 2
> Sep 30 17:43:20 srv kernel: sataide send einval to 2
> Sep 30 17:43:20 srv last message repeated 38 times
> Sep 30 17:43:20 srv kernel: sataide unlock ff050383 no id

The dlm is returning errors for both remote and local lock requests,
indicating that it doesn't know about any of the locks being requested.
That's often because the dlm was "shut down" by cman when cman lost its
connection to the cluster.  There are usually log messages from cman, too,
saying what has happened.

Is AOE using the same network as cman?  If so, you might try putting them
on two different networks.

> Sep 30 17:43:22 srv kernel: lock_dlm:  Assertion failed on line 353 of
> file /usr/src/build/574067-i686/BUILD/smp/src/dlm/lock.c
> Sep 30 17:43:22 srv kernel: lock_dlm:  assertion:  "!error"
> Sep 30 17:43:22 srv kernel: lock_dlm:  time = 2509316164
> Sep 30 17:43:22 srv kernel: sataide: error=-22 num=5,5bf2f1 lkf=801
> flags=84

This is the typical assertion failure you get when gfs can't acquire any
locks.

Dave


From jbrassow at redhat.com  Tue Oct  4 18:20:42 2005
From: jbrassow at redhat.com (Jonathan E Brassow)
Date: Tue, 4 Oct 2005 13:20:42 -0500
Subject: [Linux-cluster] Gnbd with LVM
In-Reply-To: <4342B4A0.5040504@cse.ohio-state.edu>
References: <4342B4A0.5040504@cse.ohio-state.edu>
Message-ID: <42045367e86f9fe3320a7837428e8d80@redhat.com>


On Oct 4, 2005, at 11:58 AM, Shuang Liang wrote:

> Hi all,
>   Does Gnbd work with logic volume manager in Linux, so that data can 
> stripe across multiple gnbd device on a single GFS filesytem?

I have gotten it to work.  You may need to add the gnbd devices to your 
filter (in /etc/lvm/lvm.conf).

>   I am also curious that if it is possible for the 6.1 version of GFS 
> to work without cluster tools?
>

GFS will work as a local file system if you mkfs with the '-p 
lock_nolock' option.

  brassow


From pcaulfie at redhat.com  Wed Oct  5 07:00:09 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 05 Oct 2005 08:00:09 +0100
Subject: [Linux-cluster] cluster service development - question
In-Reply-To: <1128444618.27430.177.camel@ayanami.boston.redhat.com>
References: <1128441029.25994.125.camel@phanara.ai.arno.vpn>	<1128444054.27430.169.camel@ayanami.boston.redhat.com>
	<1128444618.27430.177.camel@ayanami.boston.redhat.com>
Message-ID: <434379F9.2080209@redhat.com>

Lon Hohberger wrote:
> On Tue, 2005-10-04 at 12:40 -0400, Lon Hohberger wrote:
> 
>>On Tue, 2005-10-04 at 17:50 +0200, Christian Niessner wrote:
>>
>>>hi,
>>>
>>>i'm currently developing a cluster service for the cluster release 
>>>1.00.00. It's using libmagma for messaging and node membership 
>>>maintainance, and these parts work really well.. But i also have to
>>>maintain a list of all configured nodes.
>>>
>>>What is the 'best practice' to get this list? (Node id and name would be
>>>fine...) It doesn't seem do be possible with libmagma...
> 
> 
> Ugh, I read this wrong.
> 
> Here you go.  It's part of a rewrite of clustat which has "how to get
> this stuff from ccsd" built right in.  I haven't committed this yet, but
> will soon as part of a larger clustat rewrite.
> 
> ccs_member_list() returns cluster_member_list_t *.
> 
> build_member_list does that, but compares it against who is in the
> configuration (and in this case, who is running rgmanager).  It's kind
> of hackish the way I'm overloading the cm_state field with a bunch of
> bitflags, but it works.
> 

Just to point out that the next cman version (the userland daemon on head of
CVS) will behave as you want - ie requesting the members list will retrieve all
the nodes known to CCS.

-- 

patrick


From cfeist at redhat.com  Wed Oct  5 20:54:29 2005
From: cfeist at redhat.com (Chris Feist)
Date: Wed, 05 Oct 2005 15:54:29 -0500
Subject: [Linux-cluster] FC4 kernel-2.6.13-1.1526_FC4 & GFS-kernel package
	mismatch ?
In-Reply-To: <1128426748.43426cfc13ec5@mail.devcon.cc>
References: <1128426748.43426cfc13ec5@mail.devcon.cc>
Message-ID: <43443D85.8050703@redhat.com>

GFS/CS updated kernel rpms are available from fedora-test.

ftp://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/4/i386/

Thanks,
Chris

Thomas Kofler wrote:
> Hi,
> 
> we noticed, that after running "yum update" yesterday on our system, that cman 
> didn't start up any longer.
> 
> We investigated the problem and it depends on the kernel version, if we boot 
> with the old 1447 - kernel, cman and the related services startup fine.
> 
> The newest GFS-kernel places the modules under /lib/modules/2.6.12-1.1447_FC4
> 
> But the kernel itself kernel-2.6.13-1.1526_FC4 has of 
> course /lib/modules/2.6.13-1.1526_FC4 as its module path.
> 
> Is it a bug or is there to do something by hand? If not, I would open a bug on 
> bugzilla, but under which section: kernel or GFS-kernel - which package team 
> forget the dependency ?
> 
> Thanks for feedback,
> Thomas
> 
> 
> kernel-2.6.13-1.1526_FC4 
> modules: /lib/modules/2.6.13-1.1526_FC4
> 
> GFS-kernel-2.6.11.8-20050601.152643.FC4.14
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs/gfs.ko
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm/lock_dlm.ko
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm/lock_gulm.ko
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_harness
> /lib/modules/2.6.12-
> 1.1447_FC4/kernel/fs/gfs_locking/lock_harness/lock_harness.ko
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock
> /lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock/lock_nolock.ko
> 
>  
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From lhh at redhat.com  Wed Oct  5 20:59:16 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 05 Oct 2005 16:59:16 -0400
Subject: [Linux-cluster] cluster service development - question
In-Reply-To: <434379F9.2080209@redhat.com>
References: <1128441029.25994.125.camel@phanara.ai.arno.vpn>
	<1128444054.27430.169.camel@ayanami.boston.redhat.com>
	<1128444618.27430.177.camel@ayanami.boston.redhat.com>
	<434379F9.2080209@redhat.com>
Message-ID: <1128545956.27430.249.camel@ayanami.boston.redhat.com>

On Wed, 2005-10-05 at 08:00 +0100, Patrick Caulfield wrote:
> > Here you go.  It's part of a rewrite of clustat which has "how to get
> > this stuff from ccsd" built right in.  I haven't committed this yet, but
> > will soon as part of a larger clustat rewrite.
> > 
> > ccs_member_list() returns cluster_member_list_t *.
> > 
> > build_member_list does that, but compares it against who is in the
> > configuration (and in this case, who is running rgmanager).  It's kind
> > of hackish the way I'm overloading the cm_state field with a bunch of
> > bitflags, but it works.
> > 
> 
> Just to point out that the next cman version (the userland daemon on head of
> CVS) will behave as you want - ie requesting the members list will retrieve all
> the nodes known to CCS.
> 

Nice =)

-- Lon


From lhh at redhat.com  Wed Oct  5 21:08:22 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 05 Oct 2005 17:08:22 -0400
Subject: [Linux-cluster] Re: rgmanager dieing with no messages [was:
	Re: SMP and GFS]
In-Reply-To: <1128352997.3504.9.camel@auh5-0479.corp.jabil.org>
References: <04f401c588b6$b31db0a0$5001a8c0@spa.isqsolutions.com>
	<20051002102305.GD13944@neu.nirvana>
	<1128265567.23136.21.camel@saloon.neofreak.org>
	<1128352997.3504.9.camel@auh5-0479.corp.jabil.org>
Message-ID: <1128546502.27430.252.camel@ayanami.boston.redhat.com>

On Mon, 2005-10-03 at 11:23 -0400, Eric Kerin wrote:
> On Sun, 2005-10-02 at 11:06 -0400, DeadManMoving wrote:
> > My cluster is highly instable, just this morning i've realized that
> > the clurgmgrd deamon was dead...
> 
> I'm having this same problem on my cluster, I've been planning on
> enabling core dumps for rgmanager once I find a few minutes to restart
> the cluster services. With any luck, that will be today.

If you see anything, let me know.  There's a segfault I'm trying to
track down which this is... I haven't been able to reproduce it
internally :(


From jnewbigin at ict.swin.edu.au  Thu Oct  6 02:01:45 2005
From: jnewbigin at ict.swin.edu.au (John Newbigin)
Date: Thu, 06 Oct 2005 12:01:45 +1000
Subject: [Linux-cluster] GFS-6.0.2.27 issues
In-Reply-To: <434201F7.3050107@ict.swin.edu.au>
References: <434201F7.3050107@ict.swin.edu.au>
Message-ID: <43448589.6050608@ict.swin.edu.au>

FYI Bugzilla #169967

John Newbigin wrote:

> Is anyone seeing this with GFS-6.0.2.27 (EL3):
> ldconfig: /usr/lib/libgulm.so.6 is not a symbolic link
> 
> The cause is indeed as the message says.
> /usr/lib/libgulm.so  /usr/lib/libgulm.so.6  /usr/lib/libgulm.so.6.0.2 
> all appear the be the same file, rather than symlinks to the 6.0.2 
> version file.
> 
> It all works OK, just makes installing updates print out the error 
> message as each packages runs ldconfig.
> 
> I also have files in /usr/lib/debug which is not a problem but i wonder 
> if they need to be there.
> 
> John.
> 


-- 
John Newbigin
Computer Systems Officer
Faculty of Information and Communication Technologies
Swinburne University of Technology
Melbourne, Australia
http://www.ict.swin.edu.au/staff/jnewbigin


From phung at cs.columbia.edu  Thu Oct  6 20:37:21 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Thu, 6 Oct 2005 16:37:21 -0400 (EDT)
Subject: [Linux-cluster] new cluster created when new node joins
Message-ID: <Pine.LNX.4.44.0510061634440.10782-100000@minsk.clic.cs.columbia.edu>

I have an existing cluster:

blade04: # cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    1   X   blade01
   4    1    1   M   blade04
  11    1    1   M   blade11

then blade06 joins the cluster, but instead of joining the existing 
cluster, it creates a new one:

blade06: # cman_tool nodes
Node  Votes Exp Sts  Name
   6    1    1   M   blade06

Both machines are using Protocol version: 5.0.1

How can I further debug why this is happening?

thanks,
dan


From jbrassow at redhat.com  Thu Oct  6 21:59:31 2005
From: jbrassow at redhat.com (Jonathan E Brassow)
Date: Thu, 6 Oct 2005 16:59:31 -0500
Subject: [Linux-cluster] new cluster created when new node joins
In-Reply-To: <Pine.LNX.4.44.0510061634440.10782-100000@minsk.clic.cs.columbia.edu>
References: <Pine.LNX.4.44.0510061634440.10782-100000@minsk.clic.cs.columbia.edu>
Message-ID: <98b567529b272baeb6be7f90371aa324@redhat.com>

do they have multiple clusters set up in their environment?  Does the 
/etc/cluster/cluster.xml file match the others?

  brassow

On Oct 6, 2005, at 3:37 PM, Dan B. Phung wrote:

> I have an existing cluster:
>
> blade04: # cman_tool nodes
> Node  Votes Exp Sts  Name
>    1    1    1   X   blade01
>    4    1    1   M   blade04
>   11    1    1   M   blade11
>
> then blade06 joins the cluster, but instead of joining the existing
> cluster, it creates a new one:
>
> blade06: # cman_tool nodes
> Node  Votes Exp Sts  Name
>    6    1    1   M   blade06
>
> Both machines are using Protocol version: 5.0.1
>
> How can I further debug why this is happening?
>
> thanks,
> dan
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From phung at cs.columbia.edu  Thu Oct  6 22:06:48 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Thu, 6 Oct 2005 18:06:48 -0400 (EDT)
Subject: [Linux-cluster] new cluster created when new node joins
In-Reply-To: <98b567529b272baeb6be7f90371aa324@redhat.com>
Message-ID: <Pine.LNX.4.44.0510061805130.10782-100000@minsk.clic.cs.columbia.edu>

There is another cluster that is running orthogonal of this cluster, but
that's not defined in this cluster.xml.  The cluster.xml is the same
for both these machines.


On 6, Oct, 2005, Jonathan E Brassow declared:

> do they have multiple clusters set up in their environment?  Does the 
> /etc/cluster/cluster.xml file match the others?
> 
>   brassow
> 
> On Oct 6, 2005, at 3:37 PM, Dan B. Phung wrote:
> 
> > I have an existing cluster:
> >
> > blade04: # cman_tool nodes
> > Node  Votes Exp Sts  Name
> >    1    1    1   X   blade01
> >    4    1    1   M   blade04
> >   11    1    1   M   blade11
> >
> > then blade06 joins the cluster, but instead of joining the existing
> > cluster, it creates a new one:
> >
> > blade06: # cman_tool nodes
> > Node  Votes Exp Sts  Name
> >    6    1    1   M   blade06
> >
> > Both machines are using Protocol version: 5.0.1
> >
> > How can I further debug why this is happening?
> >
> > thanks,
> > dan
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 
email:  phung at cs.columbia.edu
www:    http://www.cs.columbia.edu/~phung
phone:  646-775-6090
fax:    212-666-0140
office: CS Dept. 520, 1214 Amsterdam Ave., MC 0401, New York, NY 10027


From jbrassow at redhat.com  Thu Oct  6 22:11:41 2005
From: jbrassow at redhat.com (Jonathan E Brassow)
Date: Thu, 6 Oct 2005 17:11:41 -0500
Subject: [Linux-cluster] new cluster created when new node joins
In-Reply-To: <Pine.LNX.4.44.0510061805130.10782-100000@minsk.clic.cs.columbia.edu>
References: <Pine.LNX.4.44.0510061805130.10782-100000@minsk.clic.cs.columbia.edu>
Message-ID: <9eebfcf8d2c537ddab93d7d0c4f7c7c8@redhat.com>

Anything in /var/log/messages?

On Oct 6, 2005, at 5:06 PM, Dan B. Phung wrote:

> There is another cluster that is running orthogonal of this cluster, 
> but
> that's not defined in this cluster.xml.  The cluster.xml is the same
> for both these machines.
>
>
> On 6, Oct, 2005, Jonathan E Brassow declared:
>
>> do they have multiple clusters set up in their environment?  Does the
>> /etc/cluster/cluster.xml file match the others?
>>
>>   brassow
>>
>> On Oct 6, 2005, at 3:37 PM, Dan B. Phung wrote:
>>
>>> I have an existing cluster:
>>>
>>> blade04: # cman_tool nodes
>>> Node  Votes Exp Sts  Name
>>>    1    1    1   X   blade01
>>>    4    1    1   M   blade04
>>>   11    1    1   M   blade11
>>>
>>> then blade06 joins the cluster, but instead of joining the existing
>>> cluster, it creates a new one:
>>>
>>> blade06: # cman_tool nodes
>>> Node  Votes Exp Sts  Name
>>>    6    1    1   M   blade06
>>>
>>> Both machines are using Protocol version: 5.0.1
>>>
>>> How can I further debug why this is happening?
>>>
>>> thanks,
>>> dan
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
> -- 
> email:  phung at cs.columbia.edu
> www:    http://www.cs.columbia.edu/~phung
> phone:  646-775-6090
> fax:    212-666-0140
> office: CS Dept. 520, 1214 Amsterdam Ave., MC 0401, New York, NY 10027
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From phung at cs.columbia.edu  Thu Oct  6 22:17:13 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Thu, 6 Oct 2005 18:17:13 -0400 (EDT)
Subject: [Linux-cluster] new cluster created when new node joins
In-Reply-To: <9eebfcf8d2c537ddab93d7d0c4f7c7c8@redhat.com>
Message-ID: <Pine.LNX.4.44.0510061814410.10782-100000@minsk.clic.cs.columbia.edu>

on the existing blades in the cluster there's nothing in the logs.  on 
the entering blade, we get the "normal"

Oct  6 16:28:29 blade06 kernel: CMAN: Waiting to join or form a Linux-cluster
Oct  6 16:29:01 blade06 kernel: CMAN: forming a new cluster
Oct  6 16:29:01 blade06 kernel: CMAN: quorum regained, resuming activity
Oct  6 16:31:20 blade06 kernel: CMAN: we are leaving the cluster.
Oct  6 16:31:42 blade06 kernel: CMAN: Waiting to join or form a Linux-cluster
Oct  6 16:32:14 blade06 kernel: CMAN: forming a new cluster
Oct  6 16:32:14 blade06 kernel: CMAN: quorum regained, resuming activity

...so it seems like the messages aren't getting sent/received on the
mutlicast network.  I guess I'll try sniffing the network to see if the
messages are out there.

-dan

On 6, Oct, 2005, Jonathan E Brassow declared:

> Anything in /var/log/messages?
> 
> On Oct 6, 2005, at 5:06 PM, Dan B. Phung wrote:
> 
> > There is another cluster that is running orthogonal of this cluster, 
> > but
> > that's not defined in this cluster.xml.  The cluster.xml is the same
> > for both these machines.
> >
> >
> > On 6, Oct, 2005, Jonathan E Brassow declared:
> >
> >> do they have multiple clusters set up in their environment?  Does the
> >> /etc/cluster/cluster.xml file match the others?
> >>
> >>   brassow
> >>
> >> On Oct 6, 2005, at 3:37 PM, Dan B. Phung wrote:
> >>
> >>> I have an existing cluster:
> >>>
> >>> blade04: # cman_tool nodes
> >>> Node  Votes Exp Sts  Name
> >>>    1    1    1   X   blade01
> >>>    4    1    1   M   blade04
> >>>   11    1    1   M   blade11
> >>>
> >>> then blade06 joins the cluster, but instead of joining the existing
> >>> cluster, it creates a new one:
> >>>
> >>> blade06: # cman_tool nodes
> >>> Node  Votes Exp Sts  Name
> >>>    6    1    1   M   blade06
> >>>
> >>> Both machines are using Protocol version: 5.0.1
> >>>
> >>> How can I further debug why this is happening?
> >>>
> >>> thanks,
> >>> dan
> >>>
> >>> --
> >>> Linux-cluster mailing list
> >>> Linux-cluster at redhat.com
> >>> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>>
> >>
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>
> >
> > -- 
> > email:  phung at cs.columbia.edu
> > www:    http://www.cs.columbia.edu/~phung
> > phone:  646-775-6090
> > fax:    212-666-0140
> > office: CS Dept. 520, 1214 Amsterdam Ave., MC 0401, New York, NY 10027
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 
email:  phung at cs.columbia.edu
www:    http://www.cs.columbia.edu/~phung
phone:  646-775-6090
fax:    212-666-0140
office: CS Dept. 520, 1214 Amsterdam Ave., MC 0401, New York, NY 10027


From hernando.garcia at gmail.com  Fri Oct  7 10:09:02 2005
From: hernando.garcia at gmail.com (Hernando Garcia)
Date: Fri, 07 Oct 2005 11:09:02 +0100
Subject: [Linux-cluster] Please remove
In-Reply-To: <C6D4E314224DD348B00AF127279427E002036961@mbbexc01.asiapacific.cpqcorp.net>
References: <C6D4E314224DD348B00AF127279427E002036961@mbbexc01.asiapacific.cpqcorp.net>
Message-ID: <1128679742.4350.0.camel@hgarcia.surrey.redhat.com>

You can DIY from here ;)

https://www.redhat.com/mailman/listinfo/linux-cluster

On Tue, 2005-10-04 at 06:17 +1000, Fernandez, Joe (HP Systems) wrote:
> Hi,
>  
> Could you please remove me off the list, thank you. 
>  
> 
> Regards,
> 
> Joe Fernandez
> 
> HP Systems
> Hewlett-Packard Australia
> Ph. 61.3 8804 7308
> Mob. 61.412 830 066
> 
> 
> 
>  
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From Axel.Thimm at ATrpms.net  Fri Oct  7 10:51:04 2005
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Fri, 7 Oct 2005 12:51:04 +0200
Subject: [Linux-cluster] RHCS/GFS for RHELU2 (was: FC4
	kernel-2.6.13-1.1526_FC4 & GFS-kernel package mismatch ?)
In-Reply-To: <43443D85.8050703@redhat.com>
References: <1128426748.43426cfc13ec5@mail.devcon.cc>
	<43443D85.8050703@redhat.com>
Message-ID: <20051007105104.GA14283@neu.nirvana>

The CS/GFS isos under RHELU2 are still for RHELU1, and the rhn
channels also only have kernel modules for RHELU1's kernel.

Should I bugzilla this?

Thanks!

On Wed, Oct 05, 2005 at 03:54:29PM -0500, Chris Feist wrote:
> GFS/CS updated kernel rpms are available from fedora-test.
> 
> ftp://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/4/i386/
> 
> Thanks,
> Chris
> 
> Thomas Kofler wrote:
> >Hi,
> >
> >we noticed, that after running "yum update" yesterday on our system, that 
> >cman didn't start up any longer.
> >
> >We investigated the problem and it depends on the kernel version, if we 
> >boot with the old 1447 - kernel, cman and the related services startup 
> >fine.
> >
> >The newest GFS-kernel places the modules under 
> >/lib/modules/2.6.12-1.1447_FC4
> >
> >But the kernel itself kernel-2.6.13-1.1526_FC4 has of 
> >course /lib/modules/2.6.13-1.1526_FC4 as its module path.
> >
> >Is it a bug or is there to do something by hand? If not, I would open a 
> >bug on bugzilla, but under which section: kernel or GFS-kernel - which 
> >package team forget the dependency ?
> >
> >Thanks for feedback,
> >Thomas
> >
> >
> >kernel-2.6.13-1.1526_FC4 
> >modules: /lib/modules/2.6.13-1.1526_FC4
> >
> >GFS-kernel-2.6.11.8-20050601.152643.FC4.14
> >/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs
> >/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs/gfs.ko
> >/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking
> >/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm
> >/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm/lock_dlm.ko
> >/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm
> >/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm/lock_gulm.ko
> >/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_harness
> >/lib/modules/2.6.12-
> >1.1447_FC4/kernel/fs/gfs_locking/lock_harness/lock_harness.ko
> >/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock
> >/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock/lock_nolock.ko
> >
> > 
> >
> 

-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051007/7c9cbf7c/attachment.sig>

From Axel.Thimm at ATrpms.net  Fri Oct  7 12:02:04 2005
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Fri, 7 Oct 2005 14:02:04 +0200
Subject: [Linux-cluster] Re: RHCS/GFS for RHELU2 (was: FC4
	kernel-2.6.13-1.1526_FC4 & GFS-kernel package mismatch ?)
In-Reply-To: <20051007105104.GA14283@neu.nirvana>
References: <1128426748.43426cfc13ec5@mail.devcon.cc>
	<43443D85.8050703@redhat.com> <20051007105104.GA14283@neu.nirvana>
Message-ID: <20051007120204.GA20566@neu.nirvana>

On Fri, Oct 07, 2005 at 12:51:04PM +0200, Axel Thimm wrote:
> The CS/GFS isos under RHELU2 are still for RHELU1, and the rhn
> channels also only have kernel modules for RHELU1's kernel.

OK, looks like they are still in the beta channel.
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051007/c9a52ff9/attachment.sig>

From cfeist at redhat.com  Fri Oct  7 14:17:39 2005
From: cfeist at redhat.com (Chris Feist)
Date: Fri, 07 Oct 2005 09:17:39 -0500
Subject: [Linux-cluster] Re: RHCS/GFS for RHELU2
In-Reply-To: <20051007105104.GA14283@neu.nirvana>
References: <1128426748.43426cfc13ec5@mail.devcon.cc>
	<43443D85.8050703@redhat.com> <20051007105104.GA14283@neu.nirvana>
Message-ID: <43468383.3050407@redhat.com>

Axel,

Normally it takes a day or two for CS/GFS isos to be released on RHN after 
RHEL is released.  The rpms have been updated and the isos should be appearing 
shortly.

Thanks,
Chris

Axel Thimm wrote:
> The CS/GFS isos under RHELU2 are still for RHELU1, and the rhn
> channels also only have kernel modules for RHELU1's kernel.
> 
> Should I bugzilla this?
> 
> Thanks!
> 
> On Wed, Oct 05, 2005 at 03:54:29PM -0500, Chris Feist wrote:
> 
>>GFS/CS updated kernel rpms are available from fedora-test.
>>
>>ftp://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/4/i386/
>>
>>Thanks,
>>Chris
>>
>>Thomas Kofler wrote:
>>
>>>Hi,
>>>
>>>we noticed, that after running "yum update" yesterday on our system, that 
>>>cman didn't start up any longer.
>>>
>>>We investigated the problem and it depends on the kernel version, if we 
>>>boot with the old 1447 - kernel, cman and the related services startup 
>>>fine.
>>>
>>>The newest GFS-kernel places the modules under 
>>>/lib/modules/2.6.12-1.1447_FC4
>>>
>>>But the kernel itself kernel-2.6.13-1.1526_FC4 has of 
>>>course /lib/modules/2.6.13-1.1526_FC4 as its module path.
>>>
>>>Is it a bug or is there to do something by hand? If not, I would open a 
>>>bug on bugzilla, but under which section: kernel or GFS-kernel - which 
>>>package team forget the dependency ?
>>>
>>>Thanks for feedback,
>>>Thomas
>>>
>>>
>>>kernel-2.6.13-1.1526_FC4 
>>>modules: /lib/modules/2.6.13-1.1526_FC4
>>>
>>>GFS-kernel-2.6.11.8-20050601.152643.FC4.14
>>>/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs
>>>/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs/gfs.ko
>>>/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking
>>>/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm
>>>/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_dlm/lock_dlm.ko
>>>/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm
>>>/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_gulm/lock_gulm.ko
>>>/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_harness
>>>/lib/modules/2.6.12-
>>>1.1447_FC4/kernel/fs/gfs_locking/lock_harness/lock_harness.ko
>>>/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock
>>>/lib/modules/2.6.12-1.1447_FC4/kernel/fs/gfs_locking/lock_nolock/lock_nolock.ko
>>>
>>>
>>>
>>
> 


From colman at codagenomics.com  Fri Oct  7 14:35:03 2005
From: colman at codagenomics.com (Richard Colman)
Date: Fri, 7 Oct 2005 07:35:03 -0700
Subject: [Linux-cluster] Job Posting - Southern CA
Message-ID: <200510071436.j97Ea4EN021329@mx3.redhat.com>

CODA Genomics in Irvine, CA is expanding and now would like to hire an
absolutely top-notch systems analyst/administrator/programmer for design,
development and administration of systems and software for real-time,
distributed parallel processing on Linux clusters for both genomics research
and commercial production of synthetic genes. We primarily use RED HAT and
Debian software.

Please respond by email to jobs at codagenomics.com to obtain a detailed job
description . No telephone calls please.

Thank You.

Richard Colman


From baesso at ksolutions.it  Mon Oct 10 07:22:07 2005
From: baesso at ksolutions.it (Baesso Mirko)
Date: Mon, 10 Oct 2005 09:22:07 +0200
Subject: [Linux-cluster] Setup Fence_wti on cluster RHES4 U1
Message-ID: <F63C63702B77F94CB4664B4D88660C4D4BACFC@kmail.ksolutions.it>

Hi 
i've setup a Redhat cluster with two node and i try to test failover
using power switch (WTI-NPS230) but seem doesn't work
If I look for error messages I see that fence_wti waits for a command to
execute.
I setup cluster.conf using system-config-cluster and there is no section
regarding command to execute
Could you please let me now how to setup correctly
Thanks

Baesso Mirko - System Engineer
KSolutions.S.p.A.
Via Lenin 132/26
56017  S.Martino Ulmiano (PI) - Italy
tel.+ 39 0 50 898369 fax. + 39 0 50 861200
baesso at ksolutions.it   http//www.ksolutions.it


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051010/f0c6bdc8/attachment.htm>

From herta.vandeneynde at cc.kuleuven.be  Mon Oct 10 14:29:50 2005
From: herta.vandeneynde at cc.kuleuven.be (Herta Van den Eynde)
Date: Mon, 10 Oct 2005 16:29:50 +0200
Subject: [Linux-cluster] umount failed - device is busy
Message-ID: <434A7ADE.108@cc.kuleuven.be>

environment:
- Red Hat AS 3 (kernel-smp-2.4.21-37.EL - custom built to probe all LUNs 
on each SCSI device)
- clumanager 1.2.28

The cluster consists of 2 members running three services which simply 
nfs export a number of directories to five other systems.
The cluster has been operational since February.

Following the latest upgrade (from kernel-smp-2.4.21-32.0.1.EL custom 
built and clumanager-1.2.26.1-1), all services are running on one 
member.  When I try to locate the services, the operation fails, and the 
following message pops up:

	A Problem has occurred while changing ownership
	of this service.  Please check logs for details.

The cluster log reports the following:

==== begin log extract
Member arnebd trying to relocate lepustl to nihald...Oct 10 16:08:06 
arnebd clusvcmgrd: [13627]: <notice> service notice: Stopping service 
lepustl ...
Oct 10 16:08:06 arnebd clurmtabd[26429]: <debug> Signal 15 received; exiting
Oct 10 16:08:12 arnebd clusvcmgrd: [13627]: <err> service error: 'umount 
/dev/sdb2' failed (/usr/local/lepus-tl), error=1
Oct 10 16:08:12 arnebd clusvcmgrd: [13627]: <err> service error: umount: 
/usr/local/lepus-tl: device is busy
Oct 10 16:08:12 arnebd clusvcmgrd: [13627]: <err> service error: umount: 
/usr/local/lepus-tl: device is busy
Oct 10 16:08:12 arnebd clusvcmgrd: [13627]: <err> service error: Cannot 
stop filesystems for lepustl
Oct 10 16:08:12 arnebd clusvcmgrd[13626]: <notice> Starting stopped 
service lepustl
Oct 10 16:08:12 arnebd clusvcmgrd: [14083]: <notice> service notice: 
Starting service lepustl ...
Oct 10 16:08:12 arnebd clurmtabd[14194]: <debug> Log level is now 7
Oct 10 16:08:12 arnebd clurmtabd[14194]: <debug> Polling interval is now 
4 seconds
failed
Oct 10 16:08:12 arnebd clusvcmgrd: [14083]: <notice> service notice: 
Started service lepustl ...
Oct 10 16:08:14 arnebd clurmtabd[6533]: <debug> Detected modified 
/var/lib/nfs/rmtab
Oct 10 16:08:14 arnebd clurmtabd[9655]: <debug> Detected modified 
/var/lib/nfs/rmtab
==== end log extract

FWIIW, no one was logged in but me, and my current directory was not on 
this filesystem.
Neither fuser nor lsof returned any process using the filesystem.
I figured the clurmtabd process may be locking it, so I did verify that 
there is only one clurmtab process for that filesystem.

Any ideas/suggestions?

Kind regards,

Herta

-- 
Herta Van den Eynde              -=- Toledo system management
K.U. Leuven - Ludit              -=- phone: +32 (0)16 322 166
                                  -=- 50?51'27" N 004?40'39" E

"I wish I were two little cats.  Then I could play together."

Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm


From herta.vandeneynde at cc.kuleuven.be  Mon Oct 10 15:59:34 2005
From: herta.vandeneynde at cc.kuleuven.be (Herta Van den Eynde)
Date: Mon, 10 Oct 2005 17:59:34 +0200
Subject: [Linux-cluster] umount failed - device is busy
In-Reply-To: <434A7ADE.108@cc.kuleuven.be>
References: <434A7ADE.108@cc.kuleuven.be>
Message-ID: <434A8FE6.40508@cc.kuleuven.be>

Further investigation suggests that locking may have something to do 
with this.
On the system that currently runs the services, I find these lock files 
in four
-rwx------    1 root     root            0 Oct  8 03:30 lock.0
-rwx------    1 root     root            0 Oct  8 03:30 lock.1
-rwx------    1 root     root            0 Oct  8 03:30 lock.116
-rwx------    1 root     root            0 Oct  8 03:30 lock.2
-rw-r--r--    1 root     root            0 Oct  8 03:31 service.0
-rw-r--r--    1 root     root            0 Oct 10 16:08 service.1
-rw-r--r--    1 root     root            0 Oct  8 03:30 service.2

On the now idel cluster member, I have these lock files:
-rwx------    1 root     root            0 Oct  8 03:30 lock.0
-rwx------    1 root     root            0 Oct  8 03:30 lock.1
-rwx------    1 root     root            0 Oct  8 03:30 lock.116
-rwx------    1 root     root            0 Oct  8 03:30 lock.2

The four lock.n files strike me as odd since I only have three services. 
  Also, should the lock files even be there on the idle cluster member?

Could anyone running a similar cluster please post the content of the 
/var/lock/clumanager/ of the different members along with the the number 
of services currently running on that member?

Kind regards,

Herta

Herta Van den Eynde wrote:
> environment:
> - Red Hat AS 3 (kernel-smp-2.4.21-37.EL - custom built to probe all LUNs 
> on each SCSI device)
> - clumanager 1.2.28
> 
> The cluster consists of 2 members running three services which simply 
> nfs export a number of directories to five other systems.
> The cluster has been operational since February.
> 
> Following the latest upgrade (from kernel-smp-2.4.21-32.0.1.EL custom 
> built and clumanager-1.2.26.1-1), all services are running on one 
> member.  When I try to locate the services, the operation fails, and the 
> following message pops up:
> 
>     A Problem has occurred while changing ownership
>     of this service.  Please check logs for details.
> 
> The cluster log reports the following:
> 
> ==== begin log extract
> Member arnebd trying to relocate lepustl to nihald...Oct 10 16:08:06 
> arnebd clusvcmgrd: [13627]: <notice> service notice: Stopping service 
> lepustl ...
> Oct 10 16:08:06 arnebd clurmtabd[26429]: <debug> Signal 15 received; 
> exiting
> Oct 10 16:08:12 arnebd clusvcmgrd: [13627]: <err> service error: 'umount 
> /dev/sdb2' failed (/usr/local/lepus-tl), error=1
> Oct 10 16:08:12 arnebd clusvcmgrd: [13627]: <err> service error: umount: 
> /usr/local/lepus-tl: device is busy
> Oct 10 16:08:12 arnebd clusvcmgrd: [13627]: <err> service error: umount: 
> /usr/local/lepus-tl: device is busy
> Oct 10 16:08:12 arnebd clusvcmgrd: [13627]: <err> service error: Cannot 
> stop filesystems for lepustl
> Oct 10 16:08:12 arnebd clusvcmgrd[13626]: <notice> Starting stopped 
> service lepustl
> Oct 10 16:08:12 arnebd clusvcmgrd: [14083]: <notice> service notice: 
> Starting service lepustl ...
> Oct 10 16:08:12 arnebd clurmtabd[14194]: <debug> Log level is now 7
> Oct 10 16:08:12 arnebd clurmtabd[14194]: <debug> Polling interval is now 
> 4 seconds
> failed
> Oct 10 16:08:12 arnebd clusvcmgrd: [14083]: <notice> service notice: 
> Started service lepustl ...
> Oct 10 16:08:14 arnebd clurmtabd[6533]: <debug> Detected modified 
> /var/lib/nfs/rmtab
> Oct 10 16:08:14 arnebd clurmtabd[9655]: <debug> Detected modified 
> /var/lib/nfs/rmtab
> ==== end log extract
> 
> FWIIW, no one was logged in but me, and my current directory was not on 
> this filesystem.
> Neither fuser nor lsof returned any process using the filesystem.
> I figured the clurmtabd process may be locking it, so I did verify that 
> there is only one clurmtab process for that filesystem.
> 
> Any ideas/suggestions?
> 
> Kind regards,
> 
> Herta
> 

-- 
Herta Van den Eynde              -=- Toledo system management
K.U. Leuven - Ludit              -=- phone: +32 (0)16 322 166
                                  -=- 50?51'27" N 004?40'39" E

"I wish I were two little cats.  Then I could play together."

Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm


From lhh at redhat.com  Mon Oct 10 17:02:02 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 10 Oct 2005 13:02:02 -0400
Subject: [Linux-cluster] umount failed - device is busy
In-Reply-To: <434A8FE6.40508@cc.kuleuven.be>
References: <434A7ADE.108@cc.kuleuven.be>  <434A8FE6.40508@cc.kuleuven.be>
Message-ID: <1128963722.4680.21.camel@ayanami.boston.redhat.com>

On Mon, 2005-10-10 at 17:59 +0200, Herta Van den Eynde wrote:
> Further investigation suggests that locking may have something to do 
> with this.
> On the system that currently runs the services, I find these lock files 
> in four
> -rwx------    1 root     root            0 Oct  8 03:30 lock.0
> -rwx------    1 root     root            0 Oct  8 03:30 lock.1
> -rwx------    1 root     root            0 Oct  8 03:30 lock.116
> -rwx------    1 root     root            0 Oct  8 03:30 lock.2
> -rw-r--r--    1 root     root            0 Oct  8 03:31 service.0
> -rw-r--r--    1 root     root            0 Oct 10 16:08 service.1
> -rw-r--r--    1 root     root            0 Oct  8 03:30 service.2

> On the now idel cluster member, I have these lock files:
> -rwx------    1 root     root            0 Oct  8 03:30 lock.0
> -rwx------    1 root     root            0 Oct  8 03:30 lock.1
> -rwx------    1 root     root            0 Oct  8 03:30 lock.116
> -rwx------    1 root     root            0 Oct  8 03:30 lock.2

Lock files aren't removed.

> The four lock.n files strike me as odd since I only have three services. 

One is the configuration lock.

>   Also, should the lock files even be there on the idle cluster member?

Yes.

Did you try enabling force unmount in the device/file system
configuration?

-- Lon


From jbrassow at redhat.com  Mon Oct 10 20:03:40 2005
From: jbrassow at redhat.com (Jonathan E Brassow)
Date: Mon, 10 Oct 2005 15:03:40 -0500
Subject: [Linux-cluster] Setup Fence_wti on cluster RHES4 U1
In-Reply-To: <F63C63702B77F94CB4664B4D88660C4D4BACFC@kmail.ksolutions.it>
References: <F63C63702B77F94CB4664B4D88660C4D4BACFC@kmail.ksolutions.it>
Message-ID: <d28e99e29046467dcb8db55cfdd55a19@redhat.com>

I don't have the gui in front of me, but there should be a manage 
fencing button or something...  If you've never specified which port 
that a machine is connected to on the WTI, then you haven't gotten far 
enough in the set-up.

  brassow

On Oct 10, 2005, at 2:22 AM, Baesso Mirko wrote:

> Hi
>
> i?ve setup a Redhat cluster with two node and i try to test failover 
> using power switch (WTI-NPS230) but seem doesn?t work
>
> If I look for error messages I see that fence_wti waits for a command 
> to execute.
>
> I setup cluster.conf using system-config-cluster and there is no 
> section regarding command to execute
>
> Could you please let me now how to setup correctly
>
> Thanks
>
> Baesso Mirko - System Engineer
>
> KSolutions.S.p.A.
>
> Via Lenin 132/26
>
> 56017? S.Martino Ulmiano (PI) - Italy
>
> tel.+ 39 0 50 898369 fax. + 39 0 50 861200
>
> baesso at ksolutions.it?? http//www.ksolutions.it
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 2095 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051010/6cc6a8f0/attachment.bin>

From herta.vandeneynde at cc.kuleuven.be  Mon Oct 10 20:06:54 2005
From: herta.vandeneynde at cc.kuleuven.be (Herta Van den Eynde)
Date: Mon, 10 Oct 2005 22:06:54 +0200
Subject: [Linux-cluster] umount failed - device is busy
In-Reply-To: <1128963722.4680.21.camel@ayanami.boston.redhat.com>
References: <434A7ADE.108@cc.kuleuven.be> <434A8FE6.40508@cc.kuleuven.be>
	<1128963722.4680.21.camel@ayanami.boston.redhat.com>
Message-ID: <434AC9DE.50606@cc.kuleuven.be>

Lon Hohberger wrote:
> On Mon, 2005-10-10 at 17:59 +0200, Herta Van den Eynde wrote:
> 
>>Further investigation suggests that locking may have something to do 
>>with this.
>>On the system that currently runs the services, I find these lock files 
>>in four
>>-rwx------    1 root     root            0 Oct  8 03:30 lock.0
>>-rwx------    1 root     root            0 Oct  8 03:30 lock.1
>>-rwx------    1 root     root            0 Oct  8 03:30 lock.116
>>-rwx------    1 root     root            0 Oct  8 03:30 lock.2
>>-rw-r--r--    1 root     root            0 Oct  8 03:31 service.0
>>-rw-r--r--    1 root     root            0 Oct 10 16:08 service.1
>>-rw-r--r--    1 root     root            0 Oct  8 03:30 service.2
> 
> 
>>On the now idel cluster member, I have these lock files:
>>-rwx------    1 root     root            0 Oct  8 03:30 lock.0
>>-rwx------    1 root     root            0 Oct  8 03:30 lock.1
>>-rwx------    1 root     root            0 Oct  8 03:30 lock.116
>>-rwx------    1 root     root            0 Oct  8 03:30 lock.2
> 
> 
> Lock files aren't removed.
> 
> 
>>The four lock.n files strike me as odd since I only have three services. 
> 
> 
> One is the configuration lock.
> 
> 
>>  Also, should the lock files even be there on the idle cluster member?
> 
> 
> Yes.
> 
> Did you try enabling force unmount in the device/file system
> configuration?
> 
> -- Lon

Thanks for the explanation, Lon.  Yes, the devices are configured for 
"Force Unmount".
With the device unmounted on all of the nfs clients I even tried to 
'umount -f' manually, but I got the same result.

Kind regards,

Herta

Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm


From lhh at redhat.com  Mon Oct 10 21:02:26 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 10 Oct 2005 17:02:26 -0400
Subject: [Linux-cluster] umount failed - device is busy
In-Reply-To: <434AC9DE.50606@cc.kuleuven.be>
References: <434A7ADE.108@cc.kuleuven.be> <434A8FE6.40508@cc.kuleuven.be>
	<1128963722.4680.21.camel@ayanami.boston.redhat.com>
	<434AC9DE.50606@cc.kuleuven.be>
Message-ID: <1128978146.4680.37.camel@ayanami.boston.redhat.com>

On Mon, 2005-10-10 at 22:06 +0200, Herta Van den Eynde wrote:

> > Did you try enabling force unmount in the device/file system
> > configuration?
> > 
> > -- Lon
> 
> Thanks for the explanation, Lon.  Yes, the devices are configured for 
> "Force Unmount".
> With the device unmounted on all of the nfs clients I even tried to 
> 'umount -f' manually, but I got the same result.

Odd.  Well, "umount -f" actually doesn't do what most people think it
does.

The "force unmount" option looks for and kills any user-land process
holding a reference on the file system using "kill -9".

So, if you're getting EBUSY on unmount even though force-unmount is
working (confirmed by you looking at lsof/fuser), chances are good that
there's a kernel reference on the file system.

It could be something NFS related - try "service nfs stop" and see if
you can umount the file system.

-- Lon


From herta.vandeneynde at cc.kuleuven.be  Mon Oct 10 21:22:20 2005
From: herta.vandeneynde at cc.kuleuven.be (Herta Van den Eynde)
Date: Mon, 10 Oct 2005 23:22:20 +0200
Subject: [Linux-cluster] umount failed - device is busy
In-Reply-To: <1128978146.4680.37.camel@ayanami.boston.redhat.com>
References: <434A7ADE.108@cc.kuleuven.be>
	<434A8FE6.40508@cc.kuleuven.be>	<1128963722.4680.21.camel@ayanami.boston.redhat.com>	<434AC9DE.50606@cc.kuleuven.be>
	<1128978146.4680.37.camel@ayanami.boston.redhat.com>
Message-ID: <434ADB8C.9010508@cc.kuleuven.be>


Lon Hohberger wrote:
> On Mon, 2005-10-10 at 22:06 +0200, Herta Van den Eynde wrote:
> 
> 
>>>Did you try enabling force unmount in the device/file system
>>>configuration?
>>>
>>>-- Lon
>>
>>Thanks for the explanation, Lon.  Yes, the devices are configured for 
>>"Force Unmount".
>>With the device unmounted on all of the nfs clients I even tried to 
>>'umount -f' manually, but I got the same result.
> 
> 
> Odd.  Well, "umount -f" actually doesn't do what most people think it
> does.
> 
> The "force unmount" option looks for and kills any user-land process
> holding a reference on the file system using "kill -9".
> 
> So, if you're getting EBUSY on unmount even though force-unmount is
> working (confirmed by you looking at lsof/fuser), chances are good that
> there's a kernel reference on the file system.
> 
> It could be something NFS related - try "service nfs stop" and see if
> you can umount the file system.
> 
> -- Lon
> 
Unfortunately, this is a production cluster which serves well over 
100,000 users (e-learning environment for our university, a dozen 
associated colleges, and a few hundred K-12 institutions) and I only 
have 4 hour maintenance windows on the 7th of each month, so stopping 
all of nfs is not an option today.  :-(
One of the cluster services is used for admin purposes, and that's the 
only one I can currently use (within limits) to test suggestions.

FWIIW, I don't think the force unmount works.  True, lsof/fuser don't 
report processes against the filesystem, but "df" and "mount" show that 
  it's still there, and I can write to and read from it after I try a 
"umount -f".

Kind regards,

Herta


Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm


From phung at cs.columbia.edu  Mon Oct 10 22:43:21 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Mon, 10 Oct 2005 18:43:21 -0400 (EDT)
Subject: [Linux-cluster] which CVS version for GFS with 2.6.11?
Message-ID: <Pine.LNX.4.44.0510101839560.15521-100000@minsk.clic.cs.columbia.edu>

Can someone advise me as to which tag I should use to checkout
the latest stable snapshot for GFS with 2.6.11?

e.g. 
cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout -rYOUR_TAG? cluster


thanks,
dan


From herta.vandeneynde at cc.kuleuven.be  Tue Oct 11 10:01:22 2005
From: herta.vandeneynde at cc.kuleuven.be (Herta Van den Eynde)
Date: Tue, 11 Oct 2005 12:01:22 +0200
Subject: [Linux-cluster] umount failed - device is busy
In-Reply-To: <434ADB8C.9010508@cc.kuleuven.be>
References: <434A7ADE.108@cc.kuleuven.be>	<434A8FE6.40508@cc.kuleuven.be>	<1128963722.4680.21.camel@ayanami.boston.redhat.com>	<434AC9DE.50606@cc.kuleuven.be>	<1128978146.4680.37.camel@ayanami.boston.redhat.com>
	<434ADB8C.9010508@cc.kuleuven.be>
Message-ID: <434B8D72.3080006@cc.kuleuven.be>

Next attempt at understanding what is going on.

According to the documentation, "the clurmtabd daemon synchronizes NFS
mount entries in /var/lib/nfs/rmtab with a private copy on a service's
mount point."

Assuming the private copy is the one in .clumanager/rmtab, shouldn't
that file contain data?  (They are empty for all three filesystems.)

Kind regards,

Herta


Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm


From lhh at redhat.com  Tue Oct 11 14:40:30 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 11 Oct 2005 10:40:30 -0400
Subject: [Linux-cluster] which CVS version for GFS with 2.6.11?
In-Reply-To: <Pine.LNX.4.44.0510101839560.15521-100000@minsk.clic.cs.columbia.edu>
References: <Pine.LNX.4.44.0510101839560.15521-100000@minsk.clic.cs.columbia.edu>
Message-ID: <1129041630.4680.58.camel@ayanami.boston.redhat.com>

On Mon, 2005-10-10 at 18:43 -0400, Dan B. Phung wrote:
> Can someone advise me as to which tag I should use to checkout
> the latest stable snapshot for GFS with 2.6.11?
> 
> e.g. 
> cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout -rYOUR_TAG? cluster

Possibly FC4, but it's antiquated.  Note that it was for the FC4 2.6.11
kernel; other kernels may or may not work.

FWIW, the -STABLE branch tracks the latest upstream kernel.

-- Lon


From lhh at redhat.com  Tue Oct 11 15:06:37 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 11 Oct 2005 11:06:37 -0400
Subject: [Linux-cluster] umount failed - device is busy
In-Reply-To: <434ADB8C.9010508@cc.kuleuven.be>
References: <434A7ADE.108@cc.kuleuven.be> <434A8FE6.40508@cc.kuleuven.be>
	<1128963722.4680.21.camel@ayanami.boston.redhat.com>
	<434AC9DE.50606@cc.kuleuven.be>
	<1128978146.4680.37.camel@ayanami.boston.redhat.com>
	<434ADB8C.9010508@cc.kuleuven.be>
Message-ID: <1129043197.4680.85.camel@ayanami.boston.redhat.com>

On Mon, 2005-10-10 at 23:22 +0200, Herta Van den Eynde wrote:
 
> > Odd.  Well, "umount -f" actually doesn't do what most people think it
> > does.

> FWIIW, I don't think the force unmount works.  True, lsof/fuser don't 
> report processes against the filesystem, but "df" and "mount" show that 
>   it's still there, and I can write to and read from it after I try a 
> "umount -f".

(/me waves his hand and says, "This is not the force unmount you are
looking for...")

* "umount -f" only works for NFS-mounted file systems, and only then in
certain cases.  If there is pending I/O (e.g. processes in disk wait),
it will not work (I may be wrong about this one, but I think this is the
case).  In any case, it does not currently do anything for local file
systems like ext3, jfs, reiserfs, xfs, etc...  If there are open
references on any local file system, the umount fails with -EBUSY,
regardless of whether or not "umount -f" was used.

This means, if there is a process, say a bash shell, running with CWD in
a local file system's mount point, running "umount -f" on that mount
point will fail the same as running "umount" without the "-f" flag.


* The "force unmount" option in Cluster Manager, by contrast, attempts
to clear references on a locally mounted file systems by killing
processes using those file systems.  Put more clearly: it attempts to do
what most people think "umount -f" does (or should do) in a general way.
So, back to our example:

That bash shell is sitting in the mountpoint.  We look for all processes
using the mount point, see that bash (pid 11034) is using it, and kill
pid 11034 with signal 9 (SIGKILL).  Bash certainly no longer has a
reference on our file system ;)

Now, there are no more processes using the mount point, so we issue
"umount"...  which should work.  However, this is not working in your
case.  There are a couple of things which come to mind which might cause
this:

* nfsd holding a reference (which is why I asked you to stop nfs;
"exportfs -ua" should work too).

* another mounted file system below the mount point (e.g. trying to
umount /a while another file system is mounted on /a/b).

-- Lon


From lhh at redhat.com  Tue Oct 11 15:09:31 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 11 Oct 2005 11:09:31 -0400
Subject: [Linux-cluster] umount failed - device is busy
In-Reply-To: <434B8D72.3080006@cc.kuleuven.be>
References: <434A7ADE.108@cc.kuleuven.be>	<434A8FE6.40508@cc.kuleuven.be>
	<1128963722.4680.21.camel@ayanami.boston.redhat.com>
	<434AC9DE.50606@cc.kuleuven.be>
	<1128978146.4680.37.camel@ayanami.boston.redhat.com>
	<434ADB8C.9010508@cc.kuleuven.be>  <434B8D72.3080006@cc.kuleuven.be>
Message-ID: <1129043371.4680.89.camel@ayanami.boston.redhat.com>

On Tue, 2005-10-11 at 12:01 +0200, Herta Van den Eynde wrote:
> Next attempt at understanding what is going on.
> 
> According to the documentation, "the clurmtabd daemon synchronizes NFS
> mount entries in /var/lib/nfs/rmtab with a private copy on a service's
> mount point."
> 
> Assuming the private copy is the one in .clumanager/rmtab, shouldn't
> that file contain data?  (They are empty for all three filesystems.)

It synchronizes based on exports found in /etc/cluster.xml ...

What kernel and nfs-utils versions are you running?

(Note that this is a separate problem from the previous one.)

-- Lon


From herta.vandeneynde at cc.kuleuven.be  Tue Oct 11 15:48:29 2005
From: herta.vandeneynde at cc.kuleuven.be (Herta Van den Eynde)
Date: Tue, 11 Oct 2005 17:48:29 +0200
Subject: [Linux-cluster] umount failed - device is busy
In-Reply-To: <1129043197.4680.85.camel@ayanami.boston.redhat.com>
References: <434A7ADE.108@cc.kuleuven.be>
	<434A8FE6.40508@cc.kuleuven.be>	<1128963722.4680.21.camel@ayanami.boston.redhat.com>	<434AC9DE.50606@cc.kuleuven.be>	<1128978146.4680.37.camel@ayanami.boston.redhat.com>	<434ADB8C.9010508@cc.kuleuven.be>
	<1129043197.4680.85.camel@ayanami.boston.redhat.com>
Message-ID: <434BDECD.2060303@cc.kuleuven.be>


Lon Hohberger wrote:
> On Mon, 2005-10-10 at 23:22 +0200, Herta Van den Eynde wrote:
>  
> 
>>>Odd.  Well, "umount -f" actually doesn't do what most people think it
>>>does.
> 
> 
>>FWIIW, I don't think the force unmount works.  True, lsof/fuser don't 
>>report processes against the filesystem, but "df" and "mount" show that 
>>  it's still there, and I can write to and read from it after I try a 
>>"umount -f".
> 
> 
> (/me waves his hand and says, "This is not the force unmount you are
> looking for...")
> 
> * "umount -f" only works for NFS-mounted file systems, and only then in
> certain cases.  If there is pending I/O (e.g. processes in disk wait),
> it will not work (I may be wrong about this one, but I think this is the
> case).  In any case, it does not currently do anything for local file
> systems like ext3, jfs, reiserfs, xfs, etc...  If there are open
> references on any local file system, the umount fails with -EBUSY,
> regardless of whether or not "umount -f" was used.
> 
> This means, if there is a process, say a bash shell, running with CWD in
> a local file system's mount point, running "umount -f" on that mount
> point will fail the same as running "umount" without the "-f" flag.
> 
> 
> * The "force unmount" option in Cluster Manager, by contrast, attempts
> to clear references on a locally mounted file systems by killing
> processes using those file systems.  Put more clearly: it attempts to do
> what most people think "umount -f" does (or should do) in a general way.
> So, back to our example:
> 
> That bash shell is sitting in the mountpoint.  We look for all processes
> using the mount point, see that bash (pid 11034) is using it, and kill
> pid 11034 with signal 9 (SIGKILL).  Bash certainly no longer has a
> reference on our file system ;)
> 
> Now, there are no more processes using the mount point, so we issue
> "umount"...  which should work.  However, this is not working in your
> case.  There are a couple of things which come to mind which might cause
> this:
> 
> * nfsd holding a reference (which is why I asked you to stop nfs;
> "exportfs -ua" should work too).
> 
> * another mounted file system below the mount point (e.g. trying to
> umount /a while another file system is mounted on /a/b).
> 
> -- Lon
Thanks for all this info, Lon.  I really appreciate it.

Bit of extra information:  the system that was running the services got 
STONITHed by the other cluster member shortly before midnight.
The services all failed over nicely, but the situation remains:  if I 
try to stop or relocate a service, I get a "device is busy".
I suppose that rules out an intermittent issue.

There's no mounts below mounts.

Kind regards,

Herta

Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm


From herta.vandeneynde at cc.kuleuven.be  Tue Oct 11 15:52:29 2005
From: herta.vandeneynde at cc.kuleuven.be (Herta Van den Eynde)
Date: Tue, 11 Oct 2005 17:52:29 +0200
Subject: [Linux-cluster] clurmtabd question (was: umount failed - device is
	busy)
In-Reply-To: <1129043371.4680.89.camel@ayanami.boston.redhat.com>
References: <434A7ADE.108@cc.kuleuven.be>	<434A8FE6.40508@cc.kuleuven.be>	<1128963722.4680.21.camel@ayanami.boston.redhat.com>	<434AC9DE.50606@cc.kuleuven.be>	<1128978146.4680.37.camel@ayanami.boston.redhat.com>	<434ADB8C.9010508@cc.kuleuven.be>
	<434B8D72.3080006@cc.kuleuven.be>
	<1129043371.4680.89.camel@ayanami.boston.redhat.com>
Message-ID: <434BDFBD.5040900@cc.kuleuven.be>

Lon Hohberger wrote:
> On Tue, 2005-10-11 at 12:01 +0200, Herta Van den Eynde wrote:
> 
>>Next attempt at understanding what is going on.
>>
>>According to the documentation, "the clurmtabd daemon synchronizes NFS
>>mount entries in /var/lib/nfs/rmtab with a private copy on a service's
>>mount point."
>>
>>Assuming the private copy is the one in .clumanager/rmtab, shouldn't
>>that file contain data?  (They are empty for all three filesystems.)
> 
> 
> It synchronizes based on exports found in /etc/cluster.xml ...
> 
> What kernel and nfs-utils versions are you running?
> 
> (Note that this is a separate problem from the previous one.)
> 
> -- Lon
Glad to hear it's a separate problem.  :-(  I changed the subject 
accordingly.

kernel-smp-2.4.21-37.EL - custom built to probe all LUNs on each SCSI device

nfs-util is at 1.0.6-42EL

Kind regards,

Herta

Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm


From lhh at redhat.com  Tue Oct 11 18:18:31 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 11 Oct 2005 14:18:31 -0400
Subject: [Linux-cluster] umount failed - device is busy
In-Reply-To: <434BDECD.2060303@cc.kuleuven.be>
References: <434A7ADE.108@cc.kuleuven.be> <434A8FE6.40508@cc.kuleuven.be>
	<1128963722.4680.21.camel@ayanami.boston.redhat.com>
	<434AC9DE.50606@cc.kuleuven.be>
	<1128978146.4680.37.camel@ayanami.boston.redhat.com>
	<434ADB8C.9010508@cc.kuleuven.be>
	<1129043197.4680.85.camel@ayanami.boston.redhat.com>
	<434BDECD.2060303@cc.kuleuven.be>
Message-ID: <1129054711.4680.119.camel@ayanami.boston.redhat.com>

On Tue, 2005-10-11 at 17:48 +0200, Herta Van den Eynde wrote:

> Bit of extra information:  the system that was running the services got 
> STONITHed by the other cluster member shortly before midnight.
> The services all failed over nicely, but the situation remains:  if I 
> try to stop or relocate a service, I get a "device is busy".
> I suppose that rules out an intermittent issue.
> 
> There's no mounts below mounts.

Drat.

Nfsd is the most likely candidate for holding the reference. 

Unfortunately, this is not something I can track down; you will have to
either file a support request and/or a Bugzilla.  When you get a chance,
you should definitely try stopping nfsd and seeing if that clears the
mystery references (allowing you to unmount).  If the problem comes from
nfsd, it should not be terribly difficult to track down.

Also, you should not need to recompile your kernel to probe all the LUNs
per device; just edit /etc/modules.conf:

options scsi_mod max_scsi_luns=128

... then run mkinitrd to rebuild the initrd image.

-- Lon


From herta.vandeneynde at cc.kuleuven.be  Tue Oct 11 19:16:55 2005
From: herta.vandeneynde at cc.kuleuven.be (Herta Van den Eynde)
Date: Tue, 11 Oct 2005 21:16:55 +0200
Subject: [Linux-cluster] umount failed - device is busy
In-Reply-To: <1129054711.4680.119.camel@ayanami.boston.redhat.com>
References: <434A7ADE.108@cc.kuleuven.be>
	<434A8FE6.40508@cc.kuleuven.be>	<1128963722.4680.21.camel@ayanami.boston.redhat.com>	<434AC9DE.50606@cc.kuleuven.be>	<1128978146.4680.37.camel@ayanami.boston.redhat.com>	<434ADB8C.9010508@cc.kuleuven.be>	<1129043197.4680.85.camel@ayanami.boston.redhat.com>	<434BDECD.2060303@cc.kuleuven.be>
	<1129054711.4680.119.camel@ayanami.boston.redhat.com>
Message-ID: <434C0FA7.9000803@cc.kuleuven.be>

Lon Hohberger wrote:
> On Tue, 2005-10-11 at 17:48 +0200, Herta Van den Eynde wrote:
> 
> 
>>Bit of extra information:  the system that was running the services got 
>>STONITHed by the other cluster member shortly before midnight.
>>The services all failed over nicely, but the situation remains:  if I 
>>try to stop or relocate a service, I get a "device is busy".
>>I suppose that rules out an intermittent issue.
>>
>>There's no mounts below mounts.
> 
> 
> Drat.
> 
> Nfsd is the most likely candidate for holding the reference. 
> 
> Unfortunately, this is not something I can track down; you will have to
> either file a support request and/or a Bugzilla.  When you get a chance,
> you should definitely try stopping nfsd and seeing if that clears the
> mystery references (allowing you to unmount).  If the problem comes from
> nfsd, it should not be terribly difficult to track down.
> 
> Also, you should not need to recompile your kernel to probe all the LUNs
> per device; just edit /etc/modules.conf:
> 
> options scsi_mod max_scsi_luns=128
> 
> ... then run mkinitrd to rebuild the initrd image.
> 
> -- Lon
Next maintenance window is 4 weeks away, so I won't be able to test the 
nfsd hypothesis anytime soon.  In the meantime, I'll file a support 
request.  I'll keep you posted.

At least the unexpected STONITH confirms that the failover still works.

The /etc/modules.conf tip is a big time saver.  Rebuilding the modules 
takes forever.

Thanks, Lon.

Herta

Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm


From bojan at rexursive.com  Tue Oct 11 20:07:44 2005
From: bojan at rexursive.com (Bojan Smojver)
Date: Wed, 12 Oct 2005 06:07:44 +1000
Subject: [Linux-cluster] GFS 6.1.2 and RHEL4 U2
Message-ID: <1129061264.2348.1.camel@coyote.rexursive.com>

I have a 5 node experimental cluster running RHEL4 U1 and GFS 6.1.0. I
upgraded one box to RHEL U2 (kernel 2.6.9-22.ELsmp) and to GFS 6.1.2.
When the box boots up with the new kernel and GFS, it joins the cluster
OK (I can see that on other members), but clvmd and fenced won't start,
so the system hangs.

Did anyone else experience similar stuff? Or is this intentional (i.e.
is the new version of GFS/cluster binary incompatible with U1 version)?

-- 
Bojan


From bojan at rexursive.com  Wed Oct 12 06:28:25 2005
From: bojan at rexursive.com (Bojan Smojver)
Date: Wed, 12 Oct 2005 16:28:25 +1000
Subject: [Linux-cluster] GFS 6.1.2 and RHEL4 U2
In-Reply-To: <1129061264.2348.1.camel@coyote.rexursive.com>
References: <1129061264.2348.1.camel@coyote.rexursive.com>
Message-ID: <20051012162825.23z5latb4ksk8k8c@imp.rexursive.com>

Quoting Bojan Smojver <bojan at rexursive.com>:

> I have a 5 node experimental cluster running RHEL4 U1 and GFS 6.1.0. I
> upgraded one box to RHEL U2 (kernel 2.6.9-22.ELsmp) and to GFS 6.1.2.
> When the box boots up with the new kernel and GFS, it joins the cluster
> OK (I can see that on other members), but clvmd and fenced won't start,
> so the system hangs.
>
> Did anyone else experience similar stuff? Or is this intentional (i.e.
> is the new version of GFS/cluster binary incompatible with U1 version)?

BTW, this is what I get on the upgraded machine when I attempt to start
fenced:

Oct 12 16:24:42 matrix1-5 kernel: SM: process_reply invalid id=0
nodeid=3
Oct 12 16:24:42 matrix1-5 kernel: SM: process_reply invalid id=0
nodeid=2
Oct 12 16:24:42 matrix1-5 kernel: SM: process_reply invalid id=0
nodeid=4
Oct 12 16:24:42 matrix1-5 kernel: SM: process_reply invalid id=0
nodeid=1

Fenced never starts...

-- 
Bojan


From pcaulfie at redhat.com  Wed Oct 12 06:47:45 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 12 Oct 2005 07:47:45 +0100
Subject: [Linux-cluster] Re: setting the heartbeat interval
In-Reply-To: <1a81f4f4b1d7c534a47c31bd918bea98@redhat.com>
References: <1a81f4f4b1d7c534a47c31bd918bea98@redhat.com>
Message-ID: <434CB191.1090803@redhat.com>

Jonathan E Brassow wrote:
> Looking at the cman_tool man page, I don't see a way to change the
> heartbeat interval.  Dave, is there a way to change this while cman is
> part of a cluster?
> 
> To change the in memory version number or expected votes for cman, you
> would:
> 
> 1) change cluster.xml file
> 2) ccs_tool update <new xml file>
> 3) cman_tool version -r <new #> ; cman_tool expected -e <new expected #>
> 
> If cman_tool can change the heartbeat interval without restarting the
> cluster (or cman on each machine), it would look very much like step
> #3.  This can not be done through the GUI, because the GUI only changes
> the in memory version number.

A reply for the list:

The heartbeat & detection intervals for cman can be set by writing values into
/proc/cluster/conf/cman/hello_timer & /proc/cluster/conf/cman/deadnode_timer

These values are in seconds and take effect immediately (-ish, ie the new hello
timer will take effect after the last hello timer has expired). Because these
really need to be the same on all nodes I don't recommend changing them
on-the-fly though - they should be set between loading the module and running
cman_tool join.

There is also /proc/cluster/conf/cman/max_retries which some may like to
increase if they are seeing "No response to messages" reasons for a node being
kicked out of the cluster - you can change this any time you like with no ill
effects.

The version of cman_tool on the STABLE tag of CVS has code that will read these
values from CCS when "cman_tool join" is run. I think this should be in a future
RHEL4 Update.


-- 

patrick


From baesso at ksolutions.it  Wed Oct 12 08:05:25 2005
From: baesso at ksolutions.it (Baesso Mirko)
Date: Wed, 12 Oct 2005 10:05:25 +0200
Subject: R: [Linux-cluster] Setup Fence_wti on cluster RHES4 U1
Message-ID: <F63C63702B77F94CB4664B4D88660C4D4BAD1A@kmail.ksolutions.it>

Hi

I check fence device section and I setup all, wti ports also

But when i try to unplug one node the other cannot power it off

This is my cluster.conf 

 
<?xml version="1.0"?>

<cluster config_version="65" name="sfpi_cluster">

        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>

        <clusternodes>

                <clusternode name="rhcs-db1.ksolutions.it" votes="1">

                        <fence>

                                <method name="1">

                                        <device name="nps-1" port="2"/>

                                </method>

                        </fence>

                </clusternode>

                <clusternode name="rhcs-db2.ksolutions.it" votes="1">

                        <fence>

                                <method name="1">

                                        <device name="nps-2" port="3"/>

                                </method>

                        </fence>

                </clusternode>

        </clusternodes>

        <cman expected_votes="1" two_node="1"/>

        <fencedevices>

                <fencedevice agent="fence_wti" ipaddr="1.1.1.1" name="nps-1" passwd="passwd"/>

                <fencedevice agent="fence_wti" ipaddr="1.1.1.2" name="nps-2" passwd="passwd"/>

        </fencedevices>

...

 
Baesso Mirko - System Engineer 
KSolutions.S.p.A. 
Via Lenin 132/26 
56017  S.Martino Ulmiano (PI) - Italy 
tel.+ 39 0 50 898369 fax. + 39 0 50 861200 
baesso at ksolutions.it   http//www.ksolutions.it 

  _____  

Da: Jonathan E Brassow [mailto:jbrassow at redhat.com] 
Inviato: luned? 10 ottobre 2005 22.04
A: linux clustering
Oggetto: Re: [Linux-cluster] Setup Fence_wti on cluster RHES4 U1

 
I don't have the gui in front of me, but there should be a manage fencing button or something... If you've never specified which port that a machine is connected to on the WTI, then you haven't gotten far enough in the set-up. 

 
brassow 

 
On Oct 10, 2005, at 2:22 AM, Baesso Mirko wrote: 

	 
	Hi 

	 
	i've setup a Redhat cluster with two node and i try to test failover using power switch (WTI-NPS230) but seem doesn't work 

	 
	If I look for error messages I see that fence_wti waits for a command to execute. 

	 
	I setup cluster.conf using system-config-cluster and there is no section regarding command to execute 

	 
	Could you please let me now how to setup correctly 

	 
	Thanks 

	 
	Baesso Mirko - System Engineer 

	 
	KSolutions.S.p.A. 

	 
	Via Lenin 132/26 

	 
	56017  S.Martino Ulmiano (PI) - Italy 

	 
	tel.+ 39 0 50 898369 fax. + 39 0 50 861200 

	 
	baesso at ksolutions.it   http//www.ksolutions.it 

	 
	-- 

	Linux-cluster mailing list 

	Linux-cluster at redhat.com 

	https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051012/40c30fc8/attachment.htm>

From teigland at redhat.com  Wed Oct 12 14:29:26 2005
From: teigland at redhat.com (David Teigland)
Date: Wed, 12 Oct 2005 09:29:26 -0500
Subject: [Linux-cluster] GFS 6.1.2 and RHEL4 U2
In-Reply-To: <20051012162825.23z5latb4ksk8k8c@imp.rexursive.com>
References: <1129061264.2348.1.camel@coyote.rexursive.com>
	<20051012162825.23z5latb4ksk8k8c@imp.rexursive.com>
Message-ID: <20051012142926.GB7876@redhat.com>

On Wed, Oct 12, 2005 at 04:28:25PM +1000, Bojan Smojver wrote:
> Quoting Bojan Smojver <bojan at rexursive.com>:
> 
> >I have a 5 node experimental cluster running RHEL4 U1 and GFS 6.1.0. I
> >upgraded one box to RHEL U2 (kernel 2.6.9-22.ELsmp) and to GFS 6.1.2.
> >When the box boots up with the new kernel and GFS, it joins the cluster
> >OK (I can see that on other members), but clvmd and fenced won't start,
> >so the system hangs.
> >
> >Did anyone else experience similar stuff? Or is this intentional (i.e.
> >is the new version of GFS/cluster binary incompatible with U1 version)?
> 
> BTW, this is what I get on the upgraded machine when I attempt to start
> fenced:
> 
> Oct 12 16:24:42 matrix1-5 kernel: SM: process_reply invalid id=0
> nodeid=3
> Oct 12 16:24:42 matrix1-5 kernel: SM: process_reply invalid id=0
> nodeid=2
> Oct 12 16:24:42 matrix1-5 kernel: SM: process_reply invalid id=0
> nodeid=4
> Oct 12 16:24:42 matrix1-5 kernel: SM: process_reply invalid id=0
> nodeid=1
> 
> Fenced never starts...

A bug fix required a minor change to the cman/sm message formats between
U1 and U2 that make the two versions incompatible, so all nodes need to be
running the U2 version.

Dave


From bojan at rexursive.com  Wed Oct 12 20:06:18 2005
From: bojan at rexursive.com (Bojan Smojver)
Date: Thu, 13 Oct 2005 06:06:18 +1000
Subject: [Linux-cluster] GFS 6.1.2 and RHEL4 U2
In-Reply-To: <20051012142926.GB7876@redhat.com>
References: <1129061264.2348.1.camel@coyote.rexursive.com>
	<20051012162825.23z5latb4ksk8k8c@imp.rexursive.com>
	<20051012142926.GB7876@redhat.com>
Message-ID: <1129147578.31843.5.camel@coyote.rexursive.com>

On Wed, 2005-10-12 at 09:29 -0500, David Teigland wrote:

> A bug fix required a minor change to the cman/sm message formats between
> U1 and U2 that make the two versions incompatible, so all nodes need to be
> running the U2 version.

Thanks. I'll bounce all nodes today and report back if they don't form
the cluster (I'm sure they will :-).

I have to admit that I didn't look that hard into RPM release notes, but
I never noticed any warnings about this on RHN...

-- 
Bojan


From Bowie_Bailey at BUC.com  Wed Oct 12 20:18:41 2005
From: Bowie_Bailey at BUC.com (Bowie Bailey)
Date: Wed, 12 Oct 2005 16:18:41 -0400
Subject: [Linux-cluster] GFS + DLM howto?
Message-ID: <4766EEE585A6D311ADF500E018C154E302133245@bnifex.cis.buc.com>

I'm trying to configure three servers to share a GFS 6.1 filesystem.

I am completely new to all of this and the instructions in the manuals I've
found on the RH website are running me around in circles.

Can anyone point me to a good how-to that will walk me through a simple
configuration of GFS and DLM?

Thanks,

Bowie


From teigland at redhat.com  Wed Oct 12 20:28:17 2005
From: teigland at redhat.com (David Teigland)
Date: Wed, 12 Oct 2005 15:28:17 -0500
Subject: [Linux-cluster] GFS + DLM howto?
In-Reply-To: <4766EEE585A6D311ADF500E018C154E302133245@bnifex.cis.buc.com>
References: <4766EEE585A6D311ADF500E018C154E302133245@bnifex.cis.buc.com>
Message-ID: <20051012202817.GD10593@redhat.com>

On Wed, Oct 12, 2005 at 04:18:41PM -0400, Bowie Bailey wrote:
> I'm trying to configure three servers to share a GFS 6.1 filesystem.
> 
> I am completely new to all of this and the instructions in the manuals I've
> found on the RH website are running me around in circles.
> 
> Can anyone point me to a good how-to that will walk me through a simple
> configuration of GFS and DLM?

I think the official documentation assumes you're doing everything through
the gui, at least with respect to the clustering components.  If you're
not, then this is probably the best we have along with the man pages:
http://sources.redhat.com/cluster/doc/usage.txt

Dave


From sgray at bluestarinc.com  Wed Oct 12 21:21:10 2005
From: sgray at bluestarinc.com (Sean Gray)
Date: Wed, 12 Oct 2005 17:21:10 -0400
Subject: [Linux-cluster] GFS + DLM howto?
In-Reply-To: <4766EEE585A6D311ADF500E018C154E302133245@bnifex.cis.buc.com>
References: <4766EEE585A6D311ADF500E018C154E302133245@bnifex.cis.buc.com>
Message-ID: <1129152070.4166.401.camel@libra.bluestar.cvg0>

Following are my notes. Keep in mind that I did a lot of installing from
SRPMs and you may not need to go through all that. Hope it helps... -
Sean


# RHGFS and RHCS on RHEL4 x86_64 2.6.9-11
# v 0.1
# By Sean Gray <sgray @ funkylinux.com> <sgray @ bluestarinc.com> copyright 2005
# Published under the GNU Free Documentation License http://www.gnu.org/licenses/fdl.txt
# Sources:	http://www.google.com
#		http://www.redhat.com/docs/manuals/csgfs/browse/rh-gfs-en/ch-install.html
#		http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/ch-software.html
#		http://sources.redhat.com/cluster/doc/usage.txt
#		http://karan.org/
#		https://www.redhat.com/archives/linux-cluster/index.html
#		http://lists.centos.org/pipermail/centos-devel/2005-August/thread.html#861
#		http://www.hughesjr.com/
# This document claims to have no value whatsoever to anyone, except maybe the author.
#
# 
#
# enable ntp 
system-config-time
# install SRPMS from:
# ftp://ftp.redhat.com/pub/redhat/linux/enterprise/4/en/RHCS/x86_64/SRPMS/*
# ftp://ftp.redhat.com/pub/redhat/linux/enterprise/4/en/RHGFS/x86_64/SRPMS/*

# install kernel source 
rpm -ivh ftp://ftp.redhat.com/pub/redhat/linux/updates/enterprise/4ES/en/os/SRPMS/kernel-2.6.9-11.EL.src.rpm

# build install perl-Net-Telnet-3.03-3
rpmbuild --rebuild /usr/src/redhat/SRPMS/perl-Net-Telnet-3.03-3.src.rpm
rpm -ivh /usr/src/redhat/RPMS/noarch/perl-Net-Telnet-3.03-3.noarch.rpm  

# build install gulm
rpmbuild -bb /usr/src/redhat/SPECS/gulm.spec
rpm -ivh /usr/src/redhat/RPMS/x86_64/gulm-1.0.0-0.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/gulm-devel-1.0.0-0.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/gulm-debuginfo-1.0.0-0.x86_64.rpm 

# build install magma
rpmbuild -bb /usr/src/redhat/SPECS/magma.spec
rpm -ivh /usr/src/redhat/RPMS/x86_64/magma-1.0.0-0.x86_64.rpm
rpm -ivh /usr/src/redhat/RPMS/x86_64/magma-debuginfo-1.0.0-0.x86_64.rpm
rpm -ivh /usr/src/redhat/RPMS/x86_64/magma-devel-1.0.0-0.x86_64.rpm

# build install magma-plugins
rpmbuild -bb /usr/src/redhat/SPECS/magma-plugins.spec
rpm -ivh /usr/src/redhat/RPMS/x86_64/magma-plugins-1.0.0-0.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/magma-plugins-debuginfo-1.0.0-0.x86_64.rpm

# Download and install from RHN
rpm -ivh kernel-2.6.9-11.EL.x86_64.rpm \
kernel-devel-2.6.9-11.EL.x86_64.rpm \
kernel-doc-2.6.9-11.EL.noarch.rpm \ 
kernel-hugemem-2.6.9-11.EL.i686.rpm \
kernel-hugemem-devel-2.6.9-11.EL.i686 \
kernel-smp-2.6.9-11.EL.x86_64.rpm \
kernel-smp-devel-2.6.9-11.EL.x86_64.rpm
# reboot with new kernel
init 6

# Edit all the hugemem kernel stuff out of the spec files
# I don't need hugmem and don't have time to troubleshoot

# uneasily build install 3rd-party fake-build-requires
rpm -ivh http://rpm.karan.org/el4/csgfs/SRPMS/fake-build-provides-1.0-20.src.rpm
rpm -ivh 

# build install ccs
rpmbuild -bb /usr/src/redhat/SPECS/ccs.spec
rpm -ivh /usr/src/redhat/RPMS/x86_64/ccs-1.0.0-0.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/ccs-debuginfo-1.0.0-0.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/ccs-devel-1.0.0-0.x86_64.rpm

# build install cman-kernel
rpmbuild -bb /usr/src/redhat/SPECS/cman-kernel.spec   
rpm -ivh /usr/src/redhat/RPMS/x86_64/cman-kernel-2.6.9-36.0.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/cman-kernel-debuginfo-2.6.9-36.0.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/cman-kernel-smp-2.6.9-36.0.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/cman-kernheaders-2.6.9-36.0.x86_64.rpm

 
# build install cman
rpmbuild -bb /usr/src/redhat/SPECS/cman.spec
rpm -ivh /usr/src/redhat/RPMS/x86_64/cman-1.0.0-0.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/cman-debuginfo-1.0.0-0.x86_64.rpm

# build install dlm-kernel
# edit dlm-kernel.spec
# remove --> $kernel_src/scripts/mod/modpost -m -i /lib/modules/%{kernel_version}$flavor/kernel/cluster/cman.symvers src/dlm.o -o dlm.symvers
# add --> $kernel_src/scripts/mod/modpost -m -i /lib/modules/%{kernel_version}$flavor/kernel/cluster/cman.symvers /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/dlm.o -o dlm.symvers
rpmbuild -bb /usr/src/redhat/SPECS/dlm-kernel.spec
rpm -ivh /usr/src/redhat/RPMS/x86_64/dlm-kernel-2.6.9-34.0.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/dlm-kernel-debuginfo-2.6.9-34.0.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/dlm-kernheaders-2.6.9-34.0.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/dlm-kernel-smp-2.6.9-34.0.x86_64.rpm

# build install dlm
rpmbuild -bb /usr/src/redhat/SPECS/dlm.spec
rpm -ivh /usr/src/redhat/RPMS/x86_64/dlm-1.0.0-0.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/dlm-devel-1.0.0-0.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/dlm-debuginfo-1.0.0-0.x86_64.rpm

# build install fence
rpmbuild -bb /usr/src/redhat/SPECS/fence.spec
rpm -ivh /usr/src/redhat/RPMS/x86_64/fence-1.32.1-0.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/fence-debuginfo-1.32.1-0.x86_64.rpm

# build install iddev
rpmbuild -bb /usr/src/redhat/SPECS/iddev.spec
rpm -ivh /usr/src/redhat/RPMS/x86_64/iddev-2.0.0-0.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/iddev-devel-2.0.0-0.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/iddev-debuginfo-2.0.0-0.x86_64.rpm

# build install rgmanager
rpmbuild -bb /usr/src/redhat/SPECS/rgmanager.spec
rpm -ivh /usr/src/redhat/RPMS/x86_64/rgmanager-1.9.34-1.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/rgmanager-debuginfo-1.9.34-1.x86_64.rpm

# build install system-config-cluster
rpmbuild -bb /usr/src/redhat/SPECS/system-config-cluster.spec
/usr/src/redhat/RPMS/noarch/system-config-cluster-1.0.12-1.0.noarch.rpm

# build install ipvsadm
rpmbuild -bb /usr/src/redhat/SPECS/ipvsadm.spec    
rpm -ivh /usr/src/redhat/RPMS/x86_64/ipvsadm-1.24-6.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/ipvsadm-debuginfo-1.24-6.x86_64.rpm

# build install piranha
rpmbuild -bb /usr/src/redhat/SPECS/piranha.spec
rpm -ivh /usr/src/redhat/RPMS/x86_64/piranha-0.8.0-1.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/piranha-debuginfo-0.8.0-1.x86_64.rpm   

# I tried to build gfs-kernel here but smp would not compile
# finally figured out, after moving to gnbd-kernel that deleting
# /usr/src/BUILD/smp allowed it to build. Hmmm.
rm -rf /usr/src/BUILD/smp

# build install gnbd-kernel
rpmbuild -bb /usr/src/redhat/SPECS/gnbd-kernel.spec
rpm -ivh /usr/src/redhat/RPMS/x86_64/gnbd-kernel-2.6.9-8.27.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/gnbd-kernheaders-2.6.9-8.27.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/gnbd-kernel-smp-2.6.9-8.27.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/gnbd-kernel-debuginfo-2.6.9-8.27.x86_64.rpm
rm -rf /usr/src/BUILD/smp
# after all the trial and error it appears my BUILD was hosed also
rm -rf /usr/src/redhat/BUILD/gfs-kernel-2.6.9-35/

# build install gfs-kernel
rpmbuild -bb /usr/src/redhat/SPECS/GFS-kernel.spec  
rpm -ivh /usr/src/redhat/RPMS/x86_64/GFS-kernel-2.6.9-35.5.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/GFS-kernheaders-2.6.9-35.5.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/GFS-kernel-smp-2.6.9-35.5.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/GFS-kernel-debuginfo-2.6.9-35.5.x86_64.rpm

# build install gfs
rpmbuild -bb /usr/src/redhat/SPECS/GFS.spec    
rpm -ivh /usr/src/redhat/RPMS/x86_64/GFS-6.1.0-0.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/GFS-debuginfo-6.1.0-0.x86_64.rpm 

# build install gnbd
rpmbuild --bb /usr/src/redhat/SPECS/gnbd.spec
rpm -ivh /usr/src/redhat/RPMS/x86_64/gnbd-1.0.0-0.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/gnbd-debuginfo-1.0.0-0.x86_64.rpm

# build install lvm2-cluster
# On my not so up2date system I had to first upgrade
# device-mapper and lvm2 for this to work
rpmbuild -bb /usr/src/redhat/SPECS/lvm2-cluster.spec
rpm -ivh /usr/src/redhat/RPMS/x86_64/lvm2-cluster-2.01.09-5.0.RHEL4.x86_64.rpm \
/usr/src/redhat/RPMS/x86_64/lvm2-cluster-debuginfo-2.01.09-5.0.RHEL4.x86_64.rpm

# Wow that only took 24 hours of my life.
# That was on node 1 <galaga> onto node 2 <asteroids>

# copy all newly built rpms to a tmp folder on asteroids
# download the following from RHN
#	kernel-smp-2.6.9-11.EL.x86_64.rpm
#	kernel-2.6.9-11.EL.x86_64.rpm   
#	device-mapper-1.01.01-1.RHEL4.x86_64.rpm
#	lvm2-2.01.08-1.0.RHEL4.x86_64.rpm
# remove all device-mapper rpms (???) there are both
# i386 and x86_64 installed, I removed both and installed
# the x86_64 and i386
rpm -e device-mapper-1.00.19-2 --allmatches --nodeps

# install
rpm -Uvh perl-Net-Telnet-3.03-3.noarch.rpm \
system-config-cluster-1.0.12-1.0.noarch.rpm \
ccs-1.0.0-0.x86_64.rpm \
ccs-debuginfo-1.0.0-0.x86_64.rpm \
ccs-devel-1.0.0-0.x86_64.rpm \
cman-1.0.0-0.x86_64.rpm \
cman-debuginfo-1.0.0-0.x86_64.rpm \
cman-kernel-2.6.9-36.0.x86_64.rpm \
cman-kernel-debuginfo-2.6.9-36.0.x86_64.rpm \
cman-kernel-smp-2.6.9-36.0.x86_64.rpm \
cman-kernheaders-2.6.9-36.0.x86_64.rpm \
device-mapper-1.01.01-1.RHEL4.x86_64.rpm \
dlm-1.0.0-0.x86_64.rpm \
dlm-debuginfo-1.0.0-0.x86_64.rpm \
dlm-devel-1.0.0-0.x86_64.rpm \
dlm-kernel-2.6.9-34.0.x86_64.rpm \
dlm-kernel-debuginfo-2.6.9-34.0.x86_64.rpm \
dlm-kernel-dlm-kernheaders-2.6.9-34.0.x86_64.rpm \
dlm-kernel-smp-2.6.9-34.0.x86_64.rpm \
dlm-kernheaders-2.6.9-34.0.x86_64.rpm \
fake-build-provides-1.0-20.x86_64.rpm \
fence-1.32.1-0.x86_64.rpm \
fence-debuginfo-1.32.1-0.x86_64.rpm \
GFS-6.1.0-0.x86_64.rpm \
GFS-debuginfo-6.1.0-0.x86_64.rpm \
GFS-kernel-2.6.9-35.5.x86_64.rpm \
GFS-kernel-debuginfo-2.6.9-35.5.x86_64.rpm \
GFS-kernel-smp-2.6.9-35.5.x86_64.rpm \
GFS-kernheaders-2.6.9-35.5.x86_64.rpm \
gnbd-1.0.0-0.x86_64.rpm \
gnbd-debuginfo-1.0.0-0.x86_64.rpm \
gnbd-kernel-2.6.9-8.27.x86_64.rpm \
gnbd-kernel-debuginfo-2.6.9-8.27.x86_64.rpm \
gnbd-kernel-smp-2.6.9-8.27.x86_64.rpm \
gnbd-kernheaders-2.6.9-8.27.x86_64.rpm \
gulm-1.0.0-0.x86_64.rpm \
gulm-debuginfo-1.0.0-0.x86_64.rpm \
gulm-devel-1.0.0-0.x86_64.rpm \
iddev-2.0.0-0.x86_64.rpm \
iddev-debuginfo-2.0.0-0.x86_64.rpm \
iddev-devel-2.0.0-0.x86_64.rpm \
ipvsadm-1.24-6.x86_64.rpm \
ipvsadm-debuginfo-1.24-6.x86_64.rpm \
kernel-2.6.9-11.EL.x86_64.rpm \
kernel-smp-2.6.9-11.EL.x86_64.rpm \
lvm2-2.01.08-1.0.RHEL4.x86_64.rpm \
lvm2-cluster-2.01.09-5.0.RHEL4.x86_64.rpm \
lvm2-cluster-debuginfo-2.01.09-5.0.RHEL4.x86_64.rpm \
magma-1.0.0-0.x86_64.rpm \
magma-debuginfo-1.0.0-0.x86_64.rpm \
magma-devel-1.0.0-0.x86_64.rpm \
magma-plugins-1.0.0-0.x86_64.rpm \
magma-plugins-debuginfo-1.0.0-0.x86_64.rpm \
piranha-0.8.0-1.x86_64.rpm \
piranha-debuginfo-0.8.0-1.x86_64.rpm \
rgmanager-1.9.34-1.x86_64.rpm \
rgmanager-debuginfo-1.9.34-1.x86_64.rpm
# reboot

# Configuration

pvcreate /dev/sda
# carve up your disk with system-config-lvm
gfs_mkfs -p lock_dlm -t alpha_cluster:scratch0_LV -j 12 /dev/scratch_VG/scratch0_LV
# on all nodes
mount -t gfs -oacl /dev/scratch_VG/scratch0_LV /scratch/

# Rinse and repeat on defender, tron, centipede, tapper, paperboy, joust, tempest
# galaxian, pacman, and punchout


On Wed, 2005-10-12 at 16:18 -0400, Bowie Bailey wrote:

> I'm trying to configure three servers to share a GFS 6.1 filesystem.
> 
> I am completely new to all of this and the instructions in the manuals I've
> found on the RH website are running me around in circles.
> 
> Can anyone point me to a good how-to that will walk me through a simple
> configuration of GFS and DLM?
> 
> Thanks,
> 
> Bowie
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

Sean N. Gray
Director of Information Technology
United Radio Incorporated, DBA BlueStar
24 Spiral Drive
Florence, Kentucky 41042
office: 859.371.4423 x263
toll free: 800.371.4423 x263
fax: 859.371.4425
mobile: 513.616.3379
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051012/f5896c6e/attachment.htm>

From bojan at rexursive.com  Wed Oct 12 23:16:05 2005
From: bojan at rexursive.com (Bojan Smojver)
Date: Thu, 13 Oct 2005 09:16:05 +1000
Subject: [Linux-cluster] GFS 6.1.2 and RHEL4 U2
In-Reply-To: <1129147578.31843.5.camel@coyote.rexursive.com>
References: <1129061264.2348.1.camel@coyote.rexursive.com>
	<20051012162825.23z5latb4ksk8k8c@imp.rexursive.com>
	<20051012142926.GB7876@redhat.com>
	<1129147578.31843.5.camel@coyote.rexursive.com>
Message-ID: <20051013091605.0vrlyqfun4kgk8wg@imp.rexursive.com>

Quoting Bojan Smojver <bojan at rexursive.com>:

> Thanks. I'll bounce all nodes today and report back if they don't form
> the cluster (I'm sure they will :-).

They all came back, so it was just a backward compatibility problem.

-- 
Bojan


From Bowie_Bailey at BUC.com  Thu Oct 13 13:49:48 2005
From: Bowie_Bailey at BUC.com (=?UTF-8?B?Qm93aWUgQmFpbGV5?=)
Date: Thu, 13 Oct 2005 09:49:48 -0400
Subject: =?UTF-8?B?UkU6IFtMaW51eC1jbHVzdGVyXSBHRlMgKyBETE0gaG93dG8/?=
Message-ID: <4766EEE585A6D311ADF500E018C154E302133249@bnifex.cis.buc.com>

From: Sean Gray [mailto:sgray at bluestarinc.com]
> 
> Following are my notes. Keep in mind that I did a lot of 
> installing from SRPMs and you may not need to go through all 
> that. Hope it helps... - Sean
> 
>   .
>   .
>   .
> 
> # Configuration
> 
> pvcreate /dev/sda
> # carve up your disk with system-config-lvm
> gfs_mkfs -p lock_dlm -t alpha_cluster:scratch0_LV -j 12 
> /dev/scratch_VG/scratch0_LV
> # on all nodes
> mount -t gfs -oacl /dev/scratch_VG/scratch0_LV /scratch/
> 
> # Rinse and repeat on defender, tron, centipede, tapper, 
> paperboy, joust, tempest
> # galaxian, pacman, and punchout

Ok, but what I am trying to get is instructions on how to configure
the "alpha_cluster:scratch0_LV" lock table that you refer to here.

Bowie


From sgray at bluestarinc.com  Thu Oct 13 14:46:59 2005
From: sgray at bluestarinc.com (Sean Gray)
Date: Thu, 13 Oct 2005 10:46:59 -0400
Subject: [Linux-cluster] GFS + DLM howto?
In-Reply-To: <4766EEE585A6D311ADF500E018C154E302133249@bnifex.cis.buc.com>
References: <4766EEE585A6D311ADF500E018C154E302133249@bnifex.cis.buc.com>
Message-ID: <1129214819.9819.108.camel@libra.bluestar.cvg0>

Create a GFS resource using system-config-cluster.


On Thu, 2005-10-13 at 09:49 -0400, Bowie Bailey wrote:

> From: Sean Gray [mailto:sgray at bluestarinc.com]
> > 
> > Following are my notes. Keep in mind that I did a lot of 
> > installing from SRPMs and you may not need to go through all 
> > that. Hope it helps... - Sean
> > 
> >   .
> >   .
> >   .
> > 
> > # Configuration
> > 
> > pvcreate /dev/sda
> > # carve up your disk with system-config-lvm
> > gfs_mkfs -p lock_dlm -t alpha_cluster:scratch0_LV -j 12 
> > /dev/scratch_VG/scratch0_LV
> > # on all nodes
> > mount -t gfs -oacl /dev/scratch_VG/scratch0_LV /scratch/
> > 
> > # Rinse and repeat on defender, tron, centipede, tapper, 
> > paperboy, joust, tempest
> > # galaxian, pacman, and punchout
> 
> Ok, but what I am trying to get is instructions on how to configure
> the "alpha_cluster:scratch0_LV" lock table that you refer to here.
> 
> Bowie
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

Sean N. Gray
Director of Information Technology
United Radio Incorporated, DBA BlueStar
24 Spiral Drive
Florence, Kentucky 41042
office: 859.371.4423 x263
toll free: 800.371.4423 x263
fax: 859.371.4425
mobile: 513.616.3379
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051013/2d6f79e8/attachment.htm>

From teigland at redhat.com  Thu Oct 13 15:03:45 2005
From: teigland at redhat.com (David Teigland)
Date: Thu, 13 Oct 2005 10:03:45 -0500
Subject: [Linux-cluster] GFS + DLM howto?
In-Reply-To: <4766EEE585A6D311ADF500E018C154E302133249@bnifex.cis.buc.com>
References: <4766EEE585A6D311ADF500E018C154E302133249@bnifex.cis.buc.com>
Message-ID: <20051013150345.GA8587@redhat.com>

On Thu, Oct 13, 2005 at 09:49:48AM -0400, Bowie Bailey wrote:
> From: Sean Gray [mailto:sgray at bluestarinc.com]

> > gfs_mkfs -p lock_dlm -t alpha_cluster:scratch0_LV -j 12 
> > /dev/scratch_VG/scratch0_LV
> > # on all nodes
> > mount -t gfs -oacl /dev/scratch_VG/scratch0_LV /scratch/

> Ok, but what I am trying to get is instructions on how to configure
> the "alpha_cluster:scratch0_LV" lock table that you refer to here.

"alpha_cluster" is the cluster name from cluster.conf
"scratch0_LV" is the unique filesystem name that you pick for the fs
when you do gfs_mkfs.  These are mentioned in usage.txt, man gfs_mkfs.

Dave


From Bowie_Bailey at BUC.com  Thu Oct 13 15:30:23 2005
From: Bowie_Bailey at BUC.com (Bowie Bailey)
Date: Thu, 13 Oct 2005 11:30:23 -0400
Subject: [Linux-cluster] GFS + DLM howto?
Message-ID: <4766EEE585A6D311ADF500E018C154E30213324D@bnifex.cis.buc.com>

From: David Teigland [mailto:teigland at redhat.com]
> 
> On Thu, Oct 13, 2005 at 09:49:48AM -0400, Bowie Bailey wrote:
> > From: Sean Gray [mailto:sgray at bluestarinc.com]
> 
> > > gfs_mkfs -p lock_dlm -t alpha_cluster:scratch0_LV -j 12 
> > > /dev/scratch_VG/scratch0_LV
> > > # on all nodes
> > > mount -t gfs -oacl /dev/scratch_VG/scratch0_LV /scratch/
> 
> > Ok, but what I am trying to get is instructions on how to configure
> > the "alpha_cluster:scratch0_LV" lock table that you refer to here.
> 
> "alpha_cluster" is the cluster name from cluster.conf
> "scratch0_LV" is the unique filesystem name that you pick for the fs
> when you do gfs_mkfs.  These are mentioned in usage.txt, man gfs_mkfs.

Right.  I'm currently working through the usage.txt that you linked me
to.  I was just replying to Sean to see if he had anything extra to
add since his response skipped over the part of the configuration that
I'm interested in.

I think I'll be able to figure it out from here.  I'll be back if I
have more questions.  :)

Thanks for the help!   (both of you)

Bowie


From spwilcox at att.com  Thu Oct 13 18:26:59 2005
From: spwilcox at att.com (Steve Wilcox)
Date: Thu, 13 Oct 2005 14:26:59 -0400
Subject: [Linux-cluster] Oracle 10G-R2 on GFS install problems
Message-ID: <1129228020.27905.17.camel@aptis101.cqtel.com>

In the process of installing Oracle 10G-R2 on a RHEL4-U2 x86_64 cluster
with GFS 6.1.2, I get the following error when running Oracle's root.sh
for cluster ready services (a.k.a clusterware):

 [  OCROSD][4142143168]utstoragetype: /u00/app/ocr0 is on FS type
18225520. Not supported.

I did a little poking around and found that OCFS2 has the same issue,
but with OCFS2 it can be circumvented by mounting with -o datavolume...
I was unable to find any similar options for GFS mounts.  This looks
like probably more of an Oracle bug, as 10G-R1 installed without any
problems (I have my DBA pursuing the Oracle route), but I was wondering
if anyone else has come across this problem and if so, was there any
fix?

Thanks,
-steve


From Bowie_Bailey at BUC.com  Thu Oct 13 19:01:57 2005
From: Bowie_Bailey at BUC.com (Bowie Bailey)
Date: Thu, 13 Oct 2005 15:01:57 -0400
Subject: [Linux-cluster] GFS + DLM howto?
Message-ID: <4766EEE585A6D311ADF500E018C154E302133251@bnifex.cis.buc.com>

I seem to be missing something.

I have been able to configure cluster.conf and get ccsd and clvmd
running, but it fails when I try to initialize the physical volume.

    # pvcreate /dev/etherd/e1.0
      Device /dev/etherd/e1.0 not found.

The device is definitely there and I can ping it with the aoeping
command.

    # ll /dev/etherd/
    total 4
    -rw-r--r--  1 root root        1 Oct 13 14:33 discover
    brw-------  1 root disk 152, 256 Oct 12 17:07 e1.0

All the modules seem to be loaded.  (irrelevant modules removed from
the list)

    # lsmod
    Module                  Size  Used by
    aoe                    26816  0
    lock_dlm               43740  0
    dlm                   113092  4 lock_dlm
    gfs                   280920  0
    lock_harness            8992  2 lock_dlm,gfs
    cman                  122720  10 lock_dlm,dlm
    dm_mod                 58949  0

Any suggestions?

Bowie


From joshua at emailscout.net  Sun Oct  2 11:03:02 2005
From: joshua at emailscout.net (Joshua Mouch)
Date: Sun, 2 Oct 2005 07:03:02 -0400
Subject: [Linux-cluster] ddraid production release
Message-ID: <4enjrj$1gendbr@mxip18a.cluster1.charter.net>

I know ddraid is still in its infancy, but do you have an approximate
release date in mind?  This year. next year. year after?

 
Joshua Mouch

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051002/082445bc/attachment.htm>

From joshua at emailscout.net  Mon Oct  3 17:23:59 2005
From: joshua at emailscout.net (Joshua Mouch)
Date: Mon, 3 Oct 2005 13:23:59 -0400
Subject: [Linux-cluster] Fedora + GFS & No GNBD scripts??
Message-ID: <4enk62$8thovc@mxip28a.cluster1.charter.net>

Hello,

I've got GFS set up (almost) perfectly after a few days of following several
HOWTOs and the RedHat manual.

However, after a reboot, gnbd_serv isn't loaded, nor does it ever get loaded
until I do it manually after boot (after disabling fenced so the system will
boot because fenced waits forever while trying to communicate with the
non-existant gnbd_serv).

So, the first issue is: why doesn't Fedora provide a way to load gnbd_serv
and the module gnbd and gnbd_client on boot (e.g. /etc/init.d/gnbd)?

The second issue is: the devices that I export using gnbd_export aren't
remembered between boots. Each time a reboot, I need to re-export like this:

gnbd_export -d /dev/VolGroup00/LogVolStorage -e server_storage

I did quite a bit of googling on this and found that Gentoo handles this all
by providing a /etc/gnbdtab, /etc/init.d/gndb_serv, and
/etc/init.d/gnbd_client. The gndb exports & imports are stored in the first
file.

So what's going on? Do I need to copy Gentoo's way of doing it, or is there
a Fedora way that didn't get installed for some reason?

 
Joshua Mouch

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051003/cb9b9826/attachment.htm>

From sdake at mvista.com  Mon Oct  3 20:12:59 2005
From: sdake at mvista.com (Steven Dake)
Date: Mon, 03 Oct 2005 13:12:59 -0700
Subject: [Openais] Re: [Linux-cluster] new userland cman
In-Reply-To: <1128360058.27430.99.camel@ayanami.boston.redhat.com>
References: <433D4134.6080608@redhat.com>
	<1128109200.8440.14.camel@unnamed.az.mvista.com>
	<4340D983.7080106@redhat.com>
	<1128360058.27430.99.camel@ayanami.boston.redhat.com>
Message-ID: <1128370379.30850.3.camel@unnamed.az.mvista.com>

On Mon, 2005-10-03 at 13:20 -0400, Lon Hohberger wrote:
> On Mon, 2005-10-03 at 08:10 +0100, Patrick Caulfield wrote:
> 
> > >>neutral
> > >>-------
> > >>- Always uses multicast (no broadcast). A default multicast address is supplied
> > >>if none is given
> > > 
> > > 
> > > If broadcast is important, which I guess it may be, we can pretty easily
> > > add this support...
> > > 
> > 
> > I was going to look into this but I doubt its really worth it. It's just any
> > extra complication and will only apply to IPv4 anyway.
> 
> I think broadcast is quite important, actually - although I also think
> that it should *not* be the default.
> 
> Multicast doesn't always work very well (in practice) on existing
> networks, and works poorly (if at all) over things like crossover
> ethernet cables and hub-based private networks.  You know, the cheap
> stuff hackers use in their houses to play with cluster software ;)
> 

I have tested the multicast with both crossover point to point as well
as hub networks.  Actually the way the protocol works, switches are not
even necessary.  There are very few (less then 1%) link collisions with
a hub network even at 90% network load.

> Broadcast is far more likely to work out of the box in the above cases,
> and isn't hard to implement (... actually, it's easier than multicast).
> 

Adding this should just be a few lines of code.  I'll see if I can work
out a patch today.

Regards
-steve

> Also, IPv6 isn't what I'd call "mainstream" just yet, so supporting all
> the hacks we can with IPv4 isn't necessarily a bad thing ;)
> 
> -- Lon
> 


From tom-fedora at kofler.eu.org  Fri Oct  7 18:30:16 2005
From: tom-fedora at kofler.eu.org (Thomas Kofler)
Date: Fri,  7 Oct 2005 20:30:16 +0200
Subject: [Linux-cluster] Additional node "Cluster membership rejected"
Message-ID: <1128709816.4346beb85a069@mail.devcon.cc>

Hi,

we are running a 4 node cluster successfully. Now we try to join an additional 
node - but it fails.

We upgraded the cluster.conf file to reflect the new node.

[root at www5 ~]# ccs_tool update /etc/cluster/cluster.conf
Config file updated from version 10 to 11
Update complete.

cluster.conf was checked and is synchron on all nodes, hosts files are also 
fine.

When we try to join the new node gfsserver2
[root at gfsserver2 cluster]# cman_tool join

we get 

[root at gfsserver2 cluster]# CMAN: Cluster membership rejected

And the interesting part is:


Oct  7 13:51:59 gfsserver ccsd[415]: Update of cluster.conf complete (version 
10 -> 11).
Oct  7 13:52:00 gfsserver kernel: CMAN: Join request from gfsserver2.devcon.cc 
rejected, config version local 10 remote 11

Why do the 4 existing nodes not check, that they also have version 11 in use ?

Or do we have to "reload" anything additionally to the ccs_tool update 
command ?

Thanks in advance,
Regards,
Thomas

Oct  7 13:51:34 gfsserver2 kernel: GFS 2.6.11.8-20050601.152643.FC4.9 (built 
Jul 18 2005 10:42:24) installed
Oct  7 13:51:39 gfsserver2 kernel: CMAN 2.6.11.5-20050601.152643.FC4.9 (built 
Jul 18 2005 10:27:35) installed
Oct  7 13:51:39 gfsserver2 kernel: NET: Registered protocol family 30
Oct  7 13:51:39 gfsserver2 kernel: DLM 2.6.11.5-20050601.152643.FC4.10 (built 
Jul 18 2005 10:34:42) installed
Oct  7 13:51:39 gfsserver2 kernel: Lock_DLM (built Jul 18 2005 10:42:18) 
installed
Oct  7 13:51:49 gfsserver2 ccsd[839]: Starting ccsd 1.0.0:
Oct  7 13:51:49 gfsserver2 ccsd[839]:  Built: Jun 16 2005 10:45:39
Oct  7 13:51:49 gfsserver2 ccsd[839]:  Copyright (C) Red Hat, Inc.  2004  All 
rights reserved.
Oct  7 13:51:49 gfsserver2 ccsd[839]:   IP Protocol:: IPv4 only
Oct  7 13:51:59 gfsserver2 ccsd[839]: cluster.conf (cluster name = 
devconcluster, version = 11) found.
Oct  7 13:51:59 gfsserver2 ccsd[839]: Remote copy of cluster.conf is from 
quorate node.
Oct  7 13:51:59 gfsserver2 ccsd[839]:  Local version # : 11
Oct  7 13:51:59 gfsserver2 ccsd[839]:  Remote version #: 11
Oct  7 13:51:59 gfsserver2 kernel: CMAN: Waiting to join or form a Linux-
cluster
Oct  7 13:51:59 gfsserver2 ccsd[839]: Connected to cluster infrastruture via: 
CMAN/SM Plugin v1.1.2
Oct  7 13:51:59 gfsserver2 ccsd[839]: Initial status:: Inquorate
Oct  7 13:52:00 gfsserver2 kernel: CMAN: sending membership request
Oct  7 13:52:00 gfsserver2 kernel: CMAN: Cluster membership rejected
Oct  7 13:52:00 gfsserver2 ccsd[839]: Cluster manager shutdown.  Attemping to 
reconnect...
Oct  7 13:52:20 gfsserver2 ccsd[839]: Unable to connect to cluster 
infrastructure after 30 seconds.
Oct  7 13:52:50 gfsserver2 ccsd[839]: Unable to connect to cluster 
infrastructure after 60 seconds.
Oct  7 13:53:20 gfsserver2 ccsd[839]: Unable to connect to cluster 
infrastructure after 90 seconds.
Oct  7 13:53:51 gfsserver2 ccsd[839]: Unable to connect to cluster 
infrastructure after 120 seconds.
Oct  7 13:53:58 gfsserver2 ccsd[839]: Remote copy of cluster.conf is from 
quorate node.

<?xml version="1.0"?>
<cluster name="cluster" config_version="11">
<cman expected_votes="3">
 </cman>
    <clusternodes>
    <clusternode name="www5.domainxy.cc" domain="fedora0" votes="1">
     <fence>
      <method name="single">
       <device name="human" ipaddr="192.xxx.2.25"/>
     </method>
    </fence>
   </clusternode>
   <clusternode name="www3.domainxy.cc" domain="fedora1" votes="1">
    <fence>
     <method name="single">
       <device name="human" ipaddr="192.xxx.2.23"/>
     </method>
    </fence>
    </clusternode>
   <clusternode name="www4.domainxy.cc" domain="fedora2" votes="1">
    <fence>
     <method name="single">
       <device name="human" ipaddr="192.xxx.2.24"/>
     </method>
    </fence>
  </clusternode>
  <clusternode name="gfsserver.domainxy.cc" domain="fedora3" votes="1">
    <fence>
     <method name="single">
       <device name="human" ipaddr="192.xxx.2.33"/>
     </method>
    </fence>
  </clusternode>
 <clusternode name="gfsserver2.domainxy.cc" domain="fedora4" votes="1">
    <fence>
     <method name="single">
       <device name="human" ipaddr="192.xxx.2.34"/>
     </method>
    </fence>
  </clusternode>
 </clusternodes>
<fence_devices>
 <fence_device name="human" agent="fence_manual"/>
</fence_devices>
</cluster>


From tom-fedora at kofler.eu.org  Sat Oct  8 07:38:15 2005
From: tom-fedora at kofler.eu.org (Thomas Kofler)
Date: Sat,  8 Oct 2005 09:38:15 +0200
Subject: [Linux-cluster] Additional node "Cluster membership rejected"
Message-ID: <1128757095.43477767a7bcc@mail.devcon.cc>

Hi,

we are running a 4 node cluster successfully. Now we try to join an additional 
node - but it fails.

We upgraded the cluster.conf file to reflect the new node.

[root at www5 ~]# ccs_tool update /etc/cluster/cluster.conf
Config file updated from version 10 to 11
Update complete.

cluster.conf was checked and is synchron on all nodes, hosts files are also 
fine.

When we try to join the new node gfsserver2
[root at gfsserver2 cluster]# cman_tool join

we get 

[root at gfsserver2 cluster]# CMAN: Cluster membership rejected

And the interesting part is:


Oct  7 13:51:59 gfsserver ccsd[415]: Update of cluster.conf complete (version 
10 -> 11).
Oct  7 13:52:00 gfsserver kernel: CMAN: Join request from gfsserver2.devcon.cc 
rejected, config version local 10 remote 11

Why do the 4 existing nodes not check, that they also have version 11 in use ?

Or do we have to "reload" anything additionally to the ccs_tool update 
command ?

Thanks in advance,
Regards,
Thomas

Oct  7 13:51:34 gfsserver2 kernel: GFS 2.6.11.8-20050601.152643.FC4.9 (built 
Jul 18 2005 10:42:24) installed
Oct  7 13:51:39 gfsserver2 kernel: CMAN 2.6.11.5-20050601.152643.FC4.9 (built 
Jul 18 2005 10:27:35) installed
Oct  7 13:51:39 gfsserver2 kernel: NET: Registered protocol family 30
Oct  7 13:51:39 gfsserver2 kernel: DLM 2.6.11.5-20050601.152643.FC4.10 (built 
Jul 18 2005 10:34:42) installed
Oct  7 13:51:39 gfsserver2 kernel: Lock_DLM (built Jul 18 2005 10:42:18) 
installed
Oct  7 13:51:49 gfsserver2 ccsd[839]: Starting ccsd 1.0.0:
Oct  7 13:51:49 gfsserver2 ccsd[839]:  Built: Jun 16 2005 10:45:39
Oct  7 13:51:49 gfsserver2 ccsd[839]:  Copyright (C) Red Hat, Inc.  2004  All 
rights reserved.
Oct  7 13:51:49 gfsserver2 ccsd[839]:   IP Protocol:: IPv4 only
Oct  7 13:51:59 gfsserver2 ccsd[839]: cluster.conf (cluster name = 
devconcluster, version = 11) found.
Oct  7 13:51:59 gfsserver2 ccsd[839]: Remote copy of cluster.conf is from 
quorate node.
Oct  7 13:51:59 gfsserver2 ccsd[839]:  Local version # : 11
Oct  7 13:51:59 gfsserver2 ccsd[839]:  Remote version #: 11
Oct  7 13:51:59 gfsserver2 kernel: CMAN: Waiting to join or form a Linux-
cluster
Oct  7 13:51:59 gfsserver2 ccsd[839]: Connected to cluster infrastruture via: 
CMAN/SM Plugin v1.1.2
Oct  7 13:51:59 gfsserver2 ccsd[839]: Initial status:: Inquorate
Oct  7 13:52:00 gfsserver2 kernel: CMAN: sending membership request
Oct  7 13:52:00 gfsserver2 kernel: CMAN: Cluster membership rejected
Oct  7 13:52:00 gfsserver2 ccsd[839]: Cluster manager shutdown.  Attemping to 
reconnect...
Oct  7 13:52:20 gfsserver2 ccsd[839]: Unable to connect to cluster 
infrastructure after 30 seconds.
Oct  7 13:52:50 gfsserver2 ccsd[839]: Unable to connect to cluster 
infrastructure after 60 seconds.
Oct  7 13:53:20 gfsserver2 ccsd[839]: Unable to connect to cluster 
infrastructure after 90 seconds.
Oct  7 13:53:51 gfsserver2 ccsd[839]: Unable to connect to cluster 
infrastructure after 120 seconds.
Oct  7 13:53:58 gfsserver2 ccsd[839]: Remote copy of cluster.conf is from 
quorate node.

<?xml version="1.0"?>
<cluster name="cluster" config_version="11">
<cman expected_votes="3">
 </cman>
    <clusternodes>
    <clusternode name="www5.domainxy.cc" domain="fedora0" votes="1">
     <fence>
      <method name="single">
       <device name="human" ipaddr="192.xxx.2.25"/>
     </method>
    </fence>
   </clusternode>
   <clusternode name="www3.domainxy.cc" domain="fedora1" votes="1">
    <fence>
     <method name="single">
       <device name="human" ipaddr="192.xxx.2.23"/>
     </method>
    </fence>
    </clusternode>
   <clusternode name="www4.domainxy.cc" domain="fedora2" votes="1">
    <fence>
     <method name="single">
       <device name="human" ipaddr="192.xxx.2.24"/>
     </method>
    </fence>
  </clusternode>
  <clusternode name="gfsserver.domainxy.cc" domain="fedora3" votes="1">
    <fence>
     <method name="single">
       <device name="human" ipaddr="192.xxx.2.33"/>
     </method>
    </fence>
  </clusternode>
 <clusternode name="gfsserver2.domainxy.cc" domain="fedora4" votes="1">
    <fence>
     <method name="single">
       <device name="human" ipaddr="192.xxx.2.34"/>
     </method>
    </fence>
  </clusternode>
 </clusternodes>
<fence_devices>
 <fence_device name="human" agent="fence_manual"/>
</fence_devices>
</cluster>


From Bowie_Bailey at BUC.com  Fri Oct 14 15:56:12 2005
From: Bowie_Bailey at BUC.com (Bowie Bailey)
Date: Fri, 14 Oct 2005 11:56:12 -0400
Subject: [Linux-cluster] Fencing?
Message-ID: <4766EEE585A6D311ADF500E018C154E30213325C@bnifex.cis.buc.com>

I'm a bit unclear on the concept of fencing.  Can anyone point me to a good
overview of what it does and how it works?

Bowie


From adam at popik.pl  Fri Oct 14 13:19:33 2005
From: adam at popik.pl (Adam Popik)
Date: Fri, 14 Oct 2005 15:19:33 +0200
Subject: [Linux-cluster] iscsi and RHGFS for RHEL4
Message-ID: <434FB065.5050109@popik.pl>

Hi,
I have questions about rhgfs for rhel4 in documentation is :
"... multipath gnbd and iSCSI are not available with this release ..."
what that mean :
gfs not supported with iscsi or not supported on gnbd with iscsi ?

PS Sorry for broken English..
Adam
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 4179 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051014/76d6686c/attachment.bin>

From pcaulfie at redhat.com  Fri Oct 14 07:09:56 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 14 Oct 2005 08:09:56 +0100
Subject: [Linux-cluster] GFS + DLM howto?
In-Reply-To: <4766EEE585A6D311ADF500E018C154E302133251@bnifex.cis.buc.com>
References: <4766EEE585A6D311ADF500E018C154E302133251@bnifex.cis.buc.com>
Message-ID: <434F59C4.8060802@redhat.com>

Bowie Bailey wrote:
> I seem to be missing something.
> 
> I have been able to configure cluster.conf and get ccsd and clvmd
> running, but it fails when I try to initialize the physical volume.
> 
>     # pvcreate /dev/etherd/e1.0
>       Device /dev/etherd/e1.0 not found.
> 
> The device is definitely there and I can ping it with the aoeping
> command.
> 
>     # ll /dev/etherd/
>     total 4
>     -rw-r--r--  1 root root        1 Oct 13 14:33 discover
>     brw-------  1 root disk 152, 256 Oct 12 17:07 e1.0
> 

You might need to add the device to the 'devices' section of /etc/lvm/lvm.conf, eg:

   types = [ "etherd", 16 ]

The name (I've used etherd here as a guess) is the device name that appears in
/proc/partitions. The number (16) is the maximum number of partitions per device.

-- 

patrick


From alan.gagne at comcast.net  Thu Oct 13 21:59:39 2005
From: alan.gagne at comcast.net (Alan Gagne)
Date: Thu, 13 Oct 2005 17:59:39 -0400
Subject: [Linux-cluster] Oracle 10G-R2 on GFS install problems
Message-ID: <003e01c5d041$69e23050$8432d70a@panhead>

GFS is not a certified file system option for Oracle 10gR2 rac.
There is some limited support for running on GFS though most
of the information I have found is for 9i. You can set this up like
I have currently. Place the Oracle clusterware voting and cluster registry files on raw devices.
You can then create the database on gfs.

Alan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051013/1fd2deb7/attachment.htm>

From Bowie_Bailey at BUC.com  Fri Oct 14 17:39:19 2005
From: Bowie_Bailey at BUC.com (Bowie Bailey)
Date: Fri, 14 Oct 2005 13:39:19 -0400
Subject: [Linux-cluster] GFS + DLM howto?
Message-ID: <4766EEE585A6D311ADF500E018C154E302133260@bnifex.cis.buc.com>

From: Patrick Caulfield [mailto:pcaulfie at redhat.com]
> 
> Bowie Bailey wrote:
> > I seem to be missing something.
> > 
> > I have been able to configure cluster.conf and get ccsd and clvmd
> > running, but it fails when I try to initialize the physical volume.
> > 
> >     # pvcreate /dev/etherd/e1.0
> >       Device /dev/etherd/e1.0 not found.
> > 
> > The device is definitely there and I can ping it with the aoeping
> > command.
> > 
> >     # ll /dev/etherd/
> >     total 4
> >     -rw-r--r--  1 root root        1 Oct 13 14:33 discover
> >     brw-------  1 root disk 152, 256 Oct 12 17:07 e1.0
> > 
> 
> You might need to add the device to the 'devices' section of 
> /etc/lvm/lvm.conf, eg:
> 
>    types = [ "etherd", 16 ]
> 
> The name (I've used etherd here as a guess) is the device name that
> appears in /proc/partitions. The number (16) is the maximum number
> of partitions per device.

I found the answer to this question on Coraid's site soon after I
asked.

I added this to the "devices" section of lvm.conf:

    types = [ "aoe", 16 ]

After that, everything worked perfectly!

I've now got an operational setup with a single node.  Now I've just
got to see if I can get the other nodes configured.

Thanks for the help (and patience)!

Bowie


From kanderso at redhat.com  Fri Oct 14 17:49:25 2005
From: kanderso at redhat.com (Kevin Anderson)
Date: Fri, 14 Oct 2005 12:49:25 -0500
Subject: [Linux-cluster] iscsi and RHGFS for RHEL4
In-Reply-To: <434FB065.5050109@popik.pl>
References: <434FB065.5050109@popik.pl>
Message-ID: <1129312165.3526.52.camel@dhcp80-225.msp.redhat.com>

On Fri, 2005-10-14 at 15:19 +0200, Adam Popik wrote:
> Hi,
> I have questions about rhgfs for rhel4 in documentation is :
> "... multipath gnbd and iSCSI are not available with this release ..."
> what that mean :
> gfs not supported with iscsi or not supported on gnbd with iscsi ?

RHEL4 didn't support iSCSI until the recent RHEL4 U2 release.  GFS works
fine with iSCSI, is supported and is used quite extensively by the
development team.

The multipath gnbd refers to the use of a gnbd device under the device
mapper multipath module.  The device mapper multipath code currently
assumes that the devices are pure SCSI devices and submits a scsi
command that GNBD currently doesn't provide.  So, the GNBD devices
aren't recognized by the multipath module as valid devices.

So, two separate items, one outdated, the other in progress.

Hope this helps
Kevin  


From spwilcox at att.com  Fri Oct 14 18:48:48 2005
From: spwilcox at att.com (Steve Wilcox)
Date: Fri, 14 Oct 2005 14:48:48 -0400
Subject: [Linux-cluster] Oracle 10G-R2 on GFS install problems
In-Reply-To: <003e01c5d041$69e23050$8432d70a@panhead>
References: <003e01c5d041$69e23050$8432d70a@panhead>
Message-ID: <1129315728.4443.10.camel@aptis101.cqtel.com>

I was afraid of that.  Interesting that Oracle would make such a change
between R1 and R2, but I guess clusterware underwent a fairly extensive
re-write.

Thanks for the info.

-steve

On Thu, 2005-10-13 at 17:59 -0400, Alan Gagne wrote:
> GFS is not a certified file system option for Oracle 10gR2 rac.
> There is some limited support for running on GFS though most
> of the information I have found is for 9i. You can set this up like
> I have currently. Place the Oracle clusterware voting and cluster
> registry files on raw devices.
> You can then create the database on gfs.
>  
> Alan
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From adam at popik.pl  Fri Oct 14 18:59:53 2005
From: adam at popik.pl (Adam Popik)
Date: Fri, 14 Oct 2005 20:59:53 +0200
Subject: [Linux-cluster] iscsi and RHGFS for RHEL4
In-Reply-To: <1129312165.3526.52.camel@dhcp80-225.msp.redhat.com>
References: <434FB065.5050109@popik.pl>
	<1129312165.3526.52.camel@dhcp80-225.msp.redhat.com>
Message-ID: <43500029.2030507@popik.pl>

Kevin Anderson wrote:
> On Fri, 2005-10-14 at 15:19 +0200, Adam Popik wrote:
> 
>>Hi,
>>I have questions about rhgfs for rhel4 in documentation is :
>>"... multipath gnbd and iSCSI are not available with this release ..."
>>what that mean :
>>gfs not supported with iscsi or not supported on gnbd with iscsi ?
> 
> 
> RHEL4 didn't support iSCSI until the recent RHEL4 U2 release.  GFS works
> fine with iSCSI, is supported and is used quite extensively by the
> development team.
> 
> The multipath gnbd refers to the use of a gnbd device under the device
> mapper multipath module.  The device mapper multipath code currently
> assumes that the devices are pure SCSI devices and submits a scsi
> command that GNBD currently doesn't provide.  So, the GNBD devices
> aren't recognized by the multipath module as valid devices.
> 
> So, two separate items, one outdated, the other in progress.
> 
> Hope this helps
> Kevin  
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
I now work with rhel3 with gfs and FC and that work fine, but new 
project have no a lot of money that maybe combination with iscsi should 
be a good way (gfs will use for home directories rhel's WS and use for 
working with fluent - big files).
Thanks for help
Adam
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 4179 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051014/239c10be/attachment.bin>

From ciril at hclinsys.com  Sat Oct 15 03:36:49 2005
From: ciril at hclinsys.com (CIRIL IGNATIOUS T)
Date: Sat, 15 Oct 2005 09:06:49 +0530
Subject: [Linux-cluster] Active/Active oracle 10g database with Redhat
	Cluster Suite.
Message-ID: <43507951.5060309@hclinsys.com>

Dear All

Is it possible to configure Active/Active Cluster of Oracle 10g Database 
with Redhat Cluster Suite.
Please indicate if any useful links.


Ciril


From omer at faruk.net  Mon Oct 17 06:16:42 2005
From: omer at faruk.net (Omer Faruk Sen)
Date: Mon, 17 Oct 2005 09:16:42 +0300 (EEST)
Subject: [Linux-cluster] Fencing?
In-Reply-To: <4766EEE585A6D311ADF500E018C154E30213325C@bnifex.cis.buc.com>
References: <4766EEE585A6D311ADF500E018C154E30213325C@bnifex.cis.buc.com>
Message-ID: <53866.193.140.74.2.1129529802.squirrel@193.140.74.2>

I don't know very much but what I understand from fencing is forcefully
disabling a node that is not reachable by cluster to prevent this dead
node accidently (maybe the node wasn't dead and will try to write
something to shared storage which can cause catastrophic damage if GFS is
not used) write something to file system. It does this using power
switches or other methods such as IPMI or ILO .(I heard there was a new
module for fencing that uses vmware )

Thus I think this fencing conecpt is the same as STONITH in linux-ha.org
which means Shoot The Other Node In The Head(Heart)....

If I am mistaken someone please correct me.
> I'm a bit unclear on the concept of fencing.  Can anyone point me to a
> good
> overview of what it does and how it works?
>
> Bowie
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Omer Faruk Sen
http://www.faruk.net


From lhh at redhat.com  Mon Oct 17 22:24:24 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 17 Oct 2005 18:24:24 -0400
Subject: [Linux-cluster] Fencing?
In-Reply-To: <53866.193.140.74.2.1129529802.squirrel@193.140.74.2>
References: <4766EEE585A6D311ADF500E018C154E30213325C@bnifex.cis.buc.com>
	<53866.193.140.74.2.1129529802.squirrel@193.140.74.2>
Message-ID: <1129587864.10298.28.camel@ayanami.boston.redhat.com>

On Mon, 2005-10-17 at 09:16 +0300, Omer Faruk Sen wrote:

> (maybe the node wasn't dead and will try to write
> something to shared storage which can cause catastrophic damage if GFS is
> not used) write something to file system.

Correct, except it causes catastrophic damage in any case, regardless of
whether or not GFS is used.  GFS requires fencing in order to operate.

> It does this using power
> switches or other methods such as IPMI or ILO .(I heard there was a new
> module for fencing that uses vmware )

GFS can use fabric-level fencing - that is, you can tell the iSCSI
server to cut a node off, or ask the fiber-channel switch to disable a
port.  This is in addition to "power-cycle" fencing.

> Thus I think this fencing conecpt is the same as STONITH in linux-ha.org
> which means Shoot The Other Node In The Head(Heart)....

STONITH, STOMITH, etc. are indeed implementations of I/O fencing.

Fencing is the act of forcefully preventing a node from being able to
access resources after that node has been evicted from the cluster in an
attempt to avoid corruption.

The canonical example of when it is needed is the live-hang scenario, as
you described:

1. node A hangs with I/Os pending to a shared file system
2. node B and node C decide that node A is dead and recover resources
allocated on node A (including the shared file system)
3. node A resumes normal operation
4. node A completes I/Os to shared file system

At this point, the shared file system is probably corrupt.  If you're
lucky, fsck will fix it -- if you're not, you'll need to restore from
backup.  I/O fencing (STONITH, or whatever we want to call it) prevents
the last step (step 4) from happening.

How fencing is done (power cycling via external switch, SCSI
reservations, FC zoning, integrated methods like IPMI, iLO, manual
intervention, etc.) is unimportant - so long as whatever method is used
can guarantee that step 4 can not complete.

-- Lon


From dawson at fnal.gov  Tue Oct 18 14:20:14 2005
From: dawson at fnal.gov (Troy Dawson)
Date: Tue, 18 Oct 2005 09:20:14 -0500
Subject: [Linux-cluster] write's pausing - which tools to debug?
Message-ID: <4355049E.4060606@fnal.gov>

Hi,
We've been having some problems with doing a write's to our GFS file 
system, and it will pause, for long periods.  (Like from 5 to 10 
seconds, to 30 seconds, and occasially 5 minutes)  After the pause, it's 
like nothing happened, whatever the process is, just keeps going happy 
as can be.
Except for these pauses, our GFS is quite zippy, both reads and writes. 
  But these pauses are holding us back from going full production.
I need to know what tools I should use to figure out what is causing 
these pauses.

Here is the setup.
-------------------
All machines: RHEL 4 update 1 (ok, actually S.L. 4.1), kernel 
2.6.9-11.ELsmp, GFS 6.1.0, ccs 1.0.0, gulm 1.0.0, rgmanager 1.9.34

I have no ability to do fencing yet, so I chose to use the gulm locking 
mechanism.  I have it setup so that there are 3 lock servers, for 
failover.  I have tested the failover, and it works quite well.

I have 5 machines in the cluster.  1 isn't connected to the SAN, or 
using GFS.  It is just a failover gulm lock server incase the other two 
lock servers go down.

So I have 4 machines connected to our SAN and using GFS.  3 are 
read-only, 1 is read-write.  If it is important, the 3 read-only are 
x86_64, the 1 read-write and the 1 not connected are i386.

The read/write machine is our master lock server.  Then one of the 
read-only is a fallback lock server, as is the machine not using GFS.
----------------

Anyway, we're getting these pauses when writting, and I'm having a hard 
time tracking down where the problem is.  I *think* that we can still 
read from the other machines.  But since this comes and goes, I haven't 
been able to verify that.

Anyway, which tools do you think would be best in diagnosing this?

Many Thanks
Troy Dawson
-- 
__________________________________________________
Troy Dawson  dawson at fnal.gov  (630)840-6468
Fermilab  ComputingDivision/CSS  CSI Group
__________________________________________________


From haiwu.us at gmail.com  Tue Oct 18 23:30:26 2005
From: haiwu.us at gmail.com (hai wu)
Date: Tue, 18 Oct 2005 18:30:26 -0500
Subject: [Linux-cluster] GFS cluster and Dell DRAC
Message-ID: <eb7f3f8f0510181630k2cb50cf3s91dc53e350d4e123@mail.gmail.com>

We are mainly using Dell PowerEdge servers. I know Dell DRAC port was
mentioned in GFS document. But I don't know how Dell DRAC would be
configured in order to get it to work for GFS (power reset).
 Can someone explain about its usage for GFS?
 Thanks,
Hai
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051018/9333799d/attachment.htm>

From suran007 at coolgames.com.cn  Wed Oct 19 08:09:41 2005
From: suran007 at coolgames.com.cn (=?gb2312?B?c3VyYW4wMDc=?=)
Date: Wed, 19 Oct 2005 16:09:41 +0800
Subject: [Linux-cluster] GFS mount hang
Message-ID: <20051019080941.4961.qmail@mail.test.com>

my system is redhat AS 3 UPDATE3,the kernel is linux-2.4.21-27.0.4.Elsmp,the gfs version is 

GFS-6.0.2-26,our gfs-cluster operation serveral days well,(suddenly),the cluster is dead,our 

gfs-cluster is one gnbd server and six gfs nodes,currently,my problem is gfs services (such 

as ccsd,lock_gulmd) of our six gfs nodes is already started well,but only node02,05,06 can 

mount the gfs pool,when the node03,04 mount the gfs pool will hang ,I use the gulm_tool 

nodelist node01 to see the lock stat,the result is well ,I can\'t see any problem .
who can help me ,my msn is suran007 at hotmail.com, I hope someone can help me ,thanks~~

----

iGENUS is a free webmail interface, No fee, Free download
---------------------------------------------------------
please visit http://www.igenus.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051019/a0fa81e8/attachment.htm>

From suran007 at coolgames.com.cn  Wed Oct 19 07:59:37 2005
From: suran007 at coolgames.com.cn (=?gb2312?B?c3VyYW4wMDc=?=)
Date: Wed, 19 Oct 2005 15:59:37 +0800
Subject: [Linux-cluster] GFS mount hang
Message-ID: <20051019075937.4864.qmail@mail.test.com>

my system is redhat AS 3 UPDATE3,the kernel is linux-2.4.21-27.0.4.Elsmp,the gfs version is GFS-6.0.2-26,our gfs-cluster operation serveral days well,(suddenly),the cluster is dead,our gfs-cluster is one gnbd server and six gfs nodes,currently,my problem is gfs services (such as ccsd,lock_gulmd) of our six gfs nodes is already started well,but only node02,05,06 can mount the gfs pool,when the node03,04 mount the gfs pool will hang ,I use the gulm_tool nodelist node01 to see the lock stat,the result is well ,I can\'t see any problem .
who can help me ,my msn is suran007 at hotmail.com, I hope someone can help me ,thanks~~

----

iGENUS is a free webmail interface, No fee, Free download
---------------------------------------------------------
please visit http://www.igenus.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051019/fd59ed5e/attachment.htm>

From suran007 at coolgames.com.cn  Wed Oct 19 08:05:05 2005
From: suran007 at coolgames.com.cn (=?gb2312?B?c3VyYW4wMDc=?=)
Date: Wed, 19 Oct 2005 16:05:05 +0800
Subject: [Linux-cluster] GFS mount hang
Message-ID: <20051019080505.4921.qmail@mail.test.com>

my system is redhat AS 3 UPDATE3,the kernel is linux-2.4.21-27.0.4.Elsmp,the gfs version is GFS-6.0.2-26,our gfs-cluster operation serveral days well,(suddenly),the cluster is dead,our gfs-cluster is one gnbd server and six gfs nodes,currently,my problem is gfs services (such as ccsd,lock_gulmd) of our six gfs nodes is already started well,but only node02,05,06 can mount the gfs pool,when the node03,04 mount the gfs pool will hang ,I use the gulm_tool nodelist node01 to see the lock stat,the result is well ,I can\'t see any problem .
who can help me ,my msn is suran007 at hotmail.com, I hope someone can help me ,thanks~~

----

iGENUS is a free webmail interface, No fee, Free download
---------------------------------------------------------
please visit http://www.igenus.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051019/58eb6ad7/attachment.htm>

From suran007 at coolgames.com.cn  Wed Oct 19 07:42:43 2005
From: suran007 at coolgames.com.cn (=?gb2312?B?c3VyYW4wMDc=?=)
Date: Wed, 19 Oct 2005 15:42:43 +0800
Subject: [Linux-cluster] TEST
Message-ID: <20051019074243.4621.qmail@mail.test.com>

MY

----

iGENUS is a free webmail interface, No fee, Free download
---------------------------------------------------------
please visit http://www.igenus.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051019/d261a1ae/attachment.htm>

From Axel.Thimm at ATrpms.net  Wed Oct 19 10:48:16 2005
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Wed, 19 Oct 2005 12:48:16 +0200
Subject: [Linux-cluster] Re: write's pausing - which tools to debug?
In-Reply-To: <4355049E.4060606@fnal.gov>
References: <4355049E.4060606@fnal.gov>
Message-ID: <20051019104816.GD31027@neu.nirvana>

Hi,

On Tue, Oct 18, 2005 at 09:20:14AM -0500, Troy Dawson wrote:
> We've been having some problems with doing a write's to our GFS file 
> system, and it will pause, for long periods.  (Like from 5 to 10 
> seconds, to 30 seconds, and occasially 5 minutes)  After the pause, it's 
> like nothing happened, whatever the process is, just keeps going happy 
> as can be.
> Except for these pauses, our GFS is quite zippy, both reads and writes. 
>  But these pauses are holding us back from going full production.
> I need to know what tools I should use to figure out what is causing 
> these pauses.
> 
> Here is the setup.
> All machines: RHEL 4 update 1 (ok, actually S.L. 4.1), kernel 
> 2.6.9-11.ELsmp, GFS 6.1.0, ccs 1.0.0, gulm 1.0.0, rgmanager 1.9.34
> 
> I have no ability to do fencing yet, so I chose to use the gulm locking 
> mechanism.  I have it setup so that there are 3 lock servers, for 
> failover.  I have tested the failover, and it works quite well.

If this is a testing environment use manual fencing. E.g. if a node
needs to get fenced you get a log message saying that you should do
that and acknowledge that.

> I have 5 machines in the cluster.  1 isn't connected to the SAN, or 
> using GFS.  It is just a failover gulm lock server incase the other two 
> lock servers go down.
> 
> So I have 4 machines connected to our SAN and using GFS.  3 are 
> read-only, 1 is read-write.  If it is important, the 3 read-only are 
> x86_64, the 1 read-write and the 1 not connected are i386.
> 
> The read/write machine is our master lock server.  Then one of the 
> read-only is a fallback lock server, as is the machine not using GFS.
> 
> Anyway, we're getting these pauses when writting, and I'm having a hard 
> time tracking down where the problem is.  I *think* that we can still 
> read from the other machines.  But since this comes and goes, I haven't 
> been able to verify that.

What SAN hardware is attached to the nodes?

> Anyway, which tools do you think would be best in diagnosing this?

I'd suggest to check/monitor networking. Also place the cluster
communication on a separate network that the SAN/LAN network. The
cluster heartbeat goes over UDP and a congested network may delay
these packages or drop the completely. At least that's the CMAN
picture, lock_gulm may be different.

Also don't mix RHELU1 and U2 or FC<N>. Just in case you'd like to
upgrade to SL4.2 one by one.

There have been many changes/bug fixes to the cluster bits in RHELU2,
and there are also some new spiffy features like multipath. Perhaps
it's worth rebasing your testing environment?
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051019/3ed9fb23/attachment.sig>

From teigland at redhat.com  Wed Oct 19 15:10:45 2005
From: teigland at redhat.com (David Teigland)
Date: Wed, 19 Oct 2005 10:10:45 -0500
Subject: [Linux-cluster] cluster-1.01.00
Message-ID: <20051019151045.GA3975@redhat.com>

A new source tarball from the STABLE branch has been released; it builds
and runs on 2.6.13:

  ftp://sources.redhat.com/pub/cluster/releases/cluster-1.01.00.tar.gz

Version 1.01.00 - 5 October 2005
================================
  cman-kernel: SM should wait for all recoveries to complete before it
  processes any group joins/leaves.  bz#162014
  cman-kernel: Fix barriers.
  cman-kernel: Fix off-by-one error in find_node_by_nodeid() that can
  cause an oops in some odd circumstances.
  dlm-kernel: Don't increment the DLM reference count when connecting to an
  already extant lockspace.  bz#157295
  dlm-kernel: Fix refcounting that could cause a memory leak.
  dlm-kernel: Return locking errors correctly.  bz#154990
  dlm-kernel: Don't free the lockinfo block if the LKB still exists.  bz#161146
  cman: "cman_tool join" can now set /proc/cluster/conf/cman values from CCS
  lock_dlm: The first mounter shouldn't let others mount until
  others_may_mount() has been called.  bz#161808
  gfs-kernel: If it took too long to sync the dependent inodes back to
  disk, resource group descriptor could get corrupted.  bz#164324
  gfs-kernel: It is now possible to toggle acls on and off with -o remount.
  Also, acls are only displayed when they are enabled.
  gfs-kernel: No longer check permissions before truncating a file in
  gfs_setattr.  bz#169039
  gfs-kernel: Fix oops when copying suid root file to gfs.
  gfs-kernel: changes to work on 2.6.13
  gfs_fsck: Some variables weren't getting initialized properly in pass1b,
  causing hangs (or segfaults) when duplicate blocks were present.  bz#162709
  fence: Add support for Dell PowerEdge 1855 to fence_drac.  bz#150563
  fence: Add support for latest ilo firmware version (1.75).  Changes were
  also added to make sure that power status of the machine is being properlly
  checked after power change commands have been issued.  bz#161352
  fence: fence_ipmilan default operation should be reboot.  bz#164627
  fence: fence_wti default operation should be reboot.  bz#162805
  ccs: Increase daemon performance by adding local socket communications.
  rgmanager: Fix ip bugs. bz#157327, bz#163651, bz#166526
  rgmanager: Fix hang when specifying nonexistent services. bz#159767
  rgmanager: Fix service tree handling.  bz#162824, bz#162936

Dave


From Bill.Scherer at VerizonWireless.com  Wed Oct 19 15:22:14 2005
From: Bill.Scherer at VerizonWireless.com (Bill Scherer)
Date: Wed, 19 Oct 2005 11:22:14 -0400
Subject: [Linux-cluster] bladecenter fencing...
Message-ID: <435664A6.4090002@VerizonWireless.com>

Hello -

I have 24 blades in two bladecenters, all running RHEL4. I have 
successfully configured a cluster with GFS on four nodes in one 
bladecenter.

There appears to be no way to create a cluster composed of blades in 
different bladecenters because the fence agent setup has no facility to 
handle multiple bladecenter management modules.

Or am I missing something?


TIA,

Bill Scherer


From david.chappel at mindbank.com  Fri Oct 14 15:31:07 2005
From: david.chappel at mindbank.com (David A. Chappel)
Date: Fri, 14 Oct 2005 09:31:07 -0600
Subject: [Linux-cluster] mounts not spanning
Message-ID: <1129303867.4838.30.camel@localhost.localdomain>

Hi there clusterites...  Anyone have a cluestick?

I have created a wee "cluster" of two machines.  They seem to be happy
in every way, except that when I mount the gfs volumes on each machine,
the mounts do not span across the two nodes, but act as a traditional
node.  In other words, I can echo "haha" > /mnt/shareMe/haha.txt on one
machine but it doesn't show up on the other.  Vice versa too.

I use:
mount -t gfs /dev/shareMeVG/shareMeLV /mnt/shareMe

I've tried the -o ignore_local_fs option without success.

Also, is there a quick/standard way for non-cluster kernel machines to
mount the "partition" remotely?

Cheers,
-D


[root at JavaTheHut ~]# cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 1
Cluster name: clusta
Cluster ID: 6621
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 2
Expected_votes: 1
Total_votes: 2
Quorum: 1
Active subsystems: 6
Node name: JavaTheHut.mindbankts.com
Node addresses: 10.1.1.22

[root at marvin ~]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster name="clusta" config_version="1">
    <cman two_node="1" expected_votes="1">
    </cman>
    <clusternodes>
      <clusternode name="marvin.mindbankts.com" votes="1">
       <fence>
        <method name="single">
         <device name="human" ipaddr="10.1.1.20"/>
       </method>
      </fence>
     </clusternode>
     <clusternode name="JavaTheHut.mindbankts.com" votes="1">
      <fence>
       <method name="single">
         <device name="human" ipaddr="10.1.1.22"/>
       </method>
      </fence>
    </clusternode>
   </clusternodes>
  <fence_devices>
   <fence_device name="human" agent="fence_manual"/>
  </fence_devices>
 </cluster>


From pcaulfie at redhat.com  Fri Oct 14 07:07:01 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 14 Oct 2005 08:07:01 +0100
Subject: [Linux-cluster] Additional node "Cluster membership rejected"
In-Reply-To: <1128709816.4346beb85a069@mail.devcon.cc>
References: <1128709816.4346beb85a069@mail.devcon.cc>
Message-ID: <434F5915.7010904@redhat.com>

Thomas Kofler wrote:
> Hi,
> 
> we are running a 4 node cluster successfully. Now we try to join an additional 
> node - but it fails.
> 
> We upgraded the cluster.conf file to reflect the new node.
> 
> [root at www5 ~]# ccs_tool update /etc/cluster/cluster.conf
> Config file updated from version 10 to 11
> Update complete.
> 
> cluster.conf was checked and is synchron on all nodes, hosts files are also 
> fine.
> 
> When we try to join the new node gfsserver2
> [root at gfsserver2 cluster]# cman_tool join
> 
> we get 
> 
> [root at gfsserver2 cluster]# CMAN: Cluster membership rejected
> 
> And the interesting part is:
> 
> 
> Oct  7 13:51:59 gfsserver ccsd[415]: Update of cluster.conf complete (version 
> 10 -> 11).
> Oct  7 13:52:00 gfsserver kernel: CMAN: Join request from gfsserver2.devcon.cc 
> rejected, config version local 10 remote 11
> 
> Why do the 4 existing nodes not check, that they also have version 11 in use ?
> 
> Or do we have to "reload" anything additionally to the ccs_tool update 
> command ?

You'll need to run cman_tool version -r 11 on one node in the cluster if you
update the CCS file.


-- 

patrick


From tom-fedora at kofler.eu.org  Fri Oct 14 06:31:09 2005
From: tom-fedora at kofler.eu.org (Thomas Kofler)
Date: Fri, 14 Oct 2005 08:31:09 +0200
Subject: [Linux-cluster] Additional node "Cluster membership rejected"
In-Reply-To: <1128709816.4346beb85a069@mail.devcon.cc>
References: <1128709816.4346beb85a069@mail.devcon.cc>
Message-ID: <1129271469.434f50adeead6@mail.devcon.cc>

Hm,

the list delayed my email for nearly a week, but neverless - found the 
solutions:

You have to tell cman, that a new config version is available:

 cman_tool version -r 11

Thomas
 

From vojtech.moravek at cz.ibm.com  Wed Oct 19 07:48:11 2005
From: vojtech.moravek at cz.ibm.com (Vojtech Moravek)
Date: Wed, 19 Oct 2005 09:48:11 +0200
Subject: [Linux-cluster] Performance Problem-GFS 6.1 u2 - LockGulm
Message-ID: <OF18D58F6D.6A8FE453-ONC125709F.002A8754-C125709F.002ADCD5@cz.ibm.com>

Hi All
 
I testing HA samba cluster with
 
One -load balancer, 
       -two samba servers(gfs client)
       -gfs-server
        -storage connected by FC to samba servers see picture below.
 
                        
                        loadbalancer
                      -------------------
                            |
                            |  (ethernet network)
           ----------------------------------------------------------    
            |                                    |                  |
            |        (internal gfs network       |                  |
        samba1-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. samba2-.-.-.-gfs-server
            |                                    |
            |                                    |
            ---------------------------------------
                        Storage (Fibre connect)

Everything works perfectly, but only 30-40 minuts aprox under client work. 
After that time gfs mount points are goint to slow rapidly :(
When I try browsing in directory structure on servers all operation like 
chdir, readir are very very slow. But all system resurces looks  ok..Ram 
is ok, cpu usage is ok, but traffic on gfs network is growing.

Did someone met with problem like this?

And one point ..when I mount gfs volumes and I made system command "df", 
first output is too slow. Is this normal???

My configurations files:

--------------------------------------
cat /etc/cluster/cluster.conf

<?xml version="1.0"?>
<cluster config_version="2" name="gsdc">
        <fence_daemon clean_start="0" post_fail_delay="0" 
post_join_delay="3"/>
        <clusternodes>
                <clusternode name="gfs-samba1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="man_fen" 
nodename="gfs-samba1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="gfs-samba2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="man_fen" 
nodename="gfs-samba2"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="gfs-fake" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="man_fen" 
nodename="gfs-fake"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <gulm>
                <lockserver name="gfs-server"/>
        </gulm>
        <fencedevices>
                <fencedevice agent="fence_manual" name="man_fen"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
</cluster>
------------------------------------- 

cat /etc/fstab
# This file is edited by fstab-sync - see 'man fstab-sync' for details
LABEL=/1                /                       ext3    defaults        1 
1
none                    /dev/pts                devpts  gid=5,mode=620  0 
0
none                    /dev/shm                tmpfs   defaults        0 
0
none                    /proc                   proc    defaults        0 
0
none                    /sys                    sysfs   defaults        0 
0
LABEL=/var1             /var                    ext3    defaults        1 
2
LABEL=SWAP-sda3         swap                    swap    defaults        0 
0

/dev/vg_pole1/profiles_vg1      /vg_pole1/profiles/         gfs noatime    
     0 0
/dev/vg_pole1/home_vg1          /vg_pole1/home/         gfs noatime  0 0

/dev/vg_pole2/profiles_vg2     /vg_pole2/profiles/              gfs 
noatime               0 0
/dev/vg_pole2/home_vg2         /vg_pole2/home/         gfs noatime       0 
0


Thanks for any help

Vojtech Moravek
vojtech.moravek at cz.ibm.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051019/108f67a4/attachment.htm>

From brent at phys.ufl.edu  Tue Oct 18 23:27:52 2005
From: brent at phys.ufl.edu (Brent A Nelson)
Date: Tue, 18 Oct 2005 19:27:52 -0400 (EDT)
Subject: [Linux-cluster] ddraid?
Message-ID: <Pine.LNX.4.61.0510181514340.5119@atma.phys.ufl.edu>

There hasn't been mention of ddraid in a while and the CVS hasn't been 
updated in about 3 months, I believe.  Has there been any further progress 
with it?

What are the risks associated with using it in its current form? If the 
lack of bad block handling is the only real concern, would the risk be 
substantially mitigated if the underlying devices are raid1, raid1+0, or 
raid5?

Has anyone tried it yet in a production environment? Any comments to 
share?

We'd really like to use this cool little tool.  Redundancy and a 
performance gain, even when across the net; what's not to like (except for 
the minor nuisance of 2^n+1 devices being required)?

Thanks,

Brent Nelson
Director of Computing
Dept. of Physics
University of Florida


From Stefan.Marx at SCHOBER.DE  Fri Oct 14 05:54:51 2005
From: Stefan.Marx at SCHOBER.DE (Stefan Marx)
Date: Fri, 14 Oct 2005 07:54:51 +0200
Subject: Antw: [Linux-cluster] Oracle 10G-R2 on GFS install problems
Message-ID: <s34f644e.099@sdm-cl2.schober.de>

Hi Marvin,

also OCFS2 ist f?r Oracle Produkte noch nicht freigegeben, auch wenn es von Oracle selebr stammt. GFS ist f?r die 9.2er Serie zertifiziert, wobei man schauen mu?, ob man RHEL3 oder RHEL4 verwenden kann, je nachdem, ob man 32-Bit oder 64-Bit Support ben?tigt. Da gibt es ziemlich klare Aussagen bei Oracle, welche Produkte mit welcher Version auf welchem Betriebssystem mit welcher Version und zus?tzlich auf welcher Hardwareplattform supported sind. Und das hat auch meistens seinen guten Grund :-(. Klar gehen die Sachen auch auf anderen Betriebssystemen, solange die entsprechenden Libraries und Kernelversionen passen, sind dann aber halt nicht supported.

Ciao, Stefan
 
 
>>>spwilcox at att.com 10/13/05 8:26 pm >>> 
In the process of installing Oracle 10G-R2 on a RHEL4-U2 x86_64 cluster 
with GFS 6.1.2, I get the following error when running Oracle's root.sh 
for cluster ready services (a.k.a clusterware): 
 
[  OCROSD][4142143168]utstoragetype: /u00/app/ocr0 is on FS type 
18225520. Not supported. 
 
I did a little poking around and found that OCFS2 has the same issue, 
but with OCFS2 it can be circumvented by mounting with -o datavolume... 
I was unable to find any similar options for GFS mounts.  This looks 
like probably more of an Oracle bug, as 10G-R1 installed without any 
problems (I have my DBA pursuing the Oracle route), but I was wondering 
if anyone else has come across this problem and if so, was there any 
fix? 
 
Thanks, 
-steve 
 
-- 
Linux-cluster mailing list 
Linux-cluster at redhat.com 
https://www.redhat.com/mailman/listinfo/linux-cluster 
 

From Axel.Thimm at ATrpms.net  Fri Oct 14 00:21:24 2005
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Fri, 14 Oct 2005 02:21:24 +0200
Subject: [Linux-cluster] Re: Additional node "Cluster membership rejected"
In-Reply-To: <1128757095.43477767a7bcc@mail.devcon.cc>
References: <1128757095.43477767a7bcc@mail.devcon.cc>
Message-ID: <20051014002124.GB4695@neu.nirvana>

On Sat, Oct 08, 2005 at 09:38:15AM +0200, Thomas Kofler wrote:
> Hi,
> 
> we are running a 4 node cluster successfully. Now we try to join an additional 
> node - but it fails.
> 
> We upgraded the cluster.conf file to reflect the new node.
> 
> [root at www5 ~]# ccs_tool update /etc/cluster/cluster.conf
> Config file updated from version 10 to 11
> Update complete.
> 
> cluster.conf was checked and is synchron on all nodes, hosts files are also 
> fine.

Now you need

cman_tool version -r 11

Check out the man page for ccs_tool under "update"

> When we try to join the new node gfsserver2
> [root at gfsserver2 cluster]# cman_tool join
> 
> we get 
> 
> [root at gfsserver2 cluster]# CMAN: Cluster membership rejected
> 
> And the interesting part is:
> 
> 
> Oct  7 13:51:59 gfsserver ccsd[415]: Update of cluster.conf complete (version 
> 10 -> 11).
> Oct  7 13:52:00 gfsserver kernel: CMAN: Join request from gfsserver2.devcon.cc 
> rejected, config version local 10 remote 11
> 
> Why do the 4 existing nodes not check, that they also have version 11 in use ?
> 
> Or do we have to "reload" anything additionally to the ccs_tool update 
> command ?
> 
> Thanks in advance,
> Regards,
> Thomas
> 
> Oct  7 13:51:34 gfsserver2 kernel: GFS 2.6.11.8-20050601.152643.FC4.9 (built 
> Jul 18 2005 10:42:24) installed
> Oct  7 13:51:39 gfsserver2 kernel: CMAN 2.6.11.5-20050601.152643.FC4.9 (built 
> Jul 18 2005 10:27:35) installed
> Oct  7 13:51:39 gfsserver2 kernel: NET: Registered protocol family 30
> Oct  7 13:51:39 gfsserver2 kernel: DLM 2.6.11.5-20050601.152643.FC4.10 (built 
> Jul 18 2005 10:34:42) installed
> Oct  7 13:51:39 gfsserver2 kernel: Lock_DLM (built Jul 18 2005 10:42:18) 
> installed
> Oct  7 13:51:49 gfsserver2 ccsd[839]: Starting ccsd 1.0.0:
> Oct  7 13:51:49 gfsserver2 ccsd[839]:  Built: Jun 16 2005 10:45:39
> Oct  7 13:51:49 gfsserver2 ccsd[839]:  Copyright (C) Red Hat, Inc.  2004  All 
> rights reserved.
> Oct  7 13:51:49 gfsserver2 ccsd[839]:   IP Protocol:: IPv4 only
> Oct  7 13:51:59 gfsserver2 ccsd[839]: cluster.conf (cluster name = 
> devconcluster, version = 11) found.
> Oct  7 13:51:59 gfsserver2 ccsd[839]: Remote copy of cluster.conf is from 
> quorate node.
> Oct  7 13:51:59 gfsserver2 ccsd[839]:  Local version # : 11
> Oct  7 13:51:59 gfsserver2 ccsd[839]:  Remote version #: 11
> Oct  7 13:51:59 gfsserver2 kernel: CMAN: Waiting to join or form a Linux-
> cluster
> Oct  7 13:51:59 gfsserver2 ccsd[839]: Connected to cluster infrastruture via: 
> CMAN/SM Plugin v1.1.2
> Oct  7 13:51:59 gfsserver2 ccsd[839]: Initial status:: Inquorate
> Oct  7 13:52:00 gfsserver2 kernel: CMAN: sending membership request
> Oct  7 13:52:00 gfsserver2 kernel: CMAN: Cluster membership rejected
> Oct  7 13:52:00 gfsserver2 ccsd[839]: Cluster manager shutdown.  Attemping to 
> reconnect...
> Oct  7 13:52:20 gfsserver2 ccsd[839]: Unable to connect to cluster 
> infrastructure after 30 seconds.
> Oct  7 13:52:50 gfsserver2 ccsd[839]: Unable to connect to cluster 
> infrastructure after 60 seconds.
> Oct  7 13:53:20 gfsserver2 ccsd[839]: Unable to connect to cluster 
> infrastructure after 90 seconds.
> Oct  7 13:53:51 gfsserver2 ccsd[839]: Unable to connect to cluster 
> infrastructure after 120 seconds.
> Oct  7 13:53:58 gfsserver2 ccsd[839]: Remote copy of cluster.conf is from 
> quorate node.
> 
> <?xml version="1.0"?>
> <cluster name="cluster" config_version="11">
> <cman expected_votes="3">
>  </cman>
>     <clusternodes>
>     <clusternode name="www5.domainxy.cc" domain="fedora0" votes="1">
>      <fence>
>       <method name="single">
>        <device name="human" ipaddr="192.xxx.2.25"/>
>      </method>
>     </fence>
>    </clusternode>
>    <clusternode name="www3.domainxy.cc" domain="fedora1" votes="1">
>     <fence>
>      <method name="single">
>        <device name="human" ipaddr="192.xxx.2.23"/>
>      </method>
>     </fence>
>     </clusternode>
>    <clusternode name="www4.domainxy.cc" domain="fedora2" votes="1">
>     <fence>
>      <method name="single">
>        <device name="human" ipaddr="192.xxx.2.24"/>
>      </method>
>     </fence>
>   </clusternode>
>   <clusternode name="gfsserver.domainxy.cc" domain="fedora3" votes="1">
>     <fence>
>      <method name="single">
>        <device name="human" ipaddr="192.xxx.2.33"/>
>      </method>
>     </fence>
>   </clusternode>
>  <clusternode name="gfsserver2.domainxy.cc" domain="fedora4" votes="1">
>     <fence>
>      <method name="single">
>        <device name="human" ipaddr="192.xxx.2.34"/>
>      </method>
>     </fence>
>   </clusternode>
>  </clusternodes>
> <fence_devices>
>  <fence_device name="human" agent="fence_manual"/>
> </fence_devices>
> </cluster>
> 
> 
> 
> 
> 
>  
> 
>  
> 

-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051014/af26b8e5/attachment.sig>

From vojtech.moravek at cz.ibm.com  Wed Oct 19 17:10:18 2005
From: vojtech.moravek at cz.ibm.com (Vojtech Moravek)
Date: Wed, 19 Oct 2005 19:10:18 +0200
Subject: [Linux-cluster] Performance Problem-GFS 6.1 u2 - LockGulm
Message-ID: <OFF7173C4A.783D7F74-ONC125709F.005E48B9-C125709F.005E5416@cz.ibm.com>

Hi All
 
I testing HA samba cluster with
 
One -load balancer, 
       -two samba servers(gfs client)
       -gfs-server
        -storage connected by FC to samba servers see picture below.
 
                        
                        loadbalancer
                      -------------------
                            |
                            |  (ethernet network)
           ----------------------------------------------------------    
            |                                    |                  |
            |        (internal gfs network       |                  |
        samba1-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. samba2-.-.-.-gfs-server
            |                                    |
            |                                    |
            ---------------------------------------
                        Storage (Fibre connect)

Everything works perfectly, but only 30-40 minuts aprox under client work. 
After that time gfs mount points are goint to slow rapidly :(
When I try browsing in directory structure on servers all operation like 
chdir, readir are very very slow. But all system resurces looks  ok..Ram 
is ok, cpu usage is ok, but traffic on gfs network is growing.

Did someone met with problem like this?

And one point ..when I mount gfs volumes and I made system command "df", 
first output is too slow. Is this normal???

My configurations files:

--------------------------------------
cat /etc/cluster/cluster.conf

<?xml version="1.0"?>
<cluster config_version="2" name="gsdc">
        <fence_daemon clean_start="0" post_fail_delay="0" 
post_join_delay="3"/>
        <clusternodes>
                <clusternode name="gfs-samba1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="man_fen" 
nodename="gfs-samba1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="gfs-samba2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="man_fen" 
nodename="gfs-samba2"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="gfs-fake" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="man_fen" 
nodename="gfs-fake"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <gulm>
                <lockserver name="gfs-server"/>
        </gulm>
        <fencedevices>
                <fencedevice agent="fence_manual" name="man_fen"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
</cluster>
------------------------------------- 

cat /etc/fstab
# This file is edited by fstab-sync - see 'man fstab-sync' for details
LABEL=/1                /                       ext3    defaults        1 
1
none                    /dev/pts                devpts  gid=5,mode=620  0 
0
none                    /dev/shm                tmpfs   defaults        0 
0
none                    /proc                   proc    defaults        0 
0
none                    /sys                    sysfs   defaults        0 
0
LABEL=/var1             /var                    ext3    defaults        1 
2
LABEL=SWAP-sda3         swap                    swap    defaults        0 
0

/dev/vg_pole1/profiles_vg1      /vg_pole1/profiles/         gfs noatime    
     0 0
/dev/vg_pole1/home_vg1          /vg_pole1/home/         gfs noatime  0 0

/dev/vg_pole2/profiles_vg2     /vg_pole2/profiles/              gfs 
noatime               0 0
/dev/vg_pole2/home_vg2         /vg_pole2/home/         gfs noatime       0 
0


Thanks for any help

Vojtech Moravek
vojtech.moravek at cz.ibm.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051019/435ed6a1/attachment.htm>

From tspauld98 at yahoo.com  Wed Oct 19 19:49:01 2005
From: tspauld98 at yahoo.com (Tim Spaulding)
Date: Wed, 19 Oct 2005 12:49:01 -0700 (PDT)
Subject: [Linux-cluster] New Cluster Installation Starts Partitioned
Message-ID: <20051019194901.19010.qmail@web60516.mail.yahoo.com>

Hi All,

I have a couple of machines that I'm trying to cluster.  The machines are freshly installed FC4
machines that have been fully updated and running the latest kernel.  They are configured to use
the lvm2 by default so lvm2 and dm was already installed.  I'm following the directions in the
usage.txt off RedHat's web site.  I compile the cluster tarball, run depmod, and start ccsd
without issue.  When I do a cman_tool join -w on each node, both nodes start cman and join the
cluster, but the cluster is apparently partitioned (i.e. they both see the cluster and are joined
to it, but the two nodes cannot see that the other node is joined).  I've searched around and
haven't found anything specific to this symptom.  I have a feeling that it's something to do with
my network configuration.  Any help would be appreciated.

Both machines are i686 archs with dual NICs.  The NICs are connected to networks that do not route
to each other.  One network (eth0 on both machines) is a development network.  The other network
(eth1) is our corporate network.  I'm trying to configure the cluster to use the dev network
(eth0).

Here's the output from uname:

Linux ctclinux1.clam.com 2.6.13-1.1526_FC4 #1 Wed Sep 28 19:15:10 EDT 2005 i686 i686 i386
GNU/Linux
Linux ctclinux2.clam.com 2.6.13-1.1526_FC4 #1 Wed Sep 28 19:15:10 EDT 2005 i686 i686 i386
GNU/Linux

Here's the network configuration on ctclinux1:

eth0      Link encap:Ethernet  HWaddr 00:01:03:26:5C:C9
          inet addr:192.168.36.200  Bcast:192.168.36.255  Mask:255.255.255.0
          inet6 addr: fe80::201:3ff:fe26:5cc9/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:7260 errors:0 dropped:0 overruns:0 frame:0
          TX packets:350 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:449183 (438.6 KiB)  TX bytes:27853 (27.2 KiB)
          Interrupt:10 Base address:0xec00

eth1      Link encap:Ethernet  HWaddr 00:B0:D0:41:0F:65
          inet addr:10.10.10.200  Bcast:10.10.255.255  Mask:255.255.0.0
          inet6 addr: fe80::2b0:d0ff:fe41:f65/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:57450 errors:0 dropped:0 overruns:1 frame:0
          TX packets:12957 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:10040767 (9.5 MiB)  TX bytes:1962029 (1.8 MiB)
          Interrupt:5 Base address:0xe880

eth1:1    Link encap:Ethernet  HWaddr 00:B0:D0:41:0F:65
          inet addr:10.10.10.204  Bcast:10.10.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          Interrupt:5 Base address:0xe880

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:17568 errors:0 dropped:0 overruns:0 frame:0
          TX packets:17568 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:3692600 (3.5 MiB)  TX bytes:3692600 (3.5 MiB)

sit0      Link encap:IPv6-in-IPv4
          NOARP  MTU:1480  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.36.0    *               255.255.255.0   U     0      0        0 eth0
10.74.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
10.72.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
10.75.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
10.73.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
10.10.0.0       *               255.255.0.0     U     0      0        0 eth1
169.254.0.0     *               255.255.0.0     U     0      0        0 eth1
default         10.10.1.1       0.0.0.0         UG    0      0        0 eth1

cat /etc/hosts
10.10.10.200    ctclinux1-svc
192.168.36.200  ctclinux1-cls
192.168.36.201  ctclinux2-cls
10.10.10.201    ctclinux2-svc

Here's the network configuration on ctclinux2:

ifconfig -a
eth0      Link encap:Ethernet  HWaddr 00:01:03:D4:80:7C
          inet addr:192.168.36.201  Bcast:192.168.36.255  Mask:255.255.255.0
          inet6 addr: fe80::201:3ff:fed4:807c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:7702 errors:0 dropped:0 overruns:1 frame:0
          TX packets:282 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:477769 (466.5 KiB)  TX bytes:22444 (21.9 KiB)
          Interrupt:10 Base address:0xec00

eth1      Link encap:Ethernet  HWaddr 00:B0:D0:41:0F:9B
          inet addr:10.10.10.201  Bcast:10.10.255.255  Mask:255.255.0.0
          inet6 addr: fe80::2b0:d0ff:fe41:f9b/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:53846 errors:0 dropped:0 overruns:1 frame:0
          TX packets:7759 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:5733713 (5.4 MiB)  TX bytes:1155588 (1.1 MiB)
          Interrupt:5 Base address:0xe880

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:17912 errors:0 dropped:0 overruns:0 frame:0
          TX packets:17912 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:3401868 (3.2 MiB)  TX bytes:3401868 (3.2 MiB)

sit0      Link encap:IPv6-in-IPv4
          NOARP  MTU:1480  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.36.0    *               255.255.255.0   U     0      0        0 eth0
10.74.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
10.72.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
10.75.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
10.73.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
10.10.0.0       *               255.255.0.0     U     0      0        0 eth1
169.254.0.0     *               255.255.0.0     U     0      0        0 eth1
default         10.10.1.1       0.0.0.0         UG    0      0        0 eth1

cat /etc/hosts
10.10.10.201    ctclinux2-svc
192.168.36.201  ctclinux2-cls
192.168.36.200  ctclinux1-cls
10.10.10.200    ctclinux1-svc

Here's the cluster configuration file:

<?xml version="1.0"?>
<cluster name="cl_tic" config_version="1">
        <cman>
        </cman>

        <clusternodes>
                <clusternode name="ctclinux1-cls">
                        <fence>
                                <method name="single">
                                        <device name="human" nodename="ctclinux1-cls"/>
                                </method>
                        </fence>
                </clusternode>

                <clusternode name="ctclinux2-cls">
                        <fence>
                                <method name="single">
                                        <device name="human" nodename="ctclinux2-cls"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>

        <fence_devices>
                <fence_device name="human" agent="fence_manual"/>
        </fence_devices>
</cluster>

Here's the cluster information from ctclinux1 after the cluster is started and joined:

cman_tool -d join -w
nodename ctclinux1.clam.com not found
nodename ctclinux1 (truncated) not found
nodename ctclinux1 doesn't match ctclinux1-cls (ctclinux1-cls in cluster.conf)
nodename ctclinux1 doesn't match ctclinux2-cls (ctclinux2-cls in cluster.conf)
nodename localhost (if lo) not found
selected nodename ctclinux1-cls
setup up interface for address: ctclinux1-cls
Broadcast address for c824a8c0 is ff24a8c0

cman_tool status
Protocol version: 5.0.1
Config version: 1
Cluster name: cl_tic
Cluster ID: 6429
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 1
Expected_votes: 2
Total_votes: 1
Quorum: 2  Activity blocked
Active subsystems: 0
Node name: ctclinux1-cls
Node addresses: 192.168.36.200

cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    2   M   ctclinux1-cls

Here's the cluster information from ctclinux2 after the cluster is started and joined:

cman_tool -d join -w
nodename ctclinux2.clam.com not found
nodename ctclinux2 (truncated) not found
nodename ctclinux2 doesn't match ctclinux1-cls (ctclinux1-cls in cluster.conf)
nodename ctclinux2 doesn't match ctclinux2-cls (ctclinux2-cls in cluster.conf)
nodename localhost (if lo) not found
selected nodename ctclinux2-cls
setup up interface for address: ctclinux2-cls
Broadcast address for c924a8c0 is ff24a8c0

cman_tool status
Protocol version: 5.0.1
Config version: 1
Cluster name: cl_tic
Cluster ID: 6429
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 1
Expected_votes: 2
Total_votes: 1
Quorum: 2  Activity blocked
Active subsystems: 0
Node name: ctclinux2-cls
Node addresses: 192.168.36.201

cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    2   M   ctclinux2-cls

Let me know if there is more information that I need to provide.  As an aside, I've tried reducing
the quorum count with no difference in behavior and I've tried using multicast which fails on the
cman_tool join with an "Unknown Host" error.  I'm open to any other suggestions.

Thanks,

tims


__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com


From alexander_rau at yahoo.com  Wed Oct 19 19:27:56 2005
From: alexander_rau at yahoo.com (Alexander Rau)
Date: Wed, 19 Oct 2005 12:27:56 -0700 (PDT)
Subject: [Linux-cluster] application monitoring - apache crash doesn't
	invoke failover
Message-ID: <20051019192756.82030.qmail@web52101.mail.yahoo.com>

We are trying to test the failover in a 2 cluster
environment by killing apache.

The service fails according to clustat, however the
cluster mananger does not move the service from the
failed node to the fail over node....

/var/log/messages shows the following output (on the
node with the forced failure):

Oct 19 16:34:59 armstrong clurgmgrd[4269]: <notice>
status on script "httpd" returned 1 (generic error)
Oct 19 16:34:59 armstrong clurgmgrd[4269]: <notice>
Stopping service http
Oct 19 16:34:59 armstrong httpd: httpd shutdown failed
Oct 19 16:34:59 armstrong clurgmgrd[4269]: <notice>
stop on script "httpd" returned 1 (generic error)
Oct 19 16:34:59 armstrong clurgmgrd[4269]: <crit> #12:
RG http failed to stop; intervention required
Oct 19 16:34:59 armstrong clurgmgrd[4269]: <notice>
Service http is failed

Anybody any ideas?

Thanks

AR


From eric at bootseg.com  Wed Oct 19 20:41:29 2005
From: eric at bootseg.com (Eric Kerin)
Date: Wed, 19 Oct 2005 16:41:29 -0400
Subject: [Linux-cluster] application monitoring - apache crash doesn't
	invoke failover
In-Reply-To: <20051019192756.82030.qmail@web52101.mail.yahoo.com>
References: <20051019192756.82030.qmail@web52101.mail.yahoo.com>
Message-ID: <1129754489.3349.13.camel@auh5-0479.corp.jabil.org>

See this bugzilla entry:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=151104 especially
the attached patch.

Basically RHEL4 (and RHEL3) don't (and at this point, can't) follow the
LSB's standard return value for successful stop operations, which is
that a stop operation of a service that isn't running should return 0 as
it's errorlevel.

Thanks,
Eric Kerin
eric at bootseg.com


On Wed, 2005-10-19 at 12:27 -0700, Alexander Rau wrote:
> We are trying to test the failover in a 2 cluster
> environment by killing apache.
> 
> The service fails according to clustat, however the
> cluster mananger does not move the service from the
> failed node to the fail over node....
> 
> /var/log/messages shows the following output (on the
> node with the forced failure):
> 
> Oct 19 16:34:59 armstrong clurgmgrd[4269]: <notice>
> status on script "httpd" returned 1 (generic error)
> Oct 19 16:34:59 armstrong clurgmgrd[4269]: <notice>
> Stopping service http
> Oct 19 16:34:59 armstrong httpd: httpd shutdown failed
> Oct 19 16:34:59 armstrong clurgmgrd[4269]: <notice>
> stop on script "httpd" returned 1 (generic error)
> Oct 19 16:34:59 armstrong clurgmgrd[4269]: <crit> #12:
> RG http failed to stop; intervention required
> Oct 19 16:34:59 armstrong clurgmgrd[4269]: <notice>
> Service http is failed
> 
> Anybody any ideas?
> 
> Thanks
> 
> AR
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From lhh at redhat.com  Wed Oct 19 22:25:21 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 19 Oct 2005 18:25:21 -0400
Subject: [Linux-cluster] application monitoring - apache crash doesn't
	invoke failover
In-Reply-To: <1129754489.3349.13.camel@auh5-0479.corp.jabil.org>
References: <20051019192756.82030.qmail@web52101.mail.yahoo.com>
	<1129754489.3349.13.camel@auh5-0479.corp.jabil.org>
Message-ID: <1129760721.25547.89.camel@ayanami.boston.redhat.com>

On Wed, 2005-10-19 at 16:41 -0400, Eric Kerin wrote:
> See this bugzilla entry:
> 
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=151104 especially
> the attached patch.
> 
> Basically RHEL4 (and RHEL3) don't (and at this point, can't) follow the
> LSB's standard return value for successful stop operations, which is
> that a stop operation of a service that isn't running should return 0 as
> it's errorlevel.

Correct.

-- Lon


From hlawatschek at atix.de  Wed Oct 19 22:28:41 2005
From: hlawatschek at atix.de (Mark Hlawatschek)
Date: Thu, 20 Oct 2005 00:28:41 +0200
Subject: [Linux-cluster] New Cluster Installation Starts Partitioned
In-Reply-To: <20051019194901.19010.qmail@web60516.mail.yahoo.com>
References: <20051019194901.19010.qmail@web60516.mail.yahoo.com>
Message-ID: <1129760921.3471.6.camel@falballa.gallien.atix>

Hi Tim,

make sure that the cmans on both nodes can talk to each other. I
observed this problem when iptables wasn't configured correctly. If you
have an active iptables config shut it down and try again.

Hope that helps ...

Mark 

On Wed, 2005-10-19 at 12:49 -0700, Tim Spaulding wrote:
> Hi All,
> 
> I have a couple of machines that I'm trying to cluster.  The machines are freshly installed FC4
> machines that have been fully updated and running the latest kernel.  They are configured to use
> the lvm2 by default so lvm2 and dm was already installed.  I'm following the directions in the
> usage.txt off RedHat's web site.  I compile the cluster tarball, run depmod, and start ccsd
> without issue.  When I do a cman_tool join -w on each node, both nodes start cman and join the
> cluster, but the cluster is apparently partitioned (i.e. they both see the cluster and are joined
> to it, but the two nodes cannot see that the other node is joined).  I've searched around and
> haven't found anything specific to this symptom.  I have a feeling that it's something to do with
> my network configuration.  Any help would be appreciated.
> 
> Both machines are i686 archs with dual NICs.  The NICs are connected to networks that do not route
> to each other.  One network (eth0 on both machines) is a development network.  The other network
> (eth1) is our corporate network.  I'm trying to configure the cluster to use the dev network
> (eth0).
> 
> Here's the output from uname:
> 
> Linux ctclinux1.clam.com 2.6.13-1.1526_FC4 #1 Wed Sep 28 19:15:10 EDT 2005 i686 i686 i386
> GNU/Linux
> Linux ctclinux2.clam.com 2.6.13-1.1526_FC4 #1 Wed Sep 28 19:15:10 EDT 2005 i686 i686 i386
> GNU/Linux
> 
> Here's the network configuration on ctclinux1:
> 
> eth0      Link encap:Ethernet  HWaddr 00:01:03:26:5C:C9
>           inet addr:192.168.36.200  Bcast:192.168.36.255  Mask:255.255.255.0
>           inet6 addr: fe80::201:3ff:fe26:5cc9/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:7260 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:350 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:449183 (438.6 KiB)  TX bytes:27853 (27.2 KiB)
>           Interrupt:10 Base address:0xec00
> 
> eth1      Link encap:Ethernet  HWaddr 00:B0:D0:41:0F:65
>           inet addr:10.10.10.200  Bcast:10.10.255.255  Mask:255.255.0.0
>           inet6 addr: fe80::2b0:d0ff:fe41:f65/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:57450 errors:0 dropped:0 overruns:1 frame:0
>           TX packets:12957 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:10040767 (9.5 MiB)  TX bytes:1962029 (1.8 MiB)
>           Interrupt:5 Base address:0xe880
> 
> eth1:1    Link encap:Ethernet  HWaddr 00:B0:D0:41:0F:65
>           inet addr:10.10.10.204  Bcast:10.10.255.255  Mask:255.255.0.0
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           Interrupt:5 Base address:0xe880
> 
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           inet6 addr: ::1/128 Scope:Host
>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>           RX packets:17568 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:17568 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:3692600 (3.5 MiB)  TX bytes:3692600 (3.5 MiB)
> 
> sit0      Link encap:IPv6-in-IPv4
>           NOARP  MTU:1480  Metric:1
>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
> 
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
> 192.168.36.0    *               255.255.255.0   U     0      0        0 eth0
> 10.74.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> 10.72.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> 10.75.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> 10.73.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> 10.10.0.0       *               255.255.0.0     U     0      0        0 eth1
> 169.254.0.0     *               255.255.0.0     U     0      0        0 eth1
> default         10.10.1.1       0.0.0.0         UG    0      0        0 eth1
> 
> cat /etc/hosts
> 10.10.10.200    ctclinux1-svc
> 192.168.36.200  ctclinux1-cls
> 192.168.36.201  ctclinux2-cls
> 10.10.10.201    ctclinux2-svc
> 
> Here's the network configuration on ctclinux2:
> 
> ifconfig -a
> eth0      Link encap:Ethernet  HWaddr 00:01:03:D4:80:7C
>           inet addr:192.168.36.201  Bcast:192.168.36.255  Mask:255.255.255.0
>           inet6 addr: fe80::201:3ff:fed4:807c/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:7702 errors:0 dropped:0 overruns:1 frame:0
>           TX packets:282 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:477769 (466.5 KiB)  TX bytes:22444 (21.9 KiB)
>           Interrupt:10 Base address:0xec00
> 
> eth1      Link encap:Ethernet  HWaddr 00:B0:D0:41:0F:9B
>           inet addr:10.10.10.201  Bcast:10.10.255.255  Mask:255.255.0.0
>           inet6 addr: fe80::2b0:d0ff:fe41:f9b/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:53846 errors:0 dropped:0 overruns:1 frame:0
>           TX packets:7759 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:5733713 (5.4 MiB)  TX bytes:1155588 (1.1 MiB)
>           Interrupt:5 Base address:0xe880
> 
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           inet6 addr: ::1/128 Scope:Host
>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>           RX packets:17912 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:17912 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:3401868 (3.2 MiB)  TX bytes:3401868 (3.2 MiB)
> 
> sit0      Link encap:IPv6-in-IPv4
>           NOARP  MTU:1480  Metric:1
>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
> 
> route
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
> 192.168.36.0    *               255.255.255.0   U     0      0        0 eth0
> 10.74.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> 10.72.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> 10.75.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> 10.73.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> 10.10.0.0       *               255.255.0.0     U     0      0        0 eth1
> 169.254.0.0     *               255.255.0.0     U     0      0        0 eth1
> default         10.10.1.1       0.0.0.0         UG    0      0        0 eth1
> 
> cat /etc/hosts
> 10.10.10.201    ctclinux2-svc
> 192.168.36.201  ctclinux2-cls
> 192.168.36.200  ctclinux1-cls
> 10.10.10.200    ctclinux1-svc
> 
> Here's the cluster configuration file:
> 
> <?xml version="1.0"?>
> <cluster name="cl_tic" config_version="1">
>         <cman>
>         </cman>
> 
>         <clusternodes>
>                 <clusternode name="ctclinux1-cls">
>                         <fence>
>                                 <method name="single">
>                                         <device name="human" nodename="ctclinux1-cls"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
> 
>                 <clusternode name="ctclinux2-cls">
>                         <fence>
>                                 <method name="single">
>                                         <device name="human" nodename="ctclinux2-cls"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
> 
>         <fence_devices>
>                 <fence_device name="human" agent="fence_manual"/>
>         </fence_devices>
> </cluster>
> 
> Here's the cluster information from ctclinux1 after the cluster is started and joined:
> 
> cman_tool -d join -w
> nodename ctclinux1.clam.com not found
> nodename ctclinux1 (truncated) not found
> nodename ctclinux1 doesn't match ctclinux1-cls (ctclinux1-cls in cluster.conf)
> nodename ctclinux1 doesn't match ctclinux2-cls (ctclinux2-cls in cluster.conf)
> nodename localhost (if lo) not found
> selected nodename ctclinux1-cls
> setup up interface for address: ctclinux1-cls
> Broadcast address for c824a8c0 is ff24a8c0
> 
> cman_tool status
> Protocol version: 5.0.1
> Config version: 1
> Cluster name: cl_tic
> Cluster ID: 6429
> Cluster Member: Yes
> Membership state: Cluster-Member
> Nodes: 1
> Expected_votes: 2
> Total_votes: 1
> Quorum: 2  Activity blocked
> Active subsystems: 0
> Node name: ctclinux1-cls
> Node addresses: 192.168.36.200
> 
> cman_tool nodes
> Node  Votes Exp Sts  Name
>    1    1    2   M   ctclinux1-cls
> 
> Here's the cluster information from ctclinux2 after the cluster is started and joined:
> 
> cman_tool -d join -w
> nodename ctclinux2.clam.com not found
> nodename ctclinux2 (truncated) not found
> nodename ctclinux2 doesn't match ctclinux1-cls (ctclinux1-cls in cluster.conf)
> nodename ctclinux2 doesn't match ctclinux2-cls (ctclinux2-cls in cluster.conf)
> nodename localhost (if lo) not found
> selected nodename ctclinux2-cls
> setup up interface for address: ctclinux2-cls
> Broadcast address for c924a8c0 is ff24a8c0
> 
> cman_tool status
> Protocol version: 5.0.1
> Config version: 1
> Cluster name: cl_tic
> Cluster ID: 6429
> Cluster Member: Yes
> Membership state: Cluster-Member
> Nodes: 1
> Expected_votes: 2
> Total_votes: 1
> Quorum: 2  Activity blocked
> Active subsystems: 0
> Node name: ctclinux2-cls
> Node addresses: 192.168.36.201
> 
> cman_tool nodes
> Node  Votes Exp Sts  Name
>    1    1    2   M   ctclinux2-cls
> 
> Let me know if there is more information that I need to provide.  As an aside, I've tried reducing
> the quorum count with no difference in behavior and I've tried using multicast which fails on the
> cman_tool join with an "Unknown Host" error.  I'm open to any other suggestions.
> 
> Thanks,
> 
> tims
> 
> 
> 	
> 		
> __________________________________ 
> Yahoo! Mail - PC Magazine Editors' Choice 2005 
> http://mail.yahoo.com
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Mark Hlawatschek <hlawatschek at atix.de>


From alexander_rau at yahoo.com  Thu Oct 20 00:11:21 2005
From: alexander_rau at yahoo.com (Alexander Rau)
Date: Wed, 19 Oct 2005 17:11:21 -0700 (PDT)
Subject: [Linux-cluster] mounting using label
Message-ID: <20051020001121.10195.qmail@web52102.mail.yahoo.com>

Hi:

I am trying to mount a file system on the SAN by using
the label rather then the device.

When I specify "-L label" in either the device line or
the Options line the cluster service fails to start.

Just wondering if anybody has successfully used labels
to mount a file system as a cluster service...?

Thanks

AR


From erwan at seanodes.com  Thu Oct 20 12:13:24 2005
From: erwan at seanodes.com (Velu Erwan)
Date: Thu, 20 Oct 2005 14:13:24 +0200
Subject: [Linux-cluster] cluster-1.01.00
In-Reply-To: <20051019151045.GA3975@redhat.com>
References: <20051019151045.GA3975@redhat.com>
Message-ID: <435789E4.70905@seanodes.com>

David Teigland a ?crit :

>A new source tarball from the STABLE branch has been released; it builds
>and runs on 2.6.13:
>
>  ftp://sources.redhat.com/pub/cluster/releases/cluster-1.01.00.tar.gz
>
>  
>
I just tried it on my 2.6.13-4 and I had the following error :

make[2]: Entering directory `/home/build/cluster-1.01.00/cman/lib'
gcc -Wall  -g -O -I. -fPIC 
-I/home/build/cluster-1.01.00/build/incdir/cluster -c -o libcman.o libcman.c
libcman.c:31:35: cluster/cnxman-socket.h: No such file or directory

I've fixed that with the patch attached to this mail.

Everything is compiling fine great jobs, this is really easiest than 
before ;)

Is it difficult to make it compile on previous kernels like 2.6.11 ?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster-1.01.00-include.patch
Type: text/x-patch
Size: 343 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051020/8c146bd0/attachment.bin>

From tspauld98 at yahoo.com  Thu Oct 20 15:49:58 2005
From: tspauld98 at yahoo.com (Tim Spaulding)
Date: Thu, 20 Oct 2005 08:49:58 -0700 (PDT)
Subject: [Linux-cluster] New Cluster Installation Starts Partitioned
In-Reply-To: <1129760921.3471.6.camel@falballa.gallien.atix>
Message-ID: <20051020154958.50725.qmail@web60524.mail.yahoo.com>

Hi Mark,

Thanks, that solved it.  I had opened up the right ports on my primary node but had forgotten to
do the same on the secondary node reinforcing Murphy's Second Law of Clustering.  It's always the
little things.  :)

Thanks again,

tims

--- Mark Hlawatschek <hlawatschek at atix.de> wrote:

> Hi Tim,
> 
> make sure that the cmans on both nodes can talk to each other. I
> observed this problem when iptables wasn't configured correctly. If you
> have an active iptables config shut it down and try again.
> 
> Hope that helps ...
> 
> Mark 
> 
> On Wed, 2005-10-19 at 12:49 -0700, Tim Spaulding wrote:
> > Hi All,
> > 
> > I have a couple of machines that I'm trying to cluster.  The machines are freshly installed
> FC4
> > machines that have been fully updated and running the latest kernel.  They are configured to
> use
> > the lvm2 by default so lvm2 and dm was already installed.  I'm following the directions in the
> > usage.txt off RedHat's web site.  I compile the cluster tarball, run depmod, and start ccsd
> > without issue.  When I do a cman_tool join -w on each node, both nodes start cman and join the
> > cluster, but the cluster is apparently partitioned (i.e. they both see the cluster and are
> joined
> > to it, but the two nodes cannot see that the other node is joined).  I've searched around and
> > haven't found anything specific to this symptom.  I have a feeling that it's something to do
> with
> > my network configuration.  Any help would be appreciated.
> > 
> > Both machines are i686 archs with dual NICs.  The NICs are connected to networks that do not
> route
> > to each other.  One network (eth0 on both machines) is a development network.  The other
> network
> > (eth1) is our corporate network.  I'm trying to configure the cluster to use the dev network
> > (eth0).
> > 
> > Here's the output from uname:
> > 
> > Linux ctclinux1.clam.com 2.6.13-1.1526_FC4 #1 Wed Sep 28 19:15:10 EDT 2005 i686 i686 i386
> > GNU/Linux
> > Linux ctclinux2.clam.com 2.6.13-1.1526_FC4 #1 Wed Sep 28 19:15:10 EDT 2005 i686 i686 i386
> > GNU/Linux
> > 
> > Here's the network configuration on ctclinux1:
> > 
> > eth0      Link encap:Ethernet  HWaddr 00:01:03:26:5C:C9
> >           inet addr:192.168.36.200  Bcast:192.168.36.255  Mask:255.255.255.0
> >           inet6 addr: fe80::201:3ff:fe26:5cc9/64 Scope:Link
> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >           RX packets:7260 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:350 errors:0 dropped:0 overruns:0 carrier:0
> >           collisions:0 txqueuelen:1000
> >           RX bytes:449183 (438.6 KiB)  TX bytes:27853 (27.2 KiB)
> >           Interrupt:10 Base address:0xec00
> > 
> > eth1      Link encap:Ethernet  HWaddr 00:B0:D0:41:0F:65
> >           inet addr:10.10.10.200  Bcast:10.10.255.255  Mask:255.255.0.0
> >           inet6 addr: fe80::2b0:d0ff:fe41:f65/64 Scope:Link
> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >           RX packets:57450 errors:0 dropped:0 overruns:1 frame:0
> >           TX packets:12957 errors:0 dropped:0 overruns:0 carrier:0
> >           collisions:0 txqueuelen:1000
> >           RX bytes:10040767 (9.5 MiB)  TX bytes:1962029 (1.8 MiB)
> >           Interrupt:5 Base address:0xe880
> > 
> > eth1:1    Link encap:Ethernet  HWaddr 00:B0:D0:41:0F:65
> >           inet addr:10.10.10.204  Bcast:10.10.255.255  Mask:255.255.0.0
> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >           Interrupt:5 Base address:0xe880
> > 
> > lo        Link encap:Local Loopback
> >           inet addr:127.0.0.1  Mask:255.0.0.0
> >           inet6 addr: ::1/128 Scope:Host
> >           UP LOOPBACK RUNNING  MTU:16436  Metric:1
> >           RX packets:17568 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:17568 errors:0 dropped:0 overruns:0 carrier:0
> >           collisions:0 txqueuelen:0
> >           RX bytes:3692600 (3.5 MiB)  TX bytes:3692600 (3.5 MiB)
> > 
> > sit0      Link encap:IPv6-in-IPv4
> >           NOARP  MTU:1480  Metric:1
> >           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> >           collisions:0 txqueuelen:0
> >           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
> > 
> > Kernel IP routing table
> > Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
> > 192.168.36.0    *               255.255.255.0   U     0      0        0 eth0
> > 10.74.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> > 10.72.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> > 10.75.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> > 10.73.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> > 10.10.0.0       *               255.255.0.0     U     0      0        0 eth1
> > 169.254.0.0     *               255.255.0.0     U     0      0        0 eth1
> > default         10.10.1.1       0.0.0.0         UG    0      0        0 eth1
> > 
> > cat /etc/hosts
> > 10.10.10.200    ctclinux1-svc
> > 192.168.36.200  ctclinux1-cls
> > 192.168.36.201  ctclinux2-cls
> > 10.10.10.201    ctclinux2-svc
> > 
> > Here's the network configuration on ctclinux2:
> > 
> > ifconfig -a
> > eth0      Link encap:Ethernet  HWaddr 00:01:03:D4:80:7C
> >           inet addr:192.168.36.201  Bcast:192.168.36.255  Mask:255.255.255.0
> >           inet6 addr: fe80::201:3ff:fed4:807c/64 Scope:Link
> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >           RX packets:7702 errors:0 dropped:0 overruns:1 frame:0
> >           TX packets:282 errors:0 dropped:0 overruns:0 carrier:0
> >           collisions:0 txqueuelen:1000
> >           RX bytes:477769 (466.5 KiB)  TX bytes:22444 (21.9 KiB)
> >           Interrupt:10 Base address:0xec00
> > 
> > eth1      Link encap:Ethernet  HWaddr 00:B0:D0:41:0F:9B
> >           inet addr:10.10.10.201  Bcast:10.10.255.255  Mask:255.255.0.0
> >           inet6 addr: fe80::2b0:d0ff:fe41:f9b/64 Scope:Link
> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >           RX packets:53846 errors:0 dropped:0 overruns:1 frame:0
> >           TX packets:7759 errors:0 dropped:0 overruns:0 carrier:0
> >           collisions:0 txqueuelen:1000
> >           RX bytes:5733713 (5.4 MiB)  TX bytes:1155588 (1.1 MiB)
> >           Interrupt:5 Base address:0xe880
> > 
> > lo        Link encap:Local Loopback
> >           inet addr:127.0.0.1  Mask:255.0.0.0
> >           inet6 addr: ::1/128 Scope:Host
> >           UP LOOPBACK RUNNING  MTU:16436  Metric:1
> >           RX packets:17912 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:17912 errors:0 dropped:0 overruns:0 carrier:0
> >           collisions:0 txqueuelen:0
> >           RX bytes:3401868 (3.2 MiB)  TX bytes:3401868 (3.2 MiB)
> > 
> > sit0      Link encap:IPv6-in-IPv4
> >           NOARP  MTU:1480  Metric:1
> >           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> >           collisions:0 txqueuelen:0
> >           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
> > 
> > route
> > Kernel IP routing table
> > Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
> > 192.168.36.0    *               255.255.255.0   U     0      0        0 eth0
> > 10.74.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> > 10.72.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> > 10.75.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> > 10.73.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> > 10.10.0.0       *               255.255.0.0     U     0      0        0 eth1
> > 169.254.0.0     *               255.255.0.0     U     0      0        0 eth1
> > default         10.10.1.1       0.0.0.0         UG    0      0        0 eth1
> > 
> > cat /etc/hosts
> > 10.10.10.201    ctclinux2-svc
> > 192.168.36.201  ctclinux2-cls
> > 192.168.36.200  ctclinux1-cls
> > 10.10.10.200    ctclinux1-svc
> > 
> > Here's the cluster configuration file:
> > 
> > <?xml version="1.0"?>
> > <cluster name="cl_tic" config_version="1">
> >         <cman>
> >         </cman>
> > 
> >         <clusternodes>
> >                 <clusternode name="ctclinux1-cls">
> >                         <fence>
> >                                 <method name="single">
> >                                         <device name="human" nodename="ctclinux1-cls"/>
> >                                 </method>
> >                         </fence>
> >                 </clusternode>
> > 
> >                 <clusternode name="ctclinux2-cls">
> >                         <fence>
> >                                 <method name="single">
> >                                         <device name="human" nodename="ctclinux2-cls"/>
> >                                 </method>
> >                         </fence>
> >                 </clusternode>
> >         </clusternodes>
> > 
> >         <fence_devices>
> >                 <fence_device name="human" agent="fence_manual"/>
> >         </fence_devices>
> > </cluster>
> > 
> > Here's the cluster information from ctclinux1 after the cluster is started and joined:
> > 
> > cman_tool -d join -w
> > nodename ctclinux1.clam.com not found
> > nodename ctclinux1 (truncated) not found
> > nodename ctclinux1 doesn't match ctclinux1-cls (ctclinux1-cls in cluster.conf)
> > nodename ctclinux1 doesn't match ctclinux2-cls (ctclinux2-cls in cluster.conf)
> > nodename localhost (if lo) not found
> > selected nodename ctclinux1-cls
> > setup up interface for address: ctclinux1-cls
> > Broadcast address for c824a8c0 is ff24a8c0
> > 
> > cman_tool status
> > Protocol version: 5.0.1
> > Config version: 1
> > Cluster name: cl_tic
> > Cluster ID: 6429
> > Cluster Member: Yes
> > Membership state: Cluster-Member
> > Nodes: 1
> > Expected_votes: 2
> > Total_votes: 1
> > Quorum: 2  Activity blocked
> > Active subsystems: 0
> > Node name: ctclinux1-cls
> > Node addresses: 192.168.36.200
> > 
> > cman_tool nodes
> > Node  Votes Exp Sts  Name
> >    1    1    2   M   ctclinux1-cls
> > 
> > Here's the cluster information from ctclinux2 after the cluster is started and joined:
> > 
> > cman_tool -d join -w
> > nodename ctclinux2.clam.com not found
> > nodename ctclinux2 (truncated) not found
> > nodename ctclinux2 doesn't match ctclinux1-cls (ctclinux1-cls in cluster.conf)
> > nodename ctclinux2 doesn't match ctclinux2-cls (ctclinux2-cls in cluster.conf)
> > nodename localhost (if lo) not found
> > selected nodename ctclinux2-cls
> > setup up interface for address: ctclinux2-cls
> > Broadcast address for c924a8c0 is ff24a8c0
> > 
> > cman_tool status
> > Protocol version: 5.0.1
> > Config version: 1
> > Cluster name: cl_tic
> > Cluster ID: 6429
> > Cluster Member: Yes
> > Membership state: Cluster-Member
> > Nodes: 1
> > Expected_votes: 2
> > Total_votes: 1
> > Quorum: 2  Activity blocked
> > Active subsystems: 0
> > Node name: ctclinux2-cls
> > Node addresses: 192.168.36.201
> > 
> > cman_tool nodes
> > Node  Votes Exp Sts  Name
> >    1    1    2   M   ctclinux2-cls
> > 
> > Let me know if there is more information that I need to provide.  As an aside, I've tried
> reducing
> > the quorum count with no difference in behavior and I've tried using multicast which fails on
> the
> > cman_tool join with an "Unknown Host" error.  I'm open to any other suggestions.
> > 
> > Thanks,
> > 
> > tims
> > 
> > 
> > 	
> > 		
> > __________________________________ 
> > Yahoo! Mail - PC Magazine Editors' Choice 2005 
> > http://mail.yahoo.com
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> -- 
> Mark Hlawatschek <hlawatschek at atix.de>
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


__________________________________ 
Yahoo! Music Unlimited 
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/


From linux4dave at gmail.com  Thu Oct 20 16:05:02 2005
From: linux4dave at gmail.com (dave first)
Date: Thu, 20 Oct 2005 09:05:02 -0700
Subject: [Linux-cluster] Clustering Tutorial
Message-ID: <207649d0510200905t77ae28b2j7813921f16b1f8e1@mail.gmail.com>

Hey Guys,

I'm a unix geek going waaaay back, but I haven't been administering Linux.
I've taken a job where all the *nix systems are Linux (most RH). There are
two clusters. I know nada about clusters.

In my experience, when I jump into something w/o learning the basics, I'm
always on a learning curve. So, I need to learn basics about Linux
Clustering, including terminology - like what is "fencing?"

Are there any good online sources you could point me to? Reading this list,
I know I could understand a lot more if I had the terminology down...


dave
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051020/d711a5c2/attachment.htm>

From mwill at penguincomputing.com  Thu Oct 20 16:18:15 2005
From: mwill at penguincomputing.com (Michael Will)
Date: Thu, 20 Oct 2005 09:18:15 -0700
Subject: [Linux-cluster] Clustering Tutorial
In-Reply-To: <207649d0510200905t77ae28b2j7813921f16b1f8e1@mail.gmail.com>
References: <207649d0510200905t77ae28b2j7813921f16b1f8e1@mail.gmail.com>
Message-ID: <4357C347.8000506@jellyfish.highlyscyld.com>

http://www.phy.duke.edu/resources/computing/brahma/Resources/beowulf_book.php 
is a good start,
http://www.beowulf.org is another good place, it is also the home of the 
original beowulf mailinglist.

Generally I would recommend digging through recent mailinglist postings 
because
there are often very informed answers to questions.

Lon just answered a fencing question a few days ago:

"STONITH, STOMITH, etc. are indeed implementations of I/O fencing.

Fencing is the act of forcefully preventing a node from being able to
access resources after that node has been evicted from the cluster in an
attempt to avoid corruption.

The canonical example of when it is needed is the live-hang scenario, as
you described:

1. node A hangs with I/Os pending to a shared file system
2. node B and node C decide that node A is dead and recover resources
allocated on node A (including the shared file system)
3. node A resumes normal operation
4. node A completes I/Os to shared file system

At this point, the shared file system is probably corrupt.  If you're
lucky, fsck will fix it -- if you're not, you'll need to restore from
backup.  I/O fencing (STONITH, or whatever we want to call it) prevents
the last step (step 4) from happening.

How fencing is done (power cycling via external switch, SCSI
reservations, FC zoning, integrated methods like IPMI, iLO, manual
intervention, etc.) is unimportant - so long as whatever method is used
can guarantee that step 4 can not complete."

"GFS can use fabric-level fencing - that is, you can tell the iSCSI
server to cut a node off, or ask the fiber-channel switch to disable a
port.  This is in addition to "power-cycle" fencing."


Michael


From davegu1 at hotmail.com  Thu Oct 20 17:52:58 2005
From: davegu1 at hotmail.com (David Gutierrez)
Date: Thu, 20 Oct 2005 12:52:58 -0500
Subject: [Linux-cluster] Clustering Tutorial
In-Reply-To: <207649d0510200905t77ae28b2j7813921f16b1f8e1@mail.gmail.com>
Message-ID: <BAY101-F121CB3A0ED3E3154195C2BFA730@phx.gbl>

Dave,
There is lots of information on liinux out there on the net. Specially if 
you do a search on Linux Documentation, there is a website for that too.
http://www.tldp.net/index.html
>From there you can go to the cluster. But imagine a cluster in linux as a 
cluster in AIX, Solaris, HPUX or Tru64. '


David


From: dave first <linux4dave at gmail.com>
Reply-To: linux clustering <linux-cluster at redhat.com>
To: linux clustering <linux-cluster at redhat.com>
Subject: [Linux-cluster] Clustering Tutorial
Date: Thu, 20 Oct 2005 09:05:02 -0700
MIME-Version: 1.0
Received: from hormel.redhat.com ([209.132.177.30]) by mc3-f6.hotmail.com 
with Microsoft SMTPSVC(6.0.3790.211); Thu, 20 Oct 2005 09:05:11 -0700
Received: from listman.util.phx.redhat.com (listman.util.phx.redhat.com 
[10.8.4.110])by hormel.redhat.com (Postfix) with ESMTPid DB7217416F; Thu, 20 
Oct 2005 12:05:10 -0400 (EDT)
Received: from int-mx1.corp.redhat.com 
(int-mx1.corp.redhat.com[172.16.52.254])by listman.util.phx.redhat.com 
(8.13.1/8.13.1) with ESMTP idj9KG59gP001593 for 
<linux-cluster at listman.util.phx.redhat.com>;Thu, 20 Oct 2005 12:05:09 -0400
Received: from mx3.redhat.com (mx3.redhat.com [172.16.48.32])by 
int-mx1.corp.redhat.com (8.11.6/8.11.6) with ESMTP id j9KG58V04078for 
<linux-cluster at redhat.com>; Thu, 20 Oct 2005 12:05:08 -0400
Received: from zproxy.gmail.com (zproxy.gmail.com [64.233.162.195])by 
mx3.redhat.com (8.13.1/8.13.1) with ESMTP id j9KG52TL014034for 
<linux-cluster at redhat.com>; Thu, 20 Oct 2005 12:05:02 -0400
Received: by zproxy.gmail.com with SMTP id m7so209527nzffor 
<linux-cluster at redhat.com>; Thu, 20 Oct 2005 09:05:02 -0700 (PDT)
Received: by 10.36.34.1 with SMTP id h1mr2045048nzh;Thu, 20 Oct 2005 
09:05:02 -0700 (PDT)
Received: by 10.36.252.29 with HTTP; Thu, 20 Oct 2005 09:05:02 -0700 (PDT)
X-Message-Info: EoYTbT2lH2Oqgvb27XXVSOWVJbUxrl8bnGIDAa6+wOE=
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; 
d=gmail.com;h=received:message-id:date:from:to:subject:mime-version:content-type;b=ZPZj2fOsm/kHT5hJZwsd8+ZwVTTPxlaRoz6+cWYRlIRwFmNogdU1mwjJhks/u0pn0ckQ0uOWaC1UD9za59IRx23rAZd/t2mVR0wmrdCfQ+M2uc2N9ODFSGYk2hXFGN6T6aTqt+Rf/crlM6oaVUN06z1JT/MNiGsG2XfF4XaEKhI=
X-RedHat-Spam-Score: -2.739
X-loop: linux-cluster at redhat.com
X-BeenThere: linux-cluster at redhat.com
X-Mailman-Version: 2.1.5
Precedence: junk
List-Id: linux clustering <linux-cluster.redhat.com>
List-Unsubscribe: 
<https://www.redhat.com/mailman/listinfo/linux-cluster>,<mailto:linux-cluster-request at redhat.com?subject=unsubscribe>
List-Archive: <https://www.redhat.com/archives/linux-cluster>
List-Post: <mailto:linux-cluster at redhat.com>
List-Help: <mailto:linux-cluster-request at redhat.com?subject=help>
List-Subscribe: 
<https://www.redhat.com/mailman/listinfo/linux-cluster>,<mailto:linux-cluster-request at redhat.com?subject=subscribe>
Errors-To: linux-cluster-bounces at redhat.com
Return-Path: linux-cluster-bounces at redhat.com
X-OriginalArrivalTime: 20 Oct 2005 16:05:11.0431 (UTC) 
FILETIME=[0CA89970:01C5D590]

Hey Guys,

I'm a unix geek going waaaay back, but I haven't been administering Linux.
I've taken a job where all the *nix systems are Linux (most RH). There are
two clusters. I know nada about clusters.

In my experience, when I jump into something w/o learning the basics, I'm
always on a learning curve. So, I need to learn basics about Linux
Clustering, including terminology - like what is "fencing?"

Are there any good online sources you could point me to? Reading this list,
I know I could understand a lot more if I had the terminology down...


dave


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From tspauld98 at yahoo.com  Thu Oct 20 18:58:24 2005
From: tspauld98 at yahoo.com (Tim Spaulding)
Date: Thu, 20 Oct 2005 11:58:24 -0700 (PDT)
Subject: [Linux-cluster] Clustering Tutorial
In-Reply-To: <4357C347.8000506@jellyfish.highlyscyld.com>
Message-ID: <20051020185824.21929.qmail@web60525.mail.yahoo.com>

Just a note of caution, there's a big difference between High Availability Clustering and High
Performance Clustering.  AFAIK, Beowulf is an HPC technology.  RHCS (Red Hat Cluster Suite) and
GFS (Global File System) are HAC technologies.  Some of the underlying building blocks are used by
both communities but they are used for fundamentally difference purposes.

http://www.linux-ha.org is the home of another HAC, linux-based technology.  They have more
documentation on clustering and its concepts.  Red Hat does a good job on the HOW-TOs of getting a
cluster working but a terrible job of telling folks the WHY-TOs of clustering.

I'm currently working on a comparison of linux-ha and RHCS so if you have questions regarding HAC
on linux then fire away.  If you have a beowulf cluster, je ne comprends pas, sorry.

--tims

--- Michael Will <mwill at penguincomputing.com> wrote:

> http://www.phy.duke.edu/resources/computing/brahma/Resources/beowulf_book.php 
> is a good start,
> http://www.beowulf.org is another good place, it is also the home of the 
> original beowulf mailinglist.
> 
> Generally I would recommend digging through recent mailinglist postings 
> because
> there are often very informed answers to questions.
> 
> Lon just answered a fencing question a few days ago:
> 
> "STONITH, STOMITH, etc. are indeed implementations of I/O fencing.
> 
> Fencing is the act of forcefully preventing a node from being able to
> access resources after that node has been evicted from the cluster in an
> attempt to avoid corruption.
> 
> The canonical example of when it is needed is the live-hang scenario, as
> you described:
> 
> 1. node A hangs with I/Os pending to a shared file system
> 2. node B and node C decide that node A is dead and recover resources
> allocated on node A (including the shared file system)
> 3. node A resumes normal operation
> 4. node A completes I/Os to shared file system
> 
> At this point, the shared file system is probably corrupt.  If you're
> lucky, fsck will fix it -- if you're not, you'll need to restore from
> backup.  I/O fencing (STONITH, or whatever we want to call it) prevents
> the last step (step 4) from happening.
> 
> How fencing is done (power cycling via external switch, SCSI
> reservations, FC zoning, integrated methods like IPMI, iLO, manual
> intervention, etc.) is unimportant - so long as whatever method is used
> can guarantee that step 4 can not complete."
> 
> "GFS can use fabric-level fencing - that is, you can tell the iSCSI
> server to cut a node off, or ask the fiber-channel switch to disable a
> port.  This is in addition to "power-cycle" fencing."
> 
> 
> Michael
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


__________________________________ 
Yahoo! Music Unlimited 
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/


From lhh at redhat.com  Thu Oct 20 19:01:51 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 20 Oct 2005 15:01:51 -0400
Subject: [Linux-cluster] mounting using label
In-Reply-To: <20051020001121.10195.qmail@web52102.mail.yahoo.com>
References: <20051020001121.10195.qmail@web52102.mail.yahoo.com>
Message-ID: <1129834911.17902.48.camel@ayanami.boston.redhat.com>

On Wed, 2005-10-19 at 17:11 -0700, Alexander Rau wrote:
> Hi:
> 
> I am trying to mount a file system on the SAN by using
> the label rather then the device.
> 
> When I specify "-L label" in either the device line or
> the Options line the cluster service fails to start.
> 
> Just wondering if anybody has successfully used labels
> to mount a file system as a cluster service...?

Put:

LABEL=label_name

...as the "device" name in the cluster configuration.

-- Lon


From dawson at fnal.gov  Fri Oct 21 13:18:28 2005
From: dawson at fnal.gov (Troy Dawson)
Date: Fri, 21 Oct 2005 08:18:28 -0500
Subject: [Linux-cluster] Re: write's pausing - which tools to debug?
In-Reply-To: <20051019104816.GD31027@neu.nirvana>
References: <4355049E.4060606@fnal.gov> <20051019104816.GD31027@neu.nirvana>
Message-ID: <4358EAA4.1080901@fnal.gov>

Axel Thimm wrote:
> Hi,
> 
> On Tue, Oct 18, 2005 at 09:20:14AM -0500, Troy Dawson wrote:
> 
>>We've been having some problems with doing a write's to our GFS file 
>>system, and it will pause, for long periods.  (Like from 5 to 10 
>>seconds, to 30 seconds, and occasially 5 minutes)  After the pause, it's 
>>like nothing happened, whatever the process is, just keeps going happy 
>>as can be.
>>Except for these pauses, our GFS is quite zippy, both reads and writes. 
>> But these pauses are holding us back from going full production.
>>I need to know what tools I should use to figure out what is causing 
>>these pauses.
>>
>>Here is the setup.
>>All machines: RHEL 4 update 1 (ok, actually S.L. 4.1), kernel 
>>2.6.9-11.ELsmp, GFS 6.1.0, ccs 1.0.0, gulm 1.0.0, rgmanager 1.9.34
>>
>>I have no ability to do fencing yet, so I chose to use the gulm locking 
>>mechanism.  I have it setup so that there are 3 lock servers, for 
>>failover.  I have tested the failover, and it works quite well.
> 
> 
> If this is a testing environment use manual fencing. E.g. if a node
> needs to get fenced you get a log message saying that you should do
> that and acknowledge that.
> 
> 
>>I have 5 machines in the cluster.  1 isn't connected to the SAN, or 
>>using GFS.  It is just a failover gulm lock server incase the other two 
>>lock servers go down.
>>
>>So I have 4 machines connected to our SAN and using GFS.  3 are 
>>read-only, 1 is read-write.  If it is important, the 3 read-only are 
>>x86_64, the 1 read-write and the 1 not connected are i386.
>>
>>The read/write machine is our master lock server.  Then one of the 
>>read-only is a fallback lock server, as is the machine not using GFS.
>>
>>Anyway, we're getting these pauses when writting, and I'm having a hard 
>>time tracking down where the problem is.  I *think* that we can still 
>>read from the other machines.  But since this comes and goes, I haven't 
>>been able to verify that.
> 
> 
> What SAN hardware is attached to the nodes?
> 
> 
 From the switch on down, I don't know.  It's a centrally managed SAN, 
that I have been allowed to plug into and given disk space.  I do have 
Qlogic cards in the machines.

>>Anyway, which tools do you think would be best in diagnosing this?
> 
> 
> I'd suggest to check/monitor networking. Also place the cluster
> communication on a separate network that the SAN/LAN network. The
> cluster heartbeat goes over UDP and a congested network may delay
> these packages or drop the completely. At least that's the CMAN
> picture, lock_gulm may be different.
> 

That sounds like a good idea.  All of our machines have two ethernet 
ports, and I'm not using the second one on any of them.  That would 
actually fix some other problems as well.

> Also don't mix RHELU1 and U2 or FC<N>. Just in case you'd like to
> upgrade to SL4.2 one by one.
> 

Yup, read that, but thanks for the reminder.

> There have been many changes/bug fixes to the cluster bits in RHELU2,
> and there are also some new spiffy features like multipath. Perhaps
> it's worth rebasing your testing environment?
> 

Don't I wish it was a testing enviroment.  But at least the machines 
don't HAVE to be 24x7.  And I've only got one of them in production 
right now, so it's only one going down.

Troy
-- 
__________________________________________________
Troy Dawson  dawson at fnal.gov  (630)840-6468
Fermilab  ComputingDivision/CSS  CSI Group
__________________________________________________


From linux4dave at gmail.com  Sat Oct 22 02:18:36 2005
From: linux4dave at gmail.com (dave first)
Date: Fri, 21 Oct 2005 19:18:36 -0700
Subject: [Linux-cluster] Clustering Tutorial
In-Reply-To: <20051020185824.21929.qmail@web60525.mail.yahoo.com>
References: <4357C347.8000506@jellyfish.highlyscyld.com>
	<20051020185824.21929.qmail@web60525.mail.yahoo.com>
Message-ID: <207649d0510211918i29dfe228k3b29befcc6f35f48@mail.gmail.com>

Thanks. I should have mentioned that we're doing high performance
clustering, and not HA. We have a beowulf cluster (old and decrepid) and an
OSCAR cluster. None of our current clusters are RH, but that will probably
change once we get our next 4-opteron cpu/box cluster...

Yeehaw!

And a Big Thanks to everyone who responded. I now have some good resources.
A lot of reading... yaaaawn ! heh-heh.

dave

On 10/20/05, Tim Spaulding <tspauld98 at yahoo.com> wrote:
>
> Just a note of caution, there's a big difference between High Availability
> Clustering and High
> Performance Clustering. AFAIK, Beowulf is an HPC technology. RHCS (Red Hat
> Cluster Suite) and
> GFS (Global File System) are HAC technologies. Some of the underlying
> building blocks are used by
> both communities but they are used for fundamentally difference purposes.
>
> http://www.linux-ha.org is the home of another HAC, linux-based
> technology. They have more
> documentation on clustering and its concepts. Red Hat does a good job on
> the HOW-TOs of getting a
> cluster working but a terrible job of telling folks the WHY-TOs of
> clustering.
>
> I'm currently working on a comparison of linux-ha and RHCS so if you have
> questions regarding HAC
> on linux then fire away. If you have a beowulf cluster, je ne comprends
> pas, sorry.
>
> --tims
>
> --- Michael Will <mwill at penguincomputing.com> wrote:
>
> >
> http://www.phy.duke.edu/resources/computing/brahma/Resources/beowulf_book.php
> > is a good start,
> > http://www.beowulf.org is another good place, it is also the home of the
> > original beowulf mailinglist.
> >
> > Generally I would recommend digging through recent mailinglist postings
> > because
> > there are often very informed answers to questions.
> >
> > Lon just answered a fencing question a few days ago:
> >
> > "STONITH, STOMITH, etc. are indeed implementations of I/O fencing.
> >
> > Fencing is the act of forcefully preventing a node from being able to
> > access resources after that node has been evicted from the cluster in an
> > attempt to avoid corruption.
> >
> > The canonical example of when it is needed is the live-hang scenario, as
> > you described:
> >
> > 1. node A hangs with I/Os pending to a shared file system
> > 2. node B and node C decide that node A is dead and recover resources
> > allocated on node A (including the shared file system)
> > 3. node A resumes normal operation
> > 4. node A completes I/Os to shared file system
> >
> > At this point, the shared file system is probably corrupt. If you're
> > lucky, fsck will fix it -- if you're not, you'll need to restore from
> > backup. I/O fencing (STONITH, or whatever we want to call it) prevents
> > the last step (step 4) from happening.
> >
> > How fencing is done (power cycling via external switch, SCSI
> > reservations, FC zoning, integrated methods like IPMI, iLO, manual
> > intervention, etc.) is unimportant - so long as whatever method is used
> > can guarantee that step 4 can not complete."
> >
> > "GFS can use fabric-level fencing - that is, you can tell the iSCSI
> > server to cut a node off, or ask the fiber-channel switch to disable a
> > port. This is in addition to "power-cycle" fencing."
> >
> >
> > Michael
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
>
>
>
>
> __________________________________
> Yahoo! Music Unlimited
> Access over 1 million songs. Try it free.
> http://music.yahoo.com/unlimited/
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051021/22de32ed/attachment.htm>

From Bill.Scherer at VerizonWireless.com  Mon Oct 24 14:56:14 2005
From: Bill.Scherer at VerizonWireless.com (Bill Scherer)
Date: Mon, 24 Oct 2005 10:56:14 -0400
Subject: [Linux-cluster] ssh, ldap, and nfs
Message-ID: <435CF60E.1050001@VerizonWireless.com>


Sorry if this is a bit off-topic, but does anyone have any idea how to 
get ssh to accept public-key authorization for accounts that exist only 
in ldap land and whose home folders are nfs mounted?

It should work, right?


From vmoravek at atlas.cz  Mon Oct 24 23:00:26 2005
From: vmoravek at atlas.cz (vmoravek at atlas.cz)
Date: Tue, 25 Oct 2005 01:00:26 +0200
Subject: [Linux-cluster] gfs dlm - SAMBA problem(lock??)
Message-ID: <c824611b71c8413ba9710d2dd95045e4@atlas.cz>

Hi all,
I having use gfs for samba cluster.GFS works fine but have big trouble with this situation.

When 6 computers want downloadind same file od directory everything is ok. 
But same situation but 8 computers trafic is rappidly going down and I have some strange messages about oplocks (maybe) in smb.log.

Have any idea what is wrong?

Best Regard

Vojta 


From sommere+linux-cluster at gac.edu  Tue Oct 25 00:48:17 2005
From: sommere+linux-cluster at gac.edu (Ethan Sommer)
Date: Mon, 24 Oct 2005 19:48:17 -0500
Subject: [Linux-cluster] Occasional kernel panics
Message-ID: <435D80D1.70105@gac.edu>

Every few days or so our cluster machines seem to have kernel panics
comp laing about GFS locking (although its pretty irregular, we went for
a few weeks without an outage)

We noticed that this happened a LOT, and it was reproducible when
certain users accessed files, when we were serving afp off the cluster.
We have changed things since then so that afp is run on a server which
nfs mounts the cluster.

We are running FC4 with the gfs modules from yum.


Here is our most recent kernel panics, followed by one from when we had
afp running on the cluster: (it looks like there is relevant info above
the cut-here, possibly if it might be helpful)


Oct 19 14:44:41 meow kernel: ------------[ cut here ]------------
Oct 19 14:44:41 meow kernel: kernel BUG at
/usr/src/build/607755-i686/BUILD/smp/src/lockqueue.c:1144!
Oct 19 14:44:41 meow kernel: invalid operand: 0000 [#1]
Oct 19 14:44:41 meow kernel: SMP
Oct 19 14:44:41 meow kernel: Modules linked in: nfsd exportfs lockd
autofs4 lock_dlm(U) gfs(U) lock_harness(U) rfcomm l2cap bluetooth dlm(U)
cman(U) md5 ip
v6 sunrpc ipt_LOG ipt_limit ipt_state ip_conntrack iptable_filter
ip_tables video button battery ac uhci_hcd ehci_hcd hw_random i2c_i801
i2c_core shpchp e1000 floppy ext3 jbd raid1 dm_mod qla2200 qla2xxx
scsi_transport_fc ata_piix libata sd_mod scsi_mod
Oct 19 14:44:41 meow kernel: CPU:    1
Oct 19 14:44:41 meow kernel: EIP:    0060:[<f8af9dcf>]    Not tainted VLI
Oct 19 14:44:41 meow kernel: EFLAGS: 00010292   (2.6.12-1.1447_FC4smp)
Oct 19 14:44:41 meow kernel: EIP is at
process_cluster_request+0xddb/0xdef [dlm]
Oct 19 14:44:41 meow kernel: eax: 00000004   ebx: 00000000   ecx:
c035fa4c   edx: 00000286
Oct 19 14:44:41 meow kernel: esi: f7fb8400   edi: 00000000   ebp:
d2988000   esp: f7eefe24
Oct 19 14:44:41 meow kernel: ds: 007b   es: 007b   ss: 0068
Oct 19 14:44:41 meow kernel: Process dlm_recvd (pid: 2402,
threadinfo=f7eef000 task=f7851020)
Oct 19 14:44:41 meow kernel: Stack: f8b0621b 00000001 f8b071e0 f8b06217
2583f987 00000001 00000040 00004000
Oct 19 14:44:41 meow kernel:        f7eefe48 00000000 c038e1a0 00000a58
f0167b00 c02a26c1 00000a58 00004040
Oct 19 14:44:41 meow kernel:        00000072 f7eefed4 00000000 00000001
00000246 00000000 edd6eeb8 00000000
Oct 19 14:44:41 meow kernel: Call Trace:
Oct 19 14:44:41 meow kernel:  [<c02a26c1>] sock_recvmsg+0x103/0x11e
Oct 19 14:44:41 meow kernel:  [<f8afd46b>]
midcomms_process_incoming_buffer+0x13b/0x25f [dlm]
Oct 19 14:44:41 meow kernel:  [<c011ce54>] load_balance_newidle+0x23/0x82
Oct 19 14:44:41 meow kernel:  [<f8afb3d3>] receive_from_sock+0x196/0x2c9
[dlm]
Oct 19 14:44:41 meow kernel:  [<c0307705>] schedule+0x405/0xc5e
Oct 19 14:44:41 meow kernel:  [<c0307731>] schedule+0x431/0xc5e
Oct 19 14:44:41 meow kernel:  [<f8afc457>] dlm_recvd+0x0/0x9c [dlm]
Oct 19 14:44:41 meow kernel:  [<f8afc2d3>] process_sockets+0x75/0xb7 [dlm]
Oct 19 14:44:41 meow kernel:  [<f8afc4c7>] dlm_recvd+0x70/0x9c [dlm]
Oct 19 14:44:41 meow kernel:  [<c0134c09>] kthread+0x93/0x97
Oct 19 14:44:41 meow kernel:  [<c0134b76>] kthread+0x0/0x97
Oct 19 14:44:41 meow kernel:  [<c01023d1>] kernel_thread_helper+0x5/0xb
Oct 19 14:44:41 meow kernel: Code: 4f 82 62 c7 89 e8 e8 b1 b4 00 00 8b
4c 24 14 89 4c 24 04 c7 04 24 6d 63 b0 f8 e8 34 82 62 c7 c7 04 24 1b 62
b0 f8 e8 28
82 62 c7 <0f> 0b 78 04 e0 71 b0 f8 c7 04 24 70 72 b0 f8 e8 40 78 62 c7 57
Oct 19 14:44:41 meow kernel:  <0>Fatal exception: panic in 5 seconds


Panic 2:

Oct 10 09:58:39 woof kernel: ------------[ cut here ]------------
Oct 10 09:58:39 woof kernel: kernel BUG at
/usr/src/build/607778-i686/BUILD/smp/src/dlm/lock.c:411!
Oct 10 09:58:39 woof kernel: invalid operand: 0000 [#1]
Oct 10 09:58:39 woof kernel: SMP
Oct 10 09:58:39 woof kernel: Modules linked in: nfsd exportfs lockd
autofs4 lock_dlm(U) gfs(U) lock_harness(U) rfcomm l2cap bluetooth dlm(U)
cman(U) md5 ip
v6 sunrpc ipt_LOG ipt_limit ipt_state ip_conntrack iptable_filter
ip_tables video button battery ac uhci_hcd ehci_hcd hw_random i2c_i801
i2c_core shpchp e1
000 dm_snapshot dm_zero dm_mirror ext3 jbd raid1 dm_mod qla2200 qla2xxx
scsi_transport_fc ata_piix libata sd_mod scsi_mod
Oct 10 09:58:39 woof kernel: CPU:    1
Oct 10 09:58:39 woof kernel: EIP:    0060:[<f8b98bf5>]    Not tainted VLI
Oct 10 09:58:39 woof kernel: EFLAGS: 00010292   (2.6.12-1.1447_FC4smp)
Oct 10 09:58:39 woof kernel: EIP is at do_dlm_lock+0x1b7/0x21d [lock_dlm]
Oct 10 09:58:39 woof kernel: eax: 00000004   ebx: 00000000   ecx:
c035fa4c   edx: 00000292
Oct 10 09:58:39 woof kernel: esi: f7848140   edi: ffffffea   ebp:
00000003   esp: c74b3cfc
Oct 10 09:58:39 woof kernel: ds: 007b   es: 007b   ss: 0068
Oct 10 09:58:39 woof kernel: Process imapd (pid: 24278,
threadinfo=c74b3000 task=f4721a80)
Oct 10 09:58:39 woof kernel: Stack: f8b9de75 f7848140 00000003 1bbe0000
00000000 ffffffea 00000003 00000005
Oct 10 09:58:39 woof kernel:        0000000d 00000005 00000000 f58c0a00
00000001 0000000d 20200000 20202020
Oct 10 09:58:39 woof kernel:        20203320 20202020 62312020 30306562
00183030 c8fb2f00 00000001 00000001
Oct 10 09:58:39 woof kernel: Call Trace:
Oct 10 09:58:39 woof kernel:  [<f8b98cff>] lm_dlm_lock+0x52/0x5e [lock_dlm]
Oct 10 09:58:39 woof kernel:  [<f8b98cad>] lm_dlm_lock+0x0/0x5e [lock_dlm]
Oct 10 09:58:39 woof kernel:  [<f8bd000c>] gfs_lm_lock+0x3d/0x5c [gfs]
Oct 10 09:58:39 woof kernel:  [<f8bc5039>] gfs_glock_xmote_th+0xae/0x1d3
[gfs]
Oct 10 09:58:39 woof kernel:  [<f8bc463c>] rq_promote+0x126/0x150 [gfs]
Oct 10 09:58:39 woof kernel:  [<f8bc4840>] run_queue+0xee/0x113 [gfs]
Oct 10 09:58:39 woof kernel:  [<f8bc5af1>] gfs_glock_nq+0x93/0x144 [gfs]
Oct 10 09:58:39 woof kernel:  [<f8bc619d>] gfs_glock_nq_init+0x18/0x2d [gfs]
Oct 10 09:58:39 woof kernel:  [<f8be3926>] get_local_rgrp+0xca/0x1b0 [gfs]
Oct 10 09:58:39 woof kernel:  [<f8be3a9c>]
gfs_inplace_reserve_i+0x90/0xd0 [gfs]
Oct 10 09:58:39 woof kernel:  [<f8be046b>] gfs_quota_lock_m+0xbf/0x117 [gfs]
Oct 10 09:58:39 woof kernel:  [<f8bd8a2b>] do_do_write_buf+0x3a1/0x485 [gfs]
Oct 10 09:58:39 woof kernel:  [<f8bc56a1>]
glock_wait_internal+0x16b/0x26a [gfs]
Oct 10 09:58:39 woof kernel:  [<f8bd8c91>] do_write_buf+0x182/0x1b6 [gfs]
Oct 10 09:58:39 woof kernel:  [<f8bd7be5>] walk_vm+0xb3/0x111 [gfs]
Oct 10 09:58:39 woof kernel:  [<f8bd8d65>] gfs_write+0xa0/0xc2 [gfs]
Oct 10 09:58:39 woof kernel:  [<f8bd8b0f>] do_write_buf+0x0/0x1b6 [gfs]
Oct 10 09:58:39 woof kernel:  [<f8bd8cc5>] gfs_write+0x0/0xc2 [gfs]
Oct 10 09:58:39 woof kernel:  [<c0162987>] vfs_write+0x9e/0x110
Oct 10 09:58:39 woof kernel:  [<c0162aa4>] sys_write+0x41/0x6a
Oct 10 09:58:39 woof kernel:  [<c0104035>] syscall_call+0x7/0xb
Oct 10 09:58:39 woof kernel: Code: 7c 24 14 89 4c 24 0c 89 5c 24 10 89
6c 24 08 89 74 24 04 c7 04 24 28 e6 b9 f8 e8 0e 94 58 c7 c7 04 24 75 de
b9 f8 e8 02
94 58 c7 <0f> 0b 9b 01 a0 e4 b9 f8 c7 04 24 3c e5 b9 f8 e8 1a 8a 58 c7 66
Oct 10 09:58:39 woof kernel:  <0>Fatal exception: panic in 5 seconds


Sep  7 15:37:44 meow kernel: ------------[ cut here ]------------
Sep  7 15:37:44 meow kernel: kernel BUG at
/usr/src/build/588748-i686/BUILD/smp/src/dlm/plock.c:500!
Sep  7 15:37:44 meow kernel: invalid operand: 0000 [#1]
Sep  7 15:37:44 meow kernel: SMP
Sep  7 15:37:44 meow kernel: Modules linked in: appletalk nfsd exportfs
lockd autofs4 lock_dlm(U) gfs(U) lock_harness(U) rfcomm l2cap bluetooth
dlm(U) cman
(U) sunrpc md5 ipv6 ipt_LOG ipt_limit ipt_state ip_conntrack
iptable_filter ip_tables video button battery ac uhci_hcd ehci_hcd
hw_random i2c_i801 i2c_core
shpchp e1000 floppy ext3 jbd raid1 dm_mod qla2200 qla2xxx
scsi_transport_fc ata_piix libata sd_mod scsi_mod
Sep  7 15:37:44 meow kernel: CPU:    3
Sep  7 15:37:44 meow kernel: EIP:    0060:[<f8b9a3f7>]    Tainted:
GF     VLI
Sep  7 15:37:44 meow kernel: EFLAGS: 00010292   (2.6.12-1.1398_FC4smp)
Sep  7 15:37:44 meow kernel: EIP is at update_lock+0x87/0x9b [lock_dlm]
Sep  7 15:37:44 meow kernel: eax: 00000004   ebx: fffffff5   ecx:
c035ca4c   edx: 00000282
Sep  7 15:37:44 meow kernel: esi: 00000000   edi: e99c2c00   ebp:
00000000   esp: d05dedb4
Sep  7 15:37:44 meow kernel: ds: 007b   es: 007b   ss: 0068
Sep  7 15:37:44 meow kernel: Process afpd (pid: 3872,
threadinfo=d05de000 task=d6447550)
Sep  7 15:37:44 meow kernel: Stack: badc0ded f8b9d0d6 fffffff5 f8b9da70
f8b9d101 06609291 f7943000 00000000
Sep  7 15:37:44 meow kernel:        f8b9a499 7ffffff8 00000000 7ffffff8
00000000 d05dede8 d7636700 7ffffff8
Sep  7 15:37:44 meow kernel:        00000000 d05deea8 d05dee28 f8b9a987
00000001 7ffffff8 00000000 7ffffff8
Sep  7 15:37:44 meow kernel: Call Trace:
Sep  7 15:37:44 meow kernel:  [<f8b9a499>] add_lock+0x8e/0xed [lock_dlm]
Sep  7 15:37:44 meow kernel:  [<f8b9a987>] fill_gaps+0x87/0x10e [lock_dlm]
Sep  7 15:37:44 meow kernel:  [<f8b9aa51>] lock_case3+0x43/0xac [lock_dlm]
Sep  7 15:37:44 meow kernel:  [<f8b9aeac>] plock_internal+0x1aa/0x370
[lock_dlm]
Sep  7 15:37:44 meow kernel:  [<f8b9b614>] lm_dlm_plock+0x25b/0x2dc
[lock_dlm]
Sep  7 15:37:44 meow kernel:  [<f8b9b3b9>] lm_dlm_plock+0x0/0x2dc [lock_dlm]
Sep  7 15:37:44 meow kernel:  [<f8bdc1c3>] gfs_lm_plock+0x45/0x57 [gfs]
Sep  7 15:37:44 meow kernel:  [<f8be5731>] gfs_lock+0xcd/0x11c [gfs]
Sep  7 15:37:44 meow kernel:  [<f8be5664>] gfs_lock+0x0/0x11c [gfs]
Sep  7 15:37:44 meow kernel:  [<c0176c4f>] fcntl_setlk64+0x16c/0x26a
Sep  7 15:37:44 meow kernel:  [<c0162e93>] fget+0x3b/0x42
Sep  7 15:37:44 meow kernel:  [<c0172bfd>] sys_fcntl64+0x55/0x97
Sep  7 15:37:44 meow kernel:  [<c0104025>] syscall_call+0x7/0xb
Sep  7 15:37:44 meow kernel: Code: 01 00 00 c7 04 24 a8 da b9 f8 e8 7c
77 58 c7 89 5c 24 04 c7 04 24 08 d1 b9 f8 e8 6c 77 58 c7 c7 04 24 d6 d0
b9 f8 e8 60
77 58 c7 <0f> 0b f4 01 70 da b9 f8 c7 04 24 10 db b9 f8 e8 78 6d 58 c7 55
Sep  7 15:37:44 meow kernel:  <0>Fatal exception: panic in 5 seconds


Thanks for any help,
  Ethan


From tpcollier at liberty.edu  Tue Oct 25 19:31:06 2005
From: tpcollier at liberty.edu (Collier, Tirus (SA))
Date: Tue, 25 Oct 2005 15:31:06 -0400
Subject: [Linux-cluster] EMCPower Errors
Message-ID: <B3BD22E8DFE12B4D8C2FB4AD645600B24E34C6@doc.University.liberty.edu>

Good Day,
 
Request to know if anyone experienced the following errors on there cluster. I'm running a 3 node cluster with following:
 
1.) 3 PE1850s
2.) CX700 storage
3.) Kernel 2.4.21-32.0.1.ELsmp #1
 
pool_tool -s | grep error | more
  /dev/emcpoweraa                                  <- error ->
  /dev/emcpoweraa1                                 <- error ->
  /dev/emcpoweraa10                                <- error ->
  /dev/emcpoweraa11                                <- error ->
  /dev/emcpoweraa12                                <- error ->
  /dev/emcpoweraa13                                <- error ->
  /dev/emcpoweraa14                                <- error ->
  /dev/emcpoweraa15                                <- error ->
  /dev/emcpoweraa2                                 <- error ->
  /dev/emcpoweraa3                                 <- error ->
  /dev/emcpoweraa4                                 <- error ->
  /dev/emcpoweraa5                                 <- error ->
  /dev/emcpoweraa6                                 <- error ->
  /dev/emcpoweraa7                                 <- error ->
  /dev/emcpoweraa8                                 <- error ->
  /dev/emcpoweraa9                                 <- error ->
  /dev/emcpowerab                                  <- error ->
  /dev/emcpowerab1                                 <- error ->
  /dev/emcpowerab10                                <- error ->
  /dev/emcpowerab11                                <- error ->

 
Stonewall T. Collier


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051025/39bffcb6/attachment.htm>

From teigland at redhat.com  Tue Oct 25 20:00:12 2005
From: teigland at redhat.com (David Teigland)
Date: Tue, 25 Oct 2005 15:00:12 -0500
Subject: [Linux-cluster] Occasional kernel panics
In-Reply-To: <435D80D1.70105@gac.edu>
References: <435D80D1.70105@gac.edu>
Message-ID: <20051025200012.GA15854@redhat.com>

On Mon, Oct 24, 2005 at 07:48:17PM -0500, Ethan Sommer wrote:

> Oct 19 14:44:41 meow kernel: kernel BUG at
> /usr/src/build/607755-i686/BUILD/smp/src/lockqueue.c:1144!

> Oct 10 09:58:39 woof kernel: kernel BUG at
> /usr/src/build/607778-i686/BUILD/smp/src/dlm/lock.c:411!

> Sep  7 15:37:44 meow kernel: kernel BUG at
> /usr/src/build/588748-i686/BUILD/smp/src/dlm/plock.c:500!

I don't have any quick explanation for the first two.  It's clear from the
third that the afpd application is doing some serious posix locking where
there's ample room for bugs.  We'll take a look, thanks for the info.
Dave


From mwill at penguincomputing.com  Thu Oct 27 00:47:13 2005
From: mwill at penguincomputing.com (Michael Will)
Date: Wed, 26 Oct 2005 17:47:13 -0700
Subject: Antw: [Linux-cluster] Oracle 10G-R2 on GFS install problems
In-Reply-To: <s34f644e.099@sdm-cl2.schober.de>
References: <s34f644e.099@sdm-cl2.schober.de>
Message-ID: <43602391.3020708@penguincomputing.com>

<babelfish>
OCFS2 has not been released for oracle products, even thought is from 
oracle themselves. GFS has  been certified for the 9.2-series, and if 
you can use RHEL3 or RHEL4 depends on if you need 32 or 64bit support. 
Oracle has strict guidelines what they support in which version on which 
OS again in which version, and on which hardware platform. Usually there 
is a good reason for exclusion of choices. It might work on other 
versions and OS's not listed, but it won't be supported then.
</babelfish>

Stefan, this mailinglist has an international audience and all postings 
are in english ;-)

Stefan Marx wrote:

>Hi Marvin,
>
>also OCFS2 ist f?r Oracle Produkte noch nicht freigegeben, auch wenn es von Oracle selebr stammt. GFS ist f?r die 9.2er Serie zertifiziert, wobei man schauen mu?, ob man RHEL3 oder RHEL4 verwenden kann, je nachdem, ob man 32-Bit oder 64-Bit Support ben?tigt. Da gibt es ziemlich klare Aussagen bei Oracle, welche Produkte mit welcher Version auf welchem Betriebssystem mit welcher Version und zus?tzlich auf welcher Hardwareplattform supported sind. Und das hat auch meistens seinen guten Grund :-(. Klar gehen die Sachen auch auf anderen Betriebssystemen, solange die entsprechenden Libraries und Kernelversionen passen, sind dann aber halt nicht supported.
>
>Ciao, Stefan
> 
> 
>  
>
>>>>spwilcox at att.com 10/13/05 8:26 pm >>> 
>>>>        
>>>>
>In the process of installing Oracle 10G-R2 on a RHEL4-U2 x86_64 cluster 
>with GFS 6.1.2, I get the following error when running Oracle's root.sh 
>for cluster ready services (a.k.a clusterware): 
> 
>[  OCROSD][4142143168]utstoragetype: /u00/app/ocr0 is on FS type 
>18225520. Not supported. 
> 
>I did a little poking around and found that OCFS2 has the same issue, 
>but with OCFS2 it can be circumvented by mounting with -o datavolume... 
>I was unable to find any similar options for GFS mounts.  This looks 
>like probably more of an Oracle bug, as 10G-R1 installed without any 
>problems (I have my DBA pursuing the Oracle route), but I was wondering 
>if anyone else has come across this problem and if so, was there any 
>fix? 
> 
>Thanks, 
>-steve 
> 
>  
>


-- 
Michael Will
Penguin Computing Corp.
Sales Engineer
415-954-2822
415-954-2899 fx
mwill at penguincomputing.com 


From erwan at seanodes.com  Thu Oct 27 12:12:27 2005
From: erwan at seanodes.com (Velu Erwan)
Date: Thu, 27 Oct 2005 14:12:27 +0200
Subject: [Linux-cluster] cluster-1.01.00
In-Reply-To: <20051019151045.GA3975@redhat.com>
References: <20051019151045.GA3975@redhat.com>
Message-ID: <4360C42B.1050107@seanodes.com>

David Teigland a ?crit :
>A new source tarball from the STABLE branch has been released; it builds
>and runs on 2.6.13:
>  

I've been working on making a rpm of this tarball.
Now I have one main rpm which contains all the usuals binaries, one for
the librairies, one for the devel, one for the kernel module.
For the kernel modules, I choose to create a dkms rpm. This allow to not
provide a binary kernel module but a rpm which rebuilt the modules on
the target host.
This is very usefull, i.e dkms is just able to rebuild automatically the
gfs module if you reboot on another kernel. It doesn't needs any
user/admin helps.
This makes our lifes easiest. ;o)
You can find my specfile and the SRPMS at
http://www.seanodes.com/~erwan/SRPMS
This rpms have been tested successfully on the withebox 4 and mandriva
2006 & cooker.
The SRPMS is now included in mandriva repository, an "urpmi" is enough
to have a runnable gfs ;o)
This could be cool to integrate the specfile to the cvs tree.

Making a rpm of this tarball shows me several troubles:

1- The configure architecture doesn't allow to choose all options
I mean the main configure is calling a set of sub configure with the
same options to all of them.
Some "sub configure" accept additionnal options, like "--plugindir=" for
magma.
If I call the main configure with --plugindir it fails because the other
sub configure doesn't implement "--plugindir.
It could be better to ignore un-implemented options which prevent failures.

2-make dependencies
In my case, I'd like to separate the binary and the kernel module build.
This could allow me to build the rpm by just compiling the binary and
then provides the dkms rpm where the kernel modules are located. This
provides a faster build process.
Today, it seems we must make the kernel modules before the binaires else
you can't build the binairies.
Instead of having a "all:" target, it could be cool to have a
"binaries:" and a "kernel-modules:" targets.

Today, if my rpm building machine doesn't have a kernel source where the
make has been done I have
from /home/nis/guibo/rpm/BUILD/cluster-1.01.00/cman-kernel/src/cnxman.c:15:
    include/linux/config.h:4:28: error: linux/autoconf.h: No such file
or directory

3-soname troubles
I'm not an expert about this part but some binaires are linked with the
.so library whereas it should be with the .so.x.

ccsd is a good example:
[root at max4 ~]# ldd /sbin/ccsd
    libxml2.so.2 => /usr/lib64/libxml2.so.2 (0x00002aaaaabc1000)
    libz.so.1 => /lib64/libz.so.1 (0x00002aaaaadd1000)
    libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x00002aaaaaee6000)
    libm.so.6 => /lib64/tls/libm.so.6 (0x00002aaaaaffb000)
    libmagma.so => /usr/lib64/libmagma.so (0x00002aaaab153000)
    libmagmamsg.so => /usr/lib64/libmagmamsg.so (0x00002aaaab25b000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaab35f000)
    libc.so.6 => /lib64/tls/libc.so.6 (0x00002aaaab462000)
    /lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)

I was told that binaries must be linked with .so.x because .so are only
for development. In the mandriva rpm's policy .so must be in
lib%name-devel rpms and .so.x in lib%name rpms. This error make it
impossible.

I don't know if its related or not but 3 librairies doesn't provides
enought informations to help rpm finding what they provides.
I've add a workaround to my spec file by defining
%ifarch x86_64
Provides:       libmagma.so()(64bit) libmagmamsg.so()(64bit)
libmagma_nt.so()(64bit)
%endif

-
Erwan Velu


From teigland at redhat.com  Thu Oct 27 19:02:28 2005
From: teigland at redhat.com (David Teigland)
Date: Thu, 27 Oct 2005 14:02:28 -0500
Subject: [Linux-cluster] cluster-1.01.00
In-Reply-To: <4360C42B.1050107@seanodes.com>
References: <20051019151045.GA3975@redhat.com> <4360C42B.1050107@seanodes.com>
Message-ID: <20051027190228.GC9710@redhat.com>

On Thu, Oct 27, 2005 at 02:12:27PM +0200, Velu Erwan wrote:
> David Teigland a ?crit :

> Making a rpm of this tarball shows me several troubles:

These all sound like good suggestions.  We'd be happy to get any patches
you have to fix some of them, otherwise it may be some time until someone
gets around to working on it.

Thanks,
Dave


From philip.r.dana at nwp01.usace.army.mil  Thu Oct 27 19:08:03 2005
From: philip.r.dana at nwp01.usace.army.mil (Philip R. Dana)
Date: Thu, 27 Oct 2005 12:08:03 -0700
Subject: [Linux-cluster] Service/Resource group help needed
Message-ID: <1130440083.2950.25.camel@nwp-wk-79033-l>

I am setting up a two node active/passive cluster to provide DHCP/DNS
services, using CentOS 4 U2 and RHCS4. The rpms were compiled from srpms
using the info provided by Sean Gray (thanks, Sean). The shared storage
is on a NetApp filer using ISCSI. I think I've missed something
somewhere. The output from clustat and clusvcadm:

[root at ns1-node1 ~]# clustat
Member Status: Quorate

Not a member of the Resource Manager service group.
Resource Group information unavailable; showing all cluster members.

  Member Name                              State      ID
  ------ ----                              -----      --
  ns1-node2.mydomain.net                   Online     0x0000000000000002
  ns1-node1.mydomain.net                   Online     0x0000000000000001

[root at ns1-node1 ~]# clusvcadm -m ns1-node1.mydomain.net -e DNS
Member ns1-node1.mydomain.net not in membership list

Any help/advice will be greatly appreciated. TIA.


From eric at bootseg.com  Thu Oct 27 19:17:36 2005
From: eric at bootseg.com (Eric Kerin)
Date: Thu, 27 Oct 2005 15:17:36 -0400
Subject: [Linux-cluster] Service/Resource group help needed
In-Reply-To: <1130440083.2950.25.camel@nwp-wk-79033-l>
References: <1130440083.2950.25.camel@nwp-wk-79033-l>
Message-ID: <1130440656.3453.10.camel@auh5-0479.corp.jabil.org>

On Thu, 2005-10-27 at 12:08 -0700, Philip R. Dana wrote:
> [root at ns1-node1 ~]# clustat
> Member Status: Quorate
> 
> Not a member of the Resource Manager service group.
> Resource Group information unavailable; showing all cluster members.
> 
>   Member Name                              State      ID
>   ------ ----                              -----      --
>   ns1-node2.mydomain.net                   Online     0x0000000000000002
>   ns1-node1.mydomain.net                   Online     0x0000000000000001
> 

Basically this message means the rgmanager service isn't running on the
cluster node you ran clustat on.  So it's showing the full membership
list for the cluster.

Start it up on all the nodes, and you should be good to go.

Hope this helps,
Eric Kerin
eric at bootseg.com


From mwill at penguincomputing.com  Thu Oct 27 19:25:04 2005
From: mwill at penguincomputing.com (Michael Will)
Date: Thu, 27 Oct 2005 12:25:04 -0700
Subject: [Linux-cluster] dhcp failover
Message-ID: <43612990.3010703@penguincomputing.com>

Two things to consider:

1. Normally you would run two DHCP servers under two different IP's on 
the same subnet that serve half of the IP numbers.
When one fails, the other one continues to serve its ip space and 
hopefully the first one is fixed before the second one runs out of IP 
numbers.

2. If you have a hard requirement to be only on a single IP address for 
the DHCP server (rare case for some ISP's DSL  hardware)
then you can do the active/passive configuration much more easily with a 
classic heartbeat-setup.

We have done both as professional service for customers of ours. 
Attached storage we usually only use when there is significant data to 
share,
i.e. mysql. The dhcp server configuration can be synced with rsync 
across the second gigabit ethernet port in realtime.

Michael

-- 
Michael Will
Penguin Computing Corp.
Sales Engineer
415-954-2822
415-954-2899 fx
mwill at penguincomputing.com 


From Philip.R.Dana at nwp01.usace.army.mil  Thu Oct 27 19:59:09 2005
From: Philip.R.Dana at nwp01.usace.army.mil (Dana, Philip R NWP Contractor)
Date: Thu, 27 Oct 2005 12:59:09 -0700
Subject: [Linux-cluster] Service/Resource group help needed
In-Reply-To: <1130440656.3453.10.camel@auh5-0479.corp.jabil.org>
References: <1130440083.2950.25.camel@nwp-wk-79033-l>
	<1130440656.3453.10.camel@auh5-0479.corp.jabil.org>
Message-ID: <1130443149.2950.32.camel@nwp-wk-79033-l>

Thanks for the quick reply. The rgmanager service is running on both
nodes, but I think I have lock (dlm) problems. From a service restart:

Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: <info> Services Initialized
Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: <info> Logged in SG
"usrm::manager"
Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: <info> Magma Event:
Membership Change
Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: <info> State change: Local UP
Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: <err> #33: Unable to obtain
cluster lock: Operation not permitted

On Thu, 2005-10-27 at 15:17 -0400, Eric Kerin wrote:
> On Thu, 2005-10-27 at 12:08 -0700, Philip R. Dana wrote:
> > [root at ns1-node1 ~]# clustat
> > Member Status: Quorate
> > 
> > Not a member of the Resource Manager service group.
> > Resource Group information unavailable; showing all cluster members.
> > 
> >   Member Name                              State      ID
> >   ------ ----                              -----      --
> >   ns1-node2.mydomain.net                   Online     0x0000000000000002
> >   ns1-node1.mydomain.net                   Online     0x0000000000000001
> > 
> 
> Basically this message means the rgmanager service isn't running on the
> cluster node you ran clustat on.  So it's showing the full membership
> list for the cluster.
> 
> Start it up on all the nodes, and you should be good to go.
> 
> Hope this helps,
> Eric Kerin
> eric at bootseg.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051027/9aa3a8d1/attachment.htm>

From lhh at redhat.com  Thu Oct 27 22:06:02 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 27 Oct 2005 18:06:02 -0400
Subject: [Linux-cluster] Service/Resource group help needed
In-Reply-To: <1130443149.2950.32.camel@nwp-wk-79033-l>
References: <1130440083.2950.25.camel@nwp-wk-79033-l>
	<1130440656.3453.10.camel@auh5-0479.corp.jabil.org>
	<1130443149.2950.32.camel@nwp-wk-79033-l>
Message-ID: <1130450762.23803.41.camel@ayanami.boston.redhat.com>

On Thu, 2005-10-27 at 12:59 -0700, Dana, Philip R NWP Contractor wrote:
> Thanks for the quick reply. The rgmanager service is running on both
> nodes, but I think I have lock (dlm) problems. From a service restart:
> 
> Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: <info> Services Initialized
> Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: <info> Logged in SG
> "usrm::manager"
> Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: <info> Magma Event:
> Membership Change
> Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: <info> State change: Local
> UP
> Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: <err> #33: Unable to obtain
> cluster lock: Operation not permitted

service rgmanager stop; modprobe dlm; service rgmanager start

-- Lon


From pcaulfie at redhat.com  Fri Oct 28 13:06:52 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 28 Oct 2005 14:06:52 +0100
Subject: [Linux-cluster] Re: Xen cluster doku down
In-Reply-To: <43621F64.2020609@kofler.eu.org>
References: <43621F64.2020609@kofler.eu.org>
Message-ID: <4362226C.1050304@redhat.com>

Thomas Kofler wrote:
> Hi,
> 
> I used your nice guide setting up a xen cluster, but suddenly the link
> is broken:
> http://www.cix.co.uk/~tykepenguin/xencluster.html
> <http://www.cix.co.uk/%7Etykepenguin/xencluster.html>
> 
> Do you have a mirror URL ?

Sorry, It's now at
http://people.redhat.com/pcaulfie/docs/xencluster.html

-- 

patrick


From a_webb_5 at yahoo.com  Fri Oct 28 16:14:38 2005
From: a_webb_5 at yahoo.com (Amber Webb)
Date: Fri, 28 Oct 2005 09:14:38 -0700 (PDT)
Subject: [Linux-cluster] TORQUE 2.0
Message-ID: <20051028161438.49432.qmail@web35710.mail.mud.yahoo.com>

Hi,

I would like to announce that TORQUE Resource Manager
2.0 was just released, and can be downloaded at
www.clusterresources.com/torque.  

TORQUE, which is built on OpenPBS is one of the most
widely used open source batch schedulers.  TORQUE's
improvements since the last patch include an improved
start up feature for quick startup of downed nodes,
enhanced internal diagnostics, simplified install, and
improved API reporting abilities.

TORQUE is a community project with contributions from
NCSA, OSC, USC, the U.S. Department of Energy, Sandia,
PNNL, University of Buffalo, TeraGrid and many other
leading edge HPC organizations.

We invite you to download and try TORQUE and visit our
user community www.clusterresources.com/torque.  We
welcome feedback and patch submissions.

Regards,

Amber 


__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com


From david.chappel at mindbank.com  Wed Oct 19 17:30:58 2005
From: david.chappel at mindbank.com (David A. Chappel)
Date: Wed, 19 Oct 2005 11:30:58 -0600
Subject: [Linux-cluster] mounts not spanning
In-Reply-To: <1129303867.4838.30.camel@localhost.localdomain>
References: <1129303867.4838.30.camel@localhost.localdomain>
Message-ID: <1129743058.5069.12.camel@localhost.localdomain>

Hi all;

On Fri, 2005-10-14 at 09:31 -0600, David A. Chappel wrote:
> Hi there clusterites...  Anyone have a cluestick?
> 

The clue stick was meant for me.  And for good reason.
I'll wait for ddraid.

Cheers,
-D


> I have created a wee "cluster" of two machines.  They seem to be happy
> in every way, except that when I mount the gfs volumes on each machine,
> the mounts do not span across the two nodes, but act as a traditional
> node.  In other words, I can echo "haha" > /mnt/shareMe/haha.txt on one
> machine but it doesn't show up on the other.  Vice versa too.
> 
> I use:
> mount -t gfs /dev/shareMeVG/shareMeLV /mnt/shareMe
> 
> I've tried the -o ignore_local_fs option without success.
> 
> Also, is there a quick/standard way for non-cluster kernel machines to
> mount the "partition" remotely?
> 
> Cheers,
> -D
> 
> 
> 
> [root at JavaTheHut ~]# cat /proc/cluster/status
> Protocol version: 5.0.1
> Config version: 1
> Cluster name: clusta
> Cluster ID: 6621
> Cluster Member: Yes
> Membership state: Cluster-Member
> Nodes: 2
> Expected_votes: 1
> Total_votes: 2
> Quorum: 1
> Active subsystems: 6
> Node name: JavaTheHut.mindbankts.com
> Node addresses: 10.1.1.22
> 
> [root at marvin ~]# cat /etc/cluster/cluster.conf
> <?xml version="1.0"?>
> <cluster name="clusta" config_version="1">
>     <cman two_node="1" expected_votes="1">
>     </cman>
>     <clusternodes>
>       <clusternode name="marvin.mindbankts.com" votes="1">
>        <fence>
>         <method name="single">
>          <device name="human" ipaddr="10.1.1.20"/>
>        </method>
>       </fence>
>      </clusternode>
>      <clusternode name="JavaTheHut.mindbankts.com" votes="1">
>       <fence>
>        <method name="single">
>          <device name="human" ipaddr="10.1.1.22"/>
>        </method>
>       </fence>
>     </clusternode>
>    </clusternodes>
>   <fence_devices>
>    <fence_device name="human" agent="fence_manual"/>
>   </fence_devices>
>  </cluster>
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From david.chappel at mindbank.com  Wed Oct 19 20:03:38 2005
From: david.chappel at mindbank.com (David A. Chappel)
Date: Wed, 19 Oct 2005 14:03:38 -0600
Subject: [Linux-cluster] New Cluster Installation Starts Partitioned
In-Reply-To: <20051019194901.19010.qmail@web60516.mail.yahoo.com>
References: <20051019194901.19010.qmail@web60516.mail.yahoo.com>
Message-ID: <1129752218.5069.29.camel@localhost.localdomain>

Might be a firewall issue.
Doing a netstat -nl listed ports that were not mentioned in the "simple
setup" docs for me.  Specifically 14567.

Cheers,
-d


On Wed, 2005-10-19 at 12:49 -0700, Tim Spaulding wrote:
> Hi All,
> 
> I have a couple of machines that I'm trying to cluster.  The machines are freshly installed FC4
> machines that have been fully updated and running the latest kernel.  They are configured to use
> the lvm2 by default so lvm2 and dm was already installed.  I'm following the directions in the
> usage.txt off RedHat's web site.  I compile the cluster tarball, run depmod, and start ccsd
> without issue.  When I do a cman_tool join -w on each node, both nodes start cman and join the
> cluster, but the cluster is apparently partitioned (i.e. they both see the cluster and are joined
> to it, but the two nodes cannot see that the other node is joined).  I've searched around and
> haven't found anything specific to this symptom.  I have a feeling that it's something to do with
> my network configuration.  Any help would be appreciated.
> 
> Both machines are i686 archs with dual NICs.  The NICs are connected to networks that do not route
> to each other.  One network (eth0 on both machines) is a development network.  The other network
> (eth1) is our corporate network.  I'm trying to configure the cluster to use the dev network
> (eth0).
> 
> Here's the output from uname:
> 
> Linux ctclinux1.clam.com 2.6.13-1.1526_FC4 #1 Wed Sep 28 19:15:10 EDT 2005 i686 i686 i386
> GNU/Linux
> Linux ctclinux2.clam.com 2.6.13-1.1526_FC4 #1 Wed Sep 28 19:15:10 EDT 2005 i686 i686 i386
> GNU/Linux
> 
> Here's the network configuration on ctclinux1:
> 
> eth0      Link encap:Ethernet  HWaddr 00:01:03:26:5C:C9
>           inet addr:192.168.36.200  Bcast:192.168.36.255  Mask:255.255.255.0
>           inet6 addr: fe80::201:3ff:fe26:5cc9/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:7260 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:350 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:449183 (438.6 KiB)  TX bytes:27853 (27.2 KiB)
>           Interrupt:10 Base address:0xec00
> 
> eth1      Link encap:Ethernet  HWaddr 00:B0:D0:41:0F:65
>           inet addr:10.10.10.200  Bcast:10.10.255.255  Mask:255.255.0.0
>           inet6 addr: fe80::2b0:d0ff:fe41:f65/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:57450 errors:0 dropped:0 overruns:1 frame:0
>           TX packets:12957 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:10040767 (9.5 MiB)  TX bytes:1962029 (1.8 MiB)
>           Interrupt:5 Base address:0xe880
> 
> eth1:1    Link encap:Ethernet  HWaddr 00:B0:D0:41:0F:65
>           inet addr:10.10.10.204  Bcast:10.10.255.255  Mask:255.255.0.0
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           Interrupt:5 Base address:0xe880
> 
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           inet6 addr: ::1/128 Scope:Host
>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>           RX packets:17568 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:17568 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:3692600 (3.5 MiB)  TX bytes:3692600 (3.5 MiB)
> 
> sit0      Link encap:IPv6-in-IPv4
>           NOARP  MTU:1480  Metric:1
>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
> 
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
> 192.168.36.0    *               255.255.255.0   U     0      0        0 eth0
> 10.74.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> 10.72.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> 10.75.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> 10.73.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> 10.10.0.0       *               255.255.0.0     U     0      0        0 eth1
> 169.254.0.0     *               255.255.0.0     U     0      0        0 eth1
> default         10.10.1.1       0.0.0.0         UG    0      0        0 eth1
> 
> cat /etc/hosts
> 10.10.10.200    ctclinux1-svc
> 192.168.36.200  ctclinux1-cls
> 192.168.36.201  ctclinux2-cls
> 10.10.10.201    ctclinux2-svc
> 
> Here's the network configuration on ctclinux2:
> 
> ifconfig -a
> eth0      Link encap:Ethernet  HWaddr 00:01:03:D4:80:7C
>           inet addr:192.168.36.201  Bcast:192.168.36.255  Mask:255.255.255.0
>           inet6 addr: fe80::201:3ff:fed4:807c/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:7702 errors:0 dropped:0 overruns:1 frame:0
>           TX packets:282 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:477769 (466.5 KiB)  TX bytes:22444 (21.9 KiB)
>           Interrupt:10 Base address:0xec00
> 
> eth1      Link encap:Ethernet  HWaddr 00:B0:D0:41:0F:9B
>           inet addr:10.10.10.201  Bcast:10.10.255.255  Mask:255.255.0.0
>           inet6 addr: fe80::2b0:d0ff:fe41:f9b/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:53846 errors:0 dropped:0 overruns:1 frame:0
>           TX packets:7759 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:5733713 (5.4 MiB)  TX bytes:1155588 (1.1 MiB)
>           Interrupt:5 Base address:0xe880
> 
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           inet6 addr: ::1/128 Scope:Host
>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>           RX packets:17912 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:17912 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:3401868 (3.2 MiB)  TX bytes:3401868 (3.2 MiB)
> 
> sit0      Link encap:IPv6-in-IPv4
>           NOARP  MTU:1480  Metric:1
>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
> 
> route
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
> 192.168.36.0    *               255.255.255.0   U     0      0        0 eth0
> 10.74.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> 10.72.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> 10.75.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> 10.73.0.0       192.168.36.10   255.255.255.0   UG    0      0        0 eth0
> 10.10.0.0       *               255.255.0.0     U     0      0        0 eth1
> 169.254.0.0     *               255.255.0.0     U     0      0        0 eth1
> default         10.10.1.1       0.0.0.0         UG    0      0        0 eth1
> 
> cat /etc/hosts
> 10.10.10.201    ctclinux2-svc
> 192.168.36.201  ctclinux2-cls
> 192.168.36.200  ctclinux1-cls
> 10.10.10.200    ctclinux1-svc
> 
> Here's the cluster configuration file:
> 
> <?xml version="1.0"?>
> <cluster name="cl_tic" config_version="1">
>         <cman>
>         </cman>
> 
>         <clusternodes>
>                 <clusternode name="ctclinux1-cls">
>                         <fence>
>                                 <method name="single">
>                                         <device name="human" nodename="ctclinux1-cls"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
> 
>                 <clusternode name="ctclinux2-cls">
>                         <fence>
>                                 <method name="single">
>                                         <device name="human" nodename="ctclinux2-cls"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
> 
>         <fence_devices>
>                 <fence_device name="human" agent="fence_manual"/>
>         </fence_devices>
> </cluster>
> 
> Here's the cluster information from ctclinux1 after the cluster is started and joined:
> 
> cman_tool -d join -w
> nodename ctclinux1.clam.com not found
> nodename ctclinux1 (truncated) not found
> nodename ctclinux1 doesn't match ctclinux1-cls (ctclinux1-cls in cluster.conf)
> nodename ctclinux1 doesn't match ctclinux2-cls (ctclinux2-cls in cluster.conf)
> nodename localhost (if lo) not found
> selected nodename ctclinux1-cls
> setup up interface for address: ctclinux1-cls
> Broadcast address for c824a8c0 is ff24a8c0
> 
> cman_tool status
> Protocol version: 5.0.1
> Config version: 1
> Cluster name: cl_tic
> Cluster ID: 6429
> Cluster Member: Yes
> Membership state: Cluster-Member
> Nodes: 1
> Expected_votes: 2
> Total_votes: 1
> Quorum: 2  Activity blocked
> Active subsystems: 0
> Node name: ctclinux1-cls
> Node addresses: 192.168.36.200
> 
> cman_tool nodes
> Node  Votes Exp Sts  Name
>    1    1    2   M   ctclinux1-cls
> 
> Here's the cluster information from ctclinux2 after the cluster is started and joined:
> 
> cman_tool -d join -w
> nodename ctclinux2.clam.com not found
> nodename ctclinux2 (truncated) not found
> nodename ctclinux2 doesn't match ctclinux1-cls (ctclinux1-cls in cluster.conf)
> nodename ctclinux2 doesn't match ctclinux2-cls (ctclinux2-cls in cluster.conf)
> nodename localhost (if lo) not found
> selected nodename ctclinux2-cls
> setup up interface for address: ctclinux2-cls
> Broadcast address for c924a8c0 is ff24a8c0
> 
> cman_tool status
> Protocol version: 5.0.1
> Config version: 1
> Cluster name: cl_tic
> Cluster ID: 6429
> Cluster Member: Yes
> Membership state: Cluster-Member
> Nodes: 1
> Expected_votes: 2
> Total_votes: 1
> Quorum: 2  Activity blocked
> Active subsystems: 0
> Node name: ctclinux2-cls
> Node addresses: 192.168.36.201
> 
> cman_tool nodes
> Node  Votes Exp Sts  Name
>    1    1    2   M   ctclinux2-cls
> 
> Let me know if there is more information that I need to provide.  As an aside, I've tried reducing
> the quorum count with no difference in behavior and I've tried using multicast which fails on the
> cman_tool join with an "Unknown Host" error.  I'm open to any other suggestions.
> 
> Thanks,
> 
> tims
> 
> 
> 	
> 		
> __________________________________ 
> Yahoo! Mail - PC Magazine Editors' Choice 2005 
> http://mail.yahoo.com
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From sommere at gac.edu  Fri Oct 21 16:38:13 2005
From: sommere at gac.edu (Ethan Sommer)
Date: Fri, 21 Oct 2005 11:38:13 -0500
Subject: [Linux-cluster] Occasional kernel panics
Message-ID: <43591975.9070800@gac.edu>

Every few days or so our cluster machines seem to have kernel panics 
comp laing about GFS locking (although its pretty irregular, we went for 
a few weeks without an outage)

We noticed that this happened a LOT, and it was reproducible when 
certain users accessed files, when we were serving afp off the cluster. 
We have changed things since then so that afp is run on a server which 
nfs mounts the cluster.

We are running FC4 with the gfs modules from yum.


Here is our most recent kernel panics, followed by one from when we had 
afp running on the cluster: (it looks like there is relevant info above 
the cut-here, possibly if it might be helpful)


Oct 19 14:44:41 meow kernel: ------------[ cut here ]------------
Oct 19 14:44:41 meow kernel: kernel BUG at 
/usr/src/build/607755-i686/BUILD/smp/src/lockqueue.c:1144!
Oct 19 14:44:41 meow kernel: invalid operand: 0000 [#1]
Oct 19 14:44:41 meow kernel: SMP
Oct 19 14:44:41 meow kernel: Modules linked in: nfsd exportfs lockd 
autofs4 lock_dlm(U) gfs(U) lock_harness(U) rfcomm l2cap bluetooth dlm(U) 
cman(U) md5 ip
v6 sunrpc ipt_LOG ipt_limit ipt_state ip_conntrack iptable_filter 
ip_tables video button battery ac uhci_hcd ehci_hcd hw_random i2c_i801 
i2c_core shpchp e1000 floppy ext3 jbd raid1 dm_mod qla2200 qla2xxx 
scsi_transport_fc ata_piix libata sd_mod scsi_mod
Oct 19 14:44:41 meow kernel: CPU:    1
Oct 19 14:44:41 meow kernel: EIP:    0060:[<f8af9dcf>]    Not tainted VLI
Oct 19 14:44:41 meow kernel: EFLAGS: 00010292   (2.6.12-1.1447_FC4smp)
Oct 19 14:44:41 meow kernel: EIP is at 
process_cluster_request+0xddb/0xdef [dlm]
Oct 19 14:44:41 meow kernel: eax: 00000004   ebx: 00000000   ecx: 
c035fa4c   edx: 00000286
Oct 19 14:44:41 meow kernel: esi: f7fb8400   edi: 00000000   ebp: 
d2988000   esp: f7eefe24
Oct 19 14:44:41 meow kernel: ds: 007b   es: 007b   ss: 0068
Oct 19 14:44:41 meow kernel: Process dlm_recvd (pid: 2402, 
threadinfo=f7eef000 task=f7851020)
Oct 19 14:44:41 meow kernel: Stack: f8b0621b 00000001 f8b071e0 f8b06217 
2583f987 00000001 00000040 00004000
Oct 19 14:44:41 meow kernel:        f7eefe48 00000000 c038e1a0 00000a58 
f0167b00 c02a26c1 00000a58 00004040
Oct 19 14:44:41 meow kernel:        00000072 f7eefed4 00000000 00000001 
00000246 00000000 edd6eeb8 00000000
Oct 19 14:44:41 meow kernel: Call Trace:
Oct 19 14:44:41 meow kernel:  [<c02a26c1>] sock_recvmsg+0x103/0x11e
Oct 19 14:44:41 meow kernel:  [<f8afd46b>] 
midcomms_process_incoming_buffer+0x13b/0x25f [dlm]
Oct 19 14:44:41 meow kernel:  [<c011ce54>] load_balance_newidle+0x23/0x82
Oct 19 14:44:41 meow kernel:  [<f8afb3d3>] receive_from_sock+0x196/0x2c9 
[dlm]
Oct 19 14:44:41 meow kernel:  [<c0307705>] schedule+0x405/0xc5e
Oct 19 14:44:41 meow kernel:  [<c0307731>] schedule+0x431/0xc5e
Oct 19 14:44:41 meow kernel:  [<f8afc457>] dlm_recvd+0x0/0x9c [dlm]
Oct 19 14:44:41 meow kernel:  [<f8afc2d3>] process_sockets+0x75/0xb7 [dlm]
Oct 19 14:44:41 meow kernel:  [<f8afc4c7>] dlm_recvd+0x70/0x9c [dlm]
Oct 19 14:44:41 meow kernel:  [<c0134c09>] kthread+0x93/0x97
Oct 19 14:44:41 meow kernel:  [<c0134b76>] kthread+0x0/0x97
Oct 19 14:44:41 meow kernel:  [<c01023d1>] kernel_thread_helper+0x5/0xb
Oct 19 14:44:41 meow kernel: Code: 4f 82 62 c7 89 e8 e8 b1 b4 00 00 8b 
4c 24 14 89 4c 24 04 c7 04 24 6d 63 b0 f8 e8 34 82 62 c7 c7 04 24 1b 62 
b0 f8 e8 28
82 62 c7 <0f> 0b 78 04 e0 71 b0 f8 c7 04 24 70 72 b0 f8 e8 40 78 62 c7 57
Oct 19 14:44:41 meow kernel:  <0>Fatal exception: panic in 5 seconds


Panic 2:

Oct 10 09:58:39 woof kernel: ------------[ cut here ]------------
Oct 10 09:58:39 woof kernel: kernel BUG at 
/usr/src/build/607778-i686/BUILD/smp/src/dlm/lock.c:411!
Oct 10 09:58:39 woof kernel: invalid operand: 0000 [#1]
Oct 10 09:58:39 woof kernel: SMP
Oct 10 09:58:39 woof kernel: Modules linked in: nfsd exportfs lockd 
autofs4 lock_dlm(U) gfs(U) lock_harness(U) rfcomm l2cap bluetooth dlm(U) 
cman(U) md5 ip
v6 sunrpc ipt_LOG ipt_limit ipt_state ip_conntrack iptable_filter 
ip_tables video button battery ac uhci_hcd ehci_hcd hw_random i2c_i801 
i2c_core shpchp e1
000 dm_snapshot dm_zero dm_mirror ext3 jbd raid1 dm_mod qla2200 qla2xxx 
scsi_transport_fc ata_piix libata sd_mod scsi_mod
Oct 10 09:58:39 woof kernel: CPU:    1
Oct 10 09:58:39 woof kernel: EIP:    0060:[<f8b98bf5>]    Not tainted VLI
Oct 10 09:58:39 woof kernel: EFLAGS: 00010292   (2.6.12-1.1447_FC4smp)
Oct 10 09:58:39 woof kernel: EIP is at do_dlm_lock+0x1b7/0x21d [lock_dlm]
Oct 10 09:58:39 woof kernel: eax: 00000004   ebx: 00000000   ecx: 
c035fa4c   edx: 00000292
Oct 10 09:58:39 woof kernel: esi: f7848140   edi: ffffffea   ebp: 
00000003   esp: c74b3cfc
Oct 10 09:58:39 woof kernel: ds: 007b   es: 007b   ss: 0068
Oct 10 09:58:39 woof kernel: Process imapd (pid: 24278, 
threadinfo=c74b3000 task=f4721a80)
Oct 10 09:58:39 woof kernel: Stack: f8b9de75 f7848140 00000003 1bbe0000 
00000000 ffffffea 00000003 00000005
Oct 10 09:58:39 woof kernel:        0000000d 00000005 00000000 f58c0a00 
00000001 0000000d 20200000 20202020
Oct 10 09:58:39 woof kernel:        20203320 20202020 62312020 30306562 
00183030 c8fb2f00 00000001 00000001
Oct 10 09:58:39 woof kernel: Call Trace:
Oct 10 09:58:39 woof kernel:  [<f8b98cff>] lm_dlm_lock+0x52/0x5e [lock_dlm]
Oct 10 09:58:39 woof kernel:  [<f8b98cad>] lm_dlm_lock+0x0/0x5e [lock_dlm]
Oct 10 09:58:39 woof kernel:  [<f8bd000c>] gfs_lm_lock+0x3d/0x5c [gfs]
Oct 10 09:58:39 woof kernel:  [<f8bc5039>] gfs_glock_xmote_th+0xae/0x1d3 
[gfs]
Oct 10 09:58:39 woof kernel:  [<f8bc463c>] rq_promote+0x126/0x150 [gfs]
Oct 10 09:58:39 woof kernel:  [<f8bc4840>] run_queue+0xee/0x113 [gfs]
Oct 10 09:58:39 woof kernel:  [<f8bc5af1>] gfs_glock_nq+0x93/0x144 [gfs]
Oct 10 09:58:39 woof kernel:  [<f8bc619d>] gfs_glock_nq_init+0x18/0x2d [gfs]
Oct 10 09:58:39 woof kernel:  [<f8be3926>] get_local_rgrp+0xca/0x1b0 [gfs]
Oct 10 09:58:39 woof kernel:  [<f8be3a9c>] 
gfs_inplace_reserve_i+0x90/0xd0 [gfs]
Oct 10 09:58:39 woof kernel:  [<f8be046b>] gfs_quota_lock_m+0xbf/0x117 [gfs]
Oct 10 09:58:39 woof kernel:  [<f8bd8a2b>] do_do_write_buf+0x3a1/0x485 [gfs]
Oct 10 09:58:39 woof kernel:  [<f8bc56a1>] 
glock_wait_internal+0x16b/0x26a [gfs]
Oct 10 09:58:39 woof kernel:  [<f8bd8c91>] do_write_buf+0x182/0x1b6 [gfs]
Oct 10 09:58:39 woof kernel:  [<f8bd7be5>] walk_vm+0xb3/0x111 [gfs]
Oct 10 09:58:39 woof kernel:  [<f8bd8d65>] gfs_write+0xa0/0xc2 [gfs]
Oct 10 09:58:39 woof kernel:  [<f8bd8b0f>] do_write_buf+0x0/0x1b6 [gfs]
Oct 10 09:58:39 woof kernel:  [<f8bd8cc5>] gfs_write+0x0/0xc2 [gfs]
Oct 10 09:58:39 woof kernel:  [<c0162987>] vfs_write+0x9e/0x110
Oct 10 09:58:39 woof kernel:  [<c0162aa4>] sys_write+0x41/0x6a
Oct 10 09:58:39 woof kernel:  [<c0104035>] syscall_call+0x7/0xb
Oct 10 09:58:39 woof kernel: Code: 7c 24 14 89 4c 24 0c 89 5c 24 10 89 
6c 24 08 89 74 24 04 c7 04 24 28 e6 b9 f8 e8 0e 94 58 c7 c7 04 24 75 de 
b9 f8 e8 02
94 58 c7 <0f> 0b 9b 01 a0 e4 b9 f8 c7 04 24 3c e5 b9 f8 e8 1a 8a 58 c7 66
Oct 10 09:58:39 woof kernel:  <0>Fatal exception: panic in 5 seconds


Sep  7 15:37:44 meow kernel: ------------[ cut here ]------------
Sep  7 15:37:44 meow kernel: kernel BUG at 
/usr/src/build/588748-i686/BUILD/smp/src/dlm/plock.c:500!
Sep  7 15:37:44 meow kernel: invalid operand: 0000 [#1]
Sep  7 15:37:44 meow kernel: SMP
Sep  7 15:37:44 meow kernel: Modules linked in: appletalk nfsd exportfs 
lockd autofs4 lock_dlm(U) gfs(U) lock_harness(U) rfcomm l2cap bluetooth 
dlm(U) cman
(U) sunrpc md5 ipv6 ipt_LOG ipt_limit ipt_state ip_conntrack 
iptable_filter ip_tables video button battery ac uhci_hcd ehci_hcd 
hw_random i2c_i801 i2c_core
 shpchp e1000 floppy ext3 jbd raid1 dm_mod qla2200 qla2xxx 
scsi_transport_fc ata_piix libata sd_mod scsi_mod
Sep  7 15:37:44 meow kernel: CPU:    3
Sep  7 15:37:44 meow kernel: EIP:    0060:[<f8b9a3f7>]    Tainted: 
GF     VLI
Sep  7 15:37:44 meow kernel: EFLAGS: 00010292   (2.6.12-1.1398_FC4smp)
Sep  7 15:37:44 meow kernel: EIP is at update_lock+0x87/0x9b [lock_dlm]
Sep  7 15:37:44 meow kernel: eax: 00000004   ebx: fffffff5   ecx: 
c035ca4c   edx: 00000282
Sep  7 15:37:44 meow kernel: esi: 00000000   edi: e99c2c00   ebp: 
00000000   esp: d05dedb4
Sep  7 15:37:44 meow kernel: ds: 007b   es: 007b   ss: 0068
Sep  7 15:37:44 meow kernel: Process afpd (pid: 3872, 
threadinfo=d05de000 task=d6447550)
Sep  7 15:37:44 meow kernel: Stack: badc0ded f8b9d0d6 fffffff5 f8b9da70 
f8b9d101 06609291 f7943000 00000000
Sep  7 15:37:44 meow kernel:        f8b9a499 7ffffff8 00000000 7ffffff8 
00000000 d05dede8 d7636700 7ffffff8
Sep  7 15:37:44 meow kernel:        00000000 d05deea8 d05dee28 f8b9a987 
00000001 7ffffff8 00000000 7ffffff8
Sep  7 15:37:44 meow kernel: Call Trace:
Sep  7 15:37:44 meow kernel:  [<f8b9a499>] add_lock+0x8e/0xed [lock_dlm]
Sep  7 15:37:44 meow kernel:  [<f8b9a987>] fill_gaps+0x87/0x10e [lock_dlm]
Sep  7 15:37:44 meow kernel:  [<f8b9aa51>] lock_case3+0x43/0xac [lock_dlm]
Sep  7 15:37:44 meow kernel:  [<f8b9aeac>] plock_internal+0x1aa/0x370 
[lock_dlm]
Sep  7 15:37:44 meow kernel:  [<f8b9b614>] lm_dlm_plock+0x25b/0x2dc 
[lock_dlm]
Sep  7 15:37:44 meow kernel:  [<f8b9b3b9>] lm_dlm_plock+0x0/0x2dc [lock_dlm]
Sep  7 15:37:44 meow kernel:  [<f8bdc1c3>] gfs_lm_plock+0x45/0x57 [gfs]
Sep  7 15:37:44 meow kernel:  [<f8be5731>] gfs_lock+0xcd/0x11c [gfs]
Sep  7 15:37:44 meow kernel:  [<f8be5664>] gfs_lock+0x0/0x11c [gfs]
Sep  7 15:37:44 meow kernel:  [<c0176c4f>] fcntl_setlk64+0x16c/0x26a
Sep  7 15:37:44 meow kernel:  [<c0162e93>] fget+0x3b/0x42
Sep  7 15:37:44 meow kernel:  [<c0172bfd>] sys_fcntl64+0x55/0x97
Sep  7 15:37:44 meow kernel:  [<c0104025>] syscall_call+0x7/0xb
Sep  7 15:37:44 meow kernel: Code: 01 00 00 c7 04 24 a8 da b9 f8 e8 7c 
77 58 c7 89 5c 24 04 c7 04 24 08 d1 b9 f8 e8 6c 77 58 c7 c7 04 24 d6 d0 
b9 f8 e8 60
77 58 c7 <0f> 0b f4 01 70 da b9 f8 c7 04 24 10 db b9 f8 e8 78 6d 58 c7 55
Sep  7 15:37:44 meow kernel:  <0>Fatal exception: panic in 5 seconds


Thanks for any help,
  Ethan


From fbavandpouri at amcc.com  Mon Oct 24 23:02:19 2005
From: fbavandpouri at amcc.com (Farid Bavandpouri)
Date: Mon, 24 Oct 2005 16:02:19 -0700
Subject: [Linux-cluster] gfs dlm - SAMBA problem(lock??)
Message-ID: <9D1E2BDCB5C57B46B56E6D80843439EBBB88DA@SDCEXCHANGE01.ad.amcc.com>

Remobe/unsubscribe mmontaseri at amcc.com

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of vmoravek at atlas.cz
Sent: Monday, October 24, 2005 4:00 PM
To: Linux-cluster at redhat.com
Subject: [Linux-cluster] gfs dlm - SAMBA problem(lock??)

Hi all,
I having use gfs for samba cluster.GFS works fine but have big trouble
with this situation.

When 6 computers want downloadind same file od directory everything is
ok. 
But same situation but 8 computers trafic is rappidly going down and I
have some strange messages about oplocks (maybe) in smb.log.

Have any idea what is wrong?

Best Regard

Vojta 


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
--------------------------------------------------------

CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and contains information that is confidential and proprietary to Applied Micro Circuits Corporation or its subsidiaries. It is to be used solely for the purpose of furthering the parties' business relationship. All unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.


From fbavandpouri at amcc.com  Tue Oct 25 16:41:27 2005
From: fbavandpouri at amcc.com (Farid Bavandpouri)
Date: Tue, 25 Oct 2005 09:41:27 -0700
Subject: [Linux-cluster] Occasional kernel panics
Message-ID: <9D1E2BDCB5C57B46B56E6D80843439EBBB89D2@SDCEXCHANGE01.ad.amcc.com>

Unsubscribe mmontaseri at amcc.com

He no longer works at AMCC.

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ethan Sommer
Sent: Monday, October 24, 2005 5:48 PM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Occasional kernel panics

Every few days or so our cluster machines seem to have kernel panics
comp laing about GFS locking (although its pretty irregular, we went for
a few weeks without an outage)

We noticed that this happened a LOT, and it was reproducible when
certain users accessed files, when we were serving afp off the cluster.
We have changed things since then so that afp is run on a server which
nfs mounts the cluster.

We are running FC4 with the gfs modules from yum.


Here is our most recent kernel panics, followed by one from when we had
afp running on the cluster: (it looks like there is relevant info above
the cut-here, possibly if it might be helpful)


Oct 19 14:44:41 meow kernel: ------------[ cut here ]------------ Oct 19
14:44:41 meow kernel: kernel BUG at
/usr/src/build/607755-i686/BUILD/smp/src/lockqueue.c:1144!
Oct 19 14:44:41 meow kernel: invalid operand: 0000 [#1] Oct 19 14:44:41
meow kernel: SMP Oct 19 14:44:41 meow kernel: Modules linked in: nfsd
exportfs lockd
autofs4 lock_dlm(U) gfs(U) lock_harness(U) rfcomm l2cap bluetooth dlm(U)
cman(U) md5 ip
v6 sunrpc ipt_LOG ipt_limit ipt_state ip_conntrack iptable_filter
ip_tables video button battery ac uhci_hcd ehci_hcd hw_random i2c_i801
i2c_core shpchp e1000 floppy ext3 jbd raid1 dm_mod qla2200 qla2xxx
scsi_transport_fc ata_piix libata sd_mod scsi_mod
Oct 19 14:44:41 meow kernel: CPU:    1
Oct 19 14:44:41 meow kernel: EIP:    0060:[<f8af9dcf>]    Not tainted
VLI
Oct 19 14:44:41 meow kernel: EFLAGS: 00010292   (2.6.12-1.1447_FC4smp)
Oct 19 14:44:41 meow kernel: EIP is at
process_cluster_request+0xddb/0xdef [dlm]
Oct 19 14:44:41 meow kernel: eax: 00000004   ebx: 00000000   ecx:
c035fa4c   edx: 00000286
Oct 19 14:44:41 meow kernel: esi: f7fb8400   edi: 00000000   ebp:
d2988000   esp: f7eefe24
Oct 19 14:44:41 meow kernel: ds: 007b   es: 007b   ss: 0068
Oct 19 14:44:41 meow kernel: Process dlm_recvd (pid: 2402,
threadinfo=f7eef000 task=f7851020) Oct 19 14:44:41 meow kernel: Stack:
f8b0621b 00000001 f8b071e0 f8b06217
2583f987 00000001 00000040 00004000
Oct 19 14:44:41 meow kernel:        f7eefe48 00000000 c038e1a0 00000a58
f0167b00 c02a26c1 00000a58 00004040
Oct 19 14:44:41 meow kernel:        00000072 f7eefed4 00000000 00000001
00000246 00000000 edd6eeb8 00000000
Oct 19 14:44:41 meow kernel: Call Trace:
Oct 19 14:44:41 meow kernel:  [<c02a26c1>] sock_recvmsg+0x103/0x11e Oct
19 14:44:41 meow kernel:  [<f8afd46b>]
midcomms_process_incoming_buffer+0x13b/0x25f [dlm] Oct 19 14:44:41 meow
kernel:  [<c011ce54>] load_balance_newidle+0x23/0x82 Oct 19 14:44:41
meow kernel:  [<f8afb3d3>] receive_from_sock+0x196/0x2c9 [dlm] Oct 19
14:44:41 meow kernel:  [<c0307705>] schedule+0x405/0xc5e Oct 19 14:44:41
meow kernel:  [<c0307731>] schedule+0x431/0xc5e Oct 19 14:44:41 meow
kernel:  [<f8afc457>] dlm_recvd+0x0/0x9c [dlm] Oct 19 14:44:41 meow
kernel:  [<f8afc2d3>] process_sockets+0x75/0xb7 [dlm] Oct 19 14:44:41
meow kernel:  [<f8afc4c7>] dlm_recvd+0x70/0x9c [dlm] Oct 19 14:44:41
meow kernel:  [<c0134c09>] kthread+0x93/0x97 Oct 19 14:44:41 meow
kernel:  [<c0134b76>] kthread+0x0/0x97 Oct 19 14:44:41 meow kernel:
[<c01023d1>] kernel_thread_helper+0x5/0xb Oct 19 14:44:41 meow kernel:
Code: 4f 82 62 c7 89 e8 e8 b1 b4 00 00 8b 4c 24 14 89 4c 24 04 c7 04 24
6d 63 b0 f8 e8 34 82 62 c7 c7 04 24 1b 62 b0 f8 e8 28
82 62 c7 <0f> 0b 78 04 e0 71 b0 f8 c7 04 24 70 72 b0 f8 e8 40 78 62 c7
57 Oct 19 14:44:41 meow kernel:  <0>Fatal exception: panic in 5 seconds


Panic 2:

Oct 10 09:58:39 woof kernel: ------------[ cut here ]------------ Oct 10
09:58:39 woof kernel: kernel BUG at
/usr/src/build/607778-i686/BUILD/smp/src/dlm/lock.c:411!
Oct 10 09:58:39 woof kernel: invalid operand: 0000 [#1] Oct 10 09:58:39
woof kernel: SMP Oct 10 09:58:39 woof kernel: Modules linked in: nfsd
exportfs lockd
autofs4 lock_dlm(U) gfs(U) lock_harness(U) rfcomm l2cap bluetooth dlm(U)
cman(U) md5 ip
v6 sunrpc ipt_LOG ipt_limit ipt_state ip_conntrack iptable_filter
ip_tables video button battery ac uhci_hcd ehci_hcd hw_random i2c_i801
i2c_core shpchp e1 000 dm_snapshot dm_zero dm_mirror ext3 jbd raid1
dm_mod qla2200 qla2xxx scsi_transport_fc ata_piix libata sd_mod scsi_mod
Oct 10 09:58:39 woof kernel: CPU:    1
Oct 10 09:58:39 woof kernel: EIP:    0060:[<f8b98bf5>]    Not tainted
VLI
Oct 10 09:58:39 woof kernel: EFLAGS: 00010292   (2.6.12-1.1447_FC4smp)
Oct 10 09:58:39 woof kernel: EIP is at do_dlm_lock+0x1b7/0x21d
[lock_dlm]
Oct 10 09:58:39 woof kernel: eax: 00000004   ebx: 00000000   ecx:
c035fa4c   edx: 00000292
Oct 10 09:58:39 woof kernel: esi: f7848140   edi: ffffffea   ebp:
00000003   esp: c74b3cfc
Oct 10 09:58:39 woof kernel: ds: 007b   es: 007b   ss: 0068
Oct 10 09:58:39 woof kernel: Process imapd (pid: 24278,
threadinfo=c74b3000 task=f4721a80) Oct 10 09:58:39 woof kernel: Stack:
f8b9de75 f7848140 00000003 1bbe0000 00000000 ffffffea 00000003 00000005
Oct 10 09:58:39 woof kernel:        0000000d 00000005 00000000 f58c0a00
00000001 0000000d 20200000 20202020
Oct 10 09:58:39 woof kernel:        20203320 20202020 62312020 30306562
00183030 c8fb2f00 00000001 00000001
Oct 10 09:58:39 woof kernel: Call Trace:
Oct 10 09:58:39 woof kernel:  [<f8b98cff>] lm_dlm_lock+0x52/0x5e
[lock_dlm] Oct 10 09:58:39 woof kernel:  [<f8b98cad>]
lm_dlm_lock+0x0/0x5e [lock_dlm] Oct 10 09:58:39 woof kernel:
[<f8bd000c>] gfs_lm_lock+0x3d/0x5c [gfs] Oct 10 09:58:39 woof kernel:
[<f8bc5039>] gfs_glock_xmote_th+0xae/0x1d3 [gfs] Oct 10 09:58:39 woof
kernel:  [<f8bc463c>] rq_promote+0x126/0x150 [gfs] Oct 10 09:58:39 woof
kernel:  [<f8bc4840>] run_queue+0xee/0x113 [gfs] Oct 10 09:58:39 woof
kernel:  [<f8bc5af1>] gfs_glock_nq+0x93/0x144 [gfs] Oct 10 09:58:39 woof
kernel:  [<f8bc619d>] gfs_glock_nq_init+0x18/0x2d [gfs] Oct 10 09:58:39
woof kernel:  [<f8be3926>] get_local_rgrp+0xca/0x1b0 [gfs] Oct 10
09:58:39 woof kernel:  [<f8be3a9c>] gfs_inplace_reserve_i+0x90/0xd0
[gfs] Oct 10 09:58:39 woof kernel:  [<f8be046b>]
gfs_quota_lock_m+0xbf/0x117 [gfs] Oct 10 09:58:39 woof kernel:
[<f8bd8a2b>] do_do_write_buf+0x3a1/0x485 [gfs] Oct 10 09:58:39 woof
kernel:  [<f8bc56a1>] glock_wait_internal+0x16b/0x26a [gfs] Oct 10
09:58:39 woof kernel:  [<f8bd8c91>] do_write_buf+0x182/0x1b6 [gfs] Oct
10 09:58:39 woof kernel:  [<f8bd7be5>] walk_vm+0xb3/0x111 [gfs] Oct 10
09:58:39 woof kernel:  [<f8bd8d65>] gfs_write+0xa0/0xc2 [gfs] Oct 10
09:58:39 woof kernel:  [<f8bd8b0f>] do_write_buf+0x0/0x1b6 [gfs] Oct 10
09:58:39 woof kernel:  [<f8bd8cc5>] gfs_write+0x0/0xc2 [gfs] Oct 10
09:58:39 woof kernel:  [<c0162987>] vfs_write+0x9e/0x110 Oct 10 09:58:39
woof kernel:  [<c0162aa4>] sys_write+0x41/0x6a Oct 10 09:58:39 woof
kernel:  [<c0104035>] syscall_call+0x7/0xb Oct 10 09:58:39 woof kernel:
Code: 7c 24 14 89 4c 24 0c 89 5c 24 10 89 6c 24 08 89 74 24 04 c7 04 24
28 e6 b9 f8 e8 0e 94 58 c7 c7 04 24 75 de
b9 f8 e8 02
94 58 c7 <0f> 0b 9b 01 a0 e4 b9 f8 c7 04 24 3c e5 b9 f8 e8 1a 8a 58 c7
66 Oct 10 09:58:39 woof kernel:  <0>Fatal exception: panic in 5 seconds


Sep  7 15:37:44 meow kernel: ------------[ cut here ]------------ Sep  7
15:37:44 meow kernel: kernel BUG at
/usr/src/build/588748-i686/BUILD/smp/src/dlm/plock.c:500!
Sep  7 15:37:44 meow kernel: invalid operand: 0000 [#1] Sep  7 15:37:44
meow kernel: SMP Sep  7 15:37:44 meow kernel: Modules linked in:
appletalk nfsd exportfs lockd autofs4 lock_dlm(U) gfs(U) lock_harness(U)
rfcomm l2cap bluetooth
dlm(U) cman
(U) sunrpc md5 ipv6 ipt_LOG ipt_limit ipt_state ip_conntrack
iptable_filter ip_tables video button battery ac uhci_hcd ehci_hcd
hw_random i2c_i801 i2c_core shpchp e1000 floppy ext3 jbd raid1 dm_mod
qla2200 qla2xxx scsi_transport_fc ata_piix libata sd_mod scsi_mod
Sep  7 15:37:44 meow kernel: CPU:    3
Sep  7 15:37:44 meow kernel: EIP:    0060:[<f8b9a3f7>]    Tainted:
GF     VLI
Sep  7 15:37:44 meow kernel: EFLAGS: 00010292   (2.6.12-1.1398_FC4smp)
Sep  7 15:37:44 meow kernel: EIP is at update_lock+0x87/0x9b [lock_dlm]
Sep  7 15:37:44 meow kernel: eax: 00000004   ebx: fffffff5   ecx:
c035ca4c   edx: 00000282
Sep  7 15:37:44 meow kernel: esi: 00000000   edi: e99c2c00   ebp:
00000000   esp: d05dedb4
Sep  7 15:37:44 meow kernel: ds: 007b   es: 007b   ss: 0068
Sep  7 15:37:44 meow kernel: Process afpd (pid: 3872,
threadinfo=d05de000 task=d6447550) Sep  7 15:37:44 meow kernel: Stack:
badc0ded f8b9d0d6 fffffff5 f8b9da70
f8b9d101 06609291 f7943000 00000000
Sep  7 15:37:44 meow kernel:        f8b9a499 7ffffff8 00000000 7ffffff8
00000000 d05dede8 d7636700 7ffffff8
Sep  7 15:37:44 meow kernel:        00000000 d05deea8 d05dee28 f8b9a987
00000001 7ffffff8 00000000 7ffffff8
Sep  7 15:37:44 meow kernel: Call Trace:
Sep  7 15:37:44 meow kernel:  [<f8b9a499>] add_lock+0x8e/0xed [lock_dlm]
Sep  7 15:37:44 meow kernel:  [<f8b9a987>] fill_gaps+0x87/0x10e
[lock_dlm] Sep  7 15:37:44 meow kernel:  [<f8b9aa51>]
lock_case3+0x43/0xac [lock_dlm] Sep  7 15:37:44 meow kernel:
[<f8b9aeac>] plock_internal+0x1aa/0x370 [lock_dlm] Sep  7 15:37:44 meow
kernel:  [<f8b9b614>] lm_dlm_plock+0x25b/0x2dc [lock_dlm] Sep  7
15:37:44 meow kernel:  [<f8b9b3b9>] lm_dlm_plock+0x0/0x2dc [lock_dlm]
Sep  7 15:37:44 meow kernel:  [<f8bdc1c3>] gfs_lm_plock+0x45/0x57 [gfs]
Sep  7 15:37:44 meow kernel:  [<f8be5731>] gfs_lock+0xcd/0x11c [gfs] Sep
7 15:37:44 meow kernel:  [<f8be5664>] gfs_lock+0x0/0x11c [gfs] Sep  7
15:37:44 meow kernel:  [<c0176c4f>] fcntl_setlk64+0x16c/0x26a Sep  7
15:37:44 meow kernel:  [<c0162e93>] fget+0x3b/0x42 Sep  7 15:37:44 meow
kernel:  [<c0172bfd>] sys_fcntl64+0x55/0x97 Sep  7 15:37:44 meow kernel:
[<c0104025>] syscall_call+0x7/0xb Sep  7 15:37:44 meow kernel: Code: 01
00 00 c7 04 24 a8 da b9 f8 e8 7c
77 58 c7 89 5c 24 04 c7 04 24 08 d1 b9 f8 e8 6c 77 58 c7 c7 04 24 d6 d0
b9 f8 e8 60
77 58 c7 <0f> 0b f4 01 70 da b9 f8 c7 04 24 10 db b9 f8 e8 78 6d 58 c7
55 Sep  7 15:37:44 meow kernel:  <0>Fatal exception: panic in 5 seconds


Thanks for any help,
  Ethan


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
--------------------------------------------------------

CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and contains information that is confidential and proprietary to Applied Micro Circuits Corporation or its subsidiaries. It is to be used solely for the purpose of furthering the parties' business relationship. All unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.


From clusterbuilder at gmail.com  Tue Oct 25 17:35:47 2005
From: clusterbuilder at gmail.com (Nick I)
Date: Tue, 25 Oct 2005 11:35:47 -0600
Subject: [Linux-cluster] Ask the Cluster Expert
Message-ID: <e073f9120510251035t5a34bf75j86a20d840ae874e7@mail.gmail.com>

Hi.

Thanks to the response from many in the community I have added sections
about diskless clusters and information on 32-bit and 64-bit processors at
the site I help run,
www.ClusterBuilder.org.<http://www.clusterbuilder.org/>I also added a
section called Ask the Cluster Expert (
http://www.clusterbuilder.org/pages/ask-the-expert.php) for people to submit
questions they have about cluster and grid computing. I post the questions
at an FAQ page (http://www.clusterbuilder.org/pages/ask-the-expert/faq.php)
and then research the answer as well as allow those knowledgeable in the
community to submit a response to the question. I want to build a valuable
knowledgebase of high performance computing information.

I need you to share your knowledge by adding to the question responses and
also submitting questions/answers to common problems you've experienced in
the past and are experiencing now.

A sample question could be certain operating systems on clusters.

Thanks,
Nick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051025/332f8bff/attachment.htm>

From garyshi at gmail.com  Fri Oct 28 17:00:33 2005
From: garyshi at gmail.com (Gary Shi)
Date: Sat, 29 Oct 2005 01:00:33 +0800
Subject: [Linux-cluster] GFS over GNBD servers connected to a SAN?
Message-ID: <c2321aa40510281000n2dff5229i3dd38d004d6e7f08@mail.gmail.com>

The Administrator's Guide suggests 3 kinds of configurations, the second one
"GFS and GNBD with a SAN", servers running GFS share device export by GNBD
servers. I'm wondering the detail of such configuration. Does it have better
performance because it could distribute the load on single GNBD servers?
Compared to the 3rd way, "GFS and GNBD with Directly Connected Storage",
seems the only difference is we could export the same device through
different GNBD servers. Is it true? For example:

Suppose the SAN exports only 1 logical device, and we have 4 GNBD servers
connect to the SAN, and 32 application servers share the filesystem via GFS.
So the disk on SAN is /dev/sdb on each GNBD server. Can we use "gnbd_export
-d /dev/sdb -e test" to export the device a same name "test" on all GNBD
servers, make every 8 GFS servers share a GNBD server, and the total 32 GFS
nodes finally access the same SAN device?

What configuration is suggested for a high-performance GNBD server? How many
client is fair for a GNBD server?

BTW, is it possible to run NFS service on GFS nodes, and make different
client groups access different NFS servers, resulting in a lot of NFS
clients access a same shared filesystem?

--
regards,
Gary Shi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051029/5d197a95/attachment.htm>

From philip.r.dana at nwp01.usace.army.mil  Mon Oct 31 14:45:07 2005
From: philip.r.dana at nwp01.usace.army.mil (Philip R. Dana)
Date: Mon, 31 Oct 2005 06:45:07 -0800
Subject: [Linux-cluster] Service/Resource group help needed
In-Reply-To: <1130450762.23803.41.camel@ayanami.boston.redhat.com>
References: <1130440083.2950.25.camel@nwp-wk-79033-l>
	<1130440656.3453.10.camel@auh5-0479.corp.jabil.org>
	<1130443149.2950.32.camel@nwp-wk-79033-l>
	<1130450762.23803.41.camel@ayanami.boston.redhat.com>
Message-ID: <1130769907.2950.60.camel@nwp-wk-79033-l>

Modprobe dlm resulted in a module not found error, a result of making
the edit to dlm-kernel.spec that Mr. Gray used. If dlm-kernel is built
with the spec file unmodified, then building cman-kernel errors out with
a file not found error. Back to square one.

The replies were all appreciated. Thanks.

On Thu, 2005-10-27 at 18:06 -0400, Lon Hohberger wrote:
> On Thu, 2005-10-27 at 12:59 -0700, Dana, Philip R NWP Contractor wrote:
> > Thanks for the quick reply. The rgmanager service is running on both
> > nodes, but I think I have lock (dlm) problems. From a service restart:
> > 
> > Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: <info> Services Initialized
> > Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: <info> Logged in SG
> > "usrm::manager"
> > Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: <info> Magma Event:
> > Membership Change
> > Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: <info> State change: Local
> > UP
> > Oct 27 12:46:41 ns1-node1 clurgmgrd[3372]: <err> #33: Unable to obtain
> > cluster lock: Operation not permitted
> 
> service rgmanager stop; modprobe dlm; service rgmanager start
> 
> -- Lon
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From erwan at seanodes.com  Mon Oct 31 16:20:15 2005
From: erwan at seanodes.com (Velu Erwan)
Date: Mon, 31 Oct 2005 17:20:15 +0100
Subject: [Linux-cluster] Readhead Issues using cluster-1.01.00
Message-ID: <4366443F.9000707@seanodes.com>

Hi,
I've been playing with cluster-1.01.00 and I found the reading very slow.
I've been trying to setup "max_readahead" using gfs_tool and 
performances are unchanged.
So I've read the code and placed some printk everywhere ;o)

Results is : in opts_file.c when you make a gfs_read you have the 
following code (this is right for both buffered and directio reads)

        if (gfs_is_jdata(ip) ||
            (gfs_is_stuffed(ip) && !test_bit(GIF_PAGED, &ip->i_flags)))
                count = do_read_readi(file, buf, size, offset);
        else
                count = generic_file_read(file, buf, size, offset);

In my case, I always uses generic_file_read because all conditions are 
equal to 0.
I'm not a gfs expert so I don't know if it's normal or not to not use 
the do_read_readi call to read the fs.

I've watched how do_read_readi() works and after a few calls you call 
gfs_start_ra (from dio.c).
This sounds perfect because it defines "uint32_t max_ra = 
gfs_tune_get(sdp, gt_max_readahead) >> sdp->sd_sb.sb_bsize_shift;";
This uses the default max_readahead set.

Now the other case, what's about generic_file_read (defined in the usual 
linux tree)
Of course, it doesn't know about the value you set (max_readahead).
So after reading how it works, the file structure owns a f_ra member 
which is defined for handling the readahead aspects.
I though about forcing this value before the generic_file_read call.
Please found this patch attached.
When I set a default value to file->f_ra.ra_pages to 512, my 
performances are 3 times better ! I've bupped from 40MB/sec to 120MB/sec.
I'm now reaching the normal performance I was expected.

Comments about this patch:
1?) I don't know if it's the cleanest way to do it but it works, so It 
shows that the readahead aspect is not handled while using 
generic_file_read.
2?) I've used a default value that match my hardware but it could be 
cleaner to use the "gfs_tune_get(sdp, gt_max_readahead)" call.
2bis?) I don't know how to call it because it needs the gfs_lock 
structure and I don't know how to provide one (I didn't read enough code 
for that).
3?) I need a gfs guru to finish this patch with point 2 & 2bis if my 
patch sounds relevant

Erwan,
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster-1.01.00-readahead.patch
Type: text/x-patch
Size: 447 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051031/85b2bc2b/attachment.bin>

From bmarzins at redhat.com  Mon Oct 31 16:57:10 2005
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Mon, 31 Oct 2005 10:57:10 -0600
Subject: [Linux-cluster] GFS over GNBD servers connected to a SAN?
In-Reply-To: <c2321aa40510281000n2dff5229i3dd38d004d6e7f08@mail.gmail.com>
References: <c2321aa40510281000n2dff5229i3dd38d004d6e7f08@mail.gmail.com>
Message-ID: <20051031165710.GA24441@phlogiston.msp.redhat.com>

On Sat, Oct 29, 2005 at 01:00:33AM +0800, Gary Shi wrote:
>    The Administrator's Guide suggests 3 kinds of configurations, the second
>    one "GFS and GNBD with a SAN", servers running GFS share device export by
>    GNBD servers. I'm wondering the detail of such configuration. Does it have
>    better performance because it could distribute the load on single GNBD
>    servers? Compared to the 3rd way, "GFS and GNBD with Directly Connected
>    Storage", seems the only difference is we could export the same device
>    through different GNBD servers. Is it true? For example:
> 
>    Suppose the SAN exports only 1 logical device, and we have 4 GNBD servers
>    connect to the SAN, and 32 application servers share the filesystem via
>    GFS. So the disk on SAN is /dev/sdb on each GNBD server. Can we use
>    "gnbd_export -d /dev/sdb -e test" to export the device a same name "test"
>    on all GNBD servers, make every 8 GFS servers share a GNBD server, and the
>    total 32 GFS nodes finally access the same SAN device?

Well, it depends.  Using RHEL3 with pool, you can have multiple GNBD servers
exporting the same SAN device.  However, GNBD itself does not do the
multipathing. It simply has a mode (uncached mode) that allows multipathing
software to be run on top of it. The RHEL3 pool code has multipathing support.
However to do this, you must name the GNBD devices exported by each server
different names.  Otherwise, GNBD will not import them multiple devices with
the same name. Best practice is to name the device <basename>_<machinename>.
There are some of additional requirements for doing this.  For one, you MUST
have hardware based fencing on the GNBD servers, otherwise you risk corruption.
You MUST export ALL multipathed GNBD devices uncached, otherwise you WILL see
corruption and you WILL eventually destroy your entire filesystem. If you are
using the fence_gnbd fencing agent (and this is only recommended if you do
not have a hardware fencing machanism for the gnbd client machines. Otherwise
use that) you must set it to multipath style fenceing, or you risk corruption.
You should read the gnbd man pages (especially fence_gnbd.8 gnbd_export.8).
All of the multipath requirements are listed there. (search for "WARNING" in
the text for the necessary steps to avoid corruption).

In RHEL4, there is no pool device. Multipathing is handled by
device-mapper-multipath.  Unfortunately, this code is currently too SCSI
centric to work with GNBD, so this setup is impossible in RHEL4.

>    What configuration is suggested for a high-performance GNBD server? How
>    many client is fair for a GNBD server?

The largest number of GNBD clients I have heard of in a production setting is
128.  There is no reason why there couldn't be more.  The performance
bottleneck for setups with a high number of clients is in the network
connection. Since you have a single thread serving each client-server-device
instance, the gnbd server actually performs better (in terms of total
throughput) with more clients. Obviously, your per-client performance will
drop, usually due to limited network bandwith.

Having only one gnbd sever per device is obviously a single point of failure.
So if you are running with RHEL3, you may want multiple servers.  In practice.
People usually do just fine by designating a single node to be exclusively a
GNBD server (which means not running GFS on that node). If you are running
GULM, and would like to use your GNBD server as a GULM server, you should
have two network interfaces. One for lock traffic and one for block traffic.
Since gulm uses a lot of memory, no disk, and gnbd uses a lot of disk but
little memory, they can do well together.  However, if gulm can't send out
heartbeats in a timely manner, your nodes can get fenced durning periods of high
block IO.

With RHEL4, the only real difference is that you do not have the option of
multiple gnbd servers per SAN device. It's still best to use the gnbd server
exclusively for that purpose.

>    BTW, is it possible to run NFS service on GFS nodes, and make different
>    client groups access different NFS servers, resulting in a lot of NFS
>    clients access a same shared filesystem?
> 
>    --
>    regards,
>    Gary Shi

> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster