From buytenh at wantstofly.org  Sun Aug  1 10:54:48 2004
From: buytenh at wantstofly.org (Lennert Buytenhek)
Date: Sun, 1 Aug 2004 12:54:48 +0200
Subject: [Linux-cluster] new FC2 RPMS for CVS GFS snapshot of today
Message-ID: <20040801105448.GA9682@xi.wantstofly.org>

Hi,

I've made some new FC2 GFS RPMS available for your enjoyment at:

	http://www2.wantstofly.org/gfs/20040801/

These are totally untested as of yet, since I unfortunately got
distracted by some other projects requiring attention.  Feedback
would be most welcome.


--L


From laza at yu.net  Sun Aug  1 17:59:19 2004
From: laza at yu.net (Lazar Obradovic)
Date: Sun, 01 Aug 2004 19:59:19 +0200
Subject: [Linux-cluster] SNMP modules?
In-Reply-To: <1090861715.13809.3.camel@laza.eunet.yu>
References: <1090861715.13809.3.camel@laza.eunet.yu>
Message-ID: <1091383159.32177.14.camel@laza.eunet.yu>

ok, here's the patch for ibm blade fencing agent...
qlogic sanbox2, comming up next :)

On Mon, 2004-07-26 at 19:08, Lazar Obradovic wrote: 
> Hello all, 
> 
> I'd like to develop my own fencing agents (for IBM BladeCenter and
> QLogic SANBox2 switches), but they will require SNMP bindings.
> 
> Is that ok with general development philosophy, since I'd like to
> contribude them? net-snmp-5.x.x-based API? 

-- 
Lazar Obradovic, System Engineer
----- 
laza at YU.net
YUnet International http://www.EUnet.yu
Dubrovacka 35/III, 11000 Belgrade
Tel: +381 11 3119901; Fax: +381 11 3119901
-----
This e-mail is confidential and intended only for the recipient.
Unauthorized distribution, modification or disclosure of its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone +381 11 3119901.
-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fence-ibmblade.patch
Type: text/x-patch
Size: 12640 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040801/cd225488/attachment.bin>

From laza at yu.net  Sun Aug  1 23:40:55 2004
From: laza at yu.net (Lazar Obradovic)
Date: Mon, 02 Aug 2004 01:40:55 +0200
Subject: [Linux-cluster] SNMP modules?
In-Reply-To: <1091383159.32177.14.camel@laza.eunet.yu>
References: <1090861715.13809.3.camel@laza.eunet.yu>
	<1091383159.32177.14.camel@laza.eunet.yu>
Message-ID: <1091403655.6495.17.camel@laza.eunet.yu>

both things in one patch... 

On Sun, 2004-08-01 at 19:59, Lazar Obradovic wrote:
> ok, here's the patch for ibm blade fencing agent...
> qlogic sanbox2, comming up next :)
> 
> On Mon, 2004-07-26 at 19:08, Lazar Obradovic wrote: 
> > Hello all, 
> > 
> > I'd like to develop my own fencing agents (for IBM BladeCenter and
> > QLogic SANBox2 switches), but they will require SNMP bindings.
> > 
> > Is that ok with general development philosophy, since I'd like to
> > contribude them? net-snmp-5.x.x-based API? 
-- 
Lazar Obradovic, System Engineer
----- 
laza at YU.net
YUnet International http://www.EUnet.yu
Dubrovacka 35/III, 11000 Belgrade
Tel: +381 11 3119901; Fax: +381 11 3119901
-----
This e-mail is confidential and intended only for the recipient.
Unauthorized distribution, modification or disclosure of its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone +381 11 3119901.
-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fence-blade_sanbox2.patch
Type: text/x-patch
Size: 23698 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040802/a58c5d87/attachment.bin>

From smelkovs at worldsoft.ch  Mon Aug  2 11:27:19 2004
From: smelkovs at worldsoft.ch (Konrads Smelkovs)
Date: Mon, 02 Aug 2004 13:27:19 +0200
Subject: [Linux-cluster] incompatible configurations
Message-ID: <410E2517.4040505@worldsoft.ch>

Hello,
I have two nodes connected to SAN(san1 and san2), and one that is not 
(eth1, for locking purposes).
I've set the fencing method to manual.
All three nodes run the same cluster configuration (same file). However 
when eth1 connects to san1 or san2 ,the other boxes display:
Aug  2 13:26:21 san1 lock_gulmd_core[2762]: Config CRC doesn't match. ( 
20681140 != 4177759411 )
Aug  2 13:26:21 san1 lock_gulmd_core[2762]: We gave them(cmsfe01) an 
error (1004:Incompatible configurations).
Aug  2 13:26:21 san1 lock_gulmd_core[2762]: Closing connection idx:4, 
fd:9 to 192.168.100.151

why?
P.S. This is for purely testing purposes.


From danderso at redhat.com  Mon Aug  2 14:45:02 2004
From: danderso at redhat.com (Derek Anderson)
Date: Mon, 2 Aug 2004 09:45:02 -0500
Subject: [Linux-cluster] incompatible configurations
In-Reply-To: <410E2517.4040505@worldsoft.ch>
References: <410E2517.4040505@worldsoft.ch>
Message-ID: <200408020945.02895.danderso@redhat.com>

Check your /etc/hosts file on each machine.  If the hostname is in the 
loopback address line (127.0.0.1) take it out and retry.

i.e.
s/127.0.0.1 san1 localhost.localdomain localhost/127.0.0.1		
localhost.localdomain localhost/

On Monday 02 August 2004 06:27, Konrads Smelkovs wrote:
> Hello,
> I have two nodes connected to SAN(san1 and san2), and one that is not
> (eth1, for locking purposes).
> I've set the fencing method to manual.
> All three nodes run the same cluster configuration (same file). However
> when eth1 connects to san1 or san2 ,the other boxes display:
> Aug  2 13:26:21 san1 lock_gulmd_core[2762]: Config CRC doesn't match. (
> 20681140 != 4177759411 )
> Aug  2 13:26:21 san1 lock_gulmd_core[2762]: We gave them(cmsfe01) an
> error (1004:Incompatible configurations).
> Aug  2 13:26:21 san1 lock_gulmd_core[2762]: Closing connection idx:4,
> fd:9 to 192.168.100.151
>
> why?
> P.S. This is for purely testing purposes.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster


From mnerren at paracel.com  Mon Aug  2 14:48:02 2004
From: mnerren at paracel.com (micah nerren)
Date: Mon, 02 Aug 2004 07:48:02 -0700
Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine
Message-ID: <1091458082.8356.23.camel@angmar>


Hello,

I am having some problems setting up GFS 6.0. I have the src rpms, and
built them against kernel-2.4.21-15.ELsmp on x86_64. The kernel modules
all load, everything seems to build properly. I create the pool device,
am able to create the file system as well. However, when I do the mount
command (mount -t /dev/pool/pool_gfs01 /gfs01) the machine crashes
instantly. lock_gulm seems to be the culprit, however I cannot get any
useful information out of the system about why this is happening. No
logs, just whammo - the system dies instantly.

My platform information is as follows:

Rocks 3.1 (based on RHEL WS 3.0)
Dual 1.4 opterons
Kernel from rpm: kernel-smp-2.4.21-15.EL
kernel source from rpm: kernel-source-2.4.21-15.EL
GFS: GFS-6.0.0-1.2.src.rpm

My dev files look like:

/dev/pool:
brw-------    2 root     root     254,  65 Jul 31 01:12 hopkins_cca
brw-------    2 root     root     254,  66 Jul 31 01:12 pool_gfs01

Modules are loaded:

gfs                   261792   0  (unused)
lock_gulm              68960   0  (unused)
lock_harness            4048   0  [gfs lock_gulm]
pool                   85760   3

uname -a:
Linux frontend-0.public 2.4.21-15.ELsmp #1 SMP Thu Apr 22 00:09:01 EDT
2004 x86_64 x86_64 x86_64 GNU/Linux

[root at frontend-0 root]# pool_tool -s
  Device                                            Pool Label
  ======                                            ==========
  /dev/pool/hopkins_cca                       <- CCA device ->
  /dev/pool/pool_gfs01                    <- GFS filesystem ->
  /dev/sda                         <- partition information ->
  /dev/sda1                                        hopkins_cca
  /dev/sda2                                         pool_gfs01


And when I do the following mount command:

mount -t gfs /dev/pool/pool_gfs01 /gfs01

The system crashes. At the console, there are tons of system calls being
listed, and at the bottom of the screen:

Code: 39 d0 75 f8 85 c9 74 10 8b 44 24 14 39 d0 74 08 8b 44 24 14
Console Shuts up:
   pid: 3547, lock_gulmd Not tainted
RIP: 0010


So... Any ideas on what may be causing this? This seems to be a
supported platform for this tool according to redhat. Has anybody used
GFS 6.0 on this kernel rev on x86_64 and if so, how did you get it to
work?

Other information:
HBA:  QLogic Corp. QLA2312 Fibre Channel Adapter
SAN Switch: Qlogic SANBOX

Thank you for any help you may have!!

Micah


From mtilstra at redhat.com  Mon Aug  2 14:54:35 2004
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Mon, 2 Aug 2004 09:54:35 -0500
Subject: [Linux-cluster] incompatible configurations
In-Reply-To: <410E2517.4040505@worldsoft.ch>
References: <410E2517.4040505@worldsoft.ch>
Message-ID: <20040802145435.GA3266@redhat.com>

On Mon, Aug 02, 2004 at 01:27:19PM +0200, Konrads Smelkovs wrote:
> Hello,
> I have two nodes connected to SAN(san1 and san2), and one that is not 
> (eth1, for locking purposes).
> I've set the fencing method to manual.
> All three nodes run the same cluster configuration (same file). However 
> when eth1 connects to san1 or san2 ,the other boxes display:
> Aug  2 13:26:21 san1 lock_gulmd_core[2762]: Config CRC doesn't match. ( 
> 20681140 != 4177759411 )

config files aren't matching in the important bits.

run lock_gulmd -C (plus any other cmd line params you had)
the -C will cause lock_gulmd to print what it thinks the config is, and
exit.
Do this on both nodes, and see what is different.

Common problem is what derek pointed at. (with the loopback address
getting the host's name.)

-- 
Michael Conrad Tadpol Tilstra
an experation date on distilled water......?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040802/00b53bdb/attachment.sig>

From amanthei at redhat.com  Mon Aug  2 15:46:05 2004
From: amanthei at redhat.com (Adam Manthei)
Date: Mon, 2 Aug 2004 10:46:05 -0500
Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine
In-Reply-To: <1091458082.8356.23.camel@angmar>
References: <1091458082.8356.23.camel@angmar>
Message-ID: <20040802154605.GC1518@redhat.com>

On Mon, Aug 02, 2004 at 07:48:02AM -0700, micah nerren wrote:
> 
> The system crashes. At the console, there are tons of system calls being
> listed, and at the bottom of the screen:
> 
> Code: 39 d0 75 f8 85 c9 74 10 8b 44 24 14 39 d0 74 08 8b 44 24 14
> Console Shuts up:
>    pid: 3547, lock_gulmd Not tainted
> RIP: 0010
> 
> 
> So... Any ideas on what may be causing this? 

Those "tons of system calls being listed" are really quite useful if not
necessary to tell you what the problem is.  My gut feeling is that there is
a stack overrun that is happening.

> This seems to be a supported platform for this tool according to redhat. 

Correct.

> Has anybody used GFS 6.0 on this kernel rev on x86_64 and if so, 
> how did you get it to work?

Yup.  I just used the rpms. Perhaps you compiled it with debugging options
enabled?  (I don't know if that would make the stack bigger)

-- 
Adam Manthei  <amanthei at redhat.com>


From mnerren at paracel.com  Mon Aug  2 16:06:48 2004
From: mnerren at paracel.com (micah nerren)
Date: Mon, 02 Aug 2004 09:06:48 -0700
Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine
In-Reply-To: <20040802154605.GC1518@redhat.com>
References: <1091458082.8356.23.camel@angmar>
	<20040802154605.GC1518@redhat.com>
Message-ID: <1091462808.8356.39.camel@angmar>

Hi,


On Mon, 2004-08-02 at 08:46, Adam Manthei wrote:
> On Mon, Aug 02, 2004 at 07:48:02AM -0700, micah nerren wrote:
> > 
> > The system crashes. At the console, there are tons of system calls being
> > listed, and at the bottom of the screen:
> > 
> > Code: 39 d0 75 f8 85 c9 74 10 8b 44 24 14 39 d0 74 08 8b 44 24 14
> > Console Shuts up:
> >    pid: 3547, lock_gulmd Not tainted
> > RIP: 0010
> > 
> > 
> > So... Any ideas on what may be causing this? 
> 
> Those "tons of system calls being listed" are really quite useful if not
> necessary to tell you what the problem is.  My gut feeling is that there is
> a stack overrun that is happening.

I could try to post them if anybody would find that useful. I will write
all that down and attempt to post it coherently. Is there any way to
capture that kind of info to a file?


> > This seems to be a supported platform for this tool according to redhat. 
> 
> Correct.
> 
> > Has anybody used GFS 6.0 on this kernel rev on x86_64 and if so, 
> > how did you get it to work?
> 
> Yup.  I just used the rpms. Perhaps you compiled it with debugging options
> enabled?  (I don't know if that would make the stack bigger)

All I did was 'rpmbuild --rebuild GFS-6.0.0-1.2.src.rpm'

That created the following rpms:
GFS-6.0.0-1.2.x86_64.rpm
GFS-debuginfo-6.0.0-1.2.x86_64.rpm
GFS-devel-6.0.0-1.2.x86_64.rpm
GFS-modules-6.0.0-1.2.x86_64.rpm
GFS-modules-smp-6.0.0-1.2.x86_64.rpm

Of those, I have the following actually installed:

GFS-modules-smp-6.0.0-1.2
GFS-6.0.0-1.2


Do you have any build instructions for getting them to work properly?
Could something built into my running kernel cause this? I am building a
new kernel from source right now to see if the binary kernel rpm I used
had some sort of problem.

Could it be related to the HBA I am using as well?

Thanks!

Micah


From phillips at redhat.com  Mon Aug  2 16:15:01 2004
From: phillips at redhat.com (Daniel Phillips)
Date: Mon, 2 Aug 2004 12:15:01 -0400
Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine
In-Reply-To: <1091462808.8356.39.camel@angmar>
References: <1091458082.8356.23.camel@angmar>
	<20040802154605.GC1518@redhat.com>
	<1091462808.8356.39.camel@angmar>
Message-ID: <200408021215.01082.phillips@redhat.com>

On Monday 02 August 2004 12:06, micah nerren wrote:
> Hi,
>
> On Mon, 2004-08-02 at 08:46, Adam Manthei wrote:
> > On Mon, Aug 02, 2004 at 07:48:02AM -0700, micah nerren wrote:
> > > The system crashes. At the console, there are tons of system
> > > calls being listed, and at the bottom of the screen:
> > >
> > > Code: 39 d0 75 f8 85 c9 74 10 8b 44 24 14 39 d0 74 08 8b 44 24 14
> > > Console Shuts up:
> > >    pid: 3547, lock_gulmd Not tainted
> > > RIP: 0010
> > >
> > >
> > > So... Any ideas on what may be causing this?
> >
> > Those "tons of system calls being listed" are really quite useful
> > if not necessary to tell you what the problem is.  My gut feeling
> > is that there is a stack overrun that is happening.
>
> I could try to post them if anybody would find that useful. I will
> write all that down and attempt to post it coherently. Is there any
> way to capture that kind of info to a file?

The traditional way is to connect a serial cable and direct console 
output to the serial port.  You can then cut and paste the console 
messages from a nice graphical buffer.

Other ways are proposed, such as crash dump to disk and network crash 
dump, but I don't think any have made it to mainline yet, though 
patches are available.  (More voices asking for it on lkml would help.)

Regards,

Daniel


From amanthei at redhat.com  Mon Aug  2 16:21:23 2004
From: amanthei at redhat.com (Adam Manthei)
Date: Mon, 2 Aug 2004 11:21:23 -0500
Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine
In-Reply-To: <1091462808.8356.39.camel@angmar>
References: <1091458082.8356.23.camel@angmar>
	<20040802154605.GC1518@redhat.com>
	<1091462808.8356.39.camel@angmar>
Message-ID: <20040802162123.GE1518@redhat.com>

On Mon, Aug 02, 2004 at 09:06:48AM -0700, micah nerren wrote:
> Hi,
> 
> 
> On Mon, 2004-08-02 at 08:46, Adam Manthei wrote:
> > On Mon, Aug 02, 2004 at 07:48:02AM -0700, micah nerren wrote:
> > > 
> > > The system crashes. At the console, there are tons of system calls being
> > > listed, and at the bottom of the screen:
> > > 
> > > Code: 39 d0 75 f8 85 c9 74 10 8b 44 24 14 39 d0 74 08 8b 44 24 14
> > > Console Shuts up:
> > >    pid: 3547, lock_gulmd Not tainted
> > > RIP: 0010
> > > 
> > > 
> > > So... Any ideas on what may be causing this? 
> > 
> > Those "tons of system calls being listed" are really quite useful if not
> > necessary to tell you what the problem is.  My gut feeling is that there is
> > a stack overrun that is happening.
> 
> I could try to post them if anybody would find that useful. I will write
> all that down and attempt to post it coherently. Is there any way to
> capture that kind of info to a file?

The best way is to connect it to a serial console to grab the output.
Depending on the state of the machine, you may even be able to grab that
with 'dmesg' or the syslogs (although it's not likely to have made it to the
syslog).

> > > Has anybody used GFS 6.0 on this kernel rev on x86_64 and if so, 
> > > how did you get it to work?
> > 
> > Yup.  I just used the rpms. Perhaps you compiled it with debugging options
> > enabled?  (I don't know if that would make the stack bigger)
> 
> All I did was 'rpmbuild --rebuild GFS-6.0.0-1.2.src.rpm'
> 
> That created the following rpms:
> GFS-6.0.0-1.2.x86_64.rpm
> GFS-debuginfo-6.0.0-1.2.x86_64.rpm
> GFS-devel-6.0.0-1.2.x86_64.rpm
> GFS-modules-6.0.0-1.2.x86_64.rpm
> GFS-modules-smp-6.0.0-1.2.x86_64.rpm
> 
> Of those, I have the following actually installed:
> 
> GFS-modules-smp-6.0.0-1.2
> GFS-6.0.0-1.2
> 
> 
> Do you have any build instructions for getting them to work properly?

What you did makes sense to me.


> Could something built into my running kernel cause this? I am building a
> new kernel from source right now to see if the binary kernel rpm I used
> had some sort of problem.
> 
> Could it be related to the HBA I am using as well?

If it is a stack overflow, then yes, it _could_ be related, but I'm not
going to blame that just yet ;)


> Thanks!
> 
> Micah
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Adam Manthei  <amanthei at redhat.com>


From david.n.lombard at intel.com  Mon Aug  2 16:33:06 2004
From: david.n.lombard at intel.com (Lombard, David N)
Date: Mon, 2 Aug 2004 09:33:06 -0700
Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine
Message-ID: <187D3A7CAB42A54DB61F1D05F0125722039C1567@orsmsx402.amr.corp.intel.com>

From: micah Nerren; Monday, August 02, 2004 9:07 AM
>
>On Mon, 2004-08-02 at 08:46, Adam Manthei wrote:
>> On Mon, Aug 02, 2004 at 07:48:02AM -0700, micah nerren wrote:
>> >
>> > The system crashes. At the console, there are tons of system calls
>being
>> > listed, and at the bottom of the screen:
>> >
>> > Code: 39 d0 75 f8 85 c9 74 10 8b 44 24 14 39 d0 74 08 8b 44 24 14
>> > Console Shuts up:
>> >    pid: 3547, lock_gulmd Not tainted
>> > RIP: 0010
>> >
>> >
>> > So... Any ideas on what may be causing this?
>>
>> Those "tons of system calls being listed" are really quite useful if
not
>> necessary to tell you what the problem is.  My gut feeling is that
there
>is
>> a stack overrun that is happening.
>
>I could try to post them if anybody would find that useful. I will
write
>all that down and attempt to post it coherently. Is there any way to
>capture that kind of info to a file?

Set up the kernel for serial console, and capture with minicom on
another system via a null modem cable.

-- 
David N. Lombard

My comments represent my opinions, not those of Intel Corporation.


From jeff at intersystems.com  Mon Aug  2 20:20:45 2004
From: jeff at intersystems.com (Jeff)
Date: Mon, 2 Aug 2004 16:20:45 -0400
Subject: [Linux-cluster] segfault if dlm is loaded while cman is still
	joining the cluster
Message-ID: <118314539.20040802162045@intersystems.com>

Is there a bug tracker somewhere or should we just post
them to this list?
--------------------------------------------------------------
This is on a dual-cpu box (FC2) with hyperthreading enabled
(eg. for a total of 4 logical CPUs).

If I issue the following commands where I type each command
as soon as the prior command completes I get a segfault loading
the dlm. The code is from CVS/latest.

[root at lx4 cluster_orig]# ccsd
[root at lx4 cluster_orig]# cman_tool join
[root at lx4 cluster_orig]# modprobe dlm
Segmentation fault
[root at lx4 cluster_orig]# modprobe dlm
[root at lx4 cluster_orig]# dmesg
<snip>
CMAN: Waiting to join or form a Linux-cluster
CMAN <CVS> (built Aug  2 2004 15:04:09) installed
kmem_cache_create: duplicate cache cluster_sock
------------[ cut here ]------------
kernel BUG at mm/slab.c:1392!
invalid operand: 0000 [#1]
SMP 
Modules linked in: cman parport_pc lp parport autofs4 nfs lockd sunrpc e1000 3c59x floppy sg microcode dm_mod uhci_hcd button battery asus_acpi ac ipv6 ext3 jbd aic7xxx sd_mod scsi_mod
CPU:    0
EIP:    0060:[<c01474f6>]    Not tainted
EFLAGS: 00010202   (2.6.7-clu-smp) 
EIP is at kmem_cache_create+0x4c6/0x660
eax: 00000030   ebx: c22f4770   ecx: c0487c98   edx: 00004ce1
esi: c033a366   edi: f8aa662d   ebp: f51d7b80   esp: f3fb0f5c
ds: 007b   es: 007b   ss: 0068
Process modprobe (pid: 5476, threadinfo=f3fb0000 task=f43ce230)
Stack: c031b3c8 f8aa6620 f51d7c38 0000000a c0000000 ffffff80 00000080 f8aa6620 
       00000080 c0356fe0 f8aae200 c0356fc4 c0356fc4 f88a804e 00002000 00000000 
       00000000 f8aa6605 c013a5c7 f6b7daa0 00000000 40018008 0807a1a0 00ccaffc 
Call Trace:
 [<f88a804e>] cluster_init+0x4e/0x3f9 [cman]
 [<c013a5c7>] sys_init_module+0x107/0x220
 [<c0106e3d>] sysenter_past_esp+0x52/0x71

Code: 0f 0b 70 05 2d ad 31 c0 8b 0b e9 5b ff ff ff 8b 87 b0 00 00 
 DLM <CVS> (built Aug  2 2004 15:04:29) installed
CMAN: sending membership request
CMAN: got node lx3
CMAN: quorum regained, resuming activity
[root at lx4 cluster_orig]# 


The cpuinfo for the 4 cpu's is pretty much the same. Here's
one of them:
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) XEON(TM) CPU 1.80GHz
stepping        : 4
cpu MHz         : 1779.842
cache size      : 512 KB
physical id     : 0
siblings        : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips        : 3514.36


From mnerren at paracel.com  Mon Aug  2 20:59:55 2004
From: mnerren at paracel.com (micah nerren)
Date: Mon, 02 Aug 2004 13:59:55 -0700
Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine
In-Reply-To: <20040802154605.GC1518@redhat.com>
References: <1091458082.8356.23.camel@angmar>
	<20040802154605.GC1518@redhat.com>
Message-ID: <1091480394.8356.58.camel@angmar>

On Mon, 2004-08-02 at 08:46, Adam Manthei wrote:
> On Mon, Aug 02, 2004 at 07:48:02AM -0700, micah nerren wrote:
> > 
> > The system crashes. At the console, there are tons of system calls being
> > listed, and at the bottom of the screen:
> > 
> > Code: 39 d0 75 f8 85 c9 74 10 8b 44 24 14 39 d0 74 08 8b 44 24 14
> > Console Shuts up:
> >    pid: 3547, lock_gulmd Not tainted
> > RIP: 0010
> > 
> > 
> > So... Any ideas on what may be causing this? 
> 
> Those "tons of system calls being listed" are really quite useful if not
> necessary to tell you what the problem is.  My gut feeling is that there is
> a stack overrun that is happening.


Ok, here is a capture of the crash occurring. Note that the message is
slightly different than the one I posted before, the end changes,
however the calls it is making look very similar. I also went and
upgraded the kernel to the lastest from RHEL 3 WS. I upgraded GFS to
GFS-6.0.0-7.src.rpm. Still crashing.

Here is my entire boot log from power on, to mount crash. Prior to the
crash, I did the following:

logged in as root via ssh
depmod -a
modprobe lock_gulm
modprobe gfs

(module pool was already loaded at boot time.)

mount -t gfs /dev/pool/pool_gfs01 /mnt/gfs
CRASH

I hope this helps!!

Micah


////////////////////////

Bootdata ok (command line is ro root=LABEL=/ noapic console=ttyS0,38400)
Linux version 2.4.21-15.0.3.ELsmp (bhcompile at thor.perf.redhat.com) (gcc
version 3.2.3 20030502 (Red Hat Linux 3.2.3-37)) #1 SMP Tue Jun 29
17:46:55 EDT 2004
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009b800 (usable)
 BIOS-e820: 000000000009b800 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000cc000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000007ff80000 (usable)
 BIOS-e820: 000000007ff80000 - 0000000080000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fec00400 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved)
kernel direct mapping tables upto 10100000000 @ 8000-d000
Scanning NUMA topology in Northbridge 24
Node 0 using interleaving mode 1/0
No NUMA configuration found
Faking a node at 0000000000000000-000000007ff80000
Bootmem setup node 0 0000000000000000-000000007ff80000
found SMP MP-table at 000f69a0
hm, page 000f6000 reserved twice.
hm, page 000f7000 reserved twice.
hm, page 0009b000 reserved twice.
hm, page 0009c000 reserved twice.
setting up node 0 0-7ff80
On node 0 totalpages: 524160
zone(0): 4096 pages.
zone(1): 520064 pages.
zone(2): 0 pages.
ACPI: Unable to locate RSDP
Intel MultiProcessor Specification v1.4
    Virtual Wire compatibility mode.
OEM ID: AMD      <6>Product ID: HAMMER       <6>APIC at: 0xFEE00000
Processor #0 15:5 APIC version 16
Processor #1 15:5 APIC version 16
I/O APIC #2 Version 17 at 0xFEC00000.
I/O APIC #3 Version 17 at 0xFC000000.
I/O APIC #4 Version 17 at 0xFC001000.
Processors: 2
Kernel command line: ro root=LABEL=/ noapic console=ttyS0,38400
Initializing CPU#0
time.c: Detected 1.193182 MHz PIT timer.
time.c: Detected 1403.229 MHz TSC timer.
Console: colour VGA+ 80x25
Calibrating delay loop... 2798.38 BogoMIPS
Memory: 2034216k/2096640k available (1797k kernel code, 0k reserved,
1862k data, 224k init)
Dentry cache hash table entries: 262144 (order: 10, 4194304 bytes)
Inode cache hash table entries: 131072 (order: 9, 2097152 bytes)
Mount cache hash table entries: 256 (order: 0, 4096 bytes)
Buffer cache hash table entries: 131072 (order: 8, 1048576 bytes)
Page-cache hash table entries: 524288 (order: 10, 4194304 bytes)
CPU: L1 I Cache: 64K (64 bytes/line/2 way), D cache 64K (64 bytes/line/2
way)
CPU: L2 Cache: 1024K (64 bytes/line/8 way)
Machine Check Reporting enabled for CPU#0
POSIX conformance testing by UNIFIX
mtrr: v2.02 (20020716))
CPU: L1 I Cache: 64K (64 bytes/line/2 way), D cache 64K (64 bytes/line/2
way)
CPU: L2 Cache: 1024K (64 bytes/line/8 way)
CPU0: AMD Opteron(tm) Processor 240 stepping 01
per-CPU timeslice cutoff: 5119.55 usecs.
task migration cache decay timeout: 10 msecs.
Booting processor 1/1 rip 6000 page 00000100077e2000
Initializing CPU#1
Calibrating delay loop... 2804.94 BogoMIPS
CPU: L1 I Cache: 64K (64 bytes/line/2 way), D cache 64K (64 bytes/line/2
way)
CPU: L2 Cache: 1024K (64 bytes/line/8 way)
Machine Check Reporting enabled for CPU#1
CPU1: AMD Opteron(tm) Processor 240 stepping 01
Total of 2 processors activated (5603.32 BogoMIPS).
Using local APIC timer interrupts.
Detected 12.528 MHz APIC timer.
cpu: 0, clocks: 2004614, slice: 668204
CPU0<T0:2004608,T1:1336400,D:4,S:668204,C:2004614>
cpu: 1, clocks: 2004614, slice: 668204
CPU1<T0:2004608,T1:668192,D:8,S:668204,C:2004614>
checking TSC synchronization across CPUs: passed.
time.c: Using PIT based timekeeping.
Starting migration thread for cpu 0
Starting migration thread for cpu 1
ACPI: Subsystem revision 20030619
PCI: Using configuration type 1
ACPI: System description tables not found
    ACPI-0084: *** Error: acpi_load_tables: Could not get RSDP,
AE_NOT_FOUND
    ACPI-0134: *** Error: acpi_load_tables: Could not load tables:
AE_NOT_FOUND
ACPI: Unable to load the System Description Tables
PCI: Probing PCI hardware
PCI: Using IRQ router default [1022/746b] at 00:07.3
Linux agpgart interface v0.99 (c) Jeff Hartmann
agpgart: Maximum main memory to use for agp memory: 1919M
PCI-DMA: Disabling IOMMU.
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
Initializing RT netlink socket
Starting kswapd
VFS: Disk quotas vdquot_6.5.1
aio_setup: num_physpages = 131040
aio_setup: sizeof(struct page) = 104
Hugetlbfs mounted.
Total HugeTLB memory allocated, 0
IA32 emulation $Id: sys_ia32.c,v 1.56 2003/04/10 10:45:37 ak Exp $
pty: 2048 Unix98 ptys configured
Serial driver version 5.05c (2001-07-08) with MANY_PORTS MULTIPORT
SHARE_IRQ SERIAL_PCI SERIAL_ACPI enabled
ttyS0 at 0x03f8 (irq = 4) is a 16550A
Real Time Clock Driver v1.10e
NET4: Frame Diverter 0.46
RAMDISK driver initialized: 256 RAM disks of 8192K size 1024 blocksize
Uniform Multi-Platform E-IDE driver Revision: 7.00beta4-2.4
ide: Assuming 33MHz system bus speed for PIO modes; override with
idebus=xx
AMD8111: IDE controller at PCI slot 00:07.1
AMD8111: chipset revision 3
AMD8111: not 100% native mode: will probe irqs later
ide: Assuming 33MHz system bus speed for PIO modes; override with
idebus=xx
AMD_IDE: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03) UDMA100
controller on pci00:07.1
    ide0: BM-DMA at 0x1020-0x1027, BIOS settings: hda:DMA, hdb:pio
    ide1: BM-DMA at 0x1028-0x102f, BIOS settings: hdc:pio, hdd:pio
hda: WDC WD600JB-00CRA1, ATA DISK drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hda: attached ide-disk driver.
hda: host protected area => 1
hda: 117231408 sectors (60022 MB) w/8192KiB Cache, CHS=116301/16/63,
UDMA(100)
ide-floppy driver 0.99.newide
Partition check:
 hda: hda1 hda2 hda3
ide-floppy driver 0.99.newide
md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
Initializing Cryptographic API
NET4: Linux TCP/IP 1.0 for NET4.0
IP: routing cache hash table of 8192 buckets, 128Kbytes
TCP: Hash tables configured (established 262144 bind 65536)
Linux IP multicast router 0.06 plus PIM-SM
Initializing IPsec netlink socket
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
RAMDISK: Compressed image found at block 0
VFS: Mounted root (ext2 filesystem).
Red Hat nash version 3.5.13 starSCSI subsystem driver Revision: 1.00
ting
Loading scsi_mod.o module
Loading sd_mod.o module
Loadinqla2x00_set_info starts at address = ffffffffa00230c0
g qla2300.o moduqla2x00: Found  VID=1077 DID=2312 SSVID=1077 SSDID=101
scsi(0): Found a QLA2312  @ bus 2, device 0x1, irq 5, iobase
0xffffff0000013000
le
scsi(0): Allocated 4096 SRB(s).
scsi(0): Configure NVRAM parameters...
scsi(0): 64 Bit PCI Addressing Enabled.
qla2x00_nvram_config ZIO enabled:intr_timer_delay=3
scsi(0): Verifying loaded RISC code...
scsi(0): Verifying chip...
scsi(0): Waiting for LIP to complete...
scsi(0): Cable is unplugged...
scsi-qla0-adapter-node=200000e08b17cf0f\;
scsi-qla0-adapter-port=210000e08b17cf0f\;
qla2x00: Found  VID=1077 DID=2312 SSVID=1077 SSDID=101
scsi(1): Found a QLA2312  @ bus 2, device 0x1, irq 10, iobase
0xffffff0000015000
scsi(1): Allocated 4096 SRB(s).
scsi(1): Configure NVRAM parameters...
scsi(1): 64 Bit PCI Addressing Enabled.
qla2x00_nvram_config ZIO enabled:intr_timer_delay=3
scsi(1): Verifying loaded RISC code...
scsi(1): Verifying chip...
scsi(1): Waiting for LIP to complete...
scsi(1): LOOP UP detected.
scsi(1): Port database changed.
scsi(1): Topology - (F_Port), Host Loop address 0xffff
qla2x00_configure_fcports(1): LOOP READY
scsi-qla1-adapter-node=200100e08b37cf0f\;
scsi-qla1-adapter-port=210100e08b37cf0f\;
scsi-qla1-tgt-0-di-0-port=22000004cffd1447\;
scsi-qla1-tgt-1-di-0-port=22000004cffd1411\;
scsi-qla1-tgt-2-di-0-port=22000004cffd0254\;
scsi-qla1-tgt-3-di-0-port=22000004cffcec36\;
scsi(1) qla2x00_isr MBA_PORT_UPDATE ignored
scsi0 : QLogic QLA2312 PCI to Fibre Channel Host Adapter: bus 2 device 1
irq 5
        Firmware version:  3.02.24, Driver version 6.07.02-RH2

scsi1 : QLogic QLA2312 PCI to Fibre Channel Host Adapter: bus 2 device 1
irq 10
        Firmware version:  3.02.24, Driver version 6.07.02-RH2

  Vendor: SEAGATE   Model: ST336607FC        Rev: 0006
  Type:   Direct-Access                      ANSI SCSI revision: 03
  Vendor: SEAGATE   Model: ST336607FC        Rev: 0006
  Type:   Direct-Access                      ANSI SCSI revision: 03
  Vendor: SEAGATE   Model: ST336607FC        Rev: 0006
  Type:   Direct-Access                      ANSI SCSI revision: 03
  Vendor: SEAGATE   Model: ST336607FC        Rev: 0006
  Type:   Direct-Access                      ANSI SCSI revision: 03
scsi(1:0:0:0): Enabled tagged queuing, queue depth 64.
scsi(1:0:1:0): Enabled tagged queuing, queue depth 64.
scsi(1:0:2:0): Enabled tagged queuing, queue depth 64.
scsi(1:0:3:0): Enabled tagged queuing, queue depth 64.
Attached scsi disk sda at scsi1, channel 0, id 0, lun 0
Attached scsi disk sdb at scsi1, channel 0, id 1, lun 0
Attached scsi disk sdc at scsi1, channel 0, id 2, lun 0
Attached scsi disk sdd at scsi1, channel 0, id 3, lun 0
SCSI device sda: 71687372 512-byte hdwr sectors (36704 MB)
 sda: sda1 sda2
SCSI device sdb: 71687372 512-byte hdwr sectors (36704 MB)
 sdb: sdb1 sdb2
SCSI device sdc: 71687372 512-byte hdwr sectors (36704 MB)
 sdc: sdc1 sdc2
SCSI device sdd: 71687372 512-byte hdwr sectors (36704 MB)
 sdd: sdd1 sdd2
Loading jbd.o module
Journalled Block Device driver loaded
Loading ext3.o module
Mounting /proc filesystem
Creating block devices
Creating root device
Mounting root filesystem
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
spurious 8259A interrupt: IRQ7.
Freeing unused kernel memory: 224k freed
INIT: version 2.85 booting
		Welcome to Rocks
		Press 'I' to enter interactive startup.
Unmounting initrd:  [  OK  ]
Configuring kernel parameters:  [  OK  ]
Setting clock  (utc): Mon Aug  2 20:42:30 GMT 2004 [  OK  ]
Setting hostname frontend-0.public:  [  OK  ]
Initializing USB controller (usb-ohci):  [  OK  ]
Mounting USB filesystem:  [  OK  ]
Initializing USB HID interface:  [  OK  ]
Initializing USB keyboard:  [  OK  ]
Initializing USB mouse:  [  OK  ]
Checking root filesystem
/: clean, 158083/7061504 files, 1680159/14116410 blocks
[/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/hda2 
[  OK  ]
Remounting root filesystem in read-write mode:  [  OK  ]
Activating swap partitions:  [  OK  ]
Finding module dependencies:  [  OK  ]
Checking filesystems
/boot: clean, 69/25584 files, 75707/102280 blocks
Checking all file systems.
[/sbin/fsck.ext3 (1) -- /boot] fsck.ext3 -a /dev/hda1 
[  OK  ]
Mounting local filesystems:  [  OK  ]
Enabling local filesystem quotas:  [  OK  ]
Enabling swap space:  [  OK  ]
INIT: Entering runlevel: 3
Entering non-interactive startup
Applying iptables firewall rules: [  OK  ]
Setting network parameters:  [  OK  ]
Bringing up loopback interface:  [  OK  ]
Bringing up interface eth0:  [  OK  ]
Bringing up interface eth1:  [  OK  ]
Starting system logger: [  OK  ]
Starting kernel logger: [  OK  ]
Starting portmapper: [  OK  ]
Starting NFS statd: [  OK  ]
Starting pool:  Pool v6.0.0 (built Aug  2 2004 18:51:15) installed
[  OK  ]
Starting ganglia-restore-rrds:  [  OK  ]
Starting ccsd:  [  OK  ]
Starting GANGLIA gmetad: [  OK  ]
Initializing random number generator:  [  OK  ]
Starting Ganglia Receptor: [  OK  ]
Starting lock_gulmd:  [  OK  ]
modprobe: Can't locate module pvfs
Starting PVFS daemon: (pvfsd.c, 683): Could not setup device /dev/pvfsd.
(pvfsd.c, 684): Did you remember to load the pvfs module?
(pvfsd.c, 453): pvfsd: setup_pvfsdev() failed
[FAILED][  OK  ]
Mounting other filesystems:  [  OK  ]
Publishing login files via 411...[  OK  ]
Starting automount:[  OK  ]
Starting named: [  OK  ]
Starting sshd:[  OK  ]
Starting xinetd: [  OK  ]
ntpd: Synchronizing with time server: [FAILED]
Starting ntpd: [  OK  ]
Starting NFS services:  [  OK  ]
Starting NFS quotas: [  OK  ]
Starting NFS daemon: [  OK  ]
Starting NFS mountd: [  OK  ]
Starting dhcpd: [  OK  ]
Starting GANGLIA gmond: [  OK  ]
Starting MySQL:  [  OK  ]
Starting httpd: [  OK  ]
Starting crond: [  OK  ]
Starting xfs: [  OK  ]
Starting atd: [  OK  ]
Starting firstboot:  [  OK  ]
   starting sge_qmaster
starting program: /opt/gridengine/bin/amd64linux/sge_commd
using service "sge_commd"
bound to port 535
Reading in complexes:
	Complex "host".
	Complex "queue".
Reading in execution hosts.
Reading in administrative hosts.
Reading in submit hosts.
Reading in usersets:
	Userset "defaultdepartment".
	Userset "deadlineusers".
Reading in queues:
	Queue "compute-0-0.q".
Reading in parallel environments:
	PE "make".
	PE "mpich".
	PE "mpi".
Reading in scheduler configuration
cant load sharetree (cant open file sharetree: No such file or
directory), starting up with empty sharetree
   starting sge_schedd
Turn off kernel logging to console: [  OK  ]
/wet^H^H^HUnable to handle kernel NULL pointer dereference at virtual
address 0000000000000000
 printing rip:
ffffffff8024a875
PML4 78215067 PGD 77f93067 PMD 0 
Oops: 0002
CPU 1 
Pid: 4027, comm: mount Not tainted
RIP: 0010:[<ffffffff8024a875>]{net_rx_action+213}
RSP: 0018:0000010078051048  EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffffffff80607ae8 RCX: ffffffff80607c88
RDX: ffffffff80607ae8 RSI: 0000010078986080 RDI: ffffffff80607ad0
RBP: ffffffff80607968 R08: 0000000080e76a9c R09: 0000000000e780e7
R10: 000000000100007f R11: 0000000000000000 R12: ffffffff80607ae8
R13: ffffffff80607ac0 R14: 00000000000071c2 R15: 0000000000000001
FS:  0000002a955764c0(0000) GS:ffffffff805d98c0(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 00000000079d2000 CR4: 00000000000006e0

Call Trace: [<ffffffff8024a84d>]{net_rx_action+173} 
       [<ffffffff8012a72e>]{do_softirq+174}
[<ffffffff80267cf0>]{ip_finish_output2+0} 
       [<ffffffff80267cc0>]{dst_output+0}
[<ffffffff802b5915>]{do_softirq_thunk+53} 
       [<ffffffff802533a7>]{.text.lock.netfilter+165}
[<ffffffff80267cc0>]{dst_output+0} 
       [<ffffffff80265fbb>]{ip_queue_xmit+1019}
[<ffffffff80262ee0>]{ip_rcv_finish+0} 
       [<ffffffff802630f0>]{ip_rcv_finish+528}
[<ffffffff80252e51>]{nf_hook_slow+305} 
       [<ffffffff80262ee0>]{ip_rcv_finish+0}
[<ffffffff80277faf>]{tcp_transmit_skb+1295} 
       [<ffffffff80278ac6>]{tcp_write_xmit+198}
[<ffffffff8026de83>]{tcp_sendmsg+4051} 
       [<ffffffff8028e795>]{inet_sendmsg+69}
[<ffffffff802407ae>]{sock_sendmsg+142} 
       [<ffffffffa017c4b1>]{:lock_gulm:do_tfer+369}
[<ffffffffa017ebd4>]{:lock_gulm:.rodata.str1.1+467} 
       [<ffffffffa017c595>]{:lock_gulm:xdr_send+37}
[<ffffffffa017b498>]{:lock_gulm:xdr_enc_flush+56} 
       [<ffffffffa017951d>]{:lock_gulm:lg_lock_login+301} 
       [<ffffffffa0175ff9>]{:lock_gulm:lt_login+57}
[<ffffffffa0172164>]{:lock_gulm:gulm_core_login_reply+164} 
       [<ffffffffa01826a0>]{:lock_gulm:core_cb+0}
[<ffffffffa01780eb>]{:lock_gulm:lg_core_handle_messages+315} 
       [<ffffffffa0178713>]{:lock_gulm:lg_core_login+323} 
       [<ffffffffa017253a>]{:lock_gulm:cm_login+122}
[<ffffffffa0172bde>]{:lock_gulm:start_gulm_threads+174} 
       [<ffffffffa0172f08>]{:lock_gulm:gulm_mount+616}
[<ffffffffa014c940>]{:gfs:gfs_glock_cb+0} 
       [<ffffffff801277bb>]{release_task+763}
[<ffffffffa01313e3>]{:lock_harness:lm_mount_Rsmp_ad6c5c21+355} 
       [<ffffffffa014c940>]{:gfs:gfs_glock_cb+0}
[<ffffffffa0151ff9>]{:gfs:gfs_mount_lockproto+313} 
       [<ffffffff8013d8d2>]{do_anonymous_page+1234}
[<ffffffff8013d94f>]{do_no_page+95} 
       [<ffffffff801a5103>]{do_page_fault+627}
[<ffffffff801109d6>]{error_exit+0} 
       [<ffffffff80184cb3>]{create_elf_tables+211}
[<ffffffff802b5798>]{strnlen_user+56} 
       [<ffffffff80184f47>]{create_elf_tables+871}
[<ffffffffa013d37b>]{:gfs:gfs_read_super+1307} 
       [<ffffffffa0171b00>]{:gfs:gfs_fs_type+0}
[<ffffffff80164c0c>]{get_sb_bdev+588} 
       [<ffffffffa0171b00>]{:gfs:gfs_fs_type+0}
[<ffffffff80164ec9>]{do_kern_mount+121} 
       [<ffffffff8017baa1>]{do_add_mount+161}
[<ffffffff8017bdb9>]{do_mount+345} 
       [<ffffffff80154b40>]{__get_free_pages+16}
[<ffffffff8017c1d5>]{sys_mount+197} 
       [<ffffffff80110177>]{system_call+119} 
Process mount (pid: 4027, stackpage=10078051000)
Stack: 0000010078051048 0000000000000018 ffffffff8024a84d
0000012a80445d20 
       0000000000000001 ffffffff80606c60 0000000000000001
000000000000000a 
       0000000000000001 0000000000000002 ffffffff8012a72e
ffffffff80267cf0 
       0000000000000246 0000000000000000 0000000000000003
ffffffff80445d20 
       ffffffff80267cc0 0000000000000000 ffffffff802b5915
0000000000000043 
       0000000000000006 000001007a05309e 000001007c97bd80
0000000000000000 
       0000000000000300 ffffffff8049c688 0000000000000001
ffffffff806077c0 
       ffffffff802533a7 ffffffff80267cc0 ffffffff80445d20
0000000000000002 
       000001007c97bd80 ffffffff805abcd0 000001007a0530ac
000001007c97bd80 
       0000010078986080 0000000000000000 0000010078986080
000001007c97bde8 
Call Trace: [<ffffffff8024a84d>]{net_rx_action+173} 
       [<ffffffff8012a72e>]{do_softirq+174}
[<ffffffff80267cf0>]{ip_finish_output2+0} 
       [<ffffffff80267cc0>]{dst_output+0}
[<ffffffff802b5915>]{do_softirq_thunk+53} 
       [<ffffffff802533a7>]{.text.lock.netfilter+165}
[<ffffffff80267cc0>]{dst_output+0} 
       [<ffffffff80265fbb>]{ip_queue_xmit+1019}
[<ffffffff80262ee0>]{ip_rcv_finish+0} 
       [<ffffffff802630f0>]{ip_rcv_finish+528}
[<ffffffff80252e51>]{nf_hook_slow+305} 
       [<ffffffff80262ee0>]{ip_rcv_finish+0}
[<ffffffff80277faf>]{tcp_transmit_skb+1295} 
       [<ffffffff80278ac6>]{tcp_write_xmit+198}
[<ffffffff8026de83>]{tcp_sendmsg+4051} 
       [<ffffffff8028e795>]{inet_sendmsg+69}
[<ffffffff802407ae>]{sock_sendmsg+142} 
       [<ffffffffa017c4b1>]{:lock_gulm:do_tfer+369}
[<ffffffffa017ebd4>]{:lock_gulm:.rodata.str1.1+467} 
       [<ffffffffa017c595>]{:lock_gulm:xdr_send+37}
[<ffffffffa017b498>]{:lock_gulm:xdr_enc_flush+56} 
       [<ffffffffa017951d>]{:lock_gulm:lg_lock_login+301} 
       [<ffffffffa0175ff9>]{:lock_gulm:lt_login+57}
[<ffffffffa0172164>]{:lock_gulm:gulm_core_login_reply+164} 
       [<ffffffffa01826a0>]{:lock_gulm:core_cb+0}
[<ffffffffa01780eb>]{:lock_gulm:lg_core_handle_messages+315} 
       [<ffffffffa0178713>]{:lock_gulm:lg_core_login+323} 
       [<ffffffffa017253a>]{:lock_gulm:cm_login+122}
[<ffffffffa0172bde>]{:lock_gulm:start_gulm_threads+174} 
       [<ffffffffa0172f08>]{:lock_gulm:gulm_mount+616}
[<ffffffffa014c940>]{:gfs:gfs_glock_cb+0} 
       [<ffffffff801277bb>]{release_task+763}
[<ffffffffa01313e3>]{:lock_harness:lm_mount_Rsmp_ad6c5c21+355} 
       [<ffffffffa014c940>]{:gfs:gfs_glock_cb+0}
[<ffffffffa0151ff9>]{:gfs:gfs_mount_lockproto+313} 
       [<ffffffff8013d8d2>]{do_anonymous_page+1234}
[<ffffffff8013d94f>]{do_no_page+95} 
       [<ffffffff801a5103>]{do_page_fault+627}
[<ffffffff801109d6>]{error_exit+0} 
       [<ffffffff80184cb3>]{create_elf_tables+211}
[<ffffffff802b5798>]{strnlen_user+56} 
       [<ffffffff80184f47>]{create_elf_tables+871}
[<ffffffffa013d37b>]{:gfs:gfs_read_super+1307} 
       [<ffffffffa0171b00>]{:gfs:gfs_fs_type+0}
[<ffffffff80164c0c>]{get_sb_bdev+588} 
       [<ffffffffa0171b00>]{:gfs:gfs_fs_type+0}
[<ffffffff80164ec9>]{do_kern_mount+121} 
       [<ffffffff8017baa1>]{do_add_mount+161}
[<ffffffff8017bdb9>]{do_mount+345} 
       [<ffffffff80154b40>]{__get_free_pages+16}
[<ffffffff8017c1d5>]{sys_mount+197} 
       [<ffffffff80110177>]{system_call+119} 

Code: 48 89 18 48 89 43 08 8b 85 90 01 00 00 85 c0 79 08 03 85 94

Kernel panic: Fatal exception
In interrupt handler - not syncing
 
NMI Watchdog detected LOCKUP on CPU0, eip ffffffff801a5419, registers:
CPU 0 
Pid: 3532, comm: lock_gulmd Not tainted
RIP: 0010:[<ffffffff801a5419>]{.text.lock.fault+7}
RSP: 0018:000001007ba7b978  EFLAGS: 00000086
RAX: 000000000000000f RBX: ffffffff806077e8 RCX: 0000000000000000
RDX: ffffffff803042e0 RSI: ffffffff803042e0 RDI: ffffffff8024a875
RBP: ffffffff80607668 R08: ffffffff803042d0 R09: 0000000000e780e7
R10: 000000000100007f R11: 0000000000000000 R12: 0000010037dcbc00
R13: 0000000000000000 R14: 0000000000000002 R15: 000001007ba7ba58
FS:  0000002a95576ce0(0000) GS:ffffffff805d9840(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0

Call Trace:  <EOE> [<ffffffff80252e51>]{nf_hook_slow+305} 
       [<ffffffff80262ee0>]{ip_rcv_finish+0}
[<ffffffff802630f0>]{ip_rcv_finish+528} 
       [<ffffffff80262d70>]{ip_local_deliver_finish+0}
[<ffffffff801109d6>]{error_exit+0} 
       [<ffffffff8024a875>]{net_rx_action+213}
[<ffffffff8024a84d>]{net_rx_action+173} 
       [<ffffffff8012a72e>]{do_softirq+174}
[<ffffffff80267cf0>]{ip_finish_output2+0} 
       [<ffffffff80267cc0>]{dst_output+0}
[<ffffffff802b5915>]{do_softirq_thunk+53} 
       [<ffffffff802533a7>]{.text.lock.netfilter+165}
[<ffffffff80267cc0>]{dst_output+0} 
       [<ffffffff80265fbb>]{ip_queue_xmit+1019}
[<ffffffff80277faf>]{tcp_transmit_skb+1295} 
       [<ffffffff80278ac6>]{tcp_write_xmit+198}
[<ffffffff8026de83>]{tcp_sendmsg+4051} 
       [<ffffffff8028e795>]{inet_sendmsg+69}
[<ffffffff802407ae>]{sock_sendmsg+142} 
       [<ffffffff802418e3>]{sys_sendto+195}
[<ffffffff80154d14>]{free_pages+132} 
       [<ffffffff801714c8>]{__poll_freewait+136}
[<ffffffff80110177>]{system_call+119} 
       
Process lock_gulmd (pid: 3532, stackpage=1007ba7b000)
Stack: 000001007ba7b978 0000000000000018 0000000000100000
0000000000000000 
       00000100079c4c80 ffffffff803e89a0 0000000000000000
00000100000fdea0 
       ffffffff803e8d00 00000100079bf000 00000100079d6400
0000000000000042 
       00000100079de280 ffffff0000000000 000000fffffff000
0000000000000000 
       00000100079d7a80 0000000000000000 0000000000000000
0000000000000000 
       0000000000000000 0000000000000000 0000000000000000
0000000000000000 
       0000010078050d48 0000000000000000 00000000006d9994
0000000000000003 
       0000000000000000 0000000000000000 0000000100000000
ffffffffffffffff 
       ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff 
       ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff 
Call Trace:  <EOE> [<ffffffff80252e51>]{nf_hook_slow+305} 
       [<ffffffff80262ee0>]{ip_rcv_finish+0}
[<ffffffff802630f0>]{ip_rcv_finish+528} 
       [<ffffffff80262d70>]{ip_local_deliver_finish+0}
[<ffffffff801109d6>]{error_exit+0} 
       [<ffffffff8024a875>]{net_rx_action+213}
[<ffffffff8024a84d>]{net_rx_action+173} 
       [<ffffffff8012a72e>]{do_softirq+174}
[<ffffffff80267cf0>]{ip_finish_output2+0} 
       [<ffffffff80267cc0>]{dst_output+0}
[<ffffffff802b5915>]{do_softirq_thunk+53} 
       [<ffffffff802533a7>]{.text.lock.netfilter+165}
[<ffffffff80267cc0>]{dst_output+0} 
       [<ffffffff80265fbb>]{ip_queue_xmit+1019}
[<ffffffff80277faf>]{tcp_transmit_skb+1295} 
       [<ffffffff80278ac6>]{tcp_write_xmit+198}
[<ffffffff8026de83>]{tcp_sendmsg+4051} 
       [<ffffffff8028e795>]{inet_sendmsg+69}
[<ffffffff802407ae>]{sock_sendmsg+142} 
       [<ffffffff802418e3>]{sys_sendto+195}
[<ffffffff80154d14>]{free_pages+132} 
       [<ffffffff801714c8>]{__poll_freewait+136}
[<ffffffff80110177>]{system_call+119} 
       

Code: f3 90 7e f5 e9 c8 fd ff ff 90 90 90 90 90 90 90 90 90 90 90

console shuts up ...
NM I Watchdog detected LOCKUP on CPU1, eip ffffffff8011a948, registers:
  

From bruce.walker at hp.com  Mon Aug  2 22:25:27 2004
From: bruce.walker at hp.com (Walker, Bruce J)
Date: Mon, 2 Aug 2004 15:25:27 -0700
Subject: [Linux-cluster] RE: [SSI-devel] Re: [ANNOUNCE] OpenSSI 1.0.0
	released!!
Message-ID: <3689AF909D816446BA505D21F1461AE4C750F0@cacexc04.americas.cpqcorp.net>

As I indicated earlier, we are going to redo the hooks for 2.6 and
submit them in a more managable way.  I expect that to take several
months.

Bruce Walker
Project manager for OpenSSI.


> -----Original Message-----
> From: ssic-linux-devel-admin at lists.sourceforge.net 
> [mailto:ssic-linux-devel-admin at lists.sourceforge.net] On 
> Behalf Of Erich Focht
> Sent: Monday, August 02, 2004 6:51 AM
> To: K V, Aneesh Kumar
> Cc: ssic-linux-devel at lists.sourceforge.net; Andi Kleen; Linux 
> Kernel Mailing List; linux-cluster at redhat.com
> Subject: Re: [SSI-devel] Re: [ANNOUNCE] OpenSSI 1.0.0 released!!
> 
> 
> On Monday 02 August 2004 08:30, Aneesh Kumar K.V wrote:
> > > [....] Congratulations. But I was a bit disappointed that there
> > > wasn't a tarball with the kernel patches and other sources.
> > > Any chance to add that to the site? 
> > 
> > I have posted  the diff at
> > http://www.openssi.org/contrib/linux-ssi.diff.gz
> 
> Hmmm, that's too huge to get an overview on what it does...
> The current CVS ci/kernel touches 137 files, openssi/kernel touches
> 350 files. Plus the ci/kernel.patches and openssi/kernel.patches...
> 
> > For 2.6 we are planning to group the changes into small 
> patches that is 
> >   easy to review.
> 
> Sounds great! Having groups sorted by functionality will help a
> lot. When will these be visible in the CVS?
> 
> Thanks,
> best regards,
> Erich
> 
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by OSTG. Have you noticed the 
> changes on
> Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
> one more big change to announce. We are now OSTG- Open Source 
> Technology
> Group. Come see the changes on the new OSTG site. www.ostg.com
> _______________________________________________
> ssic-linux-devel mailing list
> ssic-linux-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel
> 


From pcaulfie at redhat.com  Tue Aug  3 06:55:51 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 3 Aug 2004 07:55:51 +0100
Subject: [Linux-cluster] segfault if dlm is loaded while cman is still
	joining the cluster
In-Reply-To: <118314539.20040802162045@intersystems.com>
References: <118314539.20040802162045@intersystems.com>
Message-ID: <20040803065551.GB23467@tykepenguin.com>

On Mon, Aug 02, 2004 at 04:20:45PM -0400, Jeff wrote:
> Is there a bug tracker somewhere or should we just post
> them to this list?

There is a bugzilla at bugzilla.redhat.com, but posting to the list is
generally OK too.

Thanks, I'll have a look at this.

Patrick


From pcaulfie at redhat.com  Tue Aug  3 07:28:20 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 3 Aug 2004 08:28:20 +0100
Subject: [Linux-cluster] segfault if dlm is loaded while cman is still
	joining the cluster
In-Reply-To: <118314539.20040802162045@intersystems.com>
References: <118314539.20040802162045@intersystems.com>
Message-ID: <20040803072819.GE23467@tykepenguin.com>

On Mon, Aug 02, 2004 at 04:20:45PM -0400, Jeff wrote:
> CMAN: Waiting to join or form a Linux-cluster
> CMAN <CVS> (built Aug  2 2004 15:04:09) installed
> kmem_cache_create: duplicate cache cluster_sock

Hang on - how did you manage that? the cman code looks like it has been loaded
twice, or something....

The module load message is BELOW the "Waiting" message which says that the cman
code was already in the kernel when the "modprobe dlm" was executed which loaded
the cman.ko module as a dependancy.

The only way I can think this could happen is that you have cman in the kernel
AND as a module. 
-- 

patrick


From pcaulfie at redhat.com  Tue Aug  3 13:11:54 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 3 Aug 2004 14:11:54 +0100
Subject: [Linux-cluster] Is this intentional: specifying a new completion
	ast routine on a convert
In-Reply-To: <1524056551.20040728124948@intersystems.com>
References: <1524056551.20040728124948@intersystems.com>
Message-ID: <20040803131154.GI23467@tykepenguin.com>

I've changed this so that the all the AST parameters passed into a lock request
will always override the ones that are there. This makes more sense really and
seems to be what VMS does too. 

Of course if you change the blocking AST routine or argument during a convert,
the DLM makes no guarantees that a call with the old values won't be in flight
and waiting for you :-)
-- 

patrick


From mtilstra at redhat.com  Tue Aug  3 14:40:19 2004
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Tue, 3 Aug 2004 09:40:19 -0500
Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine
In-Reply-To: <1091480394.8356.58.camel@angmar>
References: <1091458082.8356.23.camel@angmar>
	<20040802154605.GC1518@redhat.com>
	<1091480394.8356.58.camel@angmar>
Message-ID: <20040803144019.GA4365@redhat.com>

On Mon, Aug 02, 2004 at 01:59:55PM -0700, micah nerren wrote:
[snip]
> I hope this helps!!
[snip]

yeah, looks like a stack overflow.
here's a patch that I put in for 6.0.  (patch works on 6.0.0-7)


-- 
Michael Conrad Tadpol Tilstra
Duct tape is like the force.  It has a light side and a dark side and it
holds the universe together.
-------------- next part --------------
===================================================================
RCS file: /mnt/export/cvs/GFS/locking/lock_gulm/kernel/gulm_fs.c,v
retrieving revision 1.1.2.16
retrieving revision 1.1.2.17
diff -u -r1.1.2.16 -r1.1.2.17
--- GFS/locking/lock_gulm/kernel/gulm_fs.c	2004/07/20 16:54:18	1.1.2.16
+++ GFS/locking/lock_gulm/kernel/gulm_fs.c	2004/08/02 16:12:39	1.1.2.17
@@ -335,11 +335,17 @@
 	    unsigned int min_lvb_size, struct lm_lockstruct *lockstruct)
 {
 	gulm_fs_t *gulm;
-	char work[256], *tbln;
+	char *work=NULL, *tbln;
 	int first;
 	int error = -1;
 	struct list_head *lltmp;
 
+	work = kmalloc(256, GFP_KERNEL);
+	if(work == NULL ) {
+		log_err("Out of Memory.\n");
+		error = -ENOMEM;
+		goto fail;
+	}
 	strncpy (work, table_name, 256);
 
 	tbln = strstr (work, ":");
@@ -483,6 +489,7 @@
 
       fail:
 
+	if(work != NULL ) kfree(work);
 	gulm_cm.starts = FALSE;
 	log_msg (lgm_Always, "fsid=%s: Exiting gulm_mount with errors %d\n",
 		 table_name, error);
@@ -570,7 +577,7 @@
 {
 	gulm_fs_t *fs = (gulm_fs_t *) lockspace;
 	int err;
-	uint8_t name[256];
+	uint8_t name[64];
 
 	if (message != LM_RD_SUCCESS) {
 		/* Need to start thinking about how I want to use this... */
@@ -579,7 +586,7 @@
 
 	if (jid == fs->fsJID) {	/* this may be drifting crud through. */
 		/* hey! its me! */
-		strncpy (name, gulm_cm.myName, 256);
+		strncpy (name, gulm_cm.myName, 64);
 	} else if (lookup_name_by_jid (fs, jid, name) != 0) {
 		log_msg (lgm_JIDMap,
 			 "fsid=%s: Could not find a client for jid %d\n",
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040803/fce93221/attachment.sig>

From jeff at intersystems.com  Tue Aug  3 14:56:38 2004
From: jeff at intersystems.com (Jeff)
Date: Tue, 3 Aug 2004 10:56:38 -0400
Subject: [Linux-cluster] segfault if dlm is loaded while cman is still
	joining the cluster
In-Reply-To: <20040803072819.GE23467@tykepenguin.com>
References: <118314539.20040802162045@intersystems.com>
	<20040803072819.GE23467@tykepenguin.com>
Message-ID: <1574160181.20040803105638@intersystems.com>

Tuesday, August 3, 2004, 3:28:20 AM, Patrick Caulfield wrote:

> On Mon, Aug 02, 2004 at 04:20:45PM -0400, Jeff wrote:
>> CMAN: Waiting to join or form a Linux-cluster
>> CMAN <CVS> (built Aug  2 2004 15:04:09) installed
>> kmem_cache_create: duplicate cache cluster_sock

> Hang on - how did you manage that? the cman code looks like it has been loaded
> twice, or something....

> The module load message is BELOW the "Waiting" message which says that the cman
> code was already in the kernel when the "modprobe dlm" was executed which loaded
> the cman.ko module as a dependancy.

> The only way I can think this could happen is that you have cman in the kernel
> AND as a module. 

Hmmm.

When I look at the Cluster Infrastructure item in 'menuconfig' its
marked with a *. Is this 'cman'? Would changing this to M solve
the problem or is there something else going on here.

Starting over with a vanilla kernel I find that if I try to build
I get a bunch of undefined symbol warnings during 'make install'
from dlm-kernel and gfs-kernel.

For instance, from dlm-kernel:

[root at lx4 src]# make install
if [ ! -e cluster ]; then ln -s . cluster; fi
if [ ! -e service.h ]; then cp //usr/include/cluster/service.h .; fi
if [ ! -e cnxman.h ]; then cp //usr/include/cluster/cnxman.h .; fi
if [ ! -e cnxman-socket.h ]; then cp //usr/include/cluster/cnxman-socket.h .; fi
make -C /usr/src/linux-2.6.7 M=/usr/src/cvs/cluster_orig/dlm-kernel/src modules USING_KBUILD=yes
make[1]: Entering directory `/usr/src/linux-2.6.7'
  Building modules, stage 2.
  MODPOST
*** Warning: "kcl_addref_cluster" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_get_node_by_addr" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_get_node_addresses" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_releaseref_cluster" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_get_current_interface" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_get_node_by_nodeid" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_leave_service" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_remove_callback" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_global_service_id" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_unregister_service" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_join_service" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_start_done" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_add_callback" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_register_service" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined!
make[1]: Leaving directory `/usr/src/linux-2.6.7'
install -d //lib/modules/2.6.7-clu-smp/kernel/cluster
install dlm.ko //lib/modules/2.6.7-clu-smp/kernel/cluster
install -d //usr/include/cluster
install dlm.h dlm_device.h //usr/include/cluster
[root at lx4 src]# 


From mtilstra at redhat.com  Tue Aug  3 14:59:05 2004
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Tue, 3 Aug 2004 09:59:05 -0500
Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine
In-Reply-To: <20040803144019.GA4365@redhat.com>
References: <1091458082.8356.23.camel@angmar>
	<20040802154605.GC1518@redhat.com>
	<1091480394.8356.58.camel@angmar>
	<20040803144019.GA4365@redhat.com>
Message-ID: <20040803145905.GA5818@redhat.com>

On Tue, Aug 03, 2004 at 09:40:19AM -0500, michael tilstra wrote:
> On Mon, Aug 02, 2004 at 01:59:55PM -0700, micah nerren wrote:
> [snip]
> > I hope this helps!!
> [snip]
> 
> yeah, looks like a stack overflow.
> here's a patch that I put in for 6.0.  (patch works on 6.0.0-7)

oh, it is entered at:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=129042

-- 
Michael Conrad Tadpol Tilstra
How?  My roommate is dying to get some!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040803/d29e1606/attachment.sig>

From phillips at istop.com  Sun Aug  1 17:23:01 2004
From: phillips at istop.com (Daniel Phillips)
Date: Sun, 1 Aug 2004 13:23:01 -0400
Subject: [Linux-cluster] Re: [ANNOUNCE] OpenSSI 1.0.0 released!!
In-Reply-To: <410B80BC.4060100@hp.com>
References: <410B80BC.4060100@hp.com>
Message-ID: <200408011323.02478.phillips@istop.com>

On Saturday 31 July 2004 07:21, Aneesh Kumar K.V wrote:
> 10. DLM
>    * is integrated with CLMS and is HA

As briefly mentioned at last week's cluster summit, we'd like to try to 
integrate the Red Hat (nee Sistina) GDLM, want to give it a try?

"One DLM to rule them all, one DLM to mind them, one DLM to sync them all, and 
in the cluster, bind them"

Regards,

Daniel


From phillips at istop.com  Sun Aug  1 17:30:01 2004
From: phillips at istop.com (Daniel Phillips)
Date: Sun, 1 Aug 2004 13:30:01 -0400
Subject: [Linux-cluster] Re: [ANNOUNCE] OpenSSI 1.0.0 released!!
In-Reply-To: <3689AF909D816446BA505D21F1461AE4C750E6@cacexc04.americas.cpqcorp.net>
References: <3689AF909D816446BA505D21F1461AE4C750E6@cacexc04.americas.cpqcorp.net>
Message-ID: <200408011330.01848.phillips@istop.com>

On Saturday 31 July 2004 12:00, Walker, Bruce J wrote:
> In the 2.4 implementation, providing this one capability by
> leveraging devfs was quite economic, efficient and has been very stable.

I wonder if device-mapper (slightly hacked) wouldn't be a better approach for 
2.6+.

Regards,

Daniel


From kpfleming at backtobasicsmgmt.com  Sun Aug  1 17:32:57 2004
From: kpfleming at backtobasicsmgmt.com (Kevin P. Fleming)
Date: Sun, 01 Aug 2004 10:32:57 -0700
Subject: [Linux-cluster] Re: [ANNOUNCE] OpenSSI 1.0.0 released!!
In-Reply-To: <200408011330.01848.phillips@istop.com>
References: <3689AF909D816446BA505D21F1461AE4C750E6@cacexc04.americas.cpqcorp.net>
	<200408011330.01848.phillips@istop.com>
Message-ID: <410D2949.20503@backtobasicsmgmt.com>

Daniel Phillips wrote:

> On Saturday 31 July 2004 12:00, Walker, Bruce J wrote:
> 
>>In the 2.4 implementation, providing this one capability by
>>leveraging devfs was quite economic, efficient and has been very stable.
> 
> 
> I wonder if device-mapper (slightly hacked) wouldn't be a better approach for 
> 2.6+.

It appeared from the original posting that their "cluster-wide devfs" 
actually supported all types of device nodes, not just block devices. I 
don't know whether accessing a character device on another node would 
ever be useful, but certainly using device-mapper wouldn't help for that 
case.


From phillips at istop.com  Mon Aug  2 01:53:46 2004
From: phillips at istop.com (Daniel Phillips)
Date: Sun, 1 Aug 2004 21:53:46 -0400
Subject: [Linux-cluster] Re: [ANNOUNCE] OpenSSI 1.0.0 released!!
In-Reply-To: <410D2949.20503@backtobasicsmgmt.com>
References: <3689AF909D816446BA505D21F1461AE4C750E6@cacexc04.americas.cpqcorp.net>
	<200408011330.01848.phillips@istop.com>
	<410D2949.20503@backtobasicsmgmt.com>
Message-ID: <200408012153.46835.phillips@istop.com>

On Sunday 01 August 2004 13:32, Kevin P. Fleming wrote:
> Daniel Phillips wrote:
> > On Saturday 31 July 2004 12:00, Walker, Bruce J wrote:
> >>In the 2.4 implementation, providing this one capability by
> >>leveraging devfs was quite economic, efficient and has been very stable.
> >
> > I wonder if device-mapper (slightly hacked) wouldn't be a better approach
> > for 2.6+.
>
> It appeared from the original posting that their "cluster-wide devfs"
> actually supported all types of device nodes, not just block devices. I
> don't know whether accessing a character device on another node would
> ever be useful, but certainly using device-mapper wouldn't help for that
> case.

Unless device-mapper learned how to deal with char devices...

Just a thought.

Regards,

Daniel


From aneesh.kumar at hp.com  Mon Aug  2 06:30:58 2004
From: aneesh.kumar at hp.com (Aneesh Kumar K.V)
Date: Mon, 02 Aug 2004 12:00:58 +0530
Subject: [Linux-cluster] Re: [ANNOUNCE] OpenSSI 1.0.0 released!!
In-Reply-To: <m37jsk42hw.fsf@averell.firstfloor.org>
References: <2o0e0-6qx-5@gated-at.bofh.it>
	<m37jsk42hw.fsf@averell.firstfloor.org>
Message-ID: <410DDFA2.40107@hp.com>

Andi Kleen wrote:
> "Aneesh Kumar K.V" <aneesh.kumar at hp.com> writes:
> 
> 
>>Hi,
>>
>>Sorry for the cross post. I came across this on OpenSSI website. I
>>guess others may also be interested.
> 
> 
> 
> [....] Congratulations. But I was a bit disappointed that there
> wasn't a tarball with the kernel patches and other sources.
> Any chance to add that to the site? 
> 
>

I have posted  the diff at
http://www.openssi.org/contrib/linux-ssi.diff.gz

This is against kernel linux-rh-2.4.20-31.9 which can be found in the 
OpenSSI CVS as srpms/linux-rh-2.4.20-31.9.tar.bz2

$cvs -d:pserver:anonymous at cvs.openssi.org:/cvsroot/ssic-linux login
$cvs -z3 -d:pserver:anonymous at cvs.openssi.org:/cvsroot/sic-linux co -r 
OPENSSI-RH  srpms/linux-rh-2.4.20-31.9.tar.bz2


This patch include the IPVS, KDB and OpenSSI changes

For 2.6 we are planning to group the changes into small patches that is 
  easy to review.


All the other sources can be found as tar.gz at ( 
http://www.openssi.org/contrib/debian/openssidebs/sources/ )or better by 
doing apt-get source package on a debian system :)

-aneesh


From bruce.walker at hp.com  Mon Aug  2 00:00:32 2004
From: bruce.walker at hp.com (Walker, Bruce J)
Date: Sun, 1 Aug 2004 17:00:32 -0700
Subject: [Linux-cluster] Re: [ANNOUNCE] OpenSSI 1.0.0 released!!
Message-ID: <3689AF909D816446BA505D21F1461AE4C750EA@cacexc04.americas.cpqcorp.net>

When processes can freely and transparently move around the cluster (at
exec time, fork time or during any system call), being able to
transparently access your controlling tty is pretty handy.  In 2.4 we
stack our CFS  on top of each node's devfs to give us naming of and
access to all devices on all nodes.  TBD on how will do this in 2.6.

Bruce


> > 
> > I wonder if device-mapper (slightly hacked) wouldn't be a 
> better approach for 
> > 2.6+.
> 
> It appeared from the original posting that their "cluster-wide devfs" 
> actually supported all types of device nodes, not just block 
> devices. I 
> don't know whether accessing a character device on another node would 
> ever be useful, but certainly using device-mapper wouldn't 
> help for that 
> case.
> 


From raven at themaw.net  Mon Aug  2 03:13:39 2004
From: raven at themaw.net (Ian Kent)
Date: Mon, 2 Aug 2004 11:13:39 +0800 (WST)
Subject: [Linux-cluster] Re: [ANNOUNCE] OpenSSI 1.0.0 released!!
In-Reply-To: <410D2949.20503@backtobasicsmgmt.com>
References: <3689AF909D816446BA505D21F1461AE4C750E6@cacexc04.americas.cpqcorp.net>
	<200408011330.01848.phillips@istop.com>
	<410D2949.20503@backtobasicsmgmt.com>
Message-ID: <Pine.LNX.4.58.0408021112170.18701@wombat.indigo.net.au>

On Sun, 1 Aug 2004, Kevin P. Fleming wrote:

> Daniel Phillips wrote:
> 
> > On Saturday 31 July 2004 12:00, Walker, Bruce J wrote:
> > 
> >>In the 2.4 implementation, providing this one capability by
> >>leveraging devfs was quite economic, efficient and has been very stable.
> > 
> > 
> > I wonder if device-mapper (slightly hacked) wouldn't be a better approach for 
> > 2.6+.
> 
> It appeared from the original posting that their "cluster-wide devfs" 
> actually supported all types of device nodes, not just block devices. I 
> don't know whether accessing a character device on another node would 
> ever be useful, but certainly using device-mapper wouldn't help for that 
> case.

Does the reduced function 2.6 devfs still have what's needed?
If it does then you should have a fair amount of breathing space.


From efocht at gmx.net  Mon Aug  2 13:50:39 2004
From: efocht at gmx.net (Erich Focht)
Date: Mon, 2 Aug 2004 15:50:39 +0200
Subject: [Linux-cluster] Re: [SSI-devel] Re: [ANNOUNCE] OpenSSI 1.0.0
	released!!
In-Reply-To: <410DDFA2.40107@hp.com>
References: <2o0e0-6qx-5@gated-at.bofh.it>
	<m37jsk42hw.fsf@averell.firstfloor.org> <410DDFA2.40107@hp.com>
Message-ID: <200408021550.39219.efocht@gmx.net>

On Monday 02 August 2004 08:30, Aneesh Kumar K.V wrote:
> > [....] Congratulations. But I was a bit disappointed that there
> > wasn't a tarball with the kernel patches and other sources.
> > Any chance to add that to the site? 
> 
> I have posted  the diff at
> http://www.openssi.org/contrib/linux-ssi.diff.gz

Hmmm, that's too huge to get an overview on what it does...
The current CVS ci/kernel touches 137 files, openssi/kernel touches
350 files. Plus the ci/kernel.patches and openssi/kernel.patches...

> For 2.6 we are planning to group the changes into small patches that is 
>   easy to review.

Sounds great! Having groups sorted by functionality will help a
lot. When will these be visible in the CVS?

Thanks,
best regards,
Erich


From bernd.schumacher at hp.com  Tue Aug  3 11:55:42 2004
From: bernd.schumacher at hp.com (Schumacher, Bernd)
Date: Tue, 3 Aug 2004 13:55:42 +0200
Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
Message-ID: <DC3726DD5043F64C90F330EC03F429063BAA17@bbnexc02.emea.cpqcorp.net>

Hi,
I have three nodes oben, mitte and unten. 

Test:
I have disabled eth0 on mitte, so that mitte will be excluded. 

Result:
Oben and unten are trying to fence mitte and build a new cluster. OK!
But mitte tries to fence oben and unten. PROBLEM!
 
Why can this happen? Mitte knows that it can not build a cluster. See
Logfile from mitte: "Have 1, need 2"

Logfile from mitte:
Aug  3 12:53:17 mitte lock_gulmd_core[1845]: Client (oben) expired Aug
3 12:53:17 mitte lock_gulmd_core[1845]: Core lost slave quorum. Have 1,
need 2. Switching to Arbitrating. Aug  3 12:53:17 mitte
lock_gulmd_core[2120]: Gonna exec fence_node oben Aug  3 12:53:17 mitte
lock_gulmd_core[1845]: Forked [2120] fence_node oben with a 0 pause. Aug
3 12:53:17 mitte fence_node[2120]: Performing fence method, manual, on
oben. 

cluster.ccs:
cluster {
    name = "tom"
    lock_gulm {
        servers = ["oben", "mitte", "unten"]
    }
}

fence.ccs:
fence_devices {
  manual_oben {
    agent = "fence_manual"
  }     
  manual_mitte ...


nodes.ccs:
nodes {
  oben {
    ip_interfaces {
      eth0 = "192.168.100.241"
    }
    fence { 
      manual {
        manual_oben {
          ipaddr = "192.168.100.241"
        }
      }
    }
  }
  mitte ...

regards
Bernd Schumacher


From danderso at redhat.com  Tue Aug  3 15:49:26 2004
From: danderso at redhat.com (Derek Anderson)
Date: Tue, 3 Aug 2004 10:49:26 -0500
Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
In-Reply-To: <DC3726DD5043F64C90F330EC03F429063BAA17@bbnexc02.emea.cpqcorp.net>
References: <DC3726DD5043F64C90F330EC03F429063BAA17@bbnexc02.emea.cpqcorp.net>
Message-ID: <200408031049.26537.danderso@redhat.com>

Bernd,

Please see http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=128635.  There 
are some outstanding fence issues here.

On Tuesday 03 August 2004 06:55, Schumacher, Bernd wrote:
> Hi,
> I have three nodes oben, mitte and unten.
>
> Test:
> I have disabled eth0 on mitte, so that mitte will be excluded.
>
> Result:
> Oben and unten are trying to fence mitte and build a new cluster. OK!
> But mitte tries to fence oben and unten. PROBLEM!
>
> Why can this happen? Mitte knows that it can not build a cluster. See
> Logfile from mitte: "Have 1, need 2"
>
> Logfile from mitte:
> Aug  3 12:53:17 mitte lock_gulmd_core[1845]: Client (oben) expired Aug
> 3 12:53:17 mitte lock_gulmd_core[1845]: Core lost slave quorum. Have 1,
> need 2. Switching to Arbitrating. Aug  3 12:53:17 mitte
> lock_gulmd_core[2120]: Gonna exec fence_node oben Aug  3 12:53:17 mitte
> lock_gulmd_core[1845]: Forked [2120] fence_node oben with a 0 pause. Aug
> 3 12:53:17 mitte fence_node[2120]: Performing fence method, manual, on
> oben.
>
> cluster.ccs:
> cluster {
>     name = "tom"
>     lock_gulm {
>         servers = ["oben", "mitte", "unten"]
>     }
> }
>
> fence.ccs:
> fence_devices {
>   manual_oben {
>     agent = "fence_manual"
>   }
>   manual_mitte ...
>
>
> nodes.ccs:
> nodes {
>   oben {
>     ip_interfaces {
>       eth0 = "192.168.100.241"
>     }
>     fence {
>       manual {
>         manual_oben {
>           ipaddr = "192.168.100.241"
>         }
>       }
>     }
>   }
>   mitte ...
>
> regards
> Bernd Schumacher
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster


From danderso at redhat.com  Tue Aug  3 16:00:06 2004
From: danderso at redhat.com (Derek Anderson)
Date: Tue, 3 Aug 2004 11:00:06 -0500
Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
In-Reply-To: <200408031049.26537.danderso@redhat.com>
References: <DC3726DD5043F64C90F330EC03F429063BAA17@bbnexc02.emea.cpqcorp.net>
	<200408031049.26537.danderso@redhat.com>
Message-ID: <200408031100.06404.danderso@redhat.com>

Please disregard my last post.  Too quick of a scan; thought you were 
referring to the pubilc CVS branch.

On Tuesday 03 August 2004 10:49, Derek Anderson wrote:
> Bernd,
>
> Please see http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=128635. 
> There are some outstanding fence issues here.
>
> On Tuesday 03 August 2004 06:55, Schumacher, Bernd wrote:
> > Hi,
> > I have three nodes oben, mitte and unten.
> >
> > Test:
> > I have disabled eth0 on mitte, so that mitte will be excluded.
> >
> > Result:
> > Oben and unten are trying to fence mitte and build a new cluster. OK!
> > But mitte tries to fence oben and unten. PROBLEM!
> >
> > Why can this happen? Mitte knows that it can not build a cluster. See
> > Logfile from mitte: "Have 1, need 2"
> >
> > Logfile from mitte:
> > Aug  3 12:53:17 mitte lock_gulmd_core[1845]: Client (oben) expired Aug
> > 3 12:53:17 mitte lock_gulmd_core[1845]: Core lost slave quorum. Have 1,
> > need 2. Switching to Arbitrating. Aug  3 12:53:17 mitte
> > lock_gulmd_core[2120]: Gonna exec fence_node oben Aug  3 12:53:17 mitte
> > lock_gulmd_core[1845]: Forked [2120] fence_node oben with a 0 pause. Aug
> > 3 12:53:17 mitte fence_node[2120]: Performing fence method, manual, on
> > oben.
> >
> > cluster.ccs:
> > cluster {
> >     name = "tom"
> >     lock_gulm {
> >         servers = ["oben", "mitte", "unten"]
> >     }
> > }
> >
> > fence.ccs:
> > fence_devices {
> >   manual_oben {
> >     agent = "fence_manual"
> >   }
> >   manual_mitte ...
> >
> >
> > nodes.ccs:
> > nodes {
> >   oben {
> >     ip_interfaces {
> >       eth0 = "192.168.100.241"
> >     }
> >     fence {
> >       manual {
> >         manual_oben {
> >           ipaddr = "192.168.100.241"
> >         }
> >       }
> >     }
> >   }
> >   mitte ...
> >
> > regards
> > Bernd Schumacher
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > http://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster


From mtilstra at redhat.com  Tue Aug  3 16:12:47 2004
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Tue, 3 Aug 2004 11:12:47 -0500
Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
In-Reply-To: <200408031049.26537.danderso@redhat.com>
References: <DC3726DD5043F64C90F330EC03F429063BAA17@bbnexc02.emea.cpqcorp.net>
	<200408031049.26537.danderso@redhat.com>
Message-ID: <20040803161247.GA6095@redhat.com>

On Tue, Aug 03, 2004 at 10:49:26AM -0500, Derek Anderson wrote:
> Bernd,
> 
> Please see http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=128635.  There 
> are some outstanding fence issues here.

except that bug is on cman, and this is about gulm.

> On Tuesday 03 August 2004 06:55, Schumacher, Bernd wrote:
> > Hi,
> > I have three nodes oben, mitte and unten.
> >
> > Test:
> > I have disabled eth0 on mitte, so that mitte will be excluded.
> >
> > Result:
> > Oben and unten are trying to fence mitte and build a new cluster. OK!
> > But mitte tries to fence oben and unten. PROBLEM!

Actually not problem. just not what you expected.  Hopefully I can
explain why... (you have a netsplit. neither side knows what the other
is doing, and must assume that the other is dead and they are right.)

> > Why can this happen? Mitte knows that it can not build a cluster. See
> > Logfile from mitte: "Have 1, need 2"

So looking at what you gave below, mitte was master. (making this guess
from the "Core lost slave quorum" part of the message below.)  It knows
that it doesn't have quorum, it still is going to try to be the Master.
It does not know "that it can not build a cluster."  The only thing it
knows right now about the other nodes is that they failed to send
heartbeats.  Therefor they must have left the cluter abnormally.
Therefor it must fence them.

The other two nodes see that mitte have failed to reply to heartbeats.
Therefor it must have left the cluster abnormally.  Therefor it must be
fenced.

Both sides of the netsplit are trying to resolve things to regain the
cluster.  From an outsiders view point (which you and I have, the nodes
do not.) We can see that mitte's attempts are futile, oben and unten
will get control of the cluter.  But the node cannot see this.

This is what makes netsplits kind of ugly.  

(using ifdown to test cluster stuff causes extra confusion in my
opinion. because you actually are creating a netsplit case.  Not a
simpler node down case.  The power switch is nice for this.)


I hope that made some sence.

-- 
Michael Conrad Tadpol Tilstra
Blood is thicker than water, and much tastier.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040803/d4d32a63/attachment.sig>

From landherr at kazeon.com  Tue Aug  3 16:22:59 2004
From: landherr at kazeon.com (Steve Landherr)
Date: Tue, 3 Aug 2004 09:22:59 -0700
Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
Message-ID: <4E022DDAB8F45741914ACD6EDFE2309B324A8D@BIGFOOT.kazeon.local>

In a netsplit, what does fencing achieve when done by a node that
doesn't have quorum?  It still won't have quorum.  It should probably
just clean up as best it can and leave the rest of the cluster alone.

-steve
--
Steve Landherr -- landherr at kazeon.com
Kazeon Systems, Inc.
Mountain View, California

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Conrad
Tadpol Tilstra
Sent: Tuesday, August 03, 2004 9:13 AM
To: Discussion of clustering software components including GFS
Subject: Re: [Linux-cluster] GFS 6.0 node without quorum tries to fence

So looking at what you gave below, mitte was master. (making this guess
from the "Core lost slave quorum" part of the message below.)  It knows
that it doesn't have quorum, it still is going to try to be the Master.
It does not know "that it can not build a cluster."  The only thing it
knows right now about the other nodes is that they failed to send
heartbeats.  Therefor they must have left the cluter abnormally.
Therefor it must fence them.

The other two nodes see that mitte have failed to reply to heartbeats.
Therefor it must have left the cluster abnormally.  Therefor it must be
fenced.

Both sides of the netsplit are trying to resolve things to regain the
cluster.  From an outsiders view point (which you and I have, the nodes
do not.) We can see that mitte's attempts are futile, oben and unten
will get control of the cluter.  But the node cannot see this.

This is what makes netsplits kind of ugly.  

(using ifdown to test cluster stuff causes extra confusion in my
opinion. because you actually are creating a netsplit case.  Not a
simpler node down case.  The power switch is nice for this.)


I hope that made some sence.

-- 
Michael Conrad Tadpol Tilstra
Blood is thicker than water, and much tastier.


From bernd.schumacher at hp.com  Tue Aug  3 16:44:06 2004
From: bernd.schumacher at hp.com (Schumacher, Bernd)
Date: Tue, 3 Aug 2004 18:44:06 +0200
Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
Message-ID: <DC3726DD5043F64C90F330EC03F429063BAA1A@bbnexc02.emea.cpqcorp.net>

before I tried with manual fencing I tried this with automatic fencing
(fence_rib). And always mitte was faster and fenced oben and unten. This
means, one faulty node can reboot all other nodes. I think this is not
ok. And even after reboot the problem is not solved, because the faulty
node is still faulty.

A node should only be allowed to fence if it is Master and if it has the
qourum. And never if it is in arbitrating mode.

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Steve Landherr
> Sent: Dienstag, 3. August 2004 18:23
> To: Discussion of clustering software components including GFS
> Subject: RE: [Linux-cluster] GFS 6.0 node without quorum 
> tries to fence
> 
> 
> In a netsplit, what does fencing achieve when done by a node 
> that doesn't have quorum?  It still won't have quorum.  It 
> should probably just clean up as best it can and leave the 
> rest of the cluster alone.
> 
> -steve
> --
> Steve Landherr -- landherr at kazeon.com
> Kazeon Systems, Inc.
> Mountain View, California
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On > Behalf Of 
> Michael Conrad Tadpol Tilstra
> Sent: Tuesday, August 03, 2004 9:13 AM
> To: Discussion of clustering software components including GFS
> Subject: Re: [Linux-cluster] GFS 6.0 node without quorum 
> tries to fence
> 
> So looking at what you gave below, mitte was master. (making 
> this guess from the "Core lost slave quorum" part of the 
> message below.)  It knows that it doesn't have quorum, it 
> still is going to try to be the Master. It does not know 
> "that it can not build a cluster."  The only thing it knows 
> right now about the other nodes is that they failed to send 
> heartbeats.  Therefor they must have left the cluter 
> abnormally. Therefor it must fence them.
> 
> The other two nodes see that mitte have failed to reply to 
> heartbeats. Therefor it must have left the cluster 
> abnormally.  Therefor it must be fenced.
> 
> Both sides of the netsplit are trying to resolve things to 
> regain the cluster.  From an outsiders view point (which you 
> and I have, the nodes do not.) We can see that mitte's 
> attempts are futile, oben and unten will get control of the 
> cluter.  But the node cannot see this.
> 
> This is what makes netsplits kind of ugly.  
> 
> (using ifdown to test cluster stuff causes extra confusion in 
> my opinion. because you actually are creating a netsplit 
> case.  Not a simpler node down case.  The power switch is 
> nice for this.)
> 
> 
> I hope that made some sence.
> 
> -- 
> Michael Conrad Tadpol Tilstra
> Blood is thicker than water, and much tastier.
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com 
> http://www.redhat.com/mailman/listinfo/linux-> cluster
> 


From laza at yu.net  Tue Aug  3 17:17:51 2004
From: laza at yu.net (Lazar Obradovic)
Date: Tue, 03 Aug 2004 19:17:51 +0200
Subject: [Linux-cluster] Multicast for GFS?
Message-ID: <1091553471.16747.165.camel@laza.eunet.yu>

Hi, 

can someone, please, give some advice about configuring multicast with
GFS? I know it might go out of topic, but it's perhaps useful for
others.

I'd use broadcast instead, but I have a problem that two groups of
servers sharing the same storage, but that are located in different
vlans, separated by router-on-a-stick, so I guess I have to use
multicast.

I've configured the router for multicast (config is right below), but it
doesn't seem to work. 

Here's ascii pic of what I'm trying to make: 

        +--------+
        | router |
        +--------+
         /       \
+----------+    +----------+
| switch A |    | switch B | 
| vlan 100 |    | vlan 200 | 
+----------+    +----------+
     |               |
+----------+    +----------+
| server A |    | server B |
+----------+    +----------+
     |               |
  +---------------------+
  |   san / storage     | 
  +---------------------+

and relevant config (that I made this far): 

router (cisco ios): 

ip multicast-routing 
!
interface FastEthernet0/0
 description Branch A
 ip address 1.1.1.1 255.255.255.0 
 ip pim sparse-dense-mode
 ip igmp version 1
 encapsulation dot1q 100
!
interface FastEthernet0/1
 description Branch B
 ip address 1.1.2.1 255.255.255.0
 ip pim sparse-dense-mode
 ip igmp version 1
 encapsulation dot1q 200
!
ip pim send-rp-announce FastEthernet0/0 scope 16
ip pim send-rp-discovery scope 16
!

switch A and switch B are manageable Intel switches (dunno the exact
model; they are bundled with my IBM Blades), but have IGMP Snooping
turned on on every interface, and show default cisco pim groups
(224.0.0.40 and 224.0.0.39) on upstream ports.

/etc/cluster/cluster.conf for each cluster node is same (only important
part of config is here, ask for more, if needed): 

<cluster name="test" config_version="1">
        <cman>
                <multicast addr="224.0.0.11"/>
        </cman>
        <nodes>
                <node name="node1" votes="1">
                        <multicast addr="224.0.0.11" interface="eth0"/>
                </node>
                <node name="node2" votes="1">
                        <multicast addr="224.0.0.11" interface="eth0"/>
                </node>
        </nodes>
</cluster>

hosts ping each other, so networking part, as far as basic ip and
unicast is concerned, is working properly.

When starting, cman_tool says: 
cluster # cman_tool join -d
multicast address 224.0.0.11
if eth0 for mcast address 224.0.0.11
setup up interface for address: node1

and as I can see from strace, "cman_tool join", this is what happens: 

socket(0x1f /* PF_??? */, SOCK_DGRAM, 2) = 3
ioctl(3, 0x780b, 0x2)                   = 0
setsockopt(3, 0x2 /* SOL_?? */, 109, [6516590], 4) = 0
[...]
socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
bind(4, {sa_family=AF_INET, sin_port=htons(6809), sin_addr=inet_addr("224.0.0.11")}, 16) = 0
socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 5
bind(5, {sa_family=AF_INET, sin_port=htons(6809), sin_addr=inet_addr("1.1.1.2")}, 16) = 0
setsockopt(3, 0x2 /* SOL_?? */, 100, "\4\0\0\0\0\0\0\0", 8) = 0
setsockopt(3, 0x2 /* SOL_?? */, 103, "\5\0\0\0\0\0\0\0", 8) = 0
setsockopt(3, 0x2 /* SOL_?? */, 101, "\1\344\377\277\3\0\0\0\0\0\0\0\1\0\0\0smtp\0\'\1@\210\0"..., 36) = 0
close(3)                                = 0
exit_group(0)                           = ?

I've checked some programming examples on multicast as well as code for
cman, and I thing cman_tool/join.c has two problems: 

- it never seems to issue setsockopt(..., IP_ADD_MEMBERSHIP...), thus,
never joins the group. I believe the problem is in if (!bcast) check,
which, if replaced with "if (bhe)" should work fine...

- it binds the socket with multicast address (fd = 4 in my case) instead
of local address. If the examples I looked are true, one should bind
local interface, and then specify mcast address in setsockopt call. 

Can someone comment this issue? Am I going in completly wrong direction,
or multicast support isn't ready yet?

-- 
Lazar Obradovic, System Engineer
----- 
laza at YU.net
YUnet International http://www.EUnet.yu
Dubrovacka 35/III, 11000 Belgrade
Tel: +381 11 3119901; Fax: +381 11 3119901
-----
This e-mail is confidential and intended only for the recipient.
Unauthorized distribution, modification or disclosure of its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone +381 11 3119901.
-----


From mtilstra at redhat.com  Tue Aug  3 17:58:12 2004
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Tue, 3 Aug 2004 12:58:12 -0500
Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
In-Reply-To: <4E022DDAB8F45741914ACD6EDFE2309B324A8D@BIGFOOT.kazeon.local>
References: <4E022DDAB8F45741914ACD6EDFE2309B324A8D@BIGFOOT.kazeon.local>
Message-ID: <20040803175812.GA6470@redhat.com>

On Tue, Aug 03, 2004 at 09:22:59AM -0700, Steve Landherr wrote:
> In a netsplit, what does fencing achieve when done by a node that
> doesn't have quorum?  It still won't have quorum.  It should probably
> just clean up as best it can and leave the rest of the cluster alone.

Think of the world as the nodes see it, not as you see it.  You cannot
tell if it is a netsplit or truely nodes dieing.  It looks the same.


-- 
Michael Conrad Tadpol Tilstra
Caution: breathing may be hazardous to your health.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040803/dcd82ffc/attachment.sig>

From alewis at redhat.com  Tue Aug  3 18:03:07 2004
From: alewis at redhat.com (AJ Lewis)
Date: Tue, 3 Aug 2004 13:03:07 -0500
Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
In-Reply-To: <20040803175812.GA6470@redhat.com>
References: <4E022DDAB8F45741914ACD6EDFE2309B324A8D@BIGFOOT.kazeon.local>
	<20040803175812.GA6470@redhat.com>
Message-ID: <20040803180307.GD25464@null.msp.redhat.com>

On Tue, Aug 03, 2004 at 12:58:12PM -0500, Michael Conrad Tadpol Tilstra wrote:
> On Tue, Aug 03, 2004 at 09:22:59AM -0700, Steve Landherr wrote:
> > In a netsplit, what does fencing achieve when done by a node that
> > doesn't have quorum?  It still won't have quorum.  It should probably
> > just clean up as best it can and leave the rest of the cluster alone.
> 
> Think of the world as the nodes see it, not as you see it.  You cannot
> tell if it is a netsplit or truely nodes dieing.  It looks the same.

Or in other words, if 2 of 3 nodes go AWOL, do you really want your cluster
to stop until someone comes in an manually starts things up again -
especially if in the meantime, the 2 bad nodes can write to the disks?  The
fact that a single node can kill of the other nodes is a good thing!

-- 
AJ Lewis                                   Voice:  612-638-0500
Red Hat Inc.                               E-Mail: alewis at redhat.com
720 Washington Ave. SE, Suite 200
Minneapolis, MN 55414

Current GPG fingerprint = D9F8 EDCE 4242 855F A03D  9B63 F50C 54A8 578C 8715
Grab the key at: http://people.redhat.com/alewis/gpg.html or one of the
many keyservers out there...
-----Begin Obligatory Humorous Quote----------------------------------------
Behind every good computer -- is a jumble of wires 'n stuff.
-----End Obligatory Humorous Quote------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040803/86ebba14/attachment.sig>

From landherr at kazeon.com  Tue Aug  3 18:03:40 2004
From: landherr at kazeon.com (Steve Landherr)
Date: Tue, 3 Aug 2004 11:03:40 -0700
Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
Message-ID: <4E022DDAB8F45741914ACD6EDFE2309B324AAD@BIGFOOT.kazeon.local>

Agreed.  But I though that the quorum concept was there to protect
against a netsplit causing independent clusters to form.  If I'm a node
and I can't contact enough peers to gain quorum, then I can't be a part
of the cluster.  If I'm not part of the cluster, then I shouldn't be
fencing other nodes.

Am I missing something?

-steve

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Conrad
Tadpol Tilstra
Sent: Tuesday, August 03, 2004 10:58 AM
To: Discussion of clustering software components including GFS
Subject: Re: [Linux-cluster] GFS 6.0 node without quorum tries to fence

On Tue, Aug 03, 2004 at 09:22:59AM -0700, Steve Landherr wrote:
> In a netsplit, what does fencing achieve when done by a node that
> doesn't have quorum?  It still won't have quorum.  It should probably
> just clean up as best it can and leave the rest of the cluster alone.

Think of the world as the nodes see it, not as you see it.  You cannot
tell if it is a netsplit or truely nodes dieing.  It looks the same.


From laza at yu.net  Tue Aug  3 18:04:39 2004
From: laza at yu.net (Lazar Obradovic)
Date: Tue, 03 Aug 2004 20:04:39 +0200
Subject: [Linux-cluster] Multicast for GFS?
In-Reply-To: <1091553471.16747.165.camel@laza.eunet.yu>
References: <1091553471.16747.165.camel@laza.eunet.yu>
Message-ID: <1091556279.30938.179.camel@laza.eunet.yu>

Just an update, 

I've wrote my own mcast server and client based on the examples i've
got, and they work perfectly with this kind of network setup (separate
vlans and everything else), so I guess problem is actually in cman and
it's mcast interface... 

I'll try to correct this today and send a patch.

On Tue, 2004-08-03 at 19:17, Lazar Obradovic wrote:
> Hi, 
> 
> can someone, please, give some advice about configuring multicast with
> GFS? I know it might go out of topic, but it's perhaps useful for
> others.
> 
> I'd use broadcast instead, but I have a problem that two groups of
> servers sharing the same storage, but that are located in different
> vlans, separated by router-on-a-stick, so I guess I have to use
> multicast.
> 
> I've configured the router for multicast (config is right below), but it
> doesn't seem to work. 
> 
> Here's ascii pic of what I'm trying to make: 
> 
>         +--------+
>         | router |
>         +--------+
>          /       \
> +----------+    +----------+
> | switch A |    | switch B | 
> | vlan 100 |    | vlan 200 | 
> +----------+    +----------+
>      |               |
> +----------+    +----------+
> | server A |    | server B |
> +----------+    +----------+
>      |               |
>   +---------------------+
>   |   san / storage     | 
>   +---------------------+
> 
> and relevant config (that I made this far): 
> 
> router (cisco ios): 
> 
> ip multicast-routing 
> !
> interface FastEthernet0/0
>  description Branch A
>  ip address 1.1.1.1 255.255.255.0 
>  ip pim sparse-dense-mode
>  ip igmp version 1
>  encapsulation dot1q 100
> !
> interface FastEthernet0/1
>  description Branch B
>  ip address 1.1.2.1 255.255.255.0
>  ip pim sparse-dense-mode
>  ip igmp version 1
>  encapsulation dot1q 200
> !
> ip pim send-rp-announce FastEthernet0/0 scope 16
> ip pim send-rp-discovery scope 16
> !
> 
> switch A and switch B are manageable Intel switches (dunno the exact
> model; they are bundled with my IBM Blades), but have IGMP Snooping
> turned on on every interface, and show default cisco pim groups
> (224.0.0.40 and 224.0.0.39) on upstream ports.
> 
> /etc/cluster/cluster.conf for each cluster node is same (only important
> part of config is here, ask for more, if needed): 
> 
> <cluster name="test" config_version="1">
>         <cman>
>                 <multicast addr="224.0.0.11"/>
>         </cman>
>         <nodes>
>                 <node name="node1" votes="1">
>                         <multicast addr="224.0.0.11" interface="eth0"/>
>                 </node>
>                 <node name="node2" votes="1">
>                         <multicast addr="224.0.0.11" interface="eth0"/>
>                 </node>
>         </nodes>
> </cluster>
> 
> hosts ping each other, so networking part, as far as basic ip and
> unicast is concerned, is working properly.
> 
> When starting, cman_tool says: 
> cluster # cman_tool join -d
> multicast address 224.0.0.11
> if eth0 for mcast address 224.0.0.11
> setup up interface for address: node1
> 
> and as I can see from strace, "cman_tool join", this is what happens: 
> 
> socket(0x1f /* PF_??? */, SOCK_DGRAM, 2) = 3
> ioctl(3, 0x780b, 0x2)                   = 0
> setsockopt(3, 0x2 /* SOL_?? */, 109, [6516590], 4) = 0
> [...]
> socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
> bind(4, {sa_family=AF_INET, sin_port=htons(6809), sin_addr=inet_addr("224.0.0.11")}, 16) = 0
> socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 5
> bind(5, {sa_family=AF_INET, sin_port=htons(6809), sin_addr=inet_addr("1.1.1.2")}, 16) = 0
> setsockopt(3, 0x2 /* SOL_?? */, 100, "\4\0\0\0\0\0\0\0", 8) = 0
> setsockopt(3, 0x2 /* SOL_?? */, 103, "\5\0\0\0\0\0\0\0", 8) = 0
> setsockopt(3, 0x2 /* SOL_?? */, 101, "\1\344\377\277\3\0\0\0\0\0\0\0\1\0\0\0smtp\0\'\1@\210\0"..., 36) = 0
> close(3)                                = 0
> exit_group(0)                           = ?
> 
> I've checked some programming examples on multicast as well as code for
> cman, and I thing cman_tool/join.c has two problems: 
> 
> - it never seems to issue setsockopt(..., IP_ADD_MEMBERSHIP...), thus,
> never joins the group. I believe the problem is in if (!bcast) check,
> which, if replaced with "if (bhe)" should work fine...
> 
> - it binds the socket with multicast address (fd = 4 in my case) instead
> of local address. If the examples I looked are true, one should bind
> local interface, and then specify mcast address in setsockopt call. 
> 
> Can someone comment this issue? Am I going in completly wrong direction,
> or multicast support isn't ready yet?
-- 
Lazar Obradovic, System Engineer
----- 
laza at YU.net
YUnet International http://www.EUnet.yu
Dubrovacka 35/III, 11000 Belgrade
Tel: +381 11 3119901; Fax: +381 11 3119901
-----
This e-mail is confidential and intended only for the recipient.
Unauthorized distribution, modification or disclosure of its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone +381 11 3119901.
-----


From mtilstra at redhat.com  Tue Aug  3 18:44:37 2004
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Tue, 3 Aug 2004 13:44:37 -0500
Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
In-Reply-To: <4E022DDAB8F45741914ACD6EDFE2309B324AAD@BIGFOOT.kazeon.local>
References: <4E022DDAB8F45741914ACD6EDFE2309B324AAD@BIGFOOT.kazeon.local>
Message-ID: <20040803184437.GA6865@redhat.com>

On Tue, Aug 03, 2004 at 11:03:40AM -0700, Steve Landherr wrote:
> Agreed.  But I though that the quorum concept was there to protect
> against a netsplit causing independent clusters to form.  If I'm a node
> and I can't contact enough peers to gain quorum, then I can't be a part
> of the cluster.  If I'm not part of the cluster, then I shouldn't be
> fencing other nodes.
> 
> Am I missing something?


In a lights out setup. (no user required to keep things running) 2 of
three nodes go AWOL.  One node now needs to reset (NPS fencing) the
other two, without quorum, to keep things running.


-- 
Michael Conrad Tadpol Tilstra
A hacker is a machine for turning caffeine into code.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040803/b4889596/attachment.sig>

From landherr at kazeon.com  Tue Aug  3 19:03:15 2004
From: landherr at kazeon.com (Steve Landherr)
Date: Tue, 3 Aug 2004 12:03:15 -0700
Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
Message-ID: <4E022DDAB8F45741914ACD6EDFE2309B324AC2@BIGFOOT.kazeon.local>

But if you do it that way, and you really have a netsplit, won't you get
into a "quickdraw" situation where each of the newly formed clusters are
trying to fence out the others?  In the worst case, all the nodes get
reset and nobody is happy.  But maybe the worst case happens so
infrequently that it is better than always losing the cluster whenever
quorum is lost.

But then again, from any one node's perspective, how often do multiple
nodes drop out of a cluster at the same time and the problem is not
either a netsplit or a glitch on the local node?

Just pondering...

-steve

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Conrad
Tadpol Tilstra
Sent: Tuesday, August 03, 2004 11:45 AM
To: Discussion of clustering software components including GFS
Subject: Re: [Linux-cluster] GFS 6.0 node without quorum tries to fence


In a lights out setup. (no user required to keep things running) 2 of
three nodes go AWOL.  One node now needs to reset (NPS fencing) the
other two, without quorum, to keep things running.


From gshi at ncsa.uiuc.edu  Tue Aug  3 19:19:43 2004
From: gshi at ncsa.uiuc.edu (Guochun Shi)
Date: Tue, 03 Aug 2004 14:19:43 -0500
Subject: [Linux-cluster] gnbd: finiband supprt and multi-device to one
	device mapping
Message-ID: <5.1.0.14.2.20040803140638.03a99680@pop.ncsa.uiuc.edu>

hi,

I am interested in adding 2 features in gnbd:

1. finiband support for communication between a gnbd client and a gnbd server. 

2. multi-device to one device mapping: multiples servers export their devices and one client import from those servers. The client sees one device, who is a wrapper device for all devices in servers. Writing/reading to the device will be distributed to different servers according to different writing/reading position, size, etc ....

I did not find any design document for gnbd so far. I will certainly be grateful if someone can point me to any URL for that.

comments and suggestions are welcome.

thanks
-Guochun


From mtilstra at redhat.com  Tue Aug  3 19:41:54 2004
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Tue, 3 Aug 2004 14:41:54 -0500
Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
In-Reply-To: <4E022DDAB8F45741914ACD6EDFE2309B324AC2@BIGFOOT.kazeon.local>
References: <4E022DDAB8F45741914ACD6EDFE2309B324AC2@BIGFOOT.kazeon.local>
Message-ID: <20040803194154.GA7790@redhat.com>

On Tue, Aug 03, 2004 at 12:03:15PM -0700, Steve Landherr wrote:
> But if you do it that way, and you really have a netsplit, won't you get
> into a "quickdraw" situation where each of the newly formed clusters are
> trying to fence out the others?  In the worst case, all the nodes get
> reset and nobody is happy.  But maybe the worst case happens so
> infrequently that it is better than always losing the cluster whenever
> quorum is lost.

yes, worst case everyone reboots.  BUT! the data on disc is safe.  There
is probbly a better way to go about this, but we've currently kept the
idea that an unexpected reboot is better than looking for backup tapes.


> But then again, from any one node's perspective, how often do multiple
> nodes drop out of a cluster at the same time and the problem is not
> either a netsplit or a glitch on the local node?

no idea.

> Just pondering...
cool.

-- 
Michael Conrad Tadpol Tilstra
It's never too late to have a happy childhood.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040803/3da1db2d/attachment.sig>

From smelkovs at worldsoft.ch  Tue Aug  3 20:04:53 2004
From: smelkovs at worldsoft.ch (Konrads Smelkovs)
Date: Tue, 03 Aug 2004 22:04:53 +0200
Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
In-Reply-To: <20040803194154.GA7790@redhat.com>
References: <4E022DDAB8F45741914ACD6EDFE2309B324AC2@BIGFOOT.kazeon.local>
	<20040803194154.GA7790@redhat.com>
Message-ID: <410FEFE5.8090301@worldsoft.ch>

Michael Conrad Tadpol Tilstra wrote:

>
>yes, worst case everyone reboots.  BUT! the data on disc is safe.  There
>is probbly a better way to go about this, but we've currently kept the
>idea that an unexpected reboot is better than looking for backup tapes.
>
>  
>
I don't think it is smart enough. This kind of assumes that the fencing 
method is power. Suppose people are running only on san fencing.


From bmarzins at redhat.com  Tue Aug  3 20:25:15 2004
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Tue, 3 Aug 2004 15:25:15 -0500
Subject: [Linux-cluster] gnbd: finiband supprt and multi-device to one
	device mapping
In-Reply-To: <5.1.0.14.2.20040803140638.03a99680@pop.ncsa.uiuc.edu>
References: <5.1.0.14.2.20040803140638.03a99680@pop.ncsa.uiuc.edu>
Message-ID: <20040803202515.GR23619@phlogiston.msp.redhat.com>

On Tue, Aug 03, 2004 at 02:19:43PM -0500, Guochun Shi wrote:
> hi,
> 
> I am interested in adding 2 features in gnbd:
> 
> 1. finiband support for communication between a gnbd client and a gnbd server. 
> 
> 2. multi-device to one device mapping: multiples servers export their devices and one client import from those servers. The client sees one device, who is a wrapper device for all devices in servers. Writing/reading to the device will be distributed to different servers according to different writing/reading position, size, etc ....
> 
> I did not find any design document for gnbd so far. I will certainly be grateful if someone can point me to any URL for that.
> 
> comments and suggestions are welcome.

There isn't very much in the way of current GNBD documentation

https://open.datacore.ch/DCwiki.open/Wiki.jsp?page=GFS
https://open.datacore.ch/DCwiki.open/Wiki.jsp?page=GFS.GNBD.Usage

has the most current info

Otherwise, there are the man pages, which I haven't got around to updating
yet, but are still mostly accurate.

Oh, and
http://www.redhat.com/docs/manuals/csgfs/admin-guide/

This is the old administrator's guide, but again, it's mostly uptodate.
As soon as I finish up some coding that needs to get done, I will set to
work on some documentation.

If this isn't enough, I'd be glad to answer any questions you have, either
via email, or via IRC (#linux-cluster on freenode)

Good luck.

-Ben Marzinski
bmarzins at redhat.com
 
> thanks
> -Guochun
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster


From patrick.seinguerlet at e-asc.com  Tue Aug  3 18:00:00 2004
From: patrick.seinguerlet at e-asc.com (Seinguerlet Patrick)
Date: Tue, 3 Aug 2004 20:00:00 +0200
Subject: [Linux-cluster] lock_dlm: init_fence error -1
Message-ID: <003701c47983$b59787e0$0100a8c0@amdk6>

When I would like to mount the GFS file system, this messages appear.
What can I do?

mount -t gfs /dev/test_gfs/lv_test /mnt
lock_dlm: init_fence error -1
GFS: can't mount proto = lock_dlm, table = test:partage1, hostdata =
mount: permission denied

I use a debian and I use the documentation file for install.

Patrick


From amir at datacore.ch  Tue Aug  3 21:02:10 2004
From: amir at datacore.ch (Amir Guindehi)
Date: Tue, 03 Aug 2004 23:02:10 +0200
Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
In-Reply-To: <20040803194154.GA7790@redhat.com>
References: <4E022DDAB8F45741914ACD6EDFE2309B324AC2@BIGFOOT.kazeon.local>
	<20040803194154.GA7790@redhat.com>
Message-ID: <410FFD52.2050705@datacore.ch>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

|>But if you do it that way, and you really have a netsplit, won't you get
|>into a "quickdraw" situation where each of the newly formed clusters are
|>trying to fence out the others?  In the worst case, all the nodes get
|>reset and nobody is happy.  But maybe the worst case happens so
|>infrequently that it is better than always losing the cluster whenever
|>quorum is lost.
|
| yes, worst case everyone reboots.  BUT! the data on disc is safe.  There
| is probbly a better way to go about this, but we've currently kept the
| idea that an unexpected reboot is better than looking for backup tapes.

I think you would need a form of /atomic fencing/ )which is not really
possible afaik) to resolve this problem. It's a race, and as long as you
can't do it atomically you can always reach above worst case state.

- - Amir
- --
Amir Guindehi, nospam.amir at datacore.ch
DataCore GmbH, Witikonerstrasse 289, 8053 Zurich, Switzerland

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2-nr1 (Windows 2000)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBD/1PbycOjskSVCwRAoRcAJ9cq5+FiQnIx817IdEthaB6HTgPTQCg9G3n
MLA+ulKC4Jh3BQZLbPq59/0=
=Gy9x
-----END PGP SIGNATURE-----


From amanthei at redhat.com  Tue Aug  3 21:17:44 2004
From: amanthei at redhat.com (Adam Manthei)
Date: Tue, 3 Aug 2004 16:17:44 -0500
Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
In-Reply-To: <410FEFE5.8090301@worldsoft.ch>
References: <4E022DDAB8F45741914ACD6EDFE2309B324AC2@BIGFOOT.kazeon.local>
	<20040803194154.GA7790@redhat.com> <410FEFE5.8090301@worldsoft.ch>
Message-ID: <20040803211744.GD26705@redhat.com>

> > Michael Conrad Tadpol Tilstra wrote:
> >yes, worst case everyone reboots.  BUT! the data on disc is safe.  There
> >is probbly a better way to go about this, but we've currently kept the
> >idea that an unexpected reboot is better than looking for backup tapes.

On Tue, Aug 03, 2004 at 10:04:53PM +0200, Konrads Smelkovs wrote:
> I don't think it is smart enough. This kind of assumes that the fencing 
> method is power. Suppose people are running only on san fencing.

Is a node/resource that has access to the cluster only through IP (say if it 
is a dedicated lock_gulmd server) really fenced if SAN fencing is used?
I would argue not.

After the a SAN fencing action has successfully returned, the lock_gulmd
resources still would have access to the rest of the cluster.  In this case,
you may as well have used /bin/true as your fencing agent.  SAN fencing for
the lock_gulmd cluster resource is the wrong tool for the job (unless you
are doing IP traffic through that SAN).

So, you're right, it's not smart enough.  What's worse is that it relies
upon the admins being smart enough to realize this before their cluster is
configured ;)  

This is probably a point worth adding to our FAQ if it's not already there.

-- 
Adam Manthei  <amanthei at redhat.com>


From alewis at redhat.com  Tue Aug  3 22:11:45 2004
From: alewis at redhat.com (AJ Lewis)
Date: Tue, 3 Aug 2004 17:11:45 -0500
Subject: [Linux-cluster] Re: Call for presentation materials; Attendee List
In-Reply-To: <200408031427.46419.phillips@redhat.com>
References: <200408031427.46419.phillips@redhat.com>
Message-ID: <20040803221145.GG25464@null.msp.redhat.com>

On Tue, Aug 03, 2004 at 02:27:46PM -0400, Daniel Phillips wrote:
> And thanks for your part in making the first-ever Minneapolis Cluster 
> Summit a great success.  Here is the attendee list as promised, in the 
> form of a massive cc list.  You're encouraged to "Reply All" with any 
> comments, suggestions, gripes, flames or other constructive material.  
> Please don't worry about generating n-squared traffic, as n is yet low.

The attendee list is good for everyone who attended to have, but let's not
use it as a primary means of communication (see below)
 
> There may be a few names on the list who didn't actually make it; no 
> matter, I'd rather err on the side of not leaving anybody out who was 
> there.  If you know of anybody I left out, could you please email me.

Trying to keep track of people that need to be added to this CC list is
going to be a real pain.  I know for a fact at least 3 people from the Red
Hat team have been accidentally left off of the list, not to mention what
happens if people's e-mail address change, or they don't want to get this
traffic anymore.

I strongly recommend people post to linux-cluster at redhat.com if they want
everyone involved to see what's going on.  There is also the added advantage
that discussions will be preserved for future reference in the mailing list
archives.
 
> Speakers:
> 
> Including non-Red Hat speakers, please send me your slides and any 
> supporting materials you feel are relevant.  Even (pointers to) code 
> would be fine, and white papers certainly qualify.  Even png scans of 
> napkins might get posted :-)  This is all for posting to the community 
> cluster site.
> 
> Please send attachments, not URL's, except for code and/or links to 
> project homepages.  Please indicate which text/links in your email are 
> to appear on the web page.  Please include a short bio, just a few 
> words.

Anyone on linux-cluster who has material to post on the
source.redhat.com/cluster web page, please post to the list asking for its
inclusion. 
 
> Everybody:
> 
> This cc list is for ongoing discussion of the material we covered at the 
> summit, in particular:
> 
>    * Possible amendments to CMAN interfaces to accommodate your
>      own project - now is a good time to speak.
> 
>    * Possible amendments to GDLM, with a view to adoption by other
>      projects besides GFS
> 
>    * Mainline submission track.  Code and [RFC]'s should start
>      appearing on lkml in October.  What code?  How changed?  What
>      supporting arguments?  Now is the time to sort that out.
> 
>    * Clustered Samba: anybody out there willing to look at how to
>      graft oplocks, weird case translations, clustered tdb, etc onto
>      gfs, please shout
> 
>    Anything else that's on your mind

Seems to me all this should be discussed on linux-cluster as well.
 
> The conference information page has been updated with the "as built" 
> conference schedule:
> 
>    http://sources.redhat.com/cluster/events/summit2004/info.html
> 
> Thanks once again for your tremendous support.
> 
> Regards,
> 
> Daniel

Regards,
-- 
AJ Lewis                                   Voice:  612-638-0500
Red Hat Inc.                               E-Mail: alewis at redhat.com
720 Washington Ave. SE, Suite 200
Minneapolis, MN 55414

Current GPG fingerprint = D9F8 EDCE 4242 855F A03D  9B63 F50C 54A8 578C 8715
Grab the key at: http://people.redhat.com/alewis/gpg.html or one of the
many keyservers out there...
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040803/d873b139/attachment.sig>

From phillips at redhat.com  Wed Aug  4 00:54:25 2004
From: phillips at redhat.com (Daniel Phillips)
Date: Tue, 3 Aug 2004 20:54:25 -0400
Subject: [Linux-cluster] Re: Call for presentation materials; Attendee List
In-Reply-To: <20040803221145.GG25464@null.msp.redhat.com>
References: <200408031427.46419.phillips@redhat.com>
	<20040803221145.GG25464@null.msp.redhat.com>
Message-ID: <200408032054.25297.phillips@redhat.com>

On Tuesday 03 August 2004 18:11, AJ Lewis wrote:

Anybody who spoke at the cluster summit, please email me their slides 
and other supporting material.  Thanks for your input, AJ.

Regards,

Daniel


From mnerren at paracel.com  Wed Aug  4 01:12:01 2004
From: mnerren at paracel.com (micah nerren)
Date: Tue, 03 Aug 2004 18:12:01 -0700
Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine
In-Reply-To: <20040803144019.GA4365@redhat.com>
References: <1091458082.8356.23.camel@angmar>
	<20040802154605.GC1518@redhat.com> <1091480394.8356.58.camel@angmar>
	<20040803144019.GA4365@redhat.com>
Message-ID: <1091581920.8356.257.camel@angmar>

Hi,

On Tue, 2004-08-03 at 07:40, Michael Conrad Tadpol Tilstra wrote:
> On Mon, Aug 02, 2004 at 01:59:55PM -0700, micah nerren wrote:
> [snip]
> > I hope this helps!!
> [snip]
> 
> yeah, looks like a stack overflow.
> here's a patch that I put in for 6.0.  (patch works on 6.0.0-7)
> 

I applied the patch to 6.0.0-7, rebuild the entire package, and I still
get the crash when I mount. Below is the text of the crash.

Any ideas? I double and triple checked that the patch was indeed applied
to the code I was building and it was.

Thanks,

Micah

///////////////

Unable to handle kernel NULL pointer dereference at virtual address
0000000000000000
 printing rip:
ffffffff8024a875
PML4 77caf067 PGD 7a78f067 PMD 0 
Oops: 0002
CPU 0 
Pid: 4056, comm: mount Not tainted
RIP: 0010:[<ffffffff8024a875>]{net_rx_action+213}
RSP: 0018:0000010077d93048  EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffffffff806077e8 RCX: ffffffff80607988
RDX: ffffffff806077e8 RSI: 0000010077d68800 RDI: ffffffff806077d0
RBP: ffffffff80607668 R08: 00000000824c6a9c R09: 00000000004c824c
R10: 000000000100007f R11: 0000000000000000 R12: ffffffff806077e8
R13: ffffffff806077c0 R14: 000000000000ed06 R15: 0000000000000000
FS:  0000002a955764c0(0000) GS:ffffffff805d9840(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0

Call Trace: [<ffffffff8024a84d>]{net_rx_action+173} 
       [<ffffffff8012a72e>]{do_softirq+174}
[<ffffffff80267cf0>]{ip_finish_output2+0} 
       [<ffffffff80267cc0>]{dst_output+0}
[<ffffffff802b5915>]{do_softirq_thunk+53} 
       [<ffffffff802533a7>]{.text.lock.netfilter+165}
[<ffffffff80267cc0>]{dst_output+0} 
       [<ffffffff80265fbb>]{ip_queue_xmit+1019}
[<ffffffff80262ee0>]{ip_rcv_finish+0} 
       [<ffffffff802630f0>]{ip_rcv_finish+528}
[<ffffffff80252e51>]{nf_hook_slow+305} 
       [<ffffffff80262ee0>]{ip_rcv_finish+0}
[<ffffffff80277faf>]{tcp_transmit_skb+1295} 
       [<ffffffff80278ac6>]{tcp_write_xmit+198}
[<ffffffff8026de83>]{tcp_sendmsg+4051} 
       [<ffffffff8028e795>]{inet_sendmsg+69}
[<ffffffff802407ae>]{sock_sendmsg+142} 
       [<ffffffffa013c4b1>]{:lock_gulm:do_tfer+369}
[<ffffffffa013ebd4>]{:lock_gulm:.rodata.str1.1+467} 
       [<ffffffffa013c595>]{:lock_gulm:xdr_send+37}
[<ffffffffa013b498>]{:lock_gulm:xdr_enc_flush+56} 
       [<ffffffffa013951d>]{:lock_gulm:lg_lock_login+301} 
       [<ffffffffa0135ff9>]{:lock_gulm:lt_login+57}
[<ffffffffa0132164>]{:lock_gulm:gulm_core_login_reply+164} 
       [<ffffffffa01426a0>]{:lock_gulm:core_cb+0}
[<ffffffffa01380eb>]{:lock_gulm:lg_core_handle_messages+315} 
       [<ffffffffa0138713>]{:lock_gulm:lg_core_login+323} 
       [<ffffffffa013253a>]{:lock_gulm:cm_login+122}
[<ffffffffa0132bde>]{:lock_gulm:start_gulm_threads+174} 
       [<ffffffffa0132f08>]{:lock_gulm:gulm_mount+616}
[<ffffffffa015d940>]{:gfs:gfs_glock_cb+0} 
       [<ffffffffa01313e3>]{:lock_harness:lm_mount_Rsmp_ad6c5c21+355} 
       [<ffffffffa015d940>]{:gfs:gfs_glock_cb+0}
[<ffffffffa0162ff9>]{:gfs:gfs_mount_lockproto+313} 
       [<ffffffff8013d8d2>]{do_anonymous_page+1234}
[<ffffffff8013d94f>]{do_no_page+95} 
       [<ffffffff801a5103>]{do_page_fault+627}
[<ffffffff801109d6>]{error_exit+0} 
       [<ffffffff80184ce5>]{create_elf_tables+261}
[<ffffffff801547bc>]{__alloc_pages+156} 
       [<ffffffffa014e37b>]{:gfs:gfs_read_super+1307}
[<ffffffffa0182b00>]{:gfs:gfs_fs_type+0} 
       [<ffffffff80164c0c>]{get_sb_bdev+588}
[<ffffffffa0182b00>]{:gfs:gfs_fs_type+0} 
       [<ffffffff80164ec9>]{do_kern_mount+121}
[<ffffffff8017baa1>]{do_add_mount+161} 
       [<ffffffff8017bdb9>]{do_mount+345}
[<ffffffff80154b40>]{__get_free_pages+16} 
       [<ffffffff8017c1d5>]{sys_mount+197}
[<ffffffff80110177>]{system_call+119} 
       
Process mount (pid: 4056, stackpage=10077d93000)
Stack: 0000010077d93048 0000000000000018 ffffffff8024a84d
0000012a80445d20 
       0000000000000001 ffffffff80606c60 0000000000000000
000000000000000a 
       0000000000000000 0000000000000002 ffffffff8012a72e
ffffffff80267cf0 
       0000000000000246 0000000000000000 0000000000000003
ffffffff80445d20 
       ffffffff80267cc0 0000000000000000 ffffffff802b5915
0000000000000043 
       0000000000000006 00000100796a109e 000001007c6231c0
0000000000000000 
       0000000000000000 ffffffff8049c648 0000000000000000
ffffffff806077c0 
       ffffffff802533a7 ffffffff80267cc0 ffffffff80445d20
0000000000000002 
       000001007c6231c0 ffffffff805abcd0 00000100796a10ac
000001007c6231c0 
       0000010077d68800 0000000000000000 0000010077d68800
000001007c623228 
Call Trace: [<ffffffff8024a84d>]{net_rx_action+173} 
       [<ffffffff8012a72e>]{do_softirq+174}
[<ffffffff80267cf0>]{ip_finish_output2+0} 
       [<ffffffff80267cc0>]{dst_output+0}
[<ffffffff802b5915>]{do_softirq_thunk+53} 
       [<ffffffff802533a7>]{.text.lock.netfilter+165}
[<ffffffff80267cc0>]{dst_output+0} 
       [<ffffffff80265fbb>]{ip_queue_xmit+1019}
[<ffffffff80262ee0>]{ip_rcv_finish+0} 
       [<ffffffff802630f0>]{ip_rcv_finish+528}
[<ffffffff80252e51>]{nf_hook_slow+305} 
       [<ffffffff80262ee0>]{ip_rcv_finish+0}
[<ffffffff80277faf>]{tcp_transmit_skb+1295} 
       [<ffffffff80278ac6>]{tcp_write_xmit+198}
[<ffffffff8026de83>]{tcp_sendmsg+4051} 
       [<ffffffff8028e795>]{inet_sendmsg+69}
[<ffffffff802407ae>]{sock_sendmsg+142} 
       [<ffffffffa013c4b1>]{:lock_gulm:do_tfer+369}
[<ffffffffa013ebd4>]{:lock_gulm:.rodata.str1.1+467} 
       [<ffffffffa013c595>]{:lock_gulm:xdr_send+37}
[<ffffffffa013b498>]{:lock_gulm:xdr_enc_flush+56} 
       [<ffffffffa013951d>]{:lock_gulm:lg_lock_login+301} 
       [<ffffffffa0135ff9>]{:lock_gulm:lt_login+57}
[<ffffffffa0132164>]{:lock_gulm:gulm_core_login_reply+164} 
       [<ffffffffa01426a0>]{:lock_gulm:core_cb+0}
[<ffffffffa01380eb>]{:lock_gulm:lg_core_handle_messages+315} 
       [<ffffffffa0138713>]{:lock_gulm:lg_core_login+323} 
       [<ffffffffa013253a>]{:lock_gulm:cm_login+122}
[<ffffffffa0132bde>]{:lock_gulm:start_gulm_threads+174} 
       [<ffffffffa0132f08>]{:lock_gulm:gulm_mount+616}
[<ffffffffa015d940>]{:gfs:gfs_glock_cb+0} 
       [<ffffffffa01313e3>]{:lock_harness:lm_mount_Rsmp_ad6c5c21+355} 
       [<ffffffffa015d940>]{:gfs:gfs_glock_cb+0}
[<ffffffffa0162ff9>]{:gfs:gfs_mount_lockproto+313} 
       [<ffffffff8013d8d2>]{do_anonymous_page+1234}
[<ffffffff8013d94f>]{do_no_page+95} 
       [<ffffffff801a5103>]{do_page_fault+627}
[<ffffffff801109d6>]{error_exit+0} 
       [<ffffffff80184ce5>]{create_elf_tables+261}
[<ffffffff801547bc>]{__alloc_pages+156} 
       [<ffffffffa014e37b>]{:gfs:gfs_read_super+1307}
[<ffffffffa0182b00>]{:gfs:gfs_fs_type+0} 
       [<ffffffff80164c0c>]{get_sb_bdev+588}
[<ffffffffa0182b00>]{:gfs:gfs_fs_type+0} 
       [<ffffffff80164ec9>]{do_kern_mount+121}
[<ffffffff8017baa1>]{do_add_mount+161} 
       [<ffffffff8017bdb9>]{do_mount+345}
[<ffffffff80154b40>]{__get_free_pages+16} 
       [<ffffffff8017c1d5>]{sys_mount+197}
[<ffffffff80110177>]{system_call+119} 
       

Code: 48 89 18 48 89 43 08 8b 85 90 01 00 00 85 c0 79 08 03 85 94

Kernel panic: Fatal exception
In interrupt handler - not syncing
 
NMI Watchdog detected LOCKUP on CPU0, eip ffffffff8011a948, registers:
CPU 0 
Pid: 4056, comm: mount Not tainted
RIP: 0010:[<ffffffff8011a948>]{smp_call_function+120}
RSP: 0018:0000010077d92d48  EFLAGS: 00000097
RAX: 0000000000000000 RBX: ffffffff802cfc1a RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff8011a970
RBP: 0000000000000002 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000000 R11: 00000000000003c8 R12: ffffffff802da247
R13: 0000000000000000 R14: 0000000000000002 R15: 0000010077d92f98
FS:  0000002a955764c0(0000) GS:ffffffff805d9840(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0

Call Trace:  <EOE> [<ffffffff8011a970>]{stop_this_cpu+0} 
       [<ffffffff8011a9b9>]{smp_send_stop+25}
[<ffffffff80123d98>]{panic+312} 
       [<ffffffff8011129a>]{show_trace+666}
[<ffffffff801113bd>]{show_stack+205} 
       [<ffffffff80111500>]{show_registers+304}
[<ffffffff801116ac>]{die+268} 
       [<ffffffff801a526d>]{do_page_fault+989}
[<ffffffff80252e51>]{nf_hook_slow+305} 
       [<ffffffff80262ee0>]{ip_rcv_finish+0}
[<ffffffff802630f0>]{ip_rcv_finish+528} 
       [<ffffffff801109d6>]{error_exit+0}
[<ffffffff8024a875>]{net_rx_action+213} 
       [<ffffffff8024a84d>]{net_rx_action+173}
[<ffffffff8012a72e>]{do_softirq+174} 
       [<ffffffff80267cf0>]{ip_finish_output2+0}
[<ffffffff80267cc0>]{dst_output+0} 
       [<ffffffff802b5915>]{do_softirq_thunk+53}
[<ffffffff802533a7>]{.text.lock.netfilter+165} 
       [<ffffffff80267cc0>]{dst_output+0}
[<ffffffff80265fbb>]{ip_queue_xmit+1019} 
       [<ffffffff80262ee0>]{ip_rcv_finish+0}
[<ffffffff802630f0>]{ip_rcv_finish+528} 
       [<ffffffff80252e51>]{nf_hook_slow+305}
[<ffffffff80262ee0>]{ip_rcv_finish+0} 
       [<ffffffff80277faf>]{tcp_transmit_skb+1295}
[<ffffffff80278ac6>]{tcp_write_xmit+198} 
       [<ffffffff8026de83>]{tcp_sendmsg+4051}
[<ffffffff8028e795>]{inet_sendmsg+69} 
       [<ffffffff802407ae>]{sock_sendmsg+142}
[<ffffffffa013c4b1>]{:lock_gulm:do_tfer+369} 
       [<ffffffffa013ebd4>]{:lock_gulm:.rodata.str1.1+467} 
       [<ffffffffa013c595>]{:lock_gulm:xdr_send+37}
[<ffffffffa013b498>]{:lock_gulm:xdr_enc_flush+56} 
       [<ffffffffa013951d>]{:lock_gulm:lg_lock_login+301} 
       [<ffffffffa0135ff9>]{:lock_gulm:lt_login+57}
[<ffffffffa0132164>]{:lock_gulm:gulm_core_login_reply+164} 
       [<ffffffffa01426a0>]{:lock_gulm:core_cb+0}
[<ffffffffa01380eb>]{:lock_gulm:lg_core_handle_messages+315} 
       [<ffffffffa0138713>]{:lock_gulm:lg_core_login+323} 
       [<ffffffffa013253a>]{:lock_gulm:cm_login+122}
[<ffffffffa0132bde>]{:lock_gulm:start_gulm_threads+174} 
       [<ffffffffa0132f08>]{:lock_gulm:gulm_mount+616}
[<ffffffffa015d940>]{:gfs:gfs_glock_cb+0} 
       [<ffffffffa01313e3>]{:lock_harness:lm_mount_Rsmp_ad6c5c21+355} 
       [<ffffffffa015d940>]{:gfs:gfs_glock_cb+0}
[<ffffffffa0162ff9>]{:gfs:gfs_mount_lockproto+313} 
       [<ffffffff8013d8d2>]{do_anonymous_page+1234}
[<ffffffff8013d94f>]{do_no_page+95} 
       [<ffffffff801a5103>]{do_page_fault+627}
[<ffffffff801109d6>]{error_exit+0} 
       [<ffffffff80184ce5>]{create_elf_tables+261}
[<ffffffff801547bc>]{__alloc_pages+156} 
       [<ffffffffa014e37b>]{:gfs:gfs_read_super+1307}
[<ffffffffa0182b00>]{:gfs:gfs_fs_type+0} 
       [<ffffffff80164c0c>]{get_sb_bdev+588}
[<ffffffffa0182b00>]{:gfs:gfs_fs_type+0} 
       [<ffffffff80164ec9>]{do_kern_mount+121}
[<ffffffff8017baa1>]{do_add_mount+161} 
       [<ffffffff8017bdb9>]{do_mount+345}
[<ffffffff80154b40>]{__get_free_pages+16} 
       [<ffffffff8017c1d5>]{sys_mount+197}
[<ffffffff80110177>]{system_call+119} 
       
Process mount (pid: 4056, stackpage=10077d93000)
Stack: 0000010077d92d48 0000000000000018 0000000000100000
0000000000000000 
       00000100079c4c80 ffffffff803e89a0 0000000000000000
00000100000fdea0 
       ffffffff803e8d00 00000100079bf000 00000100079d6400
0000000000000042 
       00000100079de280 ffffff0000000000 000000fffffff000
0000000000000000 
       00000100079d7a80 0000000000000000 0000000000000000
0000000000000000 
       0000000000000000 0000000000000000 0000000000000000
0000000000000000 
       0000010077d92d48 0000000000000000 00000000006d9994
0000000000000003 
       0000000000000000 0000000000000000 0000000100000000
ffffffffffffffff 
       ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff 
       ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff 
Call Trace:  <EOE> [<ffffffff8011a970>]{stop_this_cpu+0} 
       [<ffffffff8011a9b9>]{smp_send_stop+25}
[<ffffffff80123d98>]{panic+312} 
       [<ffffffff8011129a>]{show_trace+666}
[<ffffffff801113bd>]{show_stack+205} 
       [<ffffffff80111500>]{show_registers+304}
[<ffffffff801116ac>]{die+268} 
       [<ffffffff801a526d>]{do_page_fault+989}
[<ffffffff80252e51>]{nf_hook_slow+305} 
       [<ffffffff80262ee0>]{ip_rcv_finish+0}
[<ffffffff802630f0>]{ip_rcv_finish+528} 
       [<ffffffff801109d6>]{error_exit+0}
[<ffffffff8024a875>]{net_rx_action+213} 
       [<ffffffff8024a84d>]{net_rx_action+173}
[<ffffffff8012a72e>]{do_softirq+174} 
       [<ffffffff80267cf0>]{ip_finish_output2+0}
[<ffffffff80267cc0>]{dst_output+0} 
       [<ffffffff802b5915>]{do_softirq_thunk+53}
[<ffffffff802533a7>]{.text.lock.netfilter+165} 
       [<ffffffff80267cc0>]{dst_output+0}
[<ffffffff80265fbb>]{ip_queue_xmit+1019} 
       [<ffffffff80262ee0>]{ip_rcv_finish+0}
[<ffffffff802630f0>]{ip_rcv_finish+528} 
       [<ffffffff80252e51>]{nf_hook_slow+305}
[<ffffffff80262ee0>]{ip_rcv_finish+0} 
       [<ffffffff80277faf>]{tcp_transmit_skb+1295}
[<ffffffff80278ac6>]{tcp_write_xmit+198} 
       [<ffffffff8026de83>]{tcp_sendmsg+4051}
[<ffffffff8028e795>]{inet_sendmsg+69} 
       [<ffffffff802407ae>]{sock_sendmsg+142}
[<ffffffffa013c4b1>]{:lock_gulm:do_tfer+369} 
       [<ffffffffa013ebd4>]{:lock_gulm:.rodata.str1.1+467} 
       [<ffffffffa013c595>]{:lock_gulm:xdr_send+37}
[<ffffffffa013b498>]{:lock_gulm:xdr_enc_flush+56} 
       [<ffffffffa013951d>]{:lock_gulm:lg_lock_login+301} 
       [<ffffffffa0135ff9>]{:lock_gulm:lt_login+57}
[<ffffffffa0132164>]{:lock_gulm:gulm_core_login_reply+164} 
       [<ffffffffa01426a0>]{:lock_gulm:core_cb+0}
[<ffffffffa01380eb>]{:lock_gulm:lg_core_handle_messages+315} 
       [<ffffffffa0138713>]{:lock_gulm:lg_core_login+323} 
       [<ffffffffa013253a>]{:lock_gulm:cm_login+122}
[<ffffffffa0132bde>]{:lock_gulm:start_gulm_threads+174} 
       [<ffffffffa0132f08>]{:lock_gulm:gulm_mount+616}
[<ffffffffa015d940>]{:gfs:gfs_glock_cb+0} 
       [<ffffffffa01313e3>]{:lock_harness:lm_mount_Rsmp_ad6c5c21+355} 
       [<ffffffffa015d940>]{:gfs:gfs_glock_cb+0}
[<ffffffffa0162ff9>]{:gfs:gfs_mount_lockproto+313} 
       [<ffffffff8013d8d2>]{do_anonymous_page+1234}
[<ffffffff8013d94f>]{do_no_page+95} 
       [<ffffffff801a5103>]{do_page_fault+627}
[<ffffffff801109d6>]{error_exit+0} 
       [<ffffffff80184ce5>]{create_elf_tables+261}
[<ffffffff801547bc>]{__alloc_pages+156} 
       [<ffffffffa014e37b>]{:gfs:gfs_read_super+1307}
[<ffffffffa0182b00>]{:gfs:gfs_fs_type+0} 
       [<ffffffff80164c0c>]{get_sb_bdev+588}
[<ffffffffa0182b00>]{:gfs:gfs_fs_type+0} 
       [<ffffffff80164ec9>]{do_kern_mount+121}
[<ffffffff8017baa1>]{do_add_mount+161} 
       [<ffffffff8017bdb9>]{do_mount+345}
[<ffffffff80154b40>]{__get_free_pages+16} 
       [<ffffffff8017c1d5>]{sys_mount+197}
[<ffffffff80110177>]{system_call+119} 
       

Code: 39 d0 75 f8 85 c9 74 10 8b 44 24 14 39 d0 74 08 8b 44 24 14

console shuts up ...
 NM I Watchdog detected LOCKUP on CPU1, eip ffffffff801a5419, registers:
  

From bernd.schumacher at hp.com  Wed Aug  4 06:12:51 2004
From: bernd.schumacher at hp.com (Schumacher, Bernd)
Date: Wed, 4 Aug 2004 08:12:51 +0200
Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
Message-ID: <DC3726DD5043F64C90F330EC03F429063BAA1B@bbnexc02.emea.cpqcorp.net>

So, what I have learned from all answers is very bad news for me. It
seems, what happened is as expected by most of you. But this means:

-----------------------------------------------------------------------
--- One single point of failure in one node can stop the whole gfs. ---
-----------------------------------------------------------------------

The single point of failure is:
The lancard specified in "nodes.ccs:ip_interfaces" stops working on one
node. No matter if this node was master or slave.

The whole gfs is stopped:
The rest of the cluster seems to need time to form a new cluster. The
bad node does not need so much time for switching to arbitrary mode. So
the bad node has enough time to fence all other nodes, before it would
be fenced by the new master.

The bad node lives but it can not form a cluster. GFS is not working.

Now all other nodes will reboot. But after reboot they can not join the
cluster, because they can not contact the bad node. The lancard is still
broken. GFS is not working.

Did I miss something?
Please tell me that I am wrong!


> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> Schumacher, Bernd
> Sent: Dienstag, 3. August 2004 13:56
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
> 
> 
> Hi,
> I have three nodes oben, mitte and unten. 
> 
> Test:
> I have disabled eth0 on mitte, so that mitte will be excluded. 
> 
> Result:
> Oben and unten are trying to fence mitte and build a new 
> cluster. OK! But mitte tries to fence oben and unten. PROBLEM!
>  
> Why can this happen? Mitte knows that it can not build a 
> cluster. See Logfile from mitte: "Have 1, need 2"
> 
> Logfile from mitte:
> Aug  3 12:53:17 mitte lock_gulmd_core[1845]: Client (oben) 
> expired Aug 3 12:53:17 mitte lock_gulmd_core[1845]: Core lost 
> slave quorum. Have 1, need 2. Switching to Arbitrating. Aug  
> 3 12:53:17 mitte
> lock_gulmd_core[2120]: Gonna exec fence_node oben Aug  3 
> 12:53:17 mitte
> lock_gulmd_core[1845]: Forked [2120] fence_node oben with a 0 
> pause. Aug 3 12:53:17 mitte fence_node[2120]: Performing 
> fence method, manual, on oben. 
> 
> cluster.ccs:
> cluster {
>     name = "tom"
>     lock_gulm {
>         servers = ["oben", "mitte", "unten"]
>     }
> }
> 
> fence.ccs:
> fence_devices {
>   manual_oben {
>     agent = "fence_manual"
>   }     
>   manual_mitte ...
> 
> 
> nodes.ccs:
> nodes {
>   oben {
>     ip_interfaces {
>       eth0 = "192.168.100.241"
>     }
>     fence { 
>       manual {
>         manual_oben {
>           ipaddr = "192.168.100.241"
>         }
>       }
>     }
>   }
>   mitte ...
> 
> regards
> Bernd Schumacher
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com 
> http://www.redhat.com/mailman/listinfo/linux-> cluster
> 


From teigland at redhat.com  Wed Aug  4 06:51:26 2004
From: teigland at redhat.com (David Teigland)
Date: Wed, 4 Aug 2004 14:51:26 +0800
Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
In-Reply-To: <DC3726DD5043F64C90F330EC03F429063BAA1B@bbnexc02.emea.cpqcorp.net>
References: <DC3726DD5043F64C90F330EC03F429063BAA1B@bbnexc02.emea.cpqcorp.net>
Message-ID: <20040804065126.GB13816@redhat.com>

On Wed, Aug 04, 2004 at 08:12:51AM +0200, Schumacher, Bernd wrote:
> So, what I have learned from all answers is very bad news for me. It
> seems, what happened is as expected by most of you. But this means:
> 
> -----------------------------------------------------------------------
> --- One single point of failure in one node can stop the whole gfs. ---
> -----------------------------------------------------------------------
> 
> The single point of failure is:
> The lancard specified in "nodes.ccs:ip_interfaces" stops working on one
> node. No matter if this node was master or slave.
> 
> The whole gfs is stopped:
> The rest of the cluster seems to need time to form a new cluster. The
> bad node does not need so much time for switching to arbitrary mode. So
> the bad node has enough time to fence all other nodes, before it would
> be fenced by the new master.
> 
> The bad node lives but it can not form a cluster. GFS is not working.
> 
> Now all other nodes will reboot. But after reboot they can not join the
> cluster, because they can not contact the bad node. The lancard is still
> broken. GFS is not working.
> 
> Did I miss something?
> Please tell me that I am wrong!

Although it's still in development/testing, what you're looking for is the way
cman/fenced works.  When there's a network partition, the group with quorum
will fence the group without quorum.  If neither has quorum then no one will be
fenced and neither side can run.

Gulm could probably be designed to do fencing differently but I'm not sure how
likely that is at this point.

-- 
Dave Teigland  <teigland at redhat.com>


From tom at regio.net  Wed Aug  4 08:39:57 2004
From: tom at regio.net (tom at regio.net)
Date: Wed, 4 Aug 2004 10:39:57 +0200
Subject: [Linux-cluster] errors on inode.c
Message-ID: <OF7180D3A1.CE8D12DA-ONC1256EE6.002F8444-C1256EE6.002F9B03@regio.net>


Hi all,

im getting errors on inode.c

make[4]: Entering directory `/usr/src/linux-2.6.7'
  CC [M]  /tmp/rhgfs/cluster/gfs-kernel/src/gfs/inode.o
/tmp/rhgfs/cluster/gfs-kernel/src/gfs/inode.c: In function
`inode_init_and_link':
/tmp/rhgfs/cluster/gfs-kernel/src/gfs/inode.c:1139: error: structure has no
member named `ar_suiddir'
make[5]: *** [/tmp/rhgfs/cluster/gfs-kernel/src/gfs/inode.o] Error 1
make[4]: *** [_module_/tmp/rhgfs/cluster/gfs-kernel/src/gfs] Error 2

anyone have an idea?

m.f.G.
regio[.NET] GmbH, Support
Thomas Marmetschke
Bahnhofstrasse 16
36037 Fulda
Tel. +49 661 25000-0
Fax. +49 661 25000-49


From jeff at intersystems.com  Wed Aug  4 10:37:16 2004
From: jeff at intersystems.com (Jeff)
Date: Wed, 4 Aug 2004 06:37:16 -0400
Subject: [Linux-cluster] errors on inode.c
In-Reply-To: <OF7180D3A1.CE8D12DA-ONC1256EE6.002F8444-C1256EE6.002F9B03@regio.net>
References: <OF7180D3A1.CE8D12DA-ONC1256EE6.002F8444-C1256EE6.002F9B03@regio.net>
Message-ID: <775913971.20040804063716@intersystems.com>

Wednesday, August 4, 2004, 4:39:57 AM, tom at regio.net wrote:
> Hi all,

> im getting errors on inode.c

> make[4]: Entering directory `/usr/src/linux-2.6.7'
>   CC [M]  /tmp/rhgfs/cluster/gfs-kernel/src/gfs/inode.o
> /tmp/rhgfs/cluster/gfs-kernel/src/gfs/inode.c: In function
> `inode_init_and_link':
> /tmp/rhgfs/cluster/gfs-kernel/src/gfs/inode.c:1139: error: structure has no
> member named `ar_suiddir'
> make[5]: *** [/tmp/rhgfs/cluster/gfs-kernel/src/gfs/inode.o] Error 1
> make[4]: *** [_module_/tmp/rhgfs/cluster/gfs-kernel/src/gfs] Error 2

> anyone have an idea?

I ran into this when I moved from one of the snapshots to
the cvs-latest. Issue "updatedb" and then "locate gfs_ioctl.h".
Remove the copies outside of the source tree. The make script
looks for header files in various places other than the source
tree and if it finds them, it uses them in preference to the
source tree.

There may be similar problems with header files for
cman-kernel and gfs-kernel.

Also, the libraries moved between the snapshots and latest
so if you did install the snapshot you need to execute:
   rm -rf /lib/libmagma* /lib/magma /lib/libgulm*
   rm -rf /lib/libccs* /lib/libdlm*
before you build from cvs.


From alewis at redhat.com  Wed Aug  4 13:54:20 2004
From: alewis at redhat.com (AJ Lewis)
Date: Wed, 4 Aug 2004 08:54:20 -0500
Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
In-Reply-To: <DC3726DD5043F64C90F330EC03F429063BAA1B@bbnexc02.emea.cpqcorp.net>
References: <DC3726DD5043F64C90F330EC03F429063BAA1B@bbnexc02.emea.cpqcorp.net>
Message-ID: <20040804135420.GI25464@null.msp.redhat.com>

On Wed, Aug 04, 2004 at 08:12:51AM +0200, Schumacher, Bernd wrote:
> So, what I have learned from all answers is very bad news for me. It
> seems, what happened is as expected by most of you. But this means:
> 
> -----------------------------------------------------------------------
> --- One single point of failure in one node can stop the whole gfs. ---
> -----------------------------------------------------------------------
> 
> The single point of failure is:
> The lancard specified in "nodes.ccs:ip_interfaces" stops working on one
> node. No matter if this node was master or slave.
> 
> The whole gfs is stopped:
> The rest of the cluster seems to need time to form a new cluster. The
> bad node does not need so much time for switching to arbitrary mode. So
> the bad node has enough time to fence all other nodes, before it would
> be fenced by the new master.
> 
> The bad node lives but it can not form a cluster. GFS is not working.
> 
> Now all other nodes will reboot. But after reboot they can not join the
> cluster, because they can not contact the bad node. The lancard is still
> broken. GFS is not working.
> 
> Did I miss something?
> Please tell me that I am wrong!

Well, I guess I'm confused how the node with the bad lan card can contact
the fencing device to fence the other nodes.  If it can't communicate with
the other nodes because it's NIC is down, it can't contact the fencing
device over that NIC either, right?  Or are you using some alternate
transport to contact the fencing device? 
 
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com 
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> > Schumacher, Bernd
> > Sent: Dienstag, 3. August 2004 13:56
> > To: linux-cluster at redhat.com
> > Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
> > 
> > 
> > Hi,
> > I have three nodes oben, mitte and unten. 
> > 
> > Test:
> > I have disabled eth0 on mitte, so that mitte will be excluded. 
> > 
> > Result:
> > Oben and unten are trying to fence mitte and build a new 
> > cluster. OK! But mitte tries to fence oben and unten. PROBLEM!
> >  
> > Why can this happen? Mitte knows that it can not build a 
> > cluster. See Logfile from mitte: "Have 1, need 2"
> > 
> > Logfile from mitte:
> > Aug  3 12:53:17 mitte lock_gulmd_core[1845]: Client (oben) 
> > expired Aug 3 12:53:17 mitte lock_gulmd_core[1845]: Core lost 
> > slave quorum. Have 1, need 2. Switching to Arbitrating. Aug  
> > 3 12:53:17 mitte
> > lock_gulmd_core[2120]: Gonna exec fence_node oben Aug  3 
> > 12:53:17 mitte
> > lock_gulmd_core[1845]: Forked [2120] fence_node oben with a 0 
> > pause. Aug 3 12:53:17 mitte fence_node[2120]: Performing 
> > fence method, manual, on oben. 
> > 
> > cluster.ccs:
> > cluster {
> >     name = "tom"
> >     lock_gulm {
> >         servers = ["oben", "mitte", "unten"]
> >     }
> > }
> > 
> > fence.ccs:
> > fence_devices {
> >   manual_oben {
> >     agent = "fence_manual"
> >   }     
> >   manual_mitte ...
> > 
> > 
> > nodes.ccs:
> > nodes {
> >   oben {
> >     ip_interfaces {
> >       eth0 = "192.168.100.241"
> >     }
> >     fence { 
> >       manual {
> >         manual_oben {
> >           ipaddr = "192.168.100.241"
> >         }
> >       }
> >     }
> >   }
> >   mitte ...
> > 
> > regards
> > Bernd Schumacher
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com 
> > http://www.redhat.com/mailman/listinfo/linux-> cluster
> > 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster

-- 
AJ Lewis                                   Voice:  612-638-0500
Red Hat Inc.                               E-Mail: alewis at redhat.com
720 Washington Ave. SE, Suite 200
Minneapolis, MN 55414

Current GPG fingerprint = D9F8 EDCE 4242 855F A03D  9B63 F50C 54A8 578C 8715
Grab the key at: http://people.redhat.com/alewis/gpg.html or one of the
many keyservers out there...
-----Begin Obligatory Humorous Quote----------------------------------------
"In this time of war against Osama bin Laden and the oppressive
Taliban regime, we are thankful that OUR leader isn't the spoiled son
of a powerful politician from a wealthy oil family who is supported by
religious fundamentalists, operates through clandestine organizations,
has no respect for the democratic electoral process, bombs innocents,
and uses war to deny people their civil liberties." --The Boondocks
-----End Obligatory Humorous Quote------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040804/bc7c79fc/attachment.sig>

From bernd.schumacher at hp.com  Wed Aug  4 14:06:32 2004
From: bernd.schumacher at hp.com (Schumacher, Bernd)
Date: Wed, 4 Aug 2004 16:06:32 +0200
Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
Message-ID: <DC3726DD5043F64C90F330EC03F429063BAA20@bbnexc02.emea.cpqcorp.net>


> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of AJ Lewis
> Sent: Mittwoch, 4. August 2004 15:54
> To: Discussion of clustering software components including GFS
> Subject: Re: [Linux-cluster] GFS 6.0 node without quorum 
> tries to fence
> 
> 
> On Wed, Aug 04, 2004 at 08:12:51AM +0200, Schumacher, Bernd wrote:
> > So, what I have learned from all answers is very bad news 
> for me. It 
> > seems, what happened is as expected by most of you. But this means:
> > 
> > 
> ----------------------------------------------------------------------
> > -
> > --- One single point of failure in one node can stop the 
> whole gfs. ---
> > 
> --------------------------------------------------------------
> ---------
> > 
> > The single point of failure is:
> > The lancard specified in "nodes.ccs:ip_interfaces" stops working on 
> > one node. No matter if this node was master or slave.
> > 
> > The whole gfs is stopped:
> > The rest of the cluster seems to need time to form a new 
> cluster. The 
> > bad node does not need so much time for switching to 
> arbitrary mode. 
> > So the bad node has enough time to fence all other nodes, before it 
> > would be fenced by the new master.
> > 
> > The bad node lives but it can not form a cluster. GFS is 
> not working.
> > 
> > Now all other nodes will reboot. But after reboot they can not join 
> > the cluster, because they can not contact the bad node. The 
> lancard is 
> > still broken. GFS is not working.
> > 
> > Did I miss something?
> > Please tell me that I am wrong!
> 
> Well, I guess I'm confused how the node with the bad lan card 
> can contact the fencing device to fence the other nodes.  If 
> it can't communicate with the other nodes because it's NIC is 
> down, it can't contact the fencing device over that NIC 
> either, right?  Or are you using some alternate transport to 
> contact the fencing device? 

There is a second admin Lan which is used for fencing.
 
Could I probably use this second admin Lan for GFS Heartbeats too. Can I
define two LAN-Cards in "nodes.ccs:ip_interfaces". If this works I would
not have a single point of failure anymore. But the documentation seems
not to allow this.
I will test this tomorrow.

>  
> > > -----Original Message-----
> > > From: linux-cluster-bounces at redhat.com
> > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> > > Schumacher, Bernd
> > > Sent: Dienstag, 3. August 2004 13:56
> > > To: linux-cluster at redhat.com
> > > Subject: [Linux-cluster] GFS 6.0 node without quorum 
> tries to fence
> > > 
> > > 
> > > Hi,
> > > I have three nodes oben, mitte and unten.
> > > 
> > > Test:
> > > I have disabled eth0 on mitte, so that mitte will be excluded.
> > > 
> > > Result:
> > > Oben and unten are trying to fence mitte and build a new
> > > cluster. OK! But mitte tries to fence oben and unten. PROBLEM!
> > >  
> > > Why can this happen? Mitte knows that it can not build a
> > > cluster. See Logfile from mitte: "Have 1, need 2"
> > > 
> > > Logfile from mitte:
> > > Aug  3 12:53:17 mitte lock_gulmd_core[1845]: Client (oben)
> > > expired Aug 3 12:53:17 mitte lock_gulmd_core[1845]: Core lost 
> > > slave quorum. Have 1, need 2. Switching to Arbitrating. Aug  
> > > 3 12:53:17 mitte
> > > lock_gulmd_core[2120]: Gonna exec fence_node oben Aug  3 
> > > 12:53:17 mitte
> > > lock_gulmd_core[1845]: Forked [2120] fence_node oben with a 0 
> > > pause. Aug 3 12:53:17 mitte fence_node[2120]: Performing 
> > > fence method, manual, on oben. 
> > > 
> > > cluster.ccs:
> > > cluster {
> > >     name = "tom"
> > >     lock_gulm {
> > >         servers = ["oben", "mitte", "unten"]
> > >     }
> > > }
> > > 
> > > fence.ccs:
> > > fence_devices {
> > >   manual_oben {
> > >     agent = "fence_manual"
> > >   }     
> > >   manual_mitte ...
> > > 
> > > 
> > > nodes.ccs:
> > > nodes {
> > >   oben {
> > >     ip_interfaces {
> > >       eth0 = "192.168.100.241"
> > >     }
> > >     fence { 
> > >       manual {
> > >         manual_oben {
> > >           ipaddr = "192.168.100.241"
> > >         }
> > >       }
> > >     }
> > >   }
> > >   mitte ...
> > > 
> > > regards
> > > Bernd Schumacher
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > http://www.redhat.com/mailman/listinfo/linux-> cluster
> > > 
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com 
> > http://www.redhat.com/mailman/listinfo/linux-cluster
> 
> -- 
> AJ Lewis                                   Voice:  612-638-0500
> Red Hat Inc.                               E-Mail: alewis at redhat.com
> 720 Washington Ave. SE, Suite 200
> Minneapolis, MN 55414
> 
> Current GPG fingerprint = D9F8 EDCE 4242 855F A03D  9B63 F50C 
> 54A8 578C 8715 Grab the key at: 
> http://people.redhat.com/alewis/gpg.html or > one of the many 
> keyservers out there... -----Begin Obligatory Humorous 
> Quote----------------------------------------
> "In this time of war against Osama bin Laden and the 
> oppressive Taliban regime, we are thankful that OUR leader 
> isn't the spoiled son of a powerful politician from a wealthy 
> oil family who is supported by religious fundamentalists, 
> operates through clandestine organizations, has no respect 
> for the democratic electoral process, bombs innocents, and 
> uses war to deny people their civil liberties." --The 
> Boondocks -----End Obligatory Humorous 
> Quote------------------------------------------
> 


From amanthei at redhat.com  Wed Aug  4 14:20:22 2004
From: amanthei at redhat.com (Adam Manthei)
Date: Wed, 4 Aug 2004 09:20:22 -0500
Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
In-Reply-To: <DC3726DD5043F64C90F330EC03F429063BAA20@bbnexc02.emea.cpqcorp.net>
References: <DC3726DD5043F64C90F330EC03F429063BAA20@bbnexc02.emea.cpqcorp.net>
Message-ID: <20040804142022.GG26705@redhat.com>

On Wed, Aug 04, 2004 at 04:06:32PM +0200, Schumacher, Bernd wrote:
> > > The single point of failure is:
> > > The lancard specified in "nodes.ccs:ip_interfaces" stops working on 
> > > one node. No matter if this node was master or slave.
> > > 
> > > The whole gfs is stopped:
> > > The rest of the cluster seems to need time to form a new cluster. The 
> > > bad node does not need so much time for switching to 
> > > arbitrary mode.  So the bad node has enough time to fence all other 
> > > nodes, before it would be fenced by the new master.
> > > 
> > > The bad node lives but it can not form a cluster. GFS is not working.
> > > 
> > > Now all other nodes will reboot. But after reboot they can not join 
> > > the cluster, because they can not contact the bad node. The 
> > > lancard is still broken. GFS is not working.
> > > 
> > > Did I miss something?
> > > Please tell me that I am wrong!
> > 
> > Well, I guess I'm confused how the node with the bad lan card 
> > can contact the fencing device to fence the other nodes.  If 
> > it can't communicate with the other nodes because it's NIC is 
> > down, it can't contact the fencing device over that NIC 
> > either, right?  Or are you using some alternate transport to 
> > contact the fencing device? 
> 
> There is a second admin Lan which is used for fencing.
>  
> Could I probably use this second admin Lan for GFS Heartbeats too. Can I
> define two LAN-Cards in "nodes.ccs:ip_interfaces". If this works I would
> not have a single point of failure anymore. But the documentation seems
> not to allow this.
> I will test this tomorrow.

GULM does not support multiple ethernet devices.  In this case, you would
want to architect your network so that the fence devices are on the same
network as the heartbeats.

However, if you did _NOT_ do that, the problem isn't as bad as you make it out
to be.  You're correct in thinking that there will be a shootout.  One of
your gulm servers will try to hence the others, and the others will try to
fence the one.  When the smoke clears, you will at worst be left with a
single server.  If that remaining server can no longer talk to the other
lock_gulmd servers due to a net split, it will continue to sit in the
arbitrating state waiting for the other nodes to login.  The other nodes
however will be able to start a new generation of the cluster when they
restart because they will be quorate.  If the other quorate part of the
netsplit wins the shootout, you only loose the one node.

If this is not acceptable, then you really need to rethink why the
heartbeats are not going over the same interface as the fencing device.

-Adam

> > > > -----Original Message-----
> > > > From: linux-cluster-bounces at redhat.com
> > > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> > > > Schumacher, Bernd
> > > > Sent: Dienstag, 3. August 2004 13:56
> > > > To: linux-cluster at redhat.com
> > > > Subject: [Linux-cluster] GFS 6.0 node without quorum 
> > tries to fence
> > > > 
> > > > 
> > > > Hi,
> > > > I have three nodes oben, mitte and unten.
> > > > 
> > > > Test:
> > > > I have disabled eth0 on mitte, so that mitte will be excluded.
> > > > 
> > > > Result:
> > > > Oben and unten are trying to fence mitte and build a new
> > > > cluster. OK! But mitte tries to fence oben and unten. PROBLEM!
> > > >  
> > > > Why can this happen? Mitte knows that it can not build a
> > > > cluster. See Logfile from mitte: "Have 1, need 2"
> > > > 
> > > > Logfile from mitte:
> > > > Aug  3 12:53:17 mitte lock_gulmd_core[1845]: Client (oben)
> > > > expired Aug 3 12:53:17 mitte lock_gulmd_core[1845]: Core lost 
> > > > slave quorum. Have 1, need 2. Switching to Arbitrating. Aug  
> > > > 3 12:53:17 mitte
> > > > lock_gulmd_core[2120]: Gonna exec fence_node oben Aug  3 
> > > > 12:53:17 mitte
> > > > lock_gulmd_core[1845]: Forked [2120] fence_node oben with a 0 
> > > > pause. Aug 3 12:53:17 mitte fence_node[2120]: Performing 
> > > > fence method, manual, on oben. 
> > > > 
> > > > cluster.ccs:
> > > > cluster {
> > > >     name = "tom"
> > > >     lock_gulm {
> > > >         servers = ["oben", "mitte", "unten"]
> > > >     }
> > > > }
> > > > 
> > > > fence.ccs:
> > > > fence_devices {
> > > >   manual_oben {
> > > >     agent = "fence_manual"
> > > >   }     
> > > >   manual_mitte ...
> > > > 
> > > > 
> > > > nodes.ccs:
> > > > nodes {
> > > >   oben {
> > > >     ip_interfaces {
> > > >       eth0 = "192.168.100.241"
> > > >     }
> > > >     fence { 
> > > >       manual {
> > > >         manual_oben {
> > > >           ipaddr = "192.168.100.241"
> > > >         }
> > > >       }
> > > >     }
> > > >   }
> > > >   mitte ...
> > > > 
> > > > regards
> > > > Bernd Schumacher
> > > > 
> > > > --
> > > > Linux-cluster mailing list
> > > > Linux-cluster at redhat.com
> > > > http://www.redhat.com/mailman/listinfo/linux-> cluster
> > > > 
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com 
> > > http://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > -- 
> > AJ Lewis                                   Voice:  612-638-0500
> > Red Hat Inc.                               E-Mail: alewis at redhat.com
> > 720 Washington Ave. SE, Suite 200
> > Minneapolis, MN 55414
> > 
> > Current GPG fingerprint = D9F8 EDCE 4242 855F A03D  9B63 F50C 
> > 54A8 578C 8715 Grab the key at: 
> > http://people.redhat.com/alewis/gpg.html or > one of the many 
> > keyservers out there... -----Begin Obligatory Humorous 
> > Quote----------------------------------------
> > "In this time of war against Osama bin Laden and the 
> > oppressive Taliban regime, we are thankful that OUR leader 
> > isn't the spoiled son of a powerful politician from a wealthy 
> > oil family who is supported by religious fundamentalists, 
> > operates through clandestine organizations, has no respect 
> > for the democratic electoral process, bombs innocents, and 
> > uses war to deny people their civil liberties." --The 
> > Boondocks -----End Obligatory Humorous 
> > Quote------------------------------------------
> > 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Adam Manthei  <amanthei at redhat.com>


From mtilstra at redhat.com  Wed Aug  4 15:33:38 2004
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Wed, 4 Aug 2004 10:33:38 -0500
Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine
In-Reply-To: <1091581920.8356.257.camel@angmar>
References: <1091458082.8356.23.camel@angmar>
	<20040802154605.GC1518@redhat.com>
	<1091480394.8356.58.camel@angmar>
	<20040803144019.GA4365@redhat.com>
	<1091581920.8356.257.camel@angmar>
Message-ID: <20040804153338.GA10091@redhat.com>

On Tue, Aug 03, 2004 at 06:12:01PM -0700, micah nerren wrote:
> Hi,
> 
> On Tue, 2004-08-03 at 07:40, Michael Conrad Tadpol Tilstra wrote:
> > On Mon, Aug 02, 2004 at 01:59:55PM -0700, micah nerren wrote:
> > [snip]
> > > I hope this helps!!
> > [snip]
> > 
> > yeah, looks like a stack overflow.
> > here's a patch that I put in for 6.0.  (patch works on 6.0.0-7)
> > 
> 
> I applied the patch to 6.0.0-7, rebuild the entire package, and I still
> get the crash when I mount. Below is the text of the crash.
> 
> Any ideas? I double and triple checked that the patch was indeed applied
> to the code I was building and it was.

well, it could still be a stack overflow, just some other function
pushing it over the edge.  I'll look over things later.  Mostly just
looking for things in the stack space of the functions listed in the
backtrace for things that can take out of the stack and put onto the
heap. (run sentence run!)

-- 
Michael Conrad Tadpol Tilstra
Today, I am the bug.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040804/7487db5f/attachment.sig>

From mtilstra at redhat.com  Wed Aug  4 15:40:36 2004
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Wed, 4 Aug 2004 10:40:36 -0500
Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
In-Reply-To: <20040804142022.GG26705@redhat.com>
References: <DC3726DD5043F64C90F330EC03F429063BAA20@bbnexc02.emea.cpqcorp.net>
	<20040804142022.GG26705@redhat.com>
Message-ID: <20040804154036.GB10091@redhat.com>

On Wed, Aug 04, 2004 at 09:20:22AM -0500, adam manthei wrote:
> On Wed, Aug 04, 2004 at 04:06:32PM +0200, Schumacher, Bernd wrote:
> > > > The single point of failure is:
> > > > The lancard specified in "nodes.ccs:ip_interfaces" stops working on 
> > > > one node. No matter if this node was master or slave.
> > > > 
> > > > The whole gfs is stopped:
> > > > The rest of the cluster seems to need time to form a new cluster. The 
> > > > bad node does not need so much time for switching to 
> > > > arbitrary mode.  So the bad node has enough time to fence all other 
> > > > nodes, before it would be fenced by the new master.
> > > > 
> > > > The bad node lives but it can not form a cluster. GFS is not working.
> > > > 
> > > > Now all other nodes will reboot. But after reboot they can not join 
> > > > the cluster, because they can not contact the bad node. The 
> > > > lancard is still broken. GFS is not working.
> > > > 
> > > > Did I miss something?
> > > > Please tell me that I am wrong!
> > > 
> > > Well, I guess I'm confused how the node with the bad lan card 
> > > can contact the fencing device to fence the other nodes.  If 
> > > it can't communicate with the other nodes because it's NIC is 
> > > down, it can't contact the fencing device over that NIC 
> > > either, right?  Or are you using some alternate transport to 
> > > contact the fencing device? 
> > 
> > There is a second admin Lan which is used for fencing.
> >  
> > Could I probably use this second admin Lan for GFS Heartbeats too. Can I
> > define two LAN-Cards in "nodes.ccs:ip_interfaces". If this works I would
> > not have a single point of failure anymore. But the documentation seems
> > not to allow this.
> > I will test this tomorrow.
> 
> GULM does not support multiple ethernet devices.  In this case, you would
> want to architect your network so that the fence devices are on the same
> network as the heartbeats.
> 
> However, if you did _NOT_ do that, the problem isn't as bad as you make it out
> to be.  You're correct in thinking that there will be a shootout.  One of
> your gulm servers will try to hence the others, and the others will try to
> fence the one.  When the smoke clears, you will at worst be left with a
> single server.  If that remaining server can no longer talk to the other
> lock_gulmd servers due to a net split, it will continue to sit in the
> arbitrating state waiting for the other nodes to login.  The other nodes
> however will be able to start a new generation of the cluster when they
> restart because they will be quorate.  If the other quorate part of the
> netsplit wins the shootout, you only loose the one node.
> 
> If this is not acceptable, then you really need to rethink why the
> heartbeats are not going over the same interface as the fencing device.

Unfortunitly gulm has not yet had mutliple network device support added.
We've always ment to, but lacked the time and resources to do it.  You
really *must* put heartbeats/locktraffic/fencing/etc on the same network
device.  Things won't work the way they should otherwise.


-- 
Michael Conrad Tadpol Tilstra
I used to be indecisive, but now I'm not sure.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040804/a9078895/attachment.sig>

From danderso at redhat.com  Wed Aug  4 18:07:42 2004
From: danderso at redhat.com (Derek Anderson)
Date: Wed, 4 Aug 2004 13:07:42 -0500
Subject: [Linux-cluster] lock_dlm: init_fence error -1
In-Reply-To: <003701c47983$b59787e0$0100a8c0@amdk6>
References: <003701c47983$b59787e0$0100a8c0@amdk6>
Message-ID: <200408041307.42543.danderso@redhat.com>

Patrick,

Please attach 'cat /proc/cluster/nodes' and 'cat /proc/cluster/services' from 
each of the nodes prior to the mount attempt.  Also, any messages produced in 
/var/log/messages from the mount.

On Tuesday 03 August 2004 13:00, Seinguerlet Patrick wrote:
> When I would like to mount the GFS file system, this messages appear.
> What can I do?
>
> mount -t gfs /dev/test_gfs/lv_test /mnt
> lock_dlm: init_fence error -1
> GFS: can't mount proto = lock_dlm, table = test:partage1, hostdata =
> mount: permission denied
>
> I use a debian and I use the documentation file for install.
>
> Patrick
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster


From hanafim at asc.hpc.mil  Wed Aug  4 18:37:42 2004
From: hanafim at asc.hpc.mil (MAHMOUD HANAFI)
Date: Wed, 04 Aug 2004 14:37:42 -0400
Subject: [Linux-cluster] GFS 5.2.1-28.3.0.11 file system courrption
Message-ID: <41112CF6.1070609@asc.hpc.mil>

we are currently running GFS with 8 IO nodes attached to a DDN s2a. We 
recently had GFS crashing any time larger number of file were being 
accessed. It turned out that One of the file system was corrupted. We 
discovered the issue only by chance when I ran fsck.gfs on the file 
system. It ran for 20+ and correct many corruptions.

My question is how robust is gfs. How can one test file system for 
corruption without running fsck.

thanks


From phillips at redhat.com  Wed Aug  4 15:31:47 2004
From: phillips at redhat.com (Daniel Phillips)
Date: Wed, 4 Aug 2004 11:31:47 -0400
Subject: [Linux-cluster] Nag for summit presentation materials
Message-ID: <200408041131.47099.phillips@redhat.com>

Hi all,

It looks like this now:

http://sources.redhat.com/cluster/events/summit2004/presentations.html

 * Patrick, thanks for the slides, but could you please suggest how
   to distribute them across your three presentations?

 * Lon, I don't have anything on "Cluster Resource Management", do I?

 * Mike... mike... earth to mike... :-)

 * Alan and Bruce, you've got lots of great stuff, can I please have
   some?

 * Alasdair, perhaps you didn't see the first two emails?

It would be very nice to complete this process today.

Regards,

Daniel


From lhh at redhat.com  Wed Aug  4 16:20:41 2004
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 04 Aug 2004 12:20:41 -0400
Subject: [Linux-cluster] Re: Nag for summit presentation materials
In-Reply-To: <200408041131.47099.phillips@redhat.com>
References: <200408041131.47099.phillips@redhat.com>
Message-ID: <1091636441.13608.206.camel@atlantis.boston.redhat.com>

On Wed, 2004-08-04 at 11:31 -0400, Daniel Phillips wrote:

>  * Lon, I don't have anything on "Cluster Resource Management", do I?

I sent it to you a few days ago; you noted that you would rename it to
"lon.resources.sxi".

In any case: http://metamorphism.com/~lon/resources.sxi


From phillips at redhat.com  Wed Aug  4 16:28:41 2004
From: phillips at redhat.com (Daniel Phillips)
Date: Wed, 4 Aug 2004 12:28:41 -0400
Subject: [Linux-cluster] Re: Nag for summit presentation materials
In-Reply-To: <1091636441.13608.206.camel@atlantis.boston.redhat.com>
References: <200408041131.47099.phillips@redhat.com>
	<1091636441.13608.206.camel@atlantis.boston.redhat.com>
Message-ID: <200408041228.41434.phillips@redhat.com>

On Wednesday 04 August 2004 12:20, Lon Hohberger wrote:
> On Wed, 2004-08-04 at 11:31 -0400, Daniel Phillips wrote:
> >  * Lon, I don't have anything on "Cluster Resource Management", do
> > I?
>
> I sent it to you a few days ago; you noted that you would rename it
> to "lon.resources.sxi".

Oh, then I put it under the wrong talk, I'll move it.

But then, what do I put under "Magma - User level Cluster and Lock 
manager transparent library interface" ?

Regards,

Daniel


From lhh at redhat.com  Wed Aug  4 16:44:45 2004
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 04 Aug 2004 12:44:45 -0400
Subject: [Linux-cluster] Re: Nag for summit presentation materials
In-Reply-To: <200408041228.41434.phillips@redhat.com>
References: <200408041131.47099.phillips@redhat.com>
	<1091636441.13608.206.camel@atlantis.boston.redhat.com>
	<200408041228.41434.phillips@redhat.com>
Message-ID: <1091637885.13608.209.camel@atlantis.boston.redhat.com>

On Wed, 2004-08-04 at 12:28 -0400, Daniel Phillips wrote:
> 
> Oh, then I put it under the wrong talk, I'll move it.
> 
> But then, what do I put under "Magma - User level Cluster and Lock 
> manager transparent library interface" ?

Oh, right.

I'll bring that in tomorrow.  It's on by defunct notebook.

-- Lon


From john.l.villalovos at intel.com  Wed Aug  4 19:58:18 2004
From: john.l.villalovos at intel.com (Villalovos, John L)
Date: Wed, 4 Aug 2004 12:58:18 -0700
Subject: [Linux-cluster] Re: Nag for summit presentation materials
Message-ID: <60C14C611F1DDD4198D53F2F43D8CA3B018A9299@orsmsx410>

linux-cluster-bounces at redhat.com wrote:
> On Wednesday 04 August 2004 12:20, Lon Hohberger wrote:
>> On Wed, 2004-08-04 at 11:31 -0400, Daniel Phillips wrote:
>>>  * Lon, I don't have anything on "Cluster Resource Management", do
>>> I?
>> 
>> I sent it to you a few days ago; you noted that you would rename it
>> to "lon.resources.sxi".


The web page:
http://sources.redhat.com/cluster/events/summit2004/presentations.html

Cluster resource management
Presented by: Lon Hohberger, Red Hat

Slides - Cluster Resources

Has a link of:
file:///src/sources.redhat.cvs/htdocs/events/summit2004/lon.magma.resour
ces.sxi

Which doesn't work :(

John


From jeff at intersystems.com  Wed Aug  4 22:30:28 2004
From: jeff at intersystems.com (Jeff)
Date: Wed, 4 Aug 2004 18:30:28 -0400
Subject: [Linux-cluster] cman doesn't load building out of cvs outside of
	the kernel
Message-ID: <1502355489.20040804183028@intersystems.com>

Following the current doc/usage.txt instructions for
building outside of the kernel from cvs/latest (as of
this afternoon) I get the following error trying to
load the cman module.


[root at lx3 cman-kernel]# modprobe cman
FATAL: Error inserting cman (/lib/modules/2.6.7-smp/kernel/cluster/cman.ko):
Operation not permitted
[root at lx3 cman-kernel]# dmesg
<snip>
CMAN <CVS> (built Aug  4 2004 12:34:28) installed
NET: Registered protocol family 31
Unable to register cluster socket type

Any suggestions on what I need to do to resolve this?

TIA


From buytenh at wantstofly.org  Wed Aug  4 23:02:36 2004
From: buytenh at wantstofly.org (Lennert Buytenhek)
Date: Thu, 5 Aug 2004 01:02:36 +0200
Subject: [Linux-cluster] cman doesn't load building out of cvs outside of
	the kernel
In-Reply-To: <1502355489.20040804183028@intersystems.com>
References: <1502355489.20040804183028@intersystems.com>
Message-ID: <20040804230236.GB10696@xi.wantstofly.org>

On Wed, Aug 04, 2004 at 06:30:28PM -0400, Jeff wrote:

> [root at lx3 cman-kernel]# modprobe cman
> FATAL: Error inserting cman (/lib/modules/2.6.7-smp/kernel/cluster/cman.ko):
> Operation not permitted
> [root at lx3 cman-kernel]# dmesg
> <snip>
> CMAN <CVS> (built Aug  4 2004 12:34:28) installed
> NET: Registered protocol family 31
> Unable to register cluster socket type
> 
> Any suggestions on what I need to do to resolve this?

Remove the bluetooth modules that you already have loaded (there is
an AF_* identifier conflict still), then manually load cman, dlm, and
such.


--L


From jeff at intersystems.com  Thu Aug  5 03:41:45 2004
From: jeff at intersystems.com (Jeff)
Date: Wed, 4 Aug 2004 23:41:45 -0400
Subject: [Linux-cluster] Strange behavior(s) of DLM
Message-ID: <1909350721.20040804234145@intersystems.com>

The attached routine demonstrates some strange
behavior in the DLM and it was responsible for the
dmesg text at the end of this note.

This is on a FC2, SMP box running cvs/latest version of
cman and the dlm. Its a 2 CPU box configured with 4 logical
CPUs.

I have a two node cluster and the two machines are identical
as far as I can tell with the exception of which order they are
listed in the cluster config file.

On node #1 (in the config file) when I run the attached test from
two terminals the output looks reasonable. The same as it does if
I run it on Tru64 or VMS (more or less).

      8923: over last 10.000 seconds, grant 8922, blkast 0, cancel 0
     18730: over last 9.001 seconds, grant 9807, blkast 0, cancel 0
     28403: over last 9.001 seconds, grant 9673, blkast 0, cancel 0

If you shut this down and start it up on node #2 (lx4) you start
to get messages that look like:
     91280: over last 10.000 seconds, grant 91279, blkast 0, cancel 0
    125138: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^
    125138: NL Blocking Notification on lockid 0x00010312 (mode 0)
    125138: NL Blocking Notification Rountine End  ^^^^^^^^^^^^^^^^^^^^ 
    141370: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^
    141371: NL Blocking Notification on lockid 0x00010312 (mode 0)
    141371: NL Blocking Notification Rountine End  ^^^^^^^^^^^^^^^^^^^^ 
    141373: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^

There are two strange things about this:
1) why does node #2 behave differently than node #1. I get the
   same results if I reboot both nodes and only node #2
   joins the cluster. This seems to imply that the nodes
   aren't as identical as I think they are but... They are
   running the same kernel build and the same source from
   cvs (moved over as a tar file from one to another).

2) Why is a blocking ast routine associated with a NL lock
   being triggered. The test code may be a bit hard to follow but
   you can look at where this message comes from (nlblkrtn) and
   where nlblkrtn is used (DLM_CVT requests to convert to NULL).
   
   This looks like a race condition between queuing a new conversion
   request and delivering a blocking AST on the existing lock. I'm
   guessing that the conversion to NL is updating the AST pointers
   at a time when the blocking AST can still be delivered for the
   existing lock.

   I tripped over this because do_dlm_dispatch() ends in
    /* Call AST */
    result.astaddr(result.astparam);
    return 0;
   and it doesn't check whether result.astaddr() is null or not.

   Its not valid to have a NULL completion AST routine but it
   is valid to have a NULL blocking AST routine. To go a bit
   further, its pretty common to have a null blocking AST routine
   on a conversion to NULL because the NULL lock can't block any
   other locks.


dmesg output:
------------------------------------------------------------
CMAN: quorum regained, resuming activity
dlm: default: recover event 1 (first)
dlm: default: add nodes
dlm: got connection from 1
dlm: default: total nodes 2
dlm: default: rebuild resource directory
dlm: default: rebuilt 0 resources
dlm: default: recover event 1 done
dlm: default: recover event 1 finished
dlm: default: release lkb with status 3
dlm: lkb
id 102c9
remid 0
flags 4000
status 3
rqmode 5
grmode 3
nodeid 0
lqstate 0
lqflags 44
name "Test Lock" flags 4 nodeid 4294967295 ref 0
grant queue
000102c9 gr 5 rq -1 flg 24000 sts 2 node 0 remid 0 lq 0,44
est Lock"
default cv 5 102c9 "Test Lock"
default cv 3 1018a "Test Lock"
default cv 0 102c9 "Test Lock"
default cv 3 102c9 "Test Lock"
default cv 5 1018a "Test Lock"
default cv 0 102c9 "Test Lock"
default cv 0 1018a "Test Lock"
default cv 3 102c9 "Test Lock"
default cv 3 1018a "Test Lock"
default cv 5 1018a "Test Lock"
default cv 0 102c9 "Test Lock"
default cv 0 1018a "Test Lock"
default cv 3 1018a "Test Lock"
default cv 5 1018a "Test Lock"
default cv 0 1018a "Test Lock"
default cv 3 1018a "Test Lock"
default cv 5 1018a "Test Lock"
default cv 0 1018a "Test Lock"
default cv 3 1018a "Test Lock"
default cv 5 1018a "Test Lock"
default cv 0 1018a "Test Lock"
default cv 3 1018a "Test Lock"
default cv 5 1018a "Test Lock"
default cv 0 1018a "Test Lock"
default cv 3 1018a "Test Lock"
default cv 5 1018a "Test Lock"
default cv 0 1018a "Test Lock"
default cv 3 102c9 "Test Lock"
default cv 3 1018a "Test Lock"
default cv 5 102c9 "Test Lock"
default cv 5 1018a "Test Lock"
default un 1018a ref 1 flg 4 nodeid 0/-1 "Test Lock"

DLM:  Assertion failed on line 64 of file /usr/src/cvs/cluster_orig/dlm-kernel/src/rsb.c
DLM:  assertion:  "list_empty(&r->res_grantqueue)"
DLM:  time = 948604
name "Test Lock" flags 4 nodeid 4294967295 ref 0
convert queue
000102c9 gr 5 rq 0 flg 4000 sts 3 node 0 remid 0 lq 2,44
est Lock"
default cv 3 1018a "Test Lock"
default cv 0 102c9 "Test Lock"
default cv 3 102c9 "Test Lock"
default cv 5 1018a "Test Lock"
default cv 0 102c9 "Test Lock"
default cv 0 1018a "Test Lock"
default cv 3 102c9 "Test Lock"
default cv 3 1018a "Test Lock"
default cv 5 1018a "Test Lock"
default cv 0 102c9 "Test Lock"
default cv 0 1018a "Test Lock"
default cv 3 1018a "Test Lock"
default cv 5 1018a "Test Lock"
default cv 0 1018a "Test Lock"
default cv 3 1018a "Test Lock"
default cv 5 1018a "Test Lock"
default cv 0 1018a "Test Lock"
default cv 3 1018a "Test Lock"
default cv 5 1018a "Test Lock"
default cv 0 1018a "Test Lock"
default cv 3 1018a "Test Lock"
default cv 5 1018adlm: rsb
name "Test Lock"
nodeid -1
flags 4
ref 0
 "Test Lock"
default cv 0 1018a "Test Lock"
default cv 3 1018a "Test Lock"
default cv 5 1018a "Test Lock"
default cv 0 1018a "Test Lock"
default cv 3 102c9 "Test Lock"
default cv 3 1018a "Test Lock"
default cv 5 102c9 "Test Lock"
default cv 5 1018a "Test Lock"
default un 1018a ref 1 flg 4 nodeid 0/-1 "Test Lock"
default cv 0 102c9 "Test Lock"

DLM:  Assertion failed on line 661 of file /usr/src/cvs/cluster_orig/dlm-kernel/src/lockqueue.c
DLM:  assertion:  "target_nodeid && target_nodeid != -1"
DLM:  time = 948606
dlm: lkb
id 102c9
remid 0
flags 4000
status 3
rqmode 0
grmode 5
nodeid 0
lqstate 2
lqflags 44
dlm: rsb
name "Test Lock"
nodeid -1
flags 4
ref 0
target_nodeid 0


------------[ cut here ]------------
kernel BUG at /usr/src/cvs/cluster_orig/dlm-kernel/src/lockqueue.c:661!
invalid operand: 0000 [#1]
SMP 
Modules linked in: dlm cman parport_pc lp parport autofs4 nfs lockd sunrpc e1000 3c59x floppy sg microcode dm_mod uhci_hcd button battery asus_acpi ac ipv6 ext3 jbd aic7xxx sd_mod scsi_mod
CPU:    2
EIP:    0060:[<f8ac7ae7>]    Not tainted
EFLAGS: 00010246   (2.6.7-smp) 
EIP is at send_cluster_request+0x577/0x590 [dlm]
eax: 00000001   ebx: f4442810   ecx: f40c5dec   edx: 000059ab
esi: 00000000   edi: 00000295   ebp: f8ad6b48   esp: f40c5de8
ds: 007b   es: 007b   ss: 0068
Process cp (pid: 3565, threadinfo=f40c4000 task=f6cca1b0)
Stack: f8ad5994 00000000 f8ad6b48 f8ad6e54 000e797e f4442810 f7fcca00 f438f934 
       f4442810 f4442810 00000002 f438f934 f7fcca00 f8ac6510 f4442810 f7fcca80 
       f4442810 f7fcca80 f8ac58e4 f7fcca00 f8ad5871 00000000 000102c9 f438f9ad 
Call Trace:
 [<f8ac6510>] remote_stage+0x20/0x50 [dlm]
 [<f8ac58e4>] convert_lock+0x1a4/0x1d0 [dlm]
 [<f8ac4dc7>] dlm_lock+0x347/0x350 [dlm]
 [<f8ac25b0>] ast_routine+0x0/0x150 [dlm]
 [<f8ac2590>] bast_routine+0x0/0x20 [dlm]
 [<f8ac31a3>] do_user_lock+0x123/0x220 [dlm]
 [<f8ac25b0>] ast_routine+0x0/0x150 [dlm]
 [<f8ac2590>] bast_routine+0x0/0x20 [dlm]
 [<c012eb89>] sigprocmask+0x59/0xe0
 [<f8ac33eb>] dlm_write+0xbb/0xe0 [dlm]
 [<c015dab1>] vfs_write+0xd1/0x120
 [<c015db98>] sys_write+0x38/0x60
 [<c0106e3d>] sysenter_past_esp+0x52/0x71

Code: 0f 0b 95 02 48 6b ad f8 e9 09 fc ff ff e8 37 bd ff ff 89 c6 
 ------------[ cut here ]------------
kernel BUG at /usr/src/cvs/cluster_orig/dlm-kernel/src/rsb.c:64!
invalid operand: 0000 [#2]
SMP 
Modules linked in: dlm cman parport_pc lp parport autofs4 nfs lockd sunrpc e1000 3c59x floppy sg microcode dm_mod uhci_hcd button battery asus_acpi ac ipv6 ext3 jbd aic7xxx sd_mod scsi_mod
CPU:    3
EIP:    0060:[<f8ad405d>]    Not tainted
EFLAGS: 00010246   (2.6.7-smp) 
EIP is at _release_rsb+0x29d/0x2b0 [dlm]
eax: 00000001   ebx: f7fcca80   ecx: f6a93f40   edx: 000059af
esi: f438f934   edi: f7fcca00   ebp: 00000000   esp: f6a93f3c
ds: 007b   es: 007b   ss: 0068
Process dlm_astd (pid: 3404, threadinfo=f6a92000 task=f6af48c0)
Stack: f8ad640e 00000040 f8ad8370 f8ad83ec 000e797c f7fcca80 f4442ec4 f7fcca00 
       00000005 f8ac1465 f8adfa80 dd2514c0 000f431c f6d1e640 c2031ce0 f438f934 
       f7b83f58 f8ac2590 f8ac25b0 f8adfa68 f6a92000 f6a93fb4 f6a93fc0 f8ac1f5a 
Call Trace:
 [<f8ac1465>] process_asts+0xe5/0x1b0 [dlm]
 [<f8ac2590>] bast_routine+0x0/0x20 [dlm]
 [<f8ac25b0>] ast_routine+0x0/0x150 [dlm]
 [<f8ac1f5a>] dlm_astd+0x29a/0x2b0 [dlm]
 [<c0120550>] default_wake_function+0x0/0x10
 [<c0106da2>] ret_from_fork+0x6/0x14
 [<c0120550>] default_wake_function+0x0/0x10
 [<f8ac1cc0>] dlm_astd+0x0/0x2b0 [dlm]
 [<c01052b5>] kernel_thread_helper+0x5/0x10

Code: 0f 0b 40 00 70 83 ad f8 e9 43 ff ff ff 8d b6 00 00 00 00 31 
 
[root at lx4
-------------- next part --------------
A non-text attachment was scrubbed...
Name: conv_play.c
Type: application/octet-stream
Size: 20488 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040804/d868b4e9/attachment.obj>

From patrick.seinguerlet at e-asc.com  Thu Aug  5 09:39:20 2004
From: patrick.seinguerlet at e-asc.com (SEINGUERLET Patrick)
Date: Thu, 5 Aug 2004 11:39:20 +0200
Subject: [Linux-cluster] lock_dlm: init_fence error -1
In-Reply-To: <200408041307.42543.danderso@redhat.com>
Message-ID: <001201c47ad0$1800d330$1d0224d5@porpat>

Thanks for all, but I do a error when I configure my cluster.xml file.

-----Message d'origine-----
De : linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] De la part de Derek Anderson
Envoy? : mercredi 4 ao?t 2004 20:08
? : Discussion of clustering software components including GFS;
Seinguerlet Patrick
Objet : Re: [Linux-cluster] lock_dlm: init_fence error -1


Patrick,

Please attach 'cat /proc/cluster/nodes' and 'cat /proc/cluster/services'
from 
each of the nodes prior to the mount attempt.  Also, any messages
produced in 
/var/log/messages from the mount.

On Tuesday 03 August 2004 13:00, Seinguerlet Patrick wrote:
> When I would like to mount the GFS file system, this messages appear. 
> What can I do?
>
> mount -t gfs /dev/test_gfs/lv_test /mnt
> lock_dlm: init_fence error -1
> GFS: can't mount proto = lock_dlm, table = test:partage1, hostdata =
> mount: permission denied
>
> I use a debian and I use the documentation file for install.
>
> Patrick
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com 
> http://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
http://www.redhat.com/mailman/listinfo/linux-cluster


From danderso at redhat.com  Thu Aug  5 13:41:41 2004
From: danderso at redhat.com (Derek Anderson)
Date: Thu, 5 Aug 2004 08:41:41 -0500
Subject: [Linux-cluster] cman doesn't load building out of cvs outside of
	the kernel
In-Reply-To: <20040804230236.GB10696@xi.wantstofly.org>
References: <1502355489.20040804183028@intersystems.com>
	<20040804230236.GB10696@xi.wantstofly.org>
Message-ID: <200408050841.41294.danderso@redhat.com>

On Wednesday 04 August 2004 18:02, Lennert Buytenhek wrote:
> On Wed, Aug 04, 2004 at 06:30:28PM -0400, Jeff wrote:
> > [root at lx3 cman-kernel]# modprobe cman
> > FATAL: Error inserting cman
> > (/lib/modules/2.6.7-smp/kernel/cluster/cman.ko): Operation not permitted
> > [root at lx3 cman-kernel]# dmesg
> > <snip>
> > CMAN <CVS> (built Aug  4 2004 12:34:28) installed
> > NET: Registered protocol family 31
> > Unable to register cluster socket type
> >
> > Any suggestions on what I need to do to resolve this?
>
> Remove the bluetooth modules that you already have loaded (there is
> an AF_* identifier conflict still), then manually load cman, dlm, and
> such.

Yep.  Or make sure that the cman module is loaded before ccsd is started.
http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=127019

>
>
> --L
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster


From phillips at redhat.com  Thu Aug  5 14:34:44 2004
From: phillips at redhat.com (Daniel Phillips)
Date: Thu, 5 Aug 2004 10:34:44 -0400
Subject: [Linux-cluster] Another nag for summit presentation materials
Message-ID: <200408051034.44608.phillips@redhat.com>

If you are cc'd on this mail then there is still a cluster summit 
presentation that you made, for which I haven't received any 
presentation materials.  Patrick and Lon have both sent me stuff 
(thanks) but not for every presentation.  Alasdair, I haven't seen 
anything from you.  Surely you must have something, somewhere, that you 
can send.

Presentations still lacking slides or other supporting material:

Patrick:

   * CMAN - Kernel cluster membership 
 
   * DLM - Kernel distributed lock manager

Lon:

   * Magma - User level Cluster and Lock manager transparent library
     interface

Alasdair:

   * CLVM - Architecture and extensions of LVM2

Regards,

Daniel


From lhh at redhat.com  Thu Aug  5 15:40:42 2004
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 05 Aug 2004 11:40:42 -0400
Subject: [Linux-cluster] Cluster Summit Pictures
Message-ID: <1091720442.25665.2.camel@atlantis.boston.redhat.com>

http://people.redhat.com/lhh/cs-pics/

I didn't include the 2048x1536 originals to save space and bandwidth.
These will probably be migrated to sources.redhat.com, but for now, this
should work.

-- Lon


From laza at yu.net  Thu Aug  5 18:28:21 2004
From: laza at yu.net (Lazar Obradovic)
Date: Thu, 05 Aug 2004 20:28:21 +0200
Subject: [Linux-cluster] bug in cman-kernel / membership.c
Message-ID: <1091730500.15503.302.camel@laza.eunet.yu>

Just got this when joining one node.


CMAN: Waiting to join or form a Linux-cluster
CMAN: sending membership request
CMAN: got node new-noc
Got ENDTRANS from a node not the master: master: 6, sender: 1
CMAN: node new-noc is not responding - removing from the cluster
------------[ cut here ]------------
kernel BUG at /usr/src/cvs/cluster/cman-kernel/src/membership.c:2892!
invalid operand: 0000 [#1]
PREEMPT SMP
Modules linked in: ipv6 qla2300 qla2xxx ohci_hcd gfs lock_dlm
lock_harness dlm cman
CPU:    2
EIP:    0060:[<f885c669>]    Tainted: GF
EFLAGS: 00010246   (2.6.7-gentoo-r11)
EIP is at elect_master+0x2a/0x41 [cman]
eax: 00000080   ebx: 00000080   ecx: f88a4000   edx: 00000000
esi: f8870c08   edi: f8870c00   ebp: f7139fc0   esp: f7139f90
ds: 007b   es: 007b   ss: 0068
Process cman_memb (pid: 7327, threadinfo=f7138000 task=c22ed1e0)
Stack: f7afdb28 f8859d34 f7139fa4 00000001 f8858883 f7bf7494 fffffffb
00000000
       f7138000 0000001f 00000000 c0103fb6 00000000 c22ed1e0 c01176e2
00100100
       00200200 00000000 00000000 00000000 f88584d8 00000000 00000000
00000000
Call Trace:
 [<f8859d34>] a_node_just_died+0x130/0x181 [cman]
 [<f8858883>] membership_kthread+0x3ab/0x3e4 [cman]
 [<c0103fb6>] ret_from_fork+0x6/0x14
 [<c01176e2>] default_wake_function+0x0/0x12
 [<f88584d8>] membership_kthread+0x0/0x3e4 [cman]
 [<c01022a1>] kernel_thread_helper+0x5/0xb

Code: 0f 0b 4c 0b a0 51 86 f8 31 c0 5b c3 8b 44 24 08 89 10 8b 42

-- 
Lazar Obradovic, System Engineer
----- 
laza at YU.net
YUnet International http://www.EUnet.yu
Dubrovacka 35/III, 11000 Belgrade
Tel: +381 11 3119901; Fax: +381 11 3119901
-----
This e-mail is confidential and intended only for the recipient.
Unauthorized distribution, modification or disclosure of its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone +381 11 3119901.
-----


From agk at redhat.com  Thu Aug  5 19:13:37 2004
From: agk at redhat.com (Alasdair G Kergon)
Date: Thu, 5 Aug 2004 20:13:37 +0100
Subject: [Linux-cluster] Cluster Summit Pictures
In-Reply-To: <1091720442.25665.2.camel@atlantis.boston.redhat.com>
References: <1091720442.25665.2.camel@atlantis.boston.redhat.com>
Message-ID: <20040805191337.GF18235@agk.surrey.redhat.com>

On Thu, Aug 05, 2004 at 11:40:42AM -0400, Lon Hohberger wrote:
> http://people.redhat.com/lhh/cs-pics/
 
> These will probably be migrated to sources.redhat.com, but for now, this
> should work.
 
We can *link* to them from sources.redhat.com, but photos are best stored 
elsewhere e.g. in one of the many professional photo gallery websites such as
fotango.com, fotopic.net to mention a couple of UK-based ones.

Alasdair
-- 
agk at redhat.com


From phillips at redhat.com  Thu Aug  5 19:37:26 2004
From: phillips at redhat.com (Daniel Phillips)
Date: Thu, 5 Aug 2004 15:37:26 -0400
Subject: [Linux-cluster] Cluster Summit Pictures
In-Reply-To: <20040805191337.GF18235@agk.surrey.redhat.com>
References: <1091720442.25665.2.camel@atlantis.boston.redhat.com>
	<20040805191337.GF18235@agk.surrey.redhat.com>
Message-ID: <200408051537.26614.phillips@redhat.com>

On Thursday 05 August 2004 15:13, Alasdair G Kergon wrote:
> On Thu, Aug 05, 2004 at 11:40:42AM -0400, Lon Hohberger wrote:
> > http://people.redhat.com/lhh/cs-pics/
> >
> > These will probably be migrated to sources.redhat.com, but for now,
> > this should work.
>
> We can *link* to them from sources.redhat.com, but photos are best
> stored elsewhere e.g. in one of the many professional photo gallery
> websites such as fotango.com, fotopic.net to mention a couple of
> UK-based ones.

Where they will summarily disappear in the shifting sands of the 
internet.

Regards,

Daniel


From laza at yu.net  Thu Aug  5 20:02:52 2004
From: laza at yu.net (Lazar Obradovic)
Date: Thu, 05 Aug 2004 22:02:52 +0200
Subject: [Linux-cluster] Multicast for GFS?
In-Reply-To: <1091556279.30938.179.camel@laza.eunet.yu>
References: <1091553471.16747.165.camel@laza.eunet.yu>
	<1091556279.30938.179.camel@laza.eunet.yu>
Message-ID: <1091736172.19762.336.camel@laza.eunet.yu>

It took some time... 

Attached is the patch for correcting cman_tool join into ipv4 mcast
group. 

Note to ipv6 developers / users: you may also need to set mcast ttl to
higher value via setsockopt() if you want to have cluster with network
traversing L3 devices, since linux default ttl is set for local scope
(ttl = 1), which would make first router drop the packet. 

-- 
Lazar Obradovic, System Engineer
----- 
laza at YU.net
YUnet International http://www.EUnet.yu
Dubrovacka 35/III, 11000 Belgrade
Tel: +381 11 3119901; Fax: +381 11 3119901
-----
This e-mail is confidential and intended only for the recipient.
Unauthorized distribution, modification or disclosure of its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone +381 11 3119901.
-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cman-mcast.diff
Type: text/x-patch
Size: 1150 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040805/fc6e0032/attachment.bin>

From phillips at redhat.com  Thu Aug  5 20:23:27 2004
From: phillips at redhat.com (Daniel Phillips)
Date: Thu, 5 Aug 2004 16:23:27 -0400
Subject: [Linux-cluster] Final nag for summit presentation materials
Message-ID: <200408051623.27287.phillips@redhat.com>

Hi all,

This is the final nag for presentation materials.

Patrick, I didn't receive anything for the CMAN or DLM presentations so 
I extracted portions of your overview and placed them on those talks. 
Therefore, some material is duplicated.  Please send me more stuff if 
you have issues with this.

That leaves only Alasdair with nothing on the site.  Alasdair??

Otherwise, it's starting to look decent, though it does tend to send the 
message "we can't be bothered to post more detailed information".  It 
also sends the message "at least there's something, we're trying".

Speakers, please look over your own presentations and at least try all 
the links.  I you think your presentation looks threadbare, just send 
more material.

   http://sources.redhat.com/cluster/events/summit2004/presentations.html

Thanks to everybody who sent stuff.  In general it is first-rate, even 
if terse.

Regards,

Daniel


From mnerren at paracel.com  Thu Aug  5 23:52:53 2004
From: mnerren at paracel.com (micah nerren)
Date: Thu, 05 Aug 2004 16:52:53 -0700
Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine
In-Reply-To: <20040804153338.GA10091@redhat.com>
References: <1091458082.8356.23.camel@angmar>
	<20040802154605.GC1518@redhat.com> <1091480394.8356.58.camel@angmar>
	<20040803144019.GA4365@redhat.com> <1091581920.8356.257.camel@angmar>
	<20040804153338.GA10091@redhat.com>
Message-ID: <1091749973.18842.70.camel@angmar>

On Wed, 2004-08-04 at 08:33, Michael Conrad Tadpol Tilstra wrote:
> On Tue, Aug 03, 2004 at 06:12:01PM -0700, micah nerren wrote:
> > Hi,
> > 
> > On Tue, 2004-08-03 at 07:40, Michael Conrad Tadpol Tilstra wrote:
> > > On Mon, Aug 02, 2004 at 01:59:55PM -0700, micah nerren wrote:
> > > [snip]
> > > > I hope this helps!!
> > > [snip]
> > > 
> > > yeah, looks like a stack overflow.
> > > here's a patch that I put in for 6.0.  (patch works on 6.0.0-7)
> > > 
> > 
> > I applied the patch to 6.0.0-7, rebuild the entire package, and I still
> > get the crash when I mount. Below is the text of the crash.
> > 
> > Any ideas? I double and triple checked that the patch was indeed applied
> > to the code I was building and it was.
> 
> well, it could still be a stack overflow, just some other function
> pushing it over the edge.  I'll look over things later.  Mostly just
> looking for things in the stack space of the functions listed in the
> backtrace for things that can take out of the stack and put onto the
> heap. (run sentence run!)

FYI, I tried this with a few different HBA's, that didn't work. I
thought perhaps it could be some funny interaction with the driver but
that doesn't seem to be the case.

If there is anything I can do to help, please let me know! Up to and
including allowing access to the machines running the software if that
will help you debug it.

Thanks,

Micah


From fuscof at cli.di.unipi.it  Fri Aug  6 08:45:27 2004
From: fuscof at cli.di.unipi.it (Francesco Fusco)
Date: Fri, 6 Aug 2004 10:45:27 +0200 (CEST)
Subject: [Linux-cluster] CLVM and redundant RAID levels
Message-ID: <Pine.LNX.4.44.0408061029260.3410-100000@olivia.cli.di.unipi.it>

Hi!
I want to use some ide servers to have an high available cluster 
filesystem. 
I don't have Fibre Channel/Scsi disk array, but only inexpensive 
ide disks.

Can GFS be a good choice?
Does it support redoundant raid levels between GNBD servers?


Thanks
--
Fusco Francesco <fuscof at cli.di.unipi.it>


From anton at hq.310.ru  Fri Aug  6 08:52:50 2004
From: anton at hq.310.ru (=?Windows-1251?B?wO3y7u0gzeX17vDu+Oj1?=)
Date: Fri, 6 Aug 2004 12:52:50 +0400
Subject: [Linux-cluster] announce services through DLM
Message-ID: <828806003.20040806125250@hq.310.ru>

Hi all!

Whether I can program announce services which are used on nodes through DLM?

It is necessary for definition of working services on nodes only there
to send requests.

-- 
e-mail: anton at hq.310.ru
http://www.310.ru


From teigland at redhat.com  Fri Aug  6 12:54:29 2004
From: teigland at redhat.com (David Teigland)
Date: Fri, 6 Aug 2004 20:54:29 +0800
Subject: [Linux-cluster] Strange behavior(s) of DLM
In-Reply-To: <1909350721.20040804234145@intersystems.com>
References: <1909350721.20040804234145@intersystems.com>
Message-ID: <20040806125429.GG16109@redhat.com>

On Wed, Aug 04, 2004 at 11:41:45PM -0400, Jeff wrote:
> The attached routine demonstrates some strange
> behavior in the DLM and it was responsible for the
> dmesg text at the end of this note.
> 
> This is on a FC2, SMP box running cvs/latest version of
> cman and the dlm. Its a 2 CPU box configured with 4 logical
> CPUs.
> 
> I have a two node cluster and the two machines are identical
> as far as I can tell with the exception of which order they are
> listed in the cluster config file.
> 
> On node #1 (in the config file) when I run the attached test from
> two terminals the output looks reasonable. The same as it does if
> I run it on Tru64 or VMS (more or less).
> 
>       8923: over last 10.000 seconds, grant 8922, blkast 0, cancel 0
>      18730: over last 9.001 seconds, grant 9807, blkast 0, cancel 0
>      28403: over last 9.001 seconds, grant 9673, blkast 0, cancel 0
> 
> If you shut this down and start it up on node #2 (lx4) you start
> to get messages that look like:
>      91280: over last 10.000 seconds, grant 91279, blkast 0, cancel 0
>     125138: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^
>     125138: NL Blocking Notification on lockid 0x00010312 (mode 0)
>     125138: NL Blocking Notification Rountine End  ^^^^^^^^^^^^^^^^^^^^ 
>     141370: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^
>     141371: NL Blocking Notification on lockid 0x00010312 (mode 0)
>     141371: NL Blocking Notification Rountine End  ^^^^^^^^^^^^^^^^^^^^ 
>     141373: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^


You're running the program on two nodes at once right?  The line with "*"
is when I started the program on a second node, so it appears I get the
same thing.  I don't get any assertion failure, though.  That may be the
result of changes I've checked in for some other bugs over the past couple
days.

     57150: over last 10.000 seconds, grant 57149, blkast 0, cancel 0
    116825: over last 9.001 seconds, grant 59675, blkast 0, cancel 0
*   123790: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^
    123790: NL Blocking Notification on lockid 0x00010373 (mode 0)
    123790: NL Blocking Notification Rountine End  ^^^^^^^^^^^^^^^^^^^^
    123822: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^
    123822: NL Blocking Notification on lockid 0x00010373 (mode 0)
    123822: NL Blocking Notification Rountine End  ^^^^^^^^^^^^^^^^^^^^

-- 
Dave Teigland  <teigland at redhat.com>


From jeff at intersystems.com  Fri Aug  6 13:35:39 2004
From: jeff at intersystems.com (Jeff)
Date: Fri, 6 Aug 2004 09:35:39 -0400
Subject: [Linux-cluster] Strange behavior(s) of DLM
In-Reply-To: <20040806125429.GG16109@redhat.com>
References: <1909350721.20040804234145@intersystems.com>
	<20040806125429.GG16109@redhat.com>
Message-ID: <1323209748.20040806093539@intersystems.com>

Friday, August 6, 2004, 8:54:29 AM, David Teigland wrote:

> On Wed, Aug 04, 2004 at 11:41:45PM -0400, Jeff wrote:
>> The attached routine demonstrates some strange
>> behavior in the DLM and it was responsible for the
>> dmesg text at the end of this note.
>> 
>> This is on a FC2, SMP box running cvs/latest version of
>> cman and the dlm. Its a 2 CPU box configured with 4 logical
>> CPUs.
>> 
>> I have a two node cluster and the two machines are identical
>> as far as I can tell with the exception of which order they are
>> listed in the cluster config file.
>> 
>> On node #1 (in the config file) when I run the attached test from
>> two terminals the output looks reasonable. The same as it does if
>> I run it on Tru64 or VMS (more or less).
>> 
>>       8923: over last 10.000 seconds, grant 8922, blkast 0, cancel 0
>>      18730: over last 9.001 seconds, grant 9807, blkast 0, cancel 0
>>      28403: over last 9.001 seconds, grant 9673, blkast 0, cancel 0
>> 
>> If you shut this down and start it up on node #2 (lx4) you start
>> to get messages that look like:
>>      91280: over last 10.000 seconds, grant 91279, blkast 0, cancel 0
>>     125138: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^
>>     125138: NL Blocking Notification on lockid 0x00010312 (mode 0)
>>     125138: NL Blocking Notification Rountine End  ^^^^^^^^^^^^^^^^^^^^
>>     141370: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^
>>     141371: NL Blocking Notification on lockid 0x00010312 (mode 0)
>>     141371: NL Blocking Notification Rountine End  ^^^^^^^^^^^^^^^^^^^^
>>     141373: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^


> You're running the program on two nodes at once right?  The line with "*"
> is when I started the program on a second node, so it appears I get the
> same thing.  I don't get any assertion failure, though.  That may be the
> result of changes I've checked in for some other bugs over the past couple
> days.

>      57150: over last 10.000 seconds, grant 57149, blkast 0, cancel 0
>     116825: over last 9.001 seconds, grant 59675, blkast 0, cancel 0
> *   123790: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^
>     123790: NL Blocking Notification on lockid 0x00010373 (mode 0)
>     123790: NL Blocking Notification Rountine End  ^^^^^^^^^^^^^^^^^^^^
>     123822: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^
>     123822: NL Blocking Notification on lockid 0x00010373 (mode 0)
>     123822: NL Blocking Notification Rountine End  ^^^^^^^^^^^^^^^^^^^^

I'm running the program from two processes on a single node.

On the two nodes if I run the program from two processes on
node #1, I don't get the above behavior. If I run it from
two processes on node #2, I do (the 'NL Blocking'). When
you run it from two nodes I suspect you only see the NL blocking
on one of the nodes, never on the other one.

I'll update the lock module with the recent changes and try
to reproduce the assertion failure. The way I produce it is:

Starting from both nodes rebooted...

install the modules and have both nodes join the cluster. First
node #1 then node #2.

Run the program on node #1 and ctrl/c it to stop after a minute
or so.

Start the program on node #2 (one process) and let it run
for 10-20 seconds (one or two status lines). Start another copy
on node #2. This usually generates the NL messages. CTRL/C that
copy and start it again. Maybe CTRL/C the other copy and start it
again.

At some point after CTRL/Cing and restarting, the program just
hangs. At that point the process doesn't respond to CTRL/C any
more and dmesg will show the various failure messages.


From mtilstra at redhat.com  Fri Aug  6 16:45:10 2004
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Fri, 6 Aug 2004 11:45:10 -0500
Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine
In-Reply-To: <1091749973.18842.70.camel@angmar>
References: <1091458082.8356.23.camel@angmar>
	<20040802154605.GC1518@redhat.com>
	<1091480394.8356.58.camel@angmar>
	<20040803144019.GA4365@redhat.com>
	<1091581920.8356.257.camel@angmar>
	<20040804153338.GA10091@redhat.com>
	<1091749973.18842.70.camel@angmar>
Message-ID: <20040806164510.GA20479@redhat.com>

On Thu, Aug 05, 2004 at 04:52:53PM -0700, micah nerren wrote:
> 
> FYI, I tried this with a few different HBA's, that didn't work. I
> thought perhaps it could be some funny interaction with the driver but
> that doesn't seem to be the case.
> 
> If there is anything I can do to help, please let me know! Up to and
> including allowing access to the machines running the software if that
> will help you debug it.

well, at this point I'd try things without the hbas and without gulm.
So first off, try mounting gfs using nolock instead of gulm on a single
node.
Then gets some space on a local drive to put gfs (without pool first)
and use gulm to mount that. (kinda pointless other than just seeing if
it does an oops.)
If that works, put pool onto the local disk and try again.

That should give us a good idea of what parts need to be involved to get
the oops.

-- 
Michael Conrad Tadpol Tilstra
Sharpies don't just sniff themselves.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040806/a89e3576/attachment.sig>

From lhh at redhat.com  Fri Aug  6 19:01:09 2004
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 06 Aug 2004 15:01:09 -0400
Subject: [Linux-cluster] ccsd patch to allow retrieval of child type + CDATA
Message-ID: <1091818869.23658.43.camel@atlantis.boston.redhat.com>

This is so we have a way to figure out child types as well as the CDATA
value.

Ex:

<cluster>
  <nodes>
    <node name="foo">stuff</node>
  </nodes>
</cluster>

Old behavior: 

[root at red lhh]# ccs_test connect
Connect successful.
 Connection descriptor = 0
[root at red lhh]# ccs_test get 0 /cluster/nodes/child::*[1]
Get successful.
 Value = <stuff>

New behavior:

[root at red lhh]# ccs_test connect
Connect successful.
 Connection descriptor = 0
[root at red lhh]# ccs_test get 0 /cluster/nodes/child::*[1]
Get successful.
 Value = <node=stuff>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ccsd-child.patch
Type: text/x-patch
Size: 900 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040806/ed863041/attachment.bin>

From mnerren at paracel.com  Fri Aug  6 22:03:39 2004
From: mnerren at paracel.com (micah nerren)
Date: Fri, 06 Aug 2004 15:03:39 -0700
Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine
In-Reply-To: <20040806164510.GA20479@redhat.com>
References: <1091458082.8356.23.camel@angmar>
	<20040802154605.GC1518@redhat.com> <1091480394.8356.58.camel@angmar>
	<20040803144019.GA4365@redhat.com> <1091581920.8356.257.camel@angmar>
	<20040804153338.GA10091@redhat.com> <1091749973.18842.70.camel@angmar>
	<20040806164510.GA20479@redhat.com>
Message-ID: <1091829819.22512.14.camel@angmar>

On Fri, 2004-08-06 at 09:45, Michael Conrad Tadpol Tilstra wrote:
> On Thu, Aug 05, 2004 at 04:52:53PM -0700, micah nerren wrote:
> > 
> > FYI, I tried this with a few different HBA's, that didn't work. I
> > thought perhaps it could be some funny interaction with the driver but
> > that doesn't seem to be the case.
> > 
> > If there is anything I can do to help, please let me know! Up to and
> > including allowing access to the machines running the software if that
> > will help you debug it.
> 
> well, at this point I'd try things without the hbas and without gulm.
> So first off, try mounting gfs using nolock instead of gulm on a single
> node.
> Then gets some space on a local drive to put gfs (without pool first)
> and use gulm to mount that. (kinda pointless other than just seeing if
> it does an oops.)
> If that works, put pool onto the local disk and try again.
> 
> That should give us a good idea of what parts need to be involved to get
> the oops.

Alrighty, I thought I'd give you the latest on our efforts along these
lines. We are progressing down the paths you suggested, and wanted to
post a few results before the weekend.

We have used nolock instead of gulm, still on the pool device over the
HBA, and received a crash. Attached are two traces of the crashes. We
edited the code sprinkling printk's throughout to get some output. 

Using lock_nolock instead of lock_gulm still crashes, but slightly
differently. See koops-nolock.txt

The tracing printk()'s added to lock_gulm and gfs don't show much, but
the crash is different yet again. See koops-gulm-traced.txt

The tracing messages use -> for enter, <- for leave and ?? for
"This function returns in far too many places to bother."

Later this evening or monday, I will attempt building a local file
system without a pool, then with a pool, to give you some more data.

Thanks,

Micah


-------------- next part --------------
Lock_Harness v6.0.0 (built Aug  6 2004 20:27:11) installed                      
Gulm v6.0.0 (built Aug  6 2004 20:27:09) installed                              
Debugging printks added at paracel.                                             
GFS v6.0.0 (built Aug  6 2004 20:26:48) installed                               
->gfs_read_super(774e0000, 0, 0)                                                
->gfs_mount_lockproto({proto="", table="", host=""}, 0)                         
->gulm_mount("hopkins:gfs02", "", a0128980, 1cf000, 32, 24f6b8)                 
->start_gulm_threads("hopkins", "")                                             
->cm_login()                                                                    
??lg_core_login(7768200, 1)                                                     
??xdr_enc_flush(776515c0)                                                       
??lg_core_handle_messages(7768200, a010ca00, 0)                                 
??gulm_core_login_reply(0, 0, 0, -1, 3)                                         
->lt_login()                                                                    
??lg_lock_login(7768200, {71, 70, 83, 32})                                      
Unable to handle kernel paging request at virtual address 0000000100000000      
 printing rip:                                                                  
ffffffff802b5dd2                                                                
PML4 775d3067 PGD 0                                                             
Oops: 0000                                                                      
CPU 0                                                                           
Pid: 4026, comm: mount Not tainted                                              
RIP: 0010:[<ffffffff802b5dd2>]{memcpy+18}                                       
RSP: 0018:00000100775fb238  EFLAGS: 00010002                                    
RAX: ffffffff805d3928 RBX: 00000100775fa760 RCX: 0000000000000001               
RDX: 0000000000000080 RSI: 0000000100000000 RDI: ffffffff805d3928               
RBP: 0000000000000000 R08: 00000000ffffffff R09: 00000100076bf840               
R10: 0000002a95782200 R11: 0000000000000246 R12: 000001007bf46760               
R13: 00000100775fa000 R14: 000001007bf46000 R15: ffffffff805d38c0               
FS:  0000002a955764c0(0000) GS:ffffffff805d9840(0000) knlGS:0000000000000000    
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b                               
CR2: 0000000100000000 CR3: 0000000000101000 CR4: 00000000000006e0               
                                                                                
Call Trace: [<ffffffff8010ed13>]{__switch_to+499} [<ffffffff8011f8c2>]{thread_r 
       [<ffffffff80101000>]{init_level4_pgt+0} [<ffffffff8012f9b5>]{schedule_ti 
       [<ffffffff802b5915>]{do_softirq_thunk+53} [<ffffffff8028e0df>]{inet_wait 
       [<ffffffff8028e313>]{inet_stream_connect+339} [<ffffffffa0108f0c>]{:lock 
       [<ffffffffa010676c>]{:lock_gulm:xdr_connect+28} [<ffffffffa01035a5>]{:lo 
       [<ffffffffa01000bf>]{:lock_gulm:lt_login+63} [<ffffffffa00fc184>]{:lock_ 
       [<ffffffffa010ca00>]{:lock_gulm:core_cb+0} [<ffffffffa0102202>]{:lock_gu 
       [<ffffffffa010284a>]{:lock_gulm:lg_core_login+346}                       
       [<ffffffffa00fc568>]{:lock_gulm:cm_login+136} [<ffffffffa00fcc36>]{:lock 
       [<ffffffffa00fcfa9>]{:lock_gulm:gulm_mount+665} [<ffffffffa0128980>]{:gf 
       [<ffffffffa00fb3e3>]{:lock_harness:lm_mount_Rsmp_ad6c5c21+355}           
       [<ffffffffa0128980>]{:gfs:gfs_glock_cb+0} [<ffffffffa012e09e>]{:gfs:gfs_ 
       [<ffffffff8013d8d2>]{do_anonymous_page+1234} [<ffffffff8013d94f>]{do_no_ 
       [<ffffffff801a5103>]{do_page_fault+627} [<ffffffff801109d6>]{error_exit+ 
       [<ffffffff801ebbe9>]{serial_in+41} [<ffffffff8011e20d>]{wake_up_cpu+29}  
       [<ffffffffa011939a>]{:gfs:gfs_read_super+1338} [<ffffffffa014dca0>]{:gfs 
       [<ffffffff80164c0c>]{get_sb_bdev+588} [<ffffffffa014dca0>]{:gfs:gfs_fs_t 
       [<ffffffff80164ec9>]{do_kern_mount+121} [<ffffffff8017baa1>]{do_add_moun 
       [<ffffffff8017bdb9>]{do_mount+345} [<ffffffff80154b40>]{__get_free_pages 
       [<ffffffff8017c1d5>]{sys_mount+197} [<ffffffff80110177>]{system_call+119 
                                                                                
Process mount (pid: 4026, stackpage=100775fb000)                                
Stack: 00000100775fb238 0000000000000018 0000000000000040 00000100775fa760      
       ffffffff8010ed13 0000000000000006 00000100775fa000 000001007bf47ed8      
       000001007bf46000 ffffffff805e02c0 0000000000000000 0000000000000079      
       ffffffff8011f8c2 00000100775fb328 000001007ad88000 0000000000000020      
       0000000000000006 00000100775ffb40 0000000000000000 ffffffff80101000      
       000001007b571000 0000000000000069 0000000000000000 ffffffff805e02c0      
       00000100775fa000 00000100775fa000 000001007ad88000 0000000000000010      
       00000100775260c0 7fffffffffffffff 0000010077526108 00000100775fb3c8      
       0000000000000010 7fffffffffffffff ffffffff8012f9b5 00000100775fb3c8      
       ffffffff802b5915 0000000000000020 0000000000000006 000001000478929e      
Call Trace: [<ffffffff8010ed13>]{__switch_to+499} [<ffffffff8011f8c2>]{thread_r 
       [<ffffffff80101000>]{init_level4_pgt+0} [<ffffffff8012f9b5>]{schedule_ti 
       [<ffffffff802b5915>]{do_softirq_thunk+53} [<ffffffff8028e0df>]{inet_wait 
       [<ffffffff8028e313>]{inet_stream_connect+339} [<ffffffffa0108f0c>]{:lock 
       [<ffffffffa010676c>]{:lock_gulm:xdr_connect+28} [<ffffffffa01035a5>]{:lo 
       [<ffffffffa01000bf>]{:lock_gulm:lt_login+63} [<ffffffffa00fc184>]{:lock_ 
       [<ffffffffa010ca00>]{:lock_gulm:core_cb+0} [<ffffffffa0102202>]{:lock_gu 
       [<ffffffffa010284a>]{:lock_gulm:lg_core_login+346}                       
       [<ffffffffa00fc568>]{:lock_gulm:cm_login+136} [<ffffffffa00fcc36>]{:lock 
       [<ffffffffa00fcfa9>]{:lock_gulm:gulm_mount+665} [<ffffffffa0128980>]{:gf 
       [<ffffffffa00fb3e3>]{:lock_harness:lm_mount_Rsmp_ad6c5c21+355}           
       [<ffffffffa0128980>]{:gfs:gfs_glock_cb+0} [<ffffffffa012e09e>]{:gfs:gfs_ 
       [<ffffffff8013d8d2>]{do_anonymous_page+1234} [<ffffffff8013d94f>]{do_no_ 
       [<ffffffff801a5103>]{do_page_fault+627} [<ffffffff801109d6>]{error_exit+ 
       [<ffffffff801ebbe9>]{serial_in+41} [<ffffffff8011e20d>]{wake_up_cpu+29}  
       [<ffffffffa011939a>]{:gfs:gfs_read_super+1338} [<ffffffffa014dca0>]{:gfs 
       [<ffffffff80164c0c>]{get_sb_bdev+588} [<ffffffffa014dca0>]{:gfs:gfs_fs_t 
       [<ffffffff80164ec9>]{do_kern_mount+121} [<ffffffff8017baa1>]{do_add_moun 
       [<ffffffff8017bdb9>]{do_mount+345} [<ffffffff80154b40>]{__get_free_pages 
       [<ffffffff8017c1d5>]{sys_mount+197} [<ffffffff80110177>]{system_call+119 
                                                                                
                                                                                
Code: 4c 8b 1e 4c 8b 46 08 4c 89 1f 4c 89 47 08 4c 8b 4e 10 4c 8b               
                                                                                
Kernel panic: Fatal exception                                                   
NMI Watchdog detected LOCKUP on CPU0, eip ffffffff8012162f, registers:          
CPU 0                                                                           
Pid: 4026, comm: mount Not tainted                                              
RIP: 0010:[<ffffffff8012162f>]{.text.lock.sched+131}                            
RSP: 0018:ffffffff805de5c0  EFLAGS: 00000086                                    
RAX: 0000000000000000 RBX: 00000100775fa000 RCX: 00000000000a6040               
RDX: ffffffff8049d6a0 RSI: ffffffff8049d6b0 RDI: 0000000000000000               
RBP: ffffffff805de5f0 R08: 0000000000000000 R09: ffffffff8049d6a0               
R10: ffffffff8049d690 R11: 00000100775fad28 R12: ffffffff805e02c0               
R13: 000000000000000b R14: 0000000000000000 R15: 00000000000033c5               
FS:  0000002a955764c0(0000) GS:ffffffff805d9840(0000) knlGS:0000000000000000    
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b                               
CR2: 0000000100000000 CR3: 0000000000101000 CR4: 00000000000006e0               
                                                                                
Call Trace:  <EOE> [<ffffffff8011b263>]{smp_apic_timer_interrupt+291}           
       [<ffffffff801108dc>]{apic_timer_interrupt+64} [<ffffffff8011302e>]{handl 
       [<ffffffff801132e2>]{do_IRQ+274} [<ffffffff801106d7>]{common_interrupt+9 
       [<ffffffff8012a719>]{do_softirq+153} [<ffffffff80113323>]{do_IRQ+339}    
       [<ffffffff801106d7>]{common_interrupt+95}  <EOI> [<ffffffff80267cf0>]{ip 
       [<ffffffff80249e55>]{dev_queue_xmit+453} [<ffffffff801f653d>]{__make_req 
       [<ffffffff801f64c7>]{__make_request+1159} [<ffffffff801f669b>]{generic_m 
       [<ffffffff801f6711>]{submit_bh_rsector+97} [<ffffffff8015eede>]{write_lo 
       [<ffffffff8015f064>]{write_some_buffers+372} [<ffffffff801248c9>]{printk 
       [<ffffffff8015f097>]{write_unlocked_buffers+23} [<ffffffff8015f1ae>]{syn 
       [<ffffffff8015f31a>]{fsync_dev+10} [<ffffffff8015f45b>]{sys_sync+11}     
       [<ffffffff80123d7e>]{panic+286} [<ffffffff8011129a>]{show_trace+666}     
       [<ffffffff801113bd>]{show_stack+205} [<ffffffff80111500>]{show_registers 
       [<ffffffff801116ac>]{die+268} [<ffffffff801a526d>]{do_page_fault+989}    
       [<ffffffff8027f492>]{tcp_v4_rcv+1330} [<ffffffff80262d70>]{ip_local_deli 
       [<ffffffff80262e64>]{ip_local_deliver_finish+244} [<ffffffff80252e51>]{n 
       [<ffffffff80262d70>]{ip_local_deliver_finish+0} [<ffffffff801109d6>]{err 
       [<ffffffff802b5dd2>]{memcpy+18} [<ffffffff8010ed13>]{__switch_to+499}    
       [<ffffffff8011f8c2>]{thread_return+0} [<ffffffff80101000>]{init_level4_p 
       [<ffffffff8012f9b5>]{schedule_timeout+37} [<ffffffff802b5915>]{do_softir 
       [<ffffffff8028e0df>]{inet_wait_for_connect+287} [<ffffffff8028e313>]{ine 
       [<ffffffffa0108f0c>]{:lock_gulm:.rodata.str1.1+583}                      
       [<ffffffffa010676c>]{:lock_gulm:xdr_connect+28} [<ffffffffa01035a5>]{:lo 
       [<ffffffffa01000bf>]{:lock_gulm:lt_login+63} [<ffffffffa00fc184>]{:lock_ 
       [<ffffffffa010ca00>]{:lock_gulm:core_cb+0} [<ffffffffa0102202>]{:lock_gu 
       [<ffffffffa010284a>]{:lock_gulm:lg_core_login+346}                       
       [<ffffffffa00fc568>]{:lock_gulm:cm_login+136} [<ffffffffa00fcc36>]{:lock 
       [<ffffffffa00fcfa9>]{:lock_gulm:gulm_mount+665} [<ffffffffa0128980>]{:gf 
       [<ffffffffa00fb3e3>]{:lock_harness:lm_mount_Rsmp_ad6c5c21+355}           
       [<ffffffffa0128980>]{:gfs:gfs_glock_cb+0} [<ffffffffa012e09e>]{:gfs:gfs_ 
       [<ffffffff8013d8d2>]{do_anonymous_page+1234} [<ffffffff8013d94f>]{do_no_ 
       [<ffffffff801a5103>]{do_page_fault+627} [<ffffffff801109d6>]{error_exit+ 
       [<ffffffff801ebbe9>]{serial_in+41} [<ffffffff8011e20d>]{wake_up_cpu+29}  
       [<ffffffffa011939a>]{:gfs:gfs_read_super+1338} [<ffffffffa014dca0>]{:gfs 
       [<ffffffff80164c0c>]{get_sb_bdev+588} [<ffffffffa014dca0>]{:gfs:gfs_fs_t 
       [<ffffffff80164ec9>]{do_kern_mount+121} [<ffffffff8017baa1>]{do_add_moun 
       [<ffffffff8017bdb9>]{do_mount+345} [<ffffffff80154b40>]{__get_free_pages 
       [<ffffffff8017c1d5>]{sys_mount+197} [<ffffffff80110177>]{system_call+119 
                                                                                
Process mount (pid: 4026, stackpage=100775fb000)                                
Stack: ffffffff805de5c0 0000000000000018 0000000000100000 0000000000000000      
       00000100079c4c80 ffffffff803e89a0 0000000000000000 00000100000fdea0      
       ffffffff803e8d00 00000100079bf000 00000100079d6400 0000000000000042      
       00000100079de280 ffffff0000000000 000000fffffff000 0000000000000000      
       00000100079d7a80 0000000000000000 0000000000000000 0000000000000000      
       0000000000000000 0000000000000000 0000000000000000 0000000000000000      
       00000100775fbc28 0000000000000000 00000000006d9994 0000000000000003      
       0000000000000000 0000000000000000 0000000100000000 ffffffffffffffff      
       ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff      
       ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff      
Call Trace:  <EOE> [<ffffffff8011b263>]{smp_apic_timer_interrupt+291}           
       [<ffffffff801108dc>]{apic_timer_interrupt+64} [<ffffffff8011302e>]{handl 
       [<ffffffff801132e2>]{do_IRQ+274} [<ffffffff801106d7>]{common_interrupt+9 
       [<ffffffff8012a719>]{do_softirq+153} [<ffffffff80113323>]{do_IRQ+339}    
       [<ffffffff801106d7>]{common_interrupt+95}  <EOI> [<ffffffff80267cf0>]{ip 
       [<ffffffff80249e55>]{dev_queue_xmit+453} [<ffffffff801f653d>]{__make_req 
       [<ffffffff801f64c7>]{__make_request+1159} [<ffffffff801f669b>]{generic_m 
       [<ffffffff801f6711>]{submit_bh_rsector+97} [<ffffffff8015eede>]{write_lo 
       [<ffffffff8015f064>]{write_some_buffers+372} [<ffffffff801248c9>]{printk 
       [<ffffffff8015f097>]{write_unlocked_buffers+23} [<ffffffff8015f1ae>]{syn 
       [<ffffffff8015f31a>]{fsync_dev+10} [<ffffffff8015f45b>]{sys_sync+11}     
       [<ffffffff80123d7e>]{panic+286} [<ffffffff8011129a>]{show_trace+666}     
       [<ffffffff801113bd>]{show_stack+205} [<ffffffff80111500>]{show_registers 
       [<ffffffff801116ac>]{die+268} [<ffffffff801a526d>]{do_page_fault+989}    
       [<ffffffff8027f492>]{tcp_v4_rcv+1330} [<ffffffff80262d70>]{ip_local_deli 
       [<ffffffff80262e64>]{ip_local_deliver_finish+244} [<ffffffff80252e51>]{n 
       [<ffffffff80262d70>]{ip_local_deliver_finish+0} [<ffffffff801109d6>]{err 
       [<ffffffff802b5dd2>]{memcpy+18} [<ffffffff8010ed13>]{__switch_to+499}    
       [<ffffffff8011f8c2>]{thread_return+0} [<ffffffff80101000>]{init_level4_p 
       [<ffffffff8012f9b5>]{schedule_timeout+37} [<ffffffff802b5915>]{do_softir 
       [<ffffffff8028e0df>]{inet_wait_for_connect+287} [<ffffffff8028e313>]{ine 
       [<ffffffffa0108f0c>]{:lock_gulm:.rodata.str1.1+583}                      
       [<ffffffffa010676c>]{:lock_gulm:xdr_connect+28} [<ffffffffa01035a5>]{:lo 
       [<ffffffffa01000bf>]{:lock_gulm:lt_login+63} [<ffffffffa00fc184>]{:lock_ 
       [<ffffffffa010ca00>]{:lock_gulm:core_cb+0} [<ffffffffa0102202>]{:lock_gu 
       [<ffffffffa010284a>]{:lock_gulm:lg_core_login+346}                       
       [<ffffffffa00fc568>]{:lock_gulm:cm_login+136} [<ffffffffa00fcc36>]{:lock 
       [<ffffffffa00fcfa9>]{:lock_gulm:gulm_mount+665} [<ffffffffa0128980>]{:gf 
       [<ffffffffa00fb3e3>]{:lock_harness:lm_mount_Rsmp_ad6c5c21+355}           
       [<ffffffffa0128980>]{:gfs:gfs_glock_cb+0} [<ffffffffa012e09e>]{:gfs:gfs_ 
       [<ffffffff8013d8d2>]{do_anonymous_page+1234} [<ffffffff8013d94f>]{do_no_ 
       [<ffffffff801a5103>]{do_page_fault+627} [<ffffffff801109d6>]{error_exit+ 
       [<ffffffff801ebbe9>]{serial_in+41} [<ffffffff8011e20d>]{wake_up_cpu+29}  
       [<ffffffffa011939a>]{:gfs:gfs_read_super+1338} [<ffffffffa014dca0>]{:gfs 
       [<ffffffff80164c0c>]{get_sb_bdev+588} [<ffffffffa014dca0>]{:gfs:gfs_fs_t 
       [<ffffffff80164ec9>]{do_kern_mount+121} [<ffffffff8017baa1>]{do_add_moun 
       [<ffffffff8017bdb9>]{do_mount+345} [<ffffffff80154b40>]{__get_free_pages 
       [<ffffffff8017c1d5>]{sys_mount+197} [<ffffffff80110177>]{system_call+119 
                                                                                
                                                                                
Code: f3 90 7e f7 e9 0b db ff ff 80 ba c0 02 5e 80 00 f3 90 7e f5               
                                                                                
console shuts up ... 
-------------- next part --------------
Lock_Harness v6.0.0 (built Aug  6 2004 20:27:11) installed                      
Lock_Nolock v6.0.0 (built Aug  6 2004 20:27:12) installed                       
GFS v6.0.0 (built Aug  6 2004 20:26:48) installed                               
->gfs_read_super(78a4d000, 0, 0)                                                
->gfs_mount_lockproto({proto="", table="", host=""}, 0)                         
Gulm v6.0.0 (built Aug  5 2004 16:27:11) installed                              
Unable to handle kernel NULL pointer dereference at virtual address 000000000000
 printing rip:                                                                  
ffffffff8024a875                                                                
PML4 77ae7067 PGD 7798c067 PMD 0                                                
Oops: 0002                                                                      
CPU 0                                                                           
Pid: 4027, comm: mount Not tainted                                              
RIP: 0010:[<ffffffff8024a875>]{net_rx_action+213}                               
RSP: 0018:0000010077605048  EFLAGS: 00010046                                    
RAX: 0000000000000000 RBX: ffffffff806077e8 RCX: ffffffff80607988               
RDX: ffffffff806077e8 RSI: 0000010077b27080 RDI: ffffffff806077d0               
RBP: ffffffff80607668 R08: 0000000080a56a9c R09: 0000000000a580a5               
R10: 000000000100007f R11: 0000000000000000 R12: ffffffff806077e8               
R13: ffffffff806077c0 R14: 00000000000046e6 R15: 0000000000000000               
FS:  0000002a955764c0(0000) GS:ffffffff805d9840(0000) knlGS:0000000000000000    
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b                               
CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0               
                                                                                
Call Trace: [<ffffffff8024a84d>]{net_rx_action+173}                             
       [<ffffffff8012a72e>]{do_softirq+174} [<ffffffff80267cf0>]{ip_finish_outp 
       [<ffffffff80267cc0>]{dst_output+0} [<ffffffff802b5915>]{do_softirq_thunk 
       [<ffffffff802533a7>]{.text.lock.netfilter+165} [<ffffffff80267cc0>]{dst_ 
       [<ffffffff80265fbb>]{ip_queue_xmit+1019} [<ffffffff80262ee0>]{ip_rcv_fin 
       [<ffffffff802630f0>]{ip_rcv_finish+528} [<ffffffff80252e51>]{nf_hook_slo 
       [<ffffffff80262ee0>]{ip_rcv_finish+0} [<ffffffff80277faf>]{tcp_transmit_ 
       [<ffffffff80278ac6>]{tcp_write_xmit+198} [<ffffffff8026de83>]{tcp_sendms 
       [<ffffffff8028e795>]{inet_sendmsg+69} [<ffffffff802407ae>]{sock_sendmsg+ 
       [<ffffffffa01484b1>]{:lock_gulm:do_tfer+369} [<ffffffffa014abd4>]{:lock_ 
       [<ffffffffa0148595>]{:lock_gulm:xdr_send+37} [<ffffffffa0147498>]{:lock_ 
       [<ffffffffa014551d>]{:lock_gulm:lg_lock_login+301}                       
       [<ffffffffa0141ff9>]{:lock_gulm:lt_login+57} [<ffffffffa013e164>]{:lock_ 
       [<ffffffffa014e6a0>]{:lock_gulm:core_cb+0} [<ffffffffa01440eb>]{:lock_gu 
       [<ffffffffa0144713>]{:lock_gulm:lg_core_login+323}                       
       [<ffffffffa013e53a>]{:lock_gulm:cm_login+122} [<ffffffffa013ebde>]{:lock 
       [<ffffffffa013ef08>]{:lock_gulm:gulm_mount+616} [<ffffffffa0117980>]{:gf 
       [<ffffffff801277bb>]{release_task+763} [<ffffffffa00fb3e3>]{:lock_harnes 
       [<ffffffffa0117980>]{:gfs:gfs_glock_cb+0} [<ffffffffa011d09e>]{:gfs:gfs_ 
       [<ffffffff8013d8d2>]{do_anonymous_page+1234} [<ffffffff8013d94f>]{do_no_ 
       [<ffffffff801a5103>]{do_page_fault+627} [<ffffffff801109d6>]{error_exit+ 
       [<ffffffff801ebbe9>]{serial_in+41} [<ffffffff8011e20d>]{wake_up_cpu+29}  
       [<ffffffffa010839a>]{:gfs:gfs_read_super+1338} [<ffffffffa013cca0>]{:gfs 
       [<ffffffff80164c0c>]{get_sb_bdev+588} [<ffffffffa013cca0>]{:gfs:gfs_fs_t 
       [<ffffffff80164ec9>]{do_kern_mount+121} [<ffffffff8017baa1>]{do_add_moun 
       [<ffffffff8017bdb9>]{do_mount+345} [<ffffffff80154b40>]{__get_free_pages 
       [<ffffffff8017c1d5>]{sys_mount+197} [<ffffffff80110177>]{system_call+119 
                                                                                
Process mount (pid: 4027, stackpage=10077605000)                                
Stack: 0000010077605048 0000000000000018 ffffffff8024a84d 0000012a80445d20      
       0000000000000001 ffffffff80606c60 0000000000000000 000000000000000a      
       0000000000000000 0000000000000002 ffffffff8012a72e ffffffff80267cf0      
       0000000000000246 0000000000000000 0000000000000003 ffffffff80445d20      
       ffffffff80267cc0 0000000000000000 ffffffff802b5915 0000000000000043      
       0000000000000006 000001007a56a09e 0000010077a32d80 0000000000000000      
       0000000000000000 ffffffff8049c648 0000000000000000 ffffffff806077c0      
       ffffffff802533a7 ffffffff80267cc0 ffffffff80445d20 0000000000000002      
       0000010077a32d80 ffffffff805abcd0 000001007a56a0ac 0000010077a32d80      
       0000010077b27080 0000000000000000 0000010077b27080 0000010077a32de8      
Call Trace: [<ffffffff8024a84d>]{net_rx_action+173}                             
       [<ffffffff8012a72e>]{do_softirq+174} [<ffffffff80267cf0>]{ip_finish_outp 
       [<ffffffff80267cc0>]{dst_output+0} [<ffffffff802b5915>]{do_softirq_thunk 
       [<ffffffff802533a7>]{.text.lock.netfilter+165} [<ffffffff80267cc0>]{dst_ 
       [<ffffffff80265fbb>]{ip_queue_xmit+1019} [<ffffffff80262ee0>]{ip_rcv_fin 
       [<ffffffff802630f0>]{ip_rcv_finish+528} [<ffffffff80252e51>]{nf_hook_slo 
       [<ffffffff80262ee0>]{ip_rcv_finish+0} [<ffffffff80277faf>]{tcp_transmit_ 
       [<ffffffff80278ac6>]{tcp_write_xmit+198} [<ffffffff8026de83>]{tcp_sendms 
       [<ffffffff8028e795>]{inet_sendmsg+69} [<ffffffff802407ae>]{sock_sendmsg+ 
       [<ffffffffa01484b1>]{:lock_gulm:do_tfer+369} [<ffffffffa014abd4>]{:lock_ 
       [<ffffffffa0148595>]{:lock_gulm:xdr_send+37} [<ffffffffa0147498>]{:lock_ 
       [<ffffffffa014551d>]{:lock_gulm:lg_lock_login+301}                       
       [<ffffffffa0141ff9>]{:lock_gulm:lt_login+57} [<ffffffffa013e164>]{:lock_ 
       [<ffffffffa014e6a0>]{:lock_gulm:core_cb+0} [<ffffffffa01440eb>]{:lock_gu 
       [<ffffffffa0144713>]{:lock_gulm:lg_core_login+323}                       
       [<ffffffffa013e53a>]{:lock_gulm:cm_login+122} [<ffffffffa013ebde>]{:lock 
       [<ffffffffa013ef08>]{:lock_gulm:gulm_mount+616} [<ffffffffa0117980>]{:gf 
       [<ffffffff801277bb>]{release_task+763} [<ffffffffa00fb3e3>]{:lock_harnes 
       [<ffffffffa0117980>]{:gfs:gfs_glock_cb+0} [<ffffffffa011d09e>]{:gfs:gfs_ 
       [<ffffffff8013d8d2>]{do_anonymous_page+1234} [<ffffffff8013d94f>]{do_no_ 
       [<ffffffff801a5103>]{do_page_fault+627} [<ffffffff801109d6>]{error_exit+ 
       [<ffffffff801ebbe9>]{serial_in+41} [<ffffffff8011e20d>]{wake_up_cpu+29}  
       [<ffffffffa010839a>]{:gfs:gfs_read_super+1338} [<ffffffffa013cca0>]{:gfs 
       [<ffffffff80164c0c>]{get_sb_bdev+588} [<ffffffffa013cca0>]{:gfs:gfs_fs_t 
       [<ffffffff80164ec9>]{do_kern_mount+121} [<ffffffff8017baa1>]{do_add_moun 
       [<ffffffff8017bdb9>]{do_mount+345} [<ffffffff80154b40>]{__get_free_pages 
       [<ffffffff8017c1d5>]{sys_mount+197} [<ffffffff80110177>]{system_call+119 
                                                                                
                                                                                
Code: 48 89 18 48 89 43 08 8b 85 90 01 00 00 85 c0 79 08 03 85 94               
                                                                                
Kernel panic: Fatal exception                                                   
In interrupt handler - not syncing                                              
                                                                                
NMI Watchdog detected LOCKUP on CPU1, eip ffffffff801a5419, registers:          
CPU 1                                                                           
Pid: 3534, comm: lock_gulmd Not tainted                                         
RIP: 0010:[<ffffffff801a5419>]{.text.lock.fault+7}                              
RSP: 0018:000001007adc1978  EFLAGS: 00000086                                    
RAX: 000000000000000f RBX: ffffffff80607ae8 RCX: 0000000000000000               
RDX: ffffffff803042e0 RSI: ffffffff803042e0 RDI: ffffffff8024a875               
RBP: ffffffff80607968 R08: ffffffff803042d0 R09: 0000000000a580a5               
R10: 000000000100007f R11: 0000000000000000 R12: 0000010007a0e9c0               
R13: 0000000000000000 R14: 0000000000000002 R15: 000001007adc1a58               
FS:  0000002a95576ce0(0000) GS:ffffffff805d98c0(0000) knlGS:0000000000000000    
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b                               
CR2: 0000000000000000 CR3: 00000000079d2000 CR4: 00000000000006e0               
                                                                                
Call Trace:                                                                     
Process lock_gulmd (pid: 3534, stackpage=1007adc1000)                           
Stack: 000001007adc1978 0000000000000018 0000000000000000 0000000000000000      
       0000000000000000 0000000000000000 0000000000000000 0000000000000000      
       0000000000000000 0000000000000000 0000000000000000 0000000000000000      
       0000000000000000 0000000000000000 0000000000000000 0000000000000000      
       0000000000000000 0000000000000000 0000000000000000 0000000000000000      
       0000000000000000 0000000000000000 0000000000000000 0000000000000000      
       0000000000000000 0000000000000000 0000000000000000 0000000000000000      
       0000000000000000 0000000000000000 0000000000000000 0000000000000000      
       0000000000000000 0000000000000000 0000000000000000 0000000000000000      
       0000000000000000 0000000000000000 0000000000000000 0000000000000000      
Call Trace:                                                                     
                                                                                
Code: f3 90 7e f5 e9 c8 fd ff ff 90 90 90 90 90 90 90 90 90 90 90               
                                                                                
console shuts up ... 

From mtilstra at redhat.com  Fri Aug  6 22:35:57 2004
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Fri, 6 Aug 2004 17:35:57 -0500
Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine
In-Reply-To: <1091829819.22512.14.camel@angmar>
References: <1091458082.8356.23.camel@angmar>
	<20040802154605.GC1518@redhat.com>
	<1091480394.8356.58.camel@angmar>
	<20040803144019.GA4365@redhat.com>
	<1091581920.8356.257.camel@angmar>
	<20040804153338.GA10091@redhat.com>
	<1091749973.18842.70.camel@angmar>
	<20040806164510.GA20479@redhat.com>
	<1091829819.22512.14.camel@angmar>
Message-ID: <20040806223557.GA21731@redhat.com>

On Fri, Aug 06, 2004 at 03:03:39PM -0700, micah nerren wrote:
> We have used nolock instead of gulm, still on the pool device over the
> HBA, and received a crash. Attached are two traces of the crashes. We
> edited the code sprinkling printk's throughout to get some output. 
> 
> Using lock_nolock instead of lock_gulm still crashes, but slightly
> differently. See koops-nolock.txt

er, you might want to double check this run, looking at the oops and
loging, it looks like it is still trying to use gulm.

the line:
 Gulm v6.0.0 (built Aug  5 2004 16:27:11) installed
in the file: koops-nolock.txt
lead me to believe this, along with the lock_gulm sysmbols in the oops.

i could be imaging things too...

-- 
Michael Conrad Tadpol Tilstra
At night as I lay in bed looking at the stars I thought 'Where the hell is
the ceiling?' 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040806/f96d0e18/attachment.sig>

From mnerren at paracel.com  Fri Aug  6 22:37:35 2004
From: mnerren at paracel.com (micah nerren)
Date: Fri, 06 Aug 2004 15:37:35 -0700
Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine
In-Reply-To: <20040806223557.GA21731@redhat.com>
References: <1091458082.8356.23.camel@angmar>
	<20040802154605.GC1518@redhat.com> <1091480394.8356.58.camel@angmar>
	<20040803144019.GA4365@redhat.com> <1091581920.8356.257.camel@angmar>
	<20040804153338.GA10091@redhat.com> <1091749973.18842.70.camel@angmar>
	<20040806164510.GA20479@redhat.com> <1091829819.22512.14.camel@angmar>
	<20040806223557.GA21731@redhat.com>
Message-ID: <1091831854.22512.18.camel@angmar>

On Fri, 2004-08-06 at 15:35, Michael Conrad Tadpol Tilstra wrote:
> On Fri, Aug 06, 2004 at 03:03:39PM -0700, micah nerren wrote:
> > We have used nolock instead of gulm, still on the pool device over the
> > HBA, and received a crash. Attached are two traces of the crashes. We
> > edited the code sprinkling printk's throughout to get some output. 
> > 
> > Using lock_nolock instead of lock_gulm still crashes, but slightly
> > differently. See koops-nolock.txt
> 
> er, you might want to double check this run, looking at the oops and
> loging, it looks like it is still trying to use gulm.
> 
> the line:
>  Gulm v6.0.0 (built Aug  5 2004 16:27:11) installed
> in the file: koops-nolock.txt
> lead me to believe this, along with the lock_gulm sysmbols in the oops.
> 
> i could be imaging things too...

Yeah you are right, I had caught that and am attempting a true nolock at
the moment.


From mnerren at paracel.com  Fri Aug  6 23:31:55 2004
From: mnerren at paracel.com (micah nerren)
Date: Fri, 06 Aug 2004 16:31:55 -0700
Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine
In-Reply-To: <20040806223557.GA21731@redhat.com>
References: <1091458082.8356.23.camel@angmar>
	<20040802154605.GC1518@redhat.com> <1091480394.8356.58.camel@angmar>
	<20040803144019.GA4365@redhat.com> <1091581920.8356.257.camel@angmar>
	<20040804153338.GA10091@redhat.com> <1091749973.18842.70.camel@angmar>
	<20040806164510.GA20479@redhat.com> <1091829819.22512.14.camel@angmar>
	<20040806223557.GA21731@redhat.com>
Message-ID: <1091835114.22512.46.camel@angmar>

On Fri, 2004-08-06 at 15:35, Michael Conrad Tadpol Tilstra wrote:
> On Fri, Aug 06, 2004 at 03:03:39PM -0700, micah nerren wrote:
> > We have used nolock instead of gulm, still on the pool device over the
> > HBA, and received a crash. Attached are two traces of the crashes. We
> > edited the code sprinkling printk's throughout to get some output. 
> > 
> > Using lock_nolock instead of lock_gulm still crashes, but slightly
> > differently. See koops-nolock.txt
> 
> er, you might want to double check this run, looking at the oops and
> loging, it looks like it is still trying to use gulm.
> 
> the line:
>  Gulm v6.0.0 (built Aug  5 2004 16:27:11) installed
> in the file: koops-nolock.txt
> lead me to believe this, along with the lock_gulm sysmbols in the oops.
> 
> i could be imaging things too...

Ok, using a disk attached via fibre channel to a single machine via an
LSI hba I can create and mount a GFS file system using lock_nolock
without a pool. A start!

Console log:

GFS: fsid=(8,2).0: Joined cluster. Now mounting FS...
GFS: fsid=(8,2).0: jid=0: Trying to acquire journal lock...
GFS: fsid=(8,2).0: jid=0: Looking at journal...
GFS: fsid=(8,2).0: jid=0: Done

I then tried to do lock_nolock on a pool device, and that worked as
well:

GFS: fsid=hopkins:gfs01.0: Joined cluster. Now mounting FS...
GFS: fsid=hopkins:gfs01.0: jid=0: Trying to acquire journal lock...
GFS: fsid=hopkins:gfs01.0: jid=0: Looking at journal...
GFS: fsid=hopkins:gfs01.0: jid=0: Done

So it appears to be specifically related to lock_gulm.

Anything else I should try?

I really appreciate all your help in debugging this!

Thanks,

Micah


From phillips at redhat.com  Sat Aug  7 02:43:03 2004
From: phillips at redhat.com (Daniel Phillips)
Date: Fri, 6 Aug 2004 22:43:03 -0400
Subject: [Linux-cluster] CLVM and redundant RAID levels
In-Reply-To: <Pine.LNX.4.44.0408061029260.3410-100000@olivia.cli.di.unipi.it>
References: <Pine.LNX.4.44.0408061029260.3410-100000@olivia.cli.di.unipi.it>
Message-ID: <200408062243.03666.phillips@redhat.com>

On Friday 06 August 2004 04:45, Francesco Fusco wrote:
> Hi!
> I want to use some ide servers to have an high available cluster
> filesystem.
> I don't have Fibre Channel/Scsi disk array, but only inexpensive
> ide disks.
>
> Can GFS be a good choice?
> Does it support redoundant raid levels between GNBD servers?

It's in the pipeline.

Regards,

Daniel


From pcaulfie at redhat.com  Mon Aug  9 07:36:34 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 9 Aug 2004 08:36:34 +0100
Subject: [Linux-cluster] bug in cman-kernel / membership.c
In-Reply-To: <1091730500.15503.302.camel@laza.eunet.yu>
References: <1091730500.15503.302.camel@laza.eunet.yu>
Message-ID: <20040809073634.GB8035@tykepenguin.com>

On Thu, Aug 05, 2004 at 08:28:21PM +0200, Lazar Obradovic wrote:
> Just got this when joining one node.
> 
> 
> CMAN: Waiting to join or form a Linux-cluster
> CMAN: sending membership request
> CMAN: got node new-noc
> Got ENDTRANS from a node not the master: master: 6, sender: 1
> CMAN: node new-noc is not responding - removing from the cluster
> ------------[ cut here ]------------
> kernel BUG at /usr/src/cvs/cluster/cman-kernel/src/membership.c:2892!

Just checking: Is this fixed by your cman_tool patch or shall I put it into
bugzilla ?

patrick


From pcaulfie at redhat.com  Mon Aug  9 07:50:40 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 9 Aug 2004 08:50:40 +0100
Subject: [Linux-cluster] Multicast for GFS?
In-Reply-To: <1091736172.19762.336.camel@laza.eunet.yu>
References: <1091553471.16747.165.camel@laza.eunet.yu>
	<1091556279.30938.179.camel@laza.eunet.yu>
	<1091736172.19762.336.camel@laza.eunet.yu>
Message-ID: <20040809075039.GA9240@tykepenguin.com>

On Thu, Aug 05, 2004 at 10:02:52PM +0200, Lazar Obradovic wrote:
> It took some time... 
> 
> Attached is the patch for correcting cman_tool join into ipv4 mcast
> group. 
> 
> Note to ipv6 developers / users: you may also need to set mcast ttl to
> higher value via setsockopt() if you want to have cluster with network
> traversing L3 devices, since linux default ttl is set for local scope
> (ttl = 1), which would make first router drop the packet. 
> 


> --- cluster/cman/cman_tool/join.c       2004-07-23 09:48:16.000000000 +0200
> +++ new-cluster/cman/cman_tool/join.c   2004-08-06 05:59:20.353829392 +0200
> @@ -118,13 +118,22 @@
>         die("Cannot bind multicast address: %s", strerror(errno));
> 
>      /* Join the multicast group */
> -    if (!bcast) {
> +    if (bhe) {
>         struct ip_mreq mreq;
> +       u_char mcast_opt;
> 
>         memcpy(&mreq.imr_multiaddr, bhe->h_addr, bhe->h_length);
> -       memcpy(&mreq.imr_interface, he->h_addr, he->h_length);
> +       mreq.imr_interface.s_addr = htonl(INADDR_ANY);

Can you explain why this should be INADDR_ANY rather than the local IP address?

You also mentioned in another email that the "cman_tool leave" should issue 
a setsockopt to leave the multicast group, does this not happen automatically 
when the socket is closed? If it isn't then cman_tool leave can do this I
suppose. In the case where the cluster software exits without the help of
cman_tool it will be fenced anyway so there shoudn't be a problem :-)

-- 

patrick


From laza at yu.net  Mon Aug  9 09:26:45 2004
From: laza at yu.net (Lazar Obradovic)
Date: Mon, 09 Aug 2004 11:26:45 +0200
Subject: [Linux-cluster] bug in cman-kernel / membership.c
In-Reply-To: <20040809073634.GB8035@tykepenguin.com>
References: <1091730500.15503.302.camel@laza.eunet.yu>
	<20040809073634.GB8035@tykepenguin.com>
Message-ID: <1092043604.8966.0.camel@laza.eunet.yu>

it's not fixed... it appeared on clean cvs node... 

On Mon, 2004-08-09 at 09:36, Patrick Caulfield wrote:
> On Thu, Aug 05, 2004 at 08:28:21PM +0200, Lazar Obradovic wrote:
> > Just got this when joining one node.
> > 
> > 
> > CMAN: Waiting to join or form a Linux-cluster
> > CMAN: sending membership request
> > CMAN: got node new-noc
> > Got ENDTRANS from a node not the master: master: 6, sender: 1
> > CMAN: node new-noc is not responding - removing from the cluster
> > ------------[ cut here ]------------
> > kernel BUG at /usr/src/cvs/cluster/cman-kernel/src/membership.c:2892!
> 
> Just checking: Is this fixed by your cman_tool patch or shall I put it into
> bugzilla ?
> 
> patrick
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
-- 
Lazar Obradovic, System Engineer
----- 
laza at YU.net
YUnet International http://www.EUnet.yu
Dubrovacka 35/III, 11000 Belgrade
Tel: +381 11 3119901; Fax: +381 11 3119901
-----
This e-mail is confidential and intended only for the recipient.
Unauthorized distribution, modification or disclosure of its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone +381 11 3119901.
-----


From pcaulfie at redhat.com  Mon Aug  9 09:38:23 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 9 Aug 2004 10:38:23 +0100
Subject: [Linux-cluster] cman_tool interface change
Message-ID: <20040809093822.GC11723@tykepenguin.com>

I've changed the way cman_tool starts the cluster so if you upgrade CVS you'll
need to make sure that kernel & userspace match.

What I've done is to remove all the setsockopt() calls and replaced with with
ioctls.

Sorry for the inconvenience here but those things really are /not/ socket
options and if this code got susbmitted to the kernel team like I'd get a
roasting!
-- 

patrick


From laza at yu.net  Mon Aug  9 11:32:23 2004
From: laza at yu.net (Lazar Obradovic)
Date: Mon, 09 Aug 2004 13:32:23 +0200
Subject: [Linux-cluster] Multicast for GFS?
In-Reply-To: <20040809075039.GA9240@tykepenguin.com>
References: <1091553471.16747.165.camel@laza.eunet.yu>
	<1091556279.30938.179.camel@laza.eunet.yu>
	<1091736172.19762.336.camel@laza.eunet.yu>
	<20040809075039.GA9240@tykepenguin.com>
Message-ID: <1092051143.1114.130.camel@laza.eunet.yu>

> Can you explain why this should be INADDR_ANY rather than the local IP address?

i knew i forgot something... 

i've been having some trouble with
        memcpy(&mreq.imr_interface, he->h_addr, he->h_length);

since it contains my mcast addr instead of real host addr. I'm debugging
it now to see where the error happens... 

Temporarly, I've changed it to htonl(INADDR_ANY)... 
seems like "temporarly" is a bit streching term for me :) 

> You also mentioned in another email that the "cman_tool leave" should issue 
> a setsockopt to leave the multicast group, does this not happen automatically 
> when the socket is closed? 

Actually no. When you close the socket that was a member of Mcast Group,
node does not sent "IGMP leave" message to router, so mcast packets
continue to arrive until router issues "membership refresh" procedure
(which, depending on configuration, is at every 30 seconds). 

If node does not confirm it's membership (and it won't since kernel
can't find any socket for the multicast group), mcast path gets pruned
at a router, but stays valid for another 2-3 minutes (also depending on
configuration).

> If it isn't then cman_tool leave can do this I
> suppose. In the case where the cluster software exits without the help of
> cman_tool it will be fenced anyway so there shoudn't be a problem :-)

That's true for fenced nodes, but it's not a clean solution, so, if it
isn't much of a trouble, I'd really like to have membership drop
implemented before socket gets closed.

-- 
Lazar Obradovic, System Engineer
----- 
laza at YU.net
YUnet International http://www.EUnet.yu
Dubrovacka 35/III, 11000 Belgrade
Tel: +381 11 3119901; Fax: +381 11 3119901
-----
This e-mail is confidential and intended only for the recipient.
Unauthorized distribution, modification or disclosure of its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone +381 11 3119901.
-----


From pcaulfie at redhat.com  Mon Aug  9 12:02:55 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 9 Aug 2004 13:02:55 +0100
Subject: [Linux-cluster] bug in cman-kernel / membership.c
In-Reply-To: <1092043604.8966.0.camel@laza.eunet.yu>
References: <1091730500.15503.302.camel@laza.eunet.yu>
	<20040809073634.GB8035@tykepenguin.com>
	<1092043604.8966.0.camel@laza.eunet.yu>
Message-ID: <20040809120254.GG11723@tykepenguin.com>

On Mon, Aug 09, 2004 at 11:26:45AM +0200, Lazar Obradovic wrote:
> it's not fixed... it appeared on clean cvs node... 
> 

OK, I've raised bugzilla entry 129445 for this.

patrick


From laza at yu.net  Mon Aug  9 13:10:30 2004
From: laza at yu.net (Lazar Obradovic)
Date: Mon, 09 Aug 2004 15:10:30 +0200
Subject: [Linux-cluster] Multicast for GFS?
In-Reply-To: <1092051143.1114.130.camel@laza.eunet.yu>
References: <1091553471.16747.165.camel@laza.eunet.yu>
	<1091556279.30938.179.camel@laza.eunet.yu>
	<1091736172.19762.336.camel@laza.eunet.yu>
	<20040809075039.GA9240@tykepenguin.com>
	<1092051143.1114.130.camel@laza.eunet.yu>
Message-ID: <1092057029.25657.217.camel@laza.eunet.yu>

On Mon, 2004-08-09 at 13:32, Lazar Obradovic wrote:
> since it contains my mcast addr instead of real host addr. I'm debugging
> it now to see where the error happens... 

... and the reason why i had trouble is: 
debugging code revealed that he structure propperly contains node's
unicast address along with a hostname *before* bhe struct is inited by
gethostbyname2() call.

After that, both bhe and he point to new structure, which holds hostent
struct for mcast address... That's why bind() failed, so I had to use
IPADDR_ANY... 

If we all look at manual page for gethostbyname(3), we can see that this
behaviour is not strange, as noted in "NOTES": 

The functions gethostbyname() and gethostbyaddr() may return pointers to
static data, which may be overwritten by later calls. Copying the struct
hostent does not suffice, since it contains pointers - a deep copy is
required.

I'll correct this later and send a patch :) 

-- 
Lazar Obradovic, System Engineer
----- 
laza at YU.net
YUnet International http://www.EUnet.yu
Dubrovacka 35/III, 11000 Belgrade
Tel: +381 11 3119901; Fax: +381 11 3119901
-----
This e-mail is confidential and intended only for the recipient.
Unauthorized distribution, modification or disclosure of its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone +381 11 3119901.
-----


From pcaulfie at redhat.com  Mon Aug  9 13:34:38 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 9 Aug 2004 14:34:38 +0100
Subject: [Linux-cluster] Multicast for GFS?
In-Reply-To: <1092051143.1114.130.camel@laza.eunet.yu>
References: <1091553471.16747.165.camel@laza.eunet.yu>
	<1091556279.30938.179.camel@laza.eunet.yu>
	<1091736172.19762.336.camel@laza.eunet.yu>
	<20040809075039.GA9240@tykepenguin.com>
	<1092051143.1114.130.camel@laza.eunet.yu>
Message-ID: <20040809133438.GI11723@tykepenguin.com>

On Mon, Aug 09, 2004 at 01:32:23PM +0200, Lazar Obradovic wrote:
> 
> > You also mentioned in another email that the "cman_tool leave" should issue 
> > a setsockopt to leave the multicast group, does this not happen automatically 
> > when the socket is closed? 
> 
> Actually no. When you close the socket that was a member of Mcast Group,
> node does not sent "IGMP leave" message to router, so mcast packets
> continue to arrive until router issues "membership refresh" procedure
> (which, depending on configuration, is at every 30 seconds). 
> 
> If node does not confirm it's membership (and it won't since kernel
> can't find any socket for the multicast group), mcast path gets pruned
> at a router, but stays valid for another 2-3 minutes (also depending on
> configuration).
> 
> > If it isn't then cman_tool leave can do this I
> > suppose. In the case where the cluster software exits without the help of
> > cman_tool it will be fenced anyway so there shoudn't be a problem :-)
> 
> That's true for fenced nodes, but it's not a clean solution, so, if it
> isn't much of a trouble, I'd really like to have membership drop
> implemented before socket gets closed.
> 

OK, thanks for clearing that up. It seems I have some thinking to do...


patrick


From mtilstra at redhat.com  Mon Aug  9 15:12:08 2004
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Mon, 9 Aug 2004 10:12:08 -0500
Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine
In-Reply-To: <1091835114.22512.46.camel@angmar>
References: <20040802154605.GC1518@redhat.com>
	<1091480394.8356.58.camel@angmar>
	<20040803144019.GA4365@redhat.com>
	<1091581920.8356.257.camel@angmar>
	<20040804153338.GA10091@redhat.com>
	<1091749973.18842.70.camel@angmar>
	<20040806164510.GA20479@redhat.com>
	<1091829819.22512.14.camel@angmar>
	<20040806223557.GA21731@redhat.com>
	<1091835114.22512.46.camel@angmar>
Message-ID: <20040809151208.GA2189@redhat.com>

On Fri, Aug 06, 2004 at 04:31:55PM -0700, micah nerren wrote:
> So it appears to be specifically related to lock_gulm.

hrms, so no pushing this off onto someone else. oh well. ;)


> Anything else I should try?
well, it still pretty much looks like a stack overflow.  And looking at
the calling tree, there is not much left to take out of the stacks.  So
I guess we'll have to try making the stack shorter.

So, another patch.  This still works on my intels, give it a go and
lets see how it does on your opterons.

> I really appreciate all your help in debugging this!
np.


-- 
Michael Conrad Tadpol Tilstra
To be, or not to be, those are the parameters.
-------------- next part --------------
Index: gulm_core.c
===================================================================
RCS file: /cvs/GFS/locking/lock_gulm/kernel/gulm_core.c,v
retrieving revision 1.1.2.14
diff -u -b -B -r1.1.2.14 gulm_core.c
--- gulm_core.c	25 May 2004 20:11:23 -0000	1.1.2.14
+++ gulm_core.c	9 Aug 2004 15:11:19 -0000
@@ -51,13 +51,6 @@
 	}
 	gulm_cm.GenerationID = gen;
 
-	error = lt_login ();
-	if (error != 0) {
-		log_err ("lt_login failed. %d\n", error);
-		lg_core_logout (gulm_cm.hookup);	/* XXX is this safe? */
-		return error;
-	}
-
 	log_msg (lgm_Network2, "Logged into local core.\n");
 
 	return 0;
Index: gulm_fs.c
===================================================================
RCS file: /cvs/GFS/locking/lock_gulm/kernel/gulm_fs.c,v
retrieving revision 1.1.2.17
diff -u -b -B -r1.1.2.17 gulm_fs.c
--- gulm_fs.c	2 Aug 2004 16:12:39 -0000	1.1.2.17
+++ gulm_fs.c	9 Aug 2004 15:11:19 -0000
@@ -287,9 +287,11 @@
 			goto fail;
 		}
 
-		/* lt_login() is called after the success packet for cm_login()
-		 * returns.
-		 */
+		error = lt_login();
+		if (error != 0) {
+			log_err ("lt_login failed. %d\n", error);
+			goto fail;
+		}
 	}
       fail:
 	up (&start_stop_lock);
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040809/9b594fcb/attachment.sig>

From jeff at intersystems.com  Mon Aug  9 15:53:51 2004
From: jeff at intersystems.com (Jeff)
Date: Mon, 9 Aug 2004 11:53:51 -0400
Subject: [Linux-cluster] Strange behavior(s) of DLM
In-Reply-To: <20040806125429.GG16109@redhat.com>
References: <1909350721.20040804234145@intersystems.com>
	<20040806125429.GG16109@redhat.com>
Message-ID: <602087288.20040809115351@intersystems.com>

Friday, August 6, 2004, 8:54:29 AM, David Teigland wrote:

> On Wed, Aug 04, 2004 at 11:41:45PM -0400, Jeff wrote:
>> The attached routine demonstrates some strange
>> behavior in the DLM and it was responsible for the
>> dmesg text at the end of this note.
>> 
>> This is on a FC2, SMP box running cvs/latest version of
>> cman and the dlm. Its a 2 CPU box configured with 4 logical
>> CPUs.
>> 
>> I have a two node cluster and the two machines are identical
>> as far as I can tell with the exception of which order they are
>> listed in the cluster config file.
>> 
>> On node #1 (in the config file) when I run the attached test from
>> two terminals the output looks reasonable. The same as it does if
>> I run it on Tru64 or VMS (more or less).
>> 
>>       8923: over last 10.000 seconds, grant 8922, blkast 0, cancel 0
>>      18730: over last 9.001 seconds, grant 9807, blkast 0, cancel 0
>>      28403: over last 9.001 seconds, grant 9673, blkast 0, cancel 0
>> 
>> If you shut this down and start it up on node #2 (lx4) you start
>> to get messages that look like:
>>      91280: over last 10.000 seconds, grant 91279, blkast 0, cancel 0
>>     125138: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^
>>     125138: NL Blocking Notification on lockid 0x00010312 (mode 0)
>>     125138: NL Blocking Notification Rountine End  ^^^^^^^^^^^^^^^^^^^^
>>     141370: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^
>>     141371: NL Blocking Notification on lockid 0x00010312 (mode 0)
>>     141371: NL Blocking Notification Rountine End  ^^^^^^^^^^^^^^^^^^^^
>>     141373: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^


> You're running the program on two nodes at once right?  The line with "*"
> is when I started the program on a second node, so it appears I get the
> same thing.  I don't get any assertion failure, though.  That may be the
> result of changes I've checked in for some other bugs over the past couple
> days.

>      57150: over last 10.000 seconds, grant 57149, blkast 0, cancel 0
>     116825: over last 9.001 seconds, grant 59675, blkast 0, cancel 0
> *   123790: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^
>     123790: NL Blocking Notification on lockid 0x00010373 (mode 0)
>     123790: NL Blocking Notification Rountine End  ^^^^^^^^^^^^^^^^^^^^
>     123822: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^
>     123822: NL Blocking Notification on lockid 0x00010373 (mode 0)
>     123822: NL Blocking Notification Rountine End  ^^^^^^^^^^^^^^^^^^^^


I updated my sources this morning and I get neither the NL Blocking
routine start messages nor the assertion failures. In the past
I was able to get this quite easily so I suspect you have resolved
them.


From laza at yu.net  Mon Aug  9 15:58:41 2004
From: laza at yu.net (Lazar Obradovic)
Date: Mon, 09 Aug 2004 17:58:41 +0200
Subject: [Linux-cluster] Multicast for GFS?
In-Reply-To: <20040809133438.GI11723@tykepenguin.com>
References: <1091553471.16747.165.camel@laza.eunet.yu>
	<1091556279.30938.179.camel@laza.eunet.yu>
	<1091736172.19762.336.camel@laza.eunet.yu>
	<20040809075039.GA9240@tykepenguin.com>
	<1092051143.1114.130.camel@laza.eunet.yu>
	<20040809133438.GI11723@tykepenguin.com>
Message-ID: <1092067121.23273.235.camel@laza.eunet.yu>

On Mon, 2004-08-09 at 15:34, Patrick Caulfield wrote:
> OK, thanks for clearing that up. It seems I have some thinking to do...

holiday? egypt? :) 

ok, attached is the file with two new functions: my_gethostbyname2() and
my_freehe(). First should be used everywhere instead of gethostbyname(),
and the later should be called to free up  all the used memory. 

I deliberately didn't send a patch, since I haven't CO'ed yet, so patch
would fail with new version. I'm counting on you for incorporating it
into new tree. :)

With this, you can also keep the memcpy(mreq.imr_interaface, ...) since
it's working now :)

cheers 
-- 
Lazar Obradovic, System Engineer
----- 
laza at YU.net
YUnet International http://www.EUnet.yu
Dubrovacka 35/III, 11000 Belgrade
Tel: +381 11 3119901; Fax: +381 11 3119901
-----
This e-mail is confidential and intended only for the recipient.
Unauthorized distribution, modification or disclosure of its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone +381 11 3119901.
-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mygethostbyname.c
Type: text/x-csrc
Size: 1247 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040809/121f9c82/attachment.bin>

From pcaulfie at redhat.com  Mon Aug  9 16:11:14 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 9 Aug 2004 17:11:14 +0100
Subject: [Linux-cluster] Multicast for GFS?
In-Reply-To: <1092067121.23273.235.camel@laza.eunet.yu>
References: <1091553471.16747.165.camel@laza.eunet.yu>
	<1091556279.30938.179.camel@laza.eunet.yu>
	<1091736172.19762.336.camel@laza.eunet.yu>
	<20040809075039.GA9240@tykepenguin.com>
	<1092051143.1114.130.camel@laza.eunet.yu>
	<20040809133438.GI11723@tykepenguin.com>
	<1092067121.23273.235.camel@laza.eunet.yu>
Message-ID: <20040809161114.GM11723@tykepenguin.com>

On Mon, Aug 09, 2004 at 05:58:41PM +0200, Lazar Obradovic wrote:
> On Mon, 2004-08-09 at 15:34, Patrick Caulfield wrote:
> > OK, thanks for clearing that up. It seems I have some thinking to do...
> 
> holiday? egypt? :) 

Soon, soon...(and not too far from you either, Dubrovnik!)
 
> ok, attached is the file with two new functions: my_gethostbyname2() and
> my_freehe(). First should be used everywhere instead of gethostbyname(),
> and the later should be called to free up  all the used memory. 
> 
> I deliberately didn't send a patch, since I haven't CO'ed yet, so patch
> would fail with new version. I'm counting on you for incorporating it
> into new tree. :)
> 
> With this, you can also keep the memcpy(mreq.imr_interaface, ...) since

Thanks very much - I'll merge that tomorrow
-- 

patrick


From lhh at redhat.com  Mon Aug  9 20:16:29 2004
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 09 Aug 2004 16:16:29 -0400
Subject: [Linux-cluster] Kernel oops
Message-ID: <1092082589.20439.2.camel@atlantis.boston.redhat.com>

While doing a bunch of 'while [ 0 ]; relocate_resource_group foo; done'
simultaneously, I triggered this in the DLM:

I haven't updated since last week; will do so and attempt to reproduce.
This is just a heads-up.

-- Lon


DLM:  Assertion failed on line 328 of file cluster/dlm/lockqueue.c
DLM:  assertion:  "rsb->res_nodeid == -1 || rsb->res_nodeid == 0"
DLM:  time = 2154223
dlm: lkb
id 200ca
remid 0
flags 0
status 0
rqmode 5
grmode -1
nodeid 4294967295
lqstate 0
lqflags 0
dlm: rsb
name "usrm::vf"
nodeid 1
ref 2
dlm: reply
rh_cmd 5
rh_lkid 200ca
lockstate 0
nodeid 1
status 0
lkid c02bf515

------------[ cut here ]------------
kernel BUG at cluster/dlm/lockqueue.c:328!
invalid operand: 0000 [#1]
PREEMPT SMP 
Modules linked in: dlm cman ipv6
CPU:    0
EIP:    0060:[<d099cd16>]    Not tainted
EFLAGS: 00010286   (2.6.7cman20040804) 
EIP is at process_lockqueue_reply+0x5e6/0x720 [dlm]
eax: 00000001   ebx: 00000001   ecx: c039df74   edx: 00000282
esi: c9bbb04c   edi: c9bbc708   ebp: caf55e24   esp: caf55dfc
ds: 007b   es: 007b   ss: 0068
Process dlm_recvd (pid: 2109, threadinfo=caf54000 task=cad33360)
Stack: d09ad257 00000148 d09ad23f d09ae8a0 0020deef c1398200 000200ca
c9bbc708 
       c1398200 caf55ee0 caf55eac d099de86 c9bbc708 caf55ee0 00000001
c03b94c0 
       caf55f88 caf55e90 caf55e74 c030a4d4 caf55e90 00000000 00000000
00000fc4 
Call Trace:
 [<c010718f>] show_stack+0x7f/0xa0
 [<c010733e>] show_registers+0x15e/0x1c0
 [<c01074f2>] die+0xa2/0x120
 [<c0107955>] do_invalid_op+0xb5/0xc0
 [<c0106e09>] error_code+0x2d/0x38
 [<d099de86>] process_cluster_request+0x746/0xde0 [dlm]
 [<d09a2be7>] midcomms_process_incoming_buffer+0x167/0x250 [dlm]
 [<d099ffd9>] receive_from_sock+0x189/0x360 [dlm]
 [<d09a12c8>] process_sockets+0xd8/0x110 [dlm]
 [<d09a16ad>] dlm_recvd+0xad/0x110 [dlm]
 [<c0104455>] kernel_thread_helper+0x5/0x10

Code: 0f 0b 48 01 3f d2 9a d0 e9 0d 01 00 00 e8 e8 f0 ff ff e8 33 
 

From danderso at redhat.com  Mon Aug  9 20:53:01 2004
From: danderso at redhat.com (Derek Anderson)
Date: Mon, 9 Aug 2004 15:53:01 -0500
Subject: [Linux-cluster] Kernel oops
In-Reply-To: <1092082589.20439.2.camel@atlantis.boston.redhat.com>
References: <1092082589.20439.2.camel@atlantis.boston.redhat.com>
Message-ID: <200408091553.01433.danderso@redhat.com>

Lon,

May be the same thing as this bug:
http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=128679

On Monday 09 August 2004 15:16, Lon Hohberger wrote:
> While doing a bunch of 'while [ 0 ]; relocate_resource_group foo; done'
> simultaneously, I triggered this in the DLM:
>
> I haven't updated since last week; will do so and attempt to reproduce.
> This is just a heads-up.
>
> -- Lon
>
>
> DLM:  Assertion failed on line 328 of file cluster/dlm/lockqueue.c
> DLM:  assertion:  "rsb->res_nodeid == -1 || rsb->res_nodeid == 0"
> DLM:  time = 2154223
> dlm: lkb
> id 200ca
> remid 0
> flags 0
> status 0
> rqmode 5
> grmode -1
> nodeid 4294967295
> lqstate 0
> lqflags 0
> dlm: rsb
> name "usrm::vf"
> nodeid 1
> ref 2
> dlm: reply
> rh_cmd 5
> rh_lkid 200ca
> lockstate 0
> nodeid 1
> status 0
> lkid c02bf515
>
> ------------[ cut here ]------------
> kernel BUG at cluster/dlm/lockqueue.c:328!
> invalid operand: 0000 [#1]
> PREEMPT SMP
> Modules linked in: dlm cman ipv6
> CPU:    0
> EIP:    0060:[<d099cd16>]    Not tainted
> EFLAGS: 00010286   (2.6.7cman20040804)
> EIP is at process_lockqueue_reply+0x5e6/0x720 [dlm]
> eax: 00000001   ebx: 00000001   ecx: c039df74   edx: 00000282
> esi: c9bbb04c   edi: c9bbc708   ebp: caf55e24   esp: caf55dfc
> ds: 007b   es: 007b   ss: 0068
> Process dlm_recvd (pid: 2109, threadinfo=caf54000 task=cad33360)
> Stack: d09ad257 00000148 d09ad23f d09ae8a0 0020deef c1398200 000200ca
> c9bbc708
>        c1398200 caf55ee0 caf55eac d099de86 c9bbc708 caf55ee0 00000001
> c03b94c0
>        caf55f88 caf55e90 caf55e74 c030a4d4 caf55e90 00000000 00000000
> 00000fc4
> Call Trace:
>  [<c010718f>] show_stack+0x7f/0xa0
>  [<c010733e>] show_registers+0x15e/0x1c0
>  [<c01074f2>] die+0xa2/0x120
>  [<c0107955>] do_invalid_op+0xb5/0xc0
>  [<c0106e09>] error_code+0x2d/0x38
>  [<d099de86>] process_cluster_request+0x746/0xde0 [dlm]
>  [<d09a2be7>] midcomms_process_incoming_buffer+0x167/0x250 [dlm]
>  [<d099ffd9>] receive_from_sock+0x189/0x360 [dlm]
>  [<d09a12c8>] process_sockets+0xd8/0x110 [dlm]
>  [<d09a16ad>] dlm_recvd+0xad/0x110 [dlm]
>  [<c0104455>] kernel_thread_helper+0x5/0x10
>
> Code: 0f 0b 48 01 3f d2 9a d0 e9 0d 01 00 00 e8 e8 f0 ff ff e8 33
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster


From mnerren at paracel.com  Mon Aug  9 20:57:00 2004
From: mnerren at paracel.com (micah nerren)
Date: Mon, 09 Aug 2004 13:57:00 -0700
Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine
In-Reply-To: <20040809151208.GA2189@redhat.com>
References: <20040802154605.GC1518@redhat.com>
	<1091480394.8356.58.camel@angmar> <20040803144019.GA4365@redhat.com>
	<1091581920.8356.257.camel@angmar> <20040804153338.GA10091@redhat.com>
	<1091749973.18842.70.camel@angmar> <20040806164510.GA20479@redhat.com>
	<1091829819.22512.14.camel@angmar> <20040806223557.GA21731@redhat.com>
	<1091835114.22512.46.camel@angmar> <20040809151208.GA2189@redhat.com>
Message-ID: <1092085020.14561.3.camel@angmar>

On Mon, 2004-08-09 at 08:12, Michael Conrad Tadpol Tilstra wrote:
> On Fri, Aug 06, 2004 at 04:31:55PM -0700, micah nerren wrote:
> > So it appears to be specifically related to lock_gulm.
> 
> hrms, so no pushing this off onto someone else. oh well. ;)
> 
> 
> > Anything else I should try?
> well, it still pretty much looks like a stack overflow.  And looking at
> the calling tree, there is not much left to take out of the stacks.  So
> I guess we'll have to try making the stack shorter.
> 
> So, another patch.  This still works on my intels, give it a go and
> lets see how it does on your opterons.
> 
> > I really appreciate all your help in debugging this!
> np.
> 

I tried the patch, it still crashes with the same oops. However, I tried
something I hadn't tried before which may shed some light on this. I
rebooted the system into UP mode, loaded the UP modules, and did the
mount of the file system. This time, no oops. It still doesn't work, but
the machine lives. The mount process simply hangs. When I go to another
terminal and kill the mount process, this appears in the syslog:

lock_gulm: ERROR cm_login failed. -512
lock_gulm: ERROR Got a -512 trying to start the threads.
lock_gulm: fsid=hopkins:gfs01: Exiting gulm_mount with errors -512
GFS: can't mount proto = lock_gulm, table = hopkins:gfs01, hostdata =

So, does that shed some light onto things? Something specific to SMP and
lock_gulm. It still doesn't work in UP mode, but it does not oops.


From john.l.villalovos at intel.com  Mon Aug  9 22:46:38 2004
From: john.l.villalovos at intel.com (Villalovos, John L)
Date: Mon, 9 Aug 2004 15:46:38 -0700
Subject: [Linux-cluster] GNBD spec file?
Message-ID: <60C14C611F1DDD4198D53F2F43D8CA3B0194B025@orsmsx410>

Is there a spec file available for GNBD?

How about RPMS?

John


From pcaulfie at redhat.com  Tue Aug 10 09:29:01 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 10 Aug 2004 10:29:01 +0100
Subject: [Linux-cluster] Multicast for GFS?
In-Reply-To: <1092067121.23273.235.camel@laza.eunet.yu>
References: <1091553471.16747.165.camel@laza.eunet.yu>
	<1091556279.30938.179.camel@laza.eunet.yu>
	<1091736172.19762.336.camel@laza.eunet.yu>
	<20040809075039.GA9240@tykepenguin.com>
	<1092051143.1114.130.camel@laza.eunet.yu>
	<20040809133438.GI11723@tykepenguin.com>
	<1092067121.23273.235.camel@laza.eunet.yu>
Message-ID: <20040810092900.GB13291@tykepenguin.com>

On Mon, Aug 09, 2004 at 05:58:41PM +0200, Lazar Obradovic wrote:
> 
> With this, you can also keep the memcpy(mreq.imr_interaface, ...) since
> it's working now :)
> 

OK, I've had to do this slightly differently, using gethostname2_r, it has 
the same effect. It behaves itself on my cluster but let me know if I've missed
anything.

-- 

patrick


From laza at yu.net  Tue Aug 10 12:11:50 2004
From: laza at yu.net (Lazar Obradovic)
Date: Tue, 10 Aug 2004 14:11:50 +0200
Subject: [Linux-cluster] Multicast for GFS?
In-Reply-To: <20040810092900.GB13291@tykepenguin.com>
References: <1091553471.16747.165.camel@laza.eunet.yu>
	<1091556279.30938.179.camel@laza.eunet.yu>
	<1091736172.19762.336.camel@laza.eunet.yu>
	<20040809075039.GA9240@tykepenguin.com>
	<1092051143.1114.130.camel@laza.eunet.yu>
	<20040809133438.GI11723@tykepenguin.com>
	<1092067121.23273.235.camel@laza.eunet.yu>
	<20040810092900.GB13291@tykepenguin.com>
Message-ID: <1092139910.32187.1098.camel@laza.eunet.yu>

On Tue, 2004-08-10 at 11:29, Patrick Caulfield wrote:
> OK, I've had to do this slightly differently, using gethostname2_r, it has 
> the same effect. It behaves itself on my cluster but let me know if I've missed
> anything.

It's ok, just that I wanted cleaner solution (cleaner = w/o additional
vars).

As you might notice, TTL is fixed to a value of 10, and it might be
interesting to take this out of the code, and place it somewhere in
cluster.conf. Do you think this is ok? 

how about something like: 

#define MCAST_TTL_PATH          "//cluster/cman/multicast/@ttl"

Let me know if this is generaly a good idea, I'll work on details if you
do agree.

I also started to change ccs a bit for mcast support. It turns out that
ccs has a lot of definitions hardcoded. Can I take 'em out and put into
separate header file (comm_header.h looks nice :)? 

-- 
Lazar Obradovic, System Engineer
----- 
laza at YU.net
YUnet International http://www.EUnet.yu
Dubrovacka 35/III, 11000 Belgrade
Tel: +381 11 3119901; Fax: +381 11 3119901
-----
This e-mail is confidential and intended only for the recipient.
Unauthorized distribution, modification or disclosure of its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone +381 11 3119901.
-----


From pcaulfie at redhat.com  Tue Aug 10 12:20:43 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 10 Aug 2004 13:20:43 +0100
Subject: [Linux-cluster] Multicast for GFS?
In-Reply-To: <1092139910.32187.1098.camel@laza.eunet.yu>
References: <1091553471.16747.165.camel@laza.eunet.yu>
	<1091556279.30938.179.camel@laza.eunet.yu>
	<1091736172.19762.336.camel@laza.eunet.yu>
	<20040809075039.GA9240@tykepenguin.com>
	<1092051143.1114.130.camel@laza.eunet.yu>
	<20040809133438.GI11723@tykepenguin.com>
	<1092067121.23273.235.camel@laza.eunet.yu>
	<20040810092900.GB13291@tykepenguin.com>
	<1092139910.32187.1098.camel@laza.eunet.yu>
Message-ID: <20040810122043.GE13291@tykepenguin.com>

On Tue, Aug 10, 2004 at 02:11:50PM +0200, Lazar Obradovic wrote:
> On Tue, 2004-08-10 at 11:29, Patrick Caulfield wrote:
> > OK, I've had to do this slightly differently, using gethostname2_r, it has 
> > the same effect. It behaves itself on my cluster but let me know if I've missed
> > anything.
> 
> It's ok, just that I wanted cleaner solution (cleaner = w/o additional
> vars).
> 
> As you might notice, TTL is fixed to a value of 10, and it might be
> interesting to take this out of the code, and place it somewhere in
> cluster.conf. Do you think this is ok? 
> 
> how about something like: 
> 
> #define MCAST_TTL_PATH          "//cluster/cman/multicast/@ttl"
> 
> Let me know if this is generaly a good idea, I'll work on details if you
> do agree.

Certainly. I'm all in favour of moving hard-coded values into configuration
files - so long as the defaults are reasonable !
 
> I also started to change ccs a bit for mcast support. It turns out that
> ccs has a lot of definitions hardcoded. Can I take 'em out and put into
> separate header file (comm_header.h looks nice :)? 

I think ccs_join.h would be reasonable, then it's obvious which .c file
it holds the defaults for.
 
-- 

patrick


From john.l.villalovos at intel.com  Tue Aug 10 15:19:48 2004
From: john.l.villalovos at intel.com (Villalovos, John L)
Date: Tue, 10 Aug 2004 08:19:48 -0700
Subject: [Linux-cluster] Where to get RPMS for GFS components
Message-ID: <60C14C611F1DDD4198D53F2F43D8CA3B019A78AA@orsmsx410>

Where can I find the RPMS for the various GFS components?

I didn't see it on the website:
http://sources.redhat.com/cluster/

Hopefully I'm not blind.

I did check out from CVS the Cluster code:

cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout cluster

And I didn't find any SPEC files.

Thanks,
John


From patrick.seinguerlet at e-asc.com  Tue Aug 10 15:26:35 2004
From: patrick.seinguerlet at e-asc.com (SEINGUERLET Patrick)
Date: Tue, 10 Aug 2004 17:26:35 +0200
Subject: [Linux-cluster] Where to get RPMS for GFS components
In-Reply-To: <60C14C611F1DDD4198D53F2F43D8CA3B019A78AA@orsmsx410>
Message-ID: <000001c47eee$70f001b0$8000a8c0@porpat>

You can use http://sources.redhat.com/cluster/releases/cvs_snapshots/ to
have a snapshot of cvs files. And when you compile files, you have got
GFS components.

For more information see doc/usage.txt

Good luck.

Patrick.

-----Message d'origine-----
De : linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] De la part de Villalovos, John
L
Envoy? : mardi 10 ao?t 2004 17:20
? : Discussion of clustering software components including GFS
Objet : [Linux-cluster] Where to get RPMS for GFS components


Where can I find the RPMS for the various GFS components?

I didn't see it on the website: http://sources.redhat.com/cluster/

Hopefully I'm not blind.

I did check out from CVS the Cluster code:

cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout cluster

And I didn't find any SPEC files.

Thanks,
John

--
Linux-cluster mailing list
Linux-cluster at redhat.com
http://www.redhat.com/mailman/listinfo/linux-cluster


From mtilstra at redhat.com  Tue Aug 10 15:56:47 2004
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Tue, 10 Aug 2004 10:56:47 -0500
Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine
In-Reply-To: <1092085020.14561.3.camel@angmar>
References: <20040803144019.GA4365@redhat.com>
	<1091581920.8356.257.camel@angmar>
	<20040804153338.GA10091@redhat.com>
	<1091749973.18842.70.camel@angmar>
	<20040806164510.GA20479@redhat.com>
	<1091829819.22512.14.camel@angmar>
	<20040806223557.GA21731@redhat.com>
	<1091835114.22512.46.camel@angmar>
	<20040809151208.GA2189@redhat.com>
	<1092085020.14561.3.camel@angmar>
Message-ID: <20040810155647.GA10149@redhat.com>

On Mon, Aug 09, 2004 at 01:57:00PM -0700, micah nerren wrote:
> I tried the patch, it still crashes with the same oops.
evil butterscotch.

> However, I tried something I hadn't tried before which may shed some
> light on this. I rebooted the system into UP mode, loaded the UP
> modules, and did the mount of the file system. This time, no oops. It
> still doesn't work, but the machine lives. The mount process simply
> hangs. When I go to another terminal and kill the mount process, this
> appears in the syslog:
> 
> lock_gulm: ERROR cm_login failed. -512
> lock_gulm: ERROR Got a -512 trying to start the threads.
> lock_gulm: fsid=hopkins:gfs01: Exiting gulm_mount with errors -512
> GFS: can't mount proto = lock_gulm, table = hopkins:gfs01, hostdata =

yeah, the -512s are just how signal interrupts get moved in the kernel
space.


> So, does that shed some light onto things? Something specific to SMP and
> lock_gulm. It still doesn't work in UP mode, but it does not oops.

um. possibly means I've been looking in the wrong place for the
solution.  I'll dig in some more, but if anyone else reading this has
ideas, please share.

-- 
Michael Conrad Tadpol Tilstra
It's never too late to have a happy childhood.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040810/0652d7a2/attachment.sig>

From tru at pasteur.fr  Tue Aug 10 16:26:55 2004
From: tru at pasteur.fr (Tru Huynh)
Date: Tue, 10 Aug 2004 18:26:55 +0200
Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine
In-Reply-To: <20040810155647.GA10149@redhat.com>;
	from mtilstra@redhat.com on Tue, Aug 10, 2004 at 10:56:47AM -0500
References: <1091581920.8356.257.camel@angmar>
	<20040804153338.GA10091@redhat.com>
	<1091749973.18842.70.camel@angmar>
	<20040806164510.GA20479@redhat.com>
	<1091829819.22512.14.camel@angmar>
	<20040806223557.GA21731@redhat.com>
	<1091835114.22512.46.camel@angmar>
	<20040809151208.GA2189@redhat.com>
	<1092085020.14561.3.camel@angmar>
	<20040810155647.GA10149@redhat.com>
Message-ID: <20040810182655.A22653@xiii.bis.pasteur.fr>

On Tue, Aug 10, 2004 at 10:56:47AM -0500, Michael Conrad Tadpol Tilstra wrote:
...
> 
> um. possibly means I've been looking in the wrong place for the
> solution.  I'll dig in some more, but if anyone else reading this has
> ideas, please share.

newbie (not kernel hacker) idea (taken from the XFS mailing list)
On Mon, Aug 09, 2004 at 08:52:40AM -0500, Eric Sandeen wrote:
> what does your ffs() look like?  I added this patch to our kernels, but                                                                                                                
> it may not be in Dan's kernels (hm, I need to update the 1.3.3 packages                                                                                                                
> we have on oss...)                                                                                                                                                                     
>                                                                                                                                                                                        
> --- linux/include/asm-x86_64/bitops.h.orig      2004-07-26                                                                                                                             
> 12:33:54.000000000 -0500                                                                                                                                                               
> +++ linux/include/asm-x86_64/bitops.h   2004-07-26 12:35:23.000000000 -0500                                                                                                            
> @@ -473,7 +473,7 @@ static __inline__ int ffs(int x)                                                                                                                                   
>                                                                                                                                                                                        
>          __asm__("bsfl %1,%0\n\t"                                                                                                                                                      
>                  "cmovzl %2,%0"                                                                                                                                                        
> -               : "=r" (r) : "g" (x), "r" (32));                                                                                                                                       
> +               : "=r" (r) : "rm" (x), "r" (-1));                                                                                                                                      
>          return r+1;                                                                                                                                                                   
>   }                                                                                                                                                                                    

just .02 cents (no flame please)

Tru


From laza at yu.net  Wed Aug 11 00:04:51 2004
From: laza at yu.net (Lazar Obradovic)
Date: Wed, 11 Aug 2004 02:04:51 +0200
Subject: [Linux-cluster] SNMP modules?
In-Reply-To: <1091403655.6495.17.camel@laza.eunet.yu>
References: <1090861715.13809.3.camel@laza.eunet.yu>
	<1091383159.32177.14.camel@laza.eunet.yu>
	<1091403655.6495.17.camel@laza.eunet.yu>
Message-ID: <1092182691.26185.58.camel@laza.eunet.yu>

is this ok?
will it be a part of cvs tree or it needs additional work?

On Mon, 2004-08-02 at 01:40, Lazar Obradovic wrote:
> both things in one patch... 
> 
> On Sun, 2004-08-01 at 19:59, Lazar Obradovic wrote:
> > ok, here's the patch for ibm blade fencing agent...
> > qlogic sanbox2, comming up next :)
> > 
> > On Mon, 2004-07-26 at 19:08, Lazar Obradovic wrote: 
> > > Hello all, 
> > > 
> > > I'd like to develop my own fencing agents (for IBM BladeCenter and
> > > QLogic SANBox2 switches), but they will require SNMP bindings.
> > > 
> > > Is that ok with general development philosophy, since I'd like to
> > > contribude them? net-snmp-5.x.x-based API? 
-- 
Lazar Obradovic, System Engineer
----- 
laza at YU.net
YUnet International http://www.EUnet.yu
Dubrovacka 35/III, 11000 Belgrade
Tel: +381 11 3119901; Fax: +381 11 3119901
-----
This e-mail is confidential and intended only for the recipient.
Unauthorized distribution, modification or disclosure of its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone +381 11 3119901.
-----


From rbrown at metservice.com  Wed Aug 11 03:14:44 2004
From: rbrown at metservice.com (Royce Brown)
Date: Wed, 11 Aug 2004 15:14:44 +1200
Subject: [Linux-cluster] Clumembd heartbeat problem.
Message-ID: <200408111514542.SM01912@rbrown>

I am not sure if this is the correct place to post this. If not
and you know where I should, could you please tell me.

(This effects RedHat ES 3.0 clumanager software)
I have found a problem with the clumembd daemon where the heartbeat
message is rejected by other nodes causing the node to be powered off.

If you have a Ethernet interface with an alias and are using multicast the
source address may contain the main IP address or
the alias address. If it contains the alias address the message is
then rejected by all other nodes as it now contains the wrong IP address.

The software correctly creates a socket on the main interface and at first
the correct IP address is send. Some time later on the same socket the
alias address seems to get into the packets.

I have extract the relevant parts from my log file showing the output
from the debugging lines I inserted into the code.

Computer has  
   Interfaces: bond0   addr 10.10.197.11
               bond0:0 addr 10.10.197.6

         Multcast set up     
clumembd[2]: <debug> add_interface fd:4 name:bond0
clumembd[2]: <debug> Interface IP is 10.10.197.11
clumembd[2]: <debug> Setting up multicast 225.0.0.11 on 10.10.197.11
clumembd[2]: <debug> Multicast send fd:5 (10.10.197.11)
clumembd[2]: <debug> Multicast receive fd:6

	   Sending and receiving message (Correct behaviour)
clumembd[2]: <debug> sending multicast message fd:5 ,nodeid:1
            ,addr:225.0.0.11,token:0x0002881d4119638e
clumembd[2]: <debug> update_seen new msg nodeid:1 token:0x0002881d4119638e
         
After a while you get. sinp = source address, nsp = expected address
  
clumembd[2]: <debug> sending multicast message fd:5 ,nodeid:1
             ,addr:225.0.0.11,token:0x0002881d4119638e
clumembd[2]: <debug> update_seen new msg nodeid:1 token:0x0002881d4119638e
clumembd[2]: <debug> IP/NodeID mismatch: Probably another cluster on our
             subnet... msg from nodeid:1 sinp:10.10.197.6 nsp:10.10.197.11


The source address now has bond0:0 address when it did have bond0's address.
The socket has not changed.

This looks to me like a bug in the sending routine (it is using sendto in
std library)

Has anyone else noticed this sort of behaviour on sending multicast messages
to a Ethernet device with multiple addresses. 
			 
Cheers
Royce


From lhh at redhat.com  Wed Aug 11 13:08:16 2004
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 11 Aug 2004 09:08:16 -0400
Subject: [Linux-cluster] Clumembd heartbeat problem.
In-Reply-To: <200408111514542.SM01912@rbrown>
References: <200408111514542.SM01912@rbrown>
Message-ID: <1092229696.20439.37.camel@atlantis.boston.redhat.com>

On Wed, 2004-08-11 at 15:14 +1200, Royce Brown wrote:
> I am not sure if this is the correct place to post this. If not
> and you know where I should, could you please tell me.

Better = taroon-list
Best = bugzilla!

> 	   Sending and receiving message (Correct behaviour)
> clumembd[2]: <debug> sending multicast message fd:5 ,nodeid:1
>             ,addr:225.0.0.11,token:0x0002881d4119638e
> clumembd[2]: <debug> update_seen new msg nodeid:1 token:0x0002881d4119638e
>          
> After a while you get. sinp = source address, nsp = expected address
>   
> clumembd[2]: <debug> sending multicast message fd:5 ,nodeid:1
>              ,addr:225.0.0.11,token:0x0002881d4119638e
> clumembd[2]: <debug> update_seen new msg nodeid:1 token:0x0002881d4119638e
> clumembd[2]: <debug> IP/NodeID mismatch: Probably another cluster on our
>              subnet... msg from nodeid:1 sinp:10.10.197.6 nsp:10.10.197.11

Hmm... this doesn't make a lot of sense; clumembd doesn't rebind
anything.  It's possible to work around it.

> Has anyone else noticed this sort of behaviour on sending multicast messages
> to a Ethernet device with multiple addresses. 

It's news; probably worthy of a bugzilla.

-- Lon


From laza at yu.net  Wed Aug 11 13:22:52 2004
From: laza at yu.net (Lazar Obradovic)
Date: Wed, 11 Aug 2004 15:22:52 +0200
Subject: [Linux-cluster] Clumembd heartbeat problem.
In-Reply-To: <1092229696.20439.37.camel@atlantis.boston.redhat.com>
References: <200408111514542.SM01912@rbrown>
	<1092229696.20439.37.camel@atlantis.boston.redhat.com>
Message-ID: <1092230572.19386.69.camel@laza.eunet.yu>

On Wed, 2004-08-11 at 15:08, Lon Hohberger wrote:
> > Has anyone else noticed this sort of behaviour on sending multicast messages
> > to a Ethernet device with multiple addresses. 

btw, you shouldn't use 224.0.0.0/24 since it's assigned to various
mcast-related things... (all host, all routers, routing protocols and
the stuff like that). 

check http://www.iana.org/assignments/multicast-addresses for complete
assigment. 

-- 
Lazar Obradovic, System Engineer
----- 
laza at YU.net
YUnet International http://www.EUnet.yu
Dubrovacka 35/III, 11000 Belgrade
Tel: +381 11 3119901; Fax: +381 11 3119901
-----
This e-mail is confidential and intended only for the recipient.
Unauthorized distribution, modification or disclosure of its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone +381 11 3119901.
-----


From amanthei at redhat.com  Wed Aug 11 13:58:42 2004
From: amanthei at redhat.com (Adam Manthei)
Date: Wed, 11 Aug 2004 08:58:42 -0500
Subject: [Linux-cluster] SNMP modules?
In-Reply-To: <1092182691.26185.58.camel@laza.eunet.yu>
References: <1090861715.13809.3.camel@laza.eunet.yu>
	<1091383159.32177.14.camel@laza.eunet.yu>
	<1091403655.6495.17.camel@laza.eunet.yu>
	<1092182691.26185.58.camel@laza.eunet.yu>
Message-ID: <20040811135842.GD26705@redhat.com>

On Wed, Aug 11, 2004 at 02:04:51AM +0200, Lazar Obradovic wrote:
> is this ok?
> will it be a part of cvs tree or it needs additional work?

Thanks for the reminder.  I'll take a look at them and let you know.

> On Mon, 2004-08-02 at 01:40, Lazar Obradovic wrote:
> > both things in one patch... 
> > 
> > On Sun, 2004-08-01 at 19:59, Lazar Obradovic wrote:
> > > ok, here's the patch for ibm blade fencing agent...
> > > qlogic sanbox2, comming up next :)
> > > 
> > > On Mon, 2004-07-26 at 19:08, Lazar Obradovic wrote: 
> > > > Hello all, 
> > > > 
> > > > I'd like to develop my own fencing agents (for IBM BladeCenter and
> > > > QLogic SANBox2 switches), but they will require SNMP bindings.
> > > > 
> > > > Is that ok with general development philosophy, since I'd like to
> > > > contribude them? net-snmp-5.x.x-based API? 
> -- 
> Lazar Obradovic, System Engineer
> ----- 
> laza at YU.net
> YUnet International http://www.EUnet.yu
> Dubrovacka 35/III, 11000 Belgrade
> Tel: +381 11 3119901; Fax: +381 11 3119901
> -----
> This e-mail is confidential and intended only for the recipient.
> Unauthorized distribution, modification or disclosure of its
> contents is prohibited. If you have received this e-mail in error,
> please notify the sender by telephone +381 11 3119901.
> -----
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Adam Manthei  <amanthei at redhat.com>


From anton at hq.310.ru  Wed Aug 11 15:19:30 2004
From: anton at hq.310.ru (=?Windows-1251?B?wO3y7u0gzeX17vDu+Oj1?=)
Date: Wed, 11 Aug 2004 19:19:30 +0400
Subject: [Linux-cluster] make problem
Message-ID: <19610389547.20040811191930@hq.310.ru>

Hi all,

after "cvs up -r HEAD -Pd ." for cluster folder
i have problem with make

# make
cd cman-kernel && make all
make[1]: Entering directory `/usr/src/cluster/cman-kernel'
cd src && make all
make[2]: Entering directory `/usr/src/cluster/cman-kernel/src'
rm -f cluster
ln -s . cluster
make -C /usr/src/linux-2.6 M=/usr/src/cluster/cman-kernel/src modules USING_KBUILD=yes
make[3]: Entering directory `/usr/src/linux-2.6.7'
  CC [M]  /usr/src/cluster/cman-kernel/src/cnxman.o
/usr/src/cluster/cman-kernel/src/cnxman.c: In function `do_ioctl_pass_socket':
/usr/src/cluster/cman-kernel/src/cnxman.c:1504: error: storage size of `sock_info' isn't known
/usr/src/cluster/cman-kernel/src/cnxman.c:1504: warning: unused variable `sock_info'
/usr/src/cluster/cman-kernel/src/cnxman.c: In function `cl_ioctl':
/usr/src/cluster/cman-kernel/src/cnxman.c:1740: error: `SIOCCLUSTER_PASS_SOCKET' undeclared (first u
se in this function)
/usr/src/cluster/cman-kernel/src/cnxman.c:1740: error: (Each undeclared identifier is reported only
once
/usr/src/cluster/cman-kernel/src/cnxman.c:1740: error: for each function it appears in.)
/usr/src/cluster/cman-kernel/src/cnxman.c:1747: error: `SIOCCLUSTER_SET_NODENAME' undeclared (first
use in this function)
/usr/src/cluster/cman-kernel/src/cnxman.c:1754: error: `SIOCCLUSTER_SET_NODEID' undeclared (first us
e in this function)
/usr/src/cluster/cman-kernel/src/cnxman.c:1761: error: `SIOCCLUSTER_JOIN_CLUSTER' undeclared (first
use in this function)
/usr/src/cluster/cman-kernel/src/cnxman.c:1768: error: `SIOCCLUSTER_LEAVE_CLUSTER' undeclared (first
 use in this function)
make[4]: *** [/usr/src/cluster/cman-kernel/src/cnxman.o] Error 1
make[3]: *** [_module_/usr/src/cluster/cman-kernel/src] Error 2
make[3]: Leaving directory `/usr/src/linux-2.6.7'
make[2]: *** [all] Error 2
make[2]: Leaving directory `/usr/src/cluster/cman-kernel/src'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/usr/src/cluster/cman-kernel'
make: *** [all] Error 2

# uname -a
Linux c5.310.ru 2.6.7 #17 SMP Wed Jul 21 19:34:27 MSD 2004 i686 i686 i386 GNU/Linux

-- 
e-mail: anton at hq.310.ru
http://www.310.ru


From pcaulfie at redhat.com  Wed Aug 11 15:28:58 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 11 Aug 2004 16:28:58 +0100
Subject: [Linux-cluster] make problem
In-Reply-To: <19610389547.20040811191930@hq.310.ru>
References: <19610389547.20040811191930@hq.310.ru>
Message-ID: <20040811152858.GH24727@tykepenguin.com>

On Wed, Aug 11, 2004 at 07:19:30PM +0400, ????? ????????? wrote:
> Hi all,
> 
> after "cvs up -r HEAD -Pd ." for cluster folder
> i have problem with make
> 

It's including an old version of cnxman-socket.h. Check you have the updated one
in /usr/include/cluster.

-- 

patrick


From patrick.seinguerlet at e-asc.com  Wed Aug 11 15:25:06 2004
From: patrick.seinguerlet at e-asc.com (Seinguerlet Patrick)
Date: Wed, 11 Aug 2004 17:25:06 +0200
Subject: [Linux-cluster] make problem
References: <19610389547.20040811191930@hq.310.ru>
Message-ID: <000b01c47fb7$6635a170$0100a8c0@amdk6>

see instruction in doc/usage.txt for installation.

Patrick

----- Original Message ----- 
From: "????? ?????????" <anton at hq.310.ru>
To: <linux-cluster at redhat.com>
Sent: Wednesday, August 11, 2004 5:19 PM
Subject: [Linux-cluster] make problem


Hi all,

after "cvs up -r HEAD -Pd ." for cluster folder
i have problem with make

# make
cd cman-kernel && make all
make[1]: Entering directory `/usr/src/cluster/cman-kernel'
cd src && make all
make[2]: Entering directory `/usr/src/cluster/cman-kernel/src'
rm -f cluster
ln -s . cluster
make -C /usr/src/linux-2.6 M=/usr/src/cluster/cman-kernel/src modules USING_KBUILD=yes
make[3]: Entering directory `/usr/src/linux-2.6.7'
  CC [M]  /usr/src/cluster/cman-kernel/src/cnxman.o
/usr/src/cluster/cman-kernel/src/cnxman.c: In function `do_ioctl_pass_socket':
/usr/src/cluster/cman-kernel/src/cnxman.c:1504: error: storage size of `sock_info' isn't known
/usr/src/cluster/cman-kernel/src/cnxman.c:1504: warning: unused variable `sock_info'
/usr/src/cluster/cman-kernel/src/cnxman.c: In function `cl_ioctl':
/usr/src/cluster/cman-kernel/src/cnxman.c:1740: error: `SIOCCLUSTER_PASS_SOCKET' undeclared (first u
se in this function)
/usr/src/cluster/cman-kernel/src/cnxman.c:1740: error: (Each undeclared identifier is reported only
once
/usr/src/cluster/cman-kernel/src/cnxman.c:1740: error: for each function it appears in.)
/usr/src/cluster/cman-kernel/src/cnxman.c:1747: error: `SIOCCLUSTER_SET_NODENAME' undeclared (first
use in this function)
/usr/src/cluster/cman-kernel/src/cnxman.c:1754: error: `SIOCCLUSTER_SET_NODEID' undeclared (first us
e in this function)
/usr/src/cluster/cman-kernel/src/cnxman.c:1761: error: `SIOCCLUSTER_JOIN_CLUSTER' undeclared (first
use in this function)
/usr/src/cluster/cman-kernel/src/cnxman.c:1768: error: `SIOCCLUSTER_LEAVE_CLUSTER' undeclared (first
 use in this function)
make[4]: *** [/usr/src/cluster/cman-kernel/src/cnxman.o] Error 1
make[3]: *** [_module_/usr/src/cluster/cman-kernel/src] Error 2
make[3]: Leaving directory `/usr/src/linux-2.6.7'
make[2]: *** [all] Error 2
make[2]: Leaving directory `/usr/src/cluster/cman-kernel/src'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/usr/src/cluster/cman-kernel'
make: *** [all] Error 2

# uname -a
Linux c5.310.ru 2.6.7 #17 SMP Wed Jul 21 19:34:27 MSD 2004 i686 i686 i386 GNU/Linux

-- 
e-mail: anton at hq.310.ru
http://www.310.ru

--
Linux-cluster mailing list
Linux-cluster at redhat.com
http://www.redhat.com/mailman/listinfo/linux-cluster


From jbrassow at redhat.com  Wed Aug 11 16:03:46 2004
From: jbrassow at redhat.com (Jonathan E Brassow)
Date: Wed, 11 Aug 2004 11:03:46 -0500
Subject: [Linux-cluster] Multicast for GFS?
In-Reply-To: <20040810122043.GE13291@tykepenguin.com>
References: <1091553471.16747.165.camel@laza.eunet.yu>
	<1091556279.30938.179.camel@laza.eunet.yu>
	<1091736172.19762.336.camel@laza.eunet.yu>
	<20040809075039.GA9240@tykepenguin.com>
	<1092051143.1114.130.camel@laza.eunet.yu>
	<20040809133438.GI11723@tykepenguin.com>
	<1092067121.23273.235.camel@laza.eunet.yu>
	<20040810092900.GB13291@tykepenguin.com>
	<1092139910.32187.1098.camel@laza.eunet.yu>
	<20040810122043.GE13291@tykepenguin.com>
Message-ID: <068D22A6-EBB0-11D8-9B62-000A957BB1F6@redhat.com>

>> I also started to change ccs a bit for mcast support. It turns out 
>> that
>> ccs has a lot of definitions hardcoded. Can I take 'em out and put 
>> into
>> separate header file (comm_header.h looks nice :)?
>
> I think ccs_join.h would be reasonable, then it's obvious which .c file
> it holds the defaults for.
>

i don't think there is a ccs_join.c (you're thinking of cman_tool (?)). 
  comm_header.h would be fine.  I'll take a look at it when your ready.

  brassow


From anton at hq.310.ru  Wed Aug 11 16:13:10 2004
From: anton at hq.310.ru (=?ISO-8859-15?B?wO3y7u0gzeX17vDu+Oj1?=)
Date: Wed, 11 Aug 2004 20:13:10 +0400
Subject: [Linux-cluster] make problem
In-Reply-To: <20040811152858.GH24727@tykepenguin.com>
References: <19610389547.20040811191930@hq.310.ru>
	<20040811152858.GH24727@tykepenguin.com>
Message-ID: <1182024470.20040811201310@hq.310.ru>

?????? ???? Patrick,

Wednesday, August 11, 2004, 7:28:58 PM, you wrote:

Patrick Caulfield> On Wed, Aug 11, 2004 at
Patrick Caulfield> 07:19:30PM +0400, ????? ????????? wrote:
>> Hi all,
>> 
>> after "cvs up -r HEAD -Pd ." for cluster folder
>> i have problem with make
>> 

Patrick Caulfield> It's including an old version
Patrick Caulfield> of cnxman-socket.h. Check you have the updated one
Patrick Caulfield> in /usr/include/cluster.

It would be necessary to add in usage.txt, that it is necessary to
update not only /usr/local/cluster but and /path/to/kernel/include/cluster

And it is even better to automate it for the first method install.


-- 
e-mail: anton at hq.310.ru
http://www.310.ru


From cherry at osdl.org  Wed Aug 11 16:18:50 2004
From: cherry at osdl.org (John Cherry)
Date: Wed, 11 Aug 2004 09:18:50 -0700
Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Cluster
	summit materials
In-Reply-To: <20040810205817.GB18086@marowsky-bree.de>
References: <1092170965.2468.86.camel@cherrybomb.pdx.osdl.net>
	<20040810205817.GB18086@marowsky-bree.de>
Message-ID: <1092241130.5683.48.camel@cherrybomb.pdx.osdl.net>

On Tue, 2004-08-10 at 13:58, Lars Marowsky-Bree wrote:
> On 2004-08-10T13:49:26,
>    John Cherry <cherry at osdl.org> said:
> 
> Hi John, minor correction here...
> > This is a work in progress since Daniel Phillips is continuing to add
> > The time was right to consider common clusters components.  While we
> > expected a fair amount of contention at the meetings, it was good to see
> > a fairly unanimous desire to identify common components that could be
> > leveraged over the various cluster implementations and to drive these
> > common components to mainline acceptance.  The common cluster components
> > identified at the summit were...
> > 
> >    cman - cluster manager (membership/quorum/heartbeat, recovery
> >           control.
> >    fence - userland daemon which decides which nodes need fencing
> >    dlm - fully distributed, fully symmetrical lock manager
> >    gfs - clustered filesystem
> > 
> > While these common components all have RHAT/Sistina roots, these
> > components are in the best position for mainline acceptance.  As APIs
> > are defined for these services, other implementations could also be used
> > (the vfs model).
> 
> This isn't quite true. cman as a whole is not quite in the best position
> for mainline acceptance; actually, most isn't.

I realize that cman will probably be at "alpha" level maturity in
October, but we did not discuss any other possibilities for kernel level
membership/communication.  linux-ha and openais have user level
components.  I suppose SSI membership could be considered as a candidate
implementation for the initial merge, but the consensus was that we
would focus on cman, define the APIs, and use cman as the initial
membership/communication module.  Multiple implementations would be good
and if we do a good job defining the APIs (membership, communication,
fencing), other membership services could be used down the road.

Was I at a different summit than you attended, or is that your
understanding of the strategic direction of moving Linux to be a
"clusterable kernel"?

> 
> However, what was identified was that the following components
> 
> - membership

How can we have membership without some form of communication service?
(communication-based membership or connectivity-based membership)

The low level cluster communication mechanism is one of those services
that I believe we need an API definition for since it will also be
leveraged by higher level services such as group messaging or an event
service.

So you can call the core service "membership", but what we really need
is membership/communication, which is what cman provides.  Do you have
another suggestion for this?  TIPC + membership?

> - DLM
> - Fencing

At the summit I attended, we also talked about using GFS as the initial
"consumer" of the cluster infrastructure.  The cluster infrastructure
doesn't stand a chance of mainline acceptance without a consumer that
both validates the interfaces and hardens the services.

I am not being as subtile as RHAT was at the summit.  If we are going to
start the process to mainline the components needed to make linux a
"clusterable kernel" this year, we will need to get behind the core
services that we discussed at the summit.

John

> 
> would be the best ones to work on merging first, but it was acknowledged
> that there's quite some work left for these to be done, in particular on
> the API and the conceptual model behind it.
> 
> 
> Sincerely,
>     Lars Marowsky-Br?e <lmb at suse.de>


From cherry at osdl.org  Wed Aug 11 17:22:15 2004
From: cherry at osdl.org (John Cherry)
Date: Wed, 11 Aug 2004 10:22:15 -0700
Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Cluster
	summit materials
In-Reply-To: <20040811101104.F1924@build.pdx.osdl.net>
References: <1092170965.2468.86.camel@cherrybomb.pdx.osdl.net>
	<20040810205817.GB18086@marowsky-bree.de>
	<1092241130.5683.48.camel@cherrybomb.pdx.osdl.net>
	<20040811101104.F1924@build.pdx.osdl.net>
Message-ID: <1092244934.5685.59.camel@cherrybomb.pdx.osdl.net>

On Wed, 2004-08-11 at 10:11, Chris Wright wrote:
> * John Cherry (cherry at osdl.org) wrote:
> > At the summit I attended, we also talked about using GFS as the initial
> > "consumer" of the cluster infrastructure.  The cluster infrastructure
> > doesn't stand a chance of mainline acceptance without a consumer that
> > both validates the interfaces and hardens the services.
> > 
> > I am not being as subtile as RHAT was at the summit.  If we are going to
> > start the process to mainline the components needed to make linux a
> > "clusterable kernel" this year, we will need to get behind the core
> > services that we discussed at the summit.
> 
> I read Lars' comments as something like: 
>   There's still a lot of work to do, and it's not a foregone conclusion
>   that any of this would hit mainline.

Agreed.  There are no guarentees of mainline acceptance.  We just need
to line up against the unwritten "criteria" for mainline acceptance of
this kind of code.  These include infrastructure (common services) that
would support multiple cluster implementations, not invasive to the core
kernel, provide real value (i.e. infrastructure for a clustered
filesystem), maintainable, active development community behind the code,
etc.

> 
> Maybe I extrapolated too far.  However, the kernel summit included
> a reasonable bit of pushback on placing this in the kernel without
> convincing arguments to the contrary.  So I think it's reasonable to
> consider part of the work is still clearly defining that need.

There were some user vs kernel discussions on the list prior to the
summit, but the consensus at the summit was that the core common
services would be in the kernel.  After all, the initial consumer of the
cluster infrastructure (clustered filesystem) is in the kernel.

John


From bruce.walker at hp.com  Wed Aug 11 18:19:27 2004
From: bruce.walker at hp.com (Walker, Bruce J)
Date: Wed, 11 Aug 2004 11:19:27 -0700
Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion]
	Clustersummit materials
Message-ID: <3689AF909D816446BA505D21F1461AE4C75110@cacexc04.americas.cpqcorp.net>

> * John Cherry (cherry at osdl.org) wrote:
> At the summit I attended, we also talked about using GFS as the
initial
> "consumer" of the cluster infrastructure.  The cluster infrastructure
> doesn't stand a chance of mainline acceptance without a consumer that
> both validates the interfaces and hardens the services.

Given cman etc. was written for GFS, it doesn't prove much that it works
with GFS.  Having an independent cluster effort (like OpenSSI) use the
underlying infrastructure presents a much more compelling case.  The
OpenSSI project has started to look into this but help from OSDL, Intel
and/or RedHat wouldn't be discouraged.  Also, having SAF layered and/or
ha-linux layered would also bolster the case as a general
infrastructure.

Bruce walker
OpenSSI project lead


From phillips at redhat.com  Wed Aug 11 18:54:08 2004
From: phillips at redhat.com (Daniel Phillips)
Date: Wed, 11 Aug 2004 14:54:08 -0400
Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Cluster
	summit materials
In-Reply-To: <1092241130.5683.48.camel@cherrybomb.pdx.osdl.net>
References: <1092170965.2468.86.camel@cherrybomb.pdx.osdl.net>
	<20040810205817.GB18086@marowsky-bree.de>
	<1092241130.5683.48.camel@cherrybomb.pdx.osdl.net>
Message-ID: <200408111454.08677.phillips@redhat.com>

On Wednesday 11 August 2004 12:18, John Cherry wrote:
> On Tue, 2004-08-10 at 13:58, Lars Marowsky-Bree wrote:
> > On 2004-08-10T13:49:26, John Cherry <cherry at osdl.org> said:
> > > While these common components all have RHAT/Sistina roots, these
> > > components are in the best position for mainline acceptance.  As
> > > APIs are defined for these services, other implementations could
> > > also be used (the vfs model).
> >
> > This isn't quite true. cman as a whole is not quite in the best
> > position for mainline acceptance; actually, most isn't.

That's accurate, that's why I keep beating on the 'read the code' issue, 
not to mention trying it, and hacking it.

> I realize that cman will probably be at "alpha" level maturity in
> October, but we did not discuss any other possibilities for kernel
> level membership/communication.

I believe it was briefly mentioned that we mainly use bog-standard tcp 
socket streams for communication.  I'll add that various subsystems 
incorporate their own reliability logic, and maybe one day far from 
now, we'll be able to unify all of that.  For now, it's a little 
ambitions, not to mention unnecessary.

> linux-ha and openais have user level  
> components.  I suppose SSI membership could be considered as a
> candidate implementation for the initial merge, but the consensus was
> that we would focus on cman, define the APIs, and use cman as the
> initial membership/communication module.  Multiple implementations
> would be good and if we do a good job defining the APIs (membership,
> communication, fencing), other membership services could be used down
> the road.

IMHO, for the time being only failure detection and failover really has 
to be unified, and that is CMAN, including interaction with other bits 
and pieces, i.e., Magma and fencing, and hopefully other systems like 
Lars' SCRAT.  As far as CMAN goes, Lars and Alan seem to be the main 
parties outside Red Hat.  Lon and Patrick are most active inside Red 
Hat.  I think we'd advance fastest if they start hacking each other's 
code (anybody I just overlooked, please bellow).

However it goes, this process is going to take time.  Two months would 
be blindingly fast, and that is before we even think about pushing to 
Andrew.

> Was I at a different summit than you attended, or is that your
> understanding of the strategic direction of moving Linux to be a
> "clusterable kernel"?

That seemed to be the concensus at the summit I attended.  Note that 
we've already got the basic changes to the VFS in place, with a few 
small exceptions.

I still think that gdlm can go to Andrew before CMAN, however that is 
contingent on working out a way to invert the link-level dependency on 
CMAN so that the OCFS2 guys and people who want to experiment with 
dlm-style coding can try it without being forced to adopt a lot of 
other, less stable infrastructure at the same time.  This will be going 
forward in parallel with the CMAN api work.

> How can we have membership without some form of communication
> service? (communication-based membership or connectivity-based
> membership)
>
> The low level cluster communication mechanism is one of those
> services that I believe we need an API definition for since it will
> also be leveraged by higher level services such as group messaging or
> an event service.
>
> So you can call the core service "membership", but what we really
> need is membership/communication, which is what cman provides.  Do
> you have another suggestion for this?  TIPC + membership?

I think you really mean "connection manager", not "communication 
service"  I'll step back from this now and watch you guys sort it 
out :-)

Regards,

Daniel


From lmb at suse.de  Wed Aug 11 20:55:15 2004
From: lmb at suse.de (Lars Marowsky-Bree)
Date: Wed, 11 Aug 2004 22:55:15 +0200
Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Cluster
	summit materials
In-Reply-To: <1092241130.5683.48.camel@cherrybomb.pdx.osdl.net>
References: <1092170965.2468.86.camel@cherrybomb.pdx.osdl.net>
	<20040810205817.GB18086@marowsky-bree.de>
	<1092241130.5683.48.camel@cherrybomb.pdx.osdl.net>
Message-ID: <20040811205515.GB10855@marowsky-bree.de>

On 2004-08-11T09:18:50,
   John Cherry <cherry at osdl.org> said:

> I realize that cman will probably be at "alpha" level maturity in
> October, but we did not discuss any other possibilities for kernel level
> membership/communication.  linux-ha and openais have user level
> components. 

Let's be a bit more specific, we have so far agreed on defining the
membership API in the kernel (and likely starting from the cman one
here), but via a vfs-like "Virtual Cluster Switch" with pluggable
components right from the start, of which cman may be one, or a module
to go out and talk to a user-level membership implementation another.

That all these components need to be in the kernel hasn't been quite
agreed on, just that their information needs to be available there.

> membership/communication module.  Multiple implementations would be good
> and if we do a good job defining the APIs (membership, communication,
> fencing), other membership services could be used down the road.

Right.

> > However, what was identified was that the following components
> > 
> > - membership
> How can we have membership without some form of communication service?
> (communication-based membership or connectivity-based membership)

Communication was specifically excluded because the communication APIs
are much more complex to define; how the membership is computed
internally is, well, internal to the membership module, and thus is it's
communication method...

> The low level cluster communication mechanism is one of those services
> that I believe we need an API definition for since it will also be
> leveraged by higher level services such as group messaging or an event
> service.

Eventually, but it's also more complex and was thus excluded. We
specifically listed those three components I gave, for good reasons...

> At the summit I attended, we also talked about using GFS as the initial
> "consumer" of the cluster infrastructure.  The cluster infrastructure
> doesn't stand a chance of mainline acceptance without a consumer that
> both validates the interfaces and hardens the services.

GFS for one doesn't need any further communication channels beyond the
DLM and membership.

There's more components which are needed here, ie the recovery
coordination provided by their Service Manager and some others, however
for very good reasons (both their technical as their political
complexity) those were left out of the initial go at this.

> I am not being as subtile as RHAT was at the summit.  If we are going to
> start the process to mainline the components needed to make linux a
> "clusterable kernel" this year, we will need to get behind the core
> services that we discussed at the summit.

You better be as careful as everyone was at the Summit, or you'll
quickly be treading very loose ground ;-)


Sincerely,
    Lars Marowsky-Br?e <lmb at suse.de>

-- 
High Availability & Clustering	    \ Philosophy proclaiming reason to be 
SUSE Labs, Research and Development | the supreme human virtue is falling
SUSE LINUX AG - A Novell company    \ prey to self-adulation.


From daniel at osdl.org  Wed Aug 11 21:24:49 2004
From: daniel at osdl.org (Daniel McNeil)
Date: Wed, 11 Aug 2004 14:24:49 -0700
Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion]
	Cluster summit materials
In-Reply-To: <200408111454.08677.phillips@redhat.com>
References: <1092170965.2468.86.camel@cherrybomb.pdx.osdl.net>
	<20040810205817.GB18086@marowsky-bree.de>
	<1092241130.5683.48.camel@cherrybomb.pdx.osdl.net>
	<200408111454.08677.phillips@redhat.com>
Message-ID: <1092259489.14012.55.camel@ibm-c.pdx.osdl.net>

On Wed, 2004-08-11 at 11:54, Daniel Phillips wrote:
> On Wednesday 11 August 2004 12:18, John Cherry wrote:
> > On Tue, 2004-08-10 at 13:58, Lars Marowsky-Bree wrote:
> > > On 2004-08-10T13:49:26, John Cherry <cherry at osdl.org> said:
> > > > While these common components all have RHAT/Sistina roots, these
> > > > components are in the best position for mainline acceptance.  As
> > > > APIs are defined for these services, other implementations could
> > > > also be used (the vfs model).
> > >
> > > This isn't quite true. cman as a whole is not quite in the best
> > > position for mainline acceptance; actually, most isn't.
> 
> That's accurate, that's why I keep beating on the 'read the code' issue, 
> not to mention trying it, and hacking it.
> 
> > I realize that cman will probably be at "alpha" level maturity in
> > October, but we did not discuss any other possibilities for kernel
> > level membership/communication.
> 
> I believe it was briefly mentioned that we mainly use bog-standard tcp 
> socket streams for communication.  I'll add that various subsystems 
> incorporate their own reliability logic, and maybe one day far from 
> now, we'll be able to unify all of that.  For now, it's a little 
> ambitions, not to mention unnecessary.
> 
> > linux-ha and openais have user level  
> > components.  I suppose SSI membership could be considered as a
> > candidate implementation for the initial merge, but the consensus was
> > that we would focus on cman, define the APIs, and use cman as the
> > initial membership/communication module.  Multiple implementations
> > would be good and if we do a good job defining the APIs (membership,
> > communication, fencing), other membership services could be used down
> > the road.
> 
> IMHO, for the time being only failure detection and failover really has 
> to be unified, and that is CMAN, including interaction with other bits 
> and pieces, i.e., Magma and fencing, and hopefully other systems like 
> Lars' SCRAT.  As far as CMAN goes, Lars and Alan seem to be the main 
> parties outside Red Hat.  Lon and Patrick are most active inside Red 
> Hat.  I think we'd advance fastest if they start hacking each other's 
> code (anybody I just overlooked, please bellow).

I not sure what you mean by "failure detection and failover".
Do you mean node failure detection and consensus membership change?

I thought Magma is just redhat's backward compatibility layer.
What "interaction" are you worried about?

How fencing integrates and when it occurs might be issues we
will need to think about more.

> 
> However it goes, this process is going to take time.  Two months would 
> be blindingly fast, and that is before we even think about pushing to 
> Andrew.
> 
> > Was I at a different summit than you attended, or is that your
> > understanding of the strategic direction of moving Linux to be a
> > "clusterable kernel"?
> 
> That seemed to be the concensus at the summit I attended.  Note that 
> we've already got the basic changes to the VFS in place, with a few 
> small exceptions.
> 
> I still think that gdlm can go to Andrew before CMAN, however that is 
> contingent on working out a way to invert the link-level dependency on 
> CMAN so that the OCFS2 guys and people who want to experiment with 
> dlm-style coding can try it without being forced to adopt a lot of 
> other, less stable infrastructure at the same time.  This will be going 
> forward in parallel with the CMAN api work.

How can the DLM go to Andrew without a membership layer to
provide membership?

I would think we need the DLM to actually be working...

> 
> > How can we have membership without some form of communication
> > service? (communication-based membership or connectivity-based
> > membership)
> >
> > The low level cluster communication mechanism is one of those
> > services that I believe we need an API definition for since it will
> > also be leveraged by higher level services such as group messaging or
> > an event service.
> >
> > So you can call the core service "membership", but what we really
> > need is membership/communication, which is what cman provides.  Do
> > you have another suggestion for this?  TIPC + membership?
> 
> I think you really mean "connection manager", not "communication 
> service"  I'll step back from this now and watch you guys sort it 
> out :-)

I think John really does mean communication.  For high availability,
the cluster should have no single point of failure.  This usually
means multiple ethernet links.  (I assume CMAN supports multiple
links).  To determine membership there needs to be a way of sending
messages between the nodes to determine membership.  Ideally, losing one
ethernet link could/would be handle without causing any membership
change.

This kind of intra-cluster communication would be valuable for 
other cluster components as well.  Example: a cluster snapshot :)
or cluster mirror device should be able to send messages to
other nodes in the cluster without having to worry about which
specific link to use and what to do if a link fails.  This would
also be valuable for the DLM.

Does CMAN provide this kind of functionality?  If so, then it
really is a communication service.

Daniel McNeil 


> 
> Regards,
> 
> Daniel
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster


From cherry at osdl.org  Wed Aug 11 21:42:27 2004
From: cherry at osdl.org (John Cherry)
Date: Wed, 11 Aug 2004 14:42:27 -0700
Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Cluster
	summit materials
In-Reply-To: <20040811205515.GB10855@marowsky-bree.de>
References: <1092170965.2468.86.camel@cherrybomb.pdx.osdl.net>
	<20040810205817.GB18086@marowsky-bree.de>
	<1092241130.5683.48.camel@cherrybomb.pdx.osdl.net>
	<20040811205515.GB10855@marowsky-bree.de>
Message-ID: <1092260546.6232.76.camel@cherrybomb.pdx.osdl.net>

On Wed, 2004-08-11 at 13:55, Lars Marowsky-Bree wrote:
> On 2004-08-11T09:18:50,
>    John Cherry <cherry at osdl.org> said:
> 
> > I realize that cman will probably be at "alpha" level maturity in
> > October, but we did not discuss any other possibilities for kernel level
> > membership/communication.  linux-ha and openais have user level
> > components. 
> 
> Let's be a bit more specific, we have so far agreed on defining the
> membership API in the kernel (and likely starting from the cman one
> here), but via a vfs-like "Virtual Cluster Switch" with pluggable
> components right from the start, of which cman may be one, or a module
> to go out and talk to a user-level membership implementation another.
> 
> That all these components need to be in the kernel hasn't been quite
> agreed on, just that their information needs to be available there.

Quite right.  The primary "next step" we defined was to agree on the
membership API in the kernel.  In the OSS community, this means to
provide code that works and define the API based on the working code. 
You have to start with something, and cman (or whatever it becomes) is a
likely first candidate.  With a vfs-like cluster switch, any membership
service could be plugged in, including one that goes out and talks to a
user level membership implementation.  I wonder which one you might have
in mind?  :)

> 
> > membership/communication module.  Multiple implementations would be good
> > and if we do a good job defining the APIs (membership, communication,
> > fencing), other membership services could be used down the road.
> 
> Right.
> 
> > > However, what was identified was that the following components
> > > 
> > > - membership
> > How can we have membership without some form of communication service?
> > (communication-based membership or connectivity-based membership)
> 
> Communication was specifically excluded because the communication APIs
> are much more complex to define; how the membership is computed
> internally is, well, internal to the membership module, and thus is it's
> communication method...

Yes.  Perhaps the communication API definitions were excluded in the
first go-around.  However, you have to admit that cluster communication
IS required, if for nothing else to provide redundant communication
paths, and exposing this communication API would be valuable for higher
level services.  For instance, you don't want an event service to
provide a completely orthogonal communication mechanism in the cluster
when it could use the one that also provides the cluster heartbeat
mechanism.

> 
> > The low level cluster communication mechanism is one of those services
> > that I believe we need an API definition for since it will also be
> > leveraged by higher level services such as group messaging or an event
> > service.
> 
> Eventually, but it's also more complex and was thus excluded. We
> specifically listed those three components I gave, for good reasons...

OK.  I admit that we were not going to focus on the communication API in
the first go-around.

> 
> > At the summit I attended, we also talked about using GFS as the initial
> > "consumer" of the cluster infrastructure.  The cluster infrastructure
> > doesn't stand a chance of mainline acceptance without a consumer that
> > both validates the interfaces and hardens the services.
> 
> GFS for one doesn't need any further communication channels beyond the
> DLM and membership.

I can agree with that.

> 
> There's more components which are needed here, ie the recovery
> coordination provided by their Service Manager and some others, however
> for very good reasons (both their technical as their political
> complexity) those were left out of the initial go at this.

Agreed.  However, fencing will probably need to be addressed as we
define the membership API.

> 
> > I am not being as subtile as RHAT was at the summit.  If we are going to
> > start the process to mainline the components needed to make linux a
> > "clusterable kernel" this year, we will need to get behind the core
> > services that we discussed at the summit.
> 
> You better be as careful as everyone was at the Summit, or you'll
> quickly be treading very loose ground ;-)

Sorry.  I didn't mean to put words into anybody's mouth.  However, if
there was disagreement with the basic strategy moving forward....it was
pretty stealthy.  :)

John


From lmb at suse.de  Wed Aug 11 22:04:00 2004
From: lmb at suse.de (Lars Marowsky-Bree)
Date: Thu, 12 Aug 2004 00:04:00 +0200
Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Cluster
	summit materials
In-Reply-To: <1092259489.14012.55.camel@ibm-c.pdx.osdl.net>
References: <1092170965.2468.86.camel@cherrybomb.pdx.osdl.net>
	<20040810205817.GB18086@marowsky-bree.de>
	<1092241130.5683.48.camel@cherrybomb.pdx.osdl.net>
	<200408111454.08677.phillips@redhat.com>
	<1092259489.14012.55.camel@ibm-c.pdx.osdl.net>
Message-ID: <20040811220400.GG10855@marowsky-bree.de>

On 2004-08-11T14:24:49,
   Daniel McNeil <daniel at osdl.org> said:

> How can the DLM go to Andrew without a membership layer to
> provide membership?

I'd agree with this question. Membership is really the first and
foremost question, then the DLM can go in.

Fencing turns out to be a more difficult beast, because the way how the
GFS stack handles it's recovery (a static priority list) is somewhat
fundamentally incompatible with the way how a more powerful dependency
based cluster recovery manager might wish to handle things. We've just
run into this discussion ourselves, and as soon as we have an idea, will
propose that adequately for discussion...

> I think John really does mean communication.  For high availability,
> the cluster should have no single point of failure. 

Exposing the communication APIs begs a ton of questions regarding the
semantics; atomic, causal or total ordering?; communication groups;
access controls to those; sync or async; broadcast, multicast or
pair-wise channels?

All of these and some more can/should be supported, however most systems
just provide subsets. How to expose that, how to handle it?

That's a bit more difficult than answering the question about
membership, which is even complex enough - do you get to see membership
before or after fencing, with or without quorum etc.

Don't rush this. Don't get sidetracked. (And trust me, I've been there
at OCF for that one.) Concentrate on the slightly more palatable ones
like membership and DLM, and after we've established prior art, then
lets tackle the bigger issues.

Nobody denies that communication, recovery coordination etc are required
and very important, just that we don't wish to start there.

> Does CMAN provide this kind of functionality?  If so, then it
> really is a communication service.

It provides a very limitted subset of it which is, for example, not even
useable to the low requirements SCRAT (heartbeat's new recovery/resource
manager) has, as far as I can see, because it's not performing well
enough. And it's not meant to, because they architect their stack
differently (around DLM + TCP etc), but it means we'll need to work on
this area some more first.


Sincerely,
    Lars Marowsky-Br?e <lmb at suse.de>

-- 
High Availability & Clustering	    \ Philosophy proclaiming reason to be 
SUSE Labs, Research and Development | the supreme human virtue is falling
SUSE LINUX AG - A Novell company    \ prey to self-adulation.


From pcaulfie at redhat.com  Thu Aug 12 06:57:53 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Thu, 12 Aug 2004 07:57:53 +0100
Subject: [Linux-cluster] Multicast for GFS?
In-Reply-To: <068D22A6-EBB0-11D8-9B62-000A957BB1F6@redhat.com>
References: <1091556279.30938.179.camel@laza.eunet.yu>
	<1091736172.19762.336.camel@laza.eunet.yu>
	<20040809075039.GA9240@tykepenguin.com>
	<1092051143.1114.130.camel@laza.eunet.yu>
	<20040809133438.GI11723@tykepenguin.com>
	<1092067121.23273.235.camel@laza.eunet.yu>
	<20040810092900.GB13291@tykepenguin.com>
	<1092139910.32187.1098.camel@laza.eunet.yu>
	<20040810122043.GE13291@tykepenguin.com>
	<068D22A6-EBB0-11D8-9B62-000A957BB1F6@redhat.com>
Message-ID: <20040812065752.GA21565@tykepenguin.com>

On Wed, Aug 11, 2004 at 11:03:46AM -0500, Jonathan E Brassow wrote:
> >>I also started to change ccs a bit for mcast support. It turns out 
> >>that
> >>ccs has a lot of definitions hardcoded. Can I take 'em out and put 
> >>into
> >>separate header file (comm_header.h looks nice :)?
> >
> >I think ccs_join.h would be reasonable, then it's obvious which .c file
> >it holds the defaults for.
> >
> 
> i don't think there is a ccs_join.c (you're thinking of cman_tool (?)). 
>  comm_header.h would be fine.  I'll take a look at it when your ready.

Sorry, yes I was still in cman_tool mode !
 
-- 

patrick


From chrisw at osdl.org  Wed Aug 11 17:11:04 2004
From: chrisw at osdl.org (Chris Wright)
Date: Wed, 11 Aug 2004 10:11:04 -0700
Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Cluster
	summit materials
In-Reply-To: <1092241130.5683.48.camel@cherrybomb.pdx.osdl.net>;
	from cherry@osdl.org on Wed, Aug 11, 2004 at 09:18:50AM -0700
References: <1092170965.2468.86.camel@cherrybomb.pdx.osdl.net>
	<20040810205817.GB18086@marowsky-bree.de>
	<1092241130.5683.48.camel@cherrybomb.pdx.osdl.net>
Message-ID: <20040811101104.F1924@build.pdx.osdl.net>

* John Cherry (cherry at osdl.org) wrote:
> At the summit I attended, we also talked about using GFS as the initial
> "consumer" of the cluster infrastructure.  The cluster infrastructure
> doesn't stand a chance of mainline acceptance without a consumer that
> both validates the interfaces and hardens the services.
> 
> I am not being as subtile as RHAT was at the summit.  If we are going to
> start the process to mainline the components needed to make linux a
> "clusterable kernel" this year, we will need to get behind the core
> services that we discussed at the summit.

I read Lars' comments as something like: 
  There's still a lot of work to do, and it's not a foregone conclusion
  that any of this would hit mainline.

Maybe I extrapolated too far.  However, the kernel summit included
a reasonable bit of pushback on placing this in the kernel without
convincing arguments to the contrary.  So I think it's reasonable to
consider part of the work is still clearly defining that need.

thanks,
-chris
-- 
Linux Security Modules     http://lsm.immunix.org     http://lsm.bkbits.net


From lmb at suse.de  Thu Aug 12 09:57:36 2004
From: lmb at suse.de (Lars Marowsky-Bree)
Date: Thu, 12 Aug 2004 11:57:36 +0200
Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion]
	Clustersummit materials
In-Reply-To: <1092249962.4717.21.camel@persist.az.mvista.com>
References: <3689AF909D816446BA505D21F1461AE4C75110@cacexc04.americas.cpqcorp.net>
	<1092249962.4717.21.camel@persist.az.mvista.com>
Message-ID: <20040812095736.GE4096@marowsky-bree.de>

On 2004-08-11T11:46:03,
   Steven Dake <sdake at mvista.com> said:

> If we can't live with the cluster services in userland (although I'm
> still not convinced), then atleast the group messaging protocol in the
> kernel could be based upon 20 years of research in group messaging and
> work properly under _all_ fault scenarios.

Right. Another important alternative maybe the Transis group
communication suite, which has been released as GPL/LGPL now.

This all just highlights that we need to think about communication some
more before we can tackle it sensibly, but of course I'll be glad if
someone proves me wrong and Just Does It ;-)


Sincerely,
    Lars Marowsky-Br?e <lmb at suse.de>

-- 
High Availability & Clustering	    \ I allow neither my experience
SUSE Labs, Research and Development | nor my cynicism to deter my
SUSE LINUX AG - A Novell company    \ optimistic outlook on life.


From phillips at redhat.com  Wed Aug 11 19:58:57 2004
From: phillips at redhat.com (Daniel Phillips)
Date: Wed, 11 Aug 2004 15:58:57 -0400
Subject: [Linux-cluster] [ANNOUNCE] Minneapolis Cluster Summit Wrapup
Message-ID: <200408111558.57898.phillips@redhat.com>

Hi All,

The Minneapolis Cluster Summit came and went 10 days ago, with excellent 
attendance and high-quality interaction all round.  Over the last few 
days I've been collecting slide presentations and related material onto 
this page:

   http://sources.redhat.com/cluster/events/summit2004/presentations.html

Unfortunately, due to manpower limitations and short lead time, we 
weren't able to arrange for audio recordings, which would have been 
great since both presentations and discussion were packed full of 
useful material.  I guess this means we have to do it again next year, 
this time with a tape recorder!  As for the results... discussion 
continues on linux-cluster and other mailing lists, please judge for 
yourself.

   https://www.redhat.com/mailman/listinfo/linux-cluster
   http://lists.osdl.org/mailman/listinfo/cgl_discussion
   http://lists.osdl.org/mailman/listinfo/dcl_discussion

Regards,

Daniel


From phillips at redhat.com  Thu Aug 12 22:47:17 2004
From: phillips at redhat.com (Daniel Phillips)
Date: Thu, 12 Aug 2004 18:47:17 -0400
Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Cluster
	summit materials
In-Reply-To: <1092259489.14012.55.camel@ibm-c.pdx.osdl.net>
References: <1092170965.2468.86.camel@cherrybomb.pdx.osdl.net>
	<200408111454.08677.phillips@redhat.com>
	<1092259489.14012.55.camel@ibm-c.pdx.osdl.net>
Message-ID: <200408121847.17158.phillips@redhat.com>

On Wednesday 11 August 2004 17:24, Daniel McNeil wrote:
> > IMHO, for the time being only failure detection and failover really
> > has to be unified, and that is CMAN, including interaction with
> > other bits and pieces, i.e., Magma and fencing, and hopefully other
> > systems like Lars' SCRAT.  As far as CMAN goes, Lars and Alan seem
> > to be the main parties outside Red Hat.  Lon and Patrick are most
> > active inside Red Hat.  I think we'd advance fastest if they start
> > hacking each other's code (anybody I just overlooked, please
> > bellow).
>
> I not sure what you mean by "failure detection and failover".
> Do you mean node failure detection and consensus membership change?

I mean anything in the cluster that can fail and be reinstantiated.  
This would include server processes for cluster block devices such as 
the ones I've designed, as well as whole nodes.  It would also include 
communication paths, such as socket connections.  But by now you may 
have detected a bias against trying to deal with the latter in a 
one-size-fits-all automagic, never-stop-never-give-up cluster 
communications thingamajig layer.  What we really need is just a 
framework for failure detection, including methods supplied by various 
cluster components, and methods for re-instantiating failed components.

Note note note: while a "cluster component" could conceivably be a whole 
node, that's a special case and we really need to cater to the case 
that will eventually be much more common, where cluster nodes may be 
doing all kinds of other things besides just participating in clusters.  
So by "cluster component" I really mean something closer to "task".

> I thought Magma is just redhat's backward compatibility layer.
> What "interaction" are you worried about?

You might want to ask Lon about that...

> How fencing integrates and when it occurs might be issues we
> will need to think about more.

Understatement of the day.

> How can the DLM go to Andrew without a membership layer to
> provide membership?

By having a simple registration api that allows one to register a 
membership layer, in place of what is there now, i.e., function links 
between modules.

> > > So you can call the core service "membership", but what we really
> > > need is membership/communication, which is what cman provides. 
> > > Do you have another suggestion for this?  TIPC + membership?
> >
> > I think you really mean "connection manager", not "communication
> > service"  I'll step back from this now and watch you guys sort it
> > out :-)
>
> I think John really does mean communication.  For high availability,
> the cluster should have no single point of failure.  This usually
> means multiple ethernet links.

But it's not the business of the cluster framework to operate the links, 
only to know when they have failed and to be able to arrange for new 
connections.  So John really does mean "connection" and not 
"communication", I hope.

> (I assume CMAN supports multiple 
> links).  To determine membership there needs to be a way of sending
> messages between the nodes to determine membership.  Ideally, losing
> one ethernet link could/would be handle without causing any
> membership change.

"Ideally" is not a strong enough word, imho.

> This kind of intra-cluster communication would be valuable for
> other cluster components as well.  Example: a cluster snapshot :)
> or cluster mirror device should be able to send messages to
> other nodes in the cluster without having to worry about which
> specific link to use and what to do if a link fails.  This would
> also be valuable for the DLM.

OK, we've seen lots of warnings about not getting derailed by trying to 
invent the perfect cluster communication system, we should heed those 
warnings.  Instead, let's get down to precise specification of the 
methods we need to have, and compare it to what already exists, for 
establishing and re-establishing connections.

> Does CMAN provide this kind of functionality?  If so, then it
> really is a communication service.

http://people.redhat.com/~teigland/sca.pdf

Regards,

Daniel


From sdake at mvista.com  Thu Aug 12 17:42:16 2004
From: sdake at mvista.com (Steven Dake)
Date: Thu, 12 Aug 2004 10:42:16 -0700
Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion]
	Clustersummit materials
In-Reply-To: <20040812095736.GE4096@marowsky-bree.de>
References: <3689AF909D816446BA505D21F1461AE4C75110@cacexc04.americas.cpqcorp.net>
	<1092249962.4717.21.camel@persist.az.mvista.com>
	<20040812095736.GE4096@marowsky-bree.de>
Message-ID: <1092332536.7315.1.camel@persist.az.mvista.com>

On Thu, 2004-08-12 at 02:57, Lars Marowsky-Bree wrote:
> On 2004-08-11T11:46:03,
>    Steven Dake <sdake at mvista.com> said:
> 
> > If we can't live with the cluster services in userland (although I'm
> > still not convinced), then atleast the group messaging protocol in the
> > kernel could be based upon 20 years of research in group messaging and
> > work properly under _all_ fault scenarios.
> 
> Right. Another important alternative maybe the Transis group
> communication suite, which has been released as GPL/LGPL now.
> 
> This all just highlights that we need to think about communication some
> more before we can tackle it sensibly, but of course I'll be glad if
> someone proves me wrong and Just Does It ;-)
> 

agreed...  Transis in kernel would be a fine alternative to openais gmi
in kernel.

Speaking of transis, is the code posted anywhere?  I'd like to have a
look.

Thanks
-steve
> 
> Sincerely,
>     Lars Marowsky-Br?e <lmb at suse.de>


From lmb at suse.de  Thu Aug 12 20:37:38 2004
From: lmb at suse.de (Lars Marowsky-Bree)
Date: Thu, 12 Aug 2004 22:37:38 +0200
Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion]
	Clustersummit materials
In-Reply-To: <1092332536.7315.1.camel@persist.az.mvista.com>
References: <3689AF909D816446BA505D21F1461AE4C75110@cacexc04.americas.cpqcorp.net>
	<1092249962.4717.21.camel@persist.az.mvista.com>
	<20040812095736.GE4096@marowsky-bree.de>
	<1092332536.7315.1.camel@persist.az.mvista.com>
Message-ID: <20040812203738.GK9722@marowsky-bree.de>

On 2004-08-12T10:42:16,
   Steven Dake <sdake at mvista.com> said:

> agreed...  Transis in kernel would be a fine alternative to openais gmi
> in kernel.
> 
> Speaking of transis, is the code posted anywhere?  I'd like to have a
> look.

It's not yet at the final location, but we put up what we got at
http://wiki.trick.ca/linux-ha/Transis .


Sincerely,
    Lars Marowsky-Br?e <lmb at suse.de>

-- 
High Availability & Clustering	    \        This space          /
SUSE Labs, Research and Development |       intentionally        |
SUSE LINUX AG - A Novell company    \        left blank          /


From sdake at mvista.com  Thu Aug 12 22:59:10 2004
From: sdake at mvista.com (Steven Dake)
Date: Thu, 12 Aug 2004 15:59:10 -0700
Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion]
	Clustersummit materials
In-Reply-To: <20040812203738.GK9722@marowsky-bree.de>
References: <3689AF909D816446BA505D21F1461AE4C75110@cacexc04.americas.cpqcorp.net>
	<1092249962.4717.21.camel@persist.az.mvista.com>
	<20040812095736.GE4096@marowsky-bree.de>
	<1092332536.7315.1.camel@persist.az.mvista.com>
	<20040812203738.GK9722@marowsky-bree.de>
Message-ID: <1092351549.7315.5.camel@persist.az.mvista.com>

On Thu, 2004-08-12 at 13:37, Lars Marowsky-Bree wrote:
> On 2004-08-12T10:42:16,
>    Steven Dake <sdake at mvista.com> said:
> 
> > agreed...  Transis in kernel would be a fine alternative to openais gmi
> > in kernel.
> > 
> > Speaking of transis, is the code posted anywhere?  I'd like to have a
> > look.
> 
> It's not yet at the final location, but we put up what we got at
> http://wiki.trick.ca/linux-ha/Transis .
> 
> 
Lars

Thanks for posting transis.  I had a look at the examples and API.  The
API is of course different then openais and focused on client/server
architecture.

I tried a performance test by sending a 64k message, and then receiving
it 10 times with two nodes.  This operation takes about 5 seconds on my
hardware which is 128k/sec.  I was expecting more like 8-10MB/sec.  Is
there anything that can be done to improve the performance?

Thanks
-steve

Certainly a different sort of API then openais...

> Sincerely,
>     Lars Marowsky-Br?e <lmb at suse.de>


From sdake at mvista.com  Thu Aug 12 23:08:08 2004
From: sdake at mvista.com (Steven Dake)
Date: Thu, 12 Aug 2004 16:08:08 -0700
Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion]
	Cluster summit materials
In-Reply-To: <200408121847.17158.phillips@redhat.com>
References: <1092170965.2468.86.camel@cherrybomb.pdx.osdl.net>
	<200408111454.08677.phillips@redhat.com>
	<1092259489.14012.55.camel@ibm-c.pdx.osdl.net>
	<200408121847.17158.phillips@redhat.com>
Message-ID: <1092352087.7315.15.camel@persist.az.mvista.com>

Daniel

comments below

On Thu, 2004-08-12 at 15:47, Daniel Phillips wrote:
> On Wednesday 11 August 2004 17:24, Daniel McNeil wrote:
> > > IMHO, for the time being only failure detection and failover really
> > > has to be unified, and that is CMAN, including interaction with
> > > other bits and pieces, i.e., Magma and fencing, and hopefully other
> > > systems like Lars' SCRAT.  As far as CMAN goes, Lars and Alan seem
> > > to be the main parties outside Red Hat.  Lon and Patrick are most
> > > active inside Red Hat.  I think we'd advance fastest if they start
> > > hacking each other's code (anybody I just overlooked, please
> > > bellow).
> >
> > I not sure what you mean by "failure detection and failover".
> > Do you mean node failure detection and consensus membership change?
> 
> I mean anything in the cluster that can fail and be reinstantiated.  
> This would include server processes for cluster block devices such as 
> the ones I've designed, as well as whole nodes.  It would also include 
> communication paths, such as socket connections.  But by now you may 
> have detected a bias against trying to deal with the latter in a 
> one-size-fits-all automagic, never-stop-never-give-up cluster 
> communications thingamajig layer.  What we really need is just a 
> framework for failure detection, including methods supplied by various 
> cluster components, and methods for re-instantiating failed components.
> 

There really is no reason to reinvent the wheel here.  An API has
already been developed in the SA Forum Availability Management
Framework, and an implementation already exists
(http://developer.osdl.org/dev/openais).  I suspect there is some work
that linux-ha has done on this topic as well.

> Note note note: while a "cluster component" could conceivably be a whole 
> node, that's a special case and we really need to cater to the case 
> that will eventually be much more common, where cluster nodes may be 
> doing all kinds of other things besides just participating in clusters.  
> So by "cluster component" I really mean something closer to "task".
> 
> > I thought Magma is just redhat's backward compatibility layer.
> > What "interaction" are you worried about?
> 
> You might want to ask Lon about that...
> 
> > How fencing integrates and when it occurs might be issues we
> > will need to think about more.
> 
> Understatement of the day.
> 
> > How can the DLM go to Andrew without a membership layer to
> > provide membership?
> 
> By having a simple registration api that allows one to register a 
> membership layer, in place of what is there now, i.e., function links 
> between modules.
> 

I think what you are missing is that membership and messaging are
strongly related to one another.  When a message is sent, it is sent
under a certain membership view.  When it is received, it should also be
received under that same membership view.  Otherwise, the view of the
membership cannot be used to make decisions along with the message
contents.  If the distributed system must make decisions about a message
based upon the view of the membership (which obviously DLM must do to be
reliable) then integrating these two features is the only approach that
works.  For this reason, membersihp and messaging are tightly
integrated, atleast if a reliable distributed system is desired.

> > > > So you can call the core service "membership", but what we really
> > > > need is membership/communication, which is what cman provides. 
> > > > Do you have another suggestion for this?  TIPC + membership?
> > >
> > > I think you really mean "connection manager", not "communication
> > > service"  I'll step back from this now and watch you guys sort it
> > > out :-)
> >
> > I think John really does mean communication.  For high availability,
> > the cluster should have no single point of failure.  This usually
> > means multiple ethernet links.
> 
> But it's not the business of the cluster framework to operate the links, 
> only to know when they have failed and to be able to arrange for new 
> connections.  So John really does mean "connection" and not 
> "communication", I hope.
> 
> > (I assume CMAN supports multiple 
> > links).  To determine membership there needs to be a way of sending
> > messages between the nodes to determine membership.  Ideally, losing
> > one ethernet link could/would be handle without causing any
> > membership change.
> 
> "Ideally" is not a strong enough word, imho.
> 
> > This kind of intra-cluster communication would be valuable for
> > other cluster components as well.  Example: a cluster snapshot :)
> > or cluster mirror device should be able to send messages to
> > other nodes in the cluster without having to worry about which
> > specific link to use and what to do if a link fails.  This would
> > also be valuable for the DLM.
> 
> OK, we've seen lots of warnings about not getting derailed by trying to 
> invent the perfect cluster communication system, we should heed those 
> warnings.  Instead, let's get down to precise specification of the 
> methods we need to have, and compare it to what already exists, for 
> establishing and re-establishing connections.
> 

The perfect cluster communication model has already been invented:  its
called virtual synchrony and backed up by 20 years of research.  There
are several protocols that implement this model.  If there is no need
for agreed ordering or group communication in dlm, then maybe an
argument could be made that virtual synchrony is not appropriate for
dlm.  But, DLM benefits strongly from the semantics of virtual synchrony
and makes implementing a distributed lock service trivial.

Thanks for listening
-steve

> > Does CMAN provide this kind of functionality?  If so, then it
> > really is a communication service.
> 
> http://people.redhat.com/~teigland/sca.pdf
> 
> Regards,
> 
> Daniel
> _______________________________________________
> cgl_discussion mailing list
> cgl_discussion at lists.osdl.org
> http://lists.osdl.org/mailman/listinfo/cgl_discussion


From lmb at suse.de  Fri Aug 13 09:40:24 2004
From: lmb at suse.de (Lars Marowsky-Bree)
Date: Fri, 13 Aug 2004 11:40:24 +0200
Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion]
	Clustersummit materials
In-Reply-To: <1092351549.7315.5.camel@persist.az.mvista.com>
References: <3689AF909D816446BA505D21F1461AE4C75110@cacexc04.americas.cpqcorp.net>
	<1092249962.4717.21.camel@persist.az.mvista.com>
	<20040812095736.GE4096@marowsky-bree.de>
	<1092332536.7315.1.camel@persist.az.mvista.com>
	<20040812203738.GK9722@marowsky-bree.de>
	<1092351549.7315.5.camel@persist.az.mvista.com>
Message-ID: <20040813094024.GH4161@marowsky-bree.de>

On 2004-08-12T15:59:10,
   Steven Dake <sdake at mvista.com> said:

> Thanks for posting transis.  I had a look at the examples and API.  The
> API is of course different then openais and focused on client/server
> architecture.

Right.

> I tried a performance test by sending a 64k message, and then receiving
> it 10 times with two nodes.  This operation takes about 5 seconds on my
> hardware which is 128k/sec.  I was expecting more like 8-10MB/sec.  Is
> there anything that can be done to improve the performance?

I've not yet done any real tests with it, so I'm not sure. We were
mostly going from the theoretical description ;) But I think 128k/s is
really a bit low, so I assume something ain't quite right yet... We'll
figure it out.

It's possible that maybe it's not the way to go afterall, but before we
could go looking we first needed it as GPL/LGPL (for not becoming
IP-tainted).


Sincerely,
    Lars Marowsky-Br?e <lmb at suse.de>

-- 
High Availability & Clustering	    \        This space          /
SUSE Labs, Research and Development |       intentionally        |
SUSE LINUX AG - A Novell company    \        left blank          /


From jonathan at cnds.jhu.edu  Fri Aug 13 15:54:41 2004
From: jonathan at cnds.jhu.edu (Jonathan Stanton)
Date: Fri, 13 Aug 2004 11:54:41 -0400
Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion]
	Clustersummit materials
In-Reply-To: <1092351549.7315.5.camel@persist.az.mvista.com>
References: <3689AF909D816446BA505D21F1461AE4C75110@cacexc04.americas.cpqcorp.net>
	<1092249962.4717.21.camel@persist.az.mvista.com>
	<20040812095736.GE4096@marowsky-bree.de>
	<1092332536.7315.1.camel@persist.az.mvista.com>
	<20040812203738.GK9722@marowsky-bree.de>
	<1092351549.7315.5.camel@persist.az.mvista.com>
Message-ID: <20040813155441.GA16662@cnds.jhu.edu>

Hi,

I just joined the linux-cluster list after seeing a few of the messages 
that were cross-posted to linux-kernel. 

On Thu, Aug 12, 2004 at 03:59:10PM -0700, Steven Dake wrote:
> Lars
> 
> Thanks for posting transis.  I had a look at the examples and API.  The
> API is of course different then openais and focused on client/server
> architecture.

If you havn't looked at it already, you might want to try out the Spread
group communication system. 

http://www.spread.org/

It is, conceptually although not code-wise, a decendant of the Transis
work (and the Totem system from UCSB)  and is relatively widely used as a
production quality group messaging system (Some apache modules use it
along with a number of large web-clusters, a few commercial clustered
storage systems, and a lot of custom replication apps). It is not under
GPL but is open-source under a bsd-style (but not exactly the same)
license.

Like transis it has a client-server architecture (and a simpler API). 

> I tried a performance test by sending a 64k message, and then receiving
> it 10 times with two nodes.  This operation takes about 5 seconds on my
> hardware which is 128k/sec.  I was expecting more like 8-10MB/sec.  Is
> there anything that can be done to improve the performance?

I would expect transis to definitely do better then 128k/s given tests we
ran a number of years ago, but on upto medium sized lan environments the
totem/spread protocols are generally faster with less cpu overhead. I know
Spread could get 80Mb/s a number of years ago. We recently re-ran a clean
set of benchmarks and wrote them up. You can find them at:

http://www.cnds.jhu.edu/pub/papers/cnds-2004-1.pdf

I admit some bias as I'm one of the lead developers of Spread, and we (the 
developers) have been building group messaging systems since the early 
90's -- so I may look at things a bit differently -- so I would be very 
intersted in your thoughts on how you could use GCS and whether Spread 
would be useful. 

Cheers,

Jonathan

-- 
-------------------------------------------------------
Jonathan R. Stanton         jonathan at cs.jhu.edu
Dept. of Computer Science   
Johns Hopkins University    
-------------------------------------------------------


From angel at telvia.it  Sat Aug 14 18:57:37 2004
From: angel at telvia.it (Angelo Ovidi)
Date: Sat, 14 Aug 2004 20:57:37 +0200
Subject: [Linux-cluster] Error compiling GFS patched kernel
Message-ID: <001d01c48230$a4877b80$0a14a8c0@venus.it>

Hi.
I am trying to compile a 2.6.7 kernel patched with cvs version of cluster package of redhat.
I have no error applying the patches but the compile give me this error:

 CC [M]  fs/gfs/bmap.o
  CC [M]  fs/gfs/daemon.o
  CC [M]  fs/gfs/dio.o
  CC [M]  fs/gfs/dir.o
  CC [M]  fs/gfs/eattr.o
  CC [M]  fs/gfs/file.o
  CC [M]  fs/gfs/flock.o
  CC [M]  fs/gfs/glock.o
  CC [M]  fs/gfs/glops.o
  CC [M]  fs/gfs/inode.o
fs/gfs/inode.c: In function `inode_init_and_link':
fs/gfs/inode.c:1214: invalid lvalue in unary `&'
fs/gfs/inode.c: In function `inode_alloc_hidden':
fs/gfs/inode.c:1933: invalid lvalue in unary `&'
make[2]: *** [fs/gfs/inode.o] Error 1
make[1]: *** [fs/gfs] Error 2
make: *** [fs] Error 2

What's the problem?

Best regards,

Angelo Ovidi
Venere Net Spa
Rome, Italy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040814/2bbdca93/attachment.htm>

From lmb at suse.de  Fri Aug 13 20:30:29 2004
From: lmb at suse.de (Lars Marowsky-Bree)
Date: Fri, 13 Aug 2004 22:30:29 +0200
Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion]
	Clustersummit materials
In-Reply-To: <20040813155441.GA16662@cnds.jhu.edu>
References: <3689AF909D816446BA505D21F1461AE4C75110@cacexc04.americas.cpqcorp.net>
	<1092249962.4717.21.camel@persist.az.mvista.com>
	<20040812095736.GE4096@marowsky-bree.de>
	<1092332536.7315.1.camel@persist.az.mvista.com>
	<20040812203738.GK9722@marowsky-bree.de>
	<1092351549.7315.5.camel@persist.az.mvista.com>
	<20040813155441.GA16662@cnds.jhu.edu>
Message-ID: <20040813203029.GW4161@marowsky-bree.de>

On 2004-08-13T11:54:41,
   Jonathan Stanton <jonathan at cnds.jhu.edu> said:

> If you havn't looked at it already, you might want to try out the Spread
> group communication system. 
> 
> http://www.spread.org/

The intel lawyers have identified the Spread license to be
GPL-incompatible.

Otherwise, I agree, Spread is very nice. If those issues could be
resolved, that may be an interesting option too.

(I think the advertising clause and something else clash with the
(L)GPL; I can put you in contact with the Intel folks if you wish to
resolve this.)


Sincerely,
    Lars Marowsky-Br?e <lmb at suse.de>

-- 
High Availability & Clustering	    \        This space          /
SUSE Labs, Research and Development |       intentionally        |
SUSE LINUX AG - A Novell company    \        left blank          /


From jonathan at cnds.jhu.edu  Fri Aug 13 22:53:15 2004
From: jonathan at cnds.jhu.edu (Jonathan Stanton)
Date: Fri, 13 Aug 2004 18:53:15 -0400
Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion]
	Clustersummit materials
In-Reply-To: <20040813203029.GW4161@marowsky-bree.de>
References: <3689AF909D816446BA505D21F1461AE4C75110@cacexc04.americas.cpqcorp.net>
	<1092249962.4717.21.camel@persist.az.mvista.com>
	<20040812095736.GE4096@marowsky-bree.de>
	<1092332536.7315.1.camel@persist.az.mvista.com>
	<20040812203738.GK9722@marowsky-bree.de>
	<1092351549.7315.5.camel@persist.az.mvista.com>
	<20040813155441.GA16662@cnds.jhu.edu>
	<20040813203029.GW4161@marowsky-bree.de>
Message-ID: <20040813225315.GD16662@cnds.jhu.edu>

On Fri, Aug 13, 2004 at 10:30:29PM +0200, Lars Marowsky-Bree wrote:
> On 2004-08-13T11:54:41,
>    Jonathan Stanton <jonathan at cnds.jhu.edu> said:
> 
> > If you havn't looked at it already, you might want to try out the Spread
> > group communication system. 
> > 
> > http://www.spread.org/
> 
> The intel lawyers have identified the Spread license to be
> GPL-incompatible.
> 
> Otherwise, I agree, Spread is very nice. If those issues could be
> resolved, that may be an interesting option too.
> 
> (I think the advertising clause and something else clash with the
> (L)GPL; I can put you in contact with the Intel folks if you wish to
> resolve this.)

I would appreciate that. We did choose our licensing for what I think are 
good reasons, but we have also worked in the past with outside projects 
with possible license conflicts and have been able to resolve them. So I 
would like to understand exactly what the issues are. 

Cheers,

Jonathan

-- 
-------------------------------------------------------
Jonathan R. Stanton         jonathan at cs.jhu.edu
Dept. of Computer Science   
Johns Hopkins University    
-------------------------------------------------------


From jeff at intersystems.com  Mon Aug 16 14:02:39 2004
From: jeff at intersystems.com (Jeff)
Date: Mon, 16 Aug 2004 10:02:39 -0400
Subject: [Linux-cluster] 'make distclean ' leaves generated files behind
Message-ID: <364609576.20040816100239@intersystems.com>

'make distclean' fails to clean up the following directories:
     fence/bin
     gndb/bin
     gfs/bin


From ben.m.cahill at intel.com  Mon Aug 16 18:26:34 2004
From: ben.m.cahill at intel.com (Cahill, Ben M)
Date: Mon, 16 Aug 2004 11:26:34 -0700
Subject: [Linux-cluster] trouble trying to get ccs/cman working on
	onemachine, not the other
Message-ID: <0604335B7764D141945E202153105960033E24FD@orsmsx404.amr.corp.intel.com>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> Lennert Buytenhek
> Sent: Saturday, June 26, 2004 6:08 PM
> To: linux-cluster at redhat.com
> Subject: Re: [Linux-cluster] trouble trying to get ccs/cman 
> working on onemachine, not the other
> 
> On Sat, Jun 26, 2004 at 11:30:57PM +0200, Lennert Buytenhek wrote:
> 
> 
> OK, found out why they didn't see each other.  If your /etc/hosts has
> something like this:
> 
> 127.0.0.1               phi localhost.localdomain localhost
> 
> (which might be a remnant from an earlier Red Hat install on this box,
> created by the installer if you install without initially 
> configuring a
> network adapter) the port 6809 broadcasts will happily be 
> sent out over
> the loopback interface towards 10.255.255.255, and no wonder that your
> machines are not going to see each other.
> 

I ran into similar problem on fresh FC2 install (not upgrade), in which
I configured my cluster nodes to have static addresses 192.168.1.110 and
192.168.1.111.  (IOW, neither of Lennert's guesses above as to source of
problem applied to my situation).

I manually changed /etc/hosts to replace 127.0.0.1 with, e.g.,
192.168.1.110 and was able to join the cluster.

Was this the "right thing to do"?  Is this a bug in FC2?  What should
set up /etc/hosts??

Thanks.

-- Ben --

Opinions are mine, not Intel's


From vijay at cs.umass.edu  Tue Aug 17 13:51:24 2004
From: vijay at cs.umass.edu (Vijay Sundaram)
Date: Tue, 17 Aug 2004 09:51:24 -0400 (EDT)
Subject: [Linux-cluster] gnbd_import: ERROR cannot parse
	/sys/class/gnbd/gnbd0/server
Message-ID: <Pine.LNX.4.44.0408170941300.17817-100000@loki.cs.umass.edu>


I followed the steps on the following page.
(Basically setting up a two node cluster)

	https://open.datacore.ch/DCwiki.open/Wiki.jsp?page=GFS.GNBD.Usage

everything seems to work except that I am unable to import any devices.
I get the error

	gnbd_import: ERROR cannot parse /sys/class/gnbd/gnbd0/server

gnbd_import -e <server> correctly lists the devices being exported by the 
server.

Is it because the fields in /sys/class/gnbd/gnbd0/server are not getting 
set. 

A 
	cat /sys/class/gnbd/gnbd0/server
gives
	00000000:0

When do these fields get set?
What am I doing wrong?

thanks,
-- Vijay


From danderso at redhat.com  Tue Aug 17 14:07:05 2004
From: danderso at redhat.com (Derek Anderson)
Date: Tue, 17 Aug 2004 09:07:05 -0500
Subject: [Linux-cluster] gnbd_import: ERROR cannot parse
	/sys/class/gnbd/gnbd0/server
In-Reply-To: <Pine.LNX.4.44.0408170941300.17817-100000@loki.cs.umass.edu>
References: <Pine.LNX.4.44.0408170941300.17817-100000@loki.cs.umass.edu>
Message-ID: <200408170907.05077.danderso@redhat.com>

Vijay,

Please see:
http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=126935

Make sure you have the latest from CVS.  Also make sure that there are not 
duplicate gnbd.ko modules under /lib/modules/`uname -r`/drivers/block.

On Tuesday 17 August 2004 08:51, Vijay Sundaram wrote:
> I followed the steps on the following page.
> (Basically setting up a two node cluster)
>
> 	https://open.datacore.ch/DCwiki.open/Wiki.jsp?page=GFS.GNBD.Usage
>
> everything seems to work except that I am unable to import any devices.
> I get the error
>
> 	gnbd_import: ERROR cannot parse /sys/class/gnbd/gnbd0/server
>
> gnbd_import -e <server> correctly lists the devices being exported by the
> server.
>
> Is it because the fields in /sys/class/gnbd/gnbd0/server are not getting
> set.
>
> A
> 	cat /sys/class/gnbd/gnbd0/server
> gives
> 	00000000:0
>
> When do these fields get set?
> What am I doing wrong?
>
> thanks,
> -- Vijay
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster


From pavel at ucw.cz  Mon Aug 16 19:26:02 2004
From: pavel at ucw.cz (Pavel Machek)
Date: Mon, 16 Aug 2004 21:26:02 +0200
Subject: [Linux-cluster] Re: [ANNOUNCE] OpenSSI 1.0.0 released!!
In-Reply-To: <410D2949.20503@backtobasicsmgmt.com>
References: <3689AF909D816446BA505D21F1461AE4C750E6@cacexc04.americas.cpqcorp.net>
	<200408011330.01848.phillips@istop.com>
	<410D2949.20503@backtobasicsmgmt.com>
Message-ID: <20040816192602.GA467@openzaurus.ucw.cz>

Hi!

> >I wonder if device-mapper (slightly hacked) wouldn't be a better 
> >approach for 2.6+.
> 
> It appeared from the original posting that their "cluster-wide devfs" 
> actually supported all types of device nodes, not just block devices. 
> I don't know whether accessing a character device on another node 
> would ever be useful, but certainly using device-mapper wouldn't help 
> for that case.

Remote character devices seem extremely usefull to me...

mpg456 --device /dev/kitchen/dsp

cat /dev/roof/dsp > /dev/laptop/dsp

cat picture-to-scare-pigeons.raw > /dev/roof/fb0

X --device=/dev/livingroom/fb0

.... Okay, it will probably take a while until SSI cluster is the
right tool to network your home :-).

				Pavel
-- 
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms         


From vijay at cs.umass.edu  Tue Aug 17 14:12:15 2004
From: vijay at cs.umass.edu (Vijay Sundaram)
Date: Tue, 17 Aug 2004 10:12:15 -0400 (EDT)
Subject: [Linux-cluster] gnbd_import: ERROR cannot parse
	/sys/class/gnbd/gnbd0/server
In-Reply-To: <200408170907.05077.danderso@redhat.com>
Message-ID: <Pine.LNX.4.44.0408171010140.17817-100000@loki.cs.umass.edu>


Hi Derek,

No, I do not have a duplicate modules.
Also, I picked the snapshot from this page

	http://sources.redhat.com/cluster/releases/cvs_snapshots/

Is that good enough?

thanks,
-- Vijay


On Tue, 17 Aug 2004, Derek Anderson wrote:

> Vijay,
> 
> Please see:
> http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=126935
> 
> Make sure you have the latest from CVS.  Also make sure that there are not 
> duplicate gnbd.ko modules under /lib/modules/`uname -r`/drivers/block.
> 
> On Tuesday 17 August 2004 08:51, Vijay Sundaram wrote:
> > I followed the steps on the following page.
> > (Basically setting up a two node cluster)
> >
> > 	https://open.datacore.ch/DCwiki.open/Wiki.jsp?page=GFS.GNBD.Usage
> >
> > everything seems to work except that I am unable to import any devices.
> > I get the error
> >
> > 	gnbd_import: ERROR cannot parse /sys/class/gnbd/gnbd0/server
> >
> > gnbd_import -e <server> correctly lists the devices being exported by the
> > server.
> >
> > Is it because the fields in /sys/class/gnbd/gnbd0/server are not getting
> > set.
> >
> > A
> > 	cat /sys/class/gnbd/gnbd0/server
> > gives
> > 	00000000:0
> >
> > When do these fields get set?
> > What am I doing wrong?
> >
> > thanks,
> > -- Vijay
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > http://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 
-- Vijay


From danderso at redhat.com  Tue Aug 17 14:28:47 2004
From: danderso at redhat.com (Derek Anderson)
Date: Tue, 17 Aug 2004 09:28:47 -0500
Subject: [Linux-cluster] gnbd_import: ERROR cannot parse
	/sys/class/gnbd/gnbd0/server
In-Reply-To: <Pine.LNX.4.44.0408171010140.17817-100000@loki.cs.umass.edu>
References: <Pine.LNX.4.44.0408171010140.17817-100000@loki.cs.umass.edu>
Message-ID: <200408170928.47731.danderso@redhat.com>

Latest snapshot is from June 26; already pretty old.  You'll have better luck 
checking a tree directly out of cvs.  Look under the "Source code" section at 
http://sources.redhat.com/cluster/

s.r.c/cluster maintainers: Time for an updated snapshot already?

On Tuesday 17 August 2004 09:12, Vijay Sundaram wrote:
> Hi Derek,
>
> No, I do not have a duplicate modules.
> Also, I picked the snapshot from this page
>
> 	http://sources.redhat.com/cluster/releases/cvs_snapshots/
>
> Is that good enough?
>
> thanks,
> -- Vijay
>
> On Tue, 17 Aug 2004, Derek Anderson wrote:
> > Vijay,
> >
> > Please see:
> > http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=126935
> >
> > Make sure you have the latest from CVS.  Also make sure that there are
> > not duplicate gnbd.ko modules under /lib/modules/`uname
> > -r`/drivers/block.
> >
> > On Tuesday 17 August 2004 08:51, Vijay Sundaram wrote:
> > > I followed the steps on the following page.
> > > (Basically setting up a two node cluster)
> > >
> > > 	https://open.datacore.ch/DCwiki.open/Wiki.jsp?page=GFS.GNBD.Usage
> > >
> > > everything seems to work except that I am unable to import any devices.
> > > I get the error
> > >
> > > 	gnbd_import: ERROR cannot parse /sys/class/gnbd/gnbd0/server
> > >
> > > gnbd_import -e <server> correctly lists the devices being exported by
> > > the server.
> > >
> > > Is it because the fields in /sys/class/gnbd/gnbd0/server are not
> > > getting set.
> > >
> > > A
> > > 	cat /sys/class/gnbd/gnbd0/server
> > > gives
> > > 	00000000:0
> > >
> > > When do these fields get set?
> > > What am I doing wrong?
> > >
> > > thanks,
> > > -- Vijay
> > >
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > http://www.redhat.com/mailman/listinfo/linux-cluster


From jeff at intersystems.com  Tue Aug 17 14:56:53 2004
From: jeff at intersystems.com (Jeff)
Date: Tue, 17 Aug 2004 10:56:53 -0400
Subject: [Linux-cluster] DLM patch: owner pid,
	lock ordering and new expedite flag
Message-ID: <619335021.20040817105653@intersystems.com>

The attached patch contains the following changes:

1) Track process which owns locks
   Support is added for tracking the pid of the
   process which owns a lock. This is returned from
   a query operation and used in debug_log() messages.

2) Change rules for granting new locks
   When the LSFL_NOCONVGRANT flag is specified for
   a lockspace the rules for granting a new lock are:
   1) There must be no locks on the conversion queue
   2) There must be no other locks on the grant queue

3) Change rules for granting locks when a lock is
released or converted to a lower mode
   When the LSFL_NOCONVGRANT flag is specified for
   a lockspace the rules for granting pending locks
   when a lock is released/converted down are:
   1) Only the lock at the head of a queue and any
      compatible locks which immediately follow it are
      eligible to be granted.
   2) The waiting queue is only processed if the
      conversion queue is empty

4) Added LKF_GRNLEXPEDITE
   The current LKF_EXPEDITE flag means that if the lock
   has to be queued, it is queued at the head of the
   queue. LKF_GRNLEXPEDITE has meaning when LSFL_NOCONVGRANT
   is specified. It is only valid on a grant request for
   a NL lock and it means that the lock is granted regardless
   of whether there are any locks waiting on a queue.
   
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch.pid-and-lockorder
Type: application/octet-stream
Size: 12962 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040817/4540d204/attachment.obj>

From jeff at intersystems.com  Wed Aug 18 13:16:05 2004
From: jeff at intersystems.com (Jeff)
Date: Wed, 18 Aug 2004 09:16:05 -0400
Subject: [Linux-cluster] Permissions in create_dlm_namespace() call ignored
Message-ID: <323211221.20040818091605@intersystems.com>

Assuming that the named DLM namespace does not
already exist, the following code should
create a namespace which any process on the system
can open. However it doesn't work and subsequent
processes must be root or else the open_lockspace
call fails with
   "Error opening dlm namespace: Permission denied"
   
        dlm_lshandle_t dlmnamesp;
        int i;

        i = umask(0);
        dlmnamesp = dlm_create_lockspace("play",0777);
                                /* S_IRWXU|S_IRWXG|S_IRWXO */
        if (!dlmnamesp) {
           dlmnamesp = dlm_open_lockspace("play");
           if (!dlmnamesp) {
              umask(i);
              perror("Error opening dlm namespace");
              exit(1);
           }
        }
        umask(i);
        if (dlm_ls_pthread_init(dlmnamesp)) {
           perror("dlm_pthread_init failed");
           exit(1);
        }

[jeff at lx3 ~]$ ls -l /dev/misc
total 0
?---------  ? ? ? ?           ? dlm-control
?---------  ? ? ? ?           ? dlm_play
?---------  ? ? ? ?           ? dlm_default
?---------  ? ? ? ?           ? dlm_testls
[jeff at lx3 ~]$


From pcaulfie at redhat.com  Wed Aug 18 13:46:45 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 18 Aug 2004 14:46:45 +0100
Subject: [Linux-cluster] Permissions in create_dlm_namespace() call ignored
In-Reply-To: <323211221.20040818091605@intersystems.com>
References: <323211221.20040818091605@intersystems.com>
Message-ID: <20040818134643.GD31539@tykepenguin.com>

On Wed, Aug 18, 2004 at 09:16:05AM -0400, Jeff wrote:
> Assuming that the named DLM namespace does not
> already exist, the following code should
> create a namespace which any process on the system
> can open. However it doesn't work and subsequent
> processes must be root or else the open_lockspace
> call fails with

Odd, it works here:

dlm_create_lockspace(lsname, 0755);

# ls -l /dev/misc/
total 0
crw-r--r--  1 root root 10, 62 Jun 11 08:20 dlm-control
crw-------  1 root root 10, 61 Aug 17 13:39 dlm_default
crwxr-xr-x  1 root root 10, 60 Aug 18 14:44 dlm_testls
crw-r--r--  1 root root 10, 62 Feb 19 08:38 gdlm

Have you checked the value of umask ?
or is SELinux getting in the way ?(eek!)

-- 

patrick


From jeff at intersystems.com  Wed Aug 18 14:01:34 2004
From: jeff at intersystems.com (Jeff)
Date: Wed, 18 Aug 2004 10:01:34 -0400
Subject: [Linux-cluster] Permissions in create_dlm_namespace() call
	ignored
In-Reply-To: <20040818134643.GD31539@tykepenguin.com>
References: <323211221.20040818091605@intersystems.com>
	<20040818134643.GD31539@tykepenguin.com>
Message-ID: <1183043277.20040818100134@intersystems.com>

Wednesday, August 18, 2004, 9:46:45 AM, Patrick Caulfield wrote:

> On Wed, Aug 18, 2004 at 09:16:05AM -0400, Jeff wrote:
>> Assuming that the named DLM namespace does not
>> already exist, the following code should
>> create a namespace which any process on the system
>> can open. However it doesn't work and subsequent
>> processes must be root or else the open_lockspace
>> call fails with

> Odd, it works here:

> dlm_create_lockspace(lsname, 0755);

> # ls -l /dev/misc/
> total 0
> crw-r--r--  1 root root 10, 62 Jun 11 08:20 dlm-control
> crw-------  1 root root 10, 61 Aug 17 13:39 dlm_default
> crwxr-xr-x  1 root root 10, 60 Aug 18 14:44 dlm_testls
> crw-r--r--  1 root root 10, 62 Feb 19 08:38 gdlm

> Have you checked the value of umask ?
> or is SELinux getting in the way ?(eek!)


Apologies for the earlier ls -l output, that was from a user
process, not a root job. It really looks like:
[root at lx3]# ls -l /dev/misc
total 0
crw-r--r--  1 root root 10, 62 Jul 21 06:24 dlm-control
crwxrwxrwx  1 root root 10, 61 Jul 21 06:24 dlm_default
crwxrwxrwx  1 root root 10, 59 Aug 18 09:15 dlm_play
crwxrwxrwx  1 root root 10, 60 Jul 21 06:26 dlm_testls

The problem is that /dev/misc was missing x permission:
drw-rw-rw-  2 root root 4096 Aug 18 09:15 /dev/misc/

Changing this to
drwxrwxrwx  2 root root 4096 Aug 18 09:55 /dev/misc/

allows non-root jobs to connect to namespaces based on
the namespace's permissions.


From rmayhew at mweb.com  Wed Aug 18 14:17:34 2004
From: rmayhew at mweb.com (Richard Mayhew)
Date: Wed, 18 Aug 2004 16:17:34 +0200
Subject: [Linux-cluster] GFS Node Limit?
Message-ID: <91C4F1A7C418014D9F88E938C135545881A519@mwjdc2.mweb.com>

Hi,

I have 4 gulm_lock servers setup and 6 gulm_lock clients.

I can mount the GFS file systems on all the lock servers and only 1
client. When I try and add another client (which is specified in the
nodes list) it logs in to the master gulm lock server with no problems.
When I try mount the gfs file systems it hangs, until I unmount the file
system from another client. Its as if there is a max of either 1 Client
or a total of 5 servers/clients that can mount the GFS
FS;s.......Doesn't make sense...

Any ideas?


--

Regards

Richard Mayhew
Unix Specialist


From danderso at redhat.com  Wed Aug 18 14:30:35 2004
From: danderso at redhat.com (Derek Anderson)
Date: Wed, 18 Aug 2004 09:30:35 -0500
Subject: [Linux-cluster] GFS Node Limit?
In-Reply-To: <91C4F1A7C418014D9F88E938C135545881A519@mwjdc2.mweb.com>
References: <91C4F1A7C418014D9F88E938C135545881A519@mwjdc2.mweb.com>
Message-ID: <200408180930.35597.danderso@redhat.com>

How many journals on your filesystem?

On Wednesday 18 August 2004 09:17, Richard Mayhew wrote:
> Hi,
>
> I have 4 gulm_lock servers setup and 6 gulm_lock clients.
>
> I can mount the GFS file systems on all the lock servers and only 1
> client. When I try and add another client (which is specified in the
> nodes list) it logs in to the master gulm lock server with no problems.
> When I try mount the gfs file systems it hangs, until I unmount the file
> system from another client. Its as if there is a max of either 1 Client
> or a total of 5 servers/clients that can mount the GFS
> FS;s.......Doesn't make sense...
>
> Any ideas?
>
>
> --
>
> Regards
>
> Richard Mayhew
> Unix Specialist
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster


From rmayhew at mweb.com  Wed Aug 18 14:40:19 2004
From: rmayhew at mweb.com (Richard Mayhew)
Date: Wed, 18 Aug 2004 16:40:19 +0200
Subject: [Linux-cluster] GFS Node Limit?
Message-ID: <91C4F1A7C418014D9F88E938C135545881A533@mwjdc2.mweb.com>

I have 4 mounts of 50GB each.
Each mount has 8 Journals...(this amount I found in the manual
somewhere)

Is this a prob?

Do you have a recommended FS layout etc?
 

--

Regards

Richard Mayhew
Unix Specialist


-----Original Message-----
From: Derek Anderson [mailto:danderso at redhat.com] 
Sent: 18 August 2004 04:31 PM
To: Discussion of clustering software components including GFS; Richard
Mayhew
Subject: Re: [Linux-cluster] GFS Node Limit?

How many journals on your filesystem?

On Wednesday 18 August 2004 09:17, Richard Mayhew wrote:
> Hi,
>
> I have 4 gulm_lock servers setup and 6 gulm_lock clients.
>
> I can mount the GFS file systems on all the lock servers and only 1 
> client. When I try and add another client (which is specified in the 
> nodes list) it logs in to the master gulm lock server with no
problems.
> When I try mount the gfs file systems it hangs, until I unmount the 
> file system from another client. Its as if there is a max of either 1 
> Client or a total of 5 servers/clients that can mount the GFS 
> FS;s.......Doesn't make sense...
>
> Any ideas?
>
>
> --
>
> Regards
>
> Richard Mayhew
> Unix Specialist
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster


From rmayhew at mweb.com  Wed Aug 18 21:06:28 2004
From: rmayhew at mweb.com (Richard Mayhew)
Date: Wed, 18 Aug 2004 23:06:28 +0200
Subject: [Linux-cluster] GFS Node Limit?
Message-ID: <91C4F1A7C418014D9F88E938C135545881A543@mwjdc2.mweb.com>

 
I have increased the number of journals on each FS from the 8 previously
discussed to 16. At present the max number of servers that I intend on
using will be 10, leaving 6 journals open for expansion. When trying to
mount the 6th server, I end up with the same problem. The mount hangs
until I remove a mount from another system before the 6th system is able
to complete the GFS mount.

After adding on full verbosity on the gulm lock server I still can't see
anything of interest. Dmesg offers no clue either. The only message I'm
seeing is the following : "GFS Kernel Interface" is logged out. fd:10"
etc.

After searching the net, the only advice I can find is to increase the
number of journals to at least the number of nodes. This I have done
with no success....

Any other ideas?


--
Regards

Richard Mayhew
Unix Specialist


-----Original Message-----
From: Richard Mayhew [mailto:rmayhew at mweb.com] 
Sent: 18 August 2004 04:40 PM
To: Derek Anderson
Cc: linux-cluster at redhat.com
Subject: RE: [Linux-cluster] GFS Node Limit?

I have 4 mounts of 50GB each.
Each mount has 8 Journals...(this amount I found in the manual
somewhere)

Is this a prob?

Do you have a recommended FS layout etc?
 

--

Regards

Richard Mayhew
Unix Specialist


-----Original Message-----
From: Derek Anderson [mailto:danderso at redhat.com]
Sent: 18 August 2004 04:31 PM
To: Discussion of clustering software components including GFS; Richard
Mayhew
Subject: Re: [Linux-cluster] GFS Node Limit?

How many journals on your filesystem?

On Wednesday 18 August 2004 09:17, Richard Mayhew wrote:
> Hi,
>
> I have 4 gulm_lock servers setup and 6 gulm_lock clients.
>
> I can mount the GFS file systems on all the lock servers and only 1 
> client. When I try and add another client (which is specified in the 
> nodes list) it logs in to the master gulm lock server with no
problems.
> When I try mount the gfs file systems it hangs, until I unmount the 
> file system from another client. Its as if there is a max of either 1 
> Client or a total of 5 servers/clients that can mount the GFS 
> FS;s.......Doesn't make sense...
>
> Any ideas?
>
>
> --
>
> Regards
>
> Richard Mayhew
> Unix Specialist
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster


--
Linux-cluster mailing list
Linux-cluster at redhat.com
http://www.redhat.com/mailman/listinfo/linux-cluster


From ben.m.cahill at intel.com  Wed Aug 18 21:08:37 2004
From: ben.m.cahill at intel.com (Cahill, Ben M)
Date: Wed, 18 Aug 2004 14:08:37 -0700
Subject: [Linux-cluster] man page for gfs_mount?
Message-ID: <0604335B7764D141945E202153105960033E2507@orsmsx404.amr.corp.intel.com>

I don't see a man page for gfs_mount in CVS anywhere.

There's one in OpenGFS, originated by Sistina ... do you want to put
this (after being properly updated) in current GFS man page suite?  If
so, I can take a shot at updating it, and submit as a patch.  Or maybe
you've got a current one that just didn't get into CVS?

-- Ben --


From phillips at redhat.com  Wed Aug 18 23:12:55 2004
From: phillips at redhat.com (Daniel Phillips)
Date: Wed, 18 Aug 2004 19:12:55 -0400
Subject: [Linux-cluster] man page for gfs_mount?
In-Reply-To: <0604335B7764D141945E202153105960033E2507@orsmsx404.amr.corp.intel.com>
References: <0604335B7764D141945E202153105960033E2507@orsmsx404.amr.corp.intel.com>
Message-ID: <200408181912.55732.phillips@redhat.com>

Hi Ben,

On Wednesday 18 August 2004 17:08, Cahill, Ben M wrote:
> I don't see a man page for gfs_mount in CVS anywhere.
>
> There's one in OpenGFS, originated by Sistina ... do you want to put
> this (after being properly updated) in current GFS man page suite? 
> If so, I can take a shot at updating it, and submit as a patch.

That sounds great.

Regards,

Daniel


From amanthei at redhat.com  Wed Aug 18 23:29:23 2004
From: amanthei at redhat.com (Adam Manthei)
Date: Wed, 18 Aug 2004 18:29:23 -0500
Subject: [Linux-cluster] GFS Node Limit?
In-Reply-To: <91C4F1A7C418014D9F88E938C135545881A543@mwjdc2.mweb.com>
References: <91C4F1A7C418014D9F88E938C135545881A543@mwjdc2.mweb.com>
Message-ID: <20040818232923.GG8038@redhat.com>

On Wed, Aug 18, 2004 at 11:06:28PM +0200, Richard Mayhew wrote:
>  
> I have increased the number of journals on each FS from the 8 previously
> discussed to 16. At present the max number of servers that I intend on
> using will be 10, leaving 6 journals open for expansion. When trying to
> mount the 6th server, I end up with the same problem. The mount hangs
> until I remove a mount from another system before the 6th system is able
> to complete the GFS mount.
> 
> After adding on full verbosity on the gulm lock server I still can't see
> anything of interest. Dmesg offers no clue either. The only message I'm
> seeing is the following : "GFS Kernel Interface" is logged out. fd:10"
> etc.
> 
> After searching the net, the only advice I can find is to increase the
> number of journals to at least the number of nodes. This I have done
> with no success....
> 
> Any other ideas?

Gulm can only use 5 nodes in the servers list in 
cluster.ccs/cluster/lock_gulmd/servers.  This could be why you are having
difficulties.  Try trimming the list and see if that yields better results.

You may also want to make sure that your node names are uniquely
identifiable by their first 8 characters (this was a bug in version 
6.0.0-1.2 of the RPMs, but since you didn't post the version of the code you
are using, I can only offer this as a suggestion ;)

There would be failure messages to the console stating that you didn't have
enough journals if that was the problem.  

good luck
--
Adam Manthei


From rmayhew at mweb.com  Thu Aug 19 09:35:40 2004
From: rmayhew at mweb.com (Richard Mayhew)
Date: Thu, 19 Aug 2004 11:35:40 +0200
Subject: [Linux-cluster] GFS Node Limit?
Message-ID: <91C4F1A7C418014D9F88E938C135545881A60A@mwjdc2.mweb.com>

Hi
Brilliant!

All sorted now...I had no idea about this bug. Is it documented anywhere
noticeable?
I had all my service servers named services-01, services-02. Renamed
them to serv-01 ... And it worked!
Thanks for all your help!

--

Regards

Richard Mayhew
Unix Specialist

-----Original Message-----
From: Adam Manthei [mailto:amanthei at redhat.com] 
Sent: 19 August 2004 01:29 AM
To: Discussion of clustering software components including GFS
Subject: Re: [Linux-cluster] GFS Node Limit?

On Wed, Aug 18, 2004 at 11:06:28PM +0200, Richard Mayhew wrote:
>  
> I have increased the number of journals on each FS from the 8 
> previously discussed to 16. At present the max number of servers that 
> I intend on using will be 10, leaving 6 journals open for expansion. 
> When trying to mount the 6th server, I end up with the same problem. 
> The mount hangs until I remove a mount from another system before the 
> 6th system is able to complete the GFS mount.
> 
> After adding on full verbosity on the gulm lock server I still can't 
> see anything of interest. Dmesg offers no clue either. The only 
> message I'm seeing is the following : "GFS Kernel Interface" is logged
out. fd:10"
> etc.
> 
> After searching the net, the only advice I can find is to increase the

> number of journals to at least the number of nodes. This I have done 
> with no success....
> 
> Any other ideas?

Gulm can only use 5 nodes in the servers list in
cluster.ccs/cluster/lock_gulmd/servers.  This could be why you are
having difficulties.  Try trimming the list and see if that yields
better results.

You may also want to make sure that your node names are uniquely
identifiable by their first 8 characters (this was a bug in version
6.0.0-1.2 of the RPMs, but since you didn't post the version of the code
you are using, I can only offer this as a suggestion ;)

There would be failure messages to the console stating that you didn't
have enough journals if that was the problem.  

good luck
--
Adam Manthei

--
Linux-cluster mailing list
Linux-cluster at redhat.com
http://www.redhat.com/mailman/listinfo/linux-cluster


From Axel.Thimm at ATrpms.net  Thu Aug 19 10:24:25 2004
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Thu, 19 Aug 2004 12:24:25 +0200
Subject: [Linux-cluster] RHEL3/kernel 2.4/GFS 6.0 and 2TB limit
Message-ID: <20040819102425.GF7626@neu.physik.fu-berlin.de>

Within the next weeks/months I'd like to setup a GFS cluster on > 2TB
storage backends.

There are 2TB (or 1TB) limits for fs (and/or block devices?) for
2.4/32bits. What is the best way to proceed/plan?

I understand kernel 2.6 and GFS/cvs lift the limits, should I replace
the RHEL3's 2.4 kernel with 2.6 and go GFS/cvs? What about the cluster
suite, would it still play nice with GFS/cvs and kernel 2.6?

Are there any plans for pushing out a GFS release within the next 1-2
months, or any known RHEL plans wrt GFS? If I know that RHEL3 or some
other setup will support > 2TB storage backends in the near future I
can start setting up and testing the cluster.

Thanks!
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040819/d79441cf/attachment.sig>

From Paul_Besett at raytheon.com  Thu Aug 19 12:40:56 2004
From: Paul_Besett at raytheon.com (Paul Besett)
Date: Thu, 19 Aug 2004 08:40:56 -0400
Subject: [Linux-cluster] GFS Performance Best Practices
Message-ID: <OF2F5EAB4B.3E60D42F-ON85256EF4.0079FEAC-85256EF5.0044ECBE@mck.us.ray.com>


I've got GFS 6.0 successfully loaded and running with two nodes.  Now that
I'm satisfied with that, I want to expand this and move into an operational
environment.  Something like 10 nodes with 4-5 partitions.  There will be a
mix of nodes and partitions, like node 1 mounting partitions 2, 3, and 4;
node 2 mounting partitions 3, 4, and 5; and so on.  The administration
guide is pretty weak on best practices, and I want to maximize performance.
Can anyone provide pointers to where I might find this or offer some tips
on set up or things to avoid?

Thanks,
Paul


From anton at hq.310.ru  Thu Aug 19 12:37:25 2004
From: anton at hq.310.ru (=?Windows-1251?B?wO3y7u0gzeX17vDu+Oj1?=)
Date: Thu, 19 Aug 2004 16:37:25 +0400
Subject: [Linux-cluster] gfs_eattr tool ?
Message-ID: <13410176857.20040819163725@hq.310.ru>

Hi all,

i found manual for gfs_eattr but not found this tool ?

-- 
e-mail: anton at hq.310.ru


From macfisherman at gmail.com  Wed Aug 18 20:38:51 2004
From: macfisherman at gmail.com (Jeff Macdonald)
Date: Wed, 18 Aug 2004 16:38:51 -0400
Subject: [Linux-cluster] Re: [ANNOUNCE] OpenSSI 1.0.0 released!!
In-Reply-To: <20040816192602.GA467@openzaurus.ucw.cz>
References: <3689AF909D816446BA505D21F1461AE4C750E6@cacexc04.americas.cpqcorp.net>
	<200408011330.01848.phillips@istop.com>
	<410D2949.20503@backtobasicsmgmt.com>
	<20040816192602.GA467@openzaurus.ucw.cz>
Message-ID: <45ae90370408181338680f71bd@mail.gmail.com>

On Mon, 16 Aug 2004 21:26:02 +0200, Pavel Machek <pavel at ucw.cz> wrote:
<snip>

> Remote character devices seem extremely usefull to me...
> 
> mpg456 --device /dev/kitchen/dsp
> 
> cat /dev/roof/dsp > /dev/laptop/dsp
> 
> cat picture-to-scare-pigeons.raw > /dev/roof/fb0
> 
> X --device=/dev/livingroom/fb0
> 
> ..... Okay, it will probably take a while until SSI cluster is the
> right tool to network your home :-).

Isn't that what Inferno is suppose to be able to do?

-- 
Jeff Macdonald
Ayer, MA


From Paul_Besett at raytheon.com  Wed Aug 18 22:22:04 2004
From: Paul_Besett at raytheon.com (Paul Besett)
Date: Wed, 18 Aug 2004 18:22:04 -0400
Subject: [Linux-cluster] GFS Performance Best Practices
Message-ID: <OF2F5EAB4B.3E60D42F-ON85256EF4.0079FEAC-85256EF4.007A2204@mck.us.ray.com>


I've got GFS 6.0 successfully loaded and running with two nodes.  Now that
I'm satisfied with that, I want to expand this and move into an operational
environment.  Something like 10 nodes with 4-5 partitions.  There will be a
mix of nodes and partitions, like node 1 mounting partitions 2, 3, and 4;
node 2 mounting partitions 3, 4, and 5; and so on.  The documentation is
pretty weak on best practices, and I want to maximize performance.  Can
anyone provide pointers to where I might find this or offer some tips on
set up or things to avoid?

Thanks,
Paul


From danderso at redhat.com  Thu Aug 19 13:36:16 2004
From: danderso at redhat.com (Derek Anderson)
Date: Thu, 19 Aug 2004 08:36:16 -0500
Subject: [Linux-cluster] gfs_eattr tool ?
In-Reply-To: <13410176857.20040819163725@hq.310.ru>
References: <13410176857.20040819163725@hq.310.ru>
Message-ID: <200408190836.16708.danderso@redhat.com>

Hmm.  This man page should have been not be included.  The standard 
setfattr(1) and getfattr(1) should be used to set and get extended attributes 
on GFS.

On Thursday 19 August 2004 07:37, ????? ????????? wrote:
> Hi all,
>
> i found manual for gfs_eattr but not found this tool ?


From amanthei at redhat.com  Thu Aug 19 13:40:01 2004
From: amanthei at redhat.com (Adam Manthei)
Date: Thu, 19 Aug 2004 08:40:01 -0500
Subject: [Linux-cluster] GFS Node Limit?
In-Reply-To: <91C4F1A7C418014D9F88E938C135545881A60A@mwjdc2.mweb.com>
References: <91C4F1A7C418014D9F88E938C135545881A60A@mwjdc2.mweb.com>
Message-ID: <20040819134001.GJ8038@redhat.com>

On Thu, Aug 19, 2004 at 11:35:40AM +0200, Richard Mayhew wrote:
> Hi
> Brilliant!
> 
> All sorted now...I had no idea about this bug. Is it documented anywhere
> noticeable?

http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=127828

I would have posted the bugzilla number last night, but was just to lazy to
do the search ;)

> I had all my service servers named services-01, services-02. Renamed
> them to serv-01 ... And it worked!
> Thanks for all your help!

> -----Original Message-----
> From: Adam Manthei [mailto:amanthei at redhat.com] 
> Sent: 19 August 2004 01:29 AM
> To: Discussion of clustering software components including GFS
> Subject: Re: [Linux-cluster] GFS Node Limit?
> 
> On Wed, Aug 18, 2004 at 11:06:28PM +0200, Richard Mayhew wrote:
> >  
> > I have increased the number of journals on each FS from the 8 
> > previously discussed to 16. At present the max number of servers that 
> > I intend on using will be 10, leaving 6 journals open for expansion. 
> > When trying to mount the 6th server, I end up with the same problem. 
> > The mount hangs until I remove a mount from another system before the 
> > 6th system is able to complete the GFS mount.
> > 
> > After adding on full verbosity on the gulm lock server I still can't 
> > see anything of interest. Dmesg offers no clue either. The only 
> > message I'm seeing is the following : "GFS Kernel Interface" is logged
> out. fd:10"
> > etc.
> > 
> > After searching the net, the only advice I can find is to increase the
> 
> > number of journals to at least the number of nodes. This I have done 
> > with no success....
> > 
> > Any other ideas?
> 
> Gulm can only use 5 nodes in the servers list in
> cluster.ccs/cluster/lock_gulmd/servers.  This could be why you are
> having difficulties.  Try trimming the list and see if that yields
> better results.
> 
> You may also want to make sure that your node names are uniquely
> identifiable by their first 8 characters (this was a bug in version
> 6.0.0-1.2 of the RPMs, but since you didn't post the version of the code
> you are using, I can only offer this as a suggestion ;)
> 
> There would be failure messages to the console stating that you didn't
> have enough journals if that was the problem.  
> 
> good luck
> --
> Adam Manthei
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Adam Manthei  <amanthei at redhat.com>


From anton at hq.310.ru  Thu Aug 19 13:45:39 2004
From: anton at hq.310.ru (=?Windows-1251?B?wO3y7u0gzeX17vDu+Oj1?=)
Date: Thu, 19 Aug 2004 17:45:39 +0400
Subject: [Linux-cluster] gfs_eattr tool ?
In-Reply-To: <200408190836.16708.danderso@redhat.com>
References: <13410176857.20040819163725@hq.310.ru>
	<200408190836.16708.danderso@redhat.com>
Message-ID: <426365758.20040819174539@hq.310.ru>

Hi Derek,

Thursday, August 19, 2004, 5:36:16 PM, you wrote:

Derek Anderson> Hmm.  This man page should have been
Derek Anderson> not be included.  The standard 
Derek Anderson> setfattr(1) and getfattr(1) should be
Derek Anderson> used to set and get extended attributes 
Derek Anderson> on GFS.

On ext2(3) i using chattr +i file for set immutable flag.

I can set this flag thru setfattr on GFS?

chattr on GFS:
# chattr +i sh
chattr: Inappropriate ioctl for device while reading flags on sh

Derek Anderson> On Thursday 19 August 2004 07:37,
Derek Anderson> ????? ????????? wrote:
>> Hi all,
>>
>> i found manual for gfs_eattr but not found this tool ?


-- 
e-mail: anton at hq.310.ru


From anton at hq.310.ru  Thu Aug 19 13:51:14 2004
From: anton at hq.310.ru (=?Windows-1251?B?wO3y7u0gzeX17vDu+Oj1?=)
Date: Thu, 19 Aug 2004 17:51:14 +0400
Subject: [Linux-cluster] gfs_eattr tool ?
In-Reply-To: <200408190836.16708.danderso@redhat.com>
References: <13410176857.20040819163725@hq.310.ru>
	<200408190836.16708.danderso@redhat.com>
Message-ID: <16310644182.20040819175114@hq.310.ru>

Hi Derek,

Thursday, August 19, 2004, 5:36:16 PM, you wrote:

Derek Anderson> Hmm.  This man page should have been
Derek Anderson> not be included.  The standard 
Derek Anderson> setfattr(1) and getfattr(1) should be
Derek Anderson> used to set and get extended attributes 
Derek Anderson> on GFS.

On ext2(3) i using chattr +i file for set immutable flag.

I can set this flag thru setfattr on GFS?

chattr on GFS:
# chattr +i sh
chattr: Inappropriate ioctl for device while reading flags on sh

Derek Anderson> On Thursday 19 August 2004 07:37,
Derek Anderson> ????? ????????? wrote:
>> Hi all,
>>
>> i found manual for gfs_eattr but not found this tool ?


-- 
e-mail: anton at hq.310.ru


From ritesh.a at net4india.net  Thu Aug 19 03:24:19 2004
From: ritesh.a at net4india.net (Ritesh Agrawal)
Date: Thu, 19 Aug 2004 08:54:19 +0530
Subject: [Linux-cluster] Active Active Configuration
Message-ID: <41241D63.5090102@net4india.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi All,
~ I am using Redhat Linux Cluster suit and EL for High avalilblity server
with active/passive configuration. In active /passive configuration ,
one server(master) act as load balancer,and another one comes on the
scene as fail over server ,
It means only one server's computation power (as load balancer) is used
at a time .But i want to use both server's computational power (active/
active) configuration as well as take over the responsiblities of each
others in  case of one's  failure.

any suggestion or tutorials regarding this, you have ?
one more thing , how to implement GFS in cluster with better optimization.

- --
Regards
Ritesh Agrawal
Senior Engineer-Systems
Net 4 India Ltd,
B-4/47, Safdarjung Enclave,
New Delhi- 110 029, India

+----------------------------------------------------+
~        I think, therefore I am.

......................................................
Public Key Server: http://keyserver.veridis.com/en/
GPG Key Fingerprint :
D017 1B21 A699 BDF8 CFDD  2D78 168C FE3F DE63 9D32
+----------------------------------------------------+

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBJB1jFoz+P95jnTIRAvwKAKCtQuXXmt3xsQBX480kZophr02engCfY4rU
oBAij0iqwZiiL09ySkWIHXQ=
=IBGm
-----END PGP SIGNATURE-----


From bmarzins at redhat.com  Thu Aug 19 15:21:22 2004
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Thu, 19 Aug 2004 10:21:22 -0500
Subject: [Linux-cluster] gfs_eattr tool ?
In-Reply-To: <16310644182.20040819175114@hq.310.ru>
References: <13410176857.20040819163725@hq.310.ru>
	<200408190836.16708.danderso@redhat.com>
	<16310644182.20040819175114@hq.310.ru>
Message-ID: <20040819152122.GA12234@phlogiston.msp.redhat.com>

On Thu, Aug 19, 2004 at 05:51:14PM +0400, ????? ????????? wrote:
> Hi Derek,
> 
> Thursday, August 19, 2004, 5:36:16 PM, you wrote:
> 
> Derek Anderson> Hmm.  This man page should have been
> Derek Anderson> not be included.  The standard 
> Derek Anderson> setfattr(1) and getfattr(1) should be
> Derek Anderson> used to set and get extended attributes 
> Derek Anderson> on GFS.
> 
> On ext2(3) i using chattr +i file for set immutable flag.
> 
> I can set this flag thru setfattr on GFS?
> 
> chattr on GFS:
> # chattr +i sh
> chattr: Inappropriate ioctl for device while reading flags on sh

I might be wrong, but I don't think chattr has anything to do with extended
attributes. gfs_eattr was an old tool for setting extended attributes on
GFS, used before there was a syscall for it.

I don't know offhand of any way to set all the file attributes in GFS, that
chattr can set in ext2/3

"gfs_tool setflag" allows you to set some file attributes, but not make the
inode immutable. If someone else knows another way, speak up.

-Ben

 
> Derek Anderson> On Thursday 19 August 2004 07:37,
> Derek Anderson> ????? ????????? wrote:
> >> Hi all,
> >>
> >> i found manual for gfs_eattr but not found this tool ?
> 
> 
> 
> 
> 
> -- 
> e-mail: anton at hq.310.ru
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster


From mauelshagen at redhat.com  Wed Aug 18 19:06:10 2004
From: mauelshagen at redhat.com (Heinz Mauelshagen)
Date: Wed, 18 Aug 2004 21:06:10 +0200
Subject: [Linux-cluster] *** Announcement: dmraid 1.0.0-rc3 ***
Message-ID: <20040818190610.GA6259@redhat.com>


               *** Announcement: dmraid 1.0.0-rc3 ***

dmraid 1.0.0-rc3 is available at
http://people.redhat.com:/~heinzm/sw/dmraid/ in source, source rpm and i386 rpm.

dmraid (Device-Mapper Raid tool) discovers, [de]activates and displays
properties of software RAID sets (ie. ATARAID) and contained DOS
partitions using the device-mapper runtime of the 2.6 kernel.

The following ATARAID types are supported on Linux 2.6:

Highpoint HPT37X
Highpoint HPT45X
Intel Software RAID
Promise FastTrack
Silicon Image Medley

This ATARAID type is only basically supported in this version (I need
better metadata format specs; please help):
LSI Logic MegaRAID

Please provide insight to support those metadata formats completely.

Thanks.

See files README and CHANGELOG, which come with the source tarball for
prerequisites to run this software, further instructions on installing
and using dmraid!

CHANGELOG is contained below for your convenience as well.


Call for testers:
-----------------

I need testers with the above ATARAID types, to check that the mapping
created by this tool is correct (see options "-t -ay") and access to the ATARAID
data is proper.

You can activate your ATARAID sets without danger of overwriting
your metadata, because dmraid accesses it read-only unless you use
option -E with -r in order to erase ATARAID metadata (see 'man dmraid')!

This is a release candidate version so you want to have backups of your valuable
data *and* you want to test accessing your data read-only first in order to
make sure that the mapping is correct before you go for read-write access.


The author is reachable at <Mauelshagen at RedHat.com>.

For test results, mapping information, discussions, questions, patches,
enhancement requests and the like, please subscribe and mail
to <ataraid at redhat.com>.

--

Regards,
Heinz    -- The LVM Guy --


CHANGELOG:
---------

Changelog from dmraid 1.0.0-rc2 to 1.0.0-rc3		2004.08.18

FIXES:
------
o HPT37X mapping on first disk of set
o dietlibc sscanf() use prevented activation
o le*_to_cpu() for certain glibc environments (Luca Berra)
o sysfs discovery (Luca Berra)
o permissions to write on binary, which is needed
  by newer strip versions (Luca Berra)
o SCSI serial number string length bug
o valgrinded memory leaks
o updated design document
o comments

FEATURES:
---------
o added basic support for activation of LSI Logic MegaRAID/MegaIDE;
  more reengineering of the metadata needed!
o root check using certain options (eg, activation of RAID sets)
o implemented locking abstraction
o implemented writing device metadata offsets with "-r[D/E]"
  for ease of manual restore
o file based locking to avoid parallel tool runs competing
  with each other for the same resources
o streamlined library context
o implemented access functions for library context
o streamlined RAID set consistency checks
o implemented log function and removed macros to shrink binary size further
o removed superfluous disk geometry code
o cleaned up metadata.c collapsing free_*() functions
o slimmed down minimal binary (configure option DMRAID_MINI
  for early boot environment)

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen                                 Red Hat GmbH
Consulting Development Engineer                   Am Sonnenhang 11
                                                  56242 Marienrachdorf
                                                  Germany
Mauelshagen at RedHat.com                            +49 2626 141200
                                                       FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-


From andriy at druzhba.lviv.ua  Fri Aug 20 12:57:38 2004
From: andriy at druzhba.lviv.ua (Andriy Galetski)
Date: Fri, 20 Aug 2004 15:57:38 +0300
Subject: [Linux-cluster] GFS configuration for 2 node Cluster
References: <41241D63.5090102@net4india.net>
Message-ID: <002001c486b5$46ab23c0$f13cc90a@druzhba.com>

Hi All,

I want to build RH Cluster with GFS, but have only 2 node connected with
shared storage.
Now GFS-6.0.0-7 instaled on both nodes (RHEL3 2.4.21-15.0.3.ELsmp)
When 2 node operate normal I have no problem with mount/umount  read/write
GFS filesystem. In case of one node fail / shutdown / lost communication,
all operations
with GFS filesystem is stopped (Because Lock_gulmd lost its Quorum)
To renew GFS filesystem to normal state I need run failed node back
give   fence_ack_manual command.

Q: Is it  any trick in 2 node GFS configuration to get one node full operate
when other node disconnected from cluster ??

Thannks for all suggestions.


From amir at datacore.ch  Fri Aug 20 21:13:45 2004
From: amir at datacore.ch (Amir Guindehi)
Date: Fri, 20 Aug 2004 23:13:45 +0200
Subject: [Linux-cluster] GFS configuration for 2 node Cluster
In-Reply-To: <002001c486b5$46ab23c0$f13cc90a@druzhba.com>
References: <41241D63.5090102@net4india.net>
	<002001c486b5$46ab23c0$f13cc90a@druzhba.com>
Message-ID: <41266989.2070101@datacore.ch>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Andriy,

| Q: Is it  any trick in 2 node GFS configuration to get one node full
operate
| when other node disconnected from cluster ??

Dunno if it's the same in RH Cluster as in Linux Cluster, but for Linux
Cluster I've described how to do it at:

https://open.datacore.ch/page/GFS.Install#section-GFS.Install-ConfigurationOfGFSOnASystemRunningTheGenTooLinuxDistributionhttpgentoo.org

Regards
- - Amir
- --
Amir Guindehi, nospam.amir at datacore.ch
DataCore GmbH, Witikonerstrasse 289, 8053 Zurich, Switzerland

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBJlzObycOjskSVCwRAriYAJwMQHKKlhYQZrnGGxzgH3ZG9seM9QCgvsmi
8eqyyVFk+Cfn12iIHvDDytQ=
=MW3I
-----END PGP SIGNATURE-----


From jopet at staff.spray.se  Mon Aug 23 12:43:29 2004
From: jopet at staff.spray.se (Johan Pettersson)
Date: Mon, 23 Aug 2004 14:43:29 +0200
Subject: [Linux-cluster] Modules for kernel 2.4
Message-ID: <1093265009.8980.42.camel@zombie.i.spray.se>

Hello!

I can't find any tarball at http://sources.redhat.com/cluster/gfs/ for
kernel 2.4. Is only 2.6.7 supported? 

http://sources.redhat.com/cluster/releases/GFS-kernel/gfs-kernel-2.6.7-2.tar.gz

Thx

/J

-- 
In disk space, nobody can
hear your files scream.


From lhh at redhat.com  Mon Aug 23 14:11:34 2004
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 23 Aug 2004 10:11:34 -0400
Subject: [Linux-cluster] GFS configuration for 2 node Cluster
In-Reply-To: <41266989.2070101@datacore.ch>
References: <41241D63.5090102@net4india.net>
	<002001c486b5$46ab23c0$f13cc90a@druzhba.com>
	<41266989.2070101@datacore.ch>
Message-ID: <1093270294.3467.26.camel@atlantis.boston.redhat.com>

On Fri, 2004-08-20 at 23:13 +0200, Amir Guindehi wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hi Andriy,
> 
> | Q: Is it  any trick in 2 node GFS configuration to get one node full
> operate
> | when other node disconnected from cluster ??
> 
> Dunno if it's the same in RH Cluster as in Linux Cluster, but for Linux
> Cluster I've described how to do it at:
> 
> https://open.datacore.ch/page/GFS.Install#section-GFS.Install-ConfigurationOfGFSOnASystemRunningTheGenTooLinuxDistributionhttpgentoo.org

Hi Amir,

I think he meant for 6.0.0, which is the pappy of linux-cluster.  I
don't think you can do it with 6.0.0.

-- Lon


From notiggy at gmail.com  Mon Aug 23 14:15:23 2004
From: notiggy at gmail.com (Brian Jackson)
Date: Mon, 23 Aug 2004 09:15:23 -0500
Subject: [Linux-cluster] Modules for kernel 2.4
In-Reply-To: <1093265009.8980.42.camel@zombie.i.spray.se>
References: <1093265009.8980.42.camel@zombie.i.spray.se>
Message-ID: <fb20c214040823071531e29fce@mail.gmail.com>

On Mon, 23 Aug 2004 14:43:29 +0200, Johan Pettersson
<jopet at staff.spray.se> wrote:
> Hello!
> 
> I can't find any tarball at http://sources.redhat.com/cluster/gfs/ for
> kernel 2.4. Is only 2.6.7 supported?

Yes, only 2.6 is supported, for 2.4, look around for the gfs-6 src.rpms.

> 
> http://sources.redhat.com/cluster/releases/GFS-kernel/gfs-kernel-2.6.7-2.tar.gz
> 
> Thx
> 
> /J


From phillips at redhat.com  Mon Aug 23 15:43:11 2004
From: phillips at redhat.com (Daniel Phillips)
Date: Mon, 23 Aug 2004 11:43:11 -0400
Subject: [Linux-cluster] Subversion?
Message-ID: <200408231143.11372.phillips@redhat.com>

Hi everybody,

I was just taking a look at this article and I thought, maybe this would 
be a good time to show some leadership as a project, and take the 
Subversion plunge:

   http://www.onlamp.com/pub/a/onlamp/2004/08/19/subversiontips.html

Subversion is basically CVS as it should have been.  It's mature now.  
The number of complaints I have noticed from users out there is roughly 
zero.  Subversion _versions directories_.  Etc.  Etc.

The only negative I can think of is that some folks may not have 
Subversion installed.  But that is what tarballs are for.

Our project development is not highly parallel at this point, so our 
repository serves more as a place for maintainers of the individual 
subprojects to post current code.  So there isn't a great need for a 
distributed VCS like Bitkeeper or Arch.

Thoughts?

Regards,

Daniel


From kpfleming at backtobasicsmgmt.com  Mon Aug 23 15:49:26 2004
From: kpfleming at backtobasicsmgmt.com (Kevin P. Fleming)
Date: Mon, 23 Aug 2004 08:49:26 -0700
Subject: [Linux-cluster] Subversion?
In-Reply-To: <200408231143.11372.phillips@redhat.com>
References: <200408231143.11372.phillips@redhat.com>
Message-ID: <412A1206.5040103@backtobasicsmgmt.com>

Daniel Phillips wrote:

> I was just taking a look at this article and I thought, maybe this would 
> be a good time to show some leadership as a project, and take the 
> Subversion plunge:
> 
>    http://www.onlamp.com/pub/a/onlamp/2004/08/19/subversiontips.html

I am part of another project that recently switched to Subversion, and 
we like it quite a bit. It's a major improvement over CVS.

> Subversion is basically CVS as it should have been.  It's mature now.  
> The number of complaints I have noticed from users out there is roughly 
> zero.  Subversion _versions directories_.  Etc.  Etc.

Yes, and if you haven't already, read the first few chapters of the 
"redbean" Subversion book to get a feel for how it works. 
Branching/tagging is painless and low-cost (including dropping dead 
branches/tags, something that CVS can't do well at all), there are cool 
methods to include parts of the repository inside other parts at 
checkout time, etc.

> The only negative I can think of is that some folks may not have 
> Subversion installed.  But that is what tarballs are for.

Subversion is an easy install. There is another negative, though: the 
current release of Subversion uses Berkeley DB as its storage means, and 
we've had problems with it getting randomly locked and causing issues. 
We don't know if this is due to running ViewCVS against the repo as 
well, or what else it may be. Given the problems we've had, we are 
anxiously awaiting the 1.1 release of Subversion that will have a 
filesystem-based backend, rather than bdb.

> Our project development is not highly parallel at this point, so our 
> repository serves more as a place for maintainers of the individual 
> subprojects to post current code.  So there isn't a great need for a 
> distributed VCS like Bitkeeper or Arch.

I also like BK quite a bit, and it has one major advantage over 
CVS/Subversion: you can have local trees and actually _commit_ to them, 
including changeset comments and everything else. This is very nice when 
you are working on multiple bits of a project and are not ready to 
commit them to the "real" repositories.


From aneesh.kumar at hp.com  Mon Aug 23 15:55:23 2004
From: aneesh.kumar at hp.com (Aneesh Kumar K.V)
Date: Mon, 23 Aug 2004 21:25:23 +0530
Subject: [Linux-cluster] Subversion?
In-Reply-To: <412A1206.5040103@backtobasicsmgmt.com>
References: <200408231143.11372.phillips@redhat.com>
	<412A1206.5040103@backtobasicsmgmt.com>
Message-ID: <412A136B.4000004@hp.com>

Kevin P. Fleming wrote:

> 
> Subversion is an easy install. There is another negative, though: the 
> current release of Subversion uses Berkeley DB as its storage means, and 
> we've had problems with it getting randomly locked and causing issues. 
> We don't know if this is due to running ViewCVS against the repo as 
> well, or what else it may be. Given the problems we've had, we are 
> anxiously awaiting the 1.1 release of Subversion that will have a 
> filesystem-based backend, rather than bdb.
> 

I hit this last weekend. My subversion crashed/locked completely. I am 
not sure whether it is a lockup or a crash. But there is nothing you can 
do by going to the repo directory. BTW I didn't attempt to recover it.

-aneesh


From jeff at intersystems.com  Mon Aug 23 16:37:01 2004
From: jeff at intersystems.com (Jeff)
Date: Mon, 23 Aug 2004 12:37:01 -0400
Subject: [Linux-cluster] patch to return lkid on new lock requests as part
	of initial lock request processing
Message-ID: <1206740724.20040823123701@intersystems.com>

In order to allow a new lock request to be canceled, the lock id
of the lock must be returned to the user as part of the initial
write() call which queues the new lock request.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch.newlock-lockid
Type: application/octet-stream
Size: 701 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040823/d11a5e8b/attachment.obj>

From lhh at redhat.com  Mon Aug 23 16:55:21 2004
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 23 Aug 2004 12:55:21 -0400
Subject: [Linux-cluster] Subversion?
In-Reply-To: <200408231143.11372.phillips@redhat.com>
References: <200408231143.11372.phillips@redhat.com>
Message-ID: <1093280121.3467.80.camel@atlantis.boston.redhat.com>

On Mon, 2004-08-23 at 11:43 -0400, Daniel Phillips wrote:
> Hi everybody,
> 
> I was just taking a look at this article and I thought, maybe this would 
> be a good time to show some leadership as a project, and take the 
> Subversion plunge:
> 
>    http://www.onlamp.com/pub/a/onlamp/2004/08/19/subversiontips.html
> 
> Subversion is basically CVS as it should have been.  It's mature now.  
> The number of complaints I have noticed from users out there is roughly 
> zero.  Subversion _versions directories_.  Etc.  Etc.

Disagree.  We should use GNU arch.  Here's a comparison from someone you
know:

http://wiki.gnuarch.org/moin.cgi/SubVersionAndCvsComparison
http://better-scm.berlios.de/comparison/comparison.html

Arch supports repeated merging (incl. renames) and digitally signed
changesets (which may or may not be helpful in our case).  Mirroring and
replication are part of the core architecture.  Arch applies versions to
directories too ;)

> Our project development is not highly parallel at this point, so our 
> repository serves more as a place for maintainers of the individual 
> subprojects to post current code.

True.  For now.  Switching again in the future (if needed) will be more
painful as we attract more developers.

> So there isn't a great need for a 
> distributed VCS like Bitkeeper or Arch.

The more users of arch, the more mature it will become.  Someday,
perhaps, it will replace BK for some major open source projects near and
dear to our hearts.  Perhaps this is a pipe dream. ;)

For projects that don't need the parallel features of arch, nothing
requires that the parallelism be used.

-- Lon


From lhh at redhat.com  Mon Aug 23 16:56:55 2004
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 23 Aug 2004 12:56:55 -0400
Subject: [Linux-cluster] Subversion?
In-Reply-To: <1093280121.3467.80.camel@atlantis.boston.redhat.com>
References: <200408231143.11372.phillips@redhat.com>
	<1093280121.3467.80.camel@atlantis.boston.redhat.com>
Message-ID: <1093280215.3467.82.camel@atlantis.boston.redhat.com>

On Mon, 2004-08-23 at 12:55 -0400, Lon Hohberger wrote:

> Disagree.  We should use GNU arch.  Here's a comparison from someone you
> know:
> 
> http://wiki.gnuarch.org/moin.cgi/SubVersionAndCvsComparison

;) I think, anyway.  The lower one is by someone else.

> http://better-scm.berlios.de/comparison/comparison.html

-- Lon


From cherry at osdl.org  Mon Aug 23 17:07:22 2004
From: cherry at osdl.org (John Cherry)
Date: Mon, 23 Aug 2004 10:07:22 -0700
Subject: [Linux-cluster] Subversion?
In-Reply-To: <200408231143.11372.phillips@redhat.com>
References: <200408231143.11372.phillips@redhat.com>
Message-ID: <1093280841.12873.43.camel@cherrybomb.pdx.osdl.net>

I understand that subversion is quite nice, but kernel developers have
adopted bitkeeper (at least Linus and several of his maintainers). 
While you may not need all the distributed capabilities of bitkeeper
now, it is sure nice to have a tool that allows for non-local
repositories and change set tracking outside of the main repository (as
Kevin so clearly stated).

Since mainline kernel acceptance of the core services is one of the
objectives here, I would certainly recommend that you consider bitkeeper
for source control as well.

Regards,
John

On Mon, 2004-08-23 at 08:43, Daniel Phillips wrote:
> Hi everybody,
> 
> I was just taking a look at this article and I thought, maybe this would 
> be a good time to show some leadership as a project, and take the 
> Subversion plunge:
> 
>    http://www.onlamp.com/pub/a/onlamp/2004/08/19/subversiontips.html
> 
> Subversion is basically CVS as it should have been.  It's mature now.  
> The number of complaints I have noticed from users out there is roughly 
> zero.  Subversion _versions directories_.  Etc.  Etc.
> 
> The only negative I can think of is that some folks may not have 
> Subversion installed.  But that is what tarballs are for.
> 
> Our project development is not highly parallel at this point, so our 
> repository serves more as a place for maintainers of the individual 
> subprojects to post current code.  So there isn't a great need for a 
> distributed VCS like Bitkeeper or Arch.
> 
> Thoughts?
> 
> Regards,
> 
> Daniel
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster


From crh at ubiqx.mn.org  Mon Aug 23 17:48:37 2004
From: crh at ubiqx.mn.org (Christopher R. Hertel)
Date: Mon, 23 Aug 2004 12:48:37 -0500
Subject: [Linux-cluster] Subversion?
In-Reply-To: <200408231143.11372.phillips@redhat.com>
References: <200408231143.11372.phillips@redhat.com>
Message-ID: <20040823174837.GC22622@Favog.ubiqx.mn.org>

On Mon, Aug 23, 2004 at 11:43:11AM -0400, Daniel Phillips wrote:
> Hi everybody,
> 
> I was just taking a look at this article and I thought, maybe this would 
> be a good time to show some leadership as a project, and take the 
> Subversion plunge:
> 
>    http://www.onlamp.com/pub/a/onlamp/2004/08/19/subversiontips.html

We're using SVN to maintain Samba now.  There have been glitches, but most 
have been fixed.  The biggest problems (currently) are with the web 
front-ends.

Chris -)-----

-- 
"Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X
Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org


From phillips at redhat.com  Mon Aug 23 18:02:22 2004
From: phillips at redhat.com (Daniel Phillips)
Date: Mon, 23 Aug 2004 14:02:22 -0400
Subject: [Linux-cluster] Subversion?
In-Reply-To: <1093280841.12873.43.camel@cherrybomb.pdx.osdl.net>
References: <200408231143.11372.phillips@redhat.com>
	<1093280841.12873.43.camel@cherrybomb.pdx.osdl.net>
Message-ID: <200408231402.22863.phillips@redhat.com>

Hi John,

On Monday 23 August 2004 13:07, John Cherry wrote:
> I understand that subversion is quite nice, but kernel developers
> have adopted bitkeeper (at least Linus and several of his
> maintainers). While you may not need all the distributed capabilities 
> of bitkeeper now, it is sure nice to have a tool that allows for
> non-local repositories and change set tracking outside of the main
> repository (as Kevin so clearly stated).

In my humble opinion, Bitkeeper does not have a snowball's chance in 
hell of getting established on sources.redhat.com.

> Since mainline kernel acceptance of the core services is one of the
> objectives here, I would certainly recommend that you consider
> bitkeeper for source control as well.

Just read the license.

   http://www.taniwha.org/bitkeeper.html

"Sometimes it is tempting to sacrifice our rights and freedoms for 
convinience, but we should not do so... with the increasing popularity 
of alternative licenses, it is important [to] determine whether they 
preserve the minimum acceptable amount of freedom and be responsible 
about choosing software that that meets these minimum criteria and 
advances our goals as a community"

This is 3 years old, however there has been no improvement, quite the 
contrary.

Regards,

Daniel


From kpfleming at backtobasicsmgmt.com  Mon Aug 23 18:14:11 2004
From: kpfleming at backtobasicsmgmt.com (Kevin P. Fleming)
Date: Mon, 23 Aug 2004 11:14:11 -0700
Subject: [Linux-cluster] Subversion?
In-Reply-To: <200408231402.22863.phillips@redhat.com>
References: <200408231143.11372.phillips@redhat.com>	<1093280841.12873.43.camel@cherrybomb.pdx.osdl.net>
	<200408231402.22863.phillips@redhat.com>
Message-ID: <412A33F3.6020808@backtobasicsmgmt.com>

Daniel Phillips wrote:

> Just read the license.
> 
>    http://www.taniwha.org/bitkeeper.html

Wow, an entire treatise predicated on proving that BitKeeper is not Free 
Software, when noone from BitMover ever claimed it was.


From cherry at osdl.org  Mon Aug 23 18:17:18 2004
From: cherry at osdl.org (John Cherry)
Date: Mon, 23 Aug 2004 11:17:18 -0700
Subject: [Linux-cluster] Subversion?
In-Reply-To: <200408231402.22863.phillips@redhat.com>
References: <200408231143.11372.phillips@redhat.com>
	<1093280841.12873.43.camel@cherrybomb.pdx.osdl.net>
	<200408231402.22863.phillips@redhat.com>
Message-ID: <1093285037.12874.70.camel@cherrybomb.pdx.osdl.net>

On Mon, 2004-08-23 at 11:02, Daniel Phillips wrote:
> Hi John,
> 
> On Monday 23 August 2004 13:07, John Cherry wrote:
> > I understand that subversion is quite nice, but kernel developers
> > have adopted bitkeeper (at least Linus and several of his
> > maintainers). While you may not need all the distributed capabilities 
> > of bitkeeper now, it is sure nice to have a tool that allows for
> > non-local repositories and change set tracking outside of the main
> > repository (as Kevin so clearly stated).
> 
> In my humble opinion, Bitkeeper does not have a snowball's chance in 
> hell of getting established on sources.redhat.com.

I kinda figured it didn't have a chance at sources.redhat.com.  But what
about bkbits.net?

> 
> > Since mainline kernel acceptance of the core services is one of the
> > objectives here, I would certainly recommend that you consider
> > bitkeeper for source control as well.
> 
> Just read the license.
> 
>    http://www.taniwha.org/bitkeeper.html
> 
> "Sometimes it is tempting to sacrifice our rights and freedoms for 
> convinience, but we should not do so... with the increasing popularity 
> of alternative licenses, it is important [to] determine whether they 
> preserve the minimum acceptable amount of freedom and be responsible 
> about choosing software that that meets these minimum criteria and 
> advances our goals as a community"
> 
> This is 3 years old, however there has been no improvement, quite the 
> contrary.

I understand the concerns about the license.  It is one of the strangest
licenses I have ever read and it sounds like the licensees are at the
mercy of the licenser in many respects (the rights and freedoms
arguements).

However, bk is being used across the kernel development community and
this does not appear to be changing anytime soon.

BTW, most developers do just fine with up to date tarballs, so source
control is not a huge issue for most of them.

John


From phillips at redhat.com  Mon Aug 23 18:23:41 2004
From: phillips at redhat.com (Daniel Phillips)
Date: Mon, 23 Aug 2004 14:23:41 -0400
Subject: [Linux-cluster] Subversion?
In-Reply-To: <412A33F3.6020808@backtobasicsmgmt.com>
References: <200408231143.11372.phillips@redhat.com>
	<200408231402.22863.phillips@redhat.com>
	<412A33F3.6020808@backtobasicsmgmt.com>
Message-ID: <200408231423.41542.phillips@redhat.com>

On Monday 23 August 2004 14:14, Kevin P. Fleming wrote:
> Daniel Phillips wrote:
> > Just read the license.
> >
> >    http://www.taniwha.org/bitkeeper.html
>
> Wow, an entire treatise predicated on proving that BitKeeper is not
> Free Software, when noone from BitMover ever claimed it was.

Sources.redhat.com not only consists entirely of free software, but 
shows leadership to the free software community.  We[1] are interested 
in advancing not only our own projects, but other open source projects 
such as Subversion and Arch.

[1] Presumptively speaking for what I presume is the majority.

Regards,

Daniel


From kpfleming at backtobasicsmgmt.com  Mon Aug 23 18:36:41 2004
From: kpfleming at backtobasicsmgmt.com (Kevin P. Fleming)
Date: Mon, 23 Aug 2004 11:36:41 -0700
Subject: [Linux-cluster] Subversion?
In-Reply-To: <200408231423.41542.phillips@redhat.com>
References: <200408231143.11372.phillips@redhat.com>
	<200408231402.22863.phillips@redhat.com>
	<412A33F3.6020808@backtobasicsmgmt.com>
	<200408231423.41542.phillips@redhat.com>
Message-ID: <412A3939.9050301@backtobasicsmgmt.com>

Daniel Phillips wrote:

> Sources.redhat.com not only consists entirely of free software, but 
> shows leadership to the free software community.  We[1] are interested 
> in advancing not only our own projects, but other open source projects 
> such as Subversion and Arch.
> 
> [1] Presumptively speaking for what I presume is the majority.

I wholeheartedly agree with these statements, and if using Free Software 
projects to advance your own is the right decision then I fully support 
it. I just don't like to see decisions made using inaccurate, 
politicized arguments.

In this case, you are far better off (IMO) to say "We won't use 
BitKeeper because it is not open source", rather than to rely on 
arguments about its licensing model. It's likely that even if the 
binary-only free use license for BitKeeper came with _no_ restrictions 
whatsoever, it still would not be your choice for an SCM, because it is 
not open source.


From ananth at osc.edu  Mon Aug 23 18:47:12 2004
From: ananth at osc.edu (Ananth Devulapalli)
Date: Mon, 23 Aug 2004 14:47:12 -0400 (EDT)
Subject: [Linux-cluster] Problem compiling dlm module.
Message-ID: <Pine.GSO.4.58.0408231428490.20002@arlen.osc.edu>

Hello:

	I am following instructions for compilation of dlm module
described at  http://opendlm.sourceforge.net/doc.php and i am through with
installation of libnet and heartbeat modules. but opendlm breaks.

	my m/c is currently running 2.6.8-1.521smp on a dual xeon

	opendlm was configured using
--with-heartbeat_include=/usr/include/heartbeat

	I am pasting output of make at the end of this mail. All the
errors seem to be in files included in cccp_deliver.c. It appears to me
like I am missing some files and have configured the tree incorrectly. For
e.g. I dont have include/linux/modversions.h. Another error is
MOD_INC_USE_COUNT is defined in /usr/include/linux/module.h but not in
/lib/modules/2.6.8-1.521smp/build. configure is assigning linux_src
variable to /lib/modules/2.6.8-1.521smp/build instead of
/usr/include/linux/ hence its not able to locate that symbol. Is it a
known problem? any pointers will be of great help.

thanks,
-Ananth


<make output>

if gcc -DHAVE_CONFIG_H -I. -I. -I../../..  -D__KERNEL__ -DMODULE
-DMODVERSIONS -I/lib/modules/2.6.8-1.521smp/build/include
-I../../../src/include -include
/lib/modules/2.6.8-1.521smp/build/include/linux/modversions.h
-I../../../src/api -DOLD_MARSHAL
-I/lib/modules/2.6.8-1.521smp/build/include -I/usr/include/glib-1.2
-I/usr/lib/glib/include -pipe -O2 -Wall -g -O2 -MT cccp_deliver.o -MD -MP
-MF ".deps/cccp_deliver.Tpo" -c -o cccp_deliver.o cccp_deliver.c; \
then mv -f ".deps/cccp_deliver.Tpo" ".deps/cccp_deliver.Po"; else rm -f
".deps/cccp_deliver.Tpo"; exit 1; fi
<command line>:167113902:62704:
/lib/modules/2.6.8-1.521smp/build/include/linux/modversions.h: No such
file or directory
In file included from
/lib/modules/2.6.8-1.521smp/build/include/asm/processor.h:18,
                 from
/lib/modules/2.6.8-1.521smp/build/include/asm/thread_info.h:16,
                 from
/lib/modules/2.6.8-1.521smp/build/include/linux/thread_info.h:21,
                 from
/lib/modules/2.6.8-1.521smp/build/include/linux/spinlock.h:12,
                 from
/lib/modules/2.6.8-1.521smp/build/include/linux/capability.h:45,
                 from
/lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:7,
                 from
/lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10,
                 from cccp_deliver.c:47:
/lib/modules/2.6.8-1.521smp/build/include/asm/system.h: In function
`__set_64bit_var':
/lib/modules/2.6.8-1.521smp/build/include/asm/system.h:193: warning:
dereferencing type-punned pointer will break strict-aliasing rules
/lib/modules/2.6.8-1.521smp/build/include/asm/system.h:193: warning:
dereferencing type-punned pointer will break strict-aliasing rules
In file included from
/lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:18,
                 from
/lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17,
                 from
/lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23,
                 from
/lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10,
                 from cccp_deliver.c:47:
/lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:6:25:
mach_mpspec.h: No such file or directory
In file included from
/lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:18,
                 from
/lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17,
                 from
/lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23,
                 from
/lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10,
                 from cccp_deliver.c:47:
/lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h: At top level:
/lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:8: error:
`MAX_MP_BUSSES' undeclared here (not in a function)
/lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:9: error:
`MAX_MP_BUSSES' undeclared here (not in a function)
/lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:10: error:
`MAX_MP_BUSSES' undeclared here (not in a function)
/lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:12: error:
`MAX_MP_BUSSES' undeclared here (not in a function)
/lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:20: error:
`MAX_MP_BUSSES' undeclared here (not in a function)
/lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:20: error:
conflicting types for `mp_bus_id_to_type'
/lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:8: error: previous
declaration of `mp_bus_id_to_type'
/lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:22: error:
`MAX_IRQ_SOURCES' undeclared here (not in a function)
/lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:24: error:
`MAX_MP_BUSSES' undeclared here (not in a function)
/lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:24: error:
conflicting types for `mp_bus_id_to_pci_bus'
/lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:12: error: previous
declaration of `mp_bus_id_to_pci_bus'
In file included from
/lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:20,
                 from
/lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17,
                 from
/lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23,
                 from
/lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10,
                 from cccp_deliver.c:47:
/lib/modules/2.6.8-1.521smp/build/include/asm/io_apic.h:160: error:
`MAX_IRQ_SOURCES' undeclared here (not in a function)
/lib/modules/2.6.8-1.521smp/build/include/asm/io_apic.h:160: error:
conflicting types for `mp_irqs'
/lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:22: error: previous
declaration of `mp_irqs'
In file included from
/lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17,
                 from
/lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23,
                 from
/lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10,
                 from cccp_deliver.c:47:
/lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:71:26: mach_apicdef.h:
No such file or directory
In file included from
/lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17,
                 from
/lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23,
                 from
/lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10,
                 from cccp_deliver.c:47:
/lib/modules/2.6.8-1.521smp/build/include/asm/smp.h: In function
`hard_smp_processor_id':
/lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:75: warning: implicit
declaration of function `GET_APIC_ID'
In file included from cccp_private.h:57,
                 from cccp_deliver.c:73:
../../../src/include/dlm_kernel.h:227:2: warning: #warning Untested signal
handlers for Linux-2.6!!
cccp_deliver.c: In function `cccp_msg_delivery_loop':
cccp_deliver.c:246: error: `MOD_INC_USE_COUNT' undeclared (first use in
this function)
cccp_deliver.c:246: error: (Each undeclared identifier is reported only
once
cccp_deliver.c:246: error: for each function it appears in.)
cccp_deliver.c:294: error: `MOD_DEC_USE_COUNT' undeclared (first use in
this function)
make: *** [cccp_deliver.o] Error 1

</make output>


From erik at debian.franken.de  Mon Aug 23 18:40:33 2004
From: erik at debian.franken.de (Erik Tews)
Date: Mon, 23 Aug 2004 20:40:33 +0200
Subject: [Linux-cluster] Subversion?
In-Reply-To: <20040823174837.GC22622@Favog.ubiqx.mn.org>
References: <200408231143.11372.phillips@redhat.com>
	<20040823174837.GC22622@Favog.ubiqx.mn.org>
Message-ID: <1093286433.11335.0.camel@localhost.localdomain>

Am Mo, den 23.08.2004 schrieb Christopher R. Hertel um 19:48:
> We're using SVN to maintain Samba now.  There have been glitches, but most 
> have been fixed.  The biggest problems (currently) are with the web 
> front-ends.

I use viewcvs here, it works fine and seems to have all features I need.


From crh at ubiqx.mn.org  Mon Aug 23 19:06:56 2004
From: crh at ubiqx.mn.org (Christopher R. Hertel)
Date: Mon, 23 Aug 2004 14:06:56 -0500
Subject: [Linux-cluster] Subversion?
In-Reply-To: <1093286433.11335.0.camel@localhost.localdomain>
References: <200408231143.11372.phillips@redhat.com>
	<20040823174837.GC22622@Favog.ubiqx.mn.org>
	<1093286433.11335.0.camel@localhost.localdomain>
Message-ID: <20040823190656.GG22622@Favog.ubiqx.mn.org>

On Mon, Aug 23, 2004 at 08:40:33PM +0200, Erik Tews wrote:
> Am Mo, den 23.08.2004 schrieb Christopher R. Hertel um 19:48:
> > We're using SVN to maintain Samba now.  There have been glitches, but most 
> > have been fixed.  The biggest problems (currently) are with the web 
> > front-ends.
> 
> I use viewcvs here, it works fine and seems to have all features I need.

It's my understanding that Samba source web access will be moving (has
already been moved) to viewcvs.  I've passed along the caution, given
earlier, about database locking.  There's some talk of sharing the single
database amonst several mirrors using Samba and CIFS-VFS.  :)

We'll see what flys.

Chris -)-----

-- 
"Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X
Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org


From erik at debian.franken.de  Mon Aug 23 19:12:06 2004
From: erik at debian.franken.de (Erik Tews)
Date: Mon, 23 Aug 2004 21:12:06 +0200
Subject: [Linux-cluster] Subversion?
In-Reply-To: <20040823190656.GG22622@Favog.ubiqx.mn.org>
References: <200408231143.11372.phillips@redhat.com>
	<20040823174837.GC22622@Favog.ubiqx.mn.org>
	<1093286433.11335.0.camel@localhost.localdomain>
	<20040823190656.GG22622@Favog.ubiqx.mn.org>
Message-ID: <1093288326.11335.3.camel@localhost.localdomain>

Am Mo, den 23.08.2004 schrieb Christopher R. Hertel um 21:06:
> On Mon, Aug 23, 2004 at 08:40:33PM +0200, Erik Tews wrote:
> > Am Mo, den 23.08.2004 schrieb Christopher R. Hertel um 19:48:
> > > We're using SVN to maintain Samba now.  There have been glitches, but most 
> > > have been fixed.  The biggest problems (currently) are with the web 
> > > front-ends.
> > 
> > I use viewcvs here, it works fine and seems to have all features I need.
> 
> It's my understanding that Samba source web access will be moving (has
> already been moved) to viewcvs.  I've passed along the caution, given
> earlier, about database locking.  There's some talk of sharing the single
> database amonst several mirrors using Samba and CIFS-VFS.  :)

Well, there is a package called libsvn-mirror-perl which should make it
possible to mirror a subversion server on the subversion protocol level,
so there should be no problem with any kind of locking.

But your approach could work too.


From arekm at pld-linux.org  Mon Aug 23 20:27:26 2004
From: arekm at pld-linux.org (Arkadiusz Miskiewicz)
Date: Mon, 23 Aug 2004 22:27:26 +0200
Subject: [Linux-cluster] Subversion?
In-Reply-To: <412A1206.5040103@backtobasicsmgmt.com>
References: <200408231143.11372.phillips@redhat.com>
	<412A1206.5040103@backtobasicsmgmt.com>
Message-ID: <200408232227.26770.arekm@pld-linux.org>

On Monday 23 of August 2004 17:49, Kevin P. Fleming wrote:

> I also like BK quite a bit, and it has one major advantage over
> CVS/Subversion: you can have local trees and actually _commit_ to them,
> including changeset comments and everything else. This is very nice when
> you are working on multiple bits of a project and are not ready to
> commit them to the "real" repositories.
Try http://svk.elixus.org/. It uses subversion lower layers and it's able to 
merge from/to normal subversion repository.

BK main problem is licence. For example I'm not allowed to use it since I've 
sent few small patches to subversion people :/
-- 
Arkadiusz Mi?kiewicz     CS at FoE, Wroclaw University of Technology
arekm.pld-linux.org, 1024/3DB19BBD, JID: arekm.jabber.org, PLD/Linux


From erik at debian.franken.de  Mon Aug 23 21:18:05 2004
From: erik at debian.franken.de (Erik Tews)
Date: Mon, 23 Aug 2004 23:18:05 +0200
Subject: [Linux-cluster] Subversion?
In-Reply-To: <200408232227.26770.arekm@pld-linux.org>
References: <200408231143.11372.phillips@redhat.com>
	<412A1206.5040103@backtobasicsmgmt.com>
	<200408232227.26770.arekm@pld-linux.org>
Message-ID: <1093295884.3373.3.camel@localhost.localdomain>

Am Mo, den 23.08.2004 schrieb Arkadiusz Miskiewicz um 22:27:
> On Monday 23 of August 2004 17:49, Kevin P. Fleming wrote:
> 
> > I also like BK quite a bit, and it has one major advantage over
> > CVS/Subversion: you can have local trees and actually _commit_ to them,
> > including changeset comments and everything else. This is very nice when
> > you are working on multiple bits of a project and are not ready to
> > commit them to the "real" repositories.
> Try http://svk.elixus.org/. It uses subversion lower layers and it's able to 
> merge from/to normal subversion repository.

Is this for one time merging, like converting a repository once from cvs
to svn, or can this be done on every commit, could I setup a local svk
server, and merge all my changes to a upstream svn server, which has no
special modifications?


From notiggy at gmail.com  Mon Aug 23 21:25:53 2004
From: notiggy at gmail.com (Brian Jackson)
Date: Mon, 23 Aug 2004 16:25:53 -0500
Subject: [Linux-cluster] Problem compiling dlm module.
In-Reply-To: <Pine.GSO.4.58.0408231428490.20002@arlen.osc.edu>
References: <Pine.GSO.4.58.0408231428490.20002@arlen.osc.edu>
Message-ID: <fb20c21404082314253c0173c1@mail.gmail.com>

opendlm and the dlm that is used with gfs/linux-cluster are 2
different things. You shouldn't use the docs from one to build the
other. The usage.txt file linked off of
http://sources.redhat.com/cluster will get you up and running with the
proper dlm.

--Brian Jackson

On Mon, 23 Aug 2004 14:47:12 -0400 (EDT), Ananth Devulapalli
<ananth at osc.edu> wrote:
> Hello:
> 
>         I am following instructions for compilation of dlm module
> described at  http://opendlm.sourceforge.net/doc.php and i am through with
> installation of libnet and heartbeat modules. but opendlm breaks.
> 
>         my m/c is currently running 2.6.8-1.521smp on a dual xeon
> 
>         opendlm was configured using
> --with-heartbeat_include=/usr/include/heartbeat
> 
>         I am pasting output of make at the end of this mail. All the
> errors seem to be in files included in cccp_deliver.c. It appears to me
> like I am missing some files and have configured the tree incorrectly. For
> e.g. I dont have include/linux/modversions.h. Another error is
> MOD_INC_USE_COUNT is defined in /usr/include/linux/module.h but not in
> /lib/modules/2.6.8-1.521smp/build. configure is assigning linux_src
> variable to /lib/modules/2.6.8-1.521smp/build instead of
> /usr/include/linux/ hence its not able to locate that symbol. Is it a
> known problem? any pointers will be of great help.
> 
> thanks,
> -Ananth
> 
> <make output>
> 
> if gcc -DHAVE_CONFIG_H -I. -I. -I../../..  -D__KERNEL__ -DMODULE
> -DMODVERSIONS -I/lib/modules/2.6.8-1.521smp/build/include
> -I../../../src/include -include
> /lib/modules/2.6.8-1.521smp/build/include/linux/modversions.h
> -I../../../src/api -DOLD_MARSHAL
> -I/lib/modules/2.6.8-1.521smp/build/include -I/usr/include/glib-1.2
> -I/usr/lib/glib/include -pipe -O2 -Wall -g -O2 -MT cccp_deliver.o -MD -MP
> -MF ".deps/cccp_deliver.Tpo" -c -o cccp_deliver.o cccp_deliver.c; \
> then mv -f ".deps/cccp_deliver.Tpo" ".deps/cccp_deliver.Po"; else rm -f
> ".deps/cccp_deliver.Tpo"; exit 1; fi
> <command line>:167113902:62704:
> /lib/modules/2.6.8-1.521smp/build/include/linux/modversions.h: No such
> file or directory
> In file included from
> /lib/modules/2.6.8-1.521smp/build/include/asm/processor.h:18,
>                  from
> /lib/modules/2.6.8-1.521smp/build/include/asm/thread_info.h:16,
>                  from
> /lib/modules/2.6.8-1.521smp/build/include/linux/thread_info.h:21,
>                  from
> /lib/modules/2.6.8-1.521smp/build/include/linux/spinlock.h:12,
>                  from
> /lib/modules/2.6.8-1.521smp/build/include/linux/capability.h:45,
>                  from
> /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:7,
>                  from
> /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10,
>                  from cccp_deliver.c:47:
> /lib/modules/2.6.8-1.521smp/build/include/asm/system.h: In function
> `__set_64bit_var':
> /lib/modules/2.6.8-1.521smp/build/include/asm/system.h:193: warning:
> dereferencing type-punned pointer will break strict-aliasing rules
> /lib/modules/2.6.8-1.521smp/build/include/asm/system.h:193: warning:
> dereferencing type-punned pointer will break strict-aliasing rules
> In file included from
> /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:18,
>                  from
> /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17,
>                  from
> /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23,
>                  from
> /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10,
>                  from cccp_deliver.c:47:
> /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:6:25:
> mach_mpspec.h: No such file or directory
> In file included from
> /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:18,
>                  from
> /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17,
>                  from
> /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23,
>                  from
> /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10,
>                  from cccp_deliver.c:47:
> /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h: At top level:
> /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:8: error:
> `MAX_MP_BUSSES' undeclared here (not in a function)
> /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:9: error:
> `MAX_MP_BUSSES' undeclared here (not in a function)
> /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:10: error:
> `MAX_MP_BUSSES' undeclared here (not in a function)
> /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:12: error:
> `MAX_MP_BUSSES' undeclared here (not in a function)
> /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:20: error:
> `MAX_MP_BUSSES' undeclared here (not in a function)
> /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:20: error:
> conflicting types for `mp_bus_id_to_type'
> /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:8: error: previous
> declaration of `mp_bus_id_to_type'
> /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:22: error:
> `MAX_IRQ_SOURCES' undeclared here (not in a function)
> /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:24: error:
> `MAX_MP_BUSSES' undeclared here (not in a function)
> /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:24: error:
> conflicting types for `mp_bus_id_to_pci_bus'
> /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:12: error: previous
> declaration of `mp_bus_id_to_pci_bus'
> In file included from
> /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:20,
>                  from
> /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17,
>                  from
> /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23,
>                  from
> /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10,
>                  from cccp_deliver.c:47:
> /lib/modules/2.6.8-1.521smp/build/include/asm/io_apic.h:160: error:
> `MAX_IRQ_SOURCES' undeclared here (not in a function)
> /lib/modules/2.6.8-1.521smp/build/include/asm/io_apic.h:160: error:
> conflicting types for `mp_irqs'
> /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:22: error: previous
> declaration of `mp_irqs'
> In file included from
> /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17,
>                  from
> /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23,
>                  from
> /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10,
>                  from cccp_deliver.c:47:
> /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:71:26: mach_apicdef.h:
> No such file or directory
> In file included from
> /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17,
>                  from
> /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23,
>                  from
> /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10,
>                  from cccp_deliver.c:47:
> /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h: In function
> `hard_smp_processor_id':
> /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:75: warning: implicit
> declaration of function `GET_APIC_ID'
> In file included from cccp_private.h:57,
>                  from cccp_deliver.c:73:
> .../../../src/include/dlm_kernel.h:227:2: warning: #warning Untested signal
> handlers for Linux-2.6!!
> cccp_deliver.c: In function `cccp_msg_delivery_loop':
> cccp_deliver.c:246: error: `MOD_INC_USE_COUNT' undeclared (first use in
> this function)
> cccp_deliver.c:246: error: (Each undeclared identifier is reported only
> once
> cccp_deliver.c:246: error: for each function it appears in.)
> cccp_deliver.c:294: error: `MOD_DEC_USE_COUNT' undeclared (first use in
> this function)
> make: *** [cccp_deliver.o] Error 1
> 
> </make output>
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
>


From ananth at osc.edu  Mon Aug 23 21:44:41 2004
From: ananth at osc.edu (Ananth Devulapalli)
Date: Mon, 23 Aug 2004 17:44:41 -0400 (EDT)
Subject: [Linux-cluster] Problem compiling dlm module.
In-Reply-To: <fb20c21404082314253c0173c1@mail.gmail.com>
References: <Pine.GSO.4.58.0408231428490.20002@arlen.osc.edu>
	<fb20c21404082314253c0173c1@mail.gmail.com>
Message-ID: <Pine.GSO.4.58.0408231742530.21206@arlen.osc.edu>

It was my bad. Thanks for pointing my mistake. I got thought both were
same since opendlm links to redhat cluster's page.

regards,
-Ananth
On Mon, 23 Aug 2004, Brian Jackson wrote:

> opendlm and the dlm that is used with gfs/linux-cluster are 2
> different things. You shouldn't use the docs from one to build the
> other. The usage.txt file linked off of
> http://sources.redhat.com/cluster will get you up and running with the
> proper dlm.
>
> --Brian Jackson
>
> On Mon, 23 Aug 2004 14:47:12 -0400 (EDT), Ananth Devulapalli
> <ananth at osc.edu> wrote:
> > Hello:
> >
> >         I am following instructions for compilation of dlm module
> > described at  http://opendlm.sourceforge.net/doc.php and i am through with
> > installation of libnet and heartbeat modules. but opendlm breaks.
> >
> >         my m/c is currently running 2.6.8-1.521smp on a dual xeon
> >
> >         opendlm was configured using
> > --with-heartbeat_include=/usr/include/heartbeat
> >
> >         I am pasting output of make at the end of this mail. All the
> > errors seem to be in files included in cccp_deliver.c. It appears to me
> > like I am missing some files and have configured the tree incorrectly. For
> > e.g. I dont have include/linux/modversions.h. Another error is
> > MOD_INC_USE_COUNT is defined in /usr/include/linux/module.h but not in
> > /lib/modules/2.6.8-1.521smp/build. configure is assigning linux_src
> > variable to /lib/modules/2.6.8-1.521smp/build instead of
> > /usr/include/linux/ hence its not able to locate that symbol. Is it a
> > known problem? any pointers will be of great help.
> >
> > thanks,
> > -Ananth
> >
> > <make output>
> >
> > if gcc -DHAVE_CONFIG_H -I. -I. -I../../..  -D__KERNEL__ -DMODULE
> > -DMODVERSIONS -I/lib/modules/2.6.8-1.521smp/build/include
> > -I../../../src/include -include
> > /lib/modules/2.6.8-1.521smp/build/include/linux/modversions.h
> > -I../../../src/api -DOLD_MARSHAL
> > -I/lib/modules/2.6.8-1.521smp/build/include -I/usr/include/glib-1.2
> > -I/usr/lib/glib/include -pipe -O2 -Wall -g -O2 -MT cccp_deliver.o -MD -MP
> > -MF ".deps/cccp_deliver.Tpo" -c -o cccp_deliver.o cccp_deliver.c; \
> > then mv -f ".deps/cccp_deliver.Tpo" ".deps/cccp_deliver.Po"; else rm -f
> > ".deps/cccp_deliver.Tpo"; exit 1; fi
> > <command line>:167113902:62704:
> > /lib/modules/2.6.8-1.521smp/build/include/linux/modversions.h: No such
> > file or directory
> > In file included from
> > /lib/modules/2.6.8-1.521smp/build/include/asm/processor.h:18,
> >                  from
> > /lib/modules/2.6.8-1.521smp/build/include/asm/thread_info.h:16,
> >                  from
> > /lib/modules/2.6.8-1.521smp/build/include/linux/thread_info.h:21,
> >                  from
> > /lib/modules/2.6.8-1.521smp/build/include/linux/spinlock.h:12,
> >                  from
> > /lib/modules/2.6.8-1.521smp/build/include/linux/capability.h:45,
> >                  from
> > /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:7,
> >                  from
> > /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10,
> >                  from cccp_deliver.c:47:
> > /lib/modules/2.6.8-1.521smp/build/include/asm/system.h: In function
> > `__set_64bit_var':
> > /lib/modules/2.6.8-1.521smp/build/include/asm/system.h:193: warning:
> > dereferencing type-punned pointer will break strict-aliasing rules
> > /lib/modules/2.6.8-1.521smp/build/include/asm/system.h:193: warning:
> > dereferencing type-punned pointer will break strict-aliasing rules
> > In file included from
> > /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:18,
> >                  from
> > /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17,
> >                  from
> > /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23,
> >                  from
> > /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10,
> >                  from cccp_deliver.c:47:
> > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:6:25:
> > mach_mpspec.h: No such file or directory
> > In file included from
> > /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:18,
> >                  from
> > /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17,
> >                  from
> > /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23,
> >                  from
> > /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10,
> >                  from cccp_deliver.c:47:
> > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h: At top level:
> > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:8: error:
> > `MAX_MP_BUSSES' undeclared here (not in a function)
> > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:9: error:
> > `MAX_MP_BUSSES' undeclared here (not in a function)
> > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:10: error:
> > `MAX_MP_BUSSES' undeclared here (not in a function)
> > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:12: error:
> > `MAX_MP_BUSSES' undeclared here (not in a function)
> > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:20: error:
> > `MAX_MP_BUSSES' undeclared here (not in a function)
> > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:20: error:
> > conflicting types for `mp_bus_id_to_type'
> > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:8: error: previous
> > declaration of `mp_bus_id_to_type'
> > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:22: error:
> > `MAX_IRQ_SOURCES' undeclared here (not in a function)
> > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:24: error:
> > `MAX_MP_BUSSES' undeclared here (not in a function)
> > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:24: error:
> > conflicting types for `mp_bus_id_to_pci_bus'
> > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:12: error: previous
> > declaration of `mp_bus_id_to_pci_bus'
> > In file included from
> > /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:20,
> >                  from
> > /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17,
> >                  from
> > /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23,
> >                  from
> > /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10,
> >                  from cccp_deliver.c:47:
> > /lib/modules/2.6.8-1.521smp/build/include/asm/io_apic.h:160: error:
> > `MAX_IRQ_SOURCES' undeclared here (not in a function)
> > /lib/modules/2.6.8-1.521smp/build/include/asm/io_apic.h:160: error:
> > conflicting types for `mp_irqs'
> > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:22: error: previous
> > declaration of `mp_irqs'
> > In file included from
> > /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17,
> >                  from
> > /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23,
> >                  from
> > /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10,
> >                  from cccp_deliver.c:47:
> > /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:71:26: mach_apicdef.h:
> > No such file or directory
> > In file included from
> > /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17,
> >                  from
> > /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23,
> >                  from
> > /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10,
> >                  from cccp_deliver.c:47:
> > /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h: In function
> > `hard_smp_processor_id':
> > /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:75: warning: implicit
> > declaration of function `GET_APIC_ID'
> > In file included from cccp_private.h:57,
> >                  from cccp_deliver.c:73:
> > .../../../src/include/dlm_kernel.h:227:2: warning: #warning Untested signal
> > handlers for Linux-2.6!!
> > cccp_deliver.c: In function `cccp_msg_delivery_loop':
> > cccp_deliver.c:246: error: `MOD_INC_USE_COUNT' undeclared (first use in
> > this function)
> > cccp_deliver.c:246: error: (Each undeclared identifier is reported only
> > once
> > cccp_deliver.c:246: error: for each function it appears in.)
> > cccp_deliver.c:294: error: `MOD_DEC_USE_COUNT' undeclared (first use in
> > this function)
> > make: *** [cccp_deliver.o] Error 1
> >
> > </make output>
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > http://www.redhat.com/mailman/listinfo/linux-cluster
> >
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
>


From phillips at redhat.com  Mon Aug 23 22:30:26 2004
From: phillips at redhat.com (Daniel Phillips)
Date: Mon, 23 Aug 2004 18:30:26 -0400
Subject: [Linux-cluster] Subversion?
In-Reply-To: <1093285037.12874.70.camel@cherrybomb.pdx.osdl.net>
References: <200408231143.11372.phillips@redhat.com>
	<200408231402.22863.phillips@redhat.com>
	<1093285037.12874.70.camel@cherrybomb.pdx.osdl.net>
Message-ID: <200408231830.26171.phillips@redhat.com>

On Monday 23 August 2004 14:17, John Cherry wrote:
> On Mon, 2004-08-23 at 11:02, Daniel Phillips wrote:
> > In my humble opinion, Bitkeeper does not have a snowball's chance
> > in hell of getting established on sources.redhat.com.
>
> I kinda figured it didn't have a chance at sources.redhat.com.  But
> what about bkbits.net?

Why don't you take a poll? ;)

> However, bk is being used across the kernel development community and
> this does not appear to be changing anytime soon.

Regardless of its effect on Linus's scalability, the kernel development 
community is deeply fractured over Bitkeeper.  Please don't be fooled 
by the apparent low profile of this subject on lkml.  We do not need a 
self-inflicted wound like that in the cluster community.

> BTW, most developers do just fine with up to date tarballs, so source
> control is not a huge issue for most of them.

Yes, I personally prefer tarballs when I'm checking out a project for 
the first time.  However we still need a repository somewhere.

Regards,

Daniel


From arekm at pld-linux.org  Mon Aug 23 23:24:46 2004
From: arekm at pld-linux.org (Arkadiusz Miskiewicz)
Date: Tue, 24 Aug 2004 01:24:46 +0200
Subject: [Linux-cluster] Subversion?
In-Reply-To: <1093295884.3373.3.camel@localhost.localdomain>
References: <200408231143.11372.phillips@redhat.com>
	<200408232227.26770.arekm@pld-linux.org>
	<1093295884.3373.3.camel@localhost.localdomain>
Message-ID: <200408240124.46669.arekm@pld-linux.org>

On Monday 23 of August 2004 23:18, Erik Tews wrote:

> > Try http://svk.elixus.org/. It uses subversion lower layers and it's able
> > to merge from/to normal subversion repository.
>
> Is this for one time merging, like converting a repository once from cvs
> to svn, or can this be done on every commit, could I setup a local svk
> server, and merge all my changes to a upstream svn server, which has no
> special modifications?
You can merge all your local changes upstream then fetch new changes from 
subversion repo and so on. It's for making soft of decentralized subversion.

http://svk.elixus.org/index.cgi?SVKTutorial

-- 
Arkadiusz Mi?kiewicz     CS at FoE, Wroclaw University of Technology
arekm.pld-linux.org, 1024/3DB19BBD, JID: arekm.jabber.org, PLD/Linux


From yjcho at cs.hongik.ac.kr  Tue Aug 24 06:29:22 2004
From: yjcho at cs.hongik.ac.kr (Cho Yool Je)
Date: Tue, 24 Aug 2004 15:29:22 +0900
Subject: [Linux-cluster] i wanna use gfs with firewire....
Message-ID: <412AE042.9060203@cs.hongik.ac.kr>

i wanna use gfs with firewire....
but i can't search for document about it...
anybody has docs?


From walters at redhat.com  Mon Aug 23 17:23:06 2004
From: walters at redhat.com (Colin Walters)
Date: Mon, 23 Aug 2004 13:23:06 -0400
Subject: [Linux-cluster] Subversion?
In-Reply-To: <1093280121.3467.80.camel@atlantis.boston.redhat.com>
References: <200408231143.11372.phillips@redhat.com>
	<1093280121.3467.80.camel@atlantis.boston.redhat.com>
Message-ID: <1093281786.20301.34.camel@nexus.verbum.private>

On Mon, 2004-08-23 at 12:55 -0400, Lon Hohberger wrote:
> On Mon, 2004-08-23 at 11:43 -0400, Daniel Phillips wrote:
> > Hi everybody,
> > 
> > I was just taking a look at this article and I thought, maybe this would 
> > be a good time to show some leadership as a project, and take the 
> > Subversion plunge:
> > 
> >    http://www.onlamp.com/pub/a/onlamp/2004/08/19/subversiontips.html
> > 
> > Subversion is basically CVS as it should have been.  It's mature now.  
> > The number of complaints I have noticed from users out there is roughly 
> > zero.  Subversion _versions directories_.  Etc.  Etc.
> 
> Disagree.  We should use GNU arch.  Here's a comparison from someone you
> know:
> 
> http://wiki.gnuarch.org/moin.cgi/SubVersionAndCvsComparison
> http://better-scm.berlios.de/comparison/comparison.html

Here also is a presentation giving an introduction to Arch from the
"bottom up", which gives you a much better idea I think of why it is the
best architecture, rather than just comparing checkboxes on some list.

http://web.verbum.org/tla/grokking-arch/img0.html

> True.  For now.  Switching again in the future (if needed) will be more
> painful as we attract more developers.

Right - switching revision control systems is always painful.  You want
to make the choice once.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040823/fb772f9d/attachment.sig>

From hanafim at asc.hpc.mil  Tue Aug 24 14:01:12 2004
From: hanafim at asc.hpc.mil (MAHMOUD HANAFI)
Date: Tue, 24 Aug 2004 10:01:12 -0400
Subject: [Linux-cluster] Re: update 1-11456576 Re: LSF job pends...
In-Reply-To: <9CD190F4AD92EC499A9397F49C8B0E4E69B752@catoexm04.noam.corp.platform.com>
References: <9CD190F4AD92EC499A9397F49C8B0E4E69B752@catoexm04.noam.corp.platform.com>
Message-ID: <412B4A28.7060308@asc.hpc.mil>

You may close this issue. This was only applied to jobs that had been 
submitted before installing the job weight plugin. New jobs do not have 
the same pending condition.

thanks,
-Mahmoud

Mohammad Asim Khan wrote:
> Hi Mahmoud,
> Can you please send me the output of the following commands:
> bhosts
> bhpart -r <hostPartition>
> lshosts
> 
> Your lsb.hosts file, lsf.cluster and lsf.shared file.
> 
> 
> Regards 
> ______________________________________________________________________
>                                     TECHNICAL SUPPORT
> Mohammad Asim Khan                  FTP       : ftp.platform.com
> Technical Support Engineer            
> Platform Computing Corporation      WWW       : www.platform.com
> 3760, 14th Avenue                   Support   : support at platform.com
> Markham Ontario L3R 3T7 Canada      License   : license at platform.com
> Phone    : (905) 948-4325           Inquiries : info at platform.com
> Fax      : (905) 948-9975           Sales     : sales at platform.com 
> E-mail   : mkhan at platform.com       Phone     : 1-905-948-4297   
> 
> Note : Please cc all emails to support at platform.com
> _____________________________________________________________________
> Platform.  Accelerating Intelligence
> "Unleash the Power" of LSF by attending a Platform LSF Administration Training Class.
> distributed and Grid Computing
> 
> To receive periodic Patch Update information, critical bug notification and general support Notification from platform support email supportnotice-request at platform.com with the subject line containing the word "subscribe". 
> 
> To receive security related issue notification from Platform support email 
> securenotice-request at platform.com with the subject line containing the word "subscribe".
> 
> 


From lhh at redhat.com  Tue Aug 24 14:48:04 2004
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 24 Aug 2004 10:48:04 -0400
Subject: [Linux-cluster] Subversion?
In-Reply-To: <200408240124.46669.arekm@pld-linux.org>
References: <200408231143.11372.phillips@redhat.com>
	<200408232227.26770.arekm@pld-linux.org>
	<1093295884.3373.3.camel@localhost.localdomain>
	<200408240124.46669.arekm@pld-linux.org>
Message-ID: <1093358884.3467.130.camel@atlantis.boston.redhat.com>

On Tue, 2004-08-24 at 01:24 +0200, Arkadiusz Miskiewicz wrote:
> On Monday 23 of August 2004 23:18, Erik Tews wrote:
> 
> > > Try http://svk.elixus.org/. It uses subversion lower layers and it's able
> > > to merge from/to normal subversion repository.
> >
> > Is this for one time merging, like converting a repository once from cvs
> > to svn, or can this be done on every commit, could I setup a local svk
> > server, and merge all my changes to a upstream svn server, which has no
> > special modifications?
> You can merge all your local changes upstream then fetch new changes from 
> subversion repo and so on. It's for making soft of decentralized subversion.
> 
> http://svk.elixus.org/index.cgi?SVKTutorial
> 

http://www.gnuarch.org

It was _designed_ to handle distributed repositories (like BK).

-- Lon


From tomc at teamics.com  Tue Aug 24 14:51:58 2004
From: tomc at teamics.com (tomc at teamics.com)
Date: Tue, 24 Aug 2004 09:51:58 -0500
Subject: [Linux-cluster] unusual GFS problem
Message-ID: <OF171FFC18.0AC4D127-ON86256EFA.0051525F@teamics.com>


Looking for some direction on this, please.   What is this message telling
me?  This node was the master in a three node setup:


Aug 24 03:11:21 lvs2 lock_gulmd_LT000[1332]: Asking for exl, where I hold
the lock Shr, and someone else is
 queued before me.
Aug 24 03:11:21 lvs2 lock_gulmd_LT000[1332]: Asking for exl, where I hold
the lock Shr, and someone else is
 queued before me.

This repeated for about 3 hours, then one of the other nodes had a GFS
panic and had to be rebooted.  Any suggestions would be appreciated.


tc


From notiggy at gmail.com  Tue Aug 24 15:18:10 2004
From: notiggy at gmail.com (Brian Jackson)
Date: Tue, 24 Aug 2004 10:18:10 -0500
Subject: [Linux-cluster] i wanna use gfs with firewire....
In-Reply-To: <412AE042.9060203@cs.hongik.ac.kr>
References: <412AE042.9060203@cs.hongik.ac.kr>
Message-ID: <fb20c21404082408186094638b@mail.gmail.com>

There's a doc on the opengfs site about it, you can read the firewire
specific parts out of it, then use the linux-cluster docs to do the
filesystem setup.

--Brian Jackson

On Tue, 24 Aug 2004 15:29:22 +0900, Cho Yool Je <yjcho at cs.hongik.ac.kr> wrote:
> i wanna use gfs with firewire....
> but i can't search for document about it...
> anybody has docs?
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
>


From amanthei at redhat.com  Tue Aug 24 15:36:54 2004
From: amanthei at redhat.com (Adam Manthei)
Date: Tue, 24 Aug 2004 10:36:54 -0500
Subject: [Linux-cluster] unusual GFS problem
In-Reply-To: <OF171FFC18.0AC4D127-ON86256EFA.0051525F@teamics.com>
References: <OF171FFC18.0AC4D127-ON86256EFA.0051525F@teamics.com>
Message-ID: <20040824153654.GC27527@redhat.com>

On Tue, Aug 24, 2004 at 09:51:58AM -0500, tomc at teamics.com wrote:
> Looking for some direction on this, please.   What is this message telling
> me?  This node was the master in a three node setup:

The message is telling you that you turned on the "Locking" gulm verbosity
flag :)  The short answer is that it's just letting you know you have lock
contention.  These messages are rather common and can be ignored (especially
if your applications are modifying common files or directories from more
than one node).
> 
> 
> Aug 24 03:11:21 lvs2 lock_gulmd_LT000[1332]: Asking for exl, where I hold
> the lock Shr, and someone else is
>  queued before me.
> Aug 24 03:11:21 lvs2 lock_gulmd_LT000[1332]: Asking for exl, where I hold
> the lock Shr, and someone else is
>  queued before me.
> 
> This repeated for about 3 hours, then one of the other nodes had a GFS
> panic and had to be rebooted.  Any suggestions would be appreciated.


Without the panic message and relevant syslog messages, we can't really help
you.

-- 
Adam Manthei  <amanthei at redhat.com>


From phillips at redhat.com  Tue Aug 24 16:12:53 2004
From: phillips at redhat.com (Daniel Phillips)
Date: Tue, 24 Aug 2004 12:12:53 -0400
Subject: [Linux-cluster] Subversion?
In-Reply-To: <1093358884.3467.130.camel@atlantis.boston.redhat.com>
References: <200408231143.11372.phillips@redhat.com>
	<200408240124.46669.arekm@pld-linux.org>
	<1093358884.3467.130.camel@atlantis.boston.redhat.com>
Message-ID: <200408241212.53357.phillips@redhat.com>

Hi Lon,

On Tuesday 24 August 2004 10:48, Lon Hohberger wrote:
> It was _designed_ to handle distributed repositories (like BK).

Well, what wind is blowing, seems to be blowing in the direction of 
Arch.  I'd be equally happy with either, and in any case, much happier 
than with CVS.  Does anybody else have a strong opinion?

Regards,

Daniel


From amanthei at redhat.com  Tue Aug 24 16:15:45 2004
From: amanthei at redhat.com (Adam Manthei)
Date: Tue, 24 Aug 2004 11:15:45 -0500
Subject: [Linux-cluster] SNMP modules?
In-Reply-To: <1090861715.13809.3.camel@laza.eunet.yu>
References: <1090861715.13809.3.camel@laza.eunet.yu>
Message-ID: <20040824161545.GA31079@redhat.com>

On Mon, Jul 26, 2004 at 07:08:35PM +0200, Lazar Obradovic wrote:
> Hello all, 
> 
> I'd like to develop my own fencing agents (for IBM BladeCenter and
> QLogic SANBox2 switches), but they will require SNMP bindings.
> 
> Is that ok with general development philosophy, since I'd like to
> contribude them? net-snmp-5.x.x-based API? 

I've added these to the repository.  I also made the following adjustment:

 o removed the deprecated "fm" and "name" stdin parameters that were residue
   of the old GFS-5.1.x fencing system

 o changed a couple errors to warnings.  If a powered off the blade, the
   fencing agent would detect that it was powered off and fail.  If it knows
   that the blade is off, it should still succeed.

Patch is attached (but already checked into CVS)
-- 
Adam Manthei  <amanthei at redhat.com>
-------------- next part --------------
diff -urNp ibmblade/fence_ibmblade.pl ibmblade.mantis/fence_ibmblade.pl
--- ibmblade/fence_ibmblade.pl	2004-08-24 11:09:55.680240183 -0500
+++ ibmblade.mantis/fence_ibmblade.pl	2004-08-24 10:58:57.558573326 -0500
@@ -112,13 +112,6 @@ sub get_options_stdin
         # DO NOTHING -- this field is used by fenced
 	elsif ($name eq "agent" ) { } 
 
-	# FIXME -- depricated.  use "port" instead.
-        elsif ($name eq "fm" ) 
-	{
-            (my $dummy,$opt_n) = split /\s+/,$val;
-	    print STDERR "Depricated \"fm\" entry detected.  refer to man page.\n";
-	}
-
         elsif ($name eq "ipaddr" ) 
 	{
             $opt_a = $val;
@@ -127,8 +120,6 @@ sub get_options_stdin
 	{
             $opt_c = $val;
         } 
-	# FIXME -- depreicated residue of old fencing system
-	elsif ($name eq "name" ) { } 
 
         elsif ($name eq "option" )
         {
@@ -204,15 +195,15 @@ if (defined ($opt_t)) { 
 
 if ($opt_o =~ /^(reboot|off)$/i) { 
 	if ($result->{$oid} == "0") { 
-		printf ("$FENCE_RELEASE_NAME ERROR: Port %d on %s already down.\n", $opt_n, $opt_a); 
+		printf ("$FENCE_RELEASE_NAME WARNING: Port %d on %s already down.\n", $opt_n, $opt_a); 
 		$snmpsess->close; 
-		exit 1; 
+		exit 0; 
 	}; 
 } else { 
 	if ($result->{$oid} == "1") { 
-		printf ("$FENCE_RELEASE_NAME ERROR: Port %d on %s already up.\n", $opt_n, $opt_a); 
+		printf ("$FENCE_RELEASE_NAME WARNING: Port %d on %s already up.\n", $opt_n, $opt_a); 
 		$snmpsess->close; 
-		exit 1; 
+		exit 0; 
 	};
 };
 

From tomc at teamics.com  Tue Aug 24 16:22:31 2004
From: tomc at teamics.com (tomc at teamics.com)
Date: Tue, 24 Aug 2004 11:22:31 -0500
Subject: [Linux-cluster] unusual GFS problem
Message-ID: <OFE0ACB353.880872FC-ON86256EFA.0059E307@teamics.com>


I think the problem is actually a FC-SAN problem, but I just wanted to
follow up on this particular message.

GFS panics when Linux loses the SCSI device.  The SCSI device disapepars
because of a SAN communications failure.  I don't think it is a GFS
problem.  Thanks for the info on that message.

tc


                      Adam Manthei                                                                                                        
                      <amanthei at redhat.com>        To:       Discussion of clustering software components including GFS                   
                      Sent by:                      <linux-cluster at redhat.com>                                                            
                      linux-cluster-bounces        cc:       (bcc: Tom Currie/teamics)                                                    
                      @redhat.com                  Subject:  Re: [Linux-cluster] unusual GFS problem                                      
                                                                                                                                          
                                                                                                                                          
                      08/24/04 10:36 AM                                                                                                   
                      Please respond to                                                                                                   
                      Discussion of                                                                                                       
                      clustering software                                                                                                 
                      components including                                                                                                
                      GFS                                                                                                                 
                                                                                                                                          
                                                                                                                                          
On Tue, Aug 24, 2004 at 09:51:58AM -0500, tomc at teamics.com wrote:
> Looking for some direction on this, please.   What is this message
telling
> me?  This node was the master in a three node setup:

The message is telling you that you turned on the "Locking" gulm verbosity
flag :)  The short answer is that it's just letting you know you have lock
contention.  These messages are rather common and can be ignored
(especially
if your applications are modifying common files or directories from more
than one node).
>
>
> Aug 24 03:11:21 lvs2 lock_gulmd_LT000[1332]: Asking for exl, where I hold
> the lock Shr, and someone else is
>  queued before me.
> Aug 24 03:11:21 lvs2 lock_gulmd_LT000[1332]: Asking for exl, where I hold
> the lock Shr, and someone else is
>  queued before me.
>
> This repeated for about 3 hours, then one of the other nodes had a GFS
> panic and had to be rebooted.  Any suggestions would be appreciated.


Without the panic message and relevant syslog messages, we can't really
help
you.

--
Adam Manthei  <amanthei at redhat.com>

--
Linux-cluster mailing list
Linux-cluster at redhat.com
http://www.redhat.com/mailman/listinfo/linux-cluster


From lhh at redhat.com  Tue Aug 24 17:11:04 2004
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 24 Aug 2004 13:11:04 -0400
Subject: [Linux-cluster] Re: Arch?
In-Reply-To: <200408241212.53357.phillips@redhat.com>
References: <200408231143.11372.phillips@redhat.com>
	<200408240124.46669.arekm@pld-linux.org>
	<1093358884.3467.130.camel@atlantis.boston.redhat.com>
	<200408241212.53357.phillips@redhat.com>
Message-ID: <1093367464.3467.133.camel@atlantis.boston.redhat.com>

On Tue, 2004-08-24 at 12:12 -0400, Daniel Phillips wrote:
> Hi Lon,
> 
> On Tuesday 24 August 2004 10:48, Lon Hohberger wrote:
> > It was _designed_ to handle distributed repositories (like BK).
> 
> Well, what wind is blowing, seems to be blowing in the direction of 
> Arch.  I'd be equally happy with either, and in any case, much happier 
> than with CVS.  Does anybody else have a strong opinion?

We still have to get the current maintainers to agree, which might prove
more of a problem than deciding on what new software to use in the first
place.

-- Lon


From erik at debian.franken.de  Tue Aug 24 17:27:42 2004
From: erik at debian.franken.de (Erik Tews)
Date: Tue, 24 Aug 2004 19:27:42 +0200
Subject: [Linux-cluster] Subversion?
In-Reply-To: <200408241212.53357.phillips@redhat.com>
References: <200408231143.11372.phillips@redhat.com>
	<200408240124.46669.arekm@pld-linux.org>
	<1093358884.3467.130.camel@atlantis.boston.redhat.com>
	<200408241212.53357.phillips@redhat.com>
Message-ID: <1093368462.20290.7.camel@localhost.localdomain>

Am Di, den 24.08.2004 schrieb Daniel Phillips um 18:12:
> Well, what wind is blowing, seems to be blowing in the direction of 
> Arch.  I'd be equally happy with either, and in any case, much happier 
> than with CVS.  Does anybody else have a strong opinion?

My opinion when I have to choose one is that 3. party support should be
good too. CVS is supported on many systems and there are plugins for
ides and guis everywhere. SVN is very good in this point too. And I
usually need clients for the most common operating systems.

But this is less important on sources.redhat.com, all people accessing
this site will be running linux at least (not only redhat but linux) and
will use no special ides because most of the software makes use of the
gnu tools for building and testing.


From xiaofeng.ling at intel.com  Wed Aug 25 01:59:31 2004
From: xiaofeng.ling at intel.com (Ling, Xiaofeng)
Date: Wed, 25 Aug 2004 09:59:31 +0800
Subject: [Linux-cluster] bug? mount hangs.
Message-ID: <3ACA40606221794F80A5670F0AF15F840547798E@pdsmsx403>

Hi,
    When I trying to setup GFS on two node. some times it triggers the
kdb and the mount hangs.
Follow is the dmesg and config file.
I use kernel 2.6.6up no preemption with kdb patch. two nodes are both
DELL desktop with Intel P3 and P4 CPU.
Is this a know issue?

------------------------------------------------------------------------
---------------------------------------------------
Unable to handle kernel NULL pointer dereference at virtual address
00000046
 printing eip:
d087c916
*pde = 00000000
Oops: 0000 [#1]
CPU:    0
EIP:    0060:[<d087c916>]    Not tainted
EFLAGS: 00010286   (2.6.6kdb)
EIP is at send_to_sock+0x41/0x20a [dlm]
eax: 00000002   ebx: c763c060   ecx: 00000000   edx: 00000000
esi: d089124c   edi: c763c060   ebp: 00000000   esp: c7007f88
ds: 007b   es: 007b   ss: 0068
Process dlm_sendd (pid: 19895, threadinfo=c7006000 task=caeaa7b0)
Stack: d0887851 0000002b c7007fb0 c011475f c994bc68 c12a31b0 c763c068
00000286
       c763c060 d089124c c7006000 00000002 d087cce0 c763c060 c7006000
00000000
       00000000 00000000 d087cfda d0887905 00000000 00000000 0000007b
0000007b
Call Trace:
 [<c011475f>] __wake_up_common+0x31/0x50
 [<d087cce0>] process_output_queue+0x55/0x75 [dlm]
 [<d087cfda>] dlm_sendd+0x95/0xe9 [dlm]
 [<d087cf45>] dlm_sendd+0x0/0xe9 [dlm]
 [<c0104291>] kernel_thread_helper+0x5/0xb

Code: 8b 40 44 89 44 24 1c 8d 47 30 89 44 24 14 8b 5f 30 3b 5c 24
 <6>CMAN: Being told to leave the cluster by node 2
CMAN: we are leaving the cluster
SM: 00000001 sm_stop: SG still joined
SM: 01000002 sm_stop: SG still joined
input: AT Translated Set 2 keyboard on isa0060/serio0

my config file.
----------------------------------------------------------------
<?xml version="1.0"?>
<cluster name="god" config_version="1">

<cman two_node="1" expected_votes="1">
</cman>

<nodes>
    <node name="xling" votes="1">
      <fence>
        <method name="human">
          <device name="last_resort"/>
        </method>
      </fence>
    </node>
    <node name="ocfs2" votes="1">
      <fence>
        <method name="human">
          <device name="last_resort"/>
        </method>
      </fence>
    </node>
  </nodes>
  <fence_device>
    <device name="last_resort" agent="fence_manual"/>
  </fence_device>
</cluster>
                          
-------------------
Ling Xiaofeng(Daniel)
Intel China Software Lab.
iNet: 8-752-1243
8621-52574545-1243(O)
xfling at users.sourceforge.net
Opinions are my own and don't represent those of my employer 


From adam.cassar at netregistry.com.au  Wed Aug 25 05:07:28 2004
From: adam.cassar at netregistry.com.au (Adam Cassar)
Date: Wed, 25 Aug 2004 15:07:28 +1000
Subject: [Linux-cluster] what does this mean?
Message-ID: <1093410448.17936.232.camel@akira2.nro.au.com>

(against latest gfs in cvs)

scenario:

- 3 machines in cluster
- one importing gnbd, two directly mounted to shared fc raid

* all 3 performing io
* reboot one machine, all of the machines hung on io

attempted to get rebooted machine to join the cluster, one machine spat
out the following:


kernel BUG at /usr/src/GFS/cluster/cman-kernel/src/membership.c:611!
invalid operand: 0000 [#1]
SMP 
Modules linked in: gnbd gfs lock_dlm dlm cman lock_harness 8250
serial_core dm_mod
CPU:    0
EIP:    0060:[<f88efc18>]    Not tainted
EFLAGS: 00010246   (2.6.8.1) 
EIP is at send_joinconf+0x10/0x75 [cman]
eax: 00000000   ebx: 00000003   ecx: 018a60be   edx: c180e08c
esi: f7ff6fc0   edi: 00000000   ebp: 00000001   esp: f324fe74
ds: 007b   es: 007b   ss: 0068
Process cman_memb (pid: 822, threadinfo=f324e000 task=f3242c70)
Stack: f8907380 00000000 00000003 f7ff6fc0 00000000 00000003 f88f15ec
f8907380 
       018a7446 c01193bf c0331140 c0401a8d f7ff6fc0 0000824d 0000824d
0000824d 
       00000000 f7ff6fc0 00000000 00000007 f88f087f 00000000 00000000
00000038 
Call Trace:
 [<f88f15ec>] do_process_startack+0x14d/0x38e [cman]
 [<c01193bf>] __call_console_drivers+0x55/0x57
 [<f88f087f>] start_transition+0x20f/0x2c1 [cman]
 [<f88ccee1>] cman_callback+0x35/0x38 [dlm]
 [<f88e952a>] notify_kernel_listeners+0x41/0x68 [cman]
 [<f88f0a94>] a_node_just_died+0x163/0x181 [cman]
 [<f88f2a5c>] do_process_leave+0x6b/0x7d [cman]
 [<f88f043c>] do_membership_packet+0x98/0x1f0 [cman]
 [<f88f2e9c>] dispatch_messages+0xe3/0x104 [cman]
 [<f88ef492>] membership_kthread+0x216/0x3e6 [cman]
 [<c0103d22>] ret_from_fork+0x6/0x14
 [<c01157c8>] default_wake_function+0x0/0x12
 [<f88ef27c>] membership_kthread+0x0/0x3e6 [cman]
 [<c0102161>] kernel_thread_helper+0x5/0xb
Code: 0f 0b 63 02 40 bd 8f f8 89 44 24 10 c7 05 f0 73 90 f8 02 00 
 
kernel BUG at /usr/src/GFS/cluster/cman-kernel/src/membership.c:611!
invalid operand: 0000 [#1]
CPU:    0
EIP:    0060:[<f88efc18>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010246   (2.6.8.1) 
eax: 00000000   ebx: 00000003   ecx: 018a60be   edx: c180e08c
esi: f7ff6fc0   edi: 00000000   ebp: 00000001   esp: f324fe74
ds: 007b   es: 007b   ss: 0068
Stack: f8907380 00000000 00000003 f7ff6fc0 00000000 00000003 f88f15ec
f8907380 
       018a7446 c01193bf c0331140 c0401a8d f7ff6fc0 0000824d 0000824d
0000824d 
       00000000 f7ff6fc0 00000000 00000007 f88f087f 00000000 00000000
00000038 
 [<f88f15ec>] do_process_startack+0x14d/0x38e [cman]
 [<c01193bf>] __call_console_drivers+0x55/0x57
 [<f88f087f>] start_transition+0x20f/0x2c1 [cman]
 [<f88ccee1>] cman_callback+0x35/0x38 [dlm]
 [<f88e952a>] notify_kernel_listeners+0x41/0x68 [cman]
 [<f88f0a94>] a_node_just_died+0x163/0x181 [cman]
 [<f88f2a5c>] do_process_leave+0x6b/0x7d [cman]
 [<f88f043c>] do_membership_packet+0x98/0x1f0 [cman]
 [<f88f2e9c>] dispatch_messages+0xe3/0x104 [cman]
 [<f88ef492>] membership_kthread+0x216/0x3e6 [cman]
 [<c0103d22>] ret_from_fork+0x6/0x14
 [<c01157c8>] default_wake_function+0x0/0x12
 [<f88ef27c>] membership_kthread+0x0/0x3e6 [cman]
 [<c0102161>] kernel_thread_helper+0x5/0xb
Code: 0f 0b 63 02 40 bd 8f f8 89 44 24 10 c7 05 f0 73 90 f8 02 00 


>>EIP; f88efc18 <pg0+384c4c18/3fbd3000>   <=====

>>ecx; 018a60be Before first symbol
>>edx; c180e08c <pg0+13e308c/3fbd3000>
>>esi; f7ff6fc0 <pg0+37bcbfc0/3fbd3000>
>>esp; f324fe74 <pg0+32e24e74/3fbd3000>

Code;  f88efc18 <pg0+384c4c18/3fbd3000>
00000000 <_EIP>:
Code;  f88efc18 <pg0+384c4c18/3fbd3000>   <=====
   0:   0f 0b                     ud2a      <=====
Code;  f88efc1a <pg0+384c4c1a/3fbd3000>
   2:   63 02                     arpl   %ax,(%edx)
Code;  f88efc1c <pg0+384c4c1c/3fbd3000>
   4:   40                        inc    %eax
Code;  f88efc1d <pg0+384c4c1d/3fbd3000>
   5:   bd 8f f8 89 44            mov    $0x4489f88f,%ebp
Code;  f88efc22 <pg0+384c4c22/3fbd3000>
   a:   24 10                     and    $0x10,%al
Code;  f88efc24 <pg0+384c4c24/3fbd3000>
   c:   c7 05 f0 73 90 f8 02      movl   $0x2,0xf89073f0
Code;  f88efc2b <pg0+384c4c2b/3fbd3000>
  13:   00 00 00 


From pcaulfie at redhat.com  Wed Aug 25 07:23:32 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 25 Aug 2004 08:23:32 +0100
Subject: [Linux-cluster] bug? mount hangs.
In-Reply-To: <3ACA40606221794F80A5670F0AF15F840547798E@pdsmsx403>
References: <3ACA40606221794F80A5670F0AF15F840547798E@pdsmsx403>
Message-ID: <20040825072331.GB11961@tykepenguin.com>

On Wed, Aug 25, 2004 at 09:59:31AM +0800, Ling, Xiaofeng wrote:
> Hi,
>     When I trying to setup GFS on two node. some times it triggers the
> kdb and the mount hangs.
> Follow is the dmesg and config file.
> I use kernel 2.6.6up no preemption with kdb patch. two nodes are both
> DELL desktop with Intel P3 and P4 CPU.
> Is this a know issue?

If it was preceded by a "can't bind to port 21064" message then yes. It has been
fixed in CVS. It certainly looks like that bug.

-- 

patrick


From adam.cassar at netregistry.com.au  Wed Aug 25 07:33:34 2004
From: adam.cassar at netregistry.com.au (Adam Cassar)
Date: Wed, 25 Aug 2004 17:33:34 +1000
Subject: [Linux-cluster] 2 node vs 3 node cluster
Message-ID: <1093419214.17936.313.camel@akira2.nro.au.com>

Hi Guys,

What are the benefits of running a 3 node cluster, as only one node can
fail before bringing the entire cluster down?

It appears that a single node cannot be a member of a cluster if the
other hosts are missing. I take it that is to prevent single nodes
splitting and making themselves independent clusters? 


From njd at ndietsch.com  Wed Aug 25 08:16:07 2004
From: njd at ndietsch.com (Nathan Dietsch)
Date: Wed, 25 Aug 2004 18:16:07 +1000
Subject: [Linux-cluster] 2 node vs 3 node cluster
In-Reply-To: <1093419214.17936.313.camel@akira2.nro.au.com>
References: <1093419214.17936.313.camel@akira2.nro.au.com>
Message-ID: <412C4AC7.5050801@ndietsch.com>

Hello Adam,

Adam Cassar wrote:

>Hi Guys,
>
>What are the benefits of running a 3 node cluster, as only one node can
>fail before bringing the entire cluster down?
>
>It appears that a single node cannot be a member of a cluster if the
>other hosts are missing. I take it that is to prevent single nodes
>splitting and making themselves independent clusters? 
>  
>
I think something is missing in your understanding of cluster concepts 
in general and someone please correct me if I am wrong in my explanation.

I am sure others can answer the linux-cluster specific attributes , 
however this is a matter of quorum (finding a majority view of the 
cluster) in general.

It is important that in the case of failure, split-brain scenarios (as 
you pointed out) are avoided. If the number of nodes is even and each 
node has one vote, you face a problem. How this is resolved is 
implementation dependent, but  the explanation below might help;

In other clusters this is handled by allocating a device which both 
machines have access to. (Each machine has a vote, plus the device has a 
vote making an odd number of votes).
When the machines lose sight of each other, they race to grab hold of 
the device and whoever gets it (using SCSI-3 reservations usually) gets 
to remain " in the cluster". The other node is "fenced off" from the 
disks containing the data, usually panics and then reboots, only being 
allowed back into the cluster once it can communicate with its peers.

Quorum can also be handled by allocating a higher-number of votes to a 
specific node (I believe linux-cluster handles things this way from what 
I have read).

So to answer your question. Having a three-node  (or any odd number) 
cluster is ideal because it reduces the complexity of quorum issues. 
However, if all you need is the power of two nodes, properly configuring 
quorum (implementation dependent) can alleviate your problems.

FYI, the notion of quorum is used in other scenarios such as the 
meta-databases in Solaris Volume Manager (formerly Sun Disksuite).  I 
never really understood this one completely, but it does provide an 
example.

I hope this helps, I am sure others will have different and better 
explanations for the linux-cluster specifics. For more general cluster 
information, I recommend Gregory Pfister's book "In Search of Clusters".

Regards,

Nathan Dietsch


From xiaofeng.ling at intel.com  Wed Aug 25 08:46:57 2004
From: xiaofeng.ling at intel.com (Ling, Xiaofeng)
Date: Wed, 25 Aug 2004 16:46:57 +0800
Subject: [Linux-cluster] bug? mount hangs.
Message-ID: <3ACA40606221794F80A5670F0AF15F84054B2BFD@pdsmsx403>

 
>-----Original Message-----
>From: linux-cluster-bounces at redhat.com 
>[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Patrick 
>Caulfield
>Sent: 2004?8?25? 15:24
>To: Discussion of clustering software components including GFS
>Subject: Re: [Linux-cluster] bug? mount hangs.
>
>On Wed, Aug 25, 2004 at 09:59:31AM +0800, Ling, Xiaofeng wrote:
>> Hi,
>>     When I trying to setup GFS on two node. some times it 
>triggers the
>> kdb and the mount hangs.
>> Follow is the dmesg and config file.
>> I use kernel 2.6.6up no preemption with kdb patch. two nodes are both
>> DELL desktop with Intel P3 and P4 CPU.
>> Is this a know issue?
>
>If it was preceded by a "can't bind to port 21064" message 
>then yes. It has been
>fixed in CVS. It certainly looks like that bug.
Yes, it is. Thanks.


From lhh at redhat.com  Wed Aug 25 13:19:53 2004
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 25 Aug 2004 09:19:53 -0400
Subject: [Linux-cluster] 2 node vs 3 node cluster
In-Reply-To: <412C4AC7.5050801@ndietsch.com>
References: <1093419214.17936.313.camel@akira2.nro.au.com>
	<412C4AC7.5050801@ndietsch.com>
Message-ID: <1093439993.17698.37.camel@atlantis.boston.redhat.com>

On Wed, 2004-08-25 at 18:16 +1000, Nathan Dietsch wrote:

> Quorum can also be handled by allocating a higher-number of votes to a 
> specific node (I believe linux-cluster handles things this way from what 
> I have read).

This is currently one way you can do it.

I think you can also just put 'cman' into a '2-node' mode where it races
to fence the other (no SCSI device needed).

-- Lon


From lhh at redhat.com  Wed Aug 25 13:20:02 2004
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 25 Aug 2004 09:20:02 -0400
Subject: [Linux-cluster] 2 node vs 3 node cluster
In-Reply-To: <412C4AC7.5050801@ndietsch.com>
References: <1093419214.17936.313.camel@akira2.nro.au.com>
	<412C4AC7.5050801@ndietsch.com>
Message-ID: <1093440002.17698.39.camel@atlantis.boston.redhat.com>

On Wed, 2004-08-25 at 18:16 +1000, Nathan Dietsch wrote:

> In other clusters this is handled by allocating a device which both 
> machines have access to. (Each machine has a vote, plus the device has a 
> vote making an odd number of votes).
> When the machines lose sight of each other, they race to grab hold of 
> the device and whoever gets it (using SCSI-3 reservations usually) gets 
> to remain " in the cluster". The other node is "fenced off" from the 
> disks containing the data, usually panics and then reboots, only being 
> allowed back into the cluster once it can communicate with its peers.

Similar to the above is the use of a disk-based membership+quorum model
("it which is writing to the disk is a member and is in the quorum").
This works well in the 2-node case, but doesn't ensure network
connectivity, and isn't terribly scalable.

One can also use a disk-based membership as a backup to network
membership (e.g. membership determined over network; only in the event
of a potential split brain is the disk checked), but again, this
requires that each node be accessing the disk.

Both of the above allow continued concurrent access from all nodes to
shared partitions on a single device - but require allocation of space
on shared devices for the membership/quorum data.  

Another popular method of fixing the split-brain in even-node cases is
adding a dummy vote to a router or something which responds to
ICMP_ECHO ;)

Again, similar to Nathan's example, these models require fencing to
ensure data integrity.

To be precise, "split brain" in data-sharing clusters is typically
equated to "data corruption".

-- Lon


From teigland at redhat.com  Wed Aug 25 13:46:27 2004
From: teigland at redhat.com (David Teigland)
Date: Wed, 25 Aug 2004 21:46:27 +0800
Subject: [Linux-cluster] 2 node vs 3 node cluster
In-Reply-To: <1093419214.17936.313.camel@akira2.nro.au.com>
References: <1093419214.17936.313.camel@akira2.nro.au.com>
Message-ID: <20040825134627.GB16586@redhat.com>

On Wed, Aug 25, 2004 at 05:33:34PM +1000, Adam Cassar wrote:
> Hi Guys,
> 
> What are the benefits of running a 3 node cluster, as only one node can
> fail before bringing the entire cluster down?
> 
> It appears that a single node cannot be a member of a cluster if the
> other hosts are missing. I take it that is to prevent single nodes
> splitting and making themselves independent clusters? 

You're right.  Both 2 and 3 node clusters can tolerate the failure of 1
node.  If 2 nodes fail, a 2 node cluster would obviously be out of
commission, while the single remaining node in a 3 node cluster would be
stalled.  So, there's no advantage to a 3 node cluster in that sense.
This is assuming all nodes have the default 1 vote -- probably the most
sensible configuration.

In another sense, having one remaining node in a 3 node cluster would make
bringing things back up nicer after the failures.  All you'd need is one
failed node to join the cluster again to make the cluster quorate and
allow the stalled node to continue running.

Another option for the expert user:  if you know the two failed nodes have
been reset (or detached from storage) you could manually reduce expected
votes to allow the stalled node to continue running.  This is dangerous,
of course, unless you know what you're doing.

-- 
Dave Teigland  <teigland at redhat.com>


From pcaulfie at redhat.com  Wed Aug 25 14:46:51 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 25 Aug 2004 15:46:51 +0100
Subject: [Linux-cluster] what does this mean?
In-Reply-To: <1093410448.17936.232.camel@akira2.nro.au.com>
References: <1093410448.17936.232.camel@akira2.nro.au.com>
Message-ID: <20040825144651.GA20829@tykepenguin.com>

On Wed, Aug 25, 2004 at 03:07:28PM +1000, Adam Cassar wrote:
> (against latest gfs in cvs)
> 
> scenario:
> 
> - 3 machines in cluster
> - one importing gnbd, two directly mounted to shared fc raid
> 
> * all 3 performing io
> * reboot one machine, all of the machines hung on io
> 
> attempted to get rebooted machine to join the cluster, one machine spat
> out the following:
> 
> 
> kernel BUG at /usr/src/GFS/cluster/cman-kernel/src/membership.c:611!

Hmmm, it means a bug I was sure had been fixed, hasn't :-(

Was there anything interesting on the other two nodes ?

-- 

patrick


From anton at hq.310.ru  Wed Aug 25 16:18:47 2004
From: anton at hq.310.ru (=?Windows-1251?B?wO3y7u0gzeX17vDu+Oj1?=)
Date: Wed, 25 Aug 2004 20:18:47 +0400
Subject: [Linux-cluster] can't open cluster socket: Socket type not supported
Message-ID: <1114844837.20040825201847@hq.310.ru>

Hi all,

After update from cvs i have problem with gfs

# cman_tool join
can't open cluster socket: Socket type not supported

# uname -a
Linux c5.310.ru 2.6.8.1 #19 SMP Wed Aug 25 20:15:23 MSD 2004 i686 i686 i386 GNU/Linux

In what there can be a problem?

-- 
e-mail: anton at hq.310.ru


From jens.dreger at physik.fu-berlin.de  Wed Aug 25 16:33:00 2004
From: jens.dreger at physik.fu-berlin.de (Jens Dreger)
Date: Wed, 25 Aug 2004 18:33:00 +0200
Subject: [Linux-cluster] can't open cluster socket: Socket type not
	supported
In-Reply-To: <1114844837.20040825201847@hq.310.ru>
References: <1114844837.20040825201847@hq.310.ru>
Message-ID: <20040825163300.GC12982@smart.physik.fu-berlin.de>

On Wed, Aug 25, 2004 at 08:18:47PM +0400, ????? ????????? wrote:
> Hi all,
> 
> After update from cvs i have problem with gfs
> 
> # cman_tool join
> can't open cluster socket: Socket type not supported

I had a similiar problem when trying to start clvmd and could track it
back to AF_CLUSTER being defined differently in

  cluser/cman-kernel/src/cnxman-socket.h (30)

and 

  LVM2/daemons/clvmd/cnxman-socket.h (31)

This might be related.

HTH,

Jens.


From jeff at intersystems.com  Wed Aug 25 16:35:10 2004
From: jeff at intersystems.com (Jeff)
Date: Wed, 25 Aug 2004 12:35:10 -0400
Subject: [Linux-cluster] can't open cluster socket: Socket type not
	supported
In-Reply-To: <20040825163300.GC12982@smart.physik.fu-berlin.de>
References: <1114844837.20040825201847@hq.310.ru>
	<20040825163300.GC12982@smart.physik.fu-berlin.de>
Message-ID: <2910736404.20040825123510@intersystems.com>

Wednesday, August 25, 2004, 12:33:00 PM, Jens Dreger wrote:

> On Wed, Aug 25, 2004 at 08:18:47PM +0400, ????? ????????? wrote:
>> Hi all,
>> 
>> After update from cvs i have problem with gfs
>> 
>> # cman_tool join
>> can't open cluster socket: Socket type not supported

> I had a similiar problem when trying to start clvmd and could track it
> back to AF_CLUSTER being defined differently in

>   cluser/cman-kernel/src/cnxman-socket.h (30)

> and 

>   LVM2/daemons/clvmd/cnxman-socket.h (31)

> This might be related.

> HTH,

> Jens.

> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster

Try 'modprobe dlm' after ccsd, before cman_tool join.


From danderso at redhat.com  Wed Aug 25 17:53:29 2004
From: danderso at redhat.com (Derek Anderson)
Date: Wed, 25 Aug 2004 12:53:29 -0500
Subject: [Linux-cluster] can't open cluster socket: Socket type not
	supported
In-Reply-To: <20040825163300.GC12982@smart.physik.fu-berlin.de>
References: <1114844837.20040825201847@hq.310.ru>
	<20040825163300.GC12982@smart.physik.fu-berlin.de>
Message-ID: <200408251253.29390.danderso@redhat.com>

http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=127019

AF_CLUSTER should be 30 in the latest versions of the cluster and LVM2 tree.

On Wednesday 25 August 2004 11:33, Jens Dreger wrote:
> On Wed, Aug 25, 2004 at 08:18:47PM +0400, ????? ????????? wrote:
> > Hi all,
> >
> > After update from cvs i have problem with gfs
> >
> > # cman_tool join
> > can't open cluster socket: Socket type not supported
>
> I had a similiar problem when trying to start clvmd and could track it
> back to AF_CLUSTER being defined differently in
>
>   cluser/cman-kernel/src/cnxman-socket.h (30)
>
> and
>
>   LVM2/daemons/clvmd/cnxman-socket.h (31)
>
> This might be related.
>
> HTH,
>
> Jens.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster


From adam.cassar at netregistry.com.au  Wed Aug 25 22:40:49 2004
From: adam.cassar at netregistry.com.au (Adam Cassar)
Date: Thu, 26 Aug 2004 08:40:49 +1000
Subject: [Linux-cluster] what does this mean?
In-Reply-To: <20040825144651.GA20829@tykepenguin.com>
References: <1093410448.17936.232.camel@akira2.nro.au.com>
	<20040825144651.GA20829@tykepenguin.com>
Message-ID: <1093473649.2330.2.camel@akira2.nro.au.com>

If by interesting you mean one being a NFS server, then yes.

Would this matter?

On Thu, 2004-08-26 at 00:46, Patrick Caulfield wrote:
> On Wed, Aug 25, 2004 at 03:07:28PM +1000, Adam Cassar wrote:
> > (against latest gfs in cvs)
> > 
> > scenario:
> > 
> > - 3 machines in cluster
> > - one importing gnbd, two directly mounted to shared fc raid
> > 
> > * all 3 performing io
> > * reboot one machine, all of the machines hung on io
> > 
> > attempted to get rebooted machine to join the cluster, one machine spat
> > out the following:
> > 
> > 
> > kernel BUG at /usr/src/GFS/cluster/cman-kernel/src/membership.c:611!
> 
> Hmmm, it means a bug I was sure had been fixed, hasn't :-(
> 
> Was there anything interesting on the other two nodes ?


From adam.cassar at netregistry.com.au  Wed Aug 25 22:42:42 2004
From: adam.cassar at netregistry.com.au (Adam Cassar)
Date: Thu, 26 Aug 2004 08:42:42 +1000
Subject: [Linux-cluster] fsck time
Message-ID: <1093473762.2330.5.camel@akira2.nro.au.com>


Using the latest CVS code I attempted to run fsck on an 800G GFS
partition (of which about 200 meg was used). The fsck took:

Pass 7:  done  (3:53:57)
gfs_fsck Complete  (3:54:41).

Is this time to be expected?


From jbrassow at redhat.com  Thu Aug 26 01:18:37 2004
From: jbrassow at redhat.com (Jonathan E Brassow)
Date: Wed, 25 Aug 2004 20:18:37 -0500
Subject: [Linux-cluster] fsck time
In-Reply-To: <1093473762.2330.5.camel@akira2.nro.au.com>
References: <1093473762.2330.5.camel@akira2.nro.au.com>
Message-ID: <DB1FF0C0-F6FD-11D8-AC8D-000A957BB1F6@redhat.com>

Seems a bit high... but it's possible.  The fsck, while very thorough 
and functionally correct, is not very well optimized.

Pass 7 checks for block conflicts.  That is, if a file (or dir) has a 
block in common with another file (or dir).

Each pass is designed to check something different.  So, depending on 
the access patterns, some passes will take longer than others.

Your numbers seem high - especially for the amount of space you are 
actually using.  Part of that time is chewed up looking over the 
portions of the file system that are not used...  If you're just 
playing around, you may wish to see how the fsck does if the fs is 
smaller but the contents are the same.

  brassow

On Aug 25, 2004, at 5:42 PM, Adam Cassar wrote:

>
> Using the latest CVS code I attempted to run fsck on an 800G GFS
> partition (of which about 200 meg was used). The fsck took:
>
> Pass 7:  done  (3:53:57)
> gfs_fsck Complete  (3:54:41).
>
> Is this time to be expected?
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
>


From adam.cassar at netregistry.com.au  Thu Aug 26 07:09:37 2004
From: adam.cassar at netregistry.com.au (Adam Cassar)
Date: Thu, 26 Aug 2004 17:09:37 +1000
Subject: [Linux-cluster] kernel oops
Message-ID: <1093504177.2330.167.camel@akira2.nro.au.com>

I received the following trying to unmount a GFS partition.

I tried to unmount a GFS partition shared between three nodes and it
hung.

I discovered that one of the nodes had become unresponsive so I manually
ACKED the fence request and attempted to unmount. The following
occurred:

Unable to handle kernel paging request at virtual address 001dae44
 printing eip:
f88cda59
*pde = 00000000
Oops: 0000 [#1]
SMP 
Modules linked in: lock_dlm dlm cman gfs lock_harness 8250 serial_core
dm_mod
CPU:    0
EIP:    0060:[<f88cda59>]    Not tainted
EFLAGS: 00010286   (2.6.8.1) 
EIP is at name_to_directory_nodeid+0x15/0xf9 [dlm]
eax: 001dae00   ebx: e8dc304c   ecx: c1b5ae3c   edx: e8dc304c
esi: 00000000   edi: 001dae00   ebp: e8dc304c   esp: f7297ec0
ds: 007b   es: 007b   ss: 0068
Process dlm_recoverd (pid: 859, threadinfo=f7296000 task=f7125930)
Stack: f706c000 f706c000 c1b5ae00 00000000 f88db235 00000000 e8dc304c
c1b5ae00 
       c1b5aef0 e8dc304c f88cdb5e 001dae00 e8dc30c5 00000018 f88dc727
e8dc304c 
       00000003 00000003 f706c000 00000000 001dae00 e8dc304c e8dc304c
c1b5ae00 
Call Trace:
 [<f88db235>] rcom_send_message+0xe1/0x217 [dlm]
 [<f88cdb5e>] get_directory_nodeid+0x21/0x25 [dlm]
 [<f88dc727>] rsb_master_lookup+0x1a/0x126 [dlm]
 [<f88dc9a6>] restbl_rsb_update+0x142/0x165 [dlm]
 [<f88dcf04>] ls_reconfig+0xd5/0x220 [dlm]
 [<f88ddf71>] dlm_recoverd+0x0/0x66 [dlm]
 [<f88ddc99>] do_ls_recovery+0x16c/0x444 [dlm]
 [<f88ddfbd>] dlm_recoverd+0x4c/0x66 [dlm]
 [<c012aeb9>] kthread+0xb7/0xbd
 [<c012ae02>] kthread+0x0/0xbd
 [<c0102161>] kernel_thread_helper+0x5/0xb
Code: 83 7f 44 01 74 65 8b 44 24 34 89 44 24 04 8b 44 24 30 89 04 
Unable to handle kernel paging request at virtual address 001dae44
f88cda59
*pde = 00000000
Oops: 0000 [#1]
CPU:    0
EIP:    0060:[<f88cda59>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010286   (2.6.8.1) 
eax: 001dae00   ebx: e8dc304c   ecx: c1b5ae3c   edx: e8dc304c
esi: 00000000   edi: 001dae00   ebp: e8dc304c   esp: f7297ec0
ds: 007b   es: 007b   ss: 0068
Stack: f706c000 f706c000 c1b5ae00 00000000 f88db235 00000000 e8dc304c
c1b5ae00 
       c1b5aef0 e8dc304c f88cdb5e 001dae00 e8dc30c5 00000018 f88dc727
e8dc304c 
       00000003 00000003 f706c000 00000000 001dae00 e8dc304c e8dc304c
c1b5ae00 
 [<f88db235>] rcom_send_message+0xe1/0x217 [dlm]
 [<f88cdb5e>] get_directory_nodeid+0x21/0x25 [dlm]
 [<f88dc727>] rsb_master_lookup+0x1a/0x126 [dlm]
 [<f88dc9a6>] restbl_rsb_update+0x142/0x165 [dlm]
 [<f88dcf04>] ls_reconfig+0xd5/0x220 [dlm]
 [<f88ddf71>] dlm_recoverd+0x0/0x66 [dlm]
 [<f88ddc99>] do_ls_recovery+0x16c/0x444 [dlm]
 [<f88ddfbd>] dlm_recoverd+0x4c/0x66 [dlm]
 [<c012aeb9>] kthread+0xb7/0xbd
 [<c012ae02>] kthread+0x0/0xbd
 [<c0102161>] kernel_thread_helper+0x5/0xb
Code: 83 7f 44 01 74 65 8b 44 24 34 89 44 24 04 8b 44 24 30 89 04 


>>EIP; f88cda59 <pg0+38477a59/3fba8000>   <=====

>>eax; 001dae00 Before first symbol
>>ebx; e8dc304c <pg0+2896d04c/3fba8000>
>>ecx; c1b5ae3c <pg0+1704e3c/3fba8000>
>>edx; e8dc304c <pg0+2896d04c/3fba8000>
>>edi; 001dae00 Before first symbol
>>ebp; e8dc304c <pg0+2896d04c/3fba8000>
>>esp; f7297ec0 <pg0+36e41ec0/3fba8000>

Code;  f88cda59 <pg0+38477a59/3fba8000>
00000000 <_EIP>:
Code;  f88cda59 <pg0+38477a59/3fba8000>   <=====
   0:   83 7f 44 01               cmpl   $0x1,0x44(%edi)   <=====
Code;  f88cda5d <pg0+38477a5d/3fba8000>
   4:   74 65                     je     6b <_EIP+0x6b> f88cdac4
<pg0+38477ac4/3fba8000>
Code;  f88cda5f <pg0+38477a5f/3fba8000>
   6:   8b 44 24 34               mov    0x34(%esp,1),%eax
Code;  f88cda63 <pg0+38477a63/3fba8000>
   a:   89 44 24 04               mov    %eax,0x4(%esp,1)
Code;  f88cda67 <pg0+38477a67/3fba8000>
   e:   8b 44 24 30               mov    0x30(%esp,1),%eax
Code;  f88cda6b <pg0+38477a6b/3fba8000>
  12:   89 04 00                  mov    %eax,(%eax,%eax,1)


From adam.cassar at netregistry.com.au  Thu Aug 26 07:19:21 2004
From: adam.cassar at netregistry.com.au (Adam Cassar)
Date: Thu, 26 Aug 2004 17:19:21 +1000
Subject: [Linux-cluster] kernel oops
In-Reply-To: <1093504177.2330.167.camel@akira2.nro.au.com>
References: <1093504177.2330.167.camel@akira2.nro.au.com>
Message-ID: <1093504760.12947.0.camel@akira2.nro.au.com>

I also got quite a few of these:

dlm: dude: restbl_rsb_update_recv rsb not found 2447
dlm: dude: restbl_rsb_update_recv rsb not found 2448
dlm: dude: restbl_rsb_update_recv rsb not found 2449
dlm: dude: restbl_rsb_update_recv rsb not found 2450
dlm: dude: restbl_rsb_update_recv rsb not found 2451
dlm: dude: restbl_rsb_update_recv rsb not found 2452
dlm: dude: restbl_rsb_update_recv rsb not found 2453
dlm: dude: restbl_rsb_update_recv rsb not found 2454
dlm: dude: restbl_rsb_update_recv rsb not found 2455
dlm: dude: restbl_rsb_update_recv rsb not found 2456
dlm: dude: restbl_rsb_update_recv rsb not found 2457
dlm: dude: restbl_rsb_update_recv rsb not found 2458
dlm: dude: restbl_rsb_update_recv rsb not found 2459
dlm: dude: restbl_rsb_update_recv rsb not found 2460
dlm: dude: restbl_rsb_update_recv rsb not found 2461
dlm: dude: restbl_rsb_update_recv rsb not found 2462
dlm: dude: restbl_rsb_update_recv rsb not found 2463
dlm: dude: restbl_rsb_update_recv rsb not found 2464
dlm: dude: restbl_rsb_update_recv rsb not found 2465
dlm: dude: restbl_rsb_update_recv rsb not found 2466

On Thu, 2004-08-26 at 17:09, Adam Cassar wrote:
> I received the following trying to unmount a GFS partition.
> 
> I tried to unmount a GFS partition shared between three nodes and it
> hung.
> 
> I discovered that one of the nodes had become unresponsive so I manually
> ACKED the fence request and attempted to unmount. The following
> occurred:
> 
> Unable to handle kernel paging request at virtual address 001dae44
>  printing eip:
> f88cda59
> *pde = 00000000
> Oops: 0000 [#1]
> SMP 
> Modules linked in: lock_dlm dlm cman gfs lock_harness 8250 serial_core
> dm_mod
> CPU:    0
> EIP:    0060:[<f88cda59>]    Not tainted
> EFLAGS: 00010286   (2.6.8.1) 
> EIP is at name_to_directory_nodeid+0x15/0xf9 [dlm]
> eax: 001dae00   ebx: e8dc304c   ecx: c1b5ae3c   edx: e8dc304c
> esi: 00000000   edi: 001dae00   ebp: e8dc304c   esp: f7297ec0
> ds: 007b   es: 007b   ss: 0068
> Process dlm_recoverd (pid: 859, threadinfo=f7296000 task=f7125930)
> Stack: f706c000 f706c000 c1b5ae00 00000000 f88db235 00000000 e8dc304c
> c1b5ae00 
>        c1b5aef0 e8dc304c f88cdb5e 001dae00 e8dc30c5 00000018 f88dc727
> e8dc304c 
>        00000003 00000003 f706c000 00000000 001dae00 e8dc304c e8dc304c
> c1b5ae00 
> Call Trace:
>  [<f88db235>] rcom_send_message+0xe1/0x217 [dlm]
>  [<f88cdb5e>] get_directory_nodeid+0x21/0x25 [dlm]
>  [<f88dc727>] rsb_master_lookup+0x1a/0x126 [dlm]
>  [<f88dc9a6>] restbl_rsb_update+0x142/0x165 [dlm]
>  [<f88dcf04>] ls_reconfig+0xd5/0x220 [dlm]
>  [<f88ddf71>] dlm_recoverd+0x0/0x66 [dlm]
>  [<f88ddc99>] do_ls_recovery+0x16c/0x444 [dlm]
>  [<f88ddfbd>] dlm_recoverd+0x4c/0x66 [dlm]
>  [<c012aeb9>] kthread+0xb7/0xbd
>  [<c012ae02>] kthread+0x0/0xbd
>  [<c0102161>] kernel_thread_helper+0x5/0xb
> Code: 83 7f 44 01 74 65 8b 44 24 34 89 44 24 04 8b 44 24 30 89 04 
> Unable to handle kernel paging request at virtual address 001dae44
> f88cda59
> *pde = 00000000
> Oops: 0000 [#1]
> CPU:    0
> EIP:    0060:[<f88cda59>]    Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010286   (2.6.8.1) 
> eax: 001dae00   ebx: e8dc304c   ecx: c1b5ae3c   edx: e8dc304c
> esi: 00000000   edi: 001dae00   ebp: e8dc304c   esp: f7297ec0
> ds: 007b   es: 007b   ss: 0068
> Stack: f706c000 f706c000 c1b5ae00 00000000 f88db235 00000000 e8dc304c
> c1b5ae00 
>        c1b5aef0 e8dc304c f88cdb5e 001dae00 e8dc30c5 00000018 f88dc727
> e8dc304c 
>        00000003 00000003 f706c000 00000000 001dae00 e8dc304c e8dc304c
> c1b5ae00 
>  [<f88db235>] rcom_send_message+0xe1/0x217 [dlm]
>  [<f88cdb5e>] get_directory_nodeid+0x21/0x25 [dlm]
>  [<f88dc727>] rsb_master_lookup+0x1a/0x126 [dlm]
>  [<f88dc9a6>] restbl_rsb_update+0x142/0x165 [dlm]
>  [<f88dcf04>] ls_reconfig+0xd5/0x220 [dlm]
>  [<f88ddf71>] dlm_recoverd+0x0/0x66 [dlm]
>  [<f88ddc99>] do_ls_recovery+0x16c/0x444 [dlm]
>  [<f88ddfbd>] dlm_recoverd+0x4c/0x66 [dlm]
>  [<c012aeb9>] kthread+0xb7/0xbd
>  [<c012ae02>] kthread+0x0/0xbd
>  [<c0102161>] kernel_thread_helper+0x5/0xb
> Code: 83 7f 44 01 74 65 8b 44 24 34 89 44 24 04 8b 44 24 30 89 04 
> 
> 
> >>EIP; f88cda59 <pg0+38477a59/3fba8000>   <=====
> 
> >>eax; 001dae00 Before first symbol
> >>ebx; e8dc304c <pg0+2896d04c/3fba8000>
> >>ecx; c1b5ae3c <pg0+1704e3c/3fba8000>
> >>edx; e8dc304c <pg0+2896d04c/3fba8000>
> >>edi; 001dae00 Before first symbol
> >>ebp; e8dc304c <pg0+2896d04c/3fba8000>
> >>esp; f7297ec0 <pg0+36e41ec0/3fba8000>
> 
> Code;  f88cda59 <pg0+38477a59/3fba8000>
> 00000000 <_EIP>:
> Code;  f88cda59 <pg0+38477a59/3fba8000>   <=====
>    0:   83 7f 44 01               cmpl   $0x1,0x44(%edi)   <=====
> Code;  f88cda5d <pg0+38477a5d/3fba8000>
>    4:   74 65                     je     6b <_EIP+0x6b> f88cdac4
> <pg0+38477ac4/3fba8000>
> Code;  f88cda5f <pg0+38477a5f/3fba8000>
>    6:   8b 44 24 34               mov    0x34(%esp,1),%eax
> Code;  f88cda63 <pg0+38477a63/3fba8000>
>    a:   89 44 24 04               mov    %eax,0x4(%esp,1)
> Code;  f88cda67 <pg0+38477a67/3fba8000>
>    e:   8b 44 24 30               mov    0x30(%esp,1),%eax
> Code;  f88cda6b <pg0+38477a6b/3fba8000>
>   12:   89 04 00                  mov    %eax,(%eax,%eax,1)
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
-- 
Adam Cassar
IT Manager
NetRegistry Pty Ltd
______________________________________________
http://www.netregistry.com.au
Tel:  02 9699 6099          Fax:  02 9699 6088
PO Box 270    Broadway      NSW   2007

Domains |Business Email|Web Hosting|E-Commerce
Trusted  by  10,000s of  businesses since 1997
______________________________________________


From jopet at staff.spray.se  Thu Aug 26 08:25:59 2004
From: jopet at staff.spray.se (Johan Pettersson)
Date: Thu, 26 Aug 2004 10:25:59 +0200
Subject: [Linux-cluster] Installation problem
Message-ID: <1093508759.31675.35.camel@zombie.i.spray.se>

Hello!

I'm trying to install (http://gfs.wikidev.net/Installation) gfs and
clvm, but have some problem. Checked out sources for `device-mapper',
`lvm2' and `cluster' yesterday. And have patched a vanilla-2.6.7 kernel
with following patches:

/cluster/cman-kernel/patches/2.6.8.1/*patch
/cluster/dlm-kernel/patches/2.6.8.1/*patch
/cluster/gfs-kernel/patches/2.6.8.1/*patch
/cluster/gnbd-kernel/patches/2.6.7/*patch

I would like to use a vanilla-2.6.8.1 kernel, but I guess `gndb.kernel'
wouldn't work then!?

cd cluster; ./configure  --kernel_src=/build/linux-2.6.7; make


cd cman-kernel && make install
make[1]: Entering directory `/build/cluster/cman-kernel'
cd src && make install
make[2]: Entering directory `/build/cluster/cman-kernel/src'
rm -f cluster
ln -s . cluster
make -C /build/linux-2.6.7 M=/home/jopet/build/cluster/cman-kernel/src
modules USING_KBUILD=yes
make[3]: Entering directory `/build/linux-2.6.7'
CC [M]  /build/cluster/cman-kernel/src/cnxman.o
/build/cluster/cman-kernel/src/cnxman.c: In function `send_to_userport':
/build/cluster/cman-kernel/src/cnxman.c:796: error: `MSG_REPLYEXP'
undeclared (first use in this function)
/build/cluster/cman-kernel/src/cnxman.c:796: error: (Each undeclared
identifier is reported only once
/build/cluster/cman-kernel/src/cnxman.c:796: error: for each function it
appears in.)
/build/cluster/cman-kernel/src/cnxman.c: In function `__sendmsg':
/build/cluster/cman-kernel/src/cnxman.c:2242: error: `MSG_REPLYEXP'
undeclared (first use in this function)
/build/cluster/cman-kernel/src/cnxman.c: In function
`send_listen_request':
/build/cluster/cman-kernel/src/cnxman.c:2464: error: `MSG_REPLYEXP'
undeclared (first use in this function)
make[4]: *** [/build/cluster/cman-kernel/src/cnxman.o] Error 1
make[3]: *** [_module_/build/cluster/cman-kernel/src] Error 2
make[3]: Leaving directory `/build/linux-2.6.7'
make[2]: *** [all] Error 2
make[2]: Leaving directory `/build/cluster/cman-kernel/src'
make[1]: *** [install] Error 2
make[1]: Leaving directory `/build/cluster/cman-kernel'
make: *** [install] Error 2


Thx

/Johan


-- 
In disk space, nobody can
hear your files scream.


From pcaulfie at redhat.com  Thu Aug 26 08:34:26 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Thu, 26 Aug 2004 09:34:26 +0100
Subject: [Linux-cluster] Installation problem
In-Reply-To: <1093508759.31675.35.camel@zombie.i.spray.se>
References: <1093508759.31675.35.camel@zombie.i.spray.se>
Message-ID: <20040826083426.GB6682@tykepenguin.com>

On Thu, Aug 26, 2004 at 10:25:59AM +0200, Johan Pettersson wrote:
> Hello!
> 
> make[3]: Entering directory `/build/linux-2.6.7'
> CC [M]  /build/cluster/cman-kernel/src/cnxman.o
> /build/cluster/cman-kernel/src/cnxman.c: In function `send_to_userport':
> /build/cluster/cman-kernel/src/cnxman.c:796: error: `MSG_REPLYEXP'
> undeclared (first use in this function)
> /build/cluster/cman-kernel/src/cnxman.c:796: error: (Each undeclared
> identifier is reported only once
> /build/cluster/cman-kernel/src/cnxman.c:796: error: for each function it
> appears in.)
> /build/cluster/cman-kernel/src/cnxman.c: In function `__sendmsg':
> /build/cluster/cman-kernel/src/cnxman.c:2242: error: `MSG_REPLYEXP'
> undeclared (first use in this function)
> /build/cluster/cman-kernel/src/cnxman.c: In function
> `send_listen_request':
> /build/cluster/cman-kernel/src/cnxman.c:2464: error: `MSG_REPLYEXP'
> undeclared (first use in this function)


it looks like you might have an old version of cnxman.h lying around, maybe in
/usr/include/cluster

-- 

patrick


From anton at hq.310.ru  Thu Aug 26 08:40:09 2004
From: anton at hq.310.ru (=?Windows-1251?B?wO3y7u0gzeX17vDu+Oj1?=)
Date: Thu, 26 Aug 2004 12:40:09 +0400
Subject: [Linux-cluster] kernel: lock_dlm: init_fence error -1
Message-ID: <05345590.20040826124009@hq.310.ru>

Hi all,

After update gfs from cvs i have problem with mount gfs
i see in messages
kernel: lock_dlm: init_fence error -1
kernel: GFS: can't mount proto = lock_dlm, table = 310farm:gfs01, hostdata =

before mount "cman_tool join" and "fence_tool join" loaded without error

In what there can be a problem?


-- 
e-mail: anton at hq.310.ru


From anton at hq.310.ru  Thu Aug 26 09:22:50 2004
From: anton at hq.310.ru (=?Windows-1251?B?wO3y7u0gzeX17vDu+Oj1?=)
Date: Thu, 26 Aug 2004 13:22:50 +0400
Subject: [Linux-cluster] fence_sanbox2
Message-ID: <1445867274.20040826132250@hq.310.ru>

Hi all,


Have probably forgotten :)

*** cluster/fence/bin/Makefile.orig       2004-08-26 13:23:07.084421744 +0400
--- cluster/fence/bin/Makefile    2004-08-26 13:23:02.482121400 +0400
***************
*** 31,36 ****
--- 31,37 ----
        fence_wti \
        fence_xcat \
        fence_zvm \
+       fence_sanbox2 \
        fenced  

-- 
e-mail: anton at hq.310.ru


From jopet at staff.spray.se  Thu Aug 26 09:58:19 2004
From: jopet at staff.spray.se (Johan Pettersson)
Date: Thu, 26 Aug 2004 11:58:19 +0200
Subject: [Linux-cluster] Installation problem
In-Reply-To: <20040826083426.GB6682@tykepenguin.com>
References: <1093508759.31675.35.camel@zombie.i.spray.se>
	<20040826083426.GB6682@tykepenguin.com>
Message-ID: <1093514299.31675.44.camel@zombie.i.spray.se>

On Thu, 2004-08-26 at 10:34, Patrick Caulfield wrote:
> On Thu, Aug 26, 2004 at 10:25:59AM +0200, Johan Pettersson wrote:
> > Hello!
> > 
> > make[3]: Entering directory `/build/linux-2.6.7'
> > CC [M]  /build/cluster/cman-kernel/src/cnxman.o
> > /build/cluster/cman-kernel/src/cnxman.c: In function `send_to_userport':
> > /build/cluster/cman-kernel/src/cnxman.c:796: error: `MSG_REPLYEXP'
> > undeclared (first use in this function)
> > /build/cluster/cman-kernel/src/cnxman.c:796: error: (Each undeclared
> > identifier is reported only once
> > /build/cluster/cman-kernel/src/cnxman.c:796: error: for each function it
> > appears in.)
> > /build/cluster/cman-kernel/src/cnxman.c: In function `__sendmsg':
> > /build/cluster/cman-kernel/src/cnxman.c:2242: error: `MSG_REPLYEXP'
> > undeclared (first use in this function)
> > /build/cluster/cman-kernel/src/cnxman.c: In function
> > `send_listen_request':
> > /build/cluster/cman-kernel/src/cnxman.c:2464: error: `MSG_REPLYEXP'
> > undeclared (first use in this function)
> 
> 
> it looks like you might have an old version of cnxman.h lying around, maybe in
> /usr/include/cluster


I have only 2 cnxman.h in the system and they do not differ =/

build/cluster/cman-kernel/src/cnxman.h
build/linux-2.6.7/include/cluster/cnxman.h

/J

-- 
In disk space, nobody can
hear your files scream.


From pcaulfie at redhat.com  Thu Aug 26 10:21:19 2004
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Thu, 26 Aug 2004 11:21:19 +0100
Subject: [Linux-cluster] Installation problem
In-Reply-To: <1093514299.31675.44.camel@zombie.i.spray.se>
References: <1093508759.31675.35.camel@zombie.i.spray.se>
	<20040826083426.GB6682@tykepenguin.com>
	<1093514299.31675.44.camel@zombie.i.spray.se>
Message-ID: <20040826102119.GA9523@tykepenguin.com>

On Thu, Aug 26, 2004 at 11:58:19AM +0200, Johan Pettersson wrote:
> 
> 
> I have only 2 cnxman.h in the system and they do not differ =/
> 
> build/cluster/cman-kernel/src/cnxman.h
> build/linux-2.6.7/include/cluster/cnxman.h

Sorry that should have been cnxman-socket.h


patrick


From yjcho at cs.hongik.ac.kr  Thu Aug 26 10:32:01 2004
From: yjcho at cs.hongik.ac.kr (Cho Yool Je)
Date: Thu, 26 Aug 2004 19:32:01 +0900
Subject: [Linux-cluster] cluster.conf
Message-ID: <412DBC21.8080502@cs.hongik.ac.kr>

hi..everybody..

i have three machines....
i will use one server & two client2 with firewire for GFS

but...i can't creat a cluster.conf..
plz show me related cluster.conf...

thx...


From amanthei at redhat.com  Thu Aug 26 13:27:22 2004
From: amanthei at redhat.com (Adam Manthei)
Date: Thu, 26 Aug 2004 08:27:22 -0500
Subject: [Linux-cluster] fence_sanbox2
In-Reply-To: <1445867274.20040826132250@hq.310.ru>
References: <1445867274.20040826132250@hq.310.ru>
Message-ID: <20040826132722.GB20552@redhat.com>

On Thu, Aug 26, 2004 at 01:22:50PM +0400, ????? ????????? wrote:
> Hi all,
> 
> 
> Have probably forgotten :)

Indeed.  I also forgot to add the fence_ibmblade agent too.  
Thanks for the catch.


> 
> *** cluster/fence/bin/Makefile.orig       2004-08-26 13:23:07.084421744 +0400
> --- cluster/fence/bin/Makefile    2004-08-26 13:23:02.482121400 +0400
> ***************
> *** 31,36 ****
> --- 31,37 ----
>         fence_wti \
>         fence_xcat \
>         fence_zvm \
> +       fence_sanbox2 \
>         fenced  
> 
> -- 
> e-mail: anton at hq.310.ru
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Adam Manthei  <amanthei at redhat.com>


From jbrassow at redhat.com  Thu Aug 26 14:26:24 2004
From: jbrassow at redhat.com (Jonathan E Brassow)
Date: Thu, 26 Aug 2004 09:26:24 -0500
Subject: [Linux-cluster] cluster.conf
In-Reply-To: <412DBC21.8080502@cs.hongik.ac.kr>
References: <412DBC21.8080502@cs.hongik.ac.kr>
Message-ID: <E877A634-F76B-11D8-BC03-000A957BB1F6@redhat.com>

man 5 cluster.conf

if using dlm, also

man 5 cman

if using gulm in place of dlm, read

man 5 lock_gulmd

  brassow

On Aug 26, 2004, at 5:32 AM, Cho Yool Je wrote:

> hi..everybody..
>
> i have three machines....
> i will use one server & two client2 with firewire for GFS
>
> but...i can't creat a cluster.conf..
> plz show me related cluster.conf...
>
> thx...
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
>


From iisaman at citi.umich.edu  Thu Aug 26 15:42:30 2004
From: iisaman at citi.umich.edu (Fredric Isaman)
Date: Thu, 26 Aug 2004 11:42:30 -0400 (EDT)
Subject: [Linux-cluster] Compile problem - kernel patches updated?
In-Reply-To: <20040824160026.6BFC873D58@hormel.redhat.com>
References: <20040824160026.6BFC873D58@hormel.redhat.com>
Message-ID: <Pine.BSO.4.58.0408261131030.26011@citi.umich.edu>

I am trying to compile from a CVS download taken on Aug 22.

(Using linux kernel 2.6.7)
After patching the kernel, I try to compile in the /cluster using
'configure --kernel_src=/path/to/kernel; make install'.

I get the following error:
  CC [M]  /nfs/iisaman/gfs/cvs20040822/cluster/dlm-kernel/src/queries.o
/nfs/iisaman/gfs/cvs20040822/cluster/dlm-kernel/src/queries.c: In function
`remote_query':
/nfs/iisaman/gfs/cvs20040822/cluster/dlm-kernel/src/queries.c:338: error:
structure has no member named `lki_ownpid'


The compile seems to be using the kernel source includes, which do not
match those in the cluster directories.  Do I need to do a diff and create
my own patches, or am I doing something wrong?

For example:
> diff -u dlm-kernel/src/dlm.h $KERNELSRC/include/cluster/dlm.h
@@ -241,7 +241,6 @@
         int lki_mstlkid;        /* Lock ID on master node */
        int lki_parent;
        int lki_node;           /* Originating node (not master) */
-       int lki_ownpid;         /* Owner pid on originating node */
        uint8_t lki_state;      /* Queue the lock is on */
        uint8_t lki_grmode;     /* Granted mode */
        uint8_t lki_rqmode;     /* Requested mode */


Thanks,
	Fred


From tomc at teamics.com  Thu Aug 26 16:32:37 2004
From: tomc at teamics.com (tomc at teamics.com)
Date: Thu, 26 Aug 2004 10:32:37 -0600
Subject: [Linux-cluster] What is this GFS pipe doing here:
Message-ID: <OF4E7E59E0.BFBD1912-ON86256EFC.005A8DE6-06256EFC.00609F2B@teamics.com>

in /tmp I found this pipe.  It appears quite old.   Any idea what it's 
for?

prw-------    1  root        root                       0 May 22 06:18 
fence.manual.fifo

There is one on each of the GFS nodes except the current master.    (Can 
I)/(Should I) delete it?


tc


From yjcho at cs.hongik.ac.kr  Thu Aug 26 17:29:15 2004
From: yjcho at cs.hongik.ac.kr (Cho Yool Je)
Date: Fri, 27 Aug 2004 02:29:15 +0900
Subject: [Linux-cluster] cluster.conf 
Message-ID: <412E1DEB.2030600@cs.hongik.ac.kr>

thx a lot...but...i refereced "man 5 cluster.conf" & "man 5 lock_gulmd"
some ago...
(i'm using gulm...)

when i excuted lock_gulmd, result is...
=============================================================================
[root at client1 root]# lock_gulmd
I cannot find the name for ip "servers=client1".
gf->node_cnt = 0
In src/config_main.c:332 (DEVEL.1093502033) death by:
ASSERTION FAILED: gf->node_cnt > 0 && gf->node_cnt < 5 && gf->node_cnt != 2
I cannot find the name for ip "servers=client1".
gf->node_cnt = 0
In src/config_main.c:332 (DEVEL.1093502033) death by:
ASSERTION FAILED: gf->node_cnt > 0 && gf->node_cnt < 5 && gf->node_cnt != 2
[root at client1 root]# I cannot find the name for ip "servers=client1".
gf->node_cnt = 0
In src/config_main.c:332 (DEVEL.1093502033) death by:
ASSERTION FAILED: gf->node_cnt > 0 && gf->node_cnt < 5 && gf->node_cnt != 2
=============================================================================

my cluster.conf is ...
=============================================================================
[root at client1 root]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster name="alpha" config_version="1">

<gulm>
<servers>client1 client2</servers>
</gulm>
<node>
<node name="client1" votes="1">
<fence>
<method name="single">
<device name="human" ipaddr="xxx.xxx.xxx.161"/>
</method>
</fence>
</node>

<node name="client2" votes="1">
<fence>
<method name="single">
<device name="human" ipaddr="xxx.xxx.xxx.162"/>
</method>
</fence>
</node>
</node>
<fence_devices>
<device name="human" agent="fence_manual"/>
</fence_devices>

</cluster>
============================================================================

i registerd ip of client1 & client2 in /etc/hosts instead of DNS
(and testing with only two nodes..)

plz give me a advice...

thx...


--------------------------------------------
man 5 cluster.conf

if using dlm, also

man 5 cman

if using gulm in place of dlm, read

man 5 lock_gulmd

brassow

On Aug 26, 2004, at 5:32 AM, Cho Yool Je wrote:

    hi..everybody..

i have three machines....
i will use one server & two client2 with firewire for GFS


but...i can't creat a cluster.conf..
plz show me related cluster.conf...


    thx...

--
Linux-cluster mailing list
Linux-cluster redhat com
http://www.redhat.com/mailman/listinfo/linux-cluster


From alewis at redhat.com  Thu Aug 26 17:35:25 2004
From: alewis at redhat.com (AJ Lewis)
Date: Thu, 26 Aug 2004 12:35:25 -0500
Subject: [Linux-cluster] cluster.conf
In-Reply-To: <412E1DEB.2030600@cs.hongik.ac.kr>
References: <412E1DEB.2030600@cs.hongik.ac.kr>
Message-ID: <20040826173525.GC13272@null.msp.redhat.com>

On Fri, Aug 27, 2004 at 02:29:15AM +0900, Cho Yool Je wrote:
> thx a lot...but...i refereced "man 5 cluster.conf" & "man 5 lock_gulmd"
> some ago...
> (i'm using gulm...)
> 
> when i excuted lock_gulmd, result is...
> =============================================================================
> [root at client1 root]# lock_gulmd
> I cannot find the name for ip "servers=client1".

There was a bug introduced in ccs that has since been fixed - grab the latest
cvs code and recompile, and it should work.

> my cluster.conf is ...
> =============================================================================
> [root at client1 root]# cat /etc/cluster/cluster.conf
> <?xml version="1.0"?>
> <cluster name="alpha" config_version="1">
> 
> <gulm>
> <servers>client1 client2</servers>
> </gulm>
> <node>
> <node name="client1" votes="1">
> <fence>
> <method name="single">
> <device name="human" ipaddr="xxx.xxx.xxx.161"/>
> </method>
> </fence>
> </node>
> 
> <node name="client2" votes="1">
> <fence>
> <method name="single">
> <device name="human" ipaddr="xxx.xxx.xxx.162"/>
> </method>
> </fence>
> </node>
> </node>
> <fence_devices>
> <device name="human" agent="fence_manual"/>
> </fence_devices>
> 
> </cluster>
> ============================================================================
> 
> i registerd ip of client1 & client2 in /etc/hosts instead of DNS
> (and testing with only two nodes..)
> 
> plz give me a advice...
> 
> thx...
> 
> 
> 
> 
> 
> 
> --------------------------------------------
> man 5 cluster.conf
> 
> if using dlm, also
> 
> man 5 cman
> 
> if using gulm in place of dlm, read
> 
> man 5 lock_gulmd
> 
> brassow
> 
> On Aug 26, 2004, at 5:32 AM, Cho Yool Je wrote:
> 
>     hi..everybody..
> 
> i have three machines....
> i will use one server & two client2 with firewire for GFS
> 
> 
> but...i can't creat a cluster.conf..
> plz show me related cluster.conf...
> 
> 
>     thx...
> 
> --
> Linux-cluster mailing list
> Linux-cluster redhat com
> http://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster

-- 
AJ Lewis                                   Voice:  612-638-0500
Red Hat Inc.                               E-Mail: alewis at redhat.com
720 Washington Ave. SE, Suite 200
Minneapolis, MN 55414
   
Current GPG fingerprint = D9F8 EDCE 4242 855F A03D  9B63 F50C 54A8 578C 8715
Grab the key at: http://people.redhat.com/alewis/gpg.html or one of the
many keyservers out there...

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040826/b3795bbb/attachment.sig>

From ben.m.cahill at intel.com  Thu Aug 26 17:55:29 2004
From: ben.m.cahill at intel.com (Cahill, Ben M)
Date: Thu, 26 Aug 2004 10:55:29 -0700
Subject: [Linux-cluster] Can anyone (Ken?) explain why num_glockd mount
	option is there? TIA. EOM.
Message-ID: <0604335B7764D141945E202153105960033E2520@orsmsx404.amr.corp.intel.com>


From john.l.villalovos at intel.com  Thu Aug 26 18:20:04 2004
From: john.l.villalovos at intel.com (Villalovos, John L)
Date: Thu, 26 Aug 2004 11:20:04 -0700
Subject: [Linux-cluster] Subversion?
Message-ID: <60C14C611F1DDD4198D53F2F43D8CA3B01C7F989@orsmsx410>

linux-cluster-bounces at redhat.com wrote:
> On Tuesday 24 August 2004 10:48, Lon Hohberger wrote:
>> It was _designed_ to handle distributed repositories (like BK).
> 
> Well, what wind is blowing, seems to be blowing in the direction of
> Arch.  I'd be equally happy with either, and in any case,
> much happier
> than with CVS.  Does anybody else have a strong opinion?

I'd prefer to use Subversion.  It works through our proxy servers.  We
already use it for some projects we connect to.

I guess it depends on what you think the development methodology will
be.

If you think it will be this great big distributed development with tons
of merging of people's patches from all over the place then probably
something like Bitkeeper or GNU Arch.

If you are going to stick with your centralized development model then
CVS or Subversion is probably the way to go.

Plus Subversion comes with Fedora Core 2 by default.  Not sure about GNU
Arch.

The change from CVS to SVN (Subversion) is very very easy.  I am not
sure that we can say the same about going to GNU Arch.  (Note: I have
never used GNU Arch).

Here is some articles on Arch versus Subversion:
http://web.mit.edu/ghudson/thoughts/undiagnosing

http://web.mit.edu/ghudson/thoughts/diagnosing

http://www.reverberate.org/computers/ArchAndSVN.html

John


From kpreslan at redhat.com  Thu Aug 26 18:48:39 2004
From: kpreslan at redhat.com (Ken Preslan)
Date: Thu, 26 Aug 2004 13:48:39 -0500
Subject: [Linux-cluster] Can anyone (Ken?) explain why num_glockd mount
	option is there? TIA. EOM.
In-Reply-To: <0604335B7764D141945E202153105960033E2520@orsmsx404.amr.corp.intel.com>
References: <0604335B7764D141945E202153105960033E2520@orsmsx404.amr.corp.intel.com>
Message-ID: <20040826184839.GA18435@potassium.msp.redhat.com>

On Thu, Aug 26, 2004 at 10:55:29AM -0700, Cahill, Ben M wrote:
> Can anyone (Ken?) explain why num_glockd mount
>         option is there? TIA. EOM.

One of the things (probably the major thing) that GFS uses memory for
that's different from other filesystems is locks.  That introduces the
interesting property that in order for GFS to free that memory, it has
to do network I/O to unlock locks.  (The same is true for memory that
contains dirty disk blocks, but the VM knows about that.)  This means that
freeing memory can only happen so quickly.

In the past, GFS had one thread (gfs_glockd) that would scan through the
glock hash table looking for cached locks that were no longer needed.
It would then unlock those locks and free the memory associated with them.
But as it turned out, there were memory problems, if you had a few
processes that were scanning through huge directory trees.  You could
get into a situation where you were acquiring new locks much faster than
gfs_glockd could release old ones.  (There were many threads acquiring,
but only one thread releasing.)  If this kept up, you'd soon run out
of memory and bad things would happen.

My first quick-n-dirty solution to this was to add the num_glockd mount
option.  It would create many many threads that would look for unused
locks and unlock them.  You could balance out acquiring and releasing
processes if you knew what workload you were running.  So, that's why the
option was there originally.

Multiple gfs_glockd processes didn't completely solve the problem, though.
You could still get into the situation where GFS wasn't responding
quickly enough to memory pressure.  So I made a bunch of changes that
made things a lot better:

1) I broke gfs_glockd into two threads:
	A) gfs_scand - scans the glock hash table looking for glocks
	   to demote.  When it finds one, it puts it onto a reclaim list
	   of unneeded locks, and wakes up gfs_glockd.  
	B) gfs_glockd - looks at that the glocks on that reclaim list
	   and starts demoting them.
2) When the number of locks on the reclaim list becomes too great,
   threads that want to acquire new locks will pitch in and release a
   couple of locks before acquiring a new one.
3) GFS is more proactive about putting locks that it knows it won't need
   onto the reclaim list.  This reduces the need for gfs_scand and
   actually walking the hash table.  There's more work that could be done
   in this area.

So, I left the num_glockd option there to in case it's still needed for
some reason.  But because of #2 above, I don't think it will be.  The
option may go away in the future.

-- 
Ken Preslan <kpreslan at redhat.com>


From anton at hq.310.ru  Thu Aug 26 21:01:15 2004
From: anton at hq.310.ru (anton at hq.310.ru)
Date: Fri, 27 Aug 2004 01:01:15 +0400
Subject: [Linux-cluster] fence init problem
Message-ID: <20040827010115.spihyrcdxc8880wc@mail.310.ru>

hi all

gfs from cvs

after run fence_tool join
in /var/log/messages i see
ccsd[10248]: Error while processing connect: Connection refused
last message repeated 31 times
last message repeated 61 time
....

# strace -p 10248 (ccsd)
[pid 10248] select(1024, [6 9], NULL, NULL, NULL <unfinished ...>
[pid 10249] futex(0x4bfcab28, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 10248] <... select resumed> )      = 1 (in [6])
[pid 10248] accept(6, {sa_family=AF_INET, sin_port=htons(855),
sin_addr=inet_addr("127.0.0.1")}, [16]) = 11
[pid 10248] read(11, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 20) = 20
[pid 10248] time([1093554303])          = 1093554303
[pid 10248] rt_sigaction(SIGPIPE, {0x298450, [], 0}, {SIG_DFL}, 8) = 0
[pid 10248] send(12, "<27>Aug 27 01:05:03 ccsd[10248]:"..., 84, 0) = 84
[pid 10248] rt_sigaction(SIGPIPE, {SIG_DFL}, NULL, 8) = 0
[pid 10248] write(11, "\1\0\0\0\0\0\0\0\0\0\0\0\221\377\377\377\0\0\0\0", 20) =
20
[pid 10248] close(11)                   = 0

# strace -p 10269 (fenced)
Process 10269 attached - interrupt to quit
setup()                                 = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 1
bind(1, {sa_family=AF_INET, sin_port=htons(902),
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
connect(1, {sa_family=AF_INET, sin_port=htons(50006),
sin_addr=inet_addr("0.0.0.0")}, 16) = 0
write(1, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 20) = 20
read(1, "\1\0\0\0\0\0\0\0\0\0\0\0\221\377\377\377\0\0\0\0", 20) = 20
close(1)                                = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, {1, 0})               = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 1
bind(1, {sa_family=AF_INET, sin_port=htons(903),
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
connect(1, {sa_family=AF_INET, sin_port=htons(50006),
sin_addr=inet_addr("0.0.0.0")}, 16) = 0
write(1, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 20) = 20
read(1, "\1\0\0\0\0\0\0\0\0\0\0\0\221\377\377\377\0\0\0\0", 20) = 20
close(1)                                = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0},  <unfinished ...>
Process 10269 detached


In what a problem?

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


From amir at datacore.ch  Thu Aug 26 23:34:44 2004
From: amir at datacore.ch (Amir Guindehi)
Date: Fri, 27 Aug 2004 01:34:44 +0200
Subject: [Linux-cluster] GFS configuration for 2 node Cluster
In-Reply-To: <1093270294.3467.26.camel@atlantis.boston.redhat.com>
References: <41241D63.5090102@net4india.net>	<002001c486b5$46ab23c0$f13cc90a@druzhba.com>	<41266989.2070101@datacore.ch>
	<1093270294.3467.26.camel@atlantis.boston.redhat.com>
Message-ID: <412E7394.60505@datacore.ch>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Lon,

| I think he meant for 6.0.0, which is the pappy of linux-cluster.  I
| don't think you can do it with 6.0.0.

Uops. I'm sorry, seems I missed that part.
- - Amir
- --
Amir Guindehi, nospam.amir at datacore.ch
DataCore GmbH, Witikonerstrasse 289, 8053 Zurich, Switzerland

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBLnOSbycOjskSVCwRAveaAJ4xbUWvk9O/QFl3dvHNOHxWVy1lowCdFTda
x5UCHyfV9pNG3DENBleMPZM=
=21ec
-----END PGP SIGNATURE-----


From erik at debian.franken.de  Fri Aug 27 00:10:04 2004
From: erik at debian.franken.de (Erik Tews)
Date: Fri, 27 Aug 2004 02:10:04 +0200
Subject: [Linux-cluster] Subversion?
In-Reply-To: <60C14C611F1DDD4198D53F2F43D8CA3B01C7F989@orsmsx410>
References: <60C14C611F1DDD4198D53F2F43D8CA3B01C7F989@orsmsx410>
Message-ID: <1093565404.13004.42.camel@localhost.localdomain>

Am Do, den 26.08.2004 schrieb Villalovos, John L um 20:20:
> linux-cluster-bounces at redhat.com wrote:
> > On Tuesday 24 August 2004 10:48, Lon Hohberger wrote:
> >> It was _designed_ to handle distributed repositories (like BK).
> > 
> > Well, what wind is blowing, seems to be blowing in the direction of
> > Arch.  I'd be equally happy with either, and in any case,
> > much happier
> > than with CVS.  Does anybody else have a strong opinion?
> 
> I'd prefer to use Subversion.  It works through our proxy servers.  We
> already use it for some projects we connect to.

Wait, I had a problem here, my university seems to have any kind of
cisco transparent proxy which somehow has eaten my subversion-requests
(some strange errors in the client, usually only on commit, update
worked fine), after I moved my server away from port 80 the problem
disappeard. I don't know what they were doing (I am only a student, not
an administrator).

> If you are going to stick with your centralized development model then
> CVS or Subversion is probably the way to go.

Subversion

> Plus Subversion comes with Fedora Core 2 by default.  Not sure about GNU
> Arch.
> 
> The change from CVS to SVN (Subversion) is very very easy.  I am not
> sure that we can say the same about going to GNU Arch.  (Note: I have
> never used GNU Arch).

Thats really true, if you have used cvs before, you need round about
5-10 minutes untill you can do all the things an average cvs user does
day by day.


From teigland at redhat.com  Fri Aug 27 03:13:22 2004
From: teigland at redhat.com (David Teigland)
Date: Fri, 27 Aug 2004 11:13:22 +0800
Subject: [Linux-cluster] fence init problem
In-Reply-To: <20040827010115.spihyrcdxc8880wc@mail.310.ru>
References: <20040827010115.spihyrcdxc8880wc@mail.310.ru>
Message-ID: <20040827031322.GC18381@redhat.com>


On Fri, Aug 27, 2004 at 01:01:15AM +0400, anton at hq.310.ru wrote:
> hi all
> 
> gfs from cvs
> 
> after run fence_tool join
> in /var/log/messages i see
> ccsd[10248]: Error while processing connect: Connection refused

This is probably the same old problem everyone else ran into where
ccsd is finding bad/old magma libs that were left behind in /lib
instead of the new ones installed to /usr/lib.  You may need to go
through /lib and remove anything that was previously installed.

-- 
Dave Teigland  <teigland at redhat.com>


From teigland at redhat.com  Fri Aug 27 03:01:30 2004
From: teigland at redhat.com (David Teigland)
Date: Fri, 27 Aug 2004 11:01:30 +0800
Subject: [Linux-cluster] What is this GFS pipe doing here:
In-Reply-To: <OF4E7E59E0.BFBD1912-ON86256EFC.005A8DE6-06256EFC.00609F2B@teamics.com>
References: <OF4E7E59E0.BFBD1912-ON86256EFC.005A8DE6-06256EFC.00609F2B@teamics.com>
Message-ID: <20040827030130.GB18381@redhat.com>

On Thu, Aug 26, 2004 at 10:32:37AM -0600, tomc at teamics.com wrote:
> in /tmp I found this pipe.  It appears quite old.   Any idea what it's 
> for?
> 
> prw-------    1  root        root                       0 May 22 06:18 
> fence.manual.fifo
> 
> There is one on each of the GFS nodes except the current master.    (Can 
> I)/(Should I) delete it?

It's left over from a fence_manual that was never completed.  You can
delete it, but it won't harm anything if you don't.

-- 
Dave Teigland  <teigland at redhat.com>


From ben.m.cahill at intel.com  Fri Aug 27 04:42:23 2004
From: ben.m.cahill at intel.com (Cahill, Ben M)
Date: Thu, 26 Aug 2004 21:42:23 -0700
Subject: [Linux-cluster] fence init problem
Message-ID: <0604335B7764D141945E202153105960033E252B@orsmsx404.amr.corp.intel.com>

Just as a suggestion ... It would probably save everyone some time and
frustration if you could add a few hints in usage.txt about "well-known"
(but not well enough) gotchas like this ... I tried to do that as much
as possible with "NOTE:", "HINT:" and "Check for success" verbiage in
the HOWTOs in OpenGFS (http://opengfs.sourceforge.net/docs.php (e.g.
"HOWTO Build and Install OpenGFS (nopool, new)", but I don't have enough
experience with the RH stack to contribute much yet.

-- Ben --

Opinions are mine, not Intel's

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of David Teigland
> Sent: Thursday, August 26, 2004 11:13 PM
> To: anton at hq.310.ru
> Cc: linux-cluster at redhat.com
> Subject: Re: [Linux-cluster] fence init problem
> 
> 
> On Fri, Aug 27, 2004 at 01:01:15AM +0400, anton at hq.310.ru wrote:
> > hi all
> > 
> > gfs from cvs
> > 
> > after run fence_tool join
> > in /var/log/messages i see
> > ccsd[10248]: Error while processing connect: Connection refused
> 
> This is probably the same old problem everyone else ran into where
> ccsd is finding bad/old magma libs that were left behind in /lib
> instead of the new ones installed to /usr/lib.  You may need to go
> through /lib and remove anything that was previously installed.
> 
> -- 
> Dave Teigland  <teigland at redhat.com>
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
> 


From andriy at druzhba.lviv.ua  Fri Aug 27 09:03:32 2004
From: andriy at druzhba.lviv.ua (Andriy Galetski)
Date: Fri, 27 Aug 2004 12:03:32 +0300
Subject: [Linux-cluster] Linux cluster startup
Message-ID: <010001c48c14$baae0da0$f13cc90a@druzhba.com>

Hi !

Can anyone tell me how to get mount GFS partition
when GFS (latest CVS version with CMAN and DLM)
instaled to only one from 2 nodes future cluster system.
Do I need to install GFS on both nodes ?

Now when I  do    fence_tool join ....
Geting
ccsd[3164]: Cluster is not quorate.  Refusing connection.

Thanks


From lhh at redhat.com  Fri Aug 27 13:10:50 2004
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 27 Aug 2004 09:10:50 -0400
Subject: [Linux-cluster] fence init problem
In-Reply-To: <20040827031322.GC18381@redhat.com>
References: <20040827010115.spihyrcdxc8880wc@mail.310.ru>
	<20040827031322.GC18381@redhat.com>
Message-ID: <1093612250.17473.85.camel@atlantis.boston.redhat.com>

On Fri, 2004-08-27 at 11:13 +0800, David Teigland wrote:

> This is probably the same old problem everyone else ran into where
> ccsd is finding bad/old magma libs that were left behind in /lib
> instead of the new ones installed to /usr/lib.  You may need to go
> through /lib and remove anything that was previously installed.

Specficially:

rm -f /lib/libmagma* /lib/magma/plugins/* /lib/magma/plugins /lib/magma

Should do the trick.

-- Lon


From lhh at redhat.com  Fri Aug 27 13:44:04 2004
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 27 Aug 2004 09:44:04 -0400
Subject: [Linux-cluster] fence init problem
In-Reply-To: <105310906.20040827171824@hq.310.ru>
References: <20040827010115.spihyrcdxc8880wc@mail.310.ru>
	<20040827031322.GC18381@redhat.com>
	<1093612250.17473.85.camel@atlantis.boston.redhat.com>
	<105310906.20040827171824@hq.310.ru>
Message-ID: <1093614244.17473.88.camel@atlantis.boston.redhat.com>

On Fri, 2004-08-27 at 17:18 +0400, ????? ????????? wrote:
> Hi Lon,

> Lon Hohberger> Specficially:
> 
> Lon Hohberger> rm -f /lib/libmagma*
> Lon Hohberger> /lib/magma/plugins/* /lib/magma/plugins /lib/magma
> 
> and /lib/libdlm* :)

Yes, and that!

-- Lon


From andriy at druzhba.lviv.ua  Fri Aug 27 15:08:27 2004
From: andriy at druzhba.lviv.ua (Andriy Galetski)
Date: Fri, 27 Aug 2004 18:08:27 +0300
Subject: [Linux-cluster] System can not join to fance domain (Quorum is Ok
	!)
References: <010001c48c14$baae0da0$f13cc90a@druzhba.com>
Message-ID: <014d01c48c47$b71fae90$f13cc90a@druzhba.com>

Hi again !

Now I setup GFS for 2 nodes in exactly the same way like in
http://sources.redhat.com/cluster/doc/usage.txt

# cat /proc/cluster/status /proc/cluster/nodes

Version: 2.0.1
Config version: 1
Cluster name: alpha
Cluster ID: 3169
Membership state: Cluster-Member
Nodes: 2
Expected_votes: 1
Total_votes: 2
Quorum: 1
Active subsystems: 0
Node addresses: 10.201.60.12  192.168.0.10

Node  Votes Exp Sts  Name
   1    1    1   M   cl10
   2    1    1   M   cl20

But when I try    fence_tool join
get errors:

Aug 27 17:50:33 cl10 ccsd[5031]: Error while processing connect: Connection
refused
Aug 27 17:50:34 cl10 ccsd[5031]: Cluster is not quorate.  Refusing
connection.

Why System can not join to fance domain ?
The Quorum is Ok !

My  /etc/cluster/cluster.conf  :

<?xml version="1.0"?>
<cluster name="alpha" config_version="1">

  <cman two_node="1" expected_votes="1">
  </cman>

  <nodes>
    <node name="cl10" votes="1">
        <altname name="cl010"/>
      <fence>
        <method name="single">
          <device name="human" ipaddr="cl10"/>
        </method>
      </fence>
    </node>
    <node name="cl20" votes="1">
        <altname name="cl020"/>
      <fence>
        <method name="single">
          <device name="human" ipaddr="cl20"/>
        </method>
      </fence>
    </node>
  </nodes>

  <fence_devices>
    <device name="human" agent="fence_manual"/>
  </fence_devices>

</cluster>


Thanks for any Help.


From anton at hq.310.ru  Fri Aug 27 15:25:32 2004
From: anton at hq.310.ru (=?Windows-1251?B?wO3y7u0gzeX17vDu+Oj1?=)
Date: Fri, 27 Aug 2004 19:25:32 +0400
Subject: [Linux-cluster] System can not join to fance domain (Quorum is
	Ok !)
In-Reply-To: <014d01c48c47$b71fae90$f13cc90a@druzhba.com>
References: <010001c48c14$baae0da0$f13cc90a@druzhba.com>
	<014d01c48c47$b71fae90$f13cc90a@druzhba.com>
Message-ID: <1507579153.20040827192532@hq.310.ru>

?????? ???? Andriy,

Friday, August 27, 2004, 7:08:27 PM, you wrote:


rm -f /lib/libmagma* /lib/magma/plugins/* /lib/magma/plugins
/lib/magma /lib/libdlm*

and try again :)

Andriy Galetski> Hi again !

Andriy Galetski> Now I setup GFS for 2 nodes in
Andriy Galetski> exactly the same way like in
Andriy Galetski> http://sources.redhat.com/cluster/doc/usage.txt

Andriy Galetski> # cat /proc/cluster/status /proc/cluster/nodes

Andriy Galetski> Version: 2.0.1
Andriy Galetski> Config version: 1
Andriy Galetski> Cluster name: alpha
Andriy Galetski> Cluster ID: 3169
Andriy Galetski> Membership state: Cluster-Member
Andriy Galetski> Nodes: 2
Andriy Galetski> Expected_votes: 1
Andriy Galetski> Total_votes: 2
Andriy Galetski> Quorum: 1
Andriy Galetski> Active subsystems: 0
Andriy Galetski> Node addresses: 10.201.60.12  192.168.0.10

Andriy Galetski> Node  Votes Exp Sts  Name
Andriy Galetski>    1    1    1   M   cl10
Andriy Galetski>    2    1    1   M   cl20

Andriy Galetski> But when I try    fence_tool join
Andriy Galetski> get errors:

Andriy Galetski> Aug 27 17:50:33 cl10 ccsd[5031]:
Andriy Galetski> Error while processing connect: Connection
Andriy Galetski> refused
Andriy Galetski> Aug 27 17:50:34 cl10 ccsd[5031]:
Andriy Galetski> Cluster is not quorate.  Refusing
Andriy Galetski> connection.

Andriy Galetski> Why System can not join to fance domain ?
Andriy Galetski> The Quorum is Ok !

Andriy Galetski> My  /etc/cluster/cluster.conf  :

Andriy Galetski> <?xml version="1.0"?>
Andriy Galetski> <cluster name="alpha" config_version="1">

Andriy Galetski>   <cman two_node="1" expected_votes="1">
Andriy Galetski>   </cman>

Andriy Galetski>   <nodes>
Andriy Galetski>     <node name="cl10" votes="1">
Andriy Galetski>         <altname name="cl010"/>
Andriy Galetski>       <fence>
Andriy Galetski>         <method name="single">
Andriy Galetski>           <device name="human" ipaddr="cl10"/>
Andriy Galetski>         </method>
Andriy Galetski>       </fence>
Andriy Galetski>     </node>
Andriy Galetski>     <node name="cl20" votes="1">
Andriy Galetski>         <altname name="cl020"/>
Andriy Galetski>       <fence>
Andriy Galetski>         <method name="single">
Andriy Galetski>           <device name="human" ipaddr="cl20"/>
Andriy Galetski>         </method>
Andriy Galetski>       </fence>
Andriy Galetski>     </node>
Andriy Galetski>   </nodes>

Andriy Galetski>   <fence_devices>
Andriy Galetski>     <device name="human" agent="fence_manual"/>
Andriy Galetski>   </fence_devices>

Andriy Galetski> </cluster>


Andriy Galetski> Thanks for any Help.

Andriy Galetski> --
Andriy Galetski> Linux-cluster mailing list
Andriy Galetski> Linux-cluster at redhat.com
Andriy Galetski> http://www.redhat.com/mailman/listinfo/linux-cluster


-- 
? ?????????,
????? ?????????
???????????? ???????
?????? ??????? ???????????????????? ?????? 

???. (095) 363 3 310
???? (095) 363 3 310
e-mail: anton at hq.310.ru
http://www.310.ru


From yjcho at cs.hongik.ac.kr  Fri Aug 27 21:27:16 2004
From: yjcho at cs.hongik.ac.kr (Cho Yool Je)
Date: Sat, 28 Aug 2004 06:27:16 +0900
Subject: [Linux-cluster] erro log...
Message-ID: <412FA734.6090701@cs.hongik.ac.kr>

hi~

when i excute "mount -t gfs /dev/sda1 /mnt/gfs", my log is written

Aug 28 06:22:04 gfs lock_gulmd_core[1894]: ERROR [src/core_io.c:1317]
Node (client1.cs.xxxx.ac.kr ::ffff:xxx.xxx.xxx.xxx) has been denied from
connecting here.

what does that mean?

thx...


From adam.cassar at netregistry.com.au  Mon Aug 30 02:25:20 2004
From: adam.cassar at netregistry.com.au (Adam Cassar)
Date: Mon, 30 Aug 2004 12:25:20 +1000
Subject: [Linux-cluster] Error after upgrading from CVS
Message-ID: <1093832720.28391.26.camel@akira2.nro.au.com>

What does the following mean?

CMAN: Waiting to join or form a Linux-cluster
CMAN: sending membership request
CMAN: sending membership request
CMAN: got node cluster3
CMAN: got node cluster1
CMAN: quorum regained, resuming activity
CMAN: killed by STARTTRANS or NOMINATE
CMAN: we are leaving the cluster
SM: 00000000 sm_stop: SG still joined
SM: send_nodeid_message error -107 to 2
SM: send_broadcast_message error -107


From adam.cassar at netregistry.com.au  Mon Aug 30 03:19:29 2004
From: adam.cassar at netregistry.com.au (Adam Cassar)
Date: Mon, 30 Aug 2004 13:19:29 +1000
Subject: [Linux-cluster] Error after upgrading from CVS
In-Reply-To: <1093832720.28391.26.camel@akira2.nro.au.com>
References: <1093832720.28391.26.camel@akira2.nro.au.com>
Message-ID: <1093835969.28391.37.camel@akira2.nro.au.com>

I found the problem. One of the hosts was still using the old kernel
modules.

On Mon, 2004-08-30 at 12:25, Adam Cassar wrote:
> What does the following mean?
> 
> CMAN: Waiting to join or form a Linux-cluster
> CMAN: sending membership request
> CMAN: sending membership request
> CMAN: got node cluster3
> CMAN: got node cluster1
> CMAN: quorum regained, resuming activity
> CMAN: killed by STARTTRANS or NOMINATE
> CMAN: we are leaving the cluster
> SM: 00000000 sm_stop: SG still joined
> SM: send_nodeid_message error -107 to 2
> SM: send_broadcast_message error -107
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
-- 
Adam Cassar
IT Manager
NetRegistry Pty Ltd
______________________________________________
http://www.netregistry.com.au
Tel:  02 9699 6099          Fax:  02 9699 6088
PO Box 270    Broadway      NSW   2007

Domains |Business Email|Web Hosting|E-Commerce
Trusted  by  10,000s of  businesses since 1997
______________________________________________


From tomc at teamics.com  Mon Aug 30 03:59:05 2004
From: tomc at teamics.com (tomc at teamics.com)
Date: Sun, 29 Aug 2004 22:59:05 -0500
Subject: [Linux-cluster] tunables question
Message-ID: <OF53C1BB29.C06B66FB-ON06256F00.001B1D0A-06256F00.001B768A@teamics.com>

I am using an IBM FastT 200  and QLA2200 adapters with Sistina GFS. 
Performance varies wildly between very good to abysmal.  Any suggestions 
on tuning (queue depth, buffering, etc)?  Any good docs available on 
tuning, tweaking and troublehsooting?  (Other than the Admin guide, I 
already read that.)

tc


From adam.cassar at netregistry.com.au  Mon Aug 30 04:13:11 2004
From: adam.cassar at netregistry.com.au (Adam Cassar)
Date: Mon, 30 Aug 2004 14:13:11 +1000
Subject: [Linux-cluster] tunables question
In-Reply-To: <OF53C1BB29.C06B66FB-ON06256F00.001B1D0A-06256F00.001B768A@teamics.com>
References: <OF53C1BB29.C06B66FB-ON06256F00.001B1D0A-06256F00.001B768A@teamics.com>
Message-ID: <1093839191.28391.62.camel@akira2.nro.au.com>

Check out the IBM redbooks for FASTt performance tuning.

comp.unix.aix and comp.arch.storage are where all the FASTt people hang
out.

On Mon, 2004-08-30 at 13:59, tomc at teamics.com wrote:
> I am using an IBM FastT 200  and QLA2200 adapters with Sistina GFS. 
> Performance varies wildly between very good to abysmal.  Any suggestions 
> on tuning (queue depth, buffering, etc)?  Any good docs available on 
> tuning, tweaking and troublehsooting?  (Other than the Admin guide, I 
> already read that.)
> 
> tc
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster


From anton at hq.310.ru  Mon Aug 30 09:30:11 2004
From: anton at hq.310.ru (=?Windows-1251?B?wO3y7u0gzeX17vDu+Oj1?=)
Date: Mon, 30 Aug 2004 13:30:11 +0400
Subject: [Linux-cluster] immutable flag on gfs
Message-ID: <12410128065.20040830133011@hq.310.ru>

Hi all,

Guys, it is very necessary to set immutable a flag on GFS, how?
  

-- 
e-mail: anton at hq.310.ru


From mtilstra at redhat.com  Mon Aug 30 16:23:04 2004
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Mon, 30 Aug 2004 11:23:04 -0500
Subject: [Linux-cluster] erro log...
In-Reply-To: <412FA734.6090701@cs.hongik.ac.kr>
References: <412FA734.6090701@cs.hongik.ac.kr>
Message-ID: <20040830162304.GB6777@redhat.com>

On Sat, Aug 28, 2004 at 06:27:16AM +0900, Cho Yool Je wrote:
> when i excute "mount -t gfs /dev/sda1 /mnt/gfs", my log is written
> 
> Aug 28 06:22:04 gfs lock_gulmd_core[1894]: ERROR [src/core_io.c:1317]
> Node (client1.cs.xxxx.ac.kr ::ffff:xxx.xxx.xxx.xxx) has been denied from
> connecting here.
> 
> what does that mean?

it means that:
1) gulm failed to match the name and ip from /etc/resolv.conf
   (probably not the reason here)
2) tcpwrappers was configured not to allow that node to connect.
   (unless you fiddled with tcpwrappers, not the problem either.)
3) Node entry in ccs doesn't correctly match or is missing.
   (this is probably your problem.)

An other slightly realated problem is if you have a nodes host name
mapped to 127.0.0.1 in the /etc/hosts file.

-- 
Michael Conrad Tadpol Tilstra
At night as I lay in bed looking at the stars I thought 'Where the hell is
the ceiling?' 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040830/76c7677b/attachment.sig>

From john.l.villalovos at intel.com  Mon Aug 30 18:51:22 2004
From: john.l.villalovos at intel.com (Villalovos, John L)
Date: Mon, 30 Aug 2004 11:51:22 -0700
Subject: [Linux-cluster] Build & Installation instructions for GNBD?
Message-ID: <60C14C611F1DDD4198D53F2F43D8CA3B01CC9E77@orsmsx410>

Are there any build and installation instructions for GNBD?

The documentation file ( http://sources.redhat.com/cluster/doc/usage.txt
) does not mention a word about GNBD.

Thanks,
John


From ecashin at coraid.com  Mon Aug 30 21:30:29 2004
From: ecashin at coraid.com (Ed L Cashin)
Date: Mon, 30 Aug 2004 17:30:29 -0400
Subject: [Linux-cluster] Re: Error compiling GFS patched kernel
References: <001d01c48230$a4877b80$0a14a8c0@venus.it>
Message-ID: <87u0ukl3ey.fsf@coraid.com>

"Angelo Ovidi" <angel at telvia.it> writes:

>    Hi.
>
>    I am trying to compile a 2.6.7 kernel patched with cvs version of
>    cluster package of redhat.
>
>    I have no error applying the patches but the compile give me this
>    error:

I get the same error using the "Method 2: using kernel patches"
procedure from http://sources.redhat.com/cluster/doc/usage.txt
with this tarball:

  cluster_0406282100.tgz

...
>      CC [M]  fs/gfs/inode.o
>    fs/gfs/inode.c: In function `inode_init_and_link':
>    fs/gfs/inode.c:1214: invalid lvalue in unary `&'
>    fs/gfs/inode.c: In function `inode_alloc_hidden':
>    fs/gfs/inode.c:1933: invalid lvalue in unary `&'
>    make[2]: *** [fs/gfs/inode.o] Error 1
>    make[1]: *** [fs/gfs] Error 2
>    make: *** [fs] Error 2

-- 
  Ed L Cashin <ecashin at coraid.com>


From jens.dreger at physik.fu-berlin.de  Mon Aug 30 21:54:02 2004
From: jens.dreger at physik.fu-berlin.de (Jens Dreger)
Date: Mon, 30 Aug 2004 23:54:02 +0200
Subject: [Linux-cluster] Re: Error compiling GFS patched kernel
In-Reply-To: <87u0ukl3ey.fsf@coraid.com>
References: <001d01c48230$a4877b80$0a14a8c0@venus.it>
	<87u0ukl3ey.fsf@coraid.com>
Message-ID: <20040830215402.GT3794@smart.physik.fu-berlin.de>

On Mon, Aug 30, 2004 at 05:30:29PM -0400, Ed L Cashin wrote:
> "Angelo Ovidi" <angel at telvia.it> writes:
> 
> >    Hi.
> >
> >    I am trying to compile a 2.6.7 kernel patched with cvs version of
> >    cluster package of redhat.
> >
> >    I have no error applying the patches but the compile give me this
> >    error:
> 
> I get the same error using the "Method 2: using kernel patches"
> procedure from http://sources.redhat.com/cluster/doc/usage.txt
> with this tarball:
> 
>   cluster_0406282100.tgz

Try upgrading gcc. I got that error with gcc 2.95. Upgrading to gcc >3
solved the problem.

HTH,

Jens.


From arekm at pld-linux.org  Mon Aug 30 23:35:32 2004
From: arekm at pld-linux.org (Arkadiusz Miskiewicz)
Date: Tue, 31 Aug 2004 01:35:32 +0200
Subject: [Linux-cluster] [PATCH]: avoid local_nodeid conflict with ia64/numa
	define
Message-ID: <200408310135.32411.arekm@pld-linux.org>


Little patch by qboosh at pld-linux.org:

- avoid local_nodeid conflict with ia64/numa define

http://cvs.pld-linux.org/cgi-bin/cvsweb/SOURCES/linux-cluster-dlm.patch?r1=1.1.2.3&r2=1.1.2.4

-- 
Arkadiusz Mi?kiewicz     CS at FoE, Wroclaw University of Technology
arekm.pld-linux.org, 1024/3DB19BBD, JID: arekm.jabber.org, PLD/Linux


From adam.cassar at netregistry.com.au  Tue Aug 31 03:50:58 2004
From: adam.cassar at netregistry.com.au (Adam Cassar)
Date: Tue, 31 Aug 2004 13:50:58 +1000
Subject: [Linux-cluster] repeatable assertion failure extending file system
	while active
Message-ID: <1093924258.28391.200.camel@akira2.nro.au.com>

Hi Guys,

I have a repeatable assertion failure.

machine1:/mnt# bonnie++ -u 0
Using uid:0, gid:0.
Writing with putc()...

-------------

machine2# /usr/local/lvm/sbin/lvextend -L +1G /dev/GIPETE/lv0 
  Using stripesize of last segment 64KB
  Extending logical volume lv0 to 24.00 GB
  Logical volume lv0 successfully resized

machine2# /usr/local/gfs/sbin/gfs_jadd -j 2 /mnt
FS: Mount Point: /mnt
FS: Device: /dev/GIPETE/lv0
FS: Options: rw,noatime,nodiratime
FS: Size: 5242880
DEV: Size: 6291456
Preparing to write new FS information...
Done.

------------

machine2 # /usr/local/gfs/sbin/gfs_grow /mnt
FS: Mount Point: /mnt
FS: Device: /dev/GIPETE/lv0
FS: Options: rw,noatime,nodiratime
FS: Size: 5308416
DEV: Size: 6291456
Preparing to write new FS information...
Done.

-------------

machine1#

attempt to access beyond end of device
dm-0: rw=0, want=48380304, limit=44040192
attempt to access beyond end of device
dm-0: rw=0, want=48380312, limit=44040192
attempt to access beyond end of device
dm-0: rw=0, want=48380320, limit=44040192
attempt to access beyond end of device
dm-0: rw=0, want=48380328, limit=44040192
attempt to access beyond end of device
dm-0: rw=0, want=48380336, limit=44040192
attempt to access beyond end of device
dm-0: rw=0, want=48380344, limit=44040192
attempt to access beyond end of device
dm-0: rw=0, want=48380352, limit=44040192
attempt to access beyond end of device
dm-0: rw=0, want=48380360, limit=44040192
attempt to access beyond end of device
dm-0: rw=0, want=48380368, limit=44040192
attempt to access beyond end of device
dm-0: rw=0, want=48380376, limit=44040192
attempt to access beyond end of device
dm-0: rw=0, want=48380384, limit=44040192
attempt to access beyond end of device
dm-0: rw=0, want=48380392, limit=44040192
attempt to access beyond end of device
dm-0: rw=0, want=48380400, limit=44040192
attempt to access beyond end of device
dm-0: rw=0, want=48380408, limit=44040192
attempt to access beyond end of device
dm-0: rw=0, want=48380416, limit=44040192
GFS: fsid=cluster:donkey.0: I/O error on block 6047551

GFS: Assertion failed on line 307 of file
/usr/src/GFS/cluster/gfs-kernel/src/gfs/util.c
GFS: assertion: "FALSE"
GFS: time = 1093923913
GFS: fsid=cluster:donkey.0

Kernel panic: GFS: Record message above and reboot.


From adam.cassar at netregistry.com.au  Tue Aug 31 05:12:40 2004
From: adam.cassar at netregistry.com.au (Adam Cassar)
Date: Tue, 31 Aug 2004 15:12:40 +1000
Subject: [Linux-cluster] fencing behaviour
In-Reply-To: <1093924258.28391.200.camel@akira2.nro.au.com>
References: <1093924258.28391.200.camel@akira2.nro.au.com>
Message-ID: <1093929160.28391.230.camel@akira2.nro.au.com>

The below node in question does not get fenced by the other nodes!

It only occurs after a reboot and it joins the cluster. Only then does
the other node want to fence it. Is this normal?

On Tue, 2004-08-31 at 13:50, Adam Cassar wrote:
> Hi Guys,
> 
> I have a repeatable assertion failure.
> 
> machine1:/mnt# bonnie++ -u 0
> Using uid:0, gid:0.
> Writing with putc()...
> 
> -------------
> 
> machine2# /usr/local/lvm/sbin/lvextend -L +1G /dev/GIPETE/lv0 
>   Using stripesize of last segment 64KB
>   Extending logical volume lv0 to 24.00 GB
>   Logical volume lv0 successfully resized
> 
> machine2# /usr/local/gfs/sbin/gfs_jadd -j 2 /mnt
> FS: Mount Point: /mnt
> FS: Device: /dev/GIPETE/lv0
> FS: Options: rw,noatime,nodiratime
> FS: Size: 5242880
> DEV: Size: 6291456
> Preparing to write new FS information...
> Done.
> 
> ------------
> 
> machine2 # /usr/local/gfs/sbin/gfs_grow /mnt
> FS: Mount Point: /mnt
> FS: Device: /dev/GIPETE/lv0
> FS: Options: rw,noatime,nodiratime
> FS: Size: 5308416
> DEV: Size: 6291456
> Preparing to write new FS information...
> Done.
> 
> -------------
> 
> machine1#
> 
> attempt to access beyond end of device
> dm-0: rw=0, want=48380304, limit=44040192
> attempt to access beyond end of device
> dm-0: rw=0, want=48380312, limit=44040192
> attempt to access beyond end of device
> dm-0: rw=0, want=48380320, limit=44040192
> attempt to access beyond end of device
> dm-0: rw=0, want=48380328, limit=44040192
> attempt to access beyond end of device
> dm-0: rw=0, want=48380336, limit=44040192
> attempt to access beyond end of device
> dm-0: rw=0, want=48380344, limit=44040192
> attempt to access beyond end of device
> dm-0: rw=0, want=48380352, limit=44040192
> attempt to access beyond end of device
> dm-0: rw=0, want=48380360, limit=44040192
> attempt to access beyond end of device
> dm-0: rw=0, want=48380368, limit=44040192
> attempt to access beyond end of device
> dm-0: rw=0, want=48380376, limit=44040192
> attempt to access beyond end of device
> dm-0: rw=0, want=48380384, limit=44040192
> attempt to access beyond end of device
> dm-0: rw=0, want=48380392, limit=44040192
> attempt to access beyond end of device
> dm-0: rw=0, want=48380400, limit=44040192
> attempt to access beyond end of device
> dm-0: rw=0, want=48380408, limit=44040192
> attempt to access beyond end of device
> dm-0: rw=0, want=48380416, limit=44040192
> GFS: fsid=cluster:donkey.0: I/O error on block 6047551
> 
> GFS: Assertion failed on line 307 of file
> /usr/src/GFS/cluster/gfs-kernel/src/gfs/util.c
> GFS: assertion: "FALSE"
> GFS: time = 1093923913
> GFS: fsid=cluster:donkey.0
> 
> Kernel panic: GFS: Record message above and reboot.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
-- 
Adam Cassar
IT Manager
NetRegistry Pty Ltd
______________________________________________
http://www.netregistry.com.au
Tel:  02 9699 6099          Fax:  02 9699 6088
PO Box 270    Broadway      NSW   2007

Domains |Business Email|Web Hosting|E-Commerce
Trusted  by  10,000s of  businesses since 1997
______________________________________________


From teigland at redhat.com  Tue Aug 31 05:30:06 2004
From: teigland at redhat.com (David Teigland)
Date: Tue, 31 Aug 2004 13:30:06 +0800
Subject: [Linux-cluster] fencing behaviour
In-Reply-To: <1093929160.28391.230.camel@akira2.nro.au.com>
References: <1093924258.28391.200.camel@akira2.nro.au.com>
	<1093929160.28391.230.camel@akira2.nro.au.com>
Message-ID: <20040831053006.GA15784@redhat.com>

On Tue, Aug 31, 2004 at 03:12:40PM +1000, Adam Cassar wrote:
> The below node in question does not get fenced by the other nodes!
> 
> It only occurs after a reboot and it joins the cluster. Only then does
> the other node want to fence it. Is this normal?

Yes.  When the node dies, services (fencing, dlm, gfs) are suspended.  If
the cluster still has quorum, these services are re-enabled immediately.
If the cluster has lost quorum, the services are not re-enabled until the
cluster regains quorum, which is what you're seeing.

Fencing occurs when the fencing service is re-enabled and performs
recovery.  This is happening when your failed node rejoins the cluster,
giving it quorum again.  When doing recovery, the fencing daemon is smart
enough to see that the failed node has rejoined the cluster and will
bypass the now useless fencing operation for it.

-- 
Dave Teigland  <teigland at redhat.com>


From ecashin at coraid.com  Tue Aug 31 14:45:42 2004
From: ecashin at coraid.com (Ed L Cashin)
Date: Tue, 31 Aug 2004 10:45:42 -0400
Subject: [Linux-cluster] Re: Error compiling GFS patched kernel
References: <001d01c48230$a4877b80$0a14a8c0@venus.it>
	<87u0ukl3ey.fsf@coraid.com>
	<20040830215402.GT3794@smart.physik.fu-berlin.de>
Message-ID: <87656zl621.fsf@coraid.com>

Jens Dreger <jens.dreger at physik.fu-berlin.de> writes:

> On Mon, Aug 30, 2004 at 05:30:29PM -0400, Ed L Cashin wrote:
>> "Angelo Ovidi" <angel at telvia.it> writes:
>> 
>> >    Hi.
>> >
>> >    I am trying to compile a 2.6.7 kernel patched with cvs version of
>> >    cluster package of redhat.
>> >
>> >    I have no error applying the patches but the compile give me this
>> >    error:
>> 
>> I get the same error using the "Method 2: using kernel patches"
>> procedure from http://sources.redhat.com/cluster/doc/usage.txt
>> with this tarball:
>> 
>>   cluster_0406282100.tgz
>
> Try upgrading gcc. I got that error with gcc 2.95. Upgrading to gcc >3
> solved the problem.

Thanks.  Yes, I noticed that it works with gcc 3.3, but the snapshot
has been up for a while, and it doesn't work with gcc 2, so it seems
like gcc 2 is not supported by gfs, which merits a big warning in any
usage.txt-type docs.

-- 
  Ed L Cashin <ecashin at coraid.com>


From eoey at shopping.com  Mon Aug 30 19:03:27 2004
From: eoey at shopping.com (Edy Oey)
Date: Mon, 30 Aug 2004 12:03:27 -0700
Subject: [Linux-cluster] QLogic QLA2342 Drivers for 2.6.x kernel?
Message-ID: <0C99873E269DF9449701C2CA6D9782C902057B3A@mail-na.shopping.com>

Hi,
Anybody knows where I can get QLogic QLA2342 Drivers for 2.6.x kernel?
Thanks.
-edy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040830/1927174e/attachment.htm>

From ecashin at coraid.com  Tue Aug 31 15:56:04 2004
From: ecashin at coraid.com (Ed L Cashin)
Date: Tue, 31 Aug 2004 11:56:04 -0400
Subject: [Linux-cluster] Re: Installation problem
References: <1093508759.31675.35.camel@zombie.i.spray.se>
	<20040826083426.GB6682@tykepenguin.com>
	<1093514299.31675.44.camel@zombie.i.spray.se>
	<20040826102119.GA9523@tykepenguin.com>
Message-ID: <87y8jvjo8b.fsf@coraid.com>

Patrick Caulfield <pcaulfie at redhat.com> writes:

> On Thu, Aug 26, 2004 at 11:58:19AM +0200, Johan Pettersson wrote:
>> 
>> 
>> I have only 2 cnxman.h in the system and they do not differ =/
>> 
>> build/cluster/cman-kernel/src/cnxman.h
>> build/linux-2.6.7/include/cluster/cnxman.h
>
> Sorry that should have been cnxman-socket.h

In today's cvs cluster sources, the cnxman-socket.h in the kernel
patches is different from the one in the cluster source tree.

If you build the cluster sources according to method 2 of usage.txt on
a machine that has never had GFS installed before, I think you'll also
find that there are some problems with the include paths.  The
different parts of the cluster software can't see one anothers
headers.  The problem is masked when there are headers in
/usr/include, but shouldn't system-wide headers be ignored when
building from newer sources?

-- 
  Ed L Cashin <ecashin at coraid.com>


From coughlan at redhat.com  Tue Aug 31 16:19:25 2004
From: coughlan at redhat.com (Tom Coughlan)
Date: Tue, 31 Aug 2004 12:19:25 -0400
Subject: [Linux-cluster] QLogic QLA2342 Drivers for 2.6.x kernel?
In-Reply-To: <0C99873E269DF9449701C2CA6D9782C902057B3A@mail-na.shopping.com>
References: <0C99873E269DF9449701C2CA6D9782C902057B3A@mail-na.shopping.com>
Message-ID: <1093969165.7121.2617.camel@bianchi.boston.redhat.com>

On Mon, 2004-08-30 at 15:03, Edy Oey wrote:
> Hi,
> Anybody knows where I can get QLogic QLA2342 Drivers for 2.6.x kernel?

They are built-in.  See scsi/qla2xxx/.


From ecashin at coraid.com  Tue Aug 31 16:44:49 2004
From: ecashin at coraid.com (Ed L Cashin)
Date: Tue, 31 Aug 2004 12:44:49 -0400
Subject: [Linux-cluster] cluster depends on tcp_wrappers?
Message-ID: <87u0ujjlz2.fsf@coraid.com>

Hi.  Does cluster, and gulm/src/utils_ip.c from today's CVS
specifically, depend on tcp_wrappers?

If so, it deserves mentioning in usage.txt.  

-- 
  Ed L Cashin <ecashin at coraid.com>


From mtilstra at redhat.com  Tue Aug 31 16:52:53 2004
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Tue, 31 Aug 2004 11:52:53 -0500
Subject: [Linux-cluster] cluster depends on tcp_wrappers?
In-Reply-To: <87u0ujjlz2.fsf@coraid.com>
References: <87u0ujjlz2.fsf@coraid.com>
Message-ID: <20040831165253.GA14574@redhat.com>

On Tue, Aug 31, 2004 at 12:44:49PM -0400, Ed L Cashin wrote:
> Hi.  Does cluster, and gulm/src/utils_ip.c from today's CVS
> specifically, depend on tcp_wrappers?

gulm does use tcpwrappers, it always has.

-- 
Michael Conrad Tadpol Tilstra
Don't look back, the lemmings are gaining on you.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040831/00a4b62b/attachment.sig>

From bmarzins at redhat.com  Tue Aug 31 17:00:07 2004
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Tue, 31 Aug 2004 12:00:07 -0500
Subject: [Linux-cluster] Build & Installation instructions for GNBD?
In-Reply-To: <60C14C611F1DDD4198D53F2F43D8CA3B01CC9E77@orsmsx410>
References: <60C14C611F1DDD4198D53F2F43D8CA3B01CC9E77@orsmsx410>
Message-ID: <20040831170007.GL12234@phlogiston.msp.redhat.com>

On Mon, Aug 30, 2004 at 11:51:22AM -0700, Villalovos, John L wrote:
> Are there any build and installation instructions for GNBD?
> 
> The documentation file ( http://sources.redhat.com/cluster/doc/usage.txt
> ) does not mention a word about GNBD.

http://sources.redhat.com/cluster/gnbd/gnbd_usage.txt

There is now a link to this from the gnbd page.  If you have any questions
or comments about it, just let me know.

-Ben
 
> Thanks,
> John
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster


From ecashin at coraid.com  Tue Aug 31 18:27:03 2004
From: ecashin at coraid.com (Ed L Cashin)
Date: Tue, 31 Aug 2004 14:27:03 -0400
Subject: [Linux-cluster] Re: cluster depends on tcp_wrappers?
References: <87u0ujjlz2.fsf@coraid.com> <20040831165253.GA14574@redhat.com>
Message-ID: <87r7pnjh8o.fsf@coraid.com>

Michael Conrad Tadpol Tilstra <mtilstra at redhat.com> writes:

> On Tue, Aug 31, 2004 at 12:44:49PM -0400, Ed L Cashin wrote:
>> Hi.  Does cluster, and gulm/src/utils_ip.c from today's CVS
>> specifically, depend on tcp_wrappers?
>
> gulm does use tcpwrappers, it always has.

OK, here's a patch.  Without tcp wrappers already installed, following
the directions in usage.txt results in a cryptic message about tcpd.h
being missing, so either a check in the configure script or some
documentation is necessary.

--- cluster-cvs/doc/usage.txt.20040831	Tue Aug 31 14:21:57 2004
+++ cluster-cvs/doc/usage.txt	Tue Aug 31 14:22:39 2004
@@ -25,6 +25,10 @@
   cvs -d :pserver:cvs at sources.redhat.com:/cvs/lvm2    checkout LVM2 
   cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout cluster 
 
+- satisfy dependencies
+
+  gulm requires tcp_wrappers
+
 
 Build and install
 -----------------


-- 
  Ed L Cashin <ecashin at coraid.com>


From iisaman at citi.umich.edu  Tue Aug 31 18:49:13 2004
From: iisaman at citi.umich.edu (Fredric Isaman)
Date: Tue, 31 Aug 2004 14:49:13 -0400 (EDT)
Subject: [Linux-cluster] cluster send request failed: Bad address
Message-ID: <Pine.BSO.4.58.0408311416360.28763@citi.umich.edu>

I am trying to set up a simple 3-node cluster (containing iota6-8). I get
up to running clvmd on each node. At this point, iota8 works fine, all lvm
commands work (although with some error messages about lock failures on
the other nodes). However, any attempt to use a lvm command on the other
nodes gives some sort of locking error.  For example:

[root at iota8g LVM2]#  pvremove /baddev
  /baddev: Couldn't find device.

[root at iota6g LVM2]# pvremove /baddev
  cluster send request failed: Bad address
  Can't get lock for orphan PVs

I have tracked the failure down to the fact that the call to
dlm_ls_lock() from sync_lock() in LVM2/daemons/clvmd/clvmd-cman.c
is failing, but I can not figure out why.

In particular, I am perplexed that it works on the one machine and not
the others.  Any hints about what might be causing this would be
appreciated.

	Thanks,
	Fred


The failure in more detail:

[root at iota6g cluster]# pvremove -vvv /baddev
      Setting global/locking_type to 2
      Setting global/locking_library to liblvm2clusterlock.so
      Setting global/library_dir to /lib
      Opening shared locking library /lib/liblvm2clusterlock.so
    Loaded external locking library liblvm2clusterlock.so
      External locking enabled.
  FRED - called lock_resource(cmd, , 0x24)
      Locking P_orphans at 0x4
  FRED - called _lock_for_cluster(51, 0x4, P_orphans)
  FRED - _cluster_request(51, ., data='\x04\x00P_orphans\x00', len=12)
  FRED - in _send_request: outheader =
  cmd=1, flags=0x0, xid=0, cid=134912944, status=-14, arglen=1, node=
  cluster send request failed: Bad address
  Can't get lock for orphan PVs

[root at iota6g root]# clvmd -d
CLVMD[13066]: 1093975495 CLVMD started
CLVMD[13066]: 1093975495 FRED - init_cluster
CLVMD[13066]: 1093975496 Cluster ready, doing some more initialisation
CLVMD[13066]: 1093975496 starting LVM thread
CLVMD[13066]: 1093975496 LVM thread function started
CLVMD[13066]: 1093975496 clvmd ready for work
CLVMD[13066]: 1093975496 Using timeout of 60 seconds
  No volume groups found
CLVMD[13066]: 1093975496 LVM thread waiting for work
CLVMD[13066]: 1093975500 Got new connection on fd 7
CLVMD[13066]: 1093975500 Read on local socket 7, len = 30
CLVMD[13066]: 1093975500 creating pipe, [8, 9]
CLVMD[13066]: 1093975500 in sub thread: client = 0x80a8b60
CLVMD[13066]: 1093975500 doing PRE command LOCK_VG P_orphans at 4
CLVMD[13066]: 1093975500 FRED - sync_lock(P_orphans, 4, 0x0)
CLVMD[13066]: 1093975500 FRED - sync_lock status = -1
CLVMD[13066]: 1093975500 hold_lock. lock at 4 failed: Bad address
CLVMD[13066]: 1093975500 Writing status 14 down pipe 9
CLVMD[13066]: 1093975500 Waiting to do post command - state = 0
CLVMD[13066]: 1093975500 read on PIPE 8: 4 bytes: status: 14
CLVMD[13066]: 1093975500 background routine status was 14,
sock_client=0x80a8b60CLVMD[13066]: 1093975500 Send local reply
CLVMD[13066]: 1093975500 Read on local socket 7, len = -1
CLVMD[13066]: 1093975500 EOF on local socket: inprogress=0
CLVMD[13066]: 1093975500 Waiting for child thread
CLVMD[13066]: 1093975500 SIGUSR2 received
CLVMD[13066]: 1093975500 Joined child thread
CLVMD[13066]: 1093975500 ret == 0, errno = 104. removing client


From ben.m.cahill at intel.com  Tue Aug 31 19:01:07 2004
From: ben.m.cahill at intel.com (Cahill, Ben M)
Date: Tue, 31 Aug 2004 12:01:07 -0700
Subject: [Linux-cluster] man page for gfs_mount
Message-ID: <0604335B7764D141945E202153105960033E2541@orsmsx404.amr.corp.intel.com>

Hi all,

Attached please find a new man page for gfs_mount, as a submission to be
included in cluster/gfs/man.  I tried to write it so it would be useful
for newbies as well as veterans.

Anyone who can, please review and let me know about any problems you
see.

Thanks!

-- Ben --

Opinions are mine, not Intel's
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gfs_mount.8
Type: application/octet-stream
Size: 8813 bytes
Desc: gfs_mount.8
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040831/38947538/attachment.obj>

From sdake at mvista.com  Tue Aug 31 19:50:43 2004
From: sdake at mvista.com (Steven Dake)
Date: Tue, 31 Aug 2004 12:50:43 -0700
Subject: [Linux-cluster] New virtual synchrony API for the kernel: was Re:
	[Openais] New API in openais
In-Reply-To: <1093973757.5933.56.camel@cherrybomb.pdx.osdl.net>
References: <1093941076.3613.14.camel@persist.az.mvista.com>
	<1093973757.5933.56.camel@cherrybomb.pdx.osdl.net>
Message-ID: <1093981842.3613.42.camel@persist.az.mvista.com>

John,
As it appears the redhat clusters project is interested in a kernel
implementation of cluster messaging, this interface would have to be
available to both the kernel and user applications.  It possible to
provide EVS services to both kernel and user space applications.  There
currently is no kernel implementation of group messaging, though only a
user space interface.  TIPC could probably export this sort of
interface, or openais's gmi could be ported to the kernel.  Then
openais, redhat's cluster technologies, linux ha, or other group
messaging applications (and there are quite a few) could use that
technology and standardize on the EVS API.

It would be useful for linux cluster developers for a common low level
group communication API to be agreed upon by relevant clusters
projects.  Without this approach, we may end up with several systems all
using different cluster communication & membership mechanisms that are
incompatible. 

Thanks
-steve

On Tue, 2004-08-31 at 10:35, John Cherry wrote:
> Steve,
> 
> This sounds like a low level cluster communication service which would
> be potentially leveraged by other services, such as the event service or
> a group messaging service.  Are you envisioning this to be a public
> interface for applications?
> 
> We discussed a low level cluster communication interface at the cluster
> summit.  The rhat/sistina interface would be used by the cluster manager
> (CMAN) and the lock manager (GDLM), but there was no real momentum to
> make this a public application interface.  It would be great if we could
> derive a common cluster communication interface with the rhat/sistina
> project as well as the TIPC project.  What do you think?
> 
> John
> 
> 
> On Tue, 2004-08-31 at 01:31, Steven Dake wrote:
> > Folks
> > 
> > Its with alot of pleasure that I announce a new API that I implemented
> > over the weekend.
> > 
> > The api is called the "EVS" API and is provided by a seperate library
> > libevs.so/.a.  The standard openais executive is used.  There are two
> > test programs testevs and evsbench which demonstrate the API.  evsbench
> > will benchmark throughput rates.  I get about 9MB/sec on my hardware,
> > however, flow control in the group messaging protocol is slowing this
> > down.  I've gotten 10MB/sec with tweaking the algorithm some.
> > 
> > The API name EVS means "Extended Virtual Syncrhony".  This API provides
> > EVS semantics for those that require the guarantees provided in the face
> > of partitions and merges.
> > 
> > The API provides the following
> > multiple instances may exist at one time
> > group keys of 32 bytes
> > an instance may join one or more groups at one time
> > an instance may leave one or more groups at one time
> > an instance may multicast to the currently joined groups
> > an instance may multicast to unjoined groups
> > any message for a joined group will be delivered via callback
> > configuration changes are delivered via callback
> > 
> > Your comments welcome
> > 
> > Thanks
> > -steve
> > 
> > 
> > ______________________________________________________________________
> > _______________________________________________
> > Openais mailing list
> > Openais at lists.osdl.org
> > http://lists.osdl.org/mailman/listinfo/openais
>