From phung at cs.columbia.edu  Mon May  2 05:59:56 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Mon, 2 May 2005 01:59:56 -0400 (EDT)
Subject: [Linux-cluster] GFS/cman related problems?
Message-ID: <Pine.LNX.4.44.0505020126230.7754-100000@algiers.clic.cs.columbia.edu>

Hello, 

I hope I'm sending to the correct list.  I'm having problems starting up
gfs, and hopefully it's just something incorrect with my configuration.

For the sources, I checked out the latest of:
  device-mapper LVM2 cluster
from
  cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster 

for the kernel sources, I'm using a vanilla 2.6.8.1 kernel, so I had to
update the cluster/*kernel from old versions.  (this may be my problem)
The kernel built and installed fine.  The kernel modules load fine.

root # modprobe dm-mod
root # device-mapper/scripts/devmap_mknod.sh
root # modprobe gfs
root # modprobe lock_dlm

  Lock_Harness <CVS> (built May  1 2005 15:31:12) installed
  GFS <CVS> (built May  1 2005 15:30:54) 
  CMAN <CVS> (built May  1 2005 14:54:13) 
  NET: Registered protocol family 30
  DLM <CVS> (built May  1 2005 14:54:32) 
  udev[5023]: creating device node '/dev/dlm-control'
  Lock_DLM (built May  1 2005 15:31:03) 

Then I try to startup the daemons

root # route add -net 224.0.0.0 netmask 255.0.0.0 dev eth0

root # ccsd -V
ccsd DEVEL.1114967270 (built May  1 2005 13:07:54)
Copyright (C) Red Hat, Inc.  2004  All rights reserved.

root # ccsd 
Starting ccsd DEVEL.1114967270: 
May  1 16:02:38 localhost ccsd[5035]:  Built: May  1 2005 13:07:54 
May  1 16:02:38 localhost ccsd[5035]:  Copyright (C) Red Hat, Inc.  2004 All rights reserved.

root # cman_tool -V
cman_tool DEVEL.1114967270 (built May  1 2005 13:07:59)
Copyright (C) Red Hat, Inc.  2004  All rights reserved.

root # cman_tool -d join
selected nodename blade1
multicast address 224.0.0.0
if eth0 for mcast address 224.0.0.0
setup up interface for address: blade1
cman: CMAN DEVEL.1114967270 (built May  1 2005 20:17:15) started
root # cman: Waiting to join or form a Linux-cluster
cman: forming a new cluster
cman: quorum regained, resuming activity

root # clvmd -V
Cluster LVM daemon version: 2.01.10-cvs (2005-04-04)
Protocol version:           0.2.1

root # clvmd   
clvmd could not connect to cluster manager
Consult syslog for more information

root # syslog | tail -1
Unable to connect to cluster infrastructure after 60 seconds
(there are many of these)

so from here I tried to startup ccsd and cman from my other blade.
'cman_tool join' on the other blade never joins and gives these messages:

  cman: sending membership request

I'm following the instructions from:
http://gfs.wikidev.net/Installation#Build_and_install

Here is my configuration:

<?xml version="1.0"?>
        <cluster name="blade_cluster" config_version="2">
        <cman two_node="1" expected_votes="1">
        <multicast addr="224.0.0.0"/>
        </cman>

        <clusternodes>
          <clusternode name="blade03" nodeid="1" votes="1">
          <multicast addr="224.0.0.0" interface="eth0"/>
          <fence>
                <method name="human">
                <device name="eth0" ipaddr="blade1"/>
                </method>
          </fence>
          </clusternode>

          <clusternode name="blade04" nodeid="2" votes="1">
          <multicast addr="224.0.0.0" interface="eth0"/>
          <fence>
                <method name="human">
                <device name="eth0" ipaddr="blade2"/>
                </method>
          </fence>
          </clusternode>
        </clusternodes>

        <fencedevices>
          <fencedevice name="human" agent="fence_manual"/>
        </fencedevices>

        <fence_daemon clean_start="0">
        </fence_daemon>
</cluster>

any help is much appreciated.

regards,
Dan



From teigland at redhat.com  Mon May  2 06:14:38 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 2 May 2005 14:14:38 +0800
Subject: [Linux-cluster] GFS/cman related problems?
In-Reply-To: <Pine.LNX.4.44.0505020126230.7754-100000@algiers.clic.cs.columbia.edu>
References: <Pine.LNX.4.44.0505020126230.7754-100000@algiers.clic.cs.columbia.edu>
Message-ID: <20050502061438.GC9072@redhat.com>

On Mon, May 02, 2005 at 01:59:56AM -0400, Dan B. Phung wrote:
> Hello, 
> 
> I hope I'm sending to the correct list.  I'm having problems starting up
> gfs, and hopefully it's just something incorrect with my configuration.
> 
> For the sources, I checked out the latest of:
>   device-mapper LVM2 cluster
> from
>   cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster 

You need to checkout the RHEL4 branch, checkout -r RHEL4 cluster.
The version of cman in cvs head is userspace and incompatible with
everything else at the moment.

Dave



From birger at birger.sh  Mon May  2 12:45:55 2005
From: birger at birger.sh (birger)
Date: Mon, 02 May 2005 14:45:55 +0200
Subject: [Linux-cluster] Problems compiling RHEL4 branch on FC3
Message-ID: <42762103.5050204@birger.sh>

I just tried fetching the RHEL4 branch to update the old head branch I 
have used.

Here is a list of problems compiling this branch so far:

LOCK_USE_CLNT is gone from newer kernels. I just defined it to 1 to get 
compilation past the problems. Don't know how this will affect the 
cluster software... But at least it compiles.

In line 964 of gfs-kernel/src/gfs/quota.c the compiler complains about 
argument count to the write function. Looks right to me, so I'm a bit 
confused (quite normal state for me I'm afraid). Somehow the compiler 
seems to think this is a call to write (2), and not to the tty_drivers 
write.
I just removed the line (and the if statement in the previous line) to 
get past it. I don't need no quota messages. :-D

gulm/Makefile doesn't include libgulm.a in the all: target, but it's 
part of the install: actions so installation fails. I just added 
lib/libgulm.a to all: to finish the make.

-- 
birger



From lhh at redhat.com  Mon May  2 13:37:16 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 02 May 2005 09:37:16 -0400
Subject: [Linux-cluster] Re: Problems compiling RHEL4 branch on FC3
In-Reply-To: <42762103.5050204@birger.sh>
References: <42762103.5050204@birger.sh>
Message-ID: <1115041036.20618.593.camel@ayanami.boston.redhat.com>

On Mon, 2005-05-02 at 14:45 +0200, birger wrote:
> I just removed the line (and the if statement in the previous line) to 
> get past it. I don't need no quota messages. :-D

I don't think that will make it into CVS. .. :)

> gulm/Makefile doesn't include libgulm.a in the all: target, but it's 
> part of the install: actions so installation fails. I just added 
> lib/libgulm.a to all: to finish the make.

Interesting.  I'm surprised we haven't seen these.

Are you building with GCC4?

-- Lon



From mtilstra at redhat.com  Mon May  2 14:02:51 2005
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Mon, 2 May 2005 09:02:51 -0500
Subject: [Linux-cluster] Re: Problems compiling RHEL4 branch on FC3
In-Reply-To: <1115041036.20618.593.camel@ayanami.boston.redhat.com>
References: <42762103.5050204@birger.sh>
	<1115041036.20618.593.camel@ayanami.boston.redhat.com>
Message-ID: <20050502140251.GB24275@redhat.com>

On Mon, May 02, 2005 at 09:37:16AM -0400, Lon Hohberger wrote:
> > gulm/Makefile doesn't include libgulm.a in the all: target, but it's 
> > part of the install: actions so installation fails. I just added 
> > lib/libgulm.a to all: to finish the make.
> 
> Interesting.  I'm surprised we haven't seen these.

Except I have seen these.  I just fixed it too.  I was cleaning up the
building of the lib, and somewhere along the line I forgot to change the
all target in the RHEL4 branch.  Its all fixed now.

-- 
Michael Conrad Tadpol Tilstra
It's a lot of fun being alive ... I wonder if my bed is made?!? 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050502/3e1b8ed8/attachment.sig>

From rajkum2002 at rediffmail.com  Mon May  2 16:55:40 2005
From: rajkum2002 at rediffmail.com (Raj  Kumar)
Date: 2 May 2005 16:55:40 -0000
Subject: [Linux-cluster] umount hang
Message-ID: <20050502165540.7195.qmail@webmail47.rediffmail.com>

Hello,

I tried to unmount one of the GFS filesystems and that command did not succeed even after waiting for a long time. I found this in the log when I was restarting the server:

The system did not shutdown either. Any clue?
 ?
May  2 11:46:38 node1 kernel: GFS: fsid=gfs1:gfs02.1: Unmount seems to be stalled. Dumping lock s
tate...
May  2 11:46:38 node1 kernel: Glock (7, 26)
May  2 11:46:38 node1 kernel:   gl_flags =                                                         
May  2 11:46:38 node1 kernel:   gl_count = 2
May  2 11:46:38 node1 kernel:   gl_state = 3
May  2 11:46:38 node1 kernel:   lvb_count = 0
May  2 11:46:38 node1 kernel:   object = yes
May  2 11:46:38 node1 kernel:   dependencies = no
May  2 11:46:38 node1 kernel:   reclaim = no
May  2 11:46:38 node1 kernel:   Holder
May  2 11:46:38 node1 kernel:     owner = -1
May  2 11:46:38 node1 kernel:     gh_state = 3
May  2 11:46:38 node1 kernel:     gh_flags = 5 7 
May  2 11:46:39 node1 kernel:     error = 0
May  2 11:46:39 node1 kernel:     gh_iflags = 1 5 6 
May  2 11:46:40 node1 kernel: Glock (4, 26)
May  2 11:46:41 node1 kernel:   gl_flags = 
May  2 11:46:42 node1 kernel:   gl_count = 2
May  2 11:46:42 node1 kernel:   gl_state = 0
May  2 11:46:43 node1 kernel:   lvb_count = 0
May  2 11:46:44 node1 kernel:   object = yes
May  2 11:46:44 node1 kernel:   dependencies = no
May  2 11:46:44 node1 kernel:   reclaim = no
May  2 11:46:44 node1 kernel:   Inode:
May  2 11:46:45 node1 kernel:     num = 26/26
May  2 11:46:45 node1 kernel:     type = 2
May  2 11:46:45 node1 kernel:     i_count = 1
May  2 11:46:45 node1 kernel:     i_flags = 
May  2 11:46:45 node1 kernel:     vnode = yes

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050502/d9375604/attachment.htm>

From ranaldo at unina.it  Mon May  2 17:57:56 2005
From: ranaldo at unina.it (Nicola Ranaldo)
Date: Mon, 2 May 2005 19:57:56 +0200
Subject: [Linux-cluster] gfs/iscsi over 2.6.11 and gfs listener
Message-ID: <200505021957.56474.ranaldo@unina.it>

Hello,
i have to setup gfs over kernel 2.6.11 (due to requirements of 
iscsi-initiator).
So i cvs-checked out gfs and compiled it.
Cman does not join the cluster, so i used cman from stable (2.6.9) gfs 
release.
Now it's working well, gfs is up and running, but i got two not reproducible 
oops (i do not remember weel but is it possible rebooting the system without 
umount gfs filesystems).
I read about different branch in cvs, may you descript them better?
What version should i use exactly in order to have a quite stable system?

The second:
Are there an ioctl call in order to "listen" at gfs?
I need a process on system "a" of two clustered nodes that listen about 
open/close/read/write operation of processes on the node "b".
Is it possible?

Thank you

	Nicola Ranaldo



From ranaldo at unina.it  Mon May  2 18:02:47 2005
From: ranaldo at unina.it (Nicola Ranaldo)
Date: Mon, 2 May 2005 20:02:47 +0200
Subject: [Linux-cluster] gfs/iscsi over 2.6.11 and gfs listener
In-Reply-To: <200505021957.56474.ranaldo@unina.it>
References: <200505021957.56474.ranaldo@unina.it>
Message-ID: <200505022002.47600.ranaldo@unina.it>

> Now it's working well, gfs is up and running, but i got two not
> reproducible oops (i do not remember weel but is it possible rebooting the
> system without umount gfs filesystems).

Ok... i reproduced it:

CMAN: removing node simapplication from the cluster : Missed too many 
heartbeats
CMAN: node simapplication rejoining
scheduling while atomic: cman_comms/0x00000001/2794
 [<c03a66f2>] schedule+0x522/0x530
 [<ded1c72e>] start_ack_timer+0x2e/0x40 [cman]
 [<ded1f52c>] __sendmsg+0x4cc/0x670 [cman]
 [<ded2806d>] add_barrier_callback+0x7d/0x170 [cman]
 [<ded281dd>] callback_startdone_barrier+0x1d/0x30 [cman]
 [<ded20dc0>] check_barrier_complete_phase2+0xe0/0x130 [cman]
 [<ded20ed9>] process_barrier_msg+0x99/0x110 [cman]
 [<ded1d0cf>] process_incoming_packet+0x18f/0x290 [cman]
 [<ded1c1d1>] receive_message+0xd1/0xf0 [cman]
 [<ded1c38c>] cluster_kthread+0x19c/0x3a0 [cman]
 [<c0102622>] ret_from_fork+0x6/0x14
 [<c0113140>] default_wake_function+0x0/0x20
 [<ded1c1f0>] cluster_kthread+0x0/0x3a0 [cman]
 [<c010087d>] kernel_thread_helper+0x5/0x18

due i think it's by cman

	Niko



From phung at cs.columbia.edu  Tue May  3 01:26:43 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Mon, 2 May 2005 21:26:43 -0400 (EDT)
Subject: [Linux-cluster] Re: Problems compiling RHEL4 branch on FC3
In-Reply-To: <20050502140251.GB24275@redhat.com>
Message-ID: <Pine.LNX.4.44.0505022122000.12538-100000@algiers.clic.cs.columbia.edu>

this may be related, but when I compile from the RHEL4 branch, I 
get many errors which seem related to the kernel patched headers 
in <kernel-src>/include/cluster not being updated, such as:
cman-kernel/src/cnxman.c:478: error: `CLUSTER_LEAVEFLAG_NORESPONSE' 
undeclared

things are fine after I copy the cnxman.h from cman-kernel/src.

-dan

On 2, May, 2005, Michael Conrad Tadpol Tilstra declared:

> On Mon, May 02, 2005 at 09:37:16AM -0400, Lon Hohberger wrote:
> > > gulm/Makefile doesn't include libgulm.a in the all: target, but it's 
> > > part of the install: actions so installation fails. I just added 
> > > lib/libgulm.a to all: to finish the make.
> > 
> > Interesting.  I'm surprised we haven't seen these.
> 
> Except I have seen these.  I just fixed it too.  I was cleaning up the
> building of the lib, and somewhere along the line I forgot to change the
> all target in the RHEL4 branch.  Its all fixed now.
> 
> 

-- 



From birger at birger.sh  Tue May  3 08:08:48 2005
From: birger at birger.sh (birger)
Date: Tue, 03 May 2005 10:08:48 +0200
Subject: [Linux-cluster] Re: Problems compiling RHEL4 branch on FC3
In-Reply-To: <Pine.LNX.4.44.0505022122000.12538-100000@algiers.clic.cs.columbia.edu>
References: <Pine.LNX.4.44.0505022122000.12538-100000@algiers.clic.cs.columbia.edu>
Message-ID: <42773190.1020101@birger.sh>

Dan B. Phung wrote:

>this may be related, but when I compile from the RHEL4 branch, I 
>get many errors which seem related to the kernel patched headers 
>in <kernel-src>/include/cluster not being updated, such as:
>cman-kernel/src/cnxman.c:478: error: `CLUSTER_LEAVEFLAG_NORESPONSE' 
>undeclared
>  
>
I run
./configure  /lib/modules/`uname -r`/build
so I use the headers that come with the kernel binary rpm.  I have not 
installed kernel source to compile the cluster software. I have not seen 
the problem you mention.

-- 
birger



From pcaulfie at redhat.com  Tue May  3 08:32:13 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 03 May 2005 09:32:13 +0100
Subject: [Linux-cluster] gfs/iscsi over 2.6.11 and gfs listener
In-Reply-To: <200505022002.47600.ranaldo@unina.it>
References: <200505021957.56474.ranaldo@unina.it>
	<200505022002.47600.ranaldo@unina.it>
Message-ID: <4277370D.5030906@redhat.com>

Nicola Ranaldo wrote:
>>Now it's working well, gfs is up and running, but i got two not
>>reproducible oops (i do not remember weel but is it possible rebooting the
>>system without umount gfs filesystems).
> 
> 
> Ok... i reproduced it:
> 
> CMAN: removing node simapplication from the cluster : Missed too many 
> heartbeats
> CMAN: node simapplication rejoining
> scheduling while atomic: cman_comms/0x00000001/2794
>  [<c03a66f2>] schedule+0x522/0x530
>  [<ded1c72e>] start_ack_timer+0x2e/0x40 [cman]
>  [<ded1f52c>] __sendmsg+0x4cc/0x670 [cman]
>  [<ded2806d>] add_barrier_callback+0x7d/0x170 [cman]
>  [<ded281dd>] callback_startdone_barrier+0x1d/0x30 [cman]
>  [<ded20dc0>] check_barrier_complete_phase2+0xe0/0x130 [cman]
>  [<ded20ed9>] process_barrier_msg+0x99/0x110 [cman]
>  [<ded1d0cf>] process_incoming_packet+0x18f/0x290 [cman]
>  [<ded1c1d1>] receive_message+0xd1/0xf0 [cman]
>  [<ded1c38c>] cluster_kthread+0x19c/0x3a0 [cman]
>  [<c0102622>] ret_from_fork+0x6/0x14
>  [<c0113140>] default_wake_function+0x0/0x20
>  [<ded1c1f0>] cluster_kthread+0x0/0x3a0 [cman]
>  [<c010087d>] kernel_thread_helper+0x5/0x18
> 
> due i think it's by cman

Which branch did you pull this from? CVS HEAD is currently highly unstable. FC4 is recommended.

It looks like the SM_RETRY is still in sm_barrier.c:add_barrier_callback - that should have come out on most branches by now.

-- 

patrick



From ranaldo at unina.it  Tue May  3 12:13:43 2005
From: ranaldo at unina.it (Nicola Ranaldo)
Date: Tue, 3 May 2005 14:13:43 +0200
Subject: [Linux-cluster] gfs/iscsi over 2.6.11 and gfs listener
In-Reply-To: <4277370D.5030906@redhat.com>
References: <200505021957.56474.ranaldo@unina.it>
	<200505022002.47600.ranaldo@unina.it> <4277370D.5030906@redhat.com>
Message-ID: <200505031413.43833.ranaldo@unina.it>

> Which branch did you pull this from? CVS HEAD is currently highly unstable.
> FC4 is recommended.
>
> It looks like the SM_RETRY is still in sm_barrier.c:add_barrier_callback -
> that should have come out on most branches by now.

I tested both cvs head and cvs RHEL4, i do not know anythink about FC4.
Why so many branches??? Which of them is definitively more stable?

Howewer *IT SEEMS* recompiling 2.6.11 without PREEMPTIBLE KERNEL avoids the 
random oops!
Actually my gfs is happy :) this could be a good starting point, due to 
different messages in the mailing list referencing the same problem.

Howewer some opss continues to present when i reboot the servers without 
umount the filesystem cleanly, But i suppose that this could be while iscsi 
is died before the system umounts local filesystem during shutdown procedure.

May  2 23:04:21 simlistener kernel: Unable to handle kernel NULL pointer 
dereference at virtual address 00000024
May  2 23:04:21 simlistener kernel:  printing eip:
May  2 23:04:21 simlistener kernel: dedb978a
May  2 23:04:21 simlistener kernel: *pde = 00000000
May  2 23:04:21 simlistener kernel: Oops: 0000 [#1]
May  2 23:04:21 simlistener kernel: PREEMPT
May  2 23:04:21 simlistener kernel: Modules linked in: gfs lock_dlm dlm 
lock_harness cman iscsi_tcp iscsi_if
May  2 23:04:21 simlistener kernel: CPU:    0
May  2 23:04:21 simlistener kernel: EIP:    0060:[<dedb978a>]    Not tainted 
VLI
May  2 23:04:21 simlistener kernel: EFLAGS: 00010212   (2.6.11.7)
May  2 23:04:21 simlistener kernel: EIP is at gfs_ail_start_trans+0x4a/0x1f0 
[gfs]
May  2 23:04:21 simlistener kernel: eax: 00000000   ebx: da609ec0   ecx: 
da609e80   edx: ded5e420
May  2 23:04:21 simlistener kernel: esi: dacd66d0   edi: da609ebc   ebp: 
da609000   esp: da609e48
May  2 23:04:21 simlistener kernel: ds: 007b   es: 007b   ss: 0068
May  2 23:04:21 simlistener kernel: Process umount (pid: 3690, 
threadinfo=da609000 task=dd5a1a80)
May  2 23:04:21 simlistener kernel: Stack: da609ec0 da609e60 00000000 da609000 
da609ec0 00000000 00000000 da609e60
May  2 23:04:21 simlistener kernel:        ded5c000 ded705c0 da609e60 dedd19e5 
ded5c000 da609e60 00000000 ded705ac
May  2 23:04:21 simlistener kernel:        ded5c000 da609000 da609ec0 da1fd100 
dedbb041 ded5c000 00000400 ded5c000
May  2 23:04:21 simlistener kernel: Call Trace:
May  2 23:04:21 simlistener kernel:  [<dedd19e5>] gfs_ail_start+0x75/0xc0 
[gfs]
May  2 23:04:21 simlistener kernel:  [<dedbb041>] gfs_sync_meta+0x31/0x60 
[gfs]
May  2 23:04:21 simlistener kernel:  [<dedeb6c3>] gfs_make_fs_ro+0x53/0xb0 
[gfs]
May  2 23:04:21 simlistener kernel:  [<dede12bd>] gfs_put_super+0x2bd/0x300 
[gfs]
May  2 23:04:21 simlistener kernel:  [<c015bad7>] 
generic_shutdown_super+0x127/0x140
May  2 23:04:21 simlistener kernel:  [<dedde4a2>] gfs_kill_sb+0x32/0x6e [gfs]
May  2 23:04:21 simlistener kernel:  [<c015b88e>] deactivate_super+0x6e/0xa0
May  2 23:04:21 simlistener kernel:  [<c017313f>] sys_umount+0x3f/0xa0
May  2 23:04:21 simlistener kernel:  [<c014882d>] do_munmap+0x13d/0x180
May  2 23:04:21 simlistener kernel:  [<c01488c0>] sys_munmap+0x50/0x80
May  2 23:04:21 simlistener kernel:  [<c01731b5>] sys_oldumount+0x15/0x20
May  2 23:04:21 simlistener kernel:  [<c010273f>] syscall_call+0x7/0xb
May  2 23:04:21 simlistener kernel: Code: 00 31 c0 89 44 24 14 b8 01 00 00 00 
e8 50 99 35 e1 8b 5f 04 39 fb 8b 73 04 74 2b$
May  2 23:04:21 simlistener kernel:  <6>note: umount[3690] exited with 
preempt_count 1
May  2 23:04:21 simlistener kernel: scheduling while atomic: 
umount/0x10000001/3690
May  2 23:04:21 simlistener kernel:  [<c03a66f2>] schedule+0x522/0x530
May  2 23:04:21 simlistener kernel:  [<c0143c0e>] unmap_page_range+0x7e/0xa0

May  2 23:04:21 simlistener kernel:  [<c0143de0>] unmap_vmas+0x1b0/0x210
May  2 23:04:21 simlistener kernel:  [<c0148bec>] exit_mmap+0x7c/0x170
May  2 23:04:21 simlistener kernel:  [<c0114677>] mmput+0x37/0xb0
May  2 23:04:21 simlistener kernel:  [<c011938b>] do_exit+0x9b/0x3c0
May  2 23:04:21 simlistener kernel:  [<c010392b>] die+0x18b/0x190
May  2 23:04:21 simlistener kernel:  [<c0116e27>] printk+0x17/0x20
May  2 23:04:21 simlistener kernel:  [<c01113da>] do_page_fault+0x2da/0x5d5
May  2 23:04:21 simlistener kernel:  [<c03a7261>] __wait_on_bit+0x51/0x70
May  2 23:04:21 simlistener kernel:  [<c012c050>] wake_bit_function+0x0/0x60
May  2 23:04:21 simlistener kernel:  [<dedd2938>] log_free_buf+0x58/0x60 [gfs]
May  2 23:04:21 simlistener kernel:  [<c03a68b9>] 
wait_for_completion+0xc9/0xf0
May  2 23:04:21 simlistener kernel:  [<c0111100>] do_page_fault+0x0/0x5d5
May  2 23:04:21 simlistener kernel:  [<c0103177>] error_code+0x2b/0x30
May  2 23:04:21 simlistener kernel:  [<dedb978a>] 
gfs_ail_start_trans+0x4a/0x1f0 [gfs]
May  2 23:04:21 simlistener kernel:  [<dedd19e5>] gfs_ail_start+0x75/0xc0 
[gfs]
May  2 23:04:21 simlistener kernel:  [<dedbb041>] gfs_sync_meta+0x31/0x60 
[gfs]
May  2 23:04:21 simlistener kernel:  [<dedeb6c3>] gfs_make_fs_ro+0x53/0xb0 
[gfs]
May  2 23:04:21 simlistener kernel:  [<dede12bd>] gfs_put_super+0x2bd/0x300 
[gfs]
May  2 23:04:21 simlistener kernel:  [<c015bad7>] 
generic_shutdown_super+0x127/0x140
May  2 23:04:21 simlistener kernel:  [<dedde4a2>] gfs_kill_sb+0x32/0x6e [gfs]
May  2 23:04:21 simlistener kernel:  [<c015b88e>] deactivate_super+0x6e/0xa0
May  2 23:04:21 simlistener kernel:  [<c017313f>] sys_umount+0x3f/0xa0
May  2 23:04:21 simlistener kernel:  [<c014882d>] do_munmap+0x13d/0x180
May  2 23:04:21 simlistener kernel:  [<c01488c0>] sys_munmap+0x50/0x80
May  2 23:04:21 simlistener kernel:  [<c01731b5>] sys_oldumount+0x15/0x20
May  2 23:04:21 simlistener kernel:  [<c010273f>] syscall_call+0x7/0xb



Best Regards

-- 
Dott. Ranaldo Nicola
Sistemi di Elaborazione
C.S.I. (Centro di Servizi Informativi di Ateneo)
Sede di Monte Sant'Angelo
Via Cinthia n.4 80126 Napoli
Tel. 081/676638
Fax. 081/676628
Email: ranaldo at unina.it



From pcaulfie at redhat.com  Tue May  3 12:33:02 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 03 May 2005 13:33:02 +0100
Subject: [Linux-cluster] New cluster.conf generator
Message-ID: <42776F7E.7070800@redhat.com>

For the GUI-phobes and XML-haters out there I've written a small command-line utility to create cluster.conf files from the
command-line.

It's in CVS head ccs/cluster_conf/ and, although there's no man page (yet) there should be sufficient command-line help to get you
started.

Although it is in CVS head, it should also generate perfectly sound config files for the FC4 or RHEL4 branches too.
-- 

patrick



From pcaulfie at redhat.com  Tue May  3 12:53:33 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 03 May 2005 13:53:33 +0100
Subject: [Linux-cluster] gfs/iscsi over 2.6.11 and gfs listener
In-Reply-To: <200505031413.43833.ranaldo@unina.it>
References: <200505021957.56474.ranaldo@unina.it>	<200505022002.47600.ranaldo@unina.it>
	<4277370D.5030906@redhat.com> <200505031413.43833.ranaldo@unina.it>
Message-ID: <4277744D.60508@redhat.com>

Nicola Ranaldo wrote:

> I tested both cvs head and cvs RHEL4, i do not know anythink about FC4.
> Why so many branches??? Which of them is definitively more stable?

Well, FC4 is for Fedora Core 4, and RHEL4 is for RHEL4. Pick whichever is nearst to the kernel you are running. RHEL4 should be the
most stable - but only if you are running it against a RHEL4 kernel.

If you need support for later kernels then FC4 is your best bet.

> Howewer *IT SEEMS* recompiling 2.6.11 without PREEMPTIBLE KERNEL avoids the 
> random oops!

No surprise there. CONFIG_PREEMPT is not supported. If that's not mentioned anywhere it should be.

-- 

patrick



From ranaldo at unina.it  Tue May  3 13:59:24 2005
From: ranaldo at unina.it (Nicola Ranaldo)
Date: Tue, 3 May 2005 15:59:24 +0200
Subject: [Linux-cluster] gfs/iscsi over 2.6.11 and gfs listener
In-Reply-To: <4277744D.60508@redhat.com>
References: <200505021957.56474.ranaldo@unina.it>
	<200505031413.43833.ranaldo@unina.it> <4277744D.60508@redhat.com>
Message-ID: <200505031559.24396.ranaldo@unina.it>

> Well, FC4 is for Fedora Core 4, and RHEL4 is for RHEL4. Pick whichever is
> nearst to the kernel you are running. RHEL4 should be the most stable - but
> only if you are running it against a RHEL4 kernel.
>
> If you need support for later kernels then FC4 is your best bet.
> No surprise there. CONFIG_PREEMPT is not supported. If that's not mentioned
> anywhere it should be.

Umh... FC4 does not compile, it cannot find some definitions (eg. 
SOCK_ZAPPED), and give me other problems on cman, with kernel 2.6.11, 
2.6.11.7, 2.6.11.8.
I use official and untouched kernel distributions.
Need i some patches?

Regards

-- 
Dott. Ranaldo Nicola
Sistemi di Elaborazione
C.S.I. (Centro di Servizi Informativi di Ateneo)
Sede di Monte Sant'Angelo
Via Cinthia n.4 80126 Napoli
Tel. 081/676638
Fax. 081/676628
Email: ranaldo at unina.it



From pcaulfie at redhat.com  Tue May  3 15:48:52 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 03 May 2005 16:48:52 +0100
Subject: [Linux-cluster] New cluster.conf generator
In-Reply-To: <42776F7E.7070800@redhat.com>
References: <42776F7E.7070800@redhat.com>
Message-ID: <42779D64.1010803@redhat.com>

By popular request, this has been folded into ccs_tool.
-- 

patrick



From phung at cs.columbia.edu  Wed May  4 07:33:51 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Wed, 4 May 2005 03:33:51 -0400 (EDT)
Subject: [Linux-cluster] two node cluster, 2nd node hangs in join
Message-ID: <Pine.LNX.4.44.0505040324090.31956-100000@algiers.clic.cs.columbia.edu>

Hello, hopefully someone has ran into this and it's a quick fix. I'm using
a vanilla 2.6.9 kernel and the newest (as of tonite)  cvs branch from
-rRHEL4.  My sequence is to startup ccsd on both nodes, and then I try to
have both of them join (with a brief wait before I have the 2nd one try).
Here's what I get from the cman_tool's view of the nodes.

phung # cman_tool nodes
Node  Votes Exp Sts  Name
   3    1    1   J   blade03
   4    1    1   M   blade04

and in /var/log/messages, I see this:
  CMAN: sending membership request

followed by many:
  last message repeated 7 times

In addition I ran a tcpdump, and there seem to be UDP packets flying
around from node to node, using port 6809, so the network seems fine.
How would I debug this further?  What kinds of tools are people using
to debug their config/setup?

here's my config.

<?xml version="1.0"?>
<cluster name="blade_cluster" config_version="3">
        <fencedevices>
          <fencedevice name="blade_san" agent="fence_manual"/>
        </fencedevices>

        <fence_daemon clean_start="0">
        </fence_daemon>

        <cman two_node="1" expected_votes="1">
          <multicast addr="224.0.0.1"/>
        </cman>

        <clusternodes>
          <clusternode name="blade03" nodeid="3" votes="1">
          <multicast addr="224.0.0.1" interface="eth0"/>
             <fence>
               <method name="human">
                 <device name="last_resort" ipaddr="blade03"/>
               </method>
             </fence>
          </clusternode>

          <clusternode name="blade04" nodeid="4" votes="1">
             <multicast addr="224.0.0.1" interface="eth0"/>
             <fence>
               <method name="human">
                 <device name="last_resort" ipaddr="blade04"/>
               </method>
             </fence>
          </clusternode>
        </clusternodes>
</cluster>

regards,
Dan

-- 



From phung at cs.columbia.edu  Wed May  4 07:38:36 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Wed, 4 May 2005 03:38:36 -0400 (EDT)
Subject: [Linux-cluster] order of gfs services startup?
Message-ID: <Pine.LNX.4.44.0505040333530.31956-100000@algiers.clic.cs.columbia.edu>

Hello, I'm confused the exact ordering of the startup of the
services.  on the GFS Wiki, it says to run:

ccsd
cman_tool join
clvm
<mount your gfs stuff>
fence_tool join

however, from the document 
 "Symmetric Cluster Architecture and Component Technical Specifications" 
                                       - David Teigland

I read that I must startup fenced after joining the cluster and before
using GFS (so in the sequence about I would run it before clvm).

Does the ordering between these services matter?  If so, what's the
correct order?

regards,
Dan

-- 



From teigland at redhat.com  Wed May  4 07:51:57 2005
From: teigland at redhat.com (David Teigland)
Date: Wed, 4 May 2005 15:51:57 +0800
Subject: [Linux-cluster] order of gfs services startup?
In-Reply-To: <Pine.LNX.4.44.0505040333530.31956-100000@algiers.clic.cs.columbia.edu>
References: <Pine.LNX.4.44.0505040333530.31956-100000@algiers.clic.cs.columbia.edu>
Message-ID: <20050504075157.GB15486@redhat.com>

On Wed, May 04, 2005 at 03:38:36AM -0400, Dan B. Phung wrote:
> Hello, I'm confused the exact ordering of the startup of the
> services.  on the GFS Wiki, it says to run:
> 
> ccsd
> cman_tool join
> clvm
> <mount your gfs stuff>
> fence_tool join

This is wrong, you run fence_tool join after cman_tool join.
The primary source of instructions are here:
http://sources.redhat.com/cluster/doc/usage.txt

The GFS wiki was not done by our group.

> however, from the document 
>  "Symmetric Cluster Architecture and Component Technical Specifications" 

This is old and an unreliable source when it comes to specific details.

> Does the ordering between these services matter?  If so, what's the
> correct order?

usage.txt is the place to look

Dave



From phung at cs.columbia.edu  Wed May  4 09:54:26 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Wed, 4 May 2005 05:54:26 -0400 (EDT)
Subject: [Linux-cluster] two node cluster, 2nd node hangs in join
In-Reply-To: <Pine.LNX.4.44.0505040324090.31956-100000@algiers.clic.cs.columbia.edu>
Message-ID: <Pine.LNX.4.44.0505040553160.31956-100000@algiers.clic.cs.columbia.edu>

my problem was one from yesteryear that I found a solution for
in the mailing list.   the problem was a listing in my /etc/hosts
which I commented out.

 127.0.0.1     localhost.localdomain   localhost

-dan

On 4, May, 2005, Dan B. Phung declared:

> Hello, hopefully someone has ran into this and it's a quick fix. I'm using
> a vanilla 2.6.9 kernel and the newest (as of tonite)  cvs branch from
> -rRHEL4.  My sequence is to startup ccsd on both nodes, and then I try to
> have both of them join (with a brief wait before I have the 2nd one try).
> Here's what I get from the cman_tool's view of the nodes.
> 
> phung # cman_tool nodes
> Node  Votes Exp Sts  Name
>    3    1    1   J   blade03
>    4    1    1   M   blade04
> 
> and in /var/log/messages, I see this:
>   CMAN: sending membership request
> 
> followed by many:
>   last message repeated 7 times
> 
> In addition I ran a tcpdump, and there seem to be UDP packets flying
> around from node to node, using port 6809, so the network seems fine.
> How would I debug this further?  What kinds of tools are people using
> to debug their config/setup?
> 
> here's my config.
> 
> <?xml version="1.0"?>
> <cluster name="blade_cluster" config_version="3">
>         <fencedevices>
>           <fencedevice name="blade_san" agent="fence_manual"/>
>         </fencedevices>
> 
>         <fence_daemon clean_start="0">
>         </fence_daemon>
> 
>         <cman two_node="1" expected_votes="1">
>           <multicast addr="224.0.0.1"/>
>         </cman>
> 
>         <clusternodes>
>           <clusternode name="blade03" nodeid="3" votes="1">
>           <multicast addr="224.0.0.1" interface="eth0"/>
>              <fence>
>                <method name="human">
>                  <device name="last_resort" ipaddr="blade03"/>
>                </method>
>              </fence>
>           </clusternode>
> 
>           <clusternode name="blade04" nodeid="4" votes="1">
>              <multicast addr="224.0.0.1" interface="eth0"/>
>              <fence>
>                <method name="human">
>                  <device name="last_resort" ipaddr="blade04"/>
>                </method>
>              </fence>
>           </clusternode>
>         </clusternodes>
> </cluster>
> 
> regards,
> Dan
> 
> 

-- 



From pcaulfie at redhat.com  Wed May  4 10:03:58 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 04 May 2005 11:03:58 +0100
Subject: [Linux-cluster] two node cluster, 2nd node hangs in join
In-Reply-To: <Pine.LNX.4.44.0505040324090.31956-100000@algiers.clic.cs.columbia.edu>
References: <Pine.LNX.4.44.0505040324090.31956-100000@algiers.clic.cs.columbia.edu>
Message-ID: <42789E0E.90506@redhat.com>

Dan B. Phung wrote:
> Hello, hopefully someone has ran into this and it's a quick fix. I'm using
> a vanilla 2.6.9 kernel and the newest (as of tonite)  cvs branch from
> -rRHEL4.  My sequence is to startup ccsd on both nodes, and then I try to
> have both of them join (with a brief wait before I have the 2nd one try).
> Here's what I get from the cman_tool's view of the nodes.
> 
> phung # cman_tool nodes
> Node  Votes Exp Sts  Name
>    3    1    1   J   blade03
>    4    1    1   M   blade04
> 
> and in /var/log/messages, I see this:
>   CMAN: sending membership request
> 
> followed by many:
>   last message repeated 7 times
> 
> In addition I ran a tcpdump, and there seem to be UDP packets flying
> around from node to node, using port 6809, so the network seems fine.
> How would I debug this further?  What kinds of tools are people using
> to debug their config/setup?
>  

It looks like the return join-ack messages are not arriving at blade03. Check you don't have a firewall active that would block
these. cman uses unicast as well as multicast packets so you need to enable both.


-- 

patrick



From phung at cs.columbia.edu  Wed May  4 18:27:37 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Wed, 4 May 2005 14:27:37 -0400 (EDT)
Subject: [Linux-cluster] SMP w/ GFS?
Message-ID: <Pine.LNX.4.44.0505041425070.3092-100000@algiers.clic.cs.columbia.edu>

Are there any concerns or issues I should be aware of if SMP is enabled in
my kernel?  I'm aware of the preemptible-kernel issue, but don't quite
understand yet where in the code the problem lies.  Can someone tell me
what components have a problem with pre-emptible kernel and, if there's a
problem, SMP as well.

regards,
Dan

-- 



From andreslc at cs.toronto.edu  Wed May  4 22:29:31 2005
From: andreslc at cs.toronto.edu (Andres Lagar Cavilla)
Date: Wed, 04 May 2005 17:29:31 -0500
Subject: [Linux-cluster] cman_tool: Can't find broadcast address for node
Message-ID: <42794CCB.4030505@cs.toronto.edu>

Hi there,
I'm trying to set up a cluster here, to use gnbd. I downloaded the 
source from cvs (May 4), branch RHEL4. After a few tweaks compiled 
succesfully, and I have all the modules loaded
modprobe gfs
modprobe lock_dlm

I start ccsd with no problems. ccsd_test connect does not work. I do not 
despair and proceed to do cman_tool join -c cluster1 (my cluster.conf is 
based on the simple example out there, adapted for three machines; 
transcribed below). This works on one of my three nodes (toshiba, a SuSE 
9.2 box). However, on both the other nodes (Debian Sarge's) cman_tool fails:
cman_tool: Can't find broadcast address for node xxx

All machines run latest kernel 2.6.11, through the respective 
distribution packages. The debian machines can also boot into xen, using 
kernel 2.6.11, and it doesn't work there either.

Any pointers? don't really know what other information may be useful, so 
please ask at will and I'll cut&paste.
Thanks a lot

Andres

<?xml version="1.0"?>
<cluster name="cluster1" config_version="1">

  <cman>
  </cman>

  <clusternodes>
    <clusternode name="toshiba" votes="1">
      <fence>
        <method name="single">
          <device name="human" ipaddr="192.168.70.11"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="debian" votes="1">
      <fence>
        <method name="single">
          <device name="human" ipaddr="192.168.70.44"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="simmons" votes="1">
      <fence>
        <method name="single">
          <device name="human" ipaddr="192.168.70.112"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>

  <fence_devices>
    <fence_device name="human" agent="fence_manual"/>
  </fence_devices>

</cluster>




From phung at cs.columbia.edu  Wed May  4 21:41:57 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Wed, 4 May 2005 17:41:57 -0400 (EDT)
Subject: [Linux-cluster] cman_tool: Can't find broadcast address for node
In-Reply-To: <42794CCB.4030505@cs.toronto.edu>
Message-ID: <Pine.LNX.4.44.0505041740320.3092-100000@algiers.clic.cs.columbia.edu>

try using multicast instead of broadcast, as in:

(from http://sources.redhat.com/cluster/doc/usage.txt)

<cman>
    <multicast addr="224.0.0.1"/>
</cman>

<clusternode name="nd1">
    <multicast addr="224.0.0.1" interface="eth0"/>
</clusternode>

-dan


On 4, May, 2005, Andres Lagar Cavilla declared:

> Hi there,
> I'm trying to set up a cluster here, to use gnbd. I downloaded the 
> source from cvs (May 4), branch RHEL4. After a few tweaks compiled 
> succesfully, and I have all the modules loaded
> modprobe gfs
> modprobe lock_dlm
> 
> I start ccsd with no problems. ccsd_test connect does not work. I do not 
> despair and proceed to do cman_tool join -c cluster1 (my cluster.conf is 
> based on the simple example out there, adapted for three machines; 
> transcribed below). This works on one of my three nodes (toshiba, a SuSE 
> 9.2 box). However, on both the other nodes (Debian Sarge's) cman_tool fails:
> cman_tool: Can't find broadcast address for node xxx
> 
> All machines run latest kernel 2.6.11, through the respective 
> distribution packages. The debian machines can also boot into xen, using 
> kernel 2.6.11, and it doesn't work there either.
> 
> Any pointers? don't really know what other information may be useful, so 
> please ask at will and I'll cut&paste.
> Thanks a lot
> 
> Andres
> 
> <?xml version="1.0"?>
> <cluster name="cluster1" config_version="1">
> 
>   <cman>
>   </cman>
> 
>   <clusternodes>
>     <clusternode name="toshiba" votes="1">
>       <fence>
>         <method name="single">
>           <device name="human" ipaddr="192.168.70.11"/>
>         </method>
>       </fence>
>     </clusternode>
>     <clusternode name="debian" votes="1">
>       <fence>
>         <method name="single">
>           <device name="human" ipaddr="192.168.70.44"/>
>         </method>
>       </fence>
>     </clusternode>
>     <clusternode name="simmons" votes="1">
>       <fence>
>         <method name="single">
>           <device name="human" ipaddr="192.168.70.112"/>
>         </method>
>       </fence>
>     </clusternode>
>   </clusternodes>
> 
>   <fence_devices>
>     <fence_device name="human" agent="fence_manual"/>
>   </fence_devices>
> 
> </cluster>
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 



From phung at cs.columbia.edu  Wed May  4 22:11:28 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Wed, 4 May 2005 18:11:28 -0400 (EDT)
Subject: [Linux-cluster] gfs segfault on umount
Message-ID: <Pine.LNX.4.44.0505041801330.3092-100000@algiers.clic.cs.columbia.edu>

Nicola Ranaldo reported a segfault in a previous thread:
> [Linux-cluster] gfs/iscsi over 2.6.11 and gfs listener

though there was no follow up on the segfault.  I'm getting the same
segfault when trying to umount the gfs directory.  I'm running the latest
cvs head from the RHEL4 branch on 2.6.9 with preemptible kernel disabled.

GFS: fsid=blade_cluster:cheesefs.1: Joined cluster. Now mounting FS...
GFS: fsid=blade_cluster:cheesefs.1: jid=1: Trying to acquire journal 
lock...
GFS: fsid=blade_cluster:cheesefs.1: jid=1: Looking at journal...
GFS: fsid=blade_cluster:cheesefs.1: jid=1: Done
GFS: fsid=blade_cluster:cheesefs.1: Scanning for log elements...
GFS: fsid=blade_cluster:cheesefs.1: Found 0 unlinked inodes
GFS: fsid=blade_cluster:cheesefs.1: Found quota changes for 0 IDs
GFS: fsid=blade_cluster:cheesefs.1: Done
Unable to handle kernel NULL pointer dereference at virtual address 
00000004
 printing eip:
f8c1da05
*pde = 00000000
Oops: 0000 [#1]
Modules linked in: lock_dlm dlm cman gfs lock_harness dm_mod ipv6 rtc 
pcspkr psmouse sworks_agp agpgart tsdev mousedev joydev evdev usbhid 
ohci_hcd usbcore tg3 qla2300 qla2xxx scsi_transport_fc sg sr_mod sd_mod 
scsi_mod ide_cd cdrom reiserfs isofs ext3 jbd mbcache ide_generic 
via82cxxx trm290 triflex slc90e66 sis5513 siimage serverworks sc1200 
rz1000 piix pdc202xx_old pdc202xx_new opti621 ns87415 hpt366 ide_disk 
hpt34x generic cy82c693 cs5530 cs5520 cmd64x atiixp amd74xx alim15x3 
aec62xx ide_core unix
CPU:    0
EIP:    0060:[<f8c1da05>]    Tainted: GF  VLI
EFLAGS: 00010213   (2.6.9)
EIP is at gfs_ail_start_trans+0x15/0x180 [gfs]
eax: f8bdb5a8   ebx: 00000000   ecx: 00000400   edx: 00000000
esi: f8bc7000   edi: f8bdb5bc   ebp: f667665c   esp: f6747e4c
ds: 007b   es: 007b   ss: 0068
Process umount (pid: 5117, threadinfo=f6746000 task=f650a020)
Stack: f6747ec0 00000282 00000000 f6747ec0 f8bc7000 f66476b0 f6676600 
f8bc7000
       f8bdb5bc f6676600 f8c37945 f8bc7000 f6676600 00000000 f8bdb5a8 
f8bc7000
       f6746000 f6747ec0 f6bc2e00 f8c1f1b1 f8bc7000 00000400 f8bc7000 
00000000
Call Trace:
 [<f8c37945>] gfs_ail_start+0x75/0xc0 [gfs]
 [<f8c1f1b1>] gfs_sync_meta+0x31/0x60 [gfs]
 [<f8c53be3>] gfs_make_fs_ro+0x53/0xb0 [gfs]
 [<f8c49e8b>] gfs_put_super+0x2cb/0x310 [gfs]
 [<c0154f58>] generic_shutdown_super+0xe8/0x100
 [<f8c46f12>] gfs_kill_sb+0x32/0x6e [gfs]
 [<c0154de8>] deactivate_super+0x48/0x70
 [<c0169d5f>] sys_umount+0x3f/0xa0
 [<c01434ce>] do_munmap+0x11e/0x160
 [<c0143560>] sys_munmap+0x50/0x80
 [<c0169dd5>] sys_oldumount+0x15/0x20
 [<c0105f5f>] syscall_call+0x7/0xb
Code: 04 83 c4 08 e9 dd a1 54 c7 8d b6 00 00 00 00 8d bc 27 00 00 00 00 55 
57 56 53 83 ec 18 8b 6c 24 30 83 c5 5c 89 f6 8b 5d 04 39 eb <8b> 7b 04 74 
2e 8d b6 00 00 00 00 8b 43 c0 8d 73 c0 89 44 24 14



-dan

-- 







From teigland at redhat.com  Thu May  5 02:45:38 2005
From: teigland at redhat.com (David Teigland)
Date: Thu, 5 May 2005 10:45:38 +0800
Subject: [Linux-cluster] gfs segfault on umount
In-Reply-To: <Pine.LNX.4.44.0505041801330.3092-100000@algiers.clic.cs.columbia.edu>
References: <Pine.LNX.4.44.0505041801330.3092-100000@algiers.clic.cs.columbia.edu>
Message-ID: <20050505024538.GA10543@redhat.com>

On Wed, May 04, 2005 at 06:11:28PM -0400, Dan B. Phung wrote:
> EIP is at gfs_ail_start_trans+0x15/0x180 [gfs]

This bug crept in as part of a different fix.  It effected everyone and
I've heard that it's now fixed, if you update from cvs.

Dave



From teigland at redhat.com  Thu May  5 02:55:04 2005
From: teigland at redhat.com (David Teigland)
Date: Thu, 5 May 2005 10:55:04 +0800
Subject: [Linux-cluster] cman_tool: Can't find broadcast address for node
In-Reply-To: <42794CCB.4030505@cs.toronto.edu>
References: <42794CCB.4030505@cs.toronto.edu>
Message-ID: <20050505025504.GB10543@redhat.com>

On Wed, May 04, 2005 at 05:29:31PM -0500, Andres Lagar Cavilla wrote:

> However, on both the other nodes (Debian Sarge's) cman_tool fails:
> cman_tool: Can't find broadcast address for node xxx

> Any pointers? don't really know what other information may be useful, so 
> please ask at will and I'll cut&paste.

debug option:

cman_tool join -d

Dave



From pcaulfie at redhat.com  Thu May  5 06:56:35 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Thu, 05 May 2005 07:56:35 +0100
Subject: [Linux-cluster] cman_tool: Can't find broadcast address for node
In-Reply-To: <42794CCB.4030505@cs.toronto.edu>
References: <42794CCB.4030505@cs.toronto.edu>
Message-ID: <4279C3A3.9010809@redhat.com>

Andres Lagar Cavilla wrote:

> cman_tool: Can't find broadcast address for node xxx
> > 

The usual cause of this is that the node name is bound to the loopback interface rather than and ethernet one.
Have a look in /etc/hosts for a line like

127.0.0.1	myhostname

and remove it...oh, and put a real IP address in there.

-- 

patrick



From hyperbaba at neobee.net  Thu May  5 07:52:40 2005
From: hyperbaba at neobee.net (Vladimir Grujic)
Date: Thu, 5 May 2005 09:52:40 +0200
Subject: [Linux-cluster] cman_tool: Can't find broadcast address for node
In-Reply-To: <42794CCB.4030505@cs.toronto.edu>
References: <42794CCB.4030505@cs.toronto.edu>
Message-ID: <200505050952.41886.hyperbaba@neobee.net>

On Thursday 05 May 2005 00:29, Andres Lagar Cavilla wrote:

I had same problems when bluetooth was active in my system. Try using 
ip link list to see is there anything else except ethernets. Remove those 
interfaces and the cman_tool  will work again. 

> Hi there,
> I'm trying to set up a cluster here, to use gnbd. I downloaded the
> source from cvs (May 4), branch RHEL4. After a few tweaks compiled
> succesfully, and I have all the modules loaded
> modprobe gfs
> modprobe lock_dlm
>
> I start ccsd with no problems. ccsd_test connect does not work. I do not
> despair and proceed to do cman_tool join -c cluster1 (my cluster.conf is
> based on the simple example out there, adapted for three machines;
> transcribed below). This works on one of my three nodes (toshiba, a SuSE
> 9.2 box). However, on both the other nodes (Debian Sarge's) cman_tool
> fails: cman_tool: Can't find broadcast address for node xxx
>
> All machines run latest kernel 2.6.11, through the respective
> distribution packages. The debian machines can also boot into xen, using
> kernel 2.6.11, and it doesn't work there either.
>
> Any pointers? don't really know what other information may be useful, so
> please ask at will and I'll cut&paste.
> Thanks a lot
>
> Andres
>
> <?xml version="1.0"?>
> <cluster name="cluster1" config_version="1">
>
>   <cman>
>   </cman>
>
>   <clusternodes>
>     <clusternode name="toshiba" votes="1">
>       <fence>
>         <method name="single">
>           <device name="human" ipaddr="192.168.70.11"/>
>         </method>
>       </fence>
>     </clusternode>
>     <clusternode name="debian" votes="1">
>       <fence>
>         <method name="single">
>           <device name="human" ipaddr="192.168.70.44"/>
>         </method>
>       </fence>
>     </clusternode>
>     <clusternode name="simmons" votes="1">
>       <fence>
>         <method name="single">
>           <device name="human" ipaddr="192.168.70.112"/>
>         </method>
>       </fence>
>     </clusternode>
>   </clusternodes>
>
>   <fence_devices>
>     <fence_device name="human" agent="fence_manual"/>
>   </fence_devices>
>
> </cluster>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster

-- 
-------------------------------------------------------------------------------------------
If a program actually fits in memory and has enough disk space, it is 
guaranteed to crash. 
	-- Murphy's Computer Laws n?5
-------------------------------------------------------------------------------------------



From birger at birger.sh  Thu May  5 11:55:20 2005
From: birger at birger.sh (birger)
Date: Thu, 05 May 2005 13:55:20 +0200
Subject: [Linux-cluster] Proposed change to /etc/init.d/fenced
Message-ID: <427A09A8.9050408@birger.sh>

Would it be a good idea to do something like this change to 
/etc/init.d/fenced to enable fenced command line options to be set in 
/etc/sysconfig/cluster? This would keep -w as the default if the 
variable is unset.

fence_tool join ${FENCED_OPTIONS=-w} > /dev/null 2>&1

I'm sure the same applies to other startup-files...

-- 
birger



From stas at core.310.ru  Wed May  4 23:53:58 2005
From: stas at core.310.ru (Stanislav Sedov)
Date: Thu, 5 May 2005 03:53:58 +0400
Subject: [Linux-cluster] gfs segfault on umount
In-Reply-To: <Pine.LNX.4.44.0505041801330.3092-100000@algiers.clic.cs.columbia.edu>
References: <Pine.LNX.4.44.0505041801330.3092-100000@algiers.clic.cs.columbia.edu>
Message-ID: <20050504235358.GC68187@core.310.ru>

On Wed, May 04, 2005 at 06:11:28PM -0400, Dan B. Phung wrote:
> Nicola Ranaldo reported a segfault in a previous thread:
> > [Linux-cluster] gfs/iscsi over 2.6.11 and gfs listener
> 
> though there was no follow up on the segfault.  I'm getting the same
> segfault when trying to umount the gfs directory.  I'm running the latest
> cvs head from the RHEL4 branch on 2.6.9 with preemptible kernel disabled.
> 
> GFS: fsid=blade_cluster:cheesefs.1: Joined cluster. Now mounting FS...
> GFS: fsid=blade_cluster:cheesefs.1: jid=1: Trying to acquire journal 
> lock...
> GFS: fsid=blade_cluster:cheesefs.1: jid=1: Looking at journal...
> GFS: fsid=blade_cluster:cheesefs.1: jid=1: Done
> GFS: fsid=blade_cluster:cheesefs.1: Scanning for log elements...
> GFS: fsid=blade_cluster:cheesefs.1: Found 0 unlinked inodes
> GFS: fsid=blade_cluster:cheesefs.1: Found quota changes for 0 IDs
> GFS: fsid=blade_cluster:cheesefs.1: Done
> Unable to handle kernel NULL pointer dereference at virtual address 
> 00000004
>  printing eip:
> f8c1da05
> *pde = 00000000
> Oops: 0000 [#1]
> Modules linked in: lock_dlm dlm cman gfs lock_harness dm_mod ipv6 rtc 
> pcspkr psmouse sworks_agp agpgart tsdev mousedev joydev evdev usbhid 
> ohci_hcd usbcore tg3 qla2300 qla2xxx scsi_transport_fc sg sr_mod sd_mod 
> scsi_mod ide_cd cdrom reiserfs isofs ext3 jbd mbcache ide_generic 
> via82cxxx trm290 triflex slc90e66 sis5513 siimage serverworks sc1200 
> rz1000 piix pdc202xx_old pdc202xx_new opti621 ns87415 hpt366 ide_disk 
> hpt34x generic cy82c693 cs5530 cs5520 cmd64x atiixp amd74xx alim15x3 
> aec62xx ide_core unix
> CPU:    0
> EIP:    0060:[<f8c1da05>]    Tainted: GF  VLI
> EFLAGS: 00010213   (2.6.9)
> EIP is at gfs_ail_start_trans+0x15/0x180 [gfs]
> eax: f8bdb5a8   ebx: 00000000   ecx: 00000400   edx: 00000000
> esi: f8bc7000   edi: f8bdb5bc   ebp: f667665c   esp: f6747e4c
> ds: 007b   es: 007b   ss: 0068
> Process umount (pid: 5117, threadinfo=f6746000 task=f650a020)
> Stack: f6747ec0 00000282 00000000 f6747ec0 f8bc7000 f66476b0 f6676600 
> f8bc7000
>        f8bdb5bc f6676600 f8c37945 f8bc7000 f6676600 00000000 f8bdb5a8 
> f8bc7000
>        f6746000 f6747ec0 f6bc2e00 f8c1f1b1 f8bc7000 00000400 f8bc7000 
> 00000000
> Call Trace:
>  [<f8c37945>] gfs_ail_start+0x75/0xc0 [gfs]
>  [<f8c1f1b1>] gfs_sync_meta+0x31/0x60 [gfs]
>  [<f8c53be3>] gfs_make_fs_ro+0x53/0xb0 [gfs]
>  [<f8c49e8b>] gfs_put_super+0x2cb/0x310 [gfs]
>  [<c0154f58>] generic_shutdown_super+0xe8/0x100
>  [<f8c46f12>] gfs_kill_sb+0x32/0x6e [gfs]
>  [<c0154de8>] deactivate_super+0x48/0x70
>  [<c0169d5f>] sys_umount+0x3f/0xa0
>  [<c01434ce>] do_munmap+0x11e/0x160
>  [<c0143560>] sys_munmap+0x50/0x80
>  [<c0169dd5>] sys_oldumount+0x15/0x20
>  [<c0105f5f>] syscall_call+0x7/0xb
> Code: 04 83 c4 08 e9 dd a1 54 c7 8d b6 00 00 00 00 8d bc 27 00 00 00 00 55 
> 57 56 53 83 ec 18 8b 6c 24 30 83 c5 5c 89 f6 8b 5d 04 39 eb <8b> 7b 04 74 
> 2e 8d b6 00 00 00 00 8b 43 c0 8d 73 c0 89 44 24 14
> 
> 
> 
> -dan
> 
> -- 
> 
> 
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 
Same problem. I'm using gfs with GULM on RHEL4.



From andreslc at cs.toronto.edu  Thu May  5 15:49:12 2005
From: andreslc at cs.toronto.edu (Andres Lagar Cavilla)
Date: Thu, 05 May 2005 10:49:12 -0500
Subject: [Linux-cluster] cman_tool: Can't find broadcast address for node
Message-ID: <427A4078.2020604@cs.toronto.edu>

Patrick,
That did work, thanks a lot.
However, my cluster is very unstable. I can execute cman_tool join -c cluster1 on my three nodes, 
but rarely does cman_tool status report 3 Total_votes.
Again, my SuSE box is steady, but the debian machines seem to de-associate from the cluster at will; I have no clue as to why this may happen

Reason I'm trying to get the cluster infrastructure functional is to avoid gnbd caching. I noticed I can use gnbd by directly:

gnbd_srv -n
gnbd_export -c ...
and
gnbd_import -n -i ...

However, I want all my machines to see a consistent state of the block device at all times, which happens not be the case
with caching. Unless I have something else not working properly...

Thanks a lot
Andres

Andres Lagar Cavilla wrote:

> cman_tool: Can't find broadcast address for node xxx
> > 

The usual cause of this is that the node name is bound to the loopback interface rather than and ethernet one.
Have a look in /etc/hosts for a line like

127.0.0.1	myhostname

and remove it...oh, and put a real IP address in there.

-- 

patrick



From pcaulfie at redhat.com  Thu May  5 14:58:20 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Thu, 05 May 2005 15:58:20 +0100
Subject: [Linux-cluster] cman_tool: Can't find broadcast address for node
In-Reply-To: <427A4078.2020604@cs.toronto.edu>
References: <427A4078.2020604@cs.toronto.edu>
Message-ID: <427A348C.4000208@redhat.com>

Andres Lagar Cavilla wrote:
> Patrick,
> That did work, thanks a lot.
> However, my cluster is very unstable. I can execute cman_tool join -c
> cluster1 on my three nodes, but rarely does cman_tool status report 3
> Total_votes.
> Again, my SuSE box is steady, but the debian machines seem to
> de-associate from the cluster at will; I have no clue as to why this may
> happen

I have no idea why that would happen, unless the network interface is being
taken up & down a lot! Have a look in syslog to see if there are any clues in there.
-- 

patrick



From lhh at redhat.com  Thu May  5 15:37:39 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 05 May 2005 11:37:39 -0400
Subject: [Linux-cluster] Proposed change to /etc/init.d/fenced
In-Reply-To: <427A09A8.9050408@birger.sh>
References: <427A09A8.9050408@birger.sh>
Message-ID: <1115307459.20618.692.camel@ayanami.boston.redhat.com>

On Thu, 2005-05-05 at 13:55 +0200, birger wrote:
> Would it be a good idea to do something like this change to 
> /etc/init.d/fenced to enable fenced command line options to be set in 
> /etc/sysconfig/cluster? This would keep -w as the default if the 
> variable is unset.
> 
> fence_tool join ${FENCED_OPTIONS=-w} > /dev/null 2>&1
> 
> I'm sure the same applies to other startup-files...

/me shrugs

I can see reasons to go both ways.

-- Lon



From birger at birger.sh  Thu May  5 16:25:08 2005
From: birger at birger.sh (birger)
Date: Thu, 05 May 2005 18:25:08 +0200
Subject: [Linux-cluster] Proposed change to /etc/init.d/fenced
In-Reply-To: <1115307459.20618.692.camel@ayanami.boston.redhat.com>
References: <427A09A8.9050408@birger.sh>
	<1115307459.20618.692.camel@ayanami.boston.redhat.com>
Message-ID: <427A48E4.1050506@birger.sh>

Lon Hohberger wrote:

>/me shrugs
>
>I can see reasons to go both ways.
>  
>

The reason I propose this is that whenever I update my source and 
rebuild the files in /etc/init.d get reinstalled as well, so I have to 
remember to reset my parameters before rebooting.

If the startup files would let me set parameters in 
/etc/sysconfig/cluster (it's already being sourced by the fenced script 
at least) I wouldn't have to worry...

-- 
birger



From lhh at redhat.com  Thu May  5 19:05:03 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 05 May 2005 15:05:03 -0400
Subject: [Linux-cluster] Proposed change to /etc/init.d/fenced
In-Reply-To: <427A48E4.1050506@birger.sh>
References: <427A09A8.9050408@birger.sh>
	<1115307459.20618.692.camel@ayanami.boston.redhat.com>
	<427A48E4.1050506@birger.sh>
Message-ID: <1115319903.20618.723.camel@ayanami.boston.redhat.com>

On Thu, 2005-05-05 at 18:25 +0200, birger wrote:

> The reason I propose this is that whenever I update my source and 
> rebuild the files in /etc/init.d get reinstalled as well, so I have to 
> remember to reset my parameters before rebooting.
> 
> If the startup files would let me set parameters in 
> /etc/sysconfig/cluster (it's already being sourced by the fenced script 
> at least) I wouldn't have to worry...

Don't all (or most) of them do this?

(They probably all *should*, if they're not currently...)

-- Lon



From lhh at redhat.com  Thu May  5 21:09:54 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 05 May 2005 17:09:54 -0400
Subject: [Linux-cluster] rgmanager "arbitrary" resource trees
Message-ID: <1115327394.20618.744.camel@ayanami.boston.redhat.com>

By popular demand, now appearing in CVS.

You can now do things like the following (without tinkering with the
individual resource scripts):

<service>
  <fs>
    <fs/>
  </fs>
</service>

-- Lon



From phung at cs.columbia.edu  Thu May  5 22:49:19 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Thu, 5 May 2005 18:49:19 -0400 (EDT)
Subject: [Linux-cluster] GFS on SAN, does a quorum make sense?
Message-ID: <Pine.LNX.4.44.0505051841000.9730-100000@algiers.clic.cs.columbia.edu>

Hello, I was wondering if a quorum makes sense when I have one underlying
shared device.  My setup is this:


           blade1 b2  b3  b4  b5  .....
              \    |   |   |   |  |  |  /
               [ fiber switch module ] 
                        |  | 
                  [FastT500/EXP500] 

and I want any blade to be able to access the storage at anytime.  right
now I have my configuration such that each node has the number of votes
equivalent to the quorum.  Does this make sense?  From my understanding,
the quorum/voting procedure is to prevent split-brain scenarios where two
nodes coming up for the first time might try to form two separate clusters
of the same name, which will cause data corruption.  How would I prevent
that, while still allowing any one node, even by itself, to access the
storage media.

Another use of the quorum is for distributed disks in the case of a node
failure the I/O to that disk is fenced.  Is that correct?

regards,
Dan

-- 



From phung at cs.columbia.edu  Thu May  5 23:01:23 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Thu, 5 May 2005 19:01:23 -0400 (EDT)
Subject: [Linux-cluster] GFS on SAN, does a quorum make sense?
In-Reply-To: <Pine.LNX.4.44.0505051841000.9730-100000@algiers.clic.cs.columbia.edu>
Message-ID: <Pine.LNX.4.44.0505051859570.9730-100000@algiers.clic.cs.columbia.edu>

actually, it doesn't even work with what I thought I could do, which is to
have each node have a vote count equal to the quorum, since the vote count
is summed and halved.  ...you guys are tricky ;)

so help is needed, or maybe a pointer to a thread on where this has
been discussed previously.  

thanks,
dan


On 5, May, 2005, Dan B. Phung declared:

> Hello, I was wondering if a quorum makes sense when I have one underlying
> shared device.  My setup is this:
> 
> 
>            blade1 b2  b3  b4  b5  .....
>               \    |   |   |   |  |  |  /
>                [ fiber switch module ] 
>                         |  | 
>                   [FastT500/EXP500] 
> 
> and I want any blade to be able to access the storage at anytime.  right
> now I have my configuration such that each node has the number of votes
> equivalent to the quorum.  Does this make sense?  From my understanding,
> the quorum/voting procedure is to prevent split-brain scenarios where two
> nodes coming up for the first time might try to form two separate clusters
> of the same name, which will cause data corruption.  How would I prevent
> that, while still allowing any one node, even by itself, to access the
> storage media.
> 
> Another use of the quorum is for distributed disks in the case of a node
> failure the I/O to that disk is fenced.  Is that correct?
> 
> regards,
> Dan
> 
> 

-- 



From lhh at redhat.com  Fri May  6 16:42:30 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 06 May 2005 12:42:30 -0400
Subject: [Linux-cluster] GFS on SAN, does a quorum make sense?
In-Reply-To: <Pine.LNX.4.44.0505051841000.9730-100000@algiers.clic.cs.columbia.edu>
References: <Pine.LNX.4.44.0505051841000.9730-100000@algiers.clic.cs.columbia.edu>
Message-ID: <1115397751.20618.824.camel@ayanami.boston.redhat.com>

On Thu, 2005-05-05 at 18:49 -0400, Dan B. Phung wrote:
> From my understanding,
> the quorum/voting procedure is to prevent split-brain scenarios where two
> nodes coming up for the first time might try to form two separate clusters
> of the same name, which will cause data corruption.  How would I prevent
> that, while still allowing any one node, even by itself, to access the
> storage media.

It's not to prevent two nodes: it's to prevent "less than a majority" of
the nodes (votes really) from forming their own cluster.

What you're trying to do is exactly what the algorithm is designed to
prevent :)

Consider a case where any one node can become quorate (by itself) in an
N-node cluster.  If you unplug the network cables on each node and start
up the cluster software on all N nodes, you'll end up with an N-way
split brain!  I think that is probably not a good thing.

You can do it manually by adjusting cman_tool's expected votes down to a
small number while doing a one-node boot, but please ensure the rest of
the cluster is down before doing so.

> Another use of the quorum is for distributed disks in the case of a node
> failure the I/O to that disk is fenced.  Is that correct?

Yes.

-- Lon



From phung at cs.columbia.edu  Fri May  6 17:58:08 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Fri, 6 May 2005 13:58:08 -0400 (EDT)
Subject: [Linux-cluster] GFS on SAN, does a quorum make sense?
In-Reply-To: <1115397751.20618.824.camel@ayanami.boston.redhat.com>
Message-ID: <Pine.LNX.4.44.0505061338140.15020-100000@algiers.clic.cs.columbia.edu>

On 6, May, 2005, Lon Hohberger declared:

> On Thu, 2005-05-05 at 18:49 -0400, Dan B. Phung wrote:
> > From my understanding,
> > the quorum/voting procedure is to prevent split-brain scenarios where two
> > nodes coming up for the first time might try to form two separate clusters
> > of the same name, which will cause data corruption.  How would I prevent
> > that, while still allowing any one node, even by itself, to access the
> > storage media.
> 
> It's not to prevent two nodes: it's to prevent "less than a majority" of
> the nodes (votes really) from forming their own cluster.

right, sorry, that was just an eg, which you extend to the N-node
case below.
 
> What you're trying to do is exactly what the algorithm is designed to
> prevent :)

that's what I was afraid of!  
 
> Consider a case where any one node can become quorate (by itself) in an
> N-node cluster.  If you unplug the network cables on each node and start
> up the cluster software on all N nodes, you'll end up with an N-way
> split brain!  I think that is probably not a good thing.
> 
> You can do it manually by adjusting cman_tool's expected votes down to a
> small number while doing a one-node boot, but please ensure the rest of
> the cluster is down before doing so.

ah, I see, and that would overide the cluster.conf node/vote counts.  

Woudn't another way be to somehow mark the file system with a unique
cluster id (randomly generated) so that mounting would fail if the cluster
isn't part of the other clusters?  ...or rather, at a higher level, what
would help is an inband locking mechanism.  I guess I should start reading
the design docs and published papers to see how this would work.

 
> > Another use of the quorum is for distributed disks in the case of a node
> > failure the I/O to that disk is fenced.  Is that correct?
> 
> Yes.
> 
> -- Lon
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 



From pmeharg at redhat.com  Fri May  6 18:19:43 2005
From: pmeharg at redhat.com (Paul Meharg)
Date: Fri, 06 May 2005 13:19:43 -0500
Subject: [Linux-cluster] Dell/EMC AX100 experience?
Message-ID: <1115403583.5605.55.camel@grahem>

Has any one had any experience using the Dell/EMC AX100 Fibre Channel
storage system?  I have some people asking about these devices.

Thanks,
Paul

---
Paul Meharg
Sales Engineer                     Red Hat, Inc.   
pmeharg at redhat.com      |    512-468-5705 - cell      |
512-219-5968

Learn, Network and Experience Open Source.
Red Hat Summit, New Orleans 2005
http://www.redhat.com/promo/summit/


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050506/d1a388da/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050506/d1a388da/attachment.sig>

From greg.freemyer at gmail.com  Fri May  6 18:22:42 2005
From: greg.freemyer at gmail.com (Greg Freemyer)
Date: Fri, 6 May 2005 14:22:42 -0400
Subject: [Linux-cluster] GFS on SAN, does a quorum make sense?
In-Reply-To: <Pine.LNX.4.44.0505051841000.9730-100000@algiers.clic.cs.columbia.edu>
References: <Pine.LNX.4.44.0505051841000.9730-100000@algiers.clic.cs.columbia.edu>
Message-ID: <87f94c3705050611222b0c3051@mail.gmail.com>

On 5/5/05, Dan B. Phung wrote:
> Hello, I was wondering if a quorum makes sense when I have one underlying
> shared device.  My setup is this:
> 
>            blade1 b2  b3  b4  b5  .....
>               \    |   |   |   |  |  |  /
>                [ fiber switch module ]
>                         |  |
>                   [FastT500/EXP500]
> 
> and I want any blade to be able to access the storage at anytime.  right
> now I have my configuration such that each node has the number of votes
> equivalent to the quorum.  Does this make sense?  From my understanding,
> the quorum/voting procedure is to prevent split-brain scenarios where two
> nodes coming up for the first time might try to form two separate clusters
> of the same name, which will cause data corruption.  How would I prevent
> that, while still allowing any one node, even by itself, to access the
> storage media.
> 
> Another use of the quorum is for distributed disks in the case of a node
> failure the I/O to that disk is fenced.  Is that correct?
> 
> regards,
> Dan
> 

You could designate a single "master node" that had N+1 votes.

Anytime it was online and part of the cluster, and of the other nodes
could come and go as they please.

Unfortunately, if it ever goes offline, you lose all access to the
shared storage.

If you made your "master node" be a low-cost but reliable node that
had zero job responsibility besides being the master, it should be
able to stay up for very long periods of time.

It is similar to the concept most companies use for DNS servers.  Give
them a single simple job and they are very reliable and the job is
simple enough to run on a 486 for most small businesses.

Greg
-- 
Greg Freemyer
The Norcross Group
Forensics for the 21st Century



From lhh at redhat.com  Fri May  6 18:41:37 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 06 May 2005 14:41:37 -0400
Subject: [Linux-cluster] GFS on SAN, does a quorum make sense?
In-Reply-To: <Pine.LNX.4.44.0505061338140.15020-100000@algiers.clic.cs.columbia.edu>
References: <Pine.LNX.4.44.0505061338140.15020-100000@algiers.clic.cs.columbia.edu>
Message-ID: <1115404897.20618.848.camel@ayanami.boston.redhat.com>

On Fri, 2005-05-06 at 13:58 -0400, Dan B. Phung wrote:
> > You can do it manually by adjusting cman_tool's expected votes down to a
> > small number while doing a one-node boot, but please ensure the rest of
> > the cluster is down before doing so.
> 
> ah, I see, and that would overide the cluster.conf node/vote counts.  

Yes.

> Woudn't another way be to somehow mark the file system with a unique
> cluster id (randomly generated) so that mounting would fail if the cluster
> isn't part of the other clusters?  ...or rather, at a higher level, what
> would help is an inband locking mechanism.  I guess I should start reading
> the design docs and published papers to see how this would work.

Well, the cluster doesn't form a quorum based on the ability to mount
the file system.  Rather, to mount on a given node, that node must
already be a member of the cluster quorum.

You could use something like SCSI reservations or fibre-channel fencing
(FC zoning) as an additional measure to prevent the other nodes from
being able to access the storage at all, but you'll still have to do the
CMAN trick in order to get it up to a quorate state where you can mount.

-- Lon



From pbruna at linuxcenterla.com  Fri May  6 19:16:55 2005
From: pbruna at linuxcenterla.com (Patricio Bruna V.)
Date: Fri, 6 May 2005 15:16:55 -0400
Subject: [Linux-cluster] Dell/EMC AX100 experience?
In-Reply-To: <1115403583.5605.55.camel@grahem>
References: <1115403583.5605.55.camel@grahem>
Message-ID: <200505061517.00021.pbruna@linuxcenterla.com>

El Vie 06 May 2005 14:19, Paul Meharg escribi?:
> Has any one had any experience using the Dell/EMC AX100 Fibre Channel
> storage system?  I have some people asking about these devices.
>
> Thanks,
> Paul
>
i do, i dont recommend it. :)

-- 
Patricio Bruna
pbruna at linuxcenterla.com
RHCE/RHCI
Jefe Soporte y Operaciones LinuxCenter S.A.
Canada 239, 5to piso, Providencia, Chile
http://www.linuxcenterla.com +56-2-2745000
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050506/cab9c098/attachment.sig>

From tpcollier at liberty.edu  Fri May  6 20:57:54 2005
From: tpcollier at liberty.edu (Collier, Tirus)
Date: Fri, 6 May 2005 16:57:54 -0400
Subject: [Linux-cluster] pool_tool -c
Message-ID: <5002B7C6DF422D48A2C8E1CF6397B1DD04211595@doc.University.liberty.edu>

Good Day,
 
Question in regards to running the pool_tool -c utility. What file is required when running pool_tool -c utility? Is it reading /etc/cluster.xml or /etc/cluster.conf?
 
Please advise, thanks.
 
T
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050506/deb8394d/attachment.htm>

From jbrassow at redhat.com  Fri May  6 21:40:17 2005
From: jbrassow at redhat.com (Jonathan E Brassow)
Date: Fri, 6 May 2005 16:40:17 -0500
Subject: [Linux-cluster] pool_tool -c
In-Reply-To: <5002B7C6DF422D48A2C8E1CF6397B1DD04211595@doc.University.liberty.edu>
References: <5002B7C6DF422D48A2C8E1CF6397B1DD04211595@doc.University.liberty.edu>
Message-ID: <72f8920467b293c6316997daebd3eab7@redhat.com>

it requires a file describing the layout of the pool you wish to create 
- something completely different from cluster.xml or cluster.conf.

This command is one that you would only run once to write labels out to 
the devices.  Once the labels are written, pool_assemble is used to 
assemble them.  you can find more information by looking at the various 
pool man pages.

an example file might be:
#begin pool config file
poolname foo
subpools 1
subpool 0 0 1 gfs_data
pooldevice 0 0 /dev/sdb
#end pool config file

The above file would write the label onto sdb.  a subsequent 
pool_assemble would create a device (/dev/pool/foo) composed of 
/dev/sdb.

For more information: 'man 5 pool_conf'
  brassow

On May 6, 2005, at 3:57 PM, Collier, Tirus wrote:

> Good Day,
> ?
> Question?in regards to?running the pool_tool -c utility. What file 
> is?required when running pool_tool -c utility? Is it reading 
> /etc/cluster.xml or /etc/cluster.conf?
> ?
> Please advise, thanks.
> ?
> T--
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 1361 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050506/05472f7a/attachment.bin>

From tpcollier at liberty.edu  Fri May  6 22:27:48 2005
From: tpcollier at liberty.edu (Collier, Tirus)
Date: Fri, 6 May 2005 18:27:48 -0400
Subject: [Linux-cluster] Dell/EMC AX100 experience?
Message-ID: <5002B7C6DF422D48A2C8E1CF6397B1DD05556F45@doc.University.liberty.edu>

Buenos,

Pregunta, request to know why you do not recommend DELL/EMC AX100?



Tirus Collier
Liberty University
Systems Administration
1971 University Blvd.
Lynchburg, VA 24502
434.582.2822


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Patricio Bruna V.
Sent: Friday, May 06, 2005 3:17 PM
To: linux clustering
Subject: Re: [Linux-cluster] Dell/EMC AX100 experience?

El Vie 06 May 2005 14:19, Paul Meharg escribi?:
> Has any one had any experience using the Dell/EMC AX100 Fibre Channel
> storage system?  I have some people asking about these devices.
>
> Thanks,
> Paul
>
i do, i dont recommend it. :)

-- 
Patricio Bruna
pbruna at linuxcenterla.com
RHCE/RHCI
Jefe Soporte y Operaciones LinuxCenter S.A.
Canada 239, 5to piso, Providencia, Chile
http://www.linuxcenterla.com +56-2-2745000



From phung at cs.columbia.edu  Sat May  7 02:08:22 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Fri, 6 May 2005 22:08:22 -0400 (EDT)
Subject: [Linux-cluster] GFS init scripts order?
Message-ID: <Pine.LNX.4.44.0505062157380.20042-100000@algiers.clic.cs.columbia.edu>

Hello, 

I'm using GFS with Debian, and want to make the symlinks for the init
scripts.  

The first thing I notice is this:

blade04:/etc/rc6.d# /etc/init.d/ccsd start
/etc/init.d/ccsd: line 15: /etc/init.d/functions: No such file or directory
Starting ccsd:/etc/init.d/ccsd: line 26: success: command not found

however, I notice that ccsd is up and running.  is init.d/functions
something that's standard on redhat installations?  I was going to add
them to rc6.d and have the startup go in the obvious order:

S35networking
S36ifupdown
S40umountfs

S50ccsd
S50cman
S50fenced
S51clvmd

but what do I used to call vgchange -aly to activate my logical volumes?
init.d/gfs doesn't seem to do this, and I can add it, but I was
wondering if that is what everybody else was doing.

I'll have to deal with this for shutdown as well...

regards,
Dan


-- 



From teigland at redhat.com  Sat May  7 03:21:15 2005
From: teigland at redhat.com (David Teigland)
Date: Sat, 7 May 2005 11:21:15 +0800
Subject: [Linux-cluster] GFS on SAN, does a quorum make sense?
In-Reply-To: <87f94c3705050611222b0c3051@mail.gmail.com>
References: <Pine.LNX.4.44.0505051841000.9730-100000@algiers.clic.cs.columbia.edu>
	<87f94c3705050611222b0c3051@mail.gmail.com>
Message-ID: <20050507032115.GA8892@redhat.com>

On Fri, May 06, 2005 at 02:22:42PM -0400, Greg Freemyer wrote:

> You could designate a single "master node" that had N+1 votes.
> 
> Anytime it was online and part of the cluster, and of the other nodes
> could come and go as they please.
> 
> Unfortunately, if it ever goes offline, you lose all access to the
> shared storage.
> 
> If you made your "master node" be a low-cost but reliable node that
> had zero job responsibility besides being the master, it should be
> able to stay up for very long periods of time.

Yes, if you had seven nodes b1-b7, you could give b1 7 votes and everyone
else 1 vote.  Total votes would be 13 and quorum would require 7 or more
votes.  So as long as b1 was a member of the cluster (with its 7 votes),
there would be quorum and others could do whatever they like.

As you say, to reduce the likelihood of b1 failing you may want to just
have it join the cluster and do nothing else.

Dave



From lhh at redhat.com  Sun May  8 19:34:27 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Sun, 08 May 2005 15:34:27 -0400
Subject: [Linux-cluster] GFS init scripts order?
In-Reply-To: <Pine.LNX.4.44.0505062157380.20042-100000@algiers.clic.cs.columbia.edu>
References: <Pine.LNX.4.44.0505062157380.20042-100000@algiers.clic.cs.columbia.edu>
Message-ID: <1115580867.8698.20.camel@ayanami.boston.redhat.com>

On Fri, 2005-05-06 at 22:08 -0400, Dan B. Phung wrote:
> is init.d/functions
> something that's standard on redhat installations?

Yes.

>   I was going to add
> them to rc6.d and have the startup go in the obvious order:

> S35networking
> S36ifupdown
> S40umountfs
> 
> S50ccsd
> S50cman
> S50fenced
> S51clvmd

ccsd needs to be started before cman
cman needs to be started before fenced

I think putting them at the same start level 50 won't guarantee that,
so:

S50ccsd
S51cman
S52fenced
S53clvmd

-- Lon



From wkenji at labs.fujitsu.com  Sun May  8 20:04:37 2005
From: wkenji at labs.fujitsu.com (Kenji Wakamiya)
Date: Mon, 09 May 2005 05:04:37 +0900
Subject: [Linux-cluster] GFS and CLVM-snapshot
Message-ID: <427E70D5.6000107@labs.fujitsu.com>

Hello,

I have some basic questions.
I'm using GFS with the following configuration.

  node1  node2  node3 (FC3)
   GFS    GFS    GFS
   LV     LV     LV   <-- CLVM
  GNBD   GNBD   GNBD  <-- gnbd_import -n
   |      |      |
   +------+------+
          |
         GNBD  <-- gnbd_serv -n / gnbd_export -c
         node0 (file server: FC3)

Only for snapshot and backup, CLVM has been introduced here.
Is a setup such as this correct?
If so (or not :), why "sometimes" the following error occor?
Don't I need freeze/unfreeze GFS manually, like XFS?

  (on any node of node1-3)
  # gfs_tool freeze /mnt/gfs
  # lvcreate -s -n lv0ss0 -L 2G /dev/vg0/lv0
    Logical volume "lv0ss0" created
  # gfs_tool unfreeze /mnt/gfs
  # mount -t gfs -o ro,lockproto=lock_nolock /dev/vg0/lv0ss0 /mnt/gfs-ss
  mount: cannot mount block device /dev/vg0/lv0ss0 read-only

But if once I did gfs_fsck, the read-only mount succeeds.
Should I always do gfs_fsck before to use a snapshot?

  # gfs_fsck /dev/vg0/lv0ss0
  Initializing fsck
  Starting pass1
  Pass1 complete
  --snip--
  Starting pass5
  Pass5 complete
  # mount -t gfs -o ro,lockproto=lock_nolock /dev/vg0/lv0ss0 /mnt/gfs-ss

But on rare occasions, gfs_fsck also cause some errors.
For example,

  # gfs_fsck /dev/vg0/lv0ss0
  Initializing fsck
  Buffer #1736310 (1 of 4) is neither GFS_METATYPE_RB nor GFS_METATYPE_RG.
  Resource group is corrupted.
  Unable to read in rgrp descriptor.
  Unable to fill in resource group information.

The origin and snapshot LVs are as follows.

  # lvdisplay
    --- Logical volume ---
    LV Name                /dev/vg0/lv0
    VG Name                vg0
    LV UUID                fVPY4J-cndk-YD5n-wU0m-QR0S-2txI-LvnnQ5
    LV Write Access        read/write
    LV snapshot status     source of
			   /dev/vg0/lv0ss0 [active]
    LV Status              available
    # open                 1
    LV Size                20.00 GB
    Current LE             10240
    Segments               1
    Allocation             inherit
    Read ahead sectors     0
    Block device           253:0

    --- Logical volume ---
    LV Name                /dev/vg0/lv0ss0
    VG Name                vg0
    LV UUID                hZea21-flge-bI5J-gcCy-pKug-0YHN-ZXR7I4
    LV Write Access        read/write
    LV snapshot status     active destination for /dev/vg0/lv0
    LV Status              available
    # open                 1
    LV Size                20.00 GB
    Current LE             10240
    COW-table size         2.00 GB
    COW-table LE           1024
    Allocated to snapshot  2.35%
    Snapshot chunk size    8.00 KB
    Segments               1
    Allocation             inherit
    Read ahead sectors     0
    Block device           253:4

[Version info]
      Linux: 2.6.11-1.14_FC3smp
GFS/Cluster: CVS head of about 08 April (with CMAN/DLM)
 Dev-Mapper: 1.01.01 / CVS head of about 03 May
       LVM2: 2.2.01.09 / 2.2.01.10
       (--disable-selinux --with-clvmd=cman --with-cluster=shared)

Thanks, and sorry for my broken English.

-- Kenji



From teigland at redhat.com  Mon May  9 02:26:37 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 9 May 2005 10:26:37 +0800
Subject: [Linux-cluster] GFS and CLVM-snapshot
In-Reply-To: <427E70D5.6000107@labs.fujitsu.com>
References: <427E70D5.6000107@labs.fujitsu.com>
Message-ID: <20050509022637.GA8925@redhat.com>

On Mon, May 09, 2005 at 05:04:37AM +0900, Kenji Wakamiya wrote:
> Hello,
> 
> I have some basic questions.
> I'm using GFS with the following configuration.
> 
>   node1  node2  node3 (FC3)
>    GFS    GFS    GFS
>    LV     LV     LV   <-- CLVM
>   GNBD   GNBD   GNBD  <-- gnbd_import -n
>    |      |      |
>    +------+------+
>           |
>          GNBD  <-- gnbd_serv -n / gnbd_export -c
>          node0 (file server: FC3)
> 
> Only for snapshot and backup, CLVM has been introduced here.
> Is a setup such as this correct?
> If so (or not :), why "sometimes" the following error occor?
> Don't I need freeze/unfreeze GFS manually, like XFS?

CLVM snapshots do not work with GFS, sorry.  Maybe sometime in the future.
However, this should be possible:

   node1  node2  node3 (FC3)
    GFS    GFS    GFS
   GNBD   GNBD   GNBD  <-- gnbd_import -n
    |      |      |
    +------+------+
           |
          GNBD  <-- gnbd_serv -n / gnbd_export -c
          LV    <-- LVM2
          node0 (file server: FC3)
 
In this case you should run:

node1-3# gfs_tool freeze /mnt/gfs
node0#   lvcreate -s -n lv0ss0 ...
node1-3# gfs_tool unfreeze /mnt/gfs
node0#   gnbd_export -c -d lv0ss0 -n nbd0ss0
node1#   gnbd_import -n -i node0
node1#   mount -t gfs -o ro,lockproto=lock_nolock /dev/gnbd/nbd0ss0

I've not tried this myself, but I don't think there's anything wrong with
it in principle.

Dave



From phillips at redhat.com  Mon May  9 08:00:18 2005
From: phillips at redhat.com (Daniel Phillips)
Date: Mon, 9 May 2005 04:00:18 -0400
Subject: [Linux-cluster] GFS and CLVM-snapshot
In-Reply-To: <20050509022637.GA8925@redhat.com>
References: <427E70D5.6000107@labs.fujitsu.com>
	<20050509022637.GA8925@redhat.com>
Message-ID: <200505090400.18716.phillips@redhat.com>

On Sunday 08 May 2005 22:26, David Teigland wrote:
> On Mon, May 09, 2005 at 05:04:37AM +0900, Kenji Wakamiya wrote:
> > Hello,
> >
> > I have some basic questions.
> > I'm using GFS with the following configuration.
> >
> >   node1  node2  node3 (FC3)
> >    GFS    GFS    GFS
> >    LV     LV     LV   <-- CLVM
> >   GNBD   GNBD   GNBD  <-- gnbd_import -n
> >
> >    +------+------+
> >
> >          GNBD  <-- gnbd_serv -n / gnbd_export -c
> >          node0 (file server: FC3)
> >
> > Only for snapshot and backup, CLVM has been introduced here.
> > Is a setup such as this correct?
> > If so (or not :), why "sometimes" the following error occor?
> > Don't I need freeze/unfreeze GFS manually, like XFS?
>
> CLVM snapshots do not work with GFS, sorry.  Maybe sometime in the
> future. However, this should be possible:

I believe Dave actually meant to say, take a look here:

 http://sourceware.org/cluster/csnap/

It's a cluster snapshot.  This is ready to test if you are a developer, 
but not ready to use in a production system.

Regards,

Daniel



From naoki at valuecommerce.com  Mon May  9 08:06:03 2005
From: naoki at valuecommerce.com (Naoki)
Date: Mon, 09 May 2005 17:06:03 +0900
Subject: [Linux-cluster] Dell/EMC AX100 experience?
In-Reply-To: <200505061517.00021.pbruna@linuxcenterla.com>
References: <1115403583.5605.55.camel@grahem>
	<200505061517.00021.pbruna@linuxcenterla.com>
Message-ID: <1115625964.32239.46.camel@dragon.sys.intra>

Shockingly limited raid options (raid 5 only). 
Bizarre stuff like only allowing the hot spare on the second row of
disks which means you can't have the top and bottom row of disks as
separate raid sets.
Poor management system, setup is the proverbial.
No LED on the front.


The features of the Maxtronic (http://www.maxtronic.com/) blow it away.
I'm pretty sure most other arrays would as well.

On Fri, 2005-05-06 at 15:16 -0400, Patricio Bruna V. wrote:
> El Vie 06 May 2005 14:19, Paul Meharg escribi?:
> > Has any one had any experience using the Dell/EMC AX100 Fibre Channel
> > storage system?  I have some people asking about these devices.
> >
> > Thanks,
> > Paul
> >
> i do, i dont recommend it. :)
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster




From bruce.walker at hp.com  Mon May  9 13:04:53 2005
From: bruce.walker at hp.com (Walker, Bruce J)
Date: Mon, 9 May 2005 06:04:53 -0700
Subject: [Linux-cluster] Planning a Cluster meeting at OLS
Message-ID: <3689AF909D816446BA505D21F1461AE4039BAE00@cacexc04.americas.cpqcorp.net>

I am trying to gauge interest in a cluster meeting at OLS (before OLS
starts).  Topics would include, but not be limited to:
   membership, fencing, apis, (kernel and non-kernel)
   DLM 
   cluster filesystems (hooks, DLMs, recovery, membership, etc.)
   common hooks for clusterwide process management
      - openssi, openmosix, bproc, kerrighed, cassat, ...
Goal would be to make progress toward common infrastructure or common
infrastructure interfaces that various groups could work with.

If you are interested and can attend, please let me (not the whole
lists) know asap:
   -  one day only (Tuesday July 19)
   - two days (Monday July 18 and Tuesday July 19)
   - 3 days (Sunday-Tuesday).

Feel free to extend this request to other mailing lists.  Based on the
responses received by May 12 (it doesn't have to be a firm commitment),
I will let everyone know if and how long we will meet so we all can plan
travel.

Bruce Walker
 



From wkenji at nifty.com  Sun May  8 19:44:30 2005
From: wkenji at nifty.com (Kenji Wakamiya)
Date: Mon, 09 May 2005 04:44:30 +0900
Subject: [Linux-cluster] GFS and CLVM-snapshot
Message-ID: <427E6C1E.5020002@nifty.com>

Hello,

I have some basic questions.
I'm using GFS with the following configuration.

  node1  node2  node3 (FC3)
   GFS    GFS    GFS
   LV     LV     LV   <-- CLVM
  GNBD   GNBD   GNBD  <-- gnbd_import -n
   |      |      |
   +------+------+
          |
         GNBD  <-- gnbd_serv -n / gnbd_export -c
         node0 (file server: FC3)

Only for snapshot and backup, CLVM has been introduced here.
Is a setup such as this correct?
If so (or not :), why "sometimes" the following error occor?
Don't I need freeze/unfreeze GFS manually, like XFS?

  (on any node of node1-3)
  # gfs_tool freeze /mnt/gfs
  # lvcreate -s -n lv0ss0 -L 2G /dev/vg0/lv0
    Logical volume "lv0ss0" created
  # gfs_tool unfreeze /mnt/gfs
  # mount -t gfs -o ro,lockproto=lock_nolock /dev/vg0/lv0ss0 /mnt/gfs-ss
  mount: cannot mount block device /dev/vg0/lv0ss0 read-only

But if once I did gfs_fsck, the read-only mount succeeds.
Should I always do gfs_fsck before to use a snapshot?

  # gfs_fsck /dev/vg0/lv0ss0
  Initializing fsck
  Starting pass1
  Pass1 complete
  --snip--
  Starting pass5
  Pass5 complete
  # mount -t gfs -o ro,lockproto=lock_nolock /dev/vg0/lv0ss0 /mnt/gfs-ss

But on rare occasions, gfs_fsck also cause some errors.
For example,

  # gfs_fsck /dev/vg0/lv0ss0
  Initializing fsck
  Buffer #1736310 (1 of 4) is neither GFS_METATYPE_RB nor GFS_METATYPE_RG.
  Resource group is corrupted.
  Unable to read in rgrp descriptor.
  Unable to fill in resource group information.

The origin and snapshot LVs are as follows.

  # lvdisplay
    --- Logical volume ---
    LV Name                /dev/vg0/lv0
    VG Name                vg0
    LV UUID                fVPY4J-cndk-YD5n-wU0m-QR0S-2txI-LvnnQ5
    LV Write Access        read/write
    LV snapshot status     source of
			   /dev/vg0/lv0ss0 [active]
    LV Status              available
    # open                 1
    LV Size                20.00 GB
    Current LE             10240
    Segments               1
    Allocation             inherit
    Read ahead sectors     0
    Block device           253:0

    --- Logical volume ---
    LV Name                /dev/vg0/lv0ss0
    VG Name                vg0
    LV UUID                hZea21-flge-bI5J-gcCy-pKug-0YHN-ZXR7I4
    LV Write Access        read/write
    LV snapshot status     active destination for /dev/vg0/lv0
    LV Status              available
    # open                 1
    LV Size                20.00 GB
    Current LE             10240
    COW-table size         2.00 GB
    COW-table LE           1024
    Allocated to snapshot  2.35%
    Snapshot chunk size    8.00 KB
    Segments               1
    Allocation             inherit
    Read ahead sectors     0
    Block device           253:4

[Version info]
      Linux: 2.6.11-1.14_FC3smp
GFS/Cluster: CVS head of about 08 April (with CMAN/DLM)
 Dev-Mapper: 1.01.01 / CVS head of about 03 May
       LVM2: 2.2.01.09 / 2.2.01.10
       (--disable-selinux --with-clvmd=cman --with-cluster=shared)

Thanks, and sorry for my broken English.

-- Kenji



From schlegel at riege.com  Tue May 10 08:50:07 2005
From: schlegel at riege.com (Gunther Schlegel)
Date: Tue, 10 May 2005 10:50:07 +0200
Subject: [Linux-cluster] Dell/EMC AX100 experience?
In-Reply-To: <1115625964.32239.46.camel@dragon.sys.intra>
References: <1115403583.5605.55.camel@grahem>
	<200505061517.00021.pbruna@linuxcenterla.com>
	<1115625964.32239.46.camel@dragon.sys.intra>
Message-ID: <1115715007.17039.4.camel@gauss.riege.de>

> Shockingly limited raid options (raid 5 only). 

That has been changed with the firmware released in january. It now does
raid 0 / 1 / 5 / 10 as far as I remember the changelog. It also got
iSCSI support.

The AX100 sold by Dell has an LED on the front, 2 in fact. Whatever that
is good for... ;)

I am quite satisfied with it, but it is my first SAN, so my point of
view is certainly limited. Setup was easy, it worked right out of the
box. 

regards, Gunther

-- 
Gunther Schlegel                    Riege Software International GmbH
Manager IT Infrastructure                                Mollsfeld 10
                                             40670 Meerbusch, Germany
Email: schlegel at riege.com                     Phone: +49-2159-9148-0
                                              Fax:   +49-2159-9148-11
---------------------------------------------------------------------

Disclaimer:
You may grab my GPG key from http://www.keyserver.net .
A nonproportional font is recommended for reading.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050510/022da24b/attachment.sig>

From naoki at valuecommerce.com  Tue May 10 10:12:14 2005
From: naoki at valuecommerce.com (Naoki)
Date: Tue, 10 May 2005 19:12:14 +0900
Subject: [Linux-cluster] Dell/EMC AX100 experience?
In-Reply-To: <1115715007.17039.4.camel@gauss.riege.de>
References: <1115403583.5605.55.camel@grahem>
	<200505061517.00021.pbruna@linuxcenterla.com>
	<1115625964.32239.46.camel@dragon.sys.intra>
	<1115715007.17039.4.camel@gauss.riege.de>
Message-ID: <1115719934.21236.50.camel@dragon.sys.intra>

On Tue, 2005-05-10 at 10:50 +0200, Gunther Schlegel wrote:
> > Shockingly limited raid options (raid 5 only). 
> 
> That has been changed with the firmware released in january. It now does
> raid 0 / 1 / 5 / 10 as far as I remember the changelog. It also got
> iSCSI support.

Hey that's nice.. You'd think Dell would update the specs on their
website though ;)





From James.Bottomley at SteelEye.com  Mon May  9 15:21:13 2005
From: James.Bottomley at SteelEye.com (James Bottomley)
Date: Mon, 09 May 2005 10:21:13 -0500
Subject: [Linux-cluster] Re: [Clusters_sig] Planning a Cluster meeting at OLS
In-Reply-To: <3689AF909D816446BA505D21F1461AE4039BAE00@cacexc04.americas.cpqcorp.net>
References: <3689AF909D816446BA505D21F1461AE4039BAE00@cacexc04.americas.cpqcorp.net>
Message-ID: <1115652073.5051.20.camel@mulgrave>

On Mon, 2005-05-09 at 06:04 -0700, Walker, Bruce J wrote:
> I am trying to gauge interest in a cluster meeting at OLS (before OLS
> starts).  Topics would include, but not be limited to:
>    membership, fencing, apis, (kernel and non-kernel)
>    DLM 
>    cluster filesystems (hooks, DLMs, recovery, membership, etc.)
>    common hooks for clusterwide process management
>       - openssi, openmosix, bproc, kerrighed, cassat, ...
> Goal would be to make progress toward common infrastructure or common
> infrastructure interfaces that various groups could work with.

I have two topics that seem to me to be important for this:

1. Since almost every one of those clusters has pieces in the kernel and
pieces in userland, getting information across the kernel/user boundary
in a uniform fashion for all of them seems to be crucial.  At the
moment, it looks like the OCFS usysfs mechanism is really the only
candidate, but that's still encountering turbulence from Al Viro.
Perhaps agreeing on a common such mechanism and how it should be
implemented would be useful

2. I'd like to see a readout from the Red Hat Walldorf cluster event
added to the agenda (assuming such an event takes place, that is).

James

P.S. I'd like to come, but as you know I'll be at the Kernel Summit on
Monday and Tuesday, so it would only be to pop in briefly for particular
sessions.




From Gwenaelle.Romac at alcatel.fr  Tue May 10 09:07:06 2005
From: Gwenaelle.Romac at alcatel.fr (Gwenaelle.Romac at alcatel.fr)
Date: Tue, 10 May 2005 11:07:06 +0200
Subject: [Linux-cluster] Proliant G4 / MSA500 G2 & RedHat Cluster v3.0
Message-ID: <428079BA.8090302@alcatel.fr>

Hello,

I'm looking for experience of people having installed RedHat Cluster 3 
on HP Proliant G4 / MSA500 G2.
As far as I know, this configuration is supported by RedHat but not by HP.
Does it work fine ? Is there any issue encountered on this configuration 
? Do you have any sucess story ?

Best regards,

Gwenaelle Romac



From lmb at suse.de  Tue May 10 13:10:52 2005
From: lmb at suse.de (Lars Marowsky-Bree)
Date: Tue, 10 May 2005 15:10:52 +0200
Subject: [Linux-cluster] Re: [Clusters_sig] Planning a Cluster meeting at
	OLS
In-Reply-To: <1115652073.5051.20.camel@mulgrave>
References: <3689AF909D816446BA505D21F1461AE4039BAE00@cacexc04.americas.cpqcorp.net>
	<1115652073.5051.20.camel@mulgrave>
Message-ID: <20050510131052.GT9398@marowsky-bree.de>

On 2005-05-09T10:21:13, James Bottomley <James.Bottomley at SteelEye.com> wrote:

> 1. Since almost every one of those clusters has pieces in the kernel and
> pieces in userland, getting information across the kernel/user boundary
> in a uniform fashion for all of them seems to be crucial.  At the
> moment, it looks like the OCFS usysfs mechanism is really the only
> candidate, but that's still encountering turbulence from Al Viro.
> Perhaps agreeing on a common such mechanism and how it should be
> implemented would be useful

Indeed, I think that would be the most important discussion to have.


Sincerely,
    Lars Marowsky-Br?e <lmb at suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"



From teigland at redhat.com  Tue May 10 14:01:58 2005
From: teigland at redhat.com (David Teigland)
Date: Tue, 10 May 2005 22:01:58 +0800
Subject: [Linux-cluster] Re: [Clusters_sig] Planning a Cluster meeting at OLS
In-Reply-To: <1115652073.5051.20.camel@mulgrave>
References: <3689AF909D816446BA505D21F1461AE4039BAE00@cacexc04.americas.cpqcorp.net>
	<1115652073.5051.20.camel@mulgrave>
Message-ID: <20050510140158.GE9008@redhat.com>

On Mon, May 09, 2005 at 10:21:13AM -0500, James Bottomley wrote:

> 1. Since almost every one of those clusters has pieces in the kernel and
> pieces in userland, getting information across the kernel/user boundary
> in a uniform fashion for all of them seems to be crucial.  At the
> moment, it looks like the OCFS usysfs mechanism is really the only
> candidate, but that's still encountering turbulence from Al Viro.
> Perhaps agreeing on a common such mechanism and how it should be
> implemented would be useful

Sysfs works fine for me (for configuring dlm and gfs), but there's very
little I need to get across.  If there's something different I should use
I'd be glad to switch, configfs looked fine.

Dave



From eric at bootseg.com  Tue May 10 19:51:23 2005
From: eric at bootseg.com (Eric Kerin)
Date: Tue, 10 May 2005 15:51:23 -0400
Subject: [Linux-cluster] [PATCH} rename resourcegroup to service in
	rgmanager's example cluster.conf file
Message-ID: <1115754684.7301.5.camel@auh5-0478>

Hello, 

The attached patch changes all references of rm's child node
resourcegroup in rgmanager/examples/cluster.conf to service.

Thanks, 
Eric Kerin <eric at bootseg.com>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: example.conf.patch
Type: text/x-patch
Size: 1498 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050510/a447c112/attachment.bin>

From phillips at redhat.com  Wed May 11 01:00:36 2005
From: phillips at redhat.com (Daniel Phillips)
Date: Tue, 10 May 2005 21:00:36 -0400
Subject: [Linux-cluster] Re: [Clusters_sig] Planning a Cluster meeting at OLS
In-Reply-To: <1115652073.5051.20.camel@mulgrave>
References: <3689AF909D816446BA505D21F1461AE4039BAE00@cacexc04.americas.cpqcorp.net>
	<1115652073.5051.20.camel@mulgrave>
Message-ID: <200505102100.37148.phillips@redhat.com>

On Monday 09 May 2005 11:21, James Bottomley wrote:
> On Mon, 2005-05-09 at 06:04 -0700, Walker, Bruce J wrote:
> > I am trying to gauge interest in a cluster meeting at OLS (before
> > OLS starts).  Topics would include, but not be limited to:
> >    membership, fencing, apis, (kernel and non-kernel)
> >    DLM
> >    cluster filesystems (hooks, DLMs, recovery, membership, etc.)
> >    common hooks for clusterwide process management
> >       - openssi, openmosix, bproc, kerrighed, cassat, ...
> > Goal would be to make progress toward common infrastructure or
> > common infrastructure interfaces that various groups could work
> > with.

Now the obvious question: what is the difference between a BOF and a 
cluster workshop at OLS?  The former is a greatly expanded version of 
the latter?  Why not just expand the OLS BOF into two or more sessions?  
(Currently it is already two, as the session lead by me and the SSI 
session lead by Bruce haven't actually been merged yet, as far as I 
know.)

I do see logistical issues.  A venue would have to be arranged, which 
takes money.  Has anybody stepped up?

> I have two topics that seem to me to be important for this:
>
> 1. Since almost every one of those clusters has pieces in the kernel
> and pieces in userland, getting information across the kernel/user
> boundary in a uniform fashion for all of them seems to be crucial. 
> At the moment, it looks like the OCFS usysfs mechanism is really the
> only candidate, but that's still encountering turbulence from Al
> Viro. Perhaps agreeing on a common such mechanism and how it should
> be implemented would be useful

For that, I use a message transport similar to dbus, with great success.

> 2. I'd like to see a readout from the Red Hat Walldorf cluster event
> added to the agenda (assuming such an event takes place, that is).

Such an event is taking place.  An official announcement should be 
possible in a few days.  Unofficial details have already been 
circulated informally.

Regards,

Daniel



From bruce.walker at hp.com  Wed May 11 01:07:56 2005
From: bruce.walker at hp.com (Walker, Bruce J)
Date: Tue, 10 May 2005 18:07:56 -0700
Subject: [Linux-cluster] RE: [Clusters_sig] Planning a Cluster meeting at OLS
Message-ID: <3689AF909D816446BA505D21F1461AE403A3B001@cacexc04.americas.cpqcorp.net>

Daniel,
There is a single BoF slot of perhaps 1.5 hours.  The difference is
12-16 hours vs. 1.5.  The way I see it we will continue the work done at
the Germany summit at the pre-OLS meeting. At the BoF we will just have
time to report on what has been discussed/agreed/proposed in the summit
and pre-OLS meetings. 

bruce

-----Original Message-----
From: Daniel Phillips [mailto:phillips at redhat.com] 
Sent: Tuesday, May 10, 2005 6:01 PM
To: clusters_sig at lists.osdl.org
Cc: James Bottomley; Walker, Bruce J; linux-cluster at redhat.com
Subject: Re: [Clusters_sig] Planning a Cluster meeting at OLS

On Monday 09 May 2005 11:21, James Bottomley wrote:
> On Mon, 2005-05-09 at 06:04 -0700, Walker, Bruce J wrote:
> > I am trying to gauge interest in a cluster meeting at OLS (before 
> > OLS starts).  Topics would include, but not be limited to:
> >    membership, fencing, apis, (kernel and non-kernel)
> >    DLM
> >    cluster filesystems (hooks, DLMs, recovery, membership, etc.)
> >    common hooks for clusterwide process management
> >       - openssi, openmosix, bproc, kerrighed, cassat, ...
> > Goal would be to make progress toward common infrastructure or 
> > common infrastructure interfaces that various groups could work 
> > with.

Now the obvious question: what is the difference between a BOF and a
cluster workshop at OLS?  The former is a greatly expanded version of
the latter?  Why not just expand the OLS BOF into two or more sessions?

(Currently it is already two, as the session lead by me and the SSI
session lead by Bruce haven't actually been merged yet, as far as I
know.)

I do see logistical issues.  A venue would have to be arranged, which
takes money.  Has anybody stepped up?

> I have two topics that seem to me to be important for this:
>
> 1. Since almost every one of those clusters has pieces in the kernel 
> and pieces in userland, getting information across the kernel/user 
> boundary in a uniform fashion for all of them seems to be crucial.
> At the moment, it looks like the OCFS usysfs mechanism is really the 
> only candidate, but that's still encountering turbulence from Al Viro.

> Perhaps agreeing on a common such mechanism and how it should be 
> implemented would be useful

For that, I use a message transport similar to dbus, with great success.

> 2. I'd like to see a readout from the Red Hat Walldorf cluster event 
> added to the agenda (assuming such an event takes place, that is).

Such an event is taking place.  An official announcement should be
possible in a few days.  Unofficial details have already been circulated
informally.

Regards,

Daniel



From phillips at redhat.com  Wed May 11 02:04:06 2005
From: phillips at redhat.com (Daniel Phillips)
Date: Tue, 10 May 2005 22:04:06 -0400
Subject: [Linux-cluster] Re: [Clusters_sig] Planning a Cluster meeting at OLS
In-Reply-To: <3689AF909D816446BA505D21F1461AE403A3B001@cacexc04.americas.cpqcorp.net>
References: <3689AF909D816446BA505D21F1461AE403A3B001@cacexc04.americas.cpqcorp.net>
Message-ID: <200505102204.06944.phillips@redhat.com>

On Tuesday 10 May 2005 21:07, Walker, Bruce J wrote:
> Daniel,
> There is a single BoF slot of perhaps 1.5 hours.  The difference is
> 12-16 hours vs. 1.5.  The way I see it we will continue the work done
> at the Germany summit at the pre-OLS meeting. At the BoF we will just
> have time to report on what has been discussed/agreed/proposed in the
> summit and pre-OLS meetings.

If there is support for this then I will do what I can to pitch in and 
help.  But it's a pretty tall order.  You did not address the question 
of sponsorship, and I can't vouch for the chances of Red Hat funding 
travel to two dedicated cluster events a month apart.  Maybe back in 
1999 that would have worked!

We might have better luck getting the BOF spots at OLS expanded.

Regards,

Daniel



From wkenji at labs.fujitsu.com  Wed May 11 02:11:03 2005
From: wkenji at labs.fujitsu.com (Kenji Wakamiya)
Date: Wed, 11 May 2005 11:11:03 +0900
Subject: [Linux-cluster] GFS and CLVM-snapshot
In-Reply-To: <20050509022637.GA8925@redhat.com>
References: <427E70D5.6000107@labs.fujitsu.com>
	<20050509022637.GA8925@redhat.com>
Message-ID: <428169B7.4080707@labs.fujitsu.com>

David Teigland wrote:
> CLVM snapshots do not work with GFS, sorry.  Maybe sometime in the future.

I see.  It's a very helpful info for me.

>    node1  node2  node3 (FC3)
>     GFS    GFS    GFS
>    GNBD   GNBD   GNBD  <-- gnbd_import -n
>     |      |      |
>     +------+------+
>            |
>           GNBD  <-- gnbd_serv -n / gnbd_export -c
>           LV    <-- LVM2
>           node0 (file server: FC3)

For the time being, this setup is worthy for our use.
But in the future, we will replace GNBD with iSCSI using NetApp.
If then, LVM is not able to be used on file-server side.
So I'm expecting CLVM.  But, NetApp also might have some
block-level snapshots, I guess...

> node1-3# gfs_tool freeze /mnt/gfs
> node0#   lvcreate -s -n lv0ss0 ...
> node1-3# gfs_tool unfreeze /mnt/gfs
> node0#   gnbd_export -c -d lv0ss0 -n nbd0ss0
> node1#   gnbd_import -n -i node0
> node1#   mount -t gfs -o ro,lockproto=lock_nolock /dev/gnbd/nbd0ss0

Thank you, I tried it with CVS head of about April 8th.
I don't know why, but "gfs_tool freeze" causes the following error:
gfs_tool: unknown mountpoint /mnt/gfs

I tried also today's CVS head (cmand wasn't compiled automatically;),
but fence_tool failed (It could not create a socket).
FC4 branch was not able to be compiled on FC3 (SOCK_ZAPPED matter).
I will consider using FC4-test3.

Thanks,

-- Kenji



From phillips at redhat.com  Wed May 11 02:18:15 2005
From: phillips at redhat.com (Daniel Phillips)
Date: Tue, 10 May 2005 22:18:15 -0400
Subject: [Linux-cluster] RE: [Clusters_sig] Planning a Cluster meeting at
	OLS
In-Reply-To: <1115776315.9716.145.camel@persist.az.mvista.com>
References: <3689AF909D816446BA505D21F1461AE403A3B001@cacexc04.americas.cpqcorp.net>
	<1115776315.9716.145.camel@persist.az.mvista.com>
Message-ID: <200505102218.15965.phillips@redhat.com>

On Tuesday 10 May 2005 21:51, Steven Dake wrote:
> I agree that a larger time slot is very important to discuss all of
> the relevant facts and approaches we have.  I for example would like
> an hour or so to discuss group messaging approaches.

I will go to that!

> The approach of continuing the work taking place in Germany is very
> problematic..  Many of the key people will not be at the Germany
> summit because 1) it hasn't yet been announced and 2) its too late to
> get any kind of travel approval 3) the cost of a germany summit trip
> being 4k places it outside the boundaries of what most U.S. 
> companies are willing to fork out.

I booked my travel just 4 days ago and got a very decent rate ($750 CAD 
direct Toronto/Frankfurt return).  It is not too late (but very nearly 
so...)

> OLS is a far superior venue for any kind of cluster summit meeting
> since extending travel for 1-2 days is pretty approachable for those
> already attending OLS.

In truth, pre-OLS was my first choice as well, however the generous 
sponsorship offer of S.A.P. prevailed.  And there are certain very big 
advantages in collaborating with the potential developers of Linux 
cluster _applications_ now, rather than later after infrastructure is 
already set in concrete.  Also, we fully expect that a number of 
European cluster developers who might otherwise be excluded from the 
event to be able to attend.  Who knows, we might even be able to 
attract Andrea Arcangelli, who is not a cluster developer but no doubt 
would be a very good one after a few days study.

We will just try to do the best we can with audio/visual support in 
order to include those who can't make the trip, to the fullest extent 
possible.

> I suggest the cluster meeting at OLS be a 
> standalone meeting, perhaps based upon some report out of the results
> of the germany meeting instead of a continuation of that effort.

That sounds good to me.

Regards,

Daniel



From wkenji at labs.fujitsu.com  Wed May 11 02:40:58 2005
From: wkenji at labs.fujitsu.com (Kenji Wakamiya)
Date: Wed, 11 May 2005 11:40:58 +0900
Subject: [Linux-cluster] GFS and CLVM-snapshot
In-Reply-To: <200505090400.18716.phillips@redhat.com>
References: <427E70D5.6000107@labs.fujitsu.com>
	<20050509022637.GA8925@redhat.com>
	<200505090400.18716.phillips@redhat.com>
Message-ID: <428170BA.9040000@labs.fujitsu.com>

Daniel Phillips wrote:
> I believe Dave actually meant to say, take a look here:
> 
>  http://sourceware.org/cluster/csnap/
> 
> It's a cluster snapshot.  This is ready to test if you are a developer, 
> but not ready to use in a production system.

Thank you, Daniel.  We are considering about a small web server for
in-house use.  I've known about CSNAP, but I've never tried it because
the tarball is for kernel 2.4.  Is it able to be used on kernel 2.6?
In the meantime, I will begin with GNBD+LVM on FC4-test.

Thanks,

-- Kenji



From teigland at redhat.com  Wed May 11 03:02:39 2005
From: teigland at redhat.com (David Teigland)
Date: Wed, 11 May 2005 11:02:39 +0800
Subject: [Linux-cluster] GFS and CLVM-snapshot
In-Reply-To: <428169B7.4080707@labs.fujitsu.com>
References: <427E70D5.6000107@labs.fujitsu.com>
	<20050509022637.GA8925@redhat.com>
	<428169B7.4080707@labs.fujitsu.com>
Message-ID: <20050511030239.GB10988@redhat.com>

On Wed, May 11, 2005 at 11:11:03AM +0900, Kenji Wakamiya wrote:

> For the time being, this setup is worthy for our use.
> But in the future, we will replace GNBD with iSCSI using NetApp.
> If then, LVM is not able to be used on file-server side.
> So I'm expecting CLVM.  But, NetApp also might have some
> block-level snapshots, I guess...

If the NetApp has block-level snapshots, that would be an excellent
solution (better than any kind of host-based snapshots).


> Thank you, I tried it with CVS head of about April 8th.
> I don't know why, but "gfs_tool freeze" causes the following error:
> gfs_tool: unknown mountpoint /mnt/gfs

There's a bug in gfs_tool so it doesn't recognize fs's on gnbd devices.
You need to reference the fs using its 'list' value instead.  Here's how
it works on my machine:

[root at va03 ~]# df
Filesystem            Size  Used Avail Use% Mounted on
rootfs                7.4G  3.2G  3.8G  46% /
/dev/gnbd/nda         7.7G   20K  7.7G   1% /gfs

[root at va03 ~]# gfs_tool freeze /gfs
gfs_tool: unknown mountpoint /gfs

[root at va03 ~]# gfs_tool list
3501862912 gnbd0 foo:a.0

[root at va03 ~]# gfs_tool freeze 3501862912
[root at va03 ~]# gfs_tool unfreeze 3501862912


> I tried also today's CVS head (cmand wasn't compiled automatically;),
> but fence_tool failed (It could not create a socket).
> FC4 branch was not able to be compiled on FC3 (SOCK_ZAPPED matter).
> I will consider using FC4-test3.

The cluster components are changing radically in CVS head and don't all
work together yet.  So, you need to checkout code from the RHEL4 or FC4
cvs branches.

Dave



From teigland at redhat.com  Wed May 11 07:06:44 2005
From: teigland at redhat.com (David Teigland)
Date: Wed, 11 May 2005 15:06:44 +0800
Subject: [Linux-cluster] Re: [Clusters_sig] minutes from 07apr2005
In-Reply-To: <200505110047.56612.phillips@redhat.com>
References: <1113417918.31312.26.camel@ibm-c.pdx.osdl.net>
	<200505102222.59675.phillips@redhat.com>
	<20050511032417.GC10988@redhat.com>
	<200505110047.56612.phillips@redhat.com>
Message-ID: <20050511070644.GE10988@redhat.com>

On Wed, May 11, 2005 at 12:47:56AM -0400, Daniel Phillips wrote:
> On Tuesday 10 May 2005 23:24, David Teigland wrote:
> > > See here:
> > >    http://people.redhat.com/~teigland/sca.pdf
> >
> > That's extremely outdated and became largely irrelevant long ago.
> > The successor is still taking shape, but integrating with the
> > development of other groups is a key factor and something we're
> > making good progress on.
> 
> So when are you going to reveal to the rest of us, your new grand 
> designs?

Well, derision certainly won't get you very far.  But, I'm happy to answer
polite questions, do you have one?

There's been a lot of email discussing the new, and still evolving,
approaches we're taking to infrastructure surrounding gfs.  If you missed
all that perhaps you'll find the archives a helpful place to start.

https://www.redhat.com/archives/linux-cluster/
http://lists.osdl.org/pipermail/clusters_sig/
http://lists.osdl.org/pipermail/openais/
http://marc.theaimsgroup.com/?l=linux-kernel&r=1&w=2

Once things become more certain and I've made more progress, I'll
summarize how the new things are working for those who don't follow
closely.

Dave



From phillips at redhat.com  Wed May 11 07:49:24 2005
From: phillips at redhat.com (Daniel Phillips)
Date: Wed, 11 May 2005 03:49:24 -0400
Subject: [Linux-cluster] Re: [Clusters_sig] minutes from 07apr2005
In-Reply-To: <20050511070644.GE10988@redhat.com>
References: <1113417918.31312.26.camel@ibm-c.pdx.osdl.net>
	<200505110047.56612.phillips@redhat.com>
	<20050511070644.GE10988@redhat.com>
Message-ID: <200505110349.24630.phillips@redhat.com>

On Wednesday 11 May 2005 03:06, David Teigland wrote:
> > So when are you going to reveal to the rest of us, your new grand
> > designs?
>
> Well, derision certainly won't get you very far.  But, I'm happy to
> answer polite questions, do you have one?

Yes: where is your design documentation?

Regards,

Daniel



From teigland at redhat.com  Wed May 11 08:35:00 2005
From: teigland at redhat.com (David Teigland)
Date: Wed, 11 May 2005 16:35:00 +0800
Subject: [Linux-cluster] Re: [Clusters_sig] minutes from 07apr2005
In-Reply-To: <200505110349.24630.phillips@redhat.com>
References: <1113417918.31312.26.camel@ibm-c.pdx.osdl.net>
	<200505110047.56612.phillips@redhat.com>
	<20050511070644.GE10988@redhat.com>
	<200505110349.24630.phillips@redhat.com>
Message-ID: <20050511083500.GF10988@redhat.com>

On Wed, May 11, 2005 at 03:49:24AM -0400, Daniel Phillips wrote:
> On Wednesday 11 May 2005 03:06, David Teigland wrote:
> > > So when are you going to reveal to the rest of us, your new grand
> > > designs?
> >
> > Well, derision certainly won't get you very far.  But, I'm happy to
> > answer polite questions, do you have one?
> 
> Yes: where is your design documentation?

Ah yes, the standard spurious fare from Daniel Phillips.

My last email made it abundantly clear that there's no design document to
be had and that you need to actually follow the email discussions where I
explain how things are developing, or wait for a summary to be written.
Spare us the rhetorical games, they're a waste of time.

Dave



From lmb at suse.de  Wed May 11 10:05:44 2005
From: lmb at suse.de (Lars Marowsky-Bree)
Date: Wed, 11 May 2005 12:05:44 +0200
Subject: [Linux-cluster] Re: [Clusters_sig] Planning a Cluster meeting at OLS
In-Reply-To: <200505102204.06944.phillips@redhat.com>
References: <3689AF909D816446BA505D21F1461AE403A3B001@cacexc04.americas.cpqcorp.net>
	<200505102204.06944.phillips@redhat.com>
Message-ID: <20050511100544.GS25070@marowsky-bree.de>

On 2005-05-10T22:04:06, Daniel Phillips <phillips at redhat.com> wrote:

> If there is support for this then I will do what I can to pitch in and 
> help.  But it's a pretty tall order.  You did not address the question 
> of sponsorship, and I can't vouch for the chances of Red Hat funding 
> travel to two dedicated cluster events a month apart.  Maybe back in 
> 1999 that would have worked!

All we need is room. If we're just 10-20 people, hell, even one of the
business suites in Les Suites will do - or we crouch in the hallway of
the Kernel Summit ;-)

Bruce is looking at getting a real conference room at Les Suites
though. I'm sure between OSDL, RHAT, Novell, HP etc we'll be able to
sort that one out.

Given that many people will be able to make OLS + one or two days before
that for the Kernel Summit (or now the Cluster Summit), but not Walldorf
- and maybe vice-versa -, I think having both venues is a pretty good
idea.
  

Sincerely,
    Lars Marowsky-Br?e <lmb at suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"



From sdake at mvista.com  Wed May 11 01:51:56 2005
From: sdake at mvista.com (Steven Dake)
Date: Tue, 10 May 2005 18:51:56 -0700
Subject: [Linux-cluster] RE: [Clusters_sig] Planning a Cluster meeting
	at OLS
In-Reply-To: <3689AF909D816446BA505D21F1461AE403A3B001@cacexc04.americas.cpqcorp.net>
References: <3689AF909D816446BA505D21F1461AE403A3B001@cacexc04.americas.cpqcorp.net>
Message-ID: <1115776315.9716.145.camel@persist.az.mvista.com>

On Tue, 2005-05-10 at 18:07, Walker, Bruce J wrote:
> Daniel,
> There is a single BoF slot of perhaps 1.5 hours.  The difference is
> 12-16 hours vs. 1.5.  The way I see it we will continue the work done at
> the Germany summit at the pre-OLS meeting. At the BoF we will just have
> time to report on what has been discussed/agreed/proposed in the summit
> and pre-OLS meetings. 
> 

Bruce

I agree that a larger time slot is very important to discuss all of the
relevant facts and approaches we have.  I for example would like an hour
or so to discuss group messaging approaches.

The approach of continuing the work taking place in Germany is very
problematic..  Many of the key people will not be at the Germany summit
because 1) it hasn't yet been announced and 2) its too late to get any
kind of travel approval 3) the cost of a germany summit trip being 4k
places it outside the boundaries of what most U.S.  companies are
willing to fork out.  OLS is a far superior venue for any kind of
cluster summit meeting since extending travel for 1-2 days is pretty
approachable for those already attending OLS.  I suggest the cluster
meeting at OLS be a standalone meeting, perhaps based upon some report
out of the results of the germany meeting instead of a continuation of
that effort.

regards
-steve

> bruce
> 
> -----Original Message-----
> From: Daniel Phillips [mailto:phillips at redhat.com] 
> Sent: Tuesday, May 10, 2005 6:01 PM
> To: clusters_sig at lists.osdl.org
> Cc: James Bottomley; Walker, Bruce J; linux-cluster at redhat.com
> Subject: Re: [Clusters_sig] Planning a Cluster meeting at OLS
> 
> On Monday 09 May 2005 11:21, James Bottomley wrote:
> > On Mon, 2005-05-09 at 06:04 -0700, Walker, Bruce J wrote:
> > > I am trying to gauge interest in a cluster meeting at OLS (before 
> > > OLS starts).  Topics would include, but not be limited to:
> > >    membership, fencing, apis, (kernel and non-kernel)
> > >    DLM
> > >    cluster filesystems (hooks, DLMs, recovery, membership, etc.)
> > >    common hooks for clusterwide process management
> > >       - openssi, openmosix, bproc, kerrighed, cassat, ...
> > > Goal would be to make progress toward common infrastructure or 
> > > common infrastructure interfaces that various groups could work 
> > > with.
> 
> Now the obvious question: what is the difference between a BOF and a
> cluster workshop at OLS?  The former is a greatly expanded version of
> the latter?  Why not just expand the OLS BOF into two or more sessions?
> 
> (Currently it is already two, as the session lead by me and the SSI
> session lead by Bruce haven't actually been merged yet, as far as I
> know.)
> 
> I do see logistical issues.  A venue would have to be arranged, which
> takes money.  Has anybody stepped up?
> 
> > I have two topics that seem to me to be important for this:
> >
> > 1. Since almost every one of those clusters has pieces in the kernel 
> > and pieces in userland, getting information across the kernel/user 
> > boundary in a uniform fashion for all of them seems to be crucial.
> > At the moment, it looks like the OCFS usysfs mechanism is really the 
> > only candidate, but that's still encountering turbulence from Al Viro.
> 
> > Perhaps agreeing on a common such mechanism and how it should be 
> > implemented would be useful
> 
> For that, I use a message transport similar to dbus, with great success.
> 
> > 2. I'd like to see a readout from the Red Hat Walldorf cluster event 
> > added to the agenda (assuming such an event takes place, that is).
> 
> Such an event is taking place.  An official announcement should be
> possible in a few days.  Unofficial details have already been circulated
> informally.
> 
> Regards,
> 
> Daniel
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster



From jpraher at yahoo.de  Wed May 11 14:02:03 2005
From: jpraher at yahoo.de (Jakob Praher)
Date: Wed, 11 May 2005 16:02:03 +0200
Subject: [Linux-cluster] [ddraid] extensibility
Message-ID: <d5t2qo$lbn$1@sea.gmane.org>

hi philipp and all,

I am very interested in ddraid for having sios systems over the network.
A few questions though:

I've looked shortly at your implementation (ddraid-0.5.0), but have a
few questions, which I thought to ask you:

* how will you track extensibilty:

afaik you assign a fixed number of logical sectors to the device mapper
 device. but lets take a trivial example:

you have 3 disks (2^k+1), each in a sperate computer.
then you want to plug a 4th and 5th disk (2^k+1) disk (again separate
computer) in the ddraid - how should that look like regarding the dm setup?

what problems could arise given the striping information on the blocks.
do you simply extend a stripe from 2 to 4 data blocks? is it possible to
raise the number of logical sectors with device mapper (sorry for my
ignorance - I haven't looked the details of the dmsetup man page). do
you have to rearrange block sizes or stripe information?

* have you investigated raid-x - would that be possible to implement via
device mapper?

I've looked closer on a the papers by hwang et al regarding Raid-x,
which also provides a single io space for servlerless clusters. The
design is basically a mixture of mirroring and striping, making it
possible for one node to fail completly.
On stripe is made of (n-1) blocks and you have (n-1) mirror blocks. They
have implemented as part of their trojan cluster project, wich I think
isn't alive any more.

I look forward to hearing from you.

-- Jakob



From lhh at redhat.com  Wed May 11 17:54:09 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 11 May 2005 13:54:09 -0400
Subject: [Linux-cluster] Re: [PATCH} rename resourcegroup to service in
	rgmanager's example cluster.conf file
In-Reply-To: <1115754684.7301.5.camel@auh5-0478>
References: <1115754684.7301.5.camel@auh5-0478>
Message-ID: <1115834049.8698.86.camel@ayanami.boston.redhat.com>

On Tue, 2005-05-10 at 15:51 -0400, Eric Kerin wrote:
> Hello, 
> 
> The attached patch changes all references of rm's child node
> resourcegroup in rgmanager/examples/cluster.conf to service.
> 
> Thanks, 
> Eric Kerin <eric at bootseg.com>
> 
> 

Thanks -- I'll commit this later today.

-- Lon




From jpraher at yahoo.de  Wed May 11 18:38:58 2005
From: jpraher at yahoo.de (Jakob Praher)
Date: Wed, 11 May 2005 20:38:58 +0200
Subject: [Linux-cluster] Re: [ddraid] extensibility
In-Reply-To: <d5t2qo$lbn$1@sea.gmane.org>
References: <d5t2qo$lbn$1@sea.gmane.org>
Message-ID: <d5tj1q$i1m$1@sea.gmane.org>

sorry, I meant daniel and all that are working with/on ddraid,

Jakob Praher wrote:
> 
> I am very interested in ddraid for having sios systems over the network.
> A few questions though:
> 
> I've looked shortly at your implementation (ddraid-0.5.0), but have a
> few questions, which I thought to ask you:
> 
> * how will you track extensibilty:
> 
> afaik you assign a fixed number of logical sectors to the device mapper
>  device. but lets take a trivial example:
> 
> you have 3 disks (2^k+1), each in a sperate computer.
> then you want to plug a 4th and 5th disk (2^k+1) disk (again separate
> computer) in the ddraid - how should that look like regarding the dm setup?
> 
> what problems could arise given the striping information on the blocks.
> do you simply extend a stripe from 2 to 4 data blocks? is it possible to
> raise the number of logical sectors with device mapper (sorry for my
> ignorance - I haven't looked the details of the dmsetup man page). do
> you have to rearrange block sizes or stripe information?
> 
> * have you investigated raid-x - would that be possible to implement via
> device mapper?
> 
> I've looked closer on a the papers by hwang et al regarding Raid-x,
> which also provides a single io space for servlerless clusters. The
> design is basically a mixture of mirroring and striping, making it
> possible for one node to fail completly.
> On stripe is made of (n-1) blocks and you have (n-1) mirror blocks. They
> have implemented as part of their trojan cluster project, wich I think
> isn't alive any more.
> 
> I look forward to hearing from you.
> 
> -- Jakob
> 



From mls at skayser.de  Wed May 11 20:50:18 2005
From: mls at skayser.de (Sebastian Kayser)
Date: Wed, 11 May 2005 22:50:18 +0200
Subject: [Linux-cluster] Compile problems with 2.6.9 source tarballs
Message-ID: <20050511205018.GA2059@planet.home>

Hi there,

today i have been trying to get linux-cluster up and running on a debian 
sarge box (installed today via debian-installer rc3). 

Therefor i downloaded the tarballs from

    http://people.redhat.com/cfeist/cluster/tgz/

and ran all the .patch files against a vanilla 2.6.9 kernel. The patched 
kernel compiled fine.

I was about to compile & install the additional tools, daemons, and 
modules when i encountered several errors. For example cman-kernel
(although i just noticed that this builds the module i already 
built inside the kernel tree) gave me

1. cd /usr/local/src/linux-cluster/cman-kernel-2.6.9-31
2. ./configure --kernel_src=/usr/src/linux (link to linux-2.6.9)
3. make (results beneath)

<compile_attempt>

sarge-fc1:/usr/local/src/linux-cluster/cman-kernel-2.6.9-31# make
cd src && make all
make[1]: Entering directory
`/usr/local/src/linux-cluster/cman-kernel-2.6.9-31/src'
rm -f cluster
ln -s . cluster
make -C /usr/src/linux
M=/usr/local/src/linux-cluster/cman-kernel-2.6.9-31/src modules
USING_KBUILD=yes
make[2]: Entering directory `/usr/src/linux-2.6.9'
  CC [M]  /usr/local/src/linux-cluster/cman-kernel-2.6.9-31/src/cnxman.o
  /usr/local/src/linux-cluster/cman-kernel-2.6.9-31/src/cnxman.c: In
  Funktion ??check_for_unacked_nodes??:
  /usr/local/src/linux-cluster/cman-kernel-2.6.9-31/src/cnxman.c:478:
  error: `CLUSTER_LEAVEFLAG_NORESPONSE' undeclared (first use in this
  function)
  ....

</compile_attempt>

I noticed that /usr/src/linux/include/cluster/cnxman-socket.h differed from 
the header file beneath the cman-kernel src dir (the above #define was
missing), copied this over to the kernel include dir and that fixed 
the problem. 

Nevertheless i also get errors when trying to compile the other components.  
For example the make process for cman complains about not finding ccs.h
(which naturally exists within the ccs directory but somehow isn't
referenced the right way...)

<compile_attempt>

sarge-fc1:/usr/local/src/linux-cluster/cman-1.0-pre32# make
cd cman_tool && make all
make[1]: Entering directory
`/usr/local/src/linux-cluster/cman-1.0-pre32/cman_tool'
gcc -Wall  -g -I//usr/include -I../config
-DCMAN_RELEASE_NAME=\"1.0-pre32\" -I/usr/src/linux/include/cluster -c -o
join_ccs.o join_ccs.c
join_ccs.c:21:17: ccs.h: Datei oder Verzeichnis nicht gefunden

...

</compile_attempt>

I get the slight feeling that i am missing something genereal regarding 
the build process for the linux-cluster package. Can anyone point me in 
the right direction?

Regards,

Sebastian



From mtilstra at redhat.com  Wed May 11 21:05:31 2005
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Wed, 11 May 2005 16:05:31 -0500
Subject: [Linux-cluster] Compile problems with 2.6.9 source tarballs
In-Reply-To: <20050511205018.GA2059@planet.home>
References: <20050511205018.GA2059@planet.home>
Message-ID: <20050511210531.GA18372@redhat.com>

On Wed, May 11, 2005 at 10:50:18PM +0200, Sebastian Kayser wrote:
> Hi there,
> 
> today i have been trying to get linux-cluster up and running on a debian 
> sarge box (installed today via debian-installer rc3). 
> 
> Therefor i downloaded the tarballs from
> 
>     http://people.redhat.com/cfeist/cluster/tgz/
> 
> and ran all the .patch files against a vanilla 2.6.9 kernel. The patched 
> kernel compiled fine.
> 
> I was about to compile & install the additional tools, daemons, and 
> modules when i encountered several errors. For example cman-kernel
> (although i just noticed that this builds the module i already 
> built inside the kernel tree) gave me
> 
> 1. cd /usr/local/src/linux-cluster/cman-kernel-2.6.9-31
> 2. ./configure --kernel_src=/usr/src/linux (link to linux-2.6.9)
> 3. make (results beneath)

last i checked, you need to run 'make install'
the cluster cvs tree is amde of multiple seperate componates.  Many of
which depend on the others.  So the various headers and libraries need
to be installed before the other parts will compile.

try that first.

-- 
Michael Conrad Tadpol Tilstra
The sum of the intelligence on a planet is a constant; the population is
growing.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050511/995b8128/attachment.sig>

From wkenji at labs.fujitsu.com  Thu May 12 02:28:37 2005
From: wkenji at labs.fujitsu.com (Kenji Wakamiya)
Date: Thu, 12 May 2005 11:28:37 +0900
Subject: [Linux-cluster] GFS and CLVM-snapshot
In-Reply-To: <20050511030239.GB10988@redhat.com>
References: <427E70D5.6000107@labs.fujitsu.com>
	<20050509022637.GA8925@redhat.com>
	<428169B7.4080707@labs.fujitsu.com>
	<20050511030239.GB10988@redhat.com>
Message-ID: <4282BF55.40504@labs.fujitsu.com>

David Teigland wrote:
> If the NetApp has block-level snapshots, that would be an excellent
> solution (better than any kind of host-based snapshots).

 From what I've heard, apparently NetApp has LUN-level snapshots.
That sounds promising!

> [root at va03 ~]# gfs_tool freeze /gfs
> gfs_tool: unknown mountpoint /gfs
> 
> [root at va03 ~]# gfs_tool list
> 3501862912 gnbd0 foo:a.0
> 
> [root at va03 ~]# gfs_tool freeze 3501862912
> [root at va03 ~]# gfs_tool unfreeze 3501862912

I see how it is.  It worked stably!

By the way, sometimes I get warnings from gfs_fsck for snapshot-LV,
even though I froze GFS before doing lvcreate -s.
For example, the followings are.  Is this usual thing?

   # gfs_fsck -y /dev/vg0/lv0ss0
   Initializing fsck
   Starting pass1
   Pass1 complete
   Starting pass1b
   Pass1b complete
   Starting pass1c
   Pass1c complete
   Starting pass2
   Pass2 complete
   Starting pass3
   Pass3 complete
   Starting pass4
   Found unlinked inode at 1929233
	  Adjusting freemeta block count (59 -> 60).
	  Adjusting used dinode block count (10 -> 9).
   l+f directory at 29
   Added inode #1929233 to l+f dir
   Found unlinked inode at 1800635
   Added inode #1800635 to l+f dir
   Link count inconsistent for inode 25 - 5 6
   Link count updated for inode 25
   Pass4 complete
   Starting pass5
   ondisk and fsck bitmaps differ at block 29
   Succeeded.
   used inode count inconsistent: is 9 should be 10
   free meta count inconsistent: is 60 should be 59
   Pass5 complete
   #

> The cluster components are changing radically in CVS head and don't all
> work together yet.  So, you need to checkout code from the RHEL4 or FC4
> cvs branches.

I see, I will try with FC4-test3 from now.
Thanks!

-- Kenji



From phillips at redhat.com  Thu May 12 05:36:07 2005
From: phillips at redhat.com (Daniel Phillips)
Date: Thu, 12 May 2005 01:36:07 -0400
Subject: [Linux-cluster] [ddraid] extensibility
In-Reply-To: <d5t2qo$lbn$1@sea.gmane.org>
References: <d5t2qo$lbn$1@sea.gmane.org>
Message-ID: <200505120136.07300.phillips@redhat.com>

Hi Jakob,

On Wednesday 11 May 2005 10:02, Jakob Praher wrote:
> hi philipp and all,
>
> I am very interested in ddraid for having sios systems over the
> network. A few questions though:
>
> I've looked shortly at your implementation (ddraid-0.5.0), but have a
> few questions, which I thought to ask you:
>
> * how will you track extensibilty:
>
> afaik you assign a fixed number of logical sectors to the device
> mapper device. but lets take a trivial example:
>
> you have 3 disks (2^k+1), each in a sperate computer.
> then you want to plug a 4th and 5th disk (2^k+1) disk (again separate
> computer) in the ddraid - how should that look like regarding the dm
> setup?

You want to increase the "order" of the ddraid array?  This is not 
supported yet.  The way to do it would be to create a new, higher order 
ddraid array with an initial size of zero, then pvmove from the old 
array to the new one, expanding the new array and recovering space from 
the beginning of the old array as the move proceeds.  This will be 
tricky to implement!  But it is possible, and in time we can expect to 
see more sophisticated functionality like that arrive.

Adding an extra spindle to each ddraid member will be much easier.  In 
that case, you would change each ddraid member to a linear combination 
of two devices, by changing your dmsetup commands.

> what problems could arise given the striping information on the
> blocks. do you simply extend a stripe from 2 to 4 data blocks?

This is quite a tricky problem because you want to start using the new 
geometry while some of the old geometry is still in use.  There is no 
obstacle to implementing this, except a lot of work.

> is it 
> possible to raise the number of logical sectors with device mapper
> (sorry for my ignorance - I haven't looked the details of the dmsetup
> man page). do you have to rearrange block sizes or stripe
> information?

If you raise the number of logical sectors in the device mapper command, 
then you must actually have the new capacity available.  In other 
words, you would need to increase the partition size of each member of 
the ddraid array as well.  CLVM is supposed to be able to do such 
things automatically, but ddraid has not been integrated with CLVM yet.

> * have you investigated raid-x - would that be possible to implement
> via device mapper?
> I've looked closer on a the papers by hwang et al regarding Raid-x,

I do not have that paper.  Is it freely available online?  I see from 
the abstract that "the RAID-x architecture is based on an orthogonal 
striping and mirroring (OSM) scheme".  The answer is: you probably can 
implement this in device mapper.  But why?

> which also provides a single io space for servlerless clusters.

DDRaid provides that, in an efficient form.

> The design is basically a mixture of mirroring and striping, making it
> possible for one node to fail completly.

DDRaid does this without using either mirroring or striping.  It sounds 
like Raid-x is fairly wasteful of disk space and IO bandwidth, compared 
to ddraid.

> On stripe is made of (n-1) blocks and you have (n-1) mirror blocks.
> They have implemented as part of their trojan cluster project, wich I
> think isn't alive any more.

What is "n" in this description?

Regards,

Daniel



From Birger.Wathne at ift.uib.no  Thu May 12 10:43:55 2005
From: Birger.Wathne at ift.uib.no (Birger Wathne)
Date: Thu, 12 May 2005 12:43:55 +0200
Subject: [Linux-cluster] Proposed change to /etc/init.d/fenced
In-Reply-To: <1115319903.20618.723.camel@ayanami.boston.redhat.com>
References: <427A09A8.9050408@birger.sh>	<1115307459.20618.692.camel@ayanami.boston.redhat.com>	<427A48E4.1050506@birger.sh>
	<1115319903.20618.723.camel@ayanami.boston.redhat.com>
Message-ID: <4283336B.8040004@ift.uib.no>

Lon Hohberger wrote:

>On Thu, 2005-05-05 at 18:25 +0200, birger wrote:
>
>  
>
>>If the startup files would let me set parameters in 
>>/etc/sysconfig/cluster (it's already being sourced by the fenced script 
>>at least) I wouldn't have to worry...
>>    
>>
>
>Don't all (or most) of them do this?
>
>(They probably all *should*, if they're not currently...)
>  
>
The startup file for ccsd does, but fenced doesn't. I have changed my 
fenced file to follow the coding and naming convention from the ccsd 
startup file. Here is a diff.

10a11,12
 > FENCED_OPTS=-w
 >
21c23
<       fence_tool join -w > /dev/null 2>&1
---
 >       fence_tool join ${FENCED_OPTS} > /dev/null 2>&1

-- 
birger



From phung at cs.columbia.edu  Thu May 12 10:51:22 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Thu, 12 May 2005 06:51:22 -0400 (EDT)
Subject: [Linux-cluster] cman_tool kill, cluster stuck
Message-ID: <Pine.LNX.4.44.0505120648250.28123-100000@algiers.clic.cs.columbia.edu>

Hi all, my cluster seems to be stuck!  one of the nodes
went down, and I see a message on one of the live nodes
that repeatedly says:

  fencing node "blade12"
  fence "blade12" failed

so I take a view of my nodes:

cluster # cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    1   M   blade01
   4    1    1   M   blade04
   9    1    1   M   blade09
  10    1    1   M   blade10
  11    1    1   M   blade11
  12    1    1   X   blade12

blade09 and 10 report in, but they don't come all the way up (can't ssh
in) I think because it's hanging on the fencing of blade12.  so I
try:

cluster # cman_tool kill -n12
Can't kill node 12 : No such file or directory

cluster # cman_tool kill -nblade12
kill node failed: Invalid argument

is there a better way to repair my cluster without rebooting everybody?

thanks,
dan

--



From teigland at redhat.com  Thu May 12 11:14:53 2005
From: teigland at redhat.com (David Teigland)
Date: Thu, 12 May 2005 19:14:53 +0800
Subject: [Linux-cluster] cman_tool kill, cluster stuck
In-Reply-To: <Pine.LNX.4.44.0505120648250.28123-100000@algiers.clic.cs.columbia.edu>
References: <Pine.LNX.4.44.0505120648250.28123-100000@algiers.clic.cs.columbia.edu>
Message-ID: <20050512111453.GB14166@redhat.com>

On Thu, May 12, 2005 at 06:51:22AM -0400, Dan B. Phung wrote:
> Hi all, my cluster seems to be stuck!  one of the nodes
> went down, and I see a message on one of the live nodes
> that repeatedly says:
> 
>   fencing node "blade12"
>   fence "blade12" failed

I've gone back to some of your previous messages where you included your
cluster.conf and you had fencing misconfigured.  Here's how manual fencing
should look.  Notice that the device name "human" under node refers to the
fencedevice with the same name below.

<clusternode name="node01">
        <fence>
                <method name="single">
                        <device name="human" nodename="node01"/>
                </method>
        </fence>
</clusternode>


<fencedevices>
        <fencedevice name="human" agent="fence_manual"/>
</fencedevices>

If that doesn't help, please send your current cluster.conf.
Dave



From mls at skayser.de  Thu May 12 11:52:39 2005
From: mls at skayser.de (Sebastian Kayser)
Date: Thu, 12 May 2005 13:52:39 +0200
Subject: [Linux-cluster] Compile problems with 2.6.9 source tarballs
In-Reply-To: <20050511210531.GA18372@redhat.com>
References: <20050511205018.GA2059@planet.home>
	<20050511210531.GA18372@redhat.com>
Message-ID: <20050512115239.GA1700@planet.home>

On Wed, May 11, 2005 at 04:05:31PM -0500, Michael Conrad Tadpol Tilstra wrote:
> On Wed, May 11, 2005 at 10:50:18PM +0200, Sebastian Kayser wrote:
> > Hi there,
> > 
> > today i have been trying to get linux-cluster up and running on a debian 
> > sarge box (installed today via debian-installer rc3). 
> > 
> > Therefor i downloaded the tarballs from
> > 
> >     http://people.redhat.com/cfeist/cluster/tgz/
> > 
> > and ran all the .patch files against a vanilla 2.6.9 kernel. The patched 
> > kernel compiled fine.
> > 
> > I was about to compile & install the additional tools, daemons, and 
> > modules when i encountered several errors. For example cman-kernel
> > (although i just noticed that this builds the module i already 
> > built inside the kernel tree) gave me
> > 
> > 1. cd /usr/local/src/linux-cluster/cman-kernel-2.6.9-31
> > 2. ./configure --kernel_src=/usr/src/linux (link to linux-2.6.9)
> > 3. make (results beneath)
> 
> last i checked, you need to run 'make install'
> the cluster cvs tree is amde of multiple seperate componates.  Many of
> which depend on the others.  So the various headers and libraries need
> to be installed before the other parts will compile.
> 
> try that first.

thanks alot. That brought me on the right track.  Manual appliing the
kernel patches from the components directories wasn't a good idea at all
(that messed up the includes in the kernel tree). 

Initially i was confused because i used the separate component tarballs
from the above url and they lack the main Makefile for all components
(which is included in CVS and contains the right order to build the
components).

Today i fetched the CVS snapshot from 

    http://people.redhat.com/cfeist/cluster/cvs_snapshots/

and just ran "configure" and "make install" from the contained cluster/
directory and everything went fine. Yippiee!, up on my way to the next
obstacles =). 

Regards,

Sebastian



From alewis at redhat.com  Thu May 12 13:40:06 2005
From: alewis at redhat.com (AJ Lewis)
Date: Thu, 12 May 2005 08:40:06 -0500
Subject: [Linux-cluster] GFS and CLVM-snapshot
In-Reply-To: <4282BF55.40504@labs.fujitsu.com>
References: <427E70D5.6000107@labs.fujitsu.com>
	<20050509022637.GA8925@redhat.com>
	<428169B7.4080707@labs.fujitsu.com>
	<20050511030239.GB10988@redhat.com>
	<4282BF55.40504@labs.fujitsu.com>
Message-ID: <20050512134006.GC16497@null.msp.redhat.com>

On Thu, May 12, 2005 at 11:28:37AM +0900, Kenji Wakamiya wrote:
> By the way, sometimes I get warnings from gfs_fsck for snapshot-LV,
> even though I froze GFS before doing lvcreate -s.
> For example, the followings are.  Is this usual thing?
> 
>   # gfs_fsck -y /dev/vg0/lv0ss0
>   Initializing fsck
>   Starting pass1
>   Pass1 complete
>   Starting pass1b
>   Pass1b complete
>   Starting pass1c
>   Pass1c complete
>   Starting pass2
>   Pass2 complete
>   Starting pass3
>   Pass3 complete
>   Starting pass4
>   Found unlinked inode at 1929233
> 	  Adjusting freemeta block count (59 -> 60).
> 	  Adjusting used dinode block count (10 -> 9).
>   l+f directory at 29
>   Added inode #1929233 to l+f dir
>   Found unlinked inode at 1800635
>   Added inode #1800635 to l+f dir
>   Link count inconsistent for inode 25 - 5 6
>   Link count updated for inode 25
>   Pass4 complete
>   Starting pass5
>   ondisk and fsck bitmaps differ at block 29
>   Succeeded.
>   used inode count inconsistent: is 9 should be 10
>   free meta count inconsistent: is 60 should be 59
>   Pass5 complete
>   #

Interesting.  Have you noticed what inode gets stuck in l+f?

-- 
AJ Lewis                                   Voice:  612-638-0500
Red Hat Inc.                               E-Mail: alewis at redhat.com
720 Washington Ave. SE, Suite 200
Minneapolis, MN 55414
   
Current GPG fingerprint = D9F8 EDCE 4242 855F A03D  9B63 F50C 54A8 578C 8715
Grab the key at: http://people.redhat.com/alewis/gpg.html or one of the
many keyservers out there...

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050512/e703bae5/attachment.sig>

From birger at uib.no  Thu May 12 16:40:04 2005
From: birger at uib.no (Birger Wathne)
Date: Thu, 12 May 2005 18:40:04 +0200
Subject: [Linux-cluster] problem stopping dead samba service
Message-ID: <428386E4.50005@uib.no>


i just had a problem that may need looking into.

my samba processes had been killed. the init.d scripts for samba (smb 
and winbind) both return a non-zero status when you try to stop the 
service when it's already dead. clusvcadm couldn't stop the service 
because of this non-zero status, and i was also unable to start it.

i fixed it for now by making the stop functions in smb and winbind return 0.

are there any accepted standards for /etc/init.d scripts? What is 
supposed to happen when stopping a non-running service? in other words, 
is this a samba or a cluster problem?

-- 
birger



From phung at cs.columbia.edu  Thu May 12 17:41:19 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Thu, 12 May 2005 13:41:19 -0400 (EDT)
Subject: [Linux-cluster] cman_tool kill, cluster stuck
In-Reply-To: <20050512111453.GB14166@redhat.com>
Message-ID: <Pine.LNX.4.44.0505121340240.28123-100000@algiers.clic.cs.columbia.edu>

this config seems to have helped.  a node went down again, but I was able
to recover with fence_ack_manual.

thanks,
dan

On 12, May, 2005, David Teigland declared:

> On Thu, May 12, 2005 at 06:51:22AM -0400, Dan B. Phung wrote:
> > Hi all, my cluster seems to be stuck!  one of the nodes
> > went down, and I see a message on one of the live nodes
> > that repeatedly says:
> > 
> >   fencing node "blade12"
> >   fence "blade12" failed
> 
> I've gone back to some of your previous messages where you included your
> cluster.conf and you had fencing misconfigured.  Here's how manual fencing
> should look.  Notice that the device name "human" under node refers to the
> fencedevice with the same name below.
> 
> <clusternode name="node01">
>         <fence>
>                 <method name="single">
>                         <device name="human" nodename="node01"/>
>                 </method>
>         </fence>
> </clusternode>
> 
> 
> <fencedevices>
>         <fencedevice name="human" agent="fence_manual"/>
> </fencedevices>
> 
> If that doesn't help, please send your current cluster.conf.
> Dave
> 

-- 



From phung at cs.columbia.edu  Thu May 12 19:07:15 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Thu, 12 May 2005 15:07:15 -0400 (EDT)
Subject: [Linux-cluster] fence_manual node failure clarification
Message-ID: <Pine.LNX.4.44.0505121456330.28123-100000@algiers.clic.cs.columbia.edu>

My question is in reference to node failures using fence_manual
>From 'man fenced'

  Node failure
  When a domain member fails, the actual fencing must be completed before
  GFS recovery can begin.  This means any delay in carrying out the 
  fencing operation will also delay the completion of GFS file system
  operations; most file system operations will hang during this period.

So this is what I'm seeing now when a node fails, ie. the rest of the
nodes notice that the heartbeats of a certain node A has timed out. Node A
is fenced by ther remaining nodes, and the file system is hung.  My
questions are:

1) can I call fence_ack_manual right when I see that node A is fenced, or
do I have to wait for node A to reboot, come back, and join the cluster?

2) if I set the post_fail_delay to -1, the fence daemon waits indefinitely
for the failed node to rejoin the cluster, which it seems to be doing, 
so is this the default?  The man page shows:
  <fence_daemon post_fail_delay="0">

So with my assumption of the delay being 0, I expected the node to be
fenced instantly on timeout, recovery to begin and complete, and my file
system for the rest of the nodes to be usable in a relatively short time.
I guess if the answer to 1) is that this recovery is done manually with
the fence_ack_manual, then it all makes sense.

thanks,
dan

-- 




From jpraher at yahoo.de  Thu May 12 19:34:57 2005
From: jpraher at yahoo.de (Jakob Praher)
Date: Thu, 12 May 2005 21:34:57 +0200
Subject: [Linux-cluster] Re: [ddraid] extensibility
In-Reply-To: <200505120136.07300.phillips@redhat.com>
References: <d5t2qo$lbn$1@sea.gmane.org>
	<200505120136.07300.phillips@redhat.com>
Message-ID: <d60am9$ttb$1@sea.gmane.org>

thank you for your reply.

Daniel Phillips wrote:
> Hi Jakob,
> 
> On Wednesday 11 May 2005 10:02, Jakob Praher wrote:
> 
>>I am very interested in ddraid for having sios systems over the
>>network. A few questions though:
>>
>>I've looked shortly at your implementation (ddraid-0.5.0), but have a
>>few questions, which I thought to ask you:


> You want to increase the "order" of the ddraid array?  This is not 
> supported yet.  The way to do it would be to create a new, higher order 
> ddraid array with an initial size of zero, then pvmove from the old 
> array to the new one, expanding the new array and recovering space from 
> the beginning of the old array as the move proceeds.  This will be 
> tricky to implement!  But it is possible, and in time we can expect to 
> see more sophisticated functionality like that arrive.

That sounds promising.

> 
> Adding an extra spindle to each ddraid member will be much easier.  In 
> that case, you would change each ddraid member to a linear combination 
> of two devices, by changing your dmsetup commands.
>
okay so you would use a linear /dev/mapper/.. device instead of a
physical device.

> 
>>what problems could arise given the striping information on the
>>blocks. do you simply extend a stripe from 2 to 4 data blocks?
> 
> 
> This is quite a tricky problem because you want to start using the new 
> geometry while some of the old geometry is still in use.  There is no 
> obstacle to implementing this, except a lot of work.

Yes that should be tricky. So would need 2 tables (if there is no
mapping table, but a funciton, than you need to track at least which
logical blocks are migrated), one for the new and one for the old (this
remainds me somewhat of the newspace and oldspace stuff for copying
garbage collection, where you can do it also incrementally). It would be
interesting to discuss more about the technical implications about such
an approach.

> 
> 
>>is it 
>>possible to raise the number of logical sectors with device mapper
>>(sorry for my ignorance - I haven't looked the details of the dmsetup
>>man page). do you have to rearrange block sizes or stripe
>>information?
> 
> 
> If you raise the number of logical sectors in the device mapper command, 
> then you must actually have the new capacity available.  In other 
> words, you would need to increase the partition size of each member of 
> the ddraid array as well.  CLVM is supposed to be able to do such 
> things automatically, but ddraid has not been integrated with CLVM yet.
> 
> 
> I do not have that paper.  Is it freely available online?  I see from 
> the abstract that "the RAID-x architecture is based on an orthogonal 
> striping and mirroring (OSM) scheme".  The answer is: you probably can 
> implement this in device mapper.  But why?

Sure. Sorry for my ignorance:
- Presentation -
http://csce.uark.edu/~aapon/courses/ioparallel/presentations/Raid-x.ppt
http://www.cs.plu.edu/courses/csce480/arts/distributedraid.pdf

I want to get a better system understanding. Thus the question.
And one of my biggest interests is dynamical extensibility. Having the
ablity to add more storage if its needed. Here the RAID-x is
interesting, since it scales linearly - you can simply add an additional
node, without the need to follow a special formula. (like the 2^k+1) rule.

And since the backup information isn't stored in parity form, but in
plain form, it can be also used for load balancing and fast error
recovery. (chained declustering) - the orthogonal storage algorithm
stores all the blocks for one stripe on one mirror drive.

Sure it eats more physical space. But on the other hand the costs of the
gigabytes for sata aren't that much compared to the huge amount of money
you have to buy for a sophisticated storage system.

> 
>>The design is basically a mixture of mirroring and striping, making it
>>possible for one node to fail completly.
> 
> 
> DDRaid does this without using either mirroring or striping.  It sounds 
> like Raid-x is fairly wasteful of disk space and IO bandwidth, compared 
> to ddraid.

The following is for me a path to understand the differences.

Aha. So I've read over the raid3.5 paper, but haven't analized what the
ddraid driver implements in this regard.

You have an explicit drive full of parity, and don't distribute the
parity information accross the other drives.

Raid 3.5 splits a logical unit (or block) into 2^K drives (1 drive is
full of parity information).


If one fails, it is either:
a) the paritiy one - where parity can be rebuild
b) a data one, in which case the data can be extracted from the parity
and the other data.

These are all nice properties.


One problem is that the funciton N(k) = 2^k+1 grows pretty sooon to
quite large numbers:

3,5,9,17,33,65,129,...

So it would be problematic to plug in additional nodes. I mean you can't
grow the array linearly. To reach the next array size you've to always
nearly double the last size. N_1 = N_0 + N_0 - 1 = 2(N_0) - 1


> 
> 
>>On stripe is made of (n-1) blocks and you have (n-1) mirror blocks.
>>They have implemented as part of their trojan cluster project, wich I
>>think isn't alive any more.
> 
> What is "n" in this description?
> 
n is the number of disks. (so N in your paper)
so you have the same amount of mirror blocks as you have stripe block,
which is somewhat orthogonal to itself. The advantage is that all of the
mirror blocks of one stripe set go are stored on one device, which makes
mirroring easy to implement.

--Jakob



From lhh at redhat.com  Thu May 12 22:06:55 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 12 May 2005 18:06:55 -0400
Subject: [Linux-cluster] problem stopping dead samba service
In-Reply-To: <428386E4.50005@uib.no>
References: <428386E4.50005@uib.no>
Message-ID: <1115935616.31928.3.camel@ayanami.boston.redhat.com>

On Thu, 2005-05-12 at 18:40 +0200, Birger Wathne wrote:
> i just had a problem that may need looking into.
> 
> my samba processes had been killed. the init.d scripts for samba (smb 
> and winbind) both return a non-zero status when you try to stop the 
> service when it's already dead. clusvcadm couldn't stop the service 
> because of this non-zero status, and i was also unable to start it.
> 
> i fixed it for now by making the stop functions in smb and winbind return 0.
> 
> are there any accepted standards for /etc/init.d scripts? What is 
> supposed to happen when stopping a non-running service? in other words, 
> is this a samba or a cluster problem?
> 

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=151104

Basically, our default return was 1 if it wasn't running.  The LSB says
it should be 0 for the stop-case if it's not running.  Also, there was a
bug recently reported in which stop ordering was using the start levels:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=157248

This would cause the file systems to be unmounted before Samba --
killing it uncleanly if force-unmount was used.  This is fixed in CVS.

-- Lon



From phung at cs.columbia.edu  Thu May 12 22:49:33 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Thu, 12 May 2005 18:49:33 -0400 (EDT)
Subject: [Linux-cluster] fence_manual node failure clarification
In-Reply-To: <Pine.LNX.4.44.0505121456330.28123-100000@algiers.clic.cs.columbia.edu>
Message-ID: <Pine.LNX.4.44.0505121848080.31835-100000@algiers.clic.cs.columbia.edu>

...answered my own question...or, the helpful message answered
my question.  I can reset it manually using fence_ack_manual.

Node blade09 needs to be reset before recovery can procede.  W aiting for
blade09 to rejoin the cluster or for manual acknowledgement that it has
been reset (i.e. fence_ack_manual -n blade09)



On 12, May, 2005, Dan B. Phung declared:

> My question is in reference to node failures using fence_manual
> >From 'man fenced'
> 
>   Node failure
>   When a domain member fails, the actual fencing must be completed before
>   GFS recovery can begin.  This means any delay in carrying out the 
>   fencing operation will also delay the completion of GFS file system
>   operations; most file system operations will hang during this period.
> 
> So this is what I'm seeing now when a node fails, ie. the rest of the
> nodes notice that the heartbeats of a certain node A has timed out. Node A
> is fenced by ther remaining nodes, and the file system is hung.  My
> questions are:
> 
> 1) can I call fence_ack_manual right when I see that node A is fenced, or
> do I have to wait for node A to reboot, come back, and join the cluster?
> 
> 2) if I set the post_fail_delay to -1, the fence daemon waits indefinitely
> for the failed node to rejoin the cluster, which it seems to be doing, 
> so is this the default?  The man page shows:
>   <fence_daemon post_fail_delay="0">
> 
> So with my assumption of the delay being 0, I expected the node to be
> fenced instantly on timeout, recovery to begin and complete, and my file
> system for the rest of the nodes to be usable in a relatively short time.
> I guess if the answer to 1) is that this recovery is done manually with
> the fence_ack_manual, then it all makes sense.
> 
> thanks,
> dan
> 
> 

-- 



From Alexander.Laamanen at tecnomen.com  Fri May 13 09:48:08 2005
From: Alexander.Laamanen at tecnomen.com (Alexander.Laamanen at tecnomen.com)
Date: Fri, 13 May 2005 12:48:08 +0300
Subject: [Linux-cluster] GFS write performance with small files
Message-ID: <EA901BEDB473E8488BF6A9B33EC19A8704156EF6@doormant.tecnomen.fi>

Hi,
 
We did some GFS benchmarking using an application that writes many small
files. There seems to be a big overhead (in the amount of actual data
written to the device) with these small files. 
Info:
- Mount options: noatime,lockproto=lock_nolock
- GFS version: 20050506 of RHEL4 branch
- Number of files written: 10000
 
File size | block write speed | fs write speed
----------------------------------------------
10k       | 13-14 MB/s        | 2.2 MB/s
50k       | 18-19 MB/s        | 8.3 MB/s
100k      | 21-22 MB/s        | 13.5 MB/s

- "fs write speed" is the write speed as seen by the application
- "block write speed" is the write speed to the underlying block device, as
reposted by iostat (and verified with the raid controller's monitoring
utility).

So from this it seems that GFS is writing ~50k "extra data" for each file
through the block device. Note: this "extra data" is not taken away from the
free disk space as reposted by df.

Any ideas why this is happening? Any ideas of parameters that could be
tuned?

( ext3 on the same block device does not have this issue.)

Thanks,

- Alexander



From pcaulfie at redhat.com  Fri May 13 13:25:10 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 13 May 2005 14:25:10 +0100
Subject: [Linux-cluster] Clusters on Xen
Message-ID: <4284AAB6.9060701@redhat.com>

I've put together some documentation on how to get a Xen cluster up and running.

at http://www.cix.co.uk/~tykepenguin/xencluster.html

It's currently a little vague but it does contain some helpful (I hope) scripts and
config files.
-- 

patrick



From bruce.walker at hp.com  Fri May 13 15:58:58 2005
From: bruce.walker at hp.com (Walker, Bruce J)
Date: Fri, 13 May 2005 08:58:58 -0700
Subject: [Linux-cluster] RE: [Clusters_sig] Planning a Cluster meeting at OLS
Message-ID: <3689AF909D816446BA505D21F1461AE403AB87A2@cacexc04.americas.cpqcorp.net>

Yes.  There are 3 upcomming meetings.  I can't speak for the Walldorf.
The meeting at OLS will be Monday and Tuesday before OLS (thus July 18
and 19).  It will be a working meeting to see if we an put together a
common infrastructure or interfaces or architecture proposal for a
variety of different cluster projects.  Hopefully a few people from the
Walldorf summit will attend to share the progress made at that summit.

I can't speak for the HA Bof but I would expect them to be more of a
summary on progress made to date on various clustering projects as well
as progress made at the two preceeding meetings/summit.

I will be hosting the pre-OLS meeting (hopefully with the OSDL folks on
clusters_sig).  I'm expecting perhaps 10-20 people.

Bruce walker
 

-----Original Message-----
From: clusters_sig-bounces at lists.osdl.org
[mailto:clusters_sig-bounces at lists.osdl.org] On Behalf Of Ashutosh
Rajekar
Sent: Friday, May 13, 2005 8:14 AM
To: clusters_sig at lists.osdl.org; linux-cluster at redhat.com
Subject: Re: [Clusters_sig] Planning a Cluster meeting at OLS

Hi everyone,

I am a bit confused. From what I can gather through the past
discussions, there are apparently three "meetings":

1) The Walldorf Cluster Summit
2) OLS BOFs
3) some other "summit" to be finalised for either before or after OLS
but to be held in Ottawa.

For me attending more than one is out of question.

Hence I want to make sure that the one I end up attending is the main
one, leaving the others to be summarisations, etc.

So I'd like to know what is the exact agenda for each summit (I still
don't know the agenda for the Walldorf summit).

Of course I don't wanna pressurise the organizers, but from Andrew
Hutton's emails it seems that OLS registration is running real tight
with a few seats and hotel rooms left in Ottawa; and if I end up going
to Walldorf then I need to buy tickets in advance since the prices are
gonna go through the roof pretty soon.

Thanks and Regards,
Ash




From Scott.Money at lycos-inc.com  Fri May 13 16:54:25 2005
From: Scott.Money at lycos-inc.com (Scott.Money at lycos-inc.com)
Date: Fri, 13 May 2005 12:54:25 -0400
Subject: [Linux-cluster] GFS Issues   
Message-ID: <OF6311F4D4.A09B26D2-ON85257000.005CDB00-85257000.005D33A2@ma.lycos.com>

[Had trouble with first posting ,sorry if this is duplicate]

Hello,
I am new to the list but I was hoping to run a few questions by you. 
Hopefully this is the correct forum for these questions.

We are in the process of evaluating GFS for an Oracle 9i RAC 
implementation.

I originally downloaded and installed this GFS-6.0.0-1.2.src.rpm using 
this http://www.gyrate.org/misc/gfs.txt , the Admin Guide (from RedHat) 
and a Sistina RAC install guide(from RedHat).

That worked to a degree in that I could mount one node on the system, but 
all other nodes are blocked from mounting the gfs. This problem led me to 
the 8 char limitation 
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=127828 . My node 
names are unique in the first 8 chars (e.g. bwing.domain.com and 
xwing.domain.com) . 

First Attempt Environment
 GFS-6.0.0-1.2.src.rpm
 RedHat AS3
 Kernel Version 2.4.21-15.ELsmp
 Using GNBD to export an external scsi array
 Used the example pool configs from the RAC install guide
 manual fencing (status for all nodes is "Logged In")
 2 nodes, 1 GNBD and dedicated lock_gulm server
 
Not sure what to do I looked for a newer version of GFS.

Now I am going through the steps I did before except now I am getting a 
build error on the rpm-build.


rpmbuild --rebuild --target i686 
/usr/src/redhat/SRPMS/GFS-6.0.2-26.src.rpm
--------------------
/usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/include/linux/brlock.h:88: 
warning: `always_inline' attribute directive ignored 
 /usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/include/linux/brlock.h:101: 
warning: `always_inline' attribute directive ignored 
 /usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/include/linux/brlock.h:171: 
warning: `always_inline' attribute directive ignored 
 /usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/include/linux/brlock.h:179: 
warning: `always_inline' attribute directive ignored 
 make[4]: Leaving directory 
`/usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/drivers/char/joystick' 
 ld -m elf_i386 -r -o sis.o sis_drv.o sis_ds.o sis_mm.o 
 make[4]: Leaving directory 
`/usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/drivers/char/drm' 
 make[3]: Leaving directory 
`/usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/drivers/char'
make[2]: Leaving directory 
`/usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/drivers' 
 make[1]: *** [_mod_drivers] Error 2 
 make: *** [all] Error 2 
 error: Bad exit status from /var/tmp/rpm-tmp.15039 (%build) 
 
 
 RPM build errors: 
     Bad exit status from /var/tmp/rpm-tmp.15039 (%build) 
---------- 
The warnings I think I can ignore, but I am not sure what to do about the 
build errors.

New Version environment
 GFS-6.0.2-26.src.rpm
 RedHat AS3
 Kernel version 2.4.21-27.0.4.ELsmp


Got my GFS src rpm from 
ftp://ftp.redhat.com/pub/redhat/linux/updates/enterprise/3AS/en/RHGFS/SRPMS/

Any help would be appreciated on either issue and please let me know if 
there is more information that is required.

$cott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050513/28a8103b/attachment.htm>

From phung at cs.columbia.edu  Sun May 15 15:06:23 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Sun, 15 May 2005 11:06:23 -0400 (EDT)
Subject: [Linux-cluster] cman_tool join causes other nodes to kernel panic
Message-ID: <Pine.LNX.4.44.0505151052030.6562-100000@algiers.clic.cs.columbia.edu>

I was adding another node to my cluster, so I updated the configurations
and did cman_tool join -w, which caused all the other nodes to kernel
panic, which prompted reboot of the cluster.  I pasted the syslog of the
blade I just added and the kernel panic message from the other blades
below.  I've done this same procedure several times before, so I don't
know why this time it caused this assertion.

on the other machines, I see this:

SM:  Assertion failed on line 52 of file 
/usr/src/cluster-2.6.9/cman-kernel/src/sm_misc.c
SM:  assertion:  "!error"
SM:  time = 272181619

Kernel panic - not syncing: SM:  Record message above and reboot.

on the just added blade I see this:

May 15 10:44:58 localhost kernel: device-mapper: 4.1.0-ioctl (2003-12-10) 
initialised: dm at uk.sistina.com
May 15 10:45:02 localhost kernel: Lock_Harness <CVS> (built May 15 2005 
10:28:33) installed
May 15 10:45:02 localhost kernel: GFS <CVS> (built May 15 2005 10:28:54) 
installed
May 15 10:45:29 localhost kernel: CMAN <CVS> (built May 15 2005 10:28:17) 
installed
May 15 10:45:29 localhost kernel: NET: Registered protocol family 30
May 15 10:45:29 localhost kernel: dlm: no version for 
"kcl_register_service" found: kernel tainted.
May 15 10:45:29 localhost kernel: DLM <CVS> (built May 15 2005 10:28:29) 
installed
May 15 10:45:29 localhost kernel: Lock_DLM (built May 15 2005 10:28:36) 
installed
May 15 10:55:06 localhost ccsd[3815]: Starting ccsd DEVEL.1115264594:
May 15 10:55:06 localhost ccsd[3815]:  Built: May  4 2005 23:48:37
May 15 10:55:06 localhost ccsd[3815]:  Copyright (C) Red Hat, Inc.  2004  
All rights reserved.
May 15 10:55:07 localhost ccsd[3815]: cluster.conf (cluster name = 
blade_cluster, version = 3) found.
May 15 10:55:07 localhost ccsd[3815]: Remote copy of cluster.conf is from 
quorate node.
May 15 10:55:07 localhost ccsd[3815]:  Local version # : 3
May 15 10:55:07 localhost ccsd[3815]:  Remote version #: 3
May 15 10:55:07 localhost kernel: CMAN: Waiting to join or form a 
Linux-cluster
May 15 10:55:08 localhost ccsd[3815]: Connected to cluster infrastruture 
via: CMAN/SM Plugin v1.1.2
May 15 10:55:08 localhost ccsd[3815]: Initial status:: Inquorate
May 15 10:55:39 localhost kernel: CMAN: forming a new cluster
May 15 10:55:39 localhost kernel: CMAN: quorum regained, resuming activity
May 15 10:55:39 localhost ccsd[3815]: Cluster is quorate.  Allowing 
connections.
May 15 10:55:45 localhost fenced[3853]: blade01 not a cluster member after 
6 sec post_join_delay
May 15 10:55:45 localhost fenced[3853]: blade02 not a cluster member after 
6 sec post_join_delay
May 15 10:55:45 localhost fenced[3853]: blade04 not a cluster member after 
6 sec post_join_delay
May 15 10:55:45 localhost fenced[3853]: blade09 not a cluster member after 
6 sec post_join_delay
May 15 10:55:45 localhost fenced[3853]: blade10 not a cluster member after 
6 sec post_join_delay
May 15 10:55:45 localhost fenced[3853]: blade11 not a cluster member after 
6 sec post_join_delay
May 15 10:55:45 localhost fenced[3853]: blade12 not a cluster member after 
6 sec post_join_delay
May 15 10:55:45 localhost fenced[3853]: fencing node "blade01"

- -

let me know if you need any other info would be helpful.

regards,
dan

-- 



From phung at cs.columbia.edu  Mon May 16 06:19:23 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Mon, 16 May 2005 02:19:23 -0400 (EDT)
Subject: [Linux-cluster] auto send fence_ack_manual?
Message-ID: <Pine.LNX.4.44.0505160211170.6562-100000@algiers.clic.cs.columbia.edu>

In the usage of our cluster, we have machines getting rebooted often,
which causes gfs to hang in waiting for the fence_ack_manual.  Usually I'm
sure the rebooting node(s) fenced, so I find the manual acking redundant.
I could also use the fence_sanbox2 fenced, but I don't like the idea of
having to enable the switch ports before the node can reenter the cluster.

I was thinking of "hacking" the daemon on my stable master node so that it
sends the fence ack as soon as it's requested.  Does something like my
usage scenario already exist?  Does this violate some principal design in
the fenced?

-dan

-- 



From birger at uib.no  Mon May 16 06:39:55 2005
From: birger at uib.no (Birger Wathne)
Date: Mon, 16 May 2005 08:39:55 +0200
Subject: [Linux-cluster] auto send fence_ack_manual?
In-Reply-To: <Pine.LNX.4.44.0505160211170.6562-100000@algiers.clic.cs.columbia.edu>
References: <Pine.LNX.4.44.0505160211170.6562-100000@algiers.clic.cs.columbia.edu>
Message-ID: <4288403B.4010208@uib.no>

Dan B. Phung wrote:

>I was thinking of "hacking" the daemon on my stable master node so that it
>sends the fence ack as soon as it's requested.  Does something like my
>usage scenario already exist?  Does this violate some principal design in
>the fenced?
>

Or perhaps make the rebooting node do it by itself as the last part of a 
controlled shutdown? I guess a node can't issue the command itself after 
leaving the cluster, so you may have to hack some code to let a node 
notify your master node as the last part of shutdown, and then let the 
master node ack. If you can have the rebooting nodes forward syslog 
messages to your stable master it should be relatively easy to detect a 
controlled and successful shutdown using some existing log watcher tool.

-- 
birger





From yazan at ccs.com.jo  Mon May 16 07:43:09 2005
From: yazan at ccs.com.jo (Yazan Al-Sheyyab)
Date: Mon, 16 May 2005 09:43:09 +0200
Subject: [Linux-cluster] cluster with LVM ...question
Message-ID: <002f01c559ea$e790e820$69050364@yazanz>


Hello,

    I need to rebuild my cluster from the beggining again, but i have a question : 

  i have two nodes and a shared storage
and i will install RedHat ES V3 on each node and oracle software on each node
and only the data base will be on the shared.

  My question is that i want to use the GFS file system on the shared partitions , So can i make the hole storage space a LVm and then use the gfs on the logicals partitions ?

 I mean can i make the all storage as LVM then part it into logical partitions and then use the gfs on these logical ??

  have you any comments ?
  any documents ?

    all raw devices will be on the shared and all of the logical partitions on the shared will be formated as GFS File System ..... can i do this ??


 Thanks alot for any help.



Regards
-------------------------------------------------

Yazan 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050516/05f2b018/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tech.gif
Type: image/gif
Size: 862 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050516/05f2b018/attachment.gif>

From phung at cs.columbia.edu  Mon May 16 06:45:12 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Mon, 16 May 2005 02:45:12 -0400 (EDT)
Subject: [Linux-cluster] auto send fence_ack_manual?
In-Reply-To: <4288403B.4010208@uib.no>
Message-ID: <Pine.LNX.4.44.0505160242550.6562-100000@algiers.clic.cs.columbia.edu>

what i've done for now is this (until someone tells me how crazy this is :)

fence/agents/manual.c::L305
        for (;;) {
                rv = check_ack();
+                rv = 1;
+                printf("HACK! detected that a node was fenced, automatically"
                         "starting cleanup sequence for %s\n", 
                          victim);

                if (rv)
                        break;

                rv = check_cluster();
                if (rv)


On 16, May, 2005, Birger Wathne declared:

> Dan B. Phung wrote:
> 
> >I was thinking of "hacking" the daemon on my stable master node so that it
> >sends the fence ack as soon as it's requested.  Does something like my
> >usage scenario already exist?  Does this violate some principal design in
> >the fenced?
> >
> 
> Or perhaps make the rebooting node do it by itself as the last part of a 
> controlled shutdown? I guess a node can't issue the command itself after 
> leaving the cluster, so you may have to hack some code to let a node 
> notify your master node as the last part of shutdown, and then let the 
> master node ack. If you can have the rebooting nodes forward syslog 
> messages to your stable master it should be relatively easy to detect a 
> controlled and successful shutdown using some existing log watcher tool.
> 
> 

-- 



From phung at cs.columbia.edu  Mon May 16 06:49:19 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Mon, 16 May 2005 02:49:19 -0400 (EDT)
Subject: [Linux-cluster] cluster with LVM ...question
In-Reply-To: <002f01c559ea$e790e820$69050364@yazanz>
Message-ID: <Pine.LNX.4.44.0505160246280.6562-100000@algiers.clic.cs.columbia.edu>

this seems like a pretty typical usage of GFS.

if you haven't yet, you should start with:
http://sources.redhat.com/cluster/doc/usage.txt 

which also has an example of a 2-node cluster configuration.

-dan

On 16, May, 2005, Yazan Al-Sheyyab declared:

> 
> Hello,
> 
>     I need to rebuild my cluster from the beggining again, but i have a question : 
> 
>   i have two nodes and a shared storage
> and i will install RedHat ES V3 on each node and oracle software on each node
> and only the data base will be on the shared.
> 
>   My question is that i want to use the GFS file system on the shared partitions , So can i make the hole storage space a LVm and then use the gfs on the logicals partitions ?
> 
>  I mean can i make the all storage as LVM then part it into logical partitions and then use the gfs on these logical ??
> 
>   have you any comments ?
>   any documents ?
> 
>     all raw devices will be on the shared and all of the logical partitions on the shared will be formated as GFS File System ..... can i do this ??
> 
> 
>  Thanks alot for any help.
> 
> 
> 
> Regards
> -------------------------------------------------
> 
> Yazan

-- 



From teigland at redhat.com  Mon May 16 06:58:59 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 16 May 2005 14:58:59 +0800
Subject: [Linux-cluster] cman_tool join causes other nodes to kernel panic
In-Reply-To: <Pine.LNX.4.44.0505151052030.6562-100000@algiers.clic.cs.columbia.edu>
References: <Pine.LNX.4.44.0505151052030.6562-100000@algiers.clic.cs.columbia.edu>
Message-ID: <20050516065859.GA7094@redhat.com>

On Sun, May 15, 2005 at 11:06:23AM -0400, Dan B. Phung wrote:
> I was adding another node to my cluster, so I updated the configurations
> and did cman_tool join -w, which caused all the other nodes to kernel
> panic, which prompted reboot of the cluster.  I pasted the syslog of the
> blade I just added and the kernel panic message from the other blades
> below.  I've done this same procedure several times before, so I don't
> know why this time it caused this assertion.
> 
> on the other machines, I see this:
> 
> SM:  Assertion failed on line 52 of file 
> /usr/src/cluster-2.6.9/cman-kernel/src/sm_misc.c
> SM:  assertion:  "!error"
> SM:  time = 272181619

This means there's some sort of internal consistency error within cman.
If you could explain in more detail the steps you took prior to this I'll
try to reproduce it.  It sounds like you may have been updating
cluster.conf while the cluster was running.  If so, what exactly did you
change?

Dave



From teigland at redhat.com  Mon May 16 07:05:56 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 16 May 2005 15:05:56 +0800
Subject: [Linux-cluster] auto send fence_ack_manual?
In-Reply-To: <Pine.LNX.4.44.0505160211170.6562-100000@algiers.clic.cs.columbia.edu>
References: <Pine.LNX.4.44.0505160211170.6562-100000@algiers.clic.cs.columbia.edu>
Message-ID: <20050516070556.GB7094@redhat.com>

On Mon, May 16, 2005 at 02:19:23AM -0400, Dan B. Phung wrote:
> In the usage of our cluster, we have machines getting rebooted often,
> which causes gfs to hang in waiting for the fence_ack_manual.  Usually I'm
> sure the rebooting node(s) fenced, so I find the manual acking redundant.
> I could also use the fence_sanbox2 fenced, but I don't like the idea of
> having to enable the switch ports before the node can reenter the cluster.

If you're doing controlled reboots, you should have the node
shutdown/leave the cluster cleanly.  Then it won't be fenced.

If that's not possible, you can have a node automatically re-enable its
own switch port when it starts up by calling 'fence_sanbox2 -o enable'
directly.


> I was thinking of "hacking" the daemon on my stable master node so that it
> sends the fence ack as soon as it's requested.  Does something like my
> usage scenario already exist?  

no

> Does this violate some principal design in the fenced?

Yes, it's equivalant to doing no fencing at all.

Dave



From phung at cs.columbia.edu  Mon May 16 07:02:49 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Mon, 16 May 2005 03:02:49 -0400 (EDT)
Subject: [Linux-cluster] cman_tool join causes other nodes to kernel panic
In-Reply-To: <20050516065859.GA7094@redhat.com>
Message-ID: <Pine.LNX.4.44.0505160258510.6562-100000@algiers.clic.cs.columbia.edu>

yes, I updated the cluster.conf by adding more nodes.  

e.g., I added a couple of these blocks (6 more nodes to be exact)

        <clusternode name="blade06" nodeid="1" votes="1">
          <multicast addr="224.0.0.18" interface="eth0"/>
             <fence>
               <method name="single">
                 <device name="human" ipaddr="129.58.15.6"/>
               </method>
             </fence>
          </clusternode>

I first updated the file (incrementing the version), and then ran:
 ccs_tool update cluster.conf
 cman_tool version -r 3

These commands completed without incident.  The failure occured when
running 'cman_tool join -w' on the new node.


On 16, May, 2005, David Teigland declared:

> On Sun, May 15, 2005 at 11:06:23AM -0400, Dan B. Phung wrote:
> > I was adding another node to my cluster, so I updated the configurations
> > and did cman_tool join -w, which caused all the other nodes to kernel
> > panic, which prompted reboot of the cluster.  I pasted the syslog of the
> > blade I just added and the kernel panic message from the other blades
> > below.  I've done this same procedure several times before, so I don't
> > know why this time it caused this assertion.
> > 
> > on the other machines, I see this:
> > 
> > SM:  Assertion failed on line 52 of file 
> > /usr/src/cluster-2.6.9/cman-kernel/src/sm_misc.c
> > SM:  assertion:  "!error"
> > SM:  time = 272181619
> 
> This means there's some sort of internal consistency error within cman.
> If you could explain in more detail the steps you took prior to this I'll
> try to reproduce it.  It sounds like you may have been updating
> cluster.conf while the cluster was running.  If so, what exactly did you
> change?
> 
> Dave
> 

-- 



From phung at cs.columbia.edu  Mon May 16 07:07:35 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Mon, 16 May 2005 03:07:35 -0400 (EDT)
Subject: [Linux-cluster] auto send fence_ack_manual?
In-Reply-To: <20050516070556.GB7094@redhat.com>
Message-ID: <Pine.LNX.4.44.0505160303550.6562-100000@algiers.clic.cs.columbia.edu>

On 16, May, 2005, David Teigland declared:

> On Mon, May 16, 2005 at 02:19:23AM -0400, Dan B. Phung wrote:
> > In the usage of our cluster, we have machines getting rebooted often,
> > which causes gfs to hang in waiting for the fence_ack_manual.  Usually I'm
> > sure the rebooting node(s) fenced, so I find the manual acking redundant.
> > I could also use the fence_sanbox2 fenced, but I don't like the idea of
> > having to enable the switch ports before the node can reenter the cluster.
> 
> If you're doing controlled reboots, you should have the node
> shutdown/leave the cluster cleanly.  Then it won't be fenced.

we're doing some kernel hacking and the module we're loading currently
doesn't unload (i know, it's stupid...).  The module uses the mounted file
systems, thus cman_tool can't unregister because there are still some
active subsystems.  this is the reason for the node being fenced.
 
> If that's not possible, you can have a node automatically re-enable its
> own switch port when it starts up by calling 'fence_sanbox2 -o enable'
> directly.

ah, cool.  but wouldn't the cluster be "hung" while it waits for the node
to come back up?
 
> 
> > I was thinking of "hacking" the daemon on my stable master node so that it
> > sends the fence ack as soon as it's requested.  Does something like my
> > usage scenario already exist?  
> 
> no
> 
> > Does this violate some principal design in the fenced?
> 
> Yes, it's equivalant to doing no fencing at all.

:) again, i'm misusing your wonderful software!

-dan

-- 



From teigland at redhat.com  Mon May 16 07:18:51 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 16 May 2005 15:18:51 +0800
Subject: [Linux-cluster] auto send fence_ack_manual?
In-Reply-To: <Pine.LNX.4.44.0505160303550.6562-100000@algiers.clic.cs.columbia.edu>
References: <20050516070556.GB7094@redhat.com>
	<Pine.LNX.4.44.0505160303550.6562-100000@algiers.clic.cs.columbia.edu>
Message-ID: <20050516071851.GD7094@redhat.com>

On Mon, May 16, 2005 at 03:07:35AM -0400, Dan B. Phung wrote:
> On 16, May, 2005, David Teigland declared:
> > If that's not possible, you can have a node automatically re-enable its
> > own switch port when it starts up by calling 'fence_sanbox2 -o enable'
> > directly.
> 
> ah, cool.  but wouldn't the cluster be "hung" while it waits for the node
> to come back up?

No, when fencing disables the node's switch port, then everything will go
ahead right after that.

Dave



From teigland at redhat.com  Mon May 16 07:19:49 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 16 May 2005 15:19:49 +0800
Subject: [Linux-cluster] [PATCH 0/8] dlm: overview
Message-ID: <20050516071949.GE7094@redhat.com>

These are the distributed lock manager (dlm) patches against 2.6.12-rc4
that we'd like to see added to the kernel.  We've made changes based on
the suggestions that were sent, and are happy to get more, of course.

The original overview
http://marc.theaimsgroup.com/?l=linux-kernel&m=111444188703106&w=2

For those interested in performance, there is some information here
http://sources.redhat.com/cluster/dlm/

Dave



From pcaulfie at redhat.com  Mon May 16 07:16:16 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 16 May 2005 08:16:16 +0100
Subject: [Linux-cluster] cman_tool join causes other nodes to kernel panic
In-Reply-To: <Pine.LNX.4.44.0505160258510.6562-100000@algiers.clic.cs.columbia.edu>
References: <Pine.LNX.4.44.0505160258510.6562-100000@algiers.clic.cs.columbia.edu>
Message-ID: <428848C0.9060601@redhat.com>

Dan B. Phung wrote:
> yes, I updated the cluster.conf by adding more nodes.  
> 
> e.g., I added a couple of these blocks (6 more nodes to be exact)
> 
>         <clusternode name="blade06" nodeid="1" votes="1">
                                      ^^^^^^^^^^

Errr. I hope you changed the nodeid for each node ?


-- 

patrick



From teigland at redhat.com  Mon May 16 07:20:54 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 16 May 2005 15:20:54 +0800
Subject: [Linux-cluster] [PATCH 5/8] dlm: configuration
Message-ID: <20050516072054.GJ7094@redhat.com>

Per-lockspace configuration happens through files under:
/sys/kernel/dlm/<lockspace_name>/.  This includes telling each lockspace
which nodes are using the lockspace and suspending locking in a lockspace.

Lockspace-independent configuration involves telling the dlm communication
layer the IP address of each node ID that's being used.  These addresses
are set using an ioctl on a misc device.

Signed-off-by: Dave Teigland <teigland at redhat.com>
Signed-off-by: Patrick Caulfield <pcaulfie at redhat.com>

---

 drivers/dlm/config.c       |   47 ++++++
 drivers/dlm/config.h       |   34 ++++
 drivers/dlm/member_sysfs.c |  322 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/dlm/member_sysfs.h |   21 ++
 drivers/dlm/node_ioctl.c   |  126 +++++++++++++++++
 include/linux/dlm_node.h   |   44 ++++++
 6 files changed, 594 insertions(+)
--- a/include/linux/dlm_node.h	1970-01-01 07:30:00.000000000 +0730
+++ b/include/linux/dlm_node.h	2005-05-12 23:13:15.828485664 +0800
@@ -0,0 +1,44 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) 2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#ifndef __DLM_NODE_DOT_H__
+#define __DLM_NODE_DOT_H__
+
+#define DLM_ADDR_LEN			256
+#define DLM_MAX_ADDR_COUNT		3
+#define DLM_NODE_MISC_NAME		"dlm-node"
+
+#define DLM_NODE_VERSION_MAJOR		1
+#define DLM_NODE_VERSION_MINOR		0
+#define DLM_NODE_VERSION_PATCH		0
+
+struct dlm_node_ioctl {
+	__u32	version[3];
+	int	nodeid;
+	int	weight;
+	char	addr[DLM_ADDR_LEN];
+};
+
+enum {
+	DLM_NODE_VERSION_CMD = 0,
+	DLM_SET_NODE_CMD,
+	DLM_SET_LOCAL_CMD,
+};
+
+#define DLM_IOCTL			0xd1
+
+#define DLM_NODE_VERSION _IOWR(DLM_IOCTL, DLM_NODE_VERSION_CMD, struct dlm_node_ioctl)
+#define DLM_SET_NODE     _IOWR(DLM_IOCTL, DLM_SET_NODE_CMD, struct dlm_node_ioctl)
+#define DLM_SET_LOCAL    _IOWR(DLM_IOCTL, DLM_SET_LOCAL_CMD, struct dlm_node_ioctl)
+
+#endif
+
--- a/drivers/dlm/config.c	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/config.c	2005-05-12 23:13:15.827485816 +0800
@@ -0,0 +1,47 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#include "dlm_internal.h"
+#include "config.h"
+
+/* Config file defaults */
+#define DEFAULT_TCP_PORT       21064
+#define DEFAULT_BUFFER_SIZE     4096
+#define DEFAULT_RSBTBL_SIZE      256
+#define DEFAULT_LKBTBL_SIZE     1024
+#define DEFAULT_DIRTBL_SIZE      512
+#define DEFAULT_RECOVER_TIMER      5
+#define DEFAULT_TOSS_SECS         10
+#define DEFAULT_SCAN_SECS          5
+
+struct dlm_config_info dlm_config = {
+	.tcp_port = DEFAULT_TCP_PORT,
+	.buffer_size = DEFAULT_BUFFER_SIZE,
+	.rsbtbl_size = DEFAULT_RSBTBL_SIZE,
+	.lkbtbl_size = DEFAULT_LKBTBL_SIZE,
+	.dirtbl_size = DEFAULT_DIRTBL_SIZE,
+	.recover_timer = DEFAULT_RECOVER_TIMER,
+	.toss_secs = DEFAULT_TOSS_SECS,
+	.scan_secs = DEFAULT_SCAN_SECS
+};
+
+int dlm_config_init(void)
+{
+	/* FIXME: hook the config values into sysfs */
+	return 0;
+}
+
+void dlm_config_exit(void)
+{
+}
+
--- a/drivers/dlm/config.h	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/config.h	2005-05-12 23:13:15.827485816 +0800
@@ -0,0 +1,34 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#ifndef __CONFIG_DOT_H__
+#define __CONFIG_DOT_H__
+
+struct dlm_config_info {
+	int tcp_port;
+	int buffer_size;
+	int rsbtbl_size;
+	int lkbtbl_size;
+	int dirtbl_size;
+	int recover_timer;
+	int toss_secs;
+	int scan_secs;
+};
+
+extern struct dlm_config_info dlm_config;
+
+extern int dlm_config_init(void);
+extern void dlm_config_exit(void);
+
+#endif				/* __CONFIG_DOT_H__ */
+
--- a/drivers/dlm/member_sysfs.c	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/member_sysfs.c	2005-05-12 23:13:15.829485512 +0800
@@ -0,0 +1,322 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) 2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#include "dlm_internal.h"
+#include "member.h"
+
+/*
+/dlm/lsname/stop       RW  write 1 to suspend operation
+/dlm/lsname/start      RW  write event_nr to start recovery
+/dlm/lsname/finish     RW  write event_nr to finish recovery
+/dlm/lsname/terminate  RW  write event_nr to term recovery
+/dlm/lsname/done       RO  event_nr dlm is done processing
+/dlm/lsname/id         RW  global id of lockspace
+/dlm/lsname/members    RW  read = current members, write = next members
+*/
+
+
+static ssize_t dlm_stop_show(struct dlm_ls *ls, char *buf)
+{
+	ssize_t ret;
+	int val;
+
+	spin_lock(&ls->ls_recover_lock);
+	val = ls->ls_last_stop;
+	spin_unlock(&ls->ls_recover_lock);
+	ret = sprintf(buf, "%d\n", val);
+	return ret;
+}
+
+static ssize_t dlm_stop_store(struct dlm_ls *ls, const char *buf, size_t len)
+{
+	ssize_t ret = -EINVAL;
+
+	if (simple_strtol(buf, NULL, 0) == 1) {
+		dlm_ls_stop(ls);
+		ret = len;
+	}
+	return ret;
+}
+
+static ssize_t dlm_start_show(struct dlm_ls *ls, char *buf)
+{
+	ssize_t ret;
+	int val;
+
+	spin_lock(&ls->ls_recover_lock);
+	val = ls->ls_last_start;
+	spin_unlock(&ls->ls_recover_lock);
+	ret = sprintf(buf, "%d\n", val);
+	return ret;
+}
+
+static ssize_t dlm_start_store(struct dlm_ls *ls, const char *buf, size_t len)
+{
+	ssize_t ret;
+	ret = dlm_ls_start(ls, simple_strtol(buf, NULL, 0));
+	return ret ? ret : len;
+}
+
+static ssize_t dlm_finish_show(struct dlm_ls *ls, char *buf)
+{
+	ssize_t ret;
+	int val;
+
+	spin_lock(&ls->ls_recover_lock);
+	val = ls->ls_last_finish;
+	spin_unlock(&ls->ls_recover_lock);
+	ret = sprintf(buf, "%d\n", val);
+	return ret;
+}
+
+static ssize_t dlm_finish_store(struct dlm_ls *ls, const char *buf, size_t len)
+{
+	dlm_ls_finish(ls, simple_strtol(buf, NULL, 0));
+	return len;
+}
+
+static ssize_t dlm_terminate_show(struct dlm_ls *ls, char *buf)
+{
+	ssize_t ret;
+	int val = 0;
+
+	spin_lock(&ls->ls_recover_lock);
+	if (test_bit(LSFL_LS_TERMINATE, &ls->ls_flags))
+		val = 1;
+	spin_unlock(&ls->ls_recover_lock);
+	ret = sprintf(buf, "%d\n", val);
+	return ret;
+}
+
+static ssize_t dlm_terminate_store(struct dlm_ls *ls, const char *buf, size_t len)
+{
+	ssize_t ret = -EINVAL;
+
+	if (simple_strtol(buf, NULL, 0) == 1) {
+		dlm_ls_terminate(ls);
+		ret = len;
+	}
+	return ret;
+}
+
+static ssize_t dlm_done_show(struct dlm_ls *ls, char *buf)
+{
+	ssize_t ret;
+	int val;
+
+	spin_lock(&ls->ls_recover_lock);
+	val = ls->ls_startdone;
+	spin_unlock(&ls->ls_recover_lock);
+	ret = sprintf(buf, "%d\n", val);
+	return ret;
+}
+
+static ssize_t dlm_id_show(struct dlm_ls *ls, char *buf)
+{
+	return sprintf(buf, "%u\n", ls->ls_global_id);
+}
+
+static ssize_t dlm_id_store(struct dlm_ls *ls, const char *buf, size_t len)
+{
+	ls->ls_global_id = simple_strtol(buf, NULL, 0);
+	return len;
+}
+
+static ssize_t dlm_members_show(struct dlm_ls *ls, char *buf)
+{
+	struct dlm_member *memb;
+	ssize_t ret = 0;
+
+	if (!down_read_trylock(&ls->ls_in_recovery))
+		return -EBUSY;
+	list_for_each_entry(memb, &ls->ls_nodes, list)
+		ret += sprintf(buf+ret, "%u ", memb->nodeid);
+	ret += sprintf(buf+ret, "\n");
+	up_read(&ls->ls_in_recovery);
+	return ret;
+}
+
+static ssize_t dlm_members_store(struct dlm_ls *ls, const char *buf, size_t len)
+{
+	int *nodeids, id, count = 1, i;
+	ssize_t ret = len;
+	char *p, *t;
+
+	/* count number of id's in buf, assumes no trailing spaces */
+	for (i = 0; i < len; i++)
+		if (isspace(buf[i]))
+			count++;
+
+	nodeids = kmalloc(sizeof(int) * count, GFP_KERNEL);
+	if (!nodeids)
+		return -ENOMEM;
+
+	p = kmalloc(len+1, GFP_KERNEL);
+	if (!p) {
+		kfree(nodeids);
+		return -ENOMEM;
+	}
+	memcpy(p, buf, len);
+	p[len+1] = '\0';
+
+	for (i = 0; i < count; i++) {
+		if ((t = strsep(&p, " ")) == NULL)
+			break;
+		if (sscanf(t, "%u", &id) != 1)
+			break;
+		nodeids[i] = id;
+	}
+
+	if (i != count) {
+		kfree(nodeids);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	spin_lock(&ls->ls_recover_lock);
+	if (ls->ls_nodeids_next) {
+		kfree(nodeids);
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+	ls->ls_nodeids_next = nodeids;
+	ls->ls_nodeids_next_count = count;
+
+ out_unlock:
+	spin_unlock(&ls->ls_recover_lock);
+ out:
+	kfree(p);
+	return ret;
+}
+
+struct dlm_attr {
+	struct attribute attr;
+	ssize_t (*show)(struct dlm_ls *, char *);
+	ssize_t (*store)(struct dlm_ls *, const char *, size_t);
+};
+
+static struct dlm_attr dlm_attr_stop = {
+	.attr  = {.name = "stop", .mode = S_IRUGO | S_IWUSR},
+	.show  = dlm_stop_show,
+	.store = dlm_stop_store
+};
+
+static struct dlm_attr dlm_attr_start = {
+	.attr  = {.name = "start", .mode = S_IRUGO | S_IWUSR},
+	.show  = dlm_start_show,
+	.store = dlm_start_store
+};
+
+static struct dlm_attr dlm_attr_finish = {
+	.attr  = {.name = "finish", .mode = S_IRUGO | S_IWUSR},
+	.show  = dlm_finish_show,
+	.store = dlm_finish_store
+};
+
+static struct dlm_attr dlm_attr_terminate = {
+	.attr  = {.name = "terminate", .mode = S_IRUGO | S_IWUSR},
+	.show  = dlm_terminate_show,
+	.store = dlm_terminate_store
+};
+
+static struct dlm_attr dlm_attr_done = {
+	.attr  = {.name = "done", .mode = S_IRUGO},
+	.show  = dlm_done_show,
+};
+
+static struct dlm_attr dlm_attr_id = {
+	.attr  = {.name = "id", .mode = S_IRUGO | S_IWUSR},
+	.show  = dlm_id_show,
+	.store = dlm_id_store
+};
+
+static struct dlm_attr dlm_attr_members = {
+	.attr  = {.name = "members", .mode = S_IRUGO | S_IWUSR},
+	.show  = dlm_members_show,
+	.store = dlm_members_store
+};
+
+static struct attribute *dlm_attrs[] = {
+	&dlm_attr_stop.attr,
+	&dlm_attr_start.attr,
+	&dlm_attr_finish.attr,
+	&dlm_attr_terminate.attr,
+	&dlm_attr_done.attr,
+	&dlm_attr_id.attr,
+	&dlm_attr_members.attr,
+	NULL,
+};
+
+static ssize_t dlm_attr_show(struct kobject *kobj, struct attribute *attr,
+			     char *buf)
+{
+	struct dlm_ls *ls  = container_of(kobj, struct dlm_ls, ls_kobj);
+	struct dlm_attr *a = container_of(attr, struct dlm_attr, attr);
+	return a->show ? a->show(ls, buf) : 0;
+}
+
+static ssize_t dlm_attr_store(struct kobject *kobj, struct attribute *attr,
+			      const char *buf, size_t len)
+{
+	struct dlm_ls *ls  = container_of(kobj, struct dlm_ls, ls_kobj);
+	struct dlm_attr *a = container_of(attr, struct dlm_attr, attr);
+	return a->store ? a->store(ls, buf, len) : len;
+}
+
+static struct sysfs_ops dlm_attr_ops = {
+	.show  = dlm_attr_show,
+	.store = dlm_attr_store,
+};
+
+static struct kobj_type dlm_ktype = {
+	.default_attrs = dlm_attrs,
+	.sysfs_ops     = &dlm_attr_ops,
+};
+
+static struct kset dlm_kset = {
+	.subsys = &kernel_subsys,
+	.kobj   = {.name = "dlm",},
+	.ktype  = &dlm_ktype,
+};
+
+int dlm_member_sysfs_init(void)
+{
+	int error;
+
+	error = kset_register(&dlm_kset);
+	if (error)
+		printk("dlm_lockspace_init: cannot register kset %d\n", error);
+	return error;
+}
+
+void dlm_member_sysfs_exit(void)
+{
+	kset_unregister(&dlm_kset);
+}
+
+int dlm_kobject_setup(struct dlm_ls *ls)
+{
+	char lsname[DLM_LOCKSPACE_LEN];
+	int error;
+
+	memset(lsname, 0, DLM_LOCKSPACE_LEN);
+	snprintf(lsname, DLM_LOCKSPACE_LEN, "%s", ls->ls_name);
+
+	error = kobject_set_name(&ls->ls_kobj, "%s", lsname);
+	if (error)
+		return error;
+
+	ls->ls_kobj.kset = &dlm_kset;
+	ls->ls_kobj.ktype = &dlm_ktype;
+	return 0;
+}
+
--- a/drivers/dlm/member_sysfs.h	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/member_sysfs.h	2005-05-12 23:13:15.830485360 +0800
@@ -0,0 +1,21 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) 2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#ifndef __MEMBER_SYSFS_DOT_H__
+#define __MEMBER_SYSFS_DOT_H__
+
+int dlm_member_sysfs_init(void);
+void dlm_member_sysfs_exit(void);
+int dlm_kobject_setup(struct dlm_ls *ls);
+
+#endif                          /* __MEMBER_SYSFS_DOT_H__ */
+
--- a/drivers/dlm/node_ioctl.c	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/node_ioctl.c	2005-05-12 23:13:15.840483840 +0800
@@ -0,0 +1,126 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) 2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#include <linux/miscdevice.h>
+#include <linux/fs.h>
+
+#include <linux/dlm_node.h>
+
+#include "dlm_internal.h"
+#include "lowcomms.h"
+
+
+static int check_version(unsigned int cmd,
+			 struct dlm_node_ioctl __user *u_param)
+{
+	u32 version[3];
+	int error = 0;
+
+	if (copy_from_user(version, u_param->version, sizeof(version)))
+		return -EFAULT;
+
+	if ((DLM_NODE_VERSION_MAJOR != version[0]) ||
+	    (DLM_NODE_VERSION_MINOR < version[1])) {
+		log_print("node_ioctl: interface mismatch: "
+		          "kernel(%u.%u.%u), user(%u.%u.%u), cmd(%d)",
+		          DLM_NODE_VERSION_MAJOR,
+		          DLM_NODE_VERSION_MINOR,
+		          DLM_NODE_VERSION_PATCH,
+		          version[0], version[1], version[2], cmd);
+		error = -EINVAL;
+	}
+
+	version[0] = DLM_NODE_VERSION_MAJOR;
+	version[1] = DLM_NODE_VERSION_MINOR;
+	version[2] = DLM_NODE_VERSION_PATCH;
+
+	if (copy_to_user(u_param->version, version, sizeof(version)))
+		return -EFAULT;
+	return error;
+}
+
+static int node_ioctl(struct inode *inode, struct file *file,
+	              uint command, ulong u)
+{
+	struct dlm_node_ioctl *k_param;
+	struct dlm_node_ioctl __user *u_param;
+	unsigned int cmd, type;
+	int error;
+
+	u_param = (struct dlm_node_ioctl __user *) u;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	type = _IOC_TYPE(command);
+	cmd = _IOC_NR(command);
+
+	if (type != DLM_IOCTL) {
+		log_print("node_ioctl: bad ioctl 0x%x 0x%x 0x%x",
+		          command, type, cmd);
+		return -ENOTTY;
+	}
+
+	error = check_version(cmd, u_param);
+	if (error)
+		return error;
+
+	if (cmd == DLM_NODE_VERSION_CMD)
+		return 0;
+
+	k_param = kmalloc(sizeof(*k_param), GFP_KERNEL);
+	if (!k_param)
+		return -ENOMEM;
+
+	if (copy_from_user(k_param, u_param, sizeof(*k_param))) {
+		kfree(k_param);
+		return -EFAULT;
+	}
+
+	if (cmd == DLM_SET_NODE_CMD)
+		error = dlm_set_node(k_param->nodeid, k_param->weight,
+				     k_param->addr);
+	else if (cmd == DLM_SET_LOCAL_CMD)
+		error = dlm_set_local(k_param->nodeid, k_param->weight,
+				      k_param->addr);
+
+	kfree(k_param);
+	return error;
+}
+
+static struct file_operations node_fops = {
+	.ioctl	= node_ioctl,
+	.owner	= THIS_MODULE,
+};
+
+static struct miscdevice node_misc = {
+	.minor	= MISC_DYNAMIC_MINOR,
+	.name	= DLM_NODE_MISC_NAME,
+	.fops	= &node_fops
+};
+
+int dlm_node_ioctl_init(void)
+{
+	int error;
+
+	error = misc_register(&node_misc);
+	if (error)
+		log_print("node_ioctl: misc_register failed %d", error);
+	return error;
+}
+
+void dlm_node_ioctl_exit(void)
+{
+	if (misc_deregister(&node_misc) < 0)
+		log_print("node_ioctl: misc_deregister failed");
+}
+



From teigland at redhat.com  Mon May 16 07:21:03 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 16 May 2005 15:21:03 +0800
Subject: [Linux-cluster] [PATCH 6/8] dlm: device interface
Message-ID: <20050516072103.GK7094@redhat.com>

This is a separate module from the dlm.  It exports the dlm api to user
space through a misc device.  Applications use a library (libdlm) which
communicates with the kernel through this device.

Signed-off-by: Dave Teigland <teigland at redhat.com>
Signed-off-by: Patrick Caulfield <pcaulfie at redhat.com>

---

 drivers/dlm/device.c       | 1124 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/dlm/device.h       |   21 
 include/linux/dlm_device.h |   84 +++
 3 files changed, 1229 insertions(+)

--- a/include/linux/dlm_device.h	1970-01-01 07:30:00.000000000 +0730
+++ b/include/linux/dlm_device.h	2005-05-12 23:13:15.828485664 +0800
@@ -0,0 +1,84 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+/* This is the device interface for dlm, most users will use a library
+ * interface.
+ */
+
+#define DLM_USER_LVB_LEN	32
+
+/* Version of the device interface */
+#define DLM_DEVICE_VERSION_MAJOR 3
+#define DLM_DEVICE_VERSION_MINOR 0
+#define DLM_DEVICE_VERSION_PATCH 0
+
+/* struct passed to the lock write */
+struct dlm_lock_params {
+	__u8 mode;
+	__u16 flags;
+	__u32 lkid;
+	__u32 parent;
+	struct dlm_range range;
+	__u8 namelen;
+        void __user *castparam;
+	void __user *castaddr;
+	void __user *bastparam;
+        void __user *bastaddr;
+	struct dlm_lksb __user *lksb;
+	char lvb[DLM_USER_LVB_LEN];
+	char name[1];
+};
+
+struct dlm_lspace_params {
+	__u32 flags;
+	__u32 minor;
+	char name[1];
+};
+
+struct dlm_write_request {
+	__u32 version[3];
+	__u8 cmd;
+
+	union  {
+		struct dlm_lock_params   lock;
+		struct dlm_lspace_params lspace;
+	} i;
+};
+
+/* struct read from the "device" fd,
+   consists mainly of userspace pointers for the library to use */
+struct dlm_lock_result {
+	__u32 length;
+	void __user * user_astaddr;
+	void __user * user_astparam;
+	struct dlm_lksb __user * user_lksb;
+	struct dlm_lksb lksb;
+	__u8 bast_mode;
+	/* Offsets may be zero if no data is present */
+	__u32 lvb_offset;
+};
+
+/* Commands passed to the device */
+#define DLM_USER_LOCK         1
+#define DLM_USER_UNLOCK       2
+#define DLM_USER_QUERY        3
+#define DLM_USER_CREATE_LOCKSPACE  4
+#define DLM_USER_REMOVE_LOCKSPACE  5
+
+/* Arbitrary length restriction */
+#define MAX_LS_NAME_LEN 64
+
+/* Lockspace flags */
+#define DLM_USER_LSFLG_AUTOFREE   1
+#define DLM_USER_LSFLG_FORCEFREE  2
+
--- a/drivers/dlm/device.c	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/device.c	2005-05-12 23:13:15.837484296 +0800
@@ -0,0 +1,1124 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+/*
+ * device.c
+ *
+ * This is the userland interface to the DLM.
+ *
+ * The locking is done via a misc char device (find the
+ * registered minor number in /proc/misc).
+ *
+ * User code should not use this interface directly but
+ * call the library routines in libdlm.a instead.
+ *
+ */
+
+#include <linux/miscdevice.h>
+#include <linux/init.h>
+#include <linux/wait.h>
+#include <linux/module.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/signal.h>
+#include <linux/spinlock.h>
+#include <linux/idr.h>
+
+#include <linux/dlm.h>
+#include <linux/dlm_device.h>
+
+#include "lvb_table.h"
+#include "device.h"
+
+static struct file_operations _dlm_fops;
+static const char *name_prefix="dlm";
+static struct list_head user_ls_list;
+static struct semaphore user_ls_lock;
+
+/* Lock infos are stored in here indexed by lock ID */
+static DEFINE_IDR(lockinfo_idr);
+static rwlock_t lockinfo_lock;
+
+/* Flags in li_flags */
+#define LI_FLAG_COMPLETE   1
+#define LI_FLAG_FIRSTLOCK  2
+#define LI_FLAG_PERSISTENT 3
+
+/* flags in ls_flags*/
+#define LS_FLAG_DELETED   1
+#define LS_FLAG_AUTOFREE  2
+
+
+#define LOCKINFO_MAGIC 0x53595324
+
+struct lock_info {
+	uint32_t li_magic;
+	uint8_t li_cmd;
+	int8_t	li_grmode;
+	int8_t  li_rqmode;
+	struct dlm_lksb li_lksb;
+	wait_queue_head_t li_waitq;
+	unsigned long li_flags;
+	void __user *li_castparam;
+	void __user *li_castaddr;
+	void __user *li_bastparam;
+	void __user *li_bastaddr;
+	void __user *li_pend_bastparam;
+	void __user *li_pend_bastaddr;
+	struct list_head li_ownerqueue;
+	struct file_info *li_file;
+	struct dlm_lksb __user *li_user_lksb;
+	struct semaphore li_firstlock;
+};
+
+/* A queued AST no less */
+struct ast_info {
+	struct dlm_lock_result result;
+	struct list_head list;
+	uint32_t lvb_updated;
+	uint32_t progress;      /* How much has been read */
+};
+
+/* One of these per userland lockspace */
+struct user_ls {
+	void    *ls_lockspace;
+	atomic_t ls_refcnt;
+	long     ls_flags;
+
+	/* Passed into misc_register() */
+	struct miscdevice ls_miscinfo;
+	struct list_head  ls_list;
+};
+
+/* misc_device info for the control device */
+static struct miscdevice ctl_device;
+
+/*
+ * Stuff we hang off the file struct.
+ * The first two are to cope with unlocking all the
+ * locks help by a process when it dies.
+ */
+struct file_info {
+	struct list_head    fi_li_list;  /* List of active lock_infos */
+	spinlock_t          fi_li_lock;
+	struct list_head    fi_ast_list; /* Queue of ASTs to be delivered */
+	spinlock_t          fi_ast_lock;
+	wait_queue_head_t   fi_wait;
+	struct user_ls     *fi_ls;
+	atomic_t            fi_refcnt;   /* Number of users */
+	unsigned long       fi_flags;    /* Bit 1 means the device is open */
+};
+
+
+/* get and put ops for file_info.
+   Actually I don't really like "get" and "put", but everyone
+   else seems to use them and I can't think of anything
+   nicer at the moment */
+static void get_file_info(struct file_info *f)
+{
+	atomic_inc(&f->fi_refcnt);
+}
+
+static void put_file_info(struct file_info *f)
+{
+	if (atomic_dec_and_test(&f->fi_refcnt))
+		kfree(f);
+}
+
+static void release_lockinfo(struct lock_info *li)
+{
+	put_file_info(li->li_file);
+
+	write_lock(&lockinfo_lock);
+	idr_remove(&lockinfo_idr, li->li_lksb.sb_lkid);
+	write_unlock(&lockinfo_lock);
+
+	if (li->li_lksb.sb_lvbptr)
+		kfree(li->li_lksb.sb_lvbptr);
+	kfree(li);
+
+	module_put(THIS_MODULE);
+}
+
+static struct lock_info *get_lockinfo(uint32_t lockid)
+{
+	struct lock_info *li;
+
+	read_lock(&lockinfo_lock);
+	li = idr_find(&lockinfo_idr, lockid);
+	read_lock(&lockinfo_lock);
+
+	return li;
+}
+
+static int add_lockinfo(struct lock_info *li)
+{
+	int n;
+	int r;
+	int ret = -EINVAL;
+
+	write_lock(&lockinfo_lock);
+
+	if (idr_find(&lockinfo_idr, li->li_lksb.sb_lkid))
+		goto out_up;
+
+	ret = -ENOMEM;
+	r = idr_pre_get(&lockinfo_idr, GFP_KERNEL);
+	if (!r)
+		goto out_up;
+
+	r = idr_get_new_above(&lockinfo_idr, li, li->li_lksb.sb_lkid, &n);
+	if (r)
+		goto out_up;
+
+	if (n != li->li_lksb.sb_lkid) {
+		idr_remove(&lockinfo_idr, n);
+		goto out_up;
+	}
+
+	ret = 0;
+
+ out_up:
+	write_unlock(&lockinfo_lock);
+
+	return ret;
+}
+
+
+static struct user_ls *__find_lockspace(int minor)
+{
+	struct user_ls *lsinfo;
+
+	list_for_each_entry(lsinfo, &user_ls_list, ls_list) {
+		if (lsinfo->ls_miscinfo.minor == minor)
+			return lsinfo;
+	}
+	return NULL;
+}
+
+/* Find a lockspace struct given the device minor number */
+static struct user_ls *find_lockspace(int minor)
+{
+	struct user_ls *lsinfo;
+
+	down(&user_ls_lock);
+	lsinfo = __find_lockspace(minor);
+	up(&user_ls_lock);
+
+	return lsinfo;
+}
+
+static void add_lockspace_to_list(struct user_ls *lsinfo)
+{
+	down(&user_ls_lock);
+	list_add(&lsinfo->ls_list, &user_ls_list);
+	up(&user_ls_lock);
+}
+
+/* Register a lockspace with the DLM and create a misc
+   device for userland to access it */
+static int register_lockspace(char *name, struct user_ls **ls, int flags)
+{
+	struct user_ls *newls;
+	int status;
+	int namelen;
+
+	namelen = strlen(name)+strlen(name_prefix)+2;
+
+	newls = kmalloc(sizeof(struct user_ls), GFP_KERNEL);
+	if (!newls)
+		return -ENOMEM;
+	memset(newls, 0, sizeof(struct user_ls));
+
+	newls->ls_miscinfo.name = kmalloc(namelen, GFP_KERNEL);
+	if (!newls->ls_miscinfo.name) {
+		kfree(newls);
+		return -ENOMEM;
+	}
+
+	status = dlm_new_lockspace(name, strlen(name), &newls->ls_lockspace, 0,
+				   DLM_USER_LVB_LEN);
+	if (status != 0) {
+		kfree(newls->ls_miscinfo.name);
+		kfree(newls);
+		return status;
+	}
+
+	snprintf((char*)newls->ls_miscinfo.name, namelen, "%s_%s",
+		 name_prefix, name);
+
+	newls->ls_miscinfo.fops = &_dlm_fops;
+	newls->ls_miscinfo.minor = MISC_DYNAMIC_MINOR;
+
+	status = misc_register(&newls->ls_miscinfo);
+	if (status) {
+		printk(KERN_ERR "dlm: misc register failed for %s", name);
+		dlm_release_lockspace(newls->ls_lockspace, 0);
+		kfree(newls->ls_miscinfo.name);
+		kfree(newls);
+		return status;
+	}
+
+	if (flags & DLM_USER_LSFLG_AUTOFREE)
+		set_bit(LS_FLAG_AUTOFREE, &newls->ls_flags);
+
+	add_lockspace_to_list(newls);
+	*ls = newls;
+	return 0;
+}
+
+/* Called with the user_ls_lock semaphore held */
+static int unregister_lockspace(struct user_ls *lsinfo, int force)
+{
+	int status;
+
+	status = dlm_release_lockspace(lsinfo->ls_lockspace, force);
+	if (status)
+		return status;
+
+	status = misc_deregister(&lsinfo->ls_miscinfo);
+	if (status)
+		return status;
+
+	list_del(&lsinfo->ls_list);
+	set_bit(LS_FLAG_DELETED, &lsinfo->ls_flags);
+	lsinfo->ls_lockspace = NULL;
+	if (atomic_read(&lsinfo->ls_refcnt) == 0) {
+		kfree(lsinfo->ls_miscinfo.name);
+		kfree(lsinfo);
+	}
+
+	return 0;
+}
+
+/* Add it to userland's AST queue */
+static void add_to_astqueue(struct lock_info *li, void *astaddr, void *astparam,
+			    int lvb_updated)
+{
+	struct ast_info *ast = kmalloc(sizeof(struct ast_info), GFP_KERNEL);
+	if (!ast)
+		return;
+
+	memset(ast, 0, sizeof(*ast));
+	ast->result.user_astparam = astparam;
+	ast->result.user_astaddr  = astaddr;
+	ast->result.user_lksb     = li->li_user_lksb;
+	memcpy(&ast->result.lksb, &li->li_lksb, sizeof(struct dlm_lksb));
+	ast->lvb_updated = lvb_updated;
+
+	spin_lock(&li->li_file->fi_ast_lock);
+	list_add_tail(&ast->list, &li->li_file->fi_ast_list);
+	spin_unlock(&li->li_file->fi_ast_lock);
+	wake_up_interruptible(&li->li_file->fi_wait);
+}
+
+static void bast_routine(void *param, int mode)
+{
+	struct lock_info *li = param;
+
+	if (li && li->li_bastaddr)
+		add_to_astqueue(li, li->li_bastaddr, li->li_bastparam, 0);
+}
+
+/*
+ * This is the kernel's AST routine.
+ * All lock, unlock & query operations complete here.
+ * The only syncronous ops are those done during device close.
+ */
+static void ast_routine(void *param)
+{
+	struct lock_info *li = param;
+
+	/* Param may be NULL if a persistent lock is unlocked by someone else */
+	if (!li)
+		return;
+
+	/* If this is a succesful conversion then activate the blocking ast
+	 * args from the conversion request */
+	if (!test_bit(LI_FLAG_FIRSTLOCK, &li->li_flags) &&
+	    li->li_lksb.sb_status == 0) {
+
+		li->li_bastparam = li->li_pend_bastparam;
+		li->li_bastaddr = li->li_pend_bastaddr;
+		li->li_pend_bastaddr = NULL;
+	}
+
+	/* If it's an async request then post data to the user's AST queue. */
+	if (li->li_castaddr) {
+		int lvb_updated = 0;
+
+		/* See if the lvb has been updated */
+		if (dlm_lvb_operations[li->li_grmode+1][li->li_rqmode+1] == 1)
+			lvb_updated = 1;
+
+		if (li->li_lksb.sb_status == 0)
+			li->li_grmode = li->li_rqmode;
+
+		/* Only queue AST if the device is still open */
+		if (test_bit(1, &li->li_file->fi_flags))
+			add_to_astqueue(li, li->li_castaddr, li->li_castparam,
+					lvb_updated);
+
+		/* If it's a new lock operation that failed, then
+		 * remove it from the owner queue and free the
+		 * lock_info.
+		 */
+		if (test_and_clear_bit(LI_FLAG_FIRSTLOCK, &li->li_flags) &&
+		    li->li_lksb.sb_status != 0) {
+
+			/* Wait till dlm_lock() has finished */
+			down(&li->li_firstlock);
+			up(&li->li_firstlock);
+
+			spin_lock(&li->li_file->fi_li_lock);
+			list_del(&li->li_ownerqueue);
+			spin_unlock(&li->li_file->fi_li_lock);
+			release_lockinfo(li);
+			return;
+		}
+		/* Free unlocks & queries */
+		if (li->li_lksb.sb_status == -DLM_EUNLOCK ||
+		    li->li_cmd == DLM_USER_QUERY) {
+			release_lockinfo(li);
+		}
+	} else {
+		/* Synchronous request, just wake up the caller */
+		set_bit(LI_FLAG_COMPLETE, &li->li_flags);
+		wake_up_interruptible(&li->li_waitq);
+	}
+}
+
+/*
+ * Wait for the lock op to complete and return the status.
+ */
+static int wait_for_ast(struct lock_info *li)
+{
+	/* Wait for the AST routine to complete */
+	set_task_state(current, TASK_INTERRUPTIBLE);
+	while (!test_bit(LI_FLAG_COMPLETE, &li->li_flags))
+		schedule();
+
+	set_task_state(current, TASK_RUNNING);
+
+	return li->li_lksb.sb_status;
+}
+
+
+/* Open on control device */
+static int dlm_ctl_open(struct inode *inode, struct file *file)
+{
+	file->private_data = NULL;
+	return 0;
+}
+
+/* Close on control device */
+static int dlm_ctl_close(struct inode *inode, struct file *file)
+{
+	return 0;
+}
+
+/* Open on lockspace device */
+static int dlm_open(struct inode *inode, struct file *file)
+{
+	struct file_info *f;
+	struct user_ls *lsinfo;
+
+	lsinfo = find_lockspace(iminor(inode));
+	if (!lsinfo)
+		return -ENOENT;
+
+	f = kmalloc(sizeof(struct file_info), GFP_KERNEL);
+	if (!f)
+		return -ENOMEM;
+
+	atomic_inc(&lsinfo->ls_refcnt);
+	INIT_LIST_HEAD(&f->fi_li_list);
+	INIT_LIST_HEAD(&f->fi_ast_list);
+	spin_lock_init(&f->fi_li_lock);
+	spin_lock_init(&f->fi_ast_lock);
+	init_waitqueue_head(&f->fi_wait);
+	f->fi_ls = lsinfo;
+	atomic_set(&f->fi_refcnt, 1);
+	f->fi_flags = 0;
+	set_bit(1, &f->fi_flags);
+
+	file->private_data = f;
+
+	return 0;
+}
+
+/* Check the user's version matches ours */
+static int check_version(struct dlm_write_request *req)
+{
+	if (req->version[0] != DLM_DEVICE_VERSION_MAJOR ||
+	    (req->version[0] == DLM_DEVICE_VERSION_MAJOR &&
+	     req->version[1] > DLM_DEVICE_VERSION_MINOR)) {
+
+		printk(KERN_DEBUG "dlm: process %s (%d) version mismatch "
+		       "user (%d.%d.%d) kernel (%d.%d.%d),",
+		       current->comm,
+		       current->pid,
+		       req->version[0],
+		       req->version[1],
+		       req->version[2],
+		       DLM_DEVICE_VERSION_MAJOR,
+		       DLM_DEVICE_VERSION_MINOR,
+		       DLM_DEVICE_VERSION_PATCH);
+		return -EINVAL;
+	}
+	return 0;
+}
+
+/* Close on lockspace device */
+static int dlm_close(struct inode *inode, struct file *file)
+{
+	struct file_info *f = file->private_data;
+	struct lock_info li;
+	struct lock_info *old_li, *safe;
+	sigset_t tmpsig;
+	sigset_t allsigs;
+	struct user_ls *lsinfo;
+	DECLARE_WAITQUEUE(wq, current);
+
+	lsinfo = find_lockspace(iminor(inode));
+	if (!lsinfo)
+		return -ENOENT;
+
+	/* Mark this closed so that ASTs will not be delivered any more */
+	clear_bit(1, &f->fi_flags);
+
+	/* Block signals while we are doing this */
+	sigfillset(&allsigs);
+	sigprocmask(SIG_BLOCK, &allsigs, &tmpsig);
+
+	/* We use our own lock_info struct here, so that any
+	 * outstanding "real" ASTs will be delivered with the
+	 * corresponding "real" params, thus freeing the lock_info
+	 * that belongs the lock. This catches the corner case where
+	 * a lock is BUSY when we try to unlock it here
+	 */
+	memset(&li, 0, sizeof(li));
+	clear_bit(LI_FLAG_COMPLETE, &li.li_flags);
+	init_waitqueue_head(&li.li_waitq);
+	add_wait_queue(&li.li_waitq, &wq);
+
+	/*
+	 * Free any outstanding locks, they are on the
+	 * list in LIFO order so there should be no problems
+	 * about unlocking parents before children.
+	 */
+	list_for_each_entry_safe(old_li, safe, &f->fi_li_list, li_ownerqueue) {
+		int status;
+		int flags = 0;
+
+		/* Don't unlock persistent locks, just mark them orphaned */
+		if (test_bit(LI_FLAG_PERSISTENT, &old_li->li_flags)) {
+			list_del(&old_li->li_ownerqueue);
+
+			/* Update master copy */
+			/* TODO: Check locking core updates the local and
+			   remote ORPHAN flags */
+			li.li_lksb.sb_lkid = old_li->li_lksb.sb_lkid;
+			status = dlm_lock(f->fi_ls->ls_lockspace,
+					  old_li->li_grmode, &li.li_lksb,
+					  DLM_LKF_CONVERT|DLM_LKF_ORPHAN,
+					  NULL, 0, 0, ast_routine, NULL,
+					  NULL, NULL);
+			if (status != 0)
+				printk("dlm: Error orphaning lock %x: %d\n",
+				       old_li->li_lksb.sb_lkid, status);
+
+			/* But tidy our references in it */
+			release_lockinfo(old_li);
+			continue;
+		}
+
+		clear_bit(LI_FLAG_COMPLETE, &li.li_flags);
+
+		/* If it's not granted then cancel the request.
+		 * If the lock was WAITING then it will be dropped,
+		 *    if it was converting then it will be reverted to GRANTED,
+		 *    then we will unlock it.
+		 */
+
+		if (old_li->li_grmode != old_li->li_rqmode)
+			flags = DLM_LKF_CANCEL;
+
+		if (old_li->li_grmode >= DLM_LOCK_PW)
+			flags |= DLM_LKF_IVVALBLK;
+
+		status = dlm_unlock(f->fi_ls->ls_lockspace,
+				    old_li->li_lksb.sb_lkid, flags,
+				    &li.li_lksb, &li);
+		/* Must wait for it to complete as the next lock could be its
+		 * parent */
+		if (status == 0)
+			wait_for_ast(&li);
+
+		/* If it was waiting for a conversion, it will
+		   now be granted so we can unlock it properly */
+		if (flags & DLM_LKF_CANCEL) {
+			flags &= ~DLM_LKF_CANCEL;
+			clear_bit(LI_FLAG_COMPLETE, &li.li_flags);
+			status = dlm_unlock(f->fi_ls->ls_lockspace,
+					    old_li->li_lksb.sb_lkid, flags,
+					    &li.li_lksb, &li);
+			if (status == 0)
+				wait_for_ast(&li);
+		}
+		/* Unlock suceeded, free the lock_info struct. */
+		if (status == 0)
+			release_lockinfo(old_li);
+	}
+
+	remove_wait_queue(&li.li_waitq, &wq);
+
+	/*
+	 * If this is the last reference to the lockspace
+	 * then free the struct. If it's an AUTOFREE lockspace
+	 * then free the whole thing.
+	 */
+	down(&user_ls_lock);
+	if (atomic_dec_and_test(&lsinfo->ls_refcnt)) {
+
+		if (lsinfo->ls_lockspace) {
+			if (test_bit(LS_FLAG_AUTOFREE, &lsinfo->ls_flags)) {
+				/* TODO this breaks!
+				unregister_lockspace(lsinfo, 1); */
+			}
+		} else {
+			kfree(lsinfo->ls_miscinfo.name);
+			kfree(lsinfo);
+		}
+	}
+	up(&user_ls_lock);
+
+	/* Restore signals */
+	sigprocmask(SIG_SETMASK, &tmpsig, NULL);
+	recalc_sigpending();
+
+	return 0;
+}
+
+static int do_user_create_lockspace(struct file_info *fi, uint8_t cmd,
+				    struct dlm_lspace_params *kparams)
+{
+	int status;
+	struct user_ls *lsinfo;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	status = register_lockspace(kparams->name, &lsinfo, kparams->flags);
+
+	/* If it succeeded then return the minor number */
+	if (status == 0)
+		status = lsinfo->ls_miscinfo.minor;
+
+	return status;
+}
+
+static int do_user_remove_lockspace(struct file_info *fi, uint8_t cmd,
+				    struct dlm_lspace_params *kparams)
+{
+	int status;
+	int force = 1;
+	struct user_ls *lsinfo;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	down(&user_ls_lock);
+	lsinfo = __find_lockspace(kparams->minor);
+	if (!lsinfo) {
+		up(&user_ls_lock);
+		return -EINVAL;
+	}
+
+	if (kparams->flags & DLM_USER_LSFLG_FORCEFREE)
+		force = 2;
+
+	status = unregister_lockspace(lsinfo, force);
+	up(&user_ls_lock);
+
+	return status;
+}
+
+/* Read call, might block if no ASTs are waiting.
+ * It will only ever return one message at a time, regardless
+ * of how many are pending.
+ */
+static ssize_t dlm_read(struct file *file, char __user *buffer, size_t count,
+			loff_t *ppos)
+{
+	struct file_info *fi = file->private_data;
+	struct ast_info *ast;
+	int data_size;
+	int offset;
+	DECLARE_WAITQUEUE(wait, current);
+
+	if (count < sizeof(struct dlm_lock_result))
+		return -EINVAL;
+
+	spin_lock(&fi->fi_ast_lock);
+	if (list_empty(&fi->fi_ast_list)) {
+
+		/* No waiting ASTs.
+		 * Return EOF if the lockspace been deleted.
+		 */
+		if (test_bit(LS_FLAG_DELETED, &fi->fi_ls->ls_flags))
+			return 0;
+
+		if (file->f_flags & O_NONBLOCK) {
+			spin_unlock(&fi->fi_ast_lock);
+			return -EAGAIN;
+		}
+
+		add_wait_queue(&fi->fi_wait, &wait);
+
+	repeat:
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (list_empty(&fi->fi_ast_list) &&
+		    !signal_pending(current)) {
+
+			spin_unlock(&fi->fi_ast_lock);
+			schedule();
+			spin_lock(&fi->fi_ast_lock);
+			goto repeat;
+		}
+
+		current->state = TASK_RUNNING;
+		remove_wait_queue(&fi->fi_wait, &wait);
+
+		if (signal_pending(current)) {
+			spin_unlock(&fi->fi_ast_lock);
+			return -ERESTARTSYS;
+		}
+	}
+
+	ast = list_entry(fi->fi_ast_list.next, struct ast_info, list);
+	list_del(&ast->list);
+	spin_unlock(&fi->fi_ast_lock);
+
+	/* Work out the size of the returned data */
+	data_size = sizeof(struct dlm_lock_result);
+	if (ast->lvb_updated && ast->result.lksb.sb_lvbptr)
+		data_size += DLM_USER_LVB_LEN;
+
+	offset = sizeof(struct dlm_lock_result);
+
+	/* Room for the extended data ? */
+	if (count >= data_size) {
+
+		if (ast->lvb_updated && ast->result.lksb.sb_lvbptr) {
+			if (copy_to_user(buffer+offset,
+					 ast->result.lksb.sb_lvbptr,
+					 DLM_USER_LVB_LEN))
+				return -EFAULT;
+			ast->result.lvb_offset = offset;
+			offset += DLM_USER_LVB_LEN;
+		}
+	}
+
+	ast->result.length = data_size;
+	/* Copy the header now it has all the offsets in it */
+	if (copy_to_user(buffer, &ast->result, sizeof(struct dlm_lock_result)))
+		offset = -EFAULT;
+
+	/* If we only returned a header and there's more to come then put it
+	   back on the list */
+	if (count < data_size) {
+		spin_lock(&fi->fi_ast_lock);
+		list_add(&ast->list, &fi->fi_ast_list);
+		spin_unlock(&fi->fi_ast_lock);
+	} else
+		kfree(ast);
+	return offset;
+}
+
+static unsigned int dlm_poll(struct file *file, poll_table *wait)
+{
+	struct file_info *fi = file->private_data;
+
+	poll_wait(file, &fi->fi_wait, wait);
+
+	spin_lock(&fi->fi_ast_lock);
+	if (!list_empty(&fi->fi_ast_list)) {
+		spin_unlock(&fi->fi_ast_lock);
+		return POLLIN | POLLRDNORM;
+	}
+
+	spin_unlock(&fi->fi_ast_lock);
+	return 0;
+}
+
+static struct lock_info *allocate_lockinfo(struct file_info *fi, uint8_t cmd,
+					   struct dlm_lock_params *kparams)
+{
+	struct lock_info *li;
+
+	if (!try_module_get(THIS_MODULE))
+		return NULL;
+
+	li = kmalloc(sizeof(struct lock_info), GFP_KERNEL);
+	if (li) {
+		li->li_magic     = LOCKINFO_MAGIC;
+		li->li_file      = fi;
+		li->li_cmd       = cmd;
+		li->li_flags     = 0;
+		li->li_grmode    = -1;
+		li->li_rqmode    = -1;
+		li->li_pend_bastparam = NULL;
+		li->li_pend_bastaddr  = NULL;
+		li->li_castaddr   = NULL;
+		li->li_castparam  = NULL;
+		li->li_lksb.sb_lvbptr = NULL;
+		li->li_bastaddr  = kparams->bastaddr;
+		li->li_bastparam = kparams->bastparam;
+
+		get_file_info(fi);
+	}
+	return li;
+}
+
+static int do_user_lock(struct file_info *fi, uint8_t cmd,
+			struct dlm_lock_params *kparams)
+{
+	struct lock_info *li;
+	int status;
+
+	/*
+	 * Validate things that we need to have correct.
+	 */
+	if (!kparams->castaddr)
+		return -EINVAL;
+
+	if (!kparams->lksb)
+		return -EINVAL;
+
+	/* Persistent child locks are not available yet */
+	if ((kparams->flags & DLM_LKF_PERSISTENT) && kparams->parent)
+		return -EINVAL;
+
+        /* For conversions, there should already be a lockinfo struct,
+	   unless we are adopting an orphaned persistent lock */
+	if (kparams->flags & DLM_LKF_CONVERT) {
+
+		li = get_lockinfo(kparams->lkid);
+
+		/* If this is a persistent lock we will have to create a
+		   lockinfo again */
+		if (!li && DLM_LKF_PERSISTENT) {
+			li = allocate_lockinfo(fi, cmd, kparams);
+
+			li->li_lksb.sb_lkid = kparams->lkid;
+			li->li_castaddr  = kparams->castaddr;
+			li->li_castparam = kparams->castparam;
+
+			/* OK, this isn;t exactly a FIRSTLOCK but it is the
+			   first time we've used this lockinfo, and if things
+			   fail we want rid of it */
+			init_MUTEX_LOCKED(&li->li_firstlock);
+			set_bit(LI_FLAG_FIRSTLOCK, &li->li_flags);
+			add_lockinfo(li);
+
+			/* TODO: do a query to get the current state ?? */
+		}
+		if (!li)
+			return -EINVAL;
+
+		if (li->li_magic != LOCKINFO_MAGIC)
+			return -EINVAL;
+
+		/* For conversions don't overwrite the current blocking AST
+		   info so that:
+		   a) if a blocking AST fires before the conversion is queued
+		      it runs the current handler
+		   b) if the conversion is cancelled, the original blocking AST
+		      declaration is active
+		   The pend_ info is made active when the conversion
+		   completes.
+		*/
+		li->li_pend_bastaddr  = kparams->bastaddr;
+		li->li_pend_bastparam = kparams->bastparam;
+	} else {
+		li = allocate_lockinfo(fi, cmd, kparams);
+		if (!li)
+			return -ENOMEM;
+
+		/* semaphore to allow us to complete our work before
+  		   the AST routine runs. In fact we only need (and use) this
+		   when the initial lock fails */
+		init_MUTEX_LOCKED(&li->li_firstlock);
+		set_bit(LI_FLAG_FIRSTLOCK, &li->li_flags);
+	}
+
+	li->li_user_lksb = kparams->lksb;
+	li->li_castaddr  = kparams->castaddr;
+	li->li_castparam = kparams->castparam;
+	li->li_lksb.sb_lkid = kparams->lkid;
+	li->li_rqmode    = kparams->mode;
+	if (kparams->flags & DLM_LKF_PERSISTENT)
+		set_bit(LI_FLAG_PERSISTENT, &li->li_flags);
+
+	/* Copy in the value block */
+	if (kparams->flags & DLM_LKF_VALBLK) {
+		if (!li->li_lksb.sb_lvbptr) {
+			li->li_lksb.sb_lvbptr = kmalloc(DLM_USER_LVB_LEN,
+							GFP_KERNEL);
+			if (!li->li_lksb.sb_lvbptr) {
+				status = -ENOMEM;
+				goto out_err;
+			}
+		}
+
+		memcpy(li->li_lksb.sb_lvbptr, kparams->lvb, DLM_USER_LVB_LEN);
+	}
+
+	/* Lock it ... */
+	status = dlm_lock(fi->fi_ls->ls_lockspace,
+			  kparams->mode, &li->li_lksb,
+			  kparams->flags,
+			  kparams->name, kparams->namelen,
+			  kparams->parent,
+			  ast_routine,
+			  li,
+			  (li->li_pend_bastaddr || li->li_bastaddr) ?
+			   bast_routine : NULL,
+			  kparams->range.ra_end ? &kparams->range : NULL);
+	if (status)
+		goto out_err;
+
+	/* If it succeeded (this far) with a new lock then keep track of
+	   it on the file's lockinfo list */
+	if (!status && test_bit(LI_FLAG_FIRSTLOCK, &li->li_flags)) {
+
+		spin_lock(&fi->fi_li_lock);
+		list_add(&li->li_ownerqueue, &fi->fi_li_list);
+		spin_unlock(&fi->fi_li_lock);
+		if (add_lockinfo(li))
+			printk(KERN_WARNING "Add lockinfo failed\n");
+
+		up(&li->li_firstlock);
+	}
+
+	/* Return the lockid as the user needs it /now/ */
+	return li->li_lksb.sb_lkid;
+
+ out_err:
+	if (test_bit(LI_FLAG_FIRSTLOCK, &li->li_flags))
+		release_lockinfo(li);
+	return status;
+
+}
+
+static int do_user_unlock(struct file_info *fi, uint8_t cmd,
+			  struct dlm_lock_params *kparams)
+{
+	struct lock_info *li;
+	int status;
+	int convert_cancel = 0;
+
+	li = get_lockinfo(kparams->lkid);
+	if (!li) {
+		li = allocate_lockinfo(fi, cmd, kparams);
+		spin_lock(&fi->fi_li_lock);
+		list_add(&li->li_ownerqueue, &fi->fi_li_list);
+		spin_unlock(&fi->fi_li_lock);
+	}
+ 	if (!li)
+		return -ENOMEM;
+
+	if (li->li_magic != LOCKINFO_MAGIC)
+		return -EINVAL;
+
+	li->li_user_lksb = kparams->lksb;
+	li->li_castparam = kparams->castparam;
+	li->li_cmd       = cmd;
+
+	/* Cancelling a conversion doesn't remove the lock...*/
+	if (kparams->flags & DLM_LKF_CANCEL && li->li_grmode != -1)
+		convert_cancel = 1;
+
+	/* dlm_unlock() passes a 0 for castaddr which means don't overwrite
+	   the existing li_castaddr as that's the completion routine for
+	   unlocks. dlm_unlock_wait() specifies a new AST routine to be
+	   executed when the unlock completes. */
+	if (kparams->castaddr)
+		li->li_castaddr = kparams->castaddr;
+
+	/* Use existing lksb & astparams */
+	status = dlm_unlock(fi->fi_ls->ls_lockspace,
+			     kparams->lkid,
+			     kparams->flags, &li->li_lksb, li);
+
+	if (!status && !convert_cancel) {
+		spin_lock(&fi->fi_li_lock);
+		list_del(&li->li_ownerqueue);
+		spin_unlock(&fi->fi_li_lock);
+	}
+
+	return status;
+}
+
+/* Write call, submit a locking request */
+static ssize_t dlm_write(struct file *file, const char __user *buffer,
+			 size_t count, loff_t *ppos)
+{
+	struct file_info *fi = file->private_data;
+	struct dlm_write_request *kparams;
+	sigset_t tmpsig;
+	sigset_t allsigs;
+	int status;
+
+	/* -1 because lock name is optional */
+	if (count < sizeof(struct dlm_write_request)-1)
+		return -EINVAL;
+
+	/* Has the lockspace been deleted */
+	if (fi && test_bit(LS_FLAG_DELETED, &fi->fi_ls->ls_flags))
+		return -ENOENT;
+
+	kparams = kmalloc(count, GFP_KERNEL);
+	if (!kparams)
+		return -ENOMEM;
+
+	status = -EFAULT;
+	/* Get the command info */
+	if (copy_from_user(kparams, buffer, count))
+		goto out_free;
+
+	status = -EBADE;
+	if (check_version(kparams))
+		goto out_free;
+
+	/* Block signals while we are doing this */
+	sigfillset(&allsigs);
+	sigprocmask(SIG_BLOCK, &allsigs, &tmpsig);
+
+	status = -EINVAL;
+	switch (kparams->cmd)
+	{
+	case DLM_USER_LOCK:
+		if (!fi) goto out_sig;
+		status = do_user_lock(fi, kparams->cmd, &kparams->i.lock);
+		break;
+
+	case DLM_USER_UNLOCK:
+		if (!fi) goto out_sig;
+		status = do_user_unlock(fi, kparams->cmd, &kparams->i.lock);
+		break;
+
+	case DLM_USER_CREATE_LOCKSPACE:
+		if (fi) goto out_sig;
+		status = do_user_create_lockspace(fi, kparams->cmd,
+						  &kparams->i.lspace);
+		break;
+
+	case DLM_USER_REMOVE_LOCKSPACE:
+		if (fi) goto out_sig;
+		status = do_user_remove_lockspace(fi, kparams->cmd,
+						  &kparams->i.lspace);
+		break;
+	default:
+		printk("Unknown command passed to DLM device : %d\n",
+			kparams->cmd);
+		break;
+	}
+
+ out_sig:
+	/* Restore signals */
+	sigprocmask(SIG_SETMASK, &tmpsig, NULL);
+	recalc_sigpending();
+
+ out_free:
+	kfree(kparams);
+	if (status == 0)
+		return count;
+	else
+		return status;
+}
+
+/* Called when the cluster is shutdown uncleanly, all lockspaces
+   have been summarily removed */
+void dlm_device_free_devices()
+{
+	struct user_ls *tmp;
+	struct user_ls *lsinfo;
+
+	down(&user_ls_lock);
+	list_for_each_entry_safe(lsinfo, tmp, &user_ls_list, ls_list) {
+		misc_deregister(&lsinfo->ls_miscinfo);
+
+		/* Tidy up, but don't delete the lsinfo struct until
+		   all the users have closed their devices */
+		list_del(&lsinfo->ls_list);
+		set_bit(LS_FLAG_DELETED, &lsinfo->ls_flags);
+		lsinfo->ls_lockspace = NULL;
+	}
+	up(&user_ls_lock);
+}
+
+static struct file_operations _dlm_fops = {
+      .open    = dlm_open,
+      .release = dlm_close,
+      .read    = dlm_read,
+      .write   = dlm_write,
+      .poll    = dlm_poll,
+      .owner   = THIS_MODULE,
+};
+
+static struct file_operations _dlm_ctl_fops = {
+      .open    = dlm_ctl_open,
+      .release = dlm_ctl_close,
+      .write   = dlm_write,
+      .owner   = THIS_MODULE,
+};
+
+/*
+ * Create control device
+ */
+int __init dlm_device_init(void)
+{
+	int r;
+
+	INIT_LIST_HEAD(&user_ls_list);
+	init_MUTEX(&user_ls_lock);
+	rwlock_init(&lockinfo_lock);
+
+	ctl_device.name = "dlm-control";
+	ctl_device.fops = &_dlm_ctl_fops;
+	ctl_device.minor = MISC_DYNAMIC_MINOR;
+
+	r = misc_register(&ctl_device);
+	if (r) {
+		printk(KERN_ERR "dlm: misc_register failed for control device");
+		return r;
+	}
+
+	return 0;
+}
+
+void __exit dlm_device_exit(void)
+{
+	misc_deregister(&ctl_device);
+}
+
+MODULE_DESCRIPTION("Distributed Lock Manager device interface");
+MODULE_AUTHOR("Red Hat, Inc.");
+MODULE_LICENSE("GPL");
+
+module_init(dlm_device_init);
+module_exit(dlm_device_exit);
+
--- a/drivers/dlm/device.h	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/device.h	2005-05-12 23:13:15.828485664 +0800
@@ -0,0 +1,21 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#ifndef __DEVICE_DOT_H__
+#define __DEVICE_DOT_H__
+
+extern void dlm_device_free_devices(void);
+extern int dlm_device_init(void);
+extern void dlm_device_exit(void);
+#endif				/* __DEVICE_DOT_H__ */
+



From teigland at redhat.com  Mon May 16 07:21:11 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 16 May 2005 15:21:11 +0800
Subject: [Linux-cluster] [PATCH 7/8] dlm: debug fs
Message-ID: <20050516072111.GL7094@redhat.com>

A CONFIG setting optionally adds this file which creates a debugfs file
for each lockspace: /debug/dlm/<lockspace_name>.  Reading the debugfs file
displays all resources/locks currently managed in the given lockspace.

Signed-off-by: Dave Teigland <teigland at redhat.com>
Signed-off-by: Patrick Caulfield <pcaulfie at redhat.com>

---

 drivers/dlm/debug_fs.c |  305 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 305 insertions(+)

--- a/drivers/dlm/debug_fs.c	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/debug_fs.c	2005-05-12 23:13:15.828485664 +0800
@@ -0,0 +1,305 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) 2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#include <linux/pagemap.h>
+#include <linux/seq_file.h>
+#include <linux/module.h>
+#include <linux/ctype.h>
+#include <linux/debugfs.h>
+
+#include "dlm_internal.h"
+
+
+static struct dentry *dlm_root;
+
+struct rsb_iter {
+	int entry;
+	struct dlm_ls *ls;
+	struct list_head *next;
+	struct dlm_rsb *rsb;
+};
+
+static char *print_lockmode(int mode)
+{
+	switch (mode) {
+	case DLM_LOCK_IV:
+		return "--";
+	case DLM_LOCK_NL:
+		return "NL";
+	case DLM_LOCK_CR:
+		return "CR";
+	case DLM_LOCK_CW:
+		return "CW";
+	case DLM_LOCK_PR:
+		return "PR";
+	case DLM_LOCK_PW:
+		return "PW";
+	case DLM_LOCK_EX:
+		return "EX";
+	default:
+		return "??";
+	}
+}
+
+static void print_lock(struct seq_file *s, struct dlm_lkb *lkb,
+		       struct dlm_rsb *res)
+{
+	seq_printf(s, "%08x %s", lkb->lkb_id, print_lockmode(lkb->lkb_grmode));
+
+	if (lkb->lkb_status == DLM_LKSTS_CONVERT
+	    || lkb->lkb_status == DLM_LKSTS_WAITING)
+		seq_printf(s, " (%s)", print_lockmode(lkb->lkb_rqmode));
+
+	if (lkb->lkb_range) {
+		/* FIXME: this warns on Alpha */
+		if (lkb->lkb_status == DLM_LKSTS_CONVERT
+		    || lkb->lkb_status == DLM_LKSTS_GRANTED)
+			seq_printf(s, " %" PRIx64 "-%" PRIx64,
+				   lkb->lkb_range[GR_RANGE_START],
+				   lkb->lkb_range[GR_RANGE_END]);
+		if (lkb->lkb_status == DLM_LKSTS_CONVERT
+		    || lkb->lkb_status == DLM_LKSTS_WAITING)
+			seq_printf(s, " (%" PRIx64 "-%" PRIx64 ")",
+				   lkb->lkb_range[RQ_RANGE_START],
+				   lkb->lkb_range[RQ_RANGE_END]);
+	}
+
+	if (lkb->lkb_nodeid) {
+		if (lkb->lkb_nodeid != res->res_nodeid)
+			seq_printf(s, " Remote: %3d %08x", lkb->lkb_nodeid,
+				   lkb->lkb_remid);
+		else
+			seq_printf(s, " Master:     %08x", lkb->lkb_remid);
+	}
+
+	if (lkb->lkb_wait_type)
+		seq_printf(s, " wait_type: %d", lkb->lkb_wait_type);
+
+	seq_printf(s, "\n");
+}
+
+static int print_resource(struct dlm_rsb *res, struct seq_file *s)
+{
+	struct dlm_lkb *lkb;
+	int i, lvblen = res->res_ls->ls_lvblen;
+
+	seq_printf(s, "\nResource %p Name (len=%d) \"", res, res->res_length);
+	for (i = 0; i < res->res_length; i++) {
+		if (isprint(res->res_name[i]))
+			seq_printf(s, "%c", res->res_name[i]);
+		else
+			seq_printf(s, "%c", '.');
+	}
+	if (res->res_nodeid)
+		seq_printf(s, "\"  \nLocal Copy, Master is node %d\n",
+			   res->res_nodeid);
+	else
+		seq_printf(s, "\"  \nMaster Copy\n");
+
+	/* Print the LVB: */
+	if (res->res_lvbptr) {
+		seq_printf(s, "LVB: ");
+		for (i = 0; i < lvblen; i++) {
+			if (i == lvblen / 2)
+				seq_printf(s, "\n     ");
+			seq_printf(s, "%02x ",
+				   (unsigned char) res->res_lvbptr[i]);
+		}
+		if (test_bit(RESFL_VALNOTVALID, &res->res_flags))
+			seq_printf(s, " (INVALID)");
+		seq_printf(s, "\n");
+	}
+
+	/* Print the locks attached to this resource */
+	seq_printf(s, "Granted Queue\n");
+	list_for_each_entry(lkb, &res->res_grantqueue, lkb_statequeue)
+		print_lock(s, lkb, res);
+
+	seq_printf(s, "Conversion Queue\n");
+	list_for_each_entry(lkb, &res->res_convertqueue, lkb_statequeue)
+		print_lock(s, lkb, res);
+
+	seq_printf(s, "Waiting Queue\n");
+	list_for_each_entry(lkb, &res->res_waitqueue, lkb_statequeue)
+		print_lock(s, lkb, res);
+
+	return 0;
+}
+
+static int rsb_iter_next(struct rsb_iter *ri)
+{
+	struct dlm_ls *ls = ri->ls;
+	int i;
+
+	if (!ri->next) {
+ top:
+		/* Find the next non-empty hash bucket */
+		for (i = ri->entry; i < ls->ls_rsbtbl_size; i++) {
+			read_lock(&ls->ls_rsbtbl[i].lock);
+			if (!list_empty(&ls->ls_rsbtbl[i].list)) {
+				ri->next = ls->ls_rsbtbl[i].list.next;
+				read_unlock(&ls->ls_rsbtbl[i].lock);
+				break;
+			}
+			read_unlock(&ls->ls_rsbtbl[i].lock);
+                }
+		ri->entry = i;
+
+		if (ri->entry >= ls->ls_rsbtbl_size)
+			return 1;
+	} else {
+		i = ri->entry;
+		read_lock(&ls->ls_rsbtbl[i].lock);
+		ri->next = ri->next->next;
+		if (ri->next->next == ls->ls_rsbtbl[i].list.next) {
+			/* End of list - move to next bucket */
+			ri->next = NULL;
+			ri->entry++;
+			read_unlock(&ls->ls_rsbtbl[i].lock);
+			goto top;
+                }
+		read_unlock(&ls->ls_rsbtbl[i].lock);
+	}
+	ri->rsb = list_entry(ri->next, struct dlm_rsb, res_hashchain);
+
+	return 0;
+}
+
+static void rsb_iter_free(struct rsb_iter *ri)
+{
+	kfree(ri);
+}
+
+static struct rsb_iter *rsb_iter_init(struct dlm_ls *ls)
+{
+	struct rsb_iter *ri;
+
+	ri = kmalloc(sizeof *ri, GFP_KERNEL);
+	if (!ri)
+		return NULL;
+
+	ri->ls = ls;
+	ri->entry = 0;
+	ri->next = NULL;
+
+	if (rsb_iter_next(ri)) {
+		rsb_iter_free(ri);
+		return NULL;
+	}
+
+	return ri;
+}
+
+static void *seq_start(struct seq_file *file, loff_t *pos)
+{
+	struct rsb_iter *ri;
+	loff_t n = *pos;
+
+	ri = rsb_iter_init(file->private);
+	if (!ri)
+		return NULL;
+
+	while (n--) {
+		if (rsb_iter_next(ri)) {
+			rsb_iter_free(ri);
+			return NULL;
+		}
+	}
+
+	return ri;
+}
+
+static void *seq_next(struct seq_file *file, void *iter_ptr, loff_t *pos)
+{
+	struct rsb_iter *ri = iter_ptr;
+
+	(*pos)++;
+
+	if (rsb_iter_next(ri)) {
+		rsb_iter_free(ri);
+		return NULL;
+	}
+
+	return ri;
+}
+
+static void seq_stop(struct seq_file *file, void *iter_ptr)
+{
+	/* nothing for now */
+}
+
+static int seq_show(struct seq_file *file, void *iter_ptr)
+{
+	struct rsb_iter *ri = iter_ptr;
+
+	print_resource(ri->rsb, file);
+
+	return 0;
+}
+
+static struct seq_operations dlm_seq_ops = {
+	.start = seq_start,
+	.next  = seq_next,
+	.stop  = seq_stop,
+	.show  = seq_show,
+};
+
+static int do_open(struct inode *inode, struct file *file)
+{
+	struct seq_file *seq;
+	int ret;
+
+	ret = seq_open(file, &dlm_seq_ops);
+	if (ret)
+		return ret;
+
+	seq = file->private_data;
+	seq->private = inode->u.generic_ip;
+
+	return 0;
+}
+
+static struct file_operations dlm_fops = {
+	.owner   = THIS_MODULE,
+	.open    = do_open,
+	.read    = seq_read,
+	.llseek  = seq_lseek,
+	.release = seq_release
+};
+
+int dlm_create_debug_file(struct dlm_ls *ls)
+{
+	ls->ls_debug_dentry = debugfs_create_file(ls->ls_name,
+						  S_IFREG | S_IRUGO,
+						  dlm_root,
+						  ls,
+						  &dlm_fops);
+	return ls->ls_debug_dentry ? 0 : -ENOMEM;
+}
+
+void dlm_delete_debug_file(struct dlm_ls *ls)
+{
+	if (ls->ls_debug_dentry)
+		debugfs_remove(ls->ls_debug_dentry);
+}
+
+int dlm_register_debugfs(void)
+{
+	dlm_root = debugfs_create_dir("dlm", NULL);
+	return dlm_root ? 0 : -ENOMEM;
+}
+
+void dlm_unregister_debugfs(void)
+{
+	debugfs_remove(dlm_root);
+}
+



From teigland at redhat.com  Mon May 16 07:21:23 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 16 May 2005 15:21:23 +0800
Subject: [Linux-cluster] [PATCH 8/8] dlm: build
Message-ID: <20050516072123.GM7094@redhat.com>

Adds the dlm to the build system.

Signed-off-by: Dave Teigland <teigland at redhat.com>
Signed-off-by: Patrick Caulfield <pcaulfie at redhat.com>

---

 drivers/Kconfig      |    2 ++
 drivers/Makefile     |    1 +
 drivers/dlm/Kconfig  |   27 +++++++++++++++++++++++++++
 drivers/dlm/Makefile |   23 +++++++++++++++++++++++
 4 files changed, 53 insertions(+)

--- a/drivers/dlm/Makefile	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/Makefile	2005-05-12 23:13:15.832485056 +0800
@@ -0,0 +1,23 @@
+obj-$(CONFIG_DLM) +=		dlm.o
+obj-$(CONFIG_DLM_DEVICE) +=	dlm_device.o
+
+dlm-y :=			ast.o \
+				config.o \
+				dir.o \
+				lock.o \
+				lockspace.o \
+				lowcomms.o \
+				main.o \
+				member.o \
+				member_sysfs.o \
+				memory.o \
+				midcomms.o \
+				node_ioctl.o \
+				rcom.o \
+				recover.o \
+				recoverd.o \
+				requestqueue.o \
+				util.o
+dlm-$(CONFIG_DLM_DEBUG) +=	debug_fs.o
+
+dlm_device-y :=			device.o
--- a/drivers/Makefile  2005-04-25 15:40:15.000000000 +0800
+++ b/drivers/Makefile  2005-04-25 16:10:10.228660648 +0800
@@ -64,3 +64,4 @@
 obj-$(CONFIG_BLK_DEV_SGIIOC4)	+= sn/
 obj-y				+= firmware/
 obj-$(CONFIG_CRYPTO)		+= crypto/
+obj-$(CONFIG_DLM)		+= dlm/
--- a/drivers/dlm/Kconfig	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/Kconfig	2005-05-12 23:13:15.833484904 +0800
@@ -0,0 +1,27 @@
+menu "Distributed Lock Manager"
+	depends on INET && EXPERIMENTAL
+
+config DLM 
+	tristate "Distributed Lock Manager (DLM)"
+	select IP_SCTP
+	help
+	A general purpose distributed lock manager for kernel or userspace
+	applications.
+
+config DLM_DEVICE
+	tristate "DLM device for userspace access"
+	depends on DLM
+	help
+	This module creates a misc device through which the dlm lockspace
+	and locking functions become available to userspace applications
+	(usually through the libdlm library).
+
+config DLM_DEBUG
+	bool "DLM debugging"
+	depends on DLM
+	help
+	Under the debugfs mount point, the name of each lockspace will
+	appear as a file in the "dlm" directory.  The output is the
+	list of resource and locks the local node knows about.
+
+endmenu
--- a/drivers/Kconfig       2005-03-02 15:38:26.000000000 +0800
+++ b/drivers/Kconfig       2005-04-25 16:01:50.476634504 +0800
@@ -58,4 +58,6 @@
 
 source "drivers/infiniband/Kconfig"
 
+source "drivers/dlm/Kconfig"
+
 endmenu



From phung at cs.columbia.edu  Mon May 16 07:17:52 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Mon, 16 May 2005 03:17:52 -0400 (EDT)
Subject: [Linux-cluster] cman_tool join causes other nodes to kernel panic
In-Reply-To: <428848C0.9060601@redhat.com>
Message-ID: <Pine.LNX.4.44.0505160317250.6562-100000@algiers.clic.cs.columbia.edu>

yes, I made sure that each node has a unique nodeid.

On 16, May, 2005, Patrick Caulfield declared:

> Dan B. Phung wrote:
> > yes, I updated the cluster.conf by adding more nodes.  
> > 
> > e.g., I added a couple of these blocks (6 more nodes to be exact)
> > 
> >         <clusternode name="blade06" nodeid="1" votes="1">
>                                       ^^^^^^^^^^
> 
> Errr. I hope you changed the nodeid for each node ?
> 
> 
> 

-- 



From anushkumar at hotmail.com  Mon May 16 13:54:56 2005
From: anushkumar at hotmail.com (anush kumar)
Date: Mon, 16 May 2005 19:24:56 +0530
Subject: [Linux-cluster] reg clsuter
Message-ID: <BAY103-F56FAA1062C3312135099ACE150@phx.gbl>

Hi,

I have installed redhat 9.0.I configured the system as NIS client.I 
installed openmosix. the installlation was successful.When i booted up with 
the new kernel ypbind is not working.
done ps -eaf | grep ypbind. It does not show ypbind process running.run 
manually ypbind from root prompt no go. I checked from /var/log/messages 
this is messages i get on system
cluster3 ypbind: Setting NIS domain name migv:  succeeded
cluster3 ypbind: ypbind startup succeeded
cluster3 ypbind: ypbind shutdown failed
cluster3 ypbind: attempting to contact yp server failed

thanks in advance

Rgs,
Anush.R




From teigland at redhat.com  Tue May 17 09:57:04 2005
From: teigland at redhat.com (David Teigland)
Date: Tue, 17 May 2005 17:57:04 +0800
Subject: [Linux-cluster] Re: [PATCH 0/8] dlm: overview
In-Reply-To: <20050517001133.64d50d8c.akpm@osdl.org>
References: <20050516071949.GE7094@redhat.com>
	<20050517001133.64d50d8c.akpm@osdl.org>
Message-ID: <20050517095704.GA12081@redhat.com>

On Tue, May 17, 2005 at 12:11:33AM -0700, Andrew Morton wrote:

> The usual fallback is to identify all the stakeholders and get them to say
> "yes Andrew, this code is cool and we can use it", but I don't think the
> clustering teams have sufficent act-togetherness to be able to do that.
> 
> Am I correct in believing that this DLM is designed to be used by multiple
> clustering products?  If so, which ones, and how far along are they?

Correct.  Red Hat has multiple clustering products that do and will use
this.  GFS and CLVM are two notable ones that do now.  GFS kernel patches
that use this will be sent in the near future; CLVM uses the dlm from user
space.

Here are my impressions of where other clustering groups are at:

OpenSSI: this project is interested in integrating an existing dlm into
their system.  They don't have any effort under way to develop a dlm
themselves.  They are also interested in using gfs which is indirectly an
interest in the dlm.

Linux-HA: seem to be in a similar situation as OpenSSI

Lustre: this project includes a locking system called "dlm".  The api is
similar, but the implementation is not distributed in the style of a
traditional dlm; what they have is largely Lustre-specific [1].

OCFS2: this project includes a dlm intended for OCFS2, but there have been
some steps to make it more generic.  I believe their goal is still to
develop their dlm primarily for ocfs2's needs, not to develop an entirely
general purpose dlm such as this one.  So again, what's there is limited
and largely OCFS2-specific [1].

It would be very good to hear from any others who are interested in using
a dlm, too.

[1] Application-specific lock managers such as Lustre's and OCFS2's can be
good ideas, and I'm not criticising them.  It allows you to make
specializations and simplifications to suit your particular needs.  So,
while I'm sure it's possible for them to use this general-purpose dlm,
some may still want to do their own specialized thing.  We'll try to add
features and options to the general-purpose dlm to meet specialized needs,
but there's a limit to that.


> Looking at Ted's latest Kernel Summit agenda I see
> 
>  Clustering
> 
>  	We need to make progress on the kernel integration of things
>  	like message passing, membership, DLM etc.
> 
>  	We seem to have at least two comparable kernel-side offerings
>  	(OpenSSI and RHAT's), as well as a need to hash out how user-space
>  	plays into this.
> 
>  	(There is now a plan for a Clustering Summit prior to KS - need
>  	to validate if this will be useful, still)
> 
> So right now I'm inclined to duck any decision making and see what happens
> in July.  Does that sounds sane?  Is that Clustering Summit going to happen?

To some extent I'm sure the different clustering meetings will happen,
although I won't be at any of them.  I'm forging ahead trying to work
things out rather than waiting.  (Frankly, I don't think waiting for a
cluster summit will matter much.)

It's worth noting that most of the clustering discussions are now about
user space stuff.  Since the dlm is agnostic about what's in user space,
the dlm discussion and other clustering topics are largely independent.


> In the meanwhile I can pop this code into -mm so it gets a bit more
> compile testing, review and exposure, but that wouldn't signify anything
> further.

Sure, more feedback is what we're after.

Dave



From mishagreen at gmail.com  Tue May 17 10:14:02 2005
From: mishagreen at gmail.com (Michael Green)
Date: Tue, 17 May 2005 13:14:02 +0300
Subject: [Linux-cluster] LVS error: "bad load average"
Message-ID: <17e909a4050517031458b6846b@mail.gmail.com>

Getting the following errors in the log file from the LVS:

May 17 12:57:58 biocsm nanny[7310]: bad load average returned: biocl1 
      up  13+02:00,     0 users,  load 2.00, 1.98, 1.90 biocl2     up 
13+02:13,     0 users,  load 1.98, 1.96, 1.90 biocl3        up 
13+02:13,     0 users,  load 0.98, 0.96, 0.91 biocl4        up
13+02:12,     0 users,  load 2.00, 2.00, 1.91 biocl5        up 
13+02:13,     0 users,  load 1.98, 1.96, 1.90 biocl6        up 
13+02:13,     0 users,  load 1.98, 1.96, 1.90 biocsm        up 
13+22:08,     0 users,  load 0.00, 0.00, 0.00
May 17 12:57:58 biocsm nanny[7305]: bad load average returned: biocl1 
      up  13+02:00,     0 users,  load 2.00, 1.98, 1.90 biocl2     up 
13+02:13,     0 users,  load 1.98, 1.96, 1.90 biocl3        up 
13+02:13,     0 users,  load 0.98, 0.96, 0.91 biocl4        up
13+02:12,     0 users,  load 2.00, 2.00, 1.91 biocl5        up 
13+02:13,     0 users,  load 1.98, 1.96, 1.90 biocl6        up 
13+02:13,     0 users,  load 1.98, 1.96, 1.90 biocsm        up 
13+22:08,     0 users,  load 0.00, 0.00, 0.00

and so on... the /var/log/messages is literally flooded with these...

I've googled, I've search through the LVS & Piranha mail lists and
found that other people also had same problem, but I havn't found any
definitive answer.

Please help.

My lvs.cf is as following:

[root at biocsm ha]# more lvs.cf
serial_no = 35
primary = 132.77.90.131
service = lvs
backup = 0.0.0.0
heartbeat = 1
heartbeat_port = 539
keepalive = 6
deadtime = 18
network = nat
nat_router = 192.168.1.30 eth1:1
debug_level = NONE
virtual HTTP {
     active = 1
     address = 132.77.90.131 eth0:1
     vip_nmask = 255.255.255.0
     port = 80
     send = "GET / HTTP/1.0\r\n\r\n"
     expect = "HTTP"
     use_regex = 0
     load_monitor = ruptime
     scheduler = wlc
     protocol = tcp
     timeout = 6
     reentry = 15
     quiesce_server = 1
     server biocl1 {
         address = 192.168.1.31
         active = 1
         weight = 1
     }
     server biocl2 {
         address = 192.168.1.32
         active = 1
         weight = 1
     }
     server biocl3 {
         address = 192.168.1.33
         active = 1
         weight = 1
     }
     server biocl4 {
         address = 192.168.1.34
         active = 1
         weight = 1
     }
     server biocl5 {
         address = 192.168.1.35
         active = 1
         weight = 1
     }
     server biocl6 {
         address = 192.168.1.36
         active = 1
         weight = 1
     }
}


-- 
Warm regards,
Michael Green



From lroland at gmail.com  Tue May 17 13:26:09 2005
From: lroland at gmail.com (Lars Roland)
Date: Tue, 17 May 2005 15:26:09 +0200
Subject: [Linux-cluster] Re: [PATCH 0/8] dlm: overview
In-Reply-To: <20050517001133.64d50d8c.akpm@osdl.org>
References: <20050516071949.GE7094@redhat.com>
	<20050517001133.64d50d8c.akpm@osdl.org>
Message-ID: <4ad99e0505051706265ea06f7@mail.gmail.com>

On 5/17/05, Andrew Morton <akpm at osdl.org> wrote:
> The usual fallback is to identify all the stakeholders and get them to say
> "yes Andrew, this code is cool and we can use it", but I don't think the
> clustering teams have sufficent act-togetherness to be able to do that.

It is highly unlikely that this will ever happen because of the
different nature of these projects. That said getting a kernel system
for dlm and message passing should be doable given the wide variety
off uses they have.



From phillips at istop.com  Tue May 17 17:07:02 2005
From: phillips at istop.com (Daniel Phillips)
Date: Tue, 17 May 2005 13:07:02 -0400
Subject: [Linux-cluster] Re: [PATCH 0/8] dlm: overview
In-Reply-To: <20050517001133.64d50d8c.akpm@osdl.org>
References: <20050516071949.GE7094@redhat.com>
	<20050517001133.64d50d8c.akpm@osdl.org>
Message-ID: <200505171307.03075.phillips@istop.com>

Hi Andrew,

On Tuesday 17 May 2005 03:11, you wrote:
>   (There is now a plan for a Clustering Summit prior to KS - need
>   to validate if this will be useful, still)

The public announcement, with venue and travel details will be tomorrow.  This 
is a technical workshop that will take place on June 20 and 21st in Walldorf 
Germany, near Heidelberg and south of Frankfurt.   The goal is to either 
validate, reject, modify, or combine the kernel interfaces that Red Hat has 
proposed, and that other groups may propose in the next few weeks.  We will 
only be looking at code that actually exists.

For those who cannot physically attend, we will do our utmost to provide a 
two-way audio connection, hopefully without toll charges.

This Second Annual Cluster Summit (sorry if it sounds presumptuous) is 
sponsored by Red Hat, Hewlett-Packard, Fujitsu-Siemens, Dell, and Oracle.  My 
apologies in advance for leaving only four weeks leadtime for the public 
announcement.

Regards,

Daniel



From scottsl at ornl.gov  Fri May 13 05:07:18 2005
From: scottsl at ornl.gov (Stephen L. Scott)
Date: Fri, 13 May 2005 01:07:18 -0400
Subject: [Linux-cluster] Re: [Clusters_sig] Planning a Cluster meeting at OLS
In-Reply-To: <20050511100544.GS25070@marowsky-bree.de>
References: <3689AF909D816446BA505D21F1461AE403A3B001@cacexc04.americas.cpqcorp.net>
	<200505102204.06944.phillips@redhat.com>
	<20050511100544.GS25070@marowsky-bree.de>
Message-ID: <42843606.3050708@ornl.gov>

can we at least decide before or after ols dates - some of us have to 
wait on a lengthy process for foreign travel approval.


------------------------------------------------------------------------
Stephen L. Scott, Ph.D.                 voice: 865-574-3144
Oak Ridge National Laboratory           fax:   865-576-5491
P. O. Box 2008, Bldg. 5600, MS-6016     scottsl at ornl.gov
Oak Ridge, TN 37831-6016                http://www.csm.ornl.gov/~sscott/
------------------------------------------------------------------------



Lars Marowsky-Bree wrote:
> On 2005-05-10T22:04:06, Daniel Phillips <phillips at redhat.com> wrote:
> 
> 
>>If there is support for this then I will do what I can to pitch in and 
>>help.  But it's a pretty tall order.  You did not address the question 
>>of sponsorship, and I can't vouch for the chances of Red Hat funding 
>>travel to two dedicated cluster events a month apart.  Maybe back in 
>>1999 that would have worked!
> 
> 
> All we need is room. If we're just 10-20 people, hell, even one of the
> business suites in Les Suites will do - or we crouch in the hallway of
> the Kernel Summit ;-)
> 
> Bruce is looking at getting a real conference room at Les Suites
> though. I'm sure between OSDL, RHAT, Novell, HP etc we'll be able to
> sort that one out.
> 
> Given that many people will be able to make OLS + one or two days before
> that for the Kernel Summit (or now the Cluster Summit), but not Walldorf
> - and maybe vice-versa -, I think having both venues is a pretty good
> idea.
>   
> 
> Sincerely,
>     Lars Marowsky-Br?e <lmb at suse.de>
> 



From debug at MIT.EDU  Sun May 15 22:36:18 2005
From: debug at MIT.EDU (Cluster 2005)
Date: Sun, 15 May 2005 18:36:18 -0400
Subject: [Linux-cluster] Cluster 2005 Boston Paper deadline extended to May
	21
Message-ID: <5.2.1.1.2.20050515183540.03ae0cb8@hesiod>


               The 2005 IEEE International Conference on Cluster Computing
                               September 27-30, 2005
               Burlington Marriott Boston, Burlington, Massachusetts, USA
                                http://cluster2005.org
                                 info at cluster2005.org

--------------------------------------------------------------------------------

Commodity clusters today provide a convenient and cost-effective platform for
executing complex computation-, data- and/or transaction-centric applications.
Many research and development challenges remain in all areas of cluster 
computing,
including middleware, networking, algorithms and applications, resource 
management,
platform deployment and maintenance, and integration of clusters into 
computational
grids. Cluster 2005 provides an open forum for researchers, practitioners 
and users
to present and discuss issues, directions, and results that will shape the 
future
of cluster computing. This years event is to be held in and around Boston / 
Cambridge.

See the upcoming Call for Participation for more details.

SCOPE
-----

    Cluster 2005 welcomes paper and poster submissions from engineers and
    scientists in academia and industry describing original research work in
    all areas of cluster computing. In addition, Cluster 2005 will welcome
    proposals for tutorials and workshops to be held concurrently with the
    conference. Topics of interest include, but are not limited to:

    Cluster Middleware:
     - Single-System Image Services         - Software Environments and Tools
     - Standard Software for Clusters       - I/O Libraries, File Systems,
                                              and Distributed RAID

    Cluster Networking:
     - High-Speed System Interconnects      - Lightweight Communication 
Protocols
     - Fast Message Passing Libraries

    Cluster Management and Maintenance:
     - Cluster Security and Reliability     - Tools for Managing Clusters
     - Cluster Job and Resource Management  - High-Availability Cluster 
Solutions

    Applications:
     - Scientific and E-Commerce Apps       - Data Distribution and Load 
Balancing
     - Scheduling and Parallel Algorithms   - Innovative Cluster Applications
     - Tools and Environments               - Computational Sciences

    Performance Analysis and Evaluation:
     - Benchmarking and Profiling Tools     - Performance Prediction and 
Modeling
     - Analysis and Visualization

    Grid Computing and Clusters:
     - Grid / Clusters Integration          - Network-Based Distributed 
Computing
     - Mobile Agents and Java for
       Clusters/Grids



TECHNICAL PAPERS
----------------

Format for submission: Full paper drafts, not to exceed 25 double-spaced, 
numbered pages
(including title, author affiliations, abstract, keywords, figures, tables and
bibliographic references) using 12 point font on 8.5x11-inch pages with 
1-inch margins
all around. A web based submission mechanisms will be activated on the 
conference web site
two weeks before the submission deadline. Authors should submit a 
PostScript (level 2) or
PDF file. Hard copy submissions can not be accepted.

POSTERS
-------

Format for submission: 1 page abstract in PDF including names of authors, 
their affiliation
and a 200 word abstract sent as an e-mail attachment to the poster chair, 
Shawn Houston
<poster at cluster2005.org>. Poster presentations will also be offered to the 
authors of
technical papers that were not accepted for oral presentation but are 
recommended by the
committee for this form of publication.

IMPORTANT DATES
---------------

      - Workshop Proposals Due                March 16, 2005
      - Paper Submissions Due                 May 21, 2005
      - Tutorial Proposals Due                May 21, 2005
      - Paper Acceptance Notification         June 18, 2005
      - Poster Submissions Due                June 23, 2005
      - Exhibition Proposal Due               June 23, 2005
      - Poster Acceptance Notification        June 30, 2005
      - Camera-Ready Paper Manuscripts Due    July 16, 2005
      - Camera-Ready Poster Abstracts Due     July 16, 2005
      - Cut-off for group hotel rates         September 5, 2005
      - Conference                            September 27-30, 2005
      - Post-Conference Workshops             September 30, 2005

ORGANIZING COMMITTEE
--------------------

     * General Chair:
       - Dimiter Avresky, Northeastern University, Boston, USA

     * General Vice Chair:
       - Daniel S. Katz, Jet Propulsion Laboratory, Caltech, USA

     * Program Chair:
       - Thomas Stricker, Google Engineering, Switzerland

     * Workshops Chair:
       - Rajkumar Buyya, Senior Lecturer, Dept. of Computer Science and
         Software Engineering, The University of Melbourne, Australia

     * Tutorials Chair:
       - Box Leangsuksun, Lousiana Tech, Louisiana, USA

     * Exhibits/Sponsors Co-Chairs:
        - Ivan Judson, Argonne National Laboratory, USA
        - Rosa Badia, CEPBA-IBM Research Institute, UPC, Spain

     * Posters Chair:
        - Shawn Houston, University of Alaska, USA

     * Publications Chair:
        - Kurt Keville, Massachusetts Institute of Technology, USA

     * Publicity Chair:
        - Kevin Gleason, Mount Ida College, Newton, USA

     * Finance/Registration Chair:
        - Madeleine Furlotte, Genzyme, USA










From teigland at redhat.com  Mon May 16 07:20:03 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 16 May 2005 15:20:03 +0800
Subject: [Linux-cluster] [PATCH 1/8] dlm: core locking
Message-ID: <20050516072003.GF7094@redhat.com>

The core dlm functions.  Processes dlm_lock() and dlm_unlock() requests.
Manages locks on resources' grant/convert/wait queues.  Sends and receives
high level locking operations between nodes.

Signed-off-by: Dave Teigland <teigland at redhat.com>
Signed-off-by: Patrick Caulfield <pcaulfie at redhat.com>

---

 drivers/dlm/lock.c | 3569 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/dlm/lock.h |   51 
 2 files changed, 3620 insertions(+)

--- a/drivers/dlm/lock.c	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/lock.c	2005-05-12 23:13:15.829485512 +0800
@@ -0,0 +1,3569 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) 2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+/* Central locking logic has four stages:
+
+   dlm_lock()
+   dlm_unlock()
+
+   request_lock(ls, lkb)
+   convert_lock(ls, lkb)
+   unlock_lock(ls, lkb)
+   cancel_lock(ls, lkb)
+
+   _request_lock(r, lkb)
+   _convert_lock(r, lkb)
+   _unlock_lock(r, lkb)
+   _cancel_lock(r, lkb)
+
+   do_request(r, lkb)
+   do_convert(r, lkb)
+   do_unlock(r, lkb)
+   do_cancel(r, lkb)
+
+   Stage 1 (lock, unlock) is mainly about checking input args and
+   splitting into one of the four main operations:
+
+       dlm_lock          = request_lock
+       dlm_lock+CONVERT  = convert_lock
+       dlm_unlock        = unlock_lock
+       dlm_unlock+CANCEL = cancel_lock
+
+   Stage 2, xxxx_lock(), just finds and locks the relevant rsb which is
+   provided to the next stage.
+
+   Stage 3, _xxxx_lock(), determines if the operation is local or remote.
+   When remote, it calls send_xxxx(), when local it calls do_xxxx().
+
+   Stage 4, do_xxxx(), is the guts of the operation.  It manipulates the
+   given rsb and lkb and queues callbacks.
+
+   For remote operations, send_xxxx() results in the corresponding do_xxxx()
+   function being executed on the remote node.  The connecting send/receive
+   calls on local (L) and remote (R) nodes:
+
+   L: send_xxxx()              ->  R: receive_xxxx()
+                                   R: do_xxxx()
+   L: receive_xxxx_reply()     <-  R: send_xxxx_reply()
+*/
+
+#include "dlm_internal.h"
+#include "memory.h"
+#include "lowcomms.h"
+#include "requestqueue.h"
+#include "util.h"
+#include "dir.h"
+#include "member.h"
+#include "lockspace.h"
+#include "ast.h"
+#include "lock.h"
+#include "rcom.h"
+#include "recover.h"
+#include "lvb_table.h"
+#include "config.h"
+
+static int send_request(struct dlm_rsb *r, struct dlm_lkb *lkb);
+static int send_convert(struct dlm_rsb *r, struct dlm_lkb *lkb);
+static int send_unlock(struct dlm_rsb *r, struct dlm_lkb *lkb);
+static int send_cancel(struct dlm_rsb *r, struct dlm_lkb *lkb);
+static int send_grant(struct dlm_rsb *r, struct dlm_lkb *lkb);
+static int send_bast(struct dlm_rsb *r, struct dlm_lkb *lkb, int mode);
+static int send_lookup(struct dlm_rsb *r, struct dlm_lkb *lkb);
+static int send_remove(struct dlm_rsb *r);
+static int _request_lock(struct dlm_rsb *r, struct dlm_lkb *lkb);
+static void __receive_convert_reply(struct dlm_rsb *r, struct dlm_lkb *lkb,
+				    struct dlm_message *ms);
+static int receive_extralen(struct dlm_message *ms);
+
+/*
+ * Lock compatibilty matrix - thanks Steve
+ * UN = Unlocked state. Not really a state, used as a flag
+ * PD = Padding. Used to make the matrix a nice power of two in size
+ * Other states are the same as the VMS DLM.
+ * Usage: matrix[grmode+1][rqmode+1]  (although m[rq+1][gr+1] is the same)
+ */
+
+const int __dlm_compat_matrix[8][8] = {
+      /* UN NL CR CW PR PW EX PD */
+        {1, 1, 1, 1, 1, 1, 1, 0},       /* UN */
+        {1, 1, 1, 1, 1, 1, 1, 0},       /* NL */
+        {1, 1, 1, 1, 1, 1, 0, 0},       /* CR */
+        {1, 1, 1, 1, 0, 0, 0, 0},       /* CW */
+        {1, 1, 1, 0, 1, 0, 0, 0},       /* PR */
+        {1, 1, 1, 0, 0, 0, 0, 0},       /* PW */
+        {1, 1, 0, 0, 0, 0, 0, 0},       /* EX */
+        {0, 0, 0, 0, 0, 0, 0, 0}        /* PD */
+};
+
+#define modes_compat(gr, rq) \
+	__dlm_compat_matrix[(gr)->lkb_grmode + 1][(rq)->lkb_rqmode + 1]
+
+int dlm_modes_compat(int mode1, int mode2)
+{
+	return __dlm_compat_matrix[mode1 + 1][mode2 + 1];
+}
+
+/*
+ * Compatibility matrix for conversions with QUECVT set.
+ * Granted mode is the row; requested mode is the column.
+ * Usage: matrix[grmode+1][rqmode+1]
+ */
+
+const int __quecvt_compat_matrix[8][8] = {
+      /* UN NL CR CW PR PW EX PD */
+        {0, 0, 0, 0, 0, 0, 0, 0},       /* UN */
+        {0, 0, 1, 1, 1, 1, 1, 0},       /* NL */
+        {0, 0, 0, 1, 1, 1, 1, 0},       /* CR */
+        {0, 0, 0, 0, 1, 1, 1, 0},       /* CW */
+        {0, 0, 0, 1, 0, 1, 1, 0},       /* PR */
+        {0, 0, 0, 0, 0, 0, 1, 0},       /* PW */
+        {0, 0, 0, 0, 0, 0, 0, 0},       /* EX */
+        {0, 0, 0, 0, 0, 0, 0, 0}        /* PD */
+};
+
+void dlm_print_lkb(struct dlm_lkb *lkb)
+{
+	printk(KERN_ERR "lkb: nodeid %d id %x remid %x exflags %x flags %x\n"
+	       "     status %d rqmode %d grmode %d wait_type %d ast_type %d\n",
+	       lkb->lkb_nodeid, lkb->lkb_id, lkb->lkb_remid, lkb->lkb_exflags,
+	       lkb->lkb_flags, lkb->lkb_status, lkb->lkb_rqmode,
+	       lkb->lkb_grmode, lkb->lkb_wait_type, lkb->lkb_ast_type);
+}
+
+void dlm_print_rsb(struct dlm_rsb *r)
+{
+	printk(KERN_ERR "rsb: nodeid %d flags %lx trial %x rlc %d name %s\n",
+	       r->res_nodeid, r->res_flags, r->res_trial_lkid,
+	       r->res_recover_locks_count, r->res_name);
+}
+
+/* Threads cannot use the lockspace while it's being recovered */
+
+static inline void lock_recovery(struct dlm_ls *ls)
+{
+	down_read(&ls->ls_in_recovery);
+}
+
+static inline void unlock_recovery(struct dlm_ls *ls)
+{
+	up_read(&ls->ls_in_recovery);
+}
+
+static inline int lock_recovery_try(struct dlm_ls *ls)
+{
+	return down_read_trylock(&ls->ls_in_recovery);
+}
+
+static inline int can_be_queued(struct dlm_lkb *lkb)
+{
+	return !(lkb->lkb_exflags & DLM_LKF_NOQUEUE);
+}
+
+static inline int force_blocking_asts(struct dlm_lkb *lkb)
+{
+	return (lkb->lkb_exflags & DLM_LKF_NOQUEUEBAST);
+}
+
+static inline int is_demoted(struct dlm_lkb *lkb)
+{
+	return (lkb->lkb_sbflags & DLM_SBF_DEMOTED);
+}
+
+static inline int is_remote(struct dlm_rsb *r)
+{
+	DLM_ASSERT(r->res_nodeid >= 0, dlm_print_rsb(r););
+	return !!r->res_nodeid;
+}
+
+static inline int is_process_copy(struct dlm_lkb *lkb)
+{
+	return (lkb->lkb_nodeid && !(lkb->lkb_flags & DLM_IFL_MSTCPY));
+}
+
+static inline int is_master_copy(struct dlm_lkb *lkb)
+{
+	if (lkb->lkb_flags & DLM_IFL_MSTCPY)
+		DLM_ASSERT(lkb->lkb_nodeid, dlm_print_lkb(lkb););
+	return (lkb->lkb_flags & DLM_IFL_MSTCPY) ? TRUE : FALSE;
+}
+
+static inline int middle_conversion(struct dlm_lkb *lkb)
+{
+	if ((lkb->lkb_grmode==DLM_LOCK_PR && lkb->lkb_rqmode==DLM_LOCK_CW) ||
+	    (lkb->lkb_rqmode==DLM_LOCK_PR && lkb->lkb_grmode==DLM_LOCK_CW))
+		return TRUE;
+	return FALSE;
+}
+
+static inline int down_conversion(struct dlm_lkb *lkb)
+{
+	return (!middle_conversion(lkb) && lkb->lkb_rqmode < lkb->lkb_grmode);
+}
+
+static void queue_cast(struct dlm_rsb *r, struct dlm_lkb *lkb, int rv)
+{
+	if (is_master_copy(lkb))
+		return;
+
+	DLM_ASSERT(lkb->lkb_lksb, dlm_print_lkb(lkb););
+
+	lkb->lkb_lksb->sb_status = rv;
+	lkb->lkb_lksb->sb_flags = lkb->lkb_sbflags;
+
+	dlm_add_ast(lkb, AST_COMP);
+}
+
+static void queue_bast(struct dlm_rsb *r, struct dlm_lkb *lkb, int rqmode)
+{
+	if (is_master_copy(lkb))
+		send_bast(r, lkb, rqmode);
+	else {
+		lkb->lkb_bastmode = rqmode;
+		dlm_add_ast(lkb, AST_BAST);
+	}
+}
+
+/*
+ * Basic operations on rsb's and lkb's
+ */
+
+static struct dlm_rsb *create_rsb(struct dlm_ls *ls, char *name, int len)
+{
+	struct dlm_rsb *r;
+
+	r = allocate_rsb(ls, len);
+	if (!r)
+		return NULL;
+
+	r->res_ls = ls;
+	r->res_length = len;
+	memcpy(r->res_name, name, len);
+	init_MUTEX(&r->res_sem);
+
+	INIT_LIST_HEAD(&r->res_lookup);
+	INIT_LIST_HEAD(&r->res_grantqueue);
+	INIT_LIST_HEAD(&r->res_convertqueue);
+	INIT_LIST_HEAD(&r->res_waitqueue);
+	INIT_LIST_HEAD(&r->res_root_list);
+	INIT_LIST_HEAD(&r->res_recover_list);
+
+	return r;
+}
+
+static int search_rsb_list(struct list_head *head, char *name, int len,
+			   unsigned int flags, struct dlm_rsb **r_ret)
+{
+	struct dlm_rsb *r;
+	int error = 0;
+
+	list_for_each_entry(r, head, res_hashchain) {
+		if (len == r->res_length && !memcmp(name, r->res_name, len))
+			goto found;
+	}
+	return -ENOENT;
+
+ found:
+	if (r->res_nodeid && (flags & R_MASTER))
+		error = -ENOTBLK;
+	*r_ret = r;
+	return error;
+}
+
+static int _search_rsb(struct dlm_ls *ls, char *name, int len, int b,
+		       unsigned int flags, struct dlm_rsb **r_ret)
+{
+	struct dlm_rsb *r;
+	int error;
+
+	error = search_rsb_list(&ls->ls_rsbtbl[b].list, name, len, flags, &r);
+	if (!error) {
+		kref_get(&r->res_ref);
+		goto out;
+	}
+	error = search_rsb_list(&ls->ls_rsbtbl[b].toss, name, len, flags, &r);
+	if (error)
+		goto out;
+
+	list_move(&r->res_hashchain, &ls->ls_rsbtbl[b].list);
+
+	if (r->res_nodeid == -1) {
+		clear_bit(RESFL_MASTER_WAIT, &r->res_flags);
+		clear_bit(RESFL_MASTER_UNCERTAIN, &r->res_flags);
+		r->res_trial_lkid = 0;
+	} else if (r->res_nodeid > 0) {
+		clear_bit(RESFL_MASTER_WAIT, &r->res_flags);
+		set_bit(RESFL_MASTER_UNCERTAIN, &r->res_flags);
+		r->res_trial_lkid = 0;
+	} else {
+		DLM_ASSERT(r->res_nodeid == 0, dlm_print_rsb(r););
+		DLM_ASSERT(!test_bit(RESFL_MASTER_WAIT, &r->res_flags),
+			   dlm_print_rsb(r););
+		DLM_ASSERT(!test_bit(RESFL_MASTER_UNCERTAIN, &r->res_flags),);
+	}
+ out:
+	*r_ret = r;
+	return error;
+}
+
+static int search_rsb(struct dlm_ls *ls, char *name, int len, int b,
+		      unsigned int flags, struct dlm_rsb **r_ret)
+{
+	int error;
+	write_lock(&ls->ls_rsbtbl[b].lock);
+	error = _search_rsb(ls, name, len, b, flags, r_ret);
+	write_unlock(&ls->ls_rsbtbl[b].lock);
+	return error;
+}
+
+/*
+ * Find rsb in rsbtbl and potentially create/add one
+ *
+ * Delaying the release of rsb's has a similar benefit to applications keeping
+ * NL locks on an rsb, but without the guarantee that the cached master value
+ * will still be valid when the rsb is reused.  Apps aren't always smart enough
+ * to keep NL locks on an rsb that they may lock again shortly; this can lead
+ * to excessive master lookups and removals if we don't delay the release.
+ *
+ * Searching for an rsb means looking through both the normal list and toss
+ * list.  When found on the toss list the rsb is moved to the normal list with
+ * ref count of 1; when found on normal list the ref count is incremented.
+ */
+
+static int find_rsb(struct dlm_ls *ls, char *name, int namelen,
+		    unsigned int flags, struct dlm_rsb **r_ret)
+{
+	struct dlm_rsb *r, *tmp;
+	uint32_t bucket;
+	int error = 0;
+
+	bucket = dlm_hash(name, namelen);
+	bucket &= (ls->ls_rsbtbl_size - 1);
+
+	error = search_rsb(ls, name, namelen, bucket, flags, &r);
+	if (!error)
+		goto out;
+
+	if (error == -ENOENT && !(flags & R_CREATE))
+		goto out;
+
+	/* the rsb was found but wasn't a master copy */
+	if (error == -ENOTBLK)
+		goto out;
+
+	error = -ENOMEM;
+	r = create_rsb(ls, name, namelen);
+	if (!r)
+		goto out;
+
+	r->res_bucket = bucket;
+	r->res_nodeid = -1;
+	kref_init(&r->res_ref);
+
+	write_lock(&ls->ls_rsbtbl[bucket].lock);
+	error = _search_rsb(ls, name, namelen, bucket, 0, &tmp);
+	if (!error) {
+		write_unlock(&ls->ls_rsbtbl[bucket].lock);
+		free_rsb(r);
+		r = tmp;
+		goto out;
+	}
+	list_add(&r->res_hashchain, &ls->ls_rsbtbl[bucket].list);
+	write_unlock(&ls->ls_rsbtbl[bucket].lock);
+	error = 0;
+ out:
+	*r_ret = r;
+	return error;
+}
+
+int dlm_find_rsb(struct dlm_ls *ls, char *name, int namelen,
+		 unsigned int flags, struct dlm_rsb **r_ret)
+{
+	return find_rsb(ls, name, namelen, flags, r_ret);
+}
+
+/* This is only called to add a reference when the code already holds
+   a valid reference to the rsb, so there's no need for locking. */
+
+static inline void hold_rsb(struct dlm_rsb *r)
+{
+	kref_get(&r->res_ref);
+}
+
+void dlm_hold_rsb(struct dlm_rsb *r)
+{
+	hold_rsb(r);
+}
+
+static void toss_rsb(struct kref *kref)
+{
+	struct dlm_rsb *r = container_of(kref, struct dlm_rsb, res_ref);
+	struct dlm_ls *ls = r->res_ls;
+
+	DLM_ASSERT(list_empty(&r->res_root_list), dlm_print_rsb(r););
+	kref_init(&r->res_ref);
+	list_move(&r->res_hashchain, &ls->ls_rsbtbl[r->res_bucket].toss);
+	r->res_toss_time = jiffies;
+	if (r->res_lvbptr) {
+		free_lvb(r->res_lvbptr);
+		r->res_lvbptr = NULL;
+	}
+}
+
+/* When all references to the rsb are gone it's transfered to
+   the tossed list for later disposal. */
+
+static void put_rsb(struct dlm_rsb *r)
+{
+	struct dlm_ls *ls = r->res_ls;
+	uint32_t bucket = r->res_bucket;
+
+	write_lock(&ls->ls_rsbtbl[bucket].lock);
+	kref_put(&r->res_ref, toss_rsb);
+	write_unlock(&ls->ls_rsbtbl[bucket].lock);
+}
+
+void dlm_put_rsb(struct dlm_rsb *r)
+{
+	put_rsb(r);
+}
+
+/* See comment for unhold_lkb */
+
+static void unhold_rsb(struct dlm_rsb *r)
+{
+	int rv;
+	rv = kref_put(&r->res_ref, toss_rsb);
+	DLM_ASSERT(!rv, dlm_print_rsb(r););
+}
+
+static void kill_rsb(struct kref *kref)
+{
+	struct dlm_rsb *r = container_of(kref, struct dlm_rsb, res_ref);
+
+	/* All work is done after the return from kref_put() so we
+	   can release the write_lock before the remove and free. */
+
+	DLM_ASSERT(list_empty(&r->res_lookup),);
+	DLM_ASSERT(list_empty(&r->res_grantqueue),);
+	DLM_ASSERT(list_empty(&r->res_convertqueue),);
+	DLM_ASSERT(list_empty(&r->res_waitqueue),);
+	DLM_ASSERT(list_empty(&r->res_root_list),);
+	DLM_ASSERT(list_empty(&r->res_recover_list),);
+}
+
+/* Attaching/detaching lkb's from rsb's is for rsb reference counting.
+   The rsb must exist as long as any lkb's for it do. */
+
+static void attach_lkb(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	hold_rsb(r);
+	lkb->lkb_resource = r;
+}
+
+static void detach_lkb(struct dlm_lkb *lkb)
+{
+	if (lkb->lkb_resource) {
+		put_rsb(lkb->lkb_resource);
+		lkb->lkb_resource = NULL;
+	}
+}
+
+static int create_lkb(struct dlm_ls *ls, struct dlm_lkb **lkb_ret)
+{
+	struct dlm_lkb *lkb, *tmp;
+	uint32_t lkid = 0;
+	uint16_t bucket;
+
+	lkb = allocate_lkb(ls);
+	if (!lkb)
+		return -ENOMEM;
+
+	lkb->lkb_nodeid = -1;
+	lkb->lkb_grmode = DLM_LOCK_IV;
+	kref_init(&lkb->lkb_ref);
+
+	get_random_bytes(&bucket, sizeof(bucket));
+	bucket &= (ls->ls_lkbtbl_size - 1);
+
+	write_lock(&ls->ls_lkbtbl[bucket].lock);
+
+	/* counter can roll over so we must verify lkid is not in use */
+
+	while (lkid == 0) {
+		lkid = bucket | (ls->ls_lkbtbl[bucket].counter++ << 16);
+
+		list_for_each_entry(tmp, &ls->ls_lkbtbl[bucket].list,
+				    lkb_idtbl_list) {
+			if (tmp->lkb_id != lkid)
+				continue;
+			lkid = 0;
+			break;
+		}
+	}
+
+	lkb->lkb_id = lkid;
+	list_add(&lkb->lkb_idtbl_list, &ls->ls_lkbtbl[bucket].list);
+	write_unlock(&ls->ls_lkbtbl[bucket].lock);
+
+	*lkb_ret = lkb;
+	return 0;
+}
+
+static struct dlm_lkb *__find_lkb(struct dlm_ls *ls, uint32_t lkid)
+{
+	uint16_t bucket = lkid & 0xFFFF;
+	struct dlm_lkb *lkb;
+
+	list_for_each_entry(lkb, &ls->ls_lkbtbl[bucket].list, lkb_idtbl_list) {
+		if (lkb->lkb_id == lkid)
+			return lkb;
+	}
+	return NULL;
+}
+
+static int find_lkb(struct dlm_ls *ls, uint32_t lkid, struct dlm_lkb **lkb_ret)
+{
+	struct dlm_lkb *lkb;
+	uint16_t bucket = lkid & 0xFFFF;
+
+	if (bucket >= ls->ls_lkbtbl_size)
+		return -EBADSLT;
+
+	read_lock(&ls->ls_lkbtbl[bucket].lock);
+	lkb = __find_lkb(ls, lkid);
+	if (lkb)
+		kref_get(&lkb->lkb_ref);
+	read_unlock(&ls->ls_lkbtbl[bucket].lock);
+
+	*lkb_ret = lkb;
+	return lkb ? 0 : -ENOENT;
+}
+
+static void kill_lkb(struct kref *kref)
+{
+	struct dlm_lkb *lkb = container_of(kref, struct dlm_lkb, lkb_ref);
+
+	/* All work is done after the return from kref_put() so we
+	   can release the write_lock before the detach_lkb */
+
+	DLM_ASSERT(!lkb->lkb_status, dlm_print_lkb(lkb););
+}
+
+static int put_lkb(struct dlm_lkb *lkb)
+{
+	struct dlm_ls *ls = lkb->lkb_resource->res_ls;
+	uint16_t bucket = lkb->lkb_id & 0xFFFF;
+
+	write_lock(&ls->ls_lkbtbl[bucket].lock);
+	if (kref_put(&lkb->lkb_ref, kill_lkb)) {
+		list_del(&lkb->lkb_idtbl_list);
+		write_unlock(&ls->ls_lkbtbl[bucket].lock);
+
+		detach_lkb(lkb);
+
+		/* for local/process lkbs, lvbptr points to caller's lksb */
+		if (lkb->lkb_lvbptr && is_master_copy(lkb))
+			free_lvb(lkb->lkb_lvbptr);
+		if (lkb->lkb_range)
+			free_range(lkb->lkb_range);
+		free_lkb(lkb);
+		return 1;
+	} else {
+		write_unlock(&ls->ls_lkbtbl[bucket].lock);
+		return 0;
+	}
+}
+
+int dlm_put_lkb(struct dlm_lkb *lkb)
+{
+	return put_lkb(lkb);
+}
+
+/* This is only called to add a reference when the code already holds
+   a valid reference to the lkb, so there's no need for locking. */
+
+static inline void hold_lkb(struct dlm_lkb *lkb)
+{
+	kref_get(&lkb->lkb_ref);
+}
+
+/* This is called when we need to remove a reference and are certain
+   it's not the last ref.  e.g. del_lkb is always called between a
+   find_lkb/put_lkb and is always the inverse of a previous add_lkb.
+   put_lkb would work fine, but would involve unnecessary locking */
+
+static inline void unhold_lkb(struct dlm_lkb *lkb)
+{
+	int rv;
+	rv = kref_put(&lkb->lkb_ref, kill_lkb);
+	DLM_ASSERT(!rv, dlm_print_lkb(lkb););
+}
+
+static void lkb_add_ordered(struct list_head *new, struct list_head *head,
+			    int mode)
+{
+	struct dlm_lkb *lkb = NULL;
+
+	list_for_each_entry(lkb, head, lkb_statequeue)
+		if (lkb->lkb_rqmode < mode)
+			break;
+
+	if (!lkb)
+		list_add_tail(new, head);
+	else
+		__list_add(new, lkb->lkb_statequeue.prev, &lkb->lkb_statequeue);
+}
+
+/* add/remove lkb to rsb's grant/convert/wait queue */
+
+static void add_lkb(struct dlm_rsb *r, struct dlm_lkb *lkb, int status)
+{
+	kref_get(&lkb->lkb_ref);
+
+	DLM_ASSERT(!lkb->lkb_status, dlm_print_lkb(lkb););
+
+	lkb->lkb_status = status;
+
+	switch (status) {
+	case DLM_LKSTS_WAITING:
+		if (lkb->lkb_exflags & DLM_LKF_HEADQUE)
+			list_add(&lkb->lkb_statequeue, &r->res_waitqueue);
+		else
+			list_add_tail(&lkb->lkb_statequeue, &r->res_waitqueue);
+		break;
+	case DLM_LKSTS_GRANTED:
+		/* convention says granted locks kept in order of grmode */
+		lkb_add_ordered(&lkb->lkb_statequeue, &r->res_grantqueue,
+				lkb->lkb_grmode);
+		break;
+	case DLM_LKSTS_CONVERT:
+		if (lkb->lkb_exflags & DLM_LKF_HEADQUE)
+			list_add(&lkb->lkb_statequeue, &r->res_convertqueue);
+		else
+			list_add_tail(&lkb->lkb_statequeue,
+				      &r->res_convertqueue);
+		break;
+	default:
+		DLM_ASSERT(0, dlm_print_lkb(lkb); printk("sts=%d\n", status););
+	}
+}
+
+static void del_lkb(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	lkb->lkb_status = 0;
+	list_del(&lkb->lkb_statequeue);
+	unhold_lkb(lkb);
+}
+
+static void move_lkb(struct dlm_rsb *r, struct dlm_lkb *lkb, int sts)
+{
+	hold_lkb(lkb);
+	del_lkb(r, lkb);
+	add_lkb(r, lkb, sts);
+	unhold_lkb(lkb);
+}
+
+/* add/remove lkb from global waiters list of lkb's waiting for
+   a reply from a remote node */
+
+static void add_to_waiters(struct dlm_lkb *lkb, int mstype)
+{
+	struct dlm_ls *ls = lkb->lkb_resource->res_ls;
+
+	down(&ls->ls_waiters_sem);
+	if (lkb->lkb_wait_type) {
+		log_print("add_to_waiters error %d", lkb->lkb_wait_type);
+		goto out;
+	}
+	lkb->lkb_wait_type = mstype;
+	kref_get(&lkb->lkb_ref);
+	list_add(&lkb->lkb_wait_reply, &ls->ls_waiters);
+ out:
+	up(&ls->ls_waiters_sem);
+}
+
+static int _remove_from_waiters(struct dlm_lkb *lkb)
+{
+	int error = 0;
+
+	if (!lkb->lkb_wait_type) {
+		log_print("remove_from_waiters error");
+		error = -EINVAL;
+		goto out;
+	}
+	lkb->lkb_wait_type = 0;
+	list_del(&lkb->lkb_wait_reply);
+	unhold_lkb(lkb);
+ out:
+	return error;
+}
+
+static int remove_from_waiters(struct dlm_lkb *lkb)
+{
+	struct dlm_ls *ls = lkb->lkb_resource->res_ls;
+	int error;
+
+	down(&ls->ls_waiters_sem);
+	error = _remove_from_waiters(lkb);
+	up(&ls->ls_waiters_sem);
+	return error;
+}
+
+int dlm_remove_from_waiters(struct dlm_lkb *lkb)
+{
+	return remove_from_waiters(lkb);
+}
+
+static void dir_remove(struct dlm_rsb *r)
+{
+	int to_nodeid = dlm_dir_nodeid(r);
+
+	if (to_nodeid != dlm_our_nodeid())
+		send_remove(r);
+	else
+		dlm_dir_remove_entry(r->res_ls, to_nodeid,
+				     r->res_name, r->res_length);
+}
+
+/* FIXME: shouldn't this be able to exit as soon as one non-due rsb is
+   found since they are in order of newest to oldest? */
+
+static int shrink_bucket(struct dlm_ls *ls, int b)
+{
+	struct dlm_rsb *r;
+	int count = 0, found;
+
+	for (;;) {
+		found = FALSE;
+		write_lock(&ls->ls_rsbtbl[b].lock);
+		list_for_each_entry_reverse(r, &ls->ls_rsbtbl[b].toss,
+					    res_hashchain) {
+			if (!time_after_eq(jiffies, r->res_toss_time +
+					   dlm_config.toss_secs * HZ))
+				continue;
+			found = TRUE;
+			break;
+		}
+
+		if (!found) {
+			write_unlock(&ls->ls_rsbtbl[b].lock);
+			break;
+		}
+
+		if (kref_put(&r->res_ref, kill_rsb)) {
+			list_del(&r->res_hashchain);
+			write_unlock(&ls->ls_rsbtbl[b].lock);
+
+			if (is_master(r))
+				dir_remove(r);
+			free_rsb(r);
+			count++;
+		} else {
+			write_unlock(&ls->ls_rsbtbl[b].lock);
+			log_error(ls, "tossed rsb in use %s", r->res_name);
+		}
+	}
+
+	return count;
+}
+
+void dlm_scan_rsbs(struct dlm_ls *ls)
+{
+	int i;
+
+	if (!test_bit(LSFL_LS_RUN, &ls->ls_flags))
+		return;
+
+	for (i = 0; i < ls->ls_rsbtbl_size; i++) {
+		shrink_bucket(ls, i);
+		cond_resched();
+	}
+}
+
+/* lkb is master or local copy */
+
+static void set_lvb_lock(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	int b, len = r->res_ls->ls_lvblen;
+
+	/* b=1 lvb returned to caller
+	   b=0 lvb written to rsb or invalidated
+	   b=-1 do nothing */
+
+	b =  dlm_lvb_operations[lkb->lkb_grmode + 1][lkb->lkb_rqmode + 1];
+
+	if (b == 1) {
+		if (!lkb->lkb_lvbptr)
+			return;
+
+		if (!(lkb->lkb_exflags & DLM_LKF_VALBLK))
+			return;
+
+		if (!r->res_lvbptr)
+			return;
+
+		memcpy(lkb->lkb_lvbptr, r->res_lvbptr, len);
+		lkb->lkb_lvbseq = r->res_lvbseq;
+
+	} else if (b == 0) {
+		if (lkb->lkb_exflags & DLM_LKF_IVVALBLK) {
+			set_bit(RESFL_VALNOTVALID, &r->res_flags);
+			return;
+		}
+
+		if (!lkb->lkb_lvbptr)
+			return;
+
+		if (!(lkb->lkb_exflags & DLM_LKF_VALBLK))
+			return;
+
+		if (!r->res_lvbptr)
+			r->res_lvbptr = allocate_lvb(r->res_ls);
+
+		if (!r->res_lvbptr)
+			return;
+
+		memcpy(r->res_lvbptr, lkb->lkb_lvbptr, len);
+		r->res_lvbseq++;
+		lkb->lkb_lvbseq = r->res_lvbseq;
+		clear_bit(RESFL_VALNOTVALID, &r->res_flags);
+	}
+
+	if (test_bit(RESFL_VALNOTVALID, &r->res_flags))
+		lkb->lkb_sbflags |= DLM_SBF_VALNOTVALID;
+}
+
+static void set_lvb_unlock(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	if (lkb->lkb_grmode < DLM_LOCK_PW)
+		return;
+
+	if (lkb->lkb_exflags & DLM_LKF_IVVALBLK) {
+		set_bit(RESFL_VALNOTVALID, &r->res_flags);
+		return;
+	}
+
+	if (!lkb->lkb_lvbptr)
+		return;
+
+	if (!(lkb->lkb_exflags & DLM_LKF_VALBLK))
+		return;
+
+	if (!r->res_lvbptr)
+		r->res_lvbptr = allocate_lvb(r->res_ls);
+
+	if (!r->res_lvbptr)
+		return;
+
+	memcpy(r->res_lvbptr, lkb->lkb_lvbptr, r->res_ls->ls_lvblen);
+	r->res_lvbseq++;
+	clear_bit(RESFL_VALNOTVALID, &r->res_flags);
+}
+
+/* lkb is process copy (pc) */
+
+static void set_lvb_lock_pc(struct dlm_rsb *r, struct dlm_lkb *lkb,
+			    struct dlm_message *ms)
+{
+	int b;
+
+	if (!lkb->lkb_lvbptr)
+		return;
+
+	if (!(lkb->lkb_exflags & DLM_LKF_VALBLK))
+		return;
+
+	b =  dlm_lvb_operations[lkb->lkb_grmode + 1][lkb->lkb_rqmode + 1];
+	if (b == 1) {
+		int len = receive_extralen(ms);
+		memcpy(lkb->lkb_lvbptr, ms->m_extra, len);
+		lkb->lkb_lvbseq = ms->m_lvbseq;
+	}
+}
+
+/* Manipulate lkb's on rsb's convert/granted/waiting queues
+   remove_lock -- used for unlock, removes lkb from granted
+   revert_lock -- used for cancel, moves lkb from convert to granted
+   grant_lock  -- used for request and convert, adds lkb to granted or
+                  moves lkb from convert or waiting to granted
+
+   Each of these is used for master or local copy lkb's.  There is
+   also a _pc() variation used to make the corresponding change on
+   a process copy (pc) lkb. */
+
+static void _remove_lock(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	del_lkb(r, lkb);
+	lkb->lkb_grmode = DLM_LOCK_IV;
+	/* this unhold undoes the original ref from create_lkb()
+	   so this leads to the lkb being freed */
+	unhold_lkb(lkb);
+}
+
+static void remove_lock(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	set_lvb_unlock(r, lkb);
+	_remove_lock(r, lkb);
+}
+
+static void remove_lock_pc(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	_remove_lock(r, lkb);
+}
+
+static void revert_lock(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	lkb->lkb_rqmode = DLM_LOCK_IV;
+
+	switch (lkb->lkb_status) {
+	case DLM_LKSTS_CONVERT:
+		move_lkb(r, lkb, DLM_LKSTS_GRANTED);
+		break;
+	case DLM_LKSTS_WAITING:
+		del_lkb(r, lkb);
+		lkb->lkb_grmode = DLM_LOCK_IV;
+		/* this unhold undoes the original ref from create_lkb()
+		   so this leads to the lkb being freed */
+		unhold_lkb(lkb);
+		break;
+	default:
+		log_print("invalid status for revert %d", lkb->lkb_status);
+	}
+}
+
+static void revert_lock_pc(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	revert_lock(r, lkb);
+}
+
+static void _grant_lock(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	if (lkb->lkb_grmode != lkb->lkb_rqmode) {
+		lkb->lkb_grmode = lkb->lkb_rqmode;
+		if (lkb->lkb_status)
+			move_lkb(r, lkb, DLM_LKSTS_GRANTED);
+		else
+			add_lkb(r, lkb, DLM_LKSTS_GRANTED);
+	}
+
+	lkb->lkb_rqmode = DLM_LOCK_IV;
+
+	if (lkb->lkb_range) {
+		lkb->lkb_range[GR_RANGE_START] = lkb->lkb_range[RQ_RANGE_START];
+		lkb->lkb_range[GR_RANGE_END] = lkb->lkb_range[RQ_RANGE_END];
+	}
+}
+
+static void grant_lock(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	set_lvb_lock(r, lkb);
+	_grant_lock(r, lkb);
+	lkb->lkb_highbast = 0;
+}
+
+static void grant_lock_pc(struct dlm_rsb *r, struct dlm_lkb *lkb,
+			  struct dlm_message *ms)
+{
+	set_lvb_lock_pc(r, lkb, ms);
+	_grant_lock(r, lkb);
+}
+
+/* called by grant_pending_locks() which means an async grant message must
+   be sent to the requesting node in addition to granting the lock if the
+   lkb belongs to a remote node. */
+
+static void grant_lock_pending(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	grant_lock(r, lkb);
+	if (is_master_copy(lkb))
+		send_grant(r, lkb);
+	else
+		queue_cast(r, lkb, 0);
+}
+
+static inline int first_in_list(struct dlm_lkb *lkb, struct list_head *head)
+{
+	struct dlm_lkb *first = list_entry(head->next, struct dlm_lkb,
+					   lkb_statequeue);
+	if (lkb->lkb_id == first->lkb_id)
+		return TRUE;
+
+	return FALSE;
+}
+
+/* Return 1 if the locks' ranges overlap.  If the lkb has no range then it is
+   assumed to cover 0-ffffffff.ffffffff */
+
+static inline int ranges_overlap(struct dlm_lkb *lkb1, struct dlm_lkb *lkb2)
+{
+	if (!lkb1->lkb_range || !lkb2->lkb_range)
+		return TRUE;
+
+	if (lkb1->lkb_range[RQ_RANGE_END] < lkb2->lkb_range[GR_RANGE_START] ||
+	    lkb1->lkb_range[RQ_RANGE_START] > lkb2->lkb_range[GR_RANGE_END])
+		return FALSE;
+
+	return TRUE;
+}
+
+/* Check if the given lkb conflicts with another lkb on the queue. */
+
+static int queue_conflict(struct list_head *head, struct dlm_lkb *lkb)
+{
+	struct dlm_lkb *this;
+
+	list_for_each_entry(this, head, lkb_statequeue) {
+		if (this == lkb)
+			continue;
+		if (ranges_overlap(lkb, this) && !modes_compat(this, lkb))
+			return TRUE;
+	}
+	return FALSE;
+}
+
+/*
+ * "A conversion deadlock arises with a pair of lock requests in the converting
+ * queue for one resource.  The granted mode of each lock blocks the requested
+ * mode of the other lock."
+ *
+ * Part 2: if the granted mode of lkb is preventing the first lkb in the
+ * convert queue from being granted, then demote lkb (set grmode to NL).
+ * This second form requires that we check for conv-deadlk even when
+ * now == 0 in _can_be_granted().
+ *
+ * Example:
+ * Granted Queue: empty
+ * Convert Queue: NL->EX (first lock)
+ *                PR->EX (second lock)
+ *
+ * The first lock can't be granted because of the granted mode of the second
+ * lock and the second lock can't be granted because it's not first in the
+ * list.  We demote the granted mode of the second lock (the lkb passed to this
+ * function).
+ *
+ * After the resolution, the "grant pending" function needs to go back and try
+ * to grant locks on the convert queue again since the first lock can now be
+ * granted.
+ */
+
+static int conversion_deadlock_detect(struct dlm_rsb *rsb, struct dlm_lkb *lkb)
+{
+	struct dlm_lkb *this, *first = NULL, *self = NULL;
+
+	list_for_each_entry(this, &rsb->res_convertqueue, lkb_statequeue) {
+		if (!first)
+			first = this;
+		if (this == lkb) {
+			self = lkb;
+			continue;
+		}
+
+		if (!ranges_overlap(lkb, this))
+			continue;
+
+		if (!modes_compat(this, lkb) && !modes_compat(lkb, this))
+			return TRUE;
+	}
+
+	/* if lkb is on the convert queue and is preventing the first
+	   from being granted, then there's deadlock and we demote lkb.
+	   multiple converting locks may need to do this before the first
+	   converting lock can be granted. */
+
+	if (self && self != first) {
+		if (!modes_compat(lkb, first) &&
+		    !queue_conflict(&rsb->res_grantqueue, first))
+			return TRUE;
+	}
+
+	return FALSE;
+}
+
+/*
+ * Return 1 if the lock can be granted, 0 otherwise.
+ * Also detect and resolve conversion deadlocks.
+ *
+ * lkb is the lock to be granted
+ *
+ * now is 1 if the function is being called in the context of the
+ * immediate request, it is 0 if called later, after the lock has been
+ * queued.
+ *
+ * References are from chapter 6 of "VAXcluster Principles" by Roy Davis
+ */
+
+static int _can_be_granted(struct dlm_rsb *r, struct dlm_lkb *lkb, int now)
+{
+	int8_t conv = (lkb->lkb_grmode != DLM_LOCK_IV);
+
+	/*
+	 * 6-10: Version 5.4 introduced an option to address the phenomenon of
+	 * a new request for a NL mode lock being blocked.
+	 *
+	 * 6-11: If the optional EXPEDITE flag is used with the new NL mode
+	 * request, then it would be granted.  In essence, the use of this flag
+	 * tells the Lock Manager to expedite theis request by not considering
+	 * what may be in the CONVERTING or WAITING queues...  As of this
+	 * writing, the EXPEDITE flag can be used only with new requests for NL
+	 * mode locks.  This flag is not valid for conversion requests.
+	 *
+	 * A shortcut.  Earlier checks return an error if EXPEDITE is used in a
+	 * conversion or used with a non-NL requested mode.  We also know an
+	 * EXPEDITE request is always granted immediately, so now must always
+	 * be 1.  The full condition to grant an expedite request: (now &&
+	 * !conv && lkb->rqmode == DLM_LOCK_NL && (flags & EXPEDITE)) can
+	 * therefore be shortened to just checking the flag.
+	 */
+
+	if (lkb->lkb_exflags & DLM_LKF_EXPEDITE)
+		return TRUE;
+
+	/*
+	 * A shortcut. Without this, !queue_conflict(grantqueue, lkb) would be
+	 * added to the remaining conditions.
+	 */
+
+	if (queue_conflict(&r->res_grantqueue, lkb))
+		goto out;
+
+	/*
+	 * 6-3: By default, a conversion request is immediately granted if the
+	 * requested mode is compatible with the modes of all other granted
+	 * locks
+	 */
+
+	if (queue_conflict(&r->res_convertqueue, lkb))
+		goto out;
+
+	/*
+	 * 6-5: But the default algorithm for deciding whether to grant or
+	 * queue conversion requests does not by itself guarantee that such
+	 * requests are serviced on a "first come first serve" basis.  This, in
+	 * turn, can lead to a phenomenon known as "indefinate postponement".
+	 *
+	 * 6-7: This issue is dealt with by using the optional QUECVT flag with
+	 * the system service employed to request a lock conversion.  This flag
+	 * forces certain conversion requests to be queued, even if they are
+	 * compatible with the granted modes of other locks on the same
+	 * resource.  Thus, the use of this flag results in conversion requests
+	 * being ordered on a "first come first servce" basis.
+	 *
+	 * DCT: This condition is all about new conversions being able to occur
+	 * "in place" while the lock remains on the granted queue (assuming
+	 * nothing else conflicts.)  IOW if QUECVT isn't set, a conversion
+	 * doesn't _have_ to go onto the convert queue where it's processed in
+	 * order.  The "now" variable is necessary to distinguish converts
+	 * being received and processed for the first time now, because once a
+	 * convert is moved to the conversion queue the condition below applies
+	 * requiring fifo granting.
+	 */
+
+	if (now && conv && !(lkb->lkb_exflags & DLM_LKF_QUECVT))
+		return TRUE;
+
+	/*
+	 * When using range locks the NOORDER flag is set to avoid the standard
+	 * vms rules on grant order.
+	 */
+
+	if (lkb->lkb_exflags & DLM_LKF_NOORDER)
+		return TRUE;
+
+	/*
+	 * 6-3: Once in that queue [CONVERTING], a conversion request cannot be
+	 * granted until all other conversion requests ahead of it are granted
+	 * and/or canceled.
+	 */
+
+	if (!now && conv && first_in_list(lkb, &r->res_convertqueue))
+		return TRUE;
+
+	/*
+	 * 6-4: By default, a new request is immediately granted only if all
+	 * three of the following conditions are satisfied when the request is
+	 * issued:
+	 * - The queue of ungranted conversion requests for the resource is
+	 *   empty.
+	 * - The queue of ungranted new requests for the resource is empty.
+	 * - The mode of the new request is compatible with the most
+	 *   restrictive mode of all granted locks on the resource.
+	 */
+
+	if (now && !conv && list_empty(&r->res_convertqueue) &&
+	    list_empty(&r->res_waitqueue))
+		return TRUE;
+
+	/*
+	 * 6-4: Once a lock request is in the queue of ungranted new requests,
+	 * it cannot be granted until the queue of ungranted conversion
+	 * requests is empty, all ungranted new requests ahead of it are
+	 * granted and/or canceled, and it is compatible with the granted mode
+	 * of the most restrictive lock granted on the resource.
+	 */
+
+	if (!now && !conv && list_empty(&r->res_convertqueue) &&
+	    first_in_list(lkb, &r->res_waitqueue))
+		return TRUE;
+
+ out:
+	/*
+	 * The following, enabled by CONVDEADLK, departs from VMS.
+	 */
+
+	if (conv && (lkb->lkb_exflags & DLM_LKF_CONVDEADLK) &&
+	    conversion_deadlock_detect(r, lkb)) {
+		lkb->lkb_grmode = DLM_LOCK_NL;
+		lkb->lkb_sbflags |= DLM_SBF_DEMOTED;
+	}
+
+	return FALSE;
+}
+
+/*
+ * The ALTPR and ALTCW flags aren't traditional lock manager flags, but are a
+ * simple way to provide a big optimization to applications that can use them.
+ */
+
+static int can_be_granted(struct dlm_rsb *r, struct dlm_lkb *lkb, int now)
+{
+	uint32_t flags = lkb->lkb_exflags;
+	int rv;
+	int8_t alt = 0, rqmode = lkb->lkb_rqmode;
+
+	rv = _can_be_granted(r, lkb, now);
+	if (rv)
+		goto out;
+
+	if (lkb->lkb_sbflags & DLM_SBF_DEMOTED)
+		goto out;
+
+	if (rqmode != DLM_LOCK_PR && flags & DLM_LKF_ALTPR)
+		alt = DLM_LOCK_PR;
+	else if (rqmode != DLM_LOCK_CW && flags & DLM_LKF_ALTCW)
+		alt = DLM_LOCK_CW;
+
+	if (alt) {
+		lkb->lkb_rqmode = alt;
+		rv = _can_be_granted(r, lkb, now);
+		if (rv)
+			lkb->lkb_sbflags |= DLM_SBF_ALTMODE;
+		else
+			lkb->lkb_rqmode = rqmode;
+	}
+ out:
+	return rv;
+}
+
+static int grant_pending_convert(struct dlm_rsb *r, int high)
+{
+	struct dlm_lkb *lkb, *s;
+	int hi, demoted, quit, grant_restart, demote_restart;
+
+	quit = 0;
+ restart:
+	grant_restart = 0;
+	demote_restart = 0;
+	hi = DLM_LOCK_IV;
+
+	list_for_each_entry_safe(lkb, s, &r->res_convertqueue, lkb_statequeue) {
+		demoted = is_demoted(lkb);
+		if (can_be_granted(r, lkb, FALSE)) {
+			grant_lock_pending(r, lkb);
+			grant_restart = 1;
+		} else {
+			hi = max_t(int, lkb->lkb_rqmode, hi);
+			if (!demoted && is_demoted(lkb))
+				demote_restart = 1;
+		}
+	}
+
+	if (grant_restart)
+		goto restart;
+	if (demote_restart && !quit) {
+		quit = 1;
+		goto restart;
+	}
+
+	return max_t(int, high, hi);
+}
+
+static int grant_pending_wait(struct dlm_rsb *r, int high)
+{
+	struct dlm_lkb *lkb, *s;
+
+	list_for_each_entry_safe(lkb, s, &r->res_waitqueue, lkb_statequeue) {
+		if (can_be_granted(r, lkb, FALSE))
+			grant_lock_pending(r, lkb);
+                else
+			high = max_t(int, lkb->lkb_rqmode, high);
+	}
+
+	return high;
+}
+
+static void grant_pending_locks(struct dlm_rsb *r)
+{
+	struct dlm_lkb *lkb, *s;
+	int high = DLM_LOCK_IV;
+
+	DLM_ASSERT(is_master(r), dlm_print_rsb(r););
+
+	high = grant_pending_convert(r, high);
+	high = grant_pending_wait(r, high);
+
+	if (high == DLM_LOCK_IV)
+		return;
+
+	/*
+	 * If there are locks left on the wait/convert queue then send blocking
+	 * ASTs to granted locks based on the largest requested mode (high)
+	 * found above.  This can generate spurious blocking ASTs for range
+	 * locks. FIXME: highbast < high comparison not valid for PR/CW.
+	 */
+
+	list_for_each_entry_safe(lkb, s, &r->res_grantqueue, lkb_statequeue) {
+		if (lkb->lkb_bastaddr && (lkb->lkb_highbast < high) &&
+		    !__dlm_compat_matrix[lkb->lkb_grmode+1][high+1]) {
+			queue_bast(r, lkb, high);
+			lkb->lkb_highbast = high;
+		}
+	}
+}
+
+static void send_bast_queue(struct dlm_rsb *r, struct list_head *head,
+			    struct dlm_lkb *lkb)
+{
+	struct dlm_lkb *gr;
+
+	list_for_each_entry(gr, head, lkb_statequeue) {
+		if (gr->lkb_bastaddr &&
+		    gr->lkb_highbast < lkb->lkb_rqmode &&
+		    ranges_overlap(lkb, gr) && !modes_compat(gr, lkb)) {
+			queue_bast(r, gr, lkb->lkb_rqmode);
+			gr->lkb_highbast = lkb->lkb_rqmode;
+		}
+	}
+}
+
+static void send_blocking_asts(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	send_bast_queue(r, &r->res_grantqueue, lkb);
+}
+
+static void send_blocking_asts_all(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	send_bast_queue(r, &r->res_grantqueue, lkb);
+	send_bast_queue(r, &r->res_convertqueue, lkb);
+}
+
+/* set_master(r, lkb) -- set the master nodeid of a resource
+
+   The purpose of this function is to set the nodeid field in the given
+   lkb using the nodeid field in the given rsb.  If the rsb's nodeid is
+   known, it can just be copied to the lkb and the function will return
+   0.  If the rsb's nodeid is _not_ known, it needs to be looked up
+   before it can be copied to the lkb.
+
+   When the rsb nodeid is being looked up remotely, the initial lkb
+   causing the lookup is kept on the ls_waiters list waiting for the
+   lookup reply.  Other lkb's waiting for the same rsb lookup are kept
+   on the rsb's res_lookup list until the master is verified.
+
+   After a remote lookup or when a tossed rsb is retrived that specifies
+   a remote master, that master value is uncertain -- it may have changed
+   by the time we send it a request.  While it's uncertain, only one lkb
+   is allowed to go ahead and use the master value; that lkb is specified
+   by res_trial_lkid.  Once the trial lkb is queued on the master node
+   we know the rsb master is correct and any other lkbs on res_lookup
+   can get the rsb nodeid and go ahead with their request.
+
+   Return values:
+   0: nodeid is set in rsb/lkb and the caller should go ahead and use it
+   1: the rsb master is not available and the lkb has been placed on
+      a wait queue
+   -EXXX: there was some error in processing
+*/
+
+static int set_master(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	struct dlm_ls *ls = r->res_ls;
+	int error, dir_nodeid, ret_nodeid, our_nodeid = dlm_our_nodeid();
+
+	if (test_and_clear_bit(RESFL_MASTER_UNCERTAIN, &r->res_flags)) {
+		set_bit(RESFL_MASTER_WAIT, &r->res_flags);
+		r->res_trial_lkid = lkb->lkb_id;
+		lkb->lkb_nodeid = r->res_nodeid;
+		return 0;
+	}
+
+	if (r->res_nodeid == 0) {
+		lkb->lkb_nodeid = 0;
+		return 0;
+	}
+
+	if (r->res_trial_lkid == lkb->lkb_id) {
+		DLM_ASSERT(lkb->lkb_id, dlm_print_lkb(lkb););
+		lkb->lkb_nodeid = r->res_nodeid;
+		return 0;
+	}
+
+	if (test_bit(RESFL_MASTER_WAIT, &r->res_flags)) {
+		list_add_tail(&lkb->lkb_rsb_lookup, &r->res_lookup);
+		return 1;
+	}
+
+	if (r->res_nodeid > 0) {
+		lkb->lkb_nodeid = r->res_nodeid;
+		return 0;
+	}
+
+	/* This is the first lkb requested on this rsb since the rsb
+	   was created.  We need to figure out who the rsb master is. */
+
+	DLM_ASSERT(r->res_nodeid == -1, );
+
+	dir_nodeid = dlm_dir_nodeid(r);
+
+	if (dir_nodeid != our_nodeid) {
+		set_bit(RESFL_MASTER_WAIT, &r->res_flags);
+		send_lookup(r, lkb);
+		return 1;
+	}
+
+	for (;;) {
+		/* It's possible for dlm_scand to remove an old rsb for
+		   this same resource from the toss list, us to create
+		   a new one, look up the master locally, and find it
+		   already exists just before dlm_scand does the
+		   dir_remove() on the previous rsb. */
+
+		error = dlm_dir_lookup(ls, our_nodeid, r->res_name,
+				       r->res_length, &ret_nodeid);
+		if (!error)
+			break;
+		log_debug(ls, "dir_lookup error %d %s", error, r->res_name);
+		schedule();
+	}
+
+	if (ret_nodeid == our_nodeid) {
+		r->res_nodeid = 0;
+		lkb->lkb_nodeid = 0;
+		return 0;
+	}
+
+	set_bit(RESFL_MASTER_WAIT, &r->res_flags);
+	r->res_trial_lkid = lkb->lkb_id;
+	r->res_nodeid = ret_nodeid;
+	lkb->lkb_nodeid = ret_nodeid;
+	return 0;
+}
+
+/* confirm_master -- confirm (or deny) an rsb's master nodeid
+
+   This is called when we get a request reply from a remote node
+   who we believe is the master.  The return value (error) we got
+   back indicates whether it's really the master or not.  If it
+   wasn't, we need to start over and do another master lookup.  If
+   it was and our lock was queued, then we know the master won't
+   change.  If it was and our lock wasn't queued, we need to do
+   another trial with the next lkb.
+*/
+
+static void confirm_master(struct dlm_rsb *r, int error)
+{
+	struct dlm_lkb *lkb, *safe;
+
+	if (!test_bit(RESFL_MASTER_WAIT, &r->res_flags))
+		return;
+
+	switch (error) {
+	case 0:
+	case -EINPROGRESS:
+		/* the remote master queued our request, or
+		   the remote dir node told us we're the master */
+
+		clear_bit(RESFL_MASTER_WAIT, &r->res_flags);
+		r->res_trial_lkid = 0;
+
+		list_for_each_entry_safe(lkb, safe, &r->res_lookup,
+					 lkb_rsb_lookup) {
+			list_del(&lkb->lkb_rsb_lookup);
+			_request_lock(r, lkb);
+			schedule();
+		}
+		break;
+	
+	case -EAGAIN:
+		/* the remote master didn't queue our NOQUEUE request;
+		   do another trial with the next waiting lkb */
+
+		if (!list_empty(&r->res_lookup)) {
+			lkb = list_entry(r->res_lookup.next, struct dlm_lkb,
+					 lkb_rsb_lookup);
+			list_del(&lkb->lkb_rsb_lookup);
+			r->res_trial_lkid = lkb->lkb_id;
+			_request_lock(r, lkb);
+			break;
+		}
+		/* fall through so the rsb looks new */
+
+	case -ENOENT:
+	case -ENOTBLK:
+		/* the remote master wasn't really the master, i.e.  our
+		   trial failed; so we start over with another lookup */
+
+		r->res_nodeid = -1;
+		r->res_trial_lkid = 0;
+		clear_bit(RESFL_MASTER_WAIT, &r->res_flags);
+		break;
+
+	default:
+		log_error(r->res_ls, "confirm_master unknown error %d", error);
+	}
+}
+
+static int set_lock_args(int mode, struct dlm_lksb *lksb, uint32_t flags,
+			 int namelen, uint32_t parent_lkid, void *ast,
+			 void *astarg, void *bast, struct dlm_range *range,
+			 struct dlm_args *args)
+{
+	int rv = -EINVAL;
+
+	/* check for invalid arg usage */
+
+	if (mode < 0 || mode > DLM_LOCK_EX)
+		goto out;
+
+	if (!(flags & DLM_LKF_CONVERT) && (namelen > DLM_RESNAME_MAXLEN))
+		goto out;
+
+	if (flags & DLM_LKF_CANCEL)
+		goto out;
+
+	if (flags & DLM_LKF_QUECVT && !(flags & DLM_LKF_CONVERT))
+		goto out;
+
+	if (flags & DLM_LKF_CONVDEADLK && !(flags & DLM_LKF_CONVERT))
+		goto out;
+
+	if (flags & DLM_LKF_CONVDEADLK && flags & DLM_LKF_NOQUEUE)
+		goto out;
+
+	if (flags & DLM_LKF_EXPEDITE && flags & DLM_LKF_CONVERT)
+		goto out;
+
+	if (flags & DLM_LKF_EXPEDITE && flags & DLM_LKF_QUECVT)
+		goto out;
+
+	if (flags & DLM_LKF_EXPEDITE && flags & DLM_LKF_NOQUEUE)
+		goto out;
+
+	if (flags & DLM_LKF_EXPEDITE && mode != DLM_LOCK_NL)
+		goto out;
+
+	if (!ast || !lksb)
+		goto out;
+
+	if (flags & DLM_LKF_VALBLK && !lksb->sb_lvbptr)
+		goto out;
+
+	/* parent/child locks not yet supported */
+	if (parent_lkid)
+		goto out;
+
+	if (flags & DLM_LKF_CONVERT && !lksb->sb_lkid)
+		goto out;
+
+	/* these args will be copied to the lkb in validate_lock_args,
+	   it cannot be done now because when converting locks, fields in
+	   an active lkb cannot be modified before locking the rsb */
+
+	args->flags = flags;
+	args->astaddr = ast;
+	args->astparam = (long) astarg;
+	args->bastaddr = bast;
+	args->mode = mode;
+	args->lksb = lksb;
+	args->range = range;
+	rv = 0;
+ out:
+	return rv;
+}
+
+static int set_unlock_args(uint32_t flags, void *astarg, struct dlm_args *args)
+{
+	if (flags & ~(DLM_LKF_CANCEL | DLM_LKF_VALBLK | DLM_LKF_IVVALBLK))
+		return -EINVAL;
+
+	args->flags = flags;
+	args->astparam = (long) astarg;
+	return 0;
+}
+
+int validate_lock_args(struct dlm_ls *ls, struct dlm_lkb *lkb,
+		       struct dlm_args *args)
+{
+	int rv = -EINVAL;
+
+	if (args->flags & DLM_LKF_CONVERT) {
+		if (lkb->lkb_flags & DLM_IFL_MSTCPY)
+			goto out;
+
+		if (args->flags & DLM_LKF_QUECVT &&
+		    !__quecvt_compat_matrix[lkb->lkb_grmode+1][args->mode+1])
+			goto out;
+
+		rv = -EBUSY;
+		if (lkb->lkb_status != DLM_LKSTS_GRANTED)
+			goto out;
+
+		if (lkb->lkb_wait_type)
+			goto out;
+	}
+
+	lkb->lkb_exflags = args->flags;
+	lkb->lkb_sbflags = 0;
+	lkb->lkb_astaddr = args->astaddr;
+	lkb->lkb_astparam = args->astparam;
+	lkb->lkb_bastaddr = args->bastaddr;
+	lkb->lkb_rqmode = args->mode;
+	lkb->lkb_lksb = args->lksb;
+	lkb->lkb_lvbptr = args->lksb->sb_lvbptr;
+	lkb->lkb_ownpid = (int) current->pid;
+
+	rv = 0;
+	if (!args->range)
+		goto out;
+
+	if (!lkb->lkb_range) {
+		rv = -ENOMEM;
+		lkb->lkb_range = allocate_range(ls);
+		if (!lkb->lkb_range)
+			goto out;
+		/* This is needed for conversions that contain ranges
+		   where the original lock didn't but it's harmless for
+		   new locks too. */
+		lkb->lkb_range[GR_RANGE_START] = 0LL;
+		lkb->lkb_range[GR_RANGE_END] = 0xffffffffffffffffULL;
+	}
+
+	lkb->lkb_range[RQ_RANGE_START] = args->range->ra_start;
+	lkb->lkb_range[RQ_RANGE_END] = args->range->ra_end;
+	lkb->lkb_flags |= DLM_IFL_RANGE;
+	rv = 0;
+ out:
+	return rv;
+}
+
+int validate_unlock_args(struct dlm_lkb *lkb, struct dlm_args *args)
+{
+	int rv = -EINVAL;
+
+	if (lkb->lkb_flags & DLM_IFL_MSTCPY)
+		goto out;
+
+	if (args->flags & DLM_LKF_CANCEL &&
+	    lkb->lkb_status == DLM_LKSTS_GRANTED)
+		goto out;
+
+	if (!(args->flags & DLM_LKF_CANCEL) &&
+	    lkb->lkb_status != DLM_LKSTS_GRANTED)
+		goto out;
+
+	rv = -EBUSY;
+	if (lkb->lkb_wait_type)
+		goto out;
+
+	lkb->lkb_exflags = args->flags;
+	lkb->lkb_sbflags = 0;
+	lkb->lkb_astparam = args->astparam;
+	rv = 0;
+ out:
+	return rv;
+}
+
+/*
+ * Four stage 4 varieties:
+ * do_request(), do_convert(), do_unlock(), do_cancel()
+ * These are called on the master node for the given lock and
+ * from the central locking logic.
+ */
+
+static int do_request(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	int error = 0;
+
+	if (can_be_granted(r, lkb, TRUE)) {
+		grant_lock(r, lkb);
+		queue_cast(r, lkb, 0);
+		goto out;
+	}
+
+	if (can_be_queued(lkb)) {
+		error = -EINPROGRESS;
+		add_lkb(r, lkb, DLM_LKSTS_WAITING);
+		send_blocking_asts(r, lkb);
+		goto out;
+	}
+
+	error = -EAGAIN;
+	if (force_blocking_asts(lkb))
+		send_blocking_asts_all(r, lkb);
+	queue_cast(r, lkb, -EAGAIN);
+
+ out:
+	return error;
+}
+
+static int do_convert(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	int error = 0;
+
+	/* changing an existing lock may allow others to be granted */
+
+	if (can_be_granted(r, lkb, TRUE)) {
+		grant_lock(r, lkb);
+		queue_cast(r, lkb, 0);
+		grant_pending_locks(r);
+		goto out;
+	}
+
+	if (can_be_queued(lkb)) {
+		if (is_demoted(lkb))
+			grant_pending_locks(r);
+		error = -EINPROGRESS;
+		del_lkb(r, lkb);
+		add_lkb(r, lkb, DLM_LKSTS_CONVERT);
+		send_blocking_asts(r, lkb);
+		goto out;
+	}
+
+	error = -EAGAIN;
+	if (force_blocking_asts(lkb))
+		send_blocking_asts_all(r, lkb);
+	queue_cast(r, lkb, -EAGAIN);
+
+ out:
+	return error;
+}
+
+static int do_unlock(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	remove_lock(r, lkb);
+	queue_cast(r, lkb, -DLM_EUNLOCK);
+	grant_pending_locks(r);
+	return -DLM_EUNLOCK;
+}
+
+static int do_cancel(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	revert_lock(r, lkb);
+	queue_cast(r, lkb, -DLM_ECANCEL);
+	grant_pending_locks(r);
+	return -DLM_ECANCEL;
+}
+
+/*
+ * Four stage 3 varieties:
+ * _request_lock(), _convert_lock(), _unlock_lock(), _cancel_lock()
+ */
+
+/* add a new lkb to a possibly new rsb, called by requesting process */
+
+static int _request_lock(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	int error;
+
+	/* set_master: sets lkb nodeid from r */
+
+	error = set_master(r, lkb);
+	if (error < 0)
+		goto out;
+	if (error) {
+		error = 0;
+		goto out;
+	}
+
+	if (is_remote(r))
+		/* receive_request() calls do_request() on remote node */
+		error = send_request(r, lkb);
+	else
+		error = do_request(r, lkb);
+ out:
+	return error;
+}
+
+/* change some property of an existing lkb, e.g. mode, range */
+
+static int _convert_lock(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	int error;
+
+	if (is_remote(r))
+		/* receive_convert() calls do_convert() on remote node */
+		error = send_convert(r, lkb);
+	else
+		error = do_convert(r, lkb);
+
+	return error;
+}
+
+/* remove an existing lkb from the granted queue */
+
+static int _unlock_lock(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	int error;
+
+	if (is_remote(r))
+		/* receive_unlock() calls do_unlock() on remote node */
+		error = send_unlock(r, lkb);
+	else
+		error = do_unlock(r, lkb);
+
+	return error;
+}
+
+/* remove an existing lkb from the convert or wait queue */
+
+static int _cancel_lock(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	int error;
+
+	if (is_remote(r))
+		/* receive_cancel() calls do_cancel() on remote node */
+		error = send_cancel(r, lkb);
+	else
+		error = do_cancel(r, lkb);
+
+	return error;
+}
+
+/*
+ * Four stage 2 varieties:
+ * request_lock(), convert_lock(), unlock_lock(), cancel_lock()
+ */
+
+static int request_lock(struct dlm_ls *ls, struct dlm_lkb *lkb, char *name,
+			int len, struct dlm_args *args)
+{
+	struct dlm_rsb *r;
+	int error;
+
+	error = validate_lock_args(ls, lkb, args);
+	if (error)
+		goto out;
+
+	error = find_rsb(ls, name, len, R_CREATE, &r);
+	if (error)
+		goto out;
+
+	lock_rsb(r);
+
+	attach_lkb(r, lkb);
+	lkb->lkb_lksb->sb_lkid = lkb->lkb_id;
+
+	error = _request_lock(r, lkb);
+
+	unlock_rsb(r);
+	put_rsb(r);
+
+ out:
+	return error;
+}
+
+static int convert_lock(struct dlm_ls *ls, struct dlm_lkb *lkb,
+			struct dlm_args *args)
+{
+	struct dlm_rsb *r;
+	int error;
+
+	r = lkb->lkb_resource;
+
+	hold_rsb(r);
+	lock_rsb(r);
+
+	error = validate_lock_args(ls, lkb, args);
+	if (error)
+		goto out;
+
+	error = _convert_lock(r, lkb);
+ out:
+	unlock_rsb(r);
+	put_rsb(r);
+	return error;
+}
+
+static int unlock_lock(struct dlm_ls *ls, struct dlm_lkb *lkb,
+		       struct dlm_args *args)
+{
+	struct dlm_rsb *r;
+	int error;
+
+	r = lkb->lkb_resource;
+
+	hold_rsb(r);
+	lock_rsb(r);
+
+	error = validate_unlock_args(lkb, args);
+	if (error)
+		goto out;
+
+	error = _unlock_lock(r, lkb);
+ out:
+	unlock_rsb(r);
+	put_rsb(r);
+	return error;
+}
+
+static int cancel_lock(struct dlm_ls *ls, struct dlm_lkb *lkb,
+		       struct dlm_args *args)
+{
+	struct dlm_rsb *r;
+	int error;
+
+	r = lkb->lkb_resource;
+
+	hold_rsb(r);
+	lock_rsb(r);
+
+	error = validate_unlock_args(lkb, args);
+	if (error)
+		goto out;
+
+	error = _cancel_lock(r, lkb);
+ out:
+	unlock_rsb(r);
+	put_rsb(r);
+	return error;
+}
+
+/*
+ * Two stage 1 varieties:  dlm_lock() and dlm_unlock()
+ */
+
+int dlm_lock(dlm_lockspace_t *lockspace,
+	     int mode,
+	     struct dlm_lksb *lksb,
+	     uint32_t flags,
+	     void *name,
+	     unsigned int namelen,
+	     uint32_t parent_lkid,
+	     void (*ast) (void *astarg),
+	     void *astarg,
+	     void (*bast) (void *astarg, int mode),
+	     struct dlm_range *range)
+{
+	struct dlm_ls *ls;
+	struct dlm_lkb *lkb;
+	struct dlm_args args;
+	int error, convert = flags & DLM_LKF_CONVERT;
+
+	ls = dlm_find_lockspace_local(lockspace);
+	if (!ls)
+		return -EINVAL;
+
+	lock_recovery(ls);
+
+	if (convert)
+		error = find_lkb(ls, lksb->sb_lkid, &lkb);
+	else
+		error = create_lkb(ls, &lkb);
+
+	if (error)
+		goto out;
+
+	error = set_lock_args(mode, lksb, flags, namelen, parent_lkid, ast,
+			      astarg, bast, range, &args);
+	if (error)
+		goto out_put;
+
+	if (convert)
+		error = convert_lock(ls, lkb, &args);
+	else
+		error = request_lock(ls, lkb, name, namelen, &args);
+
+	if (error == -EINPROGRESS)
+		error = 0;
+ out_put:
+	if (convert || error)
+		put_lkb(lkb);
+	if (error == -EAGAIN)
+		error = 0;
+ out:
+	unlock_recovery(ls);
+	dlm_put_lockspace(ls);
+	return error;
+}
+
+int dlm_unlock(dlm_lockspace_t *lockspace,
+	       uint32_t lkid,
+	       uint32_t flags,
+	       struct dlm_lksb *lksb,
+	       void *astarg)
+{
+	struct dlm_ls *ls;
+	struct dlm_lkb *lkb;
+	struct dlm_args args;
+	int error;
+
+	ls = dlm_find_lockspace_local(lockspace);
+	if (!ls)
+		return -EINVAL;
+
+	lock_recovery(ls);
+
+	error = find_lkb(ls, lkid, &lkb);
+	if (error)
+		goto out;
+
+	error = set_unlock_args(flags, astarg, &args);
+	if (error)
+		goto out_put;
+
+	if (flags & DLM_LKF_CANCEL)
+		error = cancel_lock(ls, lkb, &args);
+	else
+		error = unlock_lock(ls, lkb, &args);
+
+	if (error == -DLM_EUNLOCK || error == -DLM_ECANCEL)
+		error = 0;
+ out_put:
+	put_lkb(lkb);
+ out:
+	unlock_recovery(ls);
+	dlm_put_lockspace(ls);
+	return error;
+}
+
+/*
+ * send/receive routines for remote operations and replies
+ *
+ * send_args
+ * send_common
+ * send_request			receive_request
+ * send_convert			receive_convert
+ * send_unlock			receive_unlock
+ * send_cancel			receive_cancel
+ * send_grant			receive_grant
+ * send_bast			receive_bast
+ * send_lookup			receive_lookup
+ * send_remove			receive_remove
+ *
+ * 				send_common_reply
+ * receive_request_reply	send_request_reply
+ * receive_convert_reply	send_convert_reply
+ * receive_unlock_reply		send_unlock_reply
+ * receive_cancel_reply		send_cancel_reply
+ * receive_lookup_reply		send_lookup_reply
+ */
+
+static int create_message(struct dlm_rsb *r, struct dlm_lkb *lkb,
+			  int to_nodeid, int mstype,
+			  struct dlm_message **ms_ret,
+			  struct dlm_mhandle **mh_ret)
+{
+	struct dlm_message *ms;
+	struct dlm_mhandle *mh;
+	char *mb;
+	int mb_len = sizeof(struct dlm_message);
+
+	switch (mstype) {
+	case DLM_MSG_REQUEST:
+	case DLM_MSG_LOOKUP:
+	case DLM_MSG_REMOVE:
+		mb_len += r->res_length;
+		break;
+	case DLM_MSG_CONVERT:
+	case DLM_MSG_UNLOCK:
+	case DLM_MSG_REQUEST_REPLY:
+	case DLM_MSG_CONVERT_REPLY:
+	case DLM_MSG_GRANT:
+		if (lkb && lkb->lkb_lvbptr)
+			mb_len += r->res_ls->ls_lvblen;
+		break;
+	}
+
+	/* get_buffer gives us a message handle (mh) that we need to
+	   pass into lowcomms_commit and a message buffer (mb) that we
+	   write our data into */
+
+	mh = dlm_lowcomms_get_buffer(to_nodeid, mb_len, GFP_KERNEL, &mb);
+	if (!mh)
+		return -ENOBUFS;
+
+	memset(mb, 0, mb_len);
+
+	ms = (struct dlm_message *) mb;
+
+	ms->m_header.h_version = (DLM_HEADER_MAJOR | DLM_HEADER_MINOR);
+	ms->m_header.h_lockspace = r->res_ls->ls_global_id;
+	ms->m_header.h_nodeid = dlm_our_nodeid();
+	ms->m_header.h_length = mb_len;
+	ms->m_header.h_cmd = DLM_MSG;
+
+	ms->m_type = mstype;
+
+	*mh_ret = mh;
+	*ms_ret = ms;
+	return 0;
+}
+
+/* further lowcomms enhancements or alternate implementations may make
+   the return value from this function useful at some point */
+
+static int send_message(struct dlm_mhandle *mh, struct dlm_message *ms)
+{
+	dlm_message_out(ms);
+	dlm_lowcomms_commit_buffer(mh);
+	return 0;
+}
+
+static void send_args(struct dlm_rsb *r, struct dlm_lkb *lkb,
+		      struct dlm_message *ms)
+{
+	ms->m_nodeid   = lkb->lkb_nodeid;
+	ms->m_pid      = lkb->lkb_ownpid;
+	ms->m_lkid     = lkb->lkb_id;
+	ms->m_remid    = lkb->lkb_remid;
+	ms->m_exflags  = lkb->lkb_exflags;
+	ms->m_sbflags  = lkb->lkb_sbflags;
+	ms->m_flags    = lkb->lkb_flags;
+	ms->m_lvbseq   = lkb->lkb_lvbseq;
+	ms->m_status   = lkb->lkb_status;
+	ms->m_grmode   = lkb->lkb_grmode;
+	ms->m_rqmode   = lkb->lkb_rqmode;
+
+	/* m_result and m_bastmode are set from function args,
+	   not from lkb fields */
+
+	if (lkb->lkb_bastaddr)
+		ms->m_asts |= AST_BAST;
+	if (lkb->lkb_astaddr)
+		ms->m_asts |= AST_COMP;
+
+	if (lkb->lkb_range) {
+		ms->m_range[0] = lkb->lkb_range[RQ_RANGE_START];
+		ms->m_range[1] = lkb->lkb_range[RQ_RANGE_END];
+	}
+
+	if (ms->m_type == DLM_MSG_REQUEST || ms->m_type == DLM_MSG_LOOKUP)
+		memcpy(ms->m_extra, r->res_name, r->res_length);
+
+	else if (lkb->lkb_lvbptr)
+		memcpy(ms->m_extra, lkb->lkb_lvbptr, r->res_ls->ls_lvblen);
+	
+}
+
+static int send_common(struct dlm_rsb *r, struct dlm_lkb *lkb, int mstype)
+{
+	struct dlm_message *ms;
+	struct dlm_mhandle *mh;
+	int to_nodeid, error;
+
+	add_to_waiters(lkb, mstype);
+
+	to_nodeid = r->res_nodeid;
+
+	error = create_message(r, lkb, to_nodeid, mstype, &ms, &mh);
+	if (error)
+		goto fail;
+
+	send_args(r, lkb, ms);
+
+	error = send_message(mh, ms);
+	if (error)
+		goto fail;
+	return 0;
+
+ fail:
+	remove_from_waiters(lkb);
+	return error;
+}
+
+static int send_request(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	return send_common(r, lkb, DLM_MSG_REQUEST);
+}
+
+static int send_convert(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	int error;
+
+	error = send_common(r, lkb, DLM_MSG_CONVERT);
+
+	/* down conversions go without a reply from the master */
+	if (!error && down_conversion(lkb)) {
+		remove_from_waiters(lkb);
+		r->res_ls->ls_stub_ms.m_result = 0;
+		__receive_convert_reply(r, lkb, &r->res_ls->ls_stub_ms);
+	}
+
+	return error;
+}
+
+/* FIXME: if this lkb is the only lock we hold on the rsb, then set
+   MASTER_UNCERTAIN to force the next request on the rsb to confirm
+   that the master is still correct. */
+
+static int send_unlock(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	return send_common(r, lkb, DLM_MSG_UNLOCK);
+}
+
+static int send_cancel(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	return send_common(r, lkb, DLM_MSG_CANCEL);
+}
+
+static int send_grant(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	struct dlm_message *ms;
+	struct dlm_mhandle *mh;
+	int to_nodeid, error;
+
+	to_nodeid = lkb->lkb_nodeid;
+
+	error = create_message(r, lkb, to_nodeid, DLM_MSG_GRANT, &ms, &mh);
+	if (error)
+		goto out;
+
+	send_args(r, lkb, ms);
+
+	ms->m_result = 0;
+
+	error = send_message(mh, ms);
+ out:
+	return error;
+}
+
+static int send_bast(struct dlm_rsb *r, struct dlm_lkb *lkb, int mode)
+{
+	struct dlm_message *ms;
+	struct dlm_mhandle *mh;
+	int to_nodeid, error;
+
+	to_nodeid = lkb->lkb_nodeid;
+
+	error = create_message(r, NULL, to_nodeid, DLM_MSG_BAST, &ms, &mh);
+	if (error)
+		goto out;
+
+	send_args(r, lkb, ms);
+
+	ms->m_bastmode = mode;
+
+	error = send_message(mh, ms);
+ out:
+	return error;
+}
+
+static int send_lookup(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	struct dlm_message *ms;
+	struct dlm_mhandle *mh;
+	int to_nodeid, error;
+
+	add_to_waiters(lkb, DLM_MSG_LOOKUP);
+
+	to_nodeid = dlm_dir_nodeid(r);
+
+	error = create_message(r, NULL, to_nodeid, DLM_MSG_LOOKUP, &ms, &mh);
+	if (error)
+		goto fail;
+
+	send_args(r, lkb, ms);
+
+	error = send_message(mh, ms);
+	if (error)
+		goto fail;
+	return 0;
+
+ fail:
+	remove_from_waiters(lkb);
+	return error;
+}
+
+static int send_remove(struct dlm_rsb *r)
+{
+	struct dlm_message *ms;
+	struct dlm_mhandle *mh;
+	int to_nodeid, error;
+
+	to_nodeid = dlm_dir_nodeid(r);
+
+	error = create_message(r, NULL, to_nodeid, DLM_MSG_REMOVE, &ms, &mh);
+	if (error)
+		goto out;
+
+	memcpy(ms->m_extra, r->res_name, r->res_length);
+
+	error = send_message(mh, ms);
+ out:
+	return error;
+}
+
+static int send_common_reply(struct dlm_rsb *r, struct dlm_lkb *lkb,
+			     int mstype, int rv)
+{
+	struct dlm_message *ms;
+	struct dlm_mhandle *mh;
+	int to_nodeid, error;
+
+	to_nodeid = lkb->lkb_nodeid;
+
+	error = create_message(r, lkb, to_nodeid, mstype, &ms, &mh);
+	if (error)
+		goto out;
+
+	send_args(r, lkb, ms);
+
+	ms->m_result = rv;
+
+	error = send_message(mh, ms);
+ out:
+	return error;
+}
+
+static int send_request_reply(struct dlm_rsb *r, struct dlm_lkb *lkb, int rv)
+{
+	return send_common_reply(r, lkb, DLM_MSG_REQUEST_REPLY, rv);
+}
+
+static int send_convert_reply(struct dlm_rsb *r, struct dlm_lkb *lkb, int rv)
+{
+	return send_common_reply(r, lkb, DLM_MSG_CONVERT_REPLY, rv);
+}
+
+static int send_unlock_reply(struct dlm_rsb *r, struct dlm_lkb *lkb, int rv)
+{
+	return send_common_reply(r, lkb, DLM_MSG_UNLOCK_REPLY, rv);
+}
+
+static int send_cancel_reply(struct dlm_rsb *r, struct dlm_lkb *lkb, int rv)
+{
+	return send_common_reply(r, lkb, DLM_MSG_CANCEL_REPLY, rv);
+}
+
+static int send_lookup_reply(struct dlm_ls *ls, struct dlm_message *ms_in,
+			     int ret_nodeid, int rv)
+{
+	struct dlm_rsb *r = &ls->ls_stub_rsb;
+	struct dlm_message *ms;
+	struct dlm_mhandle *mh;
+	int error, nodeid = ms_in->m_header.h_nodeid;
+
+	error = create_message(r, NULL, nodeid, DLM_MSG_LOOKUP_REPLY, &ms, &mh);
+	if (error)
+		goto out;
+
+	ms->m_lkid = ms_in->m_lkid;
+	ms->m_result = rv;
+	ms->m_nodeid = ret_nodeid;
+
+	error = send_message(mh, ms);
+ out:
+	return error;
+}
+
+/* which args we save from a received message depends heavily on the type
+   of message, unlike the send side where we can safely send everything about
+   the lkb for any type of message */
+
+static void receive_flags(struct dlm_lkb *lkb, struct dlm_message *ms)
+{
+	lkb->lkb_exflags = ms->m_exflags;
+	lkb->lkb_flags = (lkb->lkb_flags & 0xFFFF0000) |
+		         (ms->m_flags & 0x0000FFFF);
+}
+
+static void receive_flags_reply(struct dlm_lkb *lkb, struct dlm_message *ms)
+{
+	lkb->lkb_sbflags = ms->m_sbflags;
+	lkb->lkb_flags = (lkb->lkb_flags & 0xFFFF0000) |
+		         (ms->m_flags & 0x0000FFFF);
+}
+
+static int receive_extralen(struct dlm_message *ms)
+{
+	return (ms->m_header.h_length - sizeof(struct dlm_message));
+}
+
+static int receive_range(struct dlm_ls *ls, struct dlm_lkb *lkb,
+			 struct dlm_message *ms)
+{
+	if (lkb->lkb_flags & DLM_IFL_RANGE) {
+		if (!lkb->lkb_range)
+			lkb->lkb_range = allocate_range(ls);
+		if (!lkb->lkb_range)
+			return -ENOMEM;
+		lkb->lkb_range[RQ_RANGE_START] = ms->m_range[0];
+		lkb->lkb_range[RQ_RANGE_END] = ms->m_range[1];
+	}
+	return 0;
+}
+
+static int receive_lvb(struct dlm_ls *ls, struct dlm_lkb *lkb,
+		       struct dlm_message *ms)
+{
+	int len;
+
+	if (lkb->lkb_exflags & DLM_LKF_VALBLK) {
+		if (!lkb->lkb_lvbptr)
+			lkb->lkb_lvbptr = allocate_lvb(ls);
+		if (!lkb->lkb_lvbptr)
+			return -ENOMEM;
+		len = receive_extralen(ms);
+		memcpy(lkb->lkb_lvbptr, ms->m_extra, len);
+	}
+	return 0;
+}
+
+static int receive_request_args(struct dlm_ls *ls, struct dlm_lkb *lkb,
+				struct dlm_message *ms)
+{
+	lkb->lkb_nodeid = ms->m_header.h_nodeid;
+	lkb->lkb_ownpid = ms->m_pid;
+	lkb->lkb_remid = ms->m_lkid;
+	lkb->lkb_grmode = DLM_LOCK_IV;
+	lkb->lkb_rqmode = ms->m_rqmode;
+	lkb->lkb_bastaddr = (void *) (long) (ms->m_asts & AST_BAST);
+	lkb->lkb_astaddr = (void *) (long) (ms->m_asts & AST_COMP);
+
+	DLM_ASSERT(is_master_copy(lkb), dlm_print_lkb(lkb););
+
+	if (receive_range(ls, lkb, ms))
+		return -ENOMEM;
+
+	if (receive_lvb(ls, lkb, ms))
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int receive_convert_args(struct dlm_ls *ls, struct dlm_lkb *lkb,
+				struct dlm_message *ms)
+{
+	if (lkb->lkb_nodeid != ms->m_header.h_nodeid) {
+		log_error(ls, "convert_args nodeid %d %d lkid %x %x",
+			  lkb->lkb_nodeid, ms->m_header.h_nodeid,
+			  lkb->lkb_id, lkb->lkb_remid);
+		return -EINVAL;
+	}
+
+	if (!is_master_copy(lkb))
+		return -EINVAL;
+
+	if (lkb->lkb_status != DLM_LKSTS_GRANTED)
+		return -EBUSY;
+
+	if (receive_range(ls, lkb, ms))
+		return -ENOMEM;
+	if (lkb->lkb_range) {
+		lkb->lkb_range[GR_RANGE_START] = 0LL;
+		lkb->lkb_range[GR_RANGE_END] = 0xffffffffffffffffULL;
+	}
+
+	if (receive_lvb(ls, lkb, ms))
+		return -ENOMEM;
+
+	lkb->lkb_rqmode = ms->m_rqmode;
+	lkb->lkb_lvbseq = ms->m_lvbseq;
+
+	return 0;
+}
+
+static int receive_unlock_args(struct dlm_ls *ls, struct dlm_lkb *lkb,
+			       struct dlm_message *ms)
+{
+	if (!is_master_copy(lkb))
+		return -EINVAL;
+	if (receive_lvb(ls, lkb, ms))
+		return -ENOMEM;
+	return 0;
+}
+
+/* We fill in the stub-lkb fields with the info that send_xxxx_reply()
+   uses to send a reply and that the remote end uses to process the reply. */
+
+static void setup_stub_lkb(struct dlm_ls *ls, struct dlm_message *ms)
+{
+	struct dlm_lkb *lkb = &ls->ls_stub_lkb;
+	lkb->lkb_nodeid = ms->m_header.h_nodeid;
+	lkb->lkb_remid = ms->m_lkid;
+}
+
+static void receive_request(struct dlm_ls *ls, struct dlm_message *ms)
+{
+	struct dlm_lkb *lkb;
+	struct dlm_rsb *r;
+	int error, namelen;
+
+	error = create_lkb(ls, &lkb);
+	if (error)
+		goto fail;
+
+	receive_flags(lkb, ms);
+	lkb->lkb_flags |= DLM_IFL_MSTCPY;
+	error = receive_request_args(ls, lkb, ms);
+	if (error) {
+		put_lkb(lkb);
+		goto fail;
+	}
+
+	namelen = receive_extralen(ms);
+
+	error = find_rsb(ls, ms->m_extra, namelen, R_MASTER, &r);
+	if (error) {
+		put_lkb(lkb);
+		goto fail;
+	}
+
+	lock_rsb(r);
+
+	attach_lkb(r, lkb);
+	error = do_request(r, lkb);
+	send_request_reply(r, lkb, error);
+
+	unlock_rsb(r);
+	put_rsb(r);
+
+	if (error == -EINPROGRESS)
+		error = 0;
+	if (error)
+		put_lkb(lkb);
+	return;
+
+ fail:
+	setup_stub_lkb(ls, ms);
+	send_request_reply(&ls->ls_stub_rsb, &ls->ls_stub_lkb, error);
+}
+
+static void receive_convert(struct dlm_ls *ls, struct dlm_message *ms)
+{
+	struct dlm_lkb *lkb;
+	struct dlm_rsb *r;
+	int error, reply = TRUE;
+
+	error = find_lkb(ls, ms->m_remid, &lkb);
+	if (error)
+		goto fail;
+
+	r = lkb->lkb_resource;
+
+	hold_rsb(r);
+	lock_rsb(r);
+
+	receive_flags(lkb, ms);
+	error = receive_convert_args(ls, lkb, ms);
+	if (error)
+		goto out;
+	reply = !down_conversion(lkb);
+
+	error = do_convert(r, lkb);
+ out:
+	if (reply)
+		send_convert_reply(r, lkb, error);
+
+	unlock_rsb(r);
+	put_rsb(r);
+	put_lkb(lkb);
+	return;
+
+ fail:
+	setup_stub_lkb(ls, ms);
+	send_convert_reply(&ls->ls_stub_rsb, &ls->ls_stub_lkb, error);
+}
+
+static void receive_unlock(struct dlm_ls *ls, struct dlm_message *ms)
+{
+	struct dlm_lkb *lkb;
+	struct dlm_rsb *r;
+	int error;
+
+	error = find_lkb(ls, ms->m_remid, &lkb);
+	if (error)
+		goto fail;
+
+	r = lkb->lkb_resource;
+
+	hold_rsb(r);
+	lock_rsb(r);
+
+	receive_flags(lkb, ms);
+	error = receive_unlock_args(ls, lkb, ms);
+	if (error)
+		goto out;
+
+	error = do_unlock(r, lkb);
+ out:
+	send_unlock_reply(r, lkb, error);
+
+	unlock_rsb(r);
+	put_rsb(r);
+	put_lkb(lkb);
+	return;
+
+ fail:
+	setup_stub_lkb(ls, ms);
+	send_unlock_reply(&ls->ls_stub_rsb, &ls->ls_stub_lkb, error);
+}
+
+static void receive_cancel(struct dlm_ls *ls, struct dlm_message *ms)
+{
+	struct dlm_lkb *lkb;
+	struct dlm_rsb *r;
+	int error;
+
+	error = find_lkb(ls, ms->m_remid, &lkb);
+	if (error)
+		goto fail;
+
+	receive_flags(lkb, ms);
+
+	r = lkb->lkb_resource;
+
+	hold_rsb(r);
+	lock_rsb(r);
+
+	error = do_cancel(r, lkb);
+	send_cancel_reply(r, lkb, error);
+
+	unlock_rsb(r);
+	put_rsb(r);
+	put_lkb(lkb);
+	return;
+
+ fail:
+	setup_stub_lkb(ls, ms);
+	send_cancel_reply(&ls->ls_stub_rsb, &ls->ls_stub_lkb, error);
+}
+
+static void receive_grant(struct dlm_ls *ls, struct dlm_message *ms)
+{
+	struct dlm_lkb *lkb;
+	struct dlm_rsb *r;
+	int error;
+
+	error = find_lkb(ls, ms->m_remid, &lkb);
+	if (error) {
+		log_error(ls, "receive_grant no lkb");
+		return;
+	}
+	DLM_ASSERT(is_process_copy(lkb), dlm_print_lkb(lkb););
+
+	r = lkb->lkb_resource;
+
+	hold_rsb(r);
+	lock_rsb(r);
+
+	receive_flags_reply(lkb, ms);
+	grant_lock_pc(r, lkb, ms);
+	queue_cast(r, lkb, 0);
+
+	unlock_rsb(r);
+	put_rsb(r);
+	put_lkb(lkb);
+}
+
+static void receive_bast(struct dlm_ls *ls, struct dlm_message *ms)
+{
+	struct dlm_lkb *lkb;
+	struct dlm_rsb *r;
+	int error;
+
+	error = find_lkb(ls, ms->m_remid, &lkb);
+	if (error) {
+		log_error(ls, "receive_bast no lkb");
+		return;
+	}
+	DLM_ASSERT(is_process_copy(lkb), dlm_print_lkb(lkb););
+
+	r = lkb->lkb_resource;
+
+	hold_rsb(r);
+	lock_rsb(r);
+
+	queue_bast(r, lkb, ms->m_bastmode);
+
+	unlock_rsb(r);
+	put_rsb(r);
+	put_lkb(lkb);
+}
+
+static void receive_lookup(struct dlm_ls *ls, struct dlm_message *ms)
+{
+	int len, error, ret_nodeid, dir_nodeid, from_nodeid, our_nodeid;
+
+	from_nodeid = ms->m_header.h_nodeid;
+	our_nodeid = dlm_our_nodeid();
+
+	len = receive_extralen(ms);
+
+	dir_nodeid = dlm_dir_name2nodeid(ls, ms->m_extra, len);
+	if (dir_nodeid != our_nodeid) {
+		log_error(ls, "lookup dir_nodeid %d from %d",
+			  dir_nodeid, from_nodeid);
+		error = -EINVAL;
+		ret_nodeid = -1;
+		goto out;
+	}
+
+	error = dlm_dir_lookup(ls, from_nodeid, ms->m_extra, len, &ret_nodeid);
+
+	/* Optimization: we're master so treat lookup as a request */
+	if (!error && ret_nodeid == our_nodeid) {
+		receive_request(ls, ms);
+		return;
+	}
+ out:
+	send_lookup_reply(ls, ms, ret_nodeid, error);
+}
+
+static void receive_remove(struct dlm_ls *ls, struct dlm_message *ms)
+{
+	int len, dir_nodeid, from_nodeid;
+
+	from_nodeid = ms->m_header.h_nodeid;
+
+	len = receive_extralen(ms);
+
+	dir_nodeid = dlm_dir_name2nodeid(ls, ms->m_extra, len);
+	if (dir_nodeid != dlm_our_nodeid()) {
+		log_error(ls, "remove dir entry dir_nodeid %d from %d",
+			  dir_nodeid, from_nodeid);
+		return;
+	}
+
+	dlm_dir_remove_entry(ls, from_nodeid, ms->m_extra, len);
+}
+
+static void receive_request_reply(struct dlm_ls *ls, struct dlm_message *ms)
+{
+	struct dlm_lkb *lkb;
+	struct dlm_rsb *r;
+	int error, mstype;
+
+	error = find_lkb(ls, ms->m_remid, &lkb);
+	if (error) {
+		log_error(ls, "receive_request_reply no lkb");
+		return;
+	}
+	DLM_ASSERT(is_process_copy(lkb), dlm_print_lkb(lkb););
+
+	mstype = lkb->lkb_wait_type;
+	error = remove_from_waiters(lkb);
+	if (error) {
+		log_error(ls, "receive_request_reply not on waiters");
+		goto out;
+	}
+
+	/* this is the value returned from do_request() on the master */
+	error = ms->m_result;
+
+	r = lkb->lkb_resource;
+	hold_rsb(r);
+	lock_rsb(r);
+
+	/* Optimization: the dir node was also the master, so it took our
+	   lookup as a request and sent request reply instead of lookup reply */
+	if (mstype == DLM_MSG_LOOKUP) {
+		r->res_nodeid = ms->m_header.h_nodeid;
+		lkb->lkb_nodeid = r->res_nodeid;
+		r->res_trial_lkid = lkb->lkb_id;
+	}
+
+	switch (error) {
+	case -EAGAIN:
+		/* request would block (be queued) on remote master;
+		   the unhold undoes the original ref from create_lkb()
+		   so it leads to the lkb being freed */
+		queue_cast(r, lkb, -EAGAIN);
+		confirm_master(r, -EAGAIN);
+		unhold_lkb(lkb);
+		break;
+
+	case -EINPROGRESS:
+	case 0:
+		/* request was queued or granted on remote master */
+		receive_flags_reply(lkb, ms);
+		lkb->lkb_remid = ms->m_lkid;
+		if (error)
+			add_lkb(r, lkb, DLM_LKSTS_WAITING);
+		else {
+			grant_lock_pc(r, lkb, ms);
+			queue_cast(r, lkb, 0);
+		}
+		confirm_master(r, error);
+		break;
+
+	case -ENOENT:
+	case -ENOTBLK:
+		/* find_rsb failed to find rsb or rsb wasn't master */
+
+		DLM_ASSERT(test_bit(RESFL_MASTER_WAIT, &r->res_flags),
+		           log_print("receive_request_reply error %d", error);
+		           dlm_print_lkb(lkb);
+		           dlm_print_rsb(r););
+
+		confirm_master(r, error);
+		lkb->lkb_nodeid = -1;
+		_request_lock(r, lkb);
+		break;
+
+	default:
+		log_error(ls, "receive_request_reply error %d", error);
+	}
+
+	unlock_rsb(r);
+	put_rsb(r);
+ out:
+	put_lkb(lkb);
+}
+
+static void __receive_convert_reply(struct dlm_rsb *r, struct dlm_lkb *lkb,
+				    struct dlm_message *ms)
+{
+	int error = ms->m_result;
+
+	/* this is the value returned from do_convert() on the master */
+
+	switch (error) {
+	case -EAGAIN:
+		/* convert would block (be queued) on remote master */
+		queue_cast(r, lkb, -EAGAIN);
+		break;
+
+	case -EINPROGRESS:
+		/* convert was queued on remote master */
+		del_lkb(r, lkb);
+		add_lkb(r, lkb, DLM_LKSTS_CONVERT);
+		break;
+
+	case 0:
+		/* convert was granted on remote master */
+		receive_flags_reply(lkb, ms);
+		grant_lock_pc(r, lkb, ms);
+		queue_cast(r, lkb, 0);
+		break;
+
+	default:
+		log_error(r->res_ls, "receive_convert_reply error %d", error);
+	}
+}
+
+static void _receive_convert_reply(struct dlm_lkb *lkb, struct dlm_message *ms)
+{
+	struct dlm_rsb *r = lkb->lkb_resource;
+
+	hold_rsb(r);
+	lock_rsb(r);
+
+	__receive_convert_reply(r, lkb, ms);
+
+	unlock_rsb(r);
+	put_rsb(r);
+}
+
+static void receive_convert_reply(struct dlm_ls *ls, struct dlm_message *ms)
+{
+	struct dlm_lkb *lkb;
+	int error;
+
+	error = find_lkb(ls, ms->m_remid, &lkb);
+	if (error) {
+		log_error(ls, "receive_convert_reply no lkb");
+		return;
+	}
+	DLM_ASSERT(is_process_copy(lkb), dlm_print_lkb(lkb););
+
+	error = remove_from_waiters(lkb);
+	if (error) {
+		log_error(ls, "receive_convert_reply not on waiters");
+		goto out;
+	}
+
+	_receive_convert_reply(lkb, ms);
+ out:
+	put_lkb(lkb);
+}
+
+static void _receive_unlock_reply(struct dlm_lkb *lkb, struct dlm_message *ms)
+{
+	struct dlm_rsb *r = lkb->lkb_resource;
+	int error = ms->m_result;
+
+	hold_rsb(r);
+	lock_rsb(r);
+
+	/* this is the value returned from do_unlock() on the master */
+
+	switch (error) {
+	case -DLM_EUNLOCK:
+		receive_flags_reply(lkb, ms);
+		remove_lock_pc(r, lkb);
+		queue_cast(r, lkb, -DLM_EUNLOCK);
+		break;
+	default:
+		log_error(r->res_ls, "receive_unlock_reply error %d", error);
+	}
+
+	unlock_rsb(r);
+	put_rsb(r);
+}
+
+static void receive_unlock_reply(struct dlm_ls *ls, struct dlm_message *ms)
+{
+	struct dlm_lkb *lkb;
+	int error;
+
+	error = find_lkb(ls, ms->m_remid, &lkb);
+	if (error) {
+		log_error(ls, "receive_unlock_reply no lkb");
+		return;
+	}
+	DLM_ASSERT(is_process_copy(lkb), dlm_print_lkb(lkb););
+
+	error = remove_from_waiters(lkb);
+	if (error) {
+		log_error(ls, "receive_unlock_reply not on waiters");
+		goto out;
+	}
+
+	_receive_unlock_reply(lkb, ms);
+ out:
+	put_lkb(lkb);
+}
+
+static void _receive_cancel_reply(struct dlm_lkb *lkb, struct dlm_message *ms)
+{
+	struct dlm_rsb *r = lkb->lkb_resource;
+	int error = ms->m_result;
+
+	hold_rsb(r);
+	lock_rsb(r);
+
+	/* this is the value returned from do_cancel() on the master */
+
+	switch (error) {
+	case -DLM_ECANCEL:
+		receive_flags_reply(lkb, ms);
+		revert_lock_pc(r, lkb);
+		queue_cast(r, lkb, -DLM_ECANCEL);
+		break;
+	default:
+		log_error(r->res_ls, "receive_cancel_reply error %d", error);
+	}
+
+	unlock_rsb(r);
+	put_rsb(r);
+}
+
+static void receive_cancel_reply(struct dlm_ls *ls, struct dlm_message *ms)
+{
+	struct dlm_lkb *lkb;
+	int error;
+
+	error = find_lkb(ls, ms->m_remid, &lkb);
+	if (error) {
+		log_error(ls, "receive_cancel_reply no lkb");
+		return;
+	}
+	DLM_ASSERT(is_process_copy(lkb), dlm_print_lkb(lkb););
+
+	error = remove_from_waiters(lkb);
+	if (error) {
+		log_error(ls, "receive_cancel_reply not on waiters");
+		goto out;
+	}
+
+	_receive_cancel_reply(lkb, ms);
+ out:
+	put_lkb(lkb);
+}
+
+static void receive_lookup_reply(struct dlm_ls *ls, struct dlm_message *ms)
+{
+	struct dlm_lkb *lkb;
+	struct dlm_rsb *r;
+	int error, ret_nodeid;
+
+	error = find_lkb(ls, ms->m_lkid, &lkb);
+	if (error) {
+		log_error(ls, "receive_lookup_reply no lkb");
+		return;
+	}
+
+	error = remove_from_waiters(lkb);
+	if (error) {
+		log_error(ls, "receive_lookup_reply not on waiters");
+		goto out;
+	}
+
+	/* this is the value returned by dlm_dir_lookup on dir node
+	   FIXME: will a non-zero error ever be returned? */
+	error = ms->m_result;
+
+	r = lkb->lkb_resource;
+	hold_rsb(r);
+	lock_rsb(r);
+
+	ret_nodeid = ms->m_nodeid;
+	if (ret_nodeid == dlm_our_nodeid())
+		r->res_nodeid = ret_nodeid = 0;
+	else {
+		r->res_nodeid = ret_nodeid;
+		r->res_trial_lkid = lkb->lkb_id;
+	}
+
+	_request_lock(r, lkb);
+
+	if (!ret_nodeid)
+		confirm_master(r, 0);
+
+	unlock_rsb(r);
+	put_rsb(r);
+ out:
+	put_lkb(lkb);
+}
+
+int dlm_receive_message(struct dlm_header *hd, int nodeid, int recovery)
+{
+	struct dlm_message *ms = (struct dlm_message *) hd;
+	struct dlm_ls *ls;
+	int error;
+
+	if (!recovery)
+		dlm_message_in(ms);
+
+	ls = dlm_find_lockspace_global(hd->h_lockspace);
+	if (!ls) {
+		log_print("drop message %d from %d for unknown lockspace %d",
+			  ms->m_type, nodeid, hd->h_lockspace);
+		return -EINVAL;
+	}
+
+	/* recovery may have just ended leaving a bunch of backed-up requests
+	   in the requestqueue; wait while dlm_recoverd clears them */
+
+	if (!recovery)
+		dlm_wait_requestqueue(ls);
+
+	/* recovery may have just started while there were a bunch of
+	   in-flight requests -- save them in requestqueue to be processed
+	   after recovery.  we can't let dlm_recvd block on the recovery
+	   lock.  if dlm_recoverd is calling this function to clear the
+	   requestqueue, it needs to be interrupted (-EINTR) if another
+	   recovery operation is starting. */
+
+	while (1) {
+		if (!test_bit(LSFL_LS_RUN, &ls->ls_flags)) {
+			if (!recovery)
+				dlm_add_requestqueue(ls, nodeid, hd);
+			error = -EINTR;
+			goto out;
+		}
+
+		if (lock_recovery_try(ls))
+			break;
+		schedule();
+	}
+
+	switch (ms->m_type) {
+
+	/* messages sent to a master node */
+
+	case DLM_MSG_REQUEST:
+		receive_request(ls, ms);
+		break;
+
+	case DLM_MSG_CONVERT:
+		receive_convert(ls, ms);
+		break;
+
+	case DLM_MSG_UNLOCK:
+		receive_unlock(ls, ms);
+		break;
+
+	case DLM_MSG_CANCEL:
+		receive_cancel(ls, ms);
+		break;
+
+	/* messages sent from a master node (replies to above) */
+
+	case DLM_MSG_REQUEST_REPLY:
+		receive_request_reply(ls, ms);
+		break;
+
+	case DLM_MSG_CONVERT_REPLY:
+		receive_convert_reply(ls, ms);
+		break;
+
+	case DLM_MSG_UNLOCK_REPLY:
+		receive_unlock_reply(ls, ms);
+		break;
+
+	case DLM_MSG_CANCEL_REPLY:
+		receive_cancel_reply(ls, ms);
+		break;
+
+	/* messages sent from a master node (only two types of async msg) */
+
+	case DLM_MSG_GRANT:
+		receive_grant(ls, ms);
+		break;
+
+	case DLM_MSG_BAST:
+		receive_bast(ls, ms);
+		break;
+
+	/* messages sent to a dir node */
+
+	case DLM_MSG_LOOKUP:
+		receive_lookup(ls, ms);
+		break;
+
+	case DLM_MSG_REMOVE:
+		receive_remove(ls, ms);
+		break;
+
+	/* messages sent from a dir node (remove has no reply) */
+
+	case DLM_MSG_LOOKUP_REPLY:
+		receive_lookup_reply(ls, ms);
+		break;
+
+	default:
+		log_error(ls, "unknown message type %d", ms->m_type);
+	}
+
+	unlock_recovery(ls);
+ out:
+	dlm_put_lockspace(ls);
+	dlm_astd_wake();
+	return 0;
+}
+
+
+/*
+ * Recovery related
+ */
+
+static void recover_convert_waiter(struct dlm_ls *ls, struct dlm_lkb *lkb)
+{
+	if (middle_conversion(lkb)) {
+		hold_lkb(lkb);
+		ls->ls_stub_ms.m_result = -EINPROGRESS;
+		_remove_from_waiters(lkb);
+		_receive_convert_reply(lkb, &ls->ls_stub_ms);
+
+		/* Same special case as in receive_rcom_lock_args() */
+		lkb->lkb_grmode = DLM_LOCK_IV;
+		set_bit(RESFL_RECOVER_CONVERT, &lkb->lkb_resource->res_flags);
+		unhold_lkb(lkb);
+
+	} else if (lkb->lkb_rqmode >= lkb->lkb_grmode) {
+		lkb->lkb_flags |= DLM_IFL_RESEND;
+	}
+	
+	/* lkb->lkb_rqmode < lkb->lkb_grmode shouldn't happen since down
+	   conversions are async; there's no reply from the remote master */
+}
+
+/* Recovery for locks that are waiting for replies from nodes that are now
+   gone.  We can just complete unlocks and cancels by faking a reply from the
+   dead node.  Requests and up-conversions we flag to be resent after
+   recovery.  Down-conversions can just be completed with a fake reply like
+   unlocks.  Conversions between PR and CW need special attention. */
+
+void dlm_recover_waiters_pre(struct dlm_ls *ls)
+{
+	struct dlm_lkb *lkb, *safe;
+
+	down(&ls->ls_waiters_sem);
+
+	list_for_each_entry_safe(lkb, safe, &ls->ls_waiters, lkb_wait_reply) {
+		if (!dlm_is_removed(ls, lkb->lkb_nodeid))
+			continue;
+
+		log_debug(ls, "pre recover waiter lkid %x type %d flags %x",
+			  lkb->lkb_id, lkb->lkb_wait_type, lkb->lkb_flags);
+
+		switch (lkb->lkb_wait_type) {
+
+		case DLM_MSG_REQUEST:
+			lkb->lkb_flags |= DLM_IFL_RESEND;
+			break;
+
+		case DLM_MSG_CONVERT:
+			recover_convert_waiter(ls, lkb);
+			break;
+
+		case DLM_MSG_UNLOCK:
+			hold_lkb(lkb);
+			ls->ls_stub_ms.m_result = -DLM_EUNLOCK;
+			_remove_from_waiters(lkb);
+			_receive_unlock_reply(lkb, &ls->ls_stub_ms);
+			put_lkb(lkb);
+			break;
+
+		case DLM_MSG_CANCEL:
+			hold_lkb(lkb);
+			ls->ls_stub_ms.m_result = -DLM_ECANCEL;
+			_remove_from_waiters(lkb);
+			_receive_cancel_reply(lkb, &ls->ls_stub_ms);
+			put_lkb(lkb);
+			break;
+
+		case DLM_MSG_LOOKUP:
+			/* all outstanding lookups, regardless of dest.
+			   will be resent after recovery is done */
+			break;
+
+		default:
+			log_error(ls, "invalid lkb wait_type %d",
+				  lkb->lkb_wait_type);
+		}
+	}
+	up(&ls->ls_waiters_sem);
+}
+
+static int remove_resend_waiter(struct dlm_ls *ls, struct dlm_lkb **lkb_ret)
+{
+	struct dlm_lkb *lkb;
+	int rv = 0;
+
+	down(&ls->ls_waiters_sem);
+	list_for_each_entry(lkb, &ls->ls_waiters, lkb_wait_reply) {
+		if (lkb->lkb_flags & DLM_IFL_RESEND) {
+			rv = lkb->lkb_wait_type;
+			_remove_from_waiters(lkb);
+			lkb->lkb_flags &= ~DLM_IFL_RESEND;
+			break;
+		}
+	}
+	up(&ls->ls_waiters_sem);
+
+	if (!rv)
+		lkb = NULL;
+	*lkb_ret = lkb;
+	return rv;
+}
+
+/* Deal with lookups and lkb's marked RESEND from _pre.  We may now be the
+   master or dir-node for r.  Processing the lkb may result in it being placed
+   back on waiters. */
+
+int dlm_recover_waiters_post(struct dlm_ls *ls)
+{
+	struct dlm_lkb *lkb;
+	struct dlm_rsb *r;
+	int error = 0, mstype;
+
+	while (1) {
+		if (!test_bit(LSFL_LS_RUN, &ls->ls_flags)) {
+			log_debug(ls, "recover_waiters_post aborted");
+			error = -EINTR;
+			break;
+		}
+
+		mstype = remove_resend_waiter(ls, &lkb);
+		if (!mstype)
+			break;
+
+		r = lkb->lkb_resource;
+
+		log_debug(ls, "recover_waiters_post %x type %d flags %x %s",
+			  lkb->lkb_id, mstype, lkb->lkb_flags, r->res_name);
+
+		switch (mstype) {
+
+		case DLM_MSG_LOOKUP:
+		case DLM_MSG_REQUEST:
+			hold_rsb(r);
+			lock_rsb(r);
+			_request_lock(r, lkb);
+			unlock_rsb(r);
+			put_rsb(r);
+			break;
+
+		case DLM_MSG_CONVERT:
+			hold_rsb(r);
+			lock_rsb(r);
+			_convert_lock(r, lkb);
+			unlock_rsb(r);
+			put_rsb(r);
+			break;
+
+		default:
+			log_error(ls, "recover_waiters_post type %d", mstype);
+		}
+	}
+
+	return error;
+}
+
+static void purge_queue(struct dlm_rsb *r, struct list_head *queue,
+			int (*test)(struct dlm_ls *ls, struct dlm_lkb *lkb))
+{
+	struct dlm_ls *ls = r->res_ls;
+	struct dlm_lkb *lkb, *safe;
+
+	list_for_each_entry_safe(lkb, safe, queue, lkb_statequeue) {
+		if (test(ls, lkb)) {
+			del_lkb(r, lkb);
+			/* this put should free the lkb */
+			if (!put_lkb(lkb))
+				log_error(ls, "purged lkb not released");
+		}
+	}
+}
+
+static int purge_dead_test(struct dlm_ls *ls, struct dlm_lkb *lkb)
+{
+	return (is_master_copy(lkb) && dlm_is_removed(ls, lkb->lkb_nodeid));
+}
+
+static void purge_dead_locks(struct dlm_rsb *r)
+{
+	purge_queue(r, &r->res_grantqueue, &purge_dead_test);
+	purge_queue(r, &r->res_convertqueue, &purge_dead_test);
+	purge_queue(r, &r->res_waitqueue, &purge_dead_test);
+}
+
+/* Get rid of locks held by nodes that are gone. */
+
+int dlm_purge_locks(struct dlm_ls *ls)
+{
+	struct dlm_rsb *r;
+
+	log_debug(ls, "dlm_purge_locks");
+
+	down_write(&ls->ls_root_sem);
+	list_for_each_entry(r, &ls->ls_root_list, res_root_list) {
+		hold_rsb(r);
+		lock_rsb(r);
+		if (is_master(r))
+			purge_dead_locks(r);
+		unlock_rsb(r);
+		unhold_rsb(r);
+
+		schedule();
+	}
+	up_write(&ls->ls_root_sem);
+
+	return 0;
+}
+
+int dlm_grant_after_purge(struct dlm_ls *ls)
+{
+	struct dlm_rsb *r;
+	int i;
+
+	for (i = 0; i < ls->ls_rsbtbl_size; i++) {
+		read_lock(&ls->ls_rsbtbl[i].lock);
+		list_for_each_entry(r, &ls->ls_rsbtbl[i].list, res_hashchain) {
+			hold_rsb(r);
+			lock_rsb(r);
+			if (is_master(r)) {
+				grant_pending_locks(r);
+				confirm_master(r, 0);
+			}
+			unlock_rsb(r);
+			put_rsb(r);
+		}
+		read_unlock(&ls->ls_rsbtbl[i].lock);
+	}
+
+	return 0;
+}
+
+static struct dlm_lkb *search_remid_list(struct list_head *head, int nodeid,
+					 uint32_t remid)
+{
+	struct dlm_lkb *lkb;
+
+	list_for_each_entry(lkb, head, lkb_statequeue) {
+		if (lkb->lkb_nodeid == nodeid && lkb->lkb_remid == remid)
+			return lkb;
+	}
+	return NULL;
+}
+
+static struct dlm_lkb *search_remid(struct dlm_rsb *r, int nodeid,
+				    uint32_t remid)
+{
+	struct dlm_lkb *lkb;
+
+	lkb = search_remid_list(&r->res_grantqueue, nodeid, remid);
+	if (lkb)
+		return lkb;
+	lkb = search_remid_list(&r->res_convertqueue, nodeid, remid);
+	if (lkb)
+		return lkb;
+	lkb = search_remid_list(&r->res_waitqueue, nodeid, remid);
+	if (lkb)
+		return lkb;
+	return NULL;
+}
+
+static int receive_rcom_lock_args(struct dlm_ls *ls, struct dlm_lkb *lkb,
+				  struct dlm_rsb *r, struct dlm_rcom *rc)
+{
+	struct rcom_lock *rl = (struct rcom_lock *) rc->rc_buf;
+	int lvblen;
+
+	lkb->lkb_nodeid = rc->rc_header.h_nodeid;
+	lkb->lkb_ownpid = rl->rl_ownpid;
+	lkb->lkb_remid = rl->rl_lkid;
+	lkb->lkb_exflags = rl->rl_exflags;
+	lkb->lkb_flags = rl->rl_flags & 0x0000FFFF;
+	lkb->lkb_flags |= DLM_IFL_MSTCPY;
+	lkb->lkb_lvbseq = rl->rl_lvbseq;
+	lkb->lkb_rqmode = rl->rl_rqmode;
+	lkb->lkb_grmode = rl->rl_grmode;
+	/* don't set lkb_status because add_lkb wants to itself */
+
+	lkb->lkb_bastaddr = (void *) (long) (rl->rl_asts & AST_BAST);
+	lkb->lkb_astaddr = (void *) (long) (rl->rl_asts & AST_COMP);
+
+	if (lkb->lkb_flags & DLM_IFL_RANGE) {
+		lkb->lkb_range = allocate_range(ls);
+		if (!lkb->lkb_range)
+			return -ENOMEM;
+		memcpy(lkb->lkb_range, rl->rl_range, 4*sizeof(uint64_t));
+	}
+
+	if (lkb->lkb_exflags & DLM_LKF_VALBLK) {
+		lkb->lkb_lvbptr = allocate_lvb(ls);
+		if (!lkb->lkb_lvbptr)
+			return -ENOMEM;
+		lvblen = rc->rc_header.h_length - sizeof(struct dlm_rcom) -
+			 sizeof(struct rcom_lock);
+		memcpy(lkb->lkb_lvbptr, rl->rl_lvb, lvblen);
+	}
+
+	/* Conversions between PR and CW (middle modes) need special handling.
+	   The real granted mode of these converting locks cannot be determined
+	   until all locks have been rebuilt on the rsb (recover_conversion) */
+
+	if (rl->rl_wait_type == DLM_MSG_CONVERT && middle_conversion(lkb)) {
+		rl->rl_status = DLM_LKSTS_CONVERT;
+		lkb->lkb_grmode = DLM_LOCK_IV;
+		set_bit(RESFL_RECOVER_CONVERT, &r->res_flags);
+	}
+
+	return 0;
+}
+
+/* This lkb may have been recovered in a previous aborted recovery so we need
+   to check if the rsb already has an lkb with the given remote nodeid/lkid.
+   If so we just send back a standard reply.  If not, we create a new lkb with
+   the given values and send back our lkid.  We send back our lkid by sending
+   back the rcom_lock struct we got but with the remid field filled in. */
+
+int dlm_recover_master_copy(struct dlm_ls *ls, struct dlm_rcom *rc)
+{
+	struct rcom_lock *rl = (struct rcom_lock *) rc->rc_buf;
+	struct dlm_rsb *r;
+	struct dlm_lkb *lkb;
+	int error;
+
+	if (rl->rl_parent_lkid) {
+		error = -EOPNOTSUPP;
+		goto out;
+	}
+
+	error = find_rsb(ls, rl->rl_name, rl->rl_namelen, R_MASTER, &r);
+	if (error)
+		goto out;
+
+	lock_rsb(r);
+
+	lkb = search_remid(r, rc->rc_header.h_nodeid, rl->rl_lkid);
+	if (lkb) {
+		error = -EEXIST;
+		goto out_remid;
+	}
+
+	error = create_lkb(ls, &lkb);
+	if (error)
+		goto out_unlock;
+
+	error = receive_rcom_lock_args(ls, lkb, r, rc);
+	if (error) {
+		put_lkb(lkb);
+		goto out_unlock;
+	}
+
+	attach_lkb(r, lkb);
+	add_lkb(r, lkb, rl->rl_status);
+	error = 0;
+
+ out_remid:
+	/* this is the new value returned to the lock holder for
+	   saving in its process-copy lkb */
+	rl->rl_remid = lkb->lkb_id;
+
+ out_unlock:
+	unlock_rsb(r);
+	put_rsb(r);
+ out:
+	if (error)
+		log_print("recover_master_copy %d %x", error, rl->rl_lkid);
+	rl->rl_result = error;
+	return error;
+}
+
+int dlm_recover_process_copy(struct dlm_ls *ls, struct dlm_rcom *rc)
+{
+	struct rcom_lock *rl = (struct rcom_lock *) rc->rc_buf;
+	struct dlm_rsb *r;
+	struct dlm_lkb *lkb;
+	int error;
+
+	error = find_lkb(ls, rl->rl_lkid, &lkb);
+	if (error) {
+		log_error(ls, "recover_process_copy no lkid %x", rl->rl_lkid);
+		return error;
+	}
+
+	DLM_ASSERT(is_process_copy(lkb), dlm_print_lkb(lkb););
+
+	error = rl->rl_result;
+
+	r = lkb->lkb_resource;
+	hold_rsb(r);
+	lock_rsb(r);
+
+	switch (error) {
+	case -EEXIST:
+		log_debug(ls, "master copy exists %x", lkb->lkb_id);
+		/* fall through */
+	case 0:
+		lkb->lkb_remid = rl->rl_remid;
+		break;
+	default:
+		log_error(ls, "dlm_recover_process_copy unknown error %d %x",
+			  error, lkb->lkb_id);
+	}
+
+	/* an ack for dlm_recover_locks() which waits for replies from
+	   all the locks it sends to new masters */
+	dlm_recovered_lock(r);
+
+	unlock_rsb(r);
+	put_rsb(r);
+	put_lkb(lkb);
+
+	return 0;
+}
+
--- a/drivers/dlm/lock.h	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/lock.h	2005-05-12 23:13:15.829485512 +0800
@@ -0,0 +1,51 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) 2005 Red Hat, Inc.  All rights reserved.
+**  
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#ifndef __LOCK_DOT_H__
+#define __LOCK_DOT_H__
+
+void dlm_print_lkb(struct dlm_lkb *lkb);
+void dlm_print_rsb(struct dlm_rsb *r);
+int dlm_receive_message(struct dlm_header *hd, int nodeid, int recovery);
+int dlm_modes_compat(int mode1, int mode2);
+int dlm_find_rsb(struct dlm_ls *ls, char *name, int namelen,
+	unsigned int flags, struct dlm_rsb **r_ret);
+void dlm_put_rsb(struct dlm_rsb *r);
+void dlm_hold_rsb(struct dlm_rsb *r);
+int dlm_put_lkb(struct dlm_lkb *lkb);
+int dlm_remove_from_waiters(struct dlm_lkb *lkb);
+void dlm_scan_rsbs(struct dlm_ls *ls);
+
+int dlm_purge_locks(struct dlm_ls *ls);
+int dlm_grant_after_purge(struct dlm_ls *ls);
+int dlm_recover_waiters_post(struct dlm_ls *ls);
+void dlm_recover_waiters_pre(struct dlm_ls *ls);
+int dlm_recover_master_copy(struct dlm_ls *ls, struct dlm_rcom *rc);
+int dlm_recover_process_copy(struct dlm_ls *ls, struct dlm_rcom *rc);
+
+static inline int is_master(struct dlm_rsb *r)
+{
+	return !r->res_nodeid;
+}
+
+static inline void lock_rsb(struct dlm_rsb *r)
+{
+	down(&r->res_sem);
+}
+
+static inline void unlock_rsb(struct dlm_rsb *r)
+{
+	up(&r->res_sem);
+}
+
+#endif
+



From akpm at osdl.org  Tue May 17 07:11:33 2005
From: akpm at osdl.org (Andrew Morton)
Date: Tue, 17 May 2005 00:11:33 -0700
Subject: [Linux-cluster] Re: [PATCH 0/8] dlm: overview
In-Reply-To: <20050516071949.GE7094@redhat.com>
References: <20050516071949.GE7094@redhat.com>
Message-ID: <20050517001133.64d50d8c.akpm@osdl.org>

David Teigland <teigland at redhat.com> wrote:
>
>  These are the distributed lock manager (dlm) patches against 2.6.12-rc4
>  that we'd like to see added to the kernel.

Squawk.

Not only do I not know whether this stuff should be merged: I don't even
know how to find that out.  Unless I'm prepared to become a full-on
cluster/dlm person, which isn't looking likely.

The usual fallback is to identify all the stakeholders and get them to say
"yes Andrew, this code is cool and we can use it", but I don't think the
clustering teams have sufficent act-togetherness to be able to do that.

Am I correct in believing that this DLM is designed to be used by multiple
clustering products?  If so, which ones, and how far along are they?

Looking at Ted's latest Kernel Summit agenda I see

 Clustering

 	We need to make progress on the kernel integration of things
 	like message passing, membership, DLM etc.

 	We seem to have at least two comparable kernel-side offerings
 	(OpenSSI and RHAT's), as well as a need to hash out how user-space
 	plays into this.

 	(There is now a plan for a Clustering Summit prior to KS - need
 	to validate if this will be useful, still)

So right now I'm inclined to duck any decision making and see what happens
in July.  Does that sounds sane?  Is that Clustering Summit going to happen?

In the meanwhile I can pop this code into -mm so it gets a bit more
compile testing, review and exposure, but that wouldn't signify anything
further.



From ak at muc.de  Tue May 17 13:10:46 2005
From: ak at muc.de (Andi Kleen)
Date: Tue, 17 May 2005 15:10:46 +0200
Subject: [Linux-cluster] Re: [PATCH 0/8] dlm: overview
In-Reply-To: <20050517001133.64d50d8c.akpm@osdl.org> (Andrew Morton's
	message of "Tue, 17 May 2005 00:11:33 -0700")
References: <20050516071949.GE7094@redhat.com>
	<20050517001133.64d50d8c.akpm@osdl.org>
Message-ID: <m1fywm80bt.fsf@muc.de>

Andrew Morton <akpm at osdl.org> writes:
>
> Squawk.
>
> Not only do I not know whether this stuff should be merged: I don't even
> know how to find that out.  Unless I'm prepared to become a full-on
> cluster/dlm person, which isn't looking likely.
>
> The usual fallback is to identify all the stakeholders and get them to say
> "yes Andrew, this code is cool and we can use it", but I don't think the
> clustering teams have sufficent act-togetherness to be able to do that.

My impression is that it is unlikely everybody will agree on a single
cluster setup anyways, so it might be best to use a similar strategy
as with file systems ("multiple implementations - standard API to the
outside world")

This would mean the DLM could be merged if the other cluster people
agree that the interface it presents to the outside world is good
for them too. Is it? Perhaps these interfaces should be discussed
first.

-Andi



From teigland at redhat.com  Mon May 16 07:20:11 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 16 May 2005 15:20:11 +0800
Subject: [Linux-cluster] [PATCH 2/8] dlm: lockspaces, callbacks, directory
Message-ID: <20050516072011.GG7094@redhat.com>

Creates lockspaces which give applications separate contexts/namespaces in
which to do their locking.  Delivers completion and blocking callbacks
(ast's) to lock holders.  Manages the distributed directory that tracks
the current master node for each resource.  Includes the main headers.

Signed-off-by: Dave Teigland <teigland at redhat.com>
Signed-off-by: Patrick Caulfield <pcaulfie at redhat.com>

---

 drivers/dlm/ast.c          |  166 +++++++++++++
 drivers/dlm/ast.h          |   26 ++
 drivers/dlm/dir.c          |  419 +++++++++++++++++++++++++++++++++++
 drivers/dlm/dir.h          |   30 ++
 drivers/dlm/dlm_internal.h |  508 ++++++++++++++++++++++++++++++++++++++++++
 drivers/dlm/lockspace.c    |  537 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/dlm/lockspace.h    |   24 ++
 drivers/dlm/lvb_table.h    |   38 +++
 drivers/dlm/main.c         |   91 +++++++
 drivers/dlm/memory.c       |  121 ++++++++++
 drivers/dlm/memory.h       |   31 ++
 drivers/dlm/util.c         |  204 +++++++++++++++++
 drivers/dlm/util.h         |   24 ++
 include/linux/dlm.h        |  303 +++++++++++++++++++++++++
 14 files changed, 2522 insertions(+)

--- a/include/linux/dlm.h	1970-01-01 07:30:00.000000000 +0730
+++ b/include/linux/dlm.h	2005-05-12 23:13:15.828485664 +0800
@@ -0,0 +1,303 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#ifndef __DLM_DOT_H__
+#define __DLM_DOT_H__
+
+/*
+ * Interface to Distributed Lock Manager (DLM)
+ * routines and structures to use DLM lockspaces
+ */
+
+/*
+ * Lock Modes
+ */
+
+#define DLM_LOCK_IV		-1	/* invalid */
+#define DLM_LOCK_NL		0	/* null */
+#define DLM_LOCK_CR		1	/* concurrent read */
+#define DLM_LOCK_CW		2	/* concurrent write */
+#define DLM_LOCK_PR		3	/* protected read */
+#define DLM_LOCK_PW		4	/* protected write */
+#define DLM_LOCK_EX		5	/* exclusive */
+
+/*
+ * Maximum size in bytes of a dlm_lock name
+ */
+
+#define DLM_RESNAME_MAXLEN	64
+
+/*
+ * Flags to dlm_lock
+ *
+ * DLM_LKF_NOQUEUE
+ *
+ * Do not queue the lock request on the wait queue if it cannot be granted
+ * immediately.  If the lock cannot be granted because of this flag, DLM will
+ * either return -EAGAIN from the dlm_lock call or will return 0 from
+ * dlm_lock and -EAGAIN in the lock status block when the AST is executed.
+ *
+ * DLM_LKF_CANCEL
+ *
+ * Used to cancel a pending lock request or conversion.  A converting lock is
+ * returned to its previously granted mode.
+ *
+ * DLM_LKF_CONVERT
+ *
+ * Indicates a lock conversion request.  For conversions the name and namelen
+ * are ignored and the lock ID in the LKSB is used to identify the lock.
+ *
+ * DLM_LKF_VALBLK
+ *
+ * Requests DLM to return the current contents of the lock value block in the
+ * lock status block.  When this flag is set in a lock conversion from PW or EX
+ * modes, DLM assigns the value specified in the lock status block to the lock
+ * value block of the lock resource.  The LVB is a DLM_LVB_LEN size array
+ * containing application-specific information.
+ *
+ * DLM_LKF_QUECVT
+ *
+ * Force a conversion request to be queued, even if it is compatible with
+ * the granted modes of other locks on the same resource.
+ *
+ * DLM_LKF_IVVALBLK
+ *
+ * Invalidate the lock value block.
+ *
+ * DLM_LKF_CONVDEADLK
+ *
+ * Allows the dlm to resolve conversion deadlocks internally by demoting the
+ * granted mode of a converting lock to NL.  The DLM_SBF_DEMOTED flag is
+ * returned for a conversion that's been effected by this.
+ *
+ * DLM_LKF_PERSISTENT
+ *
+ * Only relevant to locks originating in userspace.  A persistent lock will not
+ * be removed if the process holding the lock exits.
+ *
+ * DLM_LKF_NODLKWT
+ * DLM_LKF_NODLCKBLK
+ *
+ * net yet implemented
+ *
+ * DLM_LKF_EXPEDITE
+ *
+ * Used only with new requests for NL mode locks.  Tells the lock manager
+ * to grant the lock, ignoring other locks in convert and wait queues.
+ *
+ * DLM_LKF_NOQUEUEBAST
+ *
+ * Send blocking AST's before returning -EAGAIN to the caller.  It is only
+ * used along with the NOQUEUE flag.  Blocking AST's are not sent for failed
+ * NOQUEUE requests otherwise.
+ *
+ * DLM_LKF_HEADQUE
+ *
+ * Add a lock to the head of the convert or wait queue rather than the tail.
+ *
+ * DLM_LKF_NOORDER
+ *
+ * Disregard the standard grant order rules and grant a lock as soon as it
+ * is compatible with other granted locks.
+ *
+ * DLM_LKF_ORPHAN
+ *
+ * not yet implemented
+ *
+ * DLM_LKF_ALTPR
+ *
+ * If the requested mode cannot be granted immediately, try to grant the lock
+ * in PR mode instead.  If this alternate mode is granted instead of the
+ * requested mode, DLM_SBF_ALTMODE is returned in the lksb.
+ *
+ * DLM_LKF_ALTCW
+ *
+ * The same as ALTPR, but the alternate mode is CW.
+ */
+
+#define DLM_LKF_NOQUEUE		0x00000001
+#define DLM_LKF_CANCEL		0x00000002
+#define DLM_LKF_CONVERT		0x00000004
+#define DLM_LKF_VALBLK		0x00000008
+#define DLM_LKF_QUECVT		0x00000010
+#define DLM_LKF_IVVALBLK	0x00000020
+#define DLM_LKF_CONVDEADLK	0x00000040
+#define DLM_LKF_PERSISTENT	0x00000080
+#define DLM_LKF_NODLCKWT	0x00000100
+#define DLM_LKF_NODLCKBLK	0x00000200
+#define DLM_LKF_EXPEDITE	0x00000400
+#define DLM_LKF_NOQUEUEBAST	0x00000800
+#define DLM_LKF_HEADQUE		0x00001000
+#define DLM_LKF_NOORDER		0x00002000
+#define DLM_LKF_ORPHAN		0x00004000
+#define DLM_LKF_ALTPR		0x00008000
+#define DLM_LKF_ALTCW		0x00010000
+
+/*
+ * Some return codes that are not in errno.h
+ */
+
+#define DLM_ECANCEL		0x10001
+#define DLM_EUNLOCK		0x10002
+
+typedef void dlm_lockspace_t;
+
+/*
+ * Lock range structure
+ */
+
+struct dlm_range {
+	uint64_t ra_start;
+	uint64_t ra_end;
+};
+
+/*
+ * Lock status block
+ *
+ * Use this structure to specify the contents of the lock value block.  For a
+ * conversion request, this structure is used to specify the lock ID of the
+ * lock.  DLM writes the status of the lock request and the lock ID assigned
+ * to the request in the lock status block.
+ *
+ * sb_lkid: the returned lock ID.  It is set on new (non-conversion) requests.
+ * It is available when dlm_lock returns.
+ *
+ * sb_lvbptr: saves or returns the contents of the lock's LVB according to rules
+ * shown for the DLM_LKF_VALBLK flag.
+ *
+ * sb_flags: DLM_SBF_DEMOTED is returned if in the process of promoting a lock,
+ * it was first demoted to NL to avoid conversion deadlock.
+ * DLM_SBF_VALNOTVALID is returned if the resource's LVB is marked invalid.
+ *
+ * sb_status: the returned status of the lock request set prior to AST
+ * execution.  Possible return values:
+ *
+ * 0 if lock request was successful
+ * -EAGAIN if request would block and is flagged DLM_LKF_NOQUEUE
+ * -ENOMEM if there is no memory to process request
+ * -EINVAL if there are invalid parameters
+ * -DLM_EUNLOCK if unlock request was successful
+ * -DLM_ECANCEL if a cancel completed successfully
+ */
+
+#define DLM_SBF_DEMOTED		0x01
+#define DLM_SBF_VALNOTVALID	0x02
+#define DLM_SBF_ALTMODE		0x04
+
+struct dlm_lksb {
+	int 	 sb_status;
+	uint32_t sb_lkid;
+	char 	 sb_flags;
+	char *	 sb_lvbptr;
+};
+
+
+#ifdef __KERNEL__
+
+/*
+ * dlm_new_lockspace
+ *
+ * Starts a lockspace with the given name.  If the named lockspace exists in
+ * the cluster, the calling node joins it.
+ */
+
+int dlm_new_lockspace(char *name, int namelen, dlm_lockspace_t **lockspace,
+		      int flags, int lvblen);
+
+/*
+ * dlm_release_lockspace
+ *
+ * Stop a lockspace.
+ */
+
+int dlm_release_lockspace(dlm_lockspace_t *lockspace, int force);
+
+/*
+ * dlm_lock
+ *
+ * Make an asyncronous request to acquire or convert a lock on a named
+ * resource.
+ *
+ * lockspace: context for the request
+ * mode: the requested mode of the lock (DLM_LOCK_)
+ * lksb: lock status block for input and async return values
+ * flags: input flags (DLM_LKF_)
+ * name: name of the resource to lock, can be binary
+ * namelen: the length in bytes of the resource name (MAX_RESNAME_LEN)
+ * parent: the lock ID of a parent lock or 0 if none
+ * lockast: function DLM executes when it completes processing the request
+ * astarg: argument passed to lockast and bast functions
+ * bast: function DLM executes when this lock later blocks another request
+ *
+ * Returns:
+ * 0 if request is successfully queued for processing
+ * -EINVAL if any input parameters are invalid
+ * -EAGAIN if request would block and is flagged DLM_LKF_NOQUEUE
+ * -ENOMEM if there is no memory to process request
+ * -ENOTCONN if there is a communication error
+ *
+ * If the call to dlm_lock returns an error then the operation has failed and
+ * the AST routine will not be called.  If dlm_lock returns 0 it is still
+ * possible that the lock operation will fail. The AST routine will be called
+ * when the locking is complete and the status is returned in the lksb.
+ *
+ * If the AST routines or parameter are passed to a conversion operation then
+ * they will overwrite those values that were passed to a previous dlm_lock
+ * call.
+ *
+ * AST routines should not block (at least not for long), but may make
+ * any locking calls they please.
+ */
+
+int dlm_lock(dlm_lockspace_t *lockspace,
+	     int mode,
+	     struct dlm_lksb *lksb,
+	     uint32_t flags,
+	     void *name,
+	     unsigned int namelen,
+	     uint32_t parent_lkid,
+	     void (*lockast) (void *astarg),
+	     void *astarg,
+	     void (*bast) (void *astarg, int mode),
+	     struct dlm_range *range);
+
+/*
+ * dlm_unlock
+ *
+ * Asynchronously release a lock on a resource.  The AST routine is called
+ * when the resource is successfully unlocked.
+ *
+ * lockspace: context for the request
+ * lkid: the lock ID as returned in the lksb
+ * flags: input flags (DLM_LKF_)
+ * lksb: if NULL the lksb parameter passed to last lock request is used
+ * astarg: the arg used with the completion ast for the unlock
+ *
+ * Returns:
+ * 0 if request is successfully queued for processing
+ * -EINVAL if any input parameters are invalid
+ * -ENOTEMPTY if the lock still has sublocks
+ * -EBUSY if the lock is waiting for a remote lock operation
+ * -ENOTCONN if there is a communication error
+ */
+
+int dlm_unlock(dlm_lockspace_t *lockspace,
+	       uint32_t lkid,
+	       uint32_t flags,
+	       struct dlm_lksb *lksb,
+	       void *astarg);
+
+#endif				/* __KERNEL__ */
+
+#endif				/* __DLM_DOT_H__ */
+
--- a/drivers/dlm/dlm_internal.h	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/dlm_internal.h	2005-05-12 23:13:15.838484144 +0800
@@ -0,0 +1,508 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#ifndef __DLM_INTERNAL_DOT_H__
+#define __DLM_INTERNAL_DOT_H__
+
+/*
+ * This is the main header file to be included in each DLM source file.
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <linux/types.h>
+#include <linux/ctype.h>
+#include <linux/spinlock.h>
+#include <linux/vmalloc.h>
+#include <linux/list.h>
+#include <linux/errno.h>
+#include <linux/random.h>
+#include <linux/delay.h>
+#include <linux/socket.h>
+#include <linux/kthread.h>
+#include <linux/kobject.h>
+#include <linux/kref.h>
+#include <linux/kernel.h>
+#include <asm/semaphore.h>
+#include <asm/uaccess.h>
+
+#include <linux/dlm.h>
+
+#define DLM_LOCKSPACE_LEN	64
+
+#ifndef TRUE
+#define TRUE 1
+#endif
+
+#ifndef FALSE
+#define FALSE 0
+#endif
+
+#if (BITS_PER_LONG == 64)
+#define PRIx64 "lx"
+#else
+#define PRIx64 "Lx"
+#endif
+
+/* Size of the temp buffer midcomms allocates on the stack.
+   We try to make this large enough so most messages fit.
+   FIXME: should sctp make this unnecessary? */
+
+#define DLM_INBUF_LEN		148
+
+struct dlm_ls;
+struct dlm_lkb;
+struct dlm_rsb;
+struct dlm_member;
+struct dlm_lkbtable;
+struct dlm_rsbtable;
+struct dlm_dirtable;
+struct dlm_direntry;
+struct dlm_recover;
+struct dlm_header;
+struct dlm_message;
+struct dlm_rcom;
+struct dlm_mhandle;
+
+#define log_print(fmt, args...) \
+	printk(KERN_ERR "dlm: "fmt"\n", ##args)
+#define log_error(ls, fmt, args...) \
+	printk(KERN_ERR "dlm: %s: " fmt "\n", (ls)->ls_name, ##args)
+
+#ifdef CONFIG_DLM_DEBUG
+int dlm_create_debug_file(struct dlm_ls *ls);
+void dlm_delete_debug_file(struct dlm_ls *ls);
+#else
+static inline int dlm_create_debug_file(struct dlm_ls *ls) { return 0; }
+static inline void dlm_delete_debug_file(struct dlm_ls *ls) { }
+#endif
+
+#ifdef DLM_LOG_DEBUG
+#define log_debug(ls, fmt, args...) log_error(ls, fmt, ##args)
+#else
+#define log_debug(ls, fmt, args...)
+#endif
+
+#define DLM_ASSERT(x, do) \
+{ \
+  if (!(x)) \
+  { \
+    printk(KERN_ERR "\nDLM:  Assertion failed on line %d of file %s\n" \
+               "DLM:  assertion:  \"%s\"\n" \
+               "DLM:  time = %lu\n", \
+               __LINE__, __FILE__, #x, jiffies); \
+    {do} \
+    printk("\n"); \
+    BUG(); \
+    panic("DLM:  Record message above and reboot.\n"); \
+  } \
+}
+
+
+struct dlm_direntry {
+	struct list_head	list;
+	uint32_t		master_nodeid;
+	uint16_t		length;
+	char			name[1];
+};
+
+struct dlm_dirtable {
+	struct list_head	list;
+	rwlock_t		lock;
+};
+
+struct dlm_rsbtable {
+	struct list_head	list;
+	struct list_head	toss;
+	rwlock_t		lock;
+};
+
+struct dlm_lkbtable {
+	struct list_head	list;
+	rwlock_t		lock;
+	uint16_t		counter;
+};
+
+/*
+ * Lockspace member (per node in a ls)
+ */
+
+struct dlm_member {
+	struct list_head	list;
+	int			nodeid;
+	int			gone_event;
+};
+
+/*
+ * Save and manage recovery state for a lockspace.
+ */
+
+struct dlm_recover {
+	struct list_head	list;
+	int			*nodeids;
+	int			node_count;
+	int			event_id;
+};
+
+/*
+ * Pass input args to second stage locking function.
+ */
+
+struct dlm_args {
+	uint32_t		flags;
+	void			*astaddr;
+	long			astparam;
+	void			*bastaddr;
+	int			mode;
+	struct dlm_lksb		*lksb;
+	struct dlm_range	*range;
+};
+
+
+/*
+ * Lock block
+ *
+ * A lock can be one of three types:
+ *
+ * local copy      lock is mastered locally
+ *                 (lkb_nodeid is zero and DLM_LKF_MSTCPY is not set)
+ * process copy    lock is mastered on a remote node
+ *                 (lkb_nodeid is non-zero and DLM_LKF_MSTCPY is not set)
+ * master copy     master node's copy of a lock owned by remote node
+ *                 (lkb_nodeid is non-zero and DLM_LKF_MSTCPY is set)
+ *
+ * lkb_exflags: a copy of the most recent flags arg provided to dlm_lock or
+ * dlm_unlock.  The dlm does not modify these or use any private flags in
+ * this field; it only contains DLM_LKF_ flags from dlm.h.  These flags
+ * are sent as-is to the remote master when the lock is remote.
+ *
+ * lkb_flags: internal dlm flags (DLM_IFL_ prefix) from dlm_internal.h.
+ * Some internal flags are shared between the master and process nodes;
+ * these shared flags are kept in the lower two bytes.  One of these
+ * flags set on the master copy will be propagated to the process copy
+ * and v.v.  Other internal flags are private to the master or process
+ * node (e.g. DLM_IFL_MSTCPY).  These are kept in the high two bytes.
+ *
+ * lkb_sbflags: status block flags.  These flags are copied directly into
+ * the caller's lksb.sb_flags prior to the dlm_lock/dlm_unlock completion
+ * ast.  All defined in dlm.h with DLM_SBF_ prefix.
+ *
+ * lkb_status: the lock status indicates which rsb queue the lock is
+ * on, grant, convert, or wait.  DLM_LKSTS_ WAITING/GRANTED/CONVERT
+ *
+ * lkb_wait_type: the dlm message type (DLM_MSG_ prefix) for which a
+ * reply is needed.  Only set when the lkb is on the lockspace waiters
+ * list awaiting a reply from a remote node.
+ *
+ * lkb_nodeid: when the lkb is a local copy, nodeid is 0; when the lkb
+ * is a master copy, nodeid specifies the remote lock holder, when the
+ * lkb is a process copy, the nodeid specifies the lock master.
+ */
+
+/* lkb_ast_type */
+
+#define AST_COMP		1
+#define AST_BAST		2
+
+/* lkb_range[] */
+
+#define GR_RANGE_START		0
+#define GR_RANGE_END		1
+#define RQ_RANGE_START		2
+#define RQ_RANGE_END		3
+
+/* lkb_status */
+
+#define DLM_LKSTS_WAITING	1
+#define DLM_LKSTS_GRANTED	2
+#define DLM_LKSTS_CONVERT	3
+
+/* lkb_flags */
+
+#define DLM_IFL_MSTCPY		0x00010000
+#define DLM_IFL_RESEND		0x00020000
+#define DLM_IFL_RANGE		0x00000001
+
+struct dlm_lkb {
+	struct dlm_rsb		*lkb_resource;	/* the rsb */
+	struct kref		lkb_ref;
+	int			lkb_nodeid;	/* copied from rsb */
+	int			lkb_ownpid;	/* pid of lock owner */
+	uint32_t		lkb_id;		/* our lock ID */
+	uint32_t		lkb_remid;	/* lock ID on remote partner */
+	uint32_t		lkb_exflags;	/* external flags from caller */
+	uint32_t		lkb_sbflags;	/* lksb flags */
+	uint32_t		lkb_flags;	/* internal flags */
+	uint32_t		lkb_lvbseq;	/* lvb sequence number */
+
+	int8_t			lkb_status;     /* granted, waiting, convert */
+	int8_t			lkb_rqmode;	/* requested lock mode */
+	int8_t			lkb_grmode;	/* granted lock mode */
+	int8_t			lkb_bastmode;	/* requested mode */
+	int8_t			lkb_highbast;	/* highest mode bast sent for */
+
+	int8_t			lkb_wait_type;	/* type of reply waiting for */
+	int8_t			lkb_ast_type;	/* type of ast queued for */
+
+	struct list_head	lkb_idtbl_list;	/* lockspace lkbtbl */
+	struct list_head	lkb_statequeue;	/* rsb g/c/w list */
+	struct list_head	lkb_rsb_lookup;	/* waiting for rsb lookup */
+	struct list_head	lkb_wait_reply;	/* waiting for remote reply */
+	struct list_head	lkb_astqueue;	/* need ast to be sent */
+
+	uint64_t		*lkb_range;	/* array of gr/rq ranges */
+	char			*lkb_lvbptr;
+	struct dlm_lksb		*lkb_lksb;      /* caller's status block */
+	void			*lkb_astaddr;	/* caller's ast function */
+	void			*lkb_bastaddr;	/* caller's bast function */
+	long			lkb_astparam;	/* caller's ast arg */
+};
+
+
+/* find_rsb() flags */
+
+#define R_MASTER		1	/* only return rsb if it's a master */
+#define R_CREATE		2	/* create/add rsb if not found */
+
+#define RESFL_MASTER_WAIT	0
+#define RESFL_MASTER_UNCERTAIN	1
+#define RESFL_VALNOTVALID	2
+#define RESFL_VALNOTVALID_PREV	3
+#define RESFL_NEW_MASTER	4
+#define RESFL_NEW_MASTER2	5
+#define RESFL_RECOVER_CONVERT	6
+
+struct dlm_rsb {
+	struct dlm_ls		*res_ls;	/* the lockspace */
+	struct kref		res_ref;
+	struct semaphore	res_sem;
+	unsigned long		res_flags;	/* RESFL_ */
+	int			res_length;	/* length of rsb name */
+	int			res_nodeid;
+	uint32_t                res_lvbseq;
+	uint32_t		res_bucket;	/* rsbtbl */
+	unsigned long		res_toss_time;
+	uint32_t		res_trial_lkid;	/* lkb trying lookup result */
+	struct list_head	res_lookup;	/* lkbs waiting lookup confirm*/
+	struct list_head	res_hashchain;	/* rsbtbl */
+	struct list_head	res_grantqueue;
+	struct list_head	res_convertqueue;
+	struct list_head	res_waitqueue;
+
+	struct list_head	res_root_list;	    /* used for recovery */
+	struct list_head	res_recover_list;   /* used for recovery */
+	int			res_recover_locks_count;
+
+	char			*res_lvbptr;
+	char			res_name[1];
+};
+
+
+/* dlm_header is first element of all structs sent between nodes */
+
+#define DLM_HEADER_MAJOR	0x00020000
+#define DLM_HEADER_MINOR	0x00000001
+
+#define DLM_MSG			1
+#define DLM_RCOM		2
+
+struct dlm_header {
+	uint32_t		h_version;
+	uint32_t		h_lockspace;
+	uint32_t		h_nodeid;	/* nodeid of sender */
+	uint16_t		h_length;
+	uint8_t			h_cmd;		/* DLM_MSG, DLM_RCOM */
+	uint8_t			h_pad;
+};
+
+
+#define DLM_MSG_REQUEST		1
+#define DLM_MSG_CONVERT		2
+#define DLM_MSG_UNLOCK		3
+#define DLM_MSG_CANCEL		4
+#define DLM_MSG_REQUEST_REPLY	5
+#define DLM_MSG_CONVERT_REPLY	6
+#define DLM_MSG_UNLOCK_REPLY	7
+#define DLM_MSG_CANCEL_REPLY	8
+#define DLM_MSG_GRANT		9
+#define DLM_MSG_BAST		10
+#define DLM_MSG_LOOKUP		11
+#define DLM_MSG_REMOVE		12
+#define DLM_MSG_LOOKUP_REPLY	13
+
+struct dlm_message {
+	struct dlm_header	m_header;
+	uint32_t		m_type;		/* DLM_MSG_ */
+	uint32_t		m_nodeid;
+	uint32_t		m_pid;
+	uint32_t		m_lkid;		/* lkid on sender */
+	uint32_t		m_remid;	/* lkid on receiver */
+	uint32_t		m_parent_lkid;
+	uint32_t		m_parent_remid;
+	uint32_t		m_exflags;
+	uint32_t		m_sbflags;
+	uint32_t		m_flags;
+	uint32_t		m_lvbseq;
+	int			m_status;
+	int			m_grmode;
+	int			m_rqmode;
+	int			m_bastmode;
+	int			m_asts;
+	int			m_result;	/* 0 or -EXXX */
+	uint64_t		m_range[2];
+	char			m_extra[0];	/* name or lvb */
+};
+
+
+#define DIR_VALID		0x00000001
+#define DIR_ALL_VALID		0x00000002
+#define NODES_VALID		0x00000004
+#define NODES_ALL_VALID		0x00000008
+#define LOCKS_VALID		0x00000010
+#define LOCKS_ALL_VALID		0x00000020
+
+#define DLM_RCOM_STATUS		1
+#define DLM_RCOM_NAMES		2
+#define DLM_RCOM_LOOKUP		3
+#define DLM_RCOM_LOCK		4
+#define DLM_RCOM_STATUS_REPLY	5
+#define DLM_RCOM_NAMES_REPLY	6
+#define DLM_RCOM_LOOKUP_REPLY	7
+#define DLM_RCOM_LOCK_REPLY	8
+
+struct dlm_rcom {
+	struct dlm_header	rc_header;
+	uint32_t		rc_type;	/* DLM_RCOM_ */
+	int			rc_result;	/* multi-purpose */
+	uint64_t		rc_id;		/* match reply with request */
+	char			rc_buf[0];
+};
+
+struct rcom_config {
+	uint32_t		rf_lvblen;
+	uint32_t		rf_lsflags;
+	uint64_t		rf_unused;
+};
+
+struct rcom_lock {
+	uint32_t		rl_ownpid;
+	uint32_t		rl_lkid;
+	uint32_t		rl_remid;
+	uint32_t		rl_parent_lkid;
+	uint32_t		rl_parent_remid;
+	uint32_t		rl_exflags;
+	uint32_t		rl_flags;
+	uint32_t		rl_lvbseq;
+	int			rl_result;
+	int8_t			rl_rqmode;
+	int8_t			rl_grmode;
+	int8_t			rl_status;
+	int8_t			rl_asts;
+	uint16_t		rl_wait_type;
+	uint16_t		rl_namelen;
+	uint64_t		rl_range[4];
+	char			rl_name[DLM_RESNAME_MAXLEN];
+	char			rl_lvb[0];
+};
+
+
+#define LSST_NONE		0
+#define LSST_INIT		1
+#define LSST_INIT_DONE		2
+#define LSST_CLEAR		3
+#define LSST_WAIT_START		4
+#define LSST_RECONFIG_DONE	5
+
+#define LSFL_WORK		0
+#define LSFL_LS_RUN		1
+#define LSFL_LS_STOP		2
+#define LSFL_LS_START		3
+#define LSFL_LS_FINISH		4
+#define LSFL_RCOM_READY		5
+#define LSFL_FINISH_RECOVERY	6
+#define LSFL_DIR_VALID		7
+#define LSFL_ALL_DIR_VALID	8
+#define LSFL_NODES_VALID	9
+#define LSFL_ALL_NODES_VALID	10
+#define LSFL_LS_TERMINATE	11
+#define LSFL_JOIN_DONE		12
+#define LSFL_LEAVE_DONE		13
+#define LSFL_LOCKS_VALID	14
+#define LSFL_ALL_LOCKS_VALID	15
+
+struct dlm_ls {
+	struct list_head	ls_list;	/* list of lockspaces */
+	uint32_t		ls_global_id;	/* global unique lockspace ID */
+	int			ls_lvblen;
+	int			ls_count;	/* reference count */
+	unsigned long		ls_flags;	/* LSFL_ */
+	struct kobject		ls_kobj;
+
+	struct dlm_rsbtable	*ls_rsbtbl;
+	uint32_t		ls_rsbtbl_size;
+
+	struct dlm_lkbtable	*ls_lkbtbl;
+	uint32_t		ls_lkbtbl_size;
+
+	struct dlm_dirtable	*ls_dirtbl;
+	uint32_t		ls_dirtbl_size;
+
+	struct semaphore	ls_waiters_sem;
+	struct list_head	ls_waiters;	/* lkbs needing a reply */
+
+	struct list_head	ls_nodes;	/* current nodes in ls */
+	struct list_head	ls_nodes_gone;	/* dead node list, recovery */
+	int			ls_num_nodes;	/* number of nodes in ls */
+	int			ls_low_nodeid;
+	int			*ls_node_array;
+	int			*ls_nodeids_next;
+	int			ls_nodeids_next_count;
+
+	struct dlm_rsb		ls_stub_rsb;	/* for returning errors */
+	struct dlm_lkb		ls_stub_lkb;	/* for returning errors */
+	struct dlm_message	ls_stub_ms;	/* for faking a reply */
+
+	struct dentry		*ls_debug_dentry; /* debugfs */
+
+	/* recovery related */
+
+	wait_queue_head_t	ls_wait_member;
+	struct task_struct	*ls_recoverd_task;
+	struct semaphore	ls_recoverd_active;
+	struct list_head	ls_recover;	/* dlm_recover structs */
+	spinlock_t		ls_recover_lock;
+	int			ls_last_stop;
+	int			ls_last_start;
+	int			ls_last_finish;
+	int			ls_startdone;
+	int			ls_state;	/* recovery states */
+
+	struct rw_semaphore	ls_in_recovery;	/* block local requests */
+	struct list_head	ls_requestqueue;/* queue remote requests */
+	struct semaphore	ls_requestqueue_lock;
+	char			*ls_recover_buf;
+	struct list_head	ls_recover_list;
+	spinlock_t		ls_recover_list_lock;
+	int			ls_recover_list_count;
+	wait_queue_head_t	ls_wait_general;
+
+	struct list_head	ls_root_list;	/* root resources */
+	struct rw_semaphore	ls_root_sem;	/* protect root_list */
+
+	int			ls_namelen;
+	char			ls_name[1];
+};
+
+#endif				/* __DLM_INTERNAL_DOT_H__ */
+
--- a/drivers/dlm/lvb_table.h	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/lvb_table.h	2005-05-12 23:13:15.829485512 +0800
@@ -0,0 +1,38 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) 2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#ifndef __LVB_TABLE_DOT_H__
+#define __LVB_TABLE_DOT_H__
+
+/*
+ * This defines the direction of transfer of LVB data.
+ * Granted mode is the row; requested mode is the column.
+ * Usage: matrix[grmode+1][rqmode+1]
+ * 1 = LVB is returned to the caller
+ * 0 = LVB is written to the resource
+ * -1 = nothing happens to the LVB
+ */
+
+const int dlm_lvb_operations[8][8] = {
+        /* UN   NL  CR  CW  PR  PW  EX  PD*/
+        {  -1,  1,  1,  1,  1,  1,  1, -1 }, /* UN */
+        {  -1,  1,  1,  1,  1,  1,  1,  0 }, /* NL */
+        {  -1, -1,  1,  1,  1,  1,  1,  0 }, /* CR */
+        {  -1, -1, -1,  1,  1,  1,  1,  0 }, /* CW */
+        {  -1, -1, -1, -1,  1,  1,  1,  0 }, /* PR */
+        {  -1,  0,  0,  0,  0,  0,  1,  0 }, /* PW */
+        {  -1,  0,  0,  0,  0,  0,  0,  0 }, /* EX */
+        {  -1,  0,  0,  0,  0,  0,  0,  0 }  /* PD */
+};
+
+#endif
+
--- a/drivers/dlm/ast.c	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/ast.c	2005-05-12 23:13:15.827485816 +0800
@@ -0,0 +1,166 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#include "dlm_internal.h"
+#include "lock.h"
+
+#define WAKE_ASTS  0
+
+static struct list_head		ast_queue;
+static spinlock_t		ast_queue_lock;
+static struct task_struct *	astd_task;
+static unsigned long		astd_wakeflags;
+static struct semaphore		astd_running;
+
+
+void dlm_del_ast(struct dlm_lkb *lkb)
+{
+	spin_lock(&ast_queue_lock);
+	if (lkb->lkb_ast_type & (AST_COMP | AST_BAST))
+		list_del(&lkb->lkb_astqueue);
+	spin_unlock(&ast_queue_lock);
+}
+
+void dlm_add_ast(struct dlm_lkb *lkb, int type)
+{
+	spin_lock(&ast_queue_lock);
+	if (!(lkb->lkb_ast_type & (AST_COMP | AST_BAST))) {
+		kref_get(&lkb->lkb_ref);
+		list_add_tail(&lkb->lkb_astqueue, &ast_queue);
+	}
+	lkb->lkb_ast_type |= type;
+	spin_unlock(&ast_queue_lock);
+
+	set_bit(WAKE_ASTS, &astd_wakeflags);
+	wake_up_process(astd_task);
+}
+
+static void process_asts(void)
+{
+	struct dlm_ls *ls = NULL;
+	struct dlm_rsb *r = NULL;
+	struct dlm_lkb *lkb;
+	void (*cast) (long param);
+	void (*bast) (long param, int mode);
+	int type = 0, found, bmode;
+
+	for (;;) {
+		found = FALSE;
+		spin_lock(&ast_queue_lock);
+		list_for_each_entry(lkb, &ast_queue, lkb_astqueue) {
+			r = lkb->lkb_resource;
+			ls = r->res_ls;
+
+			if (!test_bit(LSFL_LS_RUN, &ls->ls_flags))
+				continue;
+
+			list_del(&lkb->lkb_astqueue);
+			type = lkb->lkb_ast_type;
+			lkb->lkb_ast_type = 0;
+			found = TRUE;
+			break;
+		}
+		spin_unlock(&ast_queue_lock);
+
+		if (!found)
+			break;
+
+		cast = lkb->lkb_astaddr;
+		bast = lkb->lkb_bastaddr;
+		bmode = lkb->lkb_bastmode;
+
+		if ((type & AST_COMP) && cast)
+			cast(lkb->lkb_astparam);
+
+		/* FIXME: Is it safe to look at lkb_grmode here
+		   without doing a lock_rsb() ?
+		   Look at other checks in v1 to avoid basts. */
+
+		if ((type & AST_BAST) && bast)
+			if (!dlm_modes_compat(lkb->lkb_grmode, bmode))
+				bast(lkb->lkb_astparam, bmode);
+
+		/* this removes the reference added by dlm_add_ast
+		   and may result in the lkb being freed */
+		dlm_put_lkb(lkb);
+
+		schedule();
+	}
+}
+
+static inline int no_asts(void)
+{
+	int ret;
+
+	spin_lock(&ast_queue_lock);
+	ret = list_empty(&ast_queue);
+	spin_unlock(&ast_queue_lock);
+	return ret;
+}
+
+static int dlm_astd(void *data)
+{
+	while (!kthread_should_stop()) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (!test_bit(WAKE_ASTS, &astd_wakeflags))
+			schedule();
+		set_current_state(TASK_RUNNING);
+
+		down(&astd_running);
+		if (test_and_clear_bit(WAKE_ASTS, &astd_wakeflags))
+			process_asts();
+		up(&astd_running);
+	}
+	return 0;
+}
+
+void dlm_astd_wake(void)
+{
+	if (!no_asts()) {
+		set_bit(WAKE_ASTS, &astd_wakeflags);
+		wake_up_process(astd_task);
+	}
+}
+
+int dlm_astd_start(void)
+{
+	struct task_struct *p;
+	int error = 0;
+
+	INIT_LIST_HEAD(&ast_queue);
+	spin_lock_init(&ast_queue_lock);
+	init_MUTEX(&astd_running);
+
+	p = kthread_run(dlm_astd, NULL, "dlm_astd");
+	if (IS_ERR(p))
+		error = PTR_ERR(p);
+	else
+		astd_task = p;
+	return error;
+}
+
+void dlm_astd_stop(void)
+{
+	kthread_stop(astd_task);
+}
+
+void dlm_astd_suspend(void)
+{
+	down(&astd_running);
+}
+
+void dlm_astd_resume(void)
+{
+	up(&astd_running);
+}
+
--- a/drivers/dlm/ast.h	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/ast.h	2005-05-12 23:13:15.827485816 +0800
@@ -0,0 +1,26 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) 2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#ifndef __ASTD_DOT_H__
+#define __ASTD_DOT_H__
+
+void dlm_add_ast(struct dlm_lkb *lkb, int type);
+void dlm_del_ast(struct dlm_lkb *lkb);
+
+void dlm_astd_wake(void);
+int dlm_astd_start(void);
+void dlm_astd_stop(void);
+void dlm_astd_suspend(void);
+void dlm_astd_resume(void);
+
+#endif
+
--- a/drivers/dlm/dir.c	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/dir.c	2005-05-12 23:13:15.828485664 +0800
@@ -0,0 +1,419 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#include "dlm_internal.h"
+#include "lockspace.h"
+#include "member.h"
+#include "lowcomms.h"
+#include "rcom.h"
+#include "config.h"
+#include "memory.h"
+#include "recover.h"
+#include "util.h"
+#include "lock.h"
+
+
+static void put_free_de(struct dlm_ls *ls, struct dlm_direntry *de)
+{
+	spin_lock(&ls->ls_recover_list_lock);
+	list_add(&de->list, &ls->ls_recover_list);
+	spin_unlock(&ls->ls_recover_list_lock);
+}
+
+static struct dlm_direntry *get_free_de(struct dlm_ls *ls, int len)
+{
+	int found = FALSE;
+	struct dlm_direntry *de;
+
+	spin_lock(&ls->ls_recover_list_lock);
+	list_for_each_entry(de, &ls->ls_recover_list, list) {
+		if (de->length == len) {
+			list_del(&de->list);
+			de->master_nodeid = 0;
+			memset(de->name, 0, len);
+			found = TRUE;
+			break;
+		}
+	}
+	spin_unlock(&ls->ls_recover_list_lock);
+
+	if (!found)
+		de = allocate_direntry(ls, len);
+	return de;
+}
+
+void dlm_clear_free_entries(struct dlm_ls *ls)
+{
+	struct dlm_direntry *de;
+
+	spin_lock(&ls->ls_recover_list_lock);
+	while (!list_empty(&ls->ls_recover_list)) {
+		de = list_entry(ls->ls_recover_list.next, struct dlm_direntry,
+				list);
+		list_del(&de->list);
+		free_direntry(de);
+	}
+	spin_unlock(&ls->ls_recover_list_lock);
+}
+
+/*
+ * We use the upper 16 bits of the hash value to select the directory node.
+ * Low bits are used for distribution of rsb's among hash buckets on each node.
+ *
+ * To give the exact range wanted (0 to num_nodes-1), we apply a modulus of
+ * num_nodes to the hash value.  This value in the desired range is used as an
+ * offset into the sorted list of nodeid's to give the particular nodeid of the
+ * directory node.
+ */
+
+int dlm_dir_name2nodeid(struct dlm_ls *ls, char *name, int length)
+{
+	struct list_head *tmp;
+	struct dlm_member *memb = NULL;
+	uint32_t hash, node, n = 0;
+	int nodeid;
+
+	if (ls->ls_num_nodes == 1) {
+		nodeid = dlm_our_nodeid();
+		goto out;
+	}
+
+	hash = dlm_hash(name, length);
+	node = (hash >> 16) % ls->ls_num_nodes;
+
+	if (ls->ls_node_array) {
+		nodeid = ls->ls_node_array[node];
+		goto out;
+	}
+
+	list_for_each(tmp, &ls->ls_nodes) {
+		if (n++ != node)
+			continue;
+		memb = list_entry(tmp, struct dlm_member, list);
+		break;
+	}
+
+	DLM_ASSERT(memb , printk("num_nodes=%u n=%u node=%u\n",
+				 ls->ls_num_nodes, n, node););
+	nodeid = memb->nodeid;
+ out:
+	return nodeid;
+}
+
+int dlm_dir_nodeid(struct dlm_rsb *rsb)
+{
+	return dlm_dir_name2nodeid(rsb->res_ls, rsb->res_name, rsb->res_length);
+}
+
+static inline uint32_t dir_hash(struct dlm_ls *ls, char *name, int len)
+{
+	uint32_t val;
+
+	val = dlm_hash(name, len);
+	val &= (ls->ls_dirtbl_size - 1);
+
+	return val;
+}
+
+static void add_entry_to_hash(struct dlm_ls *ls, struct dlm_direntry *de)
+{
+	uint32_t bucket;
+
+	bucket = dir_hash(ls, de->name, de->length);
+	list_add_tail(&de->list, &ls->ls_dirtbl[bucket].list);
+}
+
+static struct dlm_direntry *search_bucket(struct dlm_ls *ls, char *name,
+					  int namelen, uint32_t bucket)
+{
+	struct dlm_direntry *de;
+
+	list_for_each_entry(de, &ls->ls_dirtbl[bucket].list, list) {
+		if (de->length == namelen && !memcmp(name, de->name, namelen))
+			goto out;
+	}
+	de = NULL;
+ out:
+	return de;
+}
+
+void dlm_dir_remove_entry(struct dlm_ls *ls, int nodeid, char *name, int namelen)
+{
+	struct dlm_direntry *de;
+	uint32_t bucket;
+
+	bucket = dir_hash(ls, name, namelen);
+
+	write_lock(&ls->ls_dirtbl[bucket].lock);
+
+	de = search_bucket(ls, name, namelen, bucket);
+
+	if (!de) {
+		log_error(ls, "remove fr %u none", nodeid);
+		goto out;
+	}
+
+	if (de->master_nodeid != nodeid) {
+		log_error(ls, "remove fr %u ID %u", nodeid, de->master_nodeid);
+		goto out;
+	}
+
+	list_del(&de->list);
+	free_direntry(de);
+ out:
+	write_unlock(&ls->ls_dirtbl[bucket].lock);
+}
+
+void dlm_dir_clear(struct dlm_ls *ls)
+{
+	struct list_head *head;
+	struct dlm_direntry *de;
+	int i;
+
+	DLM_ASSERT(list_empty(&ls->ls_recover_list), );
+
+	for (i = 0; i < ls->ls_dirtbl_size; i++) {
+		write_lock(&ls->ls_dirtbl[i].lock);
+		head = &ls->ls_dirtbl[i].list;
+		while (!list_empty(head)) {
+			de = list_entry(head->next, struct dlm_direntry, list);
+			list_del(&de->list);
+			put_free_de(ls, de);
+		}
+		write_unlock(&ls->ls_dirtbl[i].lock);
+	}
+}
+
+int dlm_recover_directory(struct dlm_ls *ls)
+{
+	struct dlm_member *memb;
+	struct dlm_direntry *de;
+	char *b, *last_name;
+	int error = -ENOMEM, last_len, count = 0;
+	uint16_t namelen;
+
+	log_debug(ls, "dlm_recover_directory");
+
+	dlm_dir_clear(ls);
+
+	last_name = kmalloc(DLM_RESNAME_MAXLEN, GFP_KERNEL);
+	if (!last_name)
+		goto out;
+
+	list_for_each_entry(memb, &ls->ls_nodes, list) {
+		memset(last_name, 0, DLM_RESNAME_MAXLEN);
+		last_len = 0;
+
+		for (;;) {
+			error = dlm_recovery_stopped(ls);
+			if (error)
+				goto free_last;
+
+			error = dlm_rcom_names(ls, memb->nodeid,
+					       last_name, last_len);
+			if (error)
+				goto free_last;
+
+			schedule();
+
+			/*
+			 * pick namelen/name pairs out of received buffer
+			 */
+
+			b = ls->ls_recover_buf + sizeof(struct dlm_rcom);
+
+			for (;;) {
+				memcpy(&namelen, b, sizeof(uint16_t));
+				namelen = be16_to_cpu(namelen);
+				b += sizeof(uint16_t);
+
+				/* namelen of 0xFFFFF marks end of names for
+				   this node; namelen of 0 marks end of the
+				   buffer */
+
+				if (namelen == 0xFFFF)
+					goto done;
+				if (!namelen)
+					break;
+
+				error = -ENOMEM;
+				de = get_free_de(ls, namelen);
+				if (!de)
+					goto free_last;
+
+				de->master_nodeid = memb->nodeid;
+				de->length = namelen;
+				last_len = namelen;
+				memcpy(de->name, b, namelen);
+				memcpy(last_name, b, namelen);
+				b += namelen;
+
+				add_entry_to_hash(ls, de);
+				count++;
+			}
+		}
+         done:
+		;
+	}
+
+	set_bit(LSFL_DIR_VALID, &ls->ls_flags);
+	error = 0;
+
+	log_debug(ls, "dlm_recover_directory %d entries", count);
+
+ free_last:
+	kfree(last_name);
+ out:
+	dlm_clear_free_entries(ls);
+	return error;
+}
+
+static int get_entry(struct dlm_ls *ls, int nodeid, char *name,
+		     int namelen, int *r_nodeid)
+{
+	struct dlm_direntry *de, *tmp;
+	uint32_t bucket;
+
+	bucket = dir_hash(ls, name, namelen);
+
+	write_lock(&ls->ls_dirtbl[bucket].lock);
+	de = search_bucket(ls, name, namelen, bucket);
+	if (de) {
+		*r_nodeid = de->master_nodeid;
+		write_unlock(&ls->ls_dirtbl[bucket].lock);
+		if (*r_nodeid == nodeid)
+			return -EEXIST;
+		return 0;
+	}
+
+	write_unlock(&ls->ls_dirtbl[bucket].lock);
+
+	de = allocate_direntry(ls, namelen);
+	if (!de)
+		return -ENOMEM;
+
+	de->master_nodeid = nodeid;
+	de->length = namelen;
+	memcpy(de->name, name, namelen);
+
+	write_lock(&ls->ls_dirtbl[bucket].lock);
+	tmp = search_bucket(ls, name, namelen, bucket);
+	if (tmp) {
+		free_direntry(de);
+		de = tmp;
+	} else {
+		list_add_tail(&de->list, &ls->ls_dirtbl[bucket].list);
+	}
+	*r_nodeid = de->master_nodeid;
+	write_unlock(&ls->ls_dirtbl[bucket].lock);
+	return 0;
+}
+
+int dlm_dir_lookup(struct dlm_ls *ls, int nodeid, char *name, int namelen,
+		   int *r_nodeid)
+{
+	return get_entry(ls, nodeid, name, namelen, r_nodeid);
+}
+
+/* Copy the names of master rsb's into the buffer provided.
+   Only select names whose dir node is the given nodeid. */
+
+void dlm_copy_master_names(struct dlm_ls *ls, char *inbuf, int inlen,
+ 			   char *outbuf, int outlen, int nodeid)
+{
+	struct list_head *list;
+	struct dlm_rsb *start_r = NULL, *r = NULL;
+	int offset = 0, start_namelen, error, dir_nodeid;
+	char *start_name;
+	uint16_t be_namelen;
+
+	/*
+	 * Find the rsb where we left off (or start again)
+	 */
+
+	start_namelen = inlen;
+	start_name = inbuf;
+
+	if (start_namelen > 1) {
+		/*
+		 * We could also use a find_rsb_root() function here that
+		 * searched the ls_root_list.
+		 */
+		error = dlm_find_rsb(ls, start_name, start_namelen, R_MASTER,
+				     &start_r);
+		DLM_ASSERT(!error && start_r,
+			   printk("error %d\n", error););
+		DLM_ASSERT(!list_empty(&start_r->res_root_list),
+			   dlm_print_rsb(start_r););
+		dlm_put_rsb(start_r);
+	}
+
+	/*
+	 * Send rsb names for rsb's we're master of and whose directory node
+	 * matches the requesting node.
+	 */
+
+	down_read(&ls->ls_root_sem);
+	if (start_r)
+		list = start_r->res_root_list.next;
+	else
+		list = ls->ls_root_list.next;
+
+	for (offset = 0; list != &ls->ls_root_list; list = list->next) {
+		r = list_entry(list, struct dlm_rsb, res_root_list);
+		if (r->res_nodeid)
+			continue;
+
+		dir_nodeid = dlm_dir_nodeid(r);
+		if (dir_nodeid != nodeid)
+			continue;
+		
+		/*
+		 * The block ends when we can't fit the following in the
+		 * remaining buffer space:
+		 * namelen (uint16_t) +
+		 * name (r->res_length) +
+		 * end-of-block record 0x0000 (uint16_t)
+		 */
+
+		if (offset + sizeof(uint16_t)*2 + r->res_length > outlen) {
+			/* Write end-of-block record */
+			be_namelen = 0;
+			memcpy(outbuf + offset, &be_namelen, sizeof(uint16_t));
+			offset += sizeof(uint16_t);
+			goto out;
+		}
+
+		be_namelen = cpu_to_be16(r->res_length);
+		memcpy(outbuf + offset, &be_namelen, sizeof(uint16_t));
+		offset += sizeof(uint16_t);
+		memcpy(outbuf + offset, r->res_name, r->res_length);
+		offset += r->res_length;
+	}
+
+	/*
+	 * If we've reached the end of the list (and there's room) write a
+	 * terminating record.
+	 */
+
+	if ((list == &ls->ls_root_list) &&
+	    (offset + sizeof(uint16_t) <= outlen)) {
+		be_namelen = 0xFFFF;
+		memcpy(outbuf + offset, &be_namelen, sizeof(uint16_t));
+		offset += sizeof(uint16_t);
+	}
+
+ out:
+	up_read(&ls->ls_root_sem);
+}
+
--- a/drivers/dlm/dir.h	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/dir.h	2005-05-12 23:13:15.828485664 +0800
@@ -0,0 +1,30 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#ifndef __DIR_DOT_H__
+#define __DIR_DOT_H__
+
+
+int dlm_dir_nodeid(struct dlm_rsb *rsb);
+int dlm_dir_name2nodeid(struct dlm_ls *ls, char *name, int length);
+void dlm_dir_remove_entry(struct dlm_ls *ls, int nodeid, char *name, int len);
+void dlm_dir_clear(struct dlm_ls *ls);
+void dlm_clear_free_entries(struct dlm_ls *ls);
+int dlm_recover_directory(struct dlm_ls *ls);
+int dlm_dir_lookup(struct dlm_ls *ls, int nodeid, char *name, int namelen,
+	int *r_nodeid);
+void dlm_copy_master_names(struct dlm_ls *ls, char *inbuf, int inlen,
+	char *outbuf, int outlen, int nodeid);
+
+#endif				/* __DIR_DOT_H__ */
+
--- a/drivers/dlm/lockspace.c	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/lockspace.c	2005-05-12 23:13:15.829485512 +0800
@@ -0,0 +1,537 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#include "dlm_internal.h"
+#include "lockspace.h"
+#include "member.h"
+#include "member_sysfs.h"
+#include "recoverd.h"
+#include "ast.h"
+#include "dir.h"
+#include "lowcomms.h"
+#include "config.h"
+#include "memory.h"
+#include "lock.h"
+
+static int			ls_count;
+static struct semaphore		ls_lock;
+static struct list_head		lslist;
+static spinlock_t		lslist_lock;
+static struct task_struct *	scand_task;
+
+
+int dlm_lockspace_init(void)
+{
+	ls_count = 0;
+	init_MUTEX(&ls_lock);
+	INIT_LIST_HEAD(&lslist);
+	spin_lock_init(&lslist_lock);
+	return 0;
+}
+
+int dlm_scand(void *data)
+{
+	struct dlm_ls *ls;
+
+	while (!kthread_should_stop()) {
+		list_for_each_entry(ls, &lslist, ls_list)
+			dlm_scan_rsbs(ls);
+		set_current_state(TASK_INTERRUPTIBLE);
+		schedule_timeout(dlm_config.scan_secs * HZ);
+	}
+	return 0;
+}
+
+int dlm_scand_start(void)
+{
+	struct task_struct *p;
+	int error = 0;
+
+	p = kthread_run(dlm_scand, NULL, "dlm_scand");
+	if (IS_ERR(p))
+		error = PTR_ERR(p);
+	else
+		scand_task = p;
+	return error;
+}
+
+void dlm_scand_stop(void)
+{
+	kthread_stop(scand_task);
+}
+
+static struct dlm_ls *find_lockspace_name(char *name, int namelen)
+{
+	struct dlm_ls *ls;
+
+	spin_lock(&lslist_lock);
+
+	list_for_each_entry(ls, &lslist, ls_list) {
+		if (ls->ls_namelen == namelen &&
+		    memcmp(ls->ls_name, name, namelen) == 0)
+			goto out;
+	}
+	ls = NULL;
+ out:
+	spin_unlock(&lslist_lock);
+	return ls;
+}
+
+struct dlm_ls *dlm_find_lockspace_global(uint32_t id)
+{
+	struct dlm_ls *ls;
+
+	spin_lock(&lslist_lock);
+
+	list_for_each_entry(ls, &lslist, ls_list) {
+		if (ls->ls_global_id == id) {
+			ls->ls_count++;
+			goto out;
+		}
+	}
+	ls = NULL;
+ out:
+	spin_unlock(&lslist_lock);
+	return ls;
+}
+
+struct dlm_ls *dlm_find_lockspace_local(void *id)
+{
+	struct dlm_ls *ls = id;
+
+	spin_lock(&lslist_lock);
+	ls->ls_count++;
+	spin_unlock(&lslist_lock);
+	return ls;
+}
+
+void dlm_put_lockspace(struct dlm_ls *ls)
+{
+	spin_lock(&lslist_lock);
+	ls->ls_count--;
+	spin_unlock(&lslist_lock);
+}
+
+static void remove_lockspace(struct dlm_ls *ls)
+{
+	for (;;) {
+		spin_lock(&lslist_lock);
+		if (ls->ls_count == 0) {
+			list_del(&ls->ls_list);
+			spin_unlock(&lslist_lock);
+			return;
+		}
+		spin_unlock(&lslist_lock);
+		ssleep(1);
+	}
+}
+
+static int threads_start(void)
+{
+	int error;
+
+	/* Thread which process lock requests for all lockspace's */
+	error = dlm_astd_start();
+	if (error) {
+		log_print("cannot start dlm_astd thread %d", error);
+		goto fail;
+	}
+
+	error = dlm_scand_start();
+	if (error) {
+		log_print("cannot start dlm_scand thread %d", error);
+		goto astd_fail;
+	}
+
+	/* Thread for sending/receiving messages for all lockspace's */
+	error = dlm_lowcomms_start();
+	if (error) {
+		log_print("cannot start dlm lowcomms %d", error);
+		goto scand_fail;
+	}
+
+	return 0;
+
+ scand_fail:
+	dlm_scand_stop();
+ astd_fail:
+	dlm_astd_stop();
+ fail:
+	return error;
+}
+
+static void threads_stop(void)
+{
+	dlm_scand_stop();
+	dlm_lowcomms_stop();
+	dlm_astd_stop();
+}
+
+static int new_lockspace(char *name, int namelen, void **lockspace, int flags,
+			 int lvblen)
+{
+	struct dlm_ls *ls;
+	int i, size, error = -ENOMEM;
+
+	if (namelen > DLM_LOCKSPACE_LEN)
+		return -EINVAL;
+
+	if (!lvblen || (lvblen % 8))
+		return -EINVAL;
+
+	if (!try_module_get(THIS_MODULE))
+		return -EINVAL;
+
+	ls = find_lockspace_name(name, namelen);
+	if (ls) {
+		*lockspace = ls;
+		module_put(THIS_MODULE);
+		return -EEXIST;
+	}
+
+	ls = kmalloc(sizeof(struct dlm_ls) + namelen, GFP_KERNEL);
+	if (!ls)
+		goto out;
+	memset(ls, 0, sizeof(struct dlm_ls) + namelen);
+	memcpy(ls->ls_name, name, namelen);
+	ls->ls_namelen = namelen;
+	ls->ls_lvblen = lvblen;
+	ls->ls_count = 0;
+	ls->ls_flags = 0;
+
+	size = dlm_config.rsbtbl_size;
+	ls->ls_rsbtbl_size = size;
+
+	ls->ls_rsbtbl = kmalloc(sizeof(struct dlm_rsbtable) * size, GFP_KERNEL);
+	if (!ls->ls_rsbtbl)
+		goto out_lsfree;
+	for (i = 0; i < size; i++) {
+		INIT_LIST_HEAD(&ls->ls_rsbtbl[i].list);
+		INIT_LIST_HEAD(&ls->ls_rsbtbl[i].toss);
+		rwlock_init(&ls->ls_rsbtbl[i].lock);
+	}
+
+	size = dlm_config.lkbtbl_size;
+	ls->ls_lkbtbl_size = size;
+
+	ls->ls_lkbtbl = kmalloc(sizeof(struct dlm_lkbtable) * size, GFP_KERNEL);
+	if (!ls->ls_lkbtbl)
+		goto out_rsbfree;
+	for (i = 0; i < size; i++) {
+		INIT_LIST_HEAD(&ls->ls_lkbtbl[i].list);
+		rwlock_init(&ls->ls_lkbtbl[i].lock);
+		ls->ls_lkbtbl[i].counter = 1;
+	}
+
+	size = dlm_config.dirtbl_size;
+	ls->ls_dirtbl_size = size;
+
+	ls->ls_dirtbl = kmalloc(sizeof(struct dlm_dirtable) * size, GFP_KERNEL);
+	if (!ls->ls_dirtbl)
+		goto out_lkbfree;
+	for (i = 0; i < size; i++) {
+		INIT_LIST_HEAD(&ls->ls_dirtbl[i].list);
+		rwlock_init(&ls->ls_dirtbl[i].lock);
+	}
+
+	ls->ls_recover_buf = kmalloc(dlm_config.buffer_size, GFP_KERNEL);
+	if (!ls->ls_recover_buf)
+		goto out_dirfree;
+
+	init_waitqueue_head(&ls->ls_wait_member);
+	INIT_LIST_HEAD(&ls->ls_nodes);
+	INIT_LIST_HEAD(&ls->ls_nodes_gone);
+	INIT_LIST_HEAD(&ls->ls_waiters);
+	ls->ls_num_nodes = 0;
+	ls->ls_node_array = NULL;
+	ls->ls_recoverd_task = NULL;
+	init_MUTEX(&ls->ls_recoverd_active);
+	INIT_LIST_HEAD(&ls->ls_recover);
+	spin_lock_init(&ls->ls_recover_lock);
+	INIT_LIST_HEAD(&ls->ls_recover_list);
+	ls->ls_recover_list_count = 0;
+	spin_lock_init(&ls->ls_recover_list_lock);
+	init_waitqueue_head(&ls->ls_wait_general);
+	INIT_LIST_HEAD(&ls->ls_root_list);
+	INIT_LIST_HEAD(&ls->ls_requestqueue);
+	ls->ls_last_stop = 0;
+	ls->ls_last_start = 0;
+	ls->ls_last_finish = 0;
+	init_MUTEX(&ls->ls_waiters_sem);
+	init_MUTEX(&ls->ls_requestqueue_lock);
+	init_rwsem(&ls->ls_root_sem);
+	init_rwsem(&ls->ls_in_recovery);
+
+	memset(&ls->ls_stub_rsb, 0, sizeof(struct dlm_rsb));
+	ls->ls_stub_rsb.res_ls = ls;
+
+	down_write(&ls->ls_in_recovery);
+
+	error = dlm_recoverd_start(ls);
+	if (error) {
+		log_error(ls, "can't start dlm_recoverd %d", error);
+		goto out_rcomfree;
+	}
+
+	ls->ls_state = LSST_INIT;
+
+	spin_lock(&lslist_lock);
+	list_add(&ls->ls_list, &lslist);
+	spin_unlock(&lslist_lock);
+
+	dlm_create_debug_file(ls);
+
+	error = dlm_kobject_setup(ls);
+	if (error)
+		goto out_del;
+
+	error = kobject_register(&ls->ls_kobj);
+	if (error)
+		goto out_del;
+
+	kobject_uevent(&ls->ls_kobj, KOBJ_ONLINE, NULL);
+
+	/* Now we depend on userspace to notice the new ls, join it and
+	   give us a start or terminate.  The ls isn't actually running
+	   until it receives a start. */
+
+	error = wait_event_interruptible(ls->ls_wait_member,
+				test_bit(LSFL_JOIN_DONE, &ls->ls_flags));
+	if (error)
+		goto out_unreg;
+
+	if (test_bit(LSFL_LS_TERMINATE, &ls->ls_flags)) {
+		error = -ERESTARTSYS;
+		goto out_unreg;
+	}
+
+	*lockspace = ls;
+	return 0;
+
+ out_unreg:
+	kobject_unregister(&ls->ls_kobj);
+ out_del:
+	dlm_delete_debug_file(ls);
+	spin_lock(&lslist_lock);
+	list_del(&ls->ls_list);
+	spin_unlock(&lslist_lock);
+	dlm_recoverd_stop(ls);
+ out_rcomfree:
+	kfree(ls->ls_recover_buf);
+ out_dirfree:
+	kfree(ls->ls_dirtbl);
+ out_lkbfree:
+	kfree(ls->ls_lkbtbl);
+ out_rsbfree:
+	kfree(ls->ls_rsbtbl);
+ out_lsfree:
+	kfree(ls);
+ out:
+	module_put(THIS_MODULE);
+	return error;
+}
+
+int dlm_new_lockspace(char *name, int namelen, void **lockspace, int flags,
+		      int lvblen)
+{
+	int error = 0;
+
+	down(&ls_lock);
+	if (!ls_count)
+		error = threads_start();
+	if (error)
+		goto out;
+
+	error = new_lockspace(name, namelen, lockspace, flags, lvblen);
+	if (!error)
+		ls_count++;
+ out:
+	up(&ls_lock);
+	return error;
+}
+
+/* Return 1 if the lockspace still has active remote locks,
+ *        2 if the lockspace still has active local locks.
+ */
+static int lockspace_busy(struct dlm_ls *ls)
+{
+	int i, lkb_found = 0;
+	struct dlm_lkb *lkb;
+
+	/* NOTE: We check the lockidtbl here rather than the resource table.
+	   This is because there may be LKBs queued as ASTs that have been
+	   unlinked from their RSBs and are pending deletion once the AST has
+	   been delivered */
+
+	for (i = 0; i < ls->ls_lkbtbl_size; i++) {
+		read_lock(&ls->ls_lkbtbl[i].lock);
+		if (!list_empty(&ls->ls_lkbtbl[i].list)) {
+			lkb_found = 1;
+			list_for_each_entry(lkb, &ls->ls_lkbtbl[i].list,
+					    lkb_idtbl_list) {
+				if (!lkb->lkb_nodeid) {
+					read_unlock(&ls->ls_lkbtbl[i].lock);
+					return 2;
+				}
+			}
+		}
+		read_unlock(&ls->ls_lkbtbl[i].lock);
+	}
+	return lkb_found;
+}
+
+static int release_lockspace(struct dlm_ls *ls, int force)
+{
+	struct dlm_lkb *lkb;
+	struct dlm_rsb *rsb;
+	struct dlm_recover *rv;
+	struct list_head *head;
+	int i, error;
+	int busy = lockspace_busy(ls);
+
+	if (busy > force)
+		return -EBUSY;
+
+	if (force < 3) {
+		error = kobject_uevent(&ls->ls_kobj, KOBJ_OFFLINE, NULL);
+		error = wait_event_interruptible(ls->ls_wait_member,
+				test_bit(LSFL_LEAVE_DONE, &ls->ls_flags));
+	}
+
+	dlm_recoverd_stop(ls);
+
+	remove_lockspace(ls);
+
+	dlm_delete_debug_file(ls);
+
+	dlm_astd_suspend();
+
+	kfree(ls->ls_recover_buf);
+
+	/*
+	 * Free direntry structs.
+	 */
+
+	dlm_dir_clear(ls);
+	kfree(ls->ls_dirtbl);
+
+	/*
+	 * Free all lkb's on lkbtbl[] lists.
+	 */
+
+	for (i = 0; i < ls->ls_lkbtbl_size; i++) {
+		head = &ls->ls_lkbtbl[i].list;
+		while (!list_empty(head)) {
+			lkb = list_entry(head->next, struct dlm_lkb,
+					 lkb_idtbl_list);
+
+			list_del(&lkb->lkb_idtbl_list);
+
+			dlm_del_ast(lkb);
+
+			if (lkb->lkb_lvbptr && lkb->lkb_flags & DLM_IFL_MSTCPY)
+				free_lvb(lkb->lkb_lvbptr);
+
+			free_lkb(lkb);
+		}
+	}
+	dlm_astd_resume();
+
+	kfree(ls->ls_lkbtbl);
+
+	/*
+	 * Free all rsb's on rsbtbl[] lists
+	 */
+
+	for (i = 0; i < ls->ls_rsbtbl_size; i++) {
+		head = &ls->ls_rsbtbl[i].list;
+		while (!list_empty(head)) {
+			rsb = list_entry(head->next, struct dlm_rsb,
+					 res_hashchain);
+
+			list_del(&rsb->res_hashchain);
+
+			if (rsb->res_lvbptr)
+				free_lvb(rsb->res_lvbptr);
+
+			free_rsb(rsb);
+		}
+
+		head = &ls->ls_rsbtbl[i].toss;
+		while (!list_empty(head)) {
+			rsb = list_entry(head->next, struct dlm_rsb,
+					 res_hashchain);
+			list_del(&rsb->res_hashchain);
+
+			if (rsb->res_lvbptr)
+				free_lvb(rsb->res_lvbptr);
+
+			free_rsb(rsb);
+		}
+	}
+
+	kfree(ls->ls_rsbtbl);
+
+	/*
+	 * Free structures on any other lists
+	 */
+
+	head = &ls->ls_recover;
+	while (!list_empty(head)) {
+		rv = list_entry(head->next, struct dlm_recover, list);
+		list_del(&rv->list);
+		kfree(rv);
+	}
+
+	dlm_clear_free_entries(ls);
+	dlm_clear_members(ls);
+	dlm_clear_members_gone(ls);
+	kfree(ls->ls_node_array);
+	kobject_unregister(&ls->ls_kobj);
+	kfree(ls);
+
+	down(&ls_lock);
+	ls_count--;
+	if (!ls_count)
+		threads_stop();
+	up(&ls_lock);
+	
+	module_put(THIS_MODULE);
+	return 0;
+}
+
+/*
+ * Called when a system has released all its locks and is not going to use the
+ * lockspace any longer.  We free everything we're managing for this lockspace.
+ * Remaining nodes will go through the recovery process as if we'd died.  The
+ * lockspace must continue to function as usual, participating in recoveries,
+ * until this returns.
+ *
+ * Force has 4 possible values:
+ * 0 - don't destroy locksapce if it has any LKBs
+ * 1 - destroy lockspace if it has remote LKBs but not if it has local LKBs
+ * 2 - destroy lockspace regardless of LKBs
+ * 3 - destroy lockspace as part of a forced shutdown
+ */
+
+int dlm_release_lockspace(void *lockspace, int force)
+{
+	struct dlm_ls *ls;
+
+	ls = dlm_find_lockspace_local(lockspace);
+	if (!ls)
+		return -EINVAL;
+	dlm_put_lockspace(ls);
+	return release_lockspace(ls, force);
+}
+
--- a/drivers/dlm/lockspace.h	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/lockspace.h	2005-05-12 23:13:15.829485512 +0800
@@ -0,0 +1,24 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#ifndef __LOCKSPACE_DOT_H__
+#define __LOCKSPACE_DOT_H__
+
+int dlm_lockspace_init(void);
+struct dlm_ls *dlm_find_lockspace_global(uint32_t id);
+struct dlm_ls *dlm_find_lockspace_local(void *id);
+struct dlm_ls *dlm_find_lockspace_name(char *name, int namelen);
+void dlm_put_lockspace(struct dlm_ls *ls);
+
+#endif				/* __LOCKSPACE_DOT_H__ */
+
--- a/drivers/dlm/main.c	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/main.c	2005-05-12 23:13:15.829485512 +0800
@@ -0,0 +1,91 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#include "dlm_internal.h"
+#include "lockspace.h"
+#include "member_sysfs.h"
+#include "lock.h"
+#include "device.h"
+#include "memory.h"
+#include "lowcomms.h"
+
+int dlm_register_debugfs(void);
+void dlm_unregister_debugfs(void);
+int dlm_node_ioctl_init(void);
+void dlm_node_ioctl_exit(void);
+
+int __init init_dlm(void)
+{
+	int error;
+
+	error = dlm_memory_init();
+	if (error)
+		goto out;
+
+	error = dlm_lockspace_init();
+	if (error)
+		goto out_mem;
+
+	error = dlm_node_ioctl_init();
+	if (error)
+		goto out_mem;
+
+	error = dlm_member_sysfs_init();
+	if (error)
+		goto out_node;
+
+	error = dlm_register_debugfs();
+	if (error)
+		goto out_member;
+
+	error = dlm_lowcomms_init();
+	if (error)
+		goto out_debug;
+
+	printk("DLM (built %s %s) installed\n", __DATE__, __TIME__);
+
+	return 0;
+
+ out_debug:
+	dlm_unregister_debugfs();
+ out_member:
+	dlm_member_sysfs_exit();
+ out_node:
+	dlm_node_ioctl_exit();
+ out_mem:
+	dlm_memory_exit();
+ out:
+	return error;
+}
+
+void __exit exit_dlm(void)
+{
+	dlm_lowcomms_exit();
+	dlm_member_sysfs_exit();
+	dlm_node_ioctl_exit();
+	dlm_memory_exit();
+	dlm_unregister_debugfs();
+}
+
+module_init(init_dlm);
+module_exit(exit_dlm);
+
+MODULE_DESCRIPTION("Distributed Lock Manager");
+MODULE_AUTHOR("Red Hat, Inc.");
+MODULE_LICENSE("GPL");
+
+EXPORT_SYMBOL(dlm_new_lockspace);
+EXPORT_SYMBOL(dlm_release_lockspace);
+EXPORT_SYMBOL(dlm_lock);
+EXPORT_SYMBOL(dlm_unlock);
+
--- a/drivers/dlm/memory.c	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/memory.c	2005-05-12 23:13:15.830485360 +0800
@@ -0,0 +1,121 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#include "dlm_internal.h"
+#include "config.h"
+
+static kmem_cache_t *lkb_cache;
+
+
+int dlm_memory_init(void)
+{
+	int ret = 0;
+
+	lkb_cache = kmem_cache_create("dlm_lkb", sizeof(struct dlm_lkb),
+				__alignof__(struct dlm_lkb), 0, NULL, NULL);
+	if (!lkb_cache)
+		ret = -ENOMEM;
+	return ret;
+}
+
+void dlm_memory_exit(void)
+{
+	if (lkb_cache)
+		kmem_cache_destroy(lkb_cache);
+}
+
+char *allocate_lvb(struct dlm_ls *ls)
+{
+	char *p;
+
+	p = kmalloc(ls->ls_lvblen, GFP_KERNEL);
+	if (p)
+		memset(p, 0, ls->ls_lvblen);
+	return p;
+}
+
+void free_lvb(char *p)
+{
+	kfree(p);
+}
+
+uint64_t *allocate_range(struct dlm_ls *ls)
+{
+	int ralen = 4*sizeof(uint64_t);
+	uint64_t *p;
+
+	p = kmalloc(ralen, GFP_KERNEL);
+	if (p)
+		memset(p, 0, ralen);
+	return p;
+}
+
+void free_range(uint64_t *p)
+{
+	kfree(p);
+}
+
+/* FIXME: have some minimal space built-in to rsb for the name and
+   kmalloc a separate name if needed, like dentries are done */
+
+struct dlm_rsb *allocate_rsb(struct dlm_ls *ls, int namelen)
+{
+	struct dlm_rsb *r;
+
+	DLM_ASSERT(namelen <= DLM_RESNAME_MAXLEN,);
+
+	r = kmalloc(sizeof(*r) + namelen, GFP_KERNEL);
+	if (r)
+		memset(r, 0, sizeof(*r) + namelen);
+	return r;
+}
+
+void free_rsb(struct dlm_rsb *r)
+{
+	if (r->res_lvbptr)
+		free_lvb(r->res_lvbptr);
+	kfree(r);
+}
+
+struct dlm_lkb *allocate_lkb(struct dlm_ls *ls)
+{
+	struct dlm_lkb *lkb;
+
+	lkb = kmem_cache_alloc(lkb_cache, GFP_KERNEL);
+	if (lkb)
+		memset(lkb, 0, sizeof(*lkb));
+	return lkb;
+}
+
+void free_lkb(struct dlm_lkb *lkb)
+{
+	kmem_cache_free(lkb_cache, lkb);
+}
+
+struct dlm_direntry *allocate_direntry(struct dlm_ls *ls, int namelen)
+{
+	struct dlm_direntry *de;
+
+	DLM_ASSERT(namelen <= DLM_RESNAME_MAXLEN,);
+
+	de = kmalloc(sizeof(*de) + namelen, GFP_KERNEL);
+	if (de)
+		memset(de, 0, sizeof(*de) + namelen);
+	return de;
+}
+
+void free_direntry(struct dlm_direntry *de)
+{
+	kfree(de);
+}
+
--- a/drivers/dlm/memory.h	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/memory.h	2005-05-12 23:13:15.830485360 +0800
@@ -0,0 +1,31 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#ifndef __MEMORY_DOT_H__
+#define __MEMORY_DOT_H__
+
+int dlm_memory_init(void);
+void dlm_memory_exit(void);
+struct dlm_rsb *allocate_rsb(struct dlm_ls *ls, int namelen);
+void free_rsb(struct dlm_rsb *r);
+struct dlm_lkb *allocate_lkb(struct dlm_ls *ls);
+void free_lkb(struct dlm_lkb *l);
+struct dlm_direntry *allocate_direntry(struct dlm_ls *ls, int namelen);
+void free_direntry(struct dlm_direntry *de);
+char *allocate_lvb(struct dlm_ls *ls);
+void free_lvb(char *l);
+uint64_t *allocate_range(struct dlm_ls *ls);
+void free_range(uint64_t *l);
+
+#endif		/* __MEMORY_DOT_H__ */
+
--- a/drivers/dlm/util.c	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/util.c	2005-05-12 23:13:15.831485208 +0800
@@ -0,0 +1,204 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) 2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#include "dlm_internal.h"
+#include "rcom.h"
+
+/**
+ * dlm_hash - hash an array of data
+ * @data: the data to be hashed
+ * @len: the length of data to be hashed
+ *
+ * Copied from GFS which copied from...
+ *
+ * Take some data and convert it to a 32-bit hash.
+ * This is the 32-bit FNV-1a hash from:
+ * http://www.isthe.com/chongo/tech/comp/fnv/
+ */
+
+static inline uint32_t hash_more_internal(const void *data, unsigned int len,
+					  uint32_t hash)
+{
+	unsigned char *p = (unsigned char *)data;
+	unsigned char *e = p + len;
+	uint32_t h = hash;
+
+	while (p < e) {
+		h ^= (uint32_t)(*p++);
+		h *= 0x01000193;
+	}
+
+	return h;
+}
+
+uint32_t dlm_hash(const void *data, int len)
+{
+	uint32_t h = 0x811C9DC5;
+	h = hash_more_internal(data, len, h);
+	return h;
+}
+
+static void header_out(struct dlm_header *hd)
+{
+	hd->h_version		= cpu_to_le32(hd->h_version);
+	hd->h_lockspace		= cpu_to_le32(hd->h_lockspace);
+	hd->h_nodeid		= cpu_to_le32(hd->h_nodeid);
+	hd->h_length		= cpu_to_le16(hd->h_length);
+}
+
+static void header_in(struct dlm_header *hd)
+{
+	hd->h_version		= le32_to_cpu(hd->h_version);
+	hd->h_lockspace		= le32_to_cpu(hd->h_lockspace);
+	hd->h_nodeid		= le32_to_cpu(hd->h_nodeid);
+	hd->h_length		= le16_to_cpu(hd->h_length);
+}
+
+void dlm_message_out(struct dlm_message *ms)
+{
+	struct dlm_header *hd = (struct dlm_header *) ms;
+
+	header_out(hd);
+
+	ms->m_type		= cpu_to_le32(ms->m_type);
+	ms->m_nodeid		= cpu_to_le32(ms->m_nodeid);
+	ms->m_pid		= cpu_to_le32(ms->m_pid);
+	ms->m_lkid		= cpu_to_le32(ms->m_lkid);
+	ms->m_remid		= cpu_to_le32(ms->m_remid);
+	ms->m_parent_lkid	= cpu_to_le32(ms->m_parent_lkid);
+	ms->m_parent_remid	= cpu_to_le32(ms->m_parent_remid);
+	ms->m_exflags		= cpu_to_le32(ms->m_exflags);
+	ms->m_sbflags		= cpu_to_le32(ms->m_sbflags);
+	ms->m_flags		= cpu_to_le32(ms->m_flags);
+	ms->m_lvbseq		= cpu_to_le32(ms->m_lvbseq);
+	ms->m_status		= cpu_to_le32(ms->m_status);
+	ms->m_grmode		= cpu_to_le32(ms->m_grmode);
+	ms->m_rqmode		= cpu_to_le32(ms->m_rqmode);
+	ms->m_bastmode		= cpu_to_le32(ms->m_bastmode);
+	ms->m_asts		= cpu_to_le32(ms->m_asts);
+	ms->m_result		= cpu_to_le32(ms->m_result);
+	ms->m_range[0]		= cpu_to_le64(ms->m_range[0]);
+	ms->m_range[1]		= cpu_to_le64(ms->m_range[1]);
+}
+
+void dlm_message_in(struct dlm_message *ms)
+{
+	struct dlm_header *hd = (struct dlm_header *) ms;
+
+	header_in(hd);
+
+	ms->m_type		= le32_to_cpu(ms->m_type);
+	ms->m_nodeid		= le32_to_cpu(ms->m_nodeid);
+	ms->m_pid		= le32_to_cpu(ms->m_pid);
+	ms->m_lkid		= le32_to_cpu(ms->m_lkid);
+	ms->m_remid		= le32_to_cpu(ms->m_remid);
+	ms->m_parent_lkid	= le32_to_cpu(ms->m_parent_lkid);
+	ms->m_parent_remid	= le32_to_cpu(ms->m_parent_remid);
+	ms->m_exflags		= le32_to_cpu(ms->m_exflags);
+	ms->m_sbflags		= le32_to_cpu(ms->m_sbflags);
+	ms->m_flags		= le32_to_cpu(ms->m_flags);
+	ms->m_lvbseq		= le32_to_cpu(ms->m_lvbseq);
+	ms->m_status		= le32_to_cpu(ms->m_status);
+	ms->m_grmode		= le32_to_cpu(ms->m_grmode);
+	ms->m_rqmode		= le32_to_cpu(ms->m_rqmode);
+	ms->m_bastmode		= le32_to_cpu(ms->m_bastmode);
+	ms->m_asts		= le32_to_cpu(ms->m_asts);
+	ms->m_result		= le32_to_cpu(ms->m_result);
+	ms->m_range[0]		= le64_to_cpu(ms->m_range[0]);
+	ms->m_range[1]		= le64_to_cpu(ms->m_range[1]);
+}
+
+static void rcom_lock_out(struct rcom_lock *rl)
+{
+	rl->rl_ownpid		= cpu_to_le32(rl->rl_ownpid);
+	rl->rl_lkid		= cpu_to_le32(rl->rl_lkid);
+	rl->rl_remid		= cpu_to_le32(rl->rl_remid);
+	rl->rl_parent_lkid	= cpu_to_le32(rl->rl_parent_lkid);
+	rl->rl_parent_remid	= cpu_to_le32(rl->rl_parent_remid);
+	rl->rl_exflags		= cpu_to_le32(rl->rl_exflags);
+	rl->rl_flags		= cpu_to_le32(rl->rl_flags);
+	rl->rl_lvbseq		= cpu_to_le32(rl->rl_lvbseq);
+	rl->rl_result		= cpu_to_le32(rl->rl_result);
+	rl->rl_wait_type	= cpu_to_le16(rl->rl_wait_type);
+	rl->rl_namelen		= cpu_to_le16(rl->rl_namelen);
+	rl->rl_range[0]		= cpu_to_le64(rl->rl_range[0]);
+	rl->rl_range[1]		= cpu_to_le64(rl->rl_range[1]);
+	rl->rl_range[2]		= cpu_to_le64(rl->rl_range[2]);
+	rl->rl_range[3]		= cpu_to_le64(rl->rl_range[3]);
+}
+
+static void rcom_lock_in(struct rcom_lock *rl)
+{
+	rl->rl_ownpid		= le32_to_cpu(rl->rl_ownpid);
+	rl->rl_lkid		= le32_to_cpu(rl->rl_lkid);
+	rl->rl_remid		= le32_to_cpu(rl->rl_remid);
+	rl->rl_parent_lkid	= le32_to_cpu(rl->rl_parent_lkid);
+	rl->rl_parent_remid	= le32_to_cpu(rl->rl_parent_remid);
+	rl->rl_exflags		= le32_to_cpu(rl->rl_exflags);
+	rl->rl_flags		= le32_to_cpu(rl->rl_flags);
+	rl->rl_lvbseq		= le32_to_cpu(rl->rl_lvbseq);
+	rl->rl_result		= le32_to_cpu(rl->rl_result);
+	rl->rl_wait_type	= le16_to_cpu(rl->rl_wait_type);
+	rl->rl_namelen		= le16_to_cpu(rl->rl_namelen);
+	rl->rl_range[0]		= le64_to_cpu(rl->rl_range[0]);
+	rl->rl_range[1]		= le64_to_cpu(rl->rl_range[1]);
+	rl->rl_range[2]		= le64_to_cpu(rl->rl_range[2]);
+	rl->rl_range[3]		= le64_to_cpu(rl->rl_range[3]);
+}
+
+static void rcom_config_out(struct rcom_config *rf)
+{
+	rf->rf_lvblen		= cpu_to_le32(rf->rf_lvblen);
+	rf->rf_lsflags		= cpu_to_le32(rf->rf_lsflags);
+}
+
+static void rcom_config_in(struct rcom_config *rf)
+{
+	rf->rf_lvblen		= le32_to_cpu(rf->rf_lvblen);
+	rf->rf_lsflags		= le32_to_cpu(rf->rf_lsflags);
+}
+
+void dlm_rcom_out(struct dlm_rcom *rc)
+{
+	struct dlm_header *hd = (struct dlm_header *) rc;
+	int type = rc->rc_type;
+
+	header_out(hd);
+
+	rc->rc_type		= cpu_to_le32(rc->rc_type);
+	rc->rc_result		= cpu_to_le32(rc->rc_result);
+	rc->rc_id		= cpu_to_le64(rc->rc_id);
+
+	if (type == DLM_RCOM_LOCK)
+		rcom_lock_out((struct rcom_lock *) rc->rc_buf);
+
+	else if (type == DLM_RCOM_STATUS_REPLY)
+		rcom_config_out((struct rcom_config *) rc->rc_buf);
+}
+
+void dlm_rcom_in(struct dlm_rcom *rc)
+{
+	struct dlm_header *hd = (struct dlm_header *) rc;
+
+	header_in(hd);
+
+	rc->rc_type		= le32_to_cpu(rc->rc_type);
+	rc->rc_result		= le32_to_cpu(rc->rc_result);
+	rc->rc_id		= le64_to_cpu(rc->rc_id);
+
+	if (rc->rc_type == DLM_RCOM_LOCK)
+		rcom_lock_in((struct rcom_lock *) rc->rc_buf);
+
+	else if (rc->rc_type == DLM_RCOM_STATUS_REPLY)
+		rcom_config_in((struct rcom_config *) rc->rc_buf);
+}
+
--- a/drivers/dlm/util.h	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/util.h	2005-05-12 23:13:15.831485208 +0800
@@ -0,0 +1,24 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) 2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#ifndef __UTIL_DOT_H__
+#define __UTIL_DOT_H__
+
+uint32_t dlm_hash(const char *data, int len);
+
+void dlm_message_out(struct dlm_message *ms);
+void dlm_message_in(struct dlm_message *ms);
+void dlm_rcom_out(struct dlm_rcom *rc);
+void dlm_rcom_in(struct dlm_rcom *rc);
+
+#endif
+



From teigland at redhat.com  Mon May 16 07:20:27 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 16 May 2005 15:20:27 +0800
Subject: [Linux-cluster] [PATCH 3/8] dlm: communication
Message-ID: <20050516072027.GH7094@redhat.com>

Inter-node communiction using SCTP.  This level is not aware of locks or
resources or other dlm objects, only data buffers.  These functions also
batch (and extract) lots of small messages bound for one node into larger
chunks.

Signed-off-by: Dave Teigland <teigland at redhat.com>
Signed-off-by: Patrick Caulfield <pcaulfie at redhat.com>

---

 drivers/dlm/lowcomms.c | 1329 +++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/dlm/lowcomms.h |   28 +
 drivers/dlm/midcomms.c |  139 +++++
 drivers/dlm/midcomms.h |   21 
 4 files changed, 1517 insertions(+)

--- a/drivers/dlm/lowcomms.c	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/lowcomms.c	2005-05-12 23:13:15.839483992 +0800
@@ -0,0 +1,1329 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+/*
+ * lowcomms.c
+ *
+ * This is the "low-level" comms layer.
+ *
+ * It is responsible for sending/receiving messages
+ * from other nodes in the cluster.
+ *
+ * Cluster nodes are referred to by their nodeids. nodeids are
+ * simply 32 bit numbers to the locking module - if they need to
+ * be expanded for the cluster infrastructure then that is it's
+ * responsibility. It is this layer's
+ * responsibility to resolve these into IP address or
+ * whatever it needs for inter-node communication.
+ *
+ * The comms level is two kernel threads that deal mainly with
+ * the receiving of messages from other nodes and passing them
+ * up to the mid-level comms layer (which understands the
+ * message format) for execution by the locking core, and
+ * a send thread which does all the setting up of connections
+ * to remote nodes and the sending of data. Threads are not allowed
+ * to send their own data because it may cause them to wait in times
+ * of high load. Also, this way, the sending thread can collect together
+ * messages bound for one node and send them in one block.
+ *
+ * I don't see any problem with the recv thread executing the locking
+ * code on behalf of remote processes as the locking code is
+ * short, efficient and never (well, hardly ever) waits.
+ *
+ */
+
+#include <asm/ioctls.h>
+#include <net/sock.h>
+#include <net/tcp.h>
+#include <net/sctp/user.h>
+#include <linux/pagemap.h>
+#include <linux/socket.h>
+#include <linux/idr.h>
+
+#include <linux/dlm_node.h>
+
+#include "dlm_internal.h"
+#include "lowcomms.h"
+#include "config.h"
+#include "member.h"
+#include "midcomms.h"
+
+static struct sockaddr_storage *local_addr[DLM_MAX_ADDR_COUNT];
+static int			local_nodeid;
+static int			local_weight;
+static int			local_count;
+static struct list_head		nodes;
+static struct semaphore		nodes_sem;
+
+/* One of these per configured node */
+
+struct dlm_node {
+	struct list_head	list;
+	int			nodeid;
+	int			weight;
+	struct sockaddr_storage	addr;
+};
+
+/* One of these per connected node */
+
+#define NI_INIT_PENDING 1
+#define NI_WRITE_PENDING 2
+
+struct nodeinfo {
+	spinlock_t		lock;
+	sctp_assoc_t		assoc_id;
+	unsigned long		flags;
+	struct list_head	write_list; /* nodes with pending writes */
+	struct list_head	writequeue; /* outgoing writequeue_entries */
+	spinlock_t		writequeue_lock;
+	int			nodeid;
+};
+
+static DEFINE_IDR(nodeinfo_idr);
+static struct rw_semaphore	nodeinfo_lock;
+static int			max_nodeid;
+
+struct cbuf {
+	unsigned		base;
+	unsigned		len;
+	unsigned		mask;
+};
+
+/* Just the one of these, now. But this struct keeps
+   the connection-specific variables together */
+
+#define CF_READ_PENDING 1
+
+struct connection {
+	struct socket          *sock;
+	unsigned long		flags;
+	struct page            *rx_page;
+	atomic_t		waiting_requests;
+	struct cbuf		cb;
+	int                     eagain_flag;
+};
+
+/* An entry waiting to be sent */
+
+struct writequeue_entry {
+	struct list_head	list;
+	struct page            *page;
+	int			offset;
+	int			len;
+	int			end;
+	int			users;
+	struct nodeinfo        *ni;
+};
+
+#define CBUF_ADD(cb, n) do { (cb)->len += n; } while(0)
+#define CBUF_EMPTY(cb) ((cb)->len == 0)
+#define CBUF_MAY_ADD(cb, n) (((cb)->len + (n)) < ((cb)->mask + 1))
+#define CBUF_DATA(cb) (((cb)->base + (cb)->len) & (cb)->mask)
+
+#define CBUF_INIT(cb, size) \
+do { \
+	(cb)->base = (cb)->len = 0; \
+	(cb)->mask = ((size)-1); \
+} while(0)
+
+#define CBUF_EAT(cb, n) \
+do { \
+	(cb)->len  -= (n); \
+	(cb)->base += (n); \
+	(cb)->base &= (cb)->mask; \
+} while(0)
+
+
+/* List of nodes which have writes pending */
+static struct list_head write_nodes;
+static spinlock_t write_nodes_lock;
+
+/* Maximum number of incoming messages to process before
+ * doing a schedule()
+ */
+#define MAX_RX_MSG_COUNT 25
+
+/* Manage daemons */
+static struct task_struct *recv_task;
+static struct task_struct *send_task;
+static wait_queue_head_t lowcomms_recv_wait;
+static atomic_t accepting;
+
+/* The SCTP connection */
+static struct connection sctp_con;
+
+
+static struct dlm_node *search_node(int nodeid)
+{
+	struct dlm_node *node;
+
+	list_for_each_entry(node, &nodes, list) {
+		if (node->nodeid == nodeid)
+			goto out;
+	}
+	node = NULL;
+ out:
+	return node;
+}
+
+static struct dlm_node *search_node_addr(struct sockaddr_storage *addr)
+{
+	struct dlm_node *node;
+
+	list_for_each_entry(node, &nodes, list) {
+		if (!memcmp(&node->addr, addr, sizeof(*addr)))
+			goto out;
+	}
+	node = NULL;
+ out:
+	return node;
+}
+
+static int _get_node(int nodeid, struct dlm_node **node_ret)
+{
+	struct dlm_node *node;
+	int error = 0;
+
+	node = search_node(nodeid);
+	if (node)
+		goto out;
+
+	node = kmalloc(sizeof(struct dlm_node), GFP_KERNEL);
+	if (!node) {
+		error = -ENOMEM;
+		goto out;
+	}
+	memset(node, 0, sizeof(struct dlm_node));
+	node->nodeid = nodeid;
+	list_add_tail(&node->list, &nodes);
+ out:
+	*node_ret = node;
+	return error;
+}
+
+static int addr_to_nodeid(struct sockaddr_storage *addr, int *nodeid)
+{
+	struct dlm_node *node;
+
+	down(&nodes_sem);
+	node = search_node_addr(addr);
+	up(&nodes_sem);
+	if (!node)
+		return -1;
+	*nodeid = node->nodeid;
+	return 0;
+}
+
+static int nodeid_to_addr(int nodeid, struct sockaddr *retaddr)
+{
+	struct dlm_node *node;
+	struct sockaddr_storage *addr;
+
+	if (!local_count)
+		return -1;
+
+	down(&nodes_sem);
+	node = search_node(nodeid);
+	up(&nodes_sem);
+	if (!node)
+		return -1;
+
+	addr = &node->addr;
+
+	if (local_addr[0]->ss_family == AF_INET) {
+	        struct sockaddr_in *in4  = (struct sockaddr_in *) addr;
+		struct sockaddr_in *ret4 = (struct sockaddr_in *) retaddr;
+		ret4->sin_addr.s_addr = in4->sin_addr.s_addr;
+	} else {
+	        struct sockaddr_in6 *in6  = (struct sockaddr_in6 *) addr;
+		struct sockaddr_in6 *ret6 = (struct sockaddr_in6 *) retaddr;
+		memcpy(&ret6->sin6_addr, &in6->sin6_addr,
+		       sizeof(in6->sin6_addr));
+	}
+
+	return 0;
+}
+
+int dlm_set_node(int nodeid, int weight, char *addr_buf)
+{
+	struct dlm_node *node;
+	int error;
+
+	down(&nodes_sem);
+	error = _get_node(nodeid, &node);
+	if (!error) {
+		memcpy(&node->addr, addr_buf, sizeof(struct sockaddr_storage));
+		node->weight = weight;
+	}
+	up(&nodes_sem);
+	return error;
+}
+
+int dlm_set_local(int nodeid, int weight, char *addr_buf)
+{
+	struct sockaddr_storage *addr;
+
+	if (local_count > DLM_MAX_ADDR_COUNT - 1) {
+		log_print("too many local addresses set %d", local_count);
+		return -EINVAL;
+	}
+	local_nodeid = nodeid;
+	local_weight = weight;
+
+	addr = kmalloc(sizeof(*addr), GFP_KERNEL);
+	if (!addr)
+		return -ENOMEM;
+	memcpy(addr, addr_buf, sizeof(*addr));
+	local_addr[local_count++] = addr;
+	return 0;
+}
+
+int dlm_our_nodeid(void)
+{
+	return local_nodeid;
+}
+
+static struct nodeinfo *nodeid2nodeinfo(int nodeid, int alloc)
+{
+	struct nodeinfo *ni;
+	int r;
+	int n;
+
+	down_read(&nodeinfo_lock);
+	ni = idr_find(&nodeinfo_idr, nodeid);
+	up_read(&nodeinfo_lock);
+
+	if (!ni && alloc) {
+		down_write(&nodeinfo_lock);
+
+		ni = idr_find(&nodeinfo_idr, nodeid);
+		if (ni)
+			goto out_up;
+
+		r = idr_pre_get(&nodeinfo_idr, alloc);
+		if (!r)
+			goto out_up;
+
+		ni = kmalloc(sizeof(struct nodeinfo), alloc);
+		if (!ni)
+			goto out_up;
+
+		r = idr_get_new_above(&nodeinfo_idr, ni, nodeid, &n);
+		if (r) {
+			kfree(ni);
+			ni = NULL;
+			goto out_up;
+		}
+		if (n != nodeid) {
+			idr_remove(&nodeinfo_idr, n);
+			kfree(ni);
+			ni = NULL;
+			goto out_up;
+		}
+		memset(ni, 0, sizeof(struct nodeinfo));
+		spin_lock_init(&ni->lock);
+		INIT_LIST_HEAD(&ni->writequeue);
+		spin_lock_init(&ni->writequeue_lock);
+		ni->nodeid = nodeid;
+
+		if (nodeid > max_nodeid)
+			max_nodeid = nodeid;
+	out_up:
+		up_write(&nodeinfo_lock);
+	}
+
+	return ni;
+}
+
+/* Don't call this too often... */
+static struct nodeinfo *assoc2nodeinfo(sctp_assoc_t assoc)
+{
+	int i;
+	struct nodeinfo *ni;
+
+	for (i=1; i<=max_nodeid; i++) {
+		ni = nodeid2nodeinfo(i, 0);
+		if (ni && ni->assoc_id == assoc)
+			return ni;
+	}
+	return NULL;
+}
+
+/* Data or notification available on socket */
+static void lowcomms_data_ready(struct sock *sk, int count_unused)
+{
+	atomic_inc(&sctp_con.waiting_requests);
+	if (test_and_set_bit(CF_READ_PENDING, &sctp_con.flags))
+		return;
+
+	wake_up_interruptible(&lowcomms_recv_wait);
+}
+
+
+/* Add the port number to an IP6 or 4 sockaddr and return the address length.
+   Also padd out the struct with zeros to make comparisons meaningful */
+
+static void make_sockaddr(struct sockaddr_storage *saddr, uint16_t port,
+			  int *addr_len)
+{
+	struct sockaddr_in *local4_addr;
+	struct sockaddr_in6 *local6_addr;
+
+	if (!local_count)
+		return;
+
+	if (!port) {
+		if (local_addr[0]->ss_family == AF_INET) {
+			local4_addr = (struct sockaddr_in *)local_addr[0];
+			port = be16_to_cpu(local4_addr->sin_port);
+		} else {
+			local6_addr = (struct sockaddr_in6 *)local_addr[0];
+			port = be16_to_cpu(local6_addr->sin6_port);
+		}
+	}
+
+	saddr->ss_family = local_addr[0]->ss_family;
+	if (local_addr[0]->ss_family == AF_INET) {
+		struct sockaddr_in *in4_addr = (struct sockaddr_in *)saddr;
+		in4_addr->sin_port = cpu_to_be16(port);
+		memset(&in4_addr->sin_zero, 0, sizeof(in4_addr->sin_zero));
+		memset(in4_addr+1, 0, sizeof(struct sockaddr_storage) -
+				      sizeof(struct sockaddr_in));
+		*addr_len = sizeof(struct sockaddr_in);
+	} else {
+		struct sockaddr_in6 *in6_addr = (struct sockaddr_in6 *)saddr;
+		in6_addr->sin6_port = cpu_to_be16(port);
+		memset(in6_addr+1, 0, sizeof(struct sockaddr_storage) -
+				      sizeof(struct sockaddr_in6));
+		*addr_len = sizeof(struct sockaddr_in6);
+	}
+}
+
+/* Close the connection and tidy up */
+static void close_connection(void)
+{
+	if (sctp_con.sock) {
+		sock_release(sctp_con.sock);
+		sctp_con.sock = NULL;
+	}
+
+	if (sctp_con.rx_page) {
+		__free_page(sctp_con.rx_page);
+		sctp_con.rx_page = NULL;
+	}
+}
+
+/* We only send shutdown messages to nodes that are not part of the cluster */
+static void send_shutdown(sctp_assoc_t associd)
+{
+	static char outcmsg[CMSG_SPACE(sizeof(struct sctp_sndrcvinfo))];
+	struct msghdr outmessage;
+	struct cmsghdr *cmsg;
+	struct sctp_sndrcvinfo *sinfo;
+	int ret;
+
+	outmessage.msg_name = NULL;
+	outmessage.msg_namelen = 0;
+	outmessage.msg_control = outcmsg;
+	outmessage.msg_controllen = sizeof(outcmsg);
+	outmessage.msg_flags = MSG_EOR;
+
+	cmsg = CMSG_FIRSTHDR(&outmessage);
+	cmsg->cmsg_level = IPPROTO_SCTP;
+	cmsg->cmsg_type = SCTP_SNDRCV;
+	cmsg->cmsg_len = CMSG_LEN(sizeof(struct sctp_sndrcvinfo));
+	outmessage.msg_controllen = cmsg->cmsg_len;
+	sinfo = (struct sctp_sndrcvinfo *)CMSG_DATA(cmsg);
+	memset(sinfo, 0x00, sizeof(struct sctp_sndrcvinfo));
+
+	sinfo->sinfo_flags |= MSG_EOF;
+	sinfo->sinfo_assoc_id = associd;
+
+	ret = kernel_sendmsg(sctp_con.sock, &outmessage, NULL, 0, 0);
+
+	if (ret != 0)
+		log_print("send EOF to node failed: %d", ret);
+}
+
+
+/* INIT failed but we don't know which node...
+   restart INIT on all pending nodes */
+static void init_failed(void)
+{
+	int i;
+	struct nodeinfo *ni;
+
+	for (i=1; i<=max_nodeid; i++) {
+		ni = nodeid2nodeinfo(i, 0);
+		if (!ni)
+			continue;
+
+		if (test_and_clear_bit(NI_INIT_PENDING, &ni->flags)) {
+			ni->assoc_id = 0;
+			if (!test_and_set_bit(NI_WRITE_PENDING, &ni->flags)) {
+				spin_lock_bh(&write_nodes_lock);
+				list_add_tail(&ni->write_list, &write_nodes);
+				spin_unlock_bh(&write_nodes_lock);
+			}
+		}
+	}
+	wake_up_process(send_task);
+}
+
+/* Something happened to an association */
+static void process_sctp_notification(struct msghdr *msg, char *buf)
+{
+	union sctp_notification *sn = (union sctp_notification *)buf;
+
+	if (sn->sn_header.sn_type == SCTP_ASSOC_CHANGE) {
+		switch (sn->sn_assoc_change.sac_state) {
+
+		case SCTP_COMM_UP:
+		case SCTP_RESTART:
+		{
+			/* Check that the new node is in the lockspace */
+			struct sctp_prim prim;
+			mm_segment_t fs;
+			int nodeid;
+			int prim_len, ret;
+			int addr_len;
+			struct nodeinfo *ni;
+
+			/* This seems to happen when we received a connection
+			 * too early... or something...  anyway, it happens but
+			 * we always seem to get a real message too, see
+			 * receive_from_sock */
+
+			if ((int)sn->sn_assoc_change.sac_assoc_id <= 0) {
+				log_print("COMM_UP for invalid assoc ID %d",
+					 (int)sn->sn_assoc_change.sac_assoc_id);
+				init_failed();
+				return;
+			}
+			memset(&prim, 0, sizeof(struct sctp_prim));
+			prim_len = sizeof(struct sctp_prim);
+			prim.ssp_assoc_id = sn->sn_assoc_change.sac_assoc_id;
+
+			fs = get_fs();
+			set_fs(get_ds());
+			ret = sctp_con.sock->ops->getsockopt(sctp_con.sock,
+						IPPROTO_SCTP, SCTP_PRIMARY_ADDR,
+						(char*)&prim, &prim_len);
+			set_fs(fs);
+			if (ret < 0) {
+				struct nodeinfo *ni;
+
+				log_print("getsockopt/sctp_primary_addr on "
+					  "new assoc %d failed : %d",
+				    (int)sn->sn_assoc_change.sac_assoc_id, ret);
+
+				/* Retry INIT later */
+				ni = assoc2nodeinfo(sn->sn_assoc_change.sac_assoc_id);
+				if (ni)
+					clear_bit(NI_INIT_PENDING, &ni->flags);
+				return;
+			}
+			make_sockaddr(&prim.ssp_addr, 0, &addr_len);
+			if (addr_to_nodeid(&prim.ssp_addr, &nodeid)) {
+				log_print("reject connect from unknown addr");
+				send_shutdown(prim.ssp_assoc_id);
+				return;
+			}
+
+			ni = nodeid2nodeinfo(nodeid, GFP_KERNEL);
+			if (!ni)
+				return;
+
+			/* Save the assoc ID */
+			spin_lock(&ni->lock);
+			ni->assoc_id = sn->sn_assoc_change.sac_assoc_id;
+			spin_unlock(&ni->lock);
+
+			log_print("got new/restarted association %d nodeid %d",
+			       (int)sn->sn_assoc_change.sac_assoc_id, nodeid);
+
+			/* Send any pending writes */
+			clear_bit(NI_INIT_PENDING, &ni->flags);
+			if (!test_and_set_bit(NI_WRITE_PENDING, &ni->flags)) {
+				spin_lock_bh(&write_nodes_lock);
+				list_add_tail(&ni->write_list, &write_nodes);
+				spin_unlock_bh(&write_nodes_lock);
+			}
+			wake_up_process(send_task);
+		}
+		break;
+
+		case SCTP_COMM_LOST:
+		case SCTP_SHUTDOWN_COMP:
+		{
+			struct nodeinfo *ni;
+
+			ni = assoc2nodeinfo(sn->sn_assoc_change.sac_assoc_id);
+			if (ni) {
+				spin_lock(&ni->lock);
+				ni->assoc_id = 0;
+				spin_unlock(&ni->lock);
+			}
+		}
+		break;
+
+		/* We don't know which INIT failed, so clear the PENDING flags
+		 * on them all.  if assoc_id is zero then it will then try
+		 * again */
+
+		case SCTP_CANT_STR_ASSOC:
+		{
+			log_print("Can't start SCTP association - retrying");
+			init_failed();
+		}
+		break;
+
+		default:
+			log_print("unexpected SCTP assoc change id=%d state=%d",
+				  (int)sn->sn_assoc_change.sac_assoc_id,
+				  sn->sn_assoc_change.sac_state);
+		}
+	}
+}
+
+/* Data received from remote end */
+static int receive_from_sock(void)
+{
+	int ret = 0;
+	struct msghdr msg;
+	struct kvec iov[2];
+	unsigned len;
+	int r;
+	struct sctp_sndrcvinfo *sinfo;
+	struct cmsghdr *cmsg;
+	struct nodeinfo *ni;
+
+	/* These two are marginally too big for stack allocation, but this
+	 * function is (currently) only called by dlm_recvd so static should be
+	 * OK.
+	 */
+	static struct sockaddr_storage msgname;
+	static char incmsg[CMSG_SPACE(sizeof(struct sctp_sndrcvinfo))];
+
+	if (sctp_con.sock == NULL)
+		goto out;
+
+	if (sctp_con.rx_page == NULL) {
+		/*
+		 * This doesn't need to be atomic, but I think it should
+		 * improve performance if it is.
+		 */
+		sctp_con.rx_page = alloc_page(GFP_ATOMIC);
+		if (sctp_con.rx_page == NULL)
+			goto out_resched;
+		CBUF_INIT(&sctp_con.cb, PAGE_CACHE_SIZE);
+	}
+
+	memset(&incmsg, 0, sizeof(incmsg));
+	memset(&msgname, 0, sizeof(msgname));
+
+	memset(incmsg, 0, sizeof(incmsg));
+	msg.msg_name = &msgname;
+	msg.msg_namelen = sizeof(msgname);
+	msg.msg_flags = 0;
+	msg.msg_control = incmsg;
+	msg.msg_controllen = sizeof(incmsg);
+
+	/* I don't see why this circular buffer stuff is necessary for SCTP
+	 * which is a packet-based protocol, but the whole thing breaks under
+	 * load without it! The overhead is minimal (and is in the TCP lowcomms
+	 * anyway, of course) so I'll leave it in until I can figure out what's
+	 * really happening.
+	 */
+
+	/*
+	 * iov[0] is the bit of the circular buffer between the current end
+	 * point (cb.base + cb.len) and the end of the buffer.
+	 */
+	iov[0].iov_len = sctp_con.cb.base - CBUF_DATA(&sctp_con.cb);
+	iov[0].iov_base = page_address(sctp_con.rx_page) +
+			  CBUF_DATA(&sctp_con.cb);
+	iov[1].iov_len = 0;
+
+	/*
+	 * iov[1] is the bit of the circular buffer between the start of the
+	 * buffer and the start of the currently used section (cb.base)
+	 */
+	if (CBUF_DATA(&sctp_con.cb) >= sctp_con.cb.base) {
+		iov[0].iov_len = PAGE_CACHE_SIZE - CBUF_DATA(&sctp_con.cb);
+		iov[1].iov_len = sctp_con.cb.base;
+		iov[1].iov_base = page_address(sctp_con.rx_page);
+		msg.msg_iovlen = 2;
+	}
+	len = iov[0].iov_len + iov[1].iov_len;
+
+	r = ret = kernel_recvmsg(sctp_con.sock, &msg, iov, 1, len,
+				 MSG_NOSIGNAL | MSG_DONTWAIT);
+	if (ret <= 0)
+		goto out_close;
+
+	msg.msg_control = incmsg;
+	msg.msg_controllen = sizeof(incmsg);
+	cmsg = CMSG_FIRSTHDR(&msg);
+	sinfo = (struct sctp_sndrcvinfo *)CMSG_DATA(cmsg);
+
+	if (msg.msg_flags & MSG_NOTIFICATION) {
+		process_sctp_notification(&msg, page_address(sctp_con.rx_page));
+		return 0;
+	}
+
+	/* Is this a new association ? */
+	ni = nodeid2nodeinfo(le32_to_cpu(sinfo->sinfo_ppid), GFP_KERNEL);
+	if (ni) {
+		ni->assoc_id = sinfo->sinfo_assoc_id;
+		if (test_and_clear_bit(NI_INIT_PENDING, &ni->flags)) {
+
+			if (!test_and_set_bit(NI_WRITE_PENDING, &ni->flags)) {
+				spin_lock_bh(&write_nodes_lock);
+				list_add_tail(&ni->write_list, &write_nodes);
+				spin_unlock_bh(&write_nodes_lock);
+			}
+			wake_up_process(send_task);
+		}
+	}
+
+	/* INIT sends a message with length of 1 - ignore it */
+	if (r == 1)
+		return 0;
+
+	CBUF_ADD(&sctp_con.cb, ret);
+	ret = dlm_process_incoming_buffer(cpu_to_le32(sinfo->sinfo_ppid),
+					  page_address(sctp_con.rx_page),
+					  sctp_con.cb.base, sctp_con.cb.len,
+					  PAGE_CACHE_SIZE);
+	if (ret < 0)
+		goto out_close;
+	CBUF_EAT(&sctp_con.cb, ret);
+
+      out:
+	ret = 0;
+	goto out_ret;
+
+      out_resched:
+	lowcomms_data_ready(sctp_con.sock->sk, 0);
+	ret = 0;
+	schedule();
+	goto out_ret;
+
+      out_close:
+	if (ret != -EAGAIN)
+		log_print("error reading from sctp socket: %d", ret);
+      out_ret:
+	return ret;
+}
+
+/* Bind to an IP address. SCTP allows multiple address so it can do multi-homing */
+static int add_bind_addr(struct sockaddr_storage *addr, int addr_len, int num)
+{
+	mm_segment_t fs;
+	int result = 0;
+
+	fs = get_fs();
+	set_fs(get_ds());
+	if (num == 1)
+		result = sctp_con.sock->ops->bind(sctp_con.sock,
+					(struct sockaddr *) addr, addr_len);
+	else
+		result = sctp_con.sock->ops->setsockopt(sctp_con.sock, SOL_SCTP,
+				SCTP_SOCKOPT_BINDX_ADD, (char *)addr, addr_len);
+	set_fs(fs);
+
+	if (result < 0)
+		log_print("Can't bind to port %d addr number %d",
+			  dlm_config.tcp_port, num);
+
+	return result;
+}
+
+
+/* Initialise SCTP socket and bind to all interfaces */
+static int init_sock(void)
+{
+	mm_segment_t fs;
+	struct socket *sock = NULL;
+	struct sockaddr_storage localaddr;
+	struct sctp_event_subscribe subscribe;
+	int result = -EINVAL, num = 1, i, addr_len;
+
+	if (!local_count) {
+		log_print("no local IP address has been set");
+		goto out;
+	}
+
+	result = sock_create_kern(local_addr[0]->ss_family, SOCK_SEQPACKET,
+				  IPPROTO_SCTP, &sock);
+	if (result < 0) {
+		log_print("Can't create comms socket, check SCTP is loaded");
+		goto out;
+	}
+
+	/* Listen for events */
+	memset(&subscribe, 0, sizeof(subscribe));
+	subscribe.sctp_data_io_event = 1;
+	subscribe.sctp_association_event = 1;
+	subscribe.sctp_send_failure_event = 1;
+	subscribe.sctp_shutdown_event = 1;
+	subscribe.sctp_partial_delivery_event = 1;
+
+	fs = get_fs();
+	set_fs(get_ds());
+	result = sock->ops->setsockopt(sock, SOL_SCTP, SCTP_EVENTS,
+				       (char *)&subscribe, sizeof(subscribe));
+	set_fs(fs);
+
+	if (result < 0) {
+		log_print("Failed to set SCTP_EVENTS on socket: result=%d",
+			  result);
+		goto create_delsock;
+	}
+
+	/* Init con struct */
+	sock->sk->sk_user_data = &sctp_con;
+	sctp_con.sock = sock;
+	sctp_con.sock->sk->sk_data_ready = lowcomms_data_ready;
+
+	/* Bind to all interfaces. */
+	for (i = 0; i < local_count; i++) {
+		memcpy(&localaddr, local_addr[i], sizeof(localaddr));
+		make_sockaddr(&localaddr, dlm_config.tcp_port, &addr_len);
+
+		result = add_bind_addr(&localaddr, addr_len, num);
+		if (result)
+			goto create_delsock;
+		++num;
+	}
+
+	result = sock->ops->listen(sock, 5);
+	if (result < 0) {
+		log_print("Can't set socket listening");
+		goto create_delsock;
+	}
+
+	return 0;
+
+ create_delsock:
+	sock_release(sock);
+	sctp_con.sock = NULL;
+ out:
+	return result;
+}
+
+
+static struct writequeue_entry *new_writequeue_entry(int allocation)
+{
+	struct writequeue_entry *entry;
+
+	entry = kmalloc(sizeof(struct writequeue_entry), allocation);
+	if (!entry)
+		return NULL;
+
+	entry->page = alloc_page(allocation);
+	if (!entry->page) {
+		kfree(entry);
+		return NULL;
+	}
+
+	entry->offset = 0;
+	entry->len = 0;
+	entry->end = 0;
+	entry->users = 0;
+
+	return entry;
+}
+
+void *dlm_lowcomms_get_buffer(int nodeid, int len, int allocation, char **ppc)
+{
+	struct writequeue_entry *e;
+	int offset = 0;
+	int users = 0;
+	struct nodeinfo *ni;
+
+	if (!atomic_read(&accepting))
+		return NULL;
+
+	ni = nodeid2nodeinfo(nodeid, allocation);
+	if (!ni)
+		return NULL;
+
+	spin_lock(&ni->writequeue_lock);
+	e = list_entry(ni->writequeue.prev, struct writequeue_entry, list);
+	if (((struct list_head *) e == &ni->writequeue) ||
+	    (PAGE_CACHE_SIZE - e->end < len)) {
+		e = NULL;
+	} else {
+		offset = e->end;
+		e->end += len;
+		users = e->users++;
+	}
+	spin_unlock(&ni->writequeue_lock);
+
+	if (e) {
+	      got_one:
+		if (users == 0)
+			kmap(e->page);
+		*ppc = page_address(e->page) + offset;
+		return e;
+	}
+
+	e = new_writequeue_entry(allocation);
+	if (e) {
+		spin_lock(&ni->writequeue_lock);
+		offset = e->end;
+		e->end += len;
+		e->ni = ni;
+		users = e->users++;
+		list_add_tail(&e->list, &ni->writequeue);
+		spin_unlock(&ni->writequeue_lock);
+		goto got_one;
+	}
+	return NULL;
+}
+
+void dlm_lowcomms_commit_buffer(void *arg)
+{
+	struct writequeue_entry *e = (struct writequeue_entry *) arg;
+	int users;
+	struct nodeinfo *ni = e->ni;
+
+	if (!atomic_read(&accepting))
+		return;
+
+	spin_lock(&ni->writequeue_lock);
+	users = --e->users;
+	if (users)
+		goto out;
+	e->len = e->end - e->offset;
+	kunmap(e->page);
+	spin_unlock(&ni->writequeue_lock);
+
+	if (!test_and_set_bit(NI_WRITE_PENDING, &ni->flags)) {
+		spin_lock_bh(&write_nodes_lock);
+		list_add_tail(&ni->write_list, &write_nodes);
+		spin_unlock_bh(&write_nodes_lock);
+		wake_up_process(send_task);
+	}
+	return;
+
+      out:
+	spin_unlock(&ni->writequeue_lock);
+	return;
+}
+
+static void free_entry(struct writequeue_entry *e)
+{
+	__free_page(e->page);
+	kfree(e);
+}
+
+/* Initiate an SCTP association. In theory we could just use sendmsg() on
+   the first IP address and it should work, but this allows us to set up the
+   association before sending any valuable data that we can't afford to lose.
+   It also keeps the send path clean as it can now always use the association ID */
+static void initiate_association(int nodeid)
+{
+	struct sockaddr_storage rem_addr;
+	static char outcmsg[CMSG_SPACE(sizeof(struct sctp_sndrcvinfo))];
+	struct msghdr outmessage;
+	struct cmsghdr *cmsg;
+	struct sctp_sndrcvinfo *sinfo;
+	int ret;
+	int addrlen;
+	char buf[1];
+	struct kvec iov[1];
+	struct nodeinfo *ni;
+
+	log_print("Initiating association with node %d", nodeid);
+
+	ni = nodeid2nodeinfo(nodeid, GFP_KERNEL);
+	if (!ni)
+		return;
+
+	if (nodeid_to_addr(nodeid, (struct sockaddr *)&rem_addr)) {
+		log_print("no address for nodeid %d", nodeid);
+		return;
+	}
+
+	make_sockaddr(&rem_addr, dlm_config.tcp_port, &addrlen);
+
+	outmessage.msg_name = &rem_addr;
+	outmessage.msg_namelen = addrlen;
+	outmessage.msg_control = outcmsg;
+	outmessage.msg_controllen = sizeof(outcmsg);
+	outmessage.msg_flags = MSG_EOR;
+
+	iov[0].iov_base = buf;
+	iov[0].iov_len = 1;
+
+	/* Real INIT messages seem to cause trouble. Just send a 1 byte message
+	   we can afford to lose */
+	cmsg = CMSG_FIRSTHDR(&outmessage);
+	cmsg->cmsg_level = IPPROTO_SCTP;
+	cmsg->cmsg_type = SCTP_SNDRCV;
+	cmsg->cmsg_len = CMSG_LEN(sizeof(struct sctp_sndrcvinfo));
+	sinfo = (struct sctp_sndrcvinfo *)CMSG_DATA(cmsg);
+	memset(sinfo, 0x00, sizeof(struct sctp_sndrcvinfo));
+	sinfo->sinfo_ppid = cpu_to_le32(local_nodeid);
+
+	outmessage.msg_controllen = cmsg->cmsg_len;
+	ret = kernel_sendmsg(sctp_con.sock, &outmessage, iov, 1, 1);
+	if (ret < 0) {
+		log_print("send INIT to node failed: %d", ret);
+		/* Try again later */
+		clear_bit(NI_INIT_PENDING, &ni->flags);
+	}
+}
+
+/* Send a message */
+static int send_to_sock(struct nodeinfo *ni)
+{
+	int ret = 0;
+	struct writequeue_entry *e;
+	int len, offset;
+	struct msghdr outmsg;
+	static char outcmsg[CMSG_SPACE(sizeof(struct sctp_sndrcvinfo))];
+	struct cmsghdr *cmsg;
+	struct sctp_sndrcvinfo *sinfo;
+	struct kvec iov;
+
+        /* See if we need to init an association before we start
+	   sending precious messages */
+	spin_lock(&ni->lock);
+	if (!ni->assoc_id && !test_and_set_bit(NI_INIT_PENDING, &ni->flags)) {
+		spin_unlock(&ni->lock);
+		initiate_association(ni->nodeid);
+		return 0;
+	}
+	spin_unlock(&ni->lock);
+
+	outmsg.msg_name = NULL; /* We use assoc_id */
+	outmsg.msg_namelen = 0;
+	outmsg.msg_control = outcmsg;
+	outmsg.msg_controllen = sizeof(outcmsg);
+	outmsg.msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL | MSG_EOR;
+
+	cmsg = CMSG_FIRSTHDR(&outmsg);
+	cmsg->cmsg_level = IPPROTO_SCTP;
+	cmsg->cmsg_type = SCTP_SNDRCV;
+	cmsg->cmsg_len = CMSG_LEN(sizeof(struct sctp_sndrcvinfo));
+	sinfo = (struct sctp_sndrcvinfo *)CMSG_DATA(cmsg);
+	memset(sinfo, 0x00, sizeof(struct sctp_sndrcvinfo));
+	sinfo->sinfo_ppid = cpu_to_le32(local_nodeid);
+	sinfo->sinfo_assoc_id = ni->assoc_id;
+	outmsg.msg_controllen = cmsg->cmsg_len;
+
+	spin_lock(&ni->writequeue_lock);
+	for (;;) {
+		if (list_empty(&ni->writequeue))
+			break;
+		e = list_entry(ni->writequeue.next, struct writequeue_entry,
+			       list);
+		kmap(e->page);
+		len = e->len;
+		offset = e->offset;
+		BUG_ON(len == 0 && e->users == 0);
+		spin_unlock(&ni->writequeue_lock);
+
+		ret = 0;
+		if (len) {
+			iov.iov_base = page_address(e->page)+offset;
+			iov.iov_len = len;
+
+			ret = kernel_sendmsg(sctp_con.sock, &outmsg, &iov, 1,
+					     len);
+			if (ret == -EAGAIN) {
+				sctp_con.eagain_flag = 1;
+				goto out;
+			} else if (ret < 0)
+				goto send_error;
+		} else {
+			/* Don't starve people filling buffers */
+			schedule();
+		}
+
+		spin_lock(&ni->writequeue_lock);
+		e->offset += ret;
+		e->len -= ret;
+
+		if (e->len == 0 && e->users == 0) {
+			list_del(&e->list);
+			free_entry(e);
+			continue;
+		}
+	}
+	spin_unlock(&ni->writequeue_lock);
+ out:
+	return ret;
+
+ send_error:
+	log_print("Error sending to node %d %d", ni->nodeid, ret);
+	spin_lock(&ni->lock);
+	if (!test_and_set_bit(NI_INIT_PENDING, &ni->flags)) {
+		ni->assoc_id = 0;
+		spin_unlock(&ni->lock);
+		initiate_association(ni->nodeid);
+	} else
+		spin_unlock(&ni->lock);
+
+	return ret;
+}
+
+/* Try to send any messages that are pending */
+static void process_output_queue(void)
+{
+	struct list_head *list;
+	struct list_head *temp;
+
+	spin_lock_bh(&write_nodes_lock);
+	list_for_each_safe(list, temp, &write_nodes) {
+		struct nodeinfo *ni =
+		    list_entry(list, struct nodeinfo, write_list);
+		list_del(&ni->write_list);
+		clear_bit(NI_WRITE_PENDING, &ni->flags);
+
+		spin_unlock_bh(&write_nodes_lock);
+
+		send_to_sock(ni);
+		spin_lock_bh(&write_nodes_lock);
+	}
+	spin_unlock_bh(&write_nodes_lock);
+}
+
+/* Called after we've had -EAGAIN and been woken up */
+static void refill_write_queue(void)
+{
+	int i;
+
+	for (i=1; i<=max_nodeid; i++) {
+		struct nodeinfo *ni = nodeid2nodeinfo(i, 0);
+
+		if (ni) {
+			if (!test_and_set_bit(NI_WRITE_PENDING, &ni->flags)) {
+				spin_lock_bh(&write_nodes_lock);
+				list_add_tail(&ni->write_list, &write_nodes);
+				spin_unlock_bh(&write_nodes_lock);
+			}
+		}
+	}
+}
+
+static void clean_one_writequeue(struct nodeinfo *ni)
+{
+	struct list_head *list;
+	struct list_head *temp;
+
+	spin_lock(&ni->writequeue_lock);
+	list_for_each_safe(list, temp, &ni->writequeue) {
+		struct writequeue_entry *e =
+			list_entry(list, struct writequeue_entry, list);
+		list_del(&e->list);
+		free_entry(e);
+	}
+	spin_unlock(&ni->writequeue_lock);
+}
+
+static void clean_writequeues(void)
+{
+	int i;
+
+	for (i=1; i<=max_nodeid; i++) {
+		struct nodeinfo *ni = nodeid2nodeinfo(i, 0);
+		if (ni)
+			clean_one_writequeue(ni);
+	}
+}
+
+
+static void dealloc_nodeinfo(void)
+{
+	int i;
+
+	for (i=1; i<=max_nodeid; i++) {
+		struct nodeinfo *ni = nodeid2nodeinfo(i, 0);
+		if (ni) {
+			idr_remove(&nodeinfo_idr, i);
+			kfree(ni);
+		}
+	}
+}
+
+static int write_list_empty(void)
+{
+	int status;
+
+	spin_lock_bh(&write_nodes_lock);
+	status = list_empty(&write_nodes);
+	spin_unlock_bh(&write_nodes_lock);
+
+	return status;
+}
+
+static int dlm_recvd(void *data)
+{
+	DECLARE_WAITQUEUE(wait, current);
+
+	while (!kthread_should_stop()) {
+		int count = 0;
+
+		set_current_state(TASK_INTERRUPTIBLE);
+		add_wait_queue(&lowcomms_recv_wait, &wait);
+		if (!test_bit(CF_READ_PENDING, &sctp_con.flags))
+			schedule();
+		remove_wait_queue(&lowcomms_recv_wait, &wait);
+		set_current_state(TASK_RUNNING);
+
+		if (test_and_clear_bit(CF_READ_PENDING, &sctp_con.flags)) {
+			int ret;
+
+			do {
+				ret = receive_from_sock();
+
+				/* Don't starve out everyone else */
+				if (++count >= MAX_RX_MSG_COUNT) {
+					schedule();
+					count = 0;
+				}
+			} while (!kthread_should_stop() && ret >=0);
+		}
+		schedule();
+	}
+
+	return 0;
+}
+
+static int dlm_sendd(void *data)
+{
+	DECLARE_WAITQUEUE(wait, current);
+
+	add_wait_queue(sctp_con.sock->sk->sk_sleep, &wait);
+
+	while (!kthread_should_stop()) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (write_list_empty())
+			schedule();
+		set_current_state(TASK_RUNNING);
+
+		if (sctp_con.eagain_flag) {
+			sctp_con.eagain_flag = 0;
+			refill_write_queue();
+		}
+		process_output_queue();
+	}
+
+	remove_wait_queue(sctp_con.sock->sk->sk_sleep, &wait);
+
+	return 0;
+}
+
+static void daemons_stop(void)
+{
+	kthread_stop(recv_task);
+	kthread_stop(send_task);
+}
+
+static int daemons_start(void)
+{
+	struct task_struct *p;
+	int error;
+
+	p = kthread_run(dlm_recvd, NULL, "dlm_recvd");
+	error = IS_ERR(p);
+       	if (error) {
+		log_print("can't start dlm_recvd %d", error);
+		return error;
+	}
+	recv_task = p;
+
+	p = kthread_run(dlm_sendd, NULL, "dlm_sendd");
+	error = IS_ERR(p);
+       	if (error) {
+		log_print("can't start dlm_sendd %d", error);
+		kthread_stop(recv_task);
+		return error;
+	}
+	send_task = p;
+
+	return 0;
+}
+
+/*
+ * This is quite likely to sleep...
+ * Temporarily initialise the waitq head so that lowcomms_send_message
+ * doesn't crash if it gets called before the thread is fully
+ * initialised
+ */
+
+int dlm_lowcomms_start(void)
+{
+	int error;
+
+	spin_lock_init(&write_nodes_lock);
+	INIT_LIST_HEAD(&write_nodes);
+	init_rwsem(&nodeinfo_lock);
+
+	error = init_sock();
+	if (error)
+		goto fail_sock;
+	error = daemons_start();
+	if (error)
+		goto fail_sock;
+	atomic_set(&accepting, 1);
+	return 0;
+
+ fail_sock:
+	close_connection();
+	return error;
+}
+
+/* Set all the activity flags to prevent any socket activity. */
+
+void dlm_lowcomms_stop(void)
+{
+	atomic_set(&accepting, 0);
+	sctp_con.flags = 0x7;
+	daemons_stop();
+	clean_writequeues();
+	close_connection();
+	dealloc_nodeinfo();
+	max_nodeid = 0;
+}
+
+int dlm_lowcomms_init(void)
+{
+	init_waitqueue_head(&lowcomms_recv_wait);
+	INIT_LIST_HEAD(&nodes);
+	init_MUTEX(&nodes_sem);
+	return 0;
+}
+
+void dlm_lowcomms_exit(void)
+{
+	struct dlm_node *node, *safe;
+	int i;
+
+	for (i = 0; i < local_count; i++)
+		kfree(local_addr[i]);
+	local_nodeid = 0;
+	local_weight = 0;
+	local_count = 0;
+
+	list_for_each_entry_safe(node, safe, &nodes, list) {
+		list_del(&node->list);
+		kfree(node);
+	}
+}
+
--- a/drivers/dlm/lowcomms.h	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/lowcomms.h	2005-05-12 23:13:15.829485512 +0800
@@ -0,0 +1,28 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#ifndef __LOWCOMMS_DOT_H__
+#define __LOWCOMMS_DOT_H__
+
+int dlm_lowcomms_init(void);
+void dlm_lowcomms_exit(void);
+int dlm_lowcomms_start(void);
+void dlm_lowcomms_stop(void);
+void *dlm_lowcomms_get_buffer(int nodeid, int len, int allocation, char **ppc);
+void dlm_lowcomms_commit_buffer(void *mh);
+int dlm_set_node(int nodeid, int weight, char *addr_buf);
+int dlm_set_local(int nodeid, int weight, char *addr_buf);
+int dlm_our_nodeid(void);
+
+#endif				/* __LOWCOMMS_DOT_H__ */
+
--- a/drivers/dlm/midcomms.c	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/midcomms.c	2005-05-12 23:13:15.830485360 +0800
@@ -0,0 +1,139 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+/*
+ * midcomms.c
+ *
+ * This is the appallingly named "mid-level" comms layer.
+ *
+ * Its purpose is to take packets from the "real" comms layer,
+ * split them up into packets and pass them to the interested
+ * part of the locking mechanism.
+ *
+ * It also takes messages from the locking layer, formats them
+ * into packets and sends them to the comms layer.
+ */
+
+#include "dlm_internal.h"
+#include "lowcomms.h"
+#include "config.h"
+#include "rcom.h"
+#include "lock.h"
+
+
+static void copy_from_cb(void *dst, const void *base, unsigned offset,
+			 unsigned len, unsigned limit)
+{
+	unsigned copy = len;
+
+	if ((copy + offset) > limit)
+		copy = limit - offset;
+	memcpy(dst, base + offset, copy);
+	len -= copy;
+	if (len)
+		memcpy(dst + copy, base, len);
+}
+
+/*
+ * Called from the low-level comms layer to process a buffer of
+ * commands.
+ *
+ * Only complete messages are processed here, any "spare" bytes from
+ * the end of a buffer are saved and tacked onto the front of the next
+ * message that comes in. I doubt this will happen very often but we
+ * need to be able to cope with it and I don't want the task to be waiting
+ * for packets to come in when there is useful work to be done.
+ */
+
+int dlm_process_incoming_buffer(int nodeid, const void *base,
+				unsigned offset, unsigned len, unsigned limit)
+{
+	unsigned char __tmp[DLM_INBUF_LEN];
+	struct dlm_header *msg = (struct dlm_header *) __tmp;
+	int ret = 0;
+	int err = 0;
+	uint16_t msglen;
+	uint32_t lockspace;
+
+	while (len > sizeof(struct dlm_header)) {
+
+		/* Copy just the header to check the total length.  The
+		   message may wrap around the end of the buffer back to the
+		   start, so we need to use a temp buffer and copy_from_cb. */
+
+		copy_from_cb(msg, base, offset, sizeof(struct dlm_header),
+			     limit);
+
+		msglen = le16_to_cpu(msg->h_length);
+		lockspace = msg->h_lockspace;
+
+		err = -EINVAL;
+		if (msglen < sizeof(struct dlm_header))
+			break;
+		err = -E2BIG;
+		if (msglen > dlm_config.buffer_size) {
+			log_print("message size %d from %d too big, buf len %d",
+				  msglen, nodeid, len);
+			break;
+		}
+		err = 0;
+
+		/* If only part of the full message is contained in this
+		   buffer, then do nothing and wait for lowcomms to call
+		   us again later with more data.  We return 0 meaning
+		   we've consumed none of the input buffer. */
+
+		if (msglen > len)
+			break;
+
+		/* Allocate a larger temp buffer if the full message won't fit
+		   in the buffer on the stack (which should work for most
+		   ordinary messages). */
+
+		if (msglen > sizeof(__tmp) &&
+		    msg == (struct dlm_header *) __tmp) {
+			msg = kmalloc(dlm_config.buffer_size, GFP_KERNEL);
+			if (msg == NULL)
+				return ret;
+		}
+
+		copy_from_cb(msg, base, offset, msglen, limit);
+
+		BUG_ON(lockspace != msg->h_lockspace);
+
+		ret += msglen;
+		offset += msglen;
+		offset &= (limit - 1);
+		len -= msglen;
+
+		switch (msg->h_cmd) {
+		case DLM_MSG:
+			dlm_receive_message(msg, nodeid, FALSE);
+			break;
+
+		case DLM_RCOM:
+			dlm_receive_rcom(msg, nodeid);
+			break;
+
+		default:
+			log_print("unknown msg type %x from %u: %u %u %u %u",
+				  msg->h_cmd, nodeid, msglen, len, offset, ret);
+		}
+	}
+
+	if (msg != (struct dlm_header *) __tmp)
+		kfree(msg);
+
+	return err ? err : ret;
+}
+
--- a/drivers/dlm/midcomms.h	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/midcomms.h	2005-05-12 23:13:15.830485360 +0800
@@ -0,0 +1,21 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#ifndef __MIDCOMMS_DOT_H__
+#define __MIDCOMMS_DOT_H__
+
+int dlm_process_incoming_buffer(int nodeid, const void *base, unsigned offset,
+				unsigned len, unsigned limit);
+
+#endif				/* __MIDCOMMS_DOT_H__ */
+



From teigland at redhat.com  Mon May 16 07:20:41 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 16 May 2005 15:20:41 +0800
Subject: [Linux-cluster] [PATCH 4/8] dlm: recovery
Message-ID: <20050516072041.GI7094@redhat.com>

When a node is removed from a lockspace, recovery is required for that
lockspace on all the remaining lockspace members.  Recovery involves: a
full rebuild of the distributed resource directory, selecting a new master
node for locks/resources previously mastered on the removed node, and
rebuilding master-copy locks on newly selected masters.

Signed-off-by: Dave Teigland <teigland at redhat.com>
Signed-off-by: Patrick Caulfield <pcaulfie at redhat.com>

---

 drivers/dlm/member.c       |  356 +++++++++++++++++++++
 drivers/dlm/member.h       |   29 +
 drivers/dlm/rcom.c         |  466 ++++++++++++++++++++++++++++
 drivers/dlm/rcom.h         |   24 +
 drivers/dlm/recover.c      |  729 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/dlm/recover.h      |   32 +
 drivers/dlm/recoverd.c     |  714 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/dlm/recoverd.h     |   24 +
 drivers/dlm/requestqueue.c |  144 ++++++++
 drivers/dlm/requestqueue.h |   22 +
 10 files changed, 2540 insertions(+)

--- a/drivers/dlm/rcom.c	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/rcom.c	2005-05-12 23:13:15.830485360 +0800
@@ -0,0 +1,466 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#include "dlm_internal.h"
+#include "lockspace.h"
+#include "member.h"
+#include "lowcomms.h"
+#include "midcomms.h"
+#include "rcom.h"
+#include "recover.h"
+#include "dir.h"
+#include "config.h"
+#include "memory.h"
+#include "lock.h"
+#include "util.h"
+
+
+static int rcom_response(struct dlm_ls *ls)
+{
+	return test_bit(LSFL_RCOM_READY, &ls->ls_flags);
+}
+
+static int create_rcom(struct dlm_ls *ls, int to_nodeid, int type, int len,
+		       struct dlm_rcom **rc_ret, struct dlm_mhandle **mh_ret)
+{
+	struct dlm_rcom *rc;
+	struct dlm_mhandle *mh;
+	char *mb;
+	int mb_len = sizeof(struct dlm_rcom) + len;
+
+	mh = dlm_lowcomms_get_buffer(to_nodeid, mb_len, GFP_KERNEL, &mb);
+	if (!mh) {
+		log_print("create_rcom to %d type %d len %d ENOBUFS",
+			  to_nodeid, type, len);
+		return -ENOBUFS;
+	}
+	memset(mb, 0, mb_len);
+
+	rc = (struct dlm_rcom *) mb;
+
+	rc->rc_header.h_version = (DLM_HEADER_MAJOR | DLM_HEADER_MINOR);
+	rc->rc_header.h_lockspace = ls->ls_global_id;
+	rc->rc_header.h_nodeid = dlm_our_nodeid();
+	rc->rc_header.h_length = mb_len;
+	rc->rc_header.h_cmd = DLM_RCOM;
+
+	rc->rc_type = type;
+
+	*mh_ret = mh;
+	*rc_ret = rc;
+	return 0;
+}
+
+static void send_rcom(struct dlm_ls *ls, struct dlm_mhandle *mh,
+		      struct dlm_rcom *rc)
+{
+	dlm_rcom_out(rc);
+	dlm_lowcomms_commit_buffer(mh);
+}
+
+/* When replying to a status request, a node also sends back its
+   configuration values.  The requesting node then checks that the remote
+   node is configured the same way as itself. */
+
+static void make_config(struct dlm_ls *ls, struct rcom_config *rf)
+{
+	rf->rf_lvblen = ls->ls_lvblen;
+}
+
+static int check_config(struct dlm_ls *ls, struct rcom_config *rf)
+{
+	if (rf->rf_lvblen != ls->ls_lvblen)
+		return -EINVAL;
+	return 0;
+}
+
+static int make_status(struct dlm_ls *ls)
+{
+	int status = 0;
+
+	if (test_bit(LSFL_NODES_VALID, &ls->ls_flags))
+		status |= NODES_VALID;
+
+	if (test_bit(LSFL_ALL_NODES_VALID, &ls->ls_flags))
+		status |= NODES_ALL_VALID;
+
+	if (test_bit(LSFL_DIR_VALID, &ls->ls_flags))
+		status |= DIR_VALID;
+
+	if (test_bit(LSFL_ALL_DIR_VALID, &ls->ls_flags))
+		status |= DIR_ALL_VALID;
+
+	if (test_bit(LSFL_LOCKS_VALID, &ls->ls_flags))
+		status |= LOCKS_VALID;
+
+	if (test_bit(LSFL_ALL_LOCKS_VALID, &ls->ls_flags))
+		status |= LOCKS_ALL_VALID;
+
+	return status;
+}
+
+int dlm_rcom_status(struct dlm_ls *ls, int nodeid)
+{
+	struct dlm_rcom *rc;
+	struct dlm_mhandle *mh;
+	int error = 0;
+
+	memset(ls->ls_recover_buf, 0, dlm_config.buffer_size);
+
+	if (nodeid == dlm_our_nodeid()) {
+		rc = (struct dlm_rcom *) ls->ls_recover_buf;
+		rc->rc_result = make_status(ls);
+		goto out;
+	}
+
+	error = create_rcom(ls, nodeid, DLM_RCOM_STATUS, 0, &rc, &mh);
+	if (error)
+		goto out;
+
+	send_rcom(ls, mh, rc);
+
+	error = dlm_wait_function(ls, &rcom_response);
+	clear_bit(LSFL_RCOM_READY, &ls->ls_flags);
+	if (error)
+		goto out;
+
+	rc = (struct dlm_rcom *) ls->ls_recover_buf;
+	error = check_config(ls, (struct rcom_config *) rc->rc_buf);
+ out:
+	return error;
+}
+
+static void receive_rcom_status(struct dlm_ls *ls, struct dlm_rcom *rc_in)
+{
+	struct dlm_rcom *rc;
+	struct dlm_mhandle *mh;
+	int error, nodeid = rc_in->rc_header.h_nodeid;
+
+	error = create_rcom(ls, nodeid, DLM_RCOM_STATUS_REPLY,
+			    sizeof(struct rcom_config), &rc, &mh);
+	if (error)
+		return;
+	rc->rc_result = make_status(ls);
+	make_config(ls, (struct rcom_config *) rc->rc_buf);
+
+	send_rcom(ls, mh, rc);
+}
+
+static void receive_rcom_status_reply(struct dlm_ls *ls, struct dlm_rcom *rc_in)
+{
+	memcpy(ls->ls_recover_buf, rc_in, rc_in->rc_header.h_length);
+	set_bit(LSFL_RCOM_READY, &ls->ls_flags);
+	wake_up(&ls->ls_wait_general);
+}
+
+int dlm_rcom_names(struct dlm_ls *ls, int nodeid, char *last_name, int last_len)
+{
+	struct dlm_rcom *rc;
+	struct dlm_mhandle *mh;
+	int error = 0, len = sizeof(struct dlm_rcom);
+
+	memset(ls->ls_recover_buf, 0, dlm_config.buffer_size);
+
+	if (nodeid == dlm_our_nodeid()) {
+		dlm_copy_master_names(ls, last_name, last_len,
+		                      ls->ls_recover_buf + len,
+		                      dlm_config.buffer_size - len, nodeid);
+		goto out;
+	}
+
+	error = create_rcom(ls, nodeid, DLM_RCOM_NAMES, last_len, &rc, &mh);
+	if (error)
+		goto out;
+	memcpy(rc->rc_buf, last_name, last_len);
+
+	send_rcom(ls, mh, rc);
+
+	error = dlm_wait_function(ls, &rcom_response);
+	clear_bit(LSFL_RCOM_READY, &ls->ls_flags);
+ out:
+	return error;
+}
+
+static void receive_rcom_names(struct dlm_ls *ls, struct dlm_rcom *rc_in)
+{
+	struct dlm_rcom *rc;
+	struct dlm_mhandle *mh;
+	int error, inlen, outlen;
+	int nodeid = rc_in->rc_header.h_nodeid;
+
+	/*
+	 * We can't run dlm_dir_rebuild_send (which uses ls_nodes) while
+	 * dlm_recoverd is running ls_nodes_reconfig (which changes ls_nodes).
+	 * It could only happen in rare cases where we get a late NAMES
+	 * message from a previous instance of recovery.
+	 */
+
+	if (!test_bit(LSFL_NODES_VALID, &ls->ls_flags)) {
+		log_debug(ls, "ignoring RCOM_NAMES from %u", nodeid);
+		return;
+	}
+		
+	nodeid = rc_in->rc_header.h_nodeid;
+	inlen = rc_in->rc_header.h_length - sizeof(struct dlm_rcom);
+	outlen = dlm_config.buffer_size - sizeof(struct dlm_rcom);
+
+	error = create_rcom(ls, nodeid, DLM_RCOM_NAMES_REPLY, outlen, &rc, &mh);
+	if (error)
+		return;
+
+	dlm_copy_master_names(ls, rc_in->rc_buf, inlen, rc->rc_buf, outlen,
+			      nodeid);
+	send_rcom(ls, mh, rc);
+}
+
+static void receive_rcom_names_reply(struct dlm_ls *ls, struct dlm_rcom *rc_in)
+{
+	memcpy(ls->ls_recover_buf, rc_in, rc_in->rc_header.h_length);
+	set_bit(LSFL_RCOM_READY, &ls->ls_flags);
+	wake_up(&ls->ls_wait_general);
+}
+
+int dlm_send_rcom_lookup(struct dlm_rsb *r, int dir_nodeid)
+{
+	struct dlm_rcom *rc;
+	struct dlm_mhandle *mh;
+	struct dlm_ls *ls = r->res_ls;
+	int error;
+
+	error = create_rcom(ls, dir_nodeid, DLM_RCOM_LOOKUP, r->res_length,
+			    &rc, &mh);
+	if (error)
+		goto out;
+	memcpy(rc->rc_buf, r->res_name, r->res_length);
+	rc->rc_id = (unsigned long) r;
+
+	send_rcom(ls, mh, rc);
+ out:
+	return error;
+}
+
+static void receive_rcom_lookup(struct dlm_ls *ls, struct dlm_rcom *rc_in)
+{
+	struct dlm_rcom *rc;
+	struct dlm_mhandle *mh;
+	int error, ret_nodeid, nodeid = rc_in->rc_header.h_nodeid;
+	int len = rc_in->rc_header.h_length - sizeof(struct dlm_rcom);
+
+	error = create_rcom(ls, nodeid, DLM_RCOM_LOOKUP_REPLY, 0, &rc, &mh);
+	if (error)
+		return;
+
+	error = dlm_dir_lookup(ls, nodeid, rc_in->rc_buf, len, &ret_nodeid);
+	if (error)
+		ret_nodeid = error;
+	rc->rc_result = ret_nodeid;
+	rc->rc_id = rc_in->rc_id;
+
+	send_rcom(ls, mh, rc);
+}
+
+static void receive_rcom_lookup_reply(struct dlm_ls *ls, struct dlm_rcom *rc_in)
+{
+	dlm_recover_master_reply(ls, rc_in);
+}
+
+static void pack_rcom_lock(struct dlm_rsb *r, struct dlm_lkb *lkb,
+			   struct rcom_lock *rl)
+{
+	memset(rl, 0, sizeof(*rl));
+
+	rl->rl_ownpid = lkb->lkb_ownpid;
+	rl->rl_lkid = lkb->lkb_id;
+	rl->rl_exflags = lkb->lkb_exflags;
+	rl->rl_flags = lkb->lkb_flags;
+	rl->rl_lvbseq = lkb->lkb_lvbseq;
+	rl->rl_rqmode = lkb->lkb_rqmode;
+	rl->rl_grmode = lkb->lkb_grmode;
+	rl->rl_status = lkb->lkb_status;
+	rl->rl_wait_type = lkb->lkb_wait_type;
+
+	if (lkb->lkb_bastaddr)
+		rl->rl_asts |= AST_BAST;
+	if (lkb->lkb_astaddr)
+		rl->rl_asts |= AST_COMP;
+
+	if (lkb->lkb_range)
+		memcpy(rl->rl_range, lkb->lkb_range, 4*sizeof(uint64_t));
+
+	rl->rl_namelen = r->res_length;
+	memcpy(rl->rl_name, r->res_name, r->res_length);
+
+	/* FIXME: might we have an lvb without DLM_LKF_VALBLK set ?
+	   If so, receive_rcom_lock_args() won't take this copy. */
+
+	if (lkb->lkb_lvbptr)
+		memcpy(rl->rl_lvb, lkb->lkb_lvbptr, r->res_ls->ls_lvblen);
+}
+
+int dlm_send_rcom_lock(struct dlm_rsb *r, struct dlm_lkb *lkb)
+{
+	struct dlm_ls *ls = r->res_ls;
+	struct dlm_rcom *rc;
+	struct dlm_mhandle *mh;
+	struct rcom_lock *rl;
+	int error, len = sizeof(struct rcom_lock);
+
+	if (lkb->lkb_lvbptr)
+		len += ls->ls_lvblen;
+
+	error = create_rcom(ls, r->res_nodeid, DLM_RCOM_LOCK, len, &rc, &mh);
+	if (error)
+		goto out;
+
+	rl = (struct rcom_lock *) rc->rc_buf;
+	pack_rcom_lock(r, lkb, rl);
+	rc->rc_id = (unsigned long) r;
+
+	send_rcom(ls, mh, rc);
+ out:
+	return error;
+}
+
+static void receive_rcom_lock(struct dlm_ls *ls, struct dlm_rcom *rc_in)
+{
+	struct dlm_rcom *rc;
+	struct dlm_mhandle *mh;
+	int error, nodeid = rc_in->rc_header.h_nodeid;
+
+	dlm_recover_master_copy(ls, rc_in);
+
+	error = create_rcom(ls, nodeid, DLM_RCOM_LOCK_REPLY,
+			    sizeof(struct rcom_lock), &rc, &mh);
+	if (error)
+		return;
+
+	/* We send back the same rcom_lock struct we received, but
+	   dlm_recover_master_copy() has filled in rl_remid and rl_result */
+
+	memcpy(rc->rc_buf, rc_in->rc_buf, sizeof(struct rcom_lock));
+	rc->rc_id = rc_in->rc_id;
+
+	send_rcom(ls, mh, rc);
+}
+
+static void receive_rcom_lock_reply(struct dlm_ls *ls, struct dlm_rcom *rc_in)
+{
+	if (!test_bit(LSFL_DIR_VALID, &ls->ls_flags)) {
+		log_debug(ls, "ignoring RCOM_LOCK_REPLY from %u",
+			  rc_in->rc_header.h_nodeid);
+		return;
+	}
+
+	dlm_recover_process_copy(ls, rc_in);
+}
+
+static int send_ls_not_ready(int nodeid, struct dlm_rcom *rc_in)
+{
+	struct dlm_rcom *rc;
+	struct dlm_mhandle *mh;
+	char *mb;
+	int mb_len = sizeof(struct dlm_rcom);
+
+	mh = dlm_lowcomms_get_buffer(nodeid, mb_len, GFP_KERNEL, &mb);
+	if (!mh)
+		return -ENOBUFS;
+	memset(mb, 0, mb_len);
+
+	rc = (struct dlm_rcom *) mb;
+
+	rc->rc_header.h_version = (DLM_HEADER_MAJOR | DLM_HEADER_MINOR);
+	rc->rc_header.h_lockspace = rc_in->rc_header.h_lockspace;
+	rc->rc_header.h_nodeid = dlm_our_nodeid();
+	rc->rc_header.h_length = mb_len;
+	rc->rc_header.h_cmd = DLM_RCOM;
+
+	rc->rc_type = DLM_RCOM_STATUS_REPLY;
+	rc->rc_result = 0;
+
+	dlm_rcom_out(rc);
+	dlm_lowcomms_commit_buffer(mh);
+
+	return 0;
+}
+
+/* Called by dlm_recvd; corresponds to dlm_receive_message() but special
+   recovery-only comms are sent through here. */
+
+void dlm_receive_rcom(struct dlm_header *hd, int nodeid)
+{
+	struct dlm_rcom *rc = (struct dlm_rcom *) hd;
+	struct dlm_ls *ls;
+
+	dlm_rcom_in(rc);
+
+	/* If the lockspace doesn't exist then still send a status message
+	   back; it's possible that it just doesn't have its global_id yet. */
+
+	ls = dlm_find_lockspace_global(hd->h_lockspace);
+	if (!ls) {
+		send_ls_not_ready(nodeid, rc);
+		return;
+	}
+
+	if (dlm_recovery_stopped(ls) && (rc->rc_type != DLM_RCOM_STATUS)) {
+		log_error(ls, "ignoring recovery message %x from %d",
+			  rc->rc_type, nodeid);
+		goto out;
+	}
+
+	if (nodeid != rc->rc_header.h_nodeid) {
+		log_error(ls, "bad rcom nodeid %d from %d",
+			  rc->rc_header.h_nodeid, nodeid);
+		goto out;
+	}
+
+	switch (rc->rc_type) {
+	case DLM_RCOM_STATUS:
+		receive_rcom_status(ls, rc);
+		break;
+
+	case DLM_RCOM_NAMES:
+		receive_rcom_names(ls, rc);
+		break;
+
+	case DLM_RCOM_LOOKUP:
+		receive_rcom_lookup(ls, rc);
+		break;
+
+	case DLM_RCOM_LOCK:
+		receive_rcom_lock(ls, rc);
+		break;
+
+	case DLM_RCOM_STATUS_REPLY:
+		receive_rcom_status_reply(ls, rc);
+		break;
+
+	case DLM_RCOM_NAMES_REPLY:
+		receive_rcom_names_reply(ls, rc);
+		break;
+
+	case DLM_RCOM_LOOKUP_REPLY:
+		receive_rcom_lookup_reply(ls, rc);
+		break;
+
+	case DLM_RCOM_LOCK_REPLY:
+		receive_rcom_lock_reply(ls, rc);
+		break;
+
+	default:
+		DLM_ASSERT(0, printk("rc_type=%x\n", rc->rc_type););
+	}
+ out:
+	dlm_put_lockspace(ls);
+}
+
--- a/drivers/dlm/rcom.h	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/rcom.h	2005-05-12 23:13:15.830485360 +0800
@@ -0,0 +1,24 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#ifndef __RCOM_DOT_H__
+#define __RCOM_DOT_H__
+
+int dlm_rcom_status(struct dlm_ls *ls, int nodeid);
+int dlm_rcom_names(struct dlm_ls *ls, int nodeid, char *last_name,int last_len);
+int dlm_send_rcom_lookup(struct dlm_rsb *r, int dir_nodeid);
+int dlm_send_rcom_lock(struct dlm_rsb *r, struct dlm_lkb *lkb);
+void dlm_receive_rcom(struct dlm_header *hd, int nodeid);
+
+#endif
+
--- a/drivers/dlm/recover.c	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/recover.c	2005-05-12 23:13:15.830485360 +0800
@@ -0,0 +1,729 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#include "dlm_internal.h"
+#include "lockspace.h"
+#include "dir.h"
+#include "config.h"
+#include "ast.h"
+#include "memory.h"
+#include "rcom.h"
+#include "lock.h"
+#include "lowcomms.h"
+#include "member.h"
+
+static struct timer_list dlm_timer;
+
+
+/*
+ * Recovery waiting routines: these functions wait for a particular reply from
+ * a remote node, or for the remote node to report a certain status.  They need
+ * to abort if the lockspace is stopped indicating a node has failed (perhaps
+ * the one being waited for).
+ */
+
+int dlm_recovery_stopped(struct dlm_ls *ls)
+{
+	return test_bit(LSFL_LS_STOP, &ls->ls_flags);
+}
+
+/*
+ * Wait until given function returns non-zero or lockspace is stopped (LS_STOP
+ * set due to failure of a node in ls_nodes).  When another function thinks it
+ * could have completed the waited-on task, they should wake up ls_wait_general
+ * to get an immediate response rather than waiting for the timer to detect the
+ * result.  A timer wakes us up periodically while waiting to see if we should
+ * abort due to a node failure.  This should only be called by the dlm_recoverd
+ * thread.
+ */
+
+static void dlm_wait_timer_fn(unsigned long data)
+{
+	struct dlm_ls *ls = (struct dlm_ls *) data;
+	mod_timer(&dlm_timer, jiffies + (dlm_config.recover_timer * HZ));
+	wake_up(&ls->ls_wait_general);
+}
+
+int dlm_wait_function(struct dlm_ls *ls, int (*testfn) (struct dlm_ls *ls))
+{
+	int error = 0;
+
+	init_timer(&dlm_timer);
+	dlm_timer.function = dlm_wait_timer_fn;
+	dlm_timer.data = (long) ls;
+	dlm_timer.expires = jiffies + (dlm_config.recover_timer * HZ);
+	add_timer(&dlm_timer);
+
+	wait_event(ls->ls_wait_general, testfn(ls) || dlm_recovery_stopped(ls));
+	del_timer_sync(&dlm_timer);
+
+	if (dlm_recovery_stopped(ls))
+		error = -EINTR;
+	return error;
+}
+
+/*
+ * An efficient way for all nodes to wait for all others to have a certain
+ * status.  The node with the lowest nodeid polls all the others for their
+ * status (dlm_wait_status_all) and all the others poll the node with the low
+ * id for its accumulated result (dlm_wait_status_low).
+ */
+
+static int dlm_wait_status_all(struct dlm_ls *ls, unsigned int wait_status)
+{
+	struct dlm_rcom *rc = (struct dlm_rcom *) ls->ls_recover_buf;
+	struct dlm_member *memb;
+	int error = 0, delay;
+
+	list_for_each_entry(memb, &ls->ls_nodes, list) {
+		delay = 0;
+		for (;;) {
+			error = dlm_recovery_stopped(ls);
+			if (error)
+				goto out;
+
+			error = dlm_rcom_status(ls, memb->nodeid);
+			if (error)
+				goto out;
+
+			if (rc->rc_result & wait_status)
+				break;
+			if (delay < 1000)
+				delay += 20;
+			msleep(delay);
+		}
+	}
+ out:
+	return error;
+}
+
+static int dlm_wait_status_low(struct dlm_ls *ls, unsigned int wait_status)
+{
+	struct dlm_rcom *rc = (struct dlm_rcom *) ls->ls_recover_buf;
+	int error = 0, delay = 0, nodeid = ls->ls_low_nodeid;
+
+	for (;;) {
+		error = dlm_recovery_stopped(ls);
+		if (error)
+			goto out;
+
+		error = dlm_rcom_status(ls, nodeid);
+		if (error)
+			break;
+
+		if (rc->rc_result & wait_status)
+			break;
+		if (delay < 1000)
+			delay += 20;
+		msleep(delay);
+	}
+ out:
+	return error;
+}
+
+int dlm_recover_members_wait(struct dlm_ls *ls)
+{
+	int error;
+
+	if (ls->ls_low_nodeid == dlm_our_nodeid()) {
+		error = dlm_wait_status_all(ls, NODES_VALID);
+		if (!error)
+			set_bit(LSFL_ALL_NODES_VALID, &ls->ls_flags);
+	} else
+		error = dlm_wait_status_low(ls, NODES_ALL_VALID);
+
+	return error;
+}
+
+int dlm_recover_directory_wait(struct dlm_ls *ls)
+{
+	int error;
+
+	if (ls->ls_low_nodeid == dlm_our_nodeid()) {
+		error = dlm_wait_status_all(ls, DIR_VALID);
+		if (!error)
+			set_bit(LSFL_ALL_DIR_VALID, &ls->ls_flags);
+	} else
+		error = dlm_wait_status_low(ls, DIR_ALL_VALID);
+
+	return error;
+}
+
+int dlm_recover_locks_wait(struct dlm_ls *ls)
+{
+	int error;
+
+	if (ls->ls_low_nodeid == dlm_our_nodeid()) {
+		error = dlm_wait_status_all(ls, LOCKS_VALID);
+		if (!error)
+			set_bit(LSFL_ALL_LOCKS_VALID, &ls->ls_flags);
+	} else
+		error = dlm_wait_status_low(ls, LOCKS_ALL_VALID);
+
+	return error;
+}
+
+/*
+ * The recover_list contains all the rsb's for which we've requested the new
+ * master nodeid.  As replies are returned from the resource directories the
+ * rsb's are removed from the list.  When the list is empty we're done.
+ *
+ * The recover_list is later similarly used for all rsb's for which we've sent
+ * new lkb's and need to receive new corresponding lkid's.
+ *
+ * We use the address of the rsb struct as a simple local identifier for the
+ * rsb so we can match an rcom reply with the rsb it was sent for.
+ */
+
+static int recover_list_empty(struct dlm_ls *ls)
+{
+	int empty;
+
+	spin_lock(&ls->ls_recover_list_lock);
+	empty = list_empty(&ls->ls_recover_list);
+	spin_unlock(&ls->ls_recover_list_lock);
+
+	return empty;
+}
+
+static void recover_list_add(struct dlm_rsb *r)
+{
+	struct dlm_ls *ls = r->res_ls;
+
+	spin_lock(&ls->ls_recover_list_lock);
+	if (list_empty(&r->res_recover_list)) {
+		list_add_tail(&r->res_recover_list, &ls->ls_recover_list);
+		ls->ls_recover_list_count++;
+		dlm_hold_rsb(r);
+	}
+	spin_unlock(&ls->ls_recover_list_lock);
+}
+
+static void recover_list_del(struct dlm_rsb *r)
+{
+	struct dlm_ls *ls = r->res_ls;
+
+	spin_lock(&ls->ls_recover_list_lock);
+	list_del_init(&r->res_recover_list);
+	ls->ls_recover_list_count--;
+	spin_unlock(&ls->ls_recover_list_lock);
+
+	dlm_put_rsb(r);
+}
+
+static struct dlm_rsb *recover_list_find(struct dlm_ls *ls, uint64_t id)
+{
+	struct dlm_rsb *r = NULL;
+
+	spin_lock(&ls->ls_recover_list_lock);
+
+	list_for_each_entry(r, &ls->ls_recover_list, res_recover_list) {
+		if (id == (unsigned long) r)
+			goto out;
+	}
+	r = NULL;
+ out:
+	spin_unlock(&ls->ls_recover_list_lock);
+	return r;
+}
+
+void recover_list_clear(struct dlm_ls *ls)
+{
+	struct dlm_rsb *r, *s;
+
+	spin_lock(&ls->ls_recover_list_lock);
+	list_for_each_entry_safe(r, s, &ls->ls_recover_list, res_recover_list) {
+		list_del_init(&r->res_recover_list);
+		dlm_put_rsb(r);
+		ls->ls_recover_list_count--;
+	}
+
+	if (ls->ls_recover_list_count != 0) {
+		log_error(ls, "warning: recover_list_count %d",
+			  ls->ls_recover_list_count);
+		ls->ls_recover_list_count = 0;
+	}
+	spin_unlock(&ls->ls_recover_list_lock);
+}
+
+
+/* Master recovery: find new master node for rsb's that were
+   mastered on nodes that have been removed.
+
+   dlm_recover_masters
+   recover_master
+   dlm_send_rcom_lookup            ->  receive_rcom_lookup
+                                       dlm_dir_lookup
+   receive_rcom_lookup_reply       <-
+   dlm_recover_master_reply
+   set_new_master
+   set_master_lkbs
+   set_lock_master
+*/
+
+/*
+ * Set the lock master for all LKBs in a lock queue
+ * If we are the new master of the rsb, we may have received new
+ * MSTCPY locks from other nodes already which we need to ignore
+ * when setting the new nodeid.
+ */
+
+static void set_lock_master(struct list_head *queue, int nodeid)
+{
+	struct dlm_lkb *lkb;
+
+	list_for_each_entry(lkb, queue, lkb_statequeue)
+		if (!(lkb->lkb_flags & DLM_IFL_MSTCPY))
+			lkb->lkb_nodeid = nodeid;
+}
+
+static void set_master_lkbs(struct dlm_rsb *r)
+{
+	set_lock_master(&r->res_grantqueue, r->res_nodeid);
+	set_lock_master(&r->res_convertqueue, r->res_nodeid);
+	set_lock_master(&r->res_waitqueue, r->res_nodeid);
+}
+
+/*
+ * Propogate the new master nodeid to locks
+ * The NEW_MASTER flag tells dlm_recover_locks() which rsb's to consider.
+ * The NEW_MASTER2 flag tells recover_lvb() which rsb's to consider.
+ */
+
+static void set_new_master(struct dlm_rsb *r, int nodeid)
+{
+	lock_rsb(r);
+
+	if (nodeid == dlm_our_nodeid())
+		r->res_nodeid = 0;
+	else
+		r->res_nodeid = nodeid;
+
+	set_master_lkbs(r);
+
+	set_bit(RESFL_NEW_MASTER, &r->res_flags);
+	set_bit(RESFL_NEW_MASTER2, &r->res_flags);
+	unlock_rsb(r);
+}
+
+/*
+ * We do async lookups on rsb's that need new masters.  The rsb's
+ * waiting for a lookup reply are kept on the recover_list.
+ */
+
+static int recover_master(struct dlm_rsb *r)
+{
+	struct dlm_ls *ls = r->res_ls;
+	int error, dir_nodeid, ret_nodeid, our_nodeid = dlm_our_nodeid();
+
+	dir_nodeid = dlm_dir_nodeid(r);
+
+	if (dir_nodeid == our_nodeid) {
+		error = dlm_dir_lookup(ls, our_nodeid, r->res_name,
+				       r->res_length, &ret_nodeid);
+		if (error)
+			log_error(ls, "recover dir lookup error %d", error);
+
+		set_new_master(r, ret_nodeid);
+	} else {
+		recover_list_add(r);
+		error = dlm_send_rcom_lookup(r, dir_nodeid);
+	}
+
+	return error;
+}
+
+/*
+ * Go through local root resources and for each rsb which has a master which
+ * has departed, get the new master nodeid from the directory.  The dir will
+ * assign mastery to the first node to look up the new master.  That means
+ * we'll discover in this lookup if we're the new master of any rsb's.
+ *
+ * We fire off all the dir lookup requests individually and asynchronously to
+ * the correct dir node.
+ */
+
+int dlm_recover_masters(struct dlm_ls *ls)
+{
+	struct dlm_rsb *r;
+	int error, count = 0;
+
+	log_debug(ls, "dlm_recover_masters");
+
+	down_read(&ls->ls_root_sem);
+	list_for_each_entry(r, &ls->ls_root_list, res_root_list) {
+		if (is_master(r))
+			continue;
+
+		error = dlm_recovery_stopped(ls);
+		if (error) {
+			up_read(&ls->ls_root_sem);
+			goto out;
+		}
+
+		if (dlm_is_removed(ls, r->res_nodeid)) {
+			recover_master(r);
+			count++;
+		}
+
+		schedule();
+	}
+	up_read(&ls->ls_root_sem);
+
+	log_debug(ls, "dlm_recover_masters %d resources", count);
+
+	error = dlm_wait_function(ls, &recover_list_empty);
+ out:
+	if (error)
+		recover_list_clear(ls);
+	return error;
+}
+
+int dlm_recover_master_reply(struct dlm_ls *ls, struct dlm_rcom *rc)
+{
+	struct dlm_rsb *r;
+
+	r = recover_list_find(ls, rc->rc_id);
+	if (!r) {
+		log_error(ls, "dlm_recover_master_reply no id %"PRIx64"",
+			  rc->rc_id);
+		goto out;
+	}
+
+	set_new_master(r, rc->rc_result);
+	recover_list_del(r);
+
+	if (recover_list_empty(ls))
+		wake_up(&ls->ls_wait_general);
+ out:
+	return 0;
+}
+
+
+/* Lock recovery: rebuild the process-copy locks we hold on a
+   remastered rsb on the new rsb master.
+
+   dlm_recover_locks
+   recover_locks
+   recover_locks_queue
+   dlm_send_rcom_lock              ->  receive_rcom_lock
+                                       dlm_recover_master_copy
+   receive_rcom_lock_reply         <-
+   dlm_recover_process_copy
+*/
+
+
+/*
+ * keep a count of the number of lkb's we send to the new master; when we get
+ * an equal number of replies then recovery for the rsb is done
+ */
+
+static int recover_locks_queue(struct dlm_rsb *r, struct list_head *head)
+{
+	struct dlm_lkb *lkb;
+	int error = 0;
+
+	list_for_each_entry(lkb, head, lkb_statequeue) {
+	   	error = dlm_send_rcom_lock(r, lkb);
+		if (error)
+			break;
+		r->res_recover_locks_count++;
+	}
+
+	return error;
+}
+
+static int all_queues_empty(struct dlm_rsb *r)
+{
+	if (!list_empty(&r->res_grantqueue) ||
+	    !list_empty(&r->res_convertqueue) ||
+	    !list_empty(&r->res_waitqueue))
+		return FALSE;
+	return TRUE;
+}
+
+static int recover_locks(struct dlm_rsb *r)
+{
+	int error = 0;
+
+	lock_rsb(r);
+	if (all_queues_empty(r))
+		goto out;
+
+	DLM_ASSERT(!r->res_recover_locks_count, dlm_print_rsb(r););
+
+	error = recover_locks_queue(r, &r->res_grantqueue);
+	if (error)
+		goto out;
+	error = recover_locks_queue(r, &r->res_convertqueue);
+	if (error)
+		goto out;
+	error = recover_locks_queue(r, &r->res_waitqueue);
+	if (error)
+		goto out;
+
+	if (r->res_recover_locks_count)
+		recover_list_add(r);
+ out:
+	unlock_rsb(r);
+	return error;
+}
+
+int dlm_recover_locks(struct dlm_ls *ls)
+{
+	struct dlm_rsb *r;
+	int error, count = 0;
+
+	log_debug(ls, "dlm_recover_locks");
+
+	down_read(&ls->ls_root_sem);
+	list_for_each_entry(r, &ls->ls_root_list, res_root_list) {
+		if (is_master(r)) {
+			clear_bit(RESFL_NEW_MASTER, &r->res_flags);
+			continue;
+		}
+
+		if (!test_bit(RESFL_NEW_MASTER, &r->res_flags))
+			continue;
+
+		error = dlm_recovery_stopped(ls);
+		if (error) {
+			up_read(&ls->ls_root_sem);
+			goto out;
+		}
+
+		error = recover_locks(r);
+		if (error) {
+			up_read(&ls->ls_root_sem);
+			goto out;
+		}
+
+		count += r->res_recover_locks_count;
+	}
+	up_read(&ls->ls_root_sem);
+
+	log_debug(ls, "dlm_recover_locks %d locks", count);
+
+	error = dlm_wait_function(ls, &recover_list_empty);
+ out:
+	if (error)
+		recover_list_clear(ls);
+	else
+		set_bit(LSFL_LOCKS_VALID, &ls->ls_flags);
+	return error;
+}
+
+void dlm_recovered_lock(struct dlm_rsb *r)
+{
+	r->res_recover_locks_count--;
+	if (!r->res_recover_locks_count) {
+		clear_bit(RESFL_NEW_MASTER, &r->res_flags);
+		recover_list_del(r);
+	}
+
+	if (recover_list_empty(r->res_ls))
+		wake_up(&r->res_ls->ls_wait_general);
+}
+
+/*
+ * The lvb needs to be recovered on all master rsb's.  This includes setting
+ * the VALNOTVALID flag if necessary, and determining the correct lvb contents
+ * based on the lvb's of the locks held on the rsb.
+ *
+ * RESFL_VALNOTVALID is set if there are only NL/CR locks on the rsb.  If it
+ * was already set prior to recovery, it's not cleared, regardless of locks.
+ *
+ * The LVB contents are only considered for changing when this is a new master
+ * of the rsb (NEW_MASTER2).  Then, the rsb's lvb is taken from any lkb with
+ * mode > CR.  If no lkb's exist with mode above CR, the lvb contents are taken
+ * from the lkb with the largest lvb sequence number.
+ */
+
+static void recover_lvb(struct dlm_rsb *r)
+{
+	struct dlm_lkb *lkb, *high_lkb = NULL;
+	uint32_t high_seq = 0;
+	int lock_lvb_exists = FALSE;
+	int big_lock_exists = FALSE;
+	int lvblen = r->res_ls->ls_lvblen;
+
+	list_for_each_entry(lkb, &r->res_grantqueue, lkb_statequeue) {
+		if (!(lkb->lkb_exflags & DLM_LKF_VALBLK))
+			continue;
+
+		lock_lvb_exists = TRUE;
+
+		if (lkb->lkb_grmode > DLM_LOCK_CR) {
+			big_lock_exists = TRUE;
+			goto setflag;
+		}
+
+		if (((int)lkb->lkb_lvbseq - (int)high_seq) >= 0) {
+			high_lkb = lkb;
+			high_seq = lkb->lkb_lvbseq;
+		}
+	}
+
+	list_for_each_entry(lkb, &r->res_convertqueue, lkb_statequeue) {
+		if (!(lkb->lkb_exflags & DLM_LKF_VALBLK))
+			continue;
+
+		lock_lvb_exists = TRUE;
+
+		if (lkb->lkb_grmode > DLM_LOCK_CR) {
+			big_lock_exists = TRUE;
+			goto setflag;
+		}
+
+		if (((int)lkb->lkb_lvbseq - (int)high_seq) >= 0) {
+			high_lkb = lkb;
+			high_seq = lkb->lkb_lvbseq;
+		}
+	}
+
+ setflag:
+	if (!lock_lvb_exists)
+		goto out;
+
+	if (!big_lock_exists)
+		set_bit(RESFL_VALNOTVALID, &r->res_flags);
+
+	/* don't mess with the lvb unless we're the new master */
+	if (!test_bit(RESFL_NEW_MASTER2, &r->res_flags))
+		goto out;
+
+	if (!r->res_lvbptr) {
+		r->res_lvbptr = allocate_lvb(r->res_ls);
+		if (!r->res_lvbptr)
+			goto out;
+	}
+
+	if (big_lock_exists) {
+		r->res_lvbseq = lkb->lkb_lvbseq;
+		memcpy(r->res_lvbptr, lkb->lkb_lvbptr, lvblen);
+	} else if (high_lkb) {
+		r->res_lvbseq = high_lkb->lkb_lvbseq;
+		memcpy(r->res_lvbptr, high_lkb->lkb_lvbptr, lvblen);
+	} else {
+		r->res_lvbseq = 0;
+		memset(r->res_lvbptr, 0, lvblen);
+	}
+ out:
+	return;
+}
+
+/* All master rsb's flagged RECOVER_CONVERT need to be looked at.  The locks
+   converting PR->CW or CW->PR need to have their lkb_grmode set. */
+
+static void recover_conversion(struct dlm_rsb *r)
+{
+	struct dlm_lkb *lkb;
+	int grmode = -1;
+
+	list_for_each_entry(lkb, &r->res_grantqueue, lkb_statequeue) {
+		if (lkb->lkb_grmode == DLM_LOCK_PR ||
+		    lkb->lkb_grmode == DLM_LOCK_CW) {
+			grmode = lkb->lkb_grmode;
+			break;
+		}
+	}
+
+	list_for_each_entry(lkb, &r->res_convertqueue, lkb_statequeue) {
+		if (lkb->lkb_grmode != DLM_LOCK_IV)
+			continue;
+		if (grmode == -1)
+			lkb->lkb_grmode = lkb->lkb_rqmode;
+		else
+			lkb->lkb_grmode = grmode;
+	}
+}
+
+void dlm_recover_rsbs(struct dlm_ls *ls)
+{
+	struct dlm_rsb *r;
+	int count = 0;
+
+	log_debug(ls, "dlm_recover_rsbs");
+
+	down_read(&ls->ls_root_sem);
+	list_for_each_entry(r, &ls->ls_root_list, res_root_list) {
+		lock_rsb(r);
+		if (is_master(r)) {
+			if (test_bit(RESFL_RECOVER_CONVERT, &r->res_flags))
+				recover_conversion(r);
+			recover_lvb(r);
+			count++;
+		}
+		clear_bit(RESFL_RECOVER_CONVERT, &r->res_flags);
+		unlock_rsb(r);
+	}
+	up_read(&ls->ls_root_sem);
+
+	log_debug(ls, "dlm_recover_rsbs %d rsbs", count);
+}
+
+/* Create a single list of all root rsb's to be used during recovery */
+
+int dlm_create_root_list(struct dlm_ls *ls)
+{
+	struct dlm_rsb *r;
+	int i, error = 0;
+
+	down_write(&ls->ls_root_sem);
+	if (!list_empty(&ls->ls_root_list)) {
+		log_error(ls, "root list not empty");
+		error = -EINVAL;
+		goto out;
+	}
+
+	for (i = 0; i < ls->ls_rsbtbl_size; i++) {
+		read_lock(&ls->ls_rsbtbl[i].lock);
+		list_for_each_entry(r, &ls->ls_rsbtbl[i].list, res_hashchain) {
+			list_add(&r->res_root_list, &ls->ls_root_list);
+			dlm_hold_rsb(r);
+		}
+		read_unlock(&ls->ls_rsbtbl[i].lock);
+	}
+ out:
+	up_write(&ls->ls_root_sem);
+	return error;
+}
+
+void dlm_release_root_list(struct dlm_ls *ls)
+{
+	struct dlm_rsb *r, *safe;
+
+	down_write(&ls->ls_root_sem);
+	list_for_each_entry_safe(r, safe, &ls->ls_root_list, res_root_list) {
+		list_del_init(&r->res_root_list);
+		dlm_put_rsb(r);
+	}
+	up_write(&ls->ls_root_sem);
+}
+
+void dlm_clear_toss_list(struct dlm_ls *ls)
+{
+	struct dlm_rsb *r, *safe;
+	int i;
+
+	for (i = 0; i < ls->ls_rsbtbl_size; i++) {
+		write_lock(&ls->ls_rsbtbl[i].lock);
+		list_for_each_entry_safe(r, safe, &ls->ls_rsbtbl[i].toss,
+					 res_hashchain) {
+			list_del(&r->res_hashchain);
+			free_rsb(r);
+		}
+		write_unlock(&ls->ls_rsbtbl[i].lock);
+	}
+}
+
--- a/drivers/dlm/recover.h	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/recover.h	2005-05-12 23:13:15.830485360 +0800
@@ -0,0 +1,32 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#ifndef __RECOVER_DOT_H__
+#define __RECOVER_DOT_H__
+
+int dlm_recovery_stopped(struct dlm_ls *ls);
+int dlm_wait_function(struct dlm_ls *ls, int (*testfn) (struct dlm_ls *ls));
+int dlm_recover_masters(struct dlm_ls *ls);
+int dlm_recover_master_reply(struct dlm_ls *ls, struct dlm_rcom *rc);
+int dlm_recover_locks(struct dlm_ls *ls);
+void dlm_recovered_lock(struct dlm_rsb *r);
+int dlm_create_root_list(struct dlm_ls *ls);
+void dlm_release_root_list(struct dlm_ls *ls);
+void dlm_clear_toss_list(struct dlm_ls *ls);
+void dlm_recover_rsbs(struct dlm_ls *ls);
+int dlm_recover_members_wait(struct dlm_ls *ls);
+int dlm_recover_directory_wait(struct dlm_ls *ls);
+int dlm_recover_locks_wait(struct dlm_ls *ls);
+
+#endif				/* __RECOVER_DOT_H__ */
+
--- a/drivers/dlm/recoverd.c	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/recoverd.c	2005-05-12 23:13:15.830485360 +0800
@@ -0,0 +1,714 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#include "dlm_internal.h"
+#include "lockspace.h"
+#include "member.h"
+#include "dir.h"
+#include "ast.h"
+#include "recover.h"
+#include "lowcomms.h"
+#include "lock.h"
+#include "requestqueue.h"
+
+/*
+ * next_move actions
+ */
+
+#define DO_STOP             (1)
+#define DO_START            (2)
+#define DO_FINISH           (3)
+#define DO_FINISH_STOP      (4)
+#define DO_FINISH_START     (5)
+
+static void set_start_done(struct dlm_ls *ls, int event_id)
+{
+	spin_lock(&ls->ls_recover_lock);
+	ls->ls_startdone = event_id;
+	spin_unlock(&ls->ls_recover_lock);
+
+	kobject_uevent(&ls->ls_kobj, KOBJ_CHANGE, NULL);
+}
+
+static int enable_locking(struct dlm_ls *ls, int event_id)
+{
+	int error = 0;
+
+	spin_lock(&ls->ls_recover_lock);
+	if (ls->ls_last_stop < event_id) {
+		set_bit(LSFL_LS_RUN, &ls->ls_flags);
+		up_write(&ls->ls_in_recovery);
+	} else {
+		error = -EINTR;
+		log_debug(ls, "enable_locking: abort %d", event_id);
+	}
+	spin_unlock(&ls->ls_recover_lock);
+	return error;
+}
+
+static int ls_first_start(struct dlm_ls *ls, struct dlm_recover *rv)
+{
+	int error;
+
+	log_debug(ls, "recover event %u (first)", rv->event_id);
+
+	down(&ls->ls_recoverd_active);
+
+	error = dlm_recover_members_first(ls, rv);
+	if (error) {
+		log_error(ls, "recover_members first failed %d", error);
+		goto out;
+	}
+
+	error = dlm_recover_directory(ls);
+	if (error) {
+		log_error(ls, "recover_directory failed %d", error);
+		goto out;
+	}
+
+	error = dlm_recover_directory_wait(ls);
+	if (error) {
+		log_error(ls, "recover_directory_wait failed %d", error);
+		goto out;
+	}
+
+	log_debug(ls, "recover event %u done", rv->event_id);
+	set_start_done(ls, rv->event_id);
+
+ out:
+	up(&ls->ls_recoverd_active);
+	return error;
+}
+
+/*
+ * We are given here a new group of nodes which are in the lockspace.  We first
+ * figure out the differences in ls membership from when we were last running.
+ * If nodes from before are gone, then there will be some lock recovery to do.
+ * If there are only nodes which have joined, then there's no lock recovery.
+ *
+ * Lockspace recovery for failed nodes must be completed before any nodes are
+ * allowed to join or leave the lockspace.
+ */
+
+static int ls_reconfig(struct dlm_ls *ls, struct dlm_recover *rv)
+{
+	unsigned long start;
+	int error, neg = 0;
+
+	log_debug(ls, "recover event %u", rv->event_id);
+
+	down(&ls->ls_recoverd_active);
+
+	/*
+	 * Suspending and resuming dlm_astd ensures that no lkb's from this ls
+	 * will be processed by dlm_astd during recovery.
+	 */
+
+	dlm_astd_suspend();
+	dlm_astd_resume();
+
+	/*
+	 * This list of root rsb's will be the basis of most of the recovery
+	 * routines.
+	 */
+
+	dlm_create_root_list(ls);
+
+	/*
+	 * Free all the tossed rsb's so we don't have to recover them.
+	 */
+
+	dlm_clear_toss_list(ls);
+
+	/*
+	 * Add or remove nodes from the lockspace's ls_nodes list.
+	 * Also waits for all nodes to complete dlm_recover_members.
+	 */
+
+	error = dlm_recover_members(ls, rv, &neg);
+	if (error) {
+		log_error(ls, "recover_members failed %d", error);
+		goto fail;
+	}
+	start = jiffies;
+
+	/*
+	 * Rebuild our own share of the directory by collecting from all other
+	 * nodes their master rsb names that hash to us.
+	 */
+
+	error = dlm_recover_directory(ls);
+	if (error) {
+		log_error(ls, "recover_directory failed %d", error);
+		goto fail;
+	}
+
+	/*
+	 * Purge directory-related requests that are saved in requestqueue.
+	 * All dir requests from before recovery are invalid now due to the dir
+	 * rebuild and will be resent by the requesting nodes.
+	 */
+
+	dlm_purge_requestqueue(ls);
+
+	/*
+	 * Wait for all nodes to complete directory rebuild.
+	 */
+
+	error = dlm_recover_directory_wait(ls);
+	if (error) {
+		log_error(ls, "recover_directory_wait failed %d", error);
+		goto fail;
+	}
+
+	/*
+	 * We may have outstanding operations that are waiting for a reply from
+	 * a failed node.  Mark these to be resent after recovery.  Unlock and
+	 * cancel ops can just be completed.
+	 */
+
+	dlm_recover_waiters_pre(ls);
+
+	error = dlm_recovery_stopped(ls);
+	if (error)
+		goto fail;
+
+	if (neg) {
+		/*
+		 * Clear lkb's for departed nodes.
+		 */
+
+		dlm_purge_locks(ls);
+
+		/*
+		 * Get new master nodeid's for rsb's that were mastered on
+		 * departed nodes.
+		 */
+
+		error = dlm_recover_masters(ls);
+		if (error) {
+			log_error(ls, "recover_masters failed %d", error);
+			goto fail;
+		}
+
+		/*
+		 * Send our locks on remastered rsb's to the new masters.
+		 */
+
+		error = dlm_recover_locks(ls);
+		if (error) {
+			log_error(ls, "recover_locks failed %d", error);
+			goto fail;
+		}
+
+		error = dlm_recover_locks_wait(ls);
+		if (error) {
+			log_error(ls, "recover_locks_wait failed %d", error);
+			goto fail;
+		}
+
+		/*
+		 * Finalize state in master rsb's now that all locks can be
+		 * checked.  This includes conversion resolution and lvb
+		 * settings.
+		 */
+
+		dlm_recover_rsbs(ls);
+	}
+
+	dlm_release_root_list(ls);
+
+	log_debug(ls, "recover event %u done: %u ms", rv->event_id,
+		  jiffies_to_msecs(jiffies - start));
+	set_start_done(ls, rv->event_id);
+	up(&ls->ls_recoverd_active);
+
+	return 0;
+
+ fail:
+	log_debug(ls, "recover event %d error %d", rv->event_id, error);
+	up(&ls->ls_recoverd_active);
+	return error;
+}
+
+/*
+ * Between calls to this routine for a ls, there can be multiple stop/start
+ * events where every start but the latest is cancelled by stops.  There can
+ * only be a single finish because every finish requires us to call start_done.
+ * A single finish event could be followed by multiple stop/start events.  This
+ * routine takes any combination of events and boils them down to one course of
+ * action.
+ */
+
+static int next_move(struct dlm_ls *ls, struct dlm_recover **rv_out,
+		     int *finish_out)
+{
+	LIST_HEAD(events);
+	unsigned int cmd = 0, stop, start, finish;
+	unsigned int last_stop, last_start, last_finish;
+	struct dlm_recover *rv = NULL, *start_rv = NULL;
+
+	spin_lock(&ls->ls_recover_lock);
+
+	stop = test_and_clear_bit(LSFL_LS_STOP, &ls->ls_flags) ? 1 : 0;
+	start = test_and_clear_bit(LSFL_LS_START, &ls->ls_flags) ? 1 : 0;
+	finish = test_and_clear_bit(LSFL_LS_FINISH, &ls->ls_flags) ? 1 : 0;
+
+	last_stop = ls->ls_last_stop;
+	last_start = ls->ls_last_start;
+	last_finish = ls->ls_last_finish;
+
+	while (!list_empty(&ls->ls_recover)) {
+		rv = list_entry(ls->ls_recover.next, struct dlm_recover, list);
+		list_del(&rv->list);
+		list_add_tail(&rv->list, &events);
+	}
+
+	/*
+	 * There are two cases where we need to adjust these event values:
+	 * 1. - we get a first start
+	 *    - we get a stop
+	 *    - we process the start + stop here and notice this special case
+	 *
+	 * 2. - we get a first start
+	 *    - we process the start
+	 *    - we get a stop
+	 *    - we process the stop here and notice this special case
+	 *
+	 * In both cases, the first start we received was aborted by a
+	 * stop before we received a finish.  last_finish being zero is the
+	 * indication that this is the "first" start, i.e. we've not yet
+	 * finished a start; if we had, last_finish would be non-zero.
+	 * Part of the problem arises from the fact that when we initially
+	 * get start/stop/start, we may get the same event id for both starts
+	 * (since the first was cancelled).
+	 *
+	 * In both cases, last_start and last_stop will be equal.
+	 * In both cases, finish=0.
+	 * In the first case start=1 && stop=1.
+	 * In the second case start=0 && stop=1.
+	 *
+	 * In both cases, we need to make adjustments to values so:
+	 * - we process the current event (now) as a normal stop
+	 * - the next start we receive will be processed normally
+	 *   (taking into account the assertions below)
+	 *
+	 * In the first case, dlm_ls_start() will have printed the
+	 * "repeated start" warning.
+	 *
+	 * In the first case we need to get rid of the recover event struct.
+	 *
+	 * - set stop=1, start=0, finish=0 for case 4 below
+	 * - last_stop and last_start must be set equal per the case 4 assert
+	 * - ls_last_stop = 0 so the next start will be larger
+	 * - ls_last_start = 0 not really necessary (avoids dlm_ls_start print)
+	 */
+
+	if (!last_finish && (last_start == last_stop)) {
+		log_debug(ls, "move reset %u,%u,%u ids %u,%u,%u", stop,
+			  start, finish, last_stop, last_start, last_finish);
+		stop = 1;
+		start = 0;
+		finish = 0;
+		last_stop = 0;
+		last_start = 0;
+		ls->ls_last_stop = 0;
+		ls->ls_last_start = 0;
+
+		while (!list_empty(&events)) {
+			rv = list_entry(events.next, struct dlm_recover, list);
+			list_del(&rv->list);
+			kfree(rv->nodeids);
+			kfree(rv);
+		}
+	}
+	spin_unlock(&ls->ls_recover_lock);
+
+	log_debug(ls, "move flags %u,%u,%u ids %u,%u,%u", stop, start, finish,
+		  last_stop, last_start, last_finish);
+
+	/*
+	 * Toss start events which have since been cancelled.
+	 */
+
+	while (!list_empty(&events)) {
+		DLM_ASSERT(start,);
+		rv = list_entry(events.next, struct dlm_recover, list);
+		list_del(&rv->list);
+
+		if (rv->event_id <= last_stop) {
+			log_debug(ls, "move skip event %u", rv->event_id);
+			kfree(rv->nodeids);
+			kfree(rv);
+			rv = NULL;
+		} else {
+			log_debug(ls, "move use event %u", rv->event_id);
+			DLM_ASSERT(!start_rv,);
+			start_rv = rv;
+		}
+	}
+
+	/*
+	 * Eight possible combinations of events.
+	 */
+
+	/* 0 */
+	if (!stop && !start && !finish) {
+		DLM_ASSERT(!start_rv,);
+		cmd = 0;
+		goto out;
+	}
+
+	/* 1 */
+	if (!stop && !start && finish) {
+		DLM_ASSERT(!start_rv,);
+		DLM_ASSERT(last_start > last_stop,);
+		DLM_ASSERT(last_finish == last_start,);
+		cmd = DO_FINISH;
+		*finish_out = last_finish;
+		goto out;
+	}
+
+	/* 2 */
+	if (!stop && start && !finish) {
+		DLM_ASSERT(start_rv,);
+		DLM_ASSERT(last_start > last_stop,);
+		cmd = DO_START;
+		*rv_out = start_rv;
+		goto out;
+	}
+
+	/* 3 */
+	if (!stop && start && finish) {
+		DLM_ASSERT(0, printk("finish and start with no stop\n"););
+	}
+
+	/* 4 */
+	if (stop && !start && !finish) {
+		DLM_ASSERT(!start_rv,);
+		DLM_ASSERT(last_start == last_stop,);
+		cmd = DO_STOP;
+		goto out;
+	}
+
+	/* 5 */
+	if (stop && !start && finish) {
+		DLM_ASSERT(!start_rv,);
+		DLM_ASSERT(last_finish == last_start,);
+		DLM_ASSERT(last_stop == last_start,);
+		cmd = DO_FINISH_STOP;
+		*finish_out = last_finish;
+		goto out;
+	}
+
+	/* 6 */
+	if (stop && start && !finish) {
+		if (start_rv) {
+			DLM_ASSERT(last_start > last_stop,);
+			cmd = DO_START;
+			*rv_out = start_rv;
+		} else {
+			DLM_ASSERT(last_stop == last_start,);
+			cmd = DO_STOP;
+		}
+		goto out;
+	}
+
+	/* 7 */
+	if (stop && start && finish) {
+		if (start_rv) {
+			DLM_ASSERT(last_start > last_stop,);
+			DLM_ASSERT(last_start > last_finish,);
+			cmd = DO_FINISH_START;
+			*finish_out = last_finish;
+			*rv_out = start_rv;
+		} else {
+			DLM_ASSERT(last_start == last_stop,);
+			DLM_ASSERT(last_start > last_finish,);
+			cmd = DO_FINISH_STOP;
+			*finish_out = last_finish;
+		}
+		goto out;
+	}
+
+ out:
+	return cmd;
+}
+
+/*
+ * This function decides what to do given every combination of current
+ * lockspace state and next lockspace state.
+ */
+
+static void do_ls_recovery(struct dlm_ls *ls)
+{
+	struct dlm_recover *rv = NULL;
+	int error, cur_state, next_state = 0, do_now, finish_event = 0;
+
+	do_now = next_move(ls, &rv, &finish_event);
+	if (!do_now)
+		goto out;
+
+	cur_state = ls->ls_state;
+	next_state = 0;
+
+	DLM_ASSERT(!test_bit(LSFL_LS_RUN, &ls->ls_flags),
+		    log_error(ls, "curstate=%d donow=%d", cur_state, do_now););
+
+	/*
+	 * LSST_CLEAR - we're not in any recovery state.  We can get a stop or
+	 * a stop and start which equates with a START.
+	 */
+
+	if (cur_state == LSST_CLEAR) {
+		switch (do_now) {
+		case DO_STOP:
+			next_state = LSST_WAIT_START;
+			break;
+
+		case DO_START:
+			error = ls_reconfig(ls, rv);
+			if (error)
+				next_state = LSST_WAIT_START;
+			else
+				next_state = LSST_RECONFIG_DONE;
+			break;
+
+		case DO_FINISH:	/* invalid */
+		case DO_FINISH_STOP:	/* invalid */
+		case DO_FINISH_START:	/* invalid */
+		default:
+			DLM_ASSERT(0,);
+		}
+		goto out;
+	}
+
+	/*
+	 * LSST_WAIT_START - we're not running because of getting a stop or
+	 * failing a start.  We wait in this state for another stop/start or
+	 * just the next start to begin another reconfig attempt.
+	 */
+
+	if (cur_state == LSST_WAIT_START) {
+		switch (do_now) {
+		case DO_STOP:
+			break;
+
+		case DO_START:
+			error = ls_reconfig(ls, rv);
+			if (error)
+				next_state = LSST_WAIT_START;
+			else
+				next_state = LSST_RECONFIG_DONE;
+			break;
+
+		case DO_FINISH:	/* invalid */
+		case DO_FINISH_STOP:	/* invalid */
+		case DO_FINISH_START:	/* invalid */
+		default:
+			DLM_ASSERT(0,);
+		}
+		goto out;
+	}
+
+	/*
+	 * LSST_RECONFIG_DONE - we entered this state after successfully
+	 * completing ls_reconfig and calling set_start_done.  We expect to get
+	 * a finish if everything goes ok.  A finish could be followed by stop
+	 * or stop/start before we get here to check it.  Or a finish may never
+	 * happen, only stop or stop/start.
+	 */
+
+	if (cur_state == LSST_RECONFIG_DONE) {
+		switch (do_now) {
+		case DO_FINISH:
+			dlm_clear_members_finish(ls, finish_event);
+			next_state = LSST_CLEAR;
+
+			error = enable_locking(ls, finish_event);
+			if (error)
+				break;
+
+			error = dlm_process_requestqueue(ls);
+			if (error)
+				break;
+
+			error = dlm_recover_waiters_post(ls);
+			if (error)
+				break;
+
+			dlm_grant_after_purge(ls);
+
+			dlm_astd_wake();
+
+			log_debug(ls, "recover event %u finished", finish_event);
+			break;
+
+		case DO_STOP:
+			next_state = LSST_WAIT_START;
+			break;
+
+		case DO_FINISH_STOP:
+			dlm_clear_members_finish(ls, finish_event);
+			next_state = LSST_WAIT_START;
+			break;
+
+		case DO_FINISH_START:
+			dlm_clear_members_finish(ls, finish_event);
+			/* fall into DO_START */
+
+		case DO_START:
+			error = ls_reconfig(ls, rv);
+			if (error)
+				next_state = LSST_WAIT_START;
+			else
+				next_state = LSST_RECONFIG_DONE;
+			break;
+
+		default:
+			DLM_ASSERT(0,);
+		}
+		goto out;
+	}
+
+	/*
+	 * LSST_INIT - state after ls is created and before it has been
+	 * started.  A start operation will cause the ls to be started for the
+	 * first time.  A failed start will cause to just wait in INIT for
+	 * another stop/start.
+	 */
+
+	if (cur_state == LSST_INIT) {
+		switch (do_now) {
+		case DO_START:
+			error = ls_first_start(ls, rv);
+			if (!error)
+				next_state = LSST_INIT_DONE;
+			break;
+
+		case DO_STOP:
+			break;
+
+		case DO_FINISH:	/* invalid */
+		case DO_FINISH_STOP:	/* invalid */
+		case DO_FINISH_START:	/* invalid */
+		default:
+			DLM_ASSERT(0,);
+		}
+		goto out;
+	}
+
+	/*
+	 * LSST_INIT_DONE - after the first start operation is completed
+	 * successfully and set_start_done() called.  If there are no errors, a
+	 * finish will arrive next and we'll move to LSST_CLEAR.
+	 */
+
+	if (cur_state == LSST_INIT_DONE) {
+		switch (do_now) {
+		case DO_STOP:
+		case DO_FINISH_STOP:
+			next_state = LSST_WAIT_START;
+			break;
+
+		case DO_START:
+		case DO_FINISH_START:
+			error = ls_reconfig(ls, rv);
+			if (error)
+				next_state = LSST_WAIT_START;
+			else
+				next_state = LSST_RECONFIG_DONE;
+			break;
+
+		case DO_FINISH:
+			next_state = LSST_CLEAR;
+
+			enable_locking(ls, finish_event);
+
+			dlm_process_requestqueue(ls);
+
+			dlm_astd_wake();
+
+			log_debug(ls, "recover event %u finished", finish_event);
+			break;
+
+		default:
+			DLM_ASSERT(0,);
+		}
+		goto out;
+	}
+
+ out:
+	if (next_state)
+		ls->ls_state = next_state;
+
+	if (rv) {
+		kfree(rv->nodeids);
+		kfree(rv);
+	}
+}
+
+int dlm_recoverd(void *arg)
+{
+	struct dlm_ls *ls;
+
+	ls = dlm_find_lockspace_local(arg);
+
+	while (!kthread_should_stop()) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (!test_bit(LSFL_WORK, &ls->ls_flags))
+			schedule();
+		set_current_state(TASK_RUNNING);
+
+		if (test_and_clear_bit(LSFL_WORK, &ls->ls_flags))
+			do_ls_recovery(ls);
+	}
+
+	dlm_put_lockspace(ls);
+	return 0;
+}
+
+void dlm_recoverd_kick(struct dlm_ls *ls)
+{
+	set_bit(LSFL_WORK, &ls->ls_flags);
+	wake_up_process(ls->ls_recoverd_task);
+}
+
+int dlm_recoverd_start(struct dlm_ls *ls)
+{
+	struct task_struct *p;
+	int error = 0;
+
+	p = kthread_run(dlm_recoverd, ls, "dlm_recoverd");
+	if (IS_ERR(p))
+		error = PTR_ERR(p);
+	else
+                ls->ls_recoverd_task = p;
+	return error;
+}
+
+void dlm_recoverd_stop(struct dlm_ls *ls)
+{
+	kthread_stop(ls->ls_recoverd_task);
+}
+
+void dlm_recoverd_suspend(struct dlm_ls *ls)
+{
+	down(&ls->ls_recoverd_active);
+}
+
+void dlm_recoverd_resume(struct dlm_ls *ls)
+{
+	up(&ls->ls_recoverd_active);
+}
+
--- a/drivers/dlm/recoverd.h	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/recoverd.h	2005-05-12 23:13:15.830485360 +0800
@@ -0,0 +1,24 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
+**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#ifndef __RECOVERD_DOT_H__
+#define __RECOVERD_DOT_H__
+
+void dlm_recoverd_kick(struct dlm_ls *ls);
+void dlm_recoverd_stop(struct dlm_ls *ls);
+int dlm_recoverd_start(struct dlm_ls *ls);
+void dlm_recoverd_suspend(struct dlm_ls *ls);
+void dlm_recoverd_resume(struct dlm_ls *ls);
+
+#endif				/* __RECOVERD_DOT_H__ */
+
--- a/drivers/dlm/requestqueue.c	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/requestqueue.c	2005-05-12 23:13:15.830485360 +0800
@@ -0,0 +1,144 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) 2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#include "dlm_internal.h"
+#include "member.h"
+#include "lock.h"
+
+struct rq_entry {
+	struct list_head list;
+	int nodeid;
+	char request[1];
+};
+
+/*
+ * Requests received while the lockspace is in recovery get added to the
+ * request queue and processed when recovery is complete.  This happens when
+ * the lockspace is suspended on some nodes before it is on others, or the
+ * lockspace is enabled on some while still suspended on others.
+ */
+
+void dlm_add_requestqueue(struct dlm_ls *ls, int nodeid, struct dlm_header *hd)
+{
+	struct rq_entry *e;
+	int length = hd->h_length;
+
+	if (dlm_is_removed(ls, nodeid))
+		return;
+
+	e = kmalloc(sizeof(struct rq_entry) + length, GFP_KERNEL);
+	if (!e) {
+		log_print("dlm_add_requestqueue: out of memory\n");
+		return;
+	}
+
+	e->nodeid = nodeid;
+	memcpy(e->request, hd, length);
+
+	down(&ls->ls_requestqueue_lock);
+	list_add_tail(&e->list, &ls->ls_requestqueue);
+	up(&ls->ls_requestqueue_lock);
+}
+
+int dlm_process_requestqueue(struct dlm_ls *ls)
+{
+	struct rq_entry *e;
+	struct dlm_header *hd;
+	int error = 0;
+
+	down(&ls->ls_requestqueue_lock);
+
+	for (;;) {
+		if (list_empty(&ls->ls_requestqueue)) {
+			up(&ls->ls_requestqueue_lock);
+			error = 0;
+			break;
+		}
+		e = list_entry(ls->ls_requestqueue.next, struct rq_entry, list);
+		up(&ls->ls_requestqueue_lock);
+
+		hd = (struct dlm_header *) e->request;
+		error = dlm_receive_message(hd, e->nodeid, TRUE);
+
+		if (error == -EINTR) {
+			/* entry is left on requestqueue */
+			log_debug(ls, "process_requestqueue abort eintr");
+			break;
+		}
+
+		down(&ls->ls_requestqueue_lock);
+		list_del(&e->list);
+		kfree(e);
+
+		if (!test_bit(LSFL_LS_RUN, &ls->ls_flags)) {
+			log_debug(ls, "process_requestqueue abort ls_run");
+			up(&ls->ls_requestqueue_lock);
+			error = -EINTR;
+			break;
+		}
+		schedule();
+	}
+
+	return error;
+}
+
+/*
+ * After recovery is done, locking is resumed and dlm_recoverd takes all the
+ * saved requests and processes them as they would have been by dlm_recvd.  At
+ * the same time, dlm_recvd will start receiving new requests from remote
+ * nodes.  We want to delay dlm_recvd processing new requests until
+ * dlm_recoverd has finished processing the old saved requests.
+ */
+
+void dlm_wait_requestqueue(struct dlm_ls *ls)
+{
+	for (;;) {
+		down(&ls->ls_requestqueue_lock);
+		if (list_empty(&ls->ls_requestqueue))
+			break;
+		if (!test_bit(LSFL_LS_RUN, &ls->ls_flags))
+			break;
+		up(&ls->ls_requestqueue_lock);
+		schedule();
+	}
+	up(&ls->ls_requestqueue_lock);
+}
+
+/*
+ * Dir lookups and lookup replies send before recovery are invalid because the
+ * directory is rebuilt during recovery, so don't save any requests of this
+ * type.  Don't save any requests from a node that's being removed either.
+ */
+
+void dlm_purge_requestqueue(struct dlm_ls *ls)
+{
+	struct dlm_message *ms;
+	struct rq_entry *e, *safe;
+	uint32_t mstype;
+
+	down(&ls->ls_requestqueue_lock);
+	list_for_each_entry_safe(e, safe, &ls->ls_requestqueue, list) {
+
+		ms = (struct dlm_message *) e->request;
+		mstype = ms->m_type;
+
+		if (dlm_is_removed(ls, e->nodeid) ||
+		    mstype == DLM_MSG_REMOVE ||
+	            mstype == DLM_MSG_LOOKUP ||
+	            mstype == DLM_MSG_LOOKUP_REPLY) {
+			list_del(&e->list);
+			kfree(e);
+		}
+	}
+	up(&ls->ls_requestqueue_lock);
+}
+
--- a/drivers/dlm/requestqueue.h	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/requestqueue.h	2005-05-12 23:13:15.830485360 +0800
@@ -0,0 +1,22 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) 2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#ifndef __REQUESTQUEUE_DOT_H__
+#define __REQUESTQUEUE_DOT_H__
+
+void dlm_add_requestqueue(struct dlm_ls *ls, int nodeid, struct dlm_header *hd);
+int dlm_process_requestqueue(struct dlm_ls *ls);
+void dlm_wait_requestqueue(struct dlm_ls *ls);
+void dlm_purge_requestqueue(struct dlm_ls *ls);
+
+#endif
+
--- a/drivers/dlm/member.c	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/member.c	2005-05-12 23:13:15.829485512 +0800
@@ -0,0 +1,356 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) 2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#include "dlm_internal.h"
+#include "member_sysfs.h"
+#include "lockspace.h"
+#include "member.h"
+#include "recoverd.h"
+#include "recover.h"
+#include "lowcomms.h"
+#include "rcom.h"
+
+/*
+ * Following called by dlm_recoverd thread
+ */
+
+static void add_ordered_member(struct dlm_ls *ls, struct dlm_member *new)
+{
+	struct dlm_member *memb = NULL;
+	struct list_head *tmp;
+	struct list_head *newlist = &new->list;
+	struct list_head *head = &ls->ls_nodes;
+
+	list_for_each(tmp, head) {
+		memb = list_entry(tmp, struct dlm_member, list);
+		if (new->nodeid < memb->nodeid)
+			break;
+	}
+
+	if (!memb)
+		list_add_tail(newlist, head);
+	else {
+		/* FIXME: can use list macro here */
+		newlist->prev = tmp->prev;
+		newlist->next = tmp;
+		tmp->prev->next = newlist;
+		tmp->prev = newlist;
+	}
+}
+
+int dlm_add_member(struct dlm_ls *ls, int nodeid)
+{
+	struct dlm_member *memb;
+
+	memb = kmalloc(sizeof(struct dlm_member), GFP_KERNEL);
+	if (!memb)
+		return -ENOMEM;
+
+	memb->nodeid = nodeid;
+	add_ordered_member(ls, memb);
+	ls->ls_num_nodes++;
+	return 0;
+}
+
+void dlm_remove_member(struct dlm_ls *ls, struct dlm_member *memb)
+{
+	list_move(&memb->list, &ls->ls_nodes_gone);
+	ls->ls_num_nodes--;
+}
+
+int dlm_is_member(struct dlm_ls *ls, int nodeid)
+{
+	struct dlm_member *memb;
+
+	list_for_each_entry(memb, &ls->ls_nodes, list) {
+		if (memb->nodeid == nodeid)
+			return TRUE;
+	}
+	return FALSE;
+}
+
+int dlm_is_removed(struct dlm_ls *ls, int nodeid)
+{
+	struct dlm_member *memb;
+
+	list_for_each_entry(memb, &ls->ls_nodes_gone, list) {
+		if (memb->nodeid == nodeid)
+			return TRUE;
+	}
+	return FALSE;
+}
+
+static void clear_memb_list(struct list_head *head)
+{
+	struct dlm_member *memb;
+
+	while (!list_empty(head)) {
+		memb = list_entry(head->next, struct dlm_member, list);
+		list_del(&memb->list);
+		kfree(memb);
+	}
+}
+
+void dlm_clear_members(struct dlm_ls *ls)
+{
+	clear_memb_list(&ls->ls_nodes);
+	ls->ls_num_nodes = 0;
+}
+
+void dlm_clear_members_gone(struct dlm_ls *ls)
+{
+	clear_memb_list(&ls->ls_nodes_gone);
+}
+
+void dlm_clear_members_finish(struct dlm_ls *ls, int finish_event)
+{
+	struct dlm_member *memb, *safe;
+
+	list_for_each_entry_safe(memb, safe, &ls->ls_nodes_gone, list) {
+		if (memb->gone_event <= finish_event) {
+			list_del(&memb->list);
+			kfree(memb);
+		}
+	}
+}
+
+static void make_member_array(struct dlm_ls *ls)
+{
+	struct dlm_member *memb;
+	int i = 0, *array;
+
+	if (ls->ls_node_array) {
+		kfree(ls->ls_node_array);
+		ls->ls_node_array = NULL;
+	}
+
+	array = kmalloc(sizeof(int) * ls->ls_num_nodes, GFP_KERNEL);
+	if (!array)
+		return;
+
+	list_for_each_entry(memb, &ls->ls_nodes, list)
+		array[i++] = memb->nodeid;
+
+	ls->ls_node_array = array;
+}
+
+/* send a status request to all members just to establish comms connections */
+
+static void ping_members(struct dlm_ls *ls)
+{
+	struct dlm_member *memb;
+	list_for_each_entry(memb, &ls->ls_nodes, list)
+		dlm_rcom_status(ls, memb->nodeid);
+}
+
+int dlm_recover_members(struct dlm_ls *ls, struct dlm_recover *rv, int *neg_out)
+{
+	struct dlm_member *memb, *safe;
+	int i, error, found, pos = 0, neg = 0, low = -1;
+
+	/* move departed members from ls_nodes to ls_nodes_gone */
+
+	list_for_each_entry_safe(memb, safe, &ls->ls_nodes, list) {
+		found = FALSE;
+		for (i = 0; i < rv->node_count; i++) {
+			if (memb->nodeid == rv->nodeids[i]) {
+				found = TRUE;
+				break;
+			}
+		}
+
+		if (!found) {
+			neg++;
+			memb->gone_event = rv->event_id;
+			dlm_remove_member(ls, memb);
+			log_debug(ls, "remove member %d", memb->nodeid);
+		}
+	}
+
+	/* add new members to ls_nodes */
+
+	for (i = 0; i < rv->node_count; i++) {
+		if (dlm_is_member(ls, rv->nodeids[i]))
+			continue;
+		dlm_add_member(ls, rv->nodeids[i]);
+		pos++;
+		log_debug(ls, "add member %d", rv->nodeids[i]);
+	}
+
+	list_for_each_entry(memb, &ls->ls_nodes, list) {
+		if (low == -1 || memb->nodeid < low)
+			low = memb->nodeid;
+	}
+	ls->ls_low_nodeid = low;
+
+	make_member_array(ls);
+	set_bit(LSFL_NODES_VALID, &ls->ls_flags);
+	*neg_out = neg;
+
+	ping_members(ls);
+
+	error = dlm_recover_members_wait(ls);
+	log_debug(ls, "total members %d", ls->ls_num_nodes);
+	return error;
+}
+
+int dlm_recover_members_first(struct dlm_ls *ls, struct dlm_recover *rv)
+{
+	int i, error, nodeid, low = -1;
+
+	dlm_clear_members(ls);
+
+	log_debug(ls, "add members");
+
+	for (i = 0; i < rv->node_count; i++) {
+		nodeid = rv->nodeids[i];
+		dlm_add_member(ls, nodeid);
+
+		if (low == -1 || nodeid < low)
+			low = nodeid;
+	}
+	ls->ls_low_nodeid = low;
+
+	make_member_array(ls);
+	set_bit(LSFL_NODES_VALID, &ls->ls_flags);
+
+	ping_members(ls);
+
+	error = dlm_recover_members_wait(ls);
+	log_debug(ls, "total members %d", ls->ls_num_nodes);
+	return error;
+}
+
+/*
+ * Following called from member_sysfs.c
+ */
+
+int dlm_ls_terminate(struct dlm_ls *ls)
+{
+	spin_lock(&ls->ls_recover_lock);
+	set_bit(LSFL_LS_TERMINATE, &ls->ls_flags);
+	set_bit(LSFL_JOIN_DONE, &ls->ls_flags);
+	set_bit(LSFL_LEAVE_DONE, &ls->ls_flags);
+	spin_unlock(&ls->ls_recover_lock);
+	wake_up(&ls->ls_wait_member);
+	log_error(ls, "dlm_ls_terminate");
+	return 0;
+}
+
+int dlm_ls_stop(struct dlm_ls *ls)
+{
+	int new;
+
+	spin_lock(&ls->ls_recover_lock);
+	ls->ls_last_stop = ls->ls_last_start;
+	set_bit(LSFL_LS_STOP, &ls->ls_flags);
+	new = test_and_clear_bit(LSFL_LS_RUN, &ls->ls_flags);
+	spin_unlock(&ls->ls_recover_lock);
+
+	/*
+	 * This in_recovery lock does two things:
+	 *
+	 * 1) Keeps this function from returning until all threads are out
+	 *    of locking routines and locking is truely stopped.
+	 * 2) Keeps any new requests from being processed until it's unlocked
+	 *    when recovery is complete.
+	 */
+
+	if (new)
+		down_write(&ls->ls_in_recovery);
+
+	/*
+	 * The recoverd suspend/resume makes sure that dlm_recoverd (if
+	 * running) has noticed the clearing of LS_RUN above and quit
+	 * processing the previous recovery.  This will be true for all nodes
+	 * before any nodes get the start.
+	 */
+
+	dlm_recoverd_suspend(ls);
+	clear_bit(LSFL_DIR_VALID, &ls->ls_flags);
+	clear_bit(LSFL_ALL_DIR_VALID, &ls->ls_flags);
+	clear_bit(LSFL_NODES_VALID, &ls->ls_flags);
+	clear_bit(LSFL_ALL_NODES_VALID, &ls->ls_flags);
+	dlm_recoverd_resume(ls);
+	dlm_recoverd_kick(ls);
+	return 0;
+}
+
+int dlm_ls_start(struct dlm_ls *ls, int event_nr)
+{
+	struct dlm_recover *rv;
+	int error = 0;
+
+	rv = kmalloc(sizeof(struct dlm_recover), GFP_KERNEL);
+	if (!rv)
+		return -ENOMEM;
+	memset(rv, 0, sizeof(struct dlm_recover));
+
+	spin_lock(&ls->ls_recover_lock);
+
+	if (test_bit(LSFL_LS_RUN, &ls->ls_flags)) {
+		spin_unlock(&ls->ls_recover_lock);
+		log_error(ls, "start ignored: lockspace running");
+		kfree(rv);
+		error = -EINVAL;
+		goto out;
+	}
+
+	if (!ls->ls_nodeids_next) {
+		spin_unlock(&ls->ls_recover_lock);
+		log_error(ls, "start ignored: existing nodeids_next");
+		kfree(rv);
+		error = -EINVAL;
+		goto out;
+	}
+
+	if (event_nr <= ls->ls_last_start) {
+		spin_unlock(&ls->ls_recover_lock);
+		log_error(ls, "start event_nr %d not greater than last %d",
+			  event_nr, ls->ls_last_start);
+		kfree(rv);
+		error = -EINVAL;
+		goto out;
+	}
+
+	rv->nodeids = ls->ls_nodeids_next;
+	ls->ls_nodeids_next = NULL;
+	rv->node_count = ls->ls_nodeids_next_count;
+	rv->event_id = event_nr;
+	ls->ls_last_start = event_nr;
+	list_add_tail(&rv->list, &ls->ls_recover);
+	set_bit(LSFL_LS_START, &ls->ls_flags);
+	spin_unlock(&ls->ls_recover_lock);
+
+	set_bit(LSFL_JOIN_DONE, &ls->ls_flags);
+	wake_up(&ls->ls_wait_member);
+	dlm_recoverd_kick(ls);
+ out:
+	return error;
+}
+
+int dlm_ls_finish(struct dlm_ls *ls, int event_nr)
+{
+	spin_lock(&ls->ls_recover_lock);
+	if (event_nr != ls->ls_last_start) {
+		spin_unlock(&ls->ls_recover_lock);
+		log_error(ls, "finish event_nr %d doesn't match start %d",
+			  event_nr, ls->ls_last_start);
+		return -EINVAL;
+	}
+	ls->ls_last_finish = event_nr;
+	set_bit(LSFL_LS_FINISH, &ls->ls_flags);
+	spin_unlock(&ls->ls_recover_lock);
+	dlm_recoverd_kick(ls);
+	return 0;
+}
+
--- a/drivers/dlm/member.h	1970-01-01 07:30:00.000000000 +0730
+++ b/drivers/dlm/member.h	2005-05-12 23:13:15.829485512 +0800
@@ -0,0 +1,29 @@
+/******************************************************************************
+*******************************************************************************
+**
+**  Copyright (C) 2005 Red Hat, Inc.  All rights reserved.
+**
+**  This copyrighted material is made available to anyone wishing to use,
+**  modify, copy, or redistribute it subject to the terms and conditions
+**  of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
+
+#ifndef __MEMBER_DOT_H__
+#define __MEMBER_DOT_H__
+
+int dlm_ls_terminate(struct dlm_ls *ls);
+int dlm_ls_stop(struct dlm_ls *ls);
+int dlm_ls_start(struct dlm_ls *ls, int event_nr);
+int dlm_ls_finish(struct dlm_ls *ls, int event_nr);
+
+void dlm_clear_members(struct dlm_ls *ls);
+void dlm_clear_members_gone(struct dlm_ls *ls);
+void dlm_clear_members_finish(struct dlm_ls *ls, int finish_event);
+int dlm_recover_members_first(struct dlm_ls *ls, struct dlm_recover *rv);
+int dlm_recover_members(struct dlm_ls *ls, struct dlm_recover *rv,int *neg_out);
+int dlm_is_removed(struct dlm_ls *ls, int nodeid);
+
+#endif                          /* __MEMBER_DOT_H__ */
+



From ARodriguez at INNODATA-ISOGEN.COM  Mon May 16 08:40:51 2005
From: ARodriguez at INNODATA-ISOGEN.COM (Rodriguez, Antonio)
Date: Mon, 16 May 2005 16:40:51 +0800
Subject: [Linux-cluster] GFS on FC4 test3
Message-ID: <D1410010F417E54F92E3A8C1E47567CC237588@MDEE2K25.MANDAUE.INNODATA.NET>

Hi,

 

I want to install GFS using this release.  Do you have any procedure on
how to go about it?

 

Thanks,

toning

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050516/c41d1d37/attachment.htm>

From apprabhu at hotmail.com  Mon May 16 11:19:27 2005
From: apprabhu at hotmail.com (Amit Prabhu)
Date: Mon, 16 May 2005 11:19:27 +0000
Subject: [Linux-cluster] Linux Cluster on RH EL 3 AS
Message-ID: <BAY12-F180D13CEEFCDE9444C0C6AC2150@phx.gbl>

Hi All,
I have been trying to setup a Linux Cluster of 2 nodes running EL3 AS.
1) My machine reboots the moment I start the cluamanager service
2) Is Redhat Cluster Management Suite a pre-requisite (manadatory) for 
Clustering in RedHat EnterPrise Linux 3 AS?

I have followed the how-to on EL 2.1 and have used the cluster-config script 
instead of "cluconfig" to configure the server.

Thanks.

-Amit

_________________________________________________________________
Don?t just search. Find. Check out the new MSN Search! 
http://search.msn.click-url.com/go/onm00200636ave/direct/01/



From Scott.Money at lycos-inc.com  Thu May 12 22:56:53 2005
From: Scott.Money at lycos-inc.com (Scott.Money at lycos-inc.com)
Date: Thu, 12 May 2005 18:56:53 -0400
Subject: [Linux-cluster] GFS Issues   
Message-ID: <OFA2FDD489.FBCC9602-ON85256FFF.007718C6-85256FFF.007E6303@ma.lycos.com>

Hello,
I am new to the list but I was hoping to run a few questions by you. 
Hopefully this is the correct forum for these questions.

We are in the process of evaluating GFS for an Oracle 9i RAC 
implementation.

I originally downloaded and installed this GFS-6.0.0-1.2.src.rpm using 
this http://www.gyrate.org/misc/gfs.txt , the Admin Guide (from RedHat) 
and a Sistina RAC install guide(from RedHat).

That worked to a degree in that I could mount one node on the system, but 
all other nodes are blocked from mounting the gfs. This problem led me to 
the 8 char limitation 
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=127828 . My node 
names are unique in the first 8 chars (e.g. bwing.domain.com and 
xwing.domain.com) . 

First Attempt Environment
 GFS-6.0.0-1.2.src.rpm
 RedHat AS3
 Kernel Version 2.4.21-15.ELsmp
 Using GNBD to export an external scsi array
 Used the example pool configs from the RAC install guide
 manual fencing (status for all nodes is "Logged In")
 2 nodes, 1 GNBD and dedicated lock_gulm server
 
Not sure what to do I looked for a newer version of GFS.

Now I am going through the steps I did before except now I am getting a 
build error on the rpm-build.


rpmbuild --rebuild --target i686 
/usr/src/redhat/SRPMS/GFS-6.0.2-26.src.rpm
--------------------
/usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/include/linux/brlock.h:88: 
warning: `always_inline' attribute directive ignored 
 /usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/include/linux/brlock.h:101: 
warning: `always_inline' attribute directive ignored 
 /usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/include/linux/brlock.h:171: 
warning: `always_inline' attribute directive ignored 
 /usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/include/linux/brlock.h:179: 
warning: `always_inline' attribute directive ignored 
 make[4]: Leaving directory 
`/usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/drivers/char/joystick' 
 ld -m elf_i386 -r -o sis.o sis_drv.o sis_ds.o sis_mm.o 
 make[4]: Leaving directory 
`/usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/drivers/char/drm' 
 make[3]: Leaving directory 
`/usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/drivers/char'
make[2]: Leaving directory 
`/usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/drivers' 
 make[1]: *** [_mod_drivers] Error 2 
 make: *** [all] Error 2 
 error: Bad exit status from /var/tmp/rpm-tmp.15039 (%build) 
 
 
 RPM build errors: 
     Bad exit status from /var/tmp/rpm-tmp.15039 (%build) 
---------- 
The warnings I think I can ignore, but I am not sure what to do about the 
build errors.

New Version environment
 GFS-6.0.2-26.src.rpm
 RedHat AS3
 Kernel version 2.4.21-27.0.4.ELsmp


Got my GFS src rpm from 
ftp://ftp.redhat.com/pub/redhat/linux/updates/enterprise/3AS/en/RHGFS/SRPMS/

Any help would be appreciated on either issue and please let me know if 
there is more information that is required.

$cott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050512/62441392/attachment.htm>

From agk at redhat.com  Tue May 17 17:22:35 2005
From: agk at redhat.com (Alasdair G Kergon)
Date: Tue, 17 May 2005 18:22:35 +0100
Subject: [Linux-cluster] FW: Re: [Clusters_sig] Planning a Cluster meeting
	at OLS
Message-ID: <20050517172235.GP15809@agk.surrey.redhat.com>

[This got bounced because From address was unqualified]

----- Forwarded message from linux-cluster-bounces at redhat.com -----

Date: Fri, 13 May 2005 11:14:29 -0400 (EDT)
From: Ashutosh Rajekar
Subject: Re: [Clusters_sig] Planning a Cluster meeting at OLS
To: clusters_sig at lists.osdl.org, <linux-cluster at redhat.com>

Hi everyone,

I am a bit confused. From what I can gather through the past discussions, 
there are apparently three "meetings":

1) The Walldorf Cluster Summit
2) OLS BOFs
3) some other "summit" to be finalised for either before or after OLS but 
to be held in Ottawa.

For me attending more than one is out of question.

Hence I want to make sure that the one I end up attending is the main one, 
leaving the others to be summarisations, etc.

So I'd like to know what is the exact agenda for each summit (I still 
don't know the agenda for the Walldorf summit).

Of course I don't wanna pressurise the organizers, but from Andrew 
Hutton's emails it seems that OLS registration is running real tight with 
a few seats and hotel rooms left in Ottawa; and if I end up going to 
Walldorf then I need to buy tickets in advance since the prices are gonna 
go through the roof pretty soon.

Thanks and Regards,
Ash


----- End forwarded message -----



From dawson at fnal.gov  Tue May 17 18:38:00 2005
From: dawson at fnal.gov (Troy Dawson)
Date: Tue, 17 May 2005 13:38:00 -0500
Subject: [Linux-cluster] Building rpm's for RHEL 4
Message-ID: <428A3A08.7030303@fnal.gov>

Howdy,
I've got some servers running RHEL 4 all connected to a SAN.  I've got 
some time before they go into production so I'm trying GFS to see how 
well it will work for our situation, and possibly shake any bugs out 
while testing.
I was able to download the cluster code from CVS, and build it.  But my 
servers don't have any compilers, and I'd like to install on them via rpm's.
In the scripts directory there is a build_srpms.pl script.  I've tried 
it, I've read it, I've tried all sorts of incantations, but it doesn't 
seem to do anything for me.  So comes the question
How do I make the rpm's for RHEL 4 so I can install the cluster programs 
on servers without compilers?
Many Thanks
Troy Dawson
-- 
__________________________________________________
Troy Dawson  dawson at fnal.gov  (630)840-6468
Fermilab  ComputingDivision/CSS  CSI Group
__________________________________________________



From yazan at ccs.com.jo  Tue May 17 20:28:04 2005
From: yazan at ccs.com.jo (Yazan Al-Sheyyab)
Date: Tue, 17 May 2005 22:28:04 +0200
Subject: [Linux-cluster] any help ?
Message-ID: <001401c55b1e$edbc6320$69050364@yazanz>

hello ,

 i have a problem when making the ( pool_tool -c system.pool ) command

  the system.pool file is a file i build myself and related to a partition created from LVM (logical)
 and i have 15 file like it.
   
  what is the problem here and how can i solve it and result to a pool written to the /dev/pool.

   it write to me that it is unable to open the whole partitions created by lvm from (lvma-...lvmo)
 each time i use this command.

   and after that i cannot assemble them too.

   Please any help solving this problem. 



Regards
-------------------------------------------------

Yazan.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050517/a014af52/attachment.htm>

From xprincess at 126.com  Wed May 18 02:21:57 2005
From: xprincess at 126.com (lixiangna)
Date: Wed, 18 May 2005 10:21:57 +0800
Subject: [Linux-cluster] cman_tool: cannot find local node name "n1" in ccs
Message-ID: <200505180428.j4I4Seke003777@mx2.redhat.com>

hello everyone:

    I met the following problem, I don't know why?

	[root at n1 cluster_rhel4]# cman_tool join
    cman_tool: cannot find local node name "n1" in ccs

    My cluster.conf is:

<?xml version="1.0"?>
<cluster name="cluster1" config_version="1">

  <cman two_node="1" expected_votes="1">
  </cman>

  <clusternodes>
     <clusternode name="n1" votes="1">
       <fence>
         <method name="single">
           <device name="human" ipaddr="n1"/>
         </method>
       </fence>
     </clusternode>
     <clusternode name="n2" votes="1">
       <fence>
         <method name="single">
           <device name="human" ipaddr="n2"/>
         </method>
       </fence>
     </clusternode>
   </clusternodes>

   <fence_devices>
     <fence_device name="human" agent="fence_manual"/>
   </fence_devices>

</cluster>
????????
My hosts is:
   202.114.16.13           localhost.localdomain n1
   202.114.16.12           n2

 				

????????xprincess
????????xprincess at 126.com
????????2005-05-18






From teigland at redhat.com  Wed May 18 04:18:03 2005
From: teigland at redhat.com (David Teigland)
Date: Wed, 18 May 2005 12:18:03 +0800
Subject: [Linux-cluster] cman_tool: cannot find local node name "n1" in ccs
In-Reply-To: <200505180428.j4I4Seke003777@mx2.redhat.com>
References: <200505180428.j4I4Seke003777@mx2.redhat.com>
Message-ID: <20050518041803.GB10931@redhat.com>

On Wed, May 18, 2005 at 10:21:57AM +0800, lixiangna wrote:

> 	[root at n1 cluster_rhel4]# cman_tool join
>     cman_tool: cannot find local node name "n1" in ccs
> 
>     My cluster.conf is:
> 
> <?xml version="1.0"?>
> <cluster name="cluster1" config_version="1">
> 
>   <cman two_node="1" expected_votes="1">
>   </cman>
> 
>   <clusternodes>
>      <clusternode name="n1" votes="1">

This looks correct.  What version of the code are you using?  Try using
the debug option:  cman_tool join -d

Dave



From yazan at ccs.com.jo  Wed May 18 09:03:32 2005
From: yazan at ccs.com.jo (Yazan Al-Sheyyab)
Date: Wed, 18 May 2005 11:03:32 +0200
Subject: [Linux-cluster] need a technical solution please ..
Message-ID: <002501c55b88$7737c270$69050364@yazanz>

Hello ,

 i have a problem when making the ( pool_tool -c system.pool ) command

  the system.pool file is a file i build myself and related to a partition created from LVM (logical)
 and i have 15 file like it.
   
  what is the problem here and how can i solve it and result to a pool written to the /dev/pool.

   it write to me that it is unable to open the whole partitions created by lvm from (lvma-...lvmo)
 each time i use this command.

   and after that i cannot assemble them too.

   Please any help solving this problem.



Regards
-------------------------------------------------

Yazan.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050518/f428069f/attachment.htm>

From Birger.Wathne at ift.uib.no  Wed May 18 11:05:18 2005
From: Birger.Wathne at ift.uib.no (Birger Wathne)
Date: Wed, 18 May 2005 13:05:18 +0200
Subject: [Linux-cluster] Resource group corruption
Message-ID: <428B216E.1040804@ift.uib.no>

I have a problem (again)...

Running FC3 in a cluster with only one node operative (but 2 nodes 
defined in the config).
Kernel is 2.6.11-1.14_FC3smp

I created 4 gfs file systems, and tested NFS and samba setups for a 
while. Now, I copied some larger amounts of data over to the file 
systems using rsync. Everything seemed normal until I stared movind 
directories around using mv and cp. Suddenly I got
May 18 09:54:27 bacchus kernel: Unable to handle kernel NULL pointer 
dereference at virtual address 00000004
May 18 09:54:27 bacchus kernel:  printing eip:
May 18 09:54:27 bacchus kernel: c018587d
May 18 09:54:27 bacchus kernel: *pde = 237dc001
May 18 09:54:27 bacchus kernel: Oops: 0000 [#1]
May 18 09:54:27 bacchus kernel: SMP
May 18 09:54:27 bacchus kernel: Modules linked in: lock_dlm(U) gfs(U) 
lock_harness(U) nfsd exportfs nfs lockd parport_pc lp parport autofs4 
dlm(U) cman(U) md5 ipv6 sunrpc ipt_LOG iptable_filter ip_tables video 
button battery ac uhci_hcd ehci_hcd hw_random i2c_i801 i2c_core 
snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer 
snd soundcore snd_page_alloc 3c59x mii tg3 floppy ext3 jbd dm_mod 
aic7xxx ata_piix libata sd_mod scsi_mod
May 18 09:54:27 bacchus kernel: CPU:    0
May 18 09:54:27 bacchus kernel: EIP:    0060:[<c018587d>]    Tainted: 
GF     VLIMay 18 09:54:27 bacchus kernel: EFLAGS: 00010246   
(2.6.11-1.14_FC3smp)
May 18 09:54:27 bacchus kernel: EIP is at posix_acl_valid+0xf/0xac
May 18 09:54:27 bacchus kernel: eax: 00000000   ebx: 00000000   ecx: 
00000008 edx: 00000001
May 18 09:54:27 bacchus kernel: esi: 00000000   edi: 00000000   ebp: 
e16a17e8 esp: ccf08d60
May 18 09:54:27 bacchus kernel: ds: 007b   es: 007b   ss: 0068
May 18 09:54:27 bacchus kernel: Process cp (pid: 17953, 
threadinfo=ccf08000 task=d2af5020)
May 18 09:54:27 bacchus kernel: Stack: 00000000 f911aebc 00000000 
f90e9035 ccf08e0c f911aebc ccf08e64 f90f1d06
May 18 09:54:27 bacchus kernel:        00000000 00000000 ccf08db0 
ccf08e53 e16a17e8 ccf08db0 ffffffff ccf08e0c
May 18 09:54:27 bacchus kernel:        e16a17e8 ccf08db0 f90f389d 
ccf08db0 e2d4cf38 e2d4cf38 e2d4cf10 d2af5020
May 18 09:54:27 bacchus kernel: Call Trace:
May 18 09:54:27 bacchus kernel:  [<f90e9035>] 
gfs_acl_validate_set+0x35/0xa2 [gfs]
May 18 09:54:27 bacchus kernel:  [<f90f1d06>] system_eo_set+0xb6/0xdc [gfs]
May 18 09:54:27 bacchus kernel:  [<f90f389d>] gfs_ea_set+0xbd/0xc1 [gfs]
May 18 09:54:27 bacchus kernel:  [<f910eb5f>] gfs_setxattr+0x8a/0x96 [gfs]
May 18 09:54:27 bacchus kernel:  [<c017a962>] setxattr+0x142/0x183
May 18 09:54:27 bacchus kernel:  [<c0171cc4>] dput+0x7f/0x1d3
May 18 09:54:27 bacchus kernel:  [<c016984b>] link_path_walk+0xbb5/0xe27
May 18 09:54:27 bacchus kernel:  [<c016528a>] cp_new_stat64+0xf0/0x102
May 18 09:54:27 bacchus kernel:  [<c0169d46>] path_lookup+0x96/0x196
May 18 09:54:27 bacchus kernel:  [<c0169f98>] __user_walk+0x42/0x52
May 18 09:54:27 bacchus kernel:  [<c017a9ea>] sys_setxattr+0x47/0x58
May 18 09:54:27 bacchus kernel:  [<c0103f0f>] syscall_call+0x7/0xb
May 18 09:54:27 bacchus kernel: Code: c1 e9 02 f3 a5 f6 c3 02 74 02 66 
a5 f6 c3 01 74 01 a4 c7 00 01 00 00 00 5b 5e 5f c3 57 8d 48 08 31 ff 56 
31 f6 ba 01 00 00 00 53 <8b> 40 04 8d 1c c1 39 d9 73 32 0f b7 41 02 a9 
f8 ff 00 00 75 27


Rebooting brings up the services again, including mounted file systems, 
but if I umount them (which involves disabling services and rebooting, 
as I get file system busy errors when rebooting, then stopping the 
services) and run gfs_fsck I get this on 3 of the 4 file systems:
# gfs_fsck /dev/raid5/pakke
Initializing fsck
Buffer #4296077767 (5 of 134571750) is neither GFS_METATYPE_RB nor 
GFS_METATYPE_ RG.
Resource group is corrupted.
Unable to read in rgrp descriptor.
Unable to fill in resource group information.

How do I try to fix this? Why is one file system unaffected by this? And 
most importantly: Why did this happen in the first place?

I am running the RHEL4 branch of the code.

The file systems were created with 2 journals, but there has never been 
more than one node connected to the storage.

-- 
birger

 



From Robert.Westmont at lycos-inc.com  Tue May 17 20:50:46 2005
From: Robert.Westmont at lycos-inc.com (Robert.Westmont at lycos-inc.com)
Date: Tue, 17 May 2005 16:50:46 -0400
Subject: [Linux-cluster] GFS Issues
In-Reply-To: <OFA2FDD489.FBCC9602-ON85256FFF.007718C6-85256FFF.007E6303@ma.lycos.com>
Message-ID: <OFAE74F5FE.35A06292-ON85257004.00726FE7-85257004.0072B5D9@ma.lycos.com>

I recommend updating this - others may have been bit by the legacy article 
(after we confirm it really works).

Robert





Scott.Money at lycos-inc.com
Sent by: linux-cluster-bounces at redhat.com
05/12/2005 06:56 PM
Please respond to linux clustering
 
        To:     linux-cluster at redhat.com
        cc: 
        Subject:        [Linux-cluster] GFS Issues



Hello, 
I am new to the list but I was hoping to run a few questions by you. 
Hopefully this is the correct forum for these questions. 

We are in the process of evaluating GFS for an Oracle 9i RAC 
implementation. 

I originally downloaded and installed this GFS-6.0.0-1.2.src.rpm using 
this http://www.gyrate.org/misc/gfs.txt , the Admin Guide (from RedHat) 
and a Sistina RAC install guide(from RedHat). 

That worked to a degree in that I could mount one node on the system, but 
all other nodes are blocked from mounting the gfs. This problem led me to 
the 8 char limitation 
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=127828 . My node 
names are unique in the first 8 chars (e.g. bwing.domain.com and 
xwing.domain.com) . 

First Attempt Environment 
 GFS-6.0.0-1.2.src.rpm 
 RedHat AS3 
 Kernel Version 2.4.21-15.ELsmp 
 Using GNBD to export an external scsi array 
 Used the example pool configs from the RAC install guide 
 manual fencing (status for all nodes is "Logged In") 
 2 nodes, 1 GNBD and dedicated lock_gulm server 
  
Not sure what to do I looked for a newer version of GFS. 

Now I am going through the steps I did before except now I am getting a 
build error on the rpm-build. 


rpmbuild --rebuild --target i686 
/usr/src/redhat/SRPMS/GFS-6.0.2-26.src.rpm 
-------------------- 
/usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/include/linux/brlock.h:88: 
warning: `always_inline' attribute directive ignored 
 /usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/include/linux/brlock.h:101: 
warning: `always_inline' attribute directive ignored 
 /usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/include/linux/brlock.h:171: 
warning: `always_inline' attribute directive ignored 
 /usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/include/linux/brlock.h:179: 
warning: `always_inline' attribute directive ignored 
 make[4]: Leaving directory 
`/usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/drivers/char/joystick' 
 ld -m elf_i386 -r -o sis.o sis_drv.o sis_ds.o sis_mm.o 
 make[4]: Leaving directory 
`/usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/drivers/char/drm' 
 make[3]: Leaving directory 
`/usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/drivers/char' 
make[2]: Leaving directory 
`/usr/src/redhat/BUILD/GFS-6.0.2/linuxsmp/drivers' 
 make[1]: *** [_mod_drivers] Error 2 
 make: *** [all] Error 2 
 error: Bad exit status from /var/tmp/rpm-tmp.15039 (%build) 
 
 
 RPM build errors: 
     Bad exit status from /var/tmp/rpm-tmp.15039 (%build) 
---------- 
The warnings I think I can ignore, but I am not sure what to do about the 
build errors. 

New Version environment 
 GFS-6.0.2-26.src.rpm 
 RedHat AS3 
 Kernel version 2.4.21-27.0.4.ELsmp 


Got my GFS src rpm from 
ftp://ftp.redhat.com/pub/redhat/linux/updates/enterprise/3AS/en/RHGFS/SRPMS/ 


Any help would be appreciated on either issue and please let me know if 
there is more information that is required. 

$cott--
Linux-cluster mailing list
Linux-cluster at redhat.com
http://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050517/1b3c1b9d/attachment.htm>

From fls at techscan-systems.com  Wed May 18 05:39:08 2005
From: fls at techscan-systems.com (Frank L. Setinsek)
Date: Tue, 17 May 2005 23:39:08 -0600
Subject: [Linux-cluster]GFS Problem
Message-ID: <200505180637.j4I6bBa9017646@mx1.redhat.com>

Hardware Configuration:  Six node cluster, each node has a LSI Fibre Channel
Host Adapter interface to a SAN.
Software Configuration: The kernel is 2.4.21-20.EL with GFS-6.0.2-25
Problem: While four nodes are simultaneously accessing the SAN, if a 5th
node attempts to access the SAN, one of the nodes will kernel panic.
              The node that crashes seems to be random.  All the crashes
have the same error as follows:
 
May 17 21:53:52 compute-0-2.local kernel: mptscsih: ioc0: WARNING - Device
(0:0:1) reported QUEUE_FULL! 
May 17 21:53:52 compute-0-2.local kernel: SCSI disk error : host 0 channel 0
id 0 lun 1 return code = 440b0000 
May 17 21:53:52 compute-0-2.local kernel: I/O error: dev 08:12, sector
139961968 
May 17 21:53:52 compute-0-2.local kernel: Pool: IO request to device, (8,18)
blk #139961968, failed. 
May 17 21:53:52 compute-0-2.local kernel: GFS: fsid=p2-2:gfs1.3: read error
on block 17495244 
May 17 21:53:52 compute-0-2.local kernel: Panicking because of read error on
block 17495244 
May 17 21:53:52 compute-0-2.local kernel: f3d33b98 f8a2f2a2 00000032
00000031 c01217d2 0000000a 00000400 f8a4f7f5 
May 17 21:53:52 compute-0-2.local kernel: f3d33be8 f3740370 010af4cc
f3740370 00000020 00000000 f8a5c000 00000031 
May 17 21:53:52 compute-0-2.local kernel: f8a1419e f8a4d692 f8a4d57a
0000024f 00000013 f8a5c000 f8a5c000 f3d33c3c 
May 17 21:53:52 compute-0-2.local kernel: Call Trace: [<f8a2f2a2>]
gfs_asserti [gfs] 0x32 (0xf3d33b9c) 
May 17 21:53:52 compute-0-2.local kernel: [<c01217d2>] printk [kernel] 0x122
(0xf3d33ba8) 
May 17 21:53:52 compute-0-2.local kernel: [<f8a4f7f5>] .rodata.str1.4 [gfs]
0x249 (0xf3d33bb4) 
May 17 21:53:52 compute-0-2.local kernel: [<f8a1419e>] gfs_dreread [gfs]
0x12e (0xf3d33bd8) 
May 17 21:53:52 compute-0-2.local kernel: [<f8a4d692>] .rodata.str1.1 [gfs]
0x1e6 (0xf3d33bdc) 
May 17 21:53:52 compute-0-2.local kernel: [<f8a4d57a>] .rodata.str1.1 [gfs]
0xce (0xf3d33be0) 
May 17 21:53:52 compute-0-2.local kernel: [<f8a13ff9>] gfs_dread [gfs] 0x49
(0xf3d33bfc) 
May 17 21:53:52 compute-0-2.local kernel: [<f8a1513f>] gfs_get_meta_buffer
[gfs] 0x9f (0xf3d33c18) 
May 17 21:53:52 compute-0-2.local kernel: [<f8a22fc2>] get_metablock [gfs]
0xb2 (0xf3d33c50) 
May 17 21:53:52 compute-0-2.local kernel: [<f8a233db>] gfs_block_map [gfs]
0x2eb (0xf3d33c70) 
May 17 21:53:52 compute-0-2.local kernel: [<c016ad48>] init_buffer_head
[kernel] 0x38 (0xf3d33cb8) 
May 17 21:53:52 compute-0-2.local kernel: [<f8a1cdae>] get_block [gfs] 0x9e
(0xf3d33d28) 
May 17 21:53:52 compute-0-2.local kernel: [<c01567db>] __block_prepare_write
[kernel] 0x19b (0xf3d33d64) 
May 17 21:53:52 compute-0-2.local kernel: [<c014a0d0>] __alloc_pages_limit
[kernel] 0x60 (0xf3d33d94) 
May 17 21:53:52 compute-0-2.local kernel: [<c0157139>] block_prepare_write
[kernel] 0x39 (0xf3d33da8) 
May 17 21:53:52 compute-0-2.local kernel: [<f8a1cd10>] get_block [gfs] 0x0
(0xf3d33dbc) 
May 17 21:53:52 compute-0-2.local kernel: [<f8a1d41c>] gfs_prepare_write
[gfs] 0x11c (0xf3d33dc8) 
May 17 21:53:52 compute-0-2.local kernel: [<f8a1cd10>] get_block [gfs] 0x0
(0xf3d33dd8) 
May 17 21:53:52 compute-0-2.local kernel: [<c013f5d5>] do_generic_file_write
[kernel] 0x1d5 (0xf3d33df0) 
May 17 21:53:52 compute-0-2.local kernel: [<f8a1790b>] do_do_write [gfs]
0x2ab (0xf3d33e44) 
May 17 21:53:52 compute-0-2.local kernel: [<f8a17d4b>] do_write [gfs] 0x18b
(0xf3d33e90) 
May 17 21:53:52 compute-0-2.local kernel: [<f8a15c89>] gfs_walk_vma [gfs]
0x129 (0xf3d33ecc) 
May 17 21:53:52 compute-0-2.local kernel: [<f8a1f112>] gfs_sync_page [gfs]
0x52 (0xf3d33eec) 
May 17 21:53:52 compute-0-2.local kernel: [<f8a319b7>] gfs_glock_nq_init
[gfs] 0x37 (0xf3d33f30) 
May 17 21:53:52 compute-0-2.local kernel: [<f8a319f3>] gfs_glock_dq_uninit
[gfs] 0x13 (0xf3d33f40) 
May 17 21:53:52 compute-0-2.local kernel: [<f8a187e1>] gfs_sync_file [gfs]
0x61 (0xf3d33f4c) 
May 17 21:53:52 compute-0-2.local kernel: [<f8a17e20>] gfs_write [gfs] 0x90
(0xf3d33f6c) 
May 17 21:53:52 compute-0-2.local kernel: [<f8a17bc0>] do_write [gfs] 0x0
(0xf3d33f80) 
May 17 21:53:52 compute-0-2.local kernel: [<c0153a53>] sys_write [kernel]
0xa3 (0xf3d33f94) 
May 17 21:53:52 compute-0-2.local kernel: 
May 17 21:53:52 compute-0-2.local kernel: Kernel panic: GFS: Assertion
failed on line 591 of file linux_dio.c 
May 17 21:53:52 compute-0-2.local kernel: GFS: assertion: "FALSE" 
May 17 21:53:52 compute-0-2.local kernel: GFS: time = 1116388432 
May 17 21:53:52 compute-0-2.local kernel: GFS: fsid=p2-2:gfs1.3 
May 17 21:53:52 compute-0-2.local kernel: 
 
Frank L. Setinsek
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050517/e7c20fa8/attachment.htm>

From mtilstra at redhat.com  Wed May 18 14:35:37 2005
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Wed, 18 May 2005 09:35:37 -0500
Subject: [Linux-cluster]GFS Problem
In-Reply-To: <200505180637.j4I6bBa9017646@mx1.redhat.com>
References: <200505180637.j4I6bBa9017646@mx1.redhat.com>
Message-ID: <20050518143537.GA26260@redhat.com>

On Tue, May 17, 2005 at 11:39:08PM -0600, Frank L. Setinsek wrote:
> 
>    Hardware Configuration:  Six node cluster, each node has a LSI Fibre Channel
>    Host Adapter interface to a SAN.
>    Software Configuration: The kernel is 2.4.21-20.EL with GFS-6.0.2-25
>    Problem: While four nodes are simultaneously accessing the SAN, if a 5th
>    node attempts to access the SAN, one of the nodes will kernel panic.
>                  The node that crashes seems to be random.  All the crashes
>    have the same error as follows:
> 
>    May 17 21:53:52 compute-0-2.local kernel: mptscsih: ioc0: WARNING - Device
>    (0:0:1) reported QUEUE_FULL!
>    May 17 21:53:52 compute-0-2.local kernel: SCSI disk error : host 0 channel 0
>    id 0 lun 1 return code = 440b0000

It really looks like your storage cannot have more than four nodes
accessing it at a single time.  You are getting scsi errors which are
completely confusing gfs, and causing it to panic.

-- 
Michael Conrad Tadpol Tilstra
To err is human, to really foul things up requires a computer.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050518/676dcc2f/attachment.sig>

From alewis at redhat.com  Wed May 18 14:40:24 2005
From: alewis at redhat.com (AJ Lewis)
Date: Wed, 18 May 2005 09:40:24 -0500
Subject: [Linux-cluster] need a technical solution please ..
In-Reply-To: <002501c55b88$7737c270$69050364@yazanz>
References: <002501c55b88$7737c270$69050364@yazanz>
Message-ID: <20050518144024.GC30193@null.msp.redhat.com>

On Wed, May 18, 2005 at 11:03:32AM +0200, Yazan Al-Sheyyab wrote:
> Hello ,
> 
>  i have a problem when making the ( pool_tool -c system.pool ) command
> 
>   the system.pool file is a file i build myself and related to a partition created from LVM (logical)
>  and i have 15 file like it.
>    
>   what is the problem here and how can i solve it and result to a pool written to the /dev/pool.
> 
>    it write to me that it is unable to open the whole partitions created by lvm from (lvma-...lvmo)
>  each time i use this command.
> 
>    and after that i cannot assemble them too.
> 
>    Please any help solving this problem.

Don't use LVM under pool.  LVM is not cluster aware and you can hurt yourself
pretty badly using it with clustered systems.  You should just use pool on the
disks without LVM in between.

-- 
AJ Lewis                                   Voice:  612-638-0500
Red Hat Inc.                               E-Mail: alewis at redhat.com
720 Washington Ave. SE, Suite 200
Minneapolis, MN 55414
   
Current GPG fingerprint = D9F8 EDCE 4242 855F A03D  9B63 F50C 54A8 578C 8715
Grab the key at: http://people.redhat.com/alewis/gpg.html or one of the
many keyservers out there...

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050518/1da0e5e0/attachment.sig>

From fsetinsek at techniscanmedical.com  Wed May 18 17:36:39 2005
From: fsetinsek at techniscanmedical.com (Frank L. Setinsek)
Date: Wed, 18 May 2005 11:36:39 -0600
Subject: [Linux-cluster]GFS Problem
In-Reply-To: <20050518143537.GA26260@redhat.com>
Message-ID: <200505181827.j4IIRZ8D020056@mx1.redhat.com>

The storage is fronted by a switch, the SCSI errors occur after the
Queue_Full warning which seems to initiate the problem.  In this context,
what does Queue_Full mean? 

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Conrad Tadpol
Tilstra
Sent: Wednesday, May 18, 2005 8:36 AM
To: linux clustering
Subject: Re: [Linux-cluster]GFS Problem

On Tue, May 17, 2005 at 11:39:08PM -0600, Frank L. Setinsek wrote:
> 
>    Hardware Configuration:  Six node cluster, each node has a LSI Fibre
Channel
>    Host Adapter interface to a SAN.
>    Software Configuration: The kernel is 2.4.21-20.EL with GFS-6.0.2-25
>    Problem: While four nodes are simultaneously accessing the SAN, if a
5th
>    node attempts to access the SAN, one of the nodes will kernel panic.
>                  The node that crashes seems to be random.  All the
crashes
>    have the same error as follows:
> 
>    May 17 21:53:52 compute-0-2.local kernel: mptscsih: ioc0: WARNING -
Device
>    (0:0:1) reported QUEUE_FULL!
>    May 17 21:53:52 compute-0-2.local kernel: SCSI disk error : host 0
channel 0
>    id 0 lun 1 return code = 440b0000

It really looks like your storage cannot have more than four nodes accessing
it at a single time.  You are getting scsi errors which are completely
confusing gfs, and causing it to panic.

--
Michael Conrad Tadpol Tilstra
To err is human, to really foul things up requires a computer.




From mtilstra at redhat.com  Wed May 18 18:57:05 2005
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Wed, 18 May 2005 13:57:05 -0500
Subject: [Linux-cluster]GFS Problem
In-Reply-To: <200505181827.j4IIRZ8D020056@mx1.redhat.com>
References: <20050518143537.GA26260@redhat.com>
	<200505181827.j4IIRZ8D020056@mx1.redhat.com>
Message-ID: <20050518185705.GA6235@redhat.com>

On Wed, May 18, 2005 at 11:36:39AM -0600, Frank L. Setinsek wrote:
> The storage is fronted by a switch, the SCSI errors occur after the
> Queue_Full warning which seems to initiate the problem.  In this context,
> what does Queue_Full mean? 

You'll need to talk with the people that made the hardware to find that
out.

-- 
Michael Conrad Tadpol Tilstra
Einstein said that talking to yourself is a sign of intelligence.
Answering yourself, however, is a sign of insanity. 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050518/0e9af799/attachment.sig>

From birger at uib.no  Wed May 18 21:51:05 2005
From: birger at uib.no (Birger Wathne)
Date: Wed, 18 May 2005 23:51:05 +0200
Subject: [Linux-cluster]GFS Problem
In-Reply-To: <20050518143537.GA26260@redhat.com>
References: <200505180637.j4I6bBa9017646@mx1.redhat.com>
	<20050518143537.GA26260@redhat.com>
Message-ID: <428BB8C9.7020909@uib.no>


>On Tue, May 17, 2005 at 11:39:08PM -0600, Frank L. Setinsek wrote:
>  
>
>>
>>   May 17 21:53:52 compute-0-2.local kernel: mptscsih: ioc0: WARNING - Device
>>   (0:0:1) reported QUEUE_FULL!
>>   May 17 21:53:52 compute-0-2.local kernel: SCSI disk error : host 0 channel 0
>>   id 0 lun 1 return code = 440b0000
>>    
>>

I would suspect this is an issue with tagged queueing.

Tagged queueing lets a host tag each I/O request with an identifier so 
the I/O subsystem can answer the requests in a different order. The host 
queries the device to find out how large the queue can be. If you have 
several hosts, all assuming they have the whole queue to themselves they 
could easily fill it...

Read the documentation for your device, and see what the tagged queue 
depth is. See if it can be configured. Then find out how you can set the 
queue depth in your scsi driver. Some drivers can set for each target in 
some config file. Set max queue depth for the device in the scsi driver 
on each node to 1/6 of the total queue depth on the device (since you 
have a 6 node cluster).

Of course the easy test would be to disable tagged queueing completely, 
but the performance hit can be bad. It would quickly show if the problem 
goes away...

Remember that you will have to reconfigure the queue depth on all nodes 
before you can add a new node... So you may want to set the depth to 1/7 
of the total so there is room for one more if these nodes run something 
you cannot restart often.

-- 
birger



From phillips at redhat.com  Wed May 18 16:54:59 2005
From: phillips at redhat.com (Daniel Phillips)
Date: Wed, 18 May 2005 12:54:59 -0400
Subject: [Linux-cluster] [ANNOUNCE] Linux Cluster Summit 2005
Message-ID: <200505181254.59790.phillips@redhat.com>

Linux Cluster Summit 2005

June 20 and 21, Walldorf, Germany (Near Heidelberg)

Sponsors: Red Hat, SAP AG, Oracle, HP, and Fujitsu-Siemens.

The goal of the two-day Linux Cluster Summit workshop is to bring
together the key individuals who can realize a general purpose
clustering API for Linux, including, kernel components, userspace
libraries and internal and external interfaces.  Results of this
workshop will be presented to the Kernel Summit the following month.

The workshop is to be held in Walldorf, the hometown of SAP, Europe's 
biggest software company.

This is in the south/central part of Germany, in an adjoining town of 
Heidelberg.  Frankfurt and France (order the menu!) are within easy 
driving distance.  It is very picturesque.  There is a lot to see and 
do in the vicinity for those who are interested.  We are planning a 
field trip for the day following the workshop for those who want to 
stay an extra day and see some sights.  Details have not been finalized 
yet.  Note that June 21 is the first day of tutorials at LinuxTag and 
June 22 is the LinuxTag Business Congress.  You will be able to attend 
the rest of LinuxTag if you wish.  It's free!

Unfortunately, space is limited. We can accommodate about 100 attendees. 
Registration is by invitation for 70 participants, including those who 
attended last year, and 30 seats are available on a first come, first 
served basis. There is no charge for this workshop.  You will only be 
need to arrange travel and accommodation.  A list of recommended hotels 
and other accommodation will be posted here.

The full agenda will be posted here, in the next few days:

   http://sourceware.org/cluster/events/summit2005/

If you are interested in attending, or require further information, 
please email:

   Daniel Phillips <phillips at redhat.com>
   Heinz Mauelshagen <mauelshagen at redhat.com>

Regards,

Daniel



From phung at cs.columbia.edu  Thu May 19 17:48:20 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Thu, 19 May 2005 13:48:20 -0400 (EDT)
Subject: [Linux-cluster] segfault during cman_tool services
Message-ID: <Pine.LNX.4.44.0505191340100.5154-100000@algiers.clic.cs.columbia.edu>

i've been having some problems with my fs where a node
will mysteriously be removed from the cluster, even though
the node is still up.  here's what I see from syslog:

CMAN: node blade11 has been removed from the cluster : No response to messages
CMAN: killed by NODEDOWN message
CMAN: we are leaving the cluster. No response to messages
dlm: proj_lv: restbl_rsb_update failed -105
dlm: home_lv: rebuild_rsbs_send failed -105

so from blade11, I try to see what's going on, and when I do:

> cman_tool services

I get the fun pasted at the end of the message.  A while back I noticed
there was some code updates/patches, but I don't know where to find the
"Changes".  Would a cvs update on the sources help?  Let  me know if
you need more info on the system I'm running.

regards,
dan

--

lock_dlm:  Assertion failed on line 353 of file 
/usr/src/cluster-2.6.8.1/gfs-kernel/src/dlm/lock.c
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 80864198
proj_lv: error=-22 num=2,1a lkf=10000 flags=84

------------[ cut here ]------------
kernel BUG at /usr/src/cluster-2.6.8.1/gfs-kernel/src/dlm/lock.c:353!
invalid operand: 0000 [#1]
Modules linked in: ipv6 evdev pcspkr psmouse sworks_agp agpgart ohci_hcd 
usbcore tg3 firmware_class lock_dlm dlm cman gfs lock_harness dm_mod 
qla2300 qla2xxx scsi_transport_fc sg sr_mod sd_mod scsi_mod ide_cd cdrom 
genrtc ext3 jbd mbcache ide_generic via82cxxx trm290 triflex slc90e66 
sis5513 siimage serverworks sc1200 rz1000 piix pdc202xx_old pdc202xx_new 
opti621 ns87415 hpt366 ide_disk hpt34x generic cy82c693 cs5530 cs5520 
cmd64x atiixp amd74xx alim15x3 aec62xx ide_core unix
CPU:    0
EIP:    0060:[<f89e6b46>]    Tainted: GF
EFLAGS: 00010286   (2.6.8.1)
EIP is at do_dlm_unlock+0x106/0x120 [lock_dlm]
eax: 00000001   ebx: ffffffea   ecx: c02b4870   edx: 000053ec
esi: f432aa00   edi: f8b301c0   ebp: f43ce000   esp: f43cfedc
ds: 007b   es: 007b   ss: 0068
Process gfs_glockd (pid: 2174, threadinfo=f43ce000 task=f43b4dd0)
Stack: f89ed876 f431cde0 ffffffea 00000002 0000001a 00000000 00010000 
00000084
       f8ba8000 f8ba8000 f89e6eef f432aa00 f8b01718 f432aa00 00000003 
f431dbd0
       f8af51f9 f8ba8000 f432aa00 00000003 00000000 f8bb83f4 f8b301c0 
00000000
Call Trace:
 [<f89e6eef>] lm_dlm_unlock+0x1f/0x30 [lock_dlm]
 [<f8b01718>] gfs_lm_unlock+0x38/0x60 [gfs]
 [<f8af51f9>] gfs_glock_drop_th+0x69/0x1a0 [gfs]
 [<f8af4588>] rq_demote+0x98/0xb0 [gfs]
 [<f8af468c>] run_queue+0xac/0xe0 [gfs]
 [<f8af6ed4>] demote_ok+0x74/0x80 [gfs]
 [<f8af700d>] gfs_reclaim_glock+0x7d/0x130 [gfs]
 [<f8ae80ca>] gfs_glockd+0x10a/0x120 [gfs]
 [<c0115950>] default_wake_function+0x0/0x20
 [<c0105d72>] ret_from_fork+0x6/0x14
 [<c0115950>] default_wake_function+0x0/0x20
 [<f8ae7fc0>] gfs_glockd+0x0/0x120 [gfs]
 [<c01042ad>] kernel_thread_helper+0x5/0x18
Code: 0f 0b 61 01 e0 c8 9e f8 c7 04 24 20 c9 9e f8 e8 16 18 73 c7

-- 



From kpreslan at redhat.com  Thu May 19 18:15:32 2005
From: kpreslan at redhat.com (Ken Preslan)
Date: Thu, 19 May 2005 13:15:32 -0500
Subject: [Linux-cluster] Resource group corruption
In-Reply-To: <428B216E.1040804@ift.uib.no>
References: <428B216E.1040804@ift.uib.no>
Message-ID: <20050519181532.GA30539@potassium.msp.redhat.com>

The oops should be fixed in CVS now.  I don't see how that would have
caused the RG problem, though.  Was GFS printing errors to the
console other than the oops?


On Wed, May 18, 2005 at 01:05:18PM +0200, Birger Wathne wrote:
> I have a problem (again)...
> 
> Running FC3 in a cluster with only one node operative (but 2 nodes 
> defined in the config).
> Kernel is 2.6.11-1.14_FC3smp
> 
> I created 4 gfs file systems, and tested NFS and samba setups for a 
> while. Now, I copied some larger amounts of data over to the file 
> systems using rsync. Everything seemed normal until I stared movind 
> directories around using mv and cp. Suddenly I got
> May 18 09:54:27 bacchus kernel: Unable to handle kernel NULL pointer 
> dereference at virtual address 00000004
> May 18 09:54:27 bacchus kernel:  printing eip:
> May 18 09:54:27 bacchus kernel: c018587d
> May 18 09:54:27 bacchus kernel: *pde = 237dc001
> May 18 09:54:27 bacchus kernel: Oops: 0000 [#1]
> May 18 09:54:27 bacchus kernel: SMP
> May 18 09:54:27 bacchus kernel: Modules linked in: lock_dlm(U) gfs(U) 
> lock_harness(U) nfsd exportfs nfs lockd parport_pc lp parport autofs4 
> dlm(U) cman(U) md5 ipv6 sunrpc ipt_LOG iptable_filter ip_tables video 
> button battery ac uhci_hcd ehci_hcd hw_random i2c_i801 i2c_core 
> snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer 
> snd soundcore snd_page_alloc 3c59x mii tg3 floppy ext3 jbd dm_mod 
> aic7xxx ata_piix libata sd_mod scsi_mod
> May 18 09:54:27 bacchus kernel: CPU:    0
> May 18 09:54:27 bacchus kernel: EIP:    0060:[<c018587d>]    Tainted: 
> GF     VLIMay 18 09:54:27 bacchus kernel: EFLAGS: 00010246   
> (2.6.11-1.14_FC3smp)
> May 18 09:54:27 bacchus kernel: EIP is at posix_acl_valid+0xf/0xac
> May 18 09:54:27 bacchus kernel: eax: 00000000   ebx: 00000000   ecx: 
> 00000008 edx: 00000001
> May 18 09:54:27 bacchus kernel: esi: 00000000   edi: 00000000   ebp: 
> e16a17e8 esp: ccf08d60
> May 18 09:54:27 bacchus kernel: ds: 007b   es: 007b   ss: 0068
> May 18 09:54:27 bacchus kernel: Process cp (pid: 17953, 
> threadinfo=ccf08000 task=d2af5020)
> May 18 09:54:27 bacchus kernel: Stack: 00000000 f911aebc 00000000 
> f90e9035 ccf08e0c f911aebc ccf08e64 f90f1d06
> May 18 09:54:27 bacchus kernel:        00000000 00000000 ccf08db0 
> ccf08e53 e16a17e8 ccf08db0 ffffffff ccf08e0c
> May 18 09:54:27 bacchus kernel:        e16a17e8 ccf08db0 f90f389d 
> ccf08db0 e2d4cf38 e2d4cf38 e2d4cf10 d2af5020
> May 18 09:54:27 bacchus kernel: Call Trace:
> May 18 09:54:27 bacchus kernel:  [<f90e9035>] 
> gfs_acl_validate_set+0x35/0xa2 [gfs]
> May 18 09:54:27 bacchus kernel:  [<f90f1d06>] system_eo_set+0xb6/0xdc [gfs]
> May 18 09:54:27 bacchus kernel:  [<f90f389d>] gfs_ea_set+0xbd/0xc1 [gfs]
> May 18 09:54:27 bacchus kernel:  [<f910eb5f>] gfs_setxattr+0x8a/0x96 [gfs]
> May 18 09:54:27 bacchus kernel:  [<c017a962>] setxattr+0x142/0x183
> May 18 09:54:27 bacchus kernel:  [<c0171cc4>] dput+0x7f/0x1d3
> May 18 09:54:27 bacchus kernel:  [<c016984b>] link_path_walk+0xbb5/0xe27
> May 18 09:54:27 bacchus kernel:  [<c016528a>] cp_new_stat64+0xf0/0x102
> May 18 09:54:27 bacchus kernel:  [<c0169d46>] path_lookup+0x96/0x196
> May 18 09:54:27 bacchus kernel:  [<c0169f98>] __user_walk+0x42/0x52
> May 18 09:54:27 bacchus kernel:  [<c017a9ea>] sys_setxattr+0x47/0x58
> May 18 09:54:27 bacchus kernel:  [<c0103f0f>] syscall_call+0x7/0xb
> May 18 09:54:27 bacchus kernel: Code: c1 e9 02 f3 a5 f6 c3 02 74 02 66 
> a5 f6 c3 01 74 01 a4 c7 00 01 00 00 00 5b 5e 5f c3 57 8d 48 08 31 ff 56 
> 31 f6 ba 01 00 00 00 53 <8b> 40 04 8d 1c c1 39 d9 73 32 0f b7 41 02 a9 
> f8 ff 00 00 75 27
> 
> 
> Rebooting brings up the services again, including mounted file systems, 
> but if I umount them (which involves disabling services and rebooting, 
> as I get file system busy errors when rebooting, then stopping the 
> services) and run gfs_fsck I get this on 3 of the 4 file systems:
> # gfs_fsck /dev/raid5/pakke
> Initializing fsck
> Buffer #4296077767 (5 of 134571750) is neither GFS_METATYPE_RB nor 
> GFS_METATYPE_ RG.
> Resource group is corrupted.
> Unable to read in rgrp descriptor.
> Unable to fill in resource group information.
> 
> How do I try to fix this? Why is one file system unaffected by this? And 
> most importantly: Why did this happen in the first place?
> 
> I am running the RHEL4 branch of the code.
> 
> The file systems were created with 2 journals, but there has never been 
> more than one node connected to the storage.
> 
> -- 
> birger
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Ken Preslan <kpreslan at redhat.com>



From dawson at fnal.gov  Thu May 19 18:40:24 2005
From: dawson at fnal.gov (Troy Dawson)
Date: Thu, 19 May 2005 13:40:24 -0500
Subject: [Linux-cluster] Building rpm's for RHEL 4
In-Reply-To: <428A3A08.7030303@fnal.gov>
References: <428A3A08.7030303@fnal.gov>
Message-ID: <428CDD98.8000909@fnal.gov>

Troy Dawson wrote:
> Howdy,
> I've got some servers running RHEL 4 all connected to a SAN.  I've got 
> some time before they go into production so I'm trying GFS to see how 
> well it will work for our situation, and possibly shake any bugs out 
> while testing.
> I was able to download the cluster code from CVS, and build it.  But my 
> servers don't have any compilers, and I'd like to install on them via 
> rpm's.
> In the scripts directory there is a build_srpms.pl script.  I've tried 
> it, I've read it, I've tried all sorts of incantations, but it doesn't 
> seem to do anything for me.  So comes the question
> How do I make the rpm's for RHEL 4 so I can install the cluster programs 
> on servers without compilers?
> Many Thanks
> Troy Dawson

Just thought I'd answer my own question.  I'd read the perl script that 
makes the rpm's, not the Makefile.  I was also trying this on an 
'export' of the cvs, and not a 'checkout'
I believe the command I was looking for was
   make srpms

Troy

-- 
__________________________________________________
Troy Dawson  dawson at fnal.gov  (630)840-6468
Fermilab  ComputingDivision/CSS  CSI Group
__________________________________________________



From jcable at gi.alaska.edu  Fri May 20 04:39:36 2005
From: jcable at gi.alaska.edu (Jay Cable)
Date: Thu, 19 May 2005 20:39:36 -0800 (AKDT)
Subject: [Linux-cluster] gfs_fsck problem
Message-ID: <Pine.LNX.4.58.0505192020090.5760@dino.gi.alaska.edu>

Hello --

I am having trouble fscking (running gfs_fsck) on gfs filesystems.  I have
a two node cluster that is going to be used as file servers.  I recently
noticed that I could not fsck the gfs filesystem they share.  Initially I
thought this was due to some problem with my hardware, but when
I tried creating a new gfs filesystem, then immediately fsck-ing it, I
still have problems, and the fsck is unable to complete.  The storage
system appears to work fine with other file systems.

I am using the "RHEL4 branch", current as of this morning.

Does anyone have any advice as to what I might be doing wrong?

Here is what I was trying:

[root at mugen gfs]# gfs_mkfs -V
gfs_mkfs DEVEL.1116516970 (built May 19 2005 11:37:18)
Copyright (C) Red Hat, Inc.  2004-2005  All rights reserved.

[root at mugen gfs]# gfs_mkfs -j 3 -p lock_dlm -t ftp:dds_space
/dev/mapper/ftp_space-erc1
This will destroy any data on /dev/mapper/ftp_space-erc1.
  It appears to contain a GFS filesystem.

Are you sure you want to proceed? [y/n] y

Device:                    /dev/mapper/ftp_space-erc1
Blocksize:                 4096
Filesystem Size:           340773216
Journals:                  3
Resource Groups:           5202
Locking Protocol:          lock_dlm
Lock Table:                ftp:dds_space

Syncing...
All Done

[root at mugen gfs]# gfs_fsck -V
GFS fsck DEVEL.1116516970 (built May 19 2005 11:37:14)
Copyright (C) Red Hat, Inc.  2004-2005  All rights reserved.
[root at mugen gfs]#  gfs_fsck /dev/mapper/ftp_space-erc1
Initializing fsck
Buffer #4296081376 (5 of 134571718) is neither GFS_METATYPE_RB nor
GFS_METATYPE_RG.
Resource group is corrupted.
Unable to read in rgrp descriptor.
Unable to fill in resource group information.


Thanks,
 -Jay
 jcable at gi.alaska.edu



From pcaulfie at redhat.com  Fri May 20 07:15:06 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 20 May 2005 08:15:06 +0100
Subject: [Linux-cluster] segfault during cman_tool services
In-Reply-To: <Pine.LNX.4.44.0505191340100.5154-100000@algiers.clic.cs.columbia.edu>
References: <Pine.LNX.4.44.0505191340100.5154-100000@algiers.clic.cs.columbia.edu>
Message-ID: <428D8E7A.9070507@redhat.com>

Dan B. Phung wrote:
> i've been having some problems with my fs where a node
> will mysteriously be removed from the cluster, even though
> the node is still up.  here's what I see from syslog:
> 
> CMAN: node blade11 has been removed from the cluster : No response to messages
> CMAN: killed by NODEDOWN message
> CMAN: we are leaving the cluster. No response to messages
> dlm: proj_lv: restbl_rsb_update failed -105
> dlm: home_lv: rebuild_rsbs_send failed -105
> 

We have a bug open for this in bugzilla:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=139738

A workaround that seems to work for is is:

echo "5" > /proc/cluster/config/cman/max_retries

-- 

patrick



From birger at uib.no  Fri May 20 07:46:25 2005
From: birger at uib.no (Birger Wathne)
Date: Fri, 20 May 2005 09:46:25 +0200
Subject: [Linux-cluster] Still build problems with RHEL4 branch on FC3
Message-ID: <428D95D1.8010401@uib.no>


I still have problems building a fresh copy of the RHEL4 branch on FC3. 
Should I use a different branch?

I used
cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster login cvs
cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout -r RHEL4 
cluster
to fetch a completely new copy (no directory named 'cluster' in . as I 
did this)
I then tried to build using
./configure --kernel_src=/lib/modules/`uname -r`/build/
and make

I still have problems with
gfs-kernel/src/nolock/main.c
and
gfs-kernel/src/gfs/ops_file.c
using LOCK_USE_CLNT which seems to be undefined.

gcc also still claims there is a mismatch in the argument count at line 
964 of
gfs-kernel/src/gfs/quota.c (seems like it wants to compare against the 
write()
system call instead of the write() that tty points to).

I'm running gcc (GCC) 3.4.3 20050227 (Red Hat 3.4.3-22.fc3)

My kernel is 2.6.11-1.14_FC3smp (installed using yum. No kernel source 
installed)

-- 
birger



From birger at uib.no  Fri May 20 08:54:00 2005
From: birger at uib.no (Birger Wathne)
Date: Fri, 20 May 2005 10:54:00 +0200
Subject: [Linux-cluster] Still build problems with RHEL4 branch on FC3
In-Reply-To: <428D95D1.8010401@uib.no>
References: <428D95D1.8010401@uib.no>
Message-ID: <428DA5A8.6010108@uib.no>

I forgot to mention... I can still get around these errors by defining 
the missing symbol and by commenting out the offending write() in the 
quota code.
I just want to know which branch would be the correct one for FC3.
The FC4 branch fails with SOCK_ZAPPED, and I have not looked into how to 
get around that.

-- 
birger



From mtilstra at redhat.com  Fri May 20 13:00:57 2005
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Fri, 20 May 2005 08:00:57 -0500
Subject: [Linux-cluster] Still build problems with RHEL4 branch on FC3
In-Reply-To: <428DA5A8.6010108@uib.no>
References: <428D95D1.8010401@uib.no> <428DA5A8.6010108@uib.no>
Message-ID: <20050520130057.GA15172@redhat.com>

On Fri, May 20, 2005 at 10:54:00AM +0200, Birger Wathne wrote:
> I forgot to mention... I can still get around these errors by defining 
> the missing symbol and by commenting out the offending write() in the 
> quota code.
> I just want to know which branch would be the correct one for FC3.
> The FC4 branch fails with SOCK_ZAPPED, and I have not looked into how to 
> get around that.

Bad news is that there isn't a branch that just works with the default
kernels in FC3.

-- 
Michael Conrad Tadpol Tilstra
use the source, Luke
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050520/90a84809/attachment.sig>

From javilin2k5 at hotmail.com  Thu May 19 18:10:25 2005
From: javilin2k5 at hotmail.com (Javier C.)
Date: Thu, 19 May 2005 20:10:25 +0200
Subject: [Linux-cluster] Unknow lock state!
Message-ID: <BAY107-F27741B591542ABA5D2B0FBDC080@phx.gbl>

Hi everybody;

We are using 3 servers with gfs and in logs of one of them i can read the 
following thing:

---------------------------
May 19 20:02:31 Linux1 kernel: lock_gulm: Error on lock 
0x4707000000000000309a286461746f7300 Got a drop lcok request for a lock that 
we don't know of. state:0x1
------------------------------

This message often appears and we do not know to what it can be.

We are using RH E.Linux Update 2. Kernel 2.4.21-15. Gfs version is 6.0.0-1.2

Thanks for any help and sorry by terrible english.

best regards.

_________________________________________________________________
Acepta el reto MSN Premium: Protecci?n para tus hijos en internet. 
Desc?rgalo y pru?balo 2 meses gratis. 
http://join.msn.com?XAPID=1697&DI=1055&HL=Footer_mailsenviados_proteccioninfantil



From alewis at redhat.com  Fri May 20 14:41:26 2005
From: alewis at redhat.com (AJ Lewis)
Date: Fri, 20 May 2005 09:41:26 -0500
Subject: [Linux-cluster] gfs_fsck problem
In-Reply-To: <Pine.LNX.4.58.0505192020090.5760@dino.gi.alaska.edu>
References: <Pine.LNX.4.58.0505192020090.5760@dino.gi.alaska.edu>
Message-ID: <20050520144126.GF6097@null.msp.redhat.com>

On Thu, May 19, 2005 at 08:39:36PM -0800, Jay Cable wrote:
> I am having trouble fscking (running gfs_fsck) on gfs filesystems.  I have
> a two node cluster that is going to be used as file servers.  I recently
> noticed that I could not fsck the gfs filesystem they share.  Initially I
> thought this was due to some problem with my hardware, but when
> I tried creating a new gfs filesystem, then immediately fsck-ing it, I
> still have problems, and the fsck is unable to complete.  The storage
> system appears to work fine with other file systems.

I see this as well.  Not sure what changed - this used to work.  I'm looking
into it.  Thanks for the report.
 
> I am using the "RHEL4 branch", current as of this morning.
> 
> Does anyone have any advice as to what I might be doing wrong?
> 
> Here is what I was trying:
> 
> [root at mugen gfs]# gfs_mkfs -V
> gfs_mkfs DEVEL.1116516970 (built May 19 2005 11:37:18)
> Copyright (C) Red Hat, Inc.  2004-2005  All rights reserved.
> 
> [root at mugen gfs]# gfs_mkfs -j 3 -p lock_dlm -t ftp:dds_space
> /dev/mapper/ftp_space-erc1
> This will destroy any data on /dev/mapper/ftp_space-erc1.
>   It appears to contain a GFS filesystem.
> 
> Are you sure you want to proceed? [y/n] y
> 
> Device:                    /dev/mapper/ftp_space-erc1
> Blocksize:                 4096
> Filesystem Size:           340773216
> Journals:                  3
> Resource Groups:           5202
> Locking Protocol:          lock_dlm
> Lock Table:                ftp:dds_space
> 
> Syncing...
> All Done
> 
> [root at mugen gfs]# gfs_fsck -V
> GFS fsck DEVEL.1116516970 (built May 19 2005 11:37:14)
> Copyright (C) Red Hat, Inc.  2004-2005  All rights reserved.
> [root at mugen gfs]#  gfs_fsck /dev/mapper/ftp_space-erc1
> Initializing fsck
> Buffer #4296081376 (5 of 134571718) is neither GFS_METATYPE_RB nor
> GFS_METATYPE_RG.
> Resource group is corrupted.
> Unable to read in rgrp descriptor.
> Unable to fill in resource group information.
> 
> 
> Thanks,
>  -Jay
>  jcable at gi.alaska.edu
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 
AJ Lewis                                   Voice:  612-638-0500
Red Hat Inc.                               E-Mail: alewis at redhat.com
720 Washington Ave. SE, Suite 200
Minneapolis, MN 55414
   
Current GPG fingerprint = D9F8 EDCE 4242 855F A03D  9B63 F50C 54A8 578C 8715
Grab the key at: http://people.redhat.com/alewis/gpg.html or one of the
many keyservers out there...

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050520/c7dd43bf/attachment.sig>

From alewis at redhat.com  Fri May 20 15:21:59 2005
From: alewis at redhat.com (AJ Lewis)
Date: Fri, 20 May 2005 10:21:59 -0500
Subject: [Linux-cluster] gfs_fsck problem
In-Reply-To: <20050520144126.GF6097@null.msp.redhat.com>
References: <Pine.LNX.4.58.0505192020090.5760@dino.gi.alaska.edu>
	<20050520144126.GF6097@null.msp.redhat.com>
Message-ID: <20050520152159.GG6097@null.msp.redhat.com>

On Fri, May 20, 2005 at 09:41:26AM -0500, AJ Lewis wrote:
> On Thu, May 19, 2005 at 08:39:36PM -0800, Jay Cable wrote:
> > I am having trouble fscking (running gfs_fsck) on gfs filesystems.  I have
> > a two node cluster that is going to be used as file servers.  I recently
> > noticed that I could not fsck the gfs filesystem they share.  Initially I
> > thought this was due to some problem with my hardware, but when
> > I tried creating a new gfs filesystem, then immediately fsck-ing it, I
> > still have problems, and the fsck is unable to complete.  The storage
> > system appears to work fine with other file systems.
> 
> I see this as well.  Not sure what changed - this used to work.  I'm looking
> into it.  Thanks for the report.

Ok, the latest code from the RHEL4 branch should work again.  Sorry about
that.

-- 
AJ Lewis                                   Voice:  612-638-0500
Red Hat Inc.                               E-Mail: alewis at redhat.com
720 Washington Ave. SE, Suite 200
Minneapolis, MN 55414
   
Current GPG fingerprint = D9F8 EDCE 4242 855F A03D  9B63 F50C 54A8 578C 8715
Grab the key at: http://people.redhat.com/alewis/gpg.html or one of the
many keyservers out there...

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050520/7f5f65d9/attachment.sig>

From jcable at gi.alaska.edu  Fri May 20 18:51:20 2005
From: jcable at gi.alaska.edu (Jay Cable)
Date: Fri, 20 May 2005 10:51:20 -0800 (AKDT)
Subject: [Linux-cluster] gfs_fsck problem
In-Reply-To: <20050520160104.A2EBF7372E@hormel.redhat.com>
References: <20050520160104.A2EBF7372E@hormel.redhat.com>
Message-ID: <Pine.LNX.4.58.0505201026130.25239@dino.gi.alaska.edu>


It works great for me now - Thanks!

Thanks again,
   -Jay
    jcable at gi.alaska.edu

=> Ok, the latest code from the RHEL4 branch should work again.  Sorry about
=> that.
=>
=> --
=> AJ Lewis                                   Voice:  612-638-0500
=> Red Hat Inc.                               E-Mail: alewis at redhat.com
=> 720 Washington Ave. SE, Suite 200
=> Minneapolis, MN 55414
=>
=> Current GPG fingerprint = D9F8 EDCE 4242 855F A03D  9B63 F50C 54A8 578C 8715
=> Grab the key at: http://people.redhat.com/alewis/gpg.html or one of the
=> many keyservers out there...
=>



From james.eastman at fedex.com  Mon May 23 03:22:58 2005
From: james.eastman at fedex.com (James Eastman)
Date: Sun, 22 May 2005 22:22:58 -0500
Subject: [Linux-cluster] ./configure, make ,
 make install problem on gentoo linux with a 2.6.11 kernel
In-Reply-To: <20050520152159.GG6097@null.msp.redhat.com>
References: <Pine.LNX.4.58.0505192020090.5760@dino.gi.alaska.edu>
	<20050520144126.GF6097@null.msp.redhat.com>
	<20050520152159.GG6097@null.msp.redhat.com>
Message-ID: <42914C92.2080602@fedex.com>

All:

I hope this email finds you doing well.  As you have probably already guessed, I am having issues getting GFS to build/install properly.  My plans for my 5 GFS enabled boxes is to use them as an Oracle 10g grid.  Yet I digress ... Here's OS specs on my machine(s):

oragrid5 root # uname -a
Linux oragrid5.corp.fedex.com 2.6.11-gentoo-r6 #1 SMP Mon May 2 11:07:22 CDT 2005 i686 Pentium III (Coppermine) GenuineIntel GNU/Linux
oragrid5 root #

As you can see I'm running a 2.6.11 kernel.  So I decide to get the latest CVS snapshot and began my compiling adventure.  Here're the CVS comands I ran to get the latest source code:


cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster login cvs
cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout cluster/gfs
cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout cluster/gfs-kernel
cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout cluster/gfs2
cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout cluster/gfs2-kernel

Thus I believe I have the latest available code.  So, I begin my adventure with ./configure --kernel_src=/usr/src/linux-2.6.11-gentoo-r6	 .

Here's what I see:

oragrid5 cluster # ./configure --kernel_src=/usr/src/linux-2.6.11-gentoo-r6
configure cman-kernel

Configuring Makefiles for your system...
Completed Makefile configuration

configure dlm-kernel

Configuring Makefiles for your system...
Completed Makefile configuration

configure gfs-kernel

Configuring Makefiles for your system...
Completed Makefile configuration

configure gfs2-kernel

Configuring Makefiles for your system...
Completed Makefile configuration

configure gnbd-kernel

Configuring Makefiles for your system...
Completed Makefile configuration

configure magma

Configuring Makefiles for your system...
Completed Makefile configuration

configure ccs

Configuring Makefiles for your system...
Completed Makefile configuration

configure cman

Configuring Makefiles for your system...
Completed Makefile configuration

configure dlm

Configuring Makefiles for your system...
Completed Makefile configuration

configure fence

Configuring Makefiles for your system...
Completed Makefile configuration

configure iddev

Configuring Makefiles for your system...
Completed Makefile configuration

configure gfs

Configuring Makefiles for your system...
Completed Makefile configuration

configure gfs2

Configuring Makefiles for your system...
Completed Makefile configuration

configure gnbd

Configuring Makefiles for your system...
Completed Makefile configuration

configure gulm

Configuring Makefiles for your system...
Completed Makefile configuration

configure magma-plugins

Configuring Makefiles for your system...
Completed Makefile configuration

configure rgmanager

Configuring Makefiles for your system...
Completed Makefile configuration

configure cmirror

Configuring Makefiles for your system...
Completed Makefile configuration

oragrid5 cluster #

Oh the sweet smell of ./confugre success.  This causes me to be confident.  Thus, my next step is to do a make.  Here's what I see:

oragrid5 cluster # make
cd cman-kernel && make install sbindir=/root/cluster/build/sbin libdir=/root/cluster/build/lib mandir=/root/cluster/build/man incdir=/root/cluster/build/incdir module_dir=/root/cluster/build/module sharedir=/root/cluster/build slibdir=/root/cluster/build/slib DESTDIR=/root/cluster/build
make[1]: Entering directory `/root/cluster/cman-kernel'
cd src && make install
make[2]: Entering directory `/root/cluster/cman-kernel/src'
rm -f cluster
ln -s . cluster
make -C /usr/src/linux-2.6.11-gentoo-r6 M=/root/cluster/cman-kernel/src modules USING_KBUILD=yes
make[3]: Entering directory `/usr/src/linux-2.6.11-gentoo-r6'
make[3]: *** No rule to make target `modules'.  Stop.
make[3]: Leaving directory `/usr/src/linux-2.6.11-gentoo-r6'
make[2]: *** [all] Error 2
make[2]: Leaving directory `/root/cluster/cman-kernel/src'
make[1]: *** [install] Error 2
make[1]: Leaving directory `/root/cluster/cman-kernel'
make: *** [all] Error 2
oragrid5 cluster #

Huh?  I don't get it.  My configure worked smoothly.  Also, I have ALL of the lvm2 and device mappper stuff compiled into the kernel.  Further I've done an 'emerge lvm2' and an 'emerge device-mapper' to get the OS parts for those tools installed.  I know this is probably a very simple and possibly previously answered questsion.  Any help with this or directions to the already existing thread that may help solve my problem are greatly appreciated.

-- 
James

"You know you've achieved perfection in design and development, not when you have nothing more to add, but when you have nothing more to take away." -- Antione de Saint-Exupery




From teigland at redhat.com  Mon May 23 03:45:04 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 23 May 2005 11:45:04 +0800
Subject: [Linux-cluster] ./configure, make ,
	make install problem on gentoo linux with a 2.6.11 kernel
In-Reply-To: <42914C92.2080602@fedex.com>
References: <Pine.LNX.4.58.0505192020090.5760@dino.gi.alaska.edu>
	<20050520144126.GF6097@null.msp.redhat.com>
	<20050520152159.GG6097@null.msp.redhat.com>
	<42914C92.2080602@fedex.com>
Message-ID: <20050523034504.GA6904@redhat.com>

On Sun, May 22, 2005 at 10:22:58PM -0500, James Eastman wrote:

> As you can see I'm running a 2.6.11 kernel.  So I decide to get the latest 
> CVS snapshot and began my compiling adventure.  Here're the CVS comands I 
> ran to get the latest source code:
> 
> cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster login cvs
> cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout cluster/gfs
> cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout 
> cluster/gfs-kernel
> cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout cluster/gfs2
> cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout 
> cluster/gfs2-kernel

Code on the cvs head is in the middle of an overhaul; it doesn't all work,
and the pieces that do work don't all work together yet.  So, you need to
use the FC4 cvs branch which will come closest to matching 2.6.11.

Dave



From davidnicol at gmail.com  Sat May 21 04:44:49 2005
From: davidnicol at gmail.com (David Nicol)
Date: Fri, 20 May 2005 23:44:49 -0500
Subject: [Linux-cluster] Re: [ANNOUNCE] Linux Cluster Summit 2005
In-Reply-To: <200505181254.59790.phillips@redhat.com>
References: <200505181254.59790.phillips@redhat.com>
Message-ID: <934f64a2050520214446a7038b@mail.gmail.com>

On 5/18/05, Daniel Phillips <phillips at redhat.com> wrote:
> Linux Cluster Summit 2005
> 
> June 20 and 21, Walldorf, Germany (Near Heidelberg)
> 
> Sponsors: Red Hat, SAP AG, Oracle, HP, and Fujitsu-Siemens.
> 
> The goal of the two-day Linux Cluster Summit workshop is to bring
> together the key individuals who can realize a general purpose
> clustering API for Linux, including, kernel components, userspace
> libraries and internal and external interfaces.  Results of this
> workshop will be presented to the Kernel Summit the following month.


target vision for cluster infrastructure (thoughts on reading
interview with Andrew Morton in
 Ziff-Davis eweek)  
		
April 21, 2005, edited May 20
I was surprised to see that cluster infrastructure is still missing,
yet pleased that the need for it is more widely perceived today
http://www.uwsg.iu.edu/hypermail/linux/kernel/0409.0/0238.html

than it was four years ago when the linux-cluster mailing list was formed.
http://mail.nl.linux.org/linux-cluster/2001-02/msg00000.html

although there is nothing but spam in its archive since july 2002.

A quick review of more recent developments indicates that little has changed.

There is no need for standardization accross cluster infrastructures at any one
installation, and the sense that a discussion is over "whose version
gets included" rather than "what can we add to make things easier for
everyone, even if doing so will actually hurt the prospects for the clustering
infrastructure I am promoting" still leads to benchmark wars whenever the
subject comes up.  So I gather from glancing at discussion on LKML from
last september that there has been some progress but not much.

Four years ago, I proposed a target vision for linux cluster kernel development
which I believe still makes sense. (And now I know to call it a
"target vision!")

At the time,  I had no good answer for why we would bother to implement support
for something that nobody would immediately use, and sky-pie descriptions of
wide area computing grids seemed silly. (they may still.)  

The vision was, that a Linux box could be in multiple clusters at
once, or could, if so configured, be a "cluster router" similar to the
file system shareres/retransmitters one can set up to run interference
between disparate network file systems.

Supporting this vision -- a box is in N clusters from M separate
cluster system vendors, at the same time, and these N clusters know nothing
about each other -- is in my opinion a reasonable plan of attack for selecting
features to include, or interfaces to mandate conformity to in cluster
system vendors, rather than getting into detailed fights about whose
implementation of feature F belongs in the core distribution.

In the automatic process migration model, it is easy to imagine a wide
cluster where each node might represent a cluster rather than a unit, and would
want to hide the details of the cluster it is representing.  Four
years ago, Mosix allowed pretty wide clusters containing nodes not directly
reachable from each other, but node id numbers needed to be unique across
the whole universe.

in the "galaxy cluster" vision, a cluster can represent itself as a node, to
other nodes participating in the same cluster, without revealing
internal details of the galaxy (because from far enough away, a galaxy
looks, at first, like a single star).

The closest thing to implementing this vision that was available when
I last reviewed what was available was implementing Condor to link separate
Mosix clusters.

I remember a few near-consensuses being reached on the linux-cluster
mailing list.

These included:

   Defining a standard interface for  requesting and obtaining
cluster services and enforcing compliance to it makes sense.

   Arguing about whose implementation of any particular clustering
feature is best does not make sense.  (Given free exchange of techniques
and a standard interface, the in fact better techniques will gradually nudge
out the in fact inferior ones with no shoving required.)

    A standard cluster configuration interface (CCI) defined as a fs
of its own makes sense, rather that squatting within /proc

   the CCI can be mounted anywhere, (possibly back
 within /proc) so multiple clusters on the same box will not
collide with each other -- each gets its own CCI, and all syscalls to
cluster parts include a pointer to a cluster configuration object, of which
there can be more than one defined 

   The first order of business therefore was to take a survey of
services provided by clustering infrastructures and work out standardizable
interfaces to these services

That's what I remember.  The survey of services may or may not have
been performed
formally, I know that a survey of cluster services provided can be
done easily -- is done
often, whenever anyone tries to select what kind of cluster they want to set up.

The role of the linux kernel maintainer, in the question of supporting
multiple disparate
cluster platforms, is NOT to choose one, but is to set up ground rules
under which
they can all play nice together.  Just like the file systems all play
nice together currently.

The thought of having two different spindles each with their own file
system is not
something that anyone blinks at anymore, but it was once revolutionary.

Having one computer participating in N clusters, at the same time, may in the
future be a similar non-issue.

Pertaining to the list of cluster services, here's a quick and small list of the
ones that spring to my mind as being valiud for inclusion into the CCI, without
doing too much additional research:

   services (including statistics) that cluster membership provides
to processes on the node should be defined and offered through the CCI

   node identification in each cluster, abstracted into that cluster

   information about other nodes

   extended PID must include cluster-ID as well as node-ID when discussing
   PID extension mechanisms:  if I am process 00F3 on my own node, I might
   have an extended pid of 000400F3 on a cluster in which I am node 4 and an
   extended pid of 001000F3 on a cluster in which I am node sixteen.

   the publish/subscribe model (just read that one today) is very good

   standardize a publish/subscribe RPC interface in terms of operations on
   filesystem entities within the CCI

Based on discussion on the cap-talk mailing list, i'd like to suggest
that publish/subscribe get implemented in terms of one-off capability
tickets, but that's a perfect example of the kind of implementatin
detail I'm saying we
do not need to define.  How a particular clustering system
implements remote procedure call is not relevant to mandating a standard for
how clustering systems, should their engineers choose to have their
product comply with a standard, may present available RPCs in the CCI, and how
processes on nodes subscribed to their clusters may call an available
RPC through
the CCI.

The big insight that I hope you to take away from this e-mail, if you
haven't had it already (I have been out of touch with the insight level of
LKML for a long time) is that clustering integration into the kernel makes
sense as a standards establishment and enforcement effort, rather than a
technology selection and inclusion effort (at least at first -- once
the CCI exists,
cluster providers might all rush to provide their "CCI driver modules" and then
we're back to selection and inclusion) and that is a different kind of effort.

Although not a new kind.  While writing that paragraph I realized that
the file system interface and the entire module interface, and any other
kind of plug-it-in-and-it-works-the-same interface linux supports, sound,
video, et cetera -- are all standards enforcement problems rather than
technology
selection problems.

Not recognizing that clustering is such a problem is what I believe is
holding back cluster infrastructure from appearing in the kernel.

So last septembers thread about message passing services, in my
vision, is improper. The question is not, how do we pass messages,
but, we have nodes that we know by node-id, and we have messages
that we wish to pass to them, how do we provide a mechanism so that
knowing only those things, node Id and message, an entity that
wishes to pass a message can ask the cluster system to pass the message?

Given modularization, it will then be possible to drop in and replace systems as
needed or as appropriate.

--
David L Nicol
Director of Research and Development
Mindtrust LLC, Kansas City, Missouri



From rawipfel at novell.com  Sat May 21 15:29:01 2005
From: rawipfel at novell.com (Robert Wipfel)
Date: Sat, 21 May 2005 09:29:01 -0600
Subject: [Linux-cluster] Re: [Clusters_sig] Re: [ANNOUNCE] Linux Cluster
	Summit 2005
Message-ID: <s28eff68.000@sinclair.provo.novell.com>

> Supporting this vision -- a box is in N clusters from M separate
> cluster system vendors, at the same time, and these N clusters know nothing
> about each other -- is in my opinion a reasonable plan of attack

In memory of the Amadeus consortia of companies that built and shipped a commerical single system image cluster Unix called SVR4/MK (based on the Chorus microkernel), I wonder today what the impact of web services based distributed systems management plus virtualized hosting environments might offer this analysis. Perhaps the classic SSI approach to Unix will become something confined to the boundaries of an embedded multi-core cluster that manifests a single computer to the external web services based distributed network. Is distributed fork/exec really the correct scale out primitive or something confined to these new multi-core super-servers. Iow, considering web services, where should the line be drawn between a single (internally clustered) multi-core distributed memory system, versus the larger external network - how many cores per single system image OS instance is expected? 64, 128? Compared to SVR4/MK, these days, what's on the outside looking in, is web services and grid.  Returning to the reality of many vendor's enterprise* business, the suitespot for h/a clusters still seems to be somewhere around ~8 dual-CPU nodes with many customers deploying multiple similar clusters. Nodes are never in multiple clusters at once, rather, individual nodes are members of a cluster and that cluster might be a member of a cluster of clusters.

(* excluding HPC)

> in the "galaxy cluster" vision, a cluster can represent itself as a node, to
> other nodes participating in the same cluster, without revealing

Intergalactic scale is interesting but so too is a microscopic view of each node. 
Perhaps it's a cluster too; running a single system image Linux over multiple cores. I suppose this is what you are thinking, presuming your vantage point is a single CPU core.

Robert


>>> David Nicol <davidnicol at gmail.com> 05/20/05 10:44 PM >>>
On 5/18/05, Daniel Phillips <phillips at redhat.com> wrote:
> Linux Cluster Summit 2005
> 
> June 20 and 21, Walldorf, Germany (Near Heidelberg)
> 
> Sponsors: Red Hat, SAP AG, Oracle, HP, and Fujitsu-Siemens.
> 
> The goal of the two-day Linux Cluster Summit workshop is to bring
> together the key individuals who can realize a general purpose
> clustering API for Linux, including, kernel components, userspace
> libraries and internal and external interfaces.  Results of this
> workshop will be presented to the Kernel Summit the following month.


target vision for cluster infrastructure (thoughts on reading
interview with Andrew Morton in
 Ziff-Davis eweek)  
		
April 21, 2005, edited May 20
I was surprised to see that cluster infrastructure is still missing,
yet pleased that the need for it is more widely perceived today
http://www.uwsg.iu.edu/hypermail/linux/kernel/0409.0/0238.html

than it was four years ago when the linux-cluster mailing list was formed.
http://mail.nl.linux.org/linux-cluster/2001-02/msg00000.html

although there is nothing but spam in its archive since july 2002.

A quick review of more recent developments indicates that little has changed.

There is no need for standardization accross cluster infrastructures at any one
installation, and the sense that a discussion is over "whose version
gets included" rather than "what can we add to make things easier for
everyone, even if doing so will actually hurt the prospects for the clustering
infrastructure I am promoting" still leads to benchmark wars whenever the
subject comes up.  So I gather from glancing at discussion on LKML from
last september that there has been some progress but not much.

Four years ago, I proposed a target vision for linux cluster kernel development
which I believe still makes sense. (And now I know to call it a
"target vision!")

At the time,  I had no good answer for why we would bother to implement support
for something that nobody would immediately use, and sky-pie descriptions of
wide area computing grids seemed silly. (they may still.)  

The vision was, that a Linux box could be in multiple clusters at
once, or could, if so configured, be a "cluster router" similar to the
file system shareres/retransmitters one can set up to run interference
between disparate network file systems.

Supporting this vision -- a box is in N clusters from M separate
cluster system vendors, at the same time, and these N clusters know nothing
about each other -- is in my opinion a reasonable plan of attack for selecting
features to include, or interfaces to mandate conformity to in cluster
system vendors, rather than getting into detailed fights about whose
implementation of feature F belongs in the core distribution.

In the automatic process migration model, it is easy to imagine a wide
cluster where each node might represent a cluster rather than a unit, and would
want to hide the details of the cluster it is representing.  Four
years ago, Mosix allowed pretty wide clusters containing nodes not directly
reachable from each other, but node id numbers needed to be unique across
the whole universe.

in the "galaxy cluster" vision, a cluster can represent itself as a node, to
other nodes participating in the same cluster, without revealing
internal details of the galaxy (because from far enough away, a galaxy
looks, at first, like a single star).

The closest thing to implementing this vision that was available when
I last reviewed what was available was implementing Condor to link separate
Mosix clusters.

I remember a few near-consensuses being reached on the linux-cluster
mailing list.

These included:

   Defining a standard interface for  requesting and obtaining
cluster services and enforcing compliance to it makes sense.

   Arguing about whose implementation of any particular clustering
feature is best does not make sense.  (Given free exchange of techniques
and a standard interface, the in fact better techniques will gradually nudge
out the in fact inferior ones with no shoving required.)

    A standard cluster configuration interface (CCI) defined as a fs
of its own makes sense, rather that squatting within /proc

   the CCI can be mounted anywhere, (possibly back
 within /proc) so multiple clusters on the same box will not
collide with each other -- each gets its own CCI, and all syscalls to
cluster parts include a pointer to a cluster configuration object, of which
there can be more than one defined 

   The first order of business therefore was to take a survey of
services provided by clustering infrastructures and work out standardizable
interfaces to these services

That's what I remember.  The survey of services may or may not have
been performed
formally, I know that a survey of cluster services provided can be
done easily -- is done
often, whenever anyone tries to select what kind of cluster they want to set up.

The role of the linux kernel maintainer, in the question of supporting
multiple disparate
cluster platforms, is NOT to choose one, but is to set up ground rules
under which
they can all play nice together.  Just like the file systems all play
nice together currently.

The thought of having two different spindles each with their own file
system is not
something that anyone blinks at anymore, but it was once revolutionary.

Having one computer participating in N clusters, at the same time, may in the
future be a similar non-issue.

Pertaining to the list of cluster services, here's a quick and small list of the
ones that spring to my mind as being valiud for inclusion into the CCI, without
doing too much additional research:

   services (including statistics) that cluster membership provides
to processes on the node should be defined and offered through the CCI

   node identification in each cluster, abstracted into that cluster

   information about other nodes

   extended PID must include cluster-ID as well as node-ID when discussing
   PID extension mechanisms:  if I am process 00F3 on my own node, I might
   have an extended pid of 000400F3 on a cluster in which I am node 4 and an
   extended pid of 001000F3 on a cluster in which I am node sixteen.

   the publish/subscribe model (just read that one today) is very good

   standardize a publish/subscribe RPC interface in terms of operations on
   filesystem entities within the CCI

Based on discussion on the cap-talk mailing list, i'd like to suggest
that publish/subscribe get implemented in terms of one-off capability
tickets, but that's a perfect example of the kind of implementatin
detail I'm saying we
do not need to define.  How a particular clustering system
implements remote procedure call is not relevant to mandating a standard for
how clustering systems, should their engineers choose to have their
product comply with a standard, may present available RPCs in the CCI, and how
processes on nodes subscribed to their clusters may call an available
RPC through
the CCI.

The big insight that I hope you to take away from this e-mail, if you
haven't had it already (I have been out of touch with the insight level of
LKML for a long time) is that clustering integration into the kernel makes
sense as a standards establishment and enforcement effort, rather than a
technology selection and inclusion effort (at least at first -- once
the CCI exists,
cluster providers might all rush to provide their "CCI driver modules" and then
we're back to selection and inclusion) and that is a different kind of effort.

Although not a new kind.  While writing that paragraph I realized that
the file system interface and the entire module interface, and any other
kind of plug-it-in-and-it-works-the-same interface linux supports, sound,
video, et cetera -- are all standards enforcement problems rather than
technology
selection problems.

Not recognizing that clustering is such a problem is what I believe is
holding back cluster infrastructure from appearing in the kernel.

So last septembers thread about message passing services, in my
vision, is improper. The question is not, how do we pass messages,
but, we have nodes that we know by node-id, and we have messages
that we wish to pass to them, how do we provide a mechanism so that
knowing only those things, node Id and message, an entity that
wishes to pass a message can ask the cluster system to pass the message?

Given modularization, it will then be possible to drop in and replace systems as
needed or as appropriate.

--
David L Nicol
Director of Research and Development
Mindtrust LLC, Kansas City, Missouri

_______________________________________________
Clusters_sig mailing list
Clusters_sig at lists.osdl.org
http://lists.osdl.org/mailman/listinfo/clusters_sig




From debug at MIT.EDU  Sat May 21 18:48:09 2005
From: debug at MIT.EDU (Cluster 2005)
Date: Sat, 21 May 2005 14:48:09 -0400
Subject: [Linux-cluster] Cluster 2005 Paper Final Deadline May 28
Message-ID: <5.2.1.1.2.20050521130749.03aa3798@hesiod>




********************************************************************************
**                               CALL FOR 
PAPERS                              **
********************************************************************************

                                   Cluster 2005
               The 2005 IEEE International Conference on Cluster Computing
                               September 27-30, 2005
               Burlington Marriott Boston, Burlington, Massachusetts, USA
                                http://cluster2005.org
                                 info at cluster2005.org

--------------------------------------------------------------------------------

  In less than a decade, clusters of commodity PCs have become the most
  cost-effective computing platforms for executing a wide range of
  high performance applications from molecular biology simulations to
  search and indexing on the Internet.

  In spite of countless deployed platforms and resounding application
  successes, many research and development challenges remain for achieving
  better performance, scalability, and usability. The IEEE TFCC Cluster
  2005 conference provides an open forum for researchers, practitioners,
  and users to present and discuss issues, directions, and results
  that will shape the future of cluster computing.

  As cluster architecture is becoming a more mature field of
  application it remains particularly important to maintain an event
  that links the academic interests and new ideas to the experience
  reports of practitioners and application developers.

See the upcoming Call for Participation for more details.

SCOPE
-----

  Cluster 2005 welcomes paper and poster submissions from engineers and
  scientists in academia and industry describing original research work
  in all areas of cluster computing. Topics of interest include, but
  are not limited to:


    Cluster Architecture and Networking:
     - High-Speed System Interconnects   - Lightweight Communication Protocols
     - Fast Message Passing Libraries    - Networking Cost/Performance

    Performance Analysis and Evaluation:
     - Monitoring and Profiling Tools    - Performance Prediction and Modeling
     - Performance Analysis and Visualization

    Cluster Management and Maintenance:
     - Cluster Security and Reliability  - Tools for Managing Clusters
     - Job and Resource Management       - High-Availability Clusters

    Cluster Software and Middleware:
     - Single-System Image Services      - Software Environments and Tools
     - Standard Software for Clusters      for Application Building

    Cluster Storage and Cluster I/O Subsystems:
     - I/O Libraries                     - Realtime I/O (e.g in Physics)
     - Distributed File Systems and RAIDs

    Applications:
     - Scientific Applications           - Data Distribution and Load Balancing
     - Algorithms for Distributed Apps.  - Innovative Cluster Applications
     - Scalable Internet Services on Clusters

    Grid Computing and Clusters:
     - Cluster Integration in Grids
     - Network-Based Distributed Computing
     - Cluster-Based Grid Services


TECHNICAL PAPERS
----------------

Format for submission: Full paper drafts, not to exceed 25 double-spaced, 
numbered pages
(including title, author affiliations, abstract, keywords, figures, tables and
bibliographic references) using 12 point font on 8.5x11-inch pages with 
1-inch margins
all around. A web based submission mechanisms will be activated on the 
conference web site
two weeks before the submission deadline. Authors should submit a 
PostScript (level 2) or
PDF file. Hard copy submissions can not be accepted.

POSTERS
-------

Format for submission: 1 page abstract in PDF including names of authors, 
their affiliation
and a 200 word abstract sent as an e-mail attachment to the poster chair, 
Shawn Houston
<poster at cluster2005.org>. Poster presentations will also be offered to the 
authors of
technical papers that were not accepted for oral presentation but are 
recommended by the
committee for this form of publication.

  IMPORTANT DATES
  ---------------
      - Paper Submissions Due                  May 21, 2005
        EXTENDED TO                        Sat May 28, 2005, 12:00 GMT 
(Zulu Time)
      - Tutorials Submissions Due              May 21, 2005
        EXTENDED TO                        Sat May 28, 2005, 12:00 GMT 
(Zulu Time)
      - Paper Acceptance Notification         June 30, 2005
      - Poster Submissions Due                June 30, 2005
      - Exhibition Proposal Due               June 30, 2005
      - Poster Acceptance Notification        June 30, 2005
      - Camera-Ready Paper Manuscripts Due    July 16, 2005
      - Camera-Ready Poster Abstracts Due     July 16, 2005
      - Cut-off for group hotel rates         September 5, 2005
      - Conference                            September 27-30, 2005
      - Post-Conference Workshops             September 30, 2005

ORGANIZING COMMITTEE
--------------------

     * General Chair:
       - Dimiter Avresky, Northeastern University, Boston, USA

     * General Vice Chair:
       - Daniel S. Katz, Jet Propulsion Laboratory, Caltech, USA

     * Program Chair:
       - Thomas Stricker, Google European Engineering Centre, Switzerland

     * Workshops Chair:
       - Rajkumar Buyya, Senior Lecturer, Dept. of Computer Science and
         Software Engineering, The University of Melbourne, Australia

     * Tutorials Chair:
       - Box Leangsuksun, Lousiana Tech, Louisiana, USA

     * Exhibits/Sponsors Co-Chairs:
        - Ivan Judson, Argonne National Laboratory, USA
        - Rosa Badia, CEPBA-IBM Research Institute, UPC, Spain

     * Posters Chair:
        - Shawn Houston, University of Alaska, USA

     * Publications Chair:
        - Kurt Keville, Massachusetts Institute of Technology, USA

     * Publicity Chair:
        - Kevin Gleason, Mount Ida College, Newton, USA

     * Finance/Registration Chair:
        - Madeleine Furlotte, Genzyme, USA









From jbrassow at redhat.com  Mon May 23 14:36:59 2005
From: jbrassow at redhat.com (Jonathan E Brassow)
Date: Mon, 23 May 2005 09:36:59 -0500
Subject: [Linux-cluster] any help ?
In-Reply-To: <001401c55b1e$edbc6320$69050364@yazanz>
References: <001401c55b1e$edbc6320$69050364@yazanz>
Message-ID: <368c2888b2f443aa45a07e5317c1da0a@redhat.com>

It would help if you sent along the 'system.pool' file, and the output 
of the command when run.

  brassow

On May 17, 2005, at 3:28 PM, Yazan Al-Sheyyab wrote:

> hello ,
> ?
> ?i have a problem when making the ( pool_tool -c system.pool ) command
> ?
> ? the system.pool file is a file i build myself and related to a 
> partition created from LVM (logical)
> ?and i have 15 file like it.
> ??
> ? what is the problem here and how can i solve it and result to a 
> pool?written to the /dev/pool.
> ?
> ?? it write to me that it is unable to open the?whole partitions 
> created by lvm from (lvma-...lvmo)
> ?each time i use this command.
> ?
> ?? and after that?i cannot?assemble them too.
> ?
> ?? Please any help solving this problem.?
> ?
> ?
>
>
> Regards
> -------------------------------------------------
> ?
> Yazan.
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 1822 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050523/397dd0bb/attachment.bin>

From james.eastman at fedex.com  Mon May 23 15:27:05 2005
From: james.eastman at fedex.com (James Eastman)
Date: Mon, 23 May 2005 10:27:05 -0500
Subject: [Linux-cluster] ./configure, make ,
	make install problem on gentoo linux with a 2.6.11 kernel
In-Reply-To: <20050523034504.GA6904@redhat.com>
References: <Pine.LNX.4.58.0505192020090.5760@dino.gi.alaska.edu>
	<20050520144126.GF6097@null.msp.redhat.com>
	<20050520152159.GG6097@null.msp.redhat.com>
	<42914C92.2080602@fedex.com> <20050523034504.GA6904@redhat.com>
Message-ID: <4291F649.1050709@fedex.com>

David:

I tried cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout -r 
FC4 cluster/gfs AND cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster 
checkout -r FC4 cluster/gfs-kernel respectively.  When I did this if did 
NOT get all the same stuff as when I did the cvs commands as listed at 
http://sources.redhat.com/cluster/gfs/.  When I cd into the cluster 
directory now I see:

oragrid5 cluster # pwd
/root/cluster
oragrid5 cluster # ls -la
total 20
drwxrwxr-x   5  500  500 4096 May 23 10:12 .
drwx------  22 root root 4096 May 23 10:17 ..
drwxrwxr-x   2  500  500 4096 May 23 10:12 CVS
drwxrwxr-x  19  500  500 4096 May 23 10:12 gfs
drwxrwxr-x   7  500  500 4096 May 23 10:12 gfs-kernel
oragrid5 cluster #

When I do the .configure in the gfs directory I see:

oragrid5 gfs # ./configure --kernel_src=/usr/src/linux-2.6.11-gentoo-r6

Configuring Makefiles for your system...
Completed Makefile configuration

oragrid5 gfs #

So, of course I think that i mught just have success.  However, when I 
do a make I see:

oragrid5 gfs # make
cd gfs_edit && make all
make[1]: Entering directory `/root/cluster/gfs/gfs_edit'
gcc -Wall -I../include -I../config -I//usr/include -DHELPER_PROGRAM 
-D_FILE_OFFSET_BITS=64 -DGFS_RELEASE_NAME=\"DEVEL.1116861794\" 
-I../include -I../config -I//usr/include  gfshex.c hexedit.c ondisk.c   
-o gfs_edit
gfshex.c:24:30: linux/gfs_ondisk.h: No such file or directory
gfshex.c:50: warning: `struct gfs_dinode' declared inside parameter list
gfshex.c:50: warning: its scope is only this definition or declaration, 
which is probably not what you want
gfshex.c: In function `do_dinode_extended':
gfshex.c:53: error: storage size of `de' isn't known
gfshex.c:57: error: dereferencing pointer to incomplete type
gfshex.c:61: error: invalid application of `sizeof' to an incomplete type
gfshex.c:65: warning: implicit declaration of function `gfs64_to_cpu'
gfshex.c:73: error: dereferencing pointer to incomplete type
gfshex.c:73: error: `GFS_FILE_DIR' undeclared (first use in this function)
gfshex.c:73: error: (Each undeclared identifier is reported only once
gfshex.c:73: error: for each function it appears in.)
gfshex.c:74: error: dereferencing pointer to incomplete type
gfshex.c:74: error: `GFS_DIF_EXHASH' undeclared (first use in this function)
gfshex.c:78: error: invalid application of `sizeof' to an incomplete type
gfshex.c:81: warning: implicit declaration of function `gfs_dirent_in'
gfshex.c:83: warning: implicit declaration of function `gfs_dirent_print'
gfshex.c:83: error: invalid application of `sizeof' to an incomplete type
gfshex.c:88: error: dereferencing pointer to incomplete type
gfshex.c:89: error: dereferencing pointer to incomplete type
gfshex.c:90: error: dereferencing pointer to incomplete type
gfshex.c:94: error: invalid application of `sizeof' to an incomplete type
gfshex.c:97: error: invalid application of `sizeof' to an incomplete type
gfshex.c:98: error: dereferencing pointer to incomplete type
gfshex.c:112: error: dereferencing pointer to incomplete type
gfshex.c:53: warning: unused variable `de'
gfshex.c: In function `do_indirect_extended':
gfshex.c:142: error: invalid application of `sizeof' to an incomplete type
gfshex.c: In function `do_leaf_extended':
gfshex.c:170: error: storage size of `de' isn't known
gfshex.c:176: error: invalid application of `sizeof' to an incomplete type
gfshex.c:181: error: invalid application of `sizeof' to an incomplete type
gfshex.c:170: warning: unused variable `de'
gfshex.c: In function `do_eattr_extended':
gfshex.c:204: error: storage size of `ea' isn't known
gfshex.c:210: error: invalid application of `sizeof' to an incomplete type
gfshex.c:213: warning: implicit declaration of function `gfs_ea_header_in'
gfshex.c:214: warning: implicit declaration of function 
`gfs_ea_header_print'
gfshex.c:214: error: invalid application of `sizeof' to an incomplete type
gfshex.c:204: warning: unused variable `ea'
gfshex.c: In function `display_gfs':
gfshex.c:239: error: storage size of `mh' isn't known
gfshex.c:240: error: storage size of `sb' isn't known
gfshex.c:241: error: storage size of `rg' isn't known
gfshex.c:242: error: storage size of `di' isn't known
gfshex.c:243: error: storage size of `lf' isn't known
gfshex.c:244: error: storage size of `lh' isn't known
gfshex.c:245: error: storage size of `ld' isn't known
gfshex.c:257: warning: implicit declaration of function `gfs32_to_cpu'
gfshex.c:262: error: `GFS_MAGIC' undeclared (first use in this function)
gfshex.c:263: warning: implicit declaration of function `gfs_meta_header_in'
gfshex.c:267: error: `GFS_METATYPE_SB' undeclared (first use in this 
function)
gfshex.c:269: warning: implicit declaration of function `gfs_sb_in'
gfshex.c:270: warning: implicit declaration of function `gfs_sb_print'
gfshex.c:278: error: `GFS_METATYPE_RG' undeclared (first use in this 
function)
gfshex.c:280: warning: implicit declaration of function `gfs_rgrp_in'
gfshex.c:281: warning: implicit declaration of function `gfs_rgrp_print'
gfshex.c:289: error: `GFS_METATYPE_RB' undeclared (first use in this 
function)
gfshex.c:291: warning: implicit declaration of function 
`gfs_meta_header_print'
gfshex.c:299: error: `GFS_METATYPE_DI' undeclared (first use in this 
function)
gfshex.c:301: warning: implicit declaration of function `gfs_dinode_in'
gfshex.c:302: warning: implicit declaration of function `gfs_dinode_print'
gfshex.c:310: error: `GFS_METATYPE_LF' undeclared (first use in this 
function)
gfshex.c:312: warning: implicit declaration of function `gfs_leaf_in'
gfshex.c:313: warning: implicit declaration of function `gfs_leaf_print'
gfshex.c:321: error: `GFS_METATYPE_IN' undeclared (first use in this 
function)
gfshex.c:331: error: `GFS_METATYPE_JD' undeclared (first use in this 
function)
gfshex.c:341: error: `GFS_METATYPE_LH' undeclared (first use in this 
function)
gfshex.c:343: warning: implicit declaration of function `gfs_log_header_in'
gfshex.c:344: warning: implicit declaration of function 
`gfs_log_header_print'
gfshex.c:352: error: `GFS_METATYPE_LD' undeclared (first use in this 
function)
gfshex.c:354: warning: implicit declaration of function `gfs_desc_in'
gfshex.c:355: warning: implicit declaration of function `gfs_desc_print'
gfshex.c:363: error: `GFS_METATYPE_EA' undeclared (first use in this 
function)
gfshex.c:373: error: `GFS_METATYPE_ED' undeclared (first use in this 
function)
gfshex.c:239: warning: unused variable `mh'
gfshex.c:240: warning: unused variable `sb'
gfshex.c:241: warning: unused variable `rg'
gfshex.c:242: warning: unused variable `di'
gfshex.c:243: warning: unused variable `lf'
gfshex.c:244: warning: unused variable `lh'
gfshex.c:245: warning: unused variable `ld'
hexedit.c:25:30: linux/gfs_ondisk.h: No such file or directory
In file included from hexedit.c:29:
hexedit.h:33: error: `GFS_BASIC_BLOCK' undeclared here (not in a function)
hexedit.c: In function `run_command':
hexedit.c:181: error: `GFS_BASIC_BLOCK' undeclared (first use in this 
function)
hexedit.c:181: error: (Each undeclared identifier is reported only once
hexedit.c:181: error: for each function it appears in.)
ondisk.c:19:30: linux/gfs_ondisk.h: No such file or directory
ondisk.c:25:30: linux/gfs_ondisk.h: No such file or directory
make[1]: *** [gfs_edit] Error 1
make[1]: Leaving directory `/root/cluster/gfs/gfs_edit'
make: *** [all] Error 2
oragrid5 gfs #

As well, if I do a ./configure then a make in the gfs-kernel directory I 
see:

oragrid5 gfs-kernel # make
cd src && make all
make[1]: Entering directory `/root/cluster/gfs-kernel/src'
cd harness && make all
make[2]: Entering directory `/root/cluster/gfs-kernel/src/harness'
rm -f linux
ln -s . linux
make -C /usr/src/linux-2.6.11-gentoo-r6 
M=/root/cluster/gfs-kernel/src/harness 
symverfile=/usr/src/linux-2.6.11-gentoo-r6/../kernel/cluster/dlm.symvers 
modules USING_KBUILD=yes
make[3]: Entering directory `/usr/src/linux-2.6.11-gentoo-r6'
make[3]: *** No rule to make target `modules'.  Stop.
make[3]: Leaving directory `/usr/src/linux-2.6.11-gentoo-r6'
make[2]: *** [all] Error 2
make[2]: Leaving directory `/root/cluster/gfs-kernel/src/harness'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/root/cluster/gfs-kernel/src'
make: *** [all] Error 2
oragrid5 gfs-kernel #

What gives?  Thanks in advance for all of your help.

-- 
James

"You know you've achieved perfection in design and development, not when you have nothing more to add, but when you have nothing more to take away." -- Antione de Saint-Exupery





David Teigland wrote:

>On Sun, May 22, 2005 at 10:22:58PM -0500, James Eastman wrote:
>
>  
>
>>As you can see I'm running a 2.6.11 kernel.  So I decide to get the latest 
>>CVS snapshot and began my compiling adventure.  Here're the CVS comands I 
>>ran to get the latest source code:
>>
>>cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster login cvs
>>cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout cluster/gfs
>>cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout 
>>cluster/gfs-kernel
>>cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout cluster/gfs2
>>cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout 
>>cluster/gfs2-kernel
>>    
>>
>
>Code on the cvs head is in the middle of an overhaul; it doesn't all work,
>and the pieces that do work don't all work together yet.  So, you need to
>use the FC4 cvs branch which will come closest to matching 2.6.11.
>
>Dave
>  
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050523/7bd3a694/attachment.htm>

From teigland at redhat.com  Mon May 23 15:55:19 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 23 May 2005 23:55:19 +0800
Subject: [Linux-cluster] ./configure, make ,
	make install problem on gentoo linux with a 2.6.11 kernel
In-Reply-To: <4291F649.1050709@fedex.com>
References: <Pine.LNX.4.58.0505192020090.5760@dino.gi.alaska.edu>
	<20050520144126.GF6097@null.msp.redhat.com>
	<20050520152159.GG6097@null.msp.redhat.com>
	<42914C92.2080602@fedex.com> <20050523034504.GA6904@redhat.com>
	<4291F649.1050709@fedex.com>
Message-ID: <20050523155519.GA15220@redhat.com>

On Mon, May 23, 2005 at 10:27:05AM -0500, James Eastman wrote:
> David:
> 
> I tried cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout -r 
> FC4 cluster/gfs AND cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster 
> checkout -r FC4 cluster/gfs-kernel respectively.  When I did this if did 
> NOT get all the same stuff as when I did the cvs commands as listed at 
> http://sources.redhat.com/cluster/gfs/.  

Those cvs commands are misleading, we'll get them clarified.

Also, you need to check out and build the entire cluster tree, not just
the gfs component since gfs and gfs-kernel depend on a bunch of the other
stuff in the cluster tree.

Dave



From ritesh.a at net4india.net  Tue May 24 13:32:46 2005
From: ritesh.a at net4india.net (Ritesh Agrawal)
Date: Tue, 24 May 2005 19:02:46 +0530
Subject: [Linux-cluster] Re: Piranha hangs
In-Reply-To: <428F42FC.6070407@net4india.net>
References: <428B8AE7.3090702@net4india.net>	<1116620024.4621.25.camel@ayanami.boston.redhat.com>
	<428F42FC.6070407@net4india.net>
Message-ID: <42932CFE.60905@net4india.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Hi ,

~ Now my loadbalancer is working fine with same hardware and same
configurations , i just boot the server in single proccesor kernel image
instead of SMP kernel image .

my all real servers working fine...
it means 'pulse' daemon do not working fine with smp kernel image....
i don't know what the exact reason behind it.
plz help me to findout proper reason.

Thanks for your help
Regards
Ritesh


Ritesh Agrawal wrote:
|
|
| Lon Hohberger wrote:
|
|> On Thu, 2005-05-19 at 00:05 +0530, Ritesh Agrawal wrote:
|>
|>
|>> Hi ,
|>>   i am facing strange problem in implmenting load balancer using
|>> piranha,  i make loadbalancer LB1 with following configurations
|>> LB1:
|>> private ip: 192.168.35.253
|>> public ip:192.168.24.126
|>> floating VIP:192.168.35.254
|>> service ip : 192.168.24.60
|>>
|>> and spam2 is my real server  with ip 192.168.35.22 providing http
|>> service.
|>> after staring pulse, when  i send  the request from outside world to
|>> 192.168.24.60:80. it works fine,but after some time my LB1 hanged
|>> displaying no error no clue,
|>> unable to find the proper reason of hanging.
|>
|>
|>
|> Off the top of my head, piranha hanging (or even crashing!) shouldn't
|> affect the load balancing traffic.  Piranha controls the routing
|> assignments, but it doesn't do any routing itself - the routing is done
|> in-kernel in the IPVS modules.
|>
|> So, I suspect that lb1 (or rather, its kernel) might be hung as opposed
|> to piranha...
|>
|> Can you get a serial console attached to lb1 and see if it's
|> panicking/hung?
|>
|> -- Lon
|
|
| Hi Lon,
| Thanks for your help , actually after starting the pulse , everything
| going well,  10-20 web request it handle properly , but within a minutes
| Machine hangs , i don't why , I tried to troubleshoot by attaching the
| console with server ,no kernel panic or other error message displayed in
| console.
|
| I checked /var/log/message for any error ,but i couldn't find any clue.
| If problem with hardware or software configuration then even single
| request shouldn't be handled , but it properly work for 10-20 requests,
| but after some idle time it hangs.
|
| Regards
| Ritesh
|
|
|
|
| _______________________________________________
|
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCkyz9Foz+P95jnTIRAi25AKDwGXX0sqYr1CysDZrXhaBSlTK6RACgwKln
fMtNa2KaPnNzAXJgt6J+jXM=
=BjGw
-----END PGP SIGNATURE-----



From lhh at redhat.com  Tue May 24 15:21:37 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 24 May 2005 11:21:37 -0400
Subject: [Linux-cluster] Re: Piranha hangs
In-Reply-To: <42932CFE.60905@net4india.net>
References: <428B8AE7.3090702@net4india.net>
	<1116620024.4621.25.camel@ayanami.boston.redhat.com>
	<428F42FC.6070407@net4india.net>  <42932CFE.60905@net4india.net>
Message-ID: <1116948097.18073.2.camel@ayanami.boston.redhat.com>

On Tue, 2005-05-24 at 19:02 +0530, Ritesh Agrawal wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> 
> Hi ,
> 
> ~ Now my loadbalancer is working fine with same hardware and same
> configurations , i just boot the server in single proccesor kernel image
> instead of SMP kernel image .
> 
> my all real servers working fine...
> it means 'pulse' daemon do not working fine with smp kernel image....
> i don't know what the exact reason behind it.
> plz help me to findout proper reason.

Sounds like one or more of the IPVS kernel modules isn't SMP safe...

-- Lon



From jason at selu.edu  Tue May 24 15:24:33 2005
From: jason at selu.edu (Jason Lanclos)
Date: Tue, 24 May 2005 10:24:33 -0500
Subject: [Linux-cluster] multiple service instances with rgmanager
Message-ID: <200505241024.33190.Jason@selu.edu>


Is it possible setup a service that will run on 2 or more nodes at the same time?

I am testing RHEL4 and cluster/gfs for our mail server setup.
Currently we have 2 RHAS 2.1 boxes attached to a SAN. All filesystems are ext3.
One node handles all SMTP requests, and does virus scanning, mimedefang, spamassassin
and several other things with the incoming mail.
When its finished, messages gets handed off to the node that we have running the "mailstore"
which has all the home directories, and handles all of the IMAP/POP connections.
Needless to say, with everyone hitting the same server to receive mail, this machine
can get pretty busy, while the sendmail node has plenty of processing power to spare.

So, I've pretty much replicated our setup using RHEL4 and cluster/gfs so both boxes can 
receive mail, and service client connections for IMAP/POP.
With the setup being this simple, I know I could not even worry about setting up services in
rgmanager, but it would be nice to start/stop/check services on both nodes with one command.


-- 
Jason Lanclos                                        
Systems Administrator                                 
Red Hat Certified Engineer        
Southeastern Louisiana University		     



From mtilstra at redhat.com  Tue May 24 15:26:10 2005
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Tue, 24 May 2005 10:26:10 -0500
Subject: [Linux-cluster] Unknow lock state!
In-Reply-To: <BAY107-F27741B591542ABA5D2B0FBDC080@phx.gbl>
References: <BAY107-F27741B591542ABA5D2B0FBDC080@phx.gbl>
Message-ID: <20050524152610.GB26812@redhat.com>

On Thu, May 19, 2005 at 08:10:25PM +0200, Javier C. wrote:
> Hi everybody;
> 
> We are using 3 servers with gfs and in logs of one of them i can read the 
> following thing:
> 
> ---------------------------
> May 19 20:02:31 Linux1 kernel: lock_gulm: Error on lock 
> 0x4707000000000000309a286461746f7300 Got a drop lcok request for a lock 
> that we don't know of. state:0x1
> ------------------------------
> 
> This message often appears and we do not know to what it can be.
> 
> We are using RH E.Linux Update 2. Kernel 2.4.21-15. Gfs version is 6.0.0-1.2

It can sometimes appear after there has been a Master Lock Server
failure and recovery.  It is mostly harmless, but you should shutdown
and restart the entire cluster when you get the chance.

-- 
Michael Conrad Tadpol Tilstra
Give me ambiguity or give me something else. 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050524/a7e89539/attachment.sig>

From lhh at redhat.com  Tue May 24 19:41:01 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 24 May 2005 15:41:01 -0400
Subject: [Linux-cluster] multiple service instances with rgmanager
In-Reply-To: <200505241024.33190.Jason@selu.edu>
References: <200505241024.33190.Jason@selu.edu>
Message-ID: <1116963661.18073.37.camel@ayanami.boston.redhat.com>

On Tue, 2005-05-24 at 10:24 -0500, Jason Lanclos wrote:
> Is it possible setup a service that will run on 2 or more nodes at the same time?
> 
> I am testing RHEL4 and cluster/gfs for our mail server setup.
> Currently we have 2 RHAS 2.1 boxes attached to a SAN. All filesystems are ext3.
> One node handles all SMTP requests, and does virus scanning, mimedefang, spamassassin
> and several other things with the incoming mail.
> When its finished, messages gets handed off to the node that we have running the "mailstore"
> which has all the home directories, and handles all of the IMAP/POP connections.
> Needless to say, with everyone hitting the same server to receive mail, this machine
> can get pretty busy, while the sendmail node has plenty of processing power to spare.
> 
> So, I've pretty much replicated our setup using RHEL4 and cluster/gfs so both boxes can 
> receive mail, and service client connections for IMAP/POP.
> With the setup being this simple, I know I could not even worry about setting up services in
> rgmanager, but it would be nice to start/stop/check services on both nodes with one command.

Currently, no.  It's not really that difficult to add, and might be a
consideration for a future release.

However, you can have certain things be reused between services.  So,
you can have two services running the same set of initscripts.

Something like:

   <resources>
      <script name="a" path="/tmp/a"/>
      <script name="b" path="/tmp/b"/>
   </resources>
   <service name="foo1" domain="node1dom">
      <script ref="a"/>
      <script ref="b"/>
   </service>
   <service name="foo2" domain="node2dom">
      <script ref="a"/>
      <script ref="b"/>
   </service>

"clustat" will show you the state of those services with one command,
but to start/stop them, you'll have to use two commands.

(node1dom = restricted fd with only node1 in the list; node2dom = same,
for node2)

-- Lon



From lhh at redhat.com  Tue May 24 19:44:06 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 24 May 2005 15:44:06 -0400
Subject: [Linux-cluster] LVS error: "bad load average"
In-Reply-To: <17e909a4050517031458b6846b@mail.gmail.com>
References: <17e909a4050517031458b6846b@mail.gmail.com>
Message-ID: <1116963846.18073.39.camel@ayanami.boston.redhat.com>

On Tue, 2005-05-17 at 13:14 +0300, Michael Green wrote:
> Getting the following errors in the log file from the LVS:

What version of piranha?

-- Lon




From ARodriguez at INNODATA-ISOGEN.COM  Wed May 25 07:43:01 2005
From: ARodriguez at INNODATA-ISOGEN.COM (Rodriguez, Antonio)
Date: Wed, 25 May 2005 15:43:01 +0800
Subject: [Linux-cluster] RHEL4AS w/ cvs RHEL4 branch
Message-ID: <D3D5DC9EC4913C4E805DE9487D4EA67712079F@MDEE2K25.MANDAUE.INNODATA.NET>

Hi everyone,

 

I manage to install RHEL4 cvs branch on a two server RHEL4 w/ kernel
2.6.9-5.ELsmp using the usage.txt as my guide.

Failover seems fine when following the shutdown procedure. 

But, when simulating a broken server scenario by just turning off power
to the server, the surviving node hangs

until such time that I was able to bootup and run "fence_tool join"
again on the previously failed node. 

 

Is the cvs RHEL4 branch capable of supporting this kind of scenario? 

I have a previous installation of RHEL3 AS w/ GFS 6.0 and works fine
when tested this way.

The only issue I have with that setup is the speed performance of my
cluster. 

Using RHEL4, my samba server is a little faster than my windows file
servers, hence the need to make this work on RHEL4.

 

I'm currently using manual fence method, but planning to make use of
switch fencing (brocade) for production.

 

Would appreciate any help on this. 

 

Best Regards,

Antonio Rodriguez

 

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050525/8291829e/attachment.htm>

From teigland at redhat.com  Wed May 25 08:06:10 2005
From: teigland at redhat.com (David Teigland)
Date: Wed, 25 May 2005 16:06:10 +0800
Subject: [Linux-cluster] RHEL4AS w/ cvs RHEL4 branch
In-Reply-To: <D3D5DC9EC4913C4E805DE9487D4EA67712079F@MDEE2K25.MANDAUE.INNODATA.NET>
References: <D3D5DC9EC4913C4E805DE9487D4EA67712079F@MDEE2K25.MANDAUE.INNODATA.NET>
Message-ID: <20050525080610.GE24147@redhat.com>

On Wed, May 25, 2005 at 03:43:01PM +0800, Rodriguez, Antonio wrote:
> Hi everyone,
> 
>  
> 
> I manage to install RHEL4 cvs branch on a two server RHEL4 w/ kernel
> 2.6.9-5.ELsmp using the usage.txt as my guide.
> 
> Failover seems fine when following the shutdown procedure. 
> 
> But, when simulating a broken server scenario by just turning off power
> to the server, the surviving node hangs
> 
> until such time that I was able to bootup and run "fence_tool join"
> again on the previously failed node. 

Do you have this in your cluster.conf?

  <cman two_node="1" expected_votes="1">
  </cman>

If not, then the live node will lose quorum when you kill the other node
and it won't do anything until the failed node has rejoined the cluster.

If you already have that in cluster.conf, then is there a manual fencing
message in /var/log/messages on the live node?

Dave



From ARodriguez at INNODATA-ISOGEN.COM  Wed May 25 08:54:16 2005
From: ARodriguez at INNODATA-ISOGEN.COM (Rodriguez, Antonio)
Date: Wed, 25 May 2005 16:54:16 +0800
Subject: [Linux-cluster] RHEL4AS w/ cvs RHEL4 branch
Message-ID: <D3D5DC9EC4913C4E805DE9487D4EA6771207A1@MDEE2K25.MANDAUE.INNODATA.NET>

Thanks for the quick reply.

Yes, my cluster.conf contains this line (required in two node cluster
configuration).
<cman two_node="1" expected_votes="1">
  </cman>

There's also a fencing messages on the surviving node, saying that the 
failed node should be restarted before any recovery can proceed. Waiting

for the node to rejoin the cluster or manual acknowledgement...

Does this mean that I cannot use the surviving node to host my resources
while my other node is fenced? 

Clustat says that my cluster is still qourate even with one node
remaining.

Please enlighten me.

Thanks,
Antonio Rodriguez


-----Original Message-----
From: David Teigland [mailto:teigland at redhat.com] 
Sent: Wednesday, May 25, 2005 4:06 PM
To: Rodriguez, Antonio
Cc: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] RHEL4AS w/ cvs RHEL4 branch

On Wed, May 25, 2005 at 03:43:01PM +0800, Rodriguez, Antonio wrote:
> Hi everyone,
> 
>  
> 
> I manage to install RHEL4 cvs branch on a two server RHEL4 w/ kernel
> 2.6.9-5.ELsmp using the usage.txt as my guide.
> 
> Failover seems fine when following the shutdown procedure. 
> 
> But, when simulating a broken server scenario by just turning off
power
> to the server, the surviving node hangs
> 
> until such time that I was able to bootup and run "fence_tool join"
> again on the previously failed node. 

Do you have this in your cluster.conf?

  <cman two_node="1" expected_votes="1">
  </cman>

If not, then the live node will lose quorum when you kill the other node
and it won't do anything until the failed node has rejoined the cluster.

If you already have that in cluster.conf, then is there a manual fencing
message in /var/log/messages on the live node?

Dave




From teigland at redhat.com  Wed May 25 09:09:30 2005
From: teigland at redhat.com (David Teigland)
Date: Wed, 25 May 2005 17:09:30 +0800
Subject: [Linux-cluster] RHEL4AS w/ cvs RHEL4 branch
In-Reply-To: <D3D5DC9EC4913C4E805DE9487D4EA6771207A1@MDEE2K25.MANDAUE.INNODATA.NET>
References: <D3D5DC9EC4913C4E805DE9487D4EA6771207A1@MDEE2K25.MANDAUE.INNODATA.NET>
Message-ID: <20050525090930.GF24147@redhat.com>

On Wed, May 25, 2005 at 04:54:16PM +0800, Rodriguez, Antonio wrote:
> Thanks for the quick reply.
> 
> Yes, my cluster.conf contains this line (required in two node cluster
> configuration).
> <cman two_node="1" expected_votes="1">
>   </cman>
> 
> There's also a fencing messages on the surviving node, saying that the 
> failed node should be restarted before any recovery can proceed. Waiting
> for the node to rejoin the cluster or manual acknowledgement...

You should run 'fence_ack_manual' on the surviving node (you've obviously
just restarted the failed node by pressing its power button.)

> Does this mean that I cannot use the surviving node to host my resources
> while my other node is fenced? 
> 
> Clustat says that my cluster is still qourate even with one node
> remaining.

The special two_node option allows the cluster to have quorum even though
there's only one node remaining.  Everything should run normally on the
surviving node after you do the manual ack.  Using SAN fencing with the
brocade will eliminate this manual step.

Dave



From ARodriguez at INNODATA-ISOGEN.COM  Wed May 25 11:14:56 2005
From: ARodriguez at INNODATA-ISOGEN.COM (Rodriguez, Antonio)
Date: Wed, 25 May 2005 19:14:56 +0800
Subject: [Linux-cluster] RHEL4AS w/ cvs RHEL4 branch
Message-ID: <D3D5DC9EC4913C4E805DE9487D4EA6771207A2@MDEE2K25.MANDAUE.INNODATA.NET>

Thanks Dave! 

Now I know what my problem is.

Now I'm trying to use fence_brocade as my fencing method, and I have
this error when I simulate a failed node, or when I run the command
fence_brocade:

fence_brocade 
Can't locate Net/Telnet.pm in @INC (@INC contains:
/usr/lib/perl5/5.8.5/i386-linux-thread-multi /usr/lib/perl5/5.8.5
/usr/lib/perl5/site_perl/5.8.5/i386-linux-thread-multi
/usr/lib/perl5/site_perl/5.8.4/i386-linux-thread-multi
/usr/lib/perl5/site_perl/5.8.3/i386-linux-thread-multi
/usr/lib/perl5/site_perl/5.8.2/i386-linux-thread-multi
/usr/lib/perl5/site_perl/5.8.1/i386-linux-thread-multi
/usr/lib/perl5/site_perl/5.8.0/i386-linux-thread-multi
/usr/lib/perl5/site_perl/5.8.5 /usr/lib/perl5/site_perl/5.8.4
/usr/lib/perl5/site_perl/5.8.3 /usr/lib/perl5/site_perl/5.8.2
/usr/lib/perl5/site_perl/5.8.1 /usr/lib/perl5/site_perl/5.8.0
/usr/lib/perl5/site_perl
/usr/lib/perl5/vendor_perl/5.8.5/i386-linux-thread-multi
/usr/lib/perl5/vendor_perl/5.8.4/i386-linux-thread-multi
/usr/lib/perl5/vendor_perl/5.8.3/i386-linux-thread-multi
/usr/lib/perl5/vendor_perl/5.8.2/i386-linux-thread-multi
/usr/lib/perl5/vendor_perl/5.8.1/i386-linux-thread-multi
/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi
/usr/lib/perl5/vendor_perl/5.8.5 /usr/lib/perl5/vendor_perl/5.8.4
/usr/lib/perl5/vendor_perl/5.8.3 /usr/lib/perl5/vendor_perl/5.8.2
/usr/lib/perl5/vendor_perl/5.8.1 /usr/lib/perl5/vendor_perl/5.8.0
/usr/lib/perl5/vendor_perl .) at /sbin/fence_brocade line 17.
BEGIN failed--compilation aborted at /sbin/fence_brocade line 17.

Please help.

Regards,
Antonio Rodriguez

-----Original Message-----
From: David Teigland [mailto:teigland at redhat.com] 
Sent: Wednesday, May 25, 2005 5:10 PM
To: Rodriguez, Antonio
Cc: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] RHEL4AS w/ cvs RHEL4 branch

On Wed, May 25, 2005 at 04:54:16PM +0800, Rodriguez, Antonio wrote:
> Thanks for the quick reply.
> 
> Yes, my cluster.conf contains this line (required in two node cluster
> configuration).
> <cman two_node="1" expected_votes="1">
>   </cman>
> 
> There's also a fencing messages on the surviving node, saying that the

> failed node should be restarted before any recovery can proceed.
Waiting
> for the node to rejoin the cluster or manual acknowledgement...

You should run 'fence_ack_manual' on the surviving node (you've
obviously
just restarted the failed node by pressing its power button.)

> Does this mean that I cannot use the surviving node to host my
resources
> while my other node is fenced? 
> 
> Clustat says that my cluster is still qourate even with one node
> remaining.

The special two_node option allows the cluster to have quorum even
though
there's only one node remaining.  Everything should run normally on the
surviving node after you do the manual ack.  Using SAN fencing with the
brocade will eliminate this manual step.

Dave




From teigland at redhat.com  Wed May 25 12:02:16 2005
From: teigland at redhat.com (David Teigland)
Date: Wed, 25 May 2005 20:02:16 +0800
Subject: [Linux-cluster] RHEL4AS w/ cvs RHEL4 branch
In-Reply-To: <D3D5DC9EC4913C4E805DE9487D4EA6771207A2@MDEE2K25.MANDAUE.INNODATA.NET>
References: <D3D5DC9EC4913C4E805DE9487D4EA6771207A2@MDEE2K25.MANDAUE.INNODATA.NET>
Message-ID: <20050525120216.GI24147@redhat.com>

On Wed, May 25, 2005 at 07:14:56PM +0800, Rodriguez, Antonio wrote:
> Now I'm trying to use fence_brocade as my fencing method, and I have
> this error when I simulate a failed node, or when I run the command
> fence_brocade:
> 
> fence_brocade 
> Can't locate Net/Telnet.pm in @INC (@INC contains:

Looks like you need to install the Net-Telnet perl module.

Dave



From jnewbigin at ict.swin.edu.au  Tue May 24 23:25:10 2005
From: jnewbigin at ict.swin.edu.au (John Newbigin)
Date: Wed, 25 May 2005 09:25:10 +1000
Subject: [Linux-cluster] GFS For Update 5
Message-ID: <4293B7D6.30000@ict.swin.edu.au>

Is GFS available for RHELAS3 Update 5 (kernel 2.4.21-32.EL)?  What 
version of GFS should I be looking for?

Thanks

John.



From Rosario.Esposito at na.infn.it  Wed May 25 11:20:48 2005
From: Rosario.Esposito at na.infn.it (Rosario Esposito)
Date: Wed, 25 May 2005 13:20:48 +0200
Subject: [Linux-cluster] SAN fencing
In-Reply-To: <D3D5DC9EC4913C4E805DE9487D4EA6771207A2@MDEE2K25.MANDAUE.INNODATA.NET>
References: <D3D5DC9EC4913C4E805DE9487D4EA6771207A2@MDEE2K25.MANDAUE.INNODATA.NET>
Message-ID: <42945F90.7020601@na.infn.it>

Hello,
I have a 3-nodes with shared storage HP MSA1500cs, mounting a GFS 
filesystem.
The fibre channel switch is a HP StorageWorks Edge Switch 2/12.
I have no power switch at the moment.
I can't find any fence_xxx script to manage the SAN fencing. I tried 
fence_brocade but doesn't work...
Any ideas ?

Thanks

- Rosario Esposito



From pbruna at linuxcenterla.com  Wed May 25 16:50:10 2005
From: pbruna at linuxcenterla.com (Patricio Bruna V.)
Date: Wed, 25 May 2005 12:50:10 -0400
Subject: [Linux-cluster] diferent lans
Message-ID: <200505251250.15050.pbruna@linuxcenterla.com>

can i have a cluster with nodes in diferents lans?
-- 
Patricio Bruna
pbruna at linuxcenterla.com
RHCE/RHCI
Jefe Soporte y Operaciones LinuxCenter S.A.
Canada 239, 5to piso, Providencia, Chile
http://www.linuxcenterla.com +56-2-2745000
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050525/8c3ed0bd/attachment.sig>

From lhh at redhat.com  Wed May 25 17:23:51 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 25 May 2005 13:23:51 -0400
Subject: [Linux-cluster] diferent lans
In-Reply-To: <200505251250.15050.pbruna@linuxcenterla.com>
References: <200505251250.15050.pbruna@linuxcenterla.com>
Message-ID: <1117041831.18073.62.camel@ayanami.boston.redhat.com>

On Wed, 2005-05-25 at 12:50 -0400, Patricio Bruna V. wrote:
> can i have a cluster with nodes in diferents lans?

Yes, but be aware that the cluster uses multicast and/or broadcast to do
some messaging, so you'll probably want to use a VLAN to connect the two
together.

-- Lon





From ed.mann at choicepoint.net  Wed May 25 17:40:21 2005
From: ed.mann at choicepoint.net (Edward Mann)
Date: Wed, 25 May 2005 12:40:21 -0500
Subject: [Linux-cluster] diferent lans
In-Reply-To: <200505251250.15050.pbruna@linuxcenterla.com>
References: <200505251250.15050.pbruna@linuxcenterla.com>
Message-ID: <1117042822.29947.16.camel@storm.cp-direct.com>

Yes,

As long as each node can talk to the other.

I have one node in a DMZ on a 192 network, and another on another
network on a 172 address.

So i know that this will work. Just make sure that both nodes can talk
to each other.


On Wed, 2005-05-25 at 12:50 -0400, Patricio Bruna V. wrote:
> can i have a cluster with nodes in diferents lans?
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster



From lhh at redhat.com  Wed May 25 18:16:29 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 25 May 2005 14:16:29 -0400
Subject: [Linux-cluster] diferent lans
In-Reply-To: <200505251340.35507.pbruna@linuxcenterla.com>
References: <200505251250.15050.pbruna@linuxcenterla.com>
	<1117041831.18073.62.camel@ayanami.boston.redhat.com>
	<200505251340.35507.pbruna@linuxcenterla.com>
Message-ID: <1117044989.18073.73.camel@ayanami.boston.redhat.com>

On Wed, 2005-05-25 at 13:40 -0400, Patricio Bruna V. wrote:

> and multicast will work?

The cluster software was primarily designed to be run on one network.
Using a VLAN between the two nodes should fulfill that requirement, but
it may require quite a bit of tweaking/configuration.

It hasn't been done yet (as far as I know), so please document what you
do and post it to the list for posterity ;)

-- Lon



From lhh at redhat.com  Wed May 25 18:18:07 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 25 May 2005 14:18:07 -0400
Subject: [Linux-cluster] diferent lans
In-Reply-To: <1117042822.29947.16.camel@storm.cp-direct.com>
References: <200505251250.15050.pbruna@linuxcenterla.com>
	<1117042822.29947.16.camel@storm.cp-direct.com>
Message-ID: <1117045087.18073.76.camel@ayanami.boston.redhat.com>

Ok, it *has* been done, and it doesn't require a VLAN. ;)

On Wed, 2005-05-25 at 12:40 -0500, Edward Mann wrote:

> As long as each node can talk to the other.
> 
> I have one node in a DMZ on a 192 network, and another on another
> network on a 172 address.
> 
> So i know that this will work. Just make sure that both nodes can talk
> to each other.

Are you using broadcast or multicast with CMAN? 

-- Lon



From ben at muppethouse.com  Wed May 25 18:06:51 2005
From: ben at muppethouse.com (Ben Russo)
Date: Wed, 25 May 2005 14:06:51 -0400
Subject: [Linux-cluster] RedHat HA Cluster and Software RAID?
Message-ID: <d72eoc$i2d$1@sea.gmane.org>

I'm setting up two servers in an HA cluster (no GFS, just an active 
service system and a standby).

I have 2 external hardware RAID devices that have 2 SCSI channels each.
I have 2 SCSI HBA's in each host.

The external RAID devices only have a single RAID controller.
So for full single point of failure avoidance I would like to do
software RAID1 across the two external arrays.

The documentation I am reading says that software RAID cannot be done
on "shared" storage.

I am a little confused.  I know the quorum drives may be "shared"
and a GFS filesystem might be shared, but is a filesystem used by an HA 
service "shared"?

Can I setup a RAID1 device that migrates to an active cluster member 
with the service?

-Ben.



From cfeist at redhat.com  Wed May 25 19:22:01 2005
From: cfeist at redhat.com (Chris Feist)
Date: Wed, 25 May 2005 14:22:01 -0500
Subject: [Linux-cluster] GFS For Update 5
In-Reply-To: <4293B7D6.30000@ict.swin.edu.au>
References: <4293B7D6.30000@ict.swin.edu.au>
Message-ID: <4294D059.90607@redhat.com>

Should be available now.  Current version is 6.0.2.20-1.

Thanks,
Chris

John Newbigin wrote:
> Is GFS available for RHELAS3 Update 5 (kernel 2.4.21-32.EL)?  What 
> version of GFS should I be looking for?
> 
> Thanks
> 
> John.
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster



From pbruna at linuxcenterla.com  Wed May 25 19:23:54 2005
From: pbruna at linuxcenterla.com (Patricio Bruna V.)
Date: Wed, 25 May 2005 15:23:54 -0400
Subject: [Linux-cluster] diferent lans
In-Reply-To: <1117042822.29947.16.camel@storm.cp-direct.com>
References: <200505251250.15050.pbruna@linuxcenterla.com>
	<1117042822.29947.16.camel@storm.cp-direct.com>
Message-ID: <200505251523.58231.pbruna@linuxcenterla.com>

El Mi? 25 May 2005 13:40, Edward Mann escribi?:
> Yes,
>
> As long as each node can talk to the other.
>
> I have one node in a DMZ on a 192 network, and another on another
> network on a 172 address.
>
> So i know that this will work. Just make sure that both nodes can talk
> to each other.
>
both nodes can talk to each other, i have full conection betewn them.
but in both cman reports:

kernel: CMAN: forming a new cluster
kernel: CMAN: quorum regained, resuming activity
ccsd[7001]: Cluster is quorate.  Allowing connections.

cat /proc/cluster/nodes (node pato):
Protocol version: 5.0.1
Config version: 1
Cluster name: lc
Cluster ID: 315
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 1
Expected_votes: 1
Total_votes: 1
Quorum: 1
Active subsystems: 0
Node name: pato.linuxcenterla.com
Node addresses: 192.168.3.31

cat /proc/cluster/status (node terminal):
Protocol version: 5.0.1
Config version: 1
Cluster name: lc
Cluster ID: 315
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 1
Expected_votes: 1
Total_votes: 1
Quorum: 1
Active subsystems: 0
Node name: terminal.linuxcenterla.com
Node addresses: 192.168.1.200

any ideas?


-- 
Patricio Bruna
pbruna at linuxcenterla.com
RHCE/RHCI
Jefe Soporte y Operaciones LinuxCenter S.A.
Canada 239, 5to piso, Providencia, Chile
http://www.linuxcenterla.com +56-2-2745000
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050525/103b4034/attachment.sig>

From jnewbigin at ict.swin.edu.au  Wed May 25 23:14:49 2005
From: jnewbigin at ict.swin.edu.au (John Newbigin)
Date: Thu, 26 May 2005 09:14:49 +1000
Subject: [Linux-cluster] GFS For Update 5
In-Reply-To: <4294D059.90607@redhat.com>
References: <4293B7D6.30000@ict.swin.edu.au> <4294D059.90607@redhat.com>
Message-ID: <429506E9.1060705@ict.swin.edu.au>

Chris Feist wrote:

> Should be available now.  Current version is 6.0.2.20-1.
Thanks for this info.  Is this also suitable for the new 2.4.21-32.0.1 
kernel or should I hold off for another new version?

John.

> 
> Thanks,
> Chris
> 
> John Newbigin wrote:
> 
>> Is GFS available for RHELAS3 Update 5 (kernel 2.4.21-32.EL)?  What 
>> version of GFS should I be looking for?
>>
>> Thanks
>>
>> John.
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> http://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 


-- 
John Newbigin
Computer Systems Officer
Faculty of Information and Communication Technologies
Swinburne University of Technology
Melbourne, Australia
http://www.ict.swin.edu.au/staff/jnewbigin



From cfeist at redhat.com  Wed May 25 23:29:32 2005
From: cfeist at redhat.com (Chris Feist)
Date: Wed, 25 May 2005 18:29:32 -0500
Subject: [Linux-cluster] GFS For Update 5
In-Reply-To: <429506E9.1060705@ict.swin.edu.au>
References: <4293B7D6.30000@ict.swin.edu.au> <4294D059.90607@redhat.com>
	<429506E9.1060705@ict.swin.edu.au>
Message-ID: <42950A5C.9060707@redhat.com>

John Newbigin wrote:
> Chris Feist wrote:
> 
>> Should be available now.  Current version is 6.0.2.20-1.
> 
> Thanks for this info.  Is this also suitable for the new 2.4.21-32.0.1 
> kernel or should I hold off for another new version?
> 
> John.
> 

That's for the 2.4.21-32.EL kernel, we're currently building & testing our 
release for the -32.0.1.EL kernel.

Thanks,
Chris



From ARodriguez at INNODATA-ISOGEN.COM  Wed May 25 23:50:56 2005
From: ARodriguez at INNODATA-ISOGEN.COM (Rodriguez, Antonio)
Date: Thu, 26 May 2005 07:50:56 +0800
Subject: [Linux-cluster] RHEL4AS w/ cvs RHEL4 branch
Message-ID: <D3D5DC9EC4913C4E805DE9487D4EA6771207A3@MDEE2K25.MANDAUE.INNODATA.NET>

That did it! Thank you very much Dave!

Best Regards,
Tony

-----Original Message-----
From: David Teigland [mailto:teigland at redhat.com] 
Sent: Wednesday, May 25, 2005 8:02 PM
To: Rodriguez, Antonio
Cc: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] RHEL4AS w/ cvs RHEL4 branch

On Wed, May 25, 2005 at 07:14:56PM +0800, Rodriguez, Antonio wrote:
> Now I'm trying to use fence_brocade as my fencing method, and I have
> this error when I simulate a failed node, or when I run the command
> fence_brocade:
> 
> fence_brocade 
> Can't locate Net/Telnet.pm in @INC (@INC contains:

Looks like you need to install the Net-Telnet perl module.

Dave




From jason at selu.edu  Thu May 26 13:37:41 2005
From: jason at selu.edu (Jason Lanclos)
Date: Thu, 26 May 2005 08:37:41 -0500
Subject: [Linux-cluster] LVM issues (CVS)
Message-ID: <200505260837.41746.Jason@selu.edu>


Using LVM from the current CVS gives me a Segmentation fault when attempting to create/extend a logical volume.

# lvcreate -L 100M -n test mail
  Rounding up size to full physical extent 128.00 MB
Segmentation fault

However, if I use the LVM2.2.01.10.tgz version, it works fine.

Is something broken in the CVS?


-- 
Jason Lanclos                                        
Systems Administrator                                 
Red Hat Certified Engineer        
Southeastern Louisiana University		     



From jason at selu.edu  Thu May 26 13:52:19 2005
From: jason at selu.edu (Jason Lanclos)
Date: Thu, 26 May 2005 08:52:19 -0500
Subject: [Linux-cluster] RHEL4 clustering/GFS
Message-ID: <200505260852.19781.Jason@selu.edu>


I've been testing RHEL4 with clustering/gfs using the RHEL4 cvs.

We are looking at upgrading our mail servers from RHAS 2.1 to RHEL4 and 
we would like to utilize gfs to help distribute the work on the machines.

My question is.. How does the CVS cluster/gfs differ from
cluster/gfs sold by RedHat for RHEL4?  
http://www.redhat.com/apps/commerce/rha/gfs/

BTW, is anyone going to the RedHat Summit in New Orleans?

-- 
Jason Lanclos                                        
Systems Administrator                                 
Red Hat Certified Engineer        
Southeastern Louisiana University		     



From mtilstra at redhat.com  Thu May 26 14:06:31 2005
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Thu, 26 May 2005 09:06:31 -0500
Subject: [Linux-cluster] RHEL4 clustering/GFS
In-Reply-To: <200505260852.19781.Jason@selu.edu>
References: <200505260852.19781.Jason@selu.edu>
Message-ID: <20050526140631.GA11242@redhat.com>

On Thu, May 26, 2005 at 08:52:19AM -0500, Jason Lanclos wrote:
> I've been testing RHEL4 with clustering/gfs using the RHEL4 cvs.
> 
> We are looking at upgrading our mail servers from RHAS 2.1 to RHEL4 and 
> we would like to utilize gfs to help distribute the work on the machines.
> 
> My question is.. How does the CVS cluster/gfs differ from
> cluster/gfs sold by RedHat for RHEL4?  

It is compiled, put into rpms, and comes with a support contract.

-- 
Michael Conrad Tadpol Tilstra
Because reality is worse than your nightmares.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050526/e1e03d48/attachment.sig>

From natecars at natecarlson.com  Wed May 25 20:08:53 2005
From: natecars at natecarlson.com (Nate Carlson)
Date: Wed, 25 May 2005 15:08:53 -0500 (CDT)
Subject: [Linux-cluster] Sane configuration?
Message-ID: <Pine.LNX.4.61.0505251503220.24946@tungsten.technicality.org>

Hey all,

Just getting started with the whole GFS/CLVM thing, and wanted to run a 
config by the people who actually know what they are doing before I try to 
deploy it, in case I'm doing anything really stupid.  :)

I'm basically looking at setting up a 4-node cluster, using GFS to share 
storage. I've picked up a Dell PowerVault 660F fibre array for the central 
sotrage, and a PV51F fibre switch (equivilent to a Brocade 2400, 
apparently), which will have the array and each of the boxes connected to 
it.

Here's the kicker: I'd like to run Xen on these boxes, and be able to 
migrate VM's and such between them. (This is really why I want the shared 
storage, besides just for experience.)

My plan is to create a LUN for each of the individual nodes to boot from, 
and then have a large shared LUN for CLVM. Within the CLVM PV, I'll create 
a LV for each of the Xen domains/vm's to boot from, and then have a large 
GFS LV for the shared data.

Does this sound reasonable? Anyone see any huge holes in my logic? I'm 
hoping that GFS will compile and work within the Xen-ified Linux kernel - 
I suppose I should've tried that out before ordering hardware, but ah 
well.  :)

Hope someone has some comments, and thanks for reading!

------------------------------------------------------------------------
| nate carlson | natecars at natecarlson.com | http://www.natecarlson.com |
|       depriving some poor village of its idiot since 1981            |
------------------------------------------------------------------------



From jason at selu.edu  Thu May 26 14:22:57 2005
From: jason at selu.edu (Jason Lanclos)
Date: Thu, 26 May 2005 09:22:57 -0500
Subject: [Linux-cluster] RHEL4 clustering/GFS
In-Reply-To: <20050526140631.GA11242@redhat.com>
References: <200505260852.19781.Jason@selu.edu>
	<20050526140631.GA11242@redhat.com>
Message-ID: <200505260922.57130.Jason@selu.edu>

Michael,
	Thanks.. That what I was thinking, but I wasn't sure.
	I was browsing the documentation on redhats site for GFS/Clustering,
	and it appeared that the only locking technique was by using gulm,
	and there are several references to a pool manager, but in looking at what
    I've pulled down from cvs, there isn't any pool manager.
	So I'm assuming this is old documentation.. Is there updated documentation
	for clustering/gfs?


On Thursday 26 May 2005 09:06 am, Michael Conrad Tadpol Tilstra wrote:
> On Thu, May 26, 2005 at 08:52:19AM -0500, Jason Lanclos wrote:
> > I've been testing RHEL4 with clustering/gfs using the RHEL4 cvs.
> > 
> > We are looking at upgrading our mail servers from RHAS 2.1 to RHEL4 and 
> > we would like to utilize gfs to help distribute the work on the machines.
> > 
> > My question is.. How does the CVS cluster/gfs differ from
> > cluster/gfs sold by RedHat for RHEL4?  
> 
> It is compiled, put into rpms, and comes with a support contract.
> 

-- 
Jason Lanclos                                        
Systems Administrator                                 
Red Hat Certified Engineer        
Southeastern Louisiana University		     



From pcaulfie at redhat.com  Thu May 26 14:34:41 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Thu, 26 May 2005 15:34:41 +0100
Subject: [Linux-cluster] Sane configuration?
In-Reply-To: <Pine.LNX.4.61.0505251503220.24946@tungsten.technicality.org>
References: <Pine.LNX.4.61.0505251503220.24946@tungsten.technicality.org>
Message-ID: <4295DE81.3040805@redhat.com>

Nate Carlson wrote:
> Hey all,
> 
> Just getting started with the whole GFS/CLVM thing, and wanted to run a
> config by the people who actually know what they are doing before I try
> to deploy it, in case I'm doing anything really stupid.  :)
> 
> I'm basically looking at setting up a 4-node cluster, using GFS to share
> storage. I've picked up a Dell PowerVault 660F fibre array for the
> central sotrage, and a PV51F fibre switch (equivilent to a Brocade 2400,
> apparently), which will have the array and each of the boxes connected
> to it.
> 
> Here's the kicker: I'd like to run Xen on these boxes, and be able to
> migrate VM's and such between them. (This is really why I want the
> shared storage, besides just for experience.)
> 
> My plan is to create a LUN for each of the individual nodes to boot
> from, and then have a large shared LUN for CLVM. Within the CLVM PV,
> I'll create a LV for each of the Xen domains/vm's to boot from, and then
> have a large GFS LV for the shared data.
> 
> Does this sound reasonable? Anyone see any huge holes in my logic? I'm
> hoping that GFS will compile and work within the Xen-ified Linux kernel
> - I suppose I should've tried that out before ordering hardware, but ah
> well.  :)
> 
> Hope someone has some comments, and thanks for reading!

I can't comment on the particular hardware but I can say that GFS runs nicely
inside xen. There are even xenU packages for Fedora Core 4 available.

One thing to watch if you're building GFS & components from source is to make sure
that they are built with ARCH=xen on the make command-line; otherwise all
the timeouts are out by an order of magnitude!

I've done some (rather vague) instructions at
http://www.cix.co.uk/~tykepenguin/xencluster.html - there's even a primitive
xen fencing agent there, if that helps.

-- 

patrick



From mtilstra at redhat.com  Thu May 26 14:35:36 2005
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Thu, 26 May 2005 09:35:36 -0500
Subject: [Linux-cluster] RHEL4 clustering/GFS
In-Reply-To: <200505260922.57130.Jason@selu.edu>
References: <200505260852.19781.Jason@selu.edu>
	<20050526140631.GA11242@redhat.com>
	<200505260922.57130.Jason@selu.edu>
Message-ID: <20050526143536.GA11530@redhat.com>

On Thu, May 26, 2005 at 09:22:57AM -0500, Jason Lanclos wrote:
> Michael,
> 	Thanks.. That what I was thinking, but I wasn't sure.
> 	I was browsing the documentation on redhats site for GFS/Clustering,
> 	and it appeared that the only locking technique was by using gulm,
> 	and there are several references to a pool manager, but in looking at what
>     I've pulled down from cvs, there isn't any pool manager.
> 	So I'm assuming this is old documentation.. Is there updated documentation
> 	for clustering/gfs?

CLVM replaces pool. (it even knows how to read and convert pool labeled
disks!)

And, yeah, that would be the old docs.  New docs will get released with
the rpms.

-- 
Michael Conrad Tadpol Tilstra
How do you explain counter-clockwise to someone with a digital watch?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050526/42b33cdf/attachment.sig>

From comaniliut at gmail.com  Thu May 26 14:38:42 2005
From: comaniliut at gmail.com (Coman Iliut)
Date: Thu, 26 May 2005 10:38:42 -0400
Subject: [Linux-cluster] SAN fencing
In-Reply-To: <42945F90.7020601@na.infn.it>
References: <D3D5DC9EC4913C4E805DE9487D4EA6771207A2@MDEE2K25.MANDAUE.INNODATA.NET>
	<42945F90.7020601@na.infn.it>
Message-ID: <9cf395480505260738217338c1@mail.gmail.com>

If you also use HP servers and they have ILO (Integrated Lights Out), you 
canuse that to do the fencing.
ILO basically resets the computer that needs to be fenced. You can also get 
it to just power the server off.

Coman

On 5/25/05, Rosario Esposito <Rosario.Esposito at na.infn.it> wrote:
> Hello,
> I have a 3-nodes with shared storage HP MSA1500cs, mounting a GFS
> filesystem.
> The fibre channel switch is a HP StorageWorks Edge Switch 2/12.
> I have no power switch at the moment.
> I can't find any fence_xxx script to manage the SAN fencing. I tried
> fence_brocade but doesn't work...
> Any ideas ?
> 
> Thanks
> 
> - Rosario Esposito
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050526/073905d1/attachment.htm>

From jason at selu.edu  Thu May 26 14:43:07 2005
From: jason at selu.edu (Jason Lanclos)
Date: Thu, 26 May 2005 09:43:07 -0500
Subject: [Linux-cluster] SAN fencing
In-Reply-To: <9cf395480505260738217338c1@mail.gmail.com>
References: <D3D5DC9EC4913C4E805DE9487D4EA6771207A2@MDEE2K25.MANDAUE.INNODATA.NET>
	<42945F90.7020601@na.infn.it>
	<9cf395480505260738217338c1@mail.gmail.com>
Message-ID: <200505260943.07283.Jason@selu.edu>


Does anyone know if XioTech Magnitude's have any fencing capability?


On Thursday 26 May 2005 09:38 am, Coman Iliut wrote:
> If you also use HP servers and they have ILO (Integrated Lights Out), you 
> canuse that to do the fencing.
> ILO basically resets the computer that needs to be fenced. You can also get 
> it to just power the server off.
> 
> Coman
> 
> On 5/25/05, Rosario Esposito <Rosario.Esposito at na.infn.it> wrote:
> > Hello,
> > I have a 3-nodes with shared storage HP MSA1500cs, mounting a GFS
> > filesystem.
> > The fibre channel switch is a HP StorageWorks Edge Switch 2/12.
> > I have no power switch at the moment.
> > I can't find any fence_xxx script to manage the SAN fencing. I tried
> > fence_brocade but doesn't work...
> > Any ideas ?
> > 
> > Thanks
> > 
> > - Rosario Esposito
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > http://www.redhat.com/mailman/listinfo/linux-cluster
> >
> 

-- 
Jason Lanclos                                        
Systems Administrator                                 
Red Hat Certified Engineer        
Southeastern Louisiana University		     



From natecars at natecarlson.com  Thu May 26 14:49:34 2005
From: natecars at natecarlson.com (Nate Carlson)
Date: Thu, 26 May 2005 09:49:34 -0500 (CDT)
Subject: [Linux-cluster] Sane configuration?
In-Reply-To: <4295DE81.3040805@redhat.com>
References: <Pine.LNX.4.61.0505251503220.24946@tungsten.technicality.org>
	<4295DE81.3040805@redhat.com>
Message-ID: <Pine.LNX.4.61.0505260948250.30819@tungsten.technicality.org>

On Thu, 26 May 2005, Patrick Caulfield wrote:
> I can't comment on the particular hardware but I can say that GFS runs 
> nicely inside xen. There are even xenU packages for Fedora Core 4 
> available.
>
> One thing to watch if you're building GFS & components from source is to 
> make sure that they are built with ARCH=xen on the make command-line; 
> otherwise all the timeouts are out by an order of magnitude!
>
> I've done some (rather vague) instructions at 
> http://www.cix.co.uk/~tykepenguin/xencluster.html - there's even a 
> primitive xen fencing agent there, if that helps.

Very cool - thanks! That's really helpful info. I wasn't finding any 
records of people doing Xen+GFS out on the 'net at large, so it's good to 
know it has been done.. :)

------------------------------------------------------------------------
| nate carlson | natecars at natecarlson.com | http://www.natecarlson.com |
|       depriving some poor village of its idiot since 1981            |
------------------------------------------------------------------------



From Rosario.Esposito at na.infn.it  Thu May 26 15:00:43 2005
From: Rosario.Esposito at na.infn.it (Rosario Esposito)
Date: Thu, 26 May 2005 17:00:43 +0200
Subject: [Linux-cluster] SAN fencing
In-Reply-To: <9cf395480505260738217338c1@mail.gmail.com>
References: <D3D5DC9EC4913C4E805DE9487D4EA6771207A2@MDEE2K25.MANDAUE.INNODATA.NET>	<42945F90.7020601@na.infn.it>
	<9cf395480505260738217338c1@mail.gmail.com>
Message-ID: <4295E49B.8090604@na.infn.it>

Unfortunately I have Supermicro servers. No ILO available :-(
Other ideas ?

- Rosario


Coman Iliut wrote:

> If you also use HP servers and they have ILO (Integrated Lights Out), 
> you canuse that to do the fencing.
> ILO basically resets the computer that needs to be fenced. You can 
> also get it to just power the server off.
>
> Coman
>
> On 5/25/05, Rosario Esposito <Rosario.Esposito at na.infn.it 
> <mailto:Rosario.Esposito at na.infn.it>> wrote:
> > Hello,
> > I have a 3-nodes with shared storage HP MSA1500cs, mounting a GFS
> > filesystem.
> > The fibre channel switch is a HP StorageWorks Edge Switch 2/12.
> > I have no power switch at the moment.
> > I can't find any fence_xxx script to manage the SAN fencing. I tried
> > fence_brocade but doesn't work...
> > Any ideas ?
> >
> > Thanks
> >
> > - Rosario Esposito
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> > http://www.redhat.com/mailman/listinfo/linux-cluster
> >



From mtilstra at redhat.com  Thu May 26 15:36:48 2005
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Thu, 26 May 2005 10:36:48 -0500
Subject: [Linux-cluster] SAN fencing
In-Reply-To: <4295E49B.8090604@na.infn.it>
References: <D3D5DC9EC4913C4E805DE9487D4EA6771207A2@MDEE2K25.MANDAUE.INNODATA.NET>
	<42945F90.7020601@na.infn.it>
	<9cf395480505260738217338c1@mail.gmail.com>
	<4295E49B.8090604@na.infn.it>
Message-ID: <20050526153648.GA11962@redhat.com>

On Thu, May 26, 2005 at 05:00:43PM +0200, Rosario Esposito wrote:
> >On 5/25/05, Rosario Esposito <Rosario.Esposito at na.infn.it 
> ><mailto:Rosario.Esposito at na.infn.it>> wrote:
> >> Hello,
> >> I have a 3-nodes with shared storage HP MSA1500cs, mounting a GFS
> >> filesystem.
> >> The fibre channel switch is a HP StorageWorks Edge Switch 2/12.
> >> I have no power switch at the moment.
> >> I can't find any fence_xxx script to manage the SAN fencing. I tried
> >> fence_brocade but doesn't work...
> >> Any ideas ?
> >>
> Unfortunately I have Supermicro servers. No ILO available :-(
> Other ideas ?

write a new fencing agent for your switch.  Most of the current fencing
agents are perls scripts.

-- 
Michael Conrad Tadpol Tilstra
Schizophrenia beats being alone.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050526/6eab3d27/attachment.sig>

From yuw at cse.ohio-state.edu  Thu May 26 19:15:53 2005
From: yuw at cse.ohio-state.edu (Weikuan Yu)
Date: Thu, 26 May 2005 15:15:53 -0400
Subject: [Linux-cluster] GFS old release for RedHat AS 3
Message-ID: <9484E16A-CE1A-11D9-A655-000D932C3754@cse.ohio-state.edu>

Hi,

Though the current development of GFS is on RHEL3, I wonder if I can 
get an old stable release of GFS for RHEL3. Could anybody please give a 
pointer?

Thanks,
Weikuan



From yuw at cse.ohio-state.edu  Thu May 26 19:30:32 2005
From: yuw at cse.ohio-state.edu (Weikuan Yu)
Date: Thu, 26 May 2005 15:30:32 -0400
Subject: [Linux-cluster] GFS old release for RedHat AS 3
In-Reply-To: <9484E16A-CE1A-11D9-A655-000D932C3754@cse.ohio-state.edu>
References: <9484E16A-CE1A-11D9-A655-000D932C3754@cse.ohio-state.edu>
Message-ID: <9FEA0848-CE1C-11D9-A655-000D932C3754@cse.ohio-state.edu>


On May 26, 2005, at 3:15 PM, Weikuan Yu wrote:

> Though the current development of GFS is on RHEL3

Sorry, I meant to say RHEL4. I am looking for old release for RHEL3.

Weikuan



From mtilstra at redhat.com  Thu May 26 19:33:03 2005
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Thu, 26 May 2005 14:33:03 -0500
Subject: [Linux-cluster] GFS old release for RedHat AS 3
In-Reply-To: <9FEA0848-CE1C-11D9-A655-000D932C3754@cse.ohio-state.edu>
References: <9484E16A-CE1A-11D9-A655-000D932C3754@cse.ohio-state.edu>
	<9FEA0848-CE1C-11D9-A655-000D932C3754@cse.ohio-state.edu>
Message-ID: <20050526193303.GA13100@redhat.com>

On Thu, May 26, 2005 at 03:30:32PM -0400, Weikuan Yu wrote:
> 
> On May 26, 2005, at 3:15 PM, Weikuan Yu wrote:
> 
> >Though the current development of GFS is on RHEL3
> 
> Sorry, I meant to say RHEL4. I am looking for old release for RHEL3.

http://sources.redhat.com/cluster/faq.html#develcode

-- 
Michael Conrad Tadpol Tilstra
Darth Vader sleeps with a Teddywookie.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050526/80edbd22/attachment.sig>

From bruce.walker at hp.com  Thu May 26 23:02:54 2005
From: bruce.walker at hp.com (Walker, Bruce J)
Date: Thu, 26 May 2005 16:02:54 -0700
Subject: [Linux-cluster] Cluster meeting at OLS
Message-ID: <3689AF909D816446BA505D21F1461AE403C42E19@cacexc04.americas.cpqcorp.net>

As a followup and extension to the Cluster Summit in Germany, there will
be Cluster Meeting on June 18 and June 19 (Monday and Tuesday before
OLS) at the conference hotel in Ottawa.  We have the the Rideau Suite,
in Les Suites, reserved all day, both days.
I will try to bring a projector so presentations can be an option.

Currently there are at least 11 confirmed attendees representing many
different cluster projects/technologies.  The room can hold at least 20
people.   I hope some of the folks who can make it to the Cluster Summit
in Germany will also be able to attend this meeting to share the outcome
of the Summit. 

Bruce


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Walker, Bruce J
Sent: Monday, May 09, 2005 6:05 AM
To: clusters_sig at lists.osdl.org; linux-cluster at redhat.com
Subject: [Linux-cluster] Planning a Cluster meeting at OLS

I am trying to gauge interest in a cluster meeting at OLS (before OLS
starts).  Topics would include, but not be limited to:
   membership, fencing, apis, (kernel and non-kernel)
   DLM 
   cluster filesystems (hooks, DLMs, recovery, membership, etc.)
   common hooks for clusterwide process management
      - openssi, openmosix, bproc, kerrighed, cassat, ...
Goal would be to make progress toward common infrastructure or common
infrastructure interfaces that various groups could work with.

If you are interested and can attend, please let me (not the whole
lists) know asap:
   -  one day only (Tuesday July 19)
   - two days (Monday July 18 and Tuesday July 19)
   - 3 days (Sunday-Tuesday).

Feel free to extend this request to other mailing lists.  Based on the
responses received by May 12 (it doesn't have to be a firm commitment),
I will let everyone know if and how long we will meet so we all can plan
travel.

Bruce Walker
 

--
Linux-cluster mailing list
Linux-cluster at redhat.com
http://www.redhat.com/mailman/listinfo/linux-cluster



From lhh at redhat.com  Fri May 27 21:32:34 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 27 May 2005 17:32:34 -0400
Subject: [Linux-cluster] SAN fencing
In-Reply-To: <20050526153648.GA11962@redhat.com>
References: <D3D5DC9EC4913C4E805DE9487D4EA6771207A2@MDEE2K25.MANDAUE.INNODATA.NET>
	<42945F90.7020601@na.infn.it>
	<9cf395480505260738217338c1@mail.gmail.com>
	<4295E49B.8090604@na.infn.it>  <20050526153648.GA11962@redhat.com>
Message-ID: <1117229554.18073.283.camel@ayanami.boston.redhat.com>

On Thu, 2005-05-26 at 10:36 -0500, Michael Conrad Tadpol Tilstra wrote:
> > Unfortunately I have Supermicro servers. No ILO available :-(
> > Other ideas ?
> 
> write a new fencing agent for your switch.  Most of the current fencing
> agents are perls scripts.

... and post it to the mailing list, if you get one written =)



From ritesh.a at net4india.net  Sat May 28 09:19:07 2005
From: ritesh.a at net4india.net (Ritesh Agrawal)
Date: Sat, 28 May 2005 14:49:07 +0530
Subject: [Linux-cluster] Does Piranaha work with SMP kernel?
Message-ID: <4298378B.20605@net4india.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Hi list,
~ Recently i configured 'piranha load blanacer' using DELL poweredge 1650
server. When machine is booted with SMP kernel ,after runing pulse
daemon and sending request it hangs.

but when i boot the machine with  non SMP kernel  with same hardware
and software configuration , it works well.

it means pirana doesn't work with SMP  kernel or any other reason.
is Piranha suitable with SMP kernel ????
A big question for me.
Thanks in advance.

- --
Regards
Ritesh Agrawal
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCmDeKFoz+P95jnTIRAjNIAJ93MqGsxsOsS+Kxi3+3juv7hCqZQgCgm8nJ
SAM93N8qjdSEX1vONbCrnKc=
=0dhY
-----END PGP SIGNATURE-----



From andrewxwang at yahoo.com.tw  Sun May 29 13:25:06 2005
From: andrewxwang at yahoo.com.tw (Andrew Wang)
Date: Sun, 29 May 2005 21:25:06 +0800 (CST)
Subject: [Linux-cluster] Gridengine 6.0 update 4 released
In-Reply-To: 6667
Message-ID: <20050529132506.79177.qmail@web18008.mail.tpe.yahoo.com>

Binary:
http://gridengine.sunsource.net/download60.html

Source:
http://gridengine.sunsource.net/files/documents/7/22/sge-V60u4_TAG-src.tar.gz

Announcement:
http://gridengine.sunsource.net/project/gridengine/news/SGE60u4-announce.html

Binary download for AMD64 and x86. Grab the source
tarball if you have IA64 and PPC/PPC64 machines.

Andrew.




_______________________________________________________________________
Yahoo!??????
????250MB????????
http://tw.promo.yahoo.com/mail_new/index.html



From fabbione at fabbione.net  Tue May 31 07:23:43 2005
From: fabbione at fabbione.net (Fabio Massimo Di Nitto)
Date: Tue, 31 May 2005 09:23:43 +0200
Subject: [Linux-cluster] documentation and configs
Message-ID: <429C10FF.7020200@fabbione.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi everybody,

I am actually preparing a set of debian/ubuntu packages of the entire
cluster suite based on RHEL4U1 branch and some great work already done in
Debian by Bastian Blank (in CC).

I have noticed on the mailing list and sometimes on IRC that often some
common problems show up due to simple configuration errors.

While the usage.txt is a very simple and safe point to start from,
I think it doesn't really exploit the full potentiality of the cluster suite.

In order to provide cluster users/admins a better experience I would
like to collect some configs with scenario descriptions coming from
real usage of the entire suite, and include them in at least the packages
i am preparing and ask upstream to include them in the doc/ dir perhaps?

If you are so kind to share your experience with the community, please
send me privately your config files with a description of your setup.

If upstream is interested i will collect them and strip them of sensible
data (like public ip addresses, if that's the case) before pushing it
back in one shot (there is no need to flood the mailinglist imho).

Thanks in advance
Fabio
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCnBD8hCzbekR3nhgRAjfBAJ9nVRH5yCguqGgn6abmTYysv6uxPACeNLDB
2Tx3olyXe4XoGwLsLL80wlk=
=C8nn
-----END PGP SIGNATURE-----



From fs at lowpingbastards.de  Mon May 30 16:40:36 2005
From: fs at lowpingbastards.de (Frederik Schueler)
Date: Mon, 30 May 2005 18:40:36 +0200
Subject: [Linux-cluster] [PATCH] rgmanager ip.sh rdisc calls fix
Message-ID: <20050530164036.GB4049@mail.lowpingbastards.de>

Hello,

the rdisc (router discovery daemon) calls in the rgmanager ip.sh
resource script fail if the daemon is not installed, these calls should
be wrapped. 
See attached patch.

Kind regards
Frederik Schueler

-- 
ENOSIG
-------------- next part --------------
--- /usr/share/cluster/ip.sh	2005-05-26 12:26:10.000000000 +0200
+++ ip.sh	2005-05-30 18:39:02.000000000 +0200
@@ -594,7 +594,11 @@
 	#
 
 	# Not sure if this is necessary for ipv6 either.
-	killall -HUP rdisc || rdisc -fs
+	file=$(which rdisc 2>/dev/null)
+	if [ -f "$file" ]; then
+		killall -HUP rdisc || rdisc -fs
+	fi
+
 	return 0
 }
 
@@ -646,7 +650,11 @@
  		arping -q -c 2 -U -I $dev $addr
 	fi
 
-	killall -HUP rdisc || rdisc -fs
+	file=$(which rdisc 2>/dev/null)
+	if [ -f "$file" ]; then
+		killall -HUP rdisc || rdisc -fs
+	fi
+
 	return 0
 }
 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050530/04f4768a/attachment.sig>

From lmb at suse.de  Tue May 31 10:26:24 2005
From: lmb at suse.de (Lars Marowsky-Bree)
Date: Tue, 31 May 2005 12:26:24 +0200
Subject: [Linux-cluster] Re: [Clusters_sig] Re: [ANNOUNCE] Linux Cluster
	Summit 2005
In-Reply-To: <s28eff68.098@sinclair.provo.novell.com>
References: <s28eff68.098@sinclair.provo.novell.com>
Message-ID: <20050531102624.GT17565@marowsky-bree.de>

On 2005-05-21T09:29:01, Robert Wipfel <rawipfel at novell.com> wrote:

> outside looking in, is web services an!  d grid.  Returning to the
> reality of many vendor's enterprise* business, the suitespot for h/a
> clusters still seems to be somewhere around ~8 dual-CPU nodes with
> many customers deploying multiple similar clusters. Nodes are never in
> multiple clusters at once, rather, individual nodes are members of a
> cluster and that cluster might be a member of a cluster of clusters.

A single node must be big enough to support sane load balancing; ie, big
enough to run at least one (or more) "whole" resource entities / jobs.

That is the breaking point after which it is more sensible to deploy
more nodes - with looser coupling - than making a single node / SSI
component larger, because decoupled operation means less complexity for
fault isolation.


Sincerely,
    Lars Marowsky-Br?e <lmb at suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"



From ggilyeat at jhsph.edu  Tue May 31 18:06:27 2005
From: ggilyeat at jhsph.edu (Gerald G. Gilyeat)
Date: Tue, 31 May 2005 14:06:27 -0400
Subject: [Linux-cluster] Question :)
Message-ID: <DF33F4DAC09B3048AA095F1B5C368915A57D0A@XCH-VN02.sph.ad.jhsph.edu>


First - thanks for the help the last time I poked my pointy little head in here.
Things have been -much- more stable since we bumped the lock limit to 2097152 ;)

However, we're still running into the occasional "glitch" where it seems like a single process is locking up -all- disk access on us, until it completes its operation.
Specifically, we see this when folks are doing rsyncs of large amounts of data (one of my faculty has been trying to copy over a couple thousand 16MB files). Even piping tar through ssh (from target machine, ssh user at host "cd /data/dir/path; tar -cpsf -" | tar -xpsf -) results in similar behaviour.
Is this tunable, or simply a fact of life that we're simply going to have to live with? it only occurs with big, or long, writes. Reads aren't a problem (it just takes 14 hours to dump 1.5TB to tape...)

Thanks!

--
Jerry Gilyeat, RHCE
Systems Administrator
Molecular Microbiology and Immunology
Johns Hopkins Bloomberg School of Public Health
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050531/d43d9d7c/attachment.htm>

From cjk at techma.com  Tue May 31 18:38:18 2005
From: cjk at techma.com (Kovacs, Corey J.)
Date: Tue, 31 May 2005 14:38:18 -0400
Subject: [Linux-cluster] Question :)
Message-ID: <EE32D921D7601547AD9CA5C87906C566095157@tmaemail.techma.com>

Jerry, is this problem with the "current" supported version of GFS? If so,
what 
version are you running? I am having a similar problem with a 5 node cluster 
with 3 nodes serving as lock managers. If I rsync large ammounts of data
(0.5TB)
to a node serving as a lock manager and mounting the FS, things croak pretty
quick.
If I rsync to a node that is NOT a lock manager, it takes longer but
eventually locks
up their as well. Although at times, it will come back.
 
when we do out rsync, the gfs_scand and lock_gulmd go crazy. In the instance
where
the fs comes back, they continue to have high cpu utilization. 
 
I don't think this is "a fact of life" that anyone needs to live with by the
way, there has
to be a reason for this. I can't believe for a minute that you and I are the
only ones
experienceing this.
 
Corey
 
________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Gerald G. Gilyeat
Sent: Tuesday, May 31, 2005 2:06 PM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Question :)




First - thanks for the help the last time I poked my pointy little head in
here.
Things have been -much- more stable since we bumped the lock limit to 2097152
;)

However, we're still running into the occasional "glitch" where it seems like
a single process is locking up -all- disk access on us, until it completes
its operation.
Specifically, we see this when folks are doing rsyncs of large amounts of
data (one of my faculty has been trying to copy over a couple thousand 16MB
files). Even piping tar through ssh (from target machine, ssh user at host "cd
/data/dir/path; tar -cpsf -" | tar -xpsf -) results in similar behaviour.
Is this tunable, or simply a fact of life that we're simply going to have to
live with? it only occurs with big, or long, writes. Reads aren't a problem
(it just takes 14 hours to dump 1.5TB to tape...)

Thanks!

--
Jerry Gilyeat, RHCE
Systems Administrator
Molecular Microbiology and Immunology
Johns Hopkins Bloomberg School of Public Health


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050531/9fedeca5/attachment.htm>

From ggilyeat at jhsph.edu  Tue May 31 18:50:56 2005
From: ggilyeat at jhsph.edu (Gerald G. Gilyeat)
Date: Tue, 31 May 2005 14:50:56 -0400
Subject: [Linux-cluster] Question :)
Message-ID: <DF33F4DAC09B3048AA095F1B5C368915A57D0C@XCH-VN02.sph.ad.jhsph.edu>


We're using GFS-6.0.0-7.1
Not the latest patch, I realize, and if that will fix things, that'd be ideal, I think.
We're also in a 5-node situation, with three servers, and in fact our behaviour appears to be almost identical.
It -does- eventually come back for us after I kill the rsync process, so it appears to be flushing a buffer of some sort. 
Regardless, it's not really acceptable behaviour when you've got a 32node compute cluster behind one of the GFS nodes and you have researchers that need to move hundreds of gigs of data into the file system and -can't- because of this.

--
Jerry Gilyeat, RHCE
Systems Administrator
Molecular Microbiology and Immunology
Johns Hopkins Bloomberg School of Public Health



-----Original Message-----
From: linux-cluster-bounces at redhat.com on behalf of Kovacs, Corey J.
Sent: Tue 5/31/2005 2:38 PM
To: linux clustering
Subject: RE: [Linux-cluster] Question :)
 
Jerry, is this problem with the "current" supported version of GFS? If so,
what 
version are you running? I am having a similar problem with a 5 node cluster 
with 3 nodes serving as lock managers. If I rsync large ammounts of data
(0.5TB)
to a node serving as a lock manager and mounting the FS, things croak pretty
quick.
If I rsync to a node that is NOT a lock manager, it takes longer but
eventually locks
up their as well. Although at times, it will come back.
 
when we do out rsync, the gfs_scand and lock_gulmd go crazy. In the instance
where
the fs comes back, they continue to have high cpu utilization. 
 
I don't think this is "a fact of life" that anyone needs to live with by the
way, there has
to be a reason for this. I can't believe for a minute that you and I are the
only ones
experienceing this.
 
Corey
 
________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Gerald G. Gilyeat
Sent: Tuesday, May 31, 2005 2:06 PM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Question :)




First - thanks for the help the last time I poked my pointy little head in
here.
Things have been -much- more stable since we bumped the lock limit to 2097152
;)

However, we're still running into the occasional "glitch" where it seems like
a single process is locking up -all- disk access on us, until it completes
its operation.
Specifically, we see this when folks are doing rsyncs of large amounts of
data (one of my faculty has been trying to copy over a couple thousand 16MB
files). Even piping tar through ssh (from target machine, ssh user at host "cd
/data/dir/path; tar -cpsf -" | tar -xpsf -) results in similar behaviour.
Is this tunable, or simply a fact of life that we're simply going to have to
live with? it only occurs with big, or long, writes. Reads aren't a problem
(it just takes 14 hours to dump 1.5TB to tape...)

Thanks!

--
Jerry Gilyeat, RHCE
Systems Administrator
Molecular Microbiology and Immunology
Johns Hopkins Bloomberg School of Public Health


-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 4152 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050531/2172d412/attachment.bin>

From phung at cs.columbia.edu  Tue May 31 22:17:15 2005
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Tue, 31 May 2005 18:17:15 -0400 (EDT)
Subject: [Linux-cluster] fence_sanbox2 configuration example?
Message-ID: <Pine.LNX.4.44.0505311657000.1928-100000@algiers.clic.cs.columbia.edu>

Can somebody who's using the fence_sanbox2 agent let me take a peek at 
their config?  


here's what I have so far for fencedevices:
   <fencedevices>
   <fencedevice name="human" agent="fence_manual"/>
      <fencedevice name="san" agent="fence_sanbox2"
         ipaddr="fsm.cs.columbia.edu" login="foo" passwd="bar"/>
   </fencedevices>

is the port indicated in the manpage the WWN for the host bus adaptor?

       -n port
              The port number to disable on the switch.


here's what I think I should have in each clusternode entry:

          <clusternode name="blade01" nodeid="1" votes="1">
          <multicast addr="224.0.0.1" interface="eth0"/>
             <fence>
               <method name="fibre">
                 <device name="san" port="2100000d55d32ced"/>
               </method>

               <method name="single">
                 <device name="human" port="128.59.18.1"/>
               </method>
             </fence>
          </clusternode>


and to test this, I was going to suddenly turn off networking
for the blade and see how the rest of the cluster reacts.  

-dan

--




From davidnicol at gmail.com  Tue May 31 19:29:34 2005
From: davidnicol at gmail.com (David Nicol)
Date: Tue, 31 May 2005 14:29:34 -0500
Subject: [Linux-cluster] Common Cluster Infrastructure discussion
In-Reply-To: <20050531102624.GT17565@marowsky-bree.de>
References: <s28eff68.098@sinclair.provo.novell.com>
	<20050531102624.GT17565@marowsky-bree.de>
Message-ID: <934f64a205053112295ee49ffd@mail.gmail.com>

On 5/31/05, Lars Marowsky-Bree <lmb at suse.de> wrote:
> On 2005-05-21T09:29:01, Robert Wipfel <rawipfel at novell.com> wrote:
> 
> > outside looking in, is web services and grid.  Returning to the
> > reality of many vendor's enterprise* business, the sweet spot for h/a
> > clusters still seems to be somewhere around ~8 dual-CPU nodes with
> > many customers deploying multiple similar clusters. Nodes are never in
> > multiple clusters at once, rather, individual nodes are members of a
> > cluster and that cluster might be a member of a cluster of clusters.

To restate the proposal, in response to this distinction, that "Nodes
are never in
 multiple clusters at once, rather, individual nodes are members of a
 cluster and that cluster might be a member of a cluster of clusters", in the
language of the CCI proposal, one or more nodes in a subcluster would join
the larger cluster, but not all of them, and these liaison nodes would
handle the
communications between this subcluster and other subclusters.  using
CCI, different
subclusters in this grid could run different clustering frameworks, or might be
lone boxes in the supercluster that aren't actually representing clusters.

To implement a cluster being a member of a cluster of clusters with the same
interface that is used to manage a node being a member of a cluster, that is the
idea.  

I take away from LMB's remark a requirement that the CCI must support in-cluster
selection/election for a presented service, so that the liaison nodes,
which could be
all nodes, or could be a subset of all nodes, in the subcluster, 
could present themselves
as a coherent authority to the other members of the supercluster, over
the channels
defined by the supercluster, representing a single node identifier in
the supercluster.

We want in-cluster communications to be through the CCI rather than through an
implementation detail (such as tcp/ip) because we do not want to
confuse communications
by associating node identification with any artifact, such as IP
address, which could
be broken by an architecture change, or even by a failover.

> A single node must be big enough to support sane load balancing; ie, big
> enough to run at least one (or more) "whole" resource entities / jobs.

OTOH, the grain size of a "whole job" can be tuned to fit the reality
of your hardware.

Hopefully the CCI will provide useful metrics about node capability
and current load
to allow apples-to-apples comparisons for making better load balancing decisions
in heterogenous clusters.



> "Ignorance more frequently begets confidence than does knowledge"

very good - I know it does for me!

David L Nicol
Proudly ignorant of many and much